Re: [PATCH v4 13/46] KVM: PPC: Book3S 64: Move GUEST_MODE_SKIP test into KVM

2021-03-31 Thread Paul Mackerras
On Tue, Mar 23, 2021 at 11:02:32AM +1000, Nicholas Piggin wrote:
> Move the GUEST_MODE_SKIP logic into KVM code. This is quite a KVM
> internal detail that has no real need to be in common handlers.
> 
> Also add a comment explaining why this thing exists.

[snip]

> diff --git a/arch/powerpc/kvm/book3s_64_entry.S 
> b/arch/powerpc/kvm/book3s_64_entry.S
> index 7a039ea78f15..a5412e24cc05 100644
> --- a/arch/powerpc/kvm/book3s_64_entry.S
> +++ b/arch/powerpc/kvm/book3s_64_entry.S
> @@ -1,6 +1,7 @@
>  /* SPDX-License-Identifier: GPL-2.0-only */
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -20,9 +21,12 @@ kvmppc_interrupt:
>* guest R12 saved in shadow VCPU SCRATCH0
>* guest R13 saved in SPRN_SCRATCH0
>*/
> -#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
>   std r9,HSTATE_SCRATCH2(r13)
>   lbz r9,HSTATE_IN_GUEST(r13)
> + cmpwi   r9,KVM_GUEST_MODE_SKIP
> + beq-.Lmaybe_skip
> +.Lno_skip:
> +#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
>   cmpwi   r9,KVM_GUEST_MODE_HOST_HV
>   beq kvmppc_bad_host_intr
>  #ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
> @@ -34,3 +38,48 @@ kvmppc_interrupt:
>  #else
>   b   kvmppc_interrupt_pr
>  #endif

It's a bit hard to see without more context, but I think that in the
PR-only case (CONFIG_KVM_BOOK3S_HV_POSSIBLE undefined), this will
corrupt R9.  You need to restore R9 before the unconditional branch to
kvmppc_interrupt_pr.  (I realize this code gets modified further, but
I'd rather not break bisection.)

Paul.


Re: [PATCH v4 12/46] KVM: PPC: Book3S 64: move KVM interrupt entry to a common entry point

2021-03-31 Thread Paul Mackerras
On Tue, Mar 23, 2021 at 11:02:31AM +1000, Nicholas Piggin wrote:
> Rather than bifurcate the call depending on whether or not HV is
> possible, and have the HV entry test for PR, just make a single
> common point which does the demultiplexing. This makes it simpler
> to add another type of exit handler.
> 
> Reviewed-by: Daniel Axtens 
> Reviewed-by: Fabiano Rosas 
> Signed-off-by: Nicholas Piggin 

Acked-by: Paul Mackerras 


Re: [PATCH v4 15/46] KVM: PPC: Book3S 64: Move hcall early register setup to KVM

2021-03-31 Thread Paul Mackerras
On Tue, Mar 23, 2021 at 11:02:34AM +1000, Nicholas Piggin wrote:
> System calls / hcalls have a different calling convention than
> other interrupts, so there is code in the KVMTEST to massage these
> into the same form as other interrupt handlers.
> 
> Move this work into the KVM hcall handler. This means teaching KVM
> a little more about the low level interrupt handler setup, PACA save
> areas, etc., although that's not obviously worse than the current
> approach of coming up with an entirely different interrupt register
> / save convention.

[snip]

> @@ -1964,29 +1948,8 @@ EXC_VIRT_END(system_call, 0x4c00, 0x100)
>  
>  #ifdef CONFIG_KVM_BOOK3S_64_HANDLER
>  TRAMP_REAL_BEGIN(system_call_kvm)
> - /*
> -  * This is a hcall, so register convention is as above, with these
> -  * differences:

I haven't checked all the code changes in detail yet, but this comment
at least is slightly misleading, since under PR KVM, system calls (to
the guest kernel) and hypercalls both come through this path.

Paul.


Re: [PATCH v4 29/46] KVM: PPC: Book3S HV P9: Implement the rest of the P9 path in C

2021-03-31 Thread Alexey Kardashevskiy




On 3/23/21 12:02 PM, Nicholas Piggin wrote:

Almost all logic is moved to C, by introducing a new in_guest mode that
selects and branches very early in the interrupt handler to the P9 exit
code.

The remaining assembly is only about 160 lines of low level stack setup,
with VCPU vs host register save and restore, plus a small shim to the
legacy paths in the interrupt handler.

There are two motivations for this, the first is just make the code more
maintainable being in C. The second is to reduce the amount of code
running in a special KVM mode, "realmode". I put that in quotes because
with radix it is no longer necessarily real-mode in the MMU, but it
still has to be treated specially because it may be in real-mode, and
has various important registers like PID, DEC, TB, etc set to guest.
This is hostile to the rest of Linux and can't use arbitrary kernel
functionality or be instrumented well.

This initial patch is a reasonably faithful conversion of the asm code.
It does lack any loop to return quickly back into the guest without
switching out of realmode in the case of unimportant or easily handled
interrupts, as explained in the previous change, handling HV interrupts
in real mode is not so important for P9.

Signed-off-by: Nicholas Piggin 
---
  arch/powerpc/include/asm/asm-prototypes.h |   3 +-
  arch/powerpc/include/asm/kvm_asm.h|   3 +-
  arch/powerpc/include/asm/kvm_book3s_64.h  |   8 +
  arch/powerpc/kernel/security.c|   5 +-
  arch/powerpc/kvm/Makefile |   3 +
  arch/powerpc/kvm/book3s_64_entry.S| 246 ++
  arch/powerpc/kvm/book3s_hv.c  |   9 +-
  arch/powerpc/kvm/book3s_hv_interrupt.c| 223 
  arch/powerpc/kvm/book3s_hv_rmhandlers.S   | 123 +--
  9 files changed, 500 insertions(+), 123 deletions(-)
  create mode 100644 arch/powerpc/kvm/book3s_hv_interrupt.c

diff --git a/arch/powerpc/include/asm/asm-prototypes.h 
b/arch/powerpc/include/asm/asm-prototypes.h
index 939f3c94c8f3..7c74c80ed994 100644
--- a/arch/powerpc/include/asm/asm-prototypes.h
+++ b/arch/powerpc/include/asm/asm-prototypes.h
@@ -122,6 +122,7 @@ extern s32 patch__call_flush_branch_caches3;
  extern s32 patch__flush_count_cache_return;
  extern s32 patch__flush_link_stack_return;
  extern s32 patch__call_kvm_flush_link_stack;
+extern s32 patch__call_kvm_flush_link_stack_p9;
  extern s32 patch__memset_nocache, patch__memcpy_nocache;
  
  extern long flush_branch_caches;

@@ -142,7 +143,7 @@ void kvmhv_load_host_pmu(void);
  void kvmhv_save_guest_pmu(struct kvm_vcpu *vcpu, bool pmu_in_use);
  void kvmhv_load_guest_pmu(struct kvm_vcpu *vcpu);
  
-int __kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu);

+void kvmppc_p9_enter_guest(struct kvm_vcpu *vcpu);
  
  long kvmppc_h_set_dabr(struct kvm_vcpu *vcpu, unsigned long dabr);

  long kvmppc_h_set_xdabr(struct kvm_vcpu *vcpu, unsigned long dabr,
diff --git a/arch/powerpc/include/asm/kvm_asm.h 
b/arch/powerpc/include/asm/kvm_asm.h
index a3633560493b..b4f9996bd331 100644
--- a/arch/powerpc/include/asm/kvm_asm.h
+++ b/arch/powerpc/include/asm/kvm_asm.h
@@ -146,7 +146,8 @@
  #define KVM_GUEST_MODE_GUEST  1
  #define KVM_GUEST_MODE_SKIP   2
  #define KVM_GUEST_MODE_GUEST_HV   3
-#define KVM_GUEST_MODE_HOST_HV 4
+#define KVM_GUEST_MODE_GUEST_HV_FAST   4 /* ISA v3.0 with host radix mode */
+#define KVM_GUEST_MODE_HOST_HV 5
  
  #define KVM_INST_FETCH_FAILED	-1
  
diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h b/arch/powerpc/include/asm/kvm_book3s_64.h

index 9bb9bb370b53..c214bcffb441 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -153,9 +153,17 @@ static inline bool kvmhv_vcpu_is_radix(struct kvm_vcpu 
*vcpu)
return radix;
  }
  
+int __kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu);

+
  #define KVM_DEFAULT_HPT_ORDER 24  /* 16MB HPT by default */
  #endif
  
+/*

+ * Invalid HDSISR value which is used to indicate when HW has not set the reg.
+ * Used to work around an errata.
+ */
+#define HDSISR_CANARY  0x7fff
+
  /*
   * We use a lock bit in HPTE dword 0 to synchronize updates and
   * accesses to each HPTE, and another bit to indicate non-present
diff --git a/arch/powerpc/kernel/security.c b/arch/powerpc/kernel/security.c
index e4e1a94ccf6a..3a607c11f20f 100644
--- a/arch/powerpc/kernel/security.c
+++ b/arch/powerpc/kernel/security.c
@@ -430,16 +430,19 @@ device_initcall(stf_barrier_debugfs_init);
  
  static void update_branch_cache_flush(void)

  {
-   u32 *site;
+   u32 *site, __maybe_unused *site2;
  
  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE

site = __call_kvm_flush_link_stack;
+   site2 = __call_kvm_flush_link_stack_p9;
// This controls the branch from guest_exit_cont to kvm_flush_link_stack
if (link_stack_flush_type == BRANCH_CACHE_FLUSH_NONE) {
patch_instruction_site(site, ppc_inst(PPC_INST_NOP));
+   patch_instruction_site(site2, 

Re: [PATCH v10 01/10] powerpc/mm: Implement set_memory() routines

2021-03-31 Thread Aneesh Kumar K.V
Jordan Niethe  writes:

> From: Russell Currey 
>
> The set_memory_{ro/rw/nx/x}() functions are required for STRICT_MODULE_RWX,
> and are generally useful primitives to have.  This implementation is
> designed to be completely generic across powerpc's many MMUs.
>
> It's possible that this could be optimised to be faster for specific
> MMUs, but the focus is on having a generic and safe implementation for
> now.
>
> This implementation does not handle cases where the caller is attempting
> to change the mapping of the page it is executing from, or if another
> CPU is concurrently using the page being altered.  These cases likely
> shouldn't happen, but a more complex implementation with MMU-specific code
> could safely handle them, so that is left as a TODO for now.
>
> On hash the linear mapping is not kept in the linux pagetable, so this
> will not change the protection if used on that range. Currently these
> functions are not used on the linear map so just WARN for now.
>
> These functions do nothing if STRICT_KERNEL_RWX is not enabled.
>
> Reviewed-by: Daniel Axtens 
> Signed-off-by: Russell Currey 
> Signed-off-by: Christophe Leroy 
> [jpn: -rebase on next plus "powerpc/mm/64s: Allow STRICT_KERNEL_RWX again"
>   - WARN on hash linear map]
> Signed-off-by: Jordan Niethe 
> ---
> v10: WARN if trying to change the hash linear map
> ---
>  arch/powerpc/Kconfig  |  1 +
>  arch/powerpc/include/asm/set_memory.h | 32 ++
>  arch/powerpc/mm/Makefile  |  2 +-
>  arch/powerpc/mm/pageattr.c| 88 +++
>  4 files changed, 122 insertions(+), 1 deletion(-)
>  create mode 100644 arch/powerpc/include/asm/set_memory.h
>  create mode 100644 arch/powerpc/mm/pageattr.c
>
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index fc7f5c5933e6..4498a27ac9db 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -135,6 +135,7 @@ config PPC
>   select ARCH_HAS_MEMBARRIER_CALLBACKS
>   select ARCH_HAS_MEMBARRIER_SYNC_CORE
>   select ARCH_HAS_SCALED_CPUTIME  if VIRT_CPU_ACCOUNTING_NATIVE 
> && PPC_BOOK3S_64
> + select ARCH_HAS_SET_MEMORY
>   select ARCH_HAS_STRICT_KERNEL_RWX   if ((PPC_BOOK3S_64 || PPC32) && 
> !HIBERNATION)
>   select ARCH_HAS_TICK_BROADCAST  if GENERIC_CLOCKEVENTS_BROADCAST
>   select ARCH_HAS_UACCESS_FLUSHCACHE
> diff --git a/arch/powerpc/include/asm/set_memory.h 
> b/arch/powerpc/include/asm/set_memory.h
> new file mode 100644
> index ..64011ea444b4
> --- /dev/null
> +++ b/arch/powerpc/include/asm/set_memory.h
> @@ -0,0 +1,32 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_POWERPC_SET_MEMORY_H
> +#define _ASM_POWERPC_SET_MEMORY_H
> +
> +#define SET_MEMORY_RO0
> +#define SET_MEMORY_RW1
> +#define SET_MEMORY_NX2
> +#define SET_MEMORY_X 3
> +
> +int change_memory_attr(unsigned long addr, int numpages, long action);
> +
> +static inline int set_memory_ro(unsigned long addr, int numpages)
> +{
> + return change_memory_attr(addr, numpages, SET_MEMORY_RO);
> +}
> +
> +static inline int set_memory_rw(unsigned long addr, int numpages)
> +{
> + return change_memory_attr(addr, numpages, SET_MEMORY_RW);
> +}
> +
> +static inline int set_memory_nx(unsigned long addr, int numpages)
> +{
> + return change_memory_attr(addr, numpages, SET_MEMORY_NX);
> +}
> +
> +static inline int set_memory_x(unsigned long addr, int numpages)
> +{
> + return change_memory_attr(addr, numpages, SET_MEMORY_X);
> +}
> +
> +#endif
> diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile
> index 3b4e9e4e25ea..d8a08abde1ae 100644
> --- a/arch/powerpc/mm/Makefile
> +++ b/arch/powerpc/mm/Makefile
> @@ -5,7 +5,7 @@
>  
>  ccflags-$(CONFIG_PPC64)  := $(NO_MINIMAL_TOC)
>  
> -obj-y:= fault.o mem.o pgtable.o mmap.o 
> maccess.o \
> +obj-y:= fault.o mem.o pgtable.o mmap.o 
> maccess.o pageattr.o \
>  init_$(BITS).o pgtable_$(BITS).o \
>  pgtable-frag.o ioremap.o ioremap_$(BITS).o \
>  init-common.o mmu_context.o drmem.o
> diff --git a/arch/powerpc/mm/pageattr.c b/arch/powerpc/mm/pageattr.c
> new file mode 100644
> index ..9efcb01088da
> --- /dev/null
> +++ b/arch/powerpc/mm/pageattr.c
> @@ -0,0 +1,88 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * MMU-generic set_memory implementation for powerpc
> + *
> + * Copyright 2019, IBM Corporation.
> + */
> +
> +#include 
> +#include 
> +
> +#include 
> +#include 
> +#include 
> +
> +
> +/*
> + * Updates the attributes of a page in three steps:
> + *
> + * 1. invalidate the page table entry
> + * 2. flush the TLB
> + * 3. install the new entry with the updated attributes
> + *
> + * This is unsafe if the caller is attempting to change the mapping of the
> + * page it is executing from, or if another CPU is 

Re: [PATCH] powerpc/8xx: Load modules closer to kernel text

2021-03-31 Thread Michael Ellerman
Christophe Leroy  writes:
> Le 31/03/2021 à 15:39, Michael Ellerman a écrit :
>> Christophe Leroy  writes:
>>> On the 8xx, TASK_SIZE is 0x8000. The space between TASK_SIZE and
>>> PAGE_OFFSET is not used.
>>>
>>> Use it to load modules in order to minimise the distance between
>>> kernel text and modules and avoid trampolines in modules to access
>>> kernel functions or other module functions.
>>>
>>> Define a 16Mbytes area for modules, that's more than enough.
>> 
>> 16MB seems kind of small.
>> 
>> At least on 64-bit we could potentially have hundreds of MBs of modules.
>> 
>
> Well, with a 16 MB kernel and 16 MB modules, my board is full :)

Heh.

> Even on the more recent board that has 128 MB, I don't expect more than a few 
> MBs of modules in 
> addition to the kernel which is approx 8M.
>
> But ok, I'll do something more generic, though it will conflict with Jordan's 
> series.

Don't feel you have to. You're the expert on 8xx, not me.

cheers


Re: [PATCH v2] powerpc/traps: Enhance readability for trap types

2021-03-31 Thread Xiongwei Song
Segher Boessenkool  于2021年4月1日周四 上午6:15写道:

> On Wed, Mar 31, 2021 at 08:58:17PM +1100, Michael Ellerman wrote:
> > So perhaps:
> >
> >   EXC_SYSTEM_RESET
> >   EXC_MACHINE_CHECK
> >   EXC_DATA_STORAGE
> >   EXC_DATA_SEGMENT
> >   EXC_INST_STORAGE
> >   EXC_INST_SEGMENT
> >   EXC_EXTERNAL_INTERRUPT
> >   EXC_ALIGNMENT
> >   EXC_PROGRAM_CHECK
> >   EXC_FP_UNAVAILABLE
> >   EXC_DECREMENTER
> >   EXC_HV_DECREMENTER
> >   EXC_SYSTEM_CALL
> >   EXC_HV_DATA_STORAGE
> >   EXC_PERF_MONITOR
>
> These are interrupt (vectors), not exceptions.  It doesn't matter all
> that much, but confusing things more isn't useful either!  There can be
> multiple exceptions that all can trigger the same interrupt.
>
>  When looking at the reference manual of e500 and e600 from NXP
 official, they call them as interrupts.While looking at the "The
Programming Environments"
 that is also from NXP, they call them exceptions. Looks like there is
 no explicit distinction between interrupts and exceptions.

Here are the links for the reference manual of e600 and e500:
https://www.nxp.com.cn/docs/en/reference-manual/E600CORERM.pdf
https://www.nxp.com/files-static/32bit/doc/ref_manual/EREF_RM.pdf

Here is the "The Programming Environments" link:
https://www.nxp.com.cn/docs/en/user-guide/MPCFPE_AD_R1.pdf

As far as I know, the values of interrupts or exceptions above are defined
explicitly in reference manual or the programming environments. Could
you please provide more details about multiple exceptions with the same
interrupts?

Xiongwei


Re: [PATCH v2] powerpc/traps: Enhance readability for trap types

2021-03-31 Thread Xiongwei Song
Michael Ellerman  于2021年3月31日周三 下午5:58写道:

> Xiongwei Song  writes:
> > From: Xiongwei Song 
> >
> > Create a new header named traps.h, define macros to list ppc exception
> > types in traps.h, replace the reference of the real trap values with
> > these macros.
>
> Personally I find the hex values easier to recognise, but I realise
> that's probably not true of other people :)
>
> I'm one of the "other people".

> ...
> > diff --git a/arch/powerpc/include/asm/traps.h
> b/arch/powerpc/include/asm/traps.h
> > new file mode 100644
> > index ..a31b6122de23
> > --- /dev/null
> > +++ b/arch/powerpc/include/asm/traps.h
> > @@ -0,0 +1,19 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _ASM_PPC_TRAPS_H
> > +#define _ASM_PPC_TRAPS_H
> > +
> > +#define TRAP_RESET   0x100 /* System reset */
> > +#define TRAP_MCE 0x200 /* Machine check */
> > +#define TRAP_DSI 0x300 /* Data storage */
> > +#define TRAP_DSEGI   0x380 /* Data segment */
> > +#define TRAP_ISI 0x400 /* Instruction storage */
> > +#define TRAP_ISEGI   0x480 /* Instruction segment */
> > +#define TRAP_ALIGN   0x600 /* Alignment */
> > +#define TRAP_PROG0x700 /* Program */
> > +#define TRAP_DEC 0x900 /* Decrementer */
> > +#define TRAP_SYSCALL 0xc00 /* System call */
> > +#define TRAP_TRACEI  0xd00 /* Trace */
> > +#define TRAP_FPA 0xe00 /* Floating-point Assist */
> > +#define TRAP_PMI 0xf00 /* Performance monitor */
>
> I know the macro is called TRAP and the field in pt_regs is called trap,
> but the terminology in the architecture is "exception", and we already
> have many uses of that. In particular we have a lot of uses of "exc" as
> an abbreviation for "exception". So I think I'd rather we use that than
> "TRAP".
>
Ok.

>
> I think we should probably use the names from the ISA, unless they are
> really over long.
>
> Which are:
>
>   0x100   System Reset
>   0x200   Machine Check
>   0x300   Data Storage
>   0x380   Data Segment
>   0x400   Instruction Storage
>   0x480   Instruction Segment
>   0x500   External
>   0x600   Alignment
>   0x700   Program
>   0x800   Floating-Point Unavailable
>   0x900   Decrementer
>   0x980   Hypervisor Decrementer
>   0xA00   Directed Privileged Doorbell
>   0xC00   System Call
>   0xD00   Trace
>   0xE00   Hypervisor Data Storage
>   0xE20   Hypervisor Instruction Storage
>   0xE40   Hypervisor Emulation Assistance
>   0xE60   Hypervisor Maintenance
>   0xE80   Directed Hypervisor Doorbell
>   0xEA0   Hypervisor Virtualization
>   0xF00   Performance Monitor
>   0xF20   Vector Unavailable
>   0xF40   VSX Unavailable
>   0xF60   Facility Unavailable
>   0xF80   Hypervisor Facility Unavailable
>   0xFA0   Directed Ultravisor Doorbell
>
>
> So perhaps:
>
>   EXC_SYSTEM_RESET
>   EXC_MACHINE_CHECK
>   EXC_DATA_STORAGE
>   EXC_DATA_SEGMENT
>   EXC_INST_STORAGE
>   EXC_INST_SEGMENT
>   EXC_EXTERNAL_INTERRUPT
>   EXC_ALIGNMENT
>   EXC_PROGRAM_CHECK
>   EXC_FP_UNAVAILABLE
>   EXC_DECREMENTER
>   EXC_HV_DECREMENTER
>   EXC_SYSTEM_CALL
>   EXC_HV_DATA_STORAGE
>   EXC_PERF_MONITOR
>
> Thanks for the suggestions. I'm ok with the prefix. Let me change.


Re: [PATCH v3 1/9] powerpc/xive: Use cpu_to_node() instead of "ibm,chip-id" property

2021-03-31 Thread David Gibson
On Wed, Mar 31, 2021 at 04:45:06PM +0200, Cédric Le Goater wrote:
> The 'chip_id' field of the XIVE CPU structure is used to choose a
> target for a source located on the same chip when possible. The XIVE
> driver queries the chip id value from the "ibm,chip-id" DT property
> but this property is not available on all platforms. It was first
> introduced on the PowerNV platform and later, under QEMU for pseries.
> However, the property does not exist under PowerVM since it is not
> specified in PAPR.
> 
> cpu_to_node() is a better alternative. On the PowerNV platform, the
> node id is computed from the "ibm,associativity" property of the CPU.
> Its value is built in the OPAL firmware from the physical chip id and
> is equivalent to "ibm,chip-id".

Hrm... I mean, for powernv this is certainly correct, but seems to be
relying on pretty specific specifics of the OPAL / chip behaviour,
namely that the NUMA id == chip ID.

> On pSeries, the hcall
> H_HOME_NODE_ASSOCIATIVITY returns the node id.

AFAICT, the chip_id field is never actually used in the PAPR version
of XIVE.  The only access to the field outside native.c is in
xive_pick_irq_target(), and it only looks at chip_id if src_chip is
valid.  But src_chip is initialized to XIVE_INVALID_CHIP_ID in papr.c

So it would make more sense to me to also initialize chip_id to
XIVE_INVALID_CHIP_ID for PAPR to make it clearer that it's not
relevant.

> Also to be noted that under QEMU/KVM "ibm,chip-id" is badly calculated
> with unusual SMT configuration. This leads to a bogus chip id value
> being returned by of_get_ibm_chip_id().

I *still* don't clearly understand what you think is bogus about the
chip id value that qemu generates.  It's clearly not a problem for
XIVE, since PAPR XIVE never uses it.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH v2] powerpc/traps: Enhance readability for trap types

2021-03-31 Thread Michael Ellerman
Segher Boessenkool  writes:
> On Wed, Mar 31, 2021 at 08:58:17PM +1100, Michael Ellerman wrote:
>> So perhaps:
>> 
>>   EXC_SYSTEM_RESET
>>   EXC_MACHINE_CHECK
>>   EXC_DATA_STORAGE
>>   EXC_DATA_SEGMENT
>>   EXC_INST_STORAGE
>>   EXC_INST_SEGMENT
>>   EXC_EXTERNAL_INTERRUPT
>>   EXC_ALIGNMENT
>>   EXC_PROGRAM_CHECK
>>   EXC_FP_UNAVAILABLE
>>   EXC_DECREMENTER
>>   EXC_HV_DECREMENTER
>>   EXC_SYSTEM_CALL
>>   EXC_HV_DATA_STORAGE
>>   EXC_PERF_MONITOR
>
> These are interrupt (vectors), not exceptions.  It doesn't matter all
> that much, but confusing things more isn't useful either!  There can be
> multiple exceptions that all can trigger the same interrupt.

Yeah I know, but I think that ship has already sailed as far as the
naming we have in the kernel.

We have over 250 uses of "exc", and several files called "exception"
something.

Using "interrupt" can also be confusing because Linux uses that to mean
"external interrupt".

But I dunno, maybe INT or VEC is clearer? .. or TRAP :)

cheers


Re: [PATCH] powerpc/vdso: Separate vvar vma from vdso

2021-03-31 Thread Dmitry Safonov
On 3/31/21 10:59 AM, Michael Ellerman wrote:
> Christophe Leroy  writes:
[..]
>>
>>> @@ -133,7 +135,13 @@ static int __arch_setup_additional_pages(struct 
>>> linux_binprm *bprm, int uses_int
>>>  * install_special_mapping or the perf counter mmap tracking code
>>>  * will fail to recognise it as a vDSO.
>>>  */
>>> -   mm->context.vdso = (void __user *)vdso_base + PAGE_SIZE;
>>> +   mm->context.vdso = (void __user *)vdso_base + vvar_size;
>>> +
>>> +   vma = _install_special_mapping(mm, vdso_base, vvar_size,
>>> +  VM_READ | VM_MAYREAD | VM_IO |
>>> +  VM_DONTDUMP | VM_PFNMAP, _spec);
>>> +   if (IS_ERR(vma))
>>> +   return PTR_ERR(vma);
>>>   
>>> /*
>>>  * our vma flags don't have VM_WRITE so by default, the process isn't
>>
>>
>> IIUC, VM_PFNMAP is for when we have a vvar_fault handler.
>> Allthough we will soon have one for handle TIME_NS, at the moment
>> powerpc doesn't have that handler.
>> Isn't it dangerous to set VM_PFNMAP then ?

I believe, it's fine, special_mapping_fault() does:
:   if (sm->fault)
:   return sm->fault(sm, vmf->vma, vmf);

> Some of the other flags seem odd too.
> eg. VM_IO ? VM_DONTDUMP ?

Yeah, so:
VM_PFNMAP | VM_IO is a protection from remote access on pages. So one
can't access such page with ptrace(), /proc/$pid/mem or
process_vm_write(). Otherwise, it would create COW mapping and the
tracee will stop working with stale vvar.

VM_DONTDUMP restricts the area from coredumping and gdb will also avoid
accessing those[1][2].

I agree that VM_PFNMAP was probably excessive in this patch alone and
rather synchronized code with other architectures, but it makes more
sense now in the new patches set by Christophe:
https://lore.kernel.org/linux-arch/cover.1617209141.git.christophe.le...@csgroup.eu/


[1] https://lore.kernel.org/lkml/550731af.6080...@redhat.com/T/
[2] https://sourceware.org/legacy-ml/gdb-patches/2015-03/msg00383.html

Thanks,
  Dmitry


Re: [PATCH v2] powerpc/traps: Enhance readability for trap types

2021-03-31 Thread Segher Boessenkool
On Wed, Mar 31, 2021 at 08:58:17PM +1100, Michael Ellerman wrote:
> So perhaps:
> 
>   EXC_SYSTEM_RESET
>   EXC_MACHINE_CHECK
>   EXC_DATA_STORAGE
>   EXC_DATA_SEGMENT
>   EXC_INST_STORAGE
>   EXC_INST_SEGMENT
>   EXC_EXTERNAL_INTERRUPT
>   EXC_ALIGNMENT
>   EXC_PROGRAM_CHECK
>   EXC_FP_UNAVAILABLE
>   EXC_DECREMENTER
>   EXC_HV_DECREMENTER
>   EXC_SYSTEM_CALL
>   EXC_HV_DATA_STORAGE
>   EXC_PERF_MONITOR

These are interrupt (vectors), not exceptions.  It doesn't matter all
that much, but confusing things more isn't useful either!  There can be
multiple exceptions that all can trigger the same interrupt.


Segher


Re: [PATCH] powerpc/vdso: Separate vvar vma from vdso

2021-03-31 Thread Dmitry Safonov
On 3/30/21 11:17 AM, Christophe Leroy wrote:
> 
> 
> Le 26/03/2021 à 20:17, Dmitry Safonov a écrit :
[..]
>> --- a/arch/powerpc/kernel/vdso.c
>> +++ b/arch/powerpc/kernel/vdso.c
>> @@ -55,10 +55,10 @@ static int vdso_mremap(const struct
>> vm_special_mapping *sm, struct vm_area_struc
>>   {
>>   unsigned long new_size = new_vma->vm_end - new_vma->vm_start;
>>   -    if (new_size != text_size + PAGE_SIZE)
>> +    if (new_size != text_size)
>>   return -EINVAL;
> 
> In ARM64 you have removed the above test in commit 871402e05b24cb56
> ("mm: forbid splitting special mappings"). Do we need to keep it here ?
> 
>>   -    current->mm->context.vdso = (void __user *)new_vma->vm_start +
>> PAGE_SIZE;
>> +    current->mm->context.vdso = (void __user *)new_vma->vm_start;
>>     return 0;
>>   }
> 

Yes, right you are, this can be dropped.

Thanks,
  Dmitry


Re: [PATCH] docs: powerpc: Fix misspellings and grammar errors

2021-03-31 Thread Jonathan Corbet
He Ying  writes:

> Reported-by: Hulk Robot 
> Signed-off-by: He Ying 
> ---
>  Documentation/powerpc/booting.rst| 2 +-
>  Documentation/powerpc/dawr-power9.rst| 2 +-
>  Documentation/powerpc/eeh-pci-error-recovery.rst | 2 +-
>  Documentation/powerpc/elfnote.rst| 2 +-
>  Documentation/powerpc/firmware-assisted-dump.rst | 2 +-
>  Documentation/powerpc/kaslr-booke32.rst  | 2 +-
>  Documentation/powerpc/mpc52xx.rst| 2 +-
>  Documentation/powerpc/papr_hcalls.rst| 4 ++--
>  Documentation/powerpc/transactional_memory.rst   | 4 ++--
>  9 files changed, 11 insertions(+), 11 deletions(-)

Applied, thanks.

jon


Re: [PATCH 6/8] drivers: firmware: efi: libstub: enable generic commandline

2021-03-31 Thread Daniel Walker
On Wed, Mar 31, 2021 at 06:10:08PM +0200, Ard Biesheuvel wrote:
> (+ Arvind)
> 
> On Tue, 30 Mar 2021 at 19:57, Daniel Walker  wrote:
> >
> > This adds code to handle the generic command line changes.
> > The efi code appears that it doesn't benefit as much from this design
> > as it could.
> >
> > For example, if you had a prepend command line with "nokaslr" then
> > you might be helpful to re-enable it in the boot loader or dts,
> > but there appears to be no way to re-enable kaslr or some of the
> > other options.
> >
> > Cc: xe-linux-exter...@cisco.com
> > Signed-off-by: Daniel Walker 
> > ---
> >  .../firmware/efi/libstub/efi-stub-helper.c| 35 +++
> >  drivers/firmware/efi/libstub/efi-stub.c   |  7 
> >  drivers/firmware/efi/libstub/efistub.h|  1 +
> >  drivers/firmware/efi/libstub/x86-stub.c   | 13 +--
> >  4 files changed, 54 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/firmware/efi/libstub/efi-stub-helper.c 
> > b/drivers/firmware/efi/libstub/efi-stub-helper.c
> > index aa8da0a49829..c155837cedc9 100644
> > --- a/drivers/firmware/efi/libstub/efi-stub-helper.c
> > +++ b/drivers/firmware/efi/libstub/efi-stub-helper.c
> > @@ -13,6 +13,7 @@
> >  #include 
> >  #include 
> >  #include  /* For CONSOLE_LOGLEVEL_* */
> > +#include 
> >  #include 
> >  #include 
> >
> > @@ -172,6 +173,40 @@ int efi_printk(const char *fmt, ...)
> > return printed;
> >  }
> >
> > +/**
> > + * efi_handle_cmdline() - handle adding in building parts of the command 
> > line
> > + * @cmdline:   kernel command line
> > + *
> > + * Add in the generic parts of the commandline and start the parsing of the
> > + * command line.
> > + *
> > + * Return: status code
> > + */
> > +efi_status_t efi_handle_cmdline(char const *cmdline)
> > +{
> > +   efi_status_t status;
> > +
> > +   status = efi_parse_options(CMDLINE_PREPEND);
> > +   if (status != EFI_SUCCESS) {
> > +   efi_err("Failed to parse options\n");
> > +   return status;
> > +   }
> 
> Even though I am not a fan of the 'success handling' pattern,
> duplicating the exact same error handling three times is not great
> either. Could we reuse more of the code here?

How about

efi_status_t status = 0;

status |= efi_parse_options(CMDLINE_PREPEND);

then error checking once ?

> > +
> > +   status = efi_parse_options(IS_ENABLED(CONFIG_CMDLINE_OVERRIDE) ? "" 
> > : cmdline);
> 
> What is the point of calling efi_parse_options() with an empty string?
 
I could change it to if ((IS_ENABLED(CONFIG_CMDLINE_OVERRIDE)) ?

> > --- a/drivers/firmware/efi/libstub/efi-stub.c
> > +++ b/drivers/firmware/efi/libstub/efi-stub.c
> > @@ -172,6 +172,12 @@ efi_status_t __efiapi efi_pe_entry(efi_handle_t handle,
> > goto fail;
> > }
> >
> > +#ifdef CONFIG_GENERIC_CMDLINE
> > +   status = efi_handle_cmdline(cmdline_ptr);
> > +   if (status != EFI_SUCCESS) {
> > +   goto fail_free_cmdline;
> > +   }
> > +#else
> > if (IS_ENABLED(CONFIG_CMDLINE_EXTEND) ||
> > IS_ENABLED(CONFIG_CMDLINE_FORCE) ||
> 
> Does this mean CONFIG_GENERIC_CMDLINE does not replace CMDLINE_EXTEND
> / CMDLINE_FORCE etc, but introduces yet another variant on top of
> those?
> 
> That does not seem like an improvement to me. I think it is great that
> you are cleaning this up, but only if it means we can get rid of the
> old implementation.
 
It does replace extend and force. I was under the impression this code was
shared between arm64 and arm32. If that's not the case I can delete the extend
and force section. I haven't submitted a conversion for arm32 yet.

Daniel


Re: [PATCH v2 3/7] powerpc: convert config files to generic cmdline

2021-03-31 Thread Daniel Walker
On Wed, Mar 31, 2021 at 12:52:19PM +0100, Will Deacon wrote:
> On Tue, Mar 30, 2021 at 10:35:21AM -0700, Daniel Walker wrote:
> > On Mon, Mar 29, 2021 at 11:07:51AM +0100, Will Deacon wrote:
> > > On Thu, Mar 25, 2021 at 12:59:56PM -0700, Daniel Walker wrote:
> > > > On Thu, Mar 25, 2021 at 01:03:55PM +0100, Christophe Leroy wrote:
> > > > > 
> > > > > Ok, so you agree we don't need to provide two CMDLINE, one to be 
> > > > > appended and one to be prepended.
> > > > > 
> > > > > Let's only provide once CMDLINE as of today, and ask the user to 
> > > > > select
> > > > > whether he wants it appended or prepended or replacee. Then no need to
> > > > > change all existing config to rename CONFIG_CMDLINE into either of 
> > > > > the new
> > > > > ones.
> > > > > 
> > > > > That's the main difference between my series and Daniel's series. So 
> > > > > I'll
> > > > > finish taking Will's comment into account and we'll send out a v3 
> > > > > soon.
> > > > 
> > > > It doesn't solve the needs of Cisco, I've stated many times your 
> > > > changes have
> > > > little value. Please stop submitting them.
> > > 
> > > FWIW, they're useful for arm64 and I will gladly review the updated 
> > > series.
> > > 
> > > I don't think asking people to stop submitting patches is ever the right
> > > answer. Please don't do that.
> > 
> > Why ? It's me nacking his series, is that not allowed anymore ?
> 
> If you're that way inclined then you can "nack" whatever you want, but
> please allow the rest of us to continue reviewing the patches. You don't
> have any basis on which to veto other people's contributions and so
> demanding that somebody stops posting code is neither constructive nor
> meaningful.

I understand , but that's not what's happening. I've dealt with Christophe on
these changes repeatedly, and he's demonstrated he doesn't understand the 
feature set or
the motivation of the changes. I've tried to work with him in the past, but it
doesn't work unless he's giving me comments on my changes.

His changes don't solve Cisco problems, and likely never will regardless of
feedback. Maybe that could change, but I don't see that happening.

Daniel


[PATCH RESEND v1 4/4] powerpc/vdso: Add support for time namespaces

2021-03-31 Thread Christophe Leroy
This patch adds the necessary glue to provide time namespaces.

Things are mainly copied from ARM64.

__arch_get_timens_vdso_data() calculates timens vdso data position
based on the vdso data position, knowing it is the next page in vvar.
This avoids having to redo the mflr/bcl/mflr/mtlr dance to locate
the page relative to running code position.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/Kconfig |   3 +-
 arch/powerpc/include/asm/vdso/gettimeofday.h |  10 ++
 arch/powerpc/include/asm/vdso_datapage.h |   2 -
 arch/powerpc/kernel/vdso.c   | 116 ---
 arch/powerpc/kernel/vdso32/vdso32.lds.S  |   2 +-
 arch/powerpc/kernel/vdso64/vdso64.lds.S  |   2 +-
 6 files changed, 114 insertions(+), 21 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index c1344c05226c..71daff5f15d5 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -172,6 +172,7 @@ config PPC
select GENERIC_CPU_AUTOPROBE
select GENERIC_CPU_VULNERABILITIES  if PPC_BARRIER_NOSPEC
select GENERIC_EARLY_IOREMAP
+   select GENERIC_GETTIMEOFDAY
select GENERIC_IRQ_SHOW
select GENERIC_IRQ_SHOW_LEVEL
select GENERIC_PCI_IOMAPif PCI
@@ -179,7 +180,7 @@ config PPC
select GENERIC_STRNCPY_FROM_USER
select GENERIC_STRNLEN_USER
select GENERIC_TIME_VSYSCALL
-   select GENERIC_GETTIMEOFDAY
+   select GENERIC_VDSO_TIME_NS
select HAVE_ARCH_AUDITSYSCALL
select HAVE_ARCH_HUGE_VMAP  if PPC_BOOK3S_64 && 
PPC_RADIX_MMU
select HAVE_ARCH_JUMP_LABEL
diff --git a/arch/powerpc/include/asm/vdso/gettimeofday.h 
b/arch/powerpc/include/asm/vdso/gettimeofday.h
index d453e725c79f..e448df1dd071 100644
--- a/arch/powerpc/include/asm/vdso/gettimeofday.h
+++ b/arch/powerpc/include/asm/vdso/gettimeofday.h
@@ -2,6 +2,8 @@
 #ifndef _ASM_POWERPC_VDSO_GETTIMEOFDAY_H
 #define _ASM_POWERPC_VDSO_GETTIMEOFDAY_H
 
+#include 
+
 #ifdef __ASSEMBLY__
 
 #include 
@@ -153,6 +155,14 @@ static __always_inline u64 __arch_get_hw_counter(s32 
clock_mode,
 
 const struct vdso_data *__arch_get_vdso_data(void);
 
+#ifdef CONFIG_TIME_NS
+static __always_inline
+const struct vdso_data *__arch_get_timens_vdso_data(const struct vdso_data *vd)
+{
+   return (void *)vd + PAGE_SIZE;
+}
+#endif
+
 static inline bool vdso_clocksource_ok(const struct vdso_data *vd)
 {
return true;
diff --git a/arch/powerpc/include/asm/vdso_datapage.h 
b/arch/powerpc/include/asm/vdso_datapage.h
index 3f958ecf2beb..a585c8e538ff 100644
--- a/arch/powerpc/include/asm/vdso_datapage.h
+++ b/arch/powerpc/include/asm/vdso_datapage.h
@@ -107,9 +107,7 @@ extern struct vdso_arch_data *vdso_data;
bcl 20, 31, .+4
 999:
mflr\ptr
-#if CONFIG_PPC_PAGE_SHIFT > 14
addis   \ptr, \ptr, (_vdso_datapage - 999b)@ha
-#endif
addi\ptr, \ptr, (_vdso_datapage - 999b)@l
 .endm
 
diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c
index b14907209822..717f2c9a7573 100644
--- a/arch/powerpc/kernel/vdso.c
+++ b/arch/powerpc/kernel/vdso.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -50,6 +51,12 @@ static union {
 } vdso_data_store __page_aligned_data;
 struct vdso_arch_data *vdso_data = _data_store.data;
 
+enum vvar_pages {
+   VVAR_DATA_PAGE_OFFSET,
+   VVAR_TIMENS_PAGE_OFFSET,
+   VVAR_NR_PAGES,
+};
+
 static int vdso_mremap(const struct vm_special_mapping *sm, struct 
vm_area_struct *new_vma,
   unsigned long text_size)
 {
@@ -73,8 +80,12 @@ static int vdso64_mremap(const struct vm_special_mapping 
*sm, struct vm_area_str
return vdso_mremap(sm, new_vma, _end - _start);
 }
 
+static vm_fault_t vvar_fault(const struct vm_special_mapping *sm,
+struct vm_area_struct *vma, struct vm_fault *vmf);
+
 static struct vm_special_mapping vvar_spec __ro_after_init = {
.name = "[vvar]",
+   .fault = vvar_fault,
 };
 
 static struct vm_special_mapping vdso32_spec __ro_after_init = {
@@ -87,6 +98,94 @@ static struct vm_special_mapping vdso64_spec __ro_after_init 
= {
.mremap = vdso64_mremap,
 };
 
+#ifdef CONFIG_TIME_NS
+struct vdso_data *arch_get_vdso_data(void *vvar_page)
+{
+   return ((struct vdso_arch_data *)vvar_page)->data;
+}
+
+/*
+ * The vvar mapping contains data for a specific time namespace, so when a task
+ * changes namespace we must unmap its vvar data for the old namespace.
+ * Subsequent faults will map in data for the new namespace.
+ *
+ * For more details see timens_setup_vdso_data().
+ */
+int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
+{
+   struct mm_struct *mm = task->mm;
+   struct vm_area_struct *vma;
+
+   mmap_read_lock(mm);
+
+   for (vma = mm->mmap; vma; vma = vma->vm_next) {
+   unsigned long size = vma->vm_end - vma->vm_start;
+
+   if 

[PATCH RESEND v1 3/4] powerpc/vdso: Separate vvar vma from vdso

2021-03-31 Thread Christophe Leroy
From: Dmitry Safonov 

Since commit 511157ab641e ("powerpc/vdso: Move vdso datapage up front")
VVAR page is in front of the VDSO area. In result it breaks CRIU
(Checkpoint Restore In Userspace) [1], where CRIU expects that "[vdso]"
from /proc/../maps points at ELF/vdso image, rather than at VVAR data page.
Laurent made a patch to keep CRIU working (by reading aux vector).
But I think it still makes sence to separate two mappings into different
VMAs. It will also make ppc64 less "special" for userspace and as
a side-bonus will make VVAR page un-writable by debugger (which previously
would COW page and can be unexpected).

I opportunistically Cc stable on it: I understand that usually such
stuff isn't a stable material, but that will allow us in CRIU have
one workaround less that is needed just for one release (v5.11) on
one platform (ppc64), which we otherwise have to maintain.
I wouldn't go as far as to say that the commit 511157ab641e is ABI
regression as no other userspace got broken, but I'd really appreciate
if it gets backported to v5.11 after v5.12 is released, so as not
to complicate already non-simple CRIU-vdso code. Thanks!

Cc: Andrei Vagin 
Cc: Andy Lutomirski 
Cc: Benjamin Herrenschmidt 
Cc: Christophe Leroy 
Cc: Laurent Dufour 
Cc: Michael Ellerman 
Cc: Paul Mackerras 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: sta...@vger.kernel.org # v5.11
[1]: https://github.com/checkpoint-restore/criu/issues/1417
Signed-off-by: Dmitry Safonov 
Tested-by: Christophe Leroy 
Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/mmu_context.h |  2 +-
 arch/powerpc/kernel/vdso.c | 54 +++---
 2 files changed, 40 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index 652ce85f9410..4bc45d3ed8b0 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -263,7 +263,7 @@ extern void arch_exit_mmap(struct mm_struct *mm);
 static inline void arch_unmap(struct mm_struct *mm,
  unsigned long start, unsigned long end)
 {
-   unsigned long vdso_base = (unsigned long)mm->context.vdso - PAGE_SIZE;
+   unsigned long vdso_base = (unsigned long)mm->context.vdso;
 
if (start <= vdso_base && vdso_base < end)
mm->context.vdso = NULL;
diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c
index e839a906fdf2..b14907209822 100644
--- a/arch/powerpc/kernel/vdso.c
+++ b/arch/powerpc/kernel/vdso.c
@@ -55,10 +55,10 @@ static int vdso_mremap(const struct vm_special_mapping *sm, 
struct vm_area_struc
 {
unsigned long new_size = new_vma->vm_end - new_vma->vm_start;
 
-   if (new_size != text_size + PAGE_SIZE)
+   if (new_size != text_size)
return -EINVAL;
 
-   current->mm->context.vdso = (void __user *)new_vma->vm_start + 
PAGE_SIZE;
+   current->mm->context.vdso = (void __user *)new_vma->vm_start;
 
return 0;
 }
@@ -73,6 +73,10 @@ static int vdso64_mremap(const struct vm_special_mapping 
*sm, struct vm_area_str
return vdso_mremap(sm, new_vma, _end - _start);
 }
 
+static struct vm_special_mapping vvar_spec __ro_after_init = {
+   .name = "[vvar]",
+};
+
 static struct vm_special_mapping vdso32_spec __ro_after_init = {
.name = "[vdso]",
.mremap = vdso32_mremap,
@@ -89,11 +93,11 @@ static struct vm_special_mapping vdso64_spec 
__ro_after_init = {
  */
 static int __arch_setup_additional_pages(struct linux_binprm *bprm, int 
uses_interp)
 {
-   struct mm_struct *mm = current->mm;
+   unsigned long vdso_size, vdso_base, mappings_size;
struct vm_special_mapping *vdso_spec;
+   unsigned long vvar_size = PAGE_SIZE;
+   struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
-   unsigned long vdso_size;
-   unsigned long vdso_base;
 
if (is_32bit_task()) {
vdso_spec = _spec;
@@ -110,8 +114,8 @@ static int __arch_setup_additional_pages(struct 
linux_binprm *bprm, int uses_int
vdso_base = 0;
}
 
-   /* Add a page to the vdso size for the data page */
-   vdso_size += PAGE_SIZE;
+   mappings_size = vdso_size + vvar_size;
+   mappings_size += (VDSO_ALIGNMENT - 1) & PAGE_MASK;
 
/*
 * pick a base address for the vDSO in process space. We try to put it
@@ -119,9 +123,7 @@ static int __arch_setup_additional_pages(struct 
linux_binprm *bprm, int uses_int
 * and end up putting it elsewhere.
 * Add enough to the size so that the result can be aligned.
 */
-   vdso_base = get_unmapped_area(NULL, vdso_base,
- vdso_size + ((VDSO_ALIGNMENT - 1) & 
PAGE_MASK),
- 0, 0);
+   vdso_base = get_unmapped_area(NULL, vdso_base, mappings_size, 0, 0);
if (IS_ERR_VALUE(vdso_base))
return vdso_base;
 
@@ -133,7 +135,13 @@ 

[PATCH RESEND v1 1/4] lib/vdso: Mark do_hres_timens() and do_coarse_timens() __always_inline()

2021-03-31 Thread Christophe Leroy
In the same spirit as commit c966533f8c6c ("lib/vdso: Mark do_hres()
and do_coarse() as __always_inline"), mark do_hres_timens() and
do_coarse_timens() __always_inline.

The measurement below in on a non timens process, ie on the fastest path.

On powerpc32, without the patch:

clock-gettime-monotonic-raw:vdso: 1155 nsec/call
clock-gettime-monotonic-coarse:vdso: 813 nsec/call
clock-gettime-monotonic:vdso: 1076 nsec/call

With the patch:

clock-gettime-monotonic-raw:vdso: 1100 nsec/call
clock-gettime-monotonic-coarse:vdso: 667 nsec/call
clock-gettime-monotonic:vdso: 1025 nsec/call

Signed-off-by: Christophe Leroy 
---
 lib/vdso/gettimeofday.c | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/lib/vdso/gettimeofday.c b/lib/vdso/gettimeofday.c
index 2919f1698140..c6f6dee08746 100644
--- a/lib/vdso/gettimeofday.c
+++ b/lib/vdso/gettimeofday.c
@@ -46,8 +46,8 @@ static inline bool vdso_cycles_ok(u64 cycles)
 #endif
 
 #ifdef CONFIG_TIME_NS
-static int do_hres_timens(const struct vdso_data *vdns, clockid_t clk,
- struct __kernel_timespec *ts)
+static __always_inline int do_hres_timens(const struct vdso_data *vdns, 
clockid_t clk,
+ struct __kernel_timespec *ts)
 {
const struct vdso_data *vd = __arch_get_timens_vdso_data();
const struct timens_offset *offs = >offset[clk];
@@ -97,8 +97,8 @@ static __always_inline const struct vdso_data 
*__arch_get_timens_vdso_data(void)
return NULL;
 }
 
-static int do_hres_timens(const struct vdso_data *vdns, clockid_t clk,
- struct __kernel_timespec *ts)
+static __always_inline int do_hres_timens(const struct vdso_data *vdns, 
clockid_t clk,
+ struct __kernel_timespec *ts)
 {
return -EINVAL;
 }
@@ -159,8 +159,8 @@ static __always_inline int do_hres(const struct vdso_data 
*vd, clockid_t clk,
 }
 
 #ifdef CONFIG_TIME_NS
-static int do_coarse_timens(const struct vdso_data *vdns, clockid_t clk,
-   struct __kernel_timespec *ts)
+static __always_inline int do_coarse_timens(const struct vdso_data *vdns, 
clockid_t clk,
+   struct __kernel_timespec *ts)
 {
const struct vdso_data *vd = __arch_get_timens_vdso_data();
const struct vdso_timestamp *vdso_ts = >basetime[clk];
@@ -188,8 +188,8 @@ static int do_coarse_timens(const struct vdso_data *vdns, 
clockid_t clk,
return 0;
 }
 #else
-static int do_coarse_timens(const struct vdso_data *vdns, clockid_t clk,
-   struct __kernel_timespec *ts)
+static __always_inline int do_coarse_timens(const struct vdso_data *vdns, 
clockid_t clk,
+   struct __kernel_timespec *ts)
 {
return -1;
 }
-- 
2.25.0



[PATCH RESEND v1 2/4] lib/vdso: Add vdso_data pointer as input to __arch_get_timens_vdso_data()

2021-03-31 Thread Christophe Leroy
For the same reason as commit e876f0b69dc9 ("lib/vdso: Allow
architectures to provide the vdso data pointer"), powerpc wants to
avoid calculation of relative position to code.

As the timens_vdso_data is next page to vdso_data, provide
vdso_data pointer to __arch_get_timens_vdso_data() in order
to ease the calculation on powerpc in following patches.

Signed-off-by: Christophe Leroy 
---
 arch/arm64/include/asm/vdso/compat_gettimeofday.h |  3 ++-
 arch/arm64/include/asm/vdso/gettimeofday.h|  2 +-
 arch/s390/include/asm/vdso/gettimeofday.h |  3 ++-
 arch/x86/include/asm/vdso/gettimeofday.h  |  3 ++-
 lib/vdso/gettimeofday.c   | 15 +--
 5 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/include/asm/vdso/compat_gettimeofday.h 
b/arch/arm64/include/asm/vdso/compat_gettimeofday.h
index 7508b0ac1d21..ecb6fd4c3c64 100644
--- a/arch/arm64/include/asm/vdso/compat_gettimeofday.h
+++ b/arch/arm64/include/asm/vdso/compat_gettimeofday.h
@@ -155,7 +155,8 @@ static __always_inline const struct vdso_data 
*__arch_get_vdso_data(void)
 }
 
 #ifdef CONFIG_TIME_NS
-static __always_inline const struct vdso_data 
*__arch_get_timens_vdso_data(void)
+static __always_inline
+const struct vdso_data *__arch_get_timens_vdso_data(const struct vdso_data *vd)
 {
const struct vdso_data *ret;
 
diff --git a/arch/arm64/include/asm/vdso/gettimeofday.h 
b/arch/arm64/include/asm/vdso/gettimeofday.h
index 631ab1281633..de86230a9436 100644
--- a/arch/arm64/include/asm/vdso/gettimeofday.h
+++ b/arch/arm64/include/asm/vdso/gettimeofday.h
@@ -100,7 +100,7 @@ const struct vdso_data *__arch_get_vdso_data(void)
 
 #ifdef CONFIG_TIME_NS
 static __always_inline
-const struct vdso_data *__arch_get_timens_vdso_data(void)
+const struct vdso_data *__arch_get_timens_vdso_data(const struct vdso_data *vd)
 {
return _timens_data;
 }
diff --git a/arch/s390/include/asm/vdso/gettimeofday.h 
b/arch/s390/include/asm/vdso/gettimeofday.h
index ed89ef742530..383c53c3 100644
--- a/arch/s390/include/asm/vdso/gettimeofday.h
+++ b/arch/s390/include/asm/vdso/gettimeofday.h
@@ -68,7 +68,8 @@ long clock_getres_fallback(clockid_t clkid, struct 
__kernel_timespec *ts)
 }
 
 #ifdef CONFIG_TIME_NS
-static __always_inline const struct vdso_data 
*__arch_get_timens_vdso_data(void)
+static __always_inline
+const struct vdso_data *__arch_get_timens_vdso_data(const struct vdso_data *vd)
 {
return _timens_data;
 }
diff --git a/arch/x86/include/asm/vdso/gettimeofday.h 
b/arch/x86/include/asm/vdso/gettimeofday.h
index df01d7349d79..1936f21ed8cd 100644
--- a/arch/x86/include/asm/vdso/gettimeofday.h
+++ b/arch/x86/include/asm/vdso/gettimeofday.h
@@ -58,7 +58,8 @@ extern struct ms_hyperv_tsc_page hvclock_page
 #endif
 
 #ifdef CONFIG_TIME_NS
-static __always_inline const struct vdso_data 
*__arch_get_timens_vdso_data(void)
+static __always_inline
+const struct vdso_data *__arch_get_timens_vdso_data(const struct vdso_data *vd)
 {
return __timens_vdso_data;
 }
diff --git a/lib/vdso/gettimeofday.c b/lib/vdso/gettimeofday.c
index c6f6dee08746..ce2f69552003 100644
--- a/lib/vdso/gettimeofday.c
+++ b/lib/vdso/gettimeofday.c
@@ -49,13 +49,15 @@ static inline bool vdso_cycles_ok(u64 cycles)
 static __always_inline int do_hres_timens(const struct vdso_data *vdns, 
clockid_t clk,
  struct __kernel_timespec *ts)
 {
-   const struct vdso_data *vd = __arch_get_timens_vdso_data();
+   const struct vdso_data *vd;
const struct timens_offset *offs = >offset[clk];
const struct vdso_timestamp *vdso_ts;
u64 cycles, last, ns;
u32 seq;
s64 sec;
 
+   vd = vdns - (clk == CLOCK_MONOTONIC_RAW ? CS_RAW : CS_HRES_COARSE);
+   vd = __arch_get_timens_vdso_data(vd);
if (clk != CLOCK_MONOTONIC_RAW)
vd = [CS_HRES_COARSE];
else
@@ -92,7 +94,8 @@ static __always_inline int do_hres_timens(const struct 
vdso_data *vdns, clockid_
return 0;
 }
 #else
-static __always_inline const struct vdso_data 
*__arch_get_timens_vdso_data(void)
+static __always_inline
+const struct vdso_data *__arch_get_timens_vdso_data(const struct vdso_data *vd)
 {
return NULL;
 }
@@ -162,7 +165,7 @@ static __always_inline int do_hres(const struct vdso_data 
*vd, clockid_t clk,
 static __always_inline int do_coarse_timens(const struct vdso_data *vdns, 
clockid_t clk,
struct __kernel_timespec *ts)
 {
-   const struct vdso_data *vd = __arch_get_timens_vdso_data();
+   const struct vdso_data *vd = __arch_get_timens_vdso_data(vdns);
const struct vdso_timestamp *vdso_ts = >basetime[clk];
const struct timens_offset *offs = >offset[clk];
u64 nsec;
@@ -310,7 +313,7 @@ __cvdso_gettimeofday_data(const struct vdso_data *vd,
if (unlikely(tz != NULL)) {
if (IS_ENABLED(CONFIG_TIME_NS) &&

[PATCH RESEND v1 0/4] powerpc/vdso: Add support for time namespaces

2021-03-31 Thread Christophe Leroy
[Sorry, resending with complete destination list, I used the wrong script on 
the first delivery]

This series adds support for time namespaces on powerpc.

All timens selftests are successfull.

Christophe Leroy (3):
  lib/vdso: Mark do_hres_timens() and do_coarse_timens()
__always_inline()
  lib/vdso: Add vdso_data pointer as input to
__arch_get_timens_vdso_data()
  powerpc/vdso: Add support for time namespaces

Dmitry Safonov (1):
  powerpc/vdso: Separate vvar vma from vdso

 .../include/asm/vdso/compat_gettimeofday.h|   3 +-
 arch/arm64/include/asm/vdso/gettimeofday.h|   2 +-
 arch/powerpc/Kconfig  |   3 +-
 arch/powerpc/include/asm/mmu_context.h|   2 +-
 arch/powerpc/include/asm/vdso/gettimeofday.h  |  10 ++
 arch/powerpc/include/asm/vdso_datapage.h  |   2 -
 arch/powerpc/kernel/vdso.c| 138 --
 arch/powerpc/kernel/vdso32/vdso32.lds.S   |   2 +-
 arch/powerpc/kernel/vdso64/vdso64.lds.S   |   2 +-
 arch/s390/include/asm/vdso/gettimeofday.h |   3 +-
 arch/x86/include/asm/vdso/gettimeofday.h  |   3 +-
 lib/vdso/gettimeofday.c   |  31 ++--
 12 files changed, 162 insertions(+), 39 deletions(-)

-- 
2.25.0



Re: [PATCH 6/8] drivers: firmware: efi: libstub: enable generic commandline

2021-03-31 Thread Ard Biesheuvel
(+ Arvind)

On Tue, 30 Mar 2021 at 19:57, Daniel Walker  wrote:
>
> This adds code to handle the generic command line changes.
> The efi code appears that it doesn't benefit as much from this design
> as it could.
>
> For example, if you had a prepend command line with "nokaslr" then
> you might be helpful to re-enable it in the boot loader or dts,
> but there appears to be no way to re-enable kaslr or some of the
> other options.
>
> Cc: xe-linux-exter...@cisco.com
> Signed-off-by: Daniel Walker 
> ---
>  .../firmware/efi/libstub/efi-stub-helper.c| 35 +++
>  drivers/firmware/efi/libstub/efi-stub.c   |  7 
>  drivers/firmware/efi/libstub/efistub.h|  1 +
>  drivers/firmware/efi/libstub/x86-stub.c   | 13 +--
>  4 files changed, 54 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/firmware/efi/libstub/efi-stub-helper.c 
> b/drivers/firmware/efi/libstub/efi-stub-helper.c
> index aa8da0a49829..c155837cedc9 100644
> --- a/drivers/firmware/efi/libstub/efi-stub-helper.c
> +++ b/drivers/firmware/efi/libstub/efi-stub-helper.c
> @@ -13,6 +13,7 @@
>  #include 
>  #include 
>  #include  /* For CONSOLE_LOGLEVEL_* */
> +#include 
>  #include 
>  #include 
>
> @@ -172,6 +173,40 @@ int efi_printk(const char *fmt, ...)
> return printed;
>  }
>
> +/**
> + * efi_handle_cmdline() - handle adding in building parts of the command line
> + * @cmdline:   kernel command line
> + *
> + * Add in the generic parts of the commandline and start the parsing of the
> + * command line.
> + *
> + * Return: status code
> + */
> +efi_status_t efi_handle_cmdline(char const *cmdline)
> +{
> +   efi_status_t status;
> +
> +   status = efi_parse_options(CMDLINE_PREPEND);
> +   if (status != EFI_SUCCESS) {
> +   efi_err("Failed to parse options\n");
> +   return status;
> +   }

Even though I am not a fan of the 'success handling' pattern,
duplicating the exact same error handling three times is not great
either. Could we reuse more of the code here?

> +
> +   status = efi_parse_options(IS_ENABLED(CONFIG_CMDLINE_OVERRIDE) ? "" : 
> cmdline);

What is the point of calling efi_parse_options() with an empty string?



> +   if (status != EFI_SUCCESS) {
> +   efi_err("Failed to parse options\n");
> +   return status;
> +   }
> +
> +   status = efi_parse_options(CMDLINE_APPEND);
> +   if (status != EFI_SUCCESS) {
> +   efi_err("Failed to parse options\n");
> +   return status;
> +   }
> +
> +   return EFI_SUCCESS;
> +}
> +
>  /**
>   * efi_parse_options() - Parse EFI command line options
>   * @cmdline:   kernel command line
> diff --git a/drivers/firmware/efi/libstub/efi-stub.c 
> b/drivers/firmware/efi/libstub/efi-stub.c
> index 26e69788f27a..760480248adf 100644
> --- a/drivers/firmware/efi/libstub/efi-stub.c
> +++ b/drivers/firmware/efi/libstub/efi-stub.c
> @@ -172,6 +172,12 @@ efi_status_t __efiapi efi_pe_entry(efi_handle_t handle,
> goto fail;
> }
>
> +#ifdef CONFIG_GENERIC_CMDLINE
> +   status = efi_handle_cmdline(cmdline_ptr);
> +   if (status != EFI_SUCCESS) {
> +   goto fail_free_cmdline;
> +   }
> +#else
> if (IS_ENABLED(CONFIG_CMDLINE_EXTEND) ||
> IS_ENABLED(CONFIG_CMDLINE_FORCE) ||

Does this mean CONFIG_GENERIC_CMDLINE does not replace CMDLINE_EXTEND
/ CMDLINE_FORCE etc, but introduces yet another variant on top of
those?

That does not seem like an improvement to me. I think it is great that
you are cleaning this up, but only if it means we can get rid of the
old implementation.

> cmdline_size == 0) {
> @@ -189,6 +195,7 @@ efi_status_t __efiapi efi_pe_entry(efi_handle_t handle,
> goto fail_free_cmdline;
> }
> }
> +#endif
>
> efi_info("Booting Linux Kernel...\n");
>
> diff --git a/drivers/firmware/efi/libstub/efistub.h 
> b/drivers/firmware/efi/libstub/efistub.h
> index cde0a2ef507d..07c7f9fdfffc 100644
> --- a/drivers/firmware/efi/libstub/efistub.h
> +++ b/drivers/firmware/efi/libstub/efistub.h
> @@ -800,6 +800,7 @@ efi_status_t efi_relocate_kernel(unsigned long 
> *image_addr,
>  unsigned long alignment,
>  unsigned long min_addr);
>
> +efi_status_t efi_handle_cmdline(char const *cmdline);
>  efi_status_t efi_parse_options(char const *cmdline);
>
>  void efi_parse_option_graphics(char *option);
> diff --git a/drivers/firmware/efi/libstub/x86-stub.c 
> b/drivers/firmware/efi/libstub/x86-stub.c
> index f14c4ff5839f..30ad8fb7122d 100644
> --- a/drivers/firmware/efi/libstub/x86-stub.c
> +++ b/drivers/firmware/efi/libstub/x86-stub.c
> @@ -673,6 +673,8 @@ unsigned long efi_main(efi_handle_t handle,
> unsigned long bzimage_addr = (unsigned long)startup_32;
> unsigned long buffer_start, buffer_end;
> struct setup_header *hdr = _params->hdr;
> +  

Re: [PATCH 16/18] iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE

2021-03-31 Thread Robin Murphy

On 2021-03-16 15:38, Christoph Hellwig wrote:
[...]

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index f1e38526d5bd40..996dfdf9d375dd 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2017,7 +2017,7 @@ static int arm_smmu_domain_finalise(struct iommu_domain 
*domain,
.iommu_dev  = smmu->dev,
};
  
-	if (smmu_domain->non_strict)

+   if (!iommu_get_dma_strict())


As Will raised, this also needs to be checking "domain->type == 
IOMMU_DOMAIN_DMA" to maintain equivalent behaviour to the attribute code 
below.



pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_NON_STRICT;
  
  	pgtbl_ops = alloc_io_pgtable_ops(fmt, _cfg, smmu_domain);

@@ -2449,52 +2449,6 @@ static struct iommu_group *arm_smmu_device_group(struct 
device *dev)
return group;
  }
  
-static int arm_smmu_domain_get_attr(struct iommu_domain *domain,

-   enum iommu_attr attr, void *data)
-{
-   struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
-
-   switch (domain->type) {
-   case IOMMU_DOMAIN_DMA:
-   switch (attr) {
-   case DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE:
-   *(int *)data = smmu_domain->non_strict;
-   return 0;
-   default:
-   return -ENODEV;
-   }
-   break;
-   default:
-   return -EINVAL;
-   }
-}

[...]

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index f985817c967a25..edb1de479dd1a7 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -668,7 +668,6 @@ struct arm_smmu_domain {
struct mutexinit_mutex; /* Protects smmu pointer */
  
  	struct io_pgtable_ops		*pgtbl_ops;

-   boolnon_strict;
atomic_tnr_ats_masters;
  
  	enum arm_smmu_domain_stage	stage;

diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c 
b/drivers/iommu/arm/arm-smmu/arm-smmu.c
index 0aa6d667274970..3dde22b1f8ffb0 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
@@ -761,6 +761,9 @@ static int arm_smmu_init_domain_context(struct iommu_domain 
*domain,
.iommu_dev  = smmu->dev,
};
  
+	if (!iommu_get_dma_strict())


Ditto here.

Sorry for not spotting that sooner :(

Robin.


+   pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_NON_STRICT;
+
if (smmu->impl && smmu->impl->init_context) {
ret = smmu->impl->init_context(smmu_domain, _cfg, dev);
if (ret)


Re: [PATCH 16/18] iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE

2021-03-31 Thread Robin Murphy

On 2021-03-31 16:32, Will Deacon wrote:

On Wed, Mar 31, 2021 at 02:09:37PM +0100, Robin Murphy wrote:

On 2021-03-31 12:49, Will Deacon wrote:

On Tue, Mar 30, 2021 at 05:28:19PM +0100, Robin Murphy wrote:

On 2021-03-30 14:58, Will Deacon wrote:

On Tue, Mar 30, 2021 at 02:19:38PM +0100, Robin Murphy wrote:

On 2021-03-30 14:11, Will Deacon wrote:

On Tue, Mar 16, 2021 at 04:38:22PM +0100, Christoph Hellwig wrote:

From: Robin Murphy 

Instead make the global iommu_dma_strict paramete in iommu.c canonical by
exporting helpers to get and set it and use those directly in the drivers.

This make sure that the iommu.strict parameter also works for the AMD and
Intel IOMMU drivers on x86.  As those default to lazy flushing a new
IOMMU_CMD_LINE_STRICT is used to turn the value into a tristate to
represent the default if not overriden by an explicit parameter.

Signed-off-by: Robin Murphy .
[ported on top of the other iommu_attr changes and added a few small
 missing bits]
Signed-off-by: Christoph Hellwig 
---
 drivers/iommu/amd/iommu.c   | 23 +---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 50 +---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  1 -
 drivers/iommu/arm/arm-smmu/arm-smmu.c   | 27 +
 drivers/iommu/dma-iommu.c   |  9 +--
 drivers/iommu/intel/iommu.c | 64 -
 drivers/iommu/iommu.c   | 27 ++---
 include/linux/iommu.h   |  4 +-
 8 files changed, 40 insertions(+), 165 deletions(-)


I really like this cleanup, but I can't help wonder if it's going in the
wrong direction. With SoCs often having multiple IOMMU instances and a
distinction between "trusted" and "untrusted" devices, then having the
flush-queue enabled on a per-IOMMU or per-domain basis doesn't sound
unreasonable to me, but this change makes it a global property.


The intent here was just to streamline the existing behaviour of stuffing a
global property into a domain attribute then pulling it out again in the
illusion that it was in any way per-domain. We're still checking
dev_is_untrusted() before making an actual decision, and it's not like we
can't add more factors at that point if we want to.


Like I say, the cleanup is great. I'm just wondering whether there's a
better way to express the complicated logic to decide whether or not to use
the flush queue than what we end up with:

if (!cookie->fq_domain && (!dev || !dev_is_untrusted(dev)) &&
domain->ops->flush_iotlb_all && !iommu_get_dma_strict())

which is mixing up globals, device properties and domain properties. The
result is that the driver code ends up just using the global to determine
whether or not to pass IO_PGTABLE_QUIRK_NON_STRICT to the page-table code,
which is a departure from the current way of doing things.


But previously, SMMU only ever saw the global policy piped through the
domain attribute by iommu_group_alloc_default_domain(), so there's no
functional change there.


For DMA domains sure, but I don't think that's the case for unmanaged
domains such as those used by VFIO.


Eh? This is only relevant to DMA domains anyway. Flush queues are part of
the IOVA allocator that VFIO doesn't even use. It's always been the case
that unmanaged domains only use strict invalidation.


Maybe I'm going mad. With this patch, the SMMU driver unconditionally sets
IO_PGTABLE_QUIRK_NON_STRICT for page-tables if iommu_get_dma_strict() is
true, no? In which case, that will get set for page-tables corresponding
to unmanaged domains as well as DMA domains when it is enabled. That didn't
happen before because you couldn't set the attribute for unmanaged domains.

What am I missing?


Oh cock... sorry, all this time I've been saying what I *expect* it to 
do, while overlooking the fact that the IO_PGTABLE_QUIRK_NON_STRICT 
hunks were the bits I forgot to write and Christoph had to fix up. 
Indeed, those should be checking the domain type too to preserve the 
existing behaviour. Apologies for the confusion.


Robin.


Obviously some of the above checks could be factored out into some kind of
iommu_use_flush_queue() helper that IOMMU drivers can also call if they need
to keep in sync. Or maybe we just allow iommu-dma to set
IO_PGTABLE_QUIRK_NON_STRICT directly via iommu_set_pgtable_quirks() if we're
treating that as a generic thing now.


I think a helper that takes a domain would be a good starting point.


You mean device, right? The one condition we currently have is at the device
level, and there's really nothing inherent to the domain itself that matters
(since the type is implicitly IOMMU_DOMAIN_DMA to even care about this).


Device would probably work too; you'd pass the first device to attach to the
domain when querying this from the SMMU driver, I suppose.

Will



Re: [PATCH 16/18] iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE

2021-03-31 Thread Will Deacon
On Wed, Mar 31, 2021 at 02:09:37PM +0100, Robin Murphy wrote:
> On 2021-03-31 12:49, Will Deacon wrote:
> > On Tue, Mar 30, 2021 at 05:28:19PM +0100, Robin Murphy wrote:
> > > On 2021-03-30 14:58, Will Deacon wrote:
> > > > On Tue, Mar 30, 2021 at 02:19:38PM +0100, Robin Murphy wrote:
> > > > > On 2021-03-30 14:11, Will Deacon wrote:
> > > > > > On Tue, Mar 16, 2021 at 04:38:22PM +0100, Christoph Hellwig wrote:
> > > > > > > From: Robin Murphy 
> > > > > > > 
> > > > > > > Instead make the global iommu_dma_strict paramete in iommu.c 
> > > > > > > canonical by
> > > > > > > exporting helpers to get and set it and use those directly in the 
> > > > > > > drivers.
> > > > > > > 
> > > > > > > This make sure that the iommu.strict parameter also works for the 
> > > > > > > AMD and
> > > > > > > Intel IOMMU drivers on x86.  As those default to lazy flushing a 
> > > > > > > new
> > > > > > > IOMMU_CMD_LINE_STRICT is used to turn the value into a tristate to
> > > > > > > represent the default if not overriden by an explicit parameter.
> > > > > > > 
> > > > > > > Signed-off-by: Robin Murphy .
> > > > > > > [ported on top of the other iommu_attr changes and added a few 
> > > > > > > small
> > > > > > > missing bits]
> > > > > > > Signed-off-by: Christoph Hellwig 
> > > > > > > ---
> > > > > > > drivers/iommu/amd/iommu.c   | 23 +---
> > > > > > > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 50 
> > > > > > > +---
> > > > > > > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  1 -
> > > > > > > drivers/iommu/arm/arm-smmu/arm-smmu.c   | 27 +
> > > > > > > drivers/iommu/dma-iommu.c   |  9 +--
> > > > > > > drivers/iommu/intel/iommu.c | 64 
> > > > > > > -
> > > > > > > drivers/iommu/iommu.c   | 27 ++---
> > > > > > > include/linux/iommu.h   |  4 +-
> > > > > > > 8 files changed, 40 insertions(+), 165 deletions(-)
> > > > > > 
> > > > > > I really like this cleanup, but I can't help wonder if it's going 
> > > > > > in the
> > > > > > wrong direction. With SoCs often having multiple IOMMU instances 
> > > > > > and a
> > > > > > distinction between "trusted" and "untrusted" devices, then having 
> > > > > > the
> > > > > > flush-queue enabled on a per-IOMMU or per-domain basis doesn't sound
> > > > > > unreasonable to me, but this change makes it a global property.
> > > > > 
> > > > > The intent here was just to streamline the existing behaviour of 
> > > > > stuffing a
> > > > > global property into a domain attribute then pulling it out again in 
> > > > > the
> > > > > illusion that it was in any way per-domain. We're still checking
> > > > > dev_is_untrusted() before making an actual decision, and it's not 
> > > > > like we
> > > > > can't add more factors at that point if we want to.
> > > > 
> > > > Like I say, the cleanup is great. I'm just wondering whether there's a
> > > > better way to express the complicated logic to decide whether or not to 
> > > > use
> > > > the flush queue than what we end up with:
> > > > 
> > > > if (!cookie->fq_domain && (!dev || !dev_is_untrusted(dev)) &&
> > > > domain->ops->flush_iotlb_all && !iommu_get_dma_strict())
> > > > 
> > > > which is mixing up globals, device properties and domain properties. The
> > > > result is that the driver code ends up just using the global to 
> > > > determine
> > > > whether or not to pass IO_PGTABLE_QUIRK_NON_STRICT to the page-table 
> > > > code,
> > > > which is a departure from the current way of doing things.
> > > 
> > > But previously, SMMU only ever saw the global policy piped through the
> > > domain attribute by iommu_group_alloc_default_domain(), so there's no
> > > functional change there.
> > 
> > For DMA domains sure, but I don't think that's the case for unmanaged
> > domains such as those used by VFIO.
> 
> Eh? This is only relevant to DMA domains anyway. Flush queues are part of
> the IOVA allocator that VFIO doesn't even use. It's always been the case
> that unmanaged domains only use strict invalidation.

Maybe I'm going mad. With this patch, the SMMU driver unconditionally sets
IO_PGTABLE_QUIRK_NON_STRICT for page-tables if iommu_get_dma_strict() is
true, no? In which case, that will get set for page-tables corresponding
to unmanaged domains as well as DMA domains when it is enabled. That didn't
happen before because you couldn't set the attribute for unmanaged domains.

What am I missing?

> > > Obviously some of the above checks could be factored out into some kind of
> > > iommu_use_flush_queue() helper that IOMMU drivers can also call if they 
> > > need
> > > to keep in sync. Or maybe we just allow iommu-dma to set
> > > IO_PGTABLE_QUIRK_NON_STRICT directly via iommu_set_pgtable_quirks() if 
> > > we're
> > > treating that as a generic thing now.
> > 
> > I think a helper that takes a domain would be a good starting 

[PATCH v1 4/4] powerpc/vdso: Add support for time namespaces

2021-03-31 Thread Christophe Leroy
This patch adds the necessary glue to provide time namespaces.

Things are mainly copied from ARM64.

__arch_get_timens_vdso_data() calculates timens vdso data position
based on the vdso data position, knowing it is the next page in vvar.
This avoids having to redo the mflr/bcl/mflr/mtlr dance to locate
the page relative to running code position.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/Kconfig |   3 +-
 arch/powerpc/include/asm/vdso/gettimeofday.h |  10 ++
 arch/powerpc/include/asm/vdso_datapage.h |   2 -
 arch/powerpc/kernel/vdso.c   | 116 ---
 arch/powerpc/kernel/vdso32/vdso32.lds.S  |   2 +-
 arch/powerpc/kernel/vdso64/vdso64.lds.S  |   2 +-
 6 files changed, 114 insertions(+), 21 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index c1344c05226c..71daff5f15d5 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -172,6 +172,7 @@ config PPC
select GENERIC_CPU_AUTOPROBE
select GENERIC_CPU_VULNERABILITIES  if PPC_BARRIER_NOSPEC
select GENERIC_EARLY_IOREMAP
+   select GENERIC_GETTIMEOFDAY
select GENERIC_IRQ_SHOW
select GENERIC_IRQ_SHOW_LEVEL
select GENERIC_PCI_IOMAPif PCI
@@ -179,7 +180,7 @@ config PPC
select GENERIC_STRNCPY_FROM_USER
select GENERIC_STRNLEN_USER
select GENERIC_TIME_VSYSCALL
-   select GENERIC_GETTIMEOFDAY
+   select GENERIC_VDSO_TIME_NS
select HAVE_ARCH_AUDITSYSCALL
select HAVE_ARCH_HUGE_VMAP  if PPC_BOOK3S_64 && 
PPC_RADIX_MMU
select HAVE_ARCH_JUMP_LABEL
diff --git a/arch/powerpc/include/asm/vdso/gettimeofday.h 
b/arch/powerpc/include/asm/vdso/gettimeofday.h
index d453e725c79f..e448df1dd071 100644
--- a/arch/powerpc/include/asm/vdso/gettimeofday.h
+++ b/arch/powerpc/include/asm/vdso/gettimeofday.h
@@ -2,6 +2,8 @@
 #ifndef _ASM_POWERPC_VDSO_GETTIMEOFDAY_H
 #define _ASM_POWERPC_VDSO_GETTIMEOFDAY_H
 
+#include 
+
 #ifdef __ASSEMBLY__
 
 #include 
@@ -153,6 +155,14 @@ static __always_inline u64 __arch_get_hw_counter(s32 
clock_mode,
 
 const struct vdso_data *__arch_get_vdso_data(void);
 
+#ifdef CONFIG_TIME_NS
+static __always_inline
+const struct vdso_data *__arch_get_timens_vdso_data(const struct vdso_data *vd)
+{
+   return (void *)vd + PAGE_SIZE;
+}
+#endif
+
 static inline bool vdso_clocksource_ok(const struct vdso_data *vd)
 {
return true;
diff --git a/arch/powerpc/include/asm/vdso_datapage.h 
b/arch/powerpc/include/asm/vdso_datapage.h
index 3f958ecf2beb..a585c8e538ff 100644
--- a/arch/powerpc/include/asm/vdso_datapage.h
+++ b/arch/powerpc/include/asm/vdso_datapage.h
@@ -107,9 +107,7 @@ extern struct vdso_arch_data *vdso_data;
bcl 20, 31, .+4
 999:
mflr\ptr
-#if CONFIG_PPC_PAGE_SHIFT > 14
addis   \ptr, \ptr, (_vdso_datapage - 999b)@ha
-#endif
addi\ptr, \ptr, (_vdso_datapage - 999b)@l
 .endm
 
diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c
index b14907209822..717f2c9a7573 100644
--- a/arch/powerpc/kernel/vdso.c
+++ b/arch/powerpc/kernel/vdso.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -50,6 +51,12 @@ static union {
 } vdso_data_store __page_aligned_data;
 struct vdso_arch_data *vdso_data = _data_store.data;
 
+enum vvar_pages {
+   VVAR_DATA_PAGE_OFFSET,
+   VVAR_TIMENS_PAGE_OFFSET,
+   VVAR_NR_PAGES,
+};
+
 static int vdso_mremap(const struct vm_special_mapping *sm, struct 
vm_area_struct *new_vma,
   unsigned long text_size)
 {
@@ -73,8 +80,12 @@ static int vdso64_mremap(const struct vm_special_mapping 
*sm, struct vm_area_str
return vdso_mremap(sm, new_vma, _end - _start);
 }
 
+static vm_fault_t vvar_fault(const struct vm_special_mapping *sm,
+struct vm_area_struct *vma, struct vm_fault *vmf);
+
 static struct vm_special_mapping vvar_spec __ro_after_init = {
.name = "[vvar]",
+   .fault = vvar_fault,
 };
 
 static struct vm_special_mapping vdso32_spec __ro_after_init = {
@@ -87,6 +98,94 @@ static struct vm_special_mapping vdso64_spec __ro_after_init 
= {
.mremap = vdso64_mremap,
 };
 
+#ifdef CONFIG_TIME_NS
+struct vdso_data *arch_get_vdso_data(void *vvar_page)
+{
+   return ((struct vdso_arch_data *)vvar_page)->data;
+}
+
+/*
+ * The vvar mapping contains data for a specific time namespace, so when a task
+ * changes namespace we must unmap its vvar data for the old namespace.
+ * Subsequent faults will map in data for the new namespace.
+ *
+ * For more details see timens_setup_vdso_data().
+ */
+int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
+{
+   struct mm_struct *mm = task->mm;
+   struct vm_area_struct *vma;
+
+   mmap_read_lock(mm);
+
+   for (vma = mm->mmap; vma; vma = vma->vm_next) {
+   unsigned long size = vma->vm_end - vma->vm_start;
+
+   if 

[PATCH v1 3/4] powerpc/vdso: Separate vvar vma from vdso

2021-03-31 Thread Christophe Leroy
From: Dmitry Safonov 

Since commit 511157ab641e ("powerpc/vdso: Move vdso datapage up front")
VVAR page is in front of the VDSO area. In result it breaks CRIU
(Checkpoint Restore In Userspace) [1], where CRIU expects that "[vdso]"
from /proc/../maps points at ELF/vdso image, rather than at VVAR data page.
Laurent made a patch to keep CRIU working (by reading aux vector).
But I think it still makes sence to separate two mappings into different
VMAs. It will also make ppc64 less "special" for userspace and as
a side-bonus will make VVAR page un-writable by debugger (which previously
would COW page and can be unexpected).

I opportunistically Cc stable on it: I understand that usually such
stuff isn't a stable material, but that will allow us in CRIU have
one workaround less that is needed just for one release (v5.11) on
one platform (ppc64), which we otherwise have to maintain.
I wouldn't go as far as to say that the commit 511157ab641e is ABI
regression as no other userspace got broken, but I'd really appreciate
if it gets backported to v5.11 after v5.12 is released, so as not
to complicate already non-simple CRIU-vdso code. Thanks!

Cc: Andrei Vagin 
Cc: Andy Lutomirski 
Cc: Benjamin Herrenschmidt 
Cc: Christophe Leroy 
Cc: Laurent Dufour 
Cc: Michael Ellerman 
Cc: Paul Mackerras 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: sta...@vger.kernel.org # v5.11
[1]: https://github.com/checkpoint-restore/criu/issues/1417
Signed-off-by: Dmitry Safonov 
Tested-by: Christophe Leroy 
Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/mmu_context.h |  2 +-
 arch/powerpc/kernel/vdso.c | 54 +++---
 2 files changed, 40 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index 652ce85f9410..4bc45d3ed8b0 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -263,7 +263,7 @@ extern void arch_exit_mmap(struct mm_struct *mm);
 static inline void arch_unmap(struct mm_struct *mm,
  unsigned long start, unsigned long end)
 {
-   unsigned long vdso_base = (unsigned long)mm->context.vdso - PAGE_SIZE;
+   unsigned long vdso_base = (unsigned long)mm->context.vdso;
 
if (start <= vdso_base && vdso_base < end)
mm->context.vdso = NULL;
diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c
index e839a906fdf2..b14907209822 100644
--- a/arch/powerpc/kernel/vdso.c
+++ b/arch/powerpc/kernel/vdso.c
@@ -55,10 +55,10 @@ static int vdso_mremap(const struct vm_special_mapping *sm, 
struct vm_area_struc
 {
unsigned long new_size = new_vma->vm_end - new_vma->vm_start;
 
-   if (new_size != text_size + PAGE_SIZE)
+   if (new_size != text_size)
return -EINVAL;
 
-   current->mm->context.vdso = (void __user *)new_vma->vm_start + 
PAGE_SIZE;
+   current->mm->context.vdso = (void __user *)new_vma->vm_start;
 
return 0;
 }
@@ -73,6 +73,10 @@ static int vdso64_mremap(const struct vm_special_mapping 
*sm, struct vm_area_str
return vdso_mremap(sm, new_vma, _end - _start);
 }
 
+static struct vm_special_mapping vvar_spec __ro_after_init = {
+   .name = "[vvar]",
+};
+
 static struct vm_special_mapping vdso32_spec __ro_after_init = {
.name = "[vdso]",
.mremap = vdso32_mremap,
@@ -89,11 +93,11 @@ static struct vm_special_mapping vdso64_spec 
__ro_after_init = {
  */
 static int __arch_setup_additional_pages(struct linux_binprm *bprm, int 
uses_interp)
 {
-   struct mm_struct *mm = current->mm;
+   unsigned long vdso_size, vdso_base, mappings_size;
struct vm_special_mapping *vdso_spec;
+   unsigned long vvar_size = PAGE_SIZE;
+   struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
-   unsigned long vdso_size;
-   unsigned long vdso_base;
 
if (is_32bit_task()) {
vdso_spec = _spec;
@@ -110,8 +114,8 @@ static int __arch_setup_additional_pages(struct 
linux_binprm *bprm, int uses_int
vdso_base = 0;
}
 
-   /* Add a page to the vdso size for the data page */
-   vdso_size += PAGE_SIZE;
+   mappings_size = vdso_size + vvar_size;
+   mappings_size += (VDSO_ALIGNMENT - 1) & PAGE_MASK;
 
/*
 * pick a base address for the vDSO in process space. We try to put it
@@ -119,9 +123,7 @@ static int __arch_setup_additional_pages(struct 
linux_binprm *bprm, int uses_int
 * and end up putting it elsewhere.
 * Add enough to the size so that the result can be aligned.
 */
-   vdso_base = get_unmapped_area(NULL, vdso_base,
- vdso_size + ((VDSO_ALIGNMENT - 1) & 
PAGE_MASK),
- 0, 0);
+   vdso_base = get_unmapped_area(NULL, vdso_base, mappings_size, 0, 0);
if (IS_ERR_VALUE(vdso_base))
return vdso_base;
 
@@ -133,7 +135,13 @@ 

[PATCH v1 1/4] lib/vdso: Mark do_hres_timens() and do_coarse_timens() __always_inline()

2021-03-31 Thread Christophe Leroy
In the same spirit as commit c966533f8c6c ("lib/vdso: Mark do_hres()
and do_coarse() as __always_inline"), mark do_hres_timens() and
do_coarse_timens() __always_inline.

The measurement below in on a non timens process, ie on the fastest path.

On powerpc32, without the patch:

clock-gettime-monotonic-raw:vdso: 1155 nsec/call
clock-gettime-monotonic-coarse:vdso: 813 nsec/call
clock-gettime-monotonic:vdso: 1076 nsec/call

With the patch:

clock-gettime-monotonic-raw:vdso: 1100 nsec/call
clock-gettime-monotonic-coarse:vdso: 667 nsec/call
clock-gettime-monotonic:vdso: 1025 nsec/call

Signed-off-by: Christophe Leroy 
---
 lib/vdso/gettimeofday.c | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/lib/vdso/gettimeofday.c b/lib/vdso/gettimeofday.c
index 2919f1698140..c6f6dee08746 100644
--- a/lib/vdso/gettimeofday.c
+++ b/lib/vdso/gettimeofday.c
@@ -46,8 +46,8 @@ static inline bool vdso_cycles_ok(u64 cycles)
 #endif
 
 #ifdef CONFIG_TIME_NS
-static int do_hres_timens(const struct vdso_data *vdns, clockid_t clk,
- struct __kernel_timespec *ts)
+static __always_inline int do_hres_timens(const struct vdso_data *vdns, 
clockid_t clk,
+ struct __kernel_timespec *ts)
 {
const struct vdso_data *vd = __arch_get_timens_vdso_data();
const struct timens_offset *offs = >offset[clk];
@@ -97,8 +97,8 @@ static __always_inline const struct vdso_data 
*__arch_get_timens_vdso_data(void)
return NULL;
 }
 
-static int do_hres_timens(const struct vdso_data *vdns, clockid_t clk,
- struct __kernel_timespec *ts)
+static __always_inline int do_hres_timens(const struct vdso_data *vdns, 
clockid_t clk,
+ struct __kernel_timespec *ts)
 {
return -EINVAL;
 }
@@ -159,8 +159,8 @@ static __always_inline int do_hres(const struct vdso_data 
*vd, clockid_t clk,
 }
 
 #ifdef CONFIG_TIME_NS
-static int do_coarse_timens(const struct vdso_data *vdns, clockid_t clk,
-   struct __kernel_timespec *ts)
+static __always_inline int do_coarse_timens(const struct vdso_data *vdns, 
clockid_t clk,
+   struct __kernel_timespec *ts)
 {
const struct vdso_data *vd = __arch_get_timens_vdso_data();
const struct vdso_timestamp *vdso_ts = >basetime[clk];
@@ -188,8 +188,8 @@ static int do_coarse_timens(const struct vdso_data *vdns, 
clockid_t clk,
return 0;
 }
 #else
-static int do_coarse_timens(const struct vdso_data *vdns, clockid_t clk,
-   struct __kernel_timespec *ts)
+static __always_inline int do_coarse_timens(const struct vdso_data *vdns, 
clockid_t clk,
+   struct __kernel_timespec *ts)
 {
return -1;
 }
-- 
2.25.0



[PATCH v1 2/4] lib/vdso: Add vdso_data pointer as input to __arch_get_timens_vdso_data()

2021-03-31 Thread Christophe Leroy
For the same reason as commit e876f0b69dc9 ("lib/vdso: Allow
architectures to provide the vdso data pointer"), powerpc wants to
avoid calculation of relative position to code.

As the timens_vdso_data is next page to vdso_data, provide
vdso_data pointer to __arch_get_timens_vdso_data() in order
to ease the calculation on powerpc in following patches.

Signed-off-by: Christophe Leroy 
---
 arch/arm64/include/asm/vdso/compat_gettimeofday.h |  3 ++-
 arch/arm64/include/asm/vdso/gettimeofday.h|  2 +-
 arch/s390/include/asm/vdso/gettimeofday.h |  3 ++-
 arch/x86/include/asm/vdso/gettimeofday.h  |  3 ++-
 lib/vdso/gettimeofday.c   | 15 +--
 5 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/include/asm/vdso/compat_gettimeofday.h 
b/arch/arm64/include/asm/vdso/compat_gettimeofday.h
index 7508b0ac1d21..ecb6fd4c3c64 100644
--- a/arch/arm64/include/asm/vdso/compat_gettimeofday.h
+++ b/arch/arm64/include/asm/vdso/compat_gettimeofday.h
@@ -155,7 +155,8 @@ static __always_inline const struct vdso_data 
*__arch_get_vdso_data(void)
 }
 
 #ifdef CONFIG_TIME_NS
-static __always_inline const struct vdso_data 
*__arch_get_timens_vdso_data(void)
+static __always_inline
+const struct vdso_data *__arch_get_timens_vdso_data(const struct vdso_data *vd)
 {
const struct vdso_data *ret;
 
diff --git a/arch/arm64/include/asm/vdso/gettimeofday.h 
b/arch/arm64/include/asm/vdso/gettimeofday.h
index 631ab1281633..de86230a9436 100644
--- a/arch/arm64/include/asm/vdso/gettimeofday.h
+++ b/arch/arm64/include/asm/vdso/gettimeofday.h
@@ -100,7 +100,7 @@ const struct vdso_data *__arch_get_vdso_data(void)
 
 #ifdef CONFIG_TIME_NS
 static __always_inline
-const struct vdso_data *__arch_get_timens_vdso_data(void)
+const struct vdso_data *__arch_get_timens_vdso_data(const struct vdso_data *vd)
 {
return _timens_data;
 }
diff --git a/arch/s390/include/asm/vdso/gettimeofday.h 
b/arch/s390/include/asm/vdso/gettimeofday.h
index ed89ef742530..383c53c3 100644
--- a/arch/s390/include/asm/vdso/gettimeofday.h
+++ b/arch/s390/include/asm/vdso/gettimeofday.h
@@ -68,7 +68,8 @@ long clock_getres_fallback(clockid_t clkid, struct 
__kernel_timespec *ts)
 }
 
 #ifdef CONFIG_TIME_NS
-static __always_inline const struct vdso_data 
*__arch_get_timens_vdso_data(void)
+static __always_inline
+const struct vdso_data *__arch_get_timens_vdso_data(const struct vdso_data *vd)
 {
return _timens_data;
 }
diff --git a/arch/x86/include/asm/vdso/gettimeofday.h 
b/arch/x86/include/asm/vdso/gettimeofday.h
index df01d7349d79..1936f21ed8cd 100644
--- a/arch/x86/include/asm/vdso/gettimeofday.h
+++ b/arch/x86/include/asm/vdso/gettimeofday.h
@@ -58,7 +58,8 @@ extern struct ms_hyperv_tsc_page hvclock_page
 #endif
 
 #ifdef CONFIG_TIME_NS
-static __always_inline const struct vdso_data 
*__arch_get_timens_vdso_data(void)
+static __always_inline
+const struct vdso_data *__arch_get_timens_vdso_data(const struct vdso_data *vd)
 {
return __timens_vdso_data;
 }
diff --git a/lib/vdso/gettimeofday.c b/lib/vdso/gettimeofday.c
index c6f6dee08746..ce2f69552003 100644
--- a/lib/vdso/gettimeofday.c
+++ b/lib/vdso/gettimeofday.c
@@ -49,13 +49,15 @@ static inline bool vdso_cycles_ok(u64 cycles)
 static __always_inline int do_hres_timens(const struct vdso_data *vdns, 
clockid_t clk,
  struct __kernel_timespec *ts)
 {
-   const struct vdso_data *vd = __arch_get_timens_vdso_data();
+   const struct vdso_data *vd;
const struct timens_offset *offs = >offset[clk];
const struct vdso_timestamp *vdso_ts;
u64 cycles, last, ns;
u32 seq;
s64 sec;
 
+   vd = vdns - (clk == CLOCK_MONOTONIC_RAW ? CS_RAW : CS_HRES_COARSE);
+   vd = __arch_get_timens_vdso_data(vd);
if (clk != CLOCK_MONOTONIC_RAW)
vd = [CS_HRES_COARSE];
else
@@ -92,7 +94,8 @@ static __always_inline int do_hres_timens(const struct 
vdso_data *vdns, clockid_
return 0;
 }
 #else
-static __always_inline const struct vdso_data 
*__arch_get_timens_vdso_data(void)
+static __always_inline
+const struct vdso_data *__arch_get_timens_vdso_data(const struct vdso_data *vd)
 {
return NULL;
 }
@@ -162,7 +165,7 @@ static __always_inline int do_hres(const struct vdso_data 
*vd, clockid_t clk,
 static __always_inline int do_coarse_timens(const struct vdso_data *vdns, 
clockid_t clk,
struct __kernel_timespec *ts)
 {
-   const struct vdso_data *vd = __arch_get_timens_vdso_data();
+   const struct vdso_data *vd = __arch_get_timens_vdso_data(vdns);
const struct vdso_timestamp *vdso_ts = >basetime[clk];
const struct timens_offset *offs = >offset[clk];
u64 nsec;
@@ -310,7 +313,7 @@ __cvdso_gettimeofday_data(const struct vdso_data *vd,
if (unlikely(tz != NULL)) {
if (IS_ENABLED(CONFIG_TIME_NS) &&

[PATCH v1 0/4] powerpc/vdso: Add support for time namespaces

2021-03-31 Thread Christophe Leroy
This series adds support for time namespaces on powerpc.

All timens selftests are successfull.

Christophe Leroy (3):
  lib/vdso: Mark do_hres_timens() and do_coarse_timens()
__always_inline()
  lib/vdso: Add vdso_data pointer as input to
__arch_get_timens_vdso_data()
  powerpc/vdso: Add support for time namespaces

Dmitry Safonov (1):
  powerpc/vdso: Separate vvar vma from vdso

 .../include/asm/vdso/compat_gettimeofday.h|   3 +-
 arch/arm64/include/asm/vdso/gettimeofday.h|   2 +-
 arch/powerpc/Kconfig  |   3 +-
 arch/powerpc/include/asm/mmu_context.h|   2 +-
 arch/powerpc/include/asm/vdso/gettimeofday.h  |  10 ++
 arch/powerpc/include/asm/vdso_datapage.h  |   2 -
 arch/powerpc/kernel/vdso.c| 138 --
 arch/powerpc/kernel/vdso32/vdso32.lds.S   |   2 +-
 arch/powerpc/kernel/vdso64/vdso64.lds.S   |   2 +-
 arch/s390/include/asm/vdso/gettimeofday.h |   3 +-
 arch/x86/include/asm/vdso/gettimeofday.h  |   3 +-
 lib/vdso/gettimeofday.c   |  31 ++--
 12 files changed, 162 insertions(+), 39 deletions(-)

-- 
2.25.0



[PATCH v3 9/9] powerpc/xive: Modernize XIVE-IPI domain with an 'alloc' handler

2021-03-31 Thread Cédric Le Goater
Instead of calling irq_create_mapping() to map the IPI for a node,
introduce an 'alloc' handler. This is usually an extension to support
hierarchy irq_domains which is not exactly the case for XIVE-IPI
domain. However, we can now use the irq_domain_alloc_irqs() routine
which allocates the IRQ descriptor on the specified node, even better
for cache performance on multi node machines.

Cc: Thomas Gleixner 
Signed-off-by: Cédric Le Goater 
---

 I didn't rerun the benchmark to check for a difference.
 
 arch/powerpc/sysdev/xive/common.c | 27 +++
 1 file changed, 19 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index 06f29cd07448..bb7435ffe99c 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -1103,15 +1103,26 @@ static struct irq_chip xive_ipi_chip = {
  * IPIs are marked per-cpu. We use separate HW interrupts under the
  * hood but associated with the same "linux" interrupt
  */
-static int xive_ipi_irq_domain_map(struct irq_domain *h, unsigned int virq,
-  irq_hw_number_t hw)
+struct xive_ipi_alloc_info {
+   irq_hw_number_t hwirq;
+};
+
+static int xive_ipi_irq_domain_alloc(struct irq_domain *domain, unsigned int 
virq,
+unsigned int nr_irqs, void *arg)
 {
-   irq_set_chip_and_handler(virq, _ipi_chip, handle_percpu_irq);
+   struct xive_ipi_alloc_info *info = arg;
+   int i;
+
+   for (i = 0; i < nr_irqs; i++) {
+   irq_domain_set_info(domain, virq + i, info->hwirq + i, 
_ipi_chip,
+   domain->host_data, handle_percpu_irq,
+   NULL, NULL);
+   }
return 0;
 }
 
 static const struct irq_domain_ops xive_ipi_irq_domain_ops = {
-   .map = xive_ipi_irq_domain_map,
+   .alloc  = xive_ipi_irq_domain_alloc,
 };
 
 static int __init xive_request_ipi(void)
@@ -1136,7 +1147,7 @@ static int __init xive_request_ipi(void)
 
for_each_node(node) {
struct xive_ipi_desc *xid = _ipis[node];
-   irq_hw_number_t ipi_hwirq = node;
+   struct xive_ipi_alloc_info info = { node };
 
/* Skip nodes without CPUs */
if (cpumask_empty(cpumask_of_node(node)))
@@ -1147,9 +1158,9 @@ static int __init xive_request_ipi(void)
 * Since the HW interrupt number doesn't have any meaning,
 * simply use the node number.
 */
-   xid->irq = irq_create_mapping(ipi_domain, ipi_hwirq);
-   if (!xid->irq) {
-   ret = -EINVAL;
+   xid->irq = irq_domain_alloc_irqs(ipi_domain, 1, node, );
+   if (xid->irq < 0) {
+   ret = xid->irq;
goto out_free_xive_ipis;
}
 
-- 
2.26.3



[PATCH v3 8/9] powerpc/xive: Map one IPI interrupt per node

2021-03-31 Thread Cédric Le Goater
ipistorm [*] can be used to benchmark the raw interrupt rate of an
interrupt controller by measuring the number of IPIs a system can
sustain. When applied to the XIVE interrupt controller of POWER9 and
POWER10 systems, a significant drop of the interrupt rate can be
observed when crossing the second node boundary.

This is due to the fact that a single IPI interrupt is used for all
CPUs of the system. The structure is shared and the cache line updates
impact greatly the traffic between nodes and the overall IPI
performance.

As a workaround, the impact can be reduced by deactivating the IRQ
lockup detector ("noirqdebug") which does a lot of accounting in the
Linux IRQ descriptor structure and is responsible for most of the
performance penalty.

As a fix, this proposal allocates an IPI interrupt per node, to be
shared by all CPUs of that node. It solves the scaling issue, the IRQ
lockup detector still has an impact but the XIVE interrupt rate scales
linearly. It also improves the "noirqdebug" case as showed in the
tables below.

 * P9 DD2.2 - 2s * 64 threads

   "noirqdebug"
Mint/sMint/s
 chips  cpus  IPI/sys   IPI/chip   IPI/chipIPI/sys
 --
 1  0-15 4.984023   4.875405   4.996536   5.048892
0-3110.879164  10.544040  10.757632  11.037859
0-4715.345301  14.688764  14.926520  15.310053
0-6317.064907  17.066812  17.613416  17.874511
 2  0-7911.768764  21.650749  22.689120  22.566508
0-9510.616812  26.878789  28.434703  28.320324
0-111   10.151693  31.397803  31.771773  32.388122
0-1279.948502  33.139336  34.875716  35.224548

 * P10 DD1 - 4s (not homogeneous) 352 threads

   "noirqdebug"
Mint/sMint/s
 chips  cpus  IPI/sys   IPI/chip   IPI/chipIPI/sys
 --
 1  0-15 2.409402   2.364108   2.383303   2.395091
0-31 6.028325   6.046075   6.08   6.073750
0-47 8.655178   8.644531   8.712830   8.724702
0-6311.629652  11.735953  12.088203  12.055979
0-7914.392321  14.729959  14.986701  14.973073
0-9512.604158  13.004034  17.528748  17.568095
 2  0-1119.767753  13.719831  19.968606  20.024218
0-1276.744566  16.418854  22.898066  22.995110
0-1436.005699  19.174421  25.425622  25.417541
0-1595.649719  21.938836  27.952662  28.059603
0-1755.441410  24.109484  31.133915  31.127996
 3  0-1915.318341  24.405322  33.999221  33.775354
0-2075.191382  26.449769  36.050161  35.867307
0-2235.102790  29.356943  39.544135  39.508169
0-2395.035295  31.933051  42.135075  42.071975
0-2554.969209  34.477367  44.655395  44.757074
 4  0-2714.907652  35.887016  47.080545  47.318537
0-2874.839581  38.076137  50.464307  50.636219
0-3034.786031  40.881319  53.478684  53.310759
0-3194.743750  43.448424  56.388102  55.973969
0-3354.709936  45.623532  59.400930  58.926857
0-3514.681413  45.646151  62.035804  61.830057

[*] https://github.com/antonblanchard/ipistorm

Cc: Thomas Gleixner 
Signed-off-by: Cédric Le Goater 
---

 Changes in v3:

  - increased IPI name length
  - use of early_cpu_to_node() for hotplugged CPUs
  - filter CPU-less nodes
  - dropped Greg's Reviewed-by because of the changes
  
 arch/powerpc/sysdev/xive/xive-internal.h |  2 -
 arch/powerpc/sysdev/xive/common.c| 60 +++-
 2 files changed, 47 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/xive-internal.h 
b/arch/powerpc/sysdev/xive/xive-internal.h
index 9cf57c722faa..b3a456fdd3a5 100644
--- a/arch/powerpc/sysdev/xive/xive-internal.h
+++ b/arch/powerpc/sysdev/xive/xive-internal.h
@@ -5,8 +5,6 @@
 #ifndef __XIVE_INTERNAL_H
 #define __XIVE_INTERNAL_H
 
-#define XIVE_IPI_HW_IRQ0 /* interrupt source # for IPIs */
-
 /*
  * A "disabled" interrupt should never fire, to catch problems
  * we set its logical number to this
diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index 69980b49afb7..06f29cd07448 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -63,8 +63,19 @@ static const struct xive_ops *xive_ops;
 static struct irq_domain *xive_irq_domain;
 
 #ifdef CONFIG_SMP
-/* The IPIs all use the same logical irq number */
-static u32 xive_ipi_irq;
+/* The IPIs use the same logical irq number when on the same chip */
+static struct xive_ipi_desc {
+   unsigned int irq;
+   char 

[PATCH v3 2/9] powerpc/xive: Introduce an IPI interrupt domain

2021-03-31 Thread Cédric Le Goater
The IPI interrupt is a special case of the XIVE IRQ domain. When
mapping and unmapping the interrupts in the Linux interrupt number
space, the HW interrupt number 0 (XIVE_IPI_HW_IRQ) is checked to
distinguish the IPI interrupt from other interrupts of the system.

Simplify the XIVE interrupt domain by introducing a specific domain
for the IPI.

Cc: Thomas Gleixner 
Signed-off-by: Cédric Le Goater 
---

 Changes in v3:

  - better error handling of xive_request_ipi()
  - use of a fwnode_handle to name the new domain
  - dropped Greg's Reviewed-by because of the changes

 arch/powerpc/sysdev/xive/common.c | 79 ++-
 1 file changed, 46 insertions(+), 33 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index 776871274b69..98f4dc916fa1 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -1068,24 +1068,58 @@ static struct irq_chip xive_ipi_chip = {
.irq_unmask = xive_ipi_do_nothing,
 };
 
-static void __init xive_request_ipi(void)
+/*
+ * IPIs are marked per-cpu. We use separate HW interrupts under the
+ * hood but associated with the same "linux" interrupt
+ */
+static int xive_ipi_irq_domain_map(struct irq_domain *h, unsigned int virq,
+  irq_hw_number_t hw)
 {
+   irq_set_chip_and_handler(virq, _ipi_chip, handle_percpu_irq);
+   return 0;
+}
+
+static const struct irq_domain_ops xive_ipi_irq_domain_ops = {
+   .map = xive_ipi_irq_domain_map,
+};
+
+static int __init xive_request_ipi(void)
+{
+   struct fwnode_handle *fwnode;
+   struct irq_domain *ipi_domain;
unsigned int virq;
+   int ret = -ENOMEM;
 
-   /*
-* Initialization failed, move on, we might manage to
-* reach the point where we display our errors before
-* the system falls appart
-*/
-   if (!xive_irq_domain)
-   return;
+   fwnode = irq_domain_alloc_named_fwnode("XIVE-IPI");
+   if (!fwnode)
+   goto out;
+
+   ipi_domain = irq_domain_create_linear(fwnode, 1,
+ _ipi_irq_domain_ops, NULL);
+   if (!ipi_domain)
+   goto out_free_fwnode;
 
/* Initialize it */
-   virq = irq_create_mapping(xive_irq_domain, XIVE_IPI_HW_IRQ);
+   virq = irq_create_mapping(ipi_domain, XIVE_IPI_HW_IRQ);
+   if (!virq) {
+   ret = -EINVAL;
+   goto out_free_domain;
+   }
+
xive_ipi_irq = virq;
 
-   WARN_ON(request_irq(virq, xive_muxed_ipi_action,
-   IRQF_PERCPU | IRQF_NO_THREAD, "IPI", NULL));
+   ret = request_irq(virq, xive_muxed_ipi_action,
+ IRQF_PERCPU | IRQF_NO_THREAD, "IPI", NULL);
+
+   WARN(ret < 0, "Failed to request IPI %d: %d\n", virq, ret);
+   return ret;
+
+out_free_domain:
+   irq_domain_remove(ipi_domain);
+out_free_fwnode:
+   irq_domain_free_fwnode(fwnode);
+out:
+   return ret;
 }
 
 static int xive_setup_cpu_ipi(unsigned int cpu)
@@ -1179,19 +1213,6 @@ static int xive_irq_domain_map(struct irq_domain *h, 
unsigned int virq,
 */
irq_clear_status_flags(virq, IRQ_LEVEL);
 
-#ifdef CONFIG_SMP
-   /* IPIs are special and come up with HW number 0 */
-   if (hw == XIVE_IPI_HW_IRQ) {
-   /*
-* IPIs are marked per-cpu. We use separate HW interrupts under
-* the hood but associated with the same "linux" interrupt
-*/
-   irq_set_chip_and_handler(virq, _ipi_chip,
-handle_percpu_irq);
-   return 0;
-   }
-#endif
-
rc = xive_irq_alloc_data(virq, hw);
if (rc)
return rc;
@@ -1203,15 +1224,7 @@ static int xive_irq_domain_map(struct irq_domain *h, 
unsigned int virq,
 
 static void xive_irq_domain_unmap(struct irq_domain *d, unsigned int virq)
 {
-   struct irq_data *data = irq_get_irq_data(virq);
-   unsigned int hw_irq;
-
-   /* XXX Assign BAD number */
-   if (!data)
-   return;
-   hw_irq = (unsigned int)irqd_to_hwirq(data);
-   if (hw_irq != XIVE_IPI_HW_IRQ)
-   xive_irq_free_data(virq);
+   xive_irq_free_data(virq);
 }
 
 static int xive_irq_domain_xlate(struct irq_domain *h, struct device_node *ct,
-- 
2.26.3



[PATCH v3 6/9] powerpc/xive: Simplify the dump of XIVE interrupts under xmon

2021-03-31 Thread Cédric Le Goater
Move the xmon routine under XIVE subsystem and rework the loop on the
interrupts taking into account the xive_irq_domain to filter out IPIs.

Reviewed-by: Greg Kurz 
Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/include/asm/xive.h   |  1 +
 arch/powerpc/sysdev/xive/common.c | 14 ++
 arch/powerpc/xmon/xmon.c  | 28 ++--
 3 files changed, 17 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
index 9a312b975ca8..aa094a8655b0 100644
--- a/arch/powerpc/include/asm/xive.h
+++ b/arch/powerpc/include/asm/xive.h
@@ -102,6 +102,7 @@ void xive_flush_interrupt(void);
 /* xmon hook */
 void xmon_xive_do_dump(int cpu);
 int xmon_xive_get_irq_config(u32 hw_irq, struct irq_data *d);
+void xmon_xive_get_irq_all(void);
 
 /* APIs used by KVM */
 u32 xive_native_default_eq_shift(void);
diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index f5fe60c194bc..4c6e2e1289f5 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -289,6 +289,20 @@ int xmon_xive_get_irq_config(u32 hw_irq, struct irq_data 
*d)
return 0;
 }
 
+void xmon_xive_get_irq_all(void)
+{
+   unsigned int i;
+   struct irq_desc *desc;
+
+   for_each_irq_desc(i, desc) {
+   struct irq_data *d = irq_desc_get_irq_data(desc);
+   unsigned int hwirq = (unsigned int)irqd_to_hwirq(d);
+
+   if (d->domain == xive_irq_domain)
+   xmon_xive_get_irq_config(hwirq, d);
+   }
+}
+
 #endif /* CONFIG_XMON */
 
 static unsigned int xive_get_irq(void)
diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index 3fe37495f63d..80fbf8968f77 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -2727,30 +2727,6 @@ static void dump_all_xives(void)
dump_one_xive(cpu);
 }
 
-static void dump_one_xive_irq(u32 num, struct irq_data *d)
-{
-   xmon_xive_get_irq_config(num, d);
-}
-
-static void dump_all_xive_irq(void)
-{
-   unsigned int i;
-   struct irq_desc *desc;
-
-   for_each_irq_desc(i, desc) {
-   struct irq_data *d = irq_desc_get_irq_data(desc);
-   unsigned int hwirq;
-
-   if (!d)
-   continue;
-
-   hwirq = (unsigned int)irqd_to_hwirq(d);
-   /* IPIs are special (HW number 0) */
-   if (hwirq)
-   dump_one_xive_irq(hwirq, d);
-   }
-}
-
 static void dump_xives(void)
 {
unsigned long num;
@@ -2767,9 +2743,9 @@ static void dump_xives(void)
return;
} else if (c == 'i') {
if (scanhex())
-   dump_one_xive_irq(num, NULL);
+   xmon_xive_get_irq_config(num, NULL);
else
-   dump_all_xive_irq();
+   xmon_xive_get_irq_all();
return;
}
 
-- 
2.26.3



[PATCH v3 7/9] powerpc/xive: Fix xmon command "dxi"

2021-03-31 Thread Cédric Le Goater
When under xmon, the "dxi" command dumps the state of the XIVE
interrupts. If an interrupt number is specified, only the state of
the associated XIVE interrupt is dumped. This form of the command
lacks an irq_data parameter which is nevertheless used by
xmon_xive_get_irq_config(), leading to an xmon crash.

Fix that by doing a lookup in the system IRQ mapping to query the IRQ
descriptor data. Invalid interrupt numbers, or not belonging to the
XIVE IRQ domain, OPAL event interrupt number for instance, should be
caught by the previous query done at the firmware level.

Reported-by: kernel test robot 
Reported-by: Dan Carpenter 
Fixes: 97ef27507793 ("powerpc/xive: Fix xmon support on the PowerNV platform")
Tested-by: Greg Kurz 
Reviewed-by: Greg Kurz 
Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/sysdev/xive/common.c | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index 4c6e2e1289f5..69980b49afb7 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -253,17 +253,20 @@ notrace void xmon_xive_do_dump(int cpu)
xmon_printf("\n");
 }
 
+static struct irq_data *xive_get_irq_data(u32 hw_irq)
+{
+   unsigned int irq = irq_find_mapping(xive_irq_domain, hw_irq);
+
+   return irq ? irq_get_irq_data(irq) : NULL;
+}
+
 int xmon_xive_get_irq_config(u32 hw_irq, struct irq_data *d)
 {
-   struct irq_chip *chip = irq_data_get_irq_chip(d);
int rc;
u32 target;
u8 prio;
u32 lirq;
 
-   if (!is_xive_irq(chip))
-   return -EINVAL;
-
rc = xive_ops->get_irq_config(hw_irq, , , );
if (rc) {
xmon_printf("IRQ 0x%08x : no config rc=%d\n", hw_irq, rc);
@@ -273,6 +276,9 @@ int xmon_xive_get_irq_config(u32 hw_irq, struct irq_data *d)
xmon_printf("IRQ 0x%08x : target=0x%x prio=%02x lirq=0x%x ",
hw_irq, target, prio, lirq);
 
+   if (!d)
+   d = xive_get_irq_data(hw_irq);
+
if (d) {
struct xive_irq_data *xd = irq_data_get_irq_handler_data(d);
u64 val = xive_esb_read(xd, XIVE_ESB_GET);
-- 
2.26.3



[PATCH v3 5/9] powerpc/xive: Drop check on irq_data in xive_core_debug_show()

2021-03-31 Thread Cédric Le Goater
When looping on IRQ descriptor, irq_data is always valid.

Reported-by: kernel test robot 
Reported-by: Dan Carpenter 
Fixes: 930914b7d528 ("powerpc/xive: Add a debugfs file to dump internal XIVE 
state")
Reviewed-by: Greg Kurz 
Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/sysdev/xive/common.c | 21 ++---
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index 4149ca846e7c..f5fe60c194bc 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -1607,6 +1607,8 @@ static void xive_debug_show_irq(struct seq_file *m, 
struct irq_data *d)
u32 target;
u8 prio;
u32 lirq;
+   struct xive_irq_data *xd;
+   u64 val;
 
rc = xive_ops->get_irq_config(hw_irq, , , );
if (rc) {
@@ -1617,17 +1619,14 @@ static void xive_debug_show_irq(struct seq_file *m, 
struct irq_data *d)
seq_printf(m, "IRQ 0x%08x : target=0x%x prio=%02x lirq=0x%x ",
   hw_irq, target, prio, lirq);
 
-   if (d) {
-   struct xive_irq_data *xd = irq_data_get_irq_handler_data(d);
-   u64 val = xive_esb_read(xd, XIVE_ESB_GET);
-
-   seq_printf(m, "flags=%c%c%c PQ=%c%c",
-  xd->flags & XIVE_IRQ_FLAG_STORE_EOI ? 'S' : ' ',
-  xd->flags & XIVE_IRQ_FLAG_LSI ? 'L' : ' ',
-  xd->flags & XIVE_IRQ_FLAG_H_INT_ESB ? 'H' : ' ',
-  val & XIVE_ESB_VAL_P ? 'P' : '-',
-  val & XIVE_ESB_VAL_Q ? 'Q' : '-');
-   }
+   xd = irq_data_get_irq_handler_data(d);
+   val = xive_esb_read(xd, XIVE_ESB_GET);
+   seq_printf(m, "flags=%c%c%c PQ=%c%c",
+  xd->flags & XIVE_IRQ_FLAG_STORE_EOI ? 'S' : ' ',
+  xd->flags & XIVE_IRQ_FLAG_LSI ? 'L' : ' ',
+  xd->flags & XIVE_IRQ_FLAG_H_INT_ESB ? 'H' : ' ',
+  val & XIVE_ESB_VAL_P ? 'P' : '-',
+  val & XIVE_ESB_VAL_Q ? 'Q' : '-');
seq_puts(m, "\n");
 }
 
-- 
2.26.3



[PATCH v3 0/9] powerpc/xive: Map one IPI interrupt per node

2021-03-31 Thread Cédric Le Goater


Hello,

ipistorm [*] can be used to benchmark the raw interrupt rate of an
interrupt controller by measuring the number of IPIs a system can
sustain. When applied to the XIVE interrupt controller of POWER9 and
POWER10 systems, a significant drop of the interrupt rate can be
observed when crossing the second node boundary.

This is due to the fact that a single IPI interrupt is used for all
CPUs of the system. The structure is shared and the cache line updates
impact greatly the traffic between nodes and the overall IPI
performance.

As a workaround, the impact can be reduced by deactivating the IRQ
lockup detector ("noirqdebug") which does a lot of accounting in the
Linux IRQ descriptor structure and is responsible for most of the
performance penalty.

As a fix, this proposal allocates an IPI interrupt per node, to be
shared by all CPUs of that node. It solves the scaling issue, the IRQ
lockup detector still has an impact but the XIVE interrupt rate scales
linearly. It also improves the "noirqdebug" case as showed in the
tables below. 

 * P9 DD2.2 - 2s * 64 threads

   "noirqdebug"
Mint/sMint/s   
 chips  cpus  IPI/sys   IPI/chip   IPI/chipIPI/sys 
 --
 1  0-15 4.984023   4.875405   4.996536   5.048892
0-3110.879164  10.544040  10.757632  11.037859
0-4715.345301  14.688764  14.926520  15.310053
0-6317.064907  17.066812  17.613416  17.874511
 2  0-7911.768764  21.650749  22.689120  22.566508
0-9510.616812  26.878789  28.434703  28.320324
0-111   10.151693  31.397803  31.771773  32.388122
0-1279.948502  33.139336  34.875716  35.224548


 * P10 DD1 - 4s (not homogeneous) 352 threads

   "noirqdebug"
Mint/sMint/s   
 chips  cpus  IPI/sys   IPI/chip   IPI/chipIPI/sys 
 --
 1  0-15 2.409402   2.364108   2.383303   2.395091
0-31 6.028325   6.046075   6.08   6.073750
0-47 8.655178   8.644531   8.712830   8.724702
0-6311.629652  11.735953  12.088203  12.055979
0-7914.392321  14.729959  14.986701  14.973073
0-9512.604158  13.004034  17.528748  17.568095
 2  0-1119.767753  13.719831  19.968606  20.024218
0-1276.744566  16.418854  22.898066  22.995110
0-1436.005699  19.174421  25.425622  25.417541
0-1595.649719  21.938836  27.952662  28.059603
0-1755.441410  24.109484  31.133915  31.127996
 3  0-1915.318341  24.405322  33.999221  33.775354
0-2075.191382  26.449769  36.050161  35.867307
0-2235.102790  29.356943  39.544135  39.508169
0-2395.035295  31.933051  42.135075  42.071975
0-2554.969209  34.477367  44.655395  44.757074
 4  0-2714.907652  35.887016  47.080545  47.318537
0-2874.839581  38.076137  50.464307  50.636219
0-3034.786031  40.881319  53.478684  53.310759
0-3194.743750  43.448424  56.388102  55.973969
0-3354.709936  45.623532  59.400930  58.926857
0-3514.681413  45.646151  62.035804  61.830057

[*] https://github.com/antonblanchard/ipistorm

Thanks,

C.

Changes in v3:

  - improved commit log for the misuse of "ibm,chip-id"
  - better error handling of xive_request_ipi()
  - use of a fwnode_handle to name the new domain 
  - increased IPI name length
  - use of early_cpu_to_node() for hotplugged CPUs
  - filter CPU-less nodes

Changes in v2:

  - extra simplification on xmon
  - fixes on issues reported by the kernel test robot

Cédric Le Goater (9):
  powerpc/xive: Use cpu_to_node() instead of "ibm,chip-id" property
  powerpc/xive: Introduce an IPI interrupt domain
  powerpc/xive: Remove useless check on XIVE_IPI_HW_IRQ
  powerpc/xive: Simplify xive_core_debug_show()
  powerpc/xive: Drop check on irq_data in xive_core_debug_show()
  powerpc/xive: Simplify the dump of XIVE interrupts under xmon
  powerpc/xive: Fix xmon command "dxi"
  powerpc/xive: Map one IPI interrupt per node
  powerpc/xive: Modernize XIVE-IPI domain with an 'alloc' handler

 arch/powerpc/include/asm/xive.h  |   1 +
 arch/powerpc/sysdev/xive/xive-internal.h |   2 -
 arch/powerpc/sysdev/xive/common.c| 211 +++
 arch/powerpc/xmon/xmon.c |  28 +--
 4 files changed, 139 insertions(+), 103 deletions(-)

-- 
2.26.3



[PATCH v3 1/9] powerpc/xive: Use cpu_to_node() instead of "ibm, chip-id" property

2021-03-31 Thread Cédric Le Goater
The 'chip_id' field of the XIVE CPU structure is used to choose a
target for a source located on the same chip when possible. The XIVE
driver queries the chip id value from the "ibm,chip-id" DT property
but this property is not available on all platforms. It was first
introduced on the PowerNV platform and later, under QEMU for pseries.
However, the property does not exist under PowerVM since it is not
specified in PAPR.

cpu_to_node() is a better alternative. On the PowerNV platform, the
node id is computed from the "ibm,associativity" property of the CPU.
Its value is built in the OPAL firmware from the physical chip id and
is equivalent to "ibm,chip-id". On pSeries, the hcall
H_HOME_NODE_ASSOCIATIVITY returns the node id.

Also to be noted that under QEMU/KVM "ibm,chip-id" is badly calculated
with unusual SMT configuration. This leads to a bogus chip id value
being returned by of_get_ibm_chip_id().

Cc: David Gibson 
Signed-off-by: Cédric Le Goater 
---

 Changes in v3:

  - improved commit log for the misuse of "ibm,chip-id"

 arch/powerpc/sysdev/xive/common.c | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index 7e08be5e5e4a..776871274b69 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -1336,16 +1336,11 @@ static int xive_prepare_cpu(unsigned int cpu)
 
xc = per_cpu(xive_cpu, cpu);
if (!xc) {
-   struct device_node *np;
-
xc = kzalloc_node(sizeof(struct xive_cpu),
  GFP_KERNEL, cpu_to_node(cpu));
if (!xc)
return -ENOMEM;
-   np = of_get_cpu_node(cpu, NULL);
-   if (np)
-   xc->chip_id = of_get_ibm_chip_id(np);
-   of_node_put(np);
+   xc->chip_id = cpu_to_node(cpu);
xc->hw_ipi = XIVE_BAD_IRQ;
 
per_cpu(xive_cpu, cpu) = xc;
-- 
2.26.3



[PATCH v3 4/9] powerpc/xive: Simplify xive_core_debug_show()

2021-03-31 Thread Cédric Le Goater
Now that the IPI interrupt has its own domain, the checks on the HW
interrupt number XIVE_IPI_HW_IRQ and on the chip can be replaced by a
check on the domain.

Reviewed-by: Greg Kurz 
Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/sysdev/xive/common.c | 18 --
 1 file changed, 4 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index 8bca9aca0607..4149ca846e7c 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -1600,17 +1600,14 @@ static void xive_debug_show_cpu(struct seq_file *m, int 
cpu)
seq_puts(m, "\n");
 }
 
-static void xive_debug_show_irq(struct seq_file *m, u32 hw_irq, struct 
irq_data *d)
+static void xive_debug_show_irq(struct seq_file *m, struct irq_data *d)
 {
-   struct irq_chip *chip = irq_data_get_irq_chip(d);
+   unsigned int hw_irq = (unsigned int)irqd_to_hwirq(d);
int rc;
u32 target;
u8 prio;
u32 lirq;
 
-   if (!is_xive_irq(chip))
-   return;
-
rc = xive_ops->get_irq_config(hw_irq, , , );
if (rc) {
seq_printf(m, "IRQ 0x%08x : no config rc=%d\n", hw_irq, rc);
@@ -1648,16 +1645,9 @@ static int xive_core_debug_show(struct seq_file *m, void 
*private)
 
for_each_irq_desc(i, desc) {
struct irq_data *d = irq_desc_get_irq_data(desc);
-   unsigned int hw_irq;
-
-   if (!d)
-   continue;
-
-   hw_irq = (unsigned int)irqd_to_hwirq(d);
 
-   /* IPIs are special (HW number 0) */
-   if (hw_irq != XIVE_IPI_HW_IRQ)
-   xive_debug_show_irq(m, hw_irq, d);
+   if (d->domain == xive_irq_domain)
+   xive_debug_show_irq(m, d);
}
return 0;
 }
-- 
2.26.3



[PATCH v3 3/9] powerpc/xive: Remove useless check on XIVE_IPI_HW_IRQ

2021-03-31 Thread Cédric Le Goater
The IPI interrupt has its own domain now. Testing the HW interrupt
number is not needed anymore.

Reviewed-by: Greg Kurz 
Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/sysdev/xive/common.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index 98f4dc916fa1..8bca9aca0607 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -1417,13 +1417,12 @@ static void xive_flush_cpu_queue(unsigned int cpu, 
struct xive_cpu *xc)
struct irq_desc *desc = irq_to_desc(irq);
struct irq_data *d = irq_desc_get_irq_data(desc);
struct xive_irq_data *xd;
-   unsigned int hw_irq = (unsigned int)irqd_to_hwirq(d);
 
/*
 * Ignore anything that isn't a XIVE irq and ignore
 * IPIs, so can just be dropped.
 */
-   if (d->domain != xive_irq_domain || hw_irq == XIVE_IPI_HW_IRQ)
+   if (d->domain != xive_irq_domain)
continue;
 
/*
-- 
2.26.3



[PATCH v6 9/9] powerpc/qspinlock: Add ARCH_USE_QUEUED_SPINLOCKS_XCHG32

2021-03-31 Thread guoren
From: Guo Ren 

We don't have native hw xchg16 instruction, so let qspinlock
generic code to deal with it.

Using the full-word atomic xchg instructions implement xchg16 has
the semantic risk for atomic operations.

This patch cancels the dependency of on qspinlock generic code on
architecture's xchg16.

Also no need when PPC_LBARX_LWARX is enabled, see the link below.

Signed-off-by: Guo Ren 
Link: 
https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20201107032328.2454582-1-npig...@gmail.com/
Cc: Christophe Leroy 
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
---
 arch/powerpc/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 386ae12d8523..6133ad51690e 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -151,6 +151,7 @@ config PPC
select ARCH_USE_CMPXCHG_LOCKREF if PPC64
select ARCH_USE_QUEUED_RWLOCKS  if PPC_QUEUED_SPINLOCKS
select ARCH_USE_QUEUED_SPINLOCKSif PPC_QUEUED_SPINLOCKS
+   select ARCH_USE_QUEUED_SPINLOCKS_XCHG32 if PPC_QUEUED_SPINLOCKS && 
!PPC_LBARX_LWARX
select ARCH_WANT_IPC_PARSE_VERSION
select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
select ARCH_WANT_LD_ORPHAN_WARN
-- 
2.17.1



[PATCH v6 8/9] xtensa: qspinlock: Add ARCH_USE_QUEUED_SPINLOCKS_XCHG32

2021-03-31 Thread guoren
From: Guo Ren 

We don't have native hw xchg16 instruction, so let qspinlock
generic code to deal with it.

Using the full-word atomic xchg instructions implement xchg16 has
the semantic risk for atomic operations.

This patch cancels the dependency of on qspinlock generic code on
architecture's xchg16.

Signed-off-by: Guo Ren 
Cc: Arnd Bergmann 
Cc: Chris Zankel 
Cc: Max Filippov 
---
 arch/xtensa/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/xtensa/Kconfig b/arch/xtensa/Kconfig
index 9ad6b7b82707..f19d780638f7 100644
--- a/arch/xtensa/Kconfig
+++ b/arch/xtensa/Kconfig
@@ -9,6 +9,7 @@ config XTENSA
select ARCH_HAS_DMA_SET_UNCACHED if MMU
select ARCH_USE_QUEUED_RWLOCKS
select ARCH_USE_QUEUED_SPINLOCKS
+   select ARCH_USE_QUEUED_SPINLOCKS_XCHG32
select ARCH_WANT_FRAME_POINTERS
select ARCH_WANT_IPC_PARSE_VERSION
select BUILDTIME_TABLE_SORT
-- 
2.17.1



[PATCH v6 7/9] sparc: qspinlock: Add ARCH_USE_QUEUED_SPINLOCKS_XCHG32

2021-03-31 Thread guoren
From: Guo Ren 

We don't have native hw xchg16 instruction, so let qspinlock
generic code to deal with it.

Using the full-word atomic xchg instructions implement xchg16 has
the semantic risk for atomic operations.

This patch cancels the dependency of on qspinlock generic code on
architecture's xchg16.

Signed-off-by: Guo Ren 
Cc: Arnd Bergmann 
Cc: David S. Miller 
Cc: Rob Gardner 
---
 arch/sparc/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index 164a5254c91c..1079fe3f058c 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -91,6 +91,7 @@ config SPARC64
select HAVE_REGS_AND_STACK_ACCESS_API
select ARCH_USE_QUEUED_RWLOCKS
select ARCH_USE_QUEUED_SPINLOCKS
+   select ARCH_USE_QUEUED_SPINLOCKS_XCHG32
select GENERIC_TIME_VSYSCALL
select ARCH_CLOCKSOURCE_DATA
select ARCH_HAS_PTE_SPECIAL
-- 
2.17.1



[PATCH v6 6/9] openrisc: qspinlock: Add ARCH_USE_QUEUED_SPINLOCKS_XCHG32

2021-03-31 Thread guoren
From: Guo Ren 

We don't have native hw xchg16 instruction, so let qspinlock
generic code to deal with it.

Using the full-word atomic xchg instructions implement xchg16 has
the semantic risk for atomic operations.

This patch cancels the dependency of on qspinlock generic code on
architecture's xchg16.

Signed-off-by: Guo Ren 
Cc: Arnd Bergmann 
Cc: Jonas Bonn 
Cc: Stefan Kristiansson 
Cc: Stafford Horne 
Cc: openr...@lists.librecores.org
---
 arch/openrisc/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/openrisc/Kconfig b/arch/openrisc/Kconfig
index 591acc5990dc..b299e409429f 100644
--- a/arch/openrisc/Kconfig
+++ b/arch/openrisc/Kconfig
@@ -33,6 +33,7 @@ config OPENRISC
select OR1K_PIC
select CPU_NO_EFFICIENT_FFS if !OPENRISC_HAVE_INST_FF1
select ARCH_USE_QUEUED_SPINLOCKS
+   select ARCH_USE_QUEUED_SPINLOCKS_XCHG32
select ARCH_USE_QUEUED_RWLOCKS
select OMPIC if SMP
select ARCH_WANT_FRAME_POINTERS
-- 
2.17.1



[PATCH v6 5/9] csky: Convert custom spinlock/rwlock to generic qspinlock/qrwlock

2021-03-31 Thread guoren
From: Guo Ren 

Update the C-SKY port to use the generic qspinlock and qrwlock.

C-SKY only support ldex.w/stex.w with word(double word) size &
align access. So it must select XCHG32 to let qspinlock only use
word atomic xchg_tail.

Default is still ticket lock.

Signed-off-by: Guo Ren 
Cc: Waiman Long 
Cc: Peter Zijlstra 
Cc: Will Deacon 
Cc: Arnd Bergmann 
---
 arch/csky/Kconfig  | 8 
 arch/csky/include/asm/Kbuild   | 2 ++
 arch/csky/include/asm/spinlock.h   | 4 
 arch/csky/include/asm/spinlock_types.h | 4 
 4 files changed, 18 insertions(+)

diff --git a/arch/csky/Kconfig b/arch/csky/Kconfig
index 34e91224adc3..ae12332edb7b 100644
--- a/arch/csky/Kconfig
+++ b/arch/csky/Kconfig
@@ -8,6 +8,8 @@ config CSKY
select ARCH_HAS_SYNC_DMA_FOR_DEVICE
select ARCH_USE_BUILTIN_BSWAP
select ARCH_USE_QUEUED_RWLOCKS
+   select ARCH_USE_QUEUED_SPINLOCKSif !CSKY_TICKET_LOCK
+   select ARCH_USE_QUEUED_SPINLOCKS_XCHG32
select ARCH_WANT_FRAME_POINTERS if !CPU_CK610
select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
select COMMON_CLK
@@ -304,6 +306,12 @@ config NR_CPUS
depends on SMP
default "4"
 
+config CSKY_TICKET_LOCK
+   bool "Ticket-based spin-locking"
+   default y
+   help
+ Say Y here to use ticket-based spin-locking.
+
 config HIGHMEM
bool "High Memory Support"
depends on !CPU_CK610
diff --git a/arch/csky/include/asm/Kbuild b/arch/csky/include/asm/Kbuild
index cc24bb8e539f..2a2d09963bb9 100644
--- a/arch/csky/include/asm/Kbuild
+++ b/arch/csky/include/asm/Kbuild
@@ -2,6 +2,8 @@
 generic-y += asm-offsets.h
 generic-y += gpio.h
 generic-y += kvm_para.h
+generic-y += mcs_spinlock.h
 generic-y += qrwlock.h
+generic-y += qspinlock.h
 generic-y += user.h
 generic-y += vmlinux.lds.h
diff --git a/arch/csky/include/asm/spinlock.h b/arch/csky/include/asm/spinlock.h
index 69677167977a..fe98ad8ece51 100644
--- a/arch/csky/include/asm/spinlock.h
+++ b/arch/csky/include/asm/spinlock.h
@@ -6,6 +6,7 @@
 #include 
 #include 
 
+#ifdef CONFIG_CSKY_TICKET_LOCK
 /*
  * Ticket-based spin-locking.
  */
@@ -80,6 +81,9 @@ static inline int arch_spin_is_contended(arch_spinlock_t 
*lock)
return (tickets.next - tickets.owner) > 1;
 }
 #define arch_spin_is_contended arch_spin_is_contended
+#else /* CONFIG_CSKY_TICKET_LOCK */
+#include 
+#endif /* CONFIG_CSKY_TICKET_LOCK */
 
 #include 
 
diff --git a/arch/csky/include/asm/spinlock_types.h 
b/arch/csky/include/asm/spinlock_types.h
index 8ff0f6ff3a00..547f035f6dd5 100644
--- a/arch/csky/include/asm/spinlock_types.h
+++ b/arch/csky/include/asm/spinlock_types.h
@@ -7,6 +7,7 @@
 # error "please don't include this file directly"
 #endif
 
+#ifdef CONFIG_CSKY_TICKET_LOCK
 #define TICKET_NEXT16
 
 typedef struct {
@@ -21,6 +22,9 @@ typedef struct {
 } arch_spinlock_t;
 
 #define __ARCH_SPIN_LOCK_UNLOCKED  { { 0 } }
+#else
+#include 
+#endif
 
 #include 
 
-- 
2.17.1



[PATCH v6 4/9] csky: locks: Optimize coding convention

2021-03-31 Thread guoren
From: Guo Ren 

 - Using smp_cond_load_acquire in arch_spin_lock by Peter's
   advice.
 - Using __smp_acquire_fence in arch_spin_trylock
 - Using smp_store_release in arch_spin_unlock

All above are just coding conventions and won't affect the
function.

TODO in smp_cond_load_acquire for architecture:
 - current csky only has:
   lr.w val, 
   sc.w . val2
   (Any other stores to p0 will let sc.w failed)

 - But smp_cond_load_acquire need:
   lr.w val, 
   wfe
   (Any stores to p0 will send the event to let wfe retired)

Signed-off-by: Guo Ren 
Link: 
https://lore.kernel.org/linux-riscv/caahsdy1jhlufwu7rucaq+ruwrbks2ksdva7eprt8--4zfof...@mail.gmail.com/T/#m13adac285b7f51f4f879a5d6b65753ecb1a7524e
Cc: Peter Zijlstra 
Cc: Arnd Bergmann 
---
 arch/csky/include/asm/spinlock.h | 11 ---
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/arch/csky/include/asm/spinlock.h b/arch/csky/include/asm/spinlock.h
index 69f5aa249c5f..69677167977a 100644
--- a/arch/csky/include/asm/spinlock.h
+++ b/arch/csky/include/asm/spinlock.h
@@ -26,10 +26,8 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
: "r"(p), "r"(ticket_next)
: "cc");
 
-   while (lockval.tickets.next != lockval.tickets.owner)
-   lockval.tickets.owner = READ_ONCE(lock->tickets.owner);
-
-   smp_mb();
+   smp_cond_load_acquire(>tickets.owner,
+   VAL == lockval.tickets.next);
 }
 
 static inline int arch_spin_trylock(arch_spinlock_t *lock)
@@ -55,15 +53,14 @@ static inline int arch_spin_trylock(arch_spinlock_t *lock)
} while (!res);
 
if (!contended)
-   smp_mb();
+   __smp_acquire_fence();
 
return !contended;
 }
 
 static inline void arch_spin_unlock(arch_spinlock_t *lock)
 {
-   smp_mb();
-   WRITE_ONCE(lock->tickets.owner, lock->tickets.owner + 1);
+   smp_store_release(>tickets.owner, lock->tickets.owner + 1);
 }
 
 static inline int arch_spin_value_unlocked(arch_spinlock_t lock)
-- 
2.17.1



[PATCH v6 3/9] riscv: locks: Introduce ticket-based spinlock implementation

2021-03-31 Thread guoren
From: Guo Ren 

This patch introduces a ticket lock implementation for riscv, along the
same lines as the implementation for arch/arm & arch/csky.

We still use qspinlock as default.

Signed-off-by: Guo Ren 
Cc: Peter Zijlstra 
Cc: Anup Patel 
Cc: Arnd Bergmann 
---
 arch/riscv/Kconfig  |  7 ++-
 arch/riscv/include/asm/spinlock.h   | 84 +
 arch/riscv/include/asm/spinlock_types.h | 17 +
 3 files changed, 107 insertions(+), 1 deletion(-)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 67cc65ba1ea1..34d0276f01d5 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -34,7 +34,7 @@ config RISCV
select ARCH_WANT_FRAME_POINTERS
select ARCH_WANT_HUGE_PMD_SHARE if 64BIT
select ARCH_USE_QUEUED_RWLOCKS
-   select ARCH_USE_QUEUED_SPINLOCKS
+   select ARCH_USE_QUEUED_SPINLOCKSif !RISCV_TICKET_LOCK
select ARCH_USE_QUEUED_SPINLOCKS_XCHG32
select CLONE_BACKWARDS
select CLINT_TIMER if !MMU
@@ -344,6 +344,11 @@ config NEED_PER_CPU_EMBED_FIRST_CHUNK
def_bool y
depends on NUMA
 
+config RISCV_TICKET_LOCK
+   bool "Ticket-based spin-locking"
+   help
+ Say Y here to use ticket-based spin-locking.
+
 config RISCV_ISA_C
bool "Emit compressed instructions when building Linux"
default y
diff --git a/arch/riscv/include/asm/spinlock.h 
b/arch/riscv/include/asm/spinlock.h
index a557de67a425..90b7eaa950cf 100644
--- a/arch/riscv/include/asm/spinlock.h
+++ b/arch/riscv/include/asm/spinlock.h
@@ -7,7 +7,91 @@
 #ifndef _ASM_RISCV_SPINLOCK_H
 #define _ASM_RISCV_SPINLOCK_H
 
+#ifdef CONFIG_RISCV_TICKET_LOCK
+#ifdef CONFIG_32BIT
+#define __ASM_SLLIW "slli\t"
+#define __ASM_SRLIW "srli\t"
+#else
+#define __ASM_SLLIW "slliw\t"
+#define __ASM_SRLIW "srliw\t"
+#endif
+
+/*
+ * Ticket-based spin-locking.
+ */
+static inline void arch_spin_lock(arch_spinlock_t *lock)
+{
+   arch_spinlock_t lockval;
+   u32 tmp;
+
+   asm volatile (
+   "1: lr.w%0, %2  \n"
+   "   mv  %1, %0  \n"
+   "   addw%0, %0, %3  \n"
+   "   sc.w%0, %0, %2  \n"
+   "   bnez%0, 1b  \n"
+   : "=" (tmp), "=" (lockval), "+A" (lock->lock)
+   : "r" (1 << TICKET_NEXT)
+   : "memory");
+
+   smp_cond_load_acquire(>tickets.owner,
+   VAL == lockval.tickets.next);
+}
+
+static inline int arch_spin_trylock(arch_spinlock_t *lock)
+{
+   u32 tmp, contended, res;
+
+   do {
+   asm volatile (
+   "   lr.w%0, %3  \n"
+   __ASM_SRLIW"%1, %0, %5  \n"
+   __ASM_SLLIW"%2, %0, %5  \n"
+   "   or  %1, %2, %1  \n"
+   "   li  %2, 0   \n"
+   "   sub %1, %1, %0  \n"
+   "   bnez%1, 1f  \n"
+   "   addw%0, %0, %4  \n"
+   "   sc.w%2, %0, %3  \n"
+   "1: \n"
+   : "=" (tmp), "=" (contended), "=" (res),
+ "+A" (lock->lock)
+   : "r" (1 << TICKET_NEXT), "I" (TICKET_NEXT)
+   : "memory");
+   } while (res);
+
+   if (!contended)
+   __atomic_acquire_fence();
+
+   return !contended;
+}
+
+static inline void arch_spin_unlock(arch_spinlock_t *lock)
+{
+   smp_store_release(>tickets.owner, lock->tickets.owner + 1);
+}
+
+static inline int arch_spin_value_unlocked(arch_spinlock_t lock)
+{
+   return lock.tickets.owner == lock.tickets.next;
+}
+
+static inline int arch_spin_is_locked(arch_spinlock_t *lock)
+{
+   return !arch_spin_value_unlocked(READ_ONCE(*lock));
+}
+
+static inline int arch_spin_is_contended(arch_spinlock_t *lock)
+{
+   struct __raw_tickets tickets = READ_ONCE(lock->tickets);
+
+   return (tickets.next - tickets.owner) > 1;
+}
+#define arch_spin_is_contended arch_spin_is_contended
+#else /* CONFIG_RISCV_TICKET_LOCK */
 #include 
+#endif /* CONFIG_RISCV_TICKET_LOCK */
+
 #include 
 
 #endif /* _ASM_RISCV_SPINLOCK_H */
diff --git a/arch/riscv/include/asm/spinlock_types.h 
b/arch/riscv/include/asm/spinlock_types.h
index d033a973f287..afbb19841d0f 100644
--- a/arch/riscv/include/asm/spinlock_types.h
+++ b/arch/riscv/include/asm/spinlock_types.h
@@ -10,7 +10,24 @@
 # error "please don't include this file directly"
 #endif
 
+#ifdef CONFIG_RISCV_TICKET_LOCK
+#define TICKET_NEXT16
+
+typedef struct {
+   union {
+   u32 lock;
+   struct __raw_tickets {
+   /* little endian */
+   u16 owner;
+   u16 next;
+   } tickets;
+   };
+} arch_spinlock_t;
+
+#define __ARCH_SPIN_LOCK_UNLOCKED  { { 0 } }
+#else
 #include 
+#endif
 

[PATCH v6 2/9] riscv: Convert custom spinlock/rwlock to generic qspinlock/qrwlock

2021-03-31 Thread guoren
From: Michael Clark 

Update the RISC-V port to use the generic qspinlock and qrwlock.

This patch requires support for xchg_xtail for full-word which
are added by a previous patch:

Guo added select ARCH_USE_QUEUED_SPINLOCKS_XCHG32 in Kconfig

Guo fixed up compile error which made by below include sequence:
+#include 
+#include 

Signed-off-by: Michael Clark 
Co-developed-by: Guo Ren 
Tested-by: Guo Ren 
Signed-off-by: Guo Ren 
Link: 
https://lore.kernel.org/linux-riscv/20190211043829.30096-3-michaeljcl...@mac.com/
Cc: Peter Zijlstra 
Cc: Anup Patel 
Cc: Arnd Bergmann 
Cc: Palmer Dabbelt 
---
 arch/riscv/Kconfig  |   3 +
 arch/riscv/include/asm/Kbuild   |   3 +
 arch/riscv/include/asm/spinlock.h   | 126 +---
 arch/riscv/include/asm/spinlock_types.h |  15 +--
 4 files changed, 11 insertions(+), 136 deletions(-)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 87d7b52f278f..67cc65ba1ea1 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -33,6 +33,9 @@ config RISCV
select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT if MMU
select ARCH_WANT_FRAME_POINTERS
select ARCH_WANT_HUGE_PMD_SHARE if 64BIT
+   select ARCH_USE_QUEUED_RWLOCKS
+   select ARCH_USE_QUEUED_SPINLOCKS
+   select ARCH_USE_QUEUED_SPINLOCKS_XCHG32
select CLONE_BACKWARDS
select CLINT_TIMER if !MMU
select COMMON_CLK
diff --git a/arch/riscv/include/asm/Kbuild b/arch/riscv/include/asm/Kbuild
index 445ccc97305a..750c1056b90f 100644
--- a/arch/riscv/include/asm/Kbuild
+++ b/arch/riscv/include/asm/Kbuild
@@ -3,5 +3,8 @@ generic-y += early_ioremap.h
 generic-y += extable.h
 generic-y += flat.h
 generic-y += kvm_para.h
+generic-y += mcs_spinlock.h
+generic-y += qrwlock.h
+generic-y += qspinlock.h
 generic-y += user.h
 generic-y += vmlinux.lds.h
diff --git a/arch/riscv/include/asm/spinlock.h 
b/arch/riscv/include/asm/spinlock.h
index f4f7fa1b7ca8..a557de67a425 100644
--- a/arch/riscv/include/asm/spinlock.h
+++ b/arch/riscv/include/asm/spinlock.h
@@ -7,129 +7,7 @@
 #ifndef _ASM_RISCV_SPINLOCK_H
 #define _ASM_RISCV_SPINLOCK_H
 
-#include 
-#include 
-#include 
-
-/*
- * Simple spin lock operations.  These provide no fairness guarantees.
- */
-
-/* FIXME: Replace this with a ticket lock, like MIPS. */
-
-#define arch_spin_is_locked(x) (READ_ONCE((x)->lock) != 0)
-
-static inline void arch_spin_unlock(arch_spinlock_t *lock)
-{
-   smp_store_release(>lock, 0);
-}
-
-static inline int arch_spin_trylock(arch_spinlock_t *lock)
-{
-   int tmp = 1, busy;
-
-   __asm__ __volatile__ (
-   "   amoswap.w %0, %2, %1\n"
-   RISCV_ACQUIRE_BARRIER
-   : "=r" (busy), "+A" (lock->lock)
-   : "r" (tmp)
-   : "memory");
-
-   return !busy;
-}
-
-static inline void arch_spin_lock(arch_spinlock_t *lock)
-{
-   while (1) {
-   if (arch_spin_is_locked(lock))
-   continue;
-
-   if (arch_spin_trylock(lock))
-   break;
-   }
-}
-
-/***/
-
-static inline void arch_read_lock(arch_rwlock_t *lock)
-{
-   int tmp;
-
-   __asm__ __volatile__(
-   "1: lr.w%1, %0\n"
-   "   bltz%1, 1b\n"
-   "   addi%1, %1, 1\n"
-   "   sc.w%1, %1, %0\n"
-   "   bnez%1, 1b\n"
-   RISCV_ACQUIRE_BARRIER
-   : "+A" (lock->lock), "=" (tmp)
-   :: "memory");
-}
-
-static inline void arch_write_lock(arch_rwlock_t *lock)
-{
-   int tmp;
-
-   __asm__ __volatile__(
-   "1: lr.w%1, %0\n"
-   "   bnez%1, 1b\n"
-   "   li  %1, -1\n"
-   "   sc.w%1, %1, %0\n"
-   "   bnez%1, 1b\n"
-   RISCV_ACQUIRE_BARRIER
-   : "+A" (lock->lock), "=" (tmp)
-   :: "memory");
-}
-
-static inline int arch_read_trylock(arch_rwlock_t *lock)
-{
-   int busy;
-
-   __asm__ __volatile__(
-   "1: lr.w%1, %0\n"
-   "   bltz%1, 1f\n"
-   "   addi%1, %1, 1\n"
-   "   sc.w%1, %1, %0\n"
-   "   bnez%1, 1b\n"
-   RISCV_ACQUIRE_BARRIER
-   "1:\n"
-   : "+A" (lock->lock), "=" (busy)
-   :: "memory");
-
-   return !busy;
-}
-
-static inline int arch_write_trylock(arch_rwlock_t *lock)
-{
-   int busy;
-
-   __asm__ __volatile__(
-   "1: lr.w%1, %0\n"
-   "   bnez%1, 1f\n"
-   "   li  %1, -1\n"
-   "   sc.w%1, %1, %0\n"
-   "   bnez%1, 1b\n"
-   RISCV_ACQUIRE_BARRIER
-   "1:\n"
-   : "+A" (lock->lock), "=" (busy)
-   :: "memory");
-
-   

[PATCH v6 1/9] locking/qspinlock: Add ARCH_USE_QUEUED_SPINLOCKS_XCHG32

2021-03-31 Thread guoren
From: Guo Ren 

Some architectures don't have sub-word swap atomic instruction,
they only have the full word's one.

The sub-word swap only improve the performance when:
NR_CPUS < 16K
 *  0- 7: locked byte
 * 8: pending
 *  9-15: not used
 * 16-17: tail index
 * 18-31: tail cpu (+1)

The 9-15 bits are wasted to use xchg16 in xchg_tail.

Please let architecture select xchg16/xchg32 to implement
xchg_tail.

Signed-off-by: Guo Ren 
Cc: Peter Zijlstra 
Cc: Will Deacon 
Cc: Ingo Molnar 
Cc: Waiman Long 
Cc: Arnd Bergmann 
Cc: Anup Patel 
---
 kernel/Kconfig.locks   |  3 +++
 kernel/locking/qspinlock.c | 46 +-
 2 files changed, 28 insertions(+), 21 deletions(-)

diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index 3de8fd11873b..d02f1261f73f 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -239,6 +239,9 @@ config LOCK_SPIN_ON_OWNER
 config ARCH_USE_QUEUED_SPINLOCKS
bool
 
+config ARCH_USE_QUEUED_SPINLOCKS_XCHG32
+   bool
+
 config QUEUED_SPINLOCKS
def_bool y if ARCH_USE_QUEUED_SPINLOCKS
depends on SMP
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index cbff6ba53d56..4bfaa969bd15 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -163,26 +163,6 @@ static __always_inline void 
clear_pending_set_locked(struct qspinlock *lock)
WRITE_ONCE(lock->locked_pending, _Q_LOCKED_VAL);
 }
 
-/*
- * xchg_tail - Put in the new queue tail code word & retrieve previous one
- * @lock : Pointer to queued spinlock structure
- * @tail : The new queue tail code word
- * Return: The previous queue tail code word
- *
- * xchg(lock, tail), which heads an address dependency
- *
- * p,*,* -> n,*,* ; prev = xchg(lock, node)
- */
-static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
-{
-   /*
-* We can use relaxed semantics since the caller ensures that the
-* MCS node is properly initialized before updating the tail.
-*/
-   return (u32)xchg_relaxed(>tail,
-tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;
-}
-
 #else /* _Q_PENDING_BITS == 8 */
 
 /**
@@ -206,6 +186,30 @@ static __always_inline void 
clear_pending_set_locked(struct qspinlock *lock)
 {
atomic_add(-_Q_PENDING_VAL + _Q_LOCKED_VAL, >val);
 }
+#endif /* _Q_PENDING_BITS == 8 */
+
+#if _Q_PENDING_BITS == 8 && !defined(CONFIG_ARCH_USE_QUEUED_SPINLOCKS_XCHG32)
+/*
+ * xchg_tail - Put in the new queue tail code word & retrieve previous one
+ * @lock : Pointer to queued spinlock structure
+ * @tail : The new queue tail code word
+ * Return: The previous queue tail code word
+ *
+ * xchg(lock, tail), which heads an address dependency
+ *
+ * p,*,* -> n,*,* ; prev = xchg(lock, node)
+ */
+static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
+{
+   /*
+* We can use relaxed semantics since the caller ensures that the
+* MCS node is properly initialized before updating the tail.
+*/
+   return (u32)xchg_relaxed(>tail,
+tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;
+}
+
+#else
 
 /**
  * xchg_tail - Put in the new queue tail code word & retrieve previous one
@@ -236,7 +240,7 @@ static __always_inline u32 xchg_tail(struct qspinlock 
*lock, u32 tail)
}
return old;
 }
-#endif /* _Q_PENDING_BITS == 8 */
+#endif
 
 /**
  * queued_fetch_set_pending_acquire - fetch the whole lock value and set 
pending
-- 
2.17.1



[PATCH v6 0/9] riscv: Add qspinlock/qrwlock

2021-03-31 Thread guoren
From: Guo Ren 

Current riscv is still using baby spinlock implementation. It'll cause
fairness and cache line bouncing problems. Many people are involved
and pay the efforts to improve it:

 - The first version of patch was made in 2019.1:
   
https://lore.kernel.org/linux-riscv/20190211043829.30096-1-michaeljcl...@mac.com/#r

 - The second version was made in 2020.11:
   
https://lore.kernel.org/linux-riscv/1606225437-22948-2-git-send-email-guo...@kernel.org/

 - A good discussion at Platform HSC.2021-03-08:
   https://drive.google.com/drive/folders/1ooqdnIsYx7XKor5O1XTtM6D1CHp4hc0p

 - A good discussion on V4 in mailling list:
   
https://lore.kernel.org/linux-riscv/1616868399-82848-1-git-send-email-guo...@kernel.org/T/#t

 - Openrisc's maintainer want to implement arch_cmpxchg infrastructure.
   
https://lore.kernel.org/linux-riscv/1616868399-82848-1-git-send-email-guo...@kernel.org/T/#m11b712fb6a4fda043811b1f4c3d61446951ed65a

Hope your comments and Tested-by or Co-developed-by or Reviewed-by ...

Let's kick the qspinlock into riscv right now (Also for the
architecture which hasn't xchg16 atomic instruction.)

Change V6:
 - Add  ticket-lock for riscv, default is qspinlock
 - Keep ticket-lock for csky,  default is ticketlock
 - Using smp_cond_load for riscv ticket-lock
 - Optimize csky ticketlock with smp_cond_load, store_release
 - Add PPC_LBARX_LWARX for powerpc 

Change V5:
 - Fixup #endif comment typo by Waiman
 - Remove cmpxchg coding convention patches which will get into a
   separate patchset later by Arnd's advice
 - Try to involve more architectures in the discussion

Change V4:
 - Remove custom sub-word xchg implementation
 - Add ARCH_USE_QUEUED_SPINLOCKS_XCHG32 in locking/qspinlock

Change V3:
 - Coding convention by Peter Zijlstra's advices

Change V2:
 - Coding convention in cmpxchg.h
 - Re-implement short xchg
 - Remove char & cmpxchg implementations

Guo Ren (8):
  locking/qspinlock: Add ARCH_USE_QUEUED_SPINLOCKS_XCHG32
  riscv: locks: Introduce ticket-based spinlock implementation
  csky: locks: Optimize coding convention
  csky: Convert custom spinlock/rwlock to generic qspinlock/qrwlock
  openrisc: qspinlock: Add ARCH_USE_QUEUED_SPINLOCKS_XCHG32
  sparc: qspinlock: Add ARCH_USE_QUEUED_SPINLOCKS_XCHG32
  xtensa: qspinlock: Add ARCH_USE_QUEUED_SPINLOCKS_XCHG32
  powerpc/qspinlock: Add ARCH_USE_QUEUED_SPINLOCKS_XCHG32

Michael Clark (1):
  riscv: Convert custom spinlock/rwlock to generic qspinlock/qrwlock

 arch/csky/Kconfig   |   8 ++
 arch/csky/include/asm/Kbuild|   2 +
 arch/csky/include/asm/spinlock.h|  15 +--
 arch/csky/include/asm/spinlock_types.h  |   4 +
 arch/openrisc/Kconfig   |   1 +
 arch/powerpc/Kconfig|   1 +
 arch/riscv/Kconfig  |   8 ++
 arch/riscv/include/asm/Kbuild   |   3 +
 arch/riscv/include/asm/spinlock.h   | 158 +---
 arch/riscv/include/asm/spinlock_types.h |  26 ++--
 arch/sparc/Kconfig  |   1 +
 arch/xtensa/Kconfig |   1 +
 kernel/Kconfig.locks|   3 +
 kernel/locking/qspinlock.c  |  46 +++
 14 files changed, 142 insertions(+), 135 deletions(-)

-- 
2.17.1



[PATCH] powerpc/ptrace: Don't return error when getting/setting FP regs without CONFIG_PPC_FPU_REGS

2021-03-31 Thread Christophe Leroy
An #ifdef CONFIG_PPC_FPU_REGS is missing in arch_ptrace() leading
to the following Oops because [REGSET_FPR] entry is not initialised in
native_regsets[].

[   41.917608] BUG: Unable to handle kernel instruction fetch
[   41.922849] Faulting instruction address: 0xff8fd228
[   41.927760] Oops: Kernel access of bad area, sig: 11 [#1]
[   41.933089] BE PAGE_SIZE=4K PREEMPT CMPC885
[   41.940753] Modules linked in:
[   41.943768] CPU: 0 PID: 366 Comm: gdb Not tainted 
5.12.0-rc5-s3k-dev-01666-g7aac86a0f057-dirty #4835
[   41.952800] NIP:  ff8fd228 LR: c004d9e0 CTR: ff8fd228
[   41.957790] REGS: caae9df0 TRAP: 0400   Not tainted  
(5.12.0-rc5-s3k-dev-01666-g7aac86a0f057-dirty)
[   41.966741] MSR:  40009032   CR: 82004248  XER: 2000
[   41.973540]
[   41.973540] GPR00: c004d9b4 caae9eb0 c1b64f60 c1b64520 c0713cd4 caae9eb8 
c1bacdfc 0004
[   41.973540] GPR08: 0200 ff8fd228 c1bac700 1032 28004242 1061aaf4 
0001 106d64a0
[   41.973540] GPR16:   7fa0a774 1061 7fa0aef9  
1061 7fa0a538
[   41.973540] GPR24: 7fa0a580 7fa0a570 c1bacc00 c1b64520 c1bacc00 caae9ee8 
0108 c0713cd4
[   42.009685] NIP [ff8fd228] 0xff8fd228
[   42.013300] LR [c004d9e0] __regset_get+0x100/0x124
[   42.018036] Call Trace:
[   42.020443] [caae9eb0] [c004d9b4] __regset_get+0xd4/0x124 (unreliable)
[   42.026899] [caae9ee0] [c004da94] copy_regset_to_user+0x5c/0xb0
[   42.032751] [caae9f10] [c002f640] sys_ptrace+0xe4/0x588
[   42.037915] [caae9f30] [c0011010] ret_from_syscall+0x0/0x28
[   42.043422] --- interrupt: c00 at 0xfd1f8e4
[   42.047553] NIP:  0fd1f8e4 LR: 1004a688 CTR: 
[   42.052544] REGS: caae9f40 TRAP: 0c00   Not tainted  
(5.12.0-rc5-s3k-dev-01666-g7aac86a0f057-dirty)
[   42.061494] MSR:  d032   CR: 48004442  XER: 
[   42.068551]
[   42.068551] GPR00: 001a 7fa0a040 77dad7e0 000e 0170  
7fa0a078 0004
[   42.068551] GPR08:  108deb88 108dda40 106d6010 44004442 1061aaf4 
0001 106d64a0
[   42.068551] GPR16:   7fa0a774 1061 7fa0aef9  
1061 7fa0a538
[   42.068551] GPR24: 7fa0a580 7fa0a570 1078fe00 1078fd70 1078fd70 0170 
0fdd3244 000d
[   42.104696] NIP [0fd1f8e4] 0xfd1f8e4
[   42.108225] LR [1004a688] 0x1004a688
[   42.111753] --- interrupt: c00
[   42.114768] Instruction dump:
[   42.117698]        

[   42.125443]        

[   42.133195] ---[ end trace d35616f22ab2100c ]---

Adding the missing #ifdef is not good because gdb doesn't like getting
an error when getting registers.

Instead, make ptrace return 0s when CONFIG_PPC_FPU_REGS is not set.

Fixes: b6254ced4da6 ("powerpc/signal: Don't manage floating point regs when no 
FPU")
Cc: sta...@vger.kernel.org
Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/ptrace/Makefile   |  4 ++--
 arch/powerpc/kernel/ptrace/ptrace-decl.h  | 14 --
 arch/powerpc/kernel/ptrace/ptrace-fpu.c   | 10 ++
 arch/powerpc/kernel/ptrace/ptrace-novsx.c |  8 
 arch/powerpc/kernel/ptrace/ptrace-view.c  |  2 --
 5 files changed, 20 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/kernel/ptrace/Makefile 
b/arch/powerpc/kernel/ptrace/Makefile
index 8ebc11d1168d..77abd1a5a508 100644
--- a/arch/powerpc/kernel/ptrace/Makefile
+++ b/arch/powerpc/kernel/ptrace/Makefile
@@ -6,11 +6,11 @@
 CFLAGS_ptrace-view.o   += -DUTS_MACHINE='"$(UTS_MACHINE)"'
 
 obj-y  += ptrace.o ptrace-view.o
-obj-$(CONFIG_PPC_FPU_REGS) += ptrace-fpu.o
+obj-y  += ptrace-fpu.o
 obj-$(CONFIG_COMPAT)   += ptrace32.o
 obj-$(CONFIG_VSX)  += ptrace-vsx.o
 ifneq ($(CONFIG_VSX),y)
-obj-$(CONFIG_PPC_FPU_REGS) += ptrace-novsx.o
+obj-y  += ptrace-novsx.o
 endif
 obj-$(CONFIG_ALTIVEC)  += ptrace-altivec.o
 obj-$(CONFIG_SPE)  += ptrace-spe.o
diff --git a/arch/powerpc/kernel/ptrace/ptrace-decl.h 
b/arch/powerpc/kernel/ptrace/ptrace-decl.h
index 3487f2c9735c..eafe5f0f6289 100644
--- a/arch/powerpc/kernel/ptrace/ptrace-decl.h
+++ b/arch/powerpc/kernel/ptrace/ptrace-decl.h
@@ -165,22 +165,8 @@ int ptrace_put_reg(struct task_struct *task, int regno, 
unsigned long data);
 extern const struct user_regset_view user_ppc_native_view;
 
 /* ptrace-fpu */
-#ifdef CONFIG_PPC_FPU_REGS
 int ptrace_get_fpr(struct task_struct *child, int index, unsigned long *data);
 int ptrace_put_fpr(struct task_struct *child, int index, unsigned long data);
-#else
-static inline int
-ptrace_get_fpr(struct task_struct *child, int index, unsigned long *data)
-{
-   return -EIO;
-}
-
-static inline int
-ptrace_put_fpr(struct task_struct *child, int index, unsigned long data)
-{
-   return -EIO;
-}
-#endif
 
 /* ptrace-(no)adv */
 void ppc_gethwdinfo(struct ppc_debug_info *dbginfo);
diff --git a/arch/powerpc/kernel/ptrace/ptrace-fpu.c 

Re: [PATCH] powerpc/8xx: Load modules closer to kernel text

2021-03-31 Thread Christophe Leroy




Le 31/03/2021 à 15:39, Michael Ellerman a écrit :

Christophe Leroy  writes:

On the 8xx, TASK_SIZE is 0x8000. The space between TASK_SIZE and
PAGE_OFFSET is not used.

Use it to load modules in order to minimise the distance between
kernel text and modules and avoid trampolines in modules to access
kernel functions or other module functions.

Define a 16Mbytes area for modules, that's more than enough.


16MB seems kind of small.

At least on 64-bit we could potentially have hundreds of MBs of modules.



Well, with a 16 MB kernel and 16 MB modules, my board is full :)

Even on the more recent board that has 128 MB, I don't expect more than a few MBs of modules in 
addition to the kernel which is approx 8M.



But ok, I'll do something more generic, though it will conflict with Jordan's 
series.

Christophe


[PATCH v2] powerpc/signal32: Fix Oops on sigreturn with unmapped VDSO

2021-03-31 Thread Christophe Leroy
PPC32 encounters a KUAP fault when trying to handle a signal with
VDSO unmapped.

Kernel attempted to read user page (7fc07ec0) - exploit attempt? (uid: 
0)
BUG: Unable to handle kernel data access on read at 0x7fc07ec0
Faulting instruction address: 0xc00111d4
Oops: Kernel access of bad area, sig: 11 [#1]
BE PAGE_SIZE=16K PREEMPT CMPC885
CPU: 0 PID: 353 Comm: sigreturn_vdso Not tainted 
5.12.0-rc4-s3k-dev-01553-gb30c310ea220 #4814
NIP:  c00111d4 LR: c0005a28 CTR: 
REGS: cadb3dd0 TRAP: 0300   Not tainted  
(5.12.0-rc4-s3k-dev-01553-gb30c310ea220)
MSR:  9032   CR: 48000884  XER: 2000
DAR: 7fc07ec0 DSISR: 8800
GPR00: c0007788 cadb3e90 c28d4a40 7fc07ec0 7fc07ed0 04e0 7fc07ce0 

GPR08: 0001 0001 7fc07ec0  28000282 1001b828 100a0920 

GPR16: 100cac0c 100b 105c43a4 105c5685 100d 100d 100d 
100b2e9e
GPR24:  105c43c8  7fc07ec8 cadb3f40 cadb3ec8 c28d4a40 

NIP [c00111d4] flush_icache_range+0x90/0xb4
LR [c0005a28] handle_signal32+0x1bc/0x1c4
Call Trace:
[cadb3e90] [100d] 0x100d (unreliable)
[cadb3ec0] [c0007788] do_notify_resume+0x260/0x314
[cadb3f20] [c000c764] syscall_exit_prepare+0x120/0x184
[cadb3f30] [c00100b4] ret_from_syscall+0xc/0x28
--- interrupt: c00 at 0xfe807f8
NIP:  0fe807f8 LR: 10001060 CTR: c0139378
REGS: cadb3f40 TRAP: 0c00   Not tainted  
(5.12.0-rc4-s3k-dev-01553-gb30c310ea220)
MSR:  d032   CR: 28000482  XER: 2000

GPR00: 0025 7fc081c0 77bb1690  000a 28000482 0001 
0ff03a38
GPR08: d032 6de5 c28d4a40 0009 88000482 1001b828 100a0920 

GPR16: 100cac0c 100b 105c43a4 105c5685 100d 100d 100d 
100b2e9e
GPR24:  105c43c8  77ba7628 10002398 1001 10002124 
00024000
NIP [0fe807f8] 0xfe807f8
LR [10001060] 0x10001060
--- interrupt: c00
Instruction dump:
38630010 7c001fac 38630010 4200fff0 7c0004ac 4c00012c 4e800020 7c001fac
2c0a 38630010 4082ffcc 4be4 <7c00186c> 2c07 39430010 
4082ff8c
---[ end trace 3973fb72b049cb06 ]---

This is because flush_icache_range() is called on user addresses.

The same problem was detected some time ago on PPC64. It was fixed by
enabling KUAP in commit 59bee45b9712 ("powerpc/mm: Fix missing KUAP
disable in flush_coherent_icache()").

PPC32 doesn't use flush_coherent_icache() and fallbacks on
clean_dcache_range() and invalidate_icache_range().

We could fix it similarly by enabling user access in those functions,
but this is overkill for just flushing two instructions.

The two instructions are 8 bytes aligned, so a single dcbst/icbi is
enough to flush them. Do like __patch_instruction() and inline
a dcbst followed by an icbi just after the write of the instructions,
while user access is still allowed. The isync is not required because
rfi will be used to return to user.

icbi() is handled as a read so read-write user access is needed.

Signed-off-by: Christophe Leroy 
---
v2: Do read-write user access.
---
 arch/powerpc/kernel/signal_32.c | 20 
 1 file changed, 8 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c
index c505b444a613..09884af693aa 100644
--- a/arch/powerpc/kernel/signal_32.c
+++ b/arch/powerpc/kernel/signal_32.c
@@ -775,7 +775,7 @@ int handle_rt_signal32(struct ksignal *ksig, sigset_t 
*oldset,
else
prepare_save_user_regs(1);
 
-   if (!user_write_access_begin(frame, sizeof(*frame)))
+   if (!user_access_begin(frame, sizeof(*frame)))
goto badframe;
 
/* Put the siginfo & fill in most of the ucontext */
@@ -809,17 +809,15 @@ int handle_rt_signal32(struct ksignal *ksig, sigset_t 
*oldset,
unsafe_put_user(PPC_INST_ADDI + __NR_rt_sigreturn, 
>mc_pad[0],
failed);
unsafe_put_user(PPC_INST_SC, >mc_pad[1], failed);
+   asm("dcbst %y0; sync; icbi %y0; sync" :: "Z" (mctx->mc_pad[0]));
}
unsafe_put_sigset_t(>uc.uc_sigmask, oldset, failed);
 
-   user_write_access_end();
+   user_access_end();
 
if (copy_siginfo_to_user(>info, >info))
goto badframe;
 
-   if (tramp == (unsigned long)mctx->mc_pad)
-   flush_icache_range(tramp, tramp + 2 * sizeof(unsigned long));
-
regs->link = tramp;
 
 #ifdef CONFIG_PPC_FPU_REGS
@@ -844,7 +842,7 @@ int handle_rt_signal32(struct ksignal *ksig, sigset_t 
*oldset,
return 0;
 
 failed:
-   user_write_access_end();
+   user_access_end();
 
 badframe:
signal_fault(tsk, regs, "handle_rt_signal32", frame);
@@ -879,7 +877,7 @@ int handle_signal32(struct ksignal *ksig, 

[PATCH] selftests: timens: Fix gettime_perf to work on powerpc

2021-03-31 Thread Christophe Leroy
On powerpc:
- VDSO library is named linux-vdso32.so.1 or linux-vdso64.so.1
- clock_gettime is named __kernel_clock_gettime()

Ensure gettime_perf tries these names before giving up.

Signed-off-by: Christophe Leroy 
---
 tools/testing/selftests/timens/gettime_perf.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/tools/testing/selftests/timens/gettime_perf.c 
b/tools/testing/selftests/timens/gettime_perf.c
index 7bf841a3967b..6b13dc277724 100644
--- a/tools/testing/selftests/timens/gettime_perf.c
+++ b/tools/testing/selftests/timens/gettime_perf.c
@@ -25,12 +25,20 @@ static void fill_function_pointers(void)
if (!vdso)
vdso = dlopen("linux-gate.so.1",
  RTLD_LAZY | RTLD_LOCAL | RTLD_NOLOAD);
+   if (!vdso)
+   vdso = dlopen("linux-vdso32.so.1",
+ RTLD_LAZY | RTLD_LOCAL | RTLD_NOLOAD);
+   if (!vdso)
+   vdso = dlopen("linux-vdso64.so.1",
+ RTLD_LAZY | RTLD_LOCAL | RTLD_NOLOAD);
if (!vdso) {
pr_err("[WARN]\tfailed to find vDSO\n");
return;
}
 
vdso_clock_gettime = (vgettime_t)dlsym(vdso, "__vdso_clock_gettime");
+   if (!vdso_clock_gettime)
+   vdso_clock_gettime = (vgettime_t)dlsym(vdso, 
"__kernel_clock_gettime");
if (!vdso_clock_gettime)
pr_err("Warning: failed to find clock_gettime in vDSO\n");
 
-- 
2.25.0



Re: [PATCH] powerpc/8xx: Load modules closer to kernel text

2021-03-31 Thread Michael Ellerman
Christophe Leroy  writes:
> On the 8xx, TASK_SIZE is 0x8000. The space between TASK_SIZE and
> PAGE_OFFSET is not used.
>
> Use it to load modules in order to minimise the distance between
> kernel text and modules and avoid trampolines in modules to access
> kernel functions or other module functions.
>
> Define a 16Mbytes area for modules, that's more than enough.

16MB seems kind of small.

At least on 64-bit we could potentially have hundreds of MBs of modules.

cheers


Re: [PATCH] powerpc/signal32: Fix Oops on sigreturn with unmapped VDSO

2021-03-31 Thread Michael Ellerman
Christophe Leroy  writes:
> PPC32 encounters a KUAP fault when trying to handle a signal with
> VDSO unmapped.
>
>   Kernel attempted to read user page (7fc07ec0) - exploit attempt? (uid: 
> 0)
>   BUG: Unable to handle kernel data access on read at 0x7fc07ec0
>   Faulting instruction address: 0xc00111d4
>   Oops: Kernel access of bad area, sig: 11 [#1]
>   BE PAGE_SIZE=16K PREEMPT CMPC885
>   CPU: 0 PID: 353 Comm: sigreturn_vdso Not tainted 
> 5.12.0-rc4-s3k-dev-01553-gb30c310ea220 #4814
>   NIP:  c00111d4 LR: c0005a28 CTR: 
>   REGS: cadb3dd0 TRAP: 0300   Not tainted  
> (5.12.0-rc4-s3k-dev-01553-gb30c310ea220)
>   MSR:  9032   CR: 48000884  XER: 2000
>   DAR: 7fc07ec0 DSISR: 8800
>   GPR00: c0007788 cadb3e90 c28d4a40 7fc07ec0 7fc07ed0 04e0 7fc07ce0 
> 
>   GPR08: 0001 0001 7fc07ec0  28000282 1001b828 100a0920 
> 
>   GPR16: 100cac0c 100b 105c43a4 105c5685 100d 100d 100d 
> 100b2e9e
>   GPR24:  105c43c8  7fc07ec8 cadb3f40 cadb3ec8 c28d4a40 
> 
>   NIP [c00111d4] flush_icache_range+0x90/0xb4
>   LR [c0005a28] handle_signal32+0x1bc/0x1c4
>   Call Trace:
>   [cadb3e90] [100d] 0x100d (unreliable)
>   [cadb3ec0] [c0007788] do_notify_resume+0x260/0x314
>   [cadb3f20] [c000c764] syscall_exit_prepare+0x120/0x184
>   [cadb3f30] [c00100b4] ret_from_syscall+0xc/0x28
>   --- interrupt: c00 at 0xfe807f8
>   NIP:  0fe807f8 LR: 10001060 CTR: c0139378
>   REGS: cadb3f40 TRAP: 0c00   Not tainted  
> (5.12.0-rc4-s3k-dev-01553-gb30c310ea220)
>   MSR:  d032   CR: 28000482  XER: 2000
>
>   GPR00: 0025 7fc081c0 77bb1690  000a 28000482 0001 
> 0ff03a38
>   GPR08: d032 6de5 c28d4a40 0009 88000482 1001b828 100a0920 
> 
>   GPR16: 100cac0c 100b 105c43a4 105c5685 100d 100d 100d 
> 100b2e9e
>   GPR24:  105c43c8  77ba7628 10002398 1001 10002124 
> 00024000
>   NIP [0fe807f8] 0xfe807f8
>   LR [10001060] 0x10001060
>   --- interrupt: c00
>   Instruction dump:
>   38630010 7c001fac 38630010 4200fff0 7c0004ac 4c00012c 4e800020 7c001fac
>   2c0a 38630010 4082ffcc 4be4 <7c00186c> 2c07 39430010 
> 4082ff8c
>   ---[ end trace 3973fb72b049cb06 ]---
>
> This is because flush_icache_range() is called on user addresses.
>
> The same problem was detected some time ago on PPC64. It was fixed by
> enabling KUAP in commit 59bee45b9712 ("powerpc/mm: Fix missing KUAP
> disable in flush_coherent_icache()").
>
> PPC32 doesn't use flush_coherent_icache() and fallbacks on
> clean_dcache_range() and invalidate_icache_range().

But this code is also used for compat tasks on 64-bit.

> We could fix it similarly by enabling user access in those functions,
> but this is overkill for just flushing two instructions.
>
> The two instructions are 8 bytes aligned, so a single dcbst/icbi is
> enough to flush them. Do like __patch_instruction() and inline
> a dcbst followed by an icbi just after the write of the instructions,
> while user access is still allowed. The isync is not required because
> rfi will be used to return to user.
>
> Signed-off-by: Christophe Leroy 
> ---
>  arch/powerpc/kernel/signal_32.c | 8 ++--
>  1 file changed, 2 insertions(+), 6 deletions(-)
>
> diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c
> index 75ee918a120a..5b2ba2731957 100644
> --- a/arch/powerpc/kernel/signal_32.c
> +++ b/arch/powerpc/kernel/signal_32.c
> @@ -809,6 +809,7 @@ int handle_rt_signal32(struct ksignal *ksig, sigset_t 
> *oldset,
>   unsafe_put_user(PPC_INST_ADDI + __NR_rt_sigreturn, 
> >mc_pad[0],
>   failed);
>   unsafe_put_user(PPC_INST_SC, >mc_pad[1], failed);
> + asm("dcbst %y0; sync; icbi %y0; sync" :: "Z" (mctx->mc_pad[0]));

If I'm reading that right you're pointing the icbi at the user address.

That's going to cause a KUAP fault just like we fixed in commit
59bee45b9712 ("powerpc/mm: Fix missing KUAP disable in 
flush_coherent_icache()").

We have user write access enabled, but the icbi is treated as a load.

So I don't think that's going to work.

cheers


Re: [PATCH 16/18] iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE

2021-03-31 Thread Robin Murphy

On 2021-03-31 12:49, Will Deacon wrote:

On Tue, Mar 30, 2021 at 05:28:19PM +0100, Robin Murphy wrote:

On 2021-03-30 14:58, Will Deacon wrote:

On Tue, Mar 30, 2021 at 02:19:38PM +0100, Robin Murphy wrote:

On 2021-03-30 14:11, Will Deacon wrote:

On Tue, Mar 16, 2021 at 04:38:22PM +0100, Christoph Hellwig wrote:

From: Robin Murphy 

Instead make the global iommu_dma_strict paramete in iommu.c canonical by
exporting helpers to get and set it and use those directly in the drivers.

This make sure that the iommu.strict parameter also works for the AMD and
Intel IOMMU drivers on x86.  As those default to lazy flushing a new
IOMMU_CMD_LINE_STRICT is used to turn the value into a tristate to
represent the default if not overriden by an explicit parameter.

Signed-off-by: Robin Murphy .
[ported on top of the other iommu_attr changes and added a few small
missing bits]
Signed-off-by: Christoph Hellwig 
---
drivers/iommu/amd/iommu.c   | 23 +---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 50 +---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  1 -
drivers/iommu/arm/arm-smmu/arm-smmu.c   | 27 +
drivers/iommu/dma-iommu.c   |  9 +--
drivers/iommu/intel/iommu.c | 64 -
drivers/iommu/iommu.c   | 27 ++---
include/linux/iommu.h   |  4 +-
8 files changed, 40 insertions(+), 165 deletions(-)


I really like this cleanup, but I can't help wonder if it's going in the
wrong direction. With SoCs often having multiple IOMMU instances and a
distinction between "trusted" and "untrusted" devices, then having the
flush-queue enabled on a per-IOMMU or per-domain basis doesn't sound
unreasonable to me, but this change makes it a global property.


The intent here was just to streamline the existing behaviour of stuffing a
global property into a domain attribute then pulling it out again in the
illusion that it was in any way per-domain. We're still checking
dev_is_untrusted() before making an actual decision, and it's not like we
can't add more factors at that point if we want to.


Like I say, the cleanup is great. I'm just wondering whether there's a
better way to express the complicated logic to decide whether or not to use
the flush queue than what we end up with:

if (!cookie->fq_domain && (!dev || !dev_is_untrusted(dev)) &&
domain->ops->flush_iotlb_all && !iommu_get_dma_strict())

which is mixing up globals, device properties and domain properties. The
result is that the driver code ends up just using the global to determine
whether or not to pass IO_PGTABLE_QUIRK_NON_STRICT to the page-table code,
which is a departure from the current way of doing things.


But previously, SMMU only ever saw the global policy piped through the
domain attribute by iommu_group_alloc_default_domain(), so there's no
functional change there.


For DMA domains sure, but I don't think that's the case for unmanaged
domains such as those used by VFIO.


Eh? This is only relevant to DMA domains anyway. Flush queues are part 
of the IOVA allocator that VFIO doesn't even use. It's always been the 
case that unmanaged domains only use strict invalidation.



Obviously some of the above checks could be factored out into some kind of
iommu_use_flush_queue() helper that IOMMU drivers can also call if they need
to keep in sync. Or maybe we just allow iommu-dma to set
IO_PGTABLE_QUIRK_NON_STRICT directly via iommu_set_pgtable_quirks() if we're
treating that as a generic thing now.


I think a helper that takes a domain would be a good starting point.


You mean device, right? The one condition we currently have is at the 
device level, and there's really nothing inherent to the domain itself 
that matters (since the type is implicitly IOMMU_DOMAIN_DMA to even care 
about this).


Another idea that's just come to mind is now that IOMMU_DOMAIN_DMA has a 
standard meaning, maybe we could split out a separate 
IOMMU_DOMAIN_DMA_STRICT type such that it can all propagate from 
iommu_get_def_domain_type()? That feels like it might be quite 
promising, but I'd still do it as an improvement on top of this patch, 
since it's beyond just cleaning up the abuse of domain attributes to 
pass a command-line option around.


Robin.


Re: [PATCH -next] powerpc/eeh: Remove unused inline function eeh_dev_phb_init_dynamic()

2021-03-31 Thread YueHaibing
On 2021/3/26 13:08, Daniel Axtens wrote:
> Hi,
> 
>> commit 475028efc708 ("powerpc/eeh: Remove eeh_dev_phb_init_dynamic()")
>> left behind this, so can remove it.
> 
> I had a look: the inline that you are removing here is for the
> !CONFIG_EEH case, which explains why it was missed the first time.
> 
> This looks like a good change. Out of interest, what tool are you using
> to find these unused inlines? If there are many more, it might make
> sense to combine future patches removing them into a single patch, but
> I'm not sure.

Just use some grep skill, will do that if any.

> 
> checkpatch likes this patch, so that's also good :)
> 
> Reviewed-by: Daniel Axtens 
> 
> Kind regards,
> Daniel
> .
> 


[PATCH -next] powerpc/eeh: Add correct inline functions

2021-03-31 Thread YueHaibing
pseries_eeh_add_device_early()/pseries_eeh_add_device_tree_early() is
never used since adding, however pseries_eeh_init_edev() and
pseries_eeh_init_edev_recursive() need their inline versions.

Fixes: b6eebb093cad ("powerpc/eeh: Make early EEH init pseries specific")
Signed-off-by: YueHaibing 
---
 arch/powerpc/include/asm/eeh.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index b1a5bba2e0b9..0b6c2a6711d3 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -357,8 +357,8 @@ static inline int eeh_phb_pe_create(struct pci_controller 
*phb) { return 0; }
 void pseries_eeh_init_edev(struct pci_dn *pdn);
 void pseries_eeh_init_edev_recursive(struct pci_dn *pdn);
 #else
-static inline void pseries_eeh_add_device_early(struct pci_dn *pdn) { }
-static inline void pseries_eeh_add_device_tree_early(struct pci_dn *pdn) { }
+static inline void pseries_eeh_init_edev(struct pci_dn *pdn) { }
+static inline void pseries_eeh_init_edev_recursive(struct pci_dn *pdn) { }
 #endif
 
 #ifdef CONFIG_PPC64
-- 
2.17.1



Re: [PATCH v10 01/10] powerpc/mm: Implement set_memory() routines

2021-03-31 Thread Christophe Leroy




Le 31/03/2021 à 13:16, Michael Ellerman a écrit :

Hi Jordan,

A few nits below ...

Jordan Niethe  writes:

From: Russell Currey 

The set_memory_{ro/rw/nx/x}() functions are required for STRICT_MODULE_RWX,
and are generally useful primitives to have.  This implementation is
designed to be completely generic across powerpc's many MMUs.

It's possible that this could be optimised to be faster for specific
MMUs, but the focus is on having a generic and safe implementation for
now.

This implementation does not handle cases where the caller is attempting
to change the mapping of the page it is executing from, or if another
CPU is concurrently using the page being altered.  These cases likely
shouldn't happen, but a more complex implementation with MMU-specific code
could safely handle them, so that is left as a TODO for now.

On hash the linear mapping is not kept in the linux pagetable, so this
will not change the protection if used on that range. Currently these
functions are not used on the linear map so just WARN for now.

These functions do nothing if STRICT_KERNEL_RWX is not enabled.

Reviewed-by: Daniel Axtens 
Signed-off-by: Russell Currey 
Signed-off-by: Christophe Leroy 
[jpn: -rebase on next plus "powerpc/mm/64s: Allow STRICT_KERNEL_RWX again"
   - WARN on hash linear map]
Signed-off-by: Jordan Niethe 
---
v10: WARN if trying to change the hash linear map
---




This ↓ should have a comment explaining what it's doing:


+#ifdef CONFIG_PPC_BOOK3S_64
+   if (WARN_ON_ONCE(!radix_enabled() &&
+get_region_id(addr) == LINEAR_MAP_REGION_ID)) {
+   return -1;
+   }
+#endif


Maybe:

if (IS_ENABLED(CONFIG_PPC_BOOK3S_64) &&
WARN_ON_ONCE(!radix_enabled() && get_region_id(addr) == 
LINEAR_MAP_REGION_ID)) {
return -1;
}


get_region_id() only exists for book3s/64 at the time being, and 
LINEAR_MAP_REGION_ID as well.




But then Aneesh pointed out that we should also block VMEMMAP_REGION_ID.

It might be better to just check for the permitted regions.

if (IS_ENABLED(CONFIG_PPC_BOOK3S_64) && !radix_enabled()) {
int region = get_region_id(addr);

if (WARN_ON_ONCE(region != VMALLOC_REGION_ID && region != 
IO_REGION_ID))
return -1;
}


+
+   return apply_to_existing_page_range(_mm, start, sz,
+   change_page_attr, (void *)action);
+}



cheers



Re: [PATCH v2 3/7] powerpc: convert config files to generic cmdline

2021-03-31 Thread Will Deacon
On Tue, Mar 30, 2021 at 10:35:21AM -0700, Daniel Walker wrote:
> On Mon, Mar 29, 2021 at 11:07:51AM +0100, Will Deacon wrote:
> > On Thu, Mar 25, 2021 at 12:59:56PM -0700, Daniel Walker wrote:
> > > On Thu, Mar 25, 2021 at 01:03:55PM +0100, Christophe Leroy wrote:
> > > > 
> > > > Ok, so you agree we don't need to provide two CMDLINE, one to be 
> > > > appended and one to be prepended.
> > > > 
> > > > Let's only provide once CMDLINE as of today, and ask the user to select
> > > > whether he wants it appended or prepended or replacee. Then no need to
> > > > change all existing config to rename CONFIG_CMDLINE into either of the 
> > > > new
> > > > ones.
> > > > 
> > > > That's the main difference between my series and Daniel's series. So 
> > > > I'll
> > > > finish taking Will's comment into account and we'll send out a v3 soon.
> > > 
> > > It doesn't solve the needs of Cisco, I've stated many times your changes 
> > > have
> > > little value. Please stop submitting them.
> > 
> > FWIW, they're useful for arm64 and I will gladly review the updated series.
> > 
> > I don't think asking people to stop submitting patches is ever the right
> > answer. Please don't do that.
> 
> Why ? It's me nacking his series, is that not allowed anymore ?

If you're that way inclined then you can "nack" whatever you want, but
please allow the rest of us to continue reviewing the patches. You don't
have any basis on which to veto other people's contributions and so
demanding that somebody stops posting code is neither constructive nor
meaningful.

Will


Re: [PATCH 16/18] iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE

2021-03-31 Thread Will Deacon
On Tue, Mar 30, 2021 at 05:28:19PM +0100, Robin Murphy wrote:
> On 2021-03-30 14:58, Will Deacon wrote:
> > On Tue, Mar 30, 2021 at 02:19:38PM +0100, Robin Murphy wrote:
> > > On 2021-03-30 14:11, Will Deacon wrote:
> > > > On Tue, Mar 16, 2021 at 04:38:22PM +0100, Christoph Hellwig wrote:
> > > > > From: Robin Murphy 
> > > > > 
> > > > > Instead make the global iommu_dma_strict paramete in iommu.c 
> > > > > canonical by
> > > > > exporting helpers to get and set it and use those directly in the 
> > > > > drivers.
> > > > > 
> > > > > This make sure that the iommu.strict parameter also works for the AMD 
> > > > > and
> > > > > Intel IOMMU drivers on x86.  As those default to lazy flushing a new
> > > > > IOMMU_CMD_LINE_STRICT is used to turn the value into a tristate to
> > > > > represent the default if not overriden by an explicit parameter.
> > > > > 
> > > > > Signed-off-by: Robin Murphy .
> > > > > [ported on top of the other iommu_attr changes and added a few small
> > > > >missing bits]
> > > > > Signed-off-by: Christoph Hellwig 
> > > > > ---
> > > > >drivers/iommu/amd/iommu.c   | 23 +---
> > > > >drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 50 +---
> > > > >drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  1 -
> > > > >drivers/iommu/arm/arm-smmu/arm-smmu.c   | 27 +
> > > > >drivers/iommu/dma-iommu.c   |  9 +--
> > > > >drivers/iommu/intel/iommu.c | 64 
> > > > > -
> > > > >drivers/iommu/iommu.c   | 27 ++---
> > > > >include/linux/iommu.h   |  4 +-
> > > > >8 files changed, 40 insertions(+), 165 deletions(-)
> > > > 
> > > > I really like this cleanup, but I can't help wonder if it's going in the
> > > > wrong direction. With SoCs often having multiple IOMMU instances and a
> > > > distinction between "trusted" and "untrusted" devices, then having the
> > > > flush-queue enabled on a per-IOMMU or per-domain basis doesn't sound
> > > > unreasonable to me, but this change makes it a global property.
> > > 
> > > The intent here was just to streamline the existing behaviour of stuffing 
> > > a
> > > global property into a domain attribute then pulling it out again in the
> > > illusion that it was in any way per-domain. We're still checking
> > > dev_is_untrusted() before making an actual decision, and it's not like we
> > > can't add more factors at that point if we want to.
> > 
> > Like I say, the cleanup is great. I'm just wondering whether there's a
> > better way to express the complicated logic to decide whether or not to use
> > the flush queue than what we end up with:
> > 
> > if (!cookie->fq_domain && (!dev || !dev_is_untrusted(dev)) &&
> > domain->ops->flush_iotlb_all && !iommu_get_dma_strict())
> > 
> > which is mixing up globals, device properties and domain properties. The
> > result is that the driver code ends up just using the global to determine
> > whether or not to pass IO_PGTABLE_QUIRK_NON_STRICT to the page-table code,
> > which is a departure from the current way of doing things.
> 
> But previously, SMMU only ever saw the global policy piped through the
> domain attribute by iommu_group_alloc_default_domain(), so there's no
> functional change there.

For DMA domains sure, but I don't think that's the case for unmanaged
domains such as those used by VFIO.

> Obviously some of the above checks could be factored out into some kind of
> iommu_use_flush_queue() helper that IOMMU drivers can also call if they need
> to keep in sync. Or maybe we just allow iommu-dma to set
> IO_PGTABLE_QUIRK_NON_STRICT directly via iommu_set_pgtable_quirks() if we're
> treating that as a generic thing now.

I think a helper that takes a domain would be a good starting point.

Will


Re: [PATCH v10 06/10] powerpc/mm/ptdump: debugfs handler for W+X checks at runtime

2021-03-31 Thread Michael Ellerman
Jordan Niethe  writes:
> From: Russell Currey 
>
> Optionally run W+X checks when dumping pagetable information to
> debugfs' kernel_page_tables.
>
> To use:
> $ echo 1 > /sys/kernel/debug/check_wx_pages
> $ cat /sys/kernel/debug/kernel_page_tables
>
> and check the kernel log.  Useful for testing strict module RWX.
>
> To disable W+X checks:
>   $ echo 0 > /sys/kernel/debug/check_wx_pages
>
> Update the Kconfig entry to reflect this.
>
> Also fix a typo.
>
> Reviewed-by: Kees Cook 
> Signed-off-by: Russell Currey 
> [jpn: Change check_wx_pages to act as mode bit affecting
>   kernel_page_tables instead of triggering action on its own]
> Signed-off-by: Jordan Niethe 
> ---
> v10: check_wx_pages now affects kernel_page_tables rather then triggers
>  its own action.

Hmm. I liked the old version better :)

I think you changed it based on Christophe's comment:

  Why not just perform the test everytime someone dumps kernel_page_tables ?


But I think he meant *always* do the check when someone dumps
kernel_page_tables, not have another file to enable checking and then
require someone to dump kernel_page_tables to do the actual check.

Still I like the previous version where you can do the checks
separately, without having to dump the page tables, because dumping can
sometimes take quite a while.

What would be even better is if ptdump_check_wx() returned an error when
wx pages were found, and that was plumbed out to the debugs file. That
way you can script around it.

cheers


Re: [PATCH v10 01/10] powerpc/mm: Implement set_memory() routines

2021-03-31 Thread Michael Ellerman
Hi Jordan,

A few nits below ...

Jordan Niethe  writes:
> From: Russell Currey 
>
> The set_memory_{ro/rw/nx/x}() functions are required for STRICT_MODULE_RWX,
> and are generally useful primitives to have.  This implementation is
> designed to be completely generic across powerpc's many MMUs.
>
> It's possible that this could be optimised to be faster for specific
> MMUs, but the focus is on having a generic and safe implementation for
> now.
>
> This implementation does not handle cases where the caller is attempting
> to change the mapping of the page it is executing from, or if another
> CPU is concurrently using the page being altered.  These cases likely
> shouldn't happen, but a more complex implementation with MMU-specific code
> could safely handle them, so that is left as a TODO for now.
>
> On hash the linear mapping is not kept in the linux pagetable, so this
> will not change the protection if used on that range. Currently these
> functions are not used on the linear map so just WARN for now.
>
> These functions do nothing if STRICT_KERNEL_RWX is not enabled.
>
> Reviewed-by: Daniel Axtens 
> Signed-off-by: Russell Currey 
> Signed-off-by: Christophe Leroy 
> [jpn: -rebase on next plus "powerpc/mm/64s: Allow STRICT_KERNEL_RWX again"
>   - WARN on hash linear map]
> Signed-off-by: Jordan Niethe 
> ---
> v10: WARN if trying to change the hash linear map
> ---
>  arch/powerpc/Kconfig  |  1 +
>  arch/powerpc/include/asm/set_memory.h | 32 ++
>  arch/powerpc/mm/Makefile  |  2 +-
>  arch/powerpc/mm/pageattr.c| 88 +++
>  4 files changed, 122 insertions(+), 1 deletion(-)
>  create mode 100644 arch/powerpc/include/asm/set_memory.h
>  create mode 100644 arch/powerpc/mm/pageattr.c
>
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index fc7f5c5933e6..4498a27ac9db 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -135,6 +135,7 @@ config PPC
>   select ARCH_HAS_MEMBARRIER_CALLBACKS
>   select ARCH_HAS_MEMBARRIER_SYNC_CORE
>   select ARCH_HAS_SCALED_CPUTIME  if VIRT_CPU_ACCOUNTING_NATIVE 
> && PPC_BOOK3S_64
> + select ARCH_HAS_SET_MEMORY

Below you do:

if (!IS_ENABLED(CONFIG_STRICT_KERNEL_RWX))
return 0;

Which suggests we should instead just only select ARCH_HAS_SET_MEMORY if
STRICT_KERNEL_RWX ?


> diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile
> index 3b4e9e4e25ea..d8a08abde1ae 100644
> --- a/arch/powerpc/mm/Makefile
> +++ b/arch/powerpc/mm/Makefile
> @@ -5,7 +5,7 @@
>  
>  ccflags-$(CONFIG_PPC64)  := $(NO_MINIMAL_TOC)
>  
> -obj-y:= fault.o mem.o pgtable.o mmap.o 
> maccess.o \
> +obj-y:= fault.o mem.o pgtable.o mmap.o 
> maccess.o pageattr.o \

.. and then the file should only be built if ARCH_HAS_SET_MEMORY = y.

>  init_$(BITS).o pgtable_$(BITS).o \
>  pgtable-frag.o ioremap.o ioremap_$(BITS).o \
>  init-common.o mmu_context.o drmem.o
> diff --git a/arch/powerpc/mm/pageattr.c b/arch/powerpc/mm/pageattr.c
> new file mode 100644
> index ..9efcb01088da
> --- /dev/null
> +++ b/arch/powerpc/mm/pageattr.c
> @@ -0,0 +1,88 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * MMU-generic set_memory implementation for powerpc
> + *
> + * Copyright 2019, IBM Corporation.

Should be 2019-2021.

> + */
> +
> +#include 
> +#include 
> +
> +#include 
> +#include 
> +#include 
> +
> +
> +/*
> + * Updates the attributes of a page in three steps:
> + *
> + * 1. invalidate the page table entry
> + * 2. flush the TLB
> + * 3. install the new entry with the updated attributes
> + *
> + * This is unsafe if the caller is attempting to change the mapping of the
> + * page it is executing from, or if another CPU is concurrently using the
> + * page being altered.

Is the 2nd part of that statement true?

Or, I guess maybe it is true depending on what "unsafe" means.

AIUI it's unsafe to use this on the page you're executing from, and by
unsafe we mean the kernel will potentially crash because it will lose
the mapping for the currently executing text.

Using this on a page that another CPU is accessing could be safe, if eg.
the other CPU is reading from the page and we are just changing it from
RW->RO.

So I'm not sure they're the same type of "unsafe".

> + * TODO make the implementation resistant to this.
> + *
> + * NOTE: can be dangerous to call without STRICT_KERNEL_RWX

I don't think we need that anymore?

> + */
> +static int change_page_attr(pte_t *ptep, unsigned long addr, void *data)
> +{
> + long action = (long)data;
> + pte_t pte;
> +
> + spin_lock(_mm.page_table_lock);
> +
> + /* invalidate the PTE so it's safe to modify */
> + pte = ptep_get_and_clear(_mm, addr, ptep);
> + flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
> +
> + /* 

Re: [PATCH v10 05/10] powerpc/bpf: Write protect JIT code

2021-03-31 Thread Christophe Leroy




Le 31/03/2021 à 12:37, Michael Ellerman a écrit :

Jordan Niethe  writes:


Once CONFIG_STRICT_MODULE_RWX is enabled there will be no need to
override bpf_jit_free() because it is now possible to set images
read-only. So use the default implementation.

Also add the necessary call to bpf_jit_binary_lock_ro() which will
remove write protection and add exec protection to the JIT image after
it has finished being written.

Signed-off-by: Jordan Niethe 
---
v10: New to series
---
  arch/powerpc/net/bpf_jit_comp.c   | 5 -
  arch/powerpc/net/bpf_jit_comp64.c | 4 
  2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/net/bpf_jit_comp.c b/arch/powerpc/net/bpf_jit_comp.c
index e809cb5a1631..8015e4a7d2d4 100644
--- a/arch/powerpc/net/bpf_jit_comp.c
+++ b/arch/powerpc/net/bpf_jit_comp.c
@@ -659,12 +659,15 @@ void bpf_jit_compile(struct bpf_prog *fp)
bpf_jit_dump(flen, proglen, pass, code_base);
  
  	bpf_flush_icache(code_base, code_base + (proglen/4));

-
  #ifdef CONFIG_PPC64
/* Function descriptor nastiness: Address + TOC */
((u64 *)image)[0] = (u64)code_base;
((u64 *)image)[1] = local_paca->kernel_toc;
  #endif
+   if (IS_ENABLED(CONFIG_STRICT_MODULE_RWX)) {
+   set_memory_ro((unsigned long)image, alloclen >> PAGE_SHIFT);
+   set_memory_x((unsigned long)image, alloclen >> PAGE_SHIFT);
+   }


You don't need to check the ifdef in a caller, there are stubs that
compile to nothing when CONFIG_ARCH_HAS_SET_MEMORY=n.


I was about to do the same comment, but 

CONFIG_STRICT_MODULE_RWX is not CONFIG_ARCH_HAS_SET_MEMORY




diff --git a/arch/powerpc/net/bpf_jit_comp64.c 
b/arch/powerpc/net/bpf_jit_comp64.c
index aaf1a887f653..1484ad588685 100644
--- a/arch/powerpc/net/bpf_jit_comp64.c
+++ b/arch/powerpc/net/bpf_jit_comp64.c
@@ -1240,6 +1240,8 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
fp->jited_len = alloclen;
  
  	bpf_flush_icache(bpf_hdr, (u8 *)bpf_hdr + (bpf_hdr->pages * PAGE_SIZE));

+   if (IS_ENABLED(CONFIG_STRICT_MODULE_RWX))
+   bpf_jit_binary_lock_ro(bpf_hdr);


Do we need the ifdef here either? Looks like it should be safe to call
due to the stubs.


Same




@@ -1262,6 +1264,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
  }
  
  /* Overriding bpf_jit_free() as we don't set images read-only. */

+#ifndef CONFIG_STRICT_MODULE_RWX


Did you test without this and notice something broken?

Looking at the generic version I can't tell why we need to override
this. Maybe we don't (anymore?) ?

cheers


  void bpf_jit_free(struct bpf_prog *fp)
  {
unsigned long addr = (unsigned long)fp->bpf_func & PAGE_MASK;
@@ -1272,3 +1275,4 @@ void bpf_jit_free(struct bpf_prog *fp)
  
  	bpf_prog_unlock_free(fp);

  }
+#endif
--
2.25.1


Re: [PATCH v10 05/10] powerpc/bpf: Write protect JIT code

2021-03-31 Thread Michael Ellerman
Jordan Niethe  writes:

> Once CONFIG_STRICT_MODULE_RWX is enabled there will be no need to
> override bpf_jit_free() because it is now possible to set images
> read-only. So use the default implementation.
>
> Also add the necessary call to bpf_jit_binary_lock_ro() which will
> remove write protection and add exec protection to the JIT image after
> it has finished being written.
>
> Signed-off-by: Jordan Niethe 
> ---
> v10: New to series
> ---
>  arch/powerpc/net/bpf_jit_comp.c   | 5 -
>  arch/powerpc/net/bpf_jit_comp64.c | 4 
>  2 files changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/net/bpf_jit_comp.c b/arch/powerpc/net/bpf_jit_comp.c
> index e809cb5a1631..8015e4a7d2d4 100644
> --- a/arch/powerpc/net/bpf_jit_comp.c
> +++ b/arch/powerpc/net/bpf_jit_comp.c
> @@ -659,12 +659,15 @@ void bpf_jit_compile(struct bpf_prog *fp)
>   bpf_jit_dump(flen, proglen, pass, code_base);
>  
>   bpf_flush_icache(code_base, code_base + (proglen/4));
> -
>  #ifdef CONFIG_PPC64
>   /* Function descriptor nastiness: Address + TOC */
>   ((u64 *)image)[0] = (u64)code_base;
>   ((u64 *)image)[1] = local_paca->kernel_toc;
>  #endif
> + if (IS_ENABLED(CONFIG_STRICT_MODULE_RWX)) {
> + set_memory_ro((unsigned long)image, alloclen >> PAGE_SHIFT);
> + set_memory_x((unsigned long)image, alloclen >> PAGE_SHIFT);
> + }

You don't need to check the ifdef in a caller, there are stubs that
compile to nothing when CONFIG_ARCH_HAS_SET_MEMORY=n.

> diff --git a/arch/powerpc/net/bpf_jit_comp64.c 
> b/arch/powerpc/net/bpf_jit_comp64.c
> index aaf1a887f653..1484ad588685 100644
> --- a/arch/powerpc/net/bpf_jit_comp64.c
> +++ b/arch/powerpc/net/bpf_jit_comp64.c
> @@ -1240,6 +1240,8 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog 
> *fp)
>   fp->jited_len = alloclen;
>  
>   bpf_flush_icache(bpf_hdr, (u8 *)bpf_hdr + (bpf_hdr->pages * PAGE_SIZE));
> + if (IS_ENABLED(CONFIG_STRICT_MODULE_RWX))
> + bpf_jit_binary_lock_ro(bpf_hdr);

Do we need the ifdef here either? Looks like it should be safe to call
due to the stubs.

> @@ -1262,6 +1264,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog 
> *fp)
>  }
>  
>  /* Overriding bpf_jit_free() as we don't set images read-only. */
> +#ifndef CONFIG_STRICT_MODULE_RWX

Did you test without this and notice something broken?

Looking at the generic version I can't tell why we need to override
this. Maybe we don't (anymore?) ?

cheers

>  void bpf_jit_free(struct bpf_prog *fp)
>  {
>   unsigned long addr = (unsigned long)fp->bpf_func & PAGE_MASK;
> @@ -1272,3 +1275,4 @@ void bpf_jit_free(struct bpf_prog *fp)
>  
>   bpf_prog_unlock_free(fp);
>  }
> +#endif
> -- 
> 2.25.1


Re: [PATCH v3] powerpc/papr_scm: Implement support for H_SCM_FLUSH hcall

2021-03-31 Thread Michael Ellerman
"Aneesh Kumar K.V"  writes:
> Shivaprasad G Bhat  writes:
>
>> Add support for ND_REGION_ASYNC capability if the device tree
>> indicates 'ibm,hcall-flush-required' property in the NVDIMM node.
>> Flush is done by issuing H_SCM_FLUSH hcall to the hypervisor.
>>
>> If the flush request failed, the hypervisor is expected to
>> to reflect the problem in the subsequent nvdimm H_SCM_HEALTH call.
>>
>> This patch prevents mmap of namespaces with MAP_SYNC flag if the
>> nvdimm requires an explicit flush[1].
>>
>> References:
>> [1] 
>> https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/ndctl.py.data/map_sync.c
>
>
> Reviewed-by: Aneesh Kumar K.V 

Do we need an ack from nvdimm folks on this?

Or is it entirely powerpc internal (seems like it from the diffstat)?

cheers


Re: [PATCH] powerpc/vdso: Separate vvar vma from vdso

2021-03-31 Thread Michael Ellerman
Christophe Leroy  writes:
> Le 26/03/2021 à 20:17, Dmitry Safonov a écrit :
>> Since commit 511157ab641e ("powerpc/vdso: Move vdso datapage up front")
>> VVAR page is in front of the VDSO area. In result it breaks CRIU
>> (Checkpoint Restore In Userspace) [1], where CRIU expects that "[vdso]"
>> from /proc/../maps points at ELF/vdso image, rather than at VVAR data page.
>> Laurent made a patch to keep CRIU working (by reading aux vector).
>> But I think it still makes sence to separate two mappings into different
>> VMAs. It will also make ppc64 less "special" for userspace and as
>> a side-bonus will make VVAR page un-writable by debugger (which previously
>> would COW page and can be unexpected).
>> 
>> I opportunistically Cc stable on it: I understand that usually such
>> stuff isn't a stable material, but that will allow us in CRIU have
>> one workaround less that is needed just for one release (v5.11) on
>> one platform (ppc64), which we otherwise have to maintain.
>> I wouldn't go as far as to say that the commit 511157ab641e is ABI
>> regression as no other userspace got broken, but I'd really appreciate
>> if it gets backported to v5.11 after v5.12 is released, so as not
>> to complicate already non-simple CRIU-vdso code. Thanks!
>> 
>> Cc: Andrei Vagin 
>> Cc: Andy Lutomirski 
>> Cc: Benjamin Herrenschmidt 
>> Cc: Christophe Leroy 
>> Cc: Laurent Dufour 
>> Cc: Michael Ellerman 
>> Cc: Paul Mackerras 
>> Cc: linuxppc-dev@lists.ozlabs.org
>> Cc: sta...@vger.kernel.org # v5.11
>> [1]: https://github.com/checkpoint-restore/criu/issues/1417
>> Signed-off-by: Dmitry Safonov 
>> Tested-by: Christophe Leroy 
>> ---
>>   arch/powerpc/include/asm/mmu_context.h |  2 +-
>>   arch/powerpc/kernel/vdso.c | 54 +++---
>>   2 files changed, 40 insertions(+), 16 deletions(-)
>> 
>
>> @@ -133,7 +135,13 @@ static int __arch_setup_additional_pages(struct 
>> linux_binprm *bprm, int uses_int
>>   * install_special_mapping or the perf counter mmap tracking code
>>   * will fail to recognise it as a vDSO.
>>   */
>> -mm->context.vdso = (void __user *)vdso_base + PAGE_SIZE;
>> +mm->context.vdso = (void __user *)vdso_base + vvar_size;
>> +
>> +vma = _install_special_mapping(mm, vdso_base, vvar_size,
>> +   VM_READ | VM_MAYREAD | VM_IO |
>> +   VM_DONTDUMP | VM_PFNMAP, _spec);
>> +if (IS_ERR(vma))
>> +return PTR_ERR(vma);
>>   
>>  /*
>>   * our vma flags don't have VM_WRITE so by default, the process isn't
>
>
> IIUC, VM_PFNMAP is for when we have a vvar_fault handler.

Some of the other flags seem odd too.
eg. VM_IO ? VM_DONTDUMP ?


cheers


Re: [PATCH v2] powerpc/traps: Enhance readability for trap types

2021-03-31 Thread Michael Ellerman
Xiongwei Song  writes:
> From: Xiongwei Song 
>
> Create a new header named traps.h, define macros to list ppc exception
> types in traps.h, replace the reference of the real trap values with
> these macros.

Personally I find the hex values easier to recognise, but I realise
that's probably not true of other people :)

...
> diff --git a/arch/powerpc/include/asm/traps.h 
> b/arch/powerpc/include/asm/traps.h
> new file mode 100644
> index ..a31b6122de23
> --- /dev/null
> +++ b/arch/powerpc/include/asm/traps.h
> @@ -0,0 +1,19 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_PPC_TRAPS_H
> +#define _ASM_PPC_TRAPS_H
> +
> +#define TRAP_RESET   0x100 /* System reset */
> +#define TRAP_MCE 0x200 /* Machine check */
> +#define TRAP_DSI 0x300 /* Data storage */
> +#define TRAP_DSEGI   0x380 /* Data segment */
> +#define TRAP_ISI 0x400 /* Instruction storage */
> +#define TRAP_ISEGI   0x480 /* Instruction segment */
> +#define TRAP_ALIGN   0x600 /* Alignment */
> +#define TRAP_PROG0x700 /* Program */
> +#define TRAP_DEC 0x900 /* Decrementer */
> +#define TRAP_SYSCALL 0xc00 /* System call */
> +#define TRAP_TRACEI  0xd00 /* Trace */
> +#define TRAP_FPA 0xe00 /* Floating-point Assist */
> +#define TRAP_PMI 0xf00 /* Performance monitor */

I know the macro is called TRAP and the field in pt_regs is called trap,
but the terminology in the architecture is "exception", and we already
have many uses of that. In particular we have a lot of uses of "exc" as
an abbreviation for "exception". So I think I'd rather we use that than
"TRAP".

I think we should probably use the names from the ISA, unless they are
really over long.

Which are:

  0x100   System Reset
  0x200   Machine Check
  0x300   Data Storage
  0x380   Data Segment
  0x400   Instruction Storage
  0x480   Instruction Segment
  0x500   External
  0x600   Alignment
  0x700   Program
  0x800   Floating-Point Unavailable
  0x900   Decrementer
  0x980   Hypervisor Decrementer
  0xA00   Directed Privileged Doorbell
  0xC00   System Call
  0xD00   Trace
  0xE00   Hypervisor Data Storage
  0xE20   Hypervisor Instruction Storage
  0xE40   Hypervisor Emulation Assistance
  0xE60   Hypervisor Maintenance
  0xE80   Directed Hypervisor Doorbell
  0xEA0   Hypervisor Virtualization
  0xF00   Performance Monitor
  0xF20   Vector Unavailable
  0xF40   VSX Unavailable
  0xF60   Facility Unavailable
  0xF80   Hypervisor Facility Unavailable
  0xFA0   Directed Ultravisor Doorbell


So perhaps:

  EXC_SYSTEM_RESET
  EXC_MACHINE_CHECK
  EXC_DATA_STORAGE
  EXC_DATA_SEGMENT
  EXC_INST_STORAGE
  EXC_INST_SEGMENT
  EXC_EXTERNAL_INTERRUPT
  EXC_ALIGNMENT
  EXC_PROGRAM_CHECK
  EXC_FP_UNAVAILABLE
  EXC_DECREMENTER
  EXC_HV_DECREMENTER
  EXC_SYSTEM_CALL
  EXC_HV_DATA_STORAGE
  EXC_PERF_MONITOR


cheers


Re: [PATCH v2] mm: Move mem_init_print_info() into mm_init()

2021-03-31 Thread Russell King - ARM Linux admin
On Wed, Mar 17, 2021 at 09:52:10AM +0800, Kefeng Wang wrote:
> mem_init_print_info() is called in mem_init() on each architecture,
> and pass NULL argument, so using void argument and move it into mm_init().
> 
> Acked-by: Dave Hansen 
> Signed-off-by: Kefeng Wang 

Acked-by: Russell King  # for arm bits
-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!


Re: [PATCH v2] mm: Move mem_init_print_info() into mm_init()

2021-03-31 Thread Mike Rapoport
On Wed, Mar 17, 2021 at 09:52:10AM +0800, Kefeng Wang wrote:
> mem_init_print_info() is called in mem_init() on each architecture,
> and pass NULL argument, so using void argument and move it into mm_init().
> 
> Acked-by: Dave Hansen 
> Signed-off-by: Kefeng Wang 

Acked-by: Mike Rapoport 

> ---
> v2:
> - Cleanup 'str' line suggested by Christophe and ACK
> 


[PATCH -next] powerpc: Remove duplicated include from time.c

2021-03-31 Thread Qiheng Lin
Remove duplicated include.

Reported-by: Hulk Robot 
Signed-off-by: Qiheng Lin 
---
 arch/powerpc/kernel/time.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index b67d93a609a2..2c8762002b21 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -53,7 +53,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 



Re: WARNING: CPU: 0 PID: 1 at arch/powerpc/lib/feature-fixups.c:109 do_feature_fixups+0xb0/0xf0

2021-03-31 Thread Christophe Leroy

Hi Paul,

Le 30/03/2021 à 12:37, Paul Menzel a écrit :

Dear Linux folks,


On the POWER8 system IBM S822LC, Linux 5.12-rc5+ logs the warning below.

```
[    0.724118] Unable to patch feature section at (ptrval) - (ptrval) with 
(ptrval) - (ptrval)

[    0.724185] pstore: Registered nvram as persistent store backend
```

Please find the output of `dmesg` attached.




Did you do a 'make clean' before building ?

If not, can you try patch 
https://patchwork.ozlabs.org/project/linuxppc-dev/patch/8bb015bc98c51d8ced581415b7e3d157e18da7c9.1617181918.git.christophe.le...@csgroup.eu/


Thanks
Christophe


[PATCH] powerpc/vdso: Make sure vdso_wrapper.o is rebuilt everytime vdso.so is rebuilt

2021-03-31 Thread Christophe Leroy
Commit bce74491c300 ("powerpc/vdso: fix unnecessary rebuilds of
vgettimeofday.o") moved vdso32_wrapper.o and vdso64_wrapper.o out
of arch/powerpc/kernel/vdso[32/64]/ and removed the dependencies in
the Makefile. This leads to the wrappers not being re-build hence the
kernel embedding the old vdso library.

Add back missing dependencies to ensure vdso32_wrapper.o and vdso64_wrapper.o
are rebuilt when vdso32.so.dbg and vdso64.so.dbg are changed.

Fixes: bce74491c300 ("powerpc/vdso: fix unnecessary rebuilds of 
vgettimeofday.o")
Cc: sta...@vger.kernel.org
Cc: Masahiro Yamada 
Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/Makefile | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index 6084fa499aa3..f66b63e81c3b 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -191,3 +191,7 @@ $(obj)/prom_init_check: $(src)/prom_init_check.sh 
$(obj)/prom_init.o FORCE
 targets += prom_init_check
 
 clean-files := vmlinux.lds
+
+# Force dependency (incbin is bad)
+$(obj)/vdso32_wrapper.o : $(obj)/vdso32/vdso32.so.dbg
+$(obj)/vdso64_wrapper.o : $(obj)/vdso64/vdso64.so.dbg
-- 
2.25.0



Re: [PATCH 1/2] powerpc/vdso: fix unnecessary rebuilds of vgettimeofday.o

2021-03-31 Thread Christophe Leroy




Le 28/01/2021 à 05:01, Masahiro Yamada a écrit :

On Thu, Dec 24, 2020 at 2:12 AM Masahiro Yamada  wrote:


vgettimeofday.o is unnecessarily rebuilt. Adding it to 'targets' is not
enough to fix the issue. Kbuild is correctly rebuilding it because the
command line is changed.

PowerPC builds each vdso directory twice; first in vdso_prepare to
generate vdso{32,64}-offsets.h, second as part of the ordinary build
process to embed vdso{32,64}.so.dbg into the kernel.

The problem shows up when CONFIG_PPC_WERROR=y due to the following line
in arch/powerpc/Kbuild:

   subdir-ccflags-$(CONFIG_PPC_WERROR) := -Werror

In the preparation stage, Kbuild directly visits the vdso directories,
hence it does not inherit subdir-ccflags-y. In the second descend,
Kbuild adds -Werror, which results in the command line flipping
with/without -Werror.

It implies a potential danger; if a more critical flag that would impact
the resulted vdso, the offsets recorded in the headers might be different
from real offsets in the embedded vdso images.

Removing the unneeded second descend solves the problem.

Link: https://lore.kernel.org/linuxppc-dev/87tuslxhry@mpe.ellerman.id.au/
Reported-by: Michael Ellerman 
Signed-off-by: Masahiro Yamada 
---



Michael, please take a  look at this.

The unneeded rebuild problem is still remaining.


Seems like with this patch, vdso32_wrapper.o is not rebuilt when vdso32.so.dbg 
is rebuilt.

Leading to ... disasters.

I'll send a patch







  arch/powerpc/kernel/Makefile  | 4 ++--
  arch/powerpc/kernel/vdso32/Makefile   | 5 +
  arch/powerpc/kernel/{vdso32 => }/vdso32_wrapper.S | 0
  arch/powerpc/kernel/vdso64/Makefile   | 6 +-
  arch/powerpc/kernel/{vdso64 => }/vdso64_wrapper.S | 0
  5 files changed, 4 insertions(+), 11 deletions(-)
  rename arch/powerpc/kernel/{vdso32 => }/vdso32_wrapper.S (100%)
  rename arch/powerpc/kernel/{vdso64 => }/vdso64_wrapper.S (100%)

diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index fe2ef598e2ea..79ee7750937d 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -51,7 +51,7 @@ obj-y += ptrace/
  obj-$(CONFIG_PPC64)+= setup_64.o \
paca.o nvram_64.o note.o syscall_64.o
  obj-$(CONFIG_COMPAT)   += sys_ppc32.o signal_32.o
-obj-$(CONFIG_VDSO32)   += vdso32/
+obj-$(CONFIG_VDSO32)   += vdso32_wrapper.o
  obj-$(CONFIG_PPC_WATCHDOG) += watchdog.o
  obj-$(CONFIG_HAVE_HW_BREAKPOINT)   += hw_breakpoint.o
  obj-$(CONFIG_PPC_DAWR) += dawr.o
@@ -60,7 +60,7 @@ obj-$(CONFIG_PPC_BOOK3S_64)   += cpu_setup_power.o
  obj-$(CONFIG_PPC_BOOK3S_64)+= mce.o mce_power.o
  obj-$(CONFIG_PPC_BOOK3E_64)+= exceptions-64e.o idle_book3e.o
  obj-$(CONFIG_PPC_BARRIER_NOSPEC) += security.o
-obj-$(CONFIG_PPC64)+= vdso64/
+obj-$(CONFIG_PPC64)+= vdso64_wrapper.o
  obj-$(CONFIG_ALTIVEC)  += vecemu.o
  obj-$(CONFIG_PPC_BOOK3S_IDLE)  += idle_book3s.o
  procfs-y   := proc_powerpc.o
diff --git a/arch/powerpc/kernel/vdso32/Makefile 
b/arch/powerpc/kernel/vdso32/Makefile
index 59aa2944ecae..42fc3de89b39 100644
--- a/arch/powerpc/kernel/vdso32/Makefile
+++ b/arch/powerpc/kernel/vdso32/Makefile
@@ -30,7 +30,7 @@ CC32FLAGS += -m32
  KBUILD_CFLAGS := $(filter-out -mcmodel=medium,$(KBUILD_CFLAGS))
  endif

-targets := $(obj-vdso32) vdso32.so.dbg
+targets := $(obj-vdso32) vdso32.so.dbg vgettimeofday.o
  obj-vdso32 := $(addprefix $(obj)/, $(obj-vdso32))

  GCOV_PROFILE := n
@@ -46,9 +46,6 @@ obj-y += vdso32_wrapper.o
  targets += vdso32.lds
  CPPFLAGS_vdso32.lds += -P -C -Upowerpc

-# Force dependency (incbin is bad)
-$(obj)/vdso32_wrapper.o : $(obj)/vdso32.so.dbg
-
  # link rule for the .so file, .lds has to be first
  $(obj)/vdso32.so.dbg: $(src)/vdso32.lds $(obj-vdso32) $(obj)/vgettimeofday.o 
FORCE
 $(call if_changed,vdso32ld_and_check)
diff --git a/arch/powerpc/kernel/vdso32/vdso32_wrapper.S 
b/arch/powerpc/kernel/vdso32_wrapper.S
similarity index 100%
rename from arch/powerpc/kernel/vdso32/vdso32_wrapper.S
rename to arch/powerpc/kernel/vdso32_wrapper.S
diff --git a/arch/powerpc/kernel/vdso64/Makefile 
b/arch/powerpc/kernel/vdso64/Makefile
index d365810a689a..b50b39fedf74 100644
--- a/arch/powerpc/kernel/vdso64/Makefile
+++ b/arch/powerpc/kernel/vdso64/Makefile
@@ -17,7 +17,7 @@ endif

  # Build rules

-targets := $(obj-vdso64) vdso64.so.dbg
+targets := $(obj-vdso64) vdso64.so.dbg vgettimeofday.o
  obj-vdso64 := $(addprefix $(obj)/, $(obj-vdso64))

  GCOV_PROFILE := n
@@ -29,15 +29,11 @@ ccflags-y := -shared -fno-common -fno-builtin -nostdlib \
 -Wl,-soname=linux-vdso64.so.1 -Wl,--hash-style=both
  asflags-y := -D__VDSO64__ -s

-obj-y += vdso64_wrapper.o
  targets += vdso64.lds
  CPPFLAGS_vdso64.lds += -P -C -U$(ARCH)

  $(obj)/vgettimeofday.o: %.o: %.c FORCE

-# Force dependency (incbin is bad)
-$(obj)/vdso64_wrapper.o 

Re: [PATCH] PCI: Try to find two continuous regions for child resource

2021-03-31 Thread Kai-Heng Feng
On Tue, Mar 30, 2021 at 12:23 AM Bjorn Helgaas  wrote:
>
> On Mon, Mar 29, 2021 at 04:47:59PM +0800, Kai-Heng Feng wrote:
> > Built-in grahpics on HP EliteDesk 805 G6 doesn't work because graphics
> > can't get the BAR it needs:
> > [0.611504] pci_bus :00: root bus resource [mem 
> > 0x1002020-0x100303f window]
> > [0.611505] pci_bus :00: root bus resource [mem 
> > 0x1003040-0x100401f window]
> > ...
> > [0.638083] pci :00:08.1:   bridge window [mem 0xd200-0xd23f]
> > [0.638086] pci :00:08.1:   bridge window [mem 
> > 0x1003000-0x100401f 64bit pref]
> > [0.962086] pci :00:08.1: can't claim BAR 15 [mem 
> > 0x1003000-0x100401f 64bit pref]: no compatible bridge window
> > [0.962086] pci :00:08.1: [mem 0x1003000-0x100401f 64bit 
> > pref] clipped to [mem 0x1003000-0x100303f 64bit pref]
> > [0.962086] pci :00:08.1:   bridge window [mem 
> > 0x1003000-0x100303f 64bit pref]
> > [0.962086] pci :07:00.0: can't claim BAR 0 [mem 
> > 0x1003000-0x1003fff 64bit pref]: no compatible bridge window
> > [0.962086] pci :07:00.0: can't claim BAR 2 [mem 
> > 0x1004000-0x100401f 64bit pref]: no compatible bridge window
> >
> > However, the root bus has two continuous regions that can contain the
> > child resource requested.
> >
> > So try to find another parent region if two regions are continuous and
> > can contain child resource. This change makes the grahpics works on the
> > system in question.
>
> The BIOS description of PCI0 is interesting:
>
>   pci_bus :00: root bus resource [mem 0x100-0x100201f window]
>   pci_bus :00: root bus resource [mem 0x1002020-0x100303f window]
>   pci_bus :00: root bus resource [mem 0x1003040-0x100401f window]
>
> So the PCI0 _CRS apparently gave us:
>
>   [mem 0x100-0x100201f] size 0x2020 (512MB + 2MB)
>   [mem 0x1002020-0x100303f] size 0x1020 (256MB + 2MB)
>   [mem 0x1003040-0x100401f] size 0x0fe0 (254MB)
>
> These are all contiguous, so we'd have no problem if we coalesced them
> into a single window:
>
>   [mem 0x100-0x100401f window] size 0x4020 (1GB + 2MB)
>
> I think we currently keep these root bus resources separate because if
> we ever support _SRS for host bridges, the argument we give to _SRS
> must be exactly the same format as what we got from _CRS (see ACPI
> v6.3, sec 6.2.16, and pnpacpi_set_resources()).
>
> pnpacpi_encode_resources() is currently very simple-minded and copies
> each device resource back into a single _SRS entry.  But (1) we don't
> support _SRS for host bridges, and (2) if we ever do, we can make
> pnpacpi_encode_resources() smarter so it breaks things back up.
>
> So I think we should try to fix this by coalescing these adjacent
> resources from _CRS so we end up with a single root bus resource that
> covers all contiguous regions.

Thanks for the tip! Working on v2 patch.

>
> Typos, etc:
>   - No need for the timestamps; they're not relevant to the problem.
>   - s/grahpics/graphics/ (two occurrences above)
>   - s/continuous/contiguous/ (three occurrences above)

Will also update those in v2.

Kai-Heng

>
> > Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=212013
> > Signed-off-by: Kai-Heng Feng 
> > ---
> >  arch/microblaze/pci/pci-common.c |  4 +--
> >  arch/powerpc/kernel/pci-common.c |  8 ++---
> >  arch/sparc/kernel/pci.c  |  4 +--
> >  drivers/pci/pci.c| 60 +++-
> >  drivers/pci/setup-res.c  | 21 +++
> >  drivers/pcmcia/rsrc_nonstatic.c  |  4 +--
> >  include/linux/pci.h  |  6 ++--
> >  7 files changed, 80 insertions(+), 27 deletions(-)
> >
> > diff --git a/arch/microblaze/pci/pci-common.c 
> > b/arch/microblaze/pci/pci-common.c
> > index 557585f1be41..8e65832fb510 100644
> > --- a/arch/microblaze/pci/pci-common.c
> > +++ b/arch/microblaze/pci/pci-common.c
> > @@ -669,7 +669,7 @@ static void pcibios_allocate_bus_resources(struct 
> > pci_bus *bus)
> >  {
> >   struct pci_bus *b;
> >   int i;
> > - struct resource *res, *pr;
> > + struct resource *res, *pr = NULL;
> >
> >   pr_debug("PCI: Allocating bus resources for %04x:%02x...\n",
> >pci_domain_nr(bus), bus->number);
> > @@ -688,7 +688,7 @@ static void pcibios_allocate_bus_resources(struct 
> > pci_bus *bus)
> >* and as such ensure proper re-allocation
> >* later.
> >*/
> > - pr = pci_find_parent_resource(bus->self, res);
> > + pci_find_parent_resource(bus->self, res, , NULL);
> >   if (pr == res) {
> >   /* this happens when the generic PCI
> >* code (wrongly) decides that this
> > diff --git a/arch/powerpc/kernel/pci-common.c 
> > 

Re: [PATCH printk v2 2/5] printk: remove safe buffers

2021-03-31 Thread John Ogness
On 2021-03-30, John Ogness  wrote:
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index e971c0a9ec9e..f090d6a1b39e 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -1772,16 +1759,21 @@ static struct task_struct *console_owner;
>  static bool console_waiter;
>  
>  /**
> - * console_lock_spinning_enable - mark beginning of code where another
> + * console_lock_spinning_enable_irqsave - mark beginning of code where 
> another
>   *   thread might safely busy wait
>   *
>   * This basically converts console_lock into a spinlock. This marks
>   * the section where the console_lock owner can not sleep, because
>   * there may be a waiter spinning (like a spinlock). Also it must be
>   * ready to hand over the lock at the end of the section.
> + *
> + * This disables interrupts because the hand over to a waiter must not be
> + * interrupted until the hand over is completed (@console_waiter is cleared).
>   */
> -static void console_lock_spinning_enable(void)
> +static void console_lock_spinning_enable_irqsave(unsigned long *flags)

I missed the prototype change for the !CONFIG_PRINTK case, resulting in:

linux/kernel/printk/printk.c:2707:3: error: implicit declaration of function 
‘console_lock_spinning_enable_irqsave’; did you mean 
‘console_lock_spinning_enable’? [-Werror=implicit-function-declaration]
   console_lock_spinning_enable_irqsave();
   ^~~~
   console_lock_spinning_enable

Will be fixed for v3.

(I have now officially added !CONFIG_PRINTK to my CI tests.)

John Ogness