Re: [PATCH v4 13/46] KVM: PPC: Book3S 64: Move GUEST_MODE_SKIP test into KVM
On Tue, Mar 23, 2021 at 11:02:32AM +1000, Nicholas Piggin wrote: > Move the GUEST_MODE_SKIP logic into KVM code. This is quite a KVM > internal detail that has no real need to be in common handlers. > > Also add a comment explaining why this thing exists. [snip] > diff --git a/arch/powerpc/kvm/book3s_64_entry.S > b/arch/powerpc/kvm/book3s_64_entry.S > index 7a039ea78f15..a5412e24cc05 100644 > --- a/arch/powerpc/kvm/book3s_64_entry.S > +++ b/arch/powerpc/kvm/book3s_64_entry.S > @@ -1,6 +1,7 @@ > /* SPDX-License-Identifier: GPL-2.0-only */ > #include > #include > +#include > #include > #include > #include > @@ -20,9 +21,12 @@ kvmppc_interrupt: >* guest R12 saved in shadow VCPU SCRATCH0 >* guest R13 saved in SPRN_SCRATCH0 >*/ > -#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE > std r9,HSTATE_SCRATCH2(r13) > lbz r9,HSTATE_IN_GUEST(r13) > + cmpwi r9,KVM_GUEST_MODE_SKIP > + beq-.Lmaybe_skip > +.Lno_skip: > +#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE > cmpwi r9,KVM_GUEST_MODE_HOST_HV > beq kvmppc_bad_host_intr > #ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE > @@ -34,3 +38,48 @@ kvmppc_interrupt: > #else > b kvmppc_interrupt_pr > #endif It's a bit hard to see without more context, but I think that in the PR-only case (CONFIG_KVM_BOOK3S_HV_POSSIBLE undefined), this will corrupt R9. You need to restore R9 before the unconditional branch to kvmppc_interrupt_pr. (I realize this code gets modified further, but I'd rather not break bisection.) Paul.
Re: [PATCH v4 12/46] KVM: PPC: Book3S 64: move KVM interrupt entry to a common entry point
On Tue, Mar 23, 2021 at 11:02:31AM +1000, Nicholas Piggin wrote: > Rather than bifurcate the call depending on whether or not HV is > possible, and have the HV entry test for PR, just make a single > common point which does the demultiplexing. This makes it simpler > to add another type of exit handler. > > Reviewed-by: Daniel Axtens > Reviewed-by: Fabiano Rosas > Signed-off-by: Nicholas Piggin Acked-by: Paul Mackerras
Re: [PATCH v4 15/46] KVM: PPC: Book3S 64: Move hcall early register setup to KVM
On Tue, Mar 23, 2021 at 11:02:34AM +1000, Nicholas Piggin wrote: > System calls / hcalls have a different calling convention than > other interrupts, so there is code in the KVMTEST to massage these > into the same form as other interrupt handlers. > > Move this work into the KVM hcall handler. This means teaching KVM > a little more about the low level interrupt handler setup, PACA save > areas, etc., although that's not obviously worse than the current > approach of coming up with an entirely different interrupt register > / save convention. [snip] > @@ -1964,29 +1948,8 @@ EXC_VIRT_END(system_call, 0x4c00, 0x100) > > #ifdef CONFIG_KVM_BOOK3S_64_HANDLER > TRAMP_REAL_BEGIN(system_call_kvm) > - /* > - * This is a hcall, so register convention is as above, with these > - * differences: I haven't checked all the code changes in detail yet, but this comment at least is slightly misleading, since under PR KVM, system calls (to the guest kernel) and hypercalls both come through this path. Paul.
Re: [PATCH v4 29/46] KVM: PPC: Book3S HV P9: Implement the rest of the P9 path in C
On 3/23/21 12:02 PM, Nicholas Piggin wrote: Almost all logic is moved to C, by introducing a new in_guest mode that selects and branches very early in the interrupt handler to the P9 exit code. The remaining assembly is only about 160 lines of low level stack setup, with VCPU vs host register save and restore, plus a small shim to the legacy paths in the interrupt handler. There are two motivations for this, the first is just make the code more maintainable being in C. The second is to reduce the amount of code running in a special KVM mode, "realmode". I put that in quotes because with radix it is no longer necessarily real-mode in the MMU, but it still has to be treated specially because it may be in real-mode, and has various important registers like PID, DEC, TB, etc set to guest. This is hostile to the rest of Linux and can't use arbitrary kernel functionality or be instrumented well. This initial patch is a reasonably faithful conversion of the asm code. It does lack any loop to return quickly back into the guest without switching out of realmode in the case of unimportant or easily handled interrupts, as explained in the previous change, handling HV interrupts in real mode is not so important for P9. Signed-off-by: Nicholas Piggin --- arch/powerpc/include/asm/asm-prototypes.h | 3 +- arch/powerpc/include/asm/kvm_asm.h| 3 +- arch/powerpc/include/asm/kvm_book3s_64.h | 8 + arch/powerpc/kernel/security.c| 5 +- arch/powerpc/kvm/Makefile | 3 + arch/powerpc/kvm/book3s_64_entry.S| 246 ++ arch/powerpc/kvm/book3s_hv.c | 9 +- arch/powerpc/kvm/book3s_hv_interrupt.c| 223 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 123 +-- 9 files changed, 500 insertions(+), 123 deletions(-) create mode 100644 arch/powerpc/kvm/book3s_hv_interrupt.c diff --git a/arch/powerpc/include/asm/asm-prototypes.h b/arch/powerpc/include/asm/asm-prototypes.h index 939f3c94c8f3..7c74c80ed994 100644 --- a/arch/powerpc/include/asm/asm-prototypes.h +++ b/arch/powerpc/include/asm/asm-prototypes.h @@ -122,6 +122,7 @@ extern s32 patch__call_flush_branch_caches3; extern s32 patch__flush_count_cache_return; extern s32 patch__flush_link_stack_return; extern s32 patch__call_kvm_flush_link_stack; +extern s32 patch__call_kvm_flush_link_stack_p9; extern s32 patch__memset_nocache, patch__memcpy_nocache; extern long flush_branch_caches; @@ -142,7 +143,7 @@ void kvmhv_load_host_pmu(void); void kvmhv_save_guest_pmu(struct kvm_vcpu *vcpu, bool pmu_in_use); void kvmhv_load_guest_pmu(struct kvm_vcpu *vcpu); -int __kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu); +void kvmppc_p9_enter_guest(struct kvm_vcpu *vcpu); long kvmppc_h_set_dabr(struct kvm_vcpu *vcpu, unsigned long dabr); long kvmppc_h_set_xdabr(struct kvm_vcpu *vcpu, unsigned long dabr, diff --git a/arch/powerpc/include/asm/kvm_asm.h b/arch/powerpc/include/asm/kvm_asm.h index a3633560493b..b4f9996bd331 100644 --- a/arch/powerpc/include/asm/kvm_asm.h +++ b/arch/powerpc/include/asm/kvm_asm.h @@ -146,7 +146,8 @@ #define KVM_GUEST_MODE_GUEST 1 #define KVM_GUEST_MODE_SKIP 2 #define KVM_GUEST_MODE_GUEST_HV 3 -#define KVM_GUEST_MODE_HOST_HV 4 +#define KVM_GUEST_MODE_GUEST_HV_FAST 4 /* ISA v3.0 with host radix mode */ +#define KVM_GUEST_MODE_HOST_HV 5 #define KVM_INST_FETCH_FAILED -1 diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h b/arch/powerpc/include/asm/kvm_book3s_64.h index 9bb9bb370b53..c214bcffb441 100644 --- a/arch/powerpc/include/asm/kvm_book3s_64.h +++ b/arch/powerpc/include/asm/kvm_book3s_64.h @@ -153,9 +153,17 @@ static inline bool kvmhv_vcpu_is_radix(struct kvm_vcpu *vcpu) return radix; } +int __kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu); + #define KVM_DEFAULT_HPT_ORDER 24 /* 16MB HPT by default */ #endif +/* + * Invalid HDSISR value which is used to indicate when HW has not set the reg. + * Used to work around an errata. + */ +#define HDSISR_CANARY 0x7fff + /* * We use a lock bit in HPTE dword 0 to synchronize updates and * accesses to each HPTE, and another bit to indicate non-present diff --git a/arch/powerpc/kernel/security.c b/arch/powerpc/kernel/security.c index e4e1a94ccf6a..3a607c11f20f 100644 --- a/arch/powerpc/kernel/security.c +++ b/arch/powerpc/kernel/security.c @@ -430,16 +430,19 @@ device_initcall(stf_barrier_debugfs_init); static void update_branch_cache_flush(void) { - u32 *site; + u32 *site, __maybe_unused *site2; #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE site = __call_kvm_flush_link_stack; + site2 = __call_kvm_flush_link_stack_p9; // This controls the branch from guest_exit_cont to kvm_flush_link_stack if (link_stack_flush_type == BRANCH_CACHE_FLUSH_NONE) { patch_instruction_site(site, ppc_inst(PPC_INST_NOP)); + patch_instruction_site(site2,
Re: [PATCH v10 01/10] powerpc/mm: Implement set_memory() routines
Jordan Niethe writes: > From: Russell Currey > > The set_memory_{ro/rw/nx/x}() functions are required for STRICT_MODULE_RWX, > and are generally useful primitives to have. This implementation is > designed to be completely generic across powerpc's many MMUs. > > It's possible that this could be optimised to be faster for specific > MMUs, but the focus is on having a generic and safe implementation for > now. > > This implementation does not handle cases where the caller is attempting > to change the mapping of the page it is executing from, or if another > CPU is concurrently using the page being altered. These cases likely > shouldn't happen, but a more complex implementation with MMU-specific code > could safely handle them, so that is left as a TODO for now. > > On hash the linear mapping is not kept in the linux pagetable, so this > will not change the protection if used on that range. Currently these > functions are not used on the linear map so just WARN for now. > > These functions do nothing if STRICT_KERNEL_RWX is not enabled. > > Reviewed-by: Daniel Axtens > Signed-off-by: Russell Currey > Signed-off-by: Christophe Leroy > [jpn: -rebase on next plus "powerpc/mm/64s: Allow STRICT_KERNEL_RWX again" > - WARN on hash linear map] > Signed-off-by: Jordan Niethe > --- > v10: WARN if trying to change the hash linear map > --- > arch/powerpc/Kconfig | 1 + > arch/powerpc/include/asm/set_memory.h | 32 ++ > arch/powerpc/mm/Makefile | 2 +- > arch/powerpc/mm/pageattr.c| 88 +++ > 4 files changed, 122 insertions(+), 1 deletion(-) > create mode 100644 arch/powerpc/include/asm/set_memory.h > create mode 100644 arch/powerpc/mm/pageattr.c > > diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig > index fc7f5c5933e6..4498a27ac9db 100644 > --- a/arch/powerpc/Kconfig > +++ b/arch/powerpc/Kconfig > @@ -135,6 +135,7 @@ config PPC > select ARCH_HAS_MEMBARRIER_CALLBACKS > select ARCH_HAS_MEMBARRIER_SYNC_CORE > select ARCH_HAS_SCALED_CPUTIME if VIRT_CPU_ACCOUNTING_NATIVE > && PPC_BOOK3S_64 > + select ARCH_HAS_SET_MEMORY > select ARCH_HAS_STRICT_KERNEL_RWX if ((PPC_BOOK3S_64 || PPC32) && > !HIBERNATION) > select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST > select ARCH_HAS_UACCESS_FLUSHCACHE > diff --git a/arch/powerpc/include/asm/set_memory.h > b/arch/powerpc/include/asm/set_memory.h > new file mode 100644 > index ..64011ea444b4 > --- /dev/null > +++ b/arch/powerpc/include/asm/set_memory.h > @@ -0,0 +1,32 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +#ifndef _ASM_POWERPC_SET_MEMORY_H > +#define _ASM_POWERPC_SET_MEMORY_H > + > +#define SET_MEMORY_RO0 > +#define SET_MEMORY_RW1 > +#define SET_MEMORY_NX2 > +#define SET_MEMORY_X 3 > + > +int change_memory_attr(unsigned long addr, int numpages, long action); > + > +static inline int set_memory_ro(unsigned long addr, int numpages) > +{ > + return change_memory_attr(addr, numpages, SET_MEMORY_RO); > +} > + > +static inline int set_memory_rw(unsigned long addr, int numpages) > +{ > + return change_memory_attr(addr, numpages, SET_MEMORY_RW); > +} > + > +static inline int set_memory_nx(unsigned long addr, int numpages) > +{ > + return change_memory_attr(addr, numpages, SET_MEMORY_NX); > +} > + > +static inline int set_memory_x(unsigned long addr, int numpages) > +{ > + return change_memory_attr(addr, numpages, SET_MEMORY_X); > +} > + > +#endif > diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile > index 3b4e9e4e25ea..d8a08abde1ae 100644 > --- a/arch/powerpc/mm/Makefile > +++ b/arch/powerpc/mm/Makefile > @@ -5,7 +5,7 @@ > > ccflags-$(CONFIG_PPC64) := $(NO_MINIMAL_TOC) > > -obj-y:= fault.o mem.o pgtable.o mmap.o > maccess.o \ > +obj-y:= fault.o mem.o pgtable.o mmap.o > maccess.o pageattr.o \ > init_$(BITS).o pgtable_$(BITS).o \ > pgtable-frag.o ioremap.o ioremap_$(BITS).o \ > init-common.o mmu_context.o drmem.o > diff --git a/arch/powerpc/mm/pageattr.c b/arch/powerpc/mm/pageattr.c > new file mode 100644 > index ..9efcb01088da > --- /dev/null > +++ b/arch/powerpc/mm/pageattr.c > @@ -0,0 +1,88 @@ > +// SPDX-License-Identifier: GPL-2.0 > + > +/* > + * MMU-generic set_memory implementation for powerpc > + * > + * Copyright 2019, IBM Corporation. > + */ > + > +#include > +#include > + > +#include > +#include > +#include > + > + > +/* > + * Updates the attributes of a page in three steps: > + * > + * 1. invalidate the page table entry > + * 2. flush the TLB > + * 3. install the new entry with the updated attributes > + * > + * This is unsafe if the caller is attempting to change the mapping of the > + * page it is executing from, or if another CPU is
Re: [PATCH] powerpc/8xx: Load modules closer to kernel text
Christophe Leroy writes: > Le 31/03/2021 à 15:39, Michael Ellerman a écrit : >> Christophe Leroy writes: >>> On the 8xx, TASK_SIZE is 0x8000. The space between TASK_SIZE and >>> PAGE_OFFSET is not used. >>> >>> Use it to load modules in order to minimise the distance between >>> kernel text and modules and avoid trampolines in modules to access >>> kernel functions or other module functions. >>> >>> Define a 16Mbytes area for modules, that's more than enough. >> >> 16MB seems kind of small. >> >> At least on 64-bit we could potentially have hundreds of MBs of modules. >> > > Well, with a 16 MB kernel and 16 MB modules, my board is full :) Heh. > Even on the more recent board that has 128 MB, I don't expect more than a few > MBs of modules in > addition to the kernel which is approx 8M. > > But ok, I'll do something more generic, though it will conflict with Jordan's > series. Don't feel you have to. You're the expert on 8xx, not me. cheers
Re: [PATCH v2] powerpc/traps: Enhance readability for trap types
Segher Boessenkool 于2021年4月1日周四 上午6:15写道: > On Wed, Mar 31, 2021 at 08:58:17PM +1100, Michael Ellerman wrote: > > So perhaps: > > > > EXC_SYSTEM_RESET > > EXC_MACHINE_CHECK > > EXC_DATA_STORAGE > > EXC_DATA_SEGMENT > > EXC_INST_STORAGE > > EXC_INST_SEGMENT > > EXC_EXTERNAL_INTERRUPT > > EXC_ALIGNMENT > > EXC_PROGRAM_CHECK > > EXC_FP_UNAVAILABLE > > EXC_DECREMENTER > > EXC_HV_DECREMENTER > > EXC_SYSTEM_CALL > > EXC_HV_DATA_STORAGE > > EXC_PERF_MONITOR > > These are interrupt (vectors), not exceptions. It doesn't matter all > that much, but confusing things more isn't useful either! There can be > multiple exceptions that all can trigger the same interrupt. > > When looking at the reference manual of e500 and e600 from NXP official, they call them as interrupts.While looking at the "The Programming Environments" that is also from NXP, they call them exceptions. Looks like there is no explicit distinction between interrupts and exceptions. Here are the links for the reference manual of e600 and e500: https://www.nxp.com.cn/docs/en/reference-manual/E600CORERM.pdf https://www.nxp.com/files-static/32bit/doc/ref_manual/EREF_RM.pdf Here is the "The Programming Environments" link: https://www.nxp.com.cn/docs/en/user-guide/MPCFPE_AD_R1.pdf As far as I know, the values of interrupts or exceptions above are defined explicitly in reference manual or the programming environments. Could you please provide more details about multiple exceptions with the same interrupts? Xiongwei
Re: [PATCH v2] powerpc/traps: Enhance readability for trap types
Michael Ellerman 于2021年3月31日周三 下午5:58写道: > Xiongwei Song writes: > > From: Xiongwei Song > > > > Create a new header named traps.h, define macros to list ppc exception > > types in traps.h, replace the reference of the real trap values with > > these macros. > > Personally I find the hex values easier to recognise, but I realise > that's probably not true of other people :) > > I'm one of the "other people". > ... > > diff --git a/arch/powerpc/include/asm/traps.h > b/arch/powerpc/include/asm/traps.h > > new file mode 100644 > > index ..a31b6122de23 > > --- /dev/null > > +++ b/arch/powerpc/include/asm/traps.h > > @@ -0,0 +1,19 @@ > > +/* SPDX-License-Identifier: GPL-2.0 */ > > +#ifndef _ASM_PPC_TRAPS_H > > +#define _ASM_PPC_TRAPS_H > > + > > +#define TRAP_RESET 0x100 /* System reset */ > > +#define TRAP_MCE 0x200 /* Machine check */ > > +#define TRAP_DSI 0x300 /* Data storage */ > > +#define TRAP_DSEGI 0x380 /* Data segment */ > > +#define TRAP_ISI 0x400 /* Instruction storage */ > > +#define TRAP_ISEGI 0x480 /* Instruction segment */ > > +#define TRAP_ALIGN 0x600 /* Alignment */ > > +#define TRAP_PROG0x700 /* Program */ > > +#define TRAP_DEC 0x900 /* Decrementer */ > > +#define TRAP_SYSCALL 0xc00 /* System call */ > > +#define TRAP_TRACEI 0xd00 /* Trace */ > > +#define TRAP_FPA 0xe00 /* Floating-point Assist */ > > +#define TRAP_PMI 0xf00 /* Performance monitor */ > > I know the macro is called TRAP and the field in pt_regs is called trap, > but the terminology in the architecture is "exception", and we already > have many uses of that. In particular we have a lot of uses of "exc" as > an abbreviation for "exception". So I think I'd rather we use that than > "TRAP". > Ok. > > I think we should probably use the names from the ISA, unless they are > really over long. > > Which are: > > 0x100 System Reset > 0x200 Machine Check > 0x300 Data Storage > 0x380 Data Segment > 0x400 Instruction Storage > 0x480 Instruction Segment > 0x500 External > 0x600 Alignment > 0x700 Program > 0x800 Floating-Point Unavailable > 0x900 Decrementer > 0x980 Hypervisor Decrementer > 0xA00 Directed Privileged Doorbell > 0xC00 System Call > 0xD00 Trace > 0xE00 Hypervisor Data Storage > 0xE20 Hypervisor Instruction Storage > 0xE40 Hypervisor Emulation Assistance > 0xE60 Hypervisor Maintenance > 0xE80 Directed Hypervisor Doorbell > 0xEA0 Hypervisor Virtualization > 0xF00 Performance Monitor > 0xF20 Vector Unavailable > 0xF40 VSX Unavailable > 0xF60 Facility Unavailable > 0xF80 Hypervisor Facility Unavailable > 0xFA0 Directed Ultravisor Doorbell > > > So perhaps: > > EXC_SYSTEM_RESET > EXC_MACHINE_CHECK > EXC_DATA_STORAGE > EXC_DATA_SEGMENT > EXC_INST_STORAGE > EXC_INST_SEGMENT > EXC_EXTERNAL_INTERRUPT > EXC_ALIGNMENT > EXC_PROGRAM_CHECK > EXC_FP_UNAVAILABLE > EXC_DECREMENTER > EXC_HV_DECREMENTER > EXC_SYSTEM_CALL > EXC_HV_DATA_STORAGE > EXC_PERF_MONITOR > > Thanks for the suggestions. I'm ok with the prefix. Let me change.
Re: [PATCH v3 1/9] powerpc/xive: Use cpu_to_node() instead of "ibm,chip-id" property
On Wed, Mar 31, 2021 at 04:45:06PM +0200, Cédric Le Goater wrote: > The 'chip_id' field of the XIVE CPU structure is used to choose a > target for a source located on the same chip when possible. The XIVE > driver queries the chip id value from the "ibm,chip-id" DT property > but this property is not available on all platforms. It was first > introduced on the PowerNV platform and later, under QEMU for pseries. > However, the property does not exist under PowerVM since it is not > specified in PAPR. > > cpu_to_node() is a better alternative. On the PowerNV platform, the > node id is computed from the "ibm,associativity" property of the CPU. > Its value is built in the OPAL firmware from the physical chip id and > is equivalent to "ibm,chip-id". Hrm... I mean, for powernv this is certainly correct, but seems to be relying on pretty specific specifics of the OPAL / chip behaviour, namely that the NUMA id == chip ID. > On pSeries, the hcall > H_HOME_NODE_ASSOCIATIVITY returns the node id. AFAICT, the chip_id field is never actually used in the PAPR version of XIVE. The only access to the field outside native.c is in xive_pick_irq_target(), and it only looks at chip_id if src_chip is valid. But src_chip is initialized to XIVE_INVALID_CHIP_ID in papr.c So it would make more sense to me to also initialize chip_id to XIVE_INVALID_CHIP_ID for PAPR to make it clearer that it's not relevant. > Also to be noted that under QEMU/KVM "ibm,chip-id" is badly calculated > with unusual SMT configuration. This leads to a bogus chip id value > being returned by of_get_ibm_chip_id(). I *still* don't clearly understand what you think is bogus about the chip id value that qemu generates. It's clearly not a problem for XIVE, since PAPR XIVE never uses it. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature
Re: [PATCH v2] powerpc/traps: Enhance readability for trap types
Segher Boessenkool writes: > On Wed, Mar 31, 2021 at 08:58:17PM +1100, Michael Ellerman wrote: >> So perhaps: >> >> EXC_SYSTEM_RESET >> EXC_MACHINE_CHECK >> EXC_DATA_STORAGE >> EXC_DATA_SEGMENT >> EXC_INST_STORAGE >> EXC_INST_SEGMENT >> EXC_EXTERNAL_INTERRUPT >> EXC_ALIGNMENT >> EXC_PROGRAM_CHECK >> EXC_FP_UNAVAILABLE >> EXC_DECREMENTER >> EXC_HV_DECREMENTER >> EXC_SYSTEM_CALL >> EXC_HV_DATA_STORAGE >> EXC_PERF_MONITOR > > These are interrupt (vectors), not exceptions. It doesn't matter all > that much, but confusing things more isn't useful either! There can be > multiple exceptions that all can trigger the same interrupt. Yeah I know, but I think that ship has already sailed as far as the naming we have in the kernel. We have over 250 uses of "exc", and several files called "exception" something. Using "interrupt" can also be confusing because Linux uses that to mean "external interrupt". But I dunno, maybe INT or VEC is clearer? .. or TRAP :) cheers
Re: [PATCH] powerpc/vdso: Separate vvar vma from vdso
On 3/31/21 10:59 AM, Michael Ellerman wrote: > Christophe Leroy writes: [..] >> >>> @@ -133,7 +135,13 @@ static int __arch_setup_additional_pages(struct >>> linux_binprm *bprm, int uses_int >>> * install_special_mapping or the perf counter mmap tracking code >>> * will fail to recognise it as a vDSO. >>> */ >>> - mm->context.vdso = (void __user *)vdso_base + PAGE_SIZE; >>> + mm->context.vdso = (void __user *)vdso_base + vvar_size; >>> + >>> + vma = _install_special_mapping(mm, vdso_base, vvar_size, >>> + VM_READ | VM_MAYREAD | VM_IO | >>> + VM_DONTDUMP | VM_PFNMAP, _spec); >>> + if (IS_ERR(vma)) >>> + return PTR_ERR(vma); >>> >>> /* >>> * our vma flags don't have VM_WRITE so by default, the process isn't >> >> >> IIUC, VM_PFNMAP is for when we have a vvar_fault handler. >> Allthough we will soon have one for handle TIME_NS, at the moment >> powerpc doesn't have that handler. >> Isn't it dangerous to set VM_PFNMAP then ? I believe, it's fine, special_mapping_fault() does: : if (sm->fault) : return sm->fault(sm, vmf->vma, vmf); > Some of the other flags seem odd too. > eg. VM_IO ? VM_DONTDUMP ? Yeah, so: VM_PFNMAP | VM_IO is a protection from remote access on pages. So one can't access such page with ptrace(), /proc/$pid/mem or process_vm_write(). Otherwise, it would create COW mapping and the tracee will stop working with stale vvar. VM_DONTDUMP restricts the area from coredumping and gdb will also avoid accessing those[1][2]. I agree that VM_PFNMAP was probably excessive in this patch alone and rather synchronized code with other architectures, but it makes more sense now in the new patches set by Christophe: https://lore.kernel.org/linux-arch/cover.1617209141.git.christophe.le...@csgroup.eu/ [1] https://lore.kernel.org/lkml/550731af.6080...@redhat.com/T/ [2] https://sourceware.org/legacy-ml/gdb-patches/2015-03/msg00383.html Thanks, Dmitry
Re: [PATCH v2] powerpc/traps: Enhance readability for trap types
On Wed, Mar 31, 2021 at 08:58:17PM +1100, Michael Ellerman wrote: > So perhaps: > > EXC_SYSTEM_RESET > EXC_MACHINE_CHECK > EXC_DATA_STORAGE > EXC_DATA_SEGMENT > EXC_INST_STORAGE > EXC_INST_SEGMENT > EXC_EXTERNAL_INTERRUPT > EXC_ALIGNMENT > EXC_PROGRAM_CHECK > EXC_FP_UNAVAILABLE > EXC_DECREMENTER > EXC_HV_DECREMENTER > EXC_SYSTEM_CALL > EXC_HV_DATA_STORAGE > EXC_PERF_MONITOR These are interrupt (vectors), not exceptions. It doesn't matter all that much, but confusing things more isn't useful either! There can be multiple exceptions that all can trigger the same interrupt. Segher
Re: [PATCH] powerpc/vdso: Separate vvar vma from vdso
On 3/30/21 11:17 AM, Christophe Leroy wrote: > > > Le 26/03/2021 à 20:17, Dmitry Safonov a écrit : [..] >> --- a/arch/powerpc/kernel/vdso.c >> +++ b/arch/powerpc/kernel/vdso.c >> @@ -55,10 +55,10 @@ static int vdso_mremap(const struct >> vm_special_mapping *sm, struct vm_area_struc >> { >> unsigned long new_size = new_vma->vm_end - new_vma->vm_start; >> - if (new_size != text_size + PAGE_SIZE) >> + if (new_size != text_size) >> return -EINVAL; > > In ARM64 you have removed the above test in commit 871402e05b24cb56 > ("mm: forbid splitting special mappings"). Do we need to keep it here ? > >> - current->mm->context.vdso = (void __user *)new_vma->vm_start + >> PAGE_SIZE; >> + current->mm->context.vdso = (void __user *)new_vma->vm_start; >> return 0; >> } > Yes, right you are, this can be dropped. Thanks, Dmitry
Re: [PATCH] docs: powerpc: Fix misspellings and grammar errors
He Ying writes: > Reported-by: Hulk Robot > Signed-off-by: He Ying > --- > Documentation/powerpc/booting.rst| 2 +- > Documentation/powerpc/dawr-power9.rst| 2 +- > Documentation/powerpc/eeh-pci-error-recovery.rst | 2 +- > Documentation/powerpc/elfnote.rst| 2 +- > Documentation/powerpc/firmware-assisted-dump.rst | 2 +- > Documentation/powerpc/kaslr-booke32.rst | 2 +- > Documentation/powerpc/mpc52xx.rst| 2 +- > Documentation/powerpc/papr_hcalls.rst| 4 ++-- > Documentation/powerpc/transactional_memory.rst | 4 ++-- > 9 files changed, 11 insertions(+), 11 deletions(-) Applied, thanks. jon
Re: [PATCH 6/8] drivers: firmware: efi: libstub: enable generic commandline
On Wed, Mar 31, 2021 at 06:10:08PM +0200, Ard Biesheuvel wrote: > (+ Arvind) > > On Tue, 30 Mar 2021 at 19:57, Daniel Walker wrote: > > > > This adds code to handle the generic command line changes. > > The efi code appears that it doesn't benefit as much from this design > > as it could. > > > > For example, if you had a prepend command line with "nokaslr" then > > you might be helpful to re-enable it in the boot loader or dts, > > but there appears to be no way to re-enable kaslr or some of the > > other options. > > > > Cc: xe-linux-exter...@cisco.com > > Signed-off-by: Daniel Walker > > --- > > .../firmware/efi/libstub/efi-stub-helper.c| 35 +++ > > drivers/firmware/efi/libstub/efi-stub.c | 7 > > drivers/firmware/efi/libstub/efistub.h| 1 + > > drivers/firmware/efi/libstub/x86-stub.c | 13 +-- > > 4 files changed, 54 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/firmware/efi/libstub/efi-stub-helper.c > > b/drivers/firmware/efi/libstub/efi-stub-helper.c > > index aa8da0a49829..c155837cedc9 100644 > > --- a/drivers/firmware/efi/libstub/efi-stub-helper.c > > +++ b/drivers/firmware/efi/libstub/efi-stub-helper.c > > @@ -13,6 +13,7 @@ > > #include > > #include > > #include /* For CONSOLE_LOGLEVEL_* */ > > +#include > > #include > > #include > > > > @@ -172,6 +173,40 @@ int efi_printk(const char *fmt, ...) > > return printed; > > } > > > > +/** > > + * efi_handle_cmdline() - handle adding in building parts of the command > > line > > + * @cmdline: kernel command line > > + * > > + * Add in the generic parts of the commandline and start the parsing of the > > + * command line. > > + * > > + * Return: status code > > + */ > > +efi_status_t efi_handle_cmdline(char const *cmdline) > > +{ > > + efi_status_t status; > > + > > + status = efi_parse_options(CMDLINE_PREPEND); > > + if (status != EFI_SUCCESS) { > > + efi_err("Failed to parse options\n"); > > + return status; > > + } > > Even though I am not a fan of the 'success handling' pattern, > duplicating the exact same error handling three times is not great > either. Could we reuse more of the code here? How about efi_status_t status = 0; status |= efi_parse_options(CMDLINE_PREPEND); then error checking once ? > > + > > + status = efi_parse_options(IS_ENABLED(CONFIG_CMDLINE_OVERRIDE) ? "" > > : cmdline); > > What is the point of calling efi_parse_options() with an empty string? I could change it to if ((IS_ENABLED(CONFIG_CMDLINE_OVERRIDE)) ? > > --- a/drivers/firmware/efi/libstub/efi-stub.c > > +++ b/drivers/firmware/efi/libstub/efi-stub.c > > @@ -172,6 +172,12 @@ efi_status_t __efiapi efi_pe_entry(efi_handle_t handle, > > goto fail; > > } > > > > +#ifdef CONFIG_GENERIC_CMDLINE > > + status = efi_handle_cmdline(cmdline_ptr); > > + if (status != EFI_SUCCESS) { > > + goto fail_free_cmdline; > > + } > > +#else > > if (IS_ENABLED(CONFIG_CMDLINE_EXTEND) || > > IS_ENABLED(CONFIG_CMDLINE_FORCE) || > > Does this mean CONFIG_GENERIC_CMDLINE does not replace CMDLINE_EXTEND > / CMDLINE_FORCE etc, but introduces yet another variant on top of > those? > > That does not seem like an improvement to me. I think it is great that > you are cleaning this up, but only if it means we can get rid of the > old implementation. It does replace extend and force. I was under the impression this code was shared between arm64 and arm32. If that's not the case I can delete the extend and force section. I haven't submitted a conversion for arm32 yet. Daniel
Re: [PATCH v2 3/7] powerpc: convert config files to generic cmdline
On Wed, Mar 31, 2021 at 12:52:19PM +0100, Will Deacon wrote: > On Tue, Mar 30, 2021 at 10:35:21AM -0700, Daniel Walker wrote: > > On Mon, Mar 29, 2021 at 11:07:51AM +0100, Will Deacon wrote: > > > On Thu, Mar 25, 2021 at 12:59:56PM -0700, Daniel Walker wrote: > > > > On Thu, Mar 25, 2021 at 01:03:55PM +0100, Christophe Leroy wrote: > > > > > > > > > > Ok, so you agree we don't need to provide two CMDLINE, one to be > > > > > appended and one to be prepended. > > > > > > > > > > Let's only provide once CMDLINE as of today, and ask the user to > > > > > select > > > > > whether he wants it appended or prepended or replacee. Then no need to > > > > > change all existing config to rename CONFIG_CMDLINE into either of > > > > > the new > > > > > ones. > > > > > > > > > > That's the main difference between my series and Daniel's series. So > > > > > I'll > > > > > finish taking Will's comment into account and we'll send out a v3 > > > > > soon. > > > > > > > > It doesn't solve the needs of Cisco, I've stated many times your > > > > changes have > > > > little value. Please stop submitting them. > > > > > > FWIW, they're useful for arm64 and I will gladly review the updated > > > series. > > > > > > I don't think asking people to stop submitting patches is ever the right > > > answer. Please don't do that. > > > > Why ? It's me nacking his series, is that not allowed anymore ? > > If you're that way inclined then you can "nack" whatever you want, but > please allow the rest of us to continue reviewing the patches. You don't > have any basis on which to veto other people's contributions and so > demanding that somebody stops posting code is neither constructive nor > meaningful. I understand , but that's not what's happening. I've dealt with Christophe on these changes repeatedly, and he's demonstrated he doesn't understand the feature set or the motivation of the changes. I've tried to work with him in the past, but it doesn't work unless he's giving me comments on my changes. His changes don't solve Cisco problems, and likely never will regardless of feedback. Maybe that could change, but I don't see that happening. Daniel
[PATCH RESEND v1 4/4] powerpc/vdso: Add support for time namespaces
This patch adds the necessary glue to provide time namespaces. Things are mainly copied from ARM64. __arch_get_timens_vdso_data() calculates timens vdso data position based on the vdso data position, knowing it is the next page in vvar. This avoids having to redo the mflr/bcl/mflr/mtlr dance to locate the page relative to running code position. Signed-off-by: Christophe Leroy --- arch/powerpc/Kconfig | 3 +- arch/powerpc/include/asm/vdso/gettimeofday.h | 10 ++ arch/powerpc/include/asm/vdso_datapage.h | 2 - arch/powerpc/kernel/vdso.c | 116 --- arch/powerpc/kernel/vdso32/vdso32.lds.S | 2 +- arch/powerpc/kernel/vdso64/vdso64.lds.S | 2 +- 6 files changed, 114 insertions(+), 21 deletions(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index c1344c05226c..71daff5f15d5 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -172,6 +172,7 @@ config PPC select GENERIC_CPU_AUTOPROBE select GENERIC_CPU_VULNERABILITIES if PPC_BARRIER_NOSPEC select GENERIC_EARLY_IOREMAP + select GENERIC_GETTIMEOFDAY select GENERIC_IRQ_SHOW select GENERIC_IRQ_SHOW_LEVEL select GENERIC_PCI_IOMAPif PCI @@ -179,7 +180,7 @@ config PPC select GENERIC_STRNCPY_FROM_USER select GENERIC_STRNLEN_USER select GENERIC_TIME_VSYSCALL - select GENERIC_GETTIMEOFDAY + select GENERIC_VDSO_TIME_NS select HAVE_ARCH_AUDITSYSCALL select HAVE_ARCH_HUGE_VMAP if PPC_BOOK3S_64 && PPC_RADIX_MMU select HAVE_ARCH_JUMP_LABEL diff --git a/arch/powerpc/include/asm/vdso/gettimeofday.h b/arch/powerpc/include/asm/vdso/gettimeofday.h index d453e725c79f..e448df1dd071 100644 --- a/arch/powerpc/include/asm/vdso/gettimeofday.h +++ b/arch/powerpc/include/asm/vdso/gettimeofday.h @@ -2,6 +2,8 @@ #ifndef _ASM_POWERPC_VDSO_GETTIMEOFDAY_H #define _ASM_POWERPC_VDSO_GETTIMEOFDAY_H +#include + #ifdef __ASSEMBLY__ #include @@ -153,6 +155,14 @@ static __always_inline u64 __arch_get_hw_counter(s32 clock_mode, const struct vdso_data *__arch_get_vdso_data(void); +#ifdef CONFIG_TIME_NS +static __always_inline +const struct vdso_data *__arch_get_timens_vdso_data(const struct vdso_data *vd) +{ + return (void *)vd + PAGE_SIZE; +} +#endif + static inline bool vdso_clocksource_ok(const struct vdso_data *vd) { return true; diff --git a/arch/powerpc/include/asm/vdso_datapage.h b/arch/powerpc/include/asm/vdso_datapage.h index 3f958ecf2beb..a585c8e538ff 100644 --- a/arch/powerpc/include/asm/vdso_datapage.h +++ b/arch/powerpc/include/asm/vdso_datapage.h @@ -107,9 +107,7 @@ extern struct vdso_arch_data *vdso_data; bcl 20, 31, .+4 999: mflr\ptr -#if CONFIG_PPC_PAGE_SHIFT > 14 addis \ptr, \ptr, (_vdso_datapage - 999b)@ha -#endif addi\ptr, \ptr, (_vdso_datapage - 999b)@l .endm diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c index b14907209822..717f2c9a7573 100644 --- a/arch/powerpc/kernel/vdso.c +++ b/arch/powerpc/kernel/vdso.c @@ -18,6 +18,7 @@ #include #include #include +#include #include #include @@ -50,6 +51,12 @@ static union { } vdso_data_store __page_aligned_data; struct vdso_arch_data *vdso_data = _data_store.data; +enum vvar_pages { + VVAR_DATA_PAGE_OFFSET, + VVAR_TIMENS_PAGE_OFFSET, + VVAR_NR_PAGES, +}; + static int vdso_mremap(const struct vm_special_mapping *sm, struct vm_area_struct *new_vma, unsigned long text_size) { @@ -73,8 +80,12 @@ static int vdso64_mremap(const struct vm_special_mapping *sm, struct vm_area_str return vdso_mremap(sm, new_vma, _end - _start); } +static vm_fault_t vvar_fault(const struct vm_special_mapping *sm, +struct vm_area_struct *vma, struct vm_fault *vmf); + static struct vm_special_mapping vvar_spec __ro_after_init = { .name = "[vvar]", + .fault = vvar_fault, }; static struct vm_special_mapping vdso32_spec __ro_after_init = { @@ -87,6 +98,94 @@ static struct vm_special_mapping vdso64_spec __ro_after_init = { .mremap = vdso64_mremap, }; +#ifdef CONFIG_TIME_NS +struct vdso_data *arch_get_vdso_data(void *vvar_page) +{ + return ((struct vdso_arch_data *)vvar_page)->data; +} + +/* + * The vvar mapping contains data for a specific time namespace, so when a task + * changes namespace we must unmap its vvar data for the old namespace. + * Subsequent faults will map in data for the new namespace. + * + * For more details see timens_setup_vdso_data(). + */ +int vdso_join_timens(struct task_struct *task, struct time_namespace *ns) +{ + struct mm_struct *mm = task->mm; + struct vm_area_struct *vma; + + mmap_read_lock(mm); + + for (vma = mm->mmap; vma; vma = vma->vm_next) { + unsigned long size = vma->vm_end - vma->vm_start; + + if
[PATCH RESEND v1 3/4] powerpc/vdso: Separate vvar vma from vdso
From: Dmitry Safonov Since commit 511157ab641e ("powerpc/vdso: Move vdso datapage up front") VVAR page is in front of the VDSO area. In result it breaks CRIU (Checkpoint Restore In Userspace) [1], where CRIU expects that "[vdso]" from /proc/../maps points at ELF/vdso image, rather than at VVAR data page. Laurent made a patch to keep CRIU working (by reading aux vector). But I think it still makes sence to separate two mappings into different VMAs. It will also make ppc64 less "special" for userspace and as a side-bonus will make VVAR page un-writable by debugger (which previously would COW page and can be unexpected). I opportunistically Cc stable on it: I understand that usually such stuff isn't a stable material, but that will allow us in CRIU have one workaround less that is needed just for one release (v5.11) on one platform (ppc64), which we otherwise have to maintain. I wouldn't go as far as to say that the commit 511157ab641e is ABI regression as no other userspace got broken, but I'd really appreciate if it gets backported to v5.11 after v5.12 is released, so as not to complicate already non-simple CRIU-vdso code. Thanks! Cc: Andrei Vagin Cc: Andy Lutomirski Cc: Benjamin Herrenschmidt Cc: Christophe Leroy Cc: Laurent Dufour Cc: Michael Ellerman Cc: Paul Mackerras Cc: linuxppc-dev@lists.ozlabs.org Cc: sta...@vger.kernel.org # v5.11 [1]: https://github.com/checkpoint-restore/criu/issues/1417 Signed-off-by: Dmitry Safonov Tested-by: Christophe Leroy Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/mmu_context.h | 2 +- arch/powerpc/kernel/vdso.c | 54 +++--- 2 files changed, 40 insertions(+), 16 deletions(-) diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h index 652ce85f9410..4bc45d3ed8b0 100644 --- a/arch/powerpc/include/asm/mmu_context.h +++ b/arch/powerpc/include/asm/mmu_context.h @@ -263,7 +263,7 @@ extern void arch_exit_mmap(struct mm_struct *mm); static inline void arch_unmap(struct mm_struct *mm, unsigned long start, unsigned long end) { - unsigned long vdso_base = (unsigned long)mm->context.vdso - PAGE_SIZE; + unsigned long vdso_base = (unsigned long)mm->context.vdso; if (start <= vdso_base && vdso_base < end) mm->context.vdso = NULL; diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c index e839a906fdf2..b14907209822 100644 --- a/arch/powerpc/kernel/vdso.c +++ b/arch/powerpc/kernel/vdso.c @@ -55,10 +55,10 @@ static int vdso_mremap(const struct vm_special_mapping *sm, struct vm_area_struc { unsigned long new_size = new_vma->vm_end - new_vma->vm_start; - if (new_size != text_size + PAGE_SIZE) + if (new_size != text_size) return -EINVAL; - current->mm->context.vdso = (void __user *)new_vma->vm_start + PAGE_SIZE; + current->mm->context.vdso = (void __user *)new_vma->vm_start; return 0; } @@ -73,6 +73,10 @@ static int vdso64_mremap(const struct vm_special_mapping *sm, struct vm_area_str return vdso_mremap(sm, new_vma, _end - _start); } +static struct vm_special_mapping vvar_spec __ro_after_init = { + .name = "[vvar]", +}; + static struct vm_special_mapping vdso32_spec __ro_after_init = { .name = "[vdso]", .mremap = vdso32_mremap, @@ -89,11 +93,11 @@ static struct vm_special_mapping vdso64_spec __ro_after_init = { */ static int __arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) { - struct mm_struct *mm = current->mm; + unsigned long vdso_size, vdso_base, mappings_size; struct vm_special_mapping *vdso_spec; + unsigned long vvar_size = PAGE_SIZE; + struct mm_struct *mm = current->mm; struct vm_area_struct *vma; - unsigned long vdso_size; - unsigned long vdso_base; if (is_32bit_task()) { vdso_spec = _spec; @@ -110,8 +114,8 @@ static int __arch_setup_additional_pages(struct linux_binprm *bprm, int uses_int vdso_base = 0; } - /* Add a page to the vdso size for the data page */ - vdso_size += PAGE_SIZE; + mappings_size = vdso_size + vvar_size; + mappings_size += (VDSO_ALIGNMENT - 1) & PAGE_MASK; /* * pick a base address for the vDSO in process space. We try to put it @@ -119,9 +123,7 @@ static int __arch_setup_additional_pages(struct linux_binprm *bprm, int uses_int * and end up putting it elsewhere. * Add enough to the size so that the result can be aligned. */ - vdso_base = get_unmapped_area(NULL, vdso_base, - vdso_size + ((VDSO_ALIGNMENT - 1) & PAGE_MASK), - 0, 0); + vdso_base = get_unmapped_area(NULL, vdso_base, mappings_size, 0, 0); if (IS_ERR_VALUE(vdso_base)) return vdso_base; @@ -133,7 +135,13 @@
[PATCH RESEND v1 1/4] lib/vdso: Mark do_hres_timens() and do_coarse_timens() __always_inline()
In the same spirit as commit c966533f8c6c ("lib/vdso: Mark do_hres() and do_coarse() as __always_inline"), mark do_hres_timens() and do_coarse_timens() __always_inline. The measurement below in on a non timens process, ie on the fastest path. On powerpc32, without the patch: clock-gettime-monotonic-raw:vdso: 1155 nsec/call clock-gettime-monotonic-coarse:vdso: 813 nsec/call clock-gettime-monotonic:vdso: 1076 nsec/call With the patch: clock-gettime-monotonic-raw:vdso: 1100 nsec/call clock-gettime-monotonic-coarse:vdso: 667 nsec/call clock-gettime-monotonic:vdso: 1025 nsec/call Signed-off-by: Christophe Leroy --- lib/vdso/gettimeofday.c | 16 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/lib/vdso/gettimeofday.c b/lib/vdso/gettimeofday.c index 2919f1698140..c6f6dee08746 100644 --- a/lib/vdso/gettimeofday.c +++ b/lib/vdso/gettimeofday.c @@ -46,8 +46,8 @@ static inline bool vdso_cycles_ok(u64 cycles) #endif #ifdef CONFIG_TIME_NS -static int do_hres_timens(const struct vdso_data *vdns, clockid_t clk, - struct __kernel_timespec *ts) +static __always_inline int do_hres_timens(const struct vdso_data *vdns, clockid_t clk, + struct __kernel_timespec *ts) { const struct vdso_data *vd = __arch_get_timens_vdso_data(); const struct timens_offset *offs = >offset[clk]; @@ -97,8 +97,8 @@ static __always_inline const struct vdso_data *__arch_get_timens_vdso_data(void) return NULL; } -static int do_hres_timens(const struct vdso_data *vdns, clockid_t clk, - struct __kernel_timespec *ts) +static __always_inline int do_hres_timens(const struct vdso_data *vdns, clockid_t clk, + struct __kernel_timespec *ts) { return -EINVAL; } @@ -159,8 +159,8 @@ static __always_inline int do_hres(const struct vdso_data *vd, clockid_t clk, } #ifdef CONFIG_TIME_NS -static int do_coarse_timens(const struct vdso_data *vdns, clockid_t clk, - struct __kernel_timespec *ts) +static __always_inline int do_coarse_timens(const struct vdso_data *vdns, clockid_t clk, + struct __kernel_timespec *ts) { const struct vdso_data *vd = __arch_get_timens_vdso_data(); const struct vdso_timestamp *vdso_ts = >basetime[clk]; @@ -188,8 +188,8 @@ static int do_coarse_timens(const struct vdso_data *vdns, clockid_t clk, return 0; } #else -static int do_coarse_timens(const struct vdso_data *vdns, clockid_t clk, - struct __kernel_timespec *ts) +static __always_inline int do_coarse_timens(const struct vdso_data *vdns, clockid_t clk, + struct __kernel_timespec *ts) { return -1; } -- 2.25.0
[PATCH RESEND v1 2/4] lib/vdso: Add vdso_data pointer as input to __arch_get_timens_vdso_data()
For the same reason as commit e876f0b69dc9 ("lib/vdso: Allow architectures to provide the vdso data pointer"), powerpc wants to avoid calculation of relative position to code. As the timens_vdso_data is next page to vdso_data, provide vdso_data pointer to __arch_get_timens_vdso_data() in order to ease the calculation on powerpc in following patches. Signed-off-by: Christophe Leroy --- arch/arm64/include/asm/vdso/compat_gettimeofday.h | 3 ++- arch/arm64/include/asm/vdso/gettimeofday.h| 2 +- arch/s390/include/asm/vdso/gettimeofday.h | 3 ++- arch/x86/include/asm/vdso/gettimeofday.h | 3 ++- lib/vdso/gettimeofday.c | 15 +-- 5 files changed, 16 insertions(+), 10 deletions(-) diff --git a/arch/arm64/include/asm/vdso/compat_gettimeofday.h b/arch/arm64/include/asm/vdso/compat_gettimeofday.h index 7508b0ac1d21..ecb6fd4c3c64 100644 --- a/arch/arm64/include/asm/vdso/compat_gettimeofday.h +++ b/arch/arm64/include/asm/vdso/compat_gettimeofday.h @@ -155,7 +155,8 @@ static __always_inline const struct vdso_data *__arch_get_vdso_data(void) } #ifdef CONFIG_TIME_NS -static __always_inline const struct vdso_data *__arch_get_timens_vdso_data(void) +static __always_inline +const struct vdso_data *__arch_get_timens_vdso_data(const struct vdso_data *vd) { const struct vdso_data *ret; diff --git a/arch/arm64/include/asm/vdso/gettimeofday.h b/arch/arm64/include/asm/vdso/gettimeofday.h index 631ab1281633..de86230a9436 100644 --- a/arch/arm64/include/asm/vdso/gettimeofday.h +++ b/arch/arm64/include/asm/vdso/gettimeofday.h @@ -100,7 +100,7 @@ const struct vdso_data *__arch_get_vdso_data(void) #ifdef CONFIG_TIME_NS static __always_inline -const struct vdso_data *__arch_get_timens_vdso_data(void) +const struct vdso_data *__arch_get_timens_vdso_data(const struct vdso_data *vd) { return _timens_data; } diff --git a/arch/s390/include/asm/vdso/gettimeofday.h b/arch/s390/include/asm/vdso/gettimeofday.h index ed89ef742530..383c53c3 100644 --- a/arch/s390/include/asm/vdso/gettimeofday.h +++ b/arch/s390/include/asm/vdso/gettimeofday.h @@ -68,7 +68,8 @@ long clock_getres_fallback(clockid_t clkid, struct __kernel_timespec *ts) } #ifdef CONFIG_TIME_NS -static __always_inline const struct vdso_data *__arch_get_timens_vdso_data(void) +static __always_inline +const struct vdso_data *__arch_get_timens_vdso_data(const struct vdso_data *vd) { return _timens_data; } diff --git a/arch/x86/include/asm/vdso/gettimeofday.h b/arch/x86/include/asm/vdso/gettimeofday.h index df01d7349d79..1936f21ed8cd 100644 --- a/arch/x86/include/asm/vdso/gettimeofday.h +++ b/arch/x86/include/asm/vdso/gettimeofday.h @@ -58,7 +58,8 @@ extern struct ms_hyperv_tsc_page hvclock_page #endif #ifdef CONFIG_TIME_NS -static __always_inline const struct vdso_data *__arch_get_timens_vdso_data(void) +static __always_inline +const struct vdso_data *__arch_get_timens_vdso_data(const struct vdso_data *vd) { return __timens_vdso_data; } diff --git a/lib/vdso/gettimeofday.c b/lib/vdso/gettimeofday.c index c6f6dee08746..ce2f69552003 100644 --- a/lib/vdso/gettimeofday.c +++ b/lib/vdso/gettimeofday.c @@ -49,13 +49,15 @@ static inline bool vdso_cycles_ok(u64 cycles) static __always_inline int do_hres_timens(const struct vdso_data *vdns, clockid_t clk, struct __kernel_timespec *ts) { - const struct vdso_data *vd = __arch_get_timens_vdso_data(); + const struct vdso_data *vd; const struct timens_offset *offs = >offset[clk]; const struct vdso_timestamp *vdso_ts; u64 cycles, last, ns; u32 seq; s64 sec; + vd = vdns - (clk == CLOCK_MONOTONIC_RAW ? CS_RAW : CS_HRES_COARSE); + vd = __arch_get_timens_vdso_data(vd); if (clk != CLOCK_MONOTONIC_RAW) vd = [CS_HRES_COARSE]; else @@ -92,7 +94,8 @@ static __always_inline int do_hres_timens(const struct vdso_data *vdns, clockid_ return 0; } #else -static __always_inline const struct vdso_data *__arch_get_timens_vdso_data(void) +static __always_inline +const struct vdso_data *__arch_get_timens_vdso_data(const struct vdso_data *vd) { return NULL; } @@ -162,7 +165,7 @@ static __always_inline int do_hres(const struct vdso_data *vd, clockid_t clk, static __always_inline int do_coarse_timens(const struct vdso_data *vdns, clockid_t clk, struct __kernel_timespec *ts) { - const struct vdso_data *vd = __arch_get_timens_vdso_data(); + const struct vdso_data *vd = __arch_get_timens_vdso_data(vdns); const struct vdso_timestamp *vdso_ts = >basetime[clk]; const struct timens_offset *offs = >offset[clk]; u64 nsec; @@ -310,7 +313,7 @@ __cvdso_gettimeofday_data(const struct vdso_data *vd, if (unlikely(tz != NULL)) { if (IS_ENABLED(CONFIG_TIME_NS) &&
[PATCH RESEND v1 0/4] powerpc/vdso: Add support for time namespaces
[Sorry, resending with complete destination list, I used the wrong script on the first delivery] This series adds support for time namespaces on powerpc. All timens selftests are successfull. Christophe Leroy (3): lib/vdso: Mark do_hres_timens() and do_coarse_timens() __always_inline() lib/vdso: Add vdso_data pointer as input to __arch_get_timens_vdso_data() powerpc/vdso: Add support for time namespaces Dmitry Safonov (1): powerpc/vdso: Separate vvar vma from vdso .../include/asm/vdso/compat_gettimeofday.h| 3 +- arch/arm64/include/asm/vdso/gettimeofday.h| 2 +- arch/powerpc/Kconfig | 3 +- arch/powerpc/include/asm/mmu_context.h| 2 +- arch/powerpc/include/asm/vdso/gettimeofday.h | 10 ++ arch/powerpc/include/asm/vdso_datapage.h | 2 - arch/powerpc/kernel/vdso.c| 138 -- arch/powerpc/kernel/vdso32/vdso32.lds.S | 2 +- arch/powerpc/kernel/vdso64/vdso64.lds.S | 2 +- arch/s390/include/asm/vdso/gettimeofday.h | 3 +- arch/x86/include/asm/vdso/gettimeofday.h | 3 +- lib/vdso/gettimeofday.c | 31 ++-- 12 files changed, 162 insertions(+), 39 deletions(-) -- 2.25.0
Re: [PATCH 6/8] drivers: firmware: efi: libstub: enable generic commandline
(+ Arvind) On Tue, 30 Mar 2021 at 19:57, Daniel Walker wrote: > > This adds code to handle the generic command line changes. > The efi code appears that it doesn't benefit as much from this design > as it could. > > For example, if you had a prepend command line with "nokaslr" then > you might be helpful to re-enable it in the boot loader or dts, > but there appears to be no way to re-enable kaslr or some of the > other options. > > Cc: xe-linux-exter...@cisco.com > Signed-off-by: Daniel Walker > --- > .../firmware/efi/libstub/efi-stub-helper.c| 35 +++ > drivers/firmware/efi/libstub/efi-stub.c | 7 > drivers/firmware/efi/libstub/efistub.h| 1 + > drivers/firmware/efi/libstub/x86-stub.c | 13 +-- > 4 files changed, 54 insertions(+), 2 deletions(-) > > diff --git a/drivers/firmware/efi/libstub/efi-stub-helper.c > b/drivers/firmware/efi/libstub/efi-stub-helper.c > index aa8da0a49829..c155837cedc9 100644 > --- a/drivers/firmware/efi/libstub/efi-stub-helper.c > +++ b/drivers/firmware/efi/libstub/efi-stub-helper.c > @@ -13,6 +13,7 @@ > #include > #include > #include /* For CONSOLE_LOGLEVEL_* */ > +#include > #include > #include > > @@ -172,6 +173,40 @@ int efi_printk(const char *fmt, ...) > return printed; > } > > +/** > + * efi_handle_cmdline() - handle adding in building parts of the command line > + * @cmdline: kernel command line > + * > + * Add in the generic parts of the commandline and start the parsing of the > + * command line. > + * > + * Return: status code > + */ > +efi_status_t efi_handle_cmdline(char const *cmdline) > +{ > + efi_status_t status; > + > + status = efi_parse_options(CMDLINE_PREPEND); > + if (status != EFI_SUCCESS) { > + efi_err("Failed to parse options\n"); > + return status; > + } Even though I am not a fan of the 'success handling' pattern, duplicating the exact same error handling three times is not great either. Could we reuse more of the code here? > + > + status = efi_parse_options(IS_ENABLED(CONFIG_CMDLINE_OVERRIDE) ? "" : > cmdline); What is the point of calling efi_parse_options() with an empty string? > + if (status != EFI_SUCCESS) { > + efi_err("Failed to parse options\n"); > + return status; > + } > + > + status = efi_parse_options(CMDLINE_APPEND); > + if (status != EFI_SUCCESS) { > + efi_err("Failed to parse options\n"); > + return status; > + } > + > + return EFI_SUCCESS; > +} > + > /** > * efi_parse_options() - Parse EFI command line options > * @cmdline: kernel command line > diff --git a/drivers/firmware/efi/libstub/efi-stub.c > b/drivers/firmware/efi/libstub/efi-stub.c > index 26e69788f27a..760480248adf 100644 > --- a/drivers/firmware/efi/libstub/efi-stub.c > +++ b/drivers/firmware/efi/libstub/efi-stub.c > @@ -172,6 +172,12 @@ efi_status_t __efiapi efi_pe_entry(efi_handle_t handle, > goto fail; > } > > +#ifdef CONFIG_GENERIC_CMDLINE > + status = efi_handle_cmdline(cmdline_ptr); > + if (status != EFI_SUCCESS) { > + goto fail_free_cmdline; > + } > +#else > if (IS_ENABLED(CONFIG_CMDLINE_EXTEND) || > IS_ENABLED(CONFIG_CMDLINE_FORCE) || Does this mean CONFIG_GENERIC_CMDLINE does not replace CMDLINE_EXTEND / CMDLINE_FORCE etc, but introduces yet another variant on top of those? That does not seem like an improvement to me. I think it is great that you are cleaning this up, but only if it means we can get rid of the old implementation. > cmdline_size == 0) { > @@ -189,6 +195,7 @@ efi_status_t __efiapi efi_pe_entry(efi_handle_t handle, > goto fail_free_cmdline; > } > } > +#endif > > efi_info("Booting Linux Kernel...\n"); > > diff --git a/drivers/firmware/efi/libstub/efistub.h > b/drivers/firmware/efi/libstub/efistub.h > index cde0a2ef507d..07c7f9fdfffc 100644 > --- a/drivers/firmware/efi/libstub/efistub.h > +++ b/drivers/firmware/efi/libstub/efistub.h > @@ -800,6 +800,7 @@ efi_status_t efi_relocate_kernel(unsigned long > *image_addr, > unsigned long alignment, > unsigned long min_addr); > > +efi_status_t efi_handle_cmdline(char const *cmdline); > efi_status_t efi_parse_options(char const *cmdline); > > void efi_parse_option_graphics(char *option); > diff --git a/drivers/firmware/efi/libstub/x86-stub.c > b/drivers/firmware/efi/libstub/x86-stub.c > index f14c4ff5839f..30ad8fb7122d 100644 > --- a/drivers/firmware/efi/libstub/x86-stub.c > +++ b/drivers/firmware/efi/libstub/x86-stub.c > @@ -673,6 +673,8 @@ unsigned long efi_main(efi_handle_t handle, > unsigned long bzimage_addr = (unsigned long)startup_32; > unsigned long buffer_start, buffer_end; > struct setup_header *hdr = _params->hdr; > +
Re: [PATCH 16/18] iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE
On 2021-03-16 15:38, Christoph Hellwig wrote: [...] diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index f1e38526d5bd40..996dfdf9d375dd 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -2017,7 +2017,7 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain, .iommu_dev = smmu->dev, }; - if (smmu_domain->non_strict) + if (!iommu_get_dma_strict()) As Will raised, this also needs to be checking "domain->type == IOMMU_DOMAIN_DMA" to maintain equivalent behaviour to the attribute code below. pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_NON_STRICT; pgtbl_ops = alloc_io_pgtable_ops(fmt, _cfg, smmu_domain); @@ -2449,52 +2449,6 @@ static struct iommu_group *arm_smmu_device_group(struct device *dev) return group; } -static int arm_smmu_domain_get_attr(struct iommu_domain *domain, - enum iommu_attr attr, void *data) -{ - struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain); - - switch (domain->type) { - case IOMMU_DOMAIN_DMA: - switch (attr) { - case DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE: - *(int *)data = smmu_domain->non_strict; - return 0; - default: - return -ENODEV; - } - break; - default: - return -EINVAL; - } -} [...] diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h index f985817c967a25..edb1de479dd1a7 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h @@ -668,7 +668,6 @@ struct arm_smmu_domain { struct mutexinit_mutex; /* Protects smmu pointer */ struct io_pgtable_ops *pgtbl_ops; - boolnon_strict; atomic_tnr_ats_masters; enum arm_smmu_domain_stage stage; diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c index 0aa6d667274970..3dde22b1f8ffb0 100644 --- a/drivers/iommu/arm/arm-smmu/arm-smmu.c +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c @@ -761,6 +761,9 @@ static int arm_smmu_init_domain_context(struct iommu_domain *domain, .iommu_dev = smmu->dev, }; + if (!iommu_get_dma_strict()) Ditto here. Sorry for not spotting that sooner :( Robin. + pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_NON_STRICT; + if (smmu->impl && smmu->impl->init_context) { ret = smmu->impl->init_context(smmu_domain, _cfg, dev); if (ret)
Re: [PATCH 16/18] iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE
On 2021-03-31 16:32, Will Deacon wrote: On Wed, Mar 31, 2021 at 02:09:37PM +0100, Robin Murphy wrote: On 2021-03-31 12:49, Will Deacon wrote: On Tue, Mar 30, 2021 at 05:28:19PM +0100, Robin Murphy wrote: On 2021-03-30 14:58, Will Deacon wrote: On Tue, Mar 30, 2021 at 02:19:38PM +0100, Robin Murphy wrote: On 2021-03-30 14:11, Will Deacon wrote: On Tue, Mar 16, 2021 at 04:38:22PM +0100, Christoph Hellwig wrote: From: Robin Murphy Instead make the global iommu_dma_strict paramete in iommu.c canonical by exporting helpers to get and set it and use those directly in the drivers. This make sure that the iommu.strict parameter also works for the AMD and Intel IOMMU drivers on x86. As those default to lazy flushing a new IOMMU_CMD_LINE_STRICT is used to turn the value into a tristate to represent the default if not overriden by an explicit parameter. Signed-off-by: Robin Murphy . [ported on top of the other iommu_attr changes and added a few small missing bits] Signed-off-by: Christoph Hellwig --- drivers/iommu/amd/iommu.c | 23 +--- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 50 +--- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 1 - drivers/iommu/arm/arm-smmu/arm-smmu.c | 27 + drivers/iommu/dma-iommu.c | 9 +-- drivers/iommu/intel/iommu.c | 64 - drivers/iommu/iommu.c | 27 ++--- include/linux/iommu.h | 4 +- 8 files changed, 40 insertions(+), 165 deletions(-) I really like this cleanup, but I can't help wonder if it's going in the wrong direction. With SoCs often having multiple IOMMU instances and a distinction between "trusted" and "untrusted" devices, then having the flush-queue enabled on a per-IOMMU or per-domain basis doesn't sound unreasonable to me, but this change makes it a global property. The intent here was just to streamline the existing behaviour of stuffing a global property into a domain attribute then pulling it out again in the illusion that it was in any way per-domain. We're still checking dev_is_untrusted() before making an actual decision, and it's not like we can't add more factors at that point if we want to. Like I say, the cleanup is great. I'm just wondering whether there's a better way to express the complicated logic to decide whether or not to use the flush queue than what we end up with: if (!cookie->fq_domain && (!dev || !dev_is_untrusted(dev)) && domain->ops->flush_iotlb_all && !iommu_get_dma_strict()) which is mixing up globals, device properties and domain properties. The result is that the driver code ends up just using the global to determine whether or not to pass IO_PGTABLE_QUIRK_NON_STRICT to the page-table code, which is a departure from the current way of doing things. But previously, SMMU only ever saw the global policy piped through the domain attribute by iommu_group_alloc_default_domain(), so there's no functional change there. For DMA domains sure, but I don't think that's the case for unmanaged domains such as those used by VFIO. Eh? This is only relevant to DMA domains anyway. Flush queues are part of the IOVA allocator that VFIO doesn't even use. It's always been the case that unmanaged domains only use strict invalidation. Maybe I'm going mad. With this patch, the SMMU driver unconditionally sets IO_PGTABLE_QUIRK_NON_STRICT for page-tables if iommu_get_dma_strict() is true, no? In which case, that will get set for page-tables corresponding to unmanaged domains as well as DMA domains when it is enabled. That didn't happen before because you couldn't set the attribute for unmanaged domains. What am I missing? Oh cock... sorry, all this time I've been saying what I *expect* it to do, while overlooking the fact that the IO_PGTABLE_QUIRK_NON_STRICT hunks were the bits I forgot to write and Christoph had to fix up. Indeed, those should be checking the domain type too to preserve the existing behaviour. Apologies for the confusion. Robin. Obviously some of the above checks could be factored out into some kind of iommu_use_flush_queue() helper that IOMMU drivers can also call if they need to keep in sync. Or maybe we just allow iommu-dma to set IO_PGTABLE_QUIRK_NON_STRICT directly via iommu_set_pgtable_quirks() if we're treating that as a generic thing now. I think a helper that takes a domain would be a good starting point. You mean device, right? The one condition we currently have is at the device level, and there's really nothing inherent to the domain itself that matters (since the type is implicitly IOMMU_DOMAIN_DMA to even care about this). Device would probably work too; you'd pass the first device to attach to the domain when querying this from the SMMU driver, I suppose. Will
Re: [PATCH 16/18] iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE
On Wed, Mar 31, 2021 at 02:09:37PM +0100, Robin Murphy wrote: > On 2021-03-31 12:49, Will Deacon wrote: > > On Tue, Mar 30, 2021 at 05:28:19PM +0100, Robin Murphy wrote: > > > On 2021-03-30 14:58, Will Deacon wrote: > > > > On Tue, Mar 30, 2021 at 02:19:38PM +0100, Robin Murphy wrote: > > > > > On 2021-03-30 14:11, Will Deacon wrote: > > > > > > On Tue, Mar 16, 2021 at 04:38:22PM +0100, Christoph Hellwig wrote: > > > > > > > From: Robin Murphy > > > > > > > > > > > > > > Instead make the global iommu_dma_strict paramete in iommu.c > > > > > > > canonical by > > > > > > > exporting helpers to get and set it and use those directly in the > > > > > > > drivers. > > > > > > > > > > > > > > This make sure that the iommu.strict parameter also works for the > > > > > > > AMD and > > > > > > > Intel IOMMU drivers on x86. As those default to lazy flushing a > > > > > > > new > > > > > > > IOMMU_CMD_LINE_STRICT is used to turn the value into a tristate to > > > > > > > represent the default if not overriden by an explicit parameter. > > > > > > > > > > > > > > Signed-off-by: Robin Murphy . > > > > > > > [ported on top of the other iommu_attr changes and added a few > > > > > > > small > > > > > > > missing bits] > > > > > > > Signed-off-by: Christoph Hellwig > > > > > > > --- > > > > > > > drivers/iommu/amd/iommu.c | 23 +--- > > > > > > > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 50 > > > > > > > +--- > > > > > > > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 1 - > > > > > > > drivers/iommu/arm/arm-smmu/arm-smmu.c | 27 + > > > > > > > drivers/iommu/dma-iommu.c | 9 +-- > > > > > > > drivers/iommu/intel/iommu.c | 64 > > > > > > > - > > > > > > > drivers/iommu/iommu.c | 27 ++--- > > > > > > > include/linux/iommu.h | 4 +- > > > > > > > 8 files changed, 40 insertions(+), 165 deletions(-) > > > > > > > > > > > > I really like this cleanup, but I can't help wonder if it's going > > > > > > in the > > > > > > wrong direction. With SoCs often having multiple IOMMU instances > > > > > > and a > > > > > > distinction between "trusted" and "untrusted" devices, then having > > > > > > the > > > > > > flush-queue enabled on a per-IOMMU or per-domain basis doesn't sound > > > > > > unreasonable to me, but this change makes it a global property. > > > > > > > > > > The intent here was just to streamline the existing behaviour of > > > > > stuffing a > > > > > global property into a domain attribute then pulling it out again in > > > > > the > > > > > illusion that it was in any way per-domain. We're still checking > > > > > dev_is_untrusted() before making an actual decision, and it's not > > > > > like we > > > > > can't add more factors at that point if we want to. > > > > > > > > Like I say, the cleanup is great. I'm just wondering whether there's a > > > > better way to express the complicated logic to decide whether or not to > > > > use > > > > the flush queue than what we end up with: > > > > > > > > if (!cookie->fq_domain && (!dev || !dev_is_untrusted(dev)) && > > > > domain->ops->flush_iotlb_all && !iommu_get_dma_strict()) > > > > > > > > which is mixing up globals, device properties and domain properties. The > > > > result is that the driver code ends up just using the global to > > > > determine > > > > whether or not to pass IO_PGTABLE_QUIRK_NON_STRICT to the page-table > > > > code, > > > > which is a departure from the current way of doing things. > > > > > > But previously, SMMU only ever saw the global policy piped through the > > > domain attribute by iommu_group_alloc_default_domain(), so there's no > > > functional change there. > > > > For DMA domains sure, but I don't think that's the case for unmanaged > > domains such as those used by VFIO. > > Eh? This is only relevant to DMA domains anyway. Flush queues are part of > the IOVA allocator that VFIO doesn't even use. It's always been the case > that unmanaged domains only use strict invalidation. Maybe I'm going mad. With this patch, the SMMU driver unconditionally sets IO_PGTABLE_QUIRK_NON_STRICT for page-tables if iommu_get_dma_strict() is true, no? In which case, that will get set for page-tables corresponding to unmanaged domains as well as DMA domains when it is enabled. That didn't happen before because you couldn't set the attribute for unmanaged domains. What am I missing? > > > Obviously some of the above checks could be factored out into some kind of > > > iommu_use_flush_queue() helper that IOMMU drivers can also call if they > > > need > > > to keep in sync. Or maybe we just allow iommu-dma to set > > > IO_PGTABLE_QUIRK_NON_STRICT directly via iommu_set_pgtable_quirks() if > > > we're > > > treating that as a generic thing now. > > > > I think a helper that takes a domain would be a good starting
[PATCH v1 4/4] powerpc/vdso: Add support for time namespaces
This patch adds the necessary glue to provide time namespaces. Things are mainly copied from ARM64. __arch_get_timens_vdso_data() calculates timens vdso data position based on the vdso data position, knowing it is the next page in vvar. This avoids having to redo the mflr/bcl/mflr/mtlr dance to locate the page relative to running code position. Signed-off-by: Christophe Leroy --- arch/powerpc/Kconfig | 3 +- arch/powerpc/include/asm/vdso/gettimeofday.h | 10 ++ arch/powerpc/include/asm/vdso_datapage.h | 2 - arch/powerpc/kernel/vdso.c | 116 --- arch/powerpc/kernel/vdso32/vdso32.lds.S | 2 +- arch/powerpc/kernel/vdso64/vdso64.lds.S | 2 +- 6 files changed, 114 insertions(+), 21 deletions(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index c1344c05226c..71daff5f15d5 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -172,6 +172,7 @@ config PPC select GENERIC_CPU_AUTOPROBE select GENERIC_CPU_VULNERABILITIES if PPC_BARRIER_NOSPEC select GENERIC_EARLY_IOREMAP + select GENERIC_GETTIMEOFDAY select GENERIC_IRQ_SHOW select GENERIC_IRQ_SHOW_LEVEL select GENERIC_PCI_IOMAPif PCI @@ -179,7 +180,7 @@ config PPC select GENERIC_STRNCPY_FROM_USER select GENERIC_STRNLEN_USER select GENERIC_TIME_VSYSCALL - select GENERIC_GETTIMEOFDAY + select GENERIC_VDSO_TIME_NS select HAVE_ARCH_AUDITSYSCALL select HAVE_ARCH_HUGE_VMAP if PPC_BOOK3S_64 && PPC_RADIX_MMU select HAVE_ARCH_JUMP_LABEL diff --git a/arch/powerpc/include/asm/vdso/gettimeofday.h b/arch/powerpc/include/asm/vdso/gettimeofday.h index d453e725c79f..e448df1dd071 100644 --- a/arch/powerpc/include/asm/vdso/gettimeofday.h +++ b/arch/powerpc/include/asm/vdso/gettimeofday.h @@ -2,6 +2,8 @@ #ifndef _ASM_POWERPC_VDSO_GETTIMEOFDAY_H #define _ASM_POWERPC_VDSO_GETTIMEOFDAY_H +#include + #ifdef __ASSEMBLY__ #include @@ -153,6 +155,14 @@ static __always_inline u64 __arch_get_hw_counter(s32 clock_mode, const struct vdso_data *__arch_get_vdso_data(void); +#ifdef CONFIG_TIME_NS +static __always_inline +const struct vdso_data *__arch_get_timens_vdso_data(const struct vdso_data *vd) +{ + return (void *)vd + PAGE_SIZE; +} +#endif + static inline bool vdso_clocksource_ok(const struct vdso_data *vd) { return true; diff --git a/arch/powerpc/include/asm/vdso_datapage.h b/arch/powerpc/include/asm/vdso_datapage.h index 3f958ecf2beb..a585c8e538ff 100644 --- a/arch/powerpc/include/asm/vdso_datapage.h +++ b/arch/powerpc/include/asm/vdso_datapage.h @@ -107,9 +107,7 @@ extern struct vdso_arch_data *vdso_data; bcl 20, 31, .+4 999: mflr\ptr -#if CONFIG_PPC_PAGE_SHIFT > 14 addis \ptr, \ptr, (_vdso_datapage - 999b)@ha -#endif addi\ptr, \ptr, (_vdso_datapage - 999b)@l .endm diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c index b14907209822..717f2c9a7573 100644 --- a/arch/powerpc/kernel/vdso.c +++ b/arch/powerpc/kernel/vdso.c @@ -18,6 +18,7 @@ #include #include #include +#include #include #include @@ -50,6 +51,12 @@ static union { } vdso_data_store __page_aligned_data; struct vdso_arch_data *vdso_data = _data_store.data; +enum vvar_pages { + VVAR_DATA_PAGE_OFFSET, + VVAR_TIMENS_PAGE_OFFSET, + VVAR_NR_PAGES, +}; + static int vdso_mremap(const struct vm_special_mapping *sm, struct vm_area_struct *new_vma, unsigned long text_size) { @@ -73,8 +80,12 @@ static int vdso64_mremap(const struct vm_special_mapping *sm, struct vm_area_str return vdso_mremap(sm, new_vma, _end - _start); } +static vm_fault_t vvar_fault(const struct vm_special_mapping *sm, +struct vm_area_struct *vma, struct vm_fault *vmf); + static struct vm_special_mapping vvar_spec __ro_after_init = { .name = "[vvar]", + .fault = vvar_fault, }; static struct vm_special_mapping vdso32_spec __ro_after_init = { @@ -87,6 +98,94 @@ static struct vm_special_mapping vdso64_spec __ro_after_init = { .mremap = vdso64_mremap, }; +#ifdef CONFIG_TIME_NS +struct vdso_data *arch_get_vdso_data(void *vvar_page) +{ + return ((struct vdso_arch_data *)vvar_page)->data; +} + +/* + * The vvar mapping contains data for a specific time namespace, so when a task + * changes namespace we must unmap its vvar data for the old namespace. + * Subsequent faults will map in data for the new namespace. + * + * For more details see timens_setup_vdso_data(). + */ +int vdso_join_timens(struct task_struct *task, struct time_namespace *ns) +{ + struct mm_struct *mm = task->mm; + struct vm_area_struct *vma; + + mmap_read_lock(mm); + + for (vma = mm->mmap; vma; vma = vma->vm_next) { + unsigned long size = vma->vm_end - vma->vm_start; + + if
[PATCH v1 3/4] powerpc/vdso: Separate vvar vma from vdso
From: Dmitry Safonov Since commit 511157ab641e ("powerpc/vdso: Move vdso datapage up front") VVAR page is in front of the VDSO area. In result it breaks CRIU (Checkpoint Restore In Userspace) [1], where CRIU expects that "[vdso]" from /proc/../maps points at ELF/vdso image, rather than at VVAR data page. Laurent made a patch to keep CRIU working (by reading aux vector). But I think it still makes sence to separate two mappings into different VMAs. It will also make ppc64 less "special" for userspace and as a side-bonus will make VVAR page un-writable by debugger (which previously would COW page and can be unexpected). I opportunistically Cc stable on it: I understand that usually such stuff isn't a stable material, but that will allow us in CRIU have one workaround less that is needed just for one release (v5.11) on one platform (ppc64), which we otherwise have to maintain. I wouldn't go as far as to say that the commit 511157ab641e is ABI regression as no other userspace got broken, but I'd really appreciate if it gets backported to v5.11 after v5.12 is released, so as not to complicate already non-simple CRIU-vdso code. Thanks! Cc: Andrei Vagin Cc: Andy Lutomirski Cc: Benjamin Herrenschmidt Cc: Christophe Leroy Cc: Laurent Dufour Cc: Michael Ellerman Cc: Paul Mackerras Cc: linuxppc-dev@lists.ozlabs.org Cc: sta...@vger.kernel.org # v5.11 [1]: https://github.com/checkpoint-restore/criu/issues/1417 Signed-off-by: Dmitry Safonov Tested-by: Christophe Leroy Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/mmu_context.h | 2 +- arch/powerpc/kernel/vdso.c | 54 +++--- 2 files changed, 40 insertions(+), 16 deletions(-) diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h index 652ce85f9410..4bc45d3ed8b0 100644 --- a/arch/powerpc/include/asm/mmu_context.h +++ b/arch/powerpc/include/asm/mmu_context.h @@ -263,7 +263,7 @@ extern void arch_exit_mmap(struct mm_struct *mm); static inline void arch_unmap(struct mm_struct *mm, unsigned long start, unsigned long end) { - unsigned long vdso_base = (unsigned long)mm->context.vdso - PAGE_SIZE; + unsigned long vdso_base = (unsigned long)mm->context.vdso; if (start <= vdso_base && vdso_base < end) mm->context.vdso = NULL; diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c index e839a906fdf2..b14907209822 100644 --- a/arch/powerpc/kernel/vdso.c +++ b/arch/powerpc/kernel/vdso.c @@ -55,10 +55,10 @@ static int vdso_mremap(const struct vm_special_mapping *sm, struct vm_area_struc { unsigned long new_size = new_vma->vm_end - new_vma->vm_start; - if (new_size != text_size + PAGE_SIZE) + if (new_size != text_size) return -EINVAL; - current->mm->context.vdso = (void __user *)new_vma->vm_start + PAGE_SIZE; + current->mm->context.vdso = (void __user *)new_vma->vm_start; return 0; } @@ -73,6 +73,10 @@ static int vdso64_mremap(const struct vm_special_mapping *sm, struct vm_area_str return vdso_mremap(sm, new_vma, _end - _start); } +static struct vm_special_mapping vvar_spec __ro_after_init = { + .name = "[vvar]", +}; + static struct vm_special_mapping vdso32_spec __ro_after_init = { .name = "[vdso]", .mremap = vdso32_mremap, @@ -89,11 +93,11 @@ static struct vm_special_mapping vdso64_spec __ro_after_init = { */ static int __arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) { - struct mm_struct *mm = current->mm; + unsigned long vdso_size, vdso_base, mappings_size; struct vm_special_mapping *vdso_spec; + unsigned long vvar_size = PAGE_SIZE; + struct mm_struct *mm = current->mm; struct vm_area_struct *vma; - unsigned long vdso_size; - unsigned long vdso_base; if (is_32bit_task()) { vdso_spec = _spec; @@ -110,8 +114,8 @@ static int __arch_setup_additional_pages(struct linux_binprm *bprm, int uses_int vdso_base = 0; } - /* Add a page to the vdso size for the data page */ - vdso_size += PAGE_SIZE; + mappings_size = vdso_size + vvar_size; + mappings_size += (VDSO_ALIGNMENT - 1) & PAGE_MASK; /* * pick a base address for the vDSO in process space. We try to put it @@ -119,9 +123,7 @@ static int __arch_setup_additional_pages(struct linux_binprm *bprm, int uses_int * and end up putting it elsewhere. * Add enough to the size so that the result can be aligned. */ - vdso_base = get_unmapped_area(NULL, vdso_base, - vdso_size + ((VDSO_ALIGNMENT - 1) & PAGE_MASK), - 0, 0); + vdso_base = get_unmapped_area(NULL, vdso_base, mappings_size, 0, 0); if (IS_ERR_VALUE(vdso_base)) return vdso_base; @@ -133,7 +135,13 @@
[PATCH v1 1/4] lib/vdso: Mark do_hres_timens() and do_coarse_timens() __always_inline()
In the same spirit as commit c966533f8c6c ("lib/vdso: Mark do_hres() and do_coarse() as __always_inline"), mark do_hres_timens() and do_coarse_timens() __always_inline. The measurement below in on a non timens process, ie on the fastest path. On powerpc32, without the patch: clock-gettime-monotonic-raw:vdso: 1155 nsec/call clock-gettime-monotonic-coarse:vdso: 813 nsec/call clock-gettime-monotonic:vdso: 1076 nsec/call With the patch: clock-gettime-monotonic-raw:vdso: 1100 nsec/call clock-gettime-monotonic-coarse:vdso: 667 nsec/call clock-gettime-monotonic:vdso: 1025 nsec/call Signed-off-by: Christophe Leroy --- lib/vdso/gettimeofday.c | 16 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/lib/vdso/gettimeofday.c b/lib/vdso/gettimeofday.c index 2919f1698140..c6f6dee08746 100644 --- a/lib/vdso/gettimeofday.c +++ b/lib/vdso/gettimeofday.c @@ -46,8 +46,8 @@ static inline bool vdso_cycles_ok(u64 cycles) #endif #ifdef CONFIG_TIME_NS -static int do_hres_timens(const struct vdso_data *vdns, clockid_t clk, - struct __kernel_timespec *ts) +static __always_inline int do_hres_timens(const struct vdso_data *vdns, clockid_t clk, + struct __kernel_timespec *ts) { const struct vdso_data *vd = __arch_get_timens_vdso_data(); const struct timens_offset *offs = >offset[clk]; @@ -97,8 +97,8 @@ static __always_inline const struct vdso_data *__arch_get_timens_vdso_data(void) return NULL; } -static int do_hres_timens(const struct vdso_data *vdns, clockid_t clk, - struct __kernel_timespec *ts) +static __always_inline int do_hres_timens(const struct vdso_data *vdns, clockid_t clk, + struct __kernel_timespec *ts) { return -EINVAL; } @@ -159,8 +159,8 @@ static __always_inline int do_hres(const struct vdso_data *vd, clockid_t clk, } #ifdef CONFIG_TIME_NS -static int do_coarse_timens(const struct vdso_data *vdns, clockid_t clk, - struct __kernel_timespec *ts) +static __always_inline int do_coarse_timens(const struct vdso_data *vdns, clockid_t clk, + struct __kernel_timespec *ts) { const struct vdso_data *vd = __arch_get_timens_vdso_data(); const struct vdso_timestamp *vdso_ts = >basetime[clk]; @@ -188,8 +188,8 @@ static int do_coarse_timens(const struct vdso_data *vdns, clockid_t clk, return 0; } #else -static int do_coarse_timens(const struct vdso_data *vdns, clockid_t clk, - struct __kernel_timespec *ts) +static __always_inline int do_coarse_timens(const struct vdso_data *vdns, clockid_t clk, + struct __kernel_timespec *ts) { return -1; } -- 2.25.0
[PATCH v1 2/4] lib/vdso: Add vdso_data pointer as input to __arch_get_timens_vdso_data()
For the same reason as commit e876f0b69dc9 ("lib/vdso: Allow architectures to provide the vdso data pointer"), powerpc wants to avoid calculation of relative position to code. As the timens_vdso_data is next page to vdso_data, provide vdso_data pointer to __arch_get_timens_vdso_data() in order to ease the calculation on powerpc in following patches. Signed-off-by: Christophe Leroy --- arch/arm64/include/asm/vdso/compat_gettimeofday.h | 3 ++- arch/arm64/include/asm/vdso/gettimeofday.h| 2 +- arch/s390/include/asm/vdso/gettimeofday.h | 3 ++- arch/x86/include/asm/vdso/gettimeofday.h | 3 ++- lib/vdso/gettimeofday.c | 15 +-- 5 files changed, 16 insertions(+), 10 deletions(-) diff --git a/arch/arm64/include/asm/vdso/compat_gettimeofday.h b/arch/arm64/include/asm/vdso/compat_gettimeofday.h index 7508b0ac1d21..ecb6fd4c3c64 100644 --- a/arch/arm64/include/asm/vdso/compat_gettimeofday.h +++ b/arch/arm64/include/asm/vdso/compat_gettimeofday.h @@ -155,7 +155,8 @@ static __always_inline const struct vdso_data *__arch_get_vdso_data(void) } #ifdef CONFIG_TIME_NS -static __always_inline const struct vdso_data *__arch_get_timens_vdso_data(void) +static __always_inline +const struct vdso_data *__arch_get_timens_vdso_data(const struct vdso_data *vd) { const struct vdso_data *ret; diff --git a/arch/arm64/include/asm/vdso/gettimeofday.h b/arch/arm64/include/asm/vdso/gettimeofday.h index 631ab1281633..de86230a9436 100644 --- a/arch/arm64/include/asm/vdso/gettimeofday.h +++ b/arch/arm64/include/asm/vdso/gettimeofday.h @@ -100,7 +100,7 @@ const struct vdso_data *__arch_get_vdso_data(void) #ifdef CONFIG_TIME_NS static __always_inline -const struct vdso_data *__arch_get_timens_vdso_data(void) +const struct vdso_data *__arch_get_timens_vdso_data(const struct vdso_data *vd) { return _timens_data; } diff --git a/arch/s390/include/asm/vdso/gettimeofday.h b/arch/s390/include/asm/vdso/gettimeofday.h index ed89ef742530..383c53c3 100644 --- a/arch/s390/include/asm/vdso/gettimeofday.h +++ b/arch/s390/include/asm/vdso/gettimeofday.h @@ -68,7 +68,8 @@ long clock_getres_fallback(clockid_t clkid, struct __kernel_timespec *ts) } #ifdef CONFIG_TIME_NS -static __always_inline const struct vdso_data *__arch_get_timens_vdso_data(void) +static __always_inline +const struct vdso_data *__arch_get_timens_vdso_data(const struct vdso_data *vd) { return _timens_data; } diff --git a/arch/x86/include/asm/vdso/gettimeofday.h b/arch/x86/include/asm/vdso/gettimeofday.h index df01d7349d79..1936f21ed8cd 100644 --- a/arch/x86/include/asm/vdso/gettimeofday.h +++ b/arch/x86/include/asm/vdso/gettimeofday.h @@ -58,7 +58,8 @@ extern struct ms_hyperv_tsc_page hvclock_page #endif #ifdef CONFIG_TIME_NS -static __always_inline const struct vdso_data *__arch_get_timens_vdso_data(void) +static __always_inline +const struct vdso_data *__arch_get_timens_vdso_data(const struct vdso_data *vd) { return __timens_vdso_data; } diff --git a/lib/vdso/gettimeofday.c b/lib/vdso/gettimeofday.c index c6f6dee08746..ce2f69552003 100644 --- a/lib/vdso/gettimeofday.c +++ b/lib/vdso/gettimeofday.c @@ -49,13 +49,15 @@ static inline bool vdso_cycles_ok(u64 cycles) static __always_inline int do_hres_timens(const struct vdso_data *vdns, clockid_t clk, struct __kernel_timespec *ts) { - const struct vdso_data *vd = __arch_get_timens_vdso_data(); + const struct vdso_data *vd; const struct timens_offset *offs = >offset[clk]; const struct vdso_timestamp *vdso_ts; u64 cycles, last, ns; u32 seq; s64 sec; + vd = vdns - (clk == CLOCK_MONOTONIC_RAW ? CS_RAW : CS_HRES_COARSE); + vd = __arch_get_timens_vdso_data(vd); if (clk != CLOCK_MONOTONIC_RAW) vd = [CS_HRES_COARSE]; else @@ -92,7 +94,8 @@ static __always_inline int do_hres_timens(const struct vdso_data *vdns, clockid_ return 0; } #else -static __always_inline const struct vdso_data *__arch_get_timens_vdso_data(void) +static __always_inline +const struct vdso_data *__arch_get_timens_vdso_data(const struct vdso_data *vd) { return NULL; } @@ -162,7 +165,7 @@ static __always_inline int do_hres(const struct vdso_data *vd, clockid_t clk, static __always_inline int do_coarse_timens(const struct vdso_data *vdns, clockid_t clk, struct __kernel_timespec *ts) { - const struct vdso_data *vd = __arch_get_timens_vdso_data(); + const struct vdso_data *vd = __arch_get_timens_vdso_data(vdns); const struct vdso_timestamp *vdso_ts = >basetime[clk]; const struct timens_offset *offs = >offset[clk]; u64 nsec; @@ -310,7 +313,7 @@ __cvdso_gettimeofday_data(const struct vdso_data *vd, if (unlikely(tz != NULL)) { if (IS_ENABLED(CONFIG_TIME_NS) &&
[PATCH v1 0/4] powerpc/vdso: Add support for time namespaces
This series adds support for time namespaces on powerpc. All timens selftests are successfull. Christophe Leroy (3): lib/vdso: Mark do_hres_timens() and do_coarse_timens() __always_inline() lib/vdso: Add vdso_data pointer as input to __arch_get_timens_vdso_data() powerpc/vdso: Add support for time namespaces Dmitry Safonov (1): powerpc/vdso: Separate vvar vma from vdso .../include/asm/vdso/compat_gettimeofday.h| 3 +- arch/arm64/include/asm/vdso/gettimeofday.h| 2 +- arch/powerpc/Kconfig | 3 +- arch/powerpc/include/asm/mmu_context.h| 2 +- arch/powerpc/include/asm/vdso/gettimeofday.h | 10 ++ arch/powerpc/include/asm/vdso_datapage.h | 2 - arch/powerpc/kernel/vdso.c| 138 -- arch/powerpc/kernel/vdso32/vdso32.lds.S | 2 +- arch/powerpc/kernel/vdso64/vdso64.lds.S | 2 +- arch/s390/include/asm/vdso/gettimeofday.h | 3 +- arch/x86/include/asm/vdso/gettimeofday.h | 3 +- lib/vdso/gettimeofday.c | 31 ++-- 12 files changed, 162 insertions(+), 39 deletions(-) -- 2.25.0
[PATCH v3 9/9] powerpc/xive: Modernize XIVE-IPI domain with an 'alloc' handler
Instead of calling irq_create_mapping() to map the IPI for a node, introduce an 'alloc' handler. This is usually an extension to support hierarchy irq_domains which is not exactly the case for XIVE-IPI domain. However, we can now use the irq_domain_alloc_irqs() routine which allocates the IRQ descriptor on the specified node, even better for cache performance on multi node machines. Cc: Thomas Gleixner Signed-off-by: Cédric Le Goater --- I didn't rerun the benchmark to check for a difference. arch/powerpc/sysdev/xive/common.c | 27 +++ 1 file changed, 19 insertions(+), 8 deletions(-) diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c index 06f29cd07448..bb7435ffe99c 100644 --- a/arch/powerpc/sysdev/xive/common.c +++ b/arch/powerpc/sysdev/xive/common.c @@ -1103,15 +1103,26 @@ static struct irq_chip xive_ipi_chip = { * IPIs are marked per-cpu. We use separate HW interrupts under the * hood but associated with the same "linux" interrupt */ -static int xive_ipi_irq_domain_map(struct irq_domain *h, unsigned int virq, - irq_hw_number_t hw) +struct xive_ipi_alloc_info { + irq_hw_number_t hwirq; +}; + +static int xive_ipi_irq_domain_alloc(struct irq_domain *domain, unsigned int virq, +unsigned int nr_irqs, void *arg) { - irq_set_chip_and_handler(virq, _ipi_chip, handle_percpu_irq); + struct xive_ipi_alloc_info *info = arg; + int i; + + for (i = 0; i < nr_irqs; i++) { + irq_domain_set_info(domain, virq + i, info->hwirq + i, _ipi_chip, + domain->host_data, handle_percpu_irq, + NULL, NULL); + } return 0; } static const struct irq_domain_ops xive_ipi_irq_domain_ops = { - .map = xive_ipi_irq_domain_map, + .alloc = xive_ipi_irq_domain_alloc, }; static int __init xive_request_ipi(void) @@ -1136,7 +1147,7 @@ static int __init xive_request_ipi(void) for_each_node(node) { struct xive_ipi_desc *xid = _ipis[node]; - irq_hw_number_t ipi_hwirq = node; + struct xive_ipi_alloc_info info = { node }; /* Skip nodes without CPUs */ if (cpumask_empty(cpumask_of_node(node))) @@ -1147,9 +1158,9 @@ static int __init xive_request_ipi(void) * Since the HW interrupt number doesn't have any meaning, * simply use the node number. */ - xid->irq = irq_create_mapping(ipi_domain, ipi_hwirq); - if (!xid->irq) { - ret = -EINVAL; + xid->irq = irq_domain_alloc_irqs(ipi_domain, 1, node, ); + if (xid->irq < 0) { + ret = xid->irq; goto out_free_xive_ipis; } -- 2.26.3
[PATCH v3 8/9] powerpc/xive: Map one IPI interrupt per node
ipistorm [*] can be used to benchmark the raw interrupt rate of an interrupt controller by measuring the number of IPIs a system can sustain. When applied to the XIVE interrupt controller of POWER9 and POWER10 systems, a significant drop of the interrupt rate can be observed when crossing the second node boundary. This is due to the fact that a single IPI interrupt is used for all CPUs of the system. The structure is shared and the cache line updates impact greatly the traffic between nodes and the overall IPI performance. As a workaround, the impact can be reduced by deactivating the IRQ lockup detector ("noirqdebug") which does a lot of accounting in the Linux IRQ descriptor structure and is responsible for most of the performance penalty. As a fix, this proposal allocates an IPI interrupt per node, to be shared by all CPUs of that node. It solves the scaling issue, the IRQ lockup detector still has an impact but the XIVE interrupt rate scales linearly. It also improves the "noirqdebug" case as showed in the tables below. * P9 DD2.2 - 2s * 64 threads "noirqdebug" Mint/sMint/s chips cpus IPI/sys IPI/chip IPI/chipIPI/sys -- 1 0-15 4.984023 4.875405 4.996536 5.048892 0-3110.879164 10.544040 10.757632 11.037859 0-4715.345301 14.688764 14.926520 15.310053 0-6317.064907 17.066812 17.613416 17.874511 2 0-7911.768764 21.650749 22.689120 22.566508 0-9510.616812 26.878789 28.434703 28.320324 0-111 10.151693 31.397803 31.771773 32.388122 0-1279.948502 33.139336 34.875716 35.224548 * P10 DD1 - 4s (not homogeneous) 352 threads "noirqdebug" Mint/sMint/s chips cpus IPI/sys IPI/chip IPI/chipIPI/sys -- 1 0-15 2.409402 2.364108 2.383303 2.395091 0-31 6.028325 6.046075 6.08 6.073750 0-47 8.655178 8.644531 8.712830 8.724702 0-6311.629652 11.735953 12.088203 12.055979 0-7914.392321 14.729959 14.986701 14.973073 0-9512.604158 13.004034 17.528748 17.568095 2 0-1119.767753 13.719831 19.968606 20.024218 0-1276.744566 16.418854 22.898066 22.995110 0-1436.005699 19.174421 25.425622 25.417541 0-1595.649719 21.938836 27.952662 28.059603 0-1755.441410 24.109484 31.133915 31.127996 3 0-1915.318341 24.405322 33.999221 33.775354 0-2075.191382 26.449769 36.050161 35.867307 0-2235.102790 29.356943 39.544135 39.508169 0-2395.035295 31.933051 42.135075 42.071975 0-2554.969209 34.477367 44.655395 44.757074 4 0-2714.907652 35.887016 47.080545 47.318537 0-2874.839581 38.076137 50.464307 50.636219 0-3034.786031 40.881319 53.478684 53.310759 0-3194.743750 43.448424 56.388102 55.973969 0-3354.709936 45.623532 59.400930 58.926857 0-3514.681413 45.646151 62.035804 61.830057 [*] https://github.com/antonblanchard/ipistorm Cc: Thomas Gleixner Signed-off-by: Cédric Le Goater --- Changes in v3: - increased IPI name length - use of early_cpu_to_node() for hotplugged CPUs - filter CPU-less nodes - dropped Greg's Reviewed-by because of the changes arch/powerpc/sysdev/xive/xive-internal.h | 2 - arch/powerpc/sysdev/xive/common.c| 60 +++- 2 files changed, 47 insertions(+), 15 deletions(-) diff --git a/arch/powerpc/sysdev/xive/xive-internal.h b/arch/powerpc/sysdev/xive/xive-internal.h index 9cf57c722faa..b3a456fdd3a5 100644 --- a/arch/powerpc/sysdev/xive/xive-internal.h +++ b/arch/powerpc/sysdev/xive/xive-internal.h @@ -5,8 +5,6 @@ #ifndef __XIVE_INTERNAL_H #define __XIVE_INTERNAL_H -#define XIVE_IPI_HW_IRQ0 /* interrupt source # for IPIs */ - /* * A "disabled" interrupt should never fire, to catch problems * we set its logical number to this diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c index 69980b49afb7..06f29cd07448 100644 --- a/arch/powerpc/sysdev/xive/common.c +++ b/arch/powerpc/sysdev/xive/common.c @@ -63,8 +63,19 @@ static const struct xive_ops *xive_ops; static struct irq_domain *xive_irq_domain; #ifdef CONFIG_SMP -/* The IPIs all use the same logical irq number */ -static u32 xive_ipi_irq; +/* The IPIs use the same logical irq number when on the same chip */ +static struct xive_ipi_desc { + unsigned int irq; + char
[PATCH v3 2/9] powerpc/xive: Introduce an IPI interrupt domain
The IPI interrupt is a special case of the XIVE IRQ domain. When mapping and unmapping the interrupts in the Linux interrupt number space, the HW interrupt number 0 (XIVE_IPI_HW_IRQ) is checked to distinguish the IPI interrupt from other interrupts of the system. Simplify the XIVE interrupt domain by introducing a specific domain for the IPI. Cc: Thomas Gleixner Signed-off-by: Cédric Le Goater --- Changes in v3: - better error handling of xive_request_ipi() - use of a fwnode_handle to name the new domain - dropped Greg's Reviewed-by because of the changes arch/powerpc/sysdev/xive/common.c | 79 ++- 1 file changed, 46 insertions(+), 33 deletions(-) diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c index 776871274b69..98f4dc916fa1 100644 --- a/arch/powerpc/sysdev/xive/common.c +++ b/arch/powerpc/sysdev/xive/common.c @@ -1068,24 +1068,58 @@ static struct irq_chip xive_ipi_chip = { .irq_unmask = xive_ipi_do_nothing, }; -static void __init xive_request_ipi(void) +/* + * IPIs are marked per-cpu. We use separate HW interrupts under the + * hood but associated with the same "linux" interrupt + */ +static int xive_ipi_irq_domain_map(struct irq_domain *h, unsigned int virq, + irq_hw_number_t hw) { + irq_set_chip_and_handler(virq, _ipi_chip, handle_percpu_irq); + return 0; +} + +static const struct irq_domain_ops xive_ipi_irq_domain_ops = { + .map = xive_ipi_irq_domain_map, +}; + +static int __init xive_request_ipi(void) +{ + struct fwnode_handle *fwnode; + struct irq_domain *ipi_domain; unsigned int virq; + int ret = -ENOMEM; - /* -* Initialization failed, move on, we might manage to -* reach the point where we display our errors before -* the system falls appart -*/ - if (!xive_irq_domain) - return; + fwnode = irq_domain_alloc_named_fwnode("XIVE-IPI"); + if (!fwnode) + goto out; + + ipi_domain = irq_domain_create_linear(fwnode, 1, + _ipi_irq_domain_ops, NULL); + if (!ipi_domain) + goto out_free_fwnode; /* Initialize it */ - virq = irq_create_mapping(xive_irq_domain, XIVE_IPI_HW_IRQ); + virq = irq_create_mapping(ipi_domain, XIVE_IPI_HW_IRQ); + if (!virq) { + ret = -EINVAL; + goto out_free_domain; + } + xive_ipi_irq = virq; - WARN_ON(request_irq(virq, xive_muxed_ipi_action, - IRQF_PERCPU | IRQF_NO_THREAD, "IPI", NULL)); + ret = request_irq(virq, xive_muxed_ipi_action, + IRQF_PERCPU | IRQF_NO_THREAD, "IPI", NULL); + + WARN(ret < 0, "Failed to request IPI %d: %d\n", virq, ret); + return ret; + +out_free_domain: + irq_domain_remove(ipi_domain); +out_free_fwnode: + irq_domain_free_fwnode(fwnode); +out: + return ret; } static int xive_setup_cpu_ipi(unsigned int cpu) @@ -1179,19 +1213,6 @@ static int xive_irq_domain_map(struct irq_domain *h, unsigned int virq, */ irq_clear_status_flags(virq, IRQ_LEVEL); -#ifdef CONFIG_SMP - /* IPIs are special and come up with HW number 0 */ - if (hw == XIVE_IPI_HW_IRQ) { - /* -* IPIs are marked per-cpu. We use separate HW interrupts under -* the hood but associated with the same "linux" interrupt -*/ - irq_set_chip_and_handler(virq, _ipi_chip, -handle_percpu_irq); - return 0; - } -#endif - rc = xive_irq_alloc_data(virq, hw); if (rc) return rc; @@ -1203,15 +1224,7 @@ static int xive_irq_domain_map(struct irq_domain *h, unsigned int virq, static void xive_irq_domain_unmap(struct irq_domain *d, unsigned int virq) { - struct irq_data *data = irq_get_irq_data(virq); - unsigned int hw_irq; - - /* XXX Assign BAD number */ - if (!data) - return; - hw_irq = (unsigned int)irqd_to_hwirq(data); - if (hw_irq != XIVE_IPI_HW_IRQ) - xive_irq_free_data(virq); + xive_irq_free_data(virq); } static int xive_irq_domain_xlate(struct irq_domain *h, struct device_node *ct, -- 2.26.3
[PATCH v3 6/9] powerpc/xive: Simplify the dump of XIVE interrupts under xmon
Move the xmon routine under XIVE subsystem and rework the loop on the interrupts taking into account the xive_irq_domain to filter out IPIs. Reviewed-by: Greg Kurz Signed-off-by: Cédric Le Goater --- arch/powerpc/include/asm/xive.h | 1 + arch/powerpc/sysdev/xive/common.c | 14 ++ arch/powerpc/xmon/xmon.c | 28 ++-- 3 files changed, 17 insertions(+), 26 deletions(-) diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h index 9a312b975ca8..aa094a8655b0 100644 --- a/arch/powerpc/include/asm/xive.h +++ b/arch/powerpc/include/asm/xive.h @@ -102,6 +102,7 @@ void xive_flush_interrupt(void); /* xmon hook */ void xmon_xive_do_dump(int cpu); int xmon_xive_get_irq_config(u32 hw_irq, struct irq_data *d); +void xmon_xive_get_irq_all(void); /* APIs used by KVM */ u32 xive_native_default_eq_shift(void); diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c index f5fe60c194bc..4c6e2e1289f5 100644 --- a/arch/powerpc/sysdev/xive/common.c +++ b/arch/powerpc/sysdev/xive/common.c @@ -289,6 +289,20 @@ int xmon_xive_get_irq_config(u32 hw_irq, struct irq_data *d) return 0; } +void xmon_xive_get_irq_all(void) +{ + unsigned int i; + struct irq_desc *desc; + + for_each_irq_desc(i, desc) { + struct irq_data *d = irq_desc_get_irq_data(desc); + unsigned int hwirq = (unsigned int)irqd_to_hwirq(d); + + if (d->domain == xive_irq_domain) + xmon_xive_get_irq_config(hwirq, d); + } +} + #endif /* CONFIG_XMON */ static unsigned int xive_get_irq(void) diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c index 3fe37495f63d..80fbf8968f77 100644 --- a/arch/powerpc/xmon/xmon.c +++ b/arch/powerpc/xmon/xmon.c @@ -2727,30 +2727,6 @@ static void dump_all_xives(void) dump_one_xive(cpu); } -static void dump_one_xive_irq(u32 num, struct irq_data *d) -{ - xmon_xive_get_irq_config(num, d); -} - -static void dump_all_xive_irq(void) -{ - unsigned int i; - struct irq_desc *desc; - - for_each_irq_desc(i, desc) { - struct irq_data *d = irq_desc_get_irq_data(desc); - unsigned int hwirq; - - if (!d) - continue; - - hwirq = (unsigned int)irqd_to_hwirq(d); - /* IPIs are special (HW number 0) */ - if (hwirq) - dump_one_xive_irq(hwirq, d); - } -} - static void dump_xives(void) { unsigned long num; @@ -2767,9 +2743,9 @@ static void dump_xives(void) return; } else if (c == 'i') { if (scanhex()) - dump_one_xive_irq(num, NULL); + xmon_xive_get_irq_config(num, NULL); else - dump_all_xive_irq(); + xmon_xive_get_irq_all(); return; } -- 2.26.3
[PATCH v3 7/9] powerpc/xive: Fix xmon command "dxi"
When under xmon, the "dxi" command dumps the state of the XIVE interrupts. If an interrupt number is specified, only the state of the associated XIVE interrupt is dumped. This form of the command lacks an irq_data parameter which is nevertheless used by xmon_xive_get_irq_config(), leading to an xmon crash. Fix that by doing a lookup in the system IRQ mapping to query the IRQ descriptor data. Invalid interrupt numbers, or not belonging to the XIVE IRQ domain, OPAL event interrupt number for instance, should be caught by the previous query done at the firmware level. Reported-by: kernel test robot Reported-by: Dan Carpenter Fixes: 97ef27507793 ("powerpc/xive: Fix xmon support on the PowerNV platform") Tested-by: Greg Kurz Reviewed-by: Greg Kurz Signed-off-by: Cédric Le Goater --- arch/powerpc/sysdev/xive/common.c | 14 ++ 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c index 4c6e2e1289f5..69980b49afb7 100644 --- a/arch/powerpc/sysdev/xive/common.c +++ b/arch/powerpc/sysdev/xive/common.c @@ -253,17 +253,20 @@ notrace void xmon_xive_do_dump(int cpu) xmon_printf("\n"); } +static struct irq_data *xive_get_irq_data(u32 hw_irq) +{ + unsigned int irq = irq_find_mapping(xive_irq_domain, hw_irq); + + return irq ? irq_get_irq_data(irq) : NULL; +} + int xmon_xive_get_irq_config(u32 hw_irq, struct irq_data *d) { - struct irq_chip *chip = irq_data_get_irq_chip(d); int rc; u32 target; u8 prio; u32 lirq; - if (!is_xive_irq(chip)) - return -EINVAL; - rc = xive_ops->get_irq_config(hw_irq, , , ); if (rc) { xmon_printf("IRQ 0x%08x : no config rc=%d\n", hw_irq, rc); @@ -273,6 +276,9 @@ int xmon_xive_get_irq_config(u32 hw_irq, struct irq_data *d) xmon_printf("IRQ 0x%08x : target=0x%x prio=%02x lirq=0x%x ", hw_irq, target, prio, lirq); + if (!d) + d = xive_get_irq_data(hw_irq); + if (d) { struct xive_irq_data *xd = irq_data_get_irq_handler_data(d); u64 val = xive_esb_read(xd, XIVE_ESB_GET); -- 2.26.3
[PATCH v3 5/9] powerpc/xive: Drop check on irq_data in xive_core_debug_show()
When looping on IRQ descriptor, irq_data is always valid. Reported-by: kernel test robot Reported-by: Dan Carpenter Fixes: 930914b7d528 ("powerpc/xive: Add a debugfs file to dump internal XIVE state") Reviewed-by: Greg Kurz Signed-off-by: Cédric Le Goater --- arch/powerpc/sysdev/xive/common.c | 21 ++--- 1 file changed, 10 insertions(+), 11 deletions(-) diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c index 4149ca846e7c..f5fe60c194bc 100644 --- a/arch/powerpc/sysdev/xive/common.c +++ b/arch/powerpc/sysdev/xive/common.c @@ -1607,6 +1607,8 @@ static void xive_debug_show_irq(struct seq_file *m, struct irq_data *d) u32 target; u8 prio; u32 lirq; + struct xive_irq_data *xd; + u64 val; rc = xive_ops->get_irq_config(hw_irq, , , ); if (rc) { @@ -1617,17 +1619,14 @@ static void xive_debug_show_irq(struct seq_file *m, struct irq_data *d) seq_printf(m, "IRQ 0x%08x : target=0x%x prio=%02x lirq=0x%x ", hw_irq, target, prio, lirq); - if (d) { - struct xive_irq_data *xd = irq_data_get_irq_handler_data(d); - u64 val = xive_esb_read(xd, XIVE_ESB_GET); - - seq_printf(m, "flags=%c%c%c PQ=%c%c", - xd->flags & XIVE_IRQ_FLAG_STORE_EOI ? 'S' : ' ', - xd->flags & XIVE_IRQ_FLAG_LSI ? 'L' : ' ', - xd->flags & XIVE_IRQ_FLAG_H_INT_ESB ? 'H' : ' ', - val & XIVE_ESB_VAL_P ? 'P' : '-', - val & XIVE_ESB_VAL_Q ? 'Q' : '-'); - } + xd = irq_data_get_irq_handler_data(d); + val = xive_esb_read(xd, XIVE_ESB_GET); + seq_printf(m, "flags=%c%c%c PQ=%c%c", + xd->flags & XIVE_IRQ_FLAG_STORE_EOI ? 'S' : ' ', + xd->flags & XIVE_IRQ_FLAG_LSI ? 'L' : ' ', + xd->flags & XIVE_IRQ_FLAG_H_INT_ESB ? 'H' : ' ', + val & XIVE_ESB_VAL_P ? 'P' : '-', + val & XIVE_ESB_VAL_Q ? 'Q' : '-'); seq_puts(m, "\n"); } -- 2.26.3
[PATCH v3 0/9] powerpc/xive: Map one IPI interrupt per node
Hello, ipistorm [*] can be used to benchmark the raw interrupt rate of an interrupt controller by measuring the number of IPIs a system can sustain. When applied to the XIVE interrupt controller of POWER9 and POWER10 systems, a significant drop of the interrupt rate can be observed when crossing the second node boundary. This is due to the fact that a single IPI interrupt is used for all CPUs of the system. The structure is shared and the cache line updates impact greatly the traffic between nodes and the overall IPI performance. As a workaround, the impact can be reduced by deactivating the IRQ lockup detector ("noirqdebug") which does a lot of accounting in the Linux IRQ descriptor structure and is responsible for most of the performance penalty. As a fix, this proposal allocates an IPI interrupt per node, to be shared by all CPUs of that node. It solves the scaling issue, the IRQ lockup detector still has an impact but the XIVE interrupt rate scales linearly. It also improves the "noirqdebug" case as showed in the tables below. * P9 DD2.2 - 2s * 64 threads "noirqdebug" Mint/sMint/s chips cpus IPI/sys IPI/chip IPI/chipIPI/sys -- 1 0-15 4.984023 4.875405 4.996536 5.048892 0-3110.879164 10.544040 10.757632 11.037859 0-4715.345301 14.688764 14.926520 15.310053 0-6317.064907 17.066812 17.613416 17.874511 2 0-7911.768764 21.650749 22.689120 22.566508 0-9510.616812 26.878789 28.434703 28.320324 0-111 10.151693 31.397803 31.771773 32.388122 0-1279.948502 33.139336 34.875716 35.224548 * P10 DD1 - 4s (not homogeneous) 352 threads "noirqdebug" Mint/sMint/s chips cpus IPI/sys IPI/chip IPI/chipIPI/sys -- 1 0-15 2.409402 2.364108 2.383303 2.395091 0-31 6.028325 6.046075 6.08 6.073750 0-47 8.655178 8.644531 8.712830 8.724702 0-6311.629652 11.735953 12.088203 12.055979 0-7914.392321 14.729959 14.986701 14.973073 0-9512.604158 13.004034 17.528748 17.568095 2 0-1119.767753 13.719831 19.968606 20.024218 0-1276.744566 16.418854 22.898066 22.995110 0-1436.005699 19.174421 25.425622 25.417541 0-1595.649719 21.938836 27.952662 28.059603 0-1755.441410 24.109484 31.133915 31.127996 3 0-1915.318341 24.405322 33.999221 33.775354 0-2075.191382 26.449769 36.050161 35.867307 0-2235.102790 29.356943 39.544135 39.508169 0-2395.035295 31.933051 42.135075 42.071975 0-2554.969209 34.477367 44.655395 44.757074 4 0-2714.907652 35.887016 47.080545 47.318537 0-2874.839581 38.076137 50.464307 50.636219 0-3034.786031 40.881319 53.478684 53.310759 0-3194.743750 43.448424 56.388102 55.973969 0-3354.709936 45.623532 59.400930 58.926857 0-3514.681413 45.646151 62.035804 61.830057 [*] https://github.com/antonblanchard/ipistorm Thanks, C. Changes in v3: - improved commit log for the misuse of "ibm,chip-id" - better error handling of xive_request_ipi() - use of a fwnode_handle to name the new domain - increased IPI name length - use of early_cpu_to_node() for hotplugged CPUs - filter CPU-less nodes Changes in v2: - extra simplification on xmon - fixes on issues reported by the kernel test robot Cédric Le Goater (9): powerpc/xive: Use cpu_to_node() instead of "ibm,chip-id" property powerpc/xive: Introduce an IPI interrupt domain powerpc/xive: Remove useless check on XIVE_IPI_HW_IRQ powerpc/xive: Simplify xive_core_debug_show() powerpc/xive: Drop check on irq_data in xive_core_debug_show() powerpc/xive: Simplify the dump of XIVE interrupts under xmon powerpc/xive: Fix xmon command "dxi" powerpc/xive: Map one IPI interrupt per node powerpc/xive: Modernize XIVE-IPI domain with an 'alloc' handler arch/powerpc/include/asm/xive.h | 1 + arch/powerpc/sysdev/xive/xive-internal.h | 2 - arch/powerpc/sysdev/xive/common.c| 211 +++ arch/powerpc/xmon/xmon.c | 28 +-- 4 files changed, 139 insertions(+), 103 deletions(-) -- 2.26.3
[PATCH v3 1/9] powerpc/xive: Use cpu_to_node() instead of "ibm, chip-id" property
The 'chip_id' field of the XIVE CPU structure is used to choose a target for a source located on the same chip when possible. The XIVE driver queries the chip id value from the "ibm,chip-id" DT property but this property is not available on all platforms. It was first introduced on the PowerNV platform and later, under QEMU for pseries. However, the property does not exist under PowerVM since it is not specified in PAPR. cpu_to_node() is a better alternative. On the PowerNV platform, the node id is computed from the "ibm,associativity" property of the CPU. Its value is built in the OPAL firmware from the physical chip id and is equivalent to "ibm,chip-id". On pSeries, the hcall H_HOME_NODE_ASSOCIATIVITY returns the node id. Also to be noted that under QEMU/KVM "ibm,chip-id" is badly calculated with unusual SMT configuration. This leads to a bogus chip id value being returned by of_get_ibm_chip_id(). Cc: David Gibson Signed-off-by: Cédric Le Goater --- Changes in v3: - improved commit log for the misuse of "ibm,chip-id" arch/powerpc/sysdev/xive/common.c | 7 +-- 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c index 7e08be5e5e4a..776871274b69 100644 --- a/arch/powerpc/sysdev/xive/common.c +++ b/arch/powerpc/sysdev/xive/common.c @@ -1336,16 +1336,11 @@ static int xive_prepare_cpu(unsigned int cpu) xc = per_cpu(xive_cpu, cpu); if (!xc) { - struct device_node *np; - xc = kzalloc_node(sizeof(struct xive_cpu), GFP_KERNEL, cpu_to_node(cpu)); if (!xc) return -ENOMEM; - np = of_get_cpu_node(cpu, NULL); - if (np) - xc->chip_id = of_get_ibm_chip_id(np); - of_node_put(np); + xc->chip_id = cpu_to_node(cpu); xc->hw_ipi = XIVE_BAD_IRQ; per_cpu(xive_cpu, cpu) = xc; -- 2.26.3
[PATCH v3 4/9] powerpc/xive: Simplify xive_core_debug_show()
Now that the IPI interrupt has its own domain, the checks on the HW interrupt number XIVE_IPI_HW_IRQ and on the chip can be replaced by a check on the domain. Reviewed-by: Greg Kurz Signed-off-by: Cédric Le Goater --- arch/powerpc/sysdev/xive/common.c | 18 -- 1 file changed, 4 insertions(+), 14 deletions(-) diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c index 8bca9aca0607..4149ca846e7c 100644 --- a/arch/powerpc/sysdev/xive/common.c +++ b/arch/powerpc/sysdev/xive/common.c @@ -1600,17 +1600,14 @@ static void xive_debug_show_cpu(struct seq_file *m, int cpu) seq_puts(m, "\n"); } -static void xive_debug_show_irq(struct seq_file *m, u32 hw_irq, struct irq_data *d) +static void xive_debug_show_irq(struct seq_file *m, struct irq_data *d) { - struct irq_chip *chip = irq_data_get_irq_chip(d); + unsigned int hw_irq = (unsigned int)irqd_to_hwirq(d); int rc; u32 target; u8 prio; u32 lirq; - if (!is_xive_irq(chip)) - return; - rc = xive_ops->get_irq_config(hw_irq, , , ); if (rc) { seq_printf(m, "IRQ 0x%08x : no config rc=%d\n", hw_irq, rc); @@ -1648,16 +1645,9 @@ static int xive_core_debug_show(struct seq_file *m, void *private) for_each_irq_desc(i, desc) { struct irq_data *d = irq_desc_get_irq_data(desc); - unsigned int hw_irq; - - if (!d) - continue; - - hw_irq = (unsigned int)irqd_to_hwirq(d); - /* IPIs are special (HW number 0) */ - if (hw_irq != XIVE_IPI_HW_IRQ) - xive_debug_show_irq(m, hw_irq, d); + if (d->domain == xive_irq_domain) + xive_debug_show_irq(m, d); } return 0; } -- 2.26.3
[PATCH v3 3/9] powerpc/xive: Remove useless check on XIVE_IPI_HW_IRQ
The IPI interrupt has its own domain now. Testing the HW interrupt number is not needed anymore. Reviewed-by: Greg Kurz Signed-off-by: Cédric Le Goater --- arch/powerpc/sysdev/xive/common.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c index 98f4dc916fa1..8bca9aca0607 100644 --- a/arch/powerpc/sysdev/xive/common.c +++ b/arch/powerpc/sysdev/xive/common.c @@ -1417,13 +1417,12 @@ static void xive_flush_cpu_queue(unsigned int cpu, struct xive_cpu *xc) struct irq_desc *desc = irq_to_desc(irq); struct irq_data *d = irq_desc_get_irq_data(desc); struct xive_irq_data *xd; - unsigned int hw_irq = (unsigned int)irqd_to_hwirq(d); /* * Ignore anything that isn't a XIVE irq and ignore * IPIs, so can just be dropped. */ - if (d->domain != xive_irq_domain || hw_irq == XIVE_IPI_HW_IRQ) + if (d->domain != xive_irq_domain) continue; /* -- 2.26.3
[PATCH v6 9/9] powerpc/qspinlock: Add ARCH_USE_QUEUED_SPINLOCKS_XCHG32
From: Guo Ren We don't have native hw xchg16 instruction, so let qspinlock generic code to deal with it. Using the full-word atomic xchg instructions implement xchg16 has the semantic risk for atomic operations. This patch cancels the dependency of on qspinlock generic code on architecture's xchg16. Also no need when PPC_LBARX_LWARX is enabled, see the link below. Signed-off-by: Guo Ren Link: https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20201107032328.2454582-1-npig...@gmail.com/ Cc: Christophe Leroy Cc: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: Paul Mackerras --- arch/powerpc/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 386ae12d8523..6133ad51690e 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -151,6 +151,7 @@ config PPC select ARCH_USE_CMPXCHG_LOCKREF if PPC64 select ARCH_USE_QUEUED_RWLOCKS if PPC_QUEUED_SPINLOCKS select ARCH_USE_QUEUED_SPINLOCKSif PPC_QUEUED_SPINLOCKS + select ARCH_USE_QUEUED_SPINLOCKS_XCHG32 if PPC_QUEUED_SPINLOCKS && !PPC_LBARX_LWARX select ARCH_WANT_IPC_PARSE_VERSION select ARCH_WANT_IRQS_OFF_ACTIVATE_MM select ARCH_WANT_LD_ORPHAN_WARN -- 2.17.1
[PATCH v6 8/9] xtensa: qspinlock: Add ARCH_USE_QUEUED_SPINLOCKS_XCHG32
From: Guo Ren We don't have native hw xchg16 instruction, so let qspinlock generic code to deal with it. Using the full-word atomic xchg instructions implement xchg16 has the semantic risk for atomic operations. This patch cancels the dependency of on qspinlock generic code on architecture's xchg16. Signed-off-by: Guo Ren Cc: Arnd Bergmann Cc: Chris Zankel Cc: Max Filippov --- arch/xtensa/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/xtensa/Kconfig b/arch/xtensa/Kconfig index 9ad6b7b82707..f19d780638f7 100644 --- a/arch/xtensa/Kconfig +++ b/arch/xtensa/Kconfig @@ -9,6 +9,7 @@ config XTENSA select ARCH_HAS_DMA_SET_UNCACHED if MMU select ARCH_USE_QUEUED_RWLOCKS select ARCH_USE_QUEUED_SPINLOCKS + select ARCH_USE_QUEUED_SPINLOCKS_XCHG32 select ARCH_WANT_FRAME_POINTERS select ARCH_WANT_IPC_PARSE_VERSION select BUILDTIME_TABLE_SORT -- 2.17.1
[PATCH v6 7/9] sparc: qspinlock: Add ARCH_USE_QUEUED_SPINLOCKS_XCHG32
From: Guo Ren We don't have native hw xchg16 instruction, so let qspinlock generic code to deal with it. Using the full-word atomic xchg instructions implement xchg16 has the semantic risk for atomic operations. This patch cancels the dependency of on qspinlock generic code on architecture's xchg16. Signed-off-by: Guo Ren Cc: Arnd Bergmann Cc: David S. Miller Cc: Rob Gardner --- arch/sparc/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig index 164a5254c91c..1079fe3f058c 100644 --- a/arch/sparc/Kconfig +++ b/arch/sparc/Kconfig @@ -91,6 +91,7 @@ config SPARC64 select HAVE_REGS_AND_STACK_ACCESS_API select ARCH_USE_QUEUED_RWLOCKS select ARCH_USE_QUEUED_SPINLOCKS + select ARCH_USE_QUEUED_SPINLOCKS_XCHG32 select GENERIC_TIME_VSYSCALL select ARCH_CLOCKSOURCE_DATA select ARCH_HAS_PTE_SPECIAL -- 2.17.1
[PATCH v6 6/9] openrisc: qspinlock: Add ARCH_USE_QUEUED_SPINLOCKS_XCHG32
From: Guo Ren We don't have native hw xchg16 instruction, so let qspinlock generic code to deal with it. Using the full-word atomic xchg instructions implement xchg16 has the semantic risk for atomic operations. This patch cancels the dependency of on qspinlock generic code on architecture's xchg16. Signed-off-by: Guo Ren Cc: Arnd Bergmann Cc: Jonas Bonn Cc: Stefan Kristiansson Cc: Stafford Horne Cc: openr...@lists.librecores.org --- arch/openrisc/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/openrisc/Kconfig b/arch/openrisc/Kconfig index 591acc5990dc..b299e409429f 100644 --- a/arch/openrisc/Kconfig +++ b/arch/openrisc/Kconfig @@ -33,6 +33,7 @@ config OPENRISC select OR1K_PIC select CPU_NO_EFFICIENT_FFS if !OPENRISC_HAVE_INST_FF1 select ARCH_USE_QUEUED_SPINLOCKS + select ARCH_USE_QUEUED_SPINLOCKS_XCHG32 select ARCH_USE_QUEUED_RWLOCKS select OMPIC if SMP select ARCH_WANT_FRAME_POINTERS -- 2.17.1
[PATCH v6 5/9] csky: Convert custom spinlock/rwlock to generic qspinlock/qrwlock
From: Guo Ren Update the C-SKY port to use the generic qspinlock and qrwlock. C-SKY only support ldex.w/stex.w with word(double word) size & align access. So it must select XCHG32 to let qspinlock only use word atomic xchg_tail. Default is still ticket lock. Signed-off-by: Guo Ren Cc: Waiman Long Cc: Peter Zijlstra Cc: Will Deacon Cc: Arnd Bergmann --- arch/csky/Kconfig | 8 arch/csky/include/asm/Kbuild | 2 ++ arch/csky/include/asm/spinlock.h | 4 arch/csky/include/asm/spinlock_types.h | 4 4 files changed, 18 insertions(+) diff --git a/arch/csky/Kconfig b/arch/csky/Kconfig index 34e91224adc3..ae12332edb7b 100644 --- a/arch/csky/Kconfig +++ b/arch/csky/Kconfig @@ -8,6 +8,8 @@ config CSKY select ARCH_HAS_SYNC_DMA_FOR_DEVICE select ARCH_USE_BUILTIN_BSWAP select ARCH_USE_QUEUED_RWLOCKS + select ARCH_USE_QUEUED_SPINLOCKSif !CSKY_TICKET_LOCK + select ARCH_USE_QUEUED_SPINLOCKS_XCHG32 select ARCH_WANT_FRAME_POINTERS if !CPU_CK610 select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT select COMMON_CLK @@ -304,6 +306,12 @@ config NR_CPUS depends on SMP default "4" +config CSKY_TICKET_LOCK + bool "Ticket-based spin-locking" + default y + help + Say Y here to use ticket-based spin-locking. + config HIGHMEM bool "High Memory Support" depends on !CPU_CK610 diff --git a/arch/csky/include/asm/Kbuild b/arch/csky/include/asm/Kbuild index cc24bb8e539f..2a2d09963bb9 100644 --- a/arch/csky/include/asm/Kbuild +++ b/arch/csky/include/asm/Kbuild @@ -2,6 +2,8 @@ generic-y += asm-offsets.h generic-y += gpio.h generic-y += kvm_para.h +generic-y += mcs_spinlock.h generic-y += qrwlock.h +generic-y += qspinlock.h generic-y += user.h generic-y += vmlinux.lds.h diff --git a/arch/csky/include/asm/spinlock.h b/arch/csky/include/asm/spinlock.h index 69677167977a..fe98ad8ece51 100644 --- a/arch/csky/include/asm/spinlock.h +++ b/arch/csky/include/asm/spinlock.h @@ -6,6 +6,7 @@ #include #include +#ifdef CONFIG_CSKY_TICKET_LOCK /* * Ticket-based spin-locking. */ @@ -80,6 +81,9 @@ static inline int arch_spin_is_contended(arch_spinlock_t *lock) return (tickets.next - tickets.owner) > 1; } #define arch_spin_is_contended arch_spin_is_contended +#else /* CONFIG_CSKY_TICKET_LOCK */ +#include +#endif /* CONFIG_CSKY_TICKET_LOCK */ #include diff --git a/arch/csky/include/asm/spinlock_types.h b/arch/csky/include/asm/spinlock_types.h index 8ff0f6ff3a00..547f035f6dd5 100644 --- a/arch/csky/include/asm/spinlock_types.h +++ b/arch/csky/include/asm/spinlock_types.h @@ -7,6 +7,7 @@ # error "please don't include this file directly" #endif +#ifdef CONFIG_CSKY_TICKET_LOCK #define TICKET_NEXT16 typedef struct { @@ -21,6 +22,9 @@ typedef struct { } arch_spinlock_t; #define __ARCH_SPIN_LOCK_UNLOCKED { { 0 } } +#else +#include +#endif #include -- 2.17.1
[PATCH v6 4/9] csky: locks: Optimize coding convention
From: Guo Ren - Using smp_cond_load_acquire in arch_spin_lock by Peter's advice. - Using __smp_acquire_fence in arch_spin_trylock - Using smp_store_release in arch_spin_unlock All above are just coding conventions and won't affect the function. TODO in smp_cond_load_acquire for architecture: - current csky only has: lr.w val, sc.w . val2 (Any other stores to p0 will let sc.w failed) - But smp_cond_load_acquire need: lr.w val, wfe (Any stores to p0 will send the event to let wfe retired) Signed-off-by: Guo Ren Link: https://lore.kernel.org/linux-riscv/caahsdy1jhlufwu7rucaq+ruwrbks2ksdva7eprt8--4zfof...@mail.gmail.com/T/#m13adac285b7f51f4f879a5d6b65753ecb1a7524e Cc: Peter Zijlstra Cc: Arnd Bergmann --- arch/csky/include/asm/spinlock.h | 11 --- 1 file changed, 4 insertions(+), 7 deletions(-) diff --git a/arch/csky/include/asm/spinlock.h b/arch/csky/include/asm/spinlock.h index 69f5aa249c5f..69677167977a 100644 --- a/arch/csky/include/asm/spinlock.h +++ b/arch/csky/include/asm/spinlock.h @@ -26,10 +26,8 @@ static inline void arch_spin_lock(arch_spinlock_t *lock) : "r"(p), "r"(ticket_next) : "cc"); - while (lockval.tickets.next != lockval.tickets.owner) - lockval.tickets.owner = READ_ONCE(lock->tickets.owner); - - smp_mb(); + smp_cond_load_acquire(>tickets.owner, + VAL == lockval.tickets.next); } static inline int arch_spin_trylock(arch_spinlock_t *lock) @@ -55,15 +53,14 @@ static inline int arch_spin_trylock(arch_spinlock_t *lock) } while (!res); if (!contended) - smp_mb(); + __smp_acquire_fence(); return !contended; } static inline void arch_spin_unlock(arch_spinlock_t *lock) { - smp_mb(); - WRITE_ONCE(lock->tickets.owner, lock->tickets.owner + 1); + smp_store_release(>tickets.owner, lock->tickets.owner + 1); } static inline int arch_spin_value_unlocked(arch_spinlock_t lock) -- 2.17.1
[PATCH v6 3/9] riscv: locks: Introduce ticket-based spinlock implementation
From: Guo Ren This patch introduces a ticket lock implementation for riscv, along the same lines as the implementation for arch/arm & arch/csky. We still use qspinlock as default. Signed-off-by: Guo Ren Cc: Peter Zijlstra Cc: Anup Patel Cc: Arnd Bergmann --- arch/riscv/Kconfig | 7 ++- arch/riscv/include/asm/spinlock.h | 84 + arch/riscv/include/asm/spinlock_types.h | 17 + 3 files changed, 107 insertions(+), 1 deletion(-) diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig index 67cc65ba1ea1..34d0276f01d5 100644 --- a/arch/riscv/Kconfig +++ b/arch/riscv/Kconfig @@ -34,7 +34,7 @@ config RISCV select ARCH_WANT_FRAME_POINTERS select ARCH_WANT_HUGE_PMD_SHARE if 64BIT select ARCH_USE_QUEUED_RWLOCKS - select ARCH_USE_QUEUED_SPINLOCKS + select ARCH_USE_QUEUED_SPINLOCKSif !RISCV_TICKET_LOCK select ARCH_USE_QUEUED_SPINLOCKS_XCHG32 select CLONE_BACKWARDS select CLINT_TIMER if !MMU @@ -344,6 +344,11 @@ config NEED_PER_CPU_EMBED_FIRST_CHUNK def_bool y depends on NUMA +config RISCV_TICKET_LOCK + bool "Ticket-based spin-locking" + help + Say Y here to use ticket-based spin-locking. + config RISCV_ISA_C bool "Emit compressed instructions when building Linux" default y diff --git a/arch/riscv/include/asm/spinlock.h b/arch/riscv/include/asm/spinlock.h index a557de67a425..90b7eaa950cf 100644 --- a/arch/riscv/include/asm/spinlock.h +++ b/arch/riscv/include/asm/spinlock.h @@ -7,7 +7,91 @@ #ifndef _ASM_RISCV_SPINLOCK_H #define _ASM_RISCV_SPINLOCK_H +#ifdef CONFIG_RISCV_TICKET_LOCK +#ifdef CONFIG_32BIT +#define __ASM_SLLIW "slli\t" +#define __ASM_SRLIW "srli\t" +#else +#define __ASM_SLLIW "slliw\t" +#define __ASM_SRLIW "srliw\t" +#endif + +/* + * Ticket-based spin-locking. + */ +static inline void arch_spin_lock(arch_spinlock_t *lock) +{ + arch_spinlock_t lockval; + u32 tmp; + + asm volatile ( + "1: lr.w%0, %2 \n" + " mv %1, %0 \n" + " addw%0, %0, %3 \n" + " sc.w%0, %0, %2 \n" + " bnez%0, 1b \n" + : "=" (tmp), "=" (lockval), "+A" (lock->lock) + : "r" (1 << TICKET_NEXT) + : "memory"); + + smp_cond_load_acquire(>tickets.owner, + VAL == lockval.tickets.next); +} + +static inline int arch_spin_trylock(arch_spinlock_t *lock) +{ + u32 tmp, contended, res; + + do { + asm volatile ( + " lr.w%0, %3 \n" + __ASM_SRLIW"%1, %0, %5 \n" + __ASM_SLLIW"%2, %0, %5 \n" + " or %1, %2, %1 \n" + " li %2, 0 \n" + " sub %1, %1, %0 \n" + " bnez%1, 1f \n" + " addw%0, %0, %4 \n" + " sc.w%2, %0, %3 \n" + "1: \n" + : "=" (tmp), "=" (contended), "=" (res), + "+A" (lock->lock) + : "r" (1 << TICKET_NEXT), "I" (TICKET_NEXT) + : "memory"); + } while (res); + + if (!contended) + __atomic_acquire_fence(); + + return !contended; +} + +static inline void arch_spin_unlock(arch_spinlock_t *lock) +{ + smp_store_release(>tickets.owner, lock->tickets.owner + 1); +} + +static inline int arch_spin_value_unlocked(arch_spinlock_t lock) +{ + return lock.tickets.owner == lock.tickets.next; +} + +static inline int arch_spin_is_locked(arch_spinlock_t *lock) +{ + return !arch_spin_value_unlocked(READ_ONCE(*lock)); +} + +static inline int arch_spin_is_contended(arch_spinlock_t *lock) +{ + struct __raw_tickets tickets = READ_ONCE(lock->tickets); + + return (tickets.next - tickets.owner) > 1; +} +#define arch_spin_is_contended arch_spin_is_contended +#else /* CONFIG_RISCV_TICKET_LOCK */ #include +#endif /* CONFIG_RISCV_TICKET_LOCK */ + #include #endif /* _ASM_RISCV_SPINLOCK_H */ diff --git a/arch/riscv/include/asm/spinlock_types.h b/arch/riscv/include/asm/spinlock_types.h index d033a973f287..afbb19841d0f 100644 --- a/arch/riscv/include/asm/spinlock_types.h +++ b/arch/riscv/include/asm/spinlock_types.h @@ -10,7 +10,24 @@ # error "please don't include this file directly" #endif +#ifdef CONFIG_RISCV_TICKET_LOCK +#define TICKET_NEXT16 + +typedef struct { + union { + u32 lock; + struct __raw_tickets { + /* little endian */ + u16 owner; + u16 next; + } tickets; + }; +} arch_spinlock_t; + +#define __ARCH_SPIN_LOCK_UNLOCKED { { 0 } } +#else #include +#endif
[PATCH v6 2/9] riscv: Convert custom spinlock/rwlock to generic qspinlock/qrwlock
From: Michael Clark Update the RISC-V port to use the generic qspinlock and qrwlock. This patch requires support for xchg_xtail for full-word which are added by a previous patch: Guo added select ARCH_USE_QUEUED_SPINLOCKS_XCHG32 in Kconfig Guo fixed up compile error which made by below include sequence: +#include +#include Signed-off-by: Michael Clark Co-developed-by: Guo Ren Tested-by: Guo Ren Signed-off-by: Guo Ren Link: https://lore.kernel.org/linux-riscv/20190211043829.30096-3-michaeljcl...@mac.com/ Cc: Peter Zijlstra Cc: Anup Patel Cc: Arnd Bergmann Cc: Palmer Dabbelt --- arch/riscv/Kconfig | 3 + arch/riscv/include/asm/Kbuild | 3 + arch/riscv/include/asm/spinlock.h | 126 +--- arch/riscv/include/asm/spinlock_types.h | 15 +-- 4 files changed, 11 insertions(+), 136 deletions(-) diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig index 87d7b52f278f..67cc65ba1ea1 100644 --- a/arch/riscv/Kconfig +++ b/arch/riscv/Kconfig @@ -33,6 +33,9 @@ config RISCV select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT if MMU select ARCH_WANT_FRAME_POINTERS select ARCH_WANT_HUGE_PMD_SHARE if 64BIT + select ARCH_USE_QUEUED_RWLOCKS + select ARCH_USE_QUEUED_SPINLOCKS + select ARCH_USE_QUEUED_SPINLOCKS_XCHG32 select CLONE_BACKWARDS select CLINT_TIMER if !MMU select COMMON_CLK diff --git a/arch/riscv/include/asm/Kbuild b/arch/riscv/include/asm/Kbuild index 445ccc97305a..750c1056b90f 100644 --- a/arch/riscv/include/asm/Kbuild +++ b/arch/riscv/include/asm/Kbuild @@ -3,5 +3,8 @@ generic-y += early_ioremap.h generic-y += extable.h generic-y += flat.h generic-y += kvm_para.h +generic-y += mcs_spinlock.h +generic-y += qrwlock.h +generic-y += qspinlock.h generic-y += user.h generic-y += vmlinux.lds.h diff --git a/arch/riscv/include/asm/spinlock.h b/arch/riscv/include/asm/spinlock.h index f4f7fa1b7ca8..a557de67a425 100644 --- a/arch/riscv/include/asm/spinlock.h +++ b/arch/riscv/include/asm/spinlock.h @@ -7,129 +7,7 @@ #ifndef _ASM_RISCV_SPINLOCK_H #define _ASM_RISCV_SPINLOCK_H -#include -#include -#include - -/* - * Simple spin lock operations. These provide no fairness guarantees. - */ - -/* FIXME: Replace this with a ticket lock, like MIPS. */ - -#define arch_spin_is_locked(x) (READ_ONCE((x)->lock) != 0) - -static inline void arch_spin_unlock(arch_spinlock_t *lock) -{ - smp_store_release(>lock, 0); -} - -static inline int arch_spin_trylock(arch_spinlock_t *lock) -{ - int tmp = 1, busy; - - __asm__ __volatile__ ( - " amoswap.w %0, %2, %1\n" - RISCV_ACQUIRE_BARRIER - : "=r" (busy), "+A" (lock->lock) - : "r" (tmp) - : "memory"); - - return !busy; -} - -static inline void arch_spin_lock(arch_spinlock_t *lock) -{ - while (1) { - if (arch_spin_is_locked(lock)) - continue; - - if (arch_spin_trylock(lock)) - break; - } -} - -/***/ - -static inline void arch_read_lock(arch_rwlock_t *lock) -{ - int tmp; - - __asm__ __volatile__( - "1: lr.w%1, %0\n" - " bltz%1, 1b\n" - " addi%1, %1, 1\n" - " sc.w%1, %1, %0\n" - " bnez%1, 1b\n" - RISCV_ACQUIRE_BARRIER - : "+A" (lock->lock), "=" (tmp) - :: "memory"); -} - -static inline void arch_write_lock(arch_rwlock_t *lock) -{ - int tmp; - - __asm__ __volatile__( - "1: lr.w%1, %0\n" - " bnez%1, 1b\n" - " li %1, -1\n" - " sc.w%1, %1, %0\n" - " bnez%1, 1b\n" - RISCV_ACQUIRE_BARRIER - : "+A" (lock->lock), "=" (tmp) - :: "memory"); -} - -static inline int arch_read_trylock(arch_rwlock_t *lock) -{ - int busy; - - __asm__ __volatile__( - "1: lr.w%1, %0\n" - " bltz%1, 1f\n" - " addi%1, %1, 1\n" - " sc.w%1, %1, %0\n" - " bnez%1, 1b\n" - RISCV_ACQUIRE_BARRIER - "1:\n" - : "+A" (lock->lock), "=" (busy) - :: "memory"); - - return !busy; -} - -static inline int arch_write_trylock(arch_rwlock_t *lock) -{ - int busy; - - __asm__ __volatile__( - "1: lr.w%1, %0\n" - " bnez%1, 1f\n" - " li %1, -1\n" - " sc.w%1, %1, %0\n" - " bnez%1, 1b\n" - RISCV_ACQUIRE_BARRIER - "1:\n" - : "+A" (lock->lock), "=" (busy) - :: "memory"); - -
[PATCH v6 1/9] locking/qspinlock: Add ARCH_USE_QUEUED_SPINLOCKS_XCHG32
From: Guo Ren Some architectures don't have sub-word swap atomic instruction, they only have the full word's one. The sub-word swap only improve the performance when: NR_CPUS < 16K * 0- 7: locked byte * 8: pending * 9-15: not used * 16-17: tail index * 18-31: tail cpu (+1) The 9-15 bits are wasted to use xchg16 in xchg_tail. Please let architecture select xchg16/xchg32 to implement xchg_tail. Signed-off-by: Guo Ren Cc: Peter Zijlstra Cc: Will Deacon Cc: Ingo Molnar Cc: Waiman Long Cc: Arnd Bergmann Cc: Anup Patel --- kernel/Kconfig.locks | 3 +++ kernel/locking/qspinlock.c | 46 +- 2 files changed, 28 insertions(+), 21 deletions(-) diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks index 3de8fd11873b..d02f1261f73f 100644 --- a/kernel/Kconfig.locks +++ b/kernel/Kconfig.locks @@ -239,6 +239,9 @@ config LOCK_SPIN_ON_OWNER config ARCH_USE_QUEUED_SPINLOCKS bool +config ARCH_USE_QUEUED_SPINLOCKS_XCHG32 + bool + config QUEUED_SPINLOCKS def_bool y if ARCH_USE_QUEUED_SPINLOCKS depends on SMP diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index cbff6ba53d56..4bfaa969bd15 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -163,26 +163,6 @@ static __always_inline void clear_pending_set_locked(struct qspinlock *lock) WRITE_ONCE(lock->locked_pending, _Q_LOCKED_VAL); } -/* - * xchg_tail - Put in the new queue tail code word & retrieve previous one - * @lock : Pointer to queued spinlock structure - * @tail : The new queue tail code word - * Return: The previous queue tail code word - * - * xchg(lock, tail), which heads an address dependency - * - * p,*,* -> n,*,* ; prev = xchg(lock, node) - */ -static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail) -{ - /* -* We can use relaxed semantics since the caller ensures that the -* MCS node is properly initialized before updating the tail. -*/ - return (u32)xchg_relaxed(>tail, -tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET; -} - #else /* _Q_PENDING_BITS == 8 */ /** @@ -206,6 +186,30 @@ static __always_inline void clear_pending_set_locked(struct qspinlock *lock) { atomic_add(-_Q_PENDING_VAL + _Q_LOCKED_VAL, >val); } +#endif /* _Q_PENDING_BITS == 8 */ + +#if _Q_PENDING_BITS == 8 && !defined(CONFIG_ARCH_USE_QUEUED_SPINLOCKS_XCHG32) +/* + * xchg_tail - Put in the new queue tail code word & retrieve previous one + * @lock : Pointer to queued spinlock structure + * @tail : The new queue tail code word + * Return: The previous queue tail code word + * + * xchg(lock, tail), which heads an address dependency + * + * p,*,* -> n,*,* ; prev = xchg(lock, node) + */ +static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail) +{ + /* +* We can use relaxed semantics since the caller ensures that the +* MCS node is properly initialized before updating the tail. +*/ + return (u32)xchg_relaxed(>tail, +tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET; +} + +#else /** * xchg_tail - Put in the new queue tail code word & retrieve previous one @@ -236,7 +240,7 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail) } return old; } -#endif /* _Q_PENDING_BITS == 8 */ +#endif /** * queued_fetch_set_pending_acquire - fetch the whole lock value and set pending -- 2.17.1
[PATCH v6 0/9] riscv: Add qspinlock/qrwlock
From: Guo Ren Current riscv is still using baby spinlock implementation. It'll cause fairness and cache line bouncing problems. Many people are involved and pay the efforts to improve it: - The first version of patch was made in 2019.1: https://lore.kernel.org/linux-riscv/20190211043829.30096-1-michaeljcl...@mac.com/#r - The second version was made in 2020.11: https://lore.kernel.org/linux-riscv/1606225437-22948-2-git-send-email-guo...@kernel.org/ - A good discussion at Platform HSC.2021-03-08: https://drive.google.com/drive/folders/1ooqdnIsYx7XKor5O1XTtM6D1CHp4hc0p - A good discussion on V4 in mailling list: https://lore.kernel.org/linux-riscv/1616868399-82848-1-git-send-email-guo...@kernel.org/T/#t - Openrisc's maintainer want to implement arch_cmpxchg infrastructure. https://lore.kernel.org/linux-riscv/1616868399-82848-1-git-send-email-guo...@kernel.org/T/#m11b712fb6a4fda043811b1f4c3d61446951ed65a Hope your comments and Tested-by or Co-developed-by or Reviewed-by ... Let's kick the qspinlock into riscv right now (Also for the architecture which hasn't xchg16 atomic instruction.) Change V6: - Add ticket-lock for riscv, default is qspinlock - Keep ticket-lock for csky, default is ticketlock - Using smp_cond_load for riscv ticket-lock - Optimize csky ticketlock with smp_cond_load, store_release - Add PPC_LBARX_LWARX for powerpc Change V5: - Fixup #endif comment typo by Waiman - Remove cmpxchg coding convention patches which will get into a separate patchset later by Arnd's advice - Try to involve more architectures in the discussion Change V4: - Remove custom sub-word xchg implementation - Add ARCH_USE_QUEUED_SPINLOCKS_XCHG32 in locking/qspinlock Change V3: - Coding convention by Peter Zijlstra's advices Change V2: - Coding convention in cmpxchg.h - Re-implement short xchg - Remove char & cmpxchg implementations Guo Ren (8): locking/qspinlock: Add ARCH_USE_QUEUED_SPINLOCKS_XCHG32 riscv: locks: Introduce ticket-based spinlock implementation csky: locks: Optimize coding convention csky: Convert custom spinlock/rwlock to generic qspinlock/qrwlock openrisc: qspinlock: Add ARCH_USE_QUEUED_SPINLOCKS_XCHG32 sparc: qspinlock: Add ARCH_USE_QUEUED_SPINLOCKS_XCHG32 xtensa: qspinlock: Add ARCH_USE_QUEUED_SPINLOCKS_XCHG32 powerpc/qspinlock: Add ARCH_USE_QUEUED_SPINLOCKS_XCHG32 Michael Clark (1): riscv: Convert custom spinlock/rwlock to generic qspinlock/qrwlock arch/csky/Kconfig | 8 ++ arch/csky/include/asm/Kbuild| 2 + arch/csky/include/asm/spinlock.h| 15 +-- arch/csky/include/asm/spinlock_types.h | 4 + arch/openrisc/Kconfig | 1 + arch/powerpc/Kconfig| 1 + arch/riscv/Kconfig | 8 ++ arch/riscv/include/asm/Kbuild | 3 + arch/riscv/include/asm/spinlock.h | 158 +--- arch/riscv/include/asm/spinlock_types.h | 26 ++-- arch/sparc/Kconfig | 1 + arch/xtensa/Kconfig | 1 + kernel/Kconfig.locks| 3 + kernel/locking/qspinlock.c | 46 +++ 14 files changed, 142 insertions(+), 135 deletions(-) -- 2.17.1
[PATCH] powerpc/ptrace: Don't return error when getting/setting FP regs without CONFIG_PPC_FPU_REGS
An #ifdef CONFIG_PPC_FPU_REGS is missing in arch_ptrace() leading to the following Oops because [REGSET_FPR] entry is not initialised in native_regsets[]. [ 41.917608] BUG: Unable to handle kernel instruction fetch [ 41.922849] Faulting instruction address: 0xff8fd228 [ 41.927760] Oops: Kernel access of bad area, sig: 11 [#1] [ 41.933089] BE PAGE_SIZE=4K PREEMPT CMPC885 [ 41.940753] Modules linked in: [ 41.943768] CPU: 0 PID: 366 Comm: gdb Not tainted 5.12.0-rc5-s3k-dev-01666-g7aac86a0f057-dirty #4835 [ 41.952800] NIP: ff8fd228 LR: c004d9e0 CTR: ff8fd228 [ 41.957790] REGS: caae9df0 TRAP: 0400 Not tainted (5.12.0-rc5-s3k-dev-01666-g7aac86a0f057-dirty) [ 41.966741] MSR: 40009032 CR: 82004248 XER: 2000 [ 41.973540] [ 41.973540] GPR00: c004d9b4 caae9eb0 c1b64f60 c1b64520 c0713cd4 caae9eb8 c1bacdfc 0004 [ 41.973540] GPR08: 0200 ff8fd228 c1bac700 1032 28004242 1061aaf4 0001 106d64a0 [ 41.973540] GPR16: 7fa0a774 1061 7fa0aef9 1061 7fa0a538 [ 41.973540] GPR24: 7fa0a580 7fa0a570 c1bacc00 c1b64520 c1bacc00 caae9ee8 0108 c0713cd4 [ 42.009685] NIP [ff8fd228] 0xff8fd228 [ 42.013300] LR [c004d9e0] __regset_get+0x100/0x124 [ 42.018036] Call Trace: [ 42.020443] [caae9eb0] [c004d9b4] __regset_get+0xd4/0x124 (unreliable) [ 42.026899] [caae9ee0] [c004da94] copy_regset_to_user+0x5c/0xb0 [ 42.032751] [caae9f10] [c002f640] sys_ptrace+0xe4/0x588 [ 42.037915] [caae9f30] [c0011010] ret_from_syscall+0x0/0x28 [ 42.043422] --- interrupt: c00 at 0xfd1f8e4 [ 42.047553] NIP: 0fd1f8e4 LR: 1004a688 CTR: [ 42.052544] REGS: caae9f40 TRAP: 0c00 Not tainted (5.12.0-rc5-s3k-dev-01666-g7aac86a0f057-dirty) [ 42.061494] MSR: d032 CR: 48004442 XER: [ 42.068551] [ 42.068551] GPR00: 001a 7fa0a040 77dad7e0 000e 0170 7fa0a078 0004 [ 42.068551] GPR08: 108deb88 108dda40 106d6010 44004442 1061aaf4 0001 106d64a0 [ 42.068551] GPR16: 7fa0a774 1061 7fa0aef9 1061 7fa0a538 [ 42.068551] GPR24: 7fa0a580 7fa0a570 1078fe00 1078fd70 1078fd70 0170 0fdd3244 000d [ 42.104696] NIP [0fd1f8e4] 0xfd1f8e4 [ 42.108225] LR [1004a688] 0x1004a688 [ 42.111753] --- interrupt: c00 [ 42.114768] Instruction dump: [ 42.117698] [ 42.125443] [ 42.133195] ---[ end trace d35616f22ab2100c ]--- Adding the missing #ifdef is not good because gdb doesn't like getting an error when getting registers. Instead, make ptrace return 0s when CONFIG_PPC_FPU_REGS is not set. Fixes: b6254ced4da6 ("powerpc/signal: Don't manage floating point regs when no FPU") Cc: sta...@vger.kernel.org Signed-off-by: Christophe Leroy --- arch/powerpc/kernel/ptrace/Makefile | 4 ++-- arch/powerpc/kernel/ptrace/ptrace-decl.h | 14 -- arch/powerpc/kernel/ptrace/ptrace-fpu.c | 10 ++ arch/powerpc/kernel/ptrace/ptrace-novsx.c | 8 arch/powerpc/kernel/ptrace/ptrace-view.c | 2 -- 5 files changed, 20 insertions(+), 18 deletions(-) diff --git a/arch/powerpc/kernel/ptrace/Makefile b/arch/powerpc/kernel/ptrace/Makefile index 8ebc11d1168d..77abd1a5a508 100644 --- a/arch/powerpc/kernel/ptrace/Makefile +++ b/arch/powerpc/kernel/ptrace/Makefile @@ -6,11 +6,11 @@ CFLAGS_ptrace-view.o += -DUTS_MACHINE='"$(UTS_MACHINE)"' obj-y += ptrace.o ptrace-view.o -obj-$(CONFIG_PPC_FPU_REGS) += ptrace-fpu.o +obj-y += ptrace-fpu.o obj-$(CONFIG_COMPAT) += ptrace32.o obj-$(CONFIG_VSX) += ptrace-vsx.o ifneq ($(CONFIG_VSX),y) -obj-$(CONFIG_PPC_FPU_REGS) += ptrace-novsx.o +obj-y += ptrace-novsx.o endif obj-$(CONFIG_ALTIVEC) += ptrace-altivec.o obj-$(CONFIG_SPE) += ptrace-spe.o diff --git a/arch/powerpc/kernel/ptrace/ptrace-decl.h b/arch/powerpc/kernel/ptrace/ptrace-decl.h index 3487f2c9735c..eafe5f0f6289 100644 --- a/arch/powerpc/kernel/ptrace/ptrace-decl.h +++ b/arch/powerpc/kernel/ptrace/ptrace-decl.h @@ -165,22 +165,8 @@ int ptrace_put_reg(struct task_struct *task, int regno, unsigned long data); extern const struct user_regset_view user_ppc_native_view; /* ptrace-fpu */ -#ifdef CONFIG_PPC_FPU_REGS int ptrace_get_fpr(struct task_struct *child, int index, unsigned long *data); int ptrace_put_fpr(struct task_struct *child, int index, unsigned long data); -#else -static inline int -ptrace_get_fpr(struct task_struct *child, int index, unsigned long *data) -{ - return -EIO; -} - -static inline int -ptrace_put_fpr(struct task_struct *child, int index, unsigned long data) -{ - return -EIO; -} -#endif /* ptrace-(no)adv */ void ppc_gethwdinfo(struct ppc_debug_info *dbginfo); diff --git a/arch/powerpc/kernel/ptrace/ptrace-fpu.c
Re: [PATCH] powerpc/8xx: Load modules closer to kernel text
Le 31/03/2021 à 15:39, Michael Ellerman a écrit : Christophe Leroy writes: On the 8xx, TASK_SIZE is 0x8000. The space between TASK_SIZE and PAGE_OFFSET is not used. Use it to load modules in order to minimise the distance between kernel text and modules and avoid trampolines in modules to access kernel functions or other module functions. Define a 16Mbytes area for modules, that's more than enough. 16MB seems kind of small. At least on 64-bit we could potentially have hundreds of MBs of modules. Well, with a 16 MB kernel and 16 MB modules, my board is full :) Even on the more recent board that has 128 MB, I don't expect more than a few MBs of modules in addition to the kernel which is approx 8M. But ok, I'll do something more generic, though it will conflict with Jordan's series. Christophe
[PATCH v2] powerpc/signal32: Fix Oops on sigreturn with unmapped VDSO
PPC32 encounters a KUAP fault when trying to handle a signal with VDSO unmapped. Kernel attempted to read user page (7fc07ec0) - exploit attempt? (uid: 0) BUG: Unable to handle kernel data access on read at 0x7fc07ec0 Faulting instruction address: 0xc00111d4 Oops: Kernel access of bad area, sig: 11 [#1] BE PAGE_SIZE=16K PREEMPT CMPC885 CPU: 0 PID: 353 Comm: sigreturn_vdso Not tainted 5.12.0-rc4-s3k-dev-01553-gb30c310ea220 #4814 NIP: c00111d4 LR: c0005a28 CTR: REGS: cadb3dd0 TRAP: 0300 Not tainted (5.12.0-rc4-s3k-dev-01553-gb30c310ea220) MSR: 9032 CR: 48000884 XER: 2000 DAR: 7fc07ec0 DSISR: 8800 GPR00: c0007788 cadb3e90 c28d4a40 7fc07ec0 7fc07ed0 04e0 7fc07ce0 GPR08: 0001 0001 7fc07ec0 28000282 1001b828 100a0920 GPR16: 100cac0c 100b 105c43a4 105c5685 100d 100d 100d 100b2e9e GPR24: 105c43c8 7fc07ec8 cadb3f40 cadb3ec8 c28d4a40 NIP [c00111d4] flush_icache_range+0x90/0xb4 LR [c0005a28] handle_signal32+0x1bc/0x1c4 Call Trace: [cadb3e90] [100d] 0x100d (unreliable) [cadb3ec0] [c0007788] do_notify_resume+0x260/0x314 [cadb3f20] [c000c764] syscall_exit_prepare+0x120/0x184 [cadb3f30] [c00100b4] ret_from_syscall+0xc/0x28 --- interrupt: c00 at 0xfe807f8 NIP: 0fe807f8 LR: 10001060 CTR: c0139378 REGS: cadb3f40 TRAP: 0c00 Not tainted (5.12.0-rc4-s3k-dev-01553-gb30c310ea220) MSR: d032 CR: 28000482 XER: 2000 GPR00: 0025 7fc081c0 77bb1690 000a 28000482 0001 0ff03a38 GPR08: d032 6de5 c28d4a40 0009 88000482 1001b828 100a0920 GPR16: 100cac0c 100b 105c43a4 105c5685 100d 100d 100d 100b2e9e GPR24: 105c43c8 77ba7628 10002398 1001 10002124 00024000 NIP [0fe807f8] 0xfe807f8 LR [10001060] 0x10001060 --- interrupt: c00 Instruction dump: 38630010 7c001fac 38630010 4200fff0 7c0004ac 4c00012c 4e800020 7c001fac 2c0a 38630010 4082ffcc 4be4 <7c00186c> 2c07 39430010 4082ff8c ---[ end trace 3973fb72b049cb06 ]--- This is because flush_icache_range() is called on user addresses. The same problem was detected some time ago on PPC64. It was fixed by enabling KUAP in commit 59bee45b9712 ("powerpc/mm: Fix missing KUAP disable in flush_coherent_icache()"). PPC32 doesn't use flush_coherent_icache() and fallbacks on clean_dcache_range() and invalidate_icache_range(). We could fix it similarly by enabling user access in those functions, but this is overkill for just flushing two instructions. The two instructions are 8 bytes aligned, so a single dcbst/icbi is enough to flush them. Do like __patch_instruction() and inline a dcbst followed by an icbi just after the write of the instructions, while user access is still allowed. The isync is not required because rfi will be used to return to user. icbi() is handled as a read so read-write user access is needed. Signed-off-by: Christophe Leroy --- v2: Do read-write user access. --- arch/powerpc/kernel/signal_32.c | 20 1 file changed, 8 insertions(+), 12 deletions(-) diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c index c505b444a613..09884af693aa 100644 --- a/arch/powerpc/kernel/signal_32.c +++ b/arch/powerpc/kernel/signal_32.c @@ -775,7 +775,7 @@ int handle_rt_signal32(struct ksignal *ksig, sigset_t *oldset, else prepare_save_user_regs(1); - if (!user_write_access_begin(frame, sizeof(*frame))) + if (!user_access_begin(frame, sizeof(*frame))) goto badframe; /* Put the siginfo & fill in most of the ucontext */ @@ -809,17 +809,15 @@ int handle_rt_signal32(struct ksignal *ksig, sigset_t *oldset, unsafe_put_user(PPC_INST_ADDI + __NR_rt_sigreturn, >mc_pad[0], failed); unsafe_put_user(PPC_INST_SC, >mc_pad[1], failed); + asm("dcbst %y0; sync; icbi %y0; sync" :: "Z" (mctx->mc_pad[0])); } unsafe_put_sigset_t(>uc.uc_sigmask, oldset, failed); - user_write_access_end(); + user_access_end(); if (copy_siginfo_to_user(>info, >info)) goto badframe; - if (tramp == (unsigned long)mctx->mc_pad) - flush_icache_range(tramp, tramp + 2 * sizeof(unsigned long)); - regs->link = tramp; #ifdef CONFIG_PPC_FPU_REGS @@ -844,7 +842,7 @@ int handle_rt_signal32(struct ksignal *ksig, sigset_t *oldset, return 0; failed: - user_write_access_end(); + user_access_end(); badframe: signal_fault(tsk, regs, "handle_rt_signal32", frame); @@ -879,7 +877,7 @@ int handle_signal32(struct ksignal *ksig,
[PATCH] selftests: timens: Fix gettime_perf to work on powerpc
On powerpc: - VDSO library is named linux-vdso32.so.1 or linux-vdso64.so.1 - clock_gettime is named __kernel_clock_gettime() Ensure gettime_perf tries these names before giving up. Signed-off-by: Christophe Leroy --- tools/testing/selftests/timens/gettime_perf.c | 8 1 file changed, 8 insertions(+) diff --git a/tools/testing/selftests/timens/gettime_perf.c b/tools/testing/selftests/timens/gettime_perf.c index 7bf841a3967b..6b13dc277724 100644 --- a/tools/testing/selftests/timens/gettime_perf.c +++ b/tools/testing/selftests/timens/gettime_perf.c @@ -25,12 +25,20 @@ static void fill_function_pointers(void) if (!vdso) vdso = dlopen("linux-gate.so.1", RTLD_LAZY | RTLD_LOCAL | RTLD_NOLOAD); + if (!vdso) + vdso = dlopen("linux-vdso32.so.1", + RTLD_LAZY | RTLD_LOCAL | RTLD_NOLOAD); + if (!vdso) + vdso = dlopen("linux-vdso64.so.1", + RTLD_LAZY | RTLD_LOCAL | RTLD_NOLOAD); if (!vdso) { pr_err("[WARN]\tfailed to find vDSO\n"); return; } vdso_clock_gettime = (vgettime_t)dlsym(vdso, "__vdso_clock_gettime"); + if (!vdso_clock_gettime) + vdso_clock_gettime = (vgettime_t)dlsym(vdso, "__kernel_clock_gettime"); if (!vdso_clock_gettime) pr_err("Warning: failed to find clock_gettime in vDSO\n"); -- 2.25.0
Re: [PATCH] powerpc/8xx: Load modules closer to kernel text
Christophe Leroy writes: > On the 8xx, TASK_SIZE is 0x8000. The space between TASK_SIZE and > PAGE_OFFSET is not used. > > Use it to load modules in order to minimise the distance between > kernel text and modules and avoid trampolines in modules to access > kernel functions or other module functions. > > Define a 16Mbytes area for modules, that's more than enough. 16MB seems kind of small. At least on 64-bit we could potentially have hundreds of MBs of modules. cheers
Re: [PATCH] powerpc/signal32: Fix Oops on sigreturn with unmapped VDSO
Christophe Leroy writes: > PPC32 encounters a KUAP fault when trying to handle a signal with > VDSO unmapped. > > Kernel attempted to read user page (7fc07ec0) - exploit attempt? (uid: > 0) > BUG: Unable to handle kernel data access on read at 0x7fc07ec0 > Faulting instruction address: 0xc00111d4 > Oops: Kernel access of bad area, sig: 11 [#1] > BE PAGE_SIZE=16K PREEMPT CMPC885 > CPU: 0 PID: 353 Comm: sigreturn_vdso Not tainted > 5.12.0-rc4-s3k-dev-01553-gb30c310ea220 #4814 > NIP: c00111d4 LR: c0005a28 CTR: > REGS: cadb3dd0 TRAP: 0300 Not tainted > (5.12.0-rc4-s3k-dev-01553-gb30c310ea220) > MSR: 9032 CR: 48000884 XER: 2000 > DAR: 7fc07ec0 DSISR: 8800 > GPR00: c0007788 cadb3e90 c28d4a40 7fc07ec0 7fc07ed0 04e0 7fc07ce0 > > GPR08: 0001 0001 7fc07ec0 28000282 1001b828 100a0920 > > GPR16: 100cac0c 100b 105c43a4 105c5685 100d 100d 100d > 100b2e9e > GPR24: 105c43c8 7fc07ec8 cadb3f40 cadb3ec8 c28d4a40 > > NIP [c00111d4] flush_icache_range+0x90/0xb4 > LR [c0005a28] handle_signal32+0x1bc/0x1c4 > Call Trace: > [cadb3e90] [100d] 0x100d (unreliable) > [cadb3ec0] [c0007788] do_notify_resume+0x260/0x314 > [cadb3f20] [c000c764] syscall_exit_prepare+0x120/0x184 > [cadb3f30] [c00100b4] ret_from_syscall+0xc/0x28 > --- interrupt: c00 at 0xfe807f8 > NIP: 0fe807f8 LR: 10001060 CTR: c0139378 > REGS: cadb3f40 TRAP: 0c00 Not tainted > (5.12.0-rc4-s3k-dev-01553-gb30c310ea220) > MSR: d032 CR: 28000482 XER: 2000 > > GPR00: 0025 7fc081c0 77bb1690 000a 28000482 0001 > 0ff03a38 > GPR08: d032 6de5 c28d4a40 0009 88000482 1001b828 100a0920 > > GPR16: 100cac0c 100b 105c43a4 105c5685 100d 100d 100d > 100b2e9e > GPR24: 105c43c8 77ba7628 10002398 1001 10002124 > 00024000 > NIP [0fe807f8] 0xfe807f8 > LR [10001060] 0x10001060 > --- interrupt: c00 > Instruction dump: > 38630010 7c001fac 38630010 4200fff0 7c0004ac 4c00012c 4e800020 7c001fac > 2c0a 38630010 4082ffcc 4be4 <7c00186c> 2c07 39430010 > 4082ff8c > ---[ end trace 3973fb72b049cb06 ]--- > > This is because flush_icache_range() is called on user addresses. > > The same problem was detected some time ago on PPC64. It was fixed by > enabling KUAP in commit 59bee45b9712 ("powerpc/mm: Fix missing KUAP > disable in flush_coherent_icache()"). > > PPC32 doesn't use flush_coherent_icache() and fallbacks on > clean_dcache_range() and invalidate_icache_range(). But this code is also used for compat tasks on 64-bit. > We could fix it similarly by enabling user access in those functions, > but this is overkill for just flushing two instructions. > > The two instructions are 8 bytes aligned, so a single dcbst/icbi is > enough to flush them. Do like __patch_instruction() and inline > a dcbst followed by an icbi just after the write of the instructions, > while user access is still allowed. The isync is not required because > rfi will be used to return to user. > > Signed-off-by: Christophe Leroy > --- > arch/powerpc/kernel/signal_32.c | 8 ++-- > 1 file changed, 2 insertions(+), 6 deletions(-) > > diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c > index 75ee918a120a..5b2ba2731957 100644 > --- a/arch/powerpc/kernel/signal_32.c > +++ b/arch/powerpc/kernel/signal_32.c > @@ -809,6 +809,7 @@ int handle_rt_signal32(struct ksignal *ksig, sigset_t > *oldset, > unsafe_put_user(PPC_INST_ADDI + __NR_rt_sigreturn, > >mc_pad[0], > failed); > unsafe_put_user(PPC_INST_SC, >mc_pad[1], failed); > + asm("dcbst %y0; sync; icbi %y0; sync" :: "Z" (mctx->mc_pad[0])); If I'm reading that right you're pointing the icbi at the user address. That's going to cause a KUAP fault just like we fixed in commit 59bee45b9712 ("powerpc/mm: Fix missing KUAP disable in flush_coherent_icache()"). We have user write access enabled, but the icbi is treated as a load. So I don't think that's going to work. cheers
Re: [PATCH 16/18] iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE
On 2021-03-31 12:49, Will Deacon wrote: On Tue, Mar 30, 2021 at 05:28:19PM +0100, Robin Murphy wrote: On 2021-03-30 14:58, Will Deacon wrote: On Tue, Mar 30, 2021 at 02:19:38PM +0100, Robin Murphy wrote: On 2021-03-30 14:11, Will Deacon wrote: On Tue, Mar 16, 2021 at 04:38:22PM +0100, Christoph Hellwig wrote: From: Robin Murphy Instead make the global iommu_dma_strict paramete in iommu.c canonical by exporting helpers to get and set it and use those directly in the drivers. This make sure that the iommu.strict parameter also works for the AMD and Intel IOMMU drivers on x86. As those default to lazy flushing a new IOMMU_CMD_LINE_STRICT is used to turn the value into a tristate to represent the default if not overriden by an explicit parameter. Signed-off-by: Robin Murphy . [ported on top of the other iommu_attr changes and added a few small missing bits] Signed-off-by: Christoph Hellwig --- drivers/iommu/amd/iommu.c | 23 +--- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 50 +--- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 1 - drivers/iommu/arm/arm-smmu/arm-smmu.c | 27 + drivers/iommu/dma-iommu.c | 9 +-- drivers/iommu/intel/iommu.c | 64 - drivers/iommu/iommu.c | 27 ++--- include/linux/iommu.h | 4 +- 8 files changed, 40 insertions(+), 165 deletions(-) I really like this cleanup, but I can't help wonder if it's going in the wrong direction. With SoCs often having multiple IOMMU instances and a distinction between "trusted" and "untrusted" devices, then having the flush-queue enabled on a per-IOMMU or per-domain basis doesn't sound unreasonable to me, but this change makes it a global property. The intent here was just to streamline the existing behaviour of stuffing a global property into a domain attribute then pulling it out again in the illusion that it was in any way per-domain. We're still checking dev_is_untrusted() before making an actual decision, and it's not like we can't add more factors at that point if we want to. Like I say, the cleanup is great. I'm just wondering whether there's a better way to express the complicated logic to decide whether or not to use the flush queue than what we end up with: if (!cookie->fq_domain && (!dev || !dev_is_untrusted(dev)) && domain->ops->flush_iotlb_all && !iommu_get_dma_strict()) which is mixing up globals, device properties and domain properties. The result is that the driver code ends up just using the global to determine whether or not to pass IO_PGTABLE_QUIRK_NON_STRICT to the page-table code, which is a departure from the current way of doing things. But previously, SMMU only ever saw the global policy piped through the domain attribute by iommu_group_alloc_default_domain(), so there's no functional change there. For DMA domains sure, but I don't think that's the case for unmanaged domains such as those used by VFIO. Eh? This is only relevant to DMA domains anyway. Flush queues are part of the IOVA allocator that VFIO doesn't even use. It's always been the case that unmanaged domains only use strict invalidation. Obviously some of the above checks could be factored out into some kind of iommu_use_flush_queue() helper that IOMMU drivers can also call if they need to keep in sync. Or maybe we just allow iommu-dma to set IO_PGTABLE_QUIRK_NON_STRICT directly via iommu_set_pgtable_quirks() if we're treating that as a generic thing now. I think a helper that takes a domain would be a good starting point. You mean device, right? The one condition we currently have is at the device level, and there's really nothing inherent to the domain itself that matters (since the type is implicitly IOMMU_DOMAIN_DMA to even care about this). Another idea that's just come to mind is now that IOMMU_DOMAIN_DMA has a standard meaning, maybe we could split out a separate IOMMU_DOMAIN_DMA_STRICT type such that it can all propagate from iommu_get_def_domain_type()? That feels like it might be quite promising, but I'd still do it as an improvement on top of this patch, since it's beyond just cleaning up the abuse of domain attributes to pass a command-line option around. Robin.
Re: [PATCH -next] powerpc/eeh: Remove unused inline function eeh_dev_phb_init_dynamic()
On 2021/3/26 13:08, Daniel Axtens wrote: > Hi, > >> commit 475028efc708 ("powerpc/eeh: Remove eeh_dev_phb_init_dynamic()") >> left behind this, so can remove it. > > I had a look: the inline that you are removing here is for the > !CONFIG_EEH case, which explains why it was missed the first time. > > This looks like a good change. Out of interest, what tool are you using > to find these unused inlines? If there are many more, it might make > sense to combine future patches removing them into a single patch, but > I'm not sure. Just use some grep skill, will do that if any. > > checkpatch likes this patch, so that's also good :) > > Reviewed-by: Daniel Axtens > > Kind regards, > Daniel > . >
[PATCH -next] powerpc/eeh: Add correct inline functions
pseries_eeh_add_device_early()/pseries_eeh_add_device_tree_early() is never used since adding, however pseries_eeh_init_edev() and pseries_eeh_init_edev_recursive() need their inline versions. Fixes: b6eebb093cad ("powerpc/eeh: Make early EEH init pseries specific") Signed-off-by: YueHaibing --- arch/powerpc/include/asm/eeh.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h index b1a5bba2e0b9..0b6c2a6711d3 100644 --- a/arch/powerpc/include/asm/eeh.h +++ b/arch/powerpc/include/asm/eeh.h @@ -357,8 +357,8 @@ static inline int eeh_phb_pe_create(struct pci_controller *phb) { return 0; } void pseries_eeh_init_edev(struct pci_dn *pdn); void pseries_eeh_init_edev_recursive(struct pci_dn *pdn); #else -static inline void pseries_eeh_add_device_early(struct pci_dn *pdn) { } -static inline void pseries_eeh_add_device_tree_early(struct pci_dn *pdn) { } +static inline void pseries_eeh_init_edev(struct pci_dn *pdn) { } +static inline void pseries_eeh_init_edev_recursive(struct pci_dn *pdn) { } #endif #ifdef CONFIG_PPC64 -- 2.17.1
Re: [PATCH v10 01/10] powerpc/mm: Implement set_memory() routines
Le 31/03/2021 à 13:16, Michael Ellerman a écrit : Hi Jordan, A few nits below ... Jordan Niethe writes: From: Russell Currey The set_memory_{ro/rw/nx/x}() functions are required for STRICT_MODULE_RWX, and are generally useful primitives to have. This implementation is designed to be completely generic across powerpc's many MMUs. It's possible that this could be optimised to be faster for specific MMUs, but the focus is on having a generic and safe implementation for now. This implementation does not handle cases where the caller is attempting to change the mapping of the page it is executing from, or if another CPU is concurrently using the page being altered. These cases likely shouldn't happen, but a more complex implementation with MMU-specific code could safely handle them, so that is left as a TODO for now. On hash the linear mapping is not kept in the linux pagetable, so this will not change the protection if used on that range. Currently these functions are not used on the linear map so just WARN for now. These functions do nothing if STRICT_KERNEL_RWX is not enabled. Reviewed-by: Daniel Axtens Signed-off-by: Russell Currey Signed-off-by: Christophe Leroy [jpn: -rebase on next plus "powerpc/mm/64s: Allow STRICT_KERNEL_RWX again" - WARN on hash linear map] Signed-off-by: Jordan Niethe --- v10: WARN if trying to change the hash linear map --- This ↓ should have a comment explaining what it's doing: +#ifdef CONFIG_PPC_BOOK3S_64 + if (WARN_ON_ONCE(!radix_enabled() && +get_region_id(addr) == LINEAR_MAP_REGION_ID)) { + return -1; + } +#endif Maybe: if (IS_ENABLED(CONFIG_PPC_BOOK3S_64) && WARN_ON_ONCE(!radix_enabled() && get_region_id(addr) == LINEAR_MAP_REGION_ID)) { return -1; } get_region_id() only exists for book3s/64 at the time being, and LINEAR_MAP_REGION_ID as well. But then Aneesh pointed out that we should also block VMEMMAP_REGION_ID. It might be better to just check for the permitted regions. if (IS_ENABLED(CONFIG_PPC_BOOK3S_64) && !radix_enabled()) { int region = get_region_id(addr); if (WARN_ON_ONCE(region != VMALLOC_REGION_ID && region != IO_REGION_ID)) return -1; } + + return apply_to_existing_page_range(_mm, start, sz, + change_page_attr, (void *)action); +} cheers
Re: [PATCH v2 3/7] powerpc: convert config files to generic cmdline
On Tue, Mar 30, 2021 at 10:35:21AM -0700, Daniel Walker wrote: > On Mon, Mar 29, 2021 at 11:07:51AM +0100, Will Deacon wrote: > > On Thu, Mar 25, 2021 at 12:59:56PM -0700, Daniel Walker wrote: > > > On Thu, Mar 25, 2021 at 01:03:55PM +0100, Christophe Leroy wrote: > > > > > > > > Ok, so you agree we don't need to provide two CMDLINE, one to be > > > > appended and one to be prepended. > > > > > > > > Let's only provide once CMDLINE as of today, and ask the user to select > > > > whether he wants it appended or prepended or replacee. Then no need to > > > > change all existing config to rename CONFIG_CMDLINE into either of the > > > > new > > > > ones. > > > > > > > > That's the main difference between my series and Daniel's series. So > > > > I'll > > > > finish taking Will's comment into account and we'll send out a v3 soon. > > > > > > It doesn't solve the needs of Cisco, I've stated many times your changes > > > have > > > little value. Please stop submitting them. > > > > FWIW, they're useful for arm64 and I will gladly review the updated series. > > > > I don't think asking people to stop submitting patches is ever the right > > answer. Please don't do that. > > Why ? It's me nacking his series, is that not allowed anymore ? If you're that way inclined then you can "nack" whatever you want, but please allow the rest of us to continue reviewing the patches. You don't have any basis on which to veto other people's contributions and so demanding that somebody stops posting code is neither constructive nor meaningful. Will
Re: [PATCH 16/18] iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE
On Tue, Mar 30, 2021 at 05:28:19PM +0100, Robin Murphy wrote: > On 2021-03-30 14:58, Will Deacon wrote: > > On Tue, Mar 30, 2021 at 02:19:38PM +0100, Robin Murphy wrote: > > > On 2021-03-30 14:11, Will Deacon wrote: > > > > On Tue, Mar 16, 2021 at 04:38:22PM +0100, Christoph Hellwig wrote: > > > > > From: Robin Murphy > > > > > > > > > > Instead make the global iommu_dma_strict paramete in iommu.c > > > > > canonical by > > > > > exporting helpers to get and set it and use those directly in the > > > > > drivers. > > > > > > > > > > This make sure that the iommu.strict parameter also works for the AMD > > > > > and > > > > > Intel IOMMU drivers on x86. As those default to lazy flushing a new > > > > > IOMMU_CMD_LINE_STRICT is used to turn the value into a tristate to > > > > > represent the default if not overriden by an explicit parameter. > > > > > > > > > > Signed-off-by: Robin Murphy . > > > > > [ported on top of the other iommu_attr changes and added a few small > > > > >missing bits] > > > > > Signed-off-by: Christoph Hellwig > > > > > --- > > > > >drivers/iommu/amd/iommu.c | 23 +--- > > > > >drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 50 +--- > > > > >drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 1 - > > > > >drivers/iommu/arm/arm-smmu/arm-smmu.c | 27 + > > > > >drivers/iommu/dma-iommu.c | 9 +-- > > > > >drivers/iommu/intel/iommu.c | 64 > > > > > - > > > > >drivers/iommu/iommu.c | 27 ++--- > > > > >include/linux/iommu.h | 4 +- > > > > >8 files changed, 40 insertions(+), 165 deletions(-) > > > > > > > > I really like this cleanup, but I can't help wonder if it's going in the > > > > wrong direction. With SoCs often having multiple IOMMU instances and a > > > > distinction between "trusted" and "untrusted" devices, then having the > > > > flush-queue enabled on a per-IOMMU or per-domain basis doesn't sound > > > > unreasonable to me, but this change makes it a global property. > > > > > > The intent here was just to streamline the existing behaviour of stuffing > > > a > > > global property into a domain attribute then pulling it out again in the > > > illusion that it was in any way per-domain. We're still checking > > > dev_is_untrusted() before making an actual decision, and it's not like we > > > can't add more factors at that point if we want to. > > > > Like I say, the cleanup is great. I'm just wondering whether there's a > > better way to express the complicated logic to decide whether or not to use > > the flush queue than what we end up with: > > > > if (!cookie->fq_domain && (!dev || !dev_is_untrusted(dev)) && > > domain->ops->flush_iotlb_all && !iommu_get_dma_strict()) > > > > which is mixing up globals, device properties and domain properties. The > > result is that the driver code ends up just using the global to determine > > whether or not to pass IO_PGTABLE_QUIRK_NON_STRICT to the page-table code, > > which is a departure from the current way of doing things. > > But previously, SMMU only ever saw the global policy piped through the > domain attribute by iommu_group_alloc_default_domain(), so there's no > functional change there. For DMA domains sure, but I don't think that's the case for unmanaged domains such as those used by VFIO. > Obviously some of the above checks could be factored out into some kind of > iommu_use_flush_queue() helper that IOMMU drivers can also call if they need > to keep in sync. Or maybe we just allow iommu-dma to set > IO_PGTABLE_QUIRK_NON_STRICT directly via iommu_set_pgtable_quirks() if we're > treating that as a generic thing now. I think a helper that takes a domain would be a good starting point. Will
Re: [PATCH v10 06/10] powerpc/mm/ptdump: debugfs handler for W+X checks at runtime
Jordan Niethe writes: > From: Russell Currey > > Optionally run W+X checks when dumping pagetable information to > debugfs' kernel_page_tables. > > To use: > $ echo 1 > /sys/kernel/debug/check_wx_pages > $ cat /sys/kernel/debug/kernel_page_tables > > and check the kernel log. Useful for testing strict module RWX. > > To disable W+X checks: > $ echo 0 > /sys/kernel/debug/check_wx_pages > > Update the Kconfig entry to reflect this. > > Also fix a typo. > > Reviewed-by: Kees Cook > Signed-off-by: Russell Currey > [jpn: Change check_wx_pages to act as mode bit affecting > kernel_page_tables instead of triggering action on its own] > Signed-off-by: Jordan Niethe > --- > v10: check_wx_pages now affects kernel_page_tables rather then triggers > its own action. Hmm. I liked the old version better :) I think you changed it based on Christophe's comment: Why not just perform the test everytime someone dumps kernel_page_tables ? But I think he meant *always* do the check when someone dumps kernel_page_tables, not have another file to enable checking and then require someone to dump kernel_page_tables to do the actual check. Still I like the previous version where you can do the checks separately, without having to dump the page tables, because dumping can sometimes take quite a while. What would be even better is if ptdump_check_wx() returned an error when wx pages were found, and that was plumbed out to the debugs file. That way you can script around it. cheers
Re: [PATCH v10 01/10] powerpc/mm: Implement set_memory() routines
Hi Jordan, A few nits below ... Jordan Niethe writes: > From: Russell Currey > > The set_memory_{ro/rw/nx/x}() functions are required for STRICT_MODULE_RWX, > and are generally useful primitives to have. This implementation is > designed to be completely generic across powerpc's many MMUs. > > It's possible that this could be optimised to be faster for specific > MMUs, but the focus is on having a generic and safe implementation for > now. > > This implementation does not handle cases where the caller is attempting > to change the mapping of the page it is executing from, or if another > CPU is concurrently using the page being altered. These cases likely > shouldn't happen, but a more complex implementation with MMU-specific code > could safely handle them, so that is left as a TODO for now. > > On hash the linear mapping is not kept in the linux pagetable, so this > will not change the protection if used on that range. Currently these > functions are not used on the linear map so just WARN for now. > > These functions do nothing if STRICT_KERNEL_RWX is not enabled. > > Reviewed-by: Daniel Axtens > Signed-off-by: Russell Currey > Signed-off-by: Christophe Leroy > [jpn: -rebase on next plus "powerpc/mm/64s: Allow STRICT_KERNEL_RWX again" > - WARN on hash linear map] > Signed-off-by: Jordan Niethe > --- > v10: WARN if trying to change the hash linear map > --- > arch/powerpc/Kconfig | 1 + > arch/powerpc/include/asm/set_memory.h | 32 ++ > arch/powerpc/mm/Makefile | 2 +- > arch/powerpc/mm/pageattr.c| 88 +++ > 4 files changed, 122 insertions(+), 1 deletion(-) > create mode 100644 arch/powerpc/include/asm/set_memory.h > create mode 100644 arch/powerpc/mm/pageattr.c > > diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig > index fc7f5c5933e6..4498a27ac9db 100644 > --- a/arch/powerpc/Kconfig > +++ b/arch/powerpc/Kconfig > @@ -135,6 +135,7 @@ config PPC > select ARCH_HAS_MEMBARRIER_CALLBACKS > select ARCH_HAS_MEMBARRIER_SYNC_CORE > select ARCH_HAS_SCALED_CPUTIME if VIRT_CPU_ACCOUNTING_NATIVE > && PPC_BOOK3S_64 > + select ARCH_HAS_SET_MEMORY Below you do: if (!IS_ENABLED(CONFIG_STRICT_KERNEL_RWX)) return 0; Which suggests we should instead just only select ARCH_HAS_SET_MEMORY if STRICT_KERNEL_RWX ? > diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile > index 3b4e9e4e25ea..d8a08abde1ae 100644 > --- a/arch/powerpc/mm/Makefile > +++ b/arch/powerpc/mm/Makefile > @@ -5,7 +5,7 @@ > > ccflags-$(CONFIG_PPC64) := $(NO_MINIMAL_TOC) > > -obj-y:= fault.o mem.o pgtable.o mmap.o > maccess.o \ > +obj-y:= fault.o mem.o pgtable.o mmap.o > maccess.o pageattr.o \ .. and then the file should only be built if ARCH_HAS_SET_MEMORY = y. > init_$(BITS).o pgtable_$(BITS).o \ > pgtable-frag.o ioremap.o ioremap_$(BITS).o \ > init-common.o mmu_context.o drmem.o > diff --git a/arch/powerpc/mm/pageattr.c b/arch/powerpc/mm/pageattr.c > new file mode 100644 > index ..9efcb01088da > --- /dev/null > +++ b/arch/powerpc/mm/pageattr.c > @@ -0,0 +1,88 @@ > +// SPDX-License-Identifier: GPL-2.0 > + > +/* > + * MMU-generic set_memory implementation for powerpc > + * > + * Copyright 2019, IBM Corporation. Should be 2019-2021. > + */ > + > +#include > +#include > + > +#include > +#include > +#include > + > + > +/* > + * Updates the attributes of a page in three steps: > + * > + * 1. invalidate the page table entry > + * 2. flush the TLB > + * 3. install the new entry with the updated attributes > + * > + * This is unsafe if the caller is attempting to change the mapping of the > + * page it is executing from, or if another CPU is concurrently using the > + * page being altered. Is the 2nd part of that statement true? Or, I guess maybe it is true depending on what "unsafe" means. AIUI it's unsafe to use this on the page you're executing from, and by unsafe we mean the kernel will potentially crash because it will lose the mapping for the currently executing text. Using this on a page that another CPU is accessing could be safe, if eg. the other CPU is reading from the page and we are just changing it from RW->RO. So I'm not sure they're the same type of "unsafe". > + * TODO make the implementation resistant to this. > + * > + * NOTE: can be dangerous to call without STRICT_KERNEL_RWX I don't think we need that anymore? > + */ > +static int change_page_attr(pte_t *ptep, unsigned long addr, void *data) > +{ > + long action = (long)data; > + pte_t pte; > + > + spin_lock(_mm.page_table_lock); > + > + /* invalidate the PTE so it's safe to modify */ > + pte = ptep_get_and_clear(_mm, addr, ptep); > + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); > + > + /*
Re: [PATCH v10 05/10] powerpc/bpf: Write protect JIT code
Le 31/03/2021 à 12:37, Michael Ellerman a écrit : Jordan Niethe writes: Once CONFIG_STRICT_MODULE_RWX is enabled there will be no need to override bpf_jit_free() because it is now possible to set images read-only. So use the default implementation. Also add the necessary call to bpf_jit_binary_lock_ro() which will remove write protection and add exec protection to the JIT image after it has finished being written. Signed-off-by: Jordan Niethe --- v10: New to series --- arch/powerpc/net/bpf_jit_comp.c | 5 - arch/powerpc/net/bpf_jit_comp64.c | 4 2 files changed, 8 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/net/bpf_jit_comp.c b/arch/powerpc/net/bpf_jit_comp.c index e809cb5a1631..8015e4a7d2d4 100644 --- a/arch/powerpc/net/bpf_jit_comp.c +++ b/arch/powerpc/net/bpf_jit_comp.c @@ -659,12 +659,15 @@ void bpf_jit_compile(struct bpf_prog *fp) bpf_jit_dump(flen, proglen, pass, code_base); bpf_flush_icache(code_base, code_base + (proglen/4)); - #ifdef CONFIG_PPC64 /* Function descriptor nastiness: Address + TOC */ ((u64 *)image)[0] = (u64)code_base; ((u64 *)image)[1] = local_paca->kernel_toc; #endif + if (IS_ENABLED(CONFIG_STRICT_MODULE_RWX)) { + set_memory_ro((unsigned long)image, alloclen >> PAGE_SHIFT); + set_memory_x((unsigned long)image, alloclen >> PAGE_SHIFT); + } You don't need to check the ifdef in a caller, there are stubs that compile to nothing when CONFIG_ARCH_HAS_SET_MEMORY=n. I was about to do the same comment, but CONFIG_STRICT_MODULE_RWX is not CONFIG_ARCH_HAS_SET_MEMORY diff --git a/arch/powerpc/net/bpf_jit_comp64.c b/arch/powerpc/net/bpf_jit_comp64.c index aaf1a887f653..1484ad588685 100644 --- a/arch/powerpc/net/bpf_jit_comp64.c +++ b/arch/powerpc/net/bpf_jit_comp64.c @@ -1240,6 +1240,8 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp) fp->jited_len = alloclen; bpf_flush_icache(bpf_hdr, (u8 *)bpf_hdr + (bpf_hdr->pages * PAGE_SIZE)); + if (IS_ENABLED(CONFIG_STRICT_MODULE_RWX)) + bpf_jit_binary_lock_ro(bpf_hdr); Do we need the ifdef here either? Looks like it should be safe to call due to the stubs. Same @@ -1262,6 +1264,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp) } /* Overriding bpf_jit_free() as we don't set images read-only. */ +#ifndef CONFIG_STRICT_MODULE_RWX Did you test without this and notice something broken? Looking at the generic version I can't tell why we need to override this. Maybe we don't (anymore?) ? cheers void bpf_jit_free(struct bpf_prog *fp) { unsigned long addr = (unsigned long)fp->bpf_func & PAGE_MASK; @@ -1272,3 +1275,4 @@ void bpf_jit_free(struct bpf_prog *fp) bpf_prog_unlock_free(fp); } +#endif -- 2.25.1
Re: [PATCH v10 05/10] powerpc/bpf: Write protect JIT code
Jordan Niethe writes: > Once CONFIG_STRICT_MODULE_RWX is enabled there will be no need to > override bpf_jit_free() because it is now possible to set images > read-only. So use the default implementation. > > Also add the necessary call to bpf_jit_binary_lock_ro() which will > remove write protection and add exec protection to the JIT image after > it has finished being written. > > Signed-off-by: Jordan Niethe > --- > v10: New to series > --- > arch/powerpc/net/bpf_jit_comp.c | 5 - > arch/powerpc/net/bpf_jit_comp64.c | 4 > 2 files changed, 8 insertions(+), 1 deletion(-) > > diff --git a/arch/powerpc/net/bpf_jit_comp.c b/arch/powerpc/net/bpf_jit_comp.c > index e809cb5a1631..8015e4a7d2d4 100644 > --- a/arch/powerpc/net/bpf_jit_comp.c > +++ b/arch/powerpc/net/bpf_jit_comp.c > @@ -659,12 +659,15 @@ void bpf_jit_compile(struct bpf_prog *fp) > bpf_jit_dump(flen, proglen, pass, code_base); > > bpf_flush_icache(code_base, code_base + (proglen/4)); > - > #ifdef CONFIG_PPC64 > /* Function descriptor nastiness: Address + TOC */ > ((u64 *)image)[0] = (u64)code_base; > ((u64 *)image)[1] = local_paca->kernel_toc; > #endif > + if (IS_ENABLED(CONFIG_STRICT_MODULE_RWX)) { > + set_memory_ro((unsigned long)image, alloclen >> PAGE_SHIFT); > + set_memory_x((unsigned long)image, alloclen >> PAGE_SHIFT); > + } You don't need to check the ifdef in a caller, there are stubs that compile to nothing when CONFIG_ARCH_HAS_SET_MEMORY=n. > diff --git a/arch/powerpc/net/bpf_jit_comp64.c > b/arch/powerpc/net/bpf_jit_comp64.c > index aaf1a887f653..1484ad588685 100644 > --- a/arch/powerpc/net/bpf_jit_comp64.c > +++ b/arch/powerpc/net/bpf_jit_comp64.c > @@ -1240,6 +1240,8 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog > *fp) > fp->jited_len = alloclen; > > bpf_flush_icache(bpf_hdr, (u8 *)bpf_hdr + (bpf_hdr->pages * PAGE_SIZE)); > + if (IS_ENABLED(CONFIG_STRICT_MODULE_RWX)) > + bpf_jit_binary_lock_ro(bpf_hdr); Do we need the ifdef here either? Looks like it should be safe to call due to the stubs. > @@ -1262,6 +1264,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog > *fp) > } > > /* Overriding bpf_jit_free() as we don't set images read-only. */ > +#ifndef CONFIG_STRICT_MODULE_RWX Did you test without this and notice something broken? Looking at the generic version I can't tell why we need to override this. Maybe we don't (anymore?) ? cheers > void bpf_jit_free(struct bpf_prog *fp) > { > unsigned long addr = (unsigned long)fp->bpf_func & PAGE_MASK; > @@ -1272,3 +1275,4 @@ void bpf_jit_free(struct bpf_prog *fp) > > bpf_prog_unlock_free(fp); > } > +#endif > -- > 2.25.1
Re: [PATCH v3] powerpc/papr_scm: Implement support for H_SCM_FLUSH hcall
"Aneesh Kumar K.V" writes: > Shivaprasad G Bhat writes: > >> Add support for ND_REGION_ASYNC capability if the device tree >> indicates 'ibm,hcall-flush-required' property in the NVDIMM node. >> Flush is done by issuing H_SCM_FLUSH hcall to the hypervisor. >> >> If the flush request failed, the hypervisor is expected to >> to reflect the problem in the subsequent nvdimm H_SCM_HEALTH call. >> >> This patch prevents mmap of namespaces with MAP_SYNC flag if the >> nvdimm requires an explicit flush[1]. >> >> References: >> [1] >> https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/ndctl.py.data/map_sync.c > > > Reviewed-by: Aneesh Kumar K.V Do we need an ack from nvdimm folks on this? Or is it entirely powerpc internal (seems like it from the diffstat)? cheers
Re: [PATCH] powerpc/vdso: Separate vvar vma from vdso
Christophe Leroy writes: > Le 26/03/2021 à 20:17, Dmitry Safonov a écrit : >> Since commit 511157ab641e ("powerpc/vdso: Move vdso datapage up front") >> VVAR page is in front of the VDSO area. In result it breaks CRIU >> (Checkpoint Restore In Userspace) [1], where CRIU expects that "[vdso]" >> from /proc/../maps points at ELF/vdso image, rather than at VVAR data page. >> Laurent made a patch to keep CRIU working (by reading aux vector). >> But I think it still makes sence to separate two mappings into different >> VMAs. It will also make ppc64 less "special" for userspace and as >> a side-bonus will make VVAR page un-writable by debugger (which previously >> would COW page and can be unexpected). >> >> I opportunistically Cc stable on it: I understand that usually such >> stuff isn't a stable material, but that will allow us in CRIU have >> one workaround less that is needed just for one release (v5.11) on >> one platform (ppc64), which we otherwise have to maintain. >> I wouldn't go as far as to say that the commit 511157ab641e is ABI >> regression as no other userspace got broken, but I'd really appreciate >> if it gets backported to v5.11 after v5.12 is released, so as not >> to complicate already non-simple CRIU-vdso code. Thanks! >> >> Cc: Andrei Vagin >> Cc: Andy Lutomirski >> Cc: Benjamin Herrenschmidt >> Cc: Christophe Leroy >> Cc: Laurent Dufour >> Cc: Michael Ellerman >> Cc: Paul Mackerras >> Cc: linuxppc-dev@lists.ozlabs.org >> Cc: sta...@vger.kernel.org # v5.11 >> [1]: https://github.com/checkpoint-restore/criu/issues/1417 >> Signed-off-by: Dmitry Safonov >> Tested-by: Christophe Leroy >> --- >> arch/powerpc/include/asm/mmu_context.h | 2 +- >> arch/powerpc/kernel/vdso.c | 54 +++--- >> 2 files changed, 40 insertions(+), 16 deletions(-) >> > >> @@ -133,7 +135,13 @@ static int __arch_setup_additional_pages(struct >> linux_binprm *bprm, int uses_int >> * install_special_mapping or the perf counter mmap tracking code >> * will fail to recognise it as a vDSO. >> */ >> -mm->context.vdso = (void __user *)vdso_base + PAGE_SIZE; >> +mm->context.vdso = (void __user *)vdso_base + vvar_size; >> + >> +vma = _install_special_mapping(mm, vdso_base, vvar_size, >> + VM_READ | VM_MAYREAD | VM_IO | >> + VM_DONTDUMP | VM_PFNMAP, _spec); >> +if (IS_ERR(vma)) >> +return PTR_ERR(vma); >> >> /* >> * our vma flags don't have VM_WRITE so by default, the process isn't > > > IIUC, VM_PFNMAP is for when we have a vvar_fault handler. Some of the other flags seem odd too. eg. VM_IO ? VM_DONTDUMP ? cheers
Re: [PATCH v2] powerpc/traps: Enhance readability for trap types
Xiongwei Song writes: > From: Xiongwei Song > > Create a new header named traps.h, define macros to list ppc exception > types in traps.h, replace the reference of the real trap values with > these macros. Personally I find the hex values easier to recognise, but I realise that's probably not true of other people :) ... > diff --git a/arch/powerpc/include/asm/traps.h > b/arch/powerpc/include/asm/traps.h > new file mode 100644 > index ..a31b6122de23 > --- /dev/null > +++ b/arch/powerpc/include/asm/traps.h > @@ -0,0 +1,19 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +#ifndef _ASM_PPC_TRAPS_H > +#define _ASM_PPC_TRAPS_H > + > +#define TRAP_RESET 0x100 /* System reset */ > +#define TRAP_MCE 0x200 /* Machine check */ > +#define TRAP_DSI 0x300 /* Data storage */ > +#define TRAP_DSEGI 0x380 /* Data segment */ > +#define TRAP_ISI 0x400 /* Instruction storage */ > +#define TRAP_ISEGI 0x480 /* Instruction segment */ > +#define TRAP_ALIGN 0x600 /* Alignment */ > +#define TRAP_PROG0x700 /* Program */ > +#define TRAP_DEC 0x900 /* Decrementer */ > +#define TRAP_SYSCALL 0xc00 /* System call */ > +#define TRAP_TRACEI 0xd00 /* Trace */ > +#define TRAP_FPA 0xe00 /* Floating-point Assist */ > +#define TRAP_PMI 0xf00 /* Performance monitor */ I know the macro is called TRAP and the field in pt_regs is called trap, but the terminology in the architecture is "exception", and we already have many uses of that. In particular we have a lot of uses of "exc" as an abbreviation for "exception". So I think I'd rather we use that than "TRAP". I think we should probably use the names from the ISA, unless they are really over long. Which are: 0x100 System Reset 0x200 Machine Check 0x300 Data Storage 0x380 Data Segment 0x400 Instruction Storage 0x480 Instruction Segment 0x500 External 0x600 Alignment 0x700 Program 0x800 Floating-Point Unavailable 0x900 Decrementer 0x980 Hypervisor Decrementer 0xA00 Directed Privileged Doorbell 0xC00 System Call 0xD00 Trace 0xE00 Hypervisor Data Storage 0xE20 Hypervisor Instruction Storage 0xE40 Hypervisor Emulation Assistance 0xE60 Hypervisor Maintenance 0xE80 Directed Hypervisor Doorbell 0xEA0 Hypervisor Virtualization 0xF00 Performance Monitor 0xF20 Vector Unavailable 0xF40 VSX Unavailable 0xF60 Facility Unavailable 0xF80 Hypervisor Facility Unavailable 0xFA0 Directed Ultravisor Doorbell So perhaps: EXC_SYSTEM_RESET EXC_MACHINE_CHECK EXC_DATA_STORAGE EXC_DATA_SEGMENT EXC_INST_STORAGE EXC_INST_SEGMENT EXC_EXTERNAL_INTERRUPT EXC_ALIGNMENT EXC_PROGRAM_CHECK EXC_FP_UNAVAILABLE EXC_DECREMENTER EXC_HV_DECREMENTER EXC_SYSTEM_CALL EXC_HV_DATA_STORAGE EXC_PERF_MONITOR cheers
Re: [PATCH v2] mm: Move mem_init_print_info() into mm_init()
On Wed, Mar 17, 2021 at 09:52:10AM +0800, Kefeng Wang wrote: > mem_init_print_info() is called in mem_init() on each architecture, > and pass NULL argument, so using void argument and move it into mm_init(). > > Acked-by: Dave Hansen > Signed-off-by: Kefeng Wang Acked-by: Russell King # for arm bits -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!
Re: [PATCH v2] mm: Move mem_init_print_info() into mm_init()
On Wed, Mar 17, 2021 at 09:52:10AM +0800, Kefeng Wang wrote: > mem_init_print_info() is called in mem_init() on each architecture, > and pass NULL argument, so using void argument and move it into mm_init(). > > Acked-by: Dave Hansen > Signed-off-by: Kefeng Wang Acked-by: Mike Rapoport > --- > v2: > - Cleanup 'str' line suggested by Christophe and ACK >
[PATCH -next] powerpc: Remove duplicated include from time.c
Remove duplicated include. Reported-by: Hulk Robot Signed-off-by: Qiheng Lin --- arch/powerpc/kernel/time.c | 1 - 1 file changed, 1 deletion(-) diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index b67d93a609a2..2c8762002b21 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -53,7 +53,6 @@ #include #include #include -#include #include #include
Re: WARNING: CPU: 0 PID: 1 at arch/powerpc/lib/feature-fixups.c:109 do_feature_fixups+0xb0/0xf0
Hi Paul, Le 30/03/2021 à 12:37, Paul Menzel a écrit : Dear Linux folks, On the POWER8 system IBM S822LC, Linux 5.12-rc5+ logs the warning below. ``` [ 0.724118] Unable to patch feature section at (ptrval) - (ptrval) with (ptrval) - (ptrval) [ 0.724185] pstore: Registered nvram as persistent store backend ``` Please find the output of `dmesg` attached. Did you do a 'make clean' before building ? If not, can you try patch https://patchwork.ozlabs.org/project/linuxppc-dev/patch/8bb015bc98c51d8ced581415b7e3d157e18da7c9.1617181918.git.christophe.le...@csgroup.eu/ Thanks Christophe
[PATCH] powerpc/vdso: Make sure vdso_wrapper.o is rebuilt everytime vdso.so is rebuilt
Commit bce74491c300 ("powerpc/vdso: fix unnecessary rebuilds of vgettimeofday.o") moved vdso32_wrapper.o and vdso64_wrapper.o out of arch/powerpc/kernel/vdso[32/64]/ and removed the dependencies in the Makefile. This leads to the wrappers not being re-build hence the kernel embedding the old vdso library. Add back missing dependencies to ensure vdso32_wrapper.o and vdso64_wrapper.o are rebuilt when vdso32.so.dbg and vdso64.so.dbg are changed. Fixes: bce74491c300 ("powerpc/vdso: fix unnecessary rebuilds of vgettimeofday.o") Cc: sta...@vger.kernel.org Cc: Masahiro Yamada Signed-off-by: Christophe Leroy --- arch/powerpc/kernel/Makefile | 4 1 file changed, 4 insertions(+) diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile index 6084fa499aa3..f66b63e81c3b 100644 --- a/arch/powerpc/kernel/Makefile +++ b/arch/powerpc/kernel/Makefile @@ -191,3 +191,7 @@ $(obj)/prom_init_check: $(src)/prom_init_check.sh $(obj)/prom_init.o FORCE targets += prom_init_check clean-files := vmlinux.lds + +# Force dependency (incbin is bad) +$(obj)/vdso32_wrapper.o : $(obj)/vdso32/vdso32.so.dbg +$(obj)/vdso64_wrapper.o : $(obj)/vdso64/vdso64.so.dbg -- 2.25.0
Re: [PATCH 1/2] powerpc/vdso: fix unnecessary rebuilds of vgettimeofday.o
Le 28/01/2021 à 05:01, Masahiro Yamada a écrit : On Thu, Dec 24, 2020 at 2:12 AM Masahiro Yamada wrote: vgettimeofday.o is unnecessarily rebuilt. Adding it to 'targets' is not enough to fix the issue. Kbuild is correctly rebuilding it because the command line is changed. PowerPC builds each vdso directory twice; first in vdso_prepare to generate vdso{32,64}-offsets.h, second as part of the ordinary build process to embed vdso{32,64}.so.dbg into the kernel. The problem shows up when CONFIG_PPC_WERROR=y due to the following line in arch/powerpc/Kbuild: subdir-ccflags-$(CONFIG_PPC_WERROR) := -Werror In the preparation stage, Kbuild directly visits the vdso directories, hence it does not inherit subdir-ccflags-y. In the second descend, Kbuild adds -Werror, which results in the command line flipping with/without -Werror. It implies a potential danger; if a more critical flag that would impact the resulted vdso, the offsets recorded in the headers might be different from real offsets in the embedded vdso images. Removing the unneeded second descend solves the problem. Link: https://lore.kernel.org/linuxppc-dev/87tuslxhry@mpe.ellerman.id.au/ Reported-by: Michael Ellerman Signed-off-by: Masahiro Yamada --- Michael, please take a look at this. The unneeded rebuild problem is still remaining. Seems like with this patch, vdso32_wrapper.o is not rebuilt when vdso32.so.dbg is rebuilt. Leading to ... disasters. I'll send a patch arch/powerpc/kernel/Makefile | 4 ++-- arch/powerpc/kernel/vdso32/Makefile | 5 + arch/powerpc/kernel/{vdso32 => }/vdso32_wrapper.S | 0 arch/powerpc/kernel/vdso64/Makefile | 6 +- arch/powerpc/kernel/{vdso64 => }/vdso64_wrapper.S | 0 5 files changed, 4 insertions(+), 11 deletions(-) rename arch/powerpc/kernel/{vdso32 => }/vdso32_wrapper.S (100%) rename arch/powerpc/kernel/{vdso64 => }/vdso64_wrapper.S (100%) diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile index fe2ef598e2ea..79ee7750937d 100644 --- a/arch/powerpc/kernel/Makefile +++ b/arch/powerpc/kernel/Makefile @@ -51,7 +51,7 @@ obj-y += ptrace/ obj-$(CONFIG_PPC64)+= setup_64.o \ paca.o nvram_64.o note.o syscall_64.o obj-$(CONFIG_COMPAT) += sys_ppc32.o signal_32.o -obj-$(CONFIG_VDSO32) += vdso32/ +obj-$(CONFIG_VDSO32) += vdso32_wrapper.o obj-$(CONFIG_PPC_WATCHDOG) += watchdog.o obj-$(CONFIG_HAVE_HW_BREAKPOINT) += hw_breakpoint.o obj-$(CONFIG_PPC_DAWR) += dawr.o @@ -60,7 +60,7 @@ obj-$(CONFIG_PPC_BOOK3S_64) += cpu_setup_power.o obj-$(CONFIG_PPC_BOOK3S_64)+= mce.o mce_power.o obj-$(CONFIG_PPC_BOOK3E_64)+= exceptions-64e.o idle_book3e.o obj-$(CONFIG_PPC_BARRIER_NOSPEC) += security.o -obj-$(CONFIG_PPC64)+= vdso64/ +obj-$(CONFIG_PPC64)+= vdso64_wrapper.o obj-$(CONFIG_ALTIVEC) += vecemu.o obj-$(CONFIG_PPC_BOOK3S_IDLE) += idle_book3s.o procfs-y := proc_powerpc.o diff --git a/arch/powerpc/kernel/vdso32/Makefile b/arch/powerpc/kernel/vdso32/Makefile index 59aa2944ecae..42fc3de89b39 100644 --- a/arch/powerpc/kernel/vdso32/Makefile +++ b/arch/powerpc/kernel/vdso32/Makefile @@ -30,7 +30,7 @@ CC32FLAGS += -m32 KBUILD_CFLAGS := $(filter-out -mcmodel=medium,$(KBUILD_CFLAGS)) endif -targets := $(obj-vdso32) vdso32.so.dbg +targets := $(obj-vdso32) vdso32.so.dbg vgettimeofday.o obj-vdso32 := $(addprefix $(obj)/, $(obj-vdso32)) GCOV_PROFILE := n @@ -46,9 +46,6 @@ obj-y += vdso32_wrapper.o targets += vdso32.lds CPPFLAGS_vdso32.lds += -P -C -Upowerpc -# Force dependency (incbin is bad) -$(obj)/vdso32_wrapper.o : $(obj)/vdso32.so.dbg - # link rule for the .so file, .lds has to be first $(obj)/vdso32.so.dbg: $(src)/vdso32.lds $(obj-vdso32) $(obj)/vgettimeofday.o FORCE $(call if_changed,vdso32ld_and_check) diff --git a/arch/powerpc/kernel/vdso32/vdso32_wrapper.S b/arch/powerpc/kernel/vdso32_wrapper.S similarity index 100% rename from arch/powerpc/kernel/vdso32/vdso32_wrapper.S rename to arch/powerpc/kernel/vdso32_wrapper.S diff --git a/arch/powerpc/kernel/vdso64/Makefile b/arch/powerpc/kernel/vdso64/Makefile index d365810a689a..b50b39fedf74 100644 --- a/arch/powerpc/kernel/vdso64/Makefile +++ b/arch/powerpc/kernel/vdso64/Makefile @@ -17,7 +17,7 @@ endif # Build rules -targets := $(obj-vdso64) vdso64.so.dbg +targets := $(obj-vdso64) vdso64.so.dbg vgettimeofday.o obj-vdso64 := $(addprefix $(obj)/, $(obj-vdso64)) GCOV_PROFILE := n @@ -29,15 +29,11 @@ ccflags-y := -shared -fno-common -fno-builtin -nostdlib \ -Wl,-soname=linux-vdso64.so.1 -Wl,--hash-style=both asflags-y := -D__VDSO64__ -s -obj-y += vdso64_wrapper.o targets += vdso64.lds CPPFLAGS_vdso64.lds += -P -C -U$(ARCH) $(obj)/vgettimeofday.o: %.o: %.c FORCE -# Force dependency (incbin is bad) -$(obj)/vdso64_wrapper.o
Re: [PATCH] PCI: Try to find two continuous regions for child resource
On Tue, Mar 30, 2021 at 12:23 AM Bjorn Helgaas wrote: > > On Mon, Mar 29, 2021 at 04:47:59PM +0800, Kai-Heng Feng wrote: > > Built-in grahpics on HP EliteDesk 805 G6 doesn't work because graphics > > can't get the BAR it needs: > > [0.611504] pci_bus :00: root bus resource [mem > > 0x1002020-0x100303f window] > > [0.611505] pci_bus :00: root bus resource [mem > > 0x1003040-0x100401f window] > > ... > > [0.638083] pci :00:08.1: bridge window [mem 0xd200-0xd23f] > > [0.638086] pci :00:08.1: bridge window [mem > > 0x1003000-0x100401f 64bit pref] > > [0.962086] pci :00:08.1: can't claim BAR 15 [mem > > 0x1003000-0x100401f 64bit pref]: no compatible bridge window > > [0.962086] pci :00:08.1: [mem 0x1003000-0x100401f 64bit > > pref] clipped to [mem 0x1003000-0x100303f 64bit pref] > > [0.962086] pci :00:08.1: bridge window [mem > > 0x1003000-0x100303f 64bit pref] > > [0.962086] pci :07:00.0: can't claim BAR 0 [mem > > 0x1003000-0x1003fff 64bit pref]: no compatible bridge window > > [0.962086] pci :07:00.0: can't claim BAR 2 [mem > > 0x1004000-0x100401f 64bit pref]: no compatible bridge window > > > > However, the root bus has two continuous regions that can contain the > > child resource requested. > > > > So try to find another parent region if two regions are continuous and > > can contain child resource. This change makes the grahpics works on the > > system in question. > > The BIOS description of PCI0 is interesting: > > pci_bus :00: root bus resource [mem 0x100-0x100201f window] > pci_bus :00: root bus resource [mem 0x1002020-0x100303f window] > pci_bus :00: root bus resource [mem 0x1003040-0x100401f window] > > So the PCI0 _CRS apparently gave us: > > [mem 0x100-0x100201f] size 0x2020 (512MB + 2MB) > [mem 0x1002020-0x100303f] size 0x1020 (256MB + 2MB) > [mem 0x1003040-0x100401f] size 0x0fe0 (254MB) > > These are all contiguous, so we'd have no problem if we coalesced them > into a single window: > > [mem 0x100-0x100401f window] size 0x4020 (1GB + 2MB) > > I think we currently keep these root bus resources separate because if > we ever support _SRS for host bridges, the argument we give to _SRS > must be exactly the same format as what we got from _CRS (see ACPI > v6.3, sec 6.2.16, and pnpacpi_set_resources()). > > pnpacpi_encode_resources() is currently very simple-minded and copies > each device resource back into a single _SRS entry. But (1) we don't > support _SRS for host bridges, and (2) if we ever do, we can make > pnpacpi_encode_resources() smarter so it breaks things back up. > > So I think we should try to fix this by coalescing these adjacent > resources from _CRS so we end up with a single root bus resource that > covers all contiguous regions. Thanks for the tip! Working on v2 patch. > > Typos, etc: > - No need for the timestamps; they're not relevant to the problem. > - s/grahpics/graphics/ (two occurrences above) > - s/continuous/contiguous/ (three occurrences above) Will also update those in v2. Kai-Heng > > > Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=212013 > > Signed-off-by: Kai-Heng Feng > > --- > > arch/microblaze/pci/pci-common.c | 4 +-- > > arch/powerpc/kernel/pci-common.c | 8 ++--- > > arch/sparc/kernel/pci.c | 4 +-- > > drivers/pci/pci.c| 60 +++- > > drivers/pci/setup-res.c | 21 +++ > > drivers/pcmcia/rsrc_nonstatic.c | 4 +-- > > include/linux/pci.h | 6 ++-- > > 7 files changed, 80 insertions(+), 27 deletions(-) > > > > diff --git a/arch/microblaze/pci/pci-common.c > > b/arch/microblaze/pci/pci-common.c > > index 557585f1be41..8e65832fb510 100644 > > --- a/arch/microblaze/pci/pci-common.c > > +++ b/arch/microblaze/pci/pci-common.c > > @@ -669,7 +669,7 @@ static void pcibios_allocate_bus_resources(struct > > pci_bus *bus) > > { > > struct pci_bus *b; > > int i; > > - struct resource *res, *pr; > > + struct resource *res, *pr = NULL; > > > > pr_debug("PCI: Allocating bus resources for %04x:%02x...\n", > >pci_domain_nr(bus), bus->number); > > @@ -688,7 +688,7 @@ static void pcibios_allocate_bus_resources(struct > > pci_bus *bus) > >* and as such ensure proper re-allocation > >* later. > >*/ > > - pr = pci_find_parent_resource(bus->self, res); > > + pci_find_parent_resource(bus->self, res, , NULL); > > if (pr == res) { > > /* this happens when the generic PCI > >* code (wrongly) decides that this > > diff --git a/arch/powerpc/kernel/pci-common.c > >
Re: [PATCH printk v2 2/5] printk: remove safe buffers
On 2021-03-30, John Ogness wrote: > diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c > index e971c0a9ec9e..f090d6a1b39e 100644 > --- a/kernel/printk/printk.c > +++ b/kernel/printk/printk.c > @@ -1772,16 +1759,21 @@ static struct task_struct *console_owner; > static bool console_waiter; > > /** > - * console_lock_spinning_enable - mark beginning of code where another > + * console_lock_spinning_enable_irqsave - mark beginning of code where > another > * thread might safely busy wait > * > * This basically converts console_lock into a spinlock. This marks > * the section where the console_lock owner can not sleep, because > * there may be a waiter spinning (like a spinlock). Also it must be > * ready to hand over the lock at the end of the section. > + * > + * This disables interrupts because the hand over to a waiter must not be > + * interrupted until the hand over is completed (@console_waiter is cleared). > */ > -static void console_lock_spinning_enable(void) > +static void console_lock_spinning_enable_irqsave(unsigned long *flags) I missed the prototype change for the !CONFIG_PRINTK case, resulting in: linux/kernel/printk/printk.c:2707:3: error: implicit declaration of function ‘console_lock_spinning_enable_irqsave’; did you mean ‘console_lock_spinning_enable’? [-Werror=implicit-function-declaration] console_lock_spinning_enable_irqsave(); ^~~~ console_lock_spinning_enable Will be fixed for v3. (I have now officially added !CONFIG_PRINTK to my CI tests.) John Ogness