[PATCH 2/2] powerpc: Remove STRICT_KERNEL_RWX incompatibility with RELOCATABLE
I have tested this with the Radix MMU and everything seems to work, and the previous patch for Hash seems to fix everything too. STRICT_KERNEL_RWX should still be disabled by default for now. Please test STRICT_KERNEL_RWX + RELOCATABLE! Signed-off-by: Russell Currey --- arch/powerpc/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 1ec34e16ed65..6093c48976bf 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -133,7 +133,7 @@ config PPC select ARCH_HAS_PTE_SPECIAL select ARCH_HAS_MEMBARRIER_CALLBACKS select ARCH_HAS_SCALED_CPUTIME if VIRT_CPU_ACCOUNTING_NATIVE && PPC_BOOK3S_64 - select ARCH_HAS_STRICT_KERNEL_RWX if ((PPC_BOOK3S_64 || PPC32) && !RELOCATABLE && !HIBERNATION) + select ARCH_HAS_STRICT_KERNEL_RWX if ((PPC_BOOK3S_64 || PPC32) && !HIBERNATION) select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST select ARCH_HAS_UACCESS_FLUSHCACHE select ARCH_HAS_UACCESS_MCSAFE if PPC64 -- 2.24.1
[PATCH 1/2] powerpc/book3s64/hash: Disable 16M linear mapping size if not aligned
With STRICT_KERNEL_RWX on in a relocatable kernel under the hash MMU, if the position the kernel is loaded at is not 16M aligned, the kernel miscalculates its ALIGN*()s and things go horribly wrong. We can easily avoid this when selecting the linear mapping size, so do so and print a warning. I tested this for various alignments and as long as the position is 64K aligned it's fine (the base requirement for powerpc). Signed-off-by: Russell Currey --- arch/powerpc/mm/book3s64/hash_utils.c | 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/mm/book3s64/hash_utils.c b/arch/powerpc/mm/book3s64/hash_utils.c index b30435c7d804..523d4d39d11e 100644 --- a/arch/powerpc/mm/book3s64/hash_utils.c +++ b/arch/powerpc/mm/book3s64/hash_utils.c @@ -652,6 +652,7 @@ static void init_hpte_page_sizes(void) static void __init htab_init_page_sizes(void) { + bool aligned = true; init_hpte_page_sizes(); if (!debug_pagealloc_enabled()) { @@ -659,7 +660,15 @@ static void __init htab_init_page_sizes(void) * Pick a size for the linear mapping. Currently, we only * support 16M, 1M and 4K which is the default */ - if (mmu_psize_defs[MMU_PAGE_16M].shift) + if (IS_ENABLED(STRICT_KERNEL_RWX) && + (unsigned long)_stext % 0x100) { + if (mmu_psize_defs[MMU_PAGE_16M].shift) + pr_warn("Kernel not 16M aligned, " + "disabling 16M linear map alignment"); + aligned = false; + } + + if (mmu_psize_defs[MMU_PAGE_16M].shift && aligned) mmu_linear_psize = MMU_PAGE_16M; else if (mmu_psize_defs[MMU_PAGE_1M].shift) mmu_linear_psize = MMU_PAGE_1M; -- 2.24.1
[PATCH v6 0/5] Implement STRICT_MODULE_RWX for powerpc
v5 cover letter: https://lore.kernel.org/kernel-hardening/20191030073111.140493-1-rus...@russell.cc/ v4 cover letter: https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-October/198268.html v3 cover letter: https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-October/198023.html Changes since v5: [1/5]: Addressed review comments from Christophe Leroy (thanks!) [2/5]: Use patch_instruction() instead of memcpy() thanks to mpe Thanks for the feedback, hopefully this is the final iteration. I have a patch to remove the STRICT_KERNEL_RWX incompatibility with RELOCATABLE for book3s64 coming soon, so with that we should have a great basis for powerpc RWX going forward. Russell Currey (5): powerpc/mm: Implement set_memory() routines powerpc/kprobes: Mark newly allocated probes as RO powerpc/mm/ptdump: debugfs handler for W+X checks at runtime powerpc: Set ARCH_HAS_STRICT_MODULE_RWX powerpc/configs: Enable STRICT_MODULE_RWX in skiroot_defconfig arch/powerpc/Kconfig | 2 + arch/powerpc/Kconfig.debug | 6 +- arch/powerpc/configs/skiroot_defconfig | 1 + arch/powerpc/include/asm/set_memory.h | 32 ++ arch/powerpc/kernel/kprobes.c | 6 +- arch/powerpc/mm/Makefile | 1 + arch/powerpc/mm/pageattr.c | 83 ++ arch/powerpc/mm/ptdump/ptdump.c| 21 ++- 8 files changed, 147 insertions(+), 5 deletions(-) create mode 100644 arch/powerpc/include/asm/set_memory.h create mode 100644 arch/powerpc/mm/pageattr.c -- 2.24.1
[PATCH v6 5/5] powerpc/configs: Enable STRICT_MODULE_RWX in skiroot_defconfig
skiroot_defconfig is the only powerpc defconfig with STRICT_KERNEL_RWX enabled, and if you want memory protection for kernel text you'd want it for modules too, so enable STRICT_MODULE_RWX there. Acked-by: Joel Stanley Signed-off-by: Russell Currey --- arch/powerpc/configs/skiroot_defconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/configs/skiroot_defconfig b/arch/powerpc/configs/skiroot_defconfig index 069f67f12731..b74358c3ede8 100644 --- a/arch/powerpc/configs/skiroot_defconfig +++ b/arch/powerpc/configs/skiroot_defconfig @@ -31,6 +31,7 @@ CONFIG_PERF_EVENTS=y CONFIG_SLAB_FREELIST_HARDENED=y CONFIG_JUMP_LABEL=y CONFIG_STRICT_KERNEL_RWX=y +CONFIG_STRICT_MODULE_RWX=y CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y CONFIG_MODULE_SIG=y -- 2.24.1
[PATCH v6 4/5] powerpc: Set ARCH_HAS_STRICT_MODULE_RWX
To enable strict module RWX on powerpc, set: CONFIG_STRICT_MODULE_RWX=y You should also have CONFIG_STRICT_KERNEL_RWX=y set to have any real security benefit. ARCH_HAS_STRICT_MODULE_RWX is set to require ARCH_HAS_STRICT_KERNEL_RWX. This is due to a quirk in arch/Kconfig and arch/powerpc/Kconfig that makes STRICT_MODULE_RWX *on by default* in configurations where STRICT_KERNEL_RWX is *unavailable*. Since this doesn't make much sense, and module RWX without kernel RWX doesn't make much sense, having the same dependencies as kernel RWX works around this problem. Signed-off-by: Russell Currey --- arch/powerpc/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index f0b9b47b5353..97ea012fdff9 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -135,6 +135,7 @@ config PPC select ARCH_HAS_SCALED_CPUTIME if VIRT_CPU_ACCOUNTING_NATIVE && PPC_BOOK3S_64 select ARCH_HAS_SET_MEMORY select ARCH_HAS_STRICT_KERNEL_RWX if ((PPC_BOOK3S_64 || PPC32) && !RELOCATABLE && !HIBERNATION) + select ARCH_HAS_STRICT_MODULE_RWX if ARCH_HAS_STRICT_KERNEL_RWX select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST select ARCH_HAS_UACCESS_FLUSHCACHE select ARCH_HAS_UACCESS_MCSAFE if PPC64 -- 2.24.1
[PATCH v6 3/5] powerpc/mm/ptdump: debugfs handler for W+X checks at runtime
Very rudimentary, just echo 1 > [debugfs]/check_wx_pages and check the kernel log. Useful for testing strict module RWX. Updated the Kconfig entry to reflect this. Also fixed a typo. Signed-off-by: Russell Currey --- arch/powerpc/Kconfig.debug | 6 -- arch/powerpc/mm/ptdump/ptdump.c | 21 - 2 files changed, 24 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/Kconfig.debug b/arch/powerpc/Kconfig.debug index 4e1d39847462..7c14c9728bc0 100644 --- a/arch/powerpc/Kconfig.debug +++ b/arch/powerpc/Kconfig.debug @@ -370,7 +370,7 @@ config PPC_PTDUMP If you are unsure, say N. config PPC_DEBUG_WX - bool "Warn on W+X mappings at boot" + bool "Warn on W+X mappings at boot & enable manual checks at runtime" depends on PPC_PTDUMP help Generate a warning if any W+X mappings are found at boot. @@ -384,7 +384,9 @@ config PPC_DEBUG_WX of other unfixed kernel bugs easier. There is no runtime or memory usage effect of this option - once the kernel has booted up - it's a one time check. + once the kernel has booted up, it only automatically checks once. + + Enables the "check_wx_pages" debugfs entry for checking at runtime. If in doubt, say "Y". diff --git a/arch/powerpc/mm/ptdump/ptdump.c b/arch/powerpc/mm/ptdump/ptdump.c index 2f9ddc29c535..b6cba29ae4a0 100644 --- a/arch/powerpc/mm/ptdump/ptdump.c +++ b/arch/powerpc/mm/ptdump/ptdump.c @@ -4,7 +4,7 @@ * * This traverses the kernel pagetables and dumps the * information about the used sections of memory to - * /sys/kernel/debug/kernel_pagetables. + * /sys/kernel/debug/kernel_page_tables. * * Derived from the arm64 implementation: * Copyright (c) 2014, The Linux Foundation, Laura Abbott. @@ -409,6 +409,25 @@ void ptdump_check_wx(void) else pr_info("Checked W+X mappings: passed, no W+X pages found\n"); } + +static int check_wx_debugfs_set(void *data, u64 val) +{ + if (val != 1ULL) + return -EINVAL; + + ptdump_check_wx(); + + return 0; +} + +DEFINE_SIMPLE_ATTRIBUTE(check_wx_fops, NULL, check_wx_debugfs_set, "%llu\n"); + +static int ptdump_check_wx_init(void) +{ + return debugfs_create_file("check_wx_pages", 0200, NULL, + NULL, _wx_fops) ? 0 : -ENOMEM; +} +device_initcall(ptdump_check_wx_init); #endif static int ptdump_init(void) -- 2.24.1
[PATCH v6 1/5] powerpc/mm: Implement set_memory() routines
The set_memory_{ro/rw/nx/x}() functions are required for STRICT_MODULE_RWX, and are generally useful primitives to have. This implementation is designed to be completely generic across powerpc's many MMUs. It's possible that this could be optimised to be faster for specific MMUs, but the focus is on having a generic and safe implementation for now. This implementation does not handle cases where the caller is attempting to change the mapping of the page it is executing from, or if another CPU is concurrently using the page being altered. These cases likely shouldn't happen, but a more complex implementation with MMU-specific code could safely handle them, so that is left as a TODO for now. Signed-off-by: Russell Currey --- arch/powerpc/Kconfig | 1 + arch/powerpc/include/asm/set_memory.h | 32 +++ arch/powerpc/mm/Makefile | 1 + arch/powerpc/mm/pageattr.c| 83 +++ 4 files changed, 117 insertions(+) create mode 100644 arch/powerpc/include/asm/set_memory.h create mode 100644 arch/powerpc/mm/pageattr.c diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 1ec34e16ed65..f0b9b47b5353 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -133,6 +133,7 @@ config PPC select ARCH_HAS_PTE_SPECIAL select ARCH_HAS_MEMBARRIER_CALLBACKS select ARCH_HAS_SCALED_CPUTIME if VIRT_CPU_ACCOUNTING_NATIVE && PPC_BOOK3S_64 + select ARCH_HAS_SET_MEMORY select ARCH_HAS_STRICT_KERNEL_RWX if ((PPC_BOOK3S_64 || PPC32) && !RELOCATABLE && !HIBERNATION) select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST select ARCH_HAS_UACCESS_FLUSHCACHE diff --git a/arch/powerpc/include/asm/set_memory.h b/arch/powerpc/include/asm/set_memory.h new file mode 100644 index ..5230ddb2fefd --- /dev/null +++ b/arch/powerpc/include/asm/set_memory.h @@ -0,0 +1,32 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ASM_POWERPC_SET_MEMORY_H +#define _ASM_POWERPC_SET_MEMORY_H + +#define SET_MEMORY_RO 1 +#define SET_MEMORY_RW 2 +#define SET_MEMORY_NX 3 +#define SET_MEMORY_X 4 + +int change_memory_attr(unsigned long addr, int numpages, int action); + +static inline int set_memory_ro(unsigned long addr, int numpages) +{ + return change_memory_attr(addr, numpages, SET_MEMORY_RO); +} + +static inline int set_memory_rw(unsigned long addr, int numpages) +{ + return change_memory_attr(addr, numpages, SET_MEMORY_RW); +} + +static inline int set_memory_nx(unsigned long addr, int numpages) +{ + return change_memory_attr(addr, numpages, SET_MEMORY_NX); +} + +static inline int set_memory_x(unsigned long addr, int numpages) +{ + return change_memory_attr(addr, numpages, SET_MEMORY_X); +} + +#endif diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile index 5e147986400d..d0a0bcbc9289 100644 --- a/arch/powerpc/mm/Makefile +++ b/arch/powerpc/mm/Makefile @@ -20,3 +20,4 @@ obj-$(CONFIG_HIGHMEM) += highmem.o obj-$(CONFIG_PPC_COPRO_BASE) += copro_fault.o obj-$(CONFIG_PPC_PTDUMP) += ptdump/ obj-$(CONFIG_KASAN)+= kasan/ +obj-$(CONFIG_ARCH_HAS_SET_MEMORY) += pageattr.o diff --git a/arch/powerpc/mm/pageattr.c b/arch/powerpc/mm/pageattr.c new file mode 100644 index ..15d5fb04f531 --- /dev/null +++ b/arch/powerpc/mm/pageattr.c @@ -0,0 +1,83 @@ +// SPDX-License-Identifier: GPL-2.0 + +/* + * MMU-generic set_memory implementation for powerpc + * + * Copyright 2019, IBM Corporation. + */ + +#include +#include + +#include +#include +#include + + +/* + * Updates the attributes of a page in three steps: + * + * 1. invalidate the page table entry + * 2. flush the TLB + * 3. install the new entry with the updated attributes + * + * This is unsafe if the caller is attempting to change the mapping of the + * page it is executing from, or if another CPU is concurrently using the + * page being altered. + * + * TODO make the implementation resistant to this. + */ +static int __change_page_attr(pte_t *ptep, unsigned long addr, void *data) +{ + int action = *((int *)data); + pte_t pte_val; + + // invalidate the PTE so it's safe to modify + pte_val = ptep_get_and_clear(_mm, addr, ptep); + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); + + // modify the PTE bits as desired, then apply + switch (action) { + case SET_MEMORY_RO: + pte_val = pte_wrprotect(pte_val); + break; + case SET_MEMORY_RW: + pte_val = pte_mkwrite(pte_val); + break; + case SET_MEMORY_NX: + pte_val = pte_exprotect(pte_val); + break; + case SET_MEMORY_X: + pte_val = pte_mkexec(pte_val); + break; + default: + WARN_ON(true); + return -EINVAL; + } + + set_pte_at(_mm, addr, ptep, pte_val); + + return 0; +} + +static int
[PATCH v6 2/5] powerpc/kprobes: Mark newly allocated probes as RO
With CONFIG_STRICT_KERNEL_RWX=y and CONFIG_KPROBES=y, there will be one W+X page at boot by default. This can be tested with CONFIG_PPC_PTDUMP=y and CONFIG_PPC_DEBUG_WX=y set, and checking the kernel log during boot. powerpc doesn't implement its own alloc() for kprobes like other architectures do, but we couldn't immediately mark RO anyway since we do a memcpy to the page we allocate later. After that, nothing should be allowed to modify the page, and write permissions are removed well before the kprobe is armed. The memcpy() would fail if >1 probes were allocated, so use patch_instruction() instead which is safe for RO. Reviewed-by: Daniel Axtens Signed-off-by: Russell Currey --- arch/powerpc/kernel/kprobes.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c index 2d27ec4feee4..b72761f0c9e3 100644 --- a/arch/powerpc/kernel/kprobes.c +++ b/arch/powerpc/kernel/kprobes.c @@ -24,6 +24,7 @@ #include #include #include +#include DEFINE_PER_CPU(struct kprobe *, current_kprobe) = NULL; DEFINE_PER_CPU(struct kprobe_ctlblk, kprobe_ctlblk); @@ -124,13 +125,14 @@ int arch_prepare_kprobe(struct kprobe *p) } if (!ret) { - memcpy(p->ainsn.insn, p->addr, - MAX_INSN_SIZE * sizeof(kprobe_opcode_t)); + patch_instruction(p->ainsn.insn, *p->addr); p->opcode = *p->addr; flush_icache_range((unsigned long)p->ainsn.insn, (unsigned long)p->ainsn.insn + sizeof(kprobe_opcode_t)); } + set_memory_ro((unsigned long)p->ainsn.insn, 1); + p->ainsn.boostable = 0; return ret; } -- 2.24.1
[PATCH V11 RESEND] mm/debug: Add tests validating architecture page table helpers
This adds tests which will validate architecture page table helpers and other accessors in their compliance with expected generic MM semantics. This will help various architectures in validating changes to existing page table helpers or addition of new ones. This test covers basic page table entry transformations including but not limited to old, young, dirty, clean, write, write protect etc at various level along with populating intermediate entries with next page table page and validating them. Test page table pages are allocated from system memory with required size and alignments. The mapped pfns at page table levels are derived from a real pfn representing a valid kernel text symbol. This test gets called right after page_alloc_init_late(). This gets build and run when CONFIG_DEBUG_VM_PGTABLE is selected along with CONFIG_VM_DEBUG. Architectures willing to subscribe this test also need to select CONFIG_ARCH_HAS_DEBUG_VM_PGTABLE which for now is limited to x86 and arm64. Going forward, other architectures too can enable this after fixing build or runtime problems (if any) with their page table helpers. Folks interested in making sure that a given platform's page table helpers conform to expected generic MM semantics should enable the above config which will just trigger this test during boot. Any non conformity here will be reported as an warning which would need to be fixed. This test will help catch any changes to the agreed upon semantics expected from generic MM and enable platforms to accommodate it thereafter. Cc: Andrew Morton Cc: Vlastimil Babka Cc: Greg Kroah-Hartman Cc: Thomas Gleixner Cc: Mike Rapoport Cc: Jason Gunthorpe Cc: Dan Williams Cc: Peter Zijlstra Cc: Michal Hocko Cc: Mark Rutland Cc: Mark Brown Cc: Steven Price Cc: Ard Biesheuvel Cc: Masahiro Yamada Cc: Kees Cook Cc: Tetsuo Handa Cc: Matthew Wilcox Cc: Sri Krishna chowdary Cc: Dave Hansen Cc: Russell King - ARM Linux Cc: Michael Ellerman Cc: Paul Mackerras Cc: Martin Schwidefsky Cc: Heiko Carstens Cc: "David S. Miller" Cc: Vineet Gupta Cc: James Hogan Cc: Paul Burton Cc: Ralf Baechle Cc: Kirill A. Shutemov Cc: Gerald Schaefer Cc: Christophe Leroy Cc: Ingo Molnar Cc: linux-snps-...@lists.infradead.org Cc: linux-m...@vger.kernel.org Cc: linux-arm-ker...@lists.infradead.org Cc: linux-i...@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-s...@vger.kernel.org Cc: linux...@vger.kernel.org Cc: sparcli...@vger.kernel.org Cc: x...@kernel.org Cc: linux-ker...@vger.kernel.org Tested-by: Christophe Leroy#PPC32 Reviewed-by: Ingo Molnar Suggested-by: Catalin Marinas Signed-off-by: Andrew Morton Signed-off-by: Christophe Leroy Signed-off-by: Anshuman Khandual --- This adds a test validation for architecture exported page table helpers. Patch adds basic transformation tests at various levels of the page table. This test was originally suggested by Catalin during arm64 THP migration RFC discussion earlier. Going forward it can include more specific tests with respect to various generic MM functions like THP, HugeTLB etc and platform specific tests. https://lore.kernel.org/linux-mm/20190628102003.ga56...@arrakis.emea.arm.com/ Needs to be applied on linux 5.5-rc2 Changes in V11: - Rebased the patch on 5.5-rc2 Changes in V10: (https://patchwork.kernel.org/project/linux-mm/list/?series=205529) - Always enable DEBUG_VM_PGTABLE when DEBUG_VM is enabled per Ingo - Added tags from Ingo Changes in V9: (https://patchwork.kernel.org/project/linux-mm/list/?series=201429) - Changed feature support enumeration for powerpc platforms per Christophe - Changed config wrapper for basic_[pmd|pud]_tests() to enable ARC platform - Enabled the test on ARC platform Changes in V8: (https://patchwork.kernel.org/project/linux-mm/list/?series=194297) - Enabled ARCH_HAS_DEBUG_VM_PGTABLE on PPC32 platform per Christophe - Updated feature documentation as DEBUG_VM_PGTABLE is now enabled on PPC32 platform - Moved ARCH_HAS_DEBUG_VM_PGTABLE earlier to indent it with DEBUG_VM per Christophe - Added an information message in debug_vm_pgtable() per Christophe - Dropped random_vaddr boundary condition checks per Christophe and Qian - Replaced virt_addr_valid() check with pfn_valid() check in debug_vm_pgtable() - Slightly changed pr_fmt(fmt) information Changes in V7: (https://patchwork.kernel.org/project/linux-mm/list/?series=193051) - Memory allocation and free routines for mapped pages have been droped - Mapped pfns are derived from standard kernel text symbol per Matthew - Moved debug_vm_pgtaable() after page_alloc_init_late() per Michal and Qian - Updated the commit message per Michal - Updated W=1 GCC warning problem on x86 per Qian Cai - Addition of new alloc_contig_pages() helper has been submitted separately Changes in V6: (https://patchwork.kernel.org/project/linux-mm/list/?series=187589) - Moved alloc_gigantic_page_order() into mm/page_alloc.c per Michal - Moved
Re: [RFC PATCH v2 05/10] lib: vdso: inline do_hres()
On Mon, Dec 23, 2019 at 6:31 AM Christophe Leroy wrote: > > do_hres() is called from several places, so GCC doesn't inline > it at first. > > do_hres() takes a struct __kernel_timespec * parameter for > passing the result. In the 32 bits case, this parameter corresponds > to a local var in the caller. In order to provide a pointer > to this structure, the caller has to put it in its stack and > do_hres() has to write the result in the stack. This is suboptimal, > especially on RISC processor like powerpc. > > By making GCC inline the function, the struct __kernel_timespec > remains a local var using registers, avoiding the need to write and > read stack. > > The improvement is significant on powerpc. I'm okay with it, mainly because I don't expect many workloads to have more than one copy of the code hot at the same time.
Re: [RFC PATCH v2 04/10] lib: vdso: get pointer to vdso data from the arch
On Mon, Dec 23, 2019 at 6:31 AM Christophe Leroy wrote: > > On powerpc, __arch_get_vdso_data() clobbers the link register, > requiring the caller to set a stack frame in order to save it. > > As the parent function already has to set a stack frame and save > the link register to call the C vdso function, retriving the > vdso data pointer there is lighter. I'm confused. Can't you inline __arch_get_vdso_data()? Or is the issue that you can't retrieve the program counter on power without clobbering the link register? I would imagine that this patch generates worse code on any architecture with PC-relative addressing modes (which includes at least x86_64, and I would guess includes most modern architectures). --Andy
Re: [RFC PATCH v2 02/10] lib: vdso: move call to fallback out of common code.
On Mon, Dec 23, 2019 at 6:31 AM Christophe Leroy wrote: > > On powerpc, VDSO functions and syscalls cannot be implemented in C > because the Linux kernel ABI requires that CR[SO] bit is set in case > of error and cleared when no error. > > As this cannot be done in C, C VDSO functions and syscall'based > fallback need a trampoline in ASM. > > By moving the fallback calls out of the common code, arches like > powerpc can implement both the call to C VDSO and the fallback call > in a single trampoline function. Maybe the issue is that I'm not a powerpc person, but I don't understand this. The common vDSO code is in C. Presumably this means that you need an asm trampoline no matter what to call the C code. Is the improvement that, with this change, you can have the asm trampoline do a single branch, so it's logically: ret = [call the C code]; if (ret == 0) { set success bit; } else { ret = fallback; if (ret == 0) set success bit; else set failure bit; } return ret; instead of: ret = [call the C code, which includes the fallback]; if (ret == 0) set success bit; else set failure bit; It's not obvious to me that the former ought to be faster. > > The two advantages are: > - No need play back and forth with CR[SO] and negative return value. > - No stack frame is required in VDSO C functions for the fallbacks. How is no stack frame required? Do you mean that the presence of the fallback causes worse code generation? Can you improve the fallback instead?
Re: [RFC PATCH v2 01/10] lib: vdso: ensure all arches have 32bit fallback
On Mon, Dec 23, 2019 at 6:31 AM Christophe Leroy wrote: > > In order to simplify next step which moves fallback call at arch > level, ensure all arches have a 32bit fallback instead of handling > the lack of 32bit fallback in the common code based > on VDSO_HAS_32BIT_FALLBACK I don't like this. You've implemented what appear to be nonsensical fallbacks (the 32-bit fallback for a 64-bit vDSO build? There's no such thing). How exactly does this simplify patch 2? --Andy
Re: [RFC PATCH v2 08/10] lib: vdso: Avoid duplication in __cvdso_clock_getres()
On Mon, Dec 23, 2019 at 6:31 AM Christophe Leroy wrote: > > VDSO_HRES and VDSO_RAW clocks are handled the same way. > > Don't duplicate code. > > Signed-off-by: Christophe Leroy Reviewed-by: Andy Lutomirski
Re: [RFC PATCH v2 07/10] lib: vdso: don't use READ_ONCE() in __c_kernel_time()
On Mon, Dec 23, 2019 at 6:31 AM Christophe Leroy wrote: > > READ_ONCE() forces the read of the 64 bit value of > vd[CS_HRES_COARSE].basetime[CLOCK_REALTIME].sec allthough > only the lower part is needed. Seems reasonable and very unlikely to be harmful. That being said, this function really ought to be considered deprecated -- 32-bit time_t is insufficient. Do you get even better code if you move the read into the if statement? Reviewed-by: Andy Lutomirski --Andy
Re: [PATCH kernel v3] powerpc/book3s64: Fix error handling in mm_iommu_do_alloc()
On 23/12/2019 22:18, Michael Ellerman wrote: > Alexey Kardashevskiy writes: > >> The last jump to free_exit in mm_iommu_do_alloc() happens after page >> pointers in struct mm_iommu_table_group_mem_t were already converted to >> physical addresses. Thus calling put_page() on these physical addresses >> will likely crash. >> >> This moves the loop which calculates the pageshift and converts page >> struct pointers to physical addresses later after the point when >> we cannot fail; thus eliminating the need to convert pointers back. >> >> Fixes: eb9d7a62c386 ("powerpc/mm_iommu: Fix potential deadlock") >> Reported-by: Jan Kara >> Signed-off-by: Alexey Kardashevskiy >> --- >> Changes: >> v3: >> * move pointers conversion after the last possible failure point >> --- >> arch/powerpc/mm/book3s64/iommu_api.c | 39 +++- >> 1 file changed, 21 insertions(+), 18 deletions(-) >> >> diff --git a/arch/powerpc/mm/book3s64/iommu_api.c >> b/arch/powerpc/mm/book3s64/iommu_api.c >> index 56cc84520577..ef164851738b 100644 >> --- a/arch/powerpc/mm/book3s64/iommu_api.c >> +++ b/arch/powerpc/mm/book3s64/iommu_api.c >> @@ -121,24 +121,6 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, >> unsigned long ua, >> goto free_exit; >> } >> >> -pageshift = PAGE_SHIFT; >> -for (i = 0; i < entries; ++i) { >> -struct page *page = mem->hpages[i]; >> - >> -/* >> - * Allow to use larger than 64k IOMMU pages. Only do that >> - * if we are backed by hugetlb. >> - */ >> -if ((mem->pageshift > PAGE_SHIFT) && PageHuge(page)) >> -pageshift = page_shift(compound_head(page)); >> -mem->pageshift = min(mem->pageshift, pageshift); >> -/* >> - * We don't need struct page reference any more, switch >> - * to physical address. >> - */ >> -mem->hpas[i] = page_to_pfn(page) << PAGE_SHIFT; >> -} >> - >> good_exit: >> atomic64_set(>mapped, 1); >> mem->used = 1; >> @@ -158,6 +140,27 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, >> unsigned long ua, >> } >> } >> >> +if (mem->dev_hpa == MM_IOMMU_TABLE_INVALID_HPA) { > > Couldn't you avoid testing this again ... > >> +/* >> + * Allow to use larger than 64k IOMMU pages. Only do that >> + * if we are backed by hugetlb. Skip device memory as it is not >> + * backed with page structs. >> + */ >> +pageshift = PAGE_SHIFT; >> +for (i = 0; i < entries; ++i) { > > ... by making this loop up to `pinned`. > > `pinned` is only incremented in the loop that does the GUP, and there's > a check that pinned == entries after that loop. > > So when we get here we know pinned == entries, and if pinned is zero > it's because we took the (dev_hpa != MM_IOMMU_TABLE_INVALID_HPA) case at > the start of the function to get here. > > Or do you think that's too subtle to rely on? I had 4 choices: 1. for (;i < pinned;) 2. if (dev_hpa == MM_IOMMU_TABLE_INVALID_HPA) (dev_hpa is a function parameter) 3. if (mem->dev_hpa == MM_IOMMU_TABLE_INVALID_HPA) 4. if (mem->hpages) The function is already ugly. 3) seemed as the most obvious way of telling what is going on here: "we have just initialized @mem and it is not for a device memory, lets finish the initialization". I could rearrange the code even more but since there is no NVLink3 coming ever, I'd avoid changing it more than necessary. Thanks, > > cheers > >> +struct page *page = mem->hpages[i]; >> + >> +if ((mem->pageshift > PAGE_SHIFT) && PageHuge(page)) >> +pageshift = page_shift(compound_head(page)); >> +mem->pageshift = min(mem->pageshift, pageshift); >> +/* >> + * We don't need struct page reference any more, switch >> + * to physical address. >> + */ >> +mem->hpas[i] = page_to_pfn(page) << PAGE_SHIFT; >> +} >> +} >> + >> list_add_rcu(>next, >context.iommu_group_mem_list); >> >> mutex_unlock(_list_mutex); >> -- >> 2.17.1 -- Alexey
Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN
On Fri, Dec 20, 2019 at 04:32:13PM -0800, Dan Williams wrote: > > > There's already a limit, it's just a much larger one. :) What does "no > > > limit" > > > really mean, numerically, to you in this case? > > > > I guess I mean 'hidden limit' - hitting the limit and failing would > > be managable. > > > > I think 7 is probably too low though, but we are not using 1GB huge > > pages, only 2M.. > > What about RDMA to 1GB-hugetlbfs and 1GB-device-dax mappings? I don't think the failing testing is doing that. It is also less likely that 1GB regions will need multi-mapping, IMHO. Jason
[RFC PATCH 8/8] powerpc/irq: drop softirq stack
There are two IRQ stacks: softirq_ctx and hardirq_ctx do_softirq_own_stack() switches stack to softirq_ctx do_IRQ() switches stack to hardirq_ctx However, when soft and hard IRQs are nested, only one of the two stacks is used: - When on softirq stack, do_IRQ() doesn't switch to hardirq stack. - irq_exit() runs softirqs on hardirq stack. There is no added value in having two IRQ stacks as only one is used when hard and soft irqs are nested. Remove softirq_ctx and use hardirq_ctx for both hard and soft IRQs. Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/irq.h | 1 - arch/powerpc/kernel/irq.c | 8 +++- arch/powerpc/kernel/process.c | 4 arch/powerpc/kernel/setup_32.c | 4 +--- arch/powerpc/kernel/setup_64.c | 4 +--- 5 files changed, 5 insertions(+), 16 deletions(-) diff --git a/arch/powerpc/include/asm/irq.h b/arch/powerpc/include/asm/irq.h index e4a92f0b4ad4..7cb2c76aa3ed 100644 --- a/arch/powerpc/include/asm/irq.h +++ b/arch/powerpc/include/asm/irq.h @@ -54,7 +54,6 @@ extern void *mcheckirq_ctx[NR_CPUS]; * Per-cpu stacks for handling hard and soft interrupts. */ extern void *hardirq_ctx[NR_CPUS]; -extern void *softirq_ctx[NR_CPUS]; #ifdef CONFIG_PPC64 void call_do_softirq(void *sp); diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c index a1122ef4a16c..3af0d1897354 100644 --- a/arch/powerpc/kernel/irq.c +++ b/arch/powerpc/kernel/irq.c @@ -680,15 +680,14 @@ void __do_irq(struct pt_regs *regs) void do_IRQ(struct pt_regs *regs) { - void *cursp, *irqsp, *sirqsp; + void *cursp, *irqsp; /* Switch to the irq stack to handle this */ cursp = (void *)(stack_pointer() & ~(THREAD_SIZE - 1)); irqsp = hardirq_ctx[raw_smp_processor_id()]; - sirqsp = softirq_ctx[raw_smp_processor_id()]; /* Already there ? Otherwise switch stack and call */ - if (unlikely(cursp == irqsp || cursp == sirqsp)) + if (unlikely(cursp == irqsp)) __do_irq(regs); else call_do_irq(regs, irqsp); @@ -706,12 +705,11 @@ void*dbgirq_ctx[NR_CPUS] __read_mostly; void *mcheckirq_ctx[NR_CPUS] __read_mostly; #endif -void *softirq_ctx[NR_CPUS] __read_mostly; void *hardirq_ctx[NR_CPUS] __read_mostly; void do_softirq_own_stack(void) { - call_do_softirq(softirq_ctx[smp_processor_id()]); + call_do_softirq(hardirq_ctx[smp_processor_id()]); } irq_hw_number_t virq_to_hw(unsigned int virq) diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c index 49d0ebf28ab9..be3e64cf28b4 100644 --- a/arch/powerpc/kernel/process.c +++ b/arch/powerpc/kernel/process.c @@ -1963,10 +1963,6 @@ static inline int valid_irq_stack(unsigned long sp, struct task_struct *p, if (sp >= stack_page && sp <= stack_page + THREAD_SIZE - nbytes) return 1; - stack_page = (unsigned long)softirq_ctx[cpu]; - if (sp >= stack_page && sp <= stack_page + THREAD_SIZE - nbytes) - return 1; - return 0; } diff --git a/arch/powerpc/kernel/setup_32.c b/arch/powerpc/kernel/setup_32.c index dcffe927f5b9..8752aae06177 100644 --- a/arch/powerpc/kernel/setup_32.c +++ b/arch/powerpc/kernel/setup_32.c @@ -155,10 +155,8 @@ void __init irqstack_early_init(void) /* interrupt stacks must be in lowmem, we get that for free on ppc32 * as the memblock is limited to lowmem by default */ - for_each_possible_cpu(i) { - softirq_ctx[i] = alloc_stack(); + for_each_possible_cpu(i) hardirq_ctx[i] = alloc_stack(); - } } #if defined(CONFIG_BOOKE) || defined(CONFIG_40x) diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c index 6104917a282d..96ee7627eda6 100644 --- a/arch/powerpc/kernel/setup_64.c +++ b/arch/powerpc/kernel/setup_64.c @@ -652,10 +652,8 @@ void __init irqstack_early_init(void) * cannot afford to take SLB misses on them. They are not * accessed in realmode. */ - for_each_possible_cpu(i) { - softirq_ctx[i] = alloc_stack(limit, i); + for_each_possible_cpu(i) hardirq_ctx[i] = alloc_stack(limit, i); - } } #ifdef CONFIG_PPC_BOOK3E -- 2.13.3
[RFC PATCH 7/8] powerpc/32: use IRQ stack immediately on IRQ exception
Exception entries run of kernel thread stack, then do_IRQ() switches to the IRQ stack. Instead of doing a first step of the thread stack, increasing the risk of stack overflow and spending time switch stacks two times when coming from userspace, set the stack to IRQ stack immediately in the EXCEPTION entry. In the same way as ARM64, consider that when the stack pointer is not within the kernel thread stack, it means it is already on IRQ stack. Signed-off-by: Christophe Leroy --- arch/powerpc/kernel/head_32.S | 2 +- arch/powerpc/kernel/head_32.h | 32 +--- arch/powerpc/kernel/head_40x.S | 2 +- arch/powerpc/kernel/head_8xx.S | 2 +- 4 files changed, 32 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/kernel/head_32.S b/arch/powerpc/kernel/head_32.S index 4a24f8f026c7..0c36fba5b861 100644 --- a/arch/powerpc/kernel/head_32.S +++ b/arch/powerpc/kernel/head_32.S @@ -332,7 +332,7 @@ END_MMU_FTR_SECTION_IFSET(MMU_FTR_HPTE_TABLE) EXC_XFER_LITE(0x400, handle_page_fault) /* External interrupt */ - EXCEPTION(0x500, HardwareInterrupt, do_IRQ, EXC_XFER_LITE) + EXCEPTION_IRQ(0x500, HardwareInterrupt, __do_irq, EXC_XFER_LITE) /* Alignment exception */ . = 0x600 diff --git a/arch/powerpc/kernel/head_32.h b/arch/powerpc/kernel/head_32.h index 8abc7783dbe5..f9e77e51723e 100644 --- a/arch/powerpc/kernel/head_32.h +++ b/arch/powerpc/kernel/head_32.h @@ -11,21 +11,41 @@ * task's thread_struct. */ -.macro EXCEPTION_PROLOG +.macro EXCEPTION_PROLOG is_irq=0 mtspr SPRN_SPRG_SCRATCH0,r10 mtspr SPRN_SPRG_SCRATCH1,r11 mfcrr10 - EXCEPTION_PROLOG_1 + EXCEPTION_PROLOG_1 is_irq=\is_irq EXCEPTION_PROLOG_2 .endm -.macro EXCEPTION_PROLOG_1 +.macro EXCEPTION_PROLOG_1 is_irq=0 mfspr r11,SPRN_SRR1 /* check whether user or kernel */ andi. r11,r11,MSR_PR + .if \is_irq + bne 2f + mfspr r11, SPRN_SPRG_THREAD + lwz r11, TASK_STACK - THREAD(r11) + xor r11, r11, r1 + cmplwi cr7, r11, THREAD_SIZE - 1 + tophys(r11, r1) /* use tophys(r1) if not thread stack */ + bgt cr7, 1f +2: +#ifdef CONFIG_SMP + mfspr r11, SPRN_SPRG_THREAD + lwz r11, TASK_CPU - THREAD(r11) + slwir11, r11, 3 + addis r11, r11, (hardirq_ctx - PAGE_OFFSET)@ha +#else + lis r11, (hardirq_ctx - PAGE_OFFSET)@ha +#endif + lwz r11, (hardirq_ctx - PAGE_OFFSET)@l(r11) + .else tophys(r11,r1) /* use tophys(r1) if kernel */ beq 1f mfspr r11,SPRN_SPRG_THREAD lwz r11,TASK_STACK-THREAD(r11) + .endif addir11,r11,THREAD_SIZE tophys(r11,r11) 1: subir11,r11,INT_FRAME_SIZE /* alloc exc. frame */ @@ -171,6 +191,12 @@ addir3,r1,STACK_FRAME_OVERHEAD; \ xfer(n, hdlr) +#define EXCEPTION_IRQ(n, label, hdlr, xfer)\ + START_EXCEPTION(n, label) \ + EXCEPTION_PROLOG is_irq=1; \ + addir3,r1,STACK_FRAME_OVERHEAD; \ + xfer(n, hdlr) + #define EXC_XFER_TEMPLATE(hdlr, trap, msr, tfer, ret) \ li r10,trap; \ stw r10,_TRAP(r11); \ diff --git a/arch/powerpc/kernel/head_40x.S b/arch/powerpc/kernel/head_40x.S index 4511fc1549f7..dd236f596c0b 100644 --- a/arch/powerpc/kernel/head_40x.S +++ b/arch/powerpc/kernel/head_40x.S @@ -315,7 +315,7 @@ _ENTRY(crit_srr1) EXC_XFER_LITE(0x400, handle_page_fault) /* 0x0500 - External Interrupt Exception */ - EXCEPTION(0x0500, HardwareInterrupt, do_IRQ, EXC_XFER_LITE) + EXCEPTION_IRQ(0x0500, HardwareInterrupt, __do_irq, EXC_XFER_LITE) /* 0x0600 - Alignment Exception */ START_EXCEPTION(0x0600, Alignment) diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S index 19f583e18402..5a6cdbc89e26 100644 --- a/arch/powerpc/kernel/head_8xx.S +++ b/arch/powerpc/kernel/head_8xx.S @@ -150,7 +150,7 @@ DataAccess: InstructionAccess: /* External interrupt */ - EXCEPTION(0x500, HardwareInterrupt, do_IRQ, EXC_XFER_LITE) + EXCEPTION_IRQ(0x500, HardwareInterrupt, __do_irq, EXC_XFER_LITE) /* Alignment exception */ . = 0x600 -- 2.13.3
[RFC PATCH 5/8] powerpc/irq: move stack overflow verification
As we are going to switch to IRQ stack immediately in the exception handler, it won't be possible anymore to check stack overflow by reading stack pointer. Do the verification on regs->gpr[1] which contains the stack pointer at the time the IRQ happended, and move it to __do_irq() so that the verification is also done when calling __do_irq() directly once the exception entry does the stack switch. Signed-off-by: Christophe Leroy --- arch/powerpc/kernel/irq.c | 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c index 28414c6665cc..4df49f6e9987 100644 --- a/arch/powerpc/kernel/irq.c +++ b/arch/powerpc/kernel/irq.c @@ -596,15 +596,16 @@ u64 arch_irq_stat_cpu(unsigned int cpu) return sum; } -static inline void check_stack_overflow(void) +static inline void check_stack_overflow(struct pt_regs *regs) { #ifdef CONFIG_DEBUG_STACKOVERFLOW + bool is_user = user_mode(regs); long sp; - sp = current_stack_pointer() & (THREAD_SIZE-1); + sp = regs->gpr[1] & (THREAD_SIZE - 1); /* check for stack overflow: is there less than 2KB free? */ - if (unlikely(sp < 2048)) { + if (unlikely(!is_user && sp < 2048)) { pr_err("do_IRQ: stack overflow: %ld\n", sp); dump_stack(); } @@ -654,6 +655,8 @@ void __do_irq(struct pt_regs *regs) trace_irq_entry(regs); + check_stack_overflow(regs); + /* * Query the platform PIC for the interrupt & ack it. * @@ -685,8 +688,6 @@ void do_IRQ(struct pt_regs *regs) irqsp = hardirq_ctx[raw_smp_processor_id()]; sirqsp = softirq_ctx[raw_smp_processor_id()]; - check_stack_overflow(); - /* Already there ? Otherwise switch stack and call */ if (unlikely(cursp == irqsp || cursp == sirqsp)) __do_irq(regs); -- 2.13.3
[RFC PATCH 6/8] powerpc/irq: cleanup check_stack_overflow() a bit
Instead of #ifdef, use IS_ENABLED(CONFIG_DEBUG_STACKOVERFLOW). This enable GCC to check for code validity even when the option is not selected. The function is not using current_stack_pointer() anymore so no need to declare it inline, let GCC decide. Signed-off-by: Christophe Leroy --- arch/powerpc/kernel/irq.c | 9 - 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c index 4df49f6e9987..a1122ef4a16c 100644 --- a/arch/powerpc/kernel/irq.c +++ b/arch/powerpc/kernel/irq.c @@ -596,20 +596,19 @@ u64 arch_irq_stat_cpu(unsigned int cpu) return sum; } -static inline void check_stack_overflow(struct pt_regs *regs) +static void check_stack_overflow(struct pt_regs *regs) { -#ifdef CONFIG_DEBUG_STACKOVERFLOW bool is_user = user_mode(regs); - long sp; + long sp = regs->gpr[1] & (THREAD_SIZE - 1); - sp = regs->gpr[1] & (THREAD_SIZE - 1); + if (!IS_ENABLED(CONFIG_DEBUG_STACKOVERFLOW)) + return; /* check for stack overflow: is there less than 2KB free? */ if (unlikely(!is_user && sp < 2048)) { pr_err("do_IRQ: stack overflow: %ld\n", sp); dump_stack(); } -#endif } #ifdef CONFIG_PPC32 -- 2.13.3
[RFC PATCH 4/8] powerpc/irq: move set_irq_regs() closer to irq_enter/exit()
set_irq_regs() is called by do_IRQ() while irq_enter() and irq_exit() are called by __do_irq(). Move set_irq_regs() in __do_irq() Signed-off-by: Christophe Leroy --- arch/powerpc/kernel/irq.c | 16 ++-- 1 file changed, 6 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c index 410accba865d..28414c6665cc 100644 --- a/arch/powerpc/kernel/irq.c +++ b/arch/powerpc/kernel/irq.c @@ -647,6 +647,7 @@ static inline void call_do_irq(struct pt_regs *regs, void *sp) void __do_irq(struct pt_regs *regs) { + struct pt_regs *old_regs = set_irq_regs(regs); unsigned int irq; irq_enter(); @@ -672,11 +673,11 @@ void __do_irq(struct pt_regs *regs) trace_irq_exit(regs); irq_exit(); + set_irq_regs(old_regs); } void do_IRQ(struct pt_regs *regs) { - struct pt_regs *old_regs = set_irq_regs(regs); void *cursp, *irqsp, *sirqsp; /* Switch to the irq stack to handle this */ @@ -686,16 +687,11 @@ void do_IRQ(struct pt_regs *regs) check_stack_overflow(); - /* Already there ? */ - if (unlikely(cursp == irqsp || cursp == sirqsp)) { + /* Already there ? Otherwise switch stack and call */ + if (unlikely(cursp == irqsp || cursp == sirqsp)) __do_irq(regs); - set_irq_regs(old_regs); - return; - } - /* Switch stack and call */ - call_do_irq(regs, irqsp); - - set_irq_regs(old_regs); + else + call_do_irq(regs, irqsp); } void __init init_IRQ(void) -- 2.13.3
[RFC PATCH 3/8] powerpc/irq: don't use current_stack_pointer() in do_IRQ()
Before commit 7306e83ccf5c ("powerpc: Don't use CURRENT_THREAD_INFO to find the stack"), the current stack base address was obtained by calling current_thread_info(). That inline function was simply masking out the value of r1. In that commit, it was changed to using current_stack_pointer(), which is an heavier function as it is an outline assembly function which cannot be inlined and which reads the content of the stack at 0(r1) Create stack_pointer() function which returns the value of r1 and use it instead. Signed-off-by: Christophe Leroy Fixes: 7306e83ccf5c ("powerpc: Don't use CURRENT_THREAD_INFO to find the stack") --- arch/powerpc/include/asm/reg.h | 8 arch/powerpc/kernel/irq.c | 2 +- 2 files changed, 9 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h index 1aa46dff0957..bc14fca9b13b 100644 --- a/arch/powerpc/include/asm/reg.h +++ b/arch/powerpc/include/asm/reg.h @@ -1466,6 +1466,14 @@ static inline void update_power8_hid0(unsigned long hid0) */ asm volatile("sync; mtspr %0,%1; isync":: "i"(SPRN_HID0), "r"(hid0)); } + +static __always_inline unsigned long stack_pointer(void) +{ + register unsigned long r1 asm("r1"); + + return r1; +} + #endif /* __ASSEMBLY__ */ #endif /* __KERNEL__ */ #endif /* _ASM_POWERPC_REG_H */ diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c index 4690e5270806..410accba865d 100644 --- a/arch/powerpc/kernel/irq.c +++ b/arch/powerpc/kernel/irq.c @@ -680,7 +680,7 @@ void do_IRQ(struct pt_regs *regs) void *cursp, *irqsp, *sirqsp; /* Switch to the irq stack to handle this */ - cursp = (void *)(current_stack_pointer() & ~(THREAD_SIZE - 1)); + cursp = (void *)(stack_pointer() & ~(THREAD_SIZE - 1)); irqsp = hardirq_ctx[raw_smp_processor_id()]; sirqsp = softirq_ctx[raw_smp_processor_id()]; -- 2.13.3
[RFC PATCH 2/8] powerpc/irq: inline call_do_irq() and call_do_softirq() on PPC32
call_do_irq() and call_do_softirq() are simple enough to be worth inlining. Inlining them avoids an mflr/mtlr pair plus a save/reload on stack. It also allows GCC to keep the saved ksp_limit in an nonvolatile reg. This is inspired from S390 arch. Several other arches do more or less the same. The way sparc arch does seems odd thought. Signed-off-by: Christophe Leroy Reviewed-by: Segher Boessenkool --- v2: no change. v3: no change. v4: - comment reminding the purpose of the inline asm block. - added r2 as clobbered reg v5: - Limiting the change to PPC32 for now. - removed r2 from the clobbered regs list (on PPC32 r2 points to current all the time) - Removed patch 1 and merged ksp_limit handling in here. v6: - rebased after removal of ksp_limit --- arch/powerpc/include/asm/irq.h | 2 ++ arch/powerpc/kernel/irq.c | 34 ++ arch/powerpc/kernel/misc_32.S | 25 - 3 files changed, 36 insertions(+), 25 deletions(-) diff --git a/arch/powerpc/include/asm/irq.h b/arch/powerpc/include/asm/irq.h index 814dfab7e392..e4a92f0b4ad4 100644 --- a/arch/powerpc/include/asm/irq.h +++ b/arch/powerpc/include/asm/irq.h @@ -56,8 +56,10 @@ extern void *mcheckirq_ctx[NR_CPUS]; extern void *hardirq_ctx[NR_CPUS]; extern void *softirq_ctx[NR_CPUS]; +#ifdef CONFIG_PPC64 void call_do_softirq(void *sp); void call_do_irq(struct pt_regs *regs, void *sp); +#endif extern void do_IRQ(struct pt_regs *regs); extern void __init init_IRQ(void); extern void __do_irq(struct pt_regs *regs); diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c index add67498c126..4690e5270806 100644 --- a/arch/powerpc/kernel/irq.c +++ b/arch/powerpc/kernel/irq.c @@ -611,6 +611,40 @@ static inline void check_stack_overflow(void) #endif } +#ifdef CONFIG_PPC32 +static inline void call_do_softirq(const void *sp) +{ + register unsigned long ret asm("r3"); + + /* Temporarily switch r1 to sp, call __do_softirq() then restore r1. */ + asm volatile( + " "PPC_STLU" 1, %2(%1);\n" + " mr 1, %1;\n" + " bl %3;\n" + " "PPC_LL"1, 0(1);\n" : + "=r"(ret) : + "b"(sp), "i"(THREAD_SIZE - STACK_FRAME_OVERHEAD), "i"(__do_softirq) : + "lr", "xer", "ctr", "memory", "cr0", "cr1", "cr5", "cr6", "cr7", + "r0", "r4", "r5", "r6", "r7", "r8", "r9", "r10", "r11", "r12"); +} + +static inline void call_do_irq(struct pt_regs *regs, void *sp) +{ + register unsigned long r3 asm("r3") = (unsigned long)regs; + + /* Temporarily switch r1 to sp, call __do_irq() then restore r1 */ + asm volatile( + " "PPC_STLU" 1, %2(%1);\n" + " mr 1, %1;\n" + " bl %3;\n" + " "PPC_LL"1, 0(1);\n" : + "+r"(r3) : + "b"(sp), "i"(THREAD_SIZE - STACK_FRAME_OVERHEAD), "i"(__do_irq) : + "lr", "xer", "ctr", "memory", "cr0", "cr1", "cr5", "cr6", "cr7", + "r0", "r4", "r5", "r6", "r7", "r8", "r9", "r10", "r11", "r12"); +} +#endif + void __do_irq(struct pt_regs *regs) { unsigned int irq; diff --git a/arch/powerpc/kernel/misc_32.S b/arch/powerpc/kernel/misc_32.S index bb5995fa6884..341a3cd199cb 100644 --- a/arch/powerpc/kernel/misc_32.S +++ b/arch/powerpc/kernel/misc_32.S @@ -27,31 +27,6 @@ .text -_GLOBAL(call_do_softirq) - mflrr0 - stw r0,4(r1) - stwur1,THREAD_SIZE-STACK_FRAME_OVERHEAD(r3) - mr r1,r3 - bl __do_softirq - lwz r1,0(r1) - lwz r0,4(r1) - mtlrr0 - blr - -/* - * void call_do_irq(struct pt_regs *regs, void *sp); - */ -_GLOBAL(call_do_irq) - mflrr0 - stw r0,4(r1) - stwur1,THREAD_SIZE-STACK_FRAME_OVERHEAD(r4) - mr r1,r4 - bl __do_irq - lwz r1,0(r1) - lwz r0,4(r1) - mtlrr0 - blr - /* * This returns the high 64 bits of the product of two 64-bit numbers. */ -- 2.13.3
[RFC PATCH 1/8] powerpc/32: drop ksp_limit based stack overflow detection
PPC32 implements a specific early stack overflow detection. This detection is inherited from ppc arch (before the merge of ppc and ppc64 into powerpc). At that time, there was no irqstacks and the verification was simply to check that the stack pointer was still over the stack base. But when irqstacks were implemented, it was not possible to perform a simple check anymore so a thread specific value called ksp_limit was introduced in the task_struct and is updated at every stack switch in order to keep track of the limit and perform the verification. ppc64 didn't have this but had a verification during IRQs. This verification was then extended to PPC32 and can be selected through CONFIG_DEBUG_STACKOVERFLOW. In the meantime, thread_info has moved away from the stack, reducing the impact of a stack overflow. In addition, there is CONFIG_SCHED_STACK_END_CHECK which can be used to check that the magic stored at stack base has not be overwritten. Remove this PPC32 specific stack overflow mechanism in order to simplify ongoing work which also aim at reducing even more risks of stack overflow: - Switch to irqstack in IRQ exception entry in ASM - VMAP stack Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/asm-prototypes.h | 1 - arch/powerpc/include/asm/processor.h | 3 -- arch/powerpc/kernel/asm-offsets.c | 2 -- arch/powerpc/kernel/entry_32.S| 57 --- arch/powerpc/kernel/head_40x.S| 2 -- arch/powerpc/kernel/head_booke.h | 1 - arch/powerpc/kernel/misc_32.S | 14 arch/powerpc/kernel/process.c | 3 -- arch/powerpc/kernel/traps.c | 9 - arch/powerpc/lib/sstep.c | 9 - 10 files changed, 101 deletions(-) diff --git a/arch/powerpc/include/asm/asm-prototypes.h b/arch/powerpc/include/asm/asm-prototypes.h index 983c0084fb3f..90e9c6e415af 100644 --- a/arch/powerpc/include/asm/asm-prototypes.h +++ b/arch/powerpc/include/asm/asm-prototypes.h @@ -66,7 +66,6 @@ void RunModeException(struct pt_regs *regs); void single_step_exception(struct pt_regs *regs); void program_check_exception(struct pt_regs *regs); void alignment_exception(struct pt_regs *regs); -void StackOverflow(struct pt_regs *regs); void kernel_fp_unavailable_exception(struct pt_regs *regs); void altivec_unavailable_exception(struct pt_regs *regs); void vsx_unavailable_exception(struct pt_regs *regs); diff --git a/arch/powerpc/include/asm/processor.h b/arch/powerpc/include/asm/processor.h index a9993e7a443b..a9552048c20b 100644 --- a/arch/powerpc/include/asm/processor.h +++ b/arch/powerpc/include/asm/processor.h @@ -155,7 +155,6 @@ struct thread_struct { #endif #ifdef CONFIG_PPC32 void*pgdir; /* root of page-table tree */ - unsigned long ksp_limit; /* if ksp <= ksp_limit stack overflow */ #ifdef CONFIG_PPC_RTAS unsigned long rtas_sp;/* stack pointer for when in RTAS */ #endif @@ -269,7 +268,6 @@ struct thread_struct { #define ARCH_MIN_TASKALIGN 16 #define INIT_SP(sizeof(init_stack) + (unsigned long) _stack) -#define INIT_SP_LIMIT ((unsigned long)_stack) #ifdef CONFIG_SPE #define SPEFSCR_INIT \ @@ -282,7 +280,6 @@ struct thread_struct { #ifdef CONFIG_PPC32 #define INIT_THREAD { \ .ksp = INIT_SP, \ - .ksp_limit = INIT_SP_LIMIT, \ .addr_limit = KERNEL_DS, \ .pgdir = swapper_pg_dir, \ .fpexc_mode = MSR_FE0 | MSR_FE1, \ diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c index 3d47aec7becf..d936db6b702f 100644 --- a/arch/powerpc/kernel/asm-offsets.c +++ b/arch/powerpc/kernel/asm-offsets.c @@ -88,7 +88,6 @@ int main(void) DEFINE(SIGSEGV, SIGSEGV); DEFINE(NMI_MASK, NMI_MASK); #else - OFFSET(KSP_LIMIT, thread_struct, ksp_limit); #ifdef CONFIG_PPC_RTAS OFFSET(RTAS_SP, thread_struct, rtas_sp); #endif @@ -353,7 +352,6 @@ int main(void) DEFINE(_CSRR1, STACK_INT_FRAME_SIZE+offsetof(struct exception_regs, csrr1)); DEFINE(_DSRR0, STACK_INT_FRAME_SIZE+offsetof(struct exception_regs, dsrr0)); DEFINE(_DSRR1, STACK_INT_FRAME_SIZE+offsetof(struct exception_regs, dsrr1)); - DEFINE(SAVED_KSP_LIMIT, STACK_INT_FRAME_SIZE+offsetof(struct exception_regs, saved_ksp_limit)); #endif #endif diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S index d60908ea37fb..bf11b464a17b 100644 --- a/arch/powerpc/kernel/entry_32.S +++ b/arch/powerpc/kernel/entry_32.S @@ -86,13 +86,6 @@ crit_transfer_to_handler: stw r0,_SRR0(r11) mfspr r0,SPRN_SRR1 stw r0,_SRR1(r11) - - /* set the stack limit to the current stack */ - mfspr r8,SPRN_SPRG_THREAD - lwz r0,KSP_LIMIT(r8) - stw r0,SAVED_KSP_LIMIT(r11) - rlwinm r0,r1,0,0,(31 - THREAD_SHIFT) - stw r0,KSP_LIMIT(r8) /* fall through */ #endif
[RFC PATCH 0/8] Accelarate IRQ entry
The purpose of this series is to accelerate IRQ entry by avoiding unneccessary trampoline functions like call_do_irq() and call_do_softirq() and by switching to IRQ stack immediately in the exception handler. For now, it is an RFC as it is still a bit messy. Please provide feedback and I'll improve next year Christophe Leroy (8): powerpc/32: drop ksp_limit based stack overflow detection powerpc/irq: inline call_do_irq() and call_do_softirq() on PPC32 powerpc/irq: don't use current_stack_pointer() in do_IRQ() powerpc/irq: move set_irq_regs() closer to irq_enter/exit() powerpc/irq: move stack overflow verification powerpc/irq: cleanup check_stack_overflow() a bit powerpc/32: use IRQ stack immediately on IRQ exception powerpc/irq: drop softirq stack arch/powerpc/include/asm/asm-prototypes.h | 1 - arch/powerpc/include/asm/irq.h| 3 +- arch/powerpc/include/asm/processor.h | 3 -- arch/powerpc/include/asm/reg.h| 8 arch/powerpc/kernel/asm-offsets.c | 2 - arch/powerpc/kernel/entry_32.S| 57 arch/powerpc/kernel/head_32.S | 2 +- arch/powerpc/kernel/head_32.h | 32 +++-- arch/powerpc/kernel/head_40x.S| 4 +- arch/powerpc/kernel/head_8xx.S| 2 +- arch/powerpc/kernel/head_booke.h | 1 - arch/powerpc/kernel/irq.c | 74 +-- arch/powerpc/kernel/misc_32.S | 39 arch/powerpc/kernel/process.c | 7 --- arch/powerpc/kernel/setup_32.c| 4 +- arch/powerpc/kernel/setup_64.c| 4 +- arch/powerpc/kernel/traps.c | 9 arch/powerpc/lib/sstep.c | 9 18 files changed, 95 insertions(+), 166 deletions(-) -- 2.13.3
[RFC PATCH v2 10/10] powerpc/32: Switch VDSO to C implementation.
This is a tentative to switch powerpc/32 vdso to generic C implementation. It will likely not work on 64 bits or even build properly at the moment, hence the RFC status. powerpc is a bit special for VDSO as well as system calls in the way that it requires setting CR SO bit which cannot be done in C. Therefore, entry/exit and fallback needs to be performed in ASM. On powerpc 8xx, performance is degraded by 30-40% for gettime and by 15-20% for getres On a powerpc885 at 132MHz: With current powerpc/32 ASM VDSO: gettimeofday:vdso: 737 nsec/call clock-getres-realtime-coarse:vdso: 3081 nsec/call clock-gettime-realtime-coarse:vdso: 2861 nsec/call clock-getres-realtime:vdso: 475 nsec/call clock-gettime-realtime:vdso: 892 nsec/call clock-getres-boottime:vdso: 2621 nsec/call clock-gettime-boottime:vdso: 3857 nsec/call clock-getres-tai:vdso: 2620 nsec/call clock-gettime-tai:vdso: 3854 nsec/call clock-getres-monotonic-raw:vdso: 2621 nsec/call clock-gettime-monotonic-raw:vdso: 3499 nsec/call clock-getres-monotonic-coarse:vdso: 3083 nsec/call clock-gettime-monotonic-coarse:vdso: 3082 nsec/call clock-getres-monotonic:vdso: 475 nsec/call clock-gettime-monotonic:vdso: 1014 nsec/call Once switched to C implementation: gettimeofday:vdso: 1016 nsec/call clock-getres-realtime-coarse:vdso: 614 nsec/call clock-gettime-realtime-coarse:vdso: 760 nsec/call clock-getres-realtime:vdso: 560 nsec/call clock-gettime-realtime:vdso: 1192 nsec/call clock-getres-boottime:vdso: 560 nsec/call clock-gettime-boottime:vdso: 1194 nsec/call clock-getres-tai:vdso: 560 nsec/call clock-gettime-tai:vdso: 1192 nsec/call clock-getres-monotonic-raw:vdso: 560 nsec/call clock-gettime-monotonic-raw:vdso: 1248 nsec/call clock-getres-monotonic-coarse:vdso: 614 nsec/call clock-gettime-monotonic-coarse:vdso: 760 nsec/call clock-getres-monotonic:vdso: 560 nsec/call clock-gettime-monotonic:vdso: 1192 nsec/call On a powerpc 8321 running at 333MHz With current powerpc/32 ASM VDSO: gettimeofday:vdso: 190 nsec/call clock-getres-realtime-coarse:vdso: 1449 nsec/call clock-gettime-realtime-coarse:vdso: 1352 nsec/call clock-getres-realtime:vdso: 135 nsec/call clock-gettime-realtime:vdso: 244 nsec/call clock-getres-boottime:vdso: 1313 nsec/call clock-gettime-boottime:vdso: 1701 nsec/call clock-getres-tai:vdso: 1268 nsec/call clock-gettime-tai:vdso: 1742 nsec/call clock-getres-monotonic-raw:vdso: 1310 nsec/call clock-gettime-monotonic-raw:vdso: 1584 nsec/call clock-getres-monotonic-coarse:vdso: 1488 nsec/call clock-gettime-monotonic-coarse:vdso: 1503 nsec/call clock-getres-monotonic:vdso: 135 nsec/call clock-gettime-monotonic:vdso: 283 nsec/call Once switched to C implementation: gettimeofday:vdso: 347 nsec/call clock-getres-realtime-coarse:vdso: 169 nsec/call clock-gettime-realtime-coarse:vdso: 271 nsec/call clock-getres-realtime:vdso: 150 nsec/call clock-gettime-realtime:vdso: 383 nsec/call clock-getres-boottime:vdso: 157 nsec/call clock-gettime-boottime:vdso: 377 nsec/call clock-getres-tai:vdso: 150 nsec/call clock-gettime-tai:vdso: 380 nsec/call clock-getres-monotonic-raw:vdso: 153 nsec/call clock-gettime-monotonic-raw:vdso: 407 nsec/call clock-getres-monotonic-coarse:vdso: 169 nsec/call clock-gettime-monotonic-coarse:vdso: 271 nsec/call clock-getres-monotonic:vdso: 153 nsec/call clock-gettime-monotonic:vdso: 377 nsec/call Signed-off-by: Christophe Leroy --- arch/powerpc/Kconfig | 2 + arch/powerpc/include/asm/vdso/gettimeofday.h | 45 + arch/powerpc/include/asm/vdso/vsyscall.h | 27 +++ arch/powerpc/include/asm/vdso_datapage.h | 18 +- arch/powerpc/kernel/asm-offsets.c| 23 +-- arch/powerpc/kernel/time.c | 92 +- arch/powerpc/kernel/vdso.c | 19 +- arch/powerpc/kernel/vdso32/Makefile | 19 +- arch/powerpc/kernel/vdso32/gettimeofday.S| 261 --- arch/powerpc/kernel/vdso32/vgettimeofday.c | 32 10 files changed, 178 insertions(+), 360 deletions(-) create mode 100644 arch/powerpc/include/asm/vdso/gettimeofday.h create mode 100644 arch/powerpc/include/asm/vdso/vsyscall.h create mode 100644 arch/powerpc/kernel/vdso32/vgettimeofday.c diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 1ec34e16ed65..bd04c68baf91 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -169,6 +169,7 @@ config PPC select GENERIC_STRNCPY_FROM_USER select GENERIC_STRNLEN_USER select GENERIC_TIME_VSYSCALL + select GENERIC_GETTIMEOFDAY select HAVE_ARCH_AUDITSYSCALL select HAVE_ARCH_HUGE_VMAP if PPC_BOOK3S_64 && PPC_RADIX_MMU select HAVE_ARCH_JUMP_LABEL @@ -198,6 +199,7 @@ config PPC select
[RFC PATCH v2 09/10] powerpc/vdso32: inline __get_datapage()
__get_datapage() is only a few instructions to retrieve the address of the page where the kernel stores data to the VDSO. By inlining this function into its users, a bl/blr pair and a mflr/mtlr pair is avoided, plus a few reg moves. The improvement is noticeable (about 55 nsec/call on an 8xx) vdsotest before the patch: gettimeofday:vdso: 731 nsec/call clock-gettime-realtime-coarse:vdso: 668 nsec/call clock-gettime-monotonic-coarse:vdso: 745 nsec/call vdsotest after the patch: gettimeofday:vdso: 677 nsec/call clock-gettime-realtime-coarse:vdso: 613 nsec/call clock-gettime-monotonic-coarse:vdso: 690 nsec/call Signed-off-by: Christophe Leroy --- v3: define get_datapage macro in asm/vdso_datapage.h v4: fixed build failure with old binutils --- arch/powerpc/include/asm/vdso_datapage.h | 10 ++ arch/powerpc/kernel/vdso32/cacheflush.S | 9 - arch/powerpc/kernel/vdso32/datapage.S | 28 +++- arch/powerpc/kernel/vdso32/gettimeofday.S | 12 +--- 4 files changed, 22 insertions(+), 37 deletions(-) diff --git a/arch/powerpc/include/asm/vdso_datapage.h b/arch/powerpc/include/asm/vdso_datapage.h index 40f13f3626d3..ee5319a6f4e3 100644 --- a/arch/powerpc/include/asm/vdso_datapage.h +++ b/arch/powerpc/include/asm/vdso_datapage.h @@ -118,6 +118,16 @@ struct vdso_data { extern struct vdso_data *vdso_data; +#else /* __ASSEMBLY__ */ + +.macro get_datapage ptr, tmp + bcl 20, 31, .+4 + mflr\ptr + addi\ptr, \ptr, (__kernel_datapage_offset - (.-4))@l + lwz \tmp, 0(\ptr) + add \ptr, \tmp, \ptr +.endm + #endif /* __ASSEMBLY__ */ #endif /* __KERNEL__ */ diff --git a/arch/powerpc/kernel/vdso32/cacheflush.S b/arch/powerpc/kernel/vdso32/cacheflush.S index 7f882e7b9f43..d178ec8c279d 100644 --- a/arch/powerpc/kernel/vdso32/cacheflush.S +++ b/arch/powerpc/kernel/vdso32/cacheflush.S @@ -8,6 +8,7 @@ #include #include #include +#include #include .text @@ -24,14 +25,12 @@ V_FUNCTION_BEGIN(__kernel_sync_dicache) .cfi_startproc mflrr12 .cfi_register lr,r12 - mr r11,r3 - bl __get_datapage@local + get_datapager10, r0 mtlrr12 - mr r10,r3 lwz r7,CFG_DCACHE_BLOCKSZ(r10) addir5,r7,-1 - andcr6,r11,r5 /* round low to line bdy */ + andcr6,r3,r5/* round low to line bdy */ subfr8,r6,r4/* compute length */ add r8,r8,r5/* ensure we get enough */ lwz r9,CFG_DCACHE_LOGBLOCKSZ(r10) @@ -48,7 +47,7 @@ V_FUNCTION_BEGIN(__kernel_sync_dicache) lwz r7,CFG_ICACHE_BLOCKSZ(r10) addir5,r7,-1 - andcr6,r11,r5 /* round low to line bdy */ + andcr6,r3,r5/* round low to line bdy */ subfr8,r6,r4/* compute length */ add r8,r8,r5 lwz r9,CFG_ICACHE_LOGBLOCKSZ(r10) diff --git a/arch/powerpc/kernel/vdso32/datapage.S b/arch/powerpc/kernel/vdso32/datapage.S index 6c7401bd284e..1095d818f94a 100644 --- a/arch/powerpc/kernel/vdso32/datapage.S +++ b/arch/powerpc/kernel/vdso32/datapage.S @@ -10,35 +10,13 @@ #include #include #include +#include .text .global __kernel_datapage_offset; __kernel_datapage_offset: .long 0 -V_FUNCTION_BEGIN(__get_datapage) - .cfi_startproc - /* We don't want that exposed or overridable as we want other objects -* to be able to bl directly to here -*/ - .protected __get_datapage - .hidden __get_datapage - - mflrr0 - .cfi_register lr,r0 - - bcl 20,31,data_page_branch -data_page_branch: - mflrr3 - mtlrr0 - addir3, r3, __kernel_datapage_offset-data_page_branch - lwz r0,0(r3) - .cfi_restore lr - add r3,r0,r3 - blr - .cfi_endproc -V_FUNCTION_END(__get_datapage) - /* * void *__kernel_get_syscall_map(unsigned int *syscall_count) ; * @@ -53,7 +31,7 @@ V_FUNCTION_BEGIN(__kernel_get_syscall_map) mflrr12 .cfi_register lr,r12 mr r4,r3 - bl __get_datapage@local + get_datapager3, r0 mtlrr12 addir3,r3,CFG_SYSCALL_MAP32 cmpli cr0,r4,0 @@ -75,7 +53,7 @@ V_FUNCTION_BEGIN(__kernel_get_tbfreq) .cfi_startproc mflrr12 .cfi_register lr,r12 - bl __get_datapage@local + get_datapager3, r0 lwz r4,(CFG_TB_TICKS_PER_SEC + 4)(r3) lwz r3,CFG_TB_TICKS_PER_SEC(r3) mtlrr12 diff --git a/arch/powerpc/kernel/vdso32/gettimeofday.S b/arch/powerpc/kernel/vdso32/gettimeofday.S index 3306672f57a9..d6c1d331e8cb 100644 --- a/arch/powerpc/kernel/vdso32/gettimeofday.S +++ b/arch/powerpc/kernel/vdso32/gettimeofday.S @@ -9,6 +9,7 @@ #include #include #include +#include #include #include
[RFC PATCH v2 07/10] lib: vdso: don't use READ_ONCE() in __c_kernel_time()
READ_ONCE() forces the read of the 64 bit value of vd[CS_HRES_COARSE].basetime[CLOCK_REALTIME].sec allthough only the lower part is needed. This results in a suboptimal code: 0af4 <__c_kernel_time>: af4: 2c 03 00 00 cmpwi r3,0 af8: 81 44 00 20 lwz r10,32(r4) afc: 81 64 00 24 lwz r11,36(r4) b00: 41 82 00 08 beq b08 <__c_kernel_time+0x14> b04: 91 63 00 00 stw r11,0(r3) b08: 7d 63 5b 78 mr r3,r11 b0c: 4e 80 00 20 blr By removing the READ_ONCE(), only the lower part is read from memory, and the code is cleaner: 0af4 <__c_kernel_time>: af4: 7c 69 1b 79 mr. r9,r3 af8: 80 64 00 24 lwz r3,36(r4) afc: 4d 82 00 20 beqlr b00: 90 69 00 00 stw r3,0(r9) b04: 4e 80 00 20 blr Signed-off-by: Christophe Leroy --- lib/vdso/gettimeofday.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/vdso/gettimeofday.c b/lib/vdso/gettimeofday.c index 17b4cff6e5f0..5a17a9d2e6cd 100644 --- a/lib/vdso/gettimeofday.c +++ b/lib/vdso/gettimeofday.c @@ -144,7 +144,7 @@ __cvdso_gettimeofday(const struct vdso_data *vd, struct __kernel_old_timeval *tv static __maybe_unused __kernel_old_time_t __cvdso_time(const struct vdso_data *vd, __kernel_old_time_t *time) { - __kernel_old_time_t t = READ_ONCE(vd[CS_HRES_COARSE].basetime[CLOCK_REALTIME].sec); + __kernel_old_time_t t = vd[CS_HRES_COARSE].basetime[CLOCK_REALTIME].sec; if (time) *time = t; -- 2.13.3
[RFC PATCH v2 06/10] lib: vdso: make do_coarse() return 0
do_coarse() is similare to do_hres() except that it never fails. Change its type to int instead of void and get it return 0 at all time. This cleans the code a bit. Signed-off-by: Christophe Leroy --- lib/vdso/gettimeofday.c | 15 --- 1 file changed, 8 insertions(+), 7 deletions(-) diff --git a/lib/vdso/gettimeofday.c b/lib/vdso/gettimeofday.c index 86d5b1c8796b..17b4cff6e5f0 100644 --- a/lib/vdso/gettimeofday.c +++ b/lib/vdso/gettimeofday.c @@ -64,7 +64,7 @@ static inline int do_hres(const struct vdso_data *vd, clockid_t clk, return 0; } -static void do_coarse(const struct vdso_data *vd, clockid_t clk, +static int do_coarse(const struct vdso_data *vd, clockid_t clk, struct __kernel_timespec *ts) { const struct vdso_timestamp *vdso_ts = >basetime[clk]; @@ -75,6 +75,8 @@ static void do_coarse(const struct vdso_data *vd, clockid_t clk, ts->tv_sec = vdso_ts->sec; ts->tv_nsec = vdso_ts->nsec; } while (unlikely(vdso_read_retry(vd, seq))); + + return 0; } static __maybe_unused int @@ -92,14 +94,13 @@ __cvdso_clock_gettime(const struct vdso_data *vd, clockid_t clock, * clocks are handled in the VDSO directly. */ msk = 1U << clock; - if (likely(msk & VDSO_HRES)) { + if (likely(msk & VDSO_HRES)) return do_hres([CS_HRES_COARSE], clock, ts); - } else if (msk & VDSO_COARSE) { - do_coarse([CS_HRES_COARSE], clock, ts); - return 0; - } else if (msk & VDSO_RAW) { + else if (msk & VDSO_COARSE) + return do_coarse([CS_HRES_COARSE], clock, ts); + else if (msk & VDSO_RAW) return do_hres([CS_RAW], clock, ts); - } + return -1; } -- 2.13.3
[RFC PATCH v2 05/10] lib: vdso: inline do_hres()
do_hres() is called from several places, so GCC doesn't inline it at first. do_hres() takes a struct __kernel_timespec * parameter for passing the result. In the 32 bits case, this parameter corresponds to a local var in the caller. In order to provide a pointer to this structure, the caller has to put it in its stack and do_hres() has to write the result in the stack. This is suboptimal, especially on RISC processor like powerpc. By making GCC inline the function, the struct __kernel_timespec remains a local var using registers, avoiding the need to write and read stack. The improvement is significant on powerpc. Signed-off-by: Christophe Leroy --- lib/vdso/gettimeofday.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/lib/vdso/gettimeofday.c b/lib/vdso/gettimeofday.c index 24e1ba838260..86d5b1c8796b 100644 --- a/lib/vdso/gettimeofday.c +++ b/lib/vdso/gettimeofday.c @@ -34,8 +34,8 @@ u64 vdso_calc_delta(u64 cycles, u64 last, u64 mask, u32 mult) } #endif -static int do_hres(const struct vdso_data *vd, clockid_t clk, - struct __kernel_timespec *ts) +static inline int do_hres(const struct vdso_data *vd, clockid_t clk, + struct __kernel_timespec *ts) { const struct vdso_timestamp *vdso_ts = >basetime[clk]; u64 cycles, last, sec, ns; -- 2.13.3
[RFC PATCH v2 08/10] lib: vdso: Avoid duplication in __cvdso_clock_getres()
VDSO_HRES and VDSO_RAW clocks are handled the same way. Don't duplicate code. Signed-off-by: Christophe Leroy --- lib/vdso/gettimeofday.c | 7 +-- 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/lib/vdso/gettimeofday.c b/lib/vdso/gettimeofday.c index 5a17a9d2e6cd..aa4a167bf1e0 100644 --- a/lib/vdso/gettimeofday.c +++ b/lib/vdso/gettimeofday.c @@ -172,7 +172,7 @@ int __cvdso_clock_getres(const struct vdso_data *vd, clockid_t clock, * clocks are handled in the VDSO directly. */ msk = 1U << clock; - if (msk & VDSO_HRES) { + if (msk & (VDSO_HRES | VDSO_RAW)) { /* * Preserves the behaviour of posix_get_hrtimer_res(). */ @@ -182,11 +182,6 @@ int __cvdso_clock_getres(const struct vdso_data *vd, clockid_t clock, * Preserves the behaviour of posix_get_coarse_res(). */ ns = LOW_RES_NSEC; - } else if (msk & VDSO_RAW) { - /* -* Preserves the behaviour of posix_get_hrtimer_res(). -*/ - ns = hrtimer_res; } else { return -1; } -- 2.13.3
[RFC PATCH v2 04/10] lib: vdso: get pointer to vdso data from the arch
On powerpc, __arch_get_vdso_data() clobbers the link register, requiring the caller to set a stack frame in order to save it. As the parent function already has to set a stack frame and save the link register to call the C vdso function, retriving the vdso data pointer there is lighter. Give arches the opportunity to hand the vdso data pointer to C vdso functions. Signed-off-by: Christophe Leroy --- arch/arm/vdso/vgettimeofday.c| 12 arch/arm64/kernel/vdso/vgettimeofday.c | 9 ++--- arch/arm64/kernel/vdso32/vgettimeofday.c | 12 arch/mips/vdso/vgettimeofday.c | 21 ++--- arch/x86/entry/vdso/vclock_gettime.c | 22 +++--- lib/vdso/gettimeofday.c | 28 ++-- 6 files changed, 65 insertions(+), 39 deletions(-) diff --git a/arch/arm/vdso/vgettimeofday.c b/arch/arm/vdso/vgettimeofday.c index 5451afb715e6..efad7d508d06 100644 --- a/arch/arm/vdso/vgettimeofday.c +++ b/arch/arm/vdso/vgettimeofday.c @@ -10,7 +10,8 @@ int __vdso_clock_gettime(clockid_t clock, struct old_timespec32 *ts) { - int ret = __cvdso_clock_gettime32(clock, ts); + const struct vdso_data *vd = __arch_get_vdso_data(); + int ret = __cvdso_clock_gettime32(vd, clock, ts); if (likely(!ret)) return ret; @@ -21,7 +22,8 @@ int __vdso_clock_gettime(clockid_t clock, int __vdso_clock_gettime64(clockid_t clock, struct __kernel_timespec *ts) { - int ret = __cvdso_clock_gettime(clock, ts); + const struct vdso_data *vd = __arch_get_vdso_data(); + int ret = __cvdso_clock_gettime(vd, clock, ts); if (likely(!ret)) return ret; @@ -32,7 +34,8 @@ int __vdso_clock_gettime64(clockid_t clock, int __vdso_gettimeofday(struct __kernel_old_timeval *tv, struct timezone *tz) { - int ret = __cvdso_gettimeofday(tv, tz); + const struct vdso_data *vd = __arch_get_vdso_data(); + int ret = __cvdso_gettimeofday(vd, tv, tz); if (likely(!ret)) return ret; @@ -43,7 +46,8 @@ int __vdso_gettimeofday(struct __kernel_old_timeval *tv, int __vdso_clock_getres(clockid_t clock_id, struct old_timespec32 *res) { - int ret = __cvdso_clock_getres_time32(clock_id, res); + const struct vdso_data *vd = __arch_get_vdso_data(); + int ret = __cvdso_clock_getres_time32(vd, clock_id, res); if (likely(!ret)) return ret; diff --git a/arch/arm64/kernel/vdso/vgettimeofday.c b/arch/arm64/kernel/vdso/vgettimeofday.c index 62694876b216..9a7122ec6d17 100644 --- a/arch/arm64/kernel/vdso/vgettimeofday.c +++ b/arch/arm64/kernel/vdso/vgettimeofday.c @@ -11,7 +11,8 @@ int __kernel_clock_gettime(clockid_t clock, struct __kernel_timespec *ts) { - int ret = __cvdso_clock_gettime(clock, ts); + const struct vdso_data *vd = __arch_get_vdso_data(); + int ret = __cvdso_clock_gettime(vd, clock, ts); if (likely(!ret)) return ret; @@ -22,7 +23,8 @@ int __kernel_clock_gettime(clockid_t clock, int __kernel_gettimeofday(struct __kernel_old_timeval *tv, struct timezone *tz) { - int ret = __cvdso_gettimeofday(tv, tz); + const struct vdso_data *vd = __arch_get_vdso_data(); + int ret = __cvdso_gettimeofday(vd, tv, tz); if (likely(!ret)) return ret; @@ -33,7 +35,8 @@ int __kernel_gettimeofday(struct __kernel_old_timeval *tv, int __kernel_clock_getres(clockid_t clock_id, struct __kernel_timespec *res) { - int ret = __cvdso_clock_getres(clock_id, res); + const struct vdso_data *vd = __arch_get_vdso_data(); + int ret = __cvdso_clock_getres(vd, clock_id, res); if (likely(!ret)) return ret; diff --git a/arch/arm64/kernel/vdso32/vgettimeofday.c b/arch/arm64/kernel/vdso32/vgettimeofday.c index 6770d2bedd1f..3eb6a82c1c25 100644 --- a/arch/arm64/kernel/vdso32/vgettimeofday.c +++ b/arch/arm64/kernel/vdso32/vgettimeofday.c @@ -11,13 +11,14 @@ int __vdso_clock_gettime(clockid_t clock, struct old_timespec32 *ts) { + const struct vdso_data *vd = __arch_get_vdso_data(); int ret; /* The checks below are required for ABI consistency with arm */ if ((u32)ts >= TASK_SIZE_32) return -EFAULT; - ret = __cvdso_clock_gettime32(clock, ts); + ret = __cvdso_clock_gettime32(vd, clock, ts); if (likely(!ret)) return ret; @@ -28,13 +29,14 @@ int __vdso_clock_gettime(clockid_t clock, int __vdso_clock_gettime64(clockid_t clock, struct __kernel_timespec *ts) { + const struct vdso_data *vd = __arch_get_vdso_data(); int ret; /* The checks below are required for ABI consistency
[RFC PATCH v2 02/10] lib: vdso: move call to fallback out of common code.
On powerpc, VDSO functions and syscalls cannot be implemented in C because the Linux kernel ABI requires that CR[SO] bit is set in case of error and cleared when no error. As this cannot be done in C, C VDSO functions and syscall'based fallback need a trampoline in ASM. By moving the fallback calls out of the common code, arches like powerpc can implement both the call to C VDSO and the fallback call in a single trampoline function. The two advantages are: - No need play back and forth with CR[SO] and negative return value. - No stack frame is required in VDSO C functions for the fallbacks. The performance improvement is significant on powerpc. Signed-off-by: Christophe Leroy --- arch/arm/vdso/vgettimeofday.c| 28 +++--- arch/arm64/kernel/vdso/vgettimeofday.c | 21 -- arch/arm64/kernel/vdso32/vgettimeofday.c | 35 --- arch/mips/vdso/vgettimeofday.c | 49 +++- arch/x86/entry/vdso/vclock_gettime.c | 42 +++ lib/vdso/gettimeofday.c | 31 6 files changed, 156 insertions(+), 50 deletions(-) diff --git a/arch/arm/vdso/vgettimeofday.c b/arch/arm/vdso/vgettimeofday.c index 1976c6f325a4..5451afb715e6 100644 --- a/arch/arm/vdso/vgettimeofday.c +++ b/arch/arm/vdso/vgettimeofday.c @@ -10,25 +10,45 @@ int __vdso_clock_gettime(clockid_t clock, struct old_timespec32 *ts) { - return __cvdso_clock_gettime32(clock, ts); + int ret = __cvdso_clock_gettime32(clock, ts); + + if (likely(!ret)) + return ret; + + return clock_gettime32_fallback(clock, ); } int __vdso_clock_gettime64(clockid_t clock, struct __kernel_timespec *ts) { - return __cvdso_clock_gettime(clock, ts); + int ret = __cvdso_clock_gettime(clock, ts); + + if (likely(!ret)) + return ret; + + return clock_gettime_fallback(clock, ts); } int __vdso_gettimeofday(struct __kernel_old_timeval *tv, struct timezone *tz) { - return __cvdso_gettimeofday(tv, tz); + int ret = __cvdso_gettimeofday(tv, tz); + + if (likely(!ret)) + return ret; + + return gettimeofday_fallback(tv, tz); } int __vdso_clock_getres(clockid_t clock_id, struct old_timespec32 *res) { - return __cvdso_clock_getres_time32(clock_id, res); + int ret = __cvdso_clock_getres_time32(clock_id, res); + + if (likely(!ret)) + return ret; + + return clock_getres32_fallback(clock, res); } /* Avoid unresolved references emitted by GCC */ diff --git a/arch/arm64/kernel/vdso/vgettimeofday.c b/arch/arm64/kernel/vdso/vgettimeofday.c index 747635501a14..62694876b216 100644 --- a/arch/arm64/kernel/vdso/vgettimeofday.c +++ b/arch/arm64/kernel/vdso/vgettimeofday.c @@ -11,17 +11,32 @@ int __kernel_clock_gettime(clockid_t clock, struct __kernel_timespec *ts) { - return __cvdso_clock_gettime(clock, ts); + int ret = __cvdso_clock_gettime(clock, ts); + + if (likely(!ret)) + return ret; + + return clock_gettime_fallback(clock, ts); } int __kernel_gettimeofday(struct __kernel_old_timeval *tv, struct timezone *tz) { - return __cvdso_gettimeofday(tv, tz); + int ret = __cvdso_gettimeofday(tv, tz); + + if (likely(!ret)) + return ret; + + return gettimeofday_fallback(tv, tz); } int __kernel_clock_getres(clockid_t clock_id, struct __kernel_timespec *res) { - return __cvdso_clock_getres(clock_id, res); + int ret = __cvdso_clock_getres(clock_id, res); + + if (likely(!ret)) + return ret; + + return clock_getres_fallback(clock, res); } diff --git a/arch/arm64/kernel/vdso32/vgettimeofday.c b/arch/arm64/kernel/vdso32/vgettimeofday.c index 54fc1c2ce93f..6770d2bedd1f 100644 --- a/arch/arm64/kernel/vdso32/vgettimeofday.c +++ b/arch/arm64/kernel/vdso32/vgettimeofday.c @@ -11,37 +11,64 @@ int __vdso_clock_gettime(clockid_t clock, struct old_timespec32 *ts) { + int ret; + /* The checks below are required for ABI consistency with arm */ if ((u32)ts >= TASK_SIZE_32) return -EFAULT; - return __cvdso_clock_gettime32(clock, ts); + ret = __cvdso_clock_gettime32(clock, ts); + + if (likely(!ret)) + return ret; + + return clock_gettime32_fallback(clock, ); } int __vdso_clock_gettime64(clockid_t clock, struct __kernel_timespec *ts) { + int ret; + /* The checks below are required for ABI consistency with arm */ if ((u32)ts >= TASK_SIZE_32) return -EFAULT; - return __cvdso_clock_gettime(clock, ts); + ret = __cvdso_clock_gettime(clock,
[RFC PATCH v2 03/10] lib: vdso: Change __cvdso_clock_gettime/getres_common() to __cvdso_clock_gettime/getres()
__cvdso_clock_getres() just calls __cvdso_clock_getres_common(). __cvdso_clock_gettime() just calls __cvdso_clock_getres_common(). Drop __cvdso_clock_getres() and __cvdso_clock_gettime() Rename __cvdso_clock_gettime_common() into __cvdso_clock_gettime() Rename __cvdso_clock_getres_common() into __cvdso_clock_getres() Signed-off-by: Christophe Leroy --- lib/vdso/gettimeofday.c | 19 --- 1 file changed, 4 insertions(+), 15 deletions(-) diff --git a/lib/vdso/gettimeofday.c b/lib/vdso/gettimeofday.c index 4618e274f1d5..c6eeeb47f446 100644 --- a/lib/vdso/gettimeofday.c +++ b/lib/vdso/gettimeofday.c @@ -79,7 +79,7 @@ static void do_coarse(const struct vdso_data *vd, clockid_t clk, } static __maybe_unused int -__cvdso_clock_gettime_common(clockid_t clock, struct __kernel_timespec *ts) +__cvdso_clock_gettime(clockid_t clock, struct __kernel_timespec *ts) { const struct vdso_data *vd = __arch_get_vdso_data(); u32 msk; @@ -105,16 +105,10 @@ __cvdso_clock_gettime_common(clockid_t clock, struct __kernel_timespec *ts) } static __maybe_unused int -__cvdso_clock_gettime(clockid_t clock, struct __kernel_timespec *ts) -{ - return __cvdso_clock_gettime_common(clock, ts); -} - -static __maybe_unused int __cvdso_clock_gettime32(clockid_t clock, struct old_timespec32 *res) { struct __kernel_timespec ts; - int ret = __cvdso_clock_gettime_common(clock, ); + int ret = __cvdso_clock_gettime(clock, ); if (likely(!ret)) { res->tv_sec = ts.tv_sec; @@ -161,7 +155,7 @@ static __maybe_unused __kernel_old_time_t __cvdso_time(__kernel_old_time_t *time #ifdef VDSO_HAS_CLOCK_GETRES static __maybe_unused -int __cvdso_clock_getres_common(clockid_t clock, struct __kernel_timespec *res) +int __cvdso_clock_getres(clockid_t clock, struct __kernel_timespec *res) { const struct vdso_data *vd = __arch_get_vdso_data(); u64 hrtimer_res; @@ -204,16 +198,11 @@ int __cvdso_clock_getres_common(clockid_t clock, struct __kernel_timespec *res) return 0; } -int __cvdso_clock_getres(clockid_t clock, struct __kernel_timespec *res) -{ - return __cvdso_clock_getres_common(clock, res); -} - static __maybe_unused int __cvdso_clock_getres_time32(clockid_t clock, struct old_timespec32 *res) { struct __kernel_timespec ts; - int ret = __cvdso_clock_getres_common(clock, ); + int ret = __cvdso_clock_getres(clock, ); if (likely(!ret && res)) { res->tv_sec = ts.tv_sec; -- 2.13.3
[RFC PATCH v2 01/10] lib: vdso: ensure all arches have 32bit fallback
In order to simplify next step which moves fallback call at arch level, ensure all arches have a 32bit fallback instead of handling the lack of 32bit fallback in the common code based on VDSO_HAS_32BIT_FALLBACK Signed-off-by: Christophe Leroy --- arch/arm/include/asm/vdso/gettimeofday.h | 26 + arch/arm64/include/asm/vdso/compat_gettimeofday.h | 2 -- arch/arm64/include/asm/vdso/gettimeofday.h| 26 + arch/mips/include/asm/vdso/gettimeofday.h | 28 +-- arch/x86/include/asm/vdso/gettimeofday.h | 28 +-- lib/vdso/gettimeofday.c | 10 6 files changed, 104 insertions(+), 16 deletions(-) diff --git a/arch/arm/include/asm/vdso/gettimeofday.h b/arch/arm/include/asm/vdso/gettimeofday.h index 0ad2429c324f..55f8ad6e 100644 --- a/arch/arm/include/asm/vdso/gettimeofday.h +++ b/arch/arm/include/asm/vdso/gettimeofday.h @@ -70,6 +70,32 @@ static __always_inline int clock_getres_fallback( return ret; } +static __always_inline +long clock_gettime32_fallback(clockid_t _clkid, struct old_timespec32 *_ts) +{ + struct __kernel_timespec ts; + int ret = clock_gettime_fallback(clock, ); + + if (likely(!ret)) { + _ts->tv_sec = ts.tv_sec; + _ts->tv_nsec = ts.tv_nsec; + } + return ret; +} + +static __always_inline +long clock_getres32_fallback(clockid_t _clkid, struct old_timespec32 *_ts) +{ + struct __kernel_timespec ts; + int ret = clock_getres_fallback(clock, ); + + if (likely(!ret && _ts)) { + _ts->tv_sec = ts.tv_sec; + _ts->tv_nsec = ts.tv_nsec; + } + return ret; +} + static __always_inline u64 __arch_get_hw_counter(int clock_mode) { #ifdef CONFIG_ARM_ARCH_TIMER diff --git a/arch/arm64/include/asm/vdso/compat_gettimeofday.h b/arch/arm64/include/asm/vdso/compat_gettimeofday.h index c50ee1b7d5cd..bab700e37a03 100644 --- a/arch/arm64/include/asm/vdso/compat_gettimeofday.h +++ b/arch/arm64/include/asm/vdso/compat_gettimeofday.h @@ -16,8 +16,6 @@ #define VDSO_HAS_CLOCK_GETRES 1 -#define VDSO_HAS_32BIT_FALLBACK1 - static __always_inline int gettimeofday_fallback(struct __kernel_old_timeval *_tv, struct timezone *_tz) diff --git a/arch/arm64/include/asm/vdso/gettimeofday.h b/arch/arm64/include/asm/vdso/gettimeofday.h index b08f476b72b4..c41c86a07423 100644 --- a/arch/arm64/include/asm/vdso/gettimeofday.h +++ b/arch/arm64/include/asm/vdso/gettimeofday.h @@ -66,6 +66,32 @@ int clock_getres_fallback(clockid_t _clkid, struct __kernel_timespec *_ts) return ret; } +static __always_inline +long clock_gettime32_fallback(clockid_t _clkid, struct old_timespec32 *_ts) +{ + struct __kernel_timespec ts; + int ret = clock_gettime_fallback(clock, ); + + if (likely(!ret)) { + _ts->tv_sec = ts.tv_sec; + _ts->tv_nsec = ts.tv_nsec; + } + return ret; +} + +static __always_inline +long clock_getres32_fallback(clockid_t _clkid, struct old_timespec32 *_ts) +{ + struct __kernel_timespec ts; + int ret = clock_getres_fallback(clock, ); + + if (likely(!ret && _ts)) { + _ts->tv_sec = ts.tv_sec; + _ts->tv_nsec = ts.tv_nsec; + } + return ret; +} + static __always_inline u64 __arch_get_hw_counter(s32 clock_mode) { u64 res; diff --git a/arch/mips/include/asm/vdso/gettimeofday.h b/arch/mips/include/asm/vdso/gettimeofday.h index b08825531e9f..60608e930a5c 100644 --- a/arch/mips/include/asm/vdso/gettimeofday.h +++ b/arch/mips/include/asm/vdso/gettimeofday.h @@ -109,8 +109,6 @@ static __always_inline int clock_getres_fallback( #if _MIPS_SIM != _MIPS_SIM_ABI64 -#define VDSO_HAS_32BIT_FALLBACK1 - static __always_inline long clock_gettime32_fallback( clockid_t _clkid, struct old_timespec32 *_ts) @@ -150,6 +148,32 @@ static __always_inline int clock_getres32_fallback( return error ? -ret : ret; } +#else +static __always_inline +long clock_gettime32_fallback(clockid_t _clkid, struct old_timespec32 *_ts) +{ + struct __kernel_timespec ts; + int ret = clock_gettime_fallback(clock, ); + + if (likely(!ret)) { + _ts->tv_sec = ts.tv_sec; + _ts->tv_nsec = ts.tv_nsec; + } + return ret; +} + +static __always_inline +long clock_getres32_fallback(clockid_t _clkid, struct old_timespec32 *_ts) +{ + struct __kernel_timespec ts; + int ret = clock_getres_fallback(clock, ); + + if (likely(!ret && _ts)) { + _ts->tv_sec = ts.tv_sec; + _ts->tv_nsec = ts.tv_nsec; + } + return ret; +} #endif #ifdef CONFIG_CSRC_R4K diff --git a/arch/x86/include/asm/vdso/gettimeofday.h
[RFC PATCH v2 00/10] powerpc/32: switch VDSO to C implementation.
This is a second tentative to switch powerpc/32 vdso to generic C implementation. It will likely not work on 64 bits or even build properly at the moment. powerpc is a bit special for VDSO as well as system calls in the way that it requires setting CR SO bit which cannot be done in C. Therefore, entry/exit and fallback needs to be performed in ASM. To allow that, the fallback calls are moved out of the common code and left to the arches. A few other changes in the common code have allowed performance improvement. The performance has improved since first RFC, but it is still lower than current assembly VDSO. On a powerpc8xx, with current powerpc/32 ASM VDSO: gettimeofday:vdso: 737 nsec/call clock-getres-realtime:vdso: 475 nsec/call clock-gettime-realtime:vdso: 892 nsec/call clock-getres-monotonic:vdso: 475 nsec/call clock-gettime-monotonic:vdso: 1014 nsec/call First try of C implementation: gettimeofday:vdso: 1533 nsec/call clock-getres-realtime:vdso: 853 nsec/call clock-gettime-realtime:vdso: 1570 nsec/call clock-getres-monotonic:vdso: 835 nsec/call clock-gettime-monotonic:vdso: 1605 nsec/call With this series: gettimeofday:vdso: 1016 nsec/call clock-getres-realtime:vdso: 560 nsec/call clock-gettime-realtime:vdso: 1192 nsec/call clock-getres-monotonic:vdso: 560 nsec/call clock-gettime-monotonic:vdso: 1192 nsec/call Changes made to other arches are untested, not even compiled. Christophe Leroy (10): lib: vdso: ensure all arches have 32bit fallback lib: vdso: move call to fallback out of common code. lib: vdso: Change __cvdso_clock_gettime/getres_common() to __cvdso_clock_gettime/getres() lib: vdso: get pointer to vdso data from the arch lib: vdso: inline do_hres() lib: vdso: make do_coarse() return 0 lib: vdso: don't use READ_ONCE() in __c_kernel_time() lib: vdso: Avoid duplication in __cvdso_clock_getres() powerpc/vdso32: inline __get_datapage() powerpc/32: Switch VDSO to C implementation. arch/arm/include/asm/vdso/gettimeofday.h | 26 +++ arch/arm/vdso/vgettimeofday.c | 32 ++- arch/arm64/include/asm/vdso/compat_gettimeofday.h | 2 - arch/arm64/include/asm/vdso/gettimeofday.h| 26 +++ arch/arm64/kernel/vdso/vgettimeofday.c| 24 +- arch/arm64/kernel/vdso32/vgettimeofday.c | 39 +++- arch/mips/include/asm/vdso/gettimeofday.h | 28 ++- arch/mips/vdso/vgettimeofday.c| 56 - arch/powerpc/Kconfig | 2 + arch/powerpc/include/asm/vdso/gettimeofday.h | 45 arch/powerpc/include/asm/vdso/vsyscall.h | 27 +++ arch/powerpc/include/asm/vdso_datapage.h | 28 +-- arch/powerpc/kernel/asm-offsets.c | 23 +- arch/powerpc/kernel/time.c| 92 +--- arch/powerpc/kernel/vdso.c| 19 +- arch/powerpc/kernel/vdso32/Makefile | 19 +- arch/powerpc/kernel/vdso32/cacheflush.S | 9 +- arch/powerpc/kernel/vdso32/datapage.S | 28 +-- arch/powerpc/kernel/vdso32/gettimeofday.S | 265 +++--- arch/powerpc/kernel/vdso32/vgettimeofday.c| 32 +++ arch/x86/entry/vdso/vclock_gettime.c | 52 - arch/x86/include/asm/vdso/gettimeofday.h | 28 ++- lib/vdso/gettimeofday.c | 100 +++- 23 files changed, 505 insertions(+), 497 deletions(-) create mode 100644 arch/powerpc/include/asm/vdso/gettimeofday.h create mode 100644 arch/powerpc/include/asm/vdso/vsyscall.h create mode 100644 arch/powerpc/kernel/vdso32/vgettimeofday.c -- 2.13.3
[PATCH] powerpc/shared: include correct header for static key
Recently, the spinlock implementation grew a static key optimization, but the jump_label.h header include was left out, leading to build errors: linux/arch/powerpc/include/asm/spinlock.h:44:7: error: implicit declaration of function ‘static_branch_unlikely’ [-Werror=implicit-function-declaration] 44 | if (!static_branch_unlikely(_processor)) This commit adds the missing header. Fixes: 656c21d6af5d ("powerpc/shared: Use static key to detect shared processor") Cc: Srikar Dronamraju Signed-off-by: Jason A. Donenfeld --- arch/powerpc/include/asm/spinlock.h | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h index 1b55fc08f853..860228e917dc 100644 --- a/arch/powerpc/include/asm/spinlock.h +++ b/arch/powerpc/include/asm/spinlock.h @@ -15,6 +15,7 @@ * * (the type definitions are in asm/spinlock_types.h) */ +#include #include #ifdef CONFIG_PPC64 #include -- 2.24.1
Re: [PATCH kernel v3] powerpc/book3s64: Fix error handling in mm_iommu_do_alloc()
Alexey Kardashevskiy writes: > The last jump to free_exit in mm_iommu_do_alloc() happens after page > pointers in struct mm_iommu_table_group_mem_t were already converted to > physical addresses. Thus calling put_page() on these physical addresses > will likely crash. > > This moves the loop which calculates the pageshift and converts page > struct pointers to physical addresses later after the point when > we cannot fail; thus eliminating the need to convert pointers back. > > Fixes: eb9d7a62c386 ("powerpc/mm_iommu: Fix potential deadlock") > Reported-by: Jan Kara > Signed-off-by: Alexey Kardashevskiy > --- > Changes: > v3: > * move pointers conversion after the last possible failure point > --- > arch/powerpc/mm/book3s64/iommu_api.c | 39 +++- > 1 file changed, 21 insertions(+), 18 deletions(-) > > diff --git a/arch/powerpc/mm/book3s64/iommu_api.c > b/arch/powerpc/mm/book3s64/iommu_api.c > index 56cc84520577..ef164851738b 100644 > --- a/arch/powerpc/mm/book3s64/iommu_api.c > +++ b/arch/powerpc/mm/book3s64/iommu_api.c > @@ -121,24 +121,6 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, > unsigned long ua, > goto free_exit; > } > > - pageshift = PAGE_SHIFT; > - for (i = 0; i < entries; ++i) { > - struct page *page = mem->hpages[i]; > - > - /* > - * Allow to use larger than 64k IOMMU pages. Only do that > - * if we are backed by hugetlb. > - */ > - if ((mem->pageshift > PAGE_SHIFT) && PageHuge(page)) > - pageshift = page_shift(compound_head(page)); > - mem->pageshift = min(mem->pageshift, pageshift); > - /* > - * We don't need struct page reference any more, switch > - * to physical address. > - */ > - mem->hpas[i] = page_to_pfn(page) << PAGE_SHIFT; > - } > - > good_exit: > atomic64_set(>mapped, 1); > mem->used = 1; > @@ -158,6 +140,27 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, > unsigned long ua, > } > } > > + if (mem->dev_hpa == MM_IOMMU_TABLE_INVALID_HPA) { Couldn't you avoid testing this again ... > + /* > + * Allow to use larger than 64k IOMMU pages. Only do that > + * if we are backed by hugetlb. Skip device memory as it is not > + * backed with page structs. > + */ > + pageshift = PAGE_SHIFT; > + for (i = 0; i < entries; ++i) { ... by making this loop up to `pinned`. `pinned` is only incremented in the loop that does the GUP, and there's a check that pinned == entries after that loop. So when we get here we know pinned == entries, and if pinned is zero it's because we took the (dev_hpa != MM_IOMMU_TABLE_INVALID_HPA) case at the start of the function to get here. Or do you think that's too subtle to rely on? cheers > + struct page *page = mem->hpages[i]; > + > + if ((mem->pageshift > PAGE_SHIFT) && PageHuge(page)) > + pageshift = page_shift(compound_head(page)); > + mem->pageshift = min(mem->pageshift, pageshift); > + /* > + * We don't need struct page reference any more, switch > + * to physical address. > + */ > + mem->hpas[i] = page_to_pfn(page) << PAGE_SHIFT; > + } > + } > + > list_add_rcu(>next, >context.iommu_group_mem_list); > > mutex_unlock(_list_mutex); > -- > 2.17.1