Re: [PATCH v6 00/15] Restricted DMA
v7: https://lore.kernel.org/patchwork/cover/1431031/ On Mon, May 10, 2021 at 5:50 PM Claire Chang wrote: > > From: Claire Chang > > This series implements mitigations for lack of DMA access control on > systems without an IOMMU, which could result in the DMA accessing the > system memory at unexpected times and/or unexpected addresses, possibly > leading to data leakage or corruption. > > For example, we plan to use the PCI-e bus for Wi-Fi and that PCI-e bus is > not behind an IOMMU. As PCI-e, by design, gives the device full access to > system memory, a vulnerability in the Wi-Fi firmware could easily escalate > to a full system exploit (remote wifi exploits: [1a], [1b] that shows a > full chain of exploits; [2], [3]). > > To mitigate the security concerns, we introduce restricted DMA. Restricted > DMA utilizes the existing swiotlb to bounce streaming DMA in and out of a > specially allocated region and does memory allocation from the same region. > The feature on its own provides a basic level of protection against the DMA > overwriting buffer contents at unexpected times. However, to protect > against general data leakage and system memory corruption, the system needs > to provide a way to restrict the DMA to a predefined memory region (this is > usually done at firmware level, e.g. MPU in ATF on some ARM platforms [4]). > > [1a] > https://googleprojectzero.blogspot.com/2017/04/over-air-exploiting-broadcoms-wi-fi_4.html > [1b] > https://googleprojectzero.blogspot.com/2017/04/over-air-exploiting-broadcoms-wi-fi_11.html > [2] https://blade.tencent.com/en/advisories/qualpwn/ > [3] > https://www.bleepingcomputer.com/news/security/vulnerabilities-found-in-highly-popular-firmware-for-wifi-chips/ > [4] > https://github.com/ARM-software/arm-trusted-firmware/blob/master/plat/mediatek/mt8183/drivers/emi_mpu/emi_mpu.c#L132 > > v6: > Address the comments in v5 > > v5: > Rebase on latest linux-next > https://lore.kernel.org/patchwork/cover/1416899/ > > v4: > - Fix spinlock bad magic > - Use rmem->name for debugfs entry > - Address the comments in v3 > https://lore.kernel.org/patchwork/cover/1378113/ > > v3: > Using only one reserved memory region for both streaming DMA and memory > allocation. > https://lore.kernel.org/patchwork/cover/1360992/ > > v2: > Building on top of swiotlb. > https://lore.kernel.org/patchwork/cover/1280705/ > > v1: > Using dma_map_ops. > https://lore.kernel.org/patchwork/cover/1271660/ > *** BLURB HERE *** > > Claire Chang (15): > swiotlb: Refactor swiotlb init functions > swiotlb: Refactor swiotlb_create_debugfs > swiotlb: Add DMA_RESTRICTED_POOL > swiotlb: Add restricted DMA pool initialization > swiotlb: Add a new get_io_tlb_mem getter > swiotlb: Update is_swiotlb_buffer to add a struct device argument > swiotlb: Update is_swiotlb_active to add a struct device argument > swiotlb: Bounce data from/to restricted DMA pool if available > swiotlb: Move alloc_size to find_slots > swiotlb: Refactor swiotlb_tbl_unmap_single > dma-direct: Add a new wrapper __dma_direct_free_pages() > swiotlb: Add restricted DMA alloc/free support. > dma-direct: Allocate memory from restricted DMA pool if available > dt-bindings: of: Add restricted DMA pool > of: Add plumbing for restricted DMA pool > > .../reserved-memory/reserved-memory.txt | 27 ++ > drivers/gpu/drm/i915/gem/i915_gem_internal.c | 2 +- > drivers/gpu/drm/nouveau/nouveau_ttm.c | 2 +- > drivers/iommu/dma-iommu.c | 12 +- > drivers/of/address.c | 25 ++ > drivers/of/device.c | 3 + > drivers/of/of_private.h | 5 + > drivers/pci/xen-pcifront.c| 2 +- > drivers/xen/swiotlb-xen.c | 2 +- > include/linux/device.h| 4 + > include/linux/swiotlb.h | 41 ++- > kernel/dma/Kconfig| 14 + > kernel/dma/direct.c | 63 +++-- > kernel/dma/direct.h | 9 +- > kernel/dma/swiotlb.c | 242 +- > 15 files changed, 356 insertions(+), 97 deletions(-) > > -- > 2.31.1.607.g51e8a6a459-goog >
[PATCH v6 1/3] riscv: Introduce CONFIG_RELOCATABLE
This config allows to compile 64b kernel as PIE and to relocate it at any virtual address at runtime: this paves the way to KASLR. Runtime relocation is possible since relocation metadata are embedded into the kernel. Note that relocating at runtime introduces an overhead even if the kernel is loaded at the same address it was linked at and that the compiler options are those used in arm64 which uses the same RELA relocation format. Signed-off-by: Alexandre Ghiti --- arch/riscv/Kconfig | 12 arch/riscv/Makefile | 5 +++- arch/riscv/kernel/vmlinux.lds.S | 6 arch/riscv/mm/Makefile | 4 +++ arch/riscv/mm/init.c| 53 - 5 files changed, 78 insertions(+), 2 deletions(-) diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig index a8ad8eb76120..7d49c9fa9a91 100644 --- a/arch/riscv/Kconfig +++ b/arch/riscv/Kconfig @@ -205,6 +205,18 @@ config PGTABLE_LEVELS config LOCKDEP_SUPPORT def_bool y +config RELOCATABLE + bool + depends on MMU && 64BIT && !XIP_KERNEL + help + This builds a kernel as a Position Independent Executable (PIE), + which retains all relocation metadata required to relocate the + kernel binary at runtime to a different virtual address than the + address it was linked at. + Since RISCV uses the RELA relocation format, this requires a + relocation pass at runtime even if the kernel is loaded at the + same address it was linked at. + source "arch/riscv/Kconfig.socs" source "arch/riscv/Kconfig.erratas" diff --git a/arch/riscv/Makefile b/arch/riscv/Makefile index 3eb9590a0775..2d217ecb6e6b 100644 --- a/arch/riscv/Makefile +++ b/arch/riscv/Makefile @@ -9,7 +9,10 @@ # OBJCOPYFLAGS:= -O binary -LDFLAGS_vmlinux := +ifeq ($(CONFIG_RELOCATABLE),y) +LDFLAGS_vmlinux := -shared -Bsymbolic -z notext -z norelro +KBUILD_CFLAGS += -fPIE +endif ifeq ($(CONFIG_DYNAMIC_FTRACE),y) LDFLAGS_vmlinux := --no-relax KBUILD_CPPFLAGS += -DCC_USING_PATCHABLE_FUNCTION_ENTRY diff --git a/arch/riscv/kernel/vmlinux.lds.S b/arch/riscv/kernel/vmlinux.lds.S index 891742ff75a7..1517fd1c7246 100644 --- a/arch/riscv/kernel/vmlinux.lds.S +++ b/arch/riscv/kernel/vmlinux.lds.S @@ -133,6 +133,12 @@ SECTIONS BSS_SECTION(PAGE_SIZE, PAGE_SIZE, 0) + .rela.dyn : ALIGN(8) { + __rela_dyn_start = .; + *(.rela .rela*) + __rela_dyn_end = .; + } + #ifdef CONFIG_EFI . = ALIGN(PECOFF_SECTION_ALIGNMENT); __pecoff_data_virt_size = ABSOLUTE(. - __pecoff_text_end); diff --git a/arch/riscv/mm/Makefile b/arch/riscv/mm/Makefile index 7ebaef10ea1b..2d33ec574bbb 100644 --- a/arch/riscv/mm/Makefile +++ b/arch/riscv/mm/Makefile @@ -1,6 +1,10 @@ # SPDX-License-Identifier: GPL-2.0-only CFLAGS_init.o := -mcmodel=medany +ifdef CONFIG_RELOCATABLE +CFLAGS_init.o += -fno-pie +endif + ifdef CONFIG_FTRACE CFLAGS_REMOVE_init.o = $(CC_FLAGS_FTRACE) CFLAGS_REMOVE_cacheflush.o = $(CC_FLAGS_FTRACE) diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c index 4faf8bd157ea..5e0a19d9d8fa 100644 --- a/arch/riscv/mm/init.c +++ b/arch/riscv/mm/init.c @@ -18,6 +18,9 @@ #include #include #include +#ifdef CONFIG_RELOCATABLE +#include +#endif #include #include @@ -99,7 +102,7 @@ static void __init print_vm_layout(void) print_mlm("lowmem", (unsigned long)PAGE_OFFSET, (unsigned long)high_memory); #ifdef CONFIG_64BIT - print_mlm("kernel", (unsigned long)KERNEL_LINK_ADDR, + print_mlm("kernel", (unsigned long)kernel_virt_addr, (unsigned long)ADDRESS_SPACE_END); #endif } @@ -454,6 +457,44 @@ asmlinkage void __init __copy_data(void) #error "setup_vm() is called from head.S before relocate so it should not use absolute addressing." #endif +#ifdef CONFIG_RELOCATABLE +extern unsigned long __rela_dyn_start, __rela_dyn_end; + +void __init relocate_kernel(uintptr_t load_pa) +{ + Elf64_Rela *rela = (Elf64_Rela *)&__rela_dyn_start; + /* +* This holds the offset between the linked virtual address and the +* relocated virtual address. +*/ + uintptr_t reloc_offset = kernel_virt_addr - KERNEL_LINK_ADDR; + /* +* This holds the offset between kernel linked virtual address and +* physical address. +*/ + uintptr_t va_kernel_link_pa_offset = KERNEL_LINK_ADDR - load_pa; + + for ( ; rela < (Elf64_Rela *)&__rela_dyn_end; rela++) { + Elf64_Addr addr = (rela->r_offset - va_kernel_link_pa_offset); + Elf64_Addr relocated_addr = rela->r_addend; + + if (rela->r_info != R_RISCV_RELATIVE) + continue; + + /* +* Make sure to not relocate vdso symbols like rt_sigreturn +* which are linked from the address 0 in vmlinux since +* vdso symbol a
[PATCH v6 3/3] riscv: Check relocations at compile time
Relocating kernel at runtime is done very early in the boot process, so it is not convenient to check for relocations there and react in case a relocation was not expected. There exists a script in scripts/ that extracts the relocations from vmlinux that is then used at postlink to check the relocations. Signed-off-by: Alexandre Ghiti Reviewed-by: Anup Patel --- arch/riscv/Makefile.postlink | 36 arch/riscv/tools/relocs_check.sh | 26 +++ 2 files changed, 62 insertions(+) create mode 100644 arch/riscv/Makefile.postlink create mode 100755 arch/riscv/tools/relocs_check.sh diff --git a/arch/riscv/Makefile.postlink b/arch/riscv/Makefile.postlink new file mode 100644 index ..bf2b2bca1845 --- /dev/null +++ b/arch/riscv/Makefile.postlink @@ -0,0 +1,36 @@ +# SPDX-License-Identifier: GPL-2.0 +# === +# Post-link riscv pass +# === +# +# Check that vmlinux relocations look sane + +PHONY := __archpost +__archpost: + +-include include/config/auto.conf +include scripts/Kbuild.include + +quiet_cmd_relocs_check = CHKREL $@ +cmd_relocs_check = \ + $(CONFIG_SHELL) $(srctree)/arch/riscv/tools/relocs_check.sh "$(OBJDUMP)" "$(NM)" "$@" + +# `@true` prevents complaint when there is nothing to be done + +vmlinux: FORCE + @true +ifdef CONFIG_RELOCATABLE + $(call if_changed,relocs_check) +endif + +%.ko: FORCE + @true + +clean: + @true + +PHONY += FORCE clean + +FORCE: + +.PHONY: $(PHONY) diff --git a/arch/riscv/tools/relocs_check.sh b/arch/riscv/tools/relocs_check.sh new file mode 100755 index ..baeb2e7b2290 --- /dev/null +++ b/arch/riscv/tools/relocs_check.sh @@ -0,0 +1,26 @@ +#!/bin/sh +# SPDX-License-Identifier: GPL-2.0-or-later +# Based on powerpc relocs_check.sh + +# This script checks the relocations of a vmlinux for "suspicious" +# relocations. + +if [ $# -lt 3 ]; then +echo "$0 [path to objdump] [path to nm] [path to vmlinux]" 1>&2 +exit 1 +fi + +bad_relocs=$( +${srctree}/scripts/relocs_check.sh "$@" | + # These relocations are okay + # R_RISCV_RELATIVE + grep -F -w -v 'R_RISCV_RELATIVE' +) + +if [ -z "$bad_relocs" ]; then + exit 0 +fi + +num_bad=$(echo "$bad_relocs" | wc -l) +echo "WARNING: $num_bad bad relocations" +echo "$bad_relocs" -- 2.30.2
[PATCH v6 0/3] Introduce 64b relocatable kernel
After multiple attempts, this patchset is now based on the fact that the 64b kernel mapping was moved outside the linear mapping. The first patch allows to build relocatable kernels but is not selected by default. That patch should ease KASLR implementation a lot. The second and third patches take advantage of an already existing powerpc script that checks relocations at compile-time, and uses it for riscv. This patchset was tested on: * kernel: - rv32: OK - rv64 with RELOCATABLE: OK and checked that "suspicious" relocations are caught. - rv64 without RELOCATABLE: OK - powerpc: build only and checked that "suspicious" relocations are caught. * xipkernel: - rv32: build only - rv64: OK * nommukernel: - rv64: build only Changes in v6: * Remove the kernel move to vmalloc zone * Rebased on top of for-next * Remove relocatable property from 32b kernel as the kernel is mapped in the linear mapping and would then need to be copied physically too * CONFIG_RELOCATABLE depends on !XIP_KERNEL * Remove Reviewed-by from first patch as it changed a bit Changes in v5: * Add "static __init" to create_kernel_page_table function as reported by Kbuild test robot * Add reviewed-by from Zong * Rebase onto v5.7 Changes in v4: * Fix BPF region that overlapped with kernel's as suggested by Zong * Fix end of module region that could be larger than 2GB as suggested by Zong * Fix the size of the vm area reserved for the kernel as we could lose PMD_SIZE if the size was already aligned on PMD_SIZE * Split compile time relocations check patch into 2 patches as suggested by Anup * Applied Reviewed-by from Zong and Anup Changes in v3: * Move kernel mapping to vmalloc Changes in v2: * Make RELOCATABLE depend on MMU as suggested by Anup * Rename kernel_load_addr into kernel_virt_addr as suggested by Anup * Use __pa_symbol instead of __pa, as suggested by Zong * Rebased on top of v5.6-rc3 * Tested with sv48 patchset * Add Reviewed/Tested-by from Zong and Anup Alexandre Ghiti (3): riscv: Introduce CONFIG_RELOCATABLE powerpc: Move script to check relocations at compile time in scripts/ riscv: Check relocations at compile time arch/powerpc/tools/relocs_check.sh | 18 ++ arch/riscv/Kconfig | 12 +++ arch/riscv/Makefile| 5 ++- arch/riscv/Makefile.postlink | 36 arch/riscv/kernel/vmlinux.lds.S| 6 arch/riscv/mm/Makefile | 4 +++ arch/riscv/mm/init.c | 53 +- arch/riscv/tools/relocs_check.sh | 26 +++ scripts/relocs_check.sh| 20 +++ 9 files changed, 162 insertions(+), 18 deletions(-) create mode 100644 arch/riscv/Makefile.postlink create mode 100755 arch/riscv/tools/relocs_check.sh create mode 100755 scripts/relocs_check.sh -- 2.30.2
[PATCH v6 2/3] powerpc: Move script to check relocations at compile time in scripts/
Relocating kernel at runtime is done very early in the boot process, so it is not convenient to check for relocations there and react in case a relocation was not expected. Powerpc architecture has a script that allows to check at compile time for such unexpected relocations: extract the common logic to scripts/ so that other architectures can take advantage of it. Signed-off-by: Alexandre Ghiti Reviewed-by: Anup Patel --- arch/powerpc/tools/relocs_check.sh | 18 ++ scripts/relocs_check.sh| 20 2 files changed, 22 insertions(+), 16 deletions(-) create mode 100755 scripts/relocs_check.sh diff --git a/arch/powerpc/tools/relocs_check.sh b/arch/powerpc/tools/relocs_check.sh index 014e00e74d2b..e367895941ae 100755 --- a/arch/powerpc/tools/relocs_check.sh +++ b/arch/powerpc/tools/relocs_check.sh @@ -15,21 +15,8 @@ if [ $# -lt 3 ]; then exit 1 fi -# Have Kbuild supply the path to objdump and nm so we handle cross compilation. -objdump="$1" -nm="$2" -vmlinux="$3" - -# Remove from the bad relocations those that match an undefined weak symbol -# which will result in an absolute relocation to 0. -# Weak unresolved symbols are of that form in nm output: -# " w _binary__btf_vmlinux_bin_end" -undef_weak_symbols=$($nm "$vmlinux" | awk '$1 ~ /w/ { print $2 }') - bad_relocs=$( -$objdump -R "$vmlinux" | - # Only look at relocation lines. - grep -E '\
[PATCH v2 2/2] mm: replace contig_page_data with node_data
Replace contig_page_data with node_data. Change the definition of NODE_DATA(nid) from (&contig_page_data) to (node_data[0]). Remove contig_page_data from the tree. Cc: Mike Rapoport Cc: Baoquan He Cc: Kazu Signed-off-by: Miles Chen --- Documentation/admin-guide/kdump/vmcoreinfo.rst | 13 - arch/powerpc/kexec/core.c | 5 - include/linux/gfp.h| 3 --- include/linux/mmzone.h | 3 +-- kernel/crash_core.c| 1 - mm/memblock.c | 2 -- 6 files changed, 1 insertion(+), 26 deletions(-) diff --git a/Documentation/admin-guide/kdump/vmcoreinfo.rst b/Documentation/admin-guide/kdump/vmcoreinfo.rst index 3861a25faae1..74185245c580 100644 --- a/Documentation/admin-guide/kdump/vmcoreinfo.rst +++ b/Documentation/admin-guide/kdump/vmcoreinfo.rst @@ -81,14 +81,6 @@ into that mem_map array. Used to map an address to the corresponding struct page. -contig_page_data - - -Makedumpfile gets the pglist_data structure from this symbol, which is -used to describe the memory layout. - -User-space tools use this to exclude free pages when dumping memory. - mem_section|(mem_section, NR_SECTION_ROOTS)|(mem_section, section_mem_map) -- @@ -531,11 +523,6 @@ node_data|(node_data, MAX_NUMNODES) See above. -contig_page_data - - -See above. - vmemmap_list diff --git a/arch/powerpc/kexec/core.c b/arch/powerpc/kexec/core.c index 56da5eb2b923..41f31dfb540c 100644 --- a/arch/powerpc/kexec/core.c +++ b/arch/powerpc/kexec/core.c @@ -68,13 +68,8 @@ void machine_kexec_cleanup(struct kimage *image) void arch_crash_save_vmcoreinfo(void) { -#ifdef CONFIG_NEED_MULTIPLE_NODES VMCOREINFO_SYMBOL(node_data); VMCOREINFO_LENGTH(node_data, MAX_NUMNODES); -#endif -#ifndef CONFIG_NEED_MULTIPLE_NODES - VMCOREINFO_SYMBOL(contig_page_data); -#endif #if defined(CONFIG_PPC64) && defined(CONFIG_SPARSEMEM_VMEMMAP) VMCOREINFO_SYMBOL(vmemmap_list); VMCOREINFO_SYMBOL(mmu_vmemmap_psize); diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 11da8af06704..ba8c511c402f 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -493,9 +493,6 @@ static inline int gfp_zonelist(gfp_t flags) * This zone list contains a maximum of MAX_NUMNODES*MAX_NR_ZONES zones. * There are two zonelists per node, one for all zones with memory and * one containing just zones from the node the zonelist belongs to. - * - * For the normal case of non-DISCONTIGMEM systems the NODE_DATA() gets - * optimized to &contig_page_data at compile-time. */ static inline struct zonelist *node_zonelist(int nid, gfp_t flags) { diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 557918dcc755..c0769292187c 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1043,9 +1043,8 @@ extern char numa_zonelist_order[]; #ifndef CONFIG_NEED_MULTIPLE_NODES -extern struct pglist_data contig_page_data; -#define NODE_DATA(nid) (&contig_page_data) extern struct pglist_data *node_data[]; +#define NODE_DATA(nid) (node_data[0]) #define NODE_MEM_MAP(nid) mem_map #else /* CONFIG_NEED_MULTIPLE_NODES */ diff --git a/kernel/crash_core.c b/kernel/crash_core.c index 825284baaf46..d1e324be67f9 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -457,7 +457,6 @@ static int __init crash_save_vmcoreinfo_init(void) #ifndef CONFIG_NEED_MULTIPLE_NODES VMCOREINFO_SYMBOL(mem_map); - VMCOREINFO_SYMBOL(contig_page_data); #endif #ifdef CONFIG_SPARSEMEM VMCOREINFO_SYMBOL_ARRAY(mem_section); diff --git a/mm/memblock.c b/mm/memblock.c index ebddb57ea62d..7cfc9a9d6243 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -93,8 +93,6 @@ */ #ifndef CONFIG_NEED_MULTIPLE_NODES -struct pglist_data __refdata contig_page_data; -EXPORT_SYMBOL(contig_page_data); struct pglist_data *node_data[MAX_NUMNODES]; #endif -- 2.18.0
[PATCH v2 0/2] mm: unify the allocation of pglist_data instances
This patches is created to fix the __pa() warning messages when CONFIG_DEBUG_VIRTUAL=y by unifying the allocation of pglist_data instances. In current implementation of node_data, if CONFIG_NEED_MULTIPLE_NODES=y, pglist_data is allocated by a memblock API. If CONFIG_NEED_MULTIPLE_NODES=n, we use a global variable named "contig_page_data". If CONFIG_DEBUG_VIRTUAL is not enabled. __pa() can handle both allocation and symbol cases. But if CONFIG_DEBUG_VIRTUAL is set, we will have the "virt_to_phys used for non-linear address" warning when booting. To fix the warning, always allocate pglist_data by memblock APIs and remove the usage of contig_page_data. Warning message: [0.00] [ cut here ] [0.00] virt_to_phys used for non-linear address: (ptrval) (contig_page_data+0x0/0x1c00) [0.00] WARNING: CPU: 0 PID: 0 at arch/arm64/mm/physaddr.c:15 __virt_to_phys+0x58/0x68 [0.00] Modules linked in: [0.00] CPU: 0 PID: 0 Comm: swapper Tainted: GW 5.13.0-rc1-00074-g1140ab592e2e #3 [0.00] Hardware name: linux,dummy-virt (DT) [0.00] pstate: 60c5 (nZCv daIF -PAN -UAO -TCO BTYPE=--) [0.00] pc : __virt_to_phys+0x58/0x68 [0.00] lr : __virt_to_phys+0x54/0x68 [0.00] sp : 800011833e70 [0.00] x29: 800011833e70 x28: 418a0018 x27: [0.00] x26: 000a x25: 800011b7 x24: 800011b7 [0.00] x23: fc0001c0 x22: 800011b7 x21: 47b0 [0.00] x20: 0008 x19: 800011b082c0 x18: [0.00] x17: x16: 800011833bf9 x15: 0004 [0.00] x14: 0fff x13: 80001186a548 x12: [0.00] x11: x10: x9 : [0.00] x8 : 8000115c9000 x7 : 737520737968705f x6 : 800011b62ef8 [0.00] x5 : x4 : 0001 x3 : [0.00] x2 : x1 : 80001159585e x0 : 0058 [0.00] Call trace: [0.00] __virt_to_phys+0x58/0x68 [0.00] check_usemap_section_nr+0x50/0xfc [0.00] sparse_init_nid+0x1ac/0x28c [0.00] sparse_init+0x1c4/0x1e0 [0.00] bootmem_init+0x60/0x90 [0.00] setup_arch+0x184/0x1f0 [0.00] start_kernel+0x78/0x488 [0.00] ---[ end trace f68728a0d3053b60 ]--- [1] https://lore.kernel.org/patchwork/patch/1425110/ Change since v1: - use memblock_alloc() to create pglist_data when CONFIG_NUMA=n Miles Chen (2): mm: introduce prepare_node_data mm: replace contig_page_data with node_data Documentation/admin-guide/kdump/vmcoreinfo.rst | 13 - arch/powerpc/kexec/core.c | 5 - include/linux/gfp.h| 3 --- include/linux/mm.h | 2 ++ include/linux/mmzone.h | 4 ++-- kernel/crash_core.c| 1 - mm/memblock.c | 3 +-- mm/page_alloc.c| 16 mm/sparse.c| 2 ++ 9 files changed, 23 insertions(+), 26 deletions(-) base-commit: 8ac91e6c6033ebc12c5c1e4aa171b81a662bd70f -- 2.18.0
[PATCH v2 1/2] mm: introduce prepare_node_data
When CONFIG_NEED_MULTIPLE_NODES=y (CONFIG_NUMA=y), the pglist_data is allocated by a memblock API and stored in an array named node_data[]. When CONFIG_NEED_MULTIPLE_NODES=n (CONFIG_NUMA=n), the pglist_data is defined as global variable contig_page_data. The difference causes problems when we enable CONFIG_DEBUG_VIRTUAL and use __pa() to get the physical address of NODE_DATA. To solve the issue, introduce prepare_node_data() to allocate pglist_data when CONFIG_NUMA=n and stored it to node_data. i.e., Use the same way to allocate node_data[] when CONFIG_NUMA=y or CONFIG_NUMA=n. prepare_node_data() is called in sparer_init() and free_area_init(). This is the first step to replace contig_page_data with allocated pglist_data. Cc: Mike Rapoport Cc: Baoquan He Cc: Kazu Signed-off-by: Miles Chen --- include/linux/mm.h | 2 ++ include/linux/mmzone.h | 1 + mm/memblock.c | 1 + mm/page_alloc.c| 16 mm/sparse.c| 2 ++ 5 files changed, 22 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index c274f75efcf9..3052eeb87455 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2462,9 +2462,11 @@ static inline int early_pfn_to_nid(unsigned long pfn) { return 0; } +extern void prepare_node_data(void); #else /* please see mm/page_alloc.c */ extern int __meminit early_pfn_to_nid(unsigned long pfn); +static inline void prepare_node_data(void) {}; #endif extern void set_dma_reserve(unsigned long new_dma_reserve); diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 0d53eba1c383..557918dcc755 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1045,6 +1045,7 @@ extern char numa_zonelist_order[]; extern struct pglist_data contig_page_data; #define NODE_DATA(nid) (&contig_page_data) +extern struct pglist_data *node_data[]; #define NODE_MEM_MAP(nid) mem_map #else /* CONFIG_NEED_MULTIPLE_NODES */ diff --git a/mm/memblock.c b/mm/memblock.c index afaefa8fc6ab..ebddb57ea62d 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -95,6 +95,7 @@ #ifndef CONFIG_NEED_MULTIPLE_NODES struct pglist_data __refdata contig_page_data; EXPORT_SYMBOL(contig_page_data); +struct pglist_data *node_data[MAX_NUMNODES]; #endif unsigned long max_low_pfn; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index aaa1655cf682..0c6d421f4cfb 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1659,6 +1659,20 @@ int __meminit early_pfn_to_nid(unsigned long pfn) return nid; } +#else +void __init prepare_node_data(void) +{ + if (node_data[0]) + return; + + node_data[0] = memblock_alloc(sizeof(struct pglist_data), + SMP_CACHE_BYTES); + + if (!node_data[0]) + panic("Cannot allocate node_data\n"); + + memset(node_data[0], 0, sizeof(struct pglist_data)); +} #endif /* CONFIG_NEED_MULTIPLE_NODES */ void __init memblock_free_pages(struct page *page, unsigned long pfn, @@ -7697,6 +7711,8 @@ void __init free_area_init(unsigned long *max_zone_pfn) int i, nid, zone; bool descending; + prepare_node_data(); + /* Record where the zone boundaries are */ memset(arch_zone_lowest_possible_pfn, 0, sizeof(arch_zone_lowest_possible_pfn)); diff --git a/mm/sparse.c b/mm/sparse.c index b2ada9dc00cb..afcfe7463b4a 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -580,6 +580,8 @@ void __init sparse_init(void) memblocks_present(); + prepare_node_data(); + pnum_begin = first_present_section_nr(); nid_begin = sparse_early_nid(__nr_to_section(pnum_begin)); -- 2.18.0
Re: [PATCH v2 0/2] mm: unify the allocation of pglist_data instances
Hello Miles, On Tue, May 18, 2021 at 05:24:44PM +0800, Miles Chen wrote: > This patches is created to fix the __pa() warning messages when > CONFIG_DEBUG_VIRTUAL=y by unifying the allocation of pglist_data > instances. > > In current implementation of node_data, if CONFIG_NEED_MULTIPLE_NODES=y, > pglist_data is allocated by a memblock API. If CONFIG_NEED_MULTIPLE_NODES=n, > we use a global variable named "contig_page_data". > > If CONFIG_DEBUG_VIRTUAL is not enabled. __pa() can handle both > allocation and symbol cases. But if CONFIG_DEBUG_VIRTUAL is set, > we will have the "virt_to_phys used for non-linear address" warning > when booting. > > To fix the warning, always allocate pglist_data by memblock APIs and > remove the usage of contig_page_data. Somehow I was sure that we can allocate pglist_data before it is accessed in sparse_init() somewhere outside mm/sparse.c. It's really not the case and having two places that may allocated this structure is surely worth than your previous suggestion. Sorry about that. > Warning message: > [0.00] [ cut here ] > [0.00] virt_to_phys used for non-linear address: (ptrval) > (contig_page_data+0x0/0x1c00) > [0.00] WARNING: CPU: 0 PID: 0 at arch/arm64/mm/physaddr.c:15 > __virt_to_phys+0x58/0x68 > [0.00] Modules linked in: > [0.00] CPU: 0 PID: 0 Comm: swapper Tainted: GW > 5.13.0-rc1-00074-g1140ab592e2e #3 > [0.00] Hardware name: linux,dummy-virt (DT) > [0.00] pstate: 60c5 (nZCv daIF -PAN -UAO -TCO BTYPE=--) > [0.00] pc : __virt_to_phys+0x58/0x68 > [0.00] lr : __virt_to_phys+0x54/0x68 > [0.00] sp : 800011833e70 > [0.00] x29: 800011833e70 x28: 418a0018 x27: > > [0.00] x26: 000a x25: 800011b7 x24: > 800011b7 > [0.00] x23: fc0001c0 x22: 800011b7 x21: > 47b0 > [0.00] x20: 0008 x19: 800011b082c0 x18: > > [0.00] x17: x16: 800011833bf9 x15: > 0004 > [0.00] x14: 0fff x13: 80001186a548 x12: > > [0.00] x11: x10: x9 : > > [0.00] x8 : 8000115c9000 x7 : 737520737968705f x6 : > 800011b62ef8 > [0.00] x5 : x4 : 0001 x3 : > > [0.00] x2 : x1 : 80001159585e x0 : > 0058 > [0.00] Call trace: > [0.00] __virt_to_phys+0x58/0x68 > [0.00] check_usemap_section_nr+0x50/0xfc > [0.00] sparse_init_nid+0x1ac/0x28c > [0.00] sparse_init+0x1c4/0x1e0 > [0.00] bootmem_init+0x60/0x90 > [0.00] setup_arch+0x184/0x1f0 > [0.00] start_kernel+0x78/0x488 > [0.00] ---[ end trace f68728a0d3053b60 ]--- > > [1] https://lore.kernel.org/patchwork/patch/1425110/ > > Change since v1: > - use memblock_alloc() to create pglist_data when CONFIG_NUMA=n > > Miles Chen (2): > mm: introduce prepare_node_data > mm: replace contig_page_data with node_data > > Documentation/admin-guide/kdump/vmcoreinfo.rst | 13 - > arch/powerpc/kexec/core.c | 5 - > include/linux/gfp.h| 3 --- > include/linux/mm.h | 2 ++ > include/linux/mmzone.h | 4 ++-- > kernel/crash_core.c| 1 - > mm/memblock.c | 3 +-- > mm/page_alloc.c| 16 > mm/sparse.c| 2 ++ > 9 files changed, 23 insertions(+), 26 deletions(-) > > > base-commit: 8ac91e6c6033ebc12c5c1e4aa171b81a662bd70f > -- > 2.18.0 > -- Sincerely yours, Mike.
Re: [PATCH v8 27/30] powerpc/kprobes: Don't allow breakpoints on suffixes
Le 06/05/2020 à 05:40, Jordan Niethe a écrit : Do not allow inserting breakpoints on the suffix of a prefix instruction in kprobes. Signed-off-by: Jordan Niethe --- v8: Add this back from v3 --- arch/powerpc/kernel/kprobes.c | 13 + 1 file changed, 13 insertions(+) diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c index 33d54b091c70..227510df8c55 100644 --- a/arch/powerpc/kernel/kprobes.c +++ b/arch/powerpc/kernel/kprobes.c @@ -106,7 +106,9 @@ kprobe_opcode_t *kprobe_lookup_name(const char *name, unsigned int offset) int arch_prepare_kprobe(struct kprobe *p) { int ret = 0; + struct kprobe *prev; struct ppc_inst insn = ppc_inst_read((struct ppc_inst *)p->addr); + struct ppc_inst prefix = ppc_inst_read((struct ppc_inst *)(p->addr - 1)); What if p->addr is the first word of a page and the previous page is not mapped ? if ((unsigned long)p->addr & 0x03) { printk("Attempt to register kprobe at an unaligned address\n"); @@ -114,6 +116,17 @@ int arch_prepare_kprobe(struct kprobe *p) } else if (IS_MTMSRD(insn) || IS_RFID(insn) || IS_RFI(insn)) { printk("Cannot register a kprobe on rfi/rfid or mtmsr[d]\n"); ret = -EINVAL; + } else if (ppc_inst_prefixed(prefix)) { If p->addr - 2 contains a valid prefixed instruction, then p->addr - 1 contains the suffix of that prefixed instruction. Are we sure a suffix can never ever be misinterpreted as the prefix of a prefixed instruction ? + printk("Cannot register a kprobe on the second word of prefixed instruction\n"); + ret = -EINVAL; + } + preempt_disable(); + prev = get_kprobe(p->addr - 1); + preempt_enable_no_resched(); + if (prev && + ppc_inst_prefixed(ppc_inst_read((struct ppc_inst *)prev->ainsn.insn))) { + printk("Cannot register a kprobe on the second word of prefixed instruction\n"); + ret = -EINVAL; } /* insn must be on a special executable page on ppc64. This is
Re: [PATCH v3 5/6] sched/fair: Consider SMT in ASYM_PACKING load balance
On Fri, May 14, 2021 at 07:14:15PM -0700, Ricardo Neri wrote: > On Fri, May 14, 2021 at 11:47:45AM +0200, Peter Zijlstra wrote: > > On Thu, May 13, 2021 at 08:49:08AM -0700, Ricardo Neri wrote: > > > include/linux/sched/topology.h | 1 + > > > kernel/sched/fair.c| 101 + > > > 2 files changed, 102 insertions(+) > > > > > > diff --git a/include/linux/sched/topology.h > > > b/include/linux/sched/topology.h > > > index 8f0f778b7c91..43bdb8b1e1df 100644 > > > --- a/include/linux/sched/topology.h > > > +++ b/include/linux/sched/topology.h > > > @@ -57,6 +57,7 @@ static inline int cpu_numa_flags(void) > > > #endif > > > > > > extern int arch_asym_cpu_priority(int cpu); > > > +extern bool arch_asym_check_smt_siblings(void); > > > > > > struct sched_domain_attr { > > > int relax_domain_level; > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > index c8b66a5d593e..3d6cc027e6e6 100644 > > > --- a/kernel/sched/fair.c > > > +++ b/kernel/sched/fair.c > > > @@ -106,6 +106,15 @@ int __weak arch_asym_cpu_priority(int cpu) > > > return -cpu; > > > } > > > > > > +/* > > > + * For asym packing, first check the state of SMT siblings before > > > deciding to > > > + * pull tasks. > > > + */ > > > +bool __weak arch_asym_check_smt_siblings(void) > > > +{ > > > + return false; > > > +} > > > + > > > /* > > > * The margin used when comparing utilization with CPU capacity. > > > * > > > > > @@ -8458,6 +8550,9 @@ sched_asym(struct lb_env *env, struct sd_lb_stats > > > *sds, struct sg_lb_stats *sgs > > > if (group == sds->local) > > > return false; > > > > > > + if (arch_asym_check_smt_siblings()) > > > + return asym_can_pull_tasks(env->dst_cpu, sds, sgs, group); > > > + > > > return sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu); > > > } > > > > So I'm thinking that this is a property of having ASYM_PACKING at a core > > level, rather than some arch special. Wouldn't something like this be > > more appropriate? > > > > --- > > --- a/include/linux/sched/topology.h > > +++ b/include/linux/sched/topology.h > > @@ -57,7 +57,6 @@ static inline int cpu_numa_flags(void) > > #endif > > > > extern int arch_asym_cpu_priority(int cpu); > > -extern bool arch_asym_check_smt_siblings(void); > > > > struct sched_domain_attr { > > int relax_domain_level; > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -107,15 +107,6 @@ int __weak arch_asym_cpu_priority(int cp > > } > > > > /* > > - * For asym packing, first check the state of SMT siblings before deciding > > to > > - * pull tasks. > > - */ > > -bool __weak arch_asym_check_smt_siblings(void) > > -{ > > - return false; > > -} > > - > > -/* > > * The margin used when comparing utilization with CPU capacity. > > * > > * (default: ~20%) > > @@ -8550,7 +8541,8 @@ sched_asym(struct lb_env *env, struct sd > > if (group == sds->local) > > return false; > > > > - if (arch_asym_check_smt_siblings()) > > + if ((sds->local->flags & SD_SHARE_CPUCAPACITY) || > > + (group->flags & SD_SHARE_CPUCAPACITY)) > > return asym_can_pull_tasks(env->dst_cpu, sds, sgs, group); > > Thanks Peter for the quick review! This makes sense to me. The only > reason we proposed arch_asym_check_smt_siblings() is because we were > about breaking powerpc (I need to study how they set priorities for SMT, > if applicable). If you think this is not an issue I can post a > v4 with this update. As far as I can see, priorities in powerpc are set by the CPU number. However, I am not sure how CPUs are enumerated? If CPUs in brackets are SMT sibling, Does an enumeration looks like A) [0, 1], [2, 3] or B) [0, 2], [1, 3]? I guess B is the right answer. Otherwise, both SMT siblings of a core would need to be busy before a new core is used. Still, I think the issue described in the cover letter may be reproducible in powerpc as well. If CPU3 is offlined, and [0, 2] pulled tasks from [1, -] so that both CPU0 and CPU2 become busy, CPU1 would not be able to help since CPU0 has the highest priority. I am cc'ing the linuxppc list to get some feedback. Thanks and BR, Ricardo
Re: [PATCH v8 27/30] powerpc/kprobes: Don't allow breakpoints on suffixes
On Tue, May 18, 2021 at 08:43:39PM +0200, Christophe Leroy wrote: > > > Le 06/05/2020 à 05:40, Jordan Niethe a écrit : > > Do not allow inserting breakpoints on the suffix of a prefix instruction > > in kprobes. > > > > Signed-off-by: Jordan Niethe > > --- > > v8: Add this back from v3 > > --- > > arch/powerpc/kernel/kprobes.c | 13 + > > 1 file changed, 13 insertions(+) > > > > diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c > > index 33d54b091c70..227510df8c55 100644 > > --- a/arch/powerpc/kernel/kprobes.c > > +++ b/arch/powerpc/kernel/kprobes.c > > @@ -106,7 +106,9 @@ kprobe_opcode_t *kprobe_lookup_name(const char *name, > > unsigned int offset) > > int arch_prepare_kprobe(struct kprobe *p) > > { > > int ret = 0; > > + struct kprobe *prev; > > struct ppc_inst insn = ppc_inst_read((struct ppc_inst *)p->addr); > > + struct ppc_inst prefix = ppc_inst_read((struct ppc_inst *)(p->addr - > > 1)); > > What if p->addr is the first word of a page and the previous page is not > mapped ? IIRC prefixed instructions can't straddle 64 byte boundaries (or was it 128 bytes?), much less page boundaries. > > > if ((unsigned long)p->addr & 0x03) { > > printk("Attempt to register kprobe at an unaligned address\n"); > > @@ -114,6 +116,17 @@ int arch_prepare_kprobe(struct kprobe *p) > > } else if (IS_MTMSRD(insn) || IS_RFID(insn) || IS_RFI(insn)) { > > printk("Cannot register a kprobe on rfi/rfid or mtmsr[d]\n"); > > ret = -EINVAL; > > + } else if (ppc_inst_prefixed(prefix)) { > > If p->addr - 2 contains a valid prefixed instruction, then p->addr - 1 > contains the suffix of that prefixed instruction. Are we sure a suffix can > never ever be misinterpreted as the prefix of a prefixed instruction ? > Prefixes are easy to decode, the 6 MSB are 0b01 (from memory). After some digging on the 'net: "All prefixes have the major opcode 1. A prefix will never be a valid word instruction. A suffix may be an existing word instruction or a new instruction." IOW, detecting prefixes is trivial. It's not x86... Gabriel > > > + printk("Cannot register a kprobe on the second word of prefixed > > instruction\n"); > > + ret = -EINVAL; > > + } > > + preempt_disable(); > > + prev = get_kprobe(p->addr - 1); > > + preempt_enable_no_resched(); > > + if (prev && > > + ppc_inst_prefixed(ppc_inst_read((struct ppc_inst > > *)prev->ainsn.insn))) { > > + printk("Cannot register a kprobe on the second word of prefixed > > instruction\n"); > > + ret = -EINVAL; > > } > > /* insn must be on a special executable page on ppc64. This is > >
Re: [PATCH v5 3/9] mm/mremap: Use pmd/pud_poplulate to update page table entries
Hi Aneesh, On Thu, Apr 22, 2021 at 11:13:17AM +0530, Aneesh Kumar K.V wrote: > pmd/pud_populate is the right interface to be used to set the respective > page table entries. Some architectures like ppc64 do assume that > set_pmd/pud_at > can only be used to set a hugepage PTE. Since we are not setting up a hugepage > PTE here, use the pmd/pud_populate interface. > > Signed-off-by: Aneesh Kumar K.V > --- > mm/mremap.c | 7 +++ > 1 file changed, 3 insertions(+), 4 deletions(-) > > diff --git a/mm/mremap.c b/mm/mremap.c > index ec8f840399ed..574287f9bb39 100644 > --- a/mm/mremap.c > +++ b/mm/mremap.c > @@ -26,6 +26,7 @@ > > #include > #include > +#include > > #include "internal.h" > > @@ -257,9 +258,8 @@ static bool move_normal_pmd(struct vm_area_struct *vma, > unsigned long old_addr, > pmd_clear(old_pmd); > > VM_BUG_ON(!pmd_none(*new_pmd)); > + pmd_populate(mm, new_pmd, (pgtable_t)pmd_page_vaddr(pmd)); > > - /* Set the new pmd */ > - set_pmd_at(mm, new_addr, new_pmd, pmd); > flush_tlb_range(vma, old_addr, old_addr + PMD_SIZE); > if (new_ptl != old_ptl) > spin_unlock(new_ptl); > @@ -306,8 +306,7 @@ static bool move_normal_pud(struct vm_area_struct *vma, > unsigned long old_addr, > > VM_BUG_ON(!pud_none(*new_pud)); > > - /* Set the new pud */ > - set_pud_at(mm, new_addr, new_pud, pud); > + pud_populate(mm, new_pud, (pmd_t *)pud_page_vaddr(pud)); > flush_tlb_range(vma, old_addr, old_addr + PUD_SIZE); > if (new_ptl != old_ptl) > spin_unlock(new_ptl); > -- > 2.30.2 > > This commit causes my WSL2 VM to close when compiling something memory intensive, such as an x86_64_defconfig + CONFIG_LTO_CLANG_FULL=y kernel or LLVM/Clang. Unfortunately, I do not have much further information to provide since I do not see any sort of splat in dmesg right before it closes and I have found zero information about getting the previous kernel message in WSL2 (custom init so no systemd or anything). The config file is the stock one from Microsoft: https://github.com/microsoft/WSL2-Linux-Kernel/blob/a571dc8cedc8e0e56487c0dc93243e0b5db8960a/Microsoft/config-wsl I have attached my .config anyways, which includes CONFIG_DEBUG_VM, which does not appear to show anything out of the ordinary. I have also attached a dmesg just in case anything sticks out. I am happy to provide any additional information or perform additional debugging steps as needed. Cheers, Nathan $ git bisect log # bad: [cd557f1c605fc5a2c0eb0b540610f50dc67dd849] Add linux-next specific files for 20210514 # good: [315d99318179b9cd5077ccc9f7f26a164c9fa998] Merge tag 'pm-5.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm git bisect start 'cd557f1c605fc5a2c0eb0b540610f50dc67dd849' '315d99318179b9cd5077ccc9f7f26a164c9fa998' # good: [9634d7cb3c506ae886a5136d12b4af696b9cee8a] Merge remote-tracking branch 'drm-misc/for-linux-next' git bisect good 9634d7cb3c506ae886a5136d12b4af696b9cee8a # good: [294636a24ae819a7caf0807d05d8eb5b964ec06f] Merge remote-tracking branch 'rcu/rcu/next' git bisect good 294636a24ae819a7caf0807d05d8eb5b964ec06f # good: [cb753d0611f912439c8e814f4254d15fa8fa1d75] Merge remote-tracking branch 'gpio-brgl/gpio/for-next' git bisect good cb753d0611f912439c8e814f4254d15fa8fa1d75 # bad: [b1e7389449084b74a044a70860c6a1c7466781cb] lib/string_helpers: switch to use BIT() macro git bisect bad b1e7389449084b74a044a70860c6a1c7466781cb # bad: [bf5570ed0654a21000e5dad9243ea1ba30bfe208] kasan: use dump_stack_lvl(KERN_ERR) to print stacks git bisect bad bf5570ed0654a21000e5dad9243ea1ba30bfe208 # good: [4a292ff7a819404039588c7a9af272aca22c869e] fixup! mm: gup: pack has_pinned in MMF_HAS_PINNED git bisect good 4a292ff7a819404039588c7a9af272aca22c869e # good: [5ed68c90c7fb884c3c493d5529aca79dcf125848] mm: memcontrol: move obj_cgroup_uncharge_pages() out of css_set_lock git bisect good 5ed68c90c7fb884c3c493d5529aca79dcf125848 # good: [f96ae2c1e63b71134e216e9940df3f2793a9a4b1] mm/memory.c: fix comment of finish_mkwrite_fault() git bisect good f96ae2c1e63b71134e216e9940df3f2793a9a4b1 # bad: [5b0a28a7f9f5fdc2fe5a5e2cce7ea17b98e5eaeb] mm/mremap: use range flush that does TLB and page walk cache flush git bisect bad 5b0a28a7f9f5fdc2fe5a5e2cce7ea17b98e5eaeb # bad: [dbee97d1f49a2f2f1f5c26bf15151cc998572e89] mm/mremap: use pmd/pud_poplulate to update page table entries git bisect bad dbee97d1f49a2f2f1f5c26bf15151cc998572e89 # good: [c4c8a76d96a7d38d1ec8732e3f852418d18a7424] selftest/mremap_test: avoid crash with static build git bisect good c4c8a76d96a7d38d1ec8732e3f852418d18a7424 # first bad commit: [dbee97d1f49a2f2f1f5c26bf15151cc998572e89] mm/mremap: use pmd/pud_poplulate to update page table entries # # Automatically generated file; DO NOT EDIT. # Linux/x86 5.13.0-rc2 Kernel Configuration # CONFIG_CC_VERSION_TEXT="gcc (GCC) 11.1.0" CONFIG_CC_IS_GCC=y CONFIG_GCC_VERSION=110100 CONFIG_CLANG_VERSION=0 CONFIG_AS_IS_
Re: [PATCH v2] powerpc/powernv/pci: fix header guard
On 5/18/2021 1:40 PM, Nick Desaulniers wrote: While looking at -Wundef warnings, the #if CONFIG_EEH stood out as a possible candidate to convert to #ifdef CONFIG_EEH. It seems that based on Kconfig dependencies it's not possible to build this file without CONFIG_EEH enabled, but based on upstream discussion, it's not clear yet that CONFIG_EEH should be enabled by default. For now, simply fix the -Wundef warning. Suggested-by: Nathan Chancellor Suggested-by: Joe Perches Link: https://github.com/ClangBuiltLinux/linux/issues/570 Link: https://lore.kernel.org/lkml/67f6cd269684c9aa8463ff4812c3b4605e6739c3.ca...@perches.com/ Link: https://lore.kernel.org/lkml/CAOSf1CGoN5R0LUrU=Y=uwho1z_9slgcx8s3sbfjxwjxc5by...@mail.gmail.com/ Signed-off-by: Nick Desaulniers Makes sense, thanks for the patch! Reviewed-by: Nathan Chancellor --- arch/powerpc/platforms/powernv/pci.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c index b18468dc31ff..6bb3c52633fb 100644 --- a/arch/powerpc/platforms/powernv/pci.c +++ b/arch/powerpc/platforms/powernv/pci.c @@ -711,7 +711,7 @@ int pnv_pci_cfg_write(struct pci_dn *pdn, return PCIBIOS_SUCCESSFUL; } -#if CONFIG_EEH +#ifdef CONFIG_EEH static bool pnv_pci_cfg_check(struct pci_dn *pdn) { struct eeh_dev *edev = NULL;
Re: [PATCH] powerpc: Kconfig: disable CONFIG_COMPAT for clang < 12
On 5/18/2021 1:58 PM, Nick Desaulniers wrote: Until clang-12, clang would attempt to assemble 32b powerpc assembler in 64b emulation mode when using a 64b target triple with -m32, leading to errors during the build of the compat VDSO. Simply disable all of CONFIG_COMPAT; users should upgrade to the latest release of clang for proper support. Link: https://github.com/ClangBuiltLinux/linux/issues/1160 Link: https://github.com/llvm/llvm-project/commits/2288319733cd5f525bf7e24dece08bfcf9d0ff9e Link: https://groups.google.com/g/clang-built-linux/c/ayNmi3HoNdY/m/XJAGj_G2AgAJ Suggested-by: Nathan Chancellor Signed-off-by: Nick Desaulniers Reviewed-by: Nathan Chancellor --- arch/powerpc/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index ce3f59531b51..2a02784b7ef0 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -289,6 +289,7 @@ config PANIC_TIMEOUT config COMPAT bool "Enable support for 32bit binaries" depends on PPC64 + depends on !CC_IS_CLANG || CLANG_VERSION >= 12 default y if !CPU_LITTLE_ENDIAN select ARCH_WANT_OLD_COMPAT_IPC select COMPAT_OLD_SIGACTION
Re: Linux powerpc new system call instruction and ABI
Hi, On Thu, Jun 11, 2020 at 06:12:01PM +1000, Nicholas Piggin wrote: [...] > - Error handling: The consensus among kernel, glibc, and musl is to move to > using negative return values in r3 rather than CR0[SO]=1 to indicate error, > which matches most other architectures, and is closer to a function call. Apparently, the patchset merged by commit v5.9-rc1~100^2~164 was incomplete: all functions defined in arch/powerpc/include/asm/ptrace.h and arch/powerpc/include/asm/syscall.h that use ccr are broken when scv is used. This includes syscall_get_error() and all its users including PTRACE_GET_SYSCALL_INFO API, which in turn makes strace unusable when scv is used. See also https://bugzilla.redhat.com/1929836 -- ldv
Re: [PATCH v2 0/2] mm: unify the allocation of pglist_data instances
On Tue, 2021-05-18 at 19:09 +0300, Mike Rapoport wrote: > Hello Miles, > > On Tue, May 18, 2021 at 05:24:44PM +0800, Miles Chen wrote: > > This patches is created to fix the __pa() warning messages when > > CONFIG_DEBUG_VIRTUAL=y by unifying the allocation of pglist_data > > instances. > > > > In current implementation of node_data, if CONFIG_NEED_MULTIPLE_NODES=y, > > pglist_data is allocated by a memblock API. If CONFIG_NEED_MULTIPLE_NODES=n, > > we use a global variable named "contig_page_data". > > > > If CONFIG_DEBUG_VIRTUAL is not enabled. __pa() can handle both > > allocation and symbol cases. But if CONFIG_DEBUG_VIRTUAL is set, > > we will have the "virt_to_phys used for non-linear address" warning > > when booting. > > > > To fix the warning, always allocate pglist_data by memblock APIs and > > remove the usage of contig_page_data. > > Somehow I was sure that we can allocate pglist_data before it is accessed > in sparse_init() somewhere outside mm/sparse.c. It's really not the case > and having two places that may allocated this structure is surely worth > than your previous suggestion. > > Sorry about that. Do you mean taht to call allocation function arch/*, somewhere after paging_init() (so we can access pglist_data) and before sparse_init() and free_area_init()? Miles > > > Warning message: > > [0.00] [ cut here ] > > [0.00] virt_to_phys used for non-linear address: (ptrval) > > (contig_page_data+0x0/0x1c00) > > [0.00] WARNING: CPU: 0 PID: 0 at arch/arm64/mm/physaddr.c:15 > > __virt_to_phys+0x58/0x68 > > [0.00] Modules linked in: > > [0.00] CPU: 0 PID: 0 Comm: swapper Tainted: GW > > 5.13.0-rc1-00074-g1140ab592e2e #3 > > [0.00] Hardware name: linux,dummy-virt (DT) > > [0.00] pstate: 60c5 (nZCv daIF -PAN -UAO -TCO BTYPE=--) > > [0.00] pc : __virt_to_phys+0x58/0x68 > > [0.00] lr : __virt_to_phys+0x54/0x68 > > [0.00] sp : 800011833e70 > > [0.00] x29: 800011833e70 x28: 418a0018 x27: > > > > [0.00] x26: 000a x25: 800011b7 x24: > > 800011b7 > > [0.00] x23: fc0001c0 x22: 800011b7 x21: > > 47b0 > > [0.00] x20: 0008 x19: 800011b082c0 x18: > > > > [0.00] x17: x16: 800011833bf9 x15: > > 0004 > > [0.00] x14: 0fff x13: 80001186a548 x12: > > > > [0.00] x11: x10: x9 : > > > > [0.00] x8 : 8000115c9000 x7 : 737520737968705f x6 : > > 800011b62ef8 > > [0.00] x5 : x4 : 0001 x3 : > > > > [0.00] x2 : x1 : 80001159585e x0 : > > 0058 > > [0.00] Call trace: > > [0.00] __virt_to_phys+0x58/0x68 > > [0.00] check_usemap_section_nr+0x50/0xfc > > [0.00] sparse_init_nid+0x1ac/0x28c > > [0.00] sparse_init+0x1c4/0x1e0 > > [0.00] bootmem_init+0x60/0x90 > > [0.00] setup_arch+0x184/0x1f0 > > [0.00] start_kernel+0x78/0x488 > > [0.00] ---[ end trace f68728a0d3053b60 ]--- > > > > [1] > > https://urldefense.com/v3/__https://lore.kernel.org/patchwork/patch/1425110/__;!!CTRNKA9wMg0ARbw!x-wGFEC1wLzXho2kI1CrC2fjXNaQm5f-n0ADQyJDckCOKZHAP_q055DCSWYcQ7Zdcw$ > > > > > > Change since v1: > > - use memblock_alloc() to create pglist_data when CONFIG_NUMA=n > > > > Miles Chen (2): > > mm: introduce prepare_node_data > > mm: replace contig_page_data with node_data > > > > Documentation/admin-guide/kdump/vmcoreinfo.rst | 13 - > > arch/powerpc/kexec/core.c | 5 - > > include/linux/gfp.h| 3 --- > > include/linux/mm.h | 2 ++ > > include/linux/mmzone.h | 4 ++-- > > kernel/crash_core.c| 1 - > > mm/memblock.c | 3 +-- > > mm/page_alloc.c| 16 > > mm/sparse.c| 2 ++ > > 9 files changed, 23 insertions(+), 26 deletions(-) > > > > > > base-commit: 8ac91e6c6033ebc12c5c1e4aa171b81a662bd70f > > -- > > 2.18.0 > > >
Re: [PATCH v5 5/9] powerpc/mm/book3s64: Update tlb flush routines to take a page walk cache flush argument
Guenter Roeck writes: > On 5/17/21 6:55 AM, Aneesh Kumar K.V wrote: >> Guenter Roeck writes: >> >>> On 5/17/21 1:40 AM, Aneesh Kumar K.V wrote: On 5/15/21 10:05 PM, Guenter Roeck wrote: > On Thu, Apr 22, 2021 at 11:13:19AM +0530, Aneesh Kumar K.V wrote: >> >> ... >> > extern void radix__local_flush_all_mm(struct mm_struct *mm); >> diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush.h >> b/arch/powerpc/include/asm/book3s/64/tlbflush.h >> index 215973b4cb26..f9f8a3a264f7 100644 >> --- a/arch/powerpc/include/asm/book3s/64/tlbflush.h >> +++ b/arch/powerpc/include/asm/book3s/64/tlbflush.h >> @@ -45,13 +45,30 @@ static inline void tlbiel_all_lpid(bool radix) >> hash__tlbiel_all(TLB_INVAL_SCOPE_LPID); >> } >> +static inline void flush_pmd_tlb_pwc_range(struct vm_area_struct *vma, > >> + unsigned long start, >> + unsigned long end, >> + bool flush_pwc) >> +{ >> + if (radix_enabled()) >> + return radix__flush_pmd_tlb_range(vma, start, end, flush_pwc); >> + return hash__flush_tlb_range(vma, start, end); > ^^ > >> +} In this specific case we won't have build errors because, static inline void hash__flush_tlb_range(struct vm_area_struct *vma, unsigned long start, unsigned long end) { >>> >>> Sorry, you completely lost me. >>> >>> Building parisc:allnoconfig ... failed >>> -- >>> Error log: >>> In file included from arch/parisc/include/asm/cacheflush.h:7, >>>from include/linux/highmem.h:12, >>>from include/linux/pagemap.h:11, >>>from include/linux/ksm.h:13, >>>from mm/mremap.c:14: >>> mm/mremap.c: In function 'flush_pte_tlb_pwc_range': >>> arch/parisc/include/asm/tlbflush.h:20:2: error: 'return' with a value, in >>> function returning void >> >> As replied here >> https://lore.kernel.org/mm-commits/8eedb441-a612-1ec8-8bf7-b40184de9...@linux.ibm.com/ >> >> That was the generic header change in the patch. I was commenting about the >> ppc64 specific change causing build failures. > > Ah, sorry. I wasn't aware that the following is valid C code > > void f1() > { > return f2(); > ^^ > } > > as long as f2() is void as well. Confusing, but we live and learn. It might be valid, but it's still bad IMHO. It's confusing to readers, and serves no useful purpose. cheers
[PATCH v2] powerpc/powernv/pci: fix header guard
While looking at -Wundef warnings, the #if CONFIG_EEH stood out as a possible candidate to convert to #ifdef CONFIG_EEH. It seems that based on Kconfig dependencies it's not possible to build this file without CONFIG_EEH enabled, but based on upstream discussion, it's not clear yet that CONFIG_EEH should be enabled by default. For now, simply fix the -Wundef warning. Suggested-by: Nathan Chancellor Suggested-by: Joe Perches Link: https://github.com/ClangBuiltLinux/linux/issues/570 Link: https://lore.kernel.org/lkml/67f6cd269684c9aa8463ff4812c3b4605e6739c3.ca...@perches.com/ Link: https://lore.kernel.org/lkml/CAOSf1CGoN5R0LUrU=Y=uwho1z_9slgcx8s3sbfjxwjxc5by...@mail.gmail.com/ Signed-off-by: Nick Desaulniers --- arch/powerpc/platforms/powernv/pci.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c index b18468dc31ff..6bb3c52633fb 100644 --- a/arch/powerpc/platforms/powernv/pci.c +++ b/arch/powerpc/platforms/powernv/pci.c @@ -711,7 +711,7 @@ int pnv_pci_cfg_write(struct pci_dn *pdn, return PCIBIOS_SUCCESSFUL; } -#if CONFIG_EEH +#ifdef CONFIG_EEH static bool pnv_pci_cfg_check(struct pci_dn *pdn) { struct eeh_dev *edev = NULL; -- 2.31.1.751.gd2f1c929bd-goog
Re: [PATCH v5 5/9] powerpc/mm/book3s64: Update tlb flush routines to take a page walk cache flush argument
On Wed, May 19, 2021 at 10:26:22AM +1000, Michael Ellerman wrote: > Guenter Roeck writes: > > Ah, sorry. I wasn't aware that the following is valid C code > > > > void f1() > > { > > return f2(); > > ^^ > > } > > > > as long as f2() is void as well. Confusing, but we live and learn. > > It might be valid, but it's still bad IMHO. > > It's confusing to readers, and serves no useful purpose. And it actually explicitly is undefined behaviour in C90 already (3.6.6.4 in C90, 6.8.6.4 in C99 and later). Segher
[PATCH] powerpc: Kconfig: disable CONFIG_COMPAT for clang < 12
Until clang-12, clang would attempt to assemble 32b powerpc assembler in 64b emulation mode when using a 64b target triple with -m32, leading to errors during the build of the compat VDSO. Simply disable all of CONFIG_COMPAT; users should upgrade to the latest release of clang for proper support. Link: https://github.com/ClangBuiltLinux/linux/issues/1160 Link: https://github.com/llvm/llvm-project/commits/2288319733cd5f525bf7e24dece08bfcf9d0ff9e Link: https://groups.google.com/g/clang-built-linux/c/ayNmi3HoNdY/m/XJAGj_G2AgAJ Suggested-by: Nathan Chancellor Signed-off-by: Nick Desaulniers --- arch/powerpc/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index ce3f59531b51..2a02784b7ef0 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -289,6 +289,7 @@ config PANIC_TIMEOUT config COMPAT bool "Enable support for 32bit binaries" depends on PPC64 + depends on !CC_IS_CLANG || CLANG_VERSION >= 12 default y if !CPU_LITTLE_ENDIAN select ARCH_WANT_OLD_COMPAT_IPC select COMPAT_OLD_SIGACTION -- 2.31.1.751.gd2f1c929bd-goog
Re: [PATCH v5 5/9] powerpc/mm/book3s64: Update tlb flush routines to take a page walk cache flush argument
On 5/18/21 5:26 PM, Michael Ellerman wrote: [ ... ] That was the generic header change in the patch. I was commenting about the ppc64 specific change causing build failures. Ah, sorry. I wasn't aware that the following is valid C code void f1() { return f2(); ^^ } as long as f2() is void as well. Confusing, but we live and learn. It might be valid, but it's still bad IMHO. It's confusing to readers, and serves no useful purpose. Agreed, but it is surprisingly wide-spread. Try to run the coccinelle script below, just for fun. The script doesn't even catch instances in include files, yet there are more than 450 hits. Guenter --- virtual report @d@ identifier f; expression e; position p; @@ void f(...) { <... return e@p; ...> } @script:python depends on report@ f << d.f; p << d.p; @@ print "void function %s:%s() with non-void return in line %s" % (p[0].file, f, p[0].line)
Re: [PATCH v6 2/3] powerpc: Move script to check relocations at compile time in scripts/
Alexandre Ghiti writes: > Relocating kernel at runtime is done very early in the boot process, so > it is not convenient to check for relocations there and react in case a > relocation was not expected. > > Powerpc architecture has a script that allows to check at compile time > for such unexpected relocations: extract the common logic to scripts/ > so that other architectures can take advantage of it. > > Signed-off-by: Alexandre Ghiti > Reviewed-by: Anup Patel > --- > arch/powerpc/tools/relocs_check.sh | 18 ++ > scripts/relocs_check.sh| 20 > 2 files changed, 22 insertions(+), 16 deletions(-) > create mode 100755 scripts/relocs_check.sh I'm not sure that script is really big/complicated enough to warrant sharing vs just copying, but I don't mind either. Acked-by: Michael Ellerman (powerpc) cheers > diff --git a/arch/powerpc/tools/relocs_check.sh > b/arch/powerpc/tools/relocs_check.sh > index 014e00e74d2b..e367895941ae 100755 > --- a/arch/powerpc/tools/relocs_check.sh > +++ b/arch/powerpc/tools/relocs_check.sh > @@ -15,21 +15,8 @@ if [ $# -lt 3 ]; then > exit 1 > fi > > -# Have Kbuild supply the path to objdump and nm so we handle cross > compilation. > -objdump="$1" > -nm="$2" > -vmlinux="$3" > - > -# Remove from the bad relocations those that match an undefined weak symbol > -# which will result in an absolute relocation to 0. > -# Weak unresolved symbols are of that form in nm output: > -# " w _binary__btf_vmlinux_bin_end" > -undef_weak_symbols=$($nm "$vmlinux" | awk '$1 ~ /w/ { print $2 }') > - > bad_relocs=$( > -$objdump -R "$vmlinux" | > - # Only look at relocation lines. > - grep -E '\ +${srctree}/scripts/relocs_check.sh "$@" | > # These relocations are okay > # On PPC64: > # R_PPC64_RELATIVE, R_PPC64_NONE > @@ -43,8 +30,7 @@ R_PPC_ADDR16_LO > R_PPC_ADDR16_HI > R_PPC_ADDR16_HA > R_PPC_RELATIVE > -R_PPC_NONE' | > - ([ "$undef_weak_symbols" ] && grep -F -w -v "$undef_weak_symbols" || > cat) > +R_PPC_NONE' > ) > > if [ -z "$bad_relocs" ]; then > diff --git a/scripts/relocs_check.sh b/scripts/relocs_check.sh > new file mode 100755 > index ..137c660499f3 > --- /dev/null > +++ b/scripts/relocs_check.sh > @@ -0,0 +1,20 @@ > +#!/bin/sh > +# SPDX-License-Identifier: GPL-2.0-or-later > + > +# Get a list of all the relocations, remove from it the relocations > +# that are known to be legitimate and return this list to arch specific > +# script that will look for suspicious relocations. > + > +objdump="$1" > +nm="$2" > +vmlinux="$3" > + > +# Remove from the possible bad relocations those that match an undefined > +# weak symbol which will result in an absolute relocation to 0. > +# Weak unresolved symbols are of that form in nm output: > +# " w _binary__btf_vmlinux_bin_end" > +undef_weak_symbols=$($nm "$vmlinux" | awk '$1 ~ /w/ { print $2 }') > + > +$objdump -R "$vmlinux" | > + grep -E '\ + ([ "$undef_weak_symbols" ] && grep -F -w -v "$undef_weak_symbols" || > cat) > -- > 2.30.2 > > > ___ > linux-riscv mailing list > linux-ri...@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-riscv
[Bug 213069] kernel BUG at arch/powerpc/include/asm/book3s/64/hash-4k.h:147! Oops: Exception in kernel mode, sig: 5 [#1]
https://bugzilla.kernel.org/show_bug.cgi?id=213069 Michael Ellerman (mich...@ellerman.id.au) changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |CODE_FIX -- You may reply to this email to add a comment. You are receiving this mail because: You are watching the assignee of the bug.
[PATCH net-next] ibmveth: fix kobj_to_dev.cocci warnings
Use kobj_to_dev() instead of container_of() Generated by: scripts/coccinelle/api/kobj_to_dev.cocci Signed-off-by: YueHaibing --- drivers/net/ethernet/ibm/ibmveth.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/net/ethernet/ibm/ibmveth.c b/drivers/net/ethernet/ibm/ibmveth.c index 7fea9ae60f13..bc67a7ee872b 100644 --- a/drivers/net/ethernet/ibm/ibmveth.c +++ b/drivers/net/ethernet/ibm/ibmveth.c @@ -1799,8 +1799,7 @@ static ssize_t veth_pool_store(struct kobject *kobj, struct attribute *attr, struct ibmveth_buff_pool *pool = container_of(kobj, struct ibmveth_buff_pool, kobj); - struct net_device *netdev = dev_get_drvdata( - container_of(kobj->parent, struct device, kobj)); + struct net_device *netdev = dev_get_drvdata(kobj_to_dev(kobj->parent)); struct ibmveth_adapter *adapter = netdev_priv(netdev); long value = simple_strtol(buf, NULL, 10); long rc; -- 2.17.1
Re: Linux powerpc new system call instruction and ABI
Excerpts from Dmitry V. Levin's message of May 19, 2021 9:13 am: > Hi, > > On Thu, Jun 11, 2020 at 06:12:01PM +1000, Nicholas Piggin wrote: > [...] >> - Error handling: The consensus among kernel, glibc, and musl is to move to >> using negative return values in r3 rather than CR0[SO]=1 to indicate error, >> which matches most other architectures, and is closer to a function call. > > Apparently, the patchset merged by commit v5.9-rc1~100^2~164 was > incomplete: all functions defined in arch/powerpc/include/asm/ptrace.h and > arch/powerpc/include/asm/syscall.h that use ccr are broken when scv is used. > This includes syscall_get_error() and all its users including > PTRACE_GET_SYSCALL_INFO API, which in turn makes strace unusable > when scv is used. > > See also https://bugzilla.redhat.com/1929836 I see, thanks. Using latest strace from github.com, the attached kernel patch makes strace -k check results a lot greener. Some of the remaining failing tests look like this (I didn't look at all of them yet): signal(SIGUSR1, 0xfacefeeddeadbeef) = 0 (SIG_DFL) write(1, "signal(SIGUSR1, 0xfacefeeddeadbe"..., 50signal(SIGUSR1, 0xfacefeeddeadbeef) = 0 (SIG_DFL) ) = 50 signal(SIGUSR1, SIG_IGN)= 0xfacefeeddeadbeef write(2, "errno2name.c:461: unknown errno "..., 41errno2name.c:461: unknown errno 559038737) = 41 write(2, ": Unknown error 559038737\n", 26: Unknown error 559038737 ) = 26 exit_group(1) = ? I think the problem is glibc testing for -ve, but it should be comparing against -4095 (+cc Matheus) #define RET_SCV \ cmpdi r3,0; \ bgelr+; \ neg r3,r3; With this patch, I think the ptrace ABI should mostly be fixed. I think a problem remains with applications that look at system call return registers directly and have powerpc specific error cases. Those probably will just need to be updated unfortunately. Michael thought it might be possible to return an indication via ptrace somehow that the syscall is using a new ABI, so such apps can be updated to test for it. I don't know how that would be done. Thanks, Nick -- diff --git a/arch/powerpc/include/asm/ptrace.h b/arch/powerpc/include/asm/ptrace.h index 9c9ab2746168..b476a685f066 100644 --- a/arch/powerpc/include/asm/ptrace.h +++ b/arch/powerpc/include/asm/ptrace.h @@ -19,6 +19,7 @@ #ifndef _ASM_POWERPC_PTRACE_H #define _ASM_POWERPC_PTRACE_H +#include #include #include @@ -152,25 +153,6 @@ extern unsigned long profile_pc(struct pt_regs *regs); long do_syscall_trace_enter(struct pt_regs *regs); void do_syscall_trace_leave(struct pt_regs *regs); -#define kernel_stack_pointer(regs) ((regs)->gpr[1]) -static inline int is_syscall_success(struct pt_regs *regs) -{ - return !(regs->ccr & 0x1000); -} - -static inline long regs_return_value(struct pt_regs *regs) -{ - if (is_syscall_success(regs)) - return regs->gpr[3]; - else - return -regs->gpr[3]; -} - -static inline void regs_set_return_value(struct pt_regs *regs, unsigned long rc) -{ - regs->gpr[3] = rc; -} - #ifdef __powerpc64__ #define user_mode(regs) regs)->msr) >> MSR_PR_LG) & 0x1) #else @@ -235,6 +217,31 @@ static __always_inline void set_trap_norestart(struct pt_regs *regs) regs->trap |= 0x1; } +#define kernel_stack_pointer(regs) ((regs)->gpr[1]) +static inline int is_syscall_success(struct pt_regs *regs) +{ + if (trap_is_scv(regs)) + return !IS_ERR_VALUE((unsigned long)regs->gpr[3]); + else + return !(regs->ccr & 0x1000); +} + +static inline long regs_return_value(struct pt_regs *regs) +{ + if (trap_is_scv(regs)) + return regs->gpr[3]; + + if (is_syscall_success(regs)) + return regs->gpr[3]; + else + return -regs->gpr[3]; +} + +static inline void regs_set_return_value(struct pt_regs *regs, unsigned long rc) +{ + regs->gpr[3] = rc; +} + #define arch_has_single_step() (1) #define arch_has_block_step() (true) #define ARCH_HAS_USER_SINGLE_STEP_REPORT diff --git a/arch/powerpc/include/asm/syscall.h b/arch/powerpc/include/asm/syscall.h index fd1b518eed17..e8b40149bf7e 100644 --- a/arch/powerpc/include/asm/syscall.h +++ b/arch/powerpc/include/asm/syscall.h @@ -41,11 +41,20 @@ static inline void syscall_rollback(struct task_struct *task, static inline long syscall_get_error(struct task_struct *task, struct pt_regs *regs) { - /* -* If the system call failed, -* regs->gpr[3] contains a positive ERRORCODE. -*/ - return (regs->ccr & 0x1000UL) ? -regs->gpr[3] : 0; + if (trap_is_scv(regs)) { + unsigned long error = regs->gpr[3]; + + if (task_is_32bit(task)) + error = (long)(int)error; + + return IS_ERR_VALUE(error) ? error : 0; + } else { + /* +* If the system cal
Re: [PATCH v2 0/2] mm: unify the allocation of pglist_data instances
On Wed, May 19, 2021 at 08:12:06AM +0800, Miles Chen wrote: > On Tue, 2021-05-18 at 19:09 +0300, Mike Rapoport wrote: > > Hello Miles, > > > > On Tue, May 18, 2021 at 05:24:44PM +0800, Miles Chen wrote: > > > This patches is created to fix the __pa() warning messages when > > > CONFIG_DEBUG_VIRTUAL=y by unifying the allocation of pglist_data > > > instances. > > > > > > In current implementation of node_data, if CONFIG_NEED_MULTIPLE_NODES=y, > > > pglist_data is allocated by a memblock API. If > > > CONFIG_NEED_MULTIPLE_NODES=n, > > > we use a global variable named "contig_page_data". > > > > > > If CONFIG_DEBUG_VIRTUAL is not enabled. __pa() can handle both > > > allocation and symbol cases. But if CONFIG_DEBUG_VIRTUAL is set, > > > we will have the "virt_to_phys used for non-linear address" warning > > > when booting. > > > > > > To fix the warning, always allocate pglist_data by memblock APIs and > > > remove the usage of contig_page_data. > > > > Somehow I was sure that we can allocate pglist_data before it is accessed > > in sparse_init() somewhere outside mm/sparse.c. It's really not the case > > and having two places that may allocated this structure is surely worth > > than your previous suggestion. > > > > Sorry about that. > > Do you mean taht to call allocation function arch/*, somewhere after > paging_init() (so we can access pglist_data) and before sparse_init() > and free_area_init()? No, I meant that your original patch is better than adding allocation of NODE_DATA(0) in two places. > Miles > > > > > > Warning message: > > > [0.00] [ cut here ] > > > [0.00] virt_to_phys used for non-linear address: (ptrval) > > > (contig_page_data+0x0/0x1c00) > > > [0.00] WARNING: CPU: 0 PID: 0 at arch/arm64/mm/physaddr.c:15 > > > __virt_to_phys+0x58/0x68 > > > [0.00] Modules linked in: > > > [0.00] CPU: 0 PID: 0 Comm: swapper Tainted: GW > > > 5.13.0-rc1-00074-g1140ab592e2e #3 > > > [0.00] Hardware name: linux,dummy-virt (DT) > > > [0.00] pstate: 60c5 (nZCv daIF -PAN -UAO -TCO BTYPE=--) > > > [0.00] pc : __virt_to_phys+0x58/0x68 > > > [0.00] lr : __virt_to_phys+0x54/0x68 > > > [0.00] sp : 800011833e70 > > > [0.00] x29: 800011833e70 x28: 418a0018 x27: > > > > > > [0.00] x26: 000a x25: 800011b7 x24: > > > 800011b7 > > > [0.00] x23: fc0001c0 x22: 800011b7 x21: > > > 47b0 > > > [0.00] x20: 0008 x19: 800011b082c0 x18: > > > > > > [0.00] x17: x16: 800011833bf9 x15: > > > 0004 > > > [0.00] x14: 0fff x13: 80001186a548 x12: > > > > > > [0.00] x11: x10: x9 : > > > > > > [0.00] x8 : 8000115c9000 x7 : 737520737968705f x6 : > > > 800011b62ef8 > > > [0.00] x5 : x4 : 0001 x3 : > > > > > > [0.00] x2 : x1 : 80001159585e x0 : > > > 0058 > > > [0.00] Call trace: > > > [0.00] __virt_to_phys+0x58/0x68 > > > [0.00] check_usemap_section_nr+0x50/0xfc > > > [0.00] sparse_init_nid+0x1ac/0x28c > > > [0.00] sparse_init+0x1c4/0x1e0 > > > [0.00] bootmem_init+0x60/0x90 > > > [0.00] setup_arch+0x184/0x1f0 > > > [0.00] start_kernel+0x78/0x488 > > > [0.00] ---[ end trace f68728a0d3053b60 ]--- > > > > > > [1] > > > https://urldefense.com/v3/__https://lore.kernel.org/patchwork/patch/1425110/__;!!CTRNKA9wMg0ARbw!x-wGFEC1wLzXho2kI1CrC2fjXNaQm5f-n0ADQyJDckCOKZHAP_q055DCSWYcQ7Zdcw$ > > > > > > > > > Change since v1: > > > - use memblock_alloc() to create pglist_data when CONFIG_NUMA=n > > > > > > Miles Chen (2): > > > mm: introduce prepare_node_data > > > mm: replace contig_page_data with node_data > > > > > > Documentation/admin-guide/kdump/vmcoreinfo.rst | 13 - > > > arch/powerpc/kexec/core.c | 5 - > > > include/linux/gfp.h| 3 --- > > > include/linux/mm.h | 2 ++ > > > include/linux/mmzone.h | 4 ++-- > > > kernel/crash_core.c| 1 - > > > mm/memblock.c | 3 +-- > > > mm/page_alloc.c| 16 > > > mm/sparse.c| 2 ++ > > > 9 files changed, 23 insertions(+), 26 deletions(-) > > > > > > > > > base-commit: 8ac91e6c6033ebc12c5c1e4aa171b81a662bd70f > > > -- > > > 2.18.0 > > > > > > -- Sincerely yours, Mike.
Re: [PATCH v2 0/2] mm: unify the allocation of pglist_data instances
On Wed, 2021-05-19 at 06:48 +0300, Mike Rapoport wrote: > On Wed, May 19, 2021 at 08:12:06AM +0800, Miles Chen wrote: > > On Tue, 2021-05-18 at 19:09 +0300, Mike Rapoport wrote: > > > Hello Miles, > > > > > > On Tue, May 18, 2021 at 05:24:44PM +0800, Miles Chen wrote: > > > > This patches is created to fix the __pa() warning messages when > > > > CONFIG_DEBUG_VIRTUAL=y by unifying the allocation of pglist_data > > > > instances. > > > > > > > > In current implementation of node_data, if CONFIG_NEED_MULTIPLE_NODES=y, > > > > pglist_data is allocated by a memblock API. If > > > > CONFIG_NEED_MULTIPLE_NODES=n, > > > > we use a global variable named "contig_page_data". > > > > > > > > If CONFIG_DEBUG_VIRTUAL is not enabled. __pa() can handle both > > > > allocation and symbol cases. But if CONFIG_DEBUG_VIRTUAL is set, > > > > we will have the "virt_to_phys used for non-linear address" warning > > > > when booting. > > > > > > > > To fix the warning, always allocate pglist_data by memblock APIs and > > > > remove the usage of contig_page_data. > > > > > > Somehow I was sure that we can allocate pglist_data before it is accessed > > > in sparse_init() somewhere outside mm/sparse.c. It's really not the case > > > and having two places that may allocated this structure is surely worth > > > than your previous suggestion. > > > > > > Sorry about that. > > > > Do you mean taht to call allocation function arch/*, somewhere after > > paging_init() (so we can access pglist_data) and before sparse_init() > > and free_area_init()? > > No, I meant that your original patch is better than adding allocation of > NODE_DATA(0) in two places. Got it. will you re-review the original patch? > > > Miles > > > > > > > > > Warning message: > > > > [0.00] [ cut here ] > > > > [0.00] virt_to_phys used for non-linear address: > > > > (ptrval) (contig_page_data+0x0/0x1c00) > > > > [0.00] WARNING: CPU: 0 PID: 0 at arch/arm64/mm/physaddr.c:15 > > > > __virt_to_phys+0x58/0x68 > > > > [0.00] Modules linked in: > > > > [0.00] CPU: 0 PID: 0 Comm: swapper Tainted: GW > > > > 5.13.0-rc1-00074-g1140ab592e2e #3 > > > > [0.00] Hardware name: linux,dummy-virt (DT) > > > > [0.00] pstate: 60c5 (nZCv daIF -PAN -UAO -TCO BTYPE=--) > > > > [0.00] pc : __virt_to_phys+0x58/0x68 > > > > [0.00] lr : __virt_to_phys+0x54/0x68 > > > > [0.00] sp : 800011833e70 > > > > [0.00] x29: 800011833e70 x28: 418a0018 x27: > > > > > > > > [0.00] x26: 000a x25: 800011b7 x24: > > > > 800011b7 > > > > [0.00] x23: fc0001c0 x22: 800011b7 x21: > > > > 47b0 > > > > [0.00] x20: 0008 x19: 800011b082c0 x18: > > > > > > > > [0.00] x17: x16: 800011833bf9 x15: > > > > 0004 > > > > [0.00] x14: 0fff x13: 80001186a548 x12: > > > > > > > > [0.00] x11: x10: x9 : > > > > > > > > [0.00] x8 : 8000115c9000 x7 : 737520737968705f x6 : > > > > 800011b62ef8 > > > > [0.00] x5 : x4 : 0001 x3 : > > > > > > > > [0.00] x2 : x1 : 80001159585e x0 : > > > > 0058 > > > > [0.00] Call trace: > > > > [0.00] __virt_to_phys+0x58/0x68 > > > > [0.00] check_usemap_section_nr+0x50/0xfc > > > > [0.00] sparse_init_nid+0x1ac/0x28c > > > > [0.00] sparse_init+0x1c4/0x1e0 > > > > [0.00] bootmem_init+0x60/0x90 > > > > [0.00] setup_arch+0x184/0x1f0 > > > > [0.00] start_kernel+0x78/0x488 > > > > [0.00] ---[ end trace f68728a0d3053b60 ]--- > > > > > > > > [1] > > > > https://urldefense.com/v3/__https://lore.kernel.org/patchwork/patch/1425110/__;!!CTRNKA9wMg0ARbw!x-wGFEC1wLzXho2kI1CrC2fjXNaQm5f-n0ADQyJDckCOKZHAP_q055DCSWYcQ7Zdcw$ > > > > > > > > > > > > Change since v1: > > > > - use memblock_alloc() to create pglist_data when CONFIG_NUMA=n > > > > > > > > Miles Chen (2): > > > > mm: introduce prepare_node_data > > > > mm: replace contig_page_data with node_data > > > > > > > > Documentation/admin-guide/kdump/vmcoreinfo.rst | 13 - > > > > arch/powerpc/kexec/core.c | 5 - > > > > include/linux/gfp.h| 3 --- > > > > include/linux/mm.h | 2 ++ > > > > include/linux/mmzone.h | 4 ++-- > > > > kernel/crash_core.c| 1 - > > > > mm/memblock.c | 3 +-- > > > > mm/page_alloc.c| 16 > > > > mm/sparse.c| 2 ++ > > > > 9 files
Re: [PATCH v5 3/9] mm/mremap: Use pmd/pud_poplulate to update page table entries
Nathan Chancellor writes: > Hi Aneesh, > > On Thu, Apr 22, 2021 at 11:13:17AM +0530, Aneesh Kumar K.V wrote: >> pmd/pud_populate is the right interface to be used to set the respective >> page table entries. Some architectures like ppc64 do assume that >> set_pmd/pud_at >> can only be used to set a hugepage PTE. Since we are not setting up a >> hugepage >> PTE here, use the pmd/pud_populate interface. >> >> Signed-off-by: Aneesh Kumar K.V >> --- >> mm/mremap.c | 7 +++ >> 1 file changed, 3 insertions(+), 4 deletions(-) >> >> diff --git a/mm/mremap.c b/mm/mremap.c >> index ec8f840399ed..574287f9bb39 100644 >> --- a/mm/mremap.c >> +++ b/mm/mremap.c >> @@ -26,6 +26,7 @@ >> >> #include >> #include >> +#include >> >> #include "internal.h" >> >> @@ -257,9 +258,8 @@ static bool move_normal_pmd(struct vm_area_struct *vma, >> unsigned long old_addr, >> pmd_clear(old_pmd); >> >> VM_BUG_ON(!pmd_none(*new_pmd)); >> +pmd_populate(mm, new_pmd, (pgtable_t)pmd_page_vaddr(pmd)); >> >> -/* Set the new pmd */ >> -set_pmd_at(mm, new_addr, new_pmd, pmd); >> flush_tlb_range(vma, old_addr, old_addr + PMD_SIZE); >> if (new_ptl != old_ptl) >> spin_unlock(new_ptl); >> @@ -306,8 +306,7 @@ static bool move_normal_pud(struct vm_area_struct *vma, >> unsigned long old_addr, >> >> VM_BUG_ON(!pud_none(*new_pud)); >> >> -/* Set the new pud */ >> -set_pud_at(mm, new_addr, new_pud, pud); >> +pud_populate(mm, new_pud, (pmd_t *)pud_page_vaddr(pud)); >> flush_tlb_range(vma, old_addr, old_addr + PUD_SIZE); >> if (new_ptl != old_ptl) >> spin_unlock(new_ptl); >> -- >> 2.30.2 >> >> > > This commit causes my WSL2 VM to close when compiling something memory > intensive, such as an x86_64_defconfig + CONFIG_LTO_CLANG_FULL=y kernel > or LLVM/Clang. Unfortunately, I do not have much further information to > provide since I do not see any sort of splat in dmesg right before it > closes and I have found zero information about getting the previous > kernel message in WSL2 (custom init so no systemd or anything). > > The config file is the stock one from Microsoft: > > https://github.com/microsoft/WSL2-Linux-Kernel/blob/a571dc8cedc8e0e56487c0dc93243e0b5db8960a/Microsoft/config-wsl > > I have attached my .config anyways, which includes CONFIG_DEBUG_VM, > which does not appear to show anything out of the ordinary. I have also > attached a dmesg just in case anything sticks out. I am happy to provide > any additional information or perform additional debugging steps as > needed. > Can you try this change? modified mm/mremap.c @@ -279,7 +279,7 @@ static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr, pmd_clear(old_pmd); VM_BUG_ON(!pmd_none(*new_pmd)); - pmd_populate(mm, new_pmd, (pgtable_t)pmd_page_vaddr(pmd)); + pmd_populate(mm, new_pmd, pmd_pgtable(pmd)); if (new_ptl != old_ptl) spin_unlock(new_ptl);
Re: Linux powerpc new system call instruction and ABI
Excerpts from Nicholas Piggin's message of May 19, 2021 12:50 pm: > Excerpts from Dmitry V. Levin's message of May 19, 2021 9:13 am: >> Hi, >> >> On Thu, Jun 11, 2020 at 06:12:01PM +1000, Nicholas Piggin wrote: >> [...] >>> - Error handling: The consensus among kernel, glibc, and musl is to move to >>> using negative return values in r3 rather than CR0[SO]=1 to indicate >>> error, >>> which matches most other architectures, and is closer to a function call. >> >> Apparently, the patchset merged by commit v5.9-rc1~100^2~164 was >> incomplete: all functions defined in arch/powerpc/include/asm/ptrace.h and >> arch/powerpc/include/asm/syscall.h that use ccr are broken when scv is used. >> This includes syscall_get_error() and all its users including >> PTRACE_GET_SYSCALL_INFO API, which in turn makes strace unusable >> when scv is used. >> >> See also https://bugzilla.redhat.com/1929836 > > I see, thanks. Using latest strace from github.com, the attached kernel > patch makes strace -k check results a lot greener. > > Some of the remaining failing tests look like this (I didn't look at all > of them yet): > > signal(SIGUSR1, 0xfacefeeddeadbeef) = 0 (SIG_DFL) > write(1, "signal(SIGUSR1, 0xfacefeeddeadbe"..., 50signal(SIGUSR1, > 0xfacefeeddeadbeef) = 0 (SIG_DFL) > ) = 50 > signal(SIGUSR1, SIG_IGN)= 0xfacefeeddeadbeef > write(2, "errno2name.c:461: unknown errno "..., 41errno2name.c:461: unknown > errno 559038737) = 41 > write(2, ": Unknown error 559038737\n", 26: Unknown error 559038737 > ) = 26 > exit_group(1) = ? > > I think the problem is glibc testing for -ve, but it should be comparing > against -4095 (+cc Matheus) > > #define RET_SCV \ > cmpdi r3,0; \ > bgelr+; \ > neg r3,r3; This glibc patch at least gets that signal test working. Haven't run the full suite yet because of trouble making it work with a local glibc install... Thanks, Nick --- diff --git a/sysdeps/powerpc/powerpc64/sysdep.h b/sysdeps/powerpc/powerpc64/sysdep.h index c57bb1c05d..1ea4c3b917 100644 --- a/sysdeps/powerpc/powerpc64/sysdep.h +++ b/sysdeps/powerpc/powerpc64/sysdep.h @@ -398,8 +398,9 @@ LT_LABELSUFFIX(name,_name_end): ; \ #endif #define RET_SCV \ -cmpdi r3,0; \ -bgelr+; \ +li r9,-4095; \ +cmpld r3,r9; \ +bltlr+; \ neg r3,r3; #define RET_SC \
Re: [PATCH net-next] ibmveth: fix kobj_to_dev.cocci warnings
> On May 18, 2021, at 9:28 PM, YueHaibing wrote: > > Use kobj_to_dev() instead of container_of() > > Generated by: scripts/coccinelle/api/kobj_to_dev.cocci > > Signed-off-by: YueHaibing > --- Acked-by: Lijun Pan > drivers/net/ethernet/ibm/ibmveth.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/drivers/net/ethernet/ibm/ibmveth.c > b/drivers/net/ethernet/ibm/ibmveth.c > index 7fea9ae60f13..bc67a7ee872b 100644 > --- a/drivers/net/ethernet/ibm/ibmveth.c > +++ b/drivers/net/ethernet/ibm/ibmveth.c > @@ -1799,8 +1799,7 @@ static ssize_t veth_pool_store(struct kobject *kobj, > struct attribute *attr, > struct ibmveth_buff_pool *pool = container_of(kobj, > struct ibmveth_buff_pool, > kobj); > - struct net_device *netdev = dev_get_drvdata( > - container_of(kobj->parent, struct device, kobj)); > + struct net_device *netdev = dev_get_drvdata(kobj_to_dev(kobj->parent)); > struct ibmveth_adapter *adapter = netdev_priv(netdev); > long value = simple_strtol(buf, NULL, 10); > long rc; > -- > 2.17.1 >