[PATCH v2] powernv/opal-prd: Silence memcpy() run-time false positive warnings
opal_prd_msg_notifier extracts the opal prd message size from the message header and uses it for allocating opal_prd_msg_queue_item that includes the correct message size to be copied. However, while running under CONFIG_FORTIFY_SOURCE=y, it triggers following run-time warning: [ 6458.234352] memcpy: detected field-spanning write (size 32) of single field ">msg" at arch/powerpc/platforms/powernv/opal-prd.c:355 (size 4) [ 6458.234390] WARNING: CPU: 9 PID: 660 at arch/powerpc/platforms/powernv/opal-prd.c:355 opal_prd_msg_notifier+0x174/0x188 [opal_prd] [...] [ 6458.234709] NIP [c0080e0c0e6c] opal_prd_msg_notifier+0x174/0x188 [opal_prd] [ 6458.234723] LR [c0080e0c0e68] opal_prd_msg_notifier+0x170/0x188 [opal_prd] [ 6458.234736] Call Trace: [ 6458.234742] [c002acb23c10] [c0080e0c0e68] opal_prd_msg_notifier+0x170/0x188 [opal_prd] (unreliable) [ 6458.234759] [c002acb23ca0] [c019ccc0] notifier_call_chain+0xc0/0x1b0 [ 6458.234774] [c002acb23d00] [c019ceac] atomic_notifier_call_chain+0x2c/0x40 [ 6458.234788] [c002acb23d20] [c00d69b4] opal_message_notify+0xf4/0x2c0 [...] Split the copy to avoid false positive run-time warning. Reported-by: Aneesh Kumar K.V Signed-off-by: Mahesh Salgaonkar --- Change from v1: - Rework the memcpy to copy message header and rest of the message separately without adding flex array. --- arch/powerpc/platforms/powernv/opal-prd.c |4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/platforms/powernv/opal-prd.c b/arch/powerpc/platforms/powernv/opal-prd.c index 113bdb151f687..a068cbc4d43c0 100644 --- a/arch/powerpc/platforms/powernv/opal-prd.c +++ b/arch/powerpc/platforms/powernv/opal-prd.c @@ -352,7 +352,9 @@ static int opal_prd_msg_notifier(struct notifier_block *nb, if (!item) return -ENOMEM; - memcpy(>msg, msg->params, msg_size); + item->msg = *hdr; + hdr++; + memcpy((char *)>msg + sizeof(*hdr), hdr, msg_size - sizeof(*hdr)); spin_lock_irqsave(_prd_msg_queue_lock, flags); list_add_tail(>list, _prd_msg_queue);
Re: [PATCH] powerpc64/kasan: Call kasan_early_init() after PACA initialised
Le 07/07/2023 à 04:20, Benjamin Gray a écrit : > The inclusion of kasan_early_init() is conditional on CONFIG_KASAN, so > >if(IS_ENABLED(CONFIG_KASAN)) >kasan_early_init(); > > wouldn't work. See my comment on the main patch, just do like kasan_init() and others in arch/powerpc/include/asm/kasan.h : Define a stub for when CONFIG_KASAN is not selected.
Re: [PATCH] powerpc64/kasan: Call kasan_early_init() after PACA initialised
Le 07/07/2023 à 03:31, Benjamin Gray a écrit : > The KCOV handler __sanitizer_cov_trace_pc() uses the PACA, so initialise > the PACA first. This fixes a hang during boot when KASAN and KCOV are > both enabled, where the coverage tracer in kasan_early_init() tries to > access a field of the (currently null) PACA. mm/kasan/Makefile has KCOV_INSTRUMENT := n Should we add the same in arch/powerpc/mm/kasan/Makefile to be consistent ? > > Signed-off-by: Benjamin Gray > > --- > > I tried annotating kasan_early_init() with 'notrace', but it still > seemed to hang. It would also be less robust, because kasan_early_init() > may in future call generic code that should keep coverage. > --- > arch/powerpc/kernel/head_64.S | 3 --- > arch/powerpc/kernel/setup_64.c | 4 > 2 files changed, 4 insertions(+), 3 deletions(-) > > diff --git a/arch/powerpc/kernel/head_64.S b/arch/powerpc/kernel/head_64.S > index f132d8704263..21a78a849ca8 100644 > --- a/arch/powerpc/kernel/head_64.S > +++ b/arch/powerpc/kernel/head_64.S > @@ -1004,9 +1004,6 @@ start_here_multiplatform: >* and SLB setup before we turn on relocation. >*/ > > -#ifdef CONFIG_KASAN > - bl CFUNC(kasan_early_init) > -#endif > /* Restore parameters passed from prom_init/kexec */ > mr r3,r31 > LOAD_REG_ADDR(r12, DOTSYM(early_setup)) > diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c > index 246201d0d879..a3f5decbc041 100644 > --- a/arch/powerpc/kernel/setup_64.c > +++ b/arch/powerpc/kernel/setup_64.c > @@ -369,6 +369,10 @@ void __init early_setup(unsigned long dt_ptr) > > /* printk is now safe to use --- */ > > +#ifdef CONFIG_KASAN > + kasan_early_init(); > +#endif Don't #ifdef In arch/powerpc/include/asm/kasan.h add a stub for kasan_early_init() just like already done for kasan_init(), kasan_mmu_init() and kasan_late_init(). > + > if (IS_ENABLED(CONFIG_PPC_BOOK3S_64) && (mfmsr() & MSR_HV)) > enable_machine_check(); >
Re: [PATCH] powerpc/powermac: Use early_* IO variants in via_calibrate_decr
On Thu, 2023-07-06 at 11:08 +1000, Benjamin Gray wrote: > The issue is pre-existing, but is surfaced by commit 721255b9826b > ("genirq: Use a maple tree for interrupt descriptor management"). > It's not clear to me why this causes it to surface. >From the thread chain in [1], it looks like the maple tree implementation just has different allocation behaviour, which follows a pre-existing code path to kmem_cache_alloc_bulk(), which unconditionally enables interrupts. (thanks Jordan Niethe for finding this thread) [1]: https://lore.kernel.org/all/87o7qdzfay.ffs@tglx/
Re: [PATCH] powerpc64/kasan: Call kasan_early_init() after PACA initialised
The inclusion of kasan_early_init() is conditional on CONFIG_KASAN, so if(IS_ENABLED(CONFIG_KASAN)) kasan_early_init(); wouldn't work.
Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
On Fri, Jul 7, 2023 at 12:58 AM Isaku Yamahata wrote: > > On Thu, Jul 06, 2023 at 01:52:08PM +0900, > David Stevens wrote: > > > On Wed, Jul 5, 2023 at 7:17 PM Yu Zhang wrote: > > > > > > On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote: > > > > From: David Stevens > > > > > > > > Stop passing FOLL_GET to __kvm_follow_pfn. This allows the host to map > > > > memory into the guest that is backed by un-refcounted struct pages - for > > > > example, higher order non-compound pages allocated by the amdgpu driver > > > > via ttm_pool_alloc_page. > > > > > > I guess you mean the tail pages of the higher order non-compound pages? > > > And as to the head page, it is said to be set to one coincidentally[*], > > > and shall not be considered as refcounted. IIUC, refcount of this head > > > page will be increased and decreased soon in hva_to_pfn_remapped(), so > > > this may not be a problem(?). But treating this head page differently, > > > as a refcounted one(e.g., to set the A/D flags), is weired. > > > > > > Or maybe I missed some context, e.g., can the head page be allocted to > > > guest at all? > > > > Yes, this is to allow mapping the tail pages of higher order > > non-compound pages - I should have been more precise in my wording. > > The head pages can already be mapped into the guest. > > > > Treating the head and tail pages would require changing how KVM > > behaves in a situation it supports today (rather than just adding > > support for an unsupported situation). Currently, without this series, > > KVM can map VM_PFNMAP|VM_IO memory backed by refcounted pages into the > > guest. When that happens, KVM sets the A/D flags. I'm not sure whether > > that's actually valid behavior, nor do I know whether anyone actually > > cares about it. But it's what KVM does today, and I would shy away > > from modifying that behavior without good reason. > > > > > > > > > > The bulk of this change is tracking the is_refcounted_page flag so that > > > > non-refcounted pages don't trigger page_count() == 0 warnings. This is > > > > done by storing the flag in an unused bit in the sptes. > > > > > > Also, maybe we should mention this only works on x86-64. > > > > > > > > > > > Signed-off-by: David Stevens > > > > --- > > > > arch/x86/kvm/mmu/mmu.c | 44 + > > > > arch/x86/kvm/mmu/mmu_internal.h | 1 + > > > > arch/x86/kvm/mmu/paging_tmpl.h | 9 --- > > > > arch/x86/kvm/mmu/spte.c | 4 ++- > > > > arch/x86/kvm/mmu/spte.h | 12 - > > > > arch/x86/kvm/mmu/tdp_mmu.c | 22 ++--- > > > > 6 files changed, 62 insertions(+), 30 deletions(-) > > > > > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > > > > index e44ab512c3a1..b1607e314497 100644 > > > > --- a/arch/x86/kvm/mmu/mmu.c > > > > +++ b/arch/x86/kvm/mmu/mmu.c > > > > > > ... > > > > > > > @@ -2937,6 +2943,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, > > > > struct kvm_memory_slot *slot, > > > > bool host_writable = !fault || fault->map_writable; > > > > bool prefetch = !fault || fault->prefetch; > > > > bool write_fault = fault && fault->write; > > > > + bool is_refcounted = !fault || fault->is_refcounted_page; > > > > > > Just wonder, what if a non-refcounted page is prefetched? Or is it > > > possible in > > > practice? > > > > Prefetching is still done via gfn_to_page_many_atomic, which sets > > FOLL_GET. That's fixable, but it's not something this series currently > > does. > > So if we prefetch a page, REFCOUNTED bit is cleared unconditionally with this > hunk. kvm_set_page_{dirty, accessed} won't be called as expected for > prefetched > spte. If I read the patch correctly, REFCOUNTED bit in SPTE should represent > whether the corresponding page is ref-countable or not, right? > > Because direct_pte_prefetch_many() is for legacy KVM MMU and > FNAME(prefetch_pte) > is shadow paging, we need to test it with legacy KVM MMU or shadow paging to > hit > the issue, though. > direct_pte_prefetch_many and prefetch_gpte both pass NULL for the fault parameter, so is_refcounted will evaluate to true. So the spte's refcounted bit will get set in that case. -David
[PATCH] powerpc64/kasan: Call kasan_early_init() after PACA initialised
The KCOV handler __sanitizer_cov_trace_pc() uses the PACA, so initialise the PACA first. This fixes a hang during boot when KASAN and KCOV are both enabled, where the coverage tracer in kasan_early_init() tries to access a field of the (currently null) PACA. Signed-off-by: Benjamin Gray --- I tried annotating kasan_early_init() with 'notrace', but it still seemed to hang. It would also be less robust, because kasan_early_init() may in future call generic code that should keep coverage. --- arch/powerpc/kernel/head_64.S | 3 --- arch/powerpc/kernel/setup_64.c | 4 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kernel/head_64.S b/arch/powerpc/kernel/head_64.S index f132d8704263..21a78a849ca8 100644 --- a/arch/powerpc/kernel/head_64.S +++ b/arch/powerpc/kernel/head_64.S @@ -1004,9 +1004,6 @@ start_here_multiplatform: * and SLB setup before we turn on relocation. */ -#ifdef CONFIG_KASAN - bl CFUNC(kasan_early_init) -#endif /* Restore parameters passed from prom_init/kexec */ mr r3,r31 LOAD_REG_ADDR(r12, DOTSYM(early_setup)) diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c index 246201d0d879..a3f5decbc041 100644 --- a/arch/powerpc/kernel/setup_64.c +++ b/arch/powerpc/kernel/setup_64.c @@ -369,6 +369,10 @@ void __init early_setup(unsigned long dt_ptr) /* printk is now safe to use --- */ +#ifdef CONFIG_KASAN + kasan_early_init(); +#endif + if (IS_ENABLED(CONFIG_PPC_BOOK3S_64) && (mfmsr() & MSR_HV)) enable_machine_check(); -- 2.41.0
Re: [PATCH v5 11/13] riscv/kexec: refactor for kernel/Kconfig.kexec
On Thu, 06 Jul 2023 15:20:25 PDT (-0700), eric.devol...@oracle.com wrote: The kexec and crash kernel options are provided in the common kernel/Kconfig.kexec. Utilize the common options and provide the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the equivalent set of KEXEC and CRASH options. Signed-off-by: Eric DeVolder --- arch/riscv/Kconfig | 44 +--- 1 file changed, 13 insertions(+), 31 deletions(-) diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig index b49793cf34eb..8a3af850597a 100644 --- a/arch/riscv/Kconfig +++ b/arch/riscv/Kconfig @@ -647,48 +647,30 @@ config RISCV_BOOT_SPINWAIT If unsure what to do here, say N. -config KEXEC - bool "Kexec system call" - depends on MMU +config ARCH_SUPPORTS_KEXEC + def_bool MMU + +config ARCH_SELECTS_KEXEC + def_bool y + depends on KEXEC select HOTPLUG_CPU if SMP - select KEXEC_CORE - help - kexec is a system call that implements the ability to shutdown your - current kernel, and to start another kernel. It is like a reboot - but it is independent of the system firmware. And like a reboot - you can start any kernel with it, not just Linux. - The name comes from the similarity to the exec system call. +config ARCH_SUPPORTS_KEXEC_FILE + def_bool 64BIT && MMU -config KEXEC_FILE - bool "kexec file based systmem call" - depends on 64BIT && MMU +config ARCH_SELECTS_KEXEC_FILE + def_bool y + depends on KEXEC_FILE select HAVE_IMA_KEXEC if IMA - select KEXEC_CORE select KEXEC_ELF - help - This is new version of kexec system call. This system call is - file based and takes file descriptors as system call argument - for kernel and initramfs as opposed to list of segments as - accepted by previous system call. - - If you don't know what to do here, say Y. config ARCH_HAS_KEXEC_PURGATORY def_bool KEXEC_FILE depends on CRYPTO=y depends on CRYPTO_SHA256=y -config CRASH_DUMP - bool "Build kdump crash kernel" - help - Generate crash dump after being started by kexec. This should - be normally only set in special crash dump kernels which are - loaded in the main kernel with kexec-tools into a specially - reserved region and then later executed after a crash by - kdump/kexec. - - For more details see Documentation/admin-guide/kdump/kdump.rst +config ARCH_SUPPORTS_CRASH_DUMP + def_bool y config COMPAT bool "Kernel support for 32-bit U-mode" Acked-by: Palmer Dabbelt
[PATCH v5 13/13] sh/kexec: refactor for kernel/Kconfig.kexec
The kexec and crash kernel options are provided in the common kernel/Kconfig.kexec. Utilize the common options and provide the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the equivalent set of KEXEC and CRASH options. Signed-off-by: Eric DeVolder Acked-by: John Paul Adrian Glaubitz --- arch/sh/Kconfig | 46 -- 1 file changed, 8 insertions(+), 38 deletions(-) diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig index 2b3ce4fd3956..1cf6603781c7 100644 --- a/arch/sh/Kconfig +++ b/arch/sh/Kconfig @@ -548,44 +548,14 @@ menu "Kernel features" source "kernel/Kconfig.hz" -config KEXEC - bool "kexec system call (EXPERIMENTAL)" - depends on MMU - select KEXEC_CORE - help - kexec is a system call that implements the ability to shutdown your - current kernel, and to start another kernel. It is like a reboot - but it is independent of the system firmware. And like a reboot - you can start any kernel with it, not just Linux. - - The name comes from the similarity to the exec system call. - - It is an ongoing process to be certain the hardware in a machine - is properly shutdown, so do not be surprised if this code does not - initially work for you. As of this writing the exact hardware - interface is strongly in flux, so no good recommendation can be - made. - -config CRASH_DUMP - bool "kernel crash dumps (EXPERIMENTAL)" - depends on BROKEN_ON_SMP - help - Generate crash dump after being started by kexec. - This should be normally only set in special crash dump kernels - which are loaded in the main kernel with kexec-tools into - a specially reserved region and then later executed after - a crash by kdump/kexec. The crash dump kernel must be compiled - to a memory address not used by the main kernel using - PHYSICAL_START. - - For more details see Documentation/admin-guide/kdump/kdump.rst - -config KEXEC_JUMP - bool "kexec jump (EXPERIMENTAL)" - depends on KEXEC && HIBERNATION - help - Jump between original kernel and kexeced kernel and invoke - code via KEXEC +config ARCH_SUPPORTS_KEXEC + def_bool MMU + +config ARCH_SUPPORTS_CRASH_DUMP + def_bool BROKEN_ON_SMP + +config ARCH_SUPPORTS_KEXEC_JUMP + def_bool y config PHYSICAL_START hex "Physical address where the kernel is loaded" if (EXPERT || CRASH_DUMP) -- 2.31.1
[PATCH v5 12/13] s390/kexec: refactor for kernel/Kconfig.kexec
The kexec and crash kernel options are provided in the common kernel/Kconfig.kexec. Utilize the common options and provide the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the equivalent set of KEXEC and CRASH options. NOTE: The original Kconfig has a KEXEC_SIG which depends on MODULE_SIG_FORMAT. However, attempts to keep the MODULE_SIG_FORMAT dependency (using the strategy outlined in this series, and other techniques) results in 'error: recursive dependency detected' on CRYPTO. Per Alexander Gordeev : "the MODULE_SIG_FORMAT dependency was introduced with [git commit below] and in fact was not necessary, since s390 did/does not use mod_check_sig() anyway. commit c8424e776b09 ("MODSIGN: Export module signature definitions") MODULE_SIG_FORMAT is needed to select SYSTEM_DATA_VERIFICATION. But SYSTEM_DATA_VERIFICATION is also selected by FS_VERITY*, so dropping MODULE_SIG_FORMAT does not hurt." Therefore, the solution is to drop the MODULE_SIG_FORMAT dependency from KEXEC_SIG. Still results in equivalent .config files for s390. Signed-off-by: Eric DeVolder Acked-by: Alexander Gordeev --- arch/s390/Kconfig | 69 --- 1 file changed, 23 insertions(+), 46 deletions(-) diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig index 5b39918b7042..44154246e4b7 100644 --- a/arch/s390/Kconfig +++ b/arch/s390/Kconfig @@ -213,6 +213,7 @@ config S390 select HAVE_VIRT_CPU_ACCOUNTING_IDLE select IOMMU_HELPER if PCI select IOMMU_SUPPORTif PCI + select KEXEC select MMU_GATHER_MERGE_VMAS select MMU_GATHER_NO_GATHER select MMU_GATHER_RCU_TABLE_FREE @@ -244,6 +245,28 @@ config PGTABLE_LEVELS source "kernel/livepatch/Kconfig" +config ARCH_DEFAULT_KEXEC + def_bool y + +config ARCH_SUPPORTS_KEXEC + def_bool y + +config ARCH_SUPPORTS_KEXEC_FILE + def_bool CRYPTO && CRYPTO_SHA256 && CRYPTO_SHA256_S390 + +config ARCH_SUPPORTS_KEXEC_SIG + def_bool y + +config ARCH_HAS_KEXEC_PURGATORY + def_bool KEXEC_FILE + +config ARCH_SUPPORTS_CRASH_DUMP + def_bool y + help + Refer to for more details on this. + This option also enables s390 zfcpdump. + See also + menu "Processor type and features" config HAVE_MARCH_Z10_FEATURES @@ -482,36 +505,6 @@ config SCHED_TOPOLOGY source "kernel/Kconfig.hz" -config KEXEC - def_bool y - select KEXEC_CORE - -config KEXEC_FILE - bool "kexec file based system call" - select KEXEC_CORE - depends on CRYPTO - depends on CRYPTO_SHA256 - depends on CRYPTO_SHA256_S390 - help - Enable the kexec file based system call. In contrast to the normal - kexec system call this system call takes file descriptors for the - kernel and initramfs as arguments. - -config ARCH_HAS_KEXEC_PURGATORY - def_bool y - depends on KEXEC_FILE - -config KEXEC_SIG - bool "Verify kernel signature during kexec_file_load() syscall" - depends on KEXEC_FILE && MODULE_SIG_FORMAT - help - This option makes kernel signature verification mandatory for - the kexec_file_load() syscall. - - In addition to that option, you need to enable signature - verification for the corresponding kernel image type being - loaded in order for this to work. - config KERNEL_NOBP def_bool n prompt "Enable modified branch prediction for the kernel by default" @@ -733,22 +726,6 @@ config VFIO_AP endmenu -menu "Dump support" - -config CRASH_DUMP - bool "kernel crash dumps" - select KEXEC - help - Generate crash dump after being started by kexec. - Crash dump kernels are loaded in the main kernel with kexec-tools - into a specially reserved region and then later executed after - a crash by kdump/kexec. - Refer to for more details on this. - This option also enables s390 zfcpdump. - See also - -endmenu - config CCW def_bool y -- 2.31.1
[PATCH v5 11/13] riscv/kexec: refactor for kernel/Kconfig.kexec
The kexec and crash kernel options are provided in the common kernel/Kconfig.kexec. Utilize the common options and provide the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the equivalent set of KEXEC and CRASH options. Signed-off-by: Eric DeVolder --- arch/riscv/Kconfig | 44 +--- 1 file changed, 13 insertions(+), 31 deletions(-) diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig index b49793cf34eb..8a3af850597a 100644 --- a/arch/riscv/Kconfig +++ b/arch/riscv/Kconfig @@ -647,48 +647,30 @@ config RISCV_BOOT_SPINWAIT If unsure what to do here, say N. -config KEXEC - bool "Kexec system call" - depends on MMU +config ARCH_SUPPORTS_KEXEC + def_bool MMU + +config ARCH_SELECTS_KEXEC + def_bool y + depends on KEXEC select HOTPLUG_CPU if SMP - select KEXEC_CORE - help - kexec is a system call that implements the ability to shutdown your - current kernel, and to start another kernel. It is like a reboot - but it is independent of the system firmware. And like a reboot - you can start any kernel with it, not just Linux. - The name comes from the similarity to the exec system call. +config ARCH_SUPPORTS_KEXEC_FILE + def_bool 64BIT && MMU -config KEXEC_FILE - bool "kexec file based systmem call" - depends on 64BIT && MMU +config ARCH_SELECTS_KEXEC_FILE + def_bool y + depends on KEXEC_FILE select HAVE_IMA_KEXEC if IMA - select KEXEC_CORE select KEXEC_ELF - help - This is new version of kexec system call. This system call is - file based and takes file descriptors as system call argument - for kernel and initramfs as opposed to list of segments as - accepted by previous system call. - - If you don't know what to do here, say Y. config ARCH_HAS_KEXEC_PURGATORY def_bool KEXEC_FILE depends on CRYPTO=y depends on CRYPTO_SHA256=y -config CRASH_DUMP - bool "Build kdump crash kernel" - help - Generate crash dump after being started by kexec. This should - be normally only set in special crash dump kernels which are - loaded in the main kernel with kexec-tools into a specially - reserved region and then later executed after a crash by - kdump/kexec. - - For more details see Documentation/admin-guide/kdump/kdump.rst +config ARCH_SUPPORTS_CRASH_DUMP + def_bool y config COMPAT bool "Kernel support for 32-bit U-mode" -- 2.31.1
[PATCH v5 10/13] powerpc/kexec: refactor for kernel/Kconfig.kexec
The kexec and crash kernel options are provided in the common kernel/Kconfig.kexec. Utilize the common options and provide the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the equivalent set of KEXEC and CRASH options. Signed-off-by: Eric DeVolder Reviewed-by: Sourabh Jain --- arch/powerpc/Kconfig | 55 ++-- 1 file changed, 17 insertions(+), 38 deletions(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 0b1172cbeccb..1695a71777f0 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -589,41 +589,21 @@ config PPC64_SUPPORTS_MEMORY_FAILURE default "y" if PPC_POWERNV select ARCH_SUPPORTS_MEMORY_FAILURE -config KEXEC - bool "kexec system call" - depends on PPC_BOOK3S || PPC_E500 || (44x && !SMP) - select KEXEC_CORE - help - kexec is a system call that implements the ability to shutdown your - current kernel, and to start another kernel. It is like a reboot - but it is independent of the system firmware. And like a reboot - you can start any kernel with it, not just Linux. - - The name comes from the similarity to the exec system call. - - It is an ongoing process to be certain the hardware in a machine - is properly shutdown, so do not be surprised if this code does not - initially work for you. As of this writing the exact hardware - interface is strongly in flux, so no good recommendation can be - made. - -config KEXEC_FILE - bool "kexec file based system call" - select KEXEC_CORE - select HAVE_IMA_KEXEC if IMA - select KEXEC_ELF - depends on PPC64 - depends on CRYPTO=y - depends on CRYPTO_SHA256=y - help - This is a new version of the kexec system call. This call is - file based and takes in file descriptors as system call arguments - for kernel and initramfs as opposed to a list of segments as is the - case for the older kexec call. +config ARCH_SUPPORTS_KEXEC + def_bool PPC_BOOK3S || PPC_E500 || (44x && !SMP) + +config ARCH_SUPPORTS_KEXEC_FILE + def_bool PPC64 && CRYPTO=y && CRYPTO_SHA256=y config ARCH_HAS_KEXEC_PURGATORY def_bool KEXEC_FILE +config ARCH_SELECTS_KEXEC_FILE + def_bool y + depends on KEXEC_FILE + select KEXEC_ELF + select HAVE_IMA_KEXEC if IMA + config PPC64_BIG_ENDIAN_ELF_ABI_V2 # Option is available to BFD, but LLD does not support ELFv1 so this is # always true there. @@ -683,14 +663,13 @@ config RELOCATABLE_TEST loaded at, which tends to be non-zero and therefore test the relocation code. -config CRASH_DUMP - bool "Build a dump capture kernel" - depends on PPC64 || PPC_BOOK3S_32 || PPC_85xx || (44x && !SMP) +config ARCH_SUPPORTS_CRASH_DUMP + def_bool PPC64 || PPC_BOOK3S_32 || PPC_85xx || (44x && !SMP) + +config ARCH_SELECTS_CRASH_DUMP + def_bool y + depends on CRASH_DUMP select RELOCATABLE if PPC64 || 44x || PPC_85xx - help - Build a kernel suitable for use as a dump capture kernel. - The same kernel binary can be used as production kernel and dump - capture kernel. config FA_DUMP bool "Firmware-assisted dump" -- 2.31.1
[PATCH v5 09/13] parisc/kexec: refactor for kernel/Kconfig.kexec
The kexec and crash kernel options are provided in the common kernel/Kconfig.kexec. Utilize the common options and provide the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the equivalent set of KEXEC and CRASH options. Signed-off-by: Eric DeVolder --- arch/parisc/Kconfig | 34 +++--- 1 file changed, 11 insertions(+), 23 deletions(-) diff --git a/arch/parisc/Kconfig b/arch/parisc/Kconfig index 4cb46d5c64a2..2ef6843aae60 100644 --- a/arch/parisc/Kconfig +++ b/arch/parisc/Kconfig @@ -339,29 +339,17 @@ config NR_CPUS default "8" if 64BIT default "16" -config KEXEC - bool "Kexec system call" - select KEXEC_CORE - help - kexec is a system call that implements the ability to shutdown your - current kernel, and to start another kernel. It is like a reboot - but it is independent of the system firmware. And like a reboot - you can start any kernel with it, not just Linux. - - It is an ongoing process to be certain the hardware in a machine - shutdown, so do not be surprised if this code does not - initially work for you. - -config KEXEC_FILE - bool "kexec file based system call" - select KEXEC_CORE - select KEXEC_ELF - help - This enables the kexec_file_load() System call. This is - file based and takes file descriptors as system call argument - for kernel and initramfs as opposed to list of segments as - accepted by previous system call. - endmenu +config ARCH_SUPPORTS_KEXEC + def_bool y + +config ARCH_SUPPORTS_KEXEC_FILE + def_bool y + +config ARCH_SELECTS_KEXEC_FILE + def_bool y + depends on KEXEC_FILE + select KEXEC_ELF + source "drivers/parisc/Kconfig" -- 2.31.1
[PATCH v5 08/13] mips/kexec: refactor for kernel/Kconfig.kexec
The kexec and crash kernel options are provided in the common kernel/Kconfig.kexec. Utilize the common options and provide the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the equivalent set of KEXEC and CRASH options. Signed-off-by: Eric DeVolder Acked-by: Thomas Bogendoerfer --- arch/mips/Kconfig | 32 +--- 1 file changed, 5 insertions(+), 27 deletions(-) diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig index fc6fba925aea..bc8421859006 100644 --- a/arch/mips/Kconfig +++ b/arch/mips/Kconfig @@ -2878,33 +2878,11 @@ config HZ config SCHED_HRTICK def_bool HIGH_RES_TIMERS -config KEXEC - bool "Kexec system call" - select KEXEC_CORE - help - kexec is a system call that implements the ability to shutdown your - current kernel, and to start another kernel. It is like a reboot - but it is independent of the system firmware. And like a reboot - you can start any kernel with it, not just Linux. - - The name comes from the similarity to the exec system call. - - It is an ongoing process to be certain the hardware in a machine - is properly shutdown, so do not be surprised if this code does not - initially work for you. As of this writing the exact hardware - interface is strongly in flux, so no good recommendation can be - made. - -config CRASH_DUMP - bool "Kernel crash dumps" - help - Generate crash dump after being started by kexec. - This should be normally only set in special crash dump kernels - which are loaded in the main kernel with kexec-tools into - a specially reserved region and then later executed after - a crash by kdump/kexec. The crash dump kernel must be compiled - to a memory address not used by the main kernel or firmware using - PHYSICAL_START. +config ARCH_SUPPORTS_KEXEC + def_bool y + +config ARCH_SUPPORTS_CRASH_DUMP + def_bool y config PHYSICAL_START hex "Physical address where the kernel is loaded" -- 2.31.1
[PATCH v5 07/13] m68k/kexec: refactor for kernel/Kconfig.kexec
The kexec and crash kernel options are provided in the common kernel/Kconfig.kexec. Utilize the common options and provide the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the equivalent set of KEXEC and CRASH options. Signed-off-by: Eric DeVolder Reviewed-by: Geert Uytterhoeven Acked-by: Geert Uytterhoeven --- arch/m68k/Kconfig | 19 ++- 1 file changed, 2 insertions(+), 17 deletions(-) diff --git a/arch/m68k/Kconfig b/arch/m68k/Kconfig index dc792b321f1e..3e318bf9504c 100644 --- a/arch/m68k/Kconfig +++ b/arch/m68k/Kconfig @@ -89,23 +89,8 @@ config MMU_SUN3 bool depends on MMU && !MMU_MOTOROLA && !MMU_COLDFIRE -config KEXEC - bool "kexec system call" - depends on M68KCLASSIC && MMU - select KEXEC_CORE - help - kexec is a system call that implements the ability to shutdown your - current kernel, and to start another kernel. It is like a reboot - but it is independent of the system firmware. And like a reboot - you can start any kernel with it, not just Linux. - - The name comes from the similarity to the exec system call. - - It is an ongoing process to be certain the hardware in a machine - is properly shutdown, so do not be surprised if this code does not - initially work for you. As of this writing the exact hardware - interface is strongly in flux, so no good recommendation can be - made. +config ARCH_SUPPORTS_KEXEC + def_bool M68KCLASSIC && MMU config BOOTINFO_PROC bool "Export bootinfo in procfs" -- 2.31.1
[PATCH v5 06/13] loongarch/kexec: refactor for kernel/Kconfig.kexec
The kexec and crash kernel options are provided in the common kernel/Kconfig.kexec. Utilize the common options and provide the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the equivalent set of KEXEC and CRASH options. Signed-off-by: Eric DeVolder --- arch/loongarch/Kconfig | 26 +++--- 1 file changed, 7 insertions(+), 19 deletions(-) diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig index e55511af4c77..397203e18800 100644 --- a/arch/loongarch/Kconfig +++ b/arch/loongarch/Kconfig @@ -537,28 +537,16 @@ config CPU_HAS_PREFETCH bool default y -config KEXEC - bool "Kexec system call" - select KEXEC_CORE - help - kexec is a system call that implements the ability to shutdown your - current kernel, and to start another kernel. It is like a reboot - but it is independent of the system firmware. And like a reboot - you can start any kernel with it, not just Linux. +config ARCH_SUPPORTS_KEXEC + def_bool y - The name comes from the similarity to the exec system call. +config ARCH_SUPPORTS_CRASH_DUMP + def_bool y -config CRASH_DUMP - bool "Build kdump crash kernel" +config ARCH_SELECTS_CRASH_DUMP + def_bool y + depends on CRASH_DUMP select RELOCATABLE - help - Generate crash dump after being started by kexec. This should - be normally only set in special crash dump kernels which are - loaded in the main kernel with kexec-tools into a specially - reserved region and then later executed after a crash by - kdump/kexec. - - For more details see Documentation/admin-guide/kdump/kdump.rst config RELOCATABLE bool "Relocatable kernel" -- 2.31.1
[PATCH v5 05/13] arm64/kexec: refactor for kernel/Kconfig.kexec
The kexec and crash kernel options are provided in the common kernel/Kconfig.kexec. Utilize the common options and provide the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the equivalent set of KEXEC and CRASH options. Signed-off-by: Eric DeVolder --- arch/arm64/Kconfig | 64 -- 1 file changed, 16 insertions(+), 48 deletions(-) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 7856c3a3e35a..3a9b06a693a9 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -1459,60 +1459,28 @@ config PARAVIRT_TIME_ACCOUNTING If in doubt, say N here. -config KEXEC - depends on PM_SLEEP_SMP - select KEXEC_CORE - bool "kexec system call" - help - kexec is a system call that implements the ability to shutdown your - current kernel, and to start another kernel. It is like a reboot - but it is independent of the system firmware. And like a reboot - you can start any kernel with it, not just Linux. - -config KEXEC_FILE - bool "kexec file based system call" - select KEXEC_CORE - select HAVE_IMA_KEXEC if IMA - help - This is new version of kexec system call. This system call is - file based and takes file descriptors as system call argument - for kernel and initramfs as opposed to list of segments as - accepted by previous system call. +config ARCH_SUPPORTS_KEXEC + def_bool PM_SLEEP_SMP -config KEXEC_SIG - bool "Verify kernel signature during kexec_file_load() syscall" - depends on KEXEC_FILE - help - Select this option to verify a signature with loaded kernel - image. If configured, any attempt of loading a image without - valid signature will fail. +config ARCH_SUPPORTS_KEXEC_FILE + def_bool y - In addition to that option, you need to enable signature - verification for the corresponding kernel image type being - loaded in order for this to work. +config ARCH_SELECTS_KEXEC_FILE + def_bool y + depends on KEXEC_FILE + select HAVE_IMA_KEXEC if IMA -config KEXEC_IMAGE_VERIFY_SIG - bool "Enable Image signature verification support" - default y - depends on KEXEC_SIG - depends on EFI && SIGNED_PE_FILE_VERIFICATION - help - Enable Image signature verification support. +config ARCH_SUPPORTS_KEXEC_SIG + def_bool y -comment "Support for PE file signature verification disabled" - depends on KEXEC_SIG - depends on !EFI || !SIGNED_PE_FILE_VERIFICATION +config ARCH_SUPPORTS_KEXEC_IMAGE_VERIFY_SIG + def_bool y -config CRASH_DUMP - bool "Build kdump crash kernel" - help - Generate crash dump after being started by kexec. This should - be normally only set in special crash dump kernels which are - loaded in the main kernel with kexec-tools into a specially - reserved region and then later executed after a crash by - kdump/kexec. +config ARCH_DEFAULT_KEXEC_IMAGE_VERIFY_SIG + def_bool y - For more details see Documentation/admin-guide/kdump/kdump.rst +config ARCH_SUPPORTS_CRASH_DUMP + def_bool y config TRANS_TABLE def_bool y -- 2.31.1
[PATCH v5 01/13] kexec: consolidate kexec and crash options into kernel/Kconfig.kexec
The config options for kexec and crash features are consolidated into new file kernel/Kconfig.kexec. Under the "General Setup" submenu is a new submenu "Kexec and crash handling". All the kexec and crash options that were once in the arch-dependent submenu "Processor type and features" are now consolidated in the new submenu. The following options are impacted: - KEXEC - KEXEC_FILE - KEXEC_SIG - KEXEC_SIG_FORCE - KEXEC_BZIMAGE_VERIFY_SIG - KEXEC_JUMP - CRASH_DUMP The three main options are KEXEC, KEXEC_FILE and CRASH_DUMP. Architectures specify support of certain KEXEC and CRASH features with similarly named new ARCH_SUPPORTS_ config options. Architectures can utilize the new ARCH_SELECTS_ config options to specify additional components when is enabled. To summarize, the ARCH_SUPPORTS_ permits the to be enabled, and the ARCH_SELECTS_ handles side effects (ie. select statements). Signed-off-by: Eric DeVolder --- arch/Kconfig | 13 - init/Kconfig | 2 + kernel/Kconfig.kexec | 116 +++ 3 files changed, 118 insertions(+), 13 deletions(-) create mode 100644 kernel/Kconfig.kexec diff --git a/arch/Kconfig b/arch/Kconfig index aff2746c8af2..b2872e9d3760 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -11,19 +11,6 @@ source "arch/$(SRCARCH)/Kconfig" menu "General architecture-dependent options" -config CRASH_CORE - bool - -config KEXEC_CORE - select CRASH_CORE - bool - -config KEXEC_ELF - bool - -config HAVE_IMA_KEXEC - bool - config ARCH_HAS_SUBPAGE_FAULTS bool help diff --git a/init/Kconfig b/init/Kconfig index f7f65af4ee12..639e8a3363c3 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1927,6 +1927,8 @@ config BINDGEN_VERSION_TEXT config TRACEPOINTS bool +source "kernel/Kconfig.kexec" + endmenu# General setup source "arch/Kconfig" diff --git a/kernel/Kconfig.kexec b/kernel/Kconfig.kexec new file mode 100644 index ..ff72e45cfaef --- /dev/null +++ b/kernel/Kconfig.kexec @@ -0,0 +1,116 @@ +# SPDX-License-Identifier: GPL-2.0-only + +menu "Kexec and crash features" + +config CRASH_CORE + bool + +config KEXEC_CORE + select CRASH_CORE + bool + +config KEXEC_ELF + bool + +config HAVE_IMA_KEXEC + bool + +config KEXEC + bool "Enable kexec system call" + default ARCH_DEFAULT_KEXEC + depends on ARCH_SUPPORTS_KEXEC + select KEXEC_CORE + help + kexec is a system call that implements the ability to shutdown your + current kernel, and to start another kernel. It is like a reboot + but it is independent of the system firmware. And like a reboot + you can start any kernel with it, not just Linux. + + The name comes from the similarity to the exec system call. + + It is an ongoing process to be certain the hardware in a machine + is properly shutdown, so do not be surprised if this code does not + initially work for you. As of this writing the exact hardware + interface is strongly in flux, so no good recommendation can be + made. + +config KEXEC_FILE + bool "Enable kexec file based system call" + depends on ARCH_SUPPORTS_KEXEC_FILE + select KEXEC_CORE + help + This is new version of kexec system call. This system call is + file based and takes file descriptors as system call argument + for kernel and initramfs as opposed to list of segments as + accepted by kexec system call. + +config KEXEC_SIG + bool "Verify kernel signature during kexec_file_load() syscall" + depends on ARCH_SUPPORTS_KEXEC_SIG + depends on KEXEC_FILE + help + This option makes the kexec_file_load() syscall check for a valid + signature of the kernel image. The image can still be loaded without + a valid signature unless you also enable KEXEC_SIG_FORCE, though if + there's a signature that we can check, then it must be valid. + + In addition to this option, you need to enable signature + verification for the corresponding kernel image type being + loaded in order for this to work. + +config KEXEC_SIG_FORCE + bool "Require a valid signature in kexec_file_load() syscall" + depends on ARCH_SUPPORTS_KEXEC_SIG_FORCE + depends on KEXEC_SIG + help + This option makes kernel signature verification mandatory for + the kexec_file_load() syscall. + +config KEXEC_IMAGE_VERIFY_SIG + bool "Enable Image signature verification support (ARM)" + default ARCH_DEFAULT_KEXEC_IMAGE_VERIFY_SIG + depends on ARCH_SUPPORTS_KEXEC_IMAGE_VERIFY_SIG + depends on KEXEC_SIG + depends on EFI && SIGNED_PE_FILE_VERIFICATION + help + Enable Image signature verification support. + +config KEXEC_BZIMAGE_VERIFY_SIG + bool "Enable bzImage signature verification
[PATCH v5 00/13] refactor Kconfig to consolidate KEXEC and CRASH options
The Kconfig is refactored to consolidate KEXEC and CRASH options from various arch//Kconfig files into new file kernel/Kconfig.kexec. The Kconfig.kexec is now a submenu titled "Kexec and crash features" located under "General Setup". The following options are impacted: - KEXEC - KEXEC_FILE - KEXEC_SIG - KEXEC_SIG_FORCE - KEXEC_IMAGE_VERIFY_SIG - KEXEC_BZIMAGE_VERIFY_SIG - KEXEC_JUMP - CRASH_DUMP Over time, these options have been copied between Kconfig files and are very similar to one another, but with slight differences. The following architectures are impacted by the refactor (because of use of one or more KEXEC/CRASH options): - arm - arm64 - ia64 - loongarch - m68k - mips - parisc - powerpc - riscv - s390 - sh - x86 More information: In the patch series "crash: Kernel handling of CPU and memory hot un/plug" https://lore.kernel.org/lkml/20230503224145.7405-1-eric.devol...@oracle.com/ the new kernel feature introduces the config option CRASH_HOTPLUG. In reviewing, Thomas Gleixner requested that the new config option not be placed in x86 Kconfig. Rather the option needs a generic/common home. To Thomas' point, the KEXEC and CRASH options have largely been duplicated in the various arch//Kconfig files, with minor differences. This kind of proliferation is to be avoid/stopped. https://lore.kernel.org/lkml/875y91yv63.ffs@tglx/ To that end, I have refactored the arch Kconfigs so as to consolidate the various KEXEC and CRASH options. Generally speaking, this work has the following themes: - KEXEC and CRASH options are moved into new file kernel/Kconfig.kexec - These items from arch/Kconfig: CRASH_CORE KEXEC_CORE KEXEC_ELF HAVE_IMA_KEXEC - These items from arch/x86/Kconfig form the common options: KEXEC KEXEC_FILE KEXEC_SIG KEXEC_SIG_FORCE KEXEC_BZIMAGE_VERIFY_SIG KEXEC_JUMP CRASH_DUMP - These items from arch/arm64/Kconfig form the common options: KEXEC_IMAGE_VERIFY_SIG - The crash hotplug series appends CRASH_HOTPLUG to Kconfig.kexec - The Kconfig.kexec is now a submenu titled "Kexec and crash features" and is now listed in "General Setup" submenu from init/Kconfig. - To control the common options, each has a new ARCH_SUPPORTS_ option. These gateway options determine whether the common options options are valid for the architecture. - To account for the slight differences in the original architecture coding of the common options, each now has a corresponding ARCH_SELECTS_ which are used to elicit the same side effects as the original arch//Kconfig files for KEXEC and CRASH options. - NOTE: The existing ARCH_HAS_KEXEC_PURGATORY remains unchanged. An example, 'make menuconfig' illustrating the submenu: > General setup > Kexec and crash features [*] Enable kexec system call [*] Enable kexec file based system call [*] Verify kernel signature during kexec_file_load() syscall [ ] Require a valid signature in kexec_file_load() syscall [ ] Enable bzImage signature verification support [*] kexec jump [*] kernel crash dumps [*] Update the crash elfcorehdr on system configuration changes In the process of consolidating the common options, I encountered slight differences in the coding of these options in several of the architectures. As a result, I settled on the following solution: - Each of the common options has a 'depends on ARCH_SUPPORTS_' statement. For example, the KEXEC_FILE option has a 'depends on ARCH_SUPPORTS_KEXEC_FILE' statement. This approach is needed on all common options so as to prevent options from appearing for architectures which previously did not allow/enable them. For example, arm supports KEXEC but not KEXEC_FILE. The arch/arm/Kconfig does not provide ARCH_SUPPORTS_KEXEC_FILE and so KEXEC_FILE and related options are not available to arm. - The boolean ARCH_SUPPORTS_ in effect allows the arch to determine when the feature is allowed. Archs which don't have the feature simply do not provide the corresponding ARCH_SUPPORTS_. For each arch, where there previously were KEXEC and/or CRASH options, these have been replaced with the corresponding boolean ARCH_SUPPORTS_, and an appropriate def_bool statement. For example, if the arch supports KEXEC_FILE, then the ARCH_SUPPORTS_KEXEC_FILE simply has a 'def_bool y'. This permits the KEXEC_FILE option to be available. If the arch has a 'depends on' statement in its original coding of the option, then that expression becomes part of the def_bool expression. For example, arm64 had: config KEXEC depends on PM_SLEEP_SMP and in this solution, this converts to: config ARCH_SUPPORTS_KEXEC def_bool PM_SLEEP_SMP - In order to account for the architecture differences in the coding for the common options, the ARCH_SELECTS_ in the arch//Kconfig is used. This option has a 'depends on ' statement to couple it to the main option, and from there can insert the differences from the common
[PATCH v5 04/13] ia64/kexec: refactor for kernel/Kconfig.kexec
The kexec and crash kernel options are provided in the common kernel/Kconfig.kexec. Utilize the common options and provide the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the equivalent set of KEXEC and CRASH options. Signed-off-by: Eric DeVolder --- arch/ia64/Kconfig | 28 +--- 1 file changed, 5 insertions(+), 23 deletions(-) diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig index 2cd93e6bf0fe..88382f105301 100644 --- a/arch/ia64/Kconfig +++ b/arch/ia64/Kconfig @@ -361,31 +361,13 @@ config IA64_HP_AML_NFW the "force" module parameter, e.g., with the "aml_nfw.force" kernel command line option. -config KEXEC - bool "kexec system call" - depends on !SMP || HOTPLUG_CPU - select KEXEC_CORE - help - kexec is a system call that implements the ability to shutdown your - current kernel, and to start another kernel. It is like a reboot - but it is independent of the system firmware. And like a reboot - you can start any kernel with it, not just Linux. - - The name comes from the similarity to the exec system call. - - It is an ongoing process to be certain the hardware in a machine - is properly shutdown, so do not be surprised if this code does not - initially work for you. As of this writing the exact hardware - interface is strongly in flux, so no good recommendation can be - made. +endmenu -config CRASH_DUMP - bool "kernel crash dumps" - depends on IA64_MCA_RECOVERY && (!SMP || HOTPLUG_CPU) - help - Generate crash dump after being started by kexec. +config ARCH_SUPPORTS_KEXEC + def_bool !SMP || HOTPLUG_CPU -endmenu +config ARCH_SUPPORTS_CRASH_DUMP + def_bool IA64_MCA_RECOVERY && (!SMP || HOTPLUG_CPU) menu "Power management and ACPI options" -- 2.31.1
[PATCH v5 02/13] x86/kexec: refactor for kernel/Kconfig.kexec
The kexec and crash kernel options are provided in the common kernel/Kconfig.kexec. Utilize the common options and provide the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the equivalent set of KEXEC and CRASH options. Signed-off-by: Eric DeVolder --- arch/x86/Kconfig | 92 ++-- 1 file changed, 19 insertions(+), 73 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 7422db409770..9767a343f7c2 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -2040,88 +2040,34 @@ config EFI_RUNTIME_MAP source "kernel/Kconfig.hz" -config KEXEC - bool "kexec system call" - select KEXEC_CORE - help - kexec is a system call that implements the ability to shutdown your - current kernel, and to start another kernel. It is like a reboot - but it is independent of the system firmware. And like a reboot - you can start any kernel with it, not just Linux. - - The name comes from the similarity to the exec system call. - - It is an ongoing process to be certain the hardware in a machine - is properly shutdown, so do not be surprised if this code does not - initially work for you. As of this writing the exact hardware - interface is strongly in flux, so no good recommendation can be - made. - -config KEXEC_FILE - bool "kexec file based system call" - select KEXEC_CORE - select HAVE_IMA_KEXEC if IMA - depends on X86_64 - depends on CRYPTO=y - depends on CRYPTO_SHA256=y - help - This is new version of kexec system call. This system call is - file based and takes file descriptors as system call argument - for kernel and initramfs as opposed to list of segments as - accepted by previous system call. +config ARCH_SUPPORTS_KEXEC + def_bool y -config ARCH_HAS_KEXEC_PURGATORY - def_bool KEXEC_FILE +config ARCH_SUPPORTS_KEXEC_FILE + def_bool X86_64 && CRYPTO && CRYPTO_SHA256 -config KEXEC_SIG - bool "Verify kernel signature during kexec_file_load() syscall" +config ARCH_SELECTS_KEXEC_FILE + def_bool y depends on KEXEC_FILE - help + select HAVE_IMA_KEXEC if IMA - This option makes the kexec_file_load() syscall check for a valid - signature of the kernel image. The image can still be loaded without - a valid signature unless you also enable KEXEC_SIG_FORCE, though if - there's a signature that we can check, then it must be valid. +config ARCH_HAS_KEXEC_PURGATORY + def_bool KEXEC_FILE - In addition to this option, you need to enable signature - verification for the corresponding kernel image type being - loaded in order for this to work. +config ARCH_SUPPORTS_KEXEC_SIG + def_bool y -config KEXEC_SIG_FORCE - bool "Require a valid signature in kexec_file_load() syscall" - depends on KEXEC_SIG - help - This option makes kernel signature verification mandatory for - the kexec_file_load() syscall. +config ARCH_SUPPORTS_KEXEC_SIG_FORCE + def_bool y -config KEXEC_BZIMAGE_VERIFY_SIG - bool "Enable bzImage signature verification support" - depends on KEXEC_SIG - depends on SIGNED_PE_FILE_VERIFICATION - select SYSTEM_TRUSTED_KEYRING - help - Enable bzImage signature verification support. +config ARCH_SUPPORTS_KEXEC_BZIMAGE_VERIFY_SIG + def_bool y -config CRASH_DUMP - bool "kernel crash dumps" - depends on X86_64 || (X86_32 && HIGHMEM) - help - Generate crash dump after being started by kexec. - This should be normally only set in special crash dump kernels - which are loaded in the main kernel with kexec-tools into - a specially reserved region and then later executed after - a crash by kdump/kexec. The crash dump kernel must be compiled - to a memory address not used by the main kernel or BIOS using - PHYSICAL_START, or it must be built as a relocatable image - (CONFIG_RELOCATABLE=y). - For more details see Documentation/admin-guide/kdump/kdump.rst +config ARCH_SUPPORTS_KEXEC_JUMP + def_bool y -config KEXEC_JUMP - bool "kexec jump" - depends on KEXEC && HIBERNATION - help - Jump between original kernel and kexeced kernel and invoke - code in physical address mode via KEXEC +config ARCH_SUPPORTS_CRASH_DUMP + def_bool X86_64 || (X86_32 && HIGHMEM) config PHYSICAL_START hex "Physical address where the kernel is loaded" if (EXPERT || CRASH_DUMP) -- 2.31.1
[PATCH v5 03/13] arm/kexec: refactor for kernel/Kconfig.kexec
The kexec and crash kernel options are provided in the common kernel/Kconfig.kexec. Utilize the common options and provide the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the equivalent set of KEXEC and CRASH options. Signed-off-by: Eric DeVolder --- arch/arm/Kconfig | 29 - 1 file changed, 4 insertions(+), 25 deletions(-) diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index 7a27550ff3c1..1a6a6eb48a15 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -1645,20 +1645,8 @@ config XIP_DEFLATED_DATA copied, saving some precious ROM space. A possible drawback is a slightly longer boot delay. -config KEXEC - bool "Kexec system call (EXPERIMENTAL)" - depends on (!SMP || PM_SLEEP_SMP) - depends on MMU - select KEXEC_CORE - help - kexec is a system call that implements the ability to shutdown your - current kernel, and to start another kernel. It is like a reboot - but it is independent of the system firmware. And like a reboot - you can start any kernel with it, not just Linux. - - It is an ongoing process to be certain the hardware in a machine - is properly shutdown, so do not be surprised if this code does not - initially work for you. +config ARCH_SUPPORTS_KEXEC + def_bool (!SMP || PM_SLEEP_SMP) && MMU config ATAGS_PROC bool "Export atags in procfs" @@ -1668,17 +1656,8 @@ config ATAGS_PROC Should the atags used to boot the kernel be exported in an "atags" file in procfs. Useful with kexec. -config CRASH_DUMP - bool "Build kdump crash kernel (EXPERIMENTAL)" - help - Generate crash dump after being started by kexec. This should - be normally only set in special crash dump kernels which are - loaded in the main kernel with kexec-tools into a specially - reserved region and then later executed after a crash by - kdump/kexec. The crash dump kernel must be compiled to a - memory address not used by the main kernel - - For more details see Documentation/admin-guide/kdump/kdump.rst +config ARCH_SUPPORTS_CRASH_DUMP + def_bool y config AUTO_ZRELADDR bool "Auto calculation of the decompressed kernel image address" if !ARCH_MULTIPLATFORM -- 2.31.1
Re: [apparmor] [PATCH v2 08/92] fs: new helper: simple_rename_timestamp
On Wed, Jul 05, 2023 at 08:04:41PM -0400, Jeff Layton wrote: > > I don't believe it's an issue. I've seen nothing in the POSIX spec that > mandates that timestamp updates to different inodes involved in an > operation be set to the _same_ value. It just says they must be updated. > > It's also hard to believe that any software would depend on this either, > given that it's very inconsistent across filesystems today. AFAICT, this > was mostly done in the past just as a matter of convenience. I've seen this assumption in several programs: mutt buffy.c https://sources.debian.org/src/mutt/2.2.9-1/buffy.c/?hl=625#L625 if (mailbox->newly_created && (sb->st_ctime != sb->st_mtime || sb->st_ctime != sb->st_atime)) mailbox->newly_created = 0; neomutt mbox/mbox.c https://sources.debian.org/src/neomutt/20220429+dfsg1-4.1/mbox/mbox.c/?hl=1820#L1820 if (m->newly_created && ((st.st_ctime != st.st_mtime) || (st.st_ctime != st.st_atime))) m->newly_created = false; screen logfile.c https://sources.debian.org/src/screen/4.9.0-4/logfile.c/?hl=130#L130 if ((!s->st_dev && !s->st_ino) || /* stat failed, that's new! */ !s->st_nlink || /* red alert: file unlinked */ (s->st_size < o.st_size) || /* file truncated */ (s->st_mtime != o.st_mtime) ||/*file modified */ ((s->st_ctime != o.st_ctime) && /* file changed (moved) */ !(s->st_mtime == s->st_ctime && /* and it was not a change */ o.st_ctime < s->st_ctime)))/* due to delayed nfs write */ { nemo libnemo-private/nemo-vfs-file.c https://sources.debian.org/src/nemo/5.6.5-1/libnemo-private/nemo-vfs-file.c/?hl=344#L344 /* mtime is when the contents changed; ctime is when the * contents or the permissions (inc. owner/group) changed. * So we can only know when the permissions changed if mtime * and ctime are different. */ if (file->details->mtime == file->details->ctime) { return FALSE; } While looking for more examples, I found a perl test that seems to suggest that at least Solaris, AFS, AmigaOS, DragonFly BSD do as you suggest: https://sources.debian.org/src/perl/5.36.0-7/t/op/stat.t/?hl=158#L140 Thanks signature.asc Description: PGP signature
Re: [PATCH v2 00/89] fs: new accessors for inode->i_ctime
On Thu, 2023-07-06 at 10:16 -0500, Eric W. Biederman wrote: > Jeff Layton writes: > > > On Wed, 2023-07-05 at 14:58 -0400, Jeff Layton wrote: > > > v2: > > > - prepend patches to add missing ctime updates > > > - add simple_rename_timestamp helper function > > > - rename ctime accessor functions as inode_get_ctime/inode_set_ctime_* > > > - drop individual inode_ctime_set_{sec,nsec} helpers > > > > > > I've been working on a patchset to change how the inode->i_ctime is > > > accessed in order to give us conditional, high-res timestamps for the > > > ctime and mtime. struct timespec64 has unused bits in it that we can use > > > to implement this. In order to do that however, we need to wrap all > > > accesses of inode->i_ctime to ensure that bits used as flags are > > > appropriately handled. > > > > > > The patchset starts with reposts of some missing ctime updates that I > > > spotted in the tree. It then adds a new helper function for updating the > > > timestamp after a successful rename, and new ctime accessor > > > infrastructure. > > > > > > The bulk of the patchset is individual conversions of different > > > subsysteme to use the new infrastructure. Finally, the patchset renames > > > the i_ctime field to __i_ctime to help ensure that I didn't miss > > > anything. > > > > > > This should apply cleanly to linux-next as of this morning. > > > > > > Most of this conversion was done via 5 different coccinelle scripts, run > > > in succession, with a large swath of by-hand conversions to clean up the > > > remainder. > > > > > > > A couple of other things I should note: > > > > If you sent me an Acked-by or Reviewed-by in the previous set, then I > > tried to keep it on the patch here, since the respun patches are mostly > > just renaming stuff from v1. Let me know if I've missed any. > > > > I've also pushed the pile to my tree as this tag: > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux.git/tag/?h=ctime.20230705 > > > > In case that's easier to work with. > > Are there any preliminary patches showing what you want your introduced > accessors to turn into? It is hard to judge the sanity of the > introduction of wrappers without seeing what the wrappers are ultimately > going to do. > > Eric I have a draft version of the multigrain patches on top of the wrapper conversion I've already posted in my "mgctime-experimental" branch: https://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux.git/log/?h=mgctime-experimental The rationale is best explained in this changelog: https://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux.git/commit/?h=mgctime-experimental=face437a144d3375afb7f70c233b0644b4edccba The idea will be to enable this on a per-fs basis. -- Jeff Layton
Re: [PATCH v4 12/13] s390/kexec: refactor for kernel/Kconfig.kexec
On 7/6/23 10:58, Alexander Gordeev wrote: On Wed, Jul 05, 2023 at 08:49:58AM -0700, Nathan Chancellor wrote: ... I just bisected the following build failure visible with 'ARCH=s390 allnoconfig' to this change as commit 842ce0e1dafa ("s390/kexec: refactor for kernel/Kconfig.kexec") in -next. arch/s390/kernel/machine_kexec.c:120:37: warning: 'struct kimage' declared inside parameter list will not be visible outside of this definition or declaration 120 | static bool kdump_csum_valid(struct kimage *image) | ^~ arch/s390/kernel/machine_kexec.c:188:34: warning: 'struct kimage' declared inside parameter list will not be visible outside of this definition or declaration 188 | int machine_kexec_prepare(struct kimage *image) | ^~ arch/s390/kernel/machine_kexec.c: In function 'machine_kexec_prepare': arch/s390/kernel/machine_kexec.c:192:18: error: invalid use of undefined type 'struct kimage' 192 | if (image->type == KEXEC_TYPE_CRASH) | ^~ arch/s390/kernel/machine_kexec.c:192:28: error: 'KEXEC_TYPE_CRASH' undeclared (first use in this function); did you mean 'KEXEC_ON_CRASH'? 192 | if (image->type == KEXEC_TYPE_CRASH) |^~~~ |KEXEC_ON_CRASH arch/s390/kernel/machine_kexec.c:192:28: note: each undeclared identifier is reported only once for each function it appears in arch/s390/kernel/machine_kexec.c:196:18: error: invalid use of undefined type 'struct kimage' 196 | if (image->type != KEXEC_TYPE_DEFAULT) | ^~ arch/s390/kernel/machine_kexec.c:196:28: error: 'KEXEC_TYPE_DEFAULT' undeclared (first use in this function); did you mean 'KEXEC_ARCH_DEFAULT'? 196 | if (image->type != KEXEC_TYPE_DEFAULT) |^~ |KEXEC_ARCH_DEFAULT In file included from arch/s390/include/asm/thread_info.h:31, from include/linux/thread_info.h:60, from arch/s390/include/asm/preempt.h:6, from include/linux/preempt.h:79, from arch/s390/include/asm/percpu.h:5, from include/linux/irqflags.h:18, from include/linux/rcupdate.h:26, from include/linux/rculist.h:11, from include/linux/pid.h:5, from include/linux/sched.h:14, from include/linux/ratelimit.h:6, from include/linux/dev_printk.h:16, from include/linux/device.h:15, from arch/s390/kernel/machine_kexec.c:9: arch/s390/kernel/machine_kexec.c:200:48: error: invalid use of undefined type 'struct kimage' 200 | reboot_code_buffer = page_to_virt(image->control_code_page); |^~ arch/s390/include/asm/page.h:186:58: note: in definition of macro '__va' 186 | #define __va(x) ((void *)(unsigned long)(x)) | ^ arch/s390/include/asm/page.h:194:38: note: in expansion of macro 'pfn_to_phys' 194 | #define pfn_to_virt(pfn)__va(pfn_to_phys(pfn)) | ^~~ arch/s390/include/asm/page.h:199:33: note: in expansion of macro 'pfn_to_virt' 199 | #define page_to_virt(page) pfn_to_virt(page_to_pfn(page)) | ^~~ include/asm-generic/memory_model.h:64:21: note: in expansion of macro '__page_to_pfn' 64 | #define page_to_pfn __page_to_pfn | ^ arch/s390/kernel/machine_kexec.c:200:30: note: in expansion of macro 'page_to_virt' 200 | reboot_code_buffer = page_to_virt(image->control_code_page); | ^~~~ arch/s390/kernel/machine_kexec.c: At top level: arch/s390/kernel/machine_kexec.c:207:35: warning: 'struct kimage' declared inside parameter list will not be visible outside of this definition or declaration 207 | void machine_kexec_cleanup(struct kimage *image) | ^~ arch/s390/kernel/machine_kexec.c: In function '__do_machine_kexec': arch/s390/kernel/machine_kexec.c:243:40: error: invalid use of undefined type 'struct kimage' 243 | data_mover = page_to_phys(image->control_code_page); |^~ arch/s390/include/asm/page.h:189:35: note: in definition of macro 'pfn_to_phys' 189 | #define pfn_to_phys(pfn)((pfn) << PAGE_SHIFT) | ^~~ include/asm-generic/memory_model.h:64:21: note: in expansion
Re: [PATCH v2 00/89] fs: new accessors for inode->i_ctime
Jeff Layton writes: > On Wed, 2023-07-05 at 14:58 -0400, Jeff Layton wrote: >> v2: >> - prepend patches to add missing ctime updates >> - add simple_rename_timestamp helper function >> - rename ctime accessor functions as inode_get_ctime/inode_set_ctime_* >> - drop individual inode_ctime_set_{sec,nsec} helpers >> >> I've been working on a patchset to change how the inode->i_ctime is >> accessed in order to give us conditional, high-res timestamps for the >> ctime and mtime. struct timespec64 has unused bits in it that we can use >> to implement this. In order to do that however, we need to wrap all >> accesses of inode->i_ctime to ensure that bits used as flags are >> appropriately handled. >> >> The patchset starts with reposts of some missing ctime updates that I >> spotted in the tree. It then adds a new helper function for updating the >> timestamp after a successful rename, and new ctime accessor >> infrastructure. >> >> The bulk of the patchset is individual conversions of different >> subsysteme to use the new infrastructure. Finally, the patchset renames >> the i_ctime field to __i_ctime to help ensure that I didn't miss >> anything. >> >> This should apply cleanly to linux-next as of this morning. >> >> Most of this conversion was done via 5 different coccinelle scripts, run >> in succession, with a large swath of by-hand conversions to clean up the >> remainder. >> > > A couple of other things I should note: > > If you sent me an Acked-by or Reviewed-by in the previous set, then I > tried to keep it on the patch here, since the respun patches are mostly > just renaming stuff from v1. Let me know if I've missed any. > > I've also pushed the pile to my tree as this tag: > > > https://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux.git/tag/?h=ctime.20230705 > > In case that's easier to work with. Are there any preliminary patches showing what you want your introduced accessors to turn into? It is hard to judge the sanity of the introduction of wrappers without seeing what the wrappers are ultimately going to do. Eric
Re: [PATCH v4 12/13] s390/kexec: refactor for kernel/Kconfig.kexec
On Wed, Jul 05, 2023 at 08:49:58AM -0700, Nathan Chancellor wrote: ... > I just bisected the following build failure visible with 'ARCH=s390 > allnoconfig' to this change as commit 842ce0e1dafa ("s390/kexec: > refactor for kernel/Kconfig.kexec") in -next. > > arch/s390/kernel/machine_kexec.c:120:37: warning: 'struct kimage' declared > inside parameter list will not be visible outside of this definition or > declaration > 120 | static bool kdump_csum_valid(struct kimage *image) > | ^~ > arch/s390/kernel/machine_kexec.c:188:34: warning: 'struct kimage' declared > inside parameter list will not be visible outside of this definition or > declaration > 188 | int machine_kexec_prepare(struct kimage *image) > | ^~ > arch/s390/kernel/machine_kexec.c: In function 'machine_kexec_prepare': > arch/s390/kernel/machine_kexec.c:192:18: error: invalid use of undefined > type 'struct kimage' > 192 | if (image->type == KEXEC_TYPE_CRASH) > | ^~ > arch/s390/kernel/machine_kexec.c:192:28: error: 'KEXEC_TYPE_CRASH' > undeclared (first use in this function); did you mean 'KEXEC_ON_CRASH'? > 192 | if (image->type == KEXEC_TYPE_CRASH) > |^~~~ > |KEXEC_ON_CRASH > arch/s390/kernel/machine_kexec.c:192:28: note: each undeclared identifier > is reported only once for each function it appears in > arch/s390/kernel/machine_kexec.c:196:18: error: invalid use of undefined > type 'struct kimage' > 196 | if (image->type != KEXEC_TYPE_DEFAULT) > | ^~ > arch/s390/kernel/machine_kexec.c:196:28: error: 'KEXEC_TYPE_DEFAULT' > undeclared (first use in this function); did you mean 'KEXEC_ARCH_DEFAULT'? > 196 | if (image->type != KEXEC_TYPE_DEFAULT) > |^~ > |KEXEC_ARCH_DEFAULT > In file included from arch/s390/include/asm/thread_info.h:31, >from include/linux/thread_info.h:60, >from arch/s390/include/asm/preempt.h:6, >from include/linux/preempt.h:79, >from arch/s390/include/asm/percpu.h:5, >from include/linux/irqflags.h:18, >from include/linux/rcupdate.h:26, >from include/linux/rculist.h:11, >from include/linux/pid.h:5, >from include/linux/sched.h:14, >from include/linux/ratelimit.h:6, >from include/linux/dev_printk.h:16, >from include/linux/device.h:15, >from arch/s390/kernel/machine_kexec.c:9: > arch/s390/kernel/machine_kexec.c:200:48: error: invalid use of undefined > type 'struct kimage' > 200 | reboot_code_buffer = page_to_virt(image->control_code_page); > |^~ > arch/s390/include/asm/page.h:186:58: note: in definition of macro '__va' > 186 | #define __va(x) ((void *)(unsigned long)(x)) > | ^ > arch/s390/include/asm/page.h:194:38: note: in expansion of macro > 'pfn_to_phys' > 194 | #define pfn_to_virt(pfn)__va(pfn_to_phys(pfn)) > | ^~~ > arch/s390/include/asm/page.h:199:33: note: in expansion of macro > 'pfn_to_virt' > 199 | #define page_to_virt(page) pfn_to_virt(page_to_pfn(page)) > | ^~~ > include/asm-generic/memory_model.h:64:21: note: in expansion of macro > '__page_to_pfn' > 64 | #define page_to_pfn __page_to_pfn > | ^ > arch/s390/kernel/machine_kexec.c:200:30: note: in expansion of macro > 'page_to_virt' > 200 | reboot_code_buffer = page_to_virt(image->control_code_page); > | ^~~~ > arch/s390/kernel/machine_kexec.c: At top level: > arch/s390/kernel/machine_kexec.c:207:35: warning: 'struct kimage' declared > inside parameter list will not be visible outside of this definition or > declaration > 207 | void machine_kexec_cleanup(struct kimage *image) > | ^~ > arch/s390/kernel/machine_kexec.c: In function '__do_machine_kexec': > arch/s390/kernel/machine_kexec.c:243:40: error: invalid use of undefined > type 'struct kimage' > 243 | data_mover = page_to_phys(image->control_code_page); > |^~ > arch/s390/include/asm/page.h:189:35: note: in definition of macro > 'pfn_to_phys' > 189 | #define pfn_to_phys(pfn)((pfn) << PAGE_SHIFT) > | ^~~ >
Re: [PATCH v2 92/92] fs: rename i_ctime field to __i_ctime
On Wed 05-07-23 14:58:12, Jeff Layton wrote: > Now that everything in-tree is converted to use the accessor functions, > rename the i_ctime field in the inode to discourage direct access. > > Signed-off-by: Jeff Layton Looks good. Feel free to add: Reviewed-by: Jan Kara Honza > --- > include/linux/fs.h | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 14e38bd900f1..b66442f91835 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -642,7 +642,7 @@ struct inode { > loff_t i_size; > struct timespec64 i_atime; > struct timespec64 i_mtime; > - struct timespec64 i_ctime; > + struct timespec64 __i_ctime; /* use inode_*_ctime accessors! */ > spinlock_t i_lock; /* i_blocks, i_bytes, maybe i_size */ > unsigned short i_bytes; > u8 i_blkbits; > @@ -1485,7 +1485,7 @@ struct timespec64 inode_set_ctime_current(struct inode > *inode); > */ > static inline struct timespec64 inode_get_ctime(const struct inode *inode) > { > - return inode->i_ctime; > + return inode->__i_ctime; > } > > /** > @@ -1498,7 +1498,7 @@ static inline struct timespec64 inode_get_ctime(const > struct inode *inode) > static inline struct timespec64 inode_set_ctime_to_ts(struct inode *inode, > struct timespec64 ts) > { > - inode->i_ctime = ts; > + inode->__i_ctime = ts; > return ts; > } > > -- > 2.41.0 > -- Jan Kara SUSE Labs, CR
Re: [PATCH] ASoC: fsl_sai: Enable MCTL_MCLK_EN bit for master mode
On Thu, Jul 6, 2023 at 8:19 AM Shengjiu Wang wrote: > The clean way for fixing is to remove the code in fsl_sai_set_bclk() > and add "fsl,sai-mclk-direction-output;" property in dts for some > node. Yes, after thinking more about it, I agree. So what you are proposing is basically a revert of the patch in Subject and I assume that the reason you did the original patch was that you missed passing fsl,sai-mclk-direction-output in DT. I will send the revert. Thanks
Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page
On Thu, 6 Jul 2023, Gerald Schaefer wrote: > > Since none of my remarks on the comments seem valid or strictly necessary > any more, and I also could not find functional issues, I think you can add > this patch as new version for 07/12. And I can now give you this: > > Reviewed-by: Gerald Schaefer Great, thanks a lot Gerald. The one change I'm making to it is then this merged in: --- a/arch/s390/mm/pgalloc.c +++ b/arch/s390/mm/pgalloc.c @@ -455,6 +455,11 @@ void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable) page = virt_to_page(pgtable); SetPageActive(page); page_table_free(mm, (unsigned long *)pgtable); + /* +* page_table_free() does not do the pgste gmap_unlink() which +* page_table_free_rcu() does: warn us if pgste ever reaches here. +*/ + WARN_ON_ONCE(mm_alloc_pgste(mm)); } #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
Re: [RFC 1/1] sched/fair: Consider asymmetric scheduler groups in load balancer
On 5/15/23 5:16 PM, Tobias Huschle wrote: > The current load balancer implementation implies that scheduler groups, > within the same domain, all host the same number of CPUs. This is > reflected in the condition, that a scheduler group, which is load > balancing and classified as having spare capacity, should pull work > from the busiest group, if the local group runs less processes than > the busiest one. This implies that these two groups should run the > same number of processes, which is problematic if the groups are not > of the same size. > > The assumption that scheduler groups within the same scheduler domain > host the same number of CPUs appears to be true for non-s390 > architectures. Nevertheless, s390 can have scheduler groups of unequal > size. > > This introduces a performance degredation in the following scenario: > > Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket, > the remaining 2 are located on another socket: > > Socket -1--2- > CPU 1 2 3 4 5 67 8 > > Placing some workload ( x = one task ) yields the following > scenarios: > > The first 5 tasks are distributed evenly across the two groups. > > Socket -1--2- > CPU 1 2 3 4 5 67 8 > x x x x x > > Adding a 6th task yields the following distribution: > > Socket -1--2- > CPU 1 2 3 4 5 67 8 > SMT1 x x x x x > SMT2x > > The task is added to the 2nd scheduler group, as the scheduler has the > assumption that scheduler groups are of the same size, so they should > also host the same number of tasks. This makes CPU 7 run into SMT > thread, which comes with a performance penalty. This means, that in > the window of 6-8 tasks, load balancing is done suboptimally, because > SMT is used although there is no reason to do so as fully idle CPUs > are still available. > > Taking the weight of the scheduler groups into account, ensures that > a load balancing CPU within a smaller group will not try to pull tasks > from a bigger group while the bigger group still has idle CPUs > available. > > Signed-off-by: Tobias Huschle > --- > kernel/sched/fair.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 48b6f0ca13ac..b1307d7e4065 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -10426,7 +10426,8 @@ static struct sched_group *find_busiest_group(struct > lb_env *env) >* group's child domain. >*/ > if (sds.prefer_sibling && local->group_type == group_has_spare && > - busiest->sum_nr_running > local->sum_nr_running + 1) > + busiest->sum_nr_running * local->group_weight > > + local->sum_nr_running * busiest->group_weight + 1) > goto force_balance; > I assume its SMT2 here. sched group is enclosed in [busy_cpus/idle_cpus/weight] Lets take an example: we will begin the with the said imbalance. [3/9/12] -- local group [3/1/4] -- busy group. we will evaluate 3*12 > (4*(3+1)) is true and proceeds further to balance. but calculate_imbalance will lead to zero as the imbalance no? in case of prefer sibling case it find the difference of sum_nr_running of the two groups. It will be 3-3 = 0 this would need modifications to calculate_imbalance. Maybe something like this will help? NOT TESTED AT ALL. diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a80a73909dc2..e027d4086edc 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -10296,7 +10296,9 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s return; } - if (busiest->group_weight == 1 || sds->prefer_sibling) { + if (busiest->group_weight == 1 || + (sds->prefer_sibling && +busiest->group_weight == local->group_weight)) { unsigned int nr_diff = busiest->sum_nr_running; /* * When prefer sibling, evenly spread running tasks on @@ -10305,6 +10307,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s env->migration_type = migrate_task; lsub_positive(_diff, local->sum_nr_running); env->imbalance = nr_diff; + } + if (sds->prefer_sibling && + busiest->group_weight != local->group_weight) { + env->migration_type = migrate_task; + env->imbalance = 1; } else { --- On a tangential dimension. Since prefer_siblings make inherent assumption that all groups have equal weight, it will cause complications when group_weights are different. I think that
Re: [PATCH v2 1/5] mm/hotplug: Embed vmem_altmap details in memory block
On 7/6/23 6:29 PM, David Hildenbrand wrote: > On 06.07.23 14:32, Aneesh Kumar K V wrote: >> On 7/6/23 4:44 PM, David Hildenbrand wrote: >>> On 06.07.23 11:36, Aneesh Kumar K V wrote: On 7/6/23 2:48 PM, David Hildenbrand wrote: > On 06.07.23 10:50, Aneesh Kumar K.V wrote: >> With memmap on memory, some architecture needs more details w.r.t altmap >> such as base_pfn, end_pfn, etc to unmap vmemmap memory. > > Can you elaborate why ppc64 needs that and x86-64 + aarch64 don't? > > IOW, why can't ppc64 simply allocate the vmemmap from the start of the > memblock (-> base_pfn) and use the stored number of vmemmap pages to > calculate the end_pfn? > > To rephrase: if the vmemmap is not at the beginning and doesn't cover > full apgeblocks, memory onlining/offlining would be broken. > > [...] With ppc64 and 64K pagesize and different memory block sizes, we can end up allocating vmemmap backing memory from outside altmap because a single page vmemmap can cover 1024 pages (64 *1024/sizeof(struct page)). and that can point to pages outside the dev_pagemap range. So on free we check >>> >>> So you end up with a mixture of altmap and ordinarily-allocated vmemmap >>> pages? That sound wrong (and is counter-intuitive to the feature in >>> general, where we *don't* want to allocate the vmemmap from outside the >>> altmap). >>> >>> (64 * 1024) / sizeof(struct page) -> 1024 pages >>> >>> 1024 pages * 64k = 64 MiB. >>> >>> What's the memory block size on these systems? If it's >= 64 MiB the >>> vmemmap of a single memory block fits into a single page and we should be >>> fine. >>> >>> Smells like you want to disable the feature on a 64k system. >>> >> >> But that part of vmemmap_free is common for both dax,dax kmem and the new >> memmap on memory feature. ie, ppc64 vmemmap_free have checks which require >> a full altmap structure with all the details in. So for memmap on memmory to >> work on ppc64 we do require similar altmap struct. Hence the idea >> of adding vmemmap_altmap to struct memory_block > > I'd suggest making sure that for the memmap_on_memory case your really > *always* allocate from the altmap (that's what the feature is about after > all), and otherwise block the feature (i.e., arch_mhp_supports_... should > reject it). > Sure. How about? bool mhp_supports_memmap_on_memory(unsigned long size) { unsigned long nr_pages = size >> PAGE_SHIFT; unsigned long vmemmap_size = nr_pages * sizeof(struct page); if (!radix_enabled()) return false; /* * memmap on memory only supported with memory block size add/remove */ if (size != memory_block_size_bytes()) return false; /* * Also make sure the vmemmap allocation is fully contianed * so that we always allocate vmemmap memory from altmap area. */ if (!IS_ALIGNED(vmemmap_size, PAGE_SIZE)) return false; /* * The pageblock alignment requirement is met by using * reserve blocks in altmap. */ return true; } > Then, you can reconstruct the altmap layout trivially > > base_pfn: start of the range to unplug > end_pfn: base_pfn + nr_vmemmap_pages > > and pass that to the removal code, which will do the right thing, no? > > > Sure, remembering the altmap might be a potential cleanup (eventually?), but > the basic reasoning why this is required as patch #1 IMHO is wrong: if you > say you support memmap_on_memory for a configuration, then you should also > properly support it (allocate from the hotplugged memory), not silently fall > back to something else. I guess you want to keep the altmap introduction as a later patch in the series and not the preparatory patch? Or are you ok with just adding the additional check I mentioned above w.r.t size value and keep this patch as patch 1 as a generic cleanup (avoiding the recomputation of altmap->alloc/base_pfn/end_pfn? -aneesh
Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
On Thu, Jul 06, 2023 at 01:52:08PM +0900, David Stevens wrote: > On Wed, Jul 5, 2023 at 7:17 PM Yu Zhang wrote: > > > > On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote: > > > From: David Stevens > > > > > > Stop passing FOLL_GET to __kvm_follow_pfn. This allows the host to map > > > memory into the guest that is backed by un-refcounted struct pages - for > > > example, higher order non-compound pages allocated by the amdgpu driver > > > via ttm_pool_alloc_page. > > > > I guess you mean the tail pages of the higher order non-compound pages? > > And as to the head page, it is said to be set to one coincidentally[*], > > and shall not be considered as refcounted. IIUC, refcount of this head > > page will be increased and decreased soon in hva_to_pfn_remapped(), so > > this may not be a problem(?). But treating this head page differently, > > as a refcounted one(e.g., to set the A/D flags), is weired. > > > > Or maybe I missed some context, e.g., can the head page be allocted to > > guest at all? > > Yes, this is to allow mapping the tail pages of higher order > non-compound pages - I should have been more precise in my wording. > The head pages can already be mapped into the guest. > > Treating the head and tail pages would require changing how KVM > behaves in a situation it supports today (rather than just adding > support for an unsupported situation). Currently, without this series, > KVM can map VM_PFNMAP|VM_IO memory backed by refcounted pages into the > guest. When that happens, KVM sets the A/D flags. I'm not sure whether > that's actually valid behavior, nor do I know whether anyone actually > cares about it. But it's what KVM does today, and I would shy away > from modifying that behavior without good reason. > > > > > > > The bulk of this change is tracking the is_refcounted_page flag so that > > > non-refcounted pages don't trigger page_count() == 0 warnings. This is > > > done by storing the flag in an unused bit in the sptes. > > > > Also, maybe we should mention this only works on x86-64. > > > > > > > > Signed-off-by: David Stevens > > > --- > > > arch/x86/kvm/mmu/mmu.c | 44 + > > > arch/x86/kvm/mmu/mmu_internal.h | 1 + > > > arch/x86/kvm/mmu/paging_tmpl.h | 9 --- > > > arch/x86/kvm/mmu/spte.c | 4 ++- > > > arch/x86/kvm/mmu/spte.h | 12 - > > > arch/x86/kvm/mmu/tdp_mmu.c | 22 ++--- > > > 6 files changed, 62 insertions(+), 30 deletions(-) > > > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > > > index e44ab512c3a1..b1607e314497 100644 > > > --- a/arch/x86/kvm/mmu/mmu.c > > > +++ b/arch/x86/kvm/mmu/mmu.c > > > > ... > > > > > @@ -2937,6 +2943,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, > > > struct kvm_memory_slot *slot, > > > bool host_writable = !fault || fault->map_writable; > > > bool prefetch = !fault || fault->prefetch; > > > bool write_fault = fault && fault->write; > > > + bool is_refcounted = !fault || fault->is_refcounted_page; > > > > Just wonder, what if a non-refcounted page is prefetched? Or is it > > possible in > > practice? > > Prefetching is still done via gfn_to_page_many_atomic, which sets > FOLL_GET. That's fixable, but it's not something this series currently > does. So if we prefetch a page, REFCOUNTED bit is cleared unconditionally with this hunk. kvm_set_page_{dirty, accessed} won't be called as expected for prefetched spte. If I read the patch correctly, REFCOUNTED bit in SPTE should represent whether the corresponding page is ref-countable or not, right? Because direct_pte_prefetch_many() is for legacy KVM MMU and FNAME(prefetch_pte) is shadow paging, we need to test it with legacy KVM MMU or shadow paging to hit the issue, though. Thanks, -- Isaku Yamahata
[PATCH v8 17/19] powerpc: mm: Convert to GENERIC_IOREMAP
From: Christophe Leroy By taking GENERIC_IOREMAP method, the generic generic_ioremap_prot(), generic_iounmap(), and their generic wrapper ioremap_prot(), ioremap() and iounmap() are all visible and available to arch. Arch needs to provide wrapper functions to override the generic versions if there's arch specific handling in its ioremap_prot(), ioremap() or iounmap(). This change will simplify implementation by removing duplicated code with generic_ioremap_prot() and generic_iounmap(), and has the equivalent functioality as before. Here, add wrapper functions ioremap_prot() and iounmap() for powerpc's special operation when ioremap() and iounmap(). Signed-off-by: Christophe Leroy Signed-off-by: Baoquan He Reviewed-by: Christoph Hellwig Reviewed-by: Mike Rapoport (IBM) Cc: Michael Ellerman Cc: Nicholas Piggin Cc: linuxppc-dev@lists.ozlabs.org --- arch/powerpc/Kconfig | 1 + arch/powerpc/include/asm/io.h | 8 +++- arch/powerpc/mm/ioremap.c | 26 +- arch/powerpc/mm/ioremap_32.c | 19 +-- arch/powerpc/mm/ioremap_64.c | 12 ++-- 5 files changed, 16 insertions(+), 50 deletions(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 0b1172cbeccb..9222c138c457 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -193,6 +193,7 @@ config PPC select GENERIC_CPU_VULNERABILITIES if PPC_BARRIER_NOSPEC select GENERIC_EARLY_IOREMAP select GENERIC_GETTIMEOFDAY + select GENERIC_IOREMAP select GENERIC_IRQ_SHOW select GENERIC_IRQ_SHOW_LEVEL select GENERIC_PCI_IOMAPif PCI diff --git a/arch/powerpc/include/asm/io.h b/arch/powerpc/include/asm/io.h index 67a3fb6de498..0732b743e099 100644 --- a/arch/powerpc/include/asm/io.h +++ b/arch/powerpc/include/asm/io.h @@ -889,8 +889,8 @@ static inline void iosync(void) * */ extern void __iomem *ioremap(phys_addr_t address, unsigned long size); -extern void __iomem *ioremap_prot(phys_addr_t address, unsigned long size, - unsigned long flags); +#define ioremap ioremap +#define ioremap_prot ioremap_prot extern void __iomem *ioremap_wc(phys_addr_t address, unsigned long size); #define ioremap_wc ioremap_wc @@ -904,14 +904,12 @@ void __iomem *ioremap_coherent(phys_addr_t address, unsigned long size); #define ioremap_cache(addr, size) \ ioremap_prot((addr), (size), pgprot_val(PAGE_KERNEL)) -extern void iounmap(volatile void __iomem *addr); +#define iounmap iounmap void __iomem *ioremap_phb(phys_addr_t paddr, unsigned long size); int early_ioremap_range(unsigned long ea, phys_addr_t pa, unsigned long size, pgprot_t prot); -void __iomem *do_ioremap(phys_addr_t pa, phys_addr_t offset, unsigned long size, -pgprot_t prot, void *caller); extern void __iomem *__ioremap_caller(phys_addr_t, unsigned long size, pgprot_t prot, void *caller); diff --git a/arch/powerpc/mm/ioremap.c b/arch/powerpc/mm/ioremap.c index 4f12504fb405..705e8e8ffde4 100644 --- a/arch/powerpc/mm/ioremap.c +++ b/arch/powerpc/mm/ioremap.c @@ -41,7 +41,7 @@ void __iomem *ioremap_coherent(phys_addr_t addr, unsigned long size) return __ioremap_caller(addr, size, prot, caller); } -void __iomem *ioremap_prot(phys_addr_t addr, unsigned long size, unsigned long flags) +void __iomem *ioremap_prot(phys_addr_t addr, size_t size, unsigned long flags) { pte_t pte = __pte(flags); void *caller = __builtin_return_address(0); @@ -74,27 +74,3 @@ int early_ioremap_range(unsigned long ea, phys_addr_t pa, return 0; } - -void __iomem *do_ioremap(phys_addr_t pa, phys_addr_t offset, unsigned long size, -pgprot_t prot, void *caller) -{ - struct vm_struct *area; - int ret; - unsigned long va; - - area = __get_vm_area_caller(size, VM_IOREMAP, IOREMAP_START, IOREMAP_END, caller); - if (area == NULL) - return NULL; - - area->phys_addr = pa; - va = (unsigned long)area->addr; - - ret = ioremap_page_range(va, va + size, pa, prot); - if (!ret) - return (void __iomem *)area->addr + offset; - - vunmap_range(va, va + size); - free_vm_area(area); - - return NULL; -} diff --git a/arch/powerpc/mm/ioremap_32.c b/arch/powerpc/mm/ioremap_32.c index 9d13143b8be4..ca5bc6be3e6f 100644 --- a/arch/powerpc/mm/ioremap_32.c +++ b/arch/powerpc/mm/ioremap_32.c @@ -21,6 +21,13 @@ __ioremap_caller(phys_addr_t addr, unsigned long size, pgprot_t prot, void *call phys_addr_t p, offset; int err; + /* +* If the address lies within the first 16 MB, assume it's in ISA +* memory space +*/ + if (addr < SZ_16M) + addr += _ISA_MEM_BASE; + /* * Choose an address to map it to. * Once the vmalloc system is running, we use
Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page
On Wed, 5 Jul 2023 18:20:21 -0700 (PDT) Hugh Dickins wrote: > On Wed, 5 Jul 2023, Gerald Schaefer wrote: > > On Tue, 4 Jul 2023 10:03:57 -0700 (PDT) > > Hugh Dickins wrote: > > > On Tue, 4 Jul 2023, Gerald Schaefer wrote: > > > > On Sat, 1 Jul 2023 21:32:38 -0700 (PDT) > > > > Hugh Dickins wrote: > > > > > On Thu, 29 Jun 2023, Hugh Dickins wrote: > > > ... > > > > > --- a/arch/s390/mm/pgalloc.c > > > > > +++ b/arch/s390/mm/pgalloc.c > > > > > @@ -229,6 +229,15 @@ void page_table_free_pgste(struct page *page) > > > > > * logic described above. Both AA bits are set to 1 to denote a > > > > > 4KB-pgtable > > > > > * while the PP bits are never used, nor such a page is added to or > > > > > removed > > > > > * from mm_context_t::pgtable_list. > > > > > + * > > > > > + * pte_free_defer() overrides those rules: it takes the page off > > > > > pgtable_list, > > > > > + * and prevents both 2K fragments from being reused. > > > > > pte_free_defer() has to > > > > > + * guarantee that its pgtable cannot be reused before the RCU grace > > > > > period > > > > > + * has elapsed (which page_table_free_rcu() does not actually > > > > > guarantee). > > > > > > > > Hmm, I think page_table_free_rcu() has to guarantee the same, i.e. not > > > > allow reuse before grace period elapsed. And I hope that it does so, by > > > > setting the PP bits, which would be noticed in page_table_alloc(), in > > > > case the page would be seen there. > > > > > > > > Unlike pte_free_defer(), page_table_free_rcu() would add pages back to > > > > the > > > > end of the list, and so they could be seen in page_table_alloc(), but > > > > they > > > > should not be reused before grace period elapsed and > > > > __tlb_remove_table() > > > > cleared the PP bits, as far as I understand. > > > > > > > > So what exactly do you mean with "which page_table_free_rcu() does not > > > > actually > > > > guarantee"? > > > > > > I'll answer without locating and re-reading what Jason explained earlier, > > > perhaps in a separate thread, about pseudo-RCU-ness in tlb_remove_table(): > > > he may have explained it better. And without working out again all the > > > MMU_GATHER #defines, and which of them do and do not apply to s390 here. > > > > > > The detail that sticks in my mind is the fallback in tlb_remove_table() > > > > Ah ok, I was aware of that "semi-RCU" fallback logic in tlb_remove_table(), > > but that is rather a generic issue, and not s390-specific. > > Yes. > > > I thought you > > meant some s390-oddity here, of which we have a lot, unfortunately... > > Of course, we call tlb_remove_table() from our page_table_free_rcu(), so > > I guess you could say that page_table_free_rcu() cannot guarantee what > > tlb_remove_table() cannot guarantee. > > > > Maybe change to "which page_table_free_rcu() does not actually guarantee, > > by calling tlb_remove_table()", to make it clear that this is not a problem > > of page_table_free_rcu() itself. > > Okay - I'll rephrase slightly to avoid being sued by s390's lawyers :-) > > > > > > in mm/mmu_gather.c: if its __get_free_page(GFP_NOWAIT) fails, it cannot > > > batch the tables for freeing by RCU, and resorts instead to an immediate > > > TLB flush (I think: that again involves chasing definitions) followed by > > > tlb_remove_table_sync_one() - which just delivers an interrupt to each > > > CPU, > > > and is commented: > > > /* > > > * This isn't an RCU grace period and hence the page-tables cannot be > > > * assumed to be actually RCU-freed. > > > * > > > * It is however sufficient for software page-table walkers that rely on > > > * IRQ disabling. > > > */ > > > > > > Whether that's good for your PP pages or not, I've given no thought: > > > I've just taken it on trust that what s390 has working today is good. > > > > Yes, we should be fine with that, current code can be trusted :-) > > Glad to hear it :-) Yes, I think it's not actually relying on the "rcu" > implied by the function name. Ah ok, now I get it. Never noticed that naming it "free_rcu" could be misleading. It is only ever called from pte_free_tlb(), so always in that "semi-RCU" context. If you just look at the name, you could expect this to always free pagetables by RCU, which would be exactly what you need for pte_free_defer(), and which of course cannot be guaranteed by our page_table_free_rcu(). IOW, exactly what your comment says, and now I think it is actually fine as it is :-) I guess I am a bit lamebrained this week, due to early shift and not enough sleep... > > > > > > > > > If that __get_free_page(GFP_NOWAIT) fallback instead used call_rcu(), > > > then I would not have written "(which page_table_free_rcu() does not > > > actually guarantee)". But it cannot use call_rcu() because it does > > > not have an rcu_head to work with - it's in some generic code, and > > > there is no MMU_GATHER_CAN_USE_PAGE_RCU_HEAD for architectures to set. > > > > > >
Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function
On Thu, Jul 06, 2023 at 02:29:24PM +0900, David Stevens wrote: > On Wed, Jul 5, 2023 at 7:53 PM Yu Zhang wrote: > > > > On Wed, Jul 05, 2023 at 06:22:59PM +0900, David Stevens wrote: > > > On Wed, Jul 5, 2023 at 12:10 PM Yu Zhang > > > wrote: > > > > > > > > > @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long > > > > > addr, bool write_fault, > > > > > * The slow path to get the pfn of the specified host virtual > > > > > address, > > > > > * 1 indicates success, -errno is returned if error is detected. > > > > > */ > > > > > -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool > > > > > write_fault, > > > > > -bool interruptible, bool *writable, > > > > > kvm_pfn_t *pfn) > > > > > +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t > > > > > *pfn) > > > > > { > > > > > - unsigned int flags = FOLL_HWPOISON; > > > > > + unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags; > > > > > struct page *page; > > > > > int npages; > > > > > > > > > > might_sleep(); > > > > > > > > > > - if (writable) > > > > > - *writable = write_fault; > > > > > - > > > > > - if (write_fault) > > > > > - flags |= FOLL_WRITE; > > > > > - if (async) > > > > > - flags |= FOLL_NOWAIT; > > > > > - if (interruptible) > > > > > - flags |= FOLL_INTERRUPTIBLE; > > > > > - > > > > > - npages = get_user_pages_unlocked(addr, 1, , flags); > > > > > + npages = get_user_pages_unlocked(foll->hva, 1, , flags); > > > > > if (npages != 1) > > > > > return npages; > > > > > > > > > > + foll->writable = (foll->flags & FOLL_WRITE) && > > > > > foll->allow_write_mapping; > > > > > + > > > > > /* map read fault as writable if possible */ > > > > > - if (unlikely(!write_fault) && writable) { > > > > > + if (unlikely(!foll->writable) && foll->allow_write_mapping) { > > > > > > > > I guess !foll->writable should be !(foll->flags & FOLL_WRITE) here. > > > > > > The two statements are logically equivalent, although I guess using > > > !(foll->flags & FOLL_WRITE) may be a little clearer, if a little more > > > verbose. > > > > Well, as the comment says, we wanna try to map the read fault as writable > > whenever possible. And __gfn_to_pfn_memslot() will only set the FOLL_WRITE > > for write faults. So I guess using !foll->writable will not allow this. > > Did I miss anything? > > We just set the foll->writable out parameter to be equal to > ((foll->flags & FOLL_WRITE) && foll->allow_write_mapping). Taking a = > foll->flags & FOLL_WRITE and b = foll->allow_write_mapping, we have > !(a && b) && b -> (!a || !b) && b -> (!a && b) || (!b && b) -> !a && > b. Ouch, my bad again... I typed "!foll->writable", but missed the "!" in my head while calculating... Thanks! :) B.R. Yu
Re: [PATCH] powernv/opal-prd: Silence memcpy() run-time false positive warnings
On 2023-07-05 11:06:46 Wed, Jordan Niethe wrote: > > > On 26/6/23 5:04 pm, Mahesh Salgaonkar wrote: > > opal_prd_msg_notifier extracts the opal prd message size from the message > > header and uses it for allocating opal_prd_msg_queue_item that includes > > the correct message size to be copied. However, while running under > > CONFIG_FORTIFY_SOURCE=y, it triggers following run-time warning: > > > > [ 6458.234352] memcpy: detected field-spanning write (size 32) of single > > field ">msg" at arch/powerpc/platforms/powernv/opal-prd.c:355 (size 4) > > [ 6458.234390] WARNING: CPU: 9 PID: 660 at > > arch/powerpc/platforms/powernv/opal-prd.c:355 > > opal_prd_msg_notifier+0x174/0x188 [opal_prd] > > [...] > > [ 6458.234709] NIP [c0080e0c0e6c] opal_prd_msg_notifier+0x174/0x188 > > [opal_prd] > > [ 6458.234723] LR [c0080e0c0e68] opal_prd_msg_notifier+0x170/0x188 > > [opal_prd] > > [ 6458.234736] Call Trace: > > [ 6458.234742] [c002acb23c10] [c0080e0c0e68] > > opal_prd_msg_notifier+0x170/0x188 [opal_prd] (unreliable) > > [ 6458.234759] [c002acb23ca0] [c019ccc0] > > notifier_call_chain+0xc0/0x1b0 > > [ 6458.234774] [c002acb23d00] [c019ceac] > > atomic_notifier_call_chain+0x2c/0x40 > > [ 6458.234788] [c002acb23d20] [c00d69b4] > > opal_message_notify+0xf4/0x2c0 > > [...] > > > > Add a flexible array member to avoid false positive run-time warning. > > > > Reported-by: Aneesh Kumar K.V > > Signed-off-by: Mahesh Salgaonkar > > --- > > arch/powerpc/platforms/powernv/opal-prd.c |7 +-- > > 1 file changed, 5 insertions(+), 2 deletions(-) > > > > diff --git a/arch/powerpc/platforms/powernv/opal-prd.c > > b/arch/powerpc/platforms/powernv/opal-prd.c > > index 113bdb151f687..9e2c4775f75f5 100644 > > --- a/arch/powerpc/platforms/powernv/opal-prd.c > > +++ b/arch/powerpc/platforms/powernv/opal-prd.c > > @@ -30,7 +30,10 @@ > >*/ > > struct opal_prd_msg_queue_item { > > struct list_headlist; > > - struct opal_prd_msg_header msg; > > + union { > > + struct opal_prd_msg_header msg; > > + DECLARE_FLEX_ARRAY(__u8, msg_flex); > > + }; > > }; > > static struct device_node *prd_node; > > @@ -352,7 +355,7 @@ static int opal_prd_msg_notifier(struct notifier_block > > *nb, > > if (!item) > > return -ENOMEM; > > - memcpy(>msg, msg->params, msg_size); > > + memcpy(>msg_flex, msg->params, msg_size); > > This does silence the warning but it seems like we might be able to go a > little further. > > What about not adding that flex array to struct opal_prd_msg_queue_item, but > adding one to struct opal_prd_msg. That is what the data format actually is. > > So we'd have something like: > > > struct opal_prd_msg { > > struct opal_prd_msg_header hdr; > > char msg[]; > > } > > > and change things to use that instead? > > But that might be more trouble than it is worth, Yup, it would cause more changes to the code elsewhere as well. > alternatively we can just > do: > > item->msg = *hdr; > memcpy((char *)item->msg + sizeof(*hdr), (char *)hdr + sizeof(*hdr), > msg_size - sizeof(*hdr)); Agree, this helps, I tried with small change to above code which also works fine. item->msg = *hdr; hdr++; memcpy((char *)>msg + sizeof(*hdr), hdr, msg_size - sizeof(*hdr)); Will send out v2. Thanks, -Mahesh.
Re: [PATCH 20/21] ARM: dma-mapping: split out arch_dma_mark_clean() helper
> Thanks for your patch, which is now commit 322dbe898f82fd8a > ("ARM: dma-mapping: split out arch_dma_mark_clean() helper") in > esmil/jh7100-dmapool. Well, something is wrong with that branch then, and this series still needs more work, and should eventually be merged through the dma-mapping tree.
Re: [PATCH] ASoC: fsl_sai: Enable MCTL_MCLK_EN bit for master mode
Hi Andreas, On Thu, Jul 6, 2023 at 9:34 AM Andreas Henriksson wrote: > I think your initial review comment was spot on Fabio. There probably > needs to be a(n imx8mm) specific flag that says when this workaround > should be applied and gate the code in bclk function on that. > Atleast that's the only thing I can think of if my interpretation of the > problem for imx8mm is correct. Yes, deciding whether MCLK_EN should act as a clock gate based on the number of SAI registers seems fragile to me. I have sent a proposal as RFC. Please give it a try if you have a chance. I would like Shengjiu to confirm if imx8mn and imx93 should also behave like imx8mm in this aspect. Cheers
Re: [PATCH] ASoC: fsl_sai: Enable MCTL_MCLK_EN bit for master mode
Hello Fabio, Maybe I shouldn't comment as I'm already on deep water, but... On Thu, Jul 06, 2023 at 08:32:57AM -0300, Fabio Estevam wrote: > On Thu, Jul 6, 2023 at 8:19 AM Shengjiu Wang wrote: > > > No, this is the code in probe(). > > The code with the issue is in fsl_sai_set_bclk(). > > Yes, I put it in the wrong place. > > > The clean way for fixing is to remove the code in fsl_sai_set_bclk() > > and add "fsl,sai-mclk-direction-output;" property in dts for some > > node. > > Yes, what about this? > > --- a/sound/soc/fsl/fsl_sai.c > +++ b/sound/soc/fsl/fsl_sai.c > @@ -507,7 +507,7 @@ static int fsl_sai_set_bclk(struct snd_soc_dai > *dai, bool tx, u32 freq) >savediv / 2 - 1); > } > > - if (sai->soc_data->max_register >= FSL_SAI_MCTL) { > + if (sai->soc_data->max_register >= FSL_SAI_MCTL && > sai->mclk_direction_output) { > /* SAI is in master mode at this point, so enable MCLK */ > regmap_update_bits(sai->regmap, FSL_SAI_MCTL, >FSL_SAI_MCTL_MCLK_EN, > FSL_SAI_MCTL_MCLK_EN); This is exactly the same as and thus redundant to the code already in the probe function?! If I understood it correctly, the problem Shengjiu was trying to adress was that apparently on i.MX8MM (and only imx8mm?!), even when using MCLK in *input*, you still need to enable it because otherwise BCLK will not be generated. (I personally don't know anything about this or the imx8mm variant though.) The problem could probably equally well be worked around by always setting the fsl,sai-mclk-direction-output; devicetree property on imx8mm, even when using MCLK in input. But I'm just guessing here to be honest. This is just as I understood what purpose the initial patch that was merged had. The current state affects more devices than imx8mm though, making MCLK in input mode not possible. I think your initial review comment was spot on Fabio. There probably needs to be a(n imx8mm) specific flag that says when this workaround should be applied and gate the code in bclk function on that. Atleast that's the only thing I can think of if my interpretation of the problem for imx8mm is correct. Regards, Andreas Henriksson
Re: [PATCH v4 01/13] kexec: consolidate kexec and crash options into kernel/Kconfig.kexec
On 7/6/23 07:18, Arnd Bergmann wrote: On Wed, Jul 5, 2023, at 16:19, Eric DeVolder wrote: + +config CRASH_DUMP + bool "kernel crash dumps" + depends on ARCH_SUPPORTS_CRASH_DUMP + select CRASH_CORE + select KEXEC Today's linux-next now runs into a warning on arm64 and presumably others, with the same problem as on arm earlier: WARNING: unmet direct dependencies detected for KEXEC Depends on [n]: ARCH_SUPPORTS_KEXEC [=n] Selected by [y]: - CRASH_DUMP [=y] && ARCH_SUPPORTS_CRASH_DUMP [=y] I think the easiest way to make this reliable would be this fixup: diff --git a/kernel/Kconfig.kexec b/kernel/Kconfig.kexec index d82a7ce59c051..e58ca6128d6ee 100644 --- a/kernel/Kconfig.kexec +++ b/kernel/Kconfig.kexec @@ -91,6 +91,7 @@ config KEXEC_JUMP config CRASH_DUMP bool "kernel crash dumps" depends on ARCH_SUPPORTS_CRASH_DUMP + depends on ARCH_SUPPORTS_KEXEC select CRASH_CORE select KEXEC help Arnd Will do, thanks! I changed my testing to include allnoconfig and allyesconfig and that revealed some minor issues as well. Almost there! eric
Re: [PATCH v4 01/13] kexec: consolidate kexec and crash options into kernel/Kconfig.kexec
On Wed, Jul 5, 2023, at 16:19, Eric DeVolder wrote: > + > +config CRASH_DUMP > + bool "kernel crash dumps" > + depends on ARCH_SUPPORTS_CRASH_DUMP > + select CRASH_CORE > + select KEXEC Today's linux-next now runs into a warning on arm64 and presumably others, with the same problem as on arm earlier: WARNING: unmet direct dependencies detected for KEXEC Depends on [n]: ARCH_SUPPORTS_KEXEC [=n] Selected by [y]: - CRASH_DUMP [=y] && ARCH_SUPPORTS_CRASH_DUMP [=y] I think the easiest way to make this reliable would be this fixup: diff --git a/kernel/Kconfig.kexec b/kernel/Kconfig.kexec index d82a7ce59c051..e58ca6128d6ee 100644 --- a/kernel/Kconfig.kexec +++ b/kernel/Kconfig.kexec @@ -91,6 +91,7 @@ config KEXEC_JUMP config CRASH_DUMP bool "kernel crash dumps" depends on ARCH_SUPPORTS_CRASH_DUMP + depends on ARCH_SUPPORTS_KEXEC select CRASH_CORE select KEXEC help Arnd
Re: [PATCH] ASoC: fsl_sai: Enable MCTL_MCLK_EN bit for master mode
On Thu, Jul 6, 2023 at 7:08 PM Fabio Estevam wrote: > Hi Andreas, > > On Thu, Jul 6, 2023 at 5:47 AM Andreas Henriksson > wrote: > > > We've been working on an i.MX8MP where MCLK needs to be input and found > > that this enables the MCLK as output despite not having set the > > `fsl,sai-mclk-direction-output;` devicetree property in our DT. > > Reverting the patch fixes the issues for us. > > > > I have to say that the code comment made me a bit confused, but once > > I found the commit message I understood why this code existed. > > If this is really i.MX8MM specific maybe mention that in the code > > comment and please make the code actually only trigger on i.MX8MM. > > It seems to me like these all fulfill the current criteria: > > imx7ulp, imx8mq, imx8mm, imx8mp, imx8ulp, imx93 > > > > Should I report this in bugzilla.kernel.org ? > > Should we do a fix like this? > > --- a/sound/soc/fsl/fsl_sai.c > +++ b/sound/soc/fsl/fsl_sai.c > @@ -1453,7 +1453,7 @@ static int fsl_sai_probe(struct platform_device > *pdev) > > /* Select MCLK direction */ > if (sai->mclk_direction_output && > - sai->soc_data->max_register >= FSL_SAI_MCTL) { > + sai->soc_data->max_register >= FSL_SAI_MCTL && > sai->mclk_direction_output) { > regmap_update_bits(sai->regmap, FSL_SAI_MCTL, >FSL_SAI_MCTL_MCLK_EN, > FSL_SAI_MCTL_MCLK_EN); > } > No, this is the code in probe(). The code with the issue is in fsl_sai_set_bclk(). The clean way for fixing is to remove the code in fsl_sai_set_bclk() and add "fsl,sai-mclk-direction-output;" property in dts for some node. best regards wang shengjiu
Re: [PATCH] ASoC: fsl_sai: Enable MCTL_MCLK_EN bit for master mode
On Thu, Jul 6, 2023 at 4:47 PM Andreas Henriksson wrote: > Hello Shengjiu, Fabio, > > On Thu, May 19, 2022 at 10:23:06AM -0300, Fabio Estevam wrote: > > Hi Shengjiu, > > > > On Thu, May 19, 2022 at 9:49 AM Shengjiu Wang > wrote: > > > > > diff --git a/sound/soc/fsl/fsl_sai.c b/sound/soc/fsl/fsl_sai.c > > > index fa950dde5310..dae16a14f177 100644 > > > --- a/sound/soc/fsl/fsl_sai.c > > > +++ b/sound/soc/fsl/fsl_sai.c > > > @@ -437,6 +437,12 @@ static int fsl_sai_set_bclk(struct snd_soc_dai > *dai, bool tx, u32 freq) > > >FSL_SAI_CR2_DIV_MASK | > FSL_SAI_CR2_BYP, > > >savediv / 2 - 1); > > > > > > + if (sai->soc_data->max_register >= FSL_SAI_MCTL) { > > > > Isn't it a bit fragile to take this decision based on the number of > > SAI registers in the SoC? > > > > What about adding a specific field in soc_data for such a purpose? > > We've been working on an i.MX8MP where MCLK needs to be input and found > that this enables the MCLK as output despite not having set the > `fsl,sai-mclk-direction-output;` devicetree property in our DT. > Reverting the patch fixes the issues for us. > > Good catch. seems there is an issue here. best regards wang shengjiu I have to say that the code comment made me a bit confused, but once > I found the commit message I understood why this code existed. > If this is really i.MX8MM specific maybe mention that in the code > comment and please make the code actually only trigger on i.MX8MM. > It seems to me like these all fulfill the current criteria: > imx7ulp, imx8mq, imx8mm, imx8mp, imx8ulp, imx93 > > Should I report this in bugzilla.kernel.org ? > > Regards, > Andreas Henriksson >
Re: [PATCH v2 08/92] fs: new helper: simple_rename_timestamp
On Wed 05-07-23 14:58:11, Jeff Layton wrote: > A rename potentially involves updating 4 different inode timestamps. Add > a function that handles the details sanely, and convert the libfs.c > callers to use it. > > Signed-off-by: Jeff Layton Looks good to me. Feel free to add: Reviewed-by: Jan Kara Honza > --- > fs/libfs.c | 36 +++- > include/linux/fs.h | 2 ++ > 2 files changed, 29 insertions(+), 9 deletions(-) > > diff --git a/fs/libfs.c b/fs/libfs.c > index a7e56baf8bbd..9ee79668c909 100644 > --- a/fs/libfs.c > +++ b/fs/libfs.c > @@ -692,6 +692,31 @@ int simple_rmdir(struct inode *dir, struct dentry > *dentry) > } > EXPORT_SYMBOL(simple_rmdir); > > +/** > + * simple_rename_timestamp - update the various inode timestamps for rename > + * @old_dir: old parent directory > + * @old_dentry: dentry that is being renamed > + * @new_dir: new parent directory > + * @new_dentry: target for rename > + * > + * POSIX mandates that the old and new parent directories have their ctime > and > + * mtime updated, and that inodes of @old_dentry and @new_dentry (if any), > have > + * their ctime updated. > + */ > +void simple_rename_timestamp(struct inode *old_dir, struct dentry > *old_dentry, > + struct inode *new_dir, struct dentry *new_dentry) > +{ > + struct inode *newino = d_inode(new_dentry); > + > + old_dir->i_mtime = inode_set_ctime_current(old_dir); > + if (new_dir != old_dir) > + new_dir->i_mtime = inode_set_ctime_current(new_dir); > + inode_set_ctime_current(d_inode(old_dentry)); > + if (newino) > + inode_set_ctime_current(newino); > +} > +EXPORT_SYMBOL_GPL(simple_rename_timestamp); > + > int simple_rename_exchange(struct inode *old_dir, struct dentry *old_dentry, > struct inode *new_dir, struct dentry *new_dentry) > { > @@ -707,11 +732,7 @@ int simple_rename_exchange(struct inode *old_dir, struct > dentry *old_dentry, > inc_nlink(old_dir); > } > } > - old_dir->i_ctime = old_dir->i_mtime = > - new_dir->i_ctime = new_dir->i_mtime = > - d_inode(old_dentry)->i_ctime = > - d_inode(new_dentry)->i_ctime = current_time(old_dir); > - > + simple_rename_timestamp(old_dir, old_dentry, new_dir, new_dentry); > return 0; > } > EXPORT_SYMBOL_GPL(simple_rename_exchange); > @@ -720,7 +741,6 @@ int simple_rename(struct mnt_idmap *idmap, struct inode > *old_dir, > struct dentry *old_dentry, struct inode *new_dir, > struct dentry *new_dentry, unsigned int flags) > { > - struct inode *inode = d_inode(old_dentry); > int they_are_dirs = d_is_dir(old_dentry); > > if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE)) > @@ -743,9 +763,7 @@ int simple_rename(struct mnt_idmap *idmap, struct inode > *old_dir, > inc_nlink(new_dir); > } > > - old_dir->i_ctime = old_dir->i_mtime = new_dir->i_ctime = > - new_dir->i_mtime = inode->i_ctime = current_time(old_dir); > - > + simple_rename_timestamp(old_dir, old_dentry, new_dir, new_dentry); > return 0; > } > EXPORT_SYMBOL(simple_rename); > diff --git a/include/linux/fs.h b/include/linux/fs.h > index bdfbd11a5811..14e38bd900f1 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -2979,6 +2979,8 @@ extern int simple_open(struct inode *inode, struct file > *file); > extern int simple_link(struct dentry *, struct inode *, struct dentry *); > extern int simple_unlink(struct inode *, struct dentry *); > extern int simple_rmdir(struct inode *, struct dentry *); > +void simple_rename_timestamp(struct inode *old_dir, struct dentry > *old_dentry, > + struct inode *new_dir, struct dentry *new_dentry); > extern int simple_rename_exchange(struct inode *old_dir, struct dentry > *old_dentry, > struct inode *new_dir, struct dentry > *new_dentry); > extern int simple_rename(struct mnt_idmap *, struct inode *, > -- > 2.41.0 > -- Jan Kara SUSE Labs, CR
Re: [PATCH] ASoC: fsl_sai: Enable MCTL_MCLK_EN bit for master mode
Hello Shengjiu, Fabio, On Thu, May 19, 2022 at 10:23:06AM -0300, Fabio Estevam wrote: > Hi Shengjiu, > > On Thu, May 19, 2022 at 9:49 AM Shengjiu Wang wrote: > > > diff --git a/sound/soc/fsl/fsl_sai.c b/sound/soc/fsl/fsl_sai.c > > index fa950dde5310..dae16a14f177 100644 > > --- a/sound/soc/fsl/fsl_sai.c > > +++ b/sound/soc/fsl/fsl_sai.c > > @@ -437,6 +437,12 @@ static int fsl_sai_set_bclk(struct snd_soc_dai *dai, > > bool tx, u32 freq) > >FSL_SAI_CR2_DIV_MASK | FSL_SAI_CR2_BYP, > >savediv / 2 - 1); > > > > + if (sai->soc_data->max_register >= FSL_SAI_MCTL) { > > Isn't it a bit fragile to take this decision based on the number of > SAI registers in the SoC? > > What about adding a specific field in soc_data for such a purpose? We've been working on an i.MX8MP where MCLK needs to be input and found that this enables the MCLK as output despite not having set the `fsl,sai-mclk-direction-output;` devicetree property in our DT. Reverting the patch fixes the issues for us. I have to say that the code comment made me a bit confused, but once I found the commit message I understood why this code existed. If this is really i.MX8MM specific maybe mention that in the code comment and please make the code actually only trigger on i.MX8MM. It seems to me like these all fulfill the current criteria: imx7ulp, imx8mq, imx8mm, imx8mp, imx8ulp, imx93 Should I report this in bugzilla.kernel.org ? Regards, Andreas Henriksson
Re: [next-20230705] kernel BUG mm/memcontrol.c:3715! (ltp/madvise06)
On 2023-07-06 11:41:38+0530, Sachin Sant wrote: > While running LTP tests (madvise06) on IBM Power9 LPAR booted with > 6.4.0-next-20230705 following crash is seen > > Injecting memory failure for pfn 0x3f79 at process virtual address > 0x7fff9b74 > Memory failure: 0x3f79: recovery action for clean LRU page: Recovered > madvise06 (133636): drop_caches: 3 > [ cut here ] > kernel BUG at mm/memcontrol.c:3715! > Oops: Exception in kernel mode, sig: 5 [#1] > LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=8192 NUMA pSeries > Modules linked in: brd overlay exfat vfat fat xfs loop sctp ip6_udp_tunnel > udp_tunnel dm_mod nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib > nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat > nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bonding ip_set tls rfkill > nf_tables libcrc32c nfnetlink sunrpc pseries_rng vmx_crypto ext4 mbcache jbd2 > sd_mod t10_pi crc64_rocksoft crc64 sg ibmvscsi scsi_transport_srp ibmveth > fuse [last unloaded: init_module(O)] > CPU: 10 PID: 133636 Comm: madvise06 Tainted: G O 6.4.0-next-20230705 #1 > Hardware name: IBM,8375-42A POWER9 (raw) 0x4e0202 0xf05 of:IBM,FW950.80 > (VL950_131) hv:phyp pSeries > NIP: c054ea88 LR: c028b2a8 CTR: c054e8d0 > REGS: c0029dd7b890 TRAP: 0700 Tainted: G O (6.4.0-next-20230705) > MSR: 80029033 CR: 28008288 XER: > CFAR: c054e904 IRQMASK: 0 > GPR00: c028b2a8 c0029dd7bb30 c1431600 c002bc978000 > GPR04: c2b3b288 00010192 0001 > GPR08: c000f9abb180 0002 c002bc978580 > GPR12: c054e8d0 c0001ec53f00 > GPR16: > GPR20: c0001b2e6578 00400cc0 7fff f000 > GPR24: c0029dd7bd30 c0029dd7bd58 c0001b2e6568 > GPR28: c0029dd7bde0 0001 0001 c0001b2e6540 > NIP [c054ea88] mem_cgroup_read_u64+0x1b8/0x1d0 > LR [c028b2a8] cgroup_seqfile_show+0xb8/0x160 > Call Trace: > [c0029dd7bb50] [c028b2a8] cgroup_seqfile_show+0xb8/0x160 > [c0029dd7bbc0] [c0673ba4] kernfs_seq_show+0x44/0x60 > [c0029dd7bbe0] [c05c4238] seq_read_iter+0x238/0x620 > [c0029dd7bcb0] [c0675064] kernfs_fop_read_iter+0x1d4/0x2c0 > [c0029dd7bd00] [c057fbac] vfs_read+0x26c/0x350 > [c0029dd7bdc0] [c058077c] ksys_read+0x7c/0x140 > [c0029dd7be10] [c0036900] system_call_exception+0x140/0x350 > [c0029dd7be50] [c000d6a0] system_call_common+0x160/0x2e4 > --- interrupt: c00 at 0x7fff9eb41484 > NIP: 7fff9eb41484 LR: 10008540 CTR: > REGS: c0029dd7be80 TRAP: 0c00 Tainted: G O (6.4.0-next-20230705) > MSR: 8280f033 CR: 28002282 XER: > > IRQMASK: 0 > GPR00: 0003 7fffc33de7d0 7fff9ec27300 0013 > GPR04: 7fffc33e0aa0 1fff 0013 > GPR08: 7fffc33e0aa0 > GPR12: 7fff9ecca3a0 > GPR16: 10035520 10035b90 100347a8 > GPR20: 1002fb68 10063900 2000 1002fb68 > GPR24: 004c 1002fa78 7fffc33e0aa0 > GPR28: 0013 1fff 1fff > NIP [7fff9eb41484] 0x7fff9eb41484 > LR [10008540] 0x10008540 > --- interrupt: c00 > Code: 7fa34800 409effc4 7c0802a6 3861 f8010030 4bfffdfd e8010030 786383e4 > 7c0803a6 4b6c 7c0802a6 f8010030 <0fe0> 7c0802a6 f8010030 0fe0 > ---[ end trace ]--- > pstore: backend (nvram) writing error (-1) > > Kernel panic - not syncing: Fatal exception > Rebooting in 10 seconds.. > > Git bisect points to following patch: > > commit 29bf1eb7d2abbdfc24c4ef7acf7a51b72dc43d2b > memcg: drop kmem.limit_in_bytes > > Does the testcase madvise06 need an update? > > 90 tst_res(TINFO, "\tCached: %ld Kb", > 91 SAFE_READ_MEMINFO("Cached:") - init_cached); > 92 > 93 print_cgmem("memory.current"); > 94 print_cgmem("memory.swap.current"); > 95 print_cgmem("memory.kmem.usage_in_bytes”); <<== this line. > 96 } > > If I comment line 95 from the testcase, it completes successfully. The handling for _KMEM was removed from mem_cgroup_read_u64() incorrectly. It is used by the still existing kmem.*usage*_in_bytes in addition to the now removed kmem.*limit*_in_bytes. (And kmem.max_usage_in_bytes, kmem.failcnt) The testcase seems to be fine, it actually did its job.
Re: [PATCH v2 1/5] mm/hotplug: Embed vmem_altmap details in memory block
On 06.07.23 14:32, Aneesh Kumar K V wrote: On 7/6/23 4:44 PM, David Hildenbrand wrote: On 06.07.23 11:36, Aneesh Kumar K V wrote: On 7/6/23 2:48 PM, David Hildenbrand wrote: On 06.07.23 10:50, Aneesh Kumar K.V wrote: With memmap on memory, some architecture needs more details w.r.t altmap such as base_pfn, end_pfn, etc to unmap vmemmap memory. Can you elaborate why ppc64 needs that and x86-64 + aarch64 don't? IOW, why can't ppc64 simply allocate the vmemmap from the start of the memblock (-> base_pfn) and use the stored number of vmemmap pages to calculate the end_pfn? To rephrase: if the vmemmap is not at the beginning and doesn't cover full apgeblocks, memory onlining/offlining would be broken. [...] With ppc64 and 64K pagesize and different memory block sizes, we can end up allocating vmemmap backing memory from outside altmap because a single page vmemmap can cover 1024 pages (64 *1024/sizeof(struct page)). and that can point to pages outside the dev_pagemap range. So on free we check So you end up with a mixture of altmap and ordinarily-allocated vmemmap pages? That sound wrong (and is counter-intuitive to the feature in general, where we *don't* want to allocate the vmemmap from outside the altmap). (64 * 1024) / sizeof(struct page) -> 1024 pages 1024 pages * 64k = 64 MiB. What's the memory block size on these systems? If it's >= 64 MiB the vmemmap of a single memory block fits into a single page and we should be fine. Smells like you want to disable the feature on a 64k system. But that part of vmemmap_free is common for both dax,dax kmem and the new memmap on memory feature. ie, ppc64 vmemmap_free have checks which require a full altmap structure with all the details in. So for memmap on memmory to work on ppc64 we do require similar altmap struct. Hence the idea of adding vmemmap_altmap to struct memory_block I'd suggest making sure that for the memmap_on_memory case your really *always* allocate from the altmap (that's what the feature is about after all), and otherwise block the feature (i.e., arch_mhp_supports_... should reject it). Then, you can reconstruct the altmap layout trivially base_pfn: start of the range to unplug end_pfn: base_pfn + nr_vmemmap_pages and pass that to the removal code, which will do the right thing, no? Sure, remembering the altmap might be a potential cleanup (eventually?), but the basic reasoning why this is required as patch #1 IMHO is wrong: if you say you support memmap_on_memory for a configuration, then you should also properly support it (allocate from the hotplugged memory), not silently fall back to something else. -- Cheers, David / dhildenb
Re: [PATCH v2 1/5] mm/hotplug: Embed vmem_altmap details in memory block
On 7/6/23 4:44 PM, David Hildenbrand wrote: > On 06.07.23 11:36, Aneesh Kumar K V wrote: >> On 7/6/23 2:48 PM, David Hildenbrand wrote: >>> On 06.07.23 10:50, Aneesh Kumar K.V wrote: With memmap on memory, some architecture needs more details w.r.t altmap such as base_pfn, end_pfn, etc to unmap vmemmap memory. >>> >>> Can you elaborate why ppc64 needs that and x86-64 + aarch64 don't? >>> >>> IOW, why can't ppc64 simply allocate the vmemmap from the start of the >>> memblock (-> base_pfn) and use the stored number of vmemmap pages to >>> calculate the end_pfn? >>> >>> To rephrase: if the vmemmap is not at the beginning and doesn't cover full >>> apgeblocks, memory onlining/offlining would be broken. >>> >>> [...] >> >> >> With ppc64 and 64K pagesize and different memory block sizes, we can end up >> allocating vmemmap backing memory from outside altmap because >> a single page vmemmap can cover 1024 pages (64 *1024/sizeof(struct page)). >> and that can point to pages outside the dev_pagemap range. >> So on free we check > > So you end up with a mixture of altmap and ordinarily-allocated vmemmap > pages? That sound wrong (and is counter-intuitive to the feature in general, > where we *don't* want to allocate the vmemmap from outside the altmap). > > (64 * 1024) / sizeof(struct page) -> 1024 pages > > 1024 pages * 64k = 64 MiB. > > What's the memory block size on these systems? If it's >= 64 MiB the vmemmap > of a single memory block fits into a single page and we should be fine. > > Smells like you want to disable the feature on a 64k system. > But that part of vmemmap_free is common for both dax,dax kmem and the new memmap on memory feature. ie, ppc64 vmemmap_free have checks which require a full altmap structure with all the details in. So for memmap on memmory to work on ppc64 we do require similar altmap struct. Hence the idea of adding vmemmap_altmap to struct memory_block >> >> vmemmap_free() { >> ... >> if (altmap) { >> alt_start = altmap->base_pfn; >> alt_end = altmap->base_pfn + altmap->reserve + >> altmap->free + altmap->alloc + altmap->align; >> } >> >> ... >> if (base_pfn >= alt_start && base_pfn < alt_end) { >> vmem_altmap_free(altmap, nr_pages); >> >> to see whether we did use altmap for the vmemmap allocation. >> -aneesh
Re: [PATCH] rtc: Kconfig: select REGMAP for RTC_DRV_DS1307
On Thu, Jul 6, 2023 at 9:50 AM Geert Uytterhoeven wrote: > On Thu, Jul 6, 2023 at 8:14 AM Benjamin Gray wrote: > > On Thu, 2023-07-06 at 05:13 +, Christophe Leroy wrote: > > > Le 05/07/2023 à 02:30, Benjamin Gray a écrit : > > > > The drivers/rtc/rtc-ds1307.c driver has a direct dependency on > > > > struct regmap_config, which is guarded behind CONFIG_REGMAP. > > > > > > > > Commit 70a640c0efa7 ("regmap: REGMAP_KUNIT should not select > > > > REGMAP") > > > > exposed this by disabling the default pick unless KUNIT_ALL_TESTS > > > > is > > > > set, causing the ppc64be allnoconfig build to fail. > > > > > > > > Signed-off-by: Benjamin Gray > > > > --- > > > > drivers/rtc/Kconfig | 1 + > > > > 1 file changed, 1 insertion(+) > > > > > > > > diff --git a/drivers/rtc/Kconfig b/drivers/rtc/Kconfig > > > > index ffca9a8bb878..7455ebd189fe 100644 > > > > --- a/drivers/rtc/Kconfig > > > > +++ b/drivers/rtc/Kconfig > > > > @@ -246,6 +246,7 @@ config RTC_DRV_AS3722 > > > > > > > > config RTC_DRV_DS1307 > > > > tristate "Dallas/Maxim DS1307/37/38/39/40/41, ST M41T00, > > > > EPSON RX-8025, ISL12057" > > > > + select REGMAP > > > > > > As far as I can see, REGMAP defaults to Y when REGMAP_I2C is > > > selected. > > > Can you explain more in details why you have to select it explicitely > > > ? > > > If there is something wrong with the logic, then the logic should be > > > fixed instead of just adding a selection of REGMAP for that > > > particular > > > RTC_DRV_DS1307. Because others like RTC_DRV_ABB5ZES3 or > > > RTC_DRV_ABEOZ9 > > > might have the exact same problem. > > > > Right, yeah, I don't want to assert this patch is the correct solution, > > sending it was more to offer a fix and allow discussion if it should be > > resolved some other way (so thanks for replying, I appreciate it). > > > > In terms of why I made this patch, the way I see it, if a config option > > requires another config option be set, then "selects" is the natural > > way of phrasing this dependency. "default" on the REGMAP side seems > > weird. If it's a default, does that mean it can be overridden? But > > RTC_DRV_DS1307 *requires* REGMAP; it's not just a "would be nice". The > > build breaks without it. > > > > But maybe KConfig works differently to my assumptions. Maybe the > > referenced patch that causes the build failure is actually incorrect > > (CC Geert). I spoke with Joel Stanley (CC) and he indicated you're not > > supposed to depend on REGMAP like KUnit does? > > Thanks for CCing me! > > Looks like I made a really silly mistake here: my patch not only allows > the user to enable REGMAP manually (for the test), but also to disable > it manually, regardless if there are users or not :-( > > I think the proper fix is to replace the "default y if ..." by > "select REGMAP" for all users. I have sent a patch to do so, followed by a few related fixes https://lore.kernel.org/r/525c37a568b10623ffb2d108850afd7e37f9350e.1688643442.git.ge...@linux-m68k.org Thanks! Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
Re: [PATCH v2 15/92] spufs: convert to ctime accessor functions
On Wed, Jul 5, 2023, at 21:00, Jeff Layton wrote: > In later patches, we're going to change how the inode's ctime field is > used. Switch to using accessor functions instead of raw accesses of > inode->i_ctime. > > Acked-by: Jeremy Kerr > Reviewed-by: Jan Kara > Signed-off-by: Jeff Layton Acked-by: Arnd Bergmann
Re: [PATCH] ASoC: fsl_sai: Enable MCTL_MCLK_EN bit for master mode
On Thu, Jul 6, 2023 at 8:19 AM Shengjiu Wang wrote: > No, this is the code in probe(). > The code with the issue is in fsl_sai_set_bclk(). Yes, I put it in the wrong place. > The clean way for fixing is to remove the code in fsl_sai_set_bclk() > and add "fsl,sai-mclk-direction-output;" property in dts for some > node. Yes, what about this? --- a/sound/soc/fsl/fsl_sai.c +++ b/sound/soc/fsl/fsl_sai.c @@ -507,7 +507,7 @@ static int fsl_sai_set_bclk(struct snd_soc_dai *dai, bool tx, u32 freq) savediv / 2 - 1); } - if (sai->soc_data->max_register >= FSL_SAI_MCTL) { + if (sai->soc_data->max_register >= FSL_SAI_MCTL && sai->mclk_direction_output) { /* SAI is in master mode at this point, so enable MCLK */ regmap_update_bits(sai->regmap, FSL_SAI_MCTL, FSL_SAI_MCTL_MCLK_EN, FSL_SAI_MCTL_MCLK_EN);
Re: [PATCH v2 3/5] mm/hotplug: Simplify the handling of MHP_MEMMAP_ON_MEMORY flag
On 06.07.23 12:04, Aneesh Kumar K V wrote: On 7/6/23 2:54 PM, David Hildenbrand wrote: On 06.07.23 10:50, Aneesh Kumar K.V wrote: Instead of checking for memmap on memory feature enablement within the functions checking for alignment, use the kernel parameter to control the memory hotplug flags. The generic kernel now enables memmap on memory feature if the hotplug flag request for the same. The ACPI code now can pass the flag unconditionally because the kernel will fallback to not using the feature if the alignment rules are not met. Signed-off-by: Aneesh Kumar K.V --- drivers/acpi/acpi_memhotplug.c | 3 +-- include/linux/memory_hotplug.h | 14 ++ mm/memory_hotplug.c | 35 +++--- 3 files changed, 26 insertions(+), 26 deletions(-) diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c index 24f662d8bd39..4d0096fc4cc2 100644 --- a/drivers/acpi/acpi_memhotplug.c +++ b/drivers/acpi/acpi_memhotplug.c @@ -211,8 +211,7 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device) if (!info->length) continue; - if (mhp_supports_memmap_on_memory(info->length)) - mhp_flags |= MHP_MEMMAP_ON_MEMORY; + mhp_flags |= get_memmap_on_memory_flags(); result = __add_memory(mgid, info->start_addr, info->length, mhp_flags); diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index a769f44b8368..af7017122506 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -358,4 +358,18 @@ bool mhp_supports_memmap_on_memory(unsigned long size); bool __mhp_supports_memmap_on_memory(unsigned long size); #endif /* CONFIG_MEMORY_HOTPLUG */ +#ifdef CONFIG_MHP_MEMMAP_ON_MEMORY +extern bool memmap_on_memory; +static inline unsigned long get_memmap_on_memory_flags(void) +{ + if (memmap_on_memory) + return MHP_MEMMAP_ON_MEMORY; + return 0; +} +#else +static inline unsigned long get_memmap_on_memory_flags(void) +{ + return 0; +} +#endif That's kind-of ugly TBH. Why do we need this change? I was trying to avoid rest of the kernel doing if (mhp_supports_memmap_on_memory(info->length)) mhp_flags |= MHP_MEMMAP_ON_MEMORY; It would look much cleaner if you would simply have: mhp_flags |= MHP_MEMMAP_ON_MEMORY; result = __add_memory(mgid, info->start_addr, info->length, mhp_flags); And modify the semantics of MHP_MEMMAP_ON_MEMORY to mean "allocate the memmap from hotplugged memory if supported and enabled globally". Then, we can simply ignore the flag in __add_memory() if either the global toggle is off or if it's not supported for the range / by the arch. Maybe, in the future we want more fine-grained control (as asked by dax/kmem) and maybe not have the global toggle. But for now, that should be good enough I guess. -- Cheers, David / dhildenb
Re: [RFC 0/1] sched/fair: Consider asymmetric scheduler groups in load balancer
On 04/07/2023 11:11, Tobias Huschle wrote: > On 2023-05-16 18:35, Dietmar Eggemann wrote: >> On 15/05/2023 13:46, Tobias Huschle wrote: >>> The current load balancer implementation implies that scheduler groups, >>> within the same scheduler domain, all host the same number of CPUs. >>> >>> This appears to be valid for non-s390 architectures. Nevertheless, s390 >>> can actually have scheduler groups of unequal size. >> >> Arm (classical) big.Little had this for years before we switched to flat >> scheduling (only MC sched domain) over CPU capacity boundaries for Arm >> DynamIQ. >> >> Arm64 Juno platform in mainline: >> >> root@juno:~# cat /sys/devices/system/cpu/cpu*/topology/cluster_cpus_list >> 0,3-5 >> 1-2 >> 1-2 >> 0,3-5 >> 0,3-5 >> 0,3-5 >> >> root@juno:~# cat /proc/schedstat | grep ^domain | awk '{print $1, $2}' >> >> domain0 39 <-- >> domain1 3f >> domain0 06 <-- >> domain1 3f >> domain0 06 >> domain1 3f >> domain0 39 >> domain1 3f >> domain0 39 >> domain1 3f >> domain0 39 >> domain1 3f >> >> root@juno:~# cat /sys/kernel/debug/sched/domains/cpu0/domain*/name >> MC >> DIE >> >> But we don't have SMT on the mobile processors. >> >> It looks like you are only interested to get group_weight dependency >> into this 'prefer_sibling' condition of find_busiest_group()? >> > Sorry, looks like your reply hit some bad filter of my mail program. > Let me answer, although it's a bit late. > > Yes, I would like to get the group_weight into the prefer_sibling path. > Unfortunately, we cannot go for a flat hierarchy as the s390 hardware > allows to have CPUs to be pretty far apart (cache-wise), which means, > the load balancer should avoid to move tasks back and forth between > those CPUs if possible. > > We can't remove SD_PREFER_SIBLING either, as this would cause the load > balancer to aim for having the same number of idle CPUs in all groups, > which is a problem as well in asymmetric groups, for example: > > With SD_PREFER_SIBLING, aiming for same number of non-idle CPUs > 00 01 02 03 04 05 06 07 08 09 10 11 || 12 13 14 15 > x x x x x x x x > > Without SD_PREFER_SIBLING, aiming for the same number of idle CPUs > 00 01 02 03 04 05 06 07 08 09 10 11 || 12 13 14 15 > x x x x x x x x > > > Hence the idea to add the group_weight to the prefer_sibling path. > > I was wondering if this would be the right place to address this issue > or if I should go down another route. Yes, it's the right place to fix it for you. IMHO, there is still some discussion needed about the correct condition and changes in calculate_imbalance() for your case if I read the comments on this thread correctly. Arm64 big.Little wouldn't be affected since we explicitly remove SD_PREFER_SIBLING on MC for our legacy MC,DIE setups to avoid spreading tasks across DIE sched groups holding CPUs with different capacities. [...]
Re: [PATCH v2 1/5] mm/hotplug: Embed vmem_altmap details in memory block
On 06.07.23 11:36, Aneesh Kumar K V wrote: On 7/6/23 2:48 PM, David Hildenbrand wrote: On 06.07.23 10:50, Aneesh Kumar K.V wrote: With memmap on memory, some architecture needs more details w.r.t altmap such as base_pfn, end_pfn, etc to unmap vmemmap memory. Can you elaborate why ppc64 needs that and x86-64 + aarch64 don't? IOW, why can't ppc64 simply allocate the vmemmap from the start of the memblock (-> base_pfn) and use the stored number of vmemmap pages to calculate the end_pfn? To rephrase: if the vmemmap is not at the beginning and doesn't cover full apgeblocks, memory onlining/offlining would be broken. [...] With ppc64 and 64K pagesize and different memory block sizes, we can end up allocating vmemmap backing memory from outside altmap because a single page vmemmap can cover 1024 pages (64 *1024/sizeof(struct page)). and that can point to pages outside the dev_pagemap range. So on free we check So you end up with a mixture of altmap and ordinarily-allocated vmemmap pages? That sound wrong (and is counter-intuitive to the feature in general, where we *don't* want to allocate the vmemmap from outside the altmap). (64 * 1024) / sizeof(struct page) -> 1024 pages 1024 pages * 64k = 64 MiB. What's the memory block size on these systems? If it's >= 64 MiB the vmemmap of a single memory block fits into a single page and we should be fine. Smells like you want to disable the feature on a 64k system. vmemmap_free() { ... if (altmap) { alt_start = altmap->base_pfn; alt_end = altmap->base_pfn + altmap->reserve + altmap->free + altmap->alloc + altmap->align; } ... if (base_pfn >= alt_start && base_pfn < alt_end) { vmem_altmap_free(altmap, nr_pages); to see whether we did use altmap for the vmemmap allocation. +/** + * struct vmem_altmap - pre-allocated storage for vmemmap_populate + * @base_pfn: base of the entire dev_pagemap mapping + * @reserve: pages mapped, but reserved for driver use (relative to @base) + * @free: free pages set aside in the mapping for memmap storage + * @align: pages reserved to meet allocation alignments + * @alloc: track pages consumed, private to vmemmap_populate() + */ +struct vmem_altmap { + unsigned long base_pfn; + const unsigned long end_pfn; + const unsigned long reserve; + unsigned long free; + unsigned long align; + unsigned long alloc; +}; Instead of embedding that, what about conditionally allocating it and store a pointer to it in the "struct memory_block"? In the general case as of today, we don't have an altmap. Sure but with memmap on memory option it is essentially adding that right?. At least on x86_64 and aarch64 only for 128 MiB DIMMs (and especially, not memory added by hv-balloon, virtio-mem, xen-balloon). So in the general case it's not that frequently used. Maybe on ppc64 once wired up. Is the concern related to the increase in the size of struct memory_block ? Partially. It looks cleaner to have !mem->altmap if there is no altmap. -- Cheers, David / dhildenb
Re: [PATCH] ASoC: fsl_sai: Enable MCTL_MCLK_EN bit for master mode
Hi Andreas, On Thu, Jul 6, 2023 at 5:47 AM Andreas Henriksson wrote: > We've been working on an i.MX8MP where MCLK needs to be input and found > that this enables the MCLK as output despite not having set the > `fsl,sai-mclk-direction-output;` devicetree property in our DT. > Reverting the patch fixes the issues for us. > > I have to say that the code comment made me a bit confused, but once > I found the commit message I understood why this code existed. > If this is really i.MX8MM specific maybe mention that in the code > comment and please make the code actually only trigger on i.MX8MM. > It seems to me like these all fulfill the current criteria: > imx7ulp, imx8mq, imx8mm, imx8mp, imx8ulp, imx93 > > Should I report this in bugzilla.kernel.org ? Should we do a fix like this? --- a/sound/soc/fsl/fsl_sai.c +++ b/sound/soc/fsl/fsl_sai.c @@ -1453,7 +1453,7 @@ static int fsl_sai_probe(struct platform_device *pdev) /* Select MCLK direction */ if (sai->mclk_direction_output && - sai->soc_data->max_register >= FSL_SAI_MCTL) { + sai->soc_data->max_register >= FSL_SAI_MCTL && sai->mclk_direction_output) { regmap_update_bits(sai->regmap, FSL_SAI_MCTL, FSL_SAI_MCTL_MCLK_EN, FSL_SAI_MCTL_MCLK_EN); }
Re: [PATCH v2 3/5] mm/hotplug: Simplify the handling of MHP_MEMMAP_ON_MEMORY flag
On 7/6/23 2:54 PM, David Hildenbrand wrote: > On 06.07.23 10:50, Aneesh Kumar K.V wrote: >> Instead of checking for memmap on memory feature enablement within the >> functions checking for alignment, use the kernel parameter to control the >> memory hotplug flags. The generic kernel now enables memmap on memory >> feature if the hotplug flag request for the same. >> >> The ACPI code now can pass the flag unconditionally because the kernel will >> fallback to not using the feature if the alignment rules are not met. >> >> Signed-off-by: Aneesh Kumar K.V >> --- >> drivers/acpi/acpi_memhotplug.c | 3 +-- >> include/linux/memory_hotplug.h | 14 ++ >> mm/memory_hotplug.c | 35 +++--- >> 3 files changed, 26 insertions(+), 26 deletions(-) >> >> diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c >> index 24f662d8bd39..4d0096fc4cc2 100644 >> --- a/drivers/acpi/acpi_memhotplug.c >> +++ b/drivers/acpi/acpi_memhotplug.c >> @@ -211,8 +211,7 @@ static int acpi_memory_enable_device(struct >> acpi_memory_device *mem_device) >> if (!info->length) >> continue; >> - if (mhp_supports_memmap_on_memory(info->length)) >> - mhp_flags |= MHP_MEMMAP_ON_MEMORY; >> + mhp_flags |= get_memmap_on_memory_flags(); >> result = __add_memory(mgid, info->start_addr, info->length, >> mhp_flags); >> diff --git a/include/linux/memory_hotplug.h >> b/include/linux/memory_hotplug.h >> index a769f44b8368..af7017122506 100644 >> --- a/include/linux/memory_hotplug.h >> +++ b/include/linux/memory_hotplug.h >> @@ -358,4 +358,18 @@ bool mhp_supports_memmap_on_memory(unsigned long size); >> bool __mhp_supports_memmap_on_memory(unsigned long size); >> #endif /* CONFIG_MEMORY_HOTPLUG */ >> +#ifdef CONFIG_MHP_MEMMAP_ON_MEMORY >> +extern bool memmap_on_memory; >> +static inline unsigned long get_memmap_on_memory_flags(void) >> +{ >> + if (memmap_on_memory) >> + return MHP_MEMMAP_ON_MEMORY; >> + return 0; >> +} >> +#else >> +static inline unsigned long get_memmap_on_memory_flags(void) >> +{ >> + return 0; >> +} >> +#endif > > That's kind-of ugly TBH. > > > Why do we need this change? > I was trying to avoid rest of the kernel doing if (mhp_supports_memmap_on_memory(info->length)) mhp_flags |= MHP_MEMMAP_ON_MEMORY; I was also following pattern similar to static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void) { return khugepaged_defrag() ? GFP_TRANSHUGE : GFP_TRANSHUGE_LIGHT; } Overall goal of the patch is to handle the fallback of not using altmap/memmap in memory inside add_memory_resource and the callsites only express the desire to use memmap on memory if possible/enabled. -aneesh
Re: [PATCH v2 1/5] mm/hotplug: Embed vmem_altmap details in memory block
On 7/6/23 2:48 PM, David Hildenbrand wrote: > On 06.07.23 10:50, Aneesh Kumar K.V wrote: >> With memmap on memory, some architecture needs more details w.r.t altmap >> such as base_pfn, end_pfn, etc to unmap vmemmap memory. > > Can you elaborate why ppc64 needs that and x86-64 + aarch64 don't? > > IOW, why can't ppc64 simply allocate the vmemmap from the start of the > memblock (-> base_pfn) and use the stored number of vmemmap pages to > calculate the end_pfn? > > To rephrase: if the vmemmap is not at the beginning and doesn't cover full > apgeblocks, memory onlining/offlining would be broken. > > [...] With ppc64 and 64K pagesize and different memory block sizes, we can end up allocating vmemmap backing memory from outside altmap because a single page vmemmap can cover 1024 pages (64 *1024/sizeof(struct page)). and that can point to pages outside the dev_pagemap range. So on free we check vmemmap_free() { ... if (altmap) { alt_start = altmap->base_pfn; alt_end = altmap->base_pfn + altmap->reserve + altmap->free + altmap->alloc + altmap->align; } ... if (base_pfn >= alt_start && base_pfn < alt_end) { vmem_altmap_free(altmap, nr_pages); to see whether we did use altmap for the vmemmap allocation. > >> +/** >> + * struct vmem_altmap - pre-allocated storage for vmemmap_populate >> + * @base_pfn: base of the entire dev_pagemap mapping >> + * @reserve: pages mapped, but reserved for driver use (relative to @base) >> + * @free: free pages set aside in the mapping for memmap storage >> + * @align: pages reserved to meet allocation alignments >> + * @alloc: track pages consumed, private to vmemmap_populate() >> + */ >> +struct vmem_altmap { >> + unsigned long base_pfn; >> + const unsigned long end_pfn; >> + const unsigned long reserve; >> + unsigned long free; >> + unsigned long align; >> + unsigned long alloc; >> +}; > > Instead of embedding that, what about conditionally allocating it and store a > pointer to it in the "struct memory_block"? > > In the general case as of today, we don't have an altmap. > Sure but with memmap on memory option it is essentially adding that right?. Is the concern related to the increase in the size of struct memory_block ? >> + >> struct memory_block { >> unsigned long start_section_nr; >> unsigned long state; /* serialized by the dev->lock */ >> @@ -77,11 +94,7 @@ struct memory_block { >> */ >> struct zone *zone; >> struct device dev; >> - /* >> - * Number of vmemmap pages. These pages >> - * lay at the beginning of the memory block. >> - */ >> - unsigned long nr_vmemmap_pages; >> + struct vmem_altmap altmap; >> struct memory_group *group; /* group (if any) for this block */ >> struct list_head group_next; /* next block inside memory group */ >> #if defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_MEMORY_HOTPLUG) >> @@ -147,7 +160,7 @@ static inline int hotplug_memory_notifier(notifier_fn_t >> fn, int pri) >> extern int register_memory_notifier(struct notifier_block *nb); >> extern void unregister_memory_notifier(struct notifier_block *nb); >> int create_memory_block_devices(unsigned long start, unsigned long size, > > [...] > >> static int check_cpu_on_node(int nid) >> @@ -2036,9 +2042,8 @@ EXPORT_SYMBOL(try_offline_node); >> static int __ref try_remove_memory(u64 start, u64 size) >> { >> - struct vmem_altmap mhp_altmap = {}; >> + int ret; >> struct vmem_altmap *altmap = NULL; >> - unsigned long nr_vmemmap_pages; >> int rc = 0, nid = NUMA_NO_NODE; >> BUG_ON(check_hotplug_memory_range(start, size)); >> @@ -2060,24 +2065,16 @@ static int __ref try_remove_memory(u64 start, u64 >> size) >> * We only support removing memory added with MHP_MEMMAP_ON_MEMORY in >> * the same granularity it was added - a single memory block. >> */ >> + > > ^ unrealted change? >
Re: [PATCH v2 5/5] powerpc/book3s64/memhotplug: Enable memmap on memory for radix
On 7/6/23 2:37 PM, David Hildenbrand wrote: > On 06.07.23 10:50, Aneesh Kumar K.V wrote: >> Radix vmemmap mapping can map things correctly at the PMD level or PTE >> level based on different device boundary checks. We also use altmap.reserve >> feature to align things correctly at pageblock granularity. We can end up >> loosing some pages in memory with this. For ex: with 256MB memory block >> size, we require 4 pages to map vmemmap pages, In order to align things >> correctly we end up adding a reserve of 28 pages. ie, for every 4096 pages >> 28 pages get reserved. >> >> Signed-off-by: Aneesh Kumar K.V >> --- >> arch/powerpc/Kconfig | 1 + >> arch/powerpc/mm/book3s64/radix_pgtable.c | 28 +++ >> .../platforms/pseries/hotplug-memory.c | 4 ++- >> 3 files changed, 32 insertions(+), 1 deletion(-) >> >> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig >> index 116d6add0bb0..f890907e5bbf 100644 >> --- a/arch/powerpc/Kconfig >> +++ b/arch/powerpc/Kconfig >> @@ -157,6 +157,7 @@ config PPC >> select ARCH_HAS_UBSAN_SANITIZE_ALL >> select ARCH_HAVE_NMI_SAFE_CMPXCHG >> select ARCH_KEEP_MEMBLOCK >> + select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE if PPC_RADIX_MMU >> select ARCH_MIGHT_HAVE_PC_PARPORT >> select ARCH_MIGHT_HAVE_PC_SERIO >> select ARCH_OPTIONAL_KERNEL_RWX if ARCH_HAS_STRICT_KERNEL_RWX >> diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c >> b/arch/powerpc/mm/book3s64/radix_pgtable.c >> index a62729f70f2a..c0bd60b5fb64 100644 >> --- a/arch/powerpc/mm/book3s64/radix_pgtable.c >> +++ b/arch/powerpc/mm/book3s64/radix_pgtable.c >> @@ -1678,3 +1678,31 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr) >> return 1; >> } >> + >> +/* >> + * mm/memory_hotplug.c:mhp_supports_memmap_on_memory goes into details >> + * some of the restrictions. We don't check for PMD_SIZE because our >> + * vmemmap allocation code can fallback correctly. The pageblock > > x86 also has the fallback IIRC; the concern is rather that you end up with a > PTE-mapped vmemmap, which is slower at runtime than a PMD-mapped vmemmap. The memory block size is dependent on a config option at the hypervisor for power and we can have values ranging from 16MB - 512MB With these different memory block sizes I was thinking relaxing these checks makes the feature more useful. Also with page size of 64K using a 2M vmemmap requires larger memory block size. > >> + * alignment requirement is met using altmap->reserve blocks. >> + */ >> +bool mhp_supports_memmap_on_memory(unsigned long size) >> +{ >> + if (!radix_enabled()) >> + return false; >> + /* >> + * The pageblock alignment requirement is met by using >> + * reserve blocks in altmap. >> + */ >> + return size == memory_block_size_bytes(); >> +} > > I really dislike such arch overrides. > > I think the flow should be something like that, having a generic: > > bool mhp_supports_memmap_on_memory(unsigned long size) > { > ... > return arch_mhp_supports_memmap_on_memory(size)) && > size == memory_block_size_bytes() && > ... > } > Sure we can do that. But for ppc64 we also want to skip the PMD_SIZE and pageblock alignment restrictions. > where we'll also keep the pageblock check here. For ppc64, I converted that pageblock rule as a reserve blocks allocation with altmap. I am not sure whether we want to do that outside ppc64? > > And a ppc specific > > bool arch_mhp_supports_memmap_on_memory(unsigned long size) > { > /* > * Don't check for the vmemmap covering PMD_SIZE, we accept that > * we might get a PTE-mapped vmemmap. > */ > return radix_enabled(); > } > > We can then move the PMD_SIZE check into arch specific code (x86-aarch64). > sure. will rework the changes accordingly. -aneesh
Re: [PATCH v2 3/5] mm/hotplug: Simplify the handling of MHP_MEMMAP_ON_MEMORY flag
On 06.07.23 10:50, Aneesh Kumar K.V wrote: Instead of checking for memmap on memory feature enablement within the functions checking for alignment, use the kernel parameter to control the memory hotplug flags. The generic kernel now enables memmap on memory feature if the hotplug flag request for the same. The ACPI code now can pass the flag unconditionally because the kernel will fallback to not using the feature if the alignment rules are not met. Signed-off-by: Aneesh Kumar K.V --- drivers/acpi/acpi_memhotplug.c | 3 +-- include/linux/memory_hotplug.h | 14 ++ mm/memory_hotplug.c| 35 +++--- 3 files changed, 26 insertions(+), 26 deletions(-) diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c index 24f662d8bd39..4d0096fc4cc2 100644 --- a/drivers/acpi/acpi_memhotplug.c +++ b/drivers/acpi/acpi_memhotplug.c @@ -211,8 +211,7 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device) if (!info->length) continue; - if (mhp_supports_memmap_on_memory(info->length)) - mhp_flags |= MHP_MEMMAP_ON_MEMORY; + mhp_flags |= get_memmap_on_memory_flags(); result = __add_memory(mgid, info->start_addr, info->length, mhp_flags); diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index a769f44b8368..af7017122506 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -358,4 +358,18 @@ bool mhp_supports_memmap_on_memory(unsigned long size); bool __mhp_supports_memmap_on_memory(unsigned long size); #endif /* CONFIG_MEMORY_HOTPLUG */ +#ifdef CONFIG_MHP_MEMMAP_ON_MEMORY +extern bool memmap_on_memory; +static inline unsigned long get_memmap_on_memory_flags(void) +{ + if (memmap_on_memory) + return MHP_MEMMAP_ON_MEMORY; + return 0; +} +#else +static inline unsigned long get_memmap_on_memory_flags(void) +{ + return 0; +} +#endif That's kind-of ugly TBH. Why do we need this change? -- Cheers, David / dhildenb
Re: [PATCH v2 2/5] mm/hotplug: Allow architecture override for memmap on memory feature
On 06.07.23 10:50, Aneesh Kumar K.V wrote: Some architectures like ppc64 wants to enable this feature only with radix translation and their vemmap mappings have different alignment requirements. Add overrider for mhp_supports_memmap_on_memory() and also use altmap.reserve feature to adjust the pageblock alignment requirement. The patch also fallback to allocation of memmap outside memblock if the alignment rules are not met for memmap on memory allocation. This allows to use the feature more widely allocating memmap as much as possible within the memory block getting added. A follow patch to enable memmap on memory for ppc64 will use this. See my other reply. Rather have another arch check then overriding it completely. -- Cheers, David / dhildenb
Re: [PATCH v2 1/5] mm/hotplug: Embed vmem_altmap details in memory block
On 06.07.23 10:50, Aneesh Kumar K.V wrote: With memmap on memory, some architecture needs more details w.r.t altmap such as base_pfn, end_pfn, etc to unmap vmemmap memory. Can you elaborate why ppc64 needs that and x86-64 + aarch64 don't? IOW, why can't ppc64 simply allocate the vmemmap from the start of the memblock (-> base_pfn) and use the stored number of vmemmap pages to calculate the end_pfn? To rephrase: if the vmemmap is not at the beginning and doesn't cover full apgeblocks, memory onlining/offlining would be broken. [...] +/** + * struct vmem_altmap - pre-allocated storage for vmemmap_populate + * @base_pfn: base of the entire dev_pagemap mapping + * @reserve: pages mapped, but reserved for driver use (relative to @base) + * @free: free pages set aside in the mapping for memmap storage + * @align: pages reserved to meet allocation alignments + * @alloc: track pages consumed, private to vmemmap_populate() + */ +struct vmem_altmap { + unsigned long base_pfn; + const unsigned long end_pfn; + const unsigned long reserve; + unsigned long free; + unsigned long align; + unsigned long alloc; +}; Instead of embedding that, what about conditionally allocating it and store a pointer to it in the "struct memory_block"? In the general case as of today, we don't have an altmap. + struct memory_block { unsigned long start_section_nr; unsigned long state;/* serialized by the dev->lock */ @@ -77,11 +94,7 @@ struct memory_block { */ struct zone *zone; struct device dev; - /* -* Number of vmemmap pages. These pages -* lay at the beginning of the memory block. -*/ - unsigned long nr_vmemmap_pages; + struct vmem_altmap altmap; struct memory_group *group; /* group (if any) for this block */ struct list_head group_next;/* next block inside memory group */ #if defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_MEMORY_HOTPLUG) @@ -147,7 +160,7 @@ static inline int hotplug_memory_notifier(notifier_fn_t fn, int pri) extern int register_memory_notifier(struct notifier_block *nb); extern void unregister_memory_notifier(struct notifier_block *nb); int create_memory_block_devices(unsigned long start, unsigned long size, [...] static int check_cpu_on_node(int nid) @@ -2036,9 +2042,8 @@ EXPORT_SYMBOL(try_offline_node); static int __ref try_remove_memory(u64 start, u64 size) { - struct vmem_altmap mhp_altmap = {}; + int ret; struct vmem_altmap *altmap = NULL; - unsigned long nr_vmemmap_pages; int rc = 0, nid = NUMA_NO_NODE; BUG_ON(check_hotplug_memory_range(start, size)); @@ -2060,24 +2065,16 @@ static int __ref try_remove_memory(u64 start, u64 size) * We only support removing memory added with MHP_MEMMAP_ON_MEMORY in * the same granularity it was added - a single memory block. */ + ^ unrealted change? -- Cheers, David / dhildenb
Re: [PATCH v2 5/5] powerpc/book3s64/memhotplug: Enable memmap on memory for radix
On 06.07.23 10:50, Aneesh Kumar K.V wrote: Radix vmemmap mapping can map things correctly at the PMD level or PTE level based on different device boundary checks. We also use altmap.reserve feature to align things correctly at pageblock granularity. We can end up loosing some pages in memory with this. For ex: with 256MB memory block size, we require 4 pages to map vmemmap pages, In order to align things correctly we end up adding a reserve of 28 pages. ie, for every 4096 pages 28 pages get reserved. Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/Kconfig | 1 + arch/powerpc/mm/book3s64/radix_pgtable.c | 28 +++ .../platforms/pseries/hotplug-memory.c| 4 ++- 3 files changed, 32 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 116d6add0bb0..f890907e5bbf 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -157,6 +157,7 @@ config PPC select ARCH_HAS_UBSAN_SANITIZE_ALL select ARCH_HAVE_NMI_SAFE_CMPXCHG select ARCH_KEEP_MEMBLOCK + select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE if PPC_RADIX_MMU select ARCH_MIGHT_HAVE_PC_PARPORT select ARCH_MIGHT_HAVE_PC_SERIO select ARCH_OPTIONAL_KERNEL_RWX if ARCH_HAS_STRICT_KERNEL_RWX diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c index a62729f70f2a..c0bd60b5fb64 100644 --- a/arch/powerpc/mm/book3s64/radix_pgtable.c +++ b/arch/powerpc/mm/book3s64/radix_pgtable.c @@ -1678,3 +1678,31 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr) return 1; } + +/* + * mm/memory_hotplug.c:mhp_supports_memmap_on_memory goes into details + * some of the restrictions. We don't check for PMD_SIZE because our + * vmemmap allocation code can fallback correctly. The pageblock x86 also has the fallback IIRC; the concern is rather that you end up with a PTE-mapped vmemmap, which is slower at runtime than a PMD-mapped vmemmap. + * alignment requirement is met using altmap->reserve blocks. + */ +bool mhp_supports_memmap_on_memory(unsigned long size) +{ + if (!radix_enabled()) + return false; + /* +* The pageblock alignment requirement is met by using +* reserve blocks in altmap. +*/ + return size == memory_block_size_bytes(); +} I really dislike such arch overrides. I think the flow should be something like that, having a generic: bool mhp_supports_memmap_on_memory(unsigned long size) { ... return arch_mhp_supports_memmap_on_memory(size)) && size == memory_block_size_bytes() && ... } where we'll also keep the pageblock check here. And a ppc specific bool arch_mhp_supports_memmap_on_memory(unsigned long size) { /* * Don't check for the vmemmap covering PMD_SIZE, we accept that * we might get a PTE-mapped vmemmap. */ return radix_enabled(); } We can then move the PMD_SIZE check into arch specific code (x86-aarch64). -- Cheers, David / dhildenb
[PATCH v2 1/5] mm/hotplug: Embed vmem_altmap details in memory block
With memmap on memory, some architecture needs more details w.r.t altmap such as base_pfn, end_pfn, etc to unmap vmemmap memory. Embed vmem_altmap data structure to memory_bock and use that instead of nr_vmemmap_pages. On memory unplug, if the kernel finds any memory block in the range to be using vmem_altmap, the kernel fails to unplug the memory if the request is not a single memory block unplug. Signed-off-by: Aneesh Kumar K.V --- drivers/base/memory.c| 28 +++- include/linux/memory.h | 25 +++-- include/linux/memremap.h | 18 +- mm/memory_hotplug.c | 31 ++- 4 files changed, 53 insertions(+), 49 deletions(-) diff --git a/drivers/base/memory.c b/drivers/base/memory.c index b456ac213610..523cc1d37c81 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -106,6 +106,7 @@ static void memory_block_release(struct device *dev) { struct memory_block *mem = to_memory_block(dev); + WARN(mem->altmap.alloc, "Altmap not fully unmapped"); kfree(mem); } @@ -183,7 +184,7 @@ static int memory_block_online(struct memory_block *mem) { unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr); unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block; - unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages; + unsigned long nr_vmemmap_pages = 0; struct zone *zone; int ret; @@ -200,6 +201,9 @@ static int memory_block_online(struct memory_block *mem) * stage helps to keep accounting easier to follow - e.g vmemmaps * belong to the same zone as the memory they backed. */ + if (mem->altmap.alloc) + nr_vmemmap_pages = mem->altmap.alloc + mem->altmap.reserve; + if (nr_vmemmap_pages) { ret = mhp_init_memmap_on_memory(start_pfn, nr_vmemmap_pages, zone); if (ret) @@ -230,7 +234,7 @@ static int memory_block_offline(struct memory_block *mem) { unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr); unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block; - unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages; + unsigned long nr_vmemmap_pages = 0; int ret; if (!mem->zone) @@ -240,6 +244,9 @@ static int memory_block_offline(struct memory_block *mem) * Unaccount before offlining, such that unpopulated zone and kthreads * can properly be torn down in offline_pages(). */ + if (mem->altmap.alloc) + nr_vmemmap_pages = mem->altmap.alloc + mem->altmap.reserve; + if (nr_vmemmap_pages) adjust_present_page_count(pfn_to_page(start_pfn), mem->group, -nr_vmemmap_pages); @@ -726,7 +733,7 @@ void memory_block_add_nid(struct memory_block *mem, int nid, #endif static int add_memory_block(unsigned long block_id, unsigned long state, - unsigned long nr_vmemmap_pages, + struct vmem_altmap *altmap, struct memory_group *group) { struct memory_block *mem; @@ -744,7 +751,10 @@ static int add_memory_block(unsigned long block_id, unsigned long state, mem->start_section_nr = block_id * sections_per_block; mem->state = state; mem->nid = NUMA_NO_NODE; - mem->nr_vmemmap_pages = nr_vmemmap_pages; + if (altmap) + memcpy(>altmap, altmap, sizeof(*altmap)); + else + mem->altmap.alloc = 0; INIT_LIST_HEAD(>group_next); #ifndef CONFIG_NUMA @@ -783,14 +793,14 @@ static int __init add_boot_memory_block(unsigned long base_section_nr) if (section_count == 0) return 0; return add_memory_block(memory_block_id(base_section_nr), - MEM_ONLINE, 0, NULL); + MEM_ONLINE, NULL, NULL); } static int add_hotplug_memory_block(unsigned long block_id, - unsigned long nr_vmemmap_pages, + struct vmem_altmap *altmap, struct memory_group *group) { - return add_memory_block(block_id, MEM_OFFLINE, nr_vmemmap_pages, group); + return add_memory_block(block_id, MEM_OFFLINE, altmap, group); } static void remove_memory_block(struct memory_block *memory) @@ -818,7 +828,7 @@ static void remove_memory_block(struct memory_block *memory) * Called under device_hotplug_lock. */ int create_memory_block_devices(unsigned long start, unsigned long size, - unsigned long vmemmap_pages, + struct vmem_altmap *altmap, struct memory_group *group) { const unsigned long start_block_id = pfn_to_block_id(PFN_DOWN(start)); @@ -832,7 +842,7 @@ int
[PATCH v3 13/13] powerpc/book3s64/radix: Add debug message to give more details of vmemmap allocation
Add some extra vmemmap pr_debug message that will indicate the type of vmemmap allocations. For ex: with DAX vmemmap optimization we can find the below details: [ 187.166580] radix-mmu: PAGE_SIZE vmemmap mapping [ 187.166587] radix-mmu: PAGE_SIZE vmemmap mapping [ 187.166591] radix-mmu: Tail page reuse vmemmap mapping [ 187.166594] radix-mmu: Tail page reuse vmemmap mapping [ 187.166598] radix-mmu: Tail page reuse vmemmap mapping [ 187.166601] radix-mmu: Tail page reuse vmemmap mapping [ 187.166604] radix-mmu: Tail page reuse vmemmap mapping [ 187.166608] radix-mmu: Tail page reuse vmemmap mapping [ 187.166611] radix-mmu: Tail page reuse vmemmap mapping [ 187.166614] radix-mmu: Tail page reuse vmemmap mapping [ 187.166617] radix-mmu: Tail page reuse vmemmap mapping [ 187.166620] radix-mmu: Tail page reuse vmemmap mapping [ 187.166623] radix-mmu: Tail page reuse vmemmap mapping [ 187.166626] radix-mmu: Tail page reuse vmemmap mapping [ 187.166629] radix-mmu: Tail page reuse vmemmap mapping [ 187.166632] radix-mmu: Tail page reuse vmemmap mapping And without vmemmap optimization [ 293.549931] radix-mmu: PMD_SIZE vmemmap mapping [ 293.549984] radix-mmu: PMD_SIZE vmemmap mapping [ 293.550032] radix-mmu: PMD_SIZE vmemmap mapping [ 293.550076] radix-mmu: PMD_SIZE vmemmap mapping [ 293.550117] radix-mmu: PMD_SIZE vmemmap mapping Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/mm/book3s64/radix_pgtable.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c index dac56c883099..a62729f70f2a 100644 --- a/arch/powerpc/mm/book3s64/radix_pgtable.c +++ b/arch/powerpc/mm/book3s64/radix_pgtable.c @@ -1026,6 +1026,7 @@ static pte_t * __meminit radix__vmemmap_pte_populate(pmd_t *pmdp, unsigned long p = vmemmap_alloc_block_buf(PAGE_SIZE, node, NULL); if (!p) return NULL; + pr_debug("PAGE_SIZE vmemmap mapping\n"); } else { /* * When a PTE/PMD entry is freed from the init_mm @@ -1038,6 +1039,7 @@ static pte_t * __meminit radix__vmemmap_pte_populate(pmd_t *pmdp, unsigned long */ get_page(reuse); p = page_to_virt(reuse); + pr_debug("Tail page reuse vmemmap mapping\n"); } VM_BUG_ON(!PAGE_ALIGNED(addr)); @@ -1147,6 +1149,7 @@ int __meminit radix__vmemmap_populate(unsigned long start, unsigned long end, in p = vmemmap_alloc_block_buf(PMD_SIZE, node, altmap); if (p) { vmemmap_set_pmd(pmd, p, node, addr, next); + pr_debug("PMD_SIZE vmemmap mapping\n"); continue; } else if (altmap) { /* -- 2.41.0
[PATCH v3 12/13] powerpc/book3s64/radix: Remove mmu_vmemmap_psize
This is not used by radix anymore. Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/mm/book3s64/radix_pgtable.c | 11 --- arch/powerpc/mm/init_64.c| 21 ++--- 2 files changed, 14 insertions(+), 18 deletions(-) diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c index c05783466562..dac56c883099 100644 --- a/arch/powerpc/mm/book3s64/radix_pgtable.c +++ b/arch/powerpc/mm/book3s64/radix_pgtable.c @@ -601,17 +601,6 @@ void __init radix__early_init_mmu(void) #else mmu_virtual_psize = MMU_PAGE_4K; #endif - -#ifdef CONFIG_SPARSEMEM_VMEMMAP - /* vmemmap mapping */ - if (mmu_psize_defs[MMU_PAGE_2M].shift) { - /* -* map vmemmap using 2M if available -*/ - mmu_vmemmap_psize = MMU_PAGE_2M; - } else - mmu_vmemmap_psize = mmu_virtual_psize; -#endif #endif /* * initialize page table size diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c index 5701faca39ef..6db7a063ba63 100644 --- a/arch/powerpc/mm/init_64.c +++ b/arch/powerpc/mm/init_64.c @@ -198,17 +198,12 @@ bool altmap_cross_boundary(struct vmem_altmap *altmap, unsigned long start, return false; } -int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, - struct vmem_altmap *altmap) +int __meminit __vmemmap_populate(unsigned long start, unsigned long end, int node, +struct vmem_altmap *altmap) { bool altmap_alloc; unsigned long page_size = 1 << mmu_psize_defs[mmu_vmemmap_psize].shift; -#ifdef CONFIG_PPC_BOOK3S_64 - if (radix_enabled()) - return radix__vmemmap_populate(start, end, node, altmap); -#endif - /* Align to the page size of the linear mapping. */ start = ALIGN_DOWN(start, page_size); @@ -277,6 +272,18 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, return 0; } +int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, + struct vmem_altmap *altmap) +{ + +#ifdef CONFIG_PPC_BOOK3S_64 + if (radix_enabled()) + return radix__vmemmap_populate(start, end, node, altmap); +#endif + + return __vmemmap_populate(start, end, node, altmap); +} + #ifdef CONFIG_MEMORY_HOTPLUG static unsigned long vmemmap_list_free(unsigned long start) { -- 2.41.0
[PATCH v3 11/13] powerpc/book3s64/radix: Add support for vmemmap optimization for radix
With 2M PMD-level mapping, we require 32 struct pages and a single vmemmap page can contain 1024 struct pages (PAGE_SIZE/sizeof(struct page)). Hence with 64K page size, we don't use vmemmap deduplication for PMD-level mapping. Signed-off-by: Aneesh Kumar K.V --- Documentation/mm/vmemmap_dedup.rst | 1 + Documentation/powerpc/index.rst| 1 + Documentation/powerpc/vmemmap_dedup.rst| 101 ++ arch/powerpc/Kconfig | 1 + arch/powerpc/include/asm/book3s/64/radix.h | 9 + arch/powerpc/mm/book3s64/radix_pgtable.c | 203 + 6 files changed, 316 insertions(+) create mode 100644 Documentation/powerpc/vmemmap_dedup.rst diff --git a/Documentation/mm/vmemmap_dedup.rst b/Documentation/mm/vmemmap_dedup.rst index a4b12ff906c4..c573e08b5043 100644 --- a/Documentation/mm/vmemmap_dedup.rst +++ b/Documentation/mm/vmemmap_dedup.rst @@ -210,6 +210,7 @@ the device (altmap). The following page sizes are supported in DAX: PAGE_SIZE (4K on x86_64), PMD_SIZE (2M on x86_64) and PUD_SIZE (1G on x86_64). +For powerpc equivalent details see Documentation/powerpc/vmemmap_dedup.rst The differences with HugeTLB are relatively minor. diff --git a/Documentation/powerpc/index.rst b/Documentation/powerpc/index.rst index d33b554ca7ba..a50834798454 100644 --- a/Documentation/powerpc/index.rst +++ b/Documentation/powerpc/index.rst @@ -36,6 +36,7 @@ powerpc ultravisor vas-api vcpudispatch_stats +vmemmap_dedup features diff --git a/Documentation/powerpc/vmemmap_dedup.rst b/Documentation/powerpc/vmemmap_dedup.rst new file mode 100644 index ..dc4db59fdf87 --- /dev/null +++ b/Documentation/powerpc/vmemmap_dedup.rst @@ -0,0 +1,101 @@ +.. SPDX-License-Identifier: GPL-2.0 + +== +Device DAX +== + +The device-dax interface uses the tail deduplication technique explained in +Documentation/mm/vmemmap_dedup.rst + +On powerpc, vmemmap deduplication is only used with radix MMU translation. Also +with a 64K page size, only the devdax namespace with 1G alignment uses vmemmap +deduplication. + +With 2M PMD level mapping, we require 32 struct pages and a single 64K vmemmap +page can contain 1024 struct pages (64K/sizeof(struct page)). Hence there is no +vmemmap deduplication possible. + +With 1G PUD level mapping, we require 16384 struct pages and a single 64K +vmemmap page can contain 1024 struct pages (64K/sizeof(struct page)). Hence we +require 16 64K pages in vmemmap to map the struct page for 1G PUD level mapping. + +Here's how things look like on device-dax after the sections are populated:: + +---+ ---virt_to_page---> +---+ mapping to +---+ + | | | 0 | -> | 0 | + | | +---++---+ + | | | 1 | -> | 1 | + | | +---++---+ + | | | 2 | ^ ^ ^ ^ ^ ^ + | | +---+ | | | | | + | | | 3 | --+ | | | | + | | +---+ | | | | + | | | 4 | + | | | + |PUD| +---+ | | | + | level | | . | --+ | | + | mapping | +---+ | | + | | | . | + | + | | +---+ | + | | | 15| --+ + | | +---+ + | | + | | + | | + +---+ + + +With 4K page size, 2M PMD level mapping requires 512 struct pages and a single +4K vmemmap page contains 64 struct pages(4K/sizeof(struct page)). Hence we +require 8 4K pages in vmemmap to map the struct page for 2M pmd level mapping. + +Here's how things look like on device-dax after the sections are populated:: + + +---+ ---virt_to_page---> +---+ mapping to +---+ + | | | 0 | -> | 0 | + | | +---++---+ + | | | 1 | -> | 1 | + | | +---++---+ + | | | 2 | ^ ^ ^ ^ ^ ^ + | | +---+ | | | | | + | | | 3 | --+ | | | | + | |
[PATCH v3 10/13] powerpc/book3s64/vmemmap: Switch radix to use a different vmemmap handling function
This is in preparation to update radix to implement vmemmap optimization for devdax. Below are the rules w.r.t radix vmemmap mapping 1. First try to map things using PMD (2M) 2. With altmap if altmap cross-boundary check returns true, fall back to PAGE_SIZE 3. If we can't allocate PMD_SIZE backing memory for vmemmap, fallback to PAGE_SIZE On removing vmemmap mapping, check if every subsection that is using the vmemmap area is invalid. If found to be invalid, that implies we can safely free the vmemmap area. We don't use the PAGE_UNUSED pattern used by x86 because with 64K page size, we need to do the above check even at the PAGE_SIZE granularity. Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/include/asm/book3s/64/radix.h | 2 + arch/powerpc/include/asm/pgtable.h | 4 + arch/powerpc/mm/book3s64/radix_pgtable.c | 318 +++-- arch/powerpc/mm/init_64.c | 26 +- 4 files changed, 319 insertions(+), 31 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h index 2ef92f36340f..f1461289643a 100644 --- a/arch/powerpc/include/asm/book3s/64/radix.h +++ b/arch/powerpc/include/asm/book3s/64/radix.h @@ -331,6 +331,8 @@ extern int __meminit radix__vmemmap_create_mapping(unsigned long start, unsigned long phys); int __meminit radix__vmemmap_populate(unsigned long start, unsigned long end, int node, struct vmem_altmap *altmap); +void __ref radix__vmemmap_free(unsigned long start, unsigned long end, + struct vmem_altmap *altmap); extern void radix__vmemmap_remove_mapping(unsigned long start, unsigned long page_size); diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h index 6a88bfdaa69b..68817ea7f994 100644 --- a/arch/powerpc/include/asm/pgtable.h +++ b/arch/powerpc/include/asm/pgtable.h @@ -165,6 +165,10 @@ static inline bool is_ioremap_addr(const void *x) return addr >= IOREMAP_BASE && addr < IOREMAP_END; } + +int __meminit vmemmap_populated(unsigned long vmemmap_addr, int vmemmap_map_size); +bool altmap_cross_boundary(struct vmem_altmap *altmap, unsigned long start, + unsigned long page_size); #endif /* CONFIG_PPC64 */ #endif /* __ASSEMBLY__ */ diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c index 227fea53c217..8a03e1005fd3 100644 --- a/arch/powerpc/mm/book3s64/radix_pgtable.c +++ b/arch/powerpc/mm/book3s64/radix_pgtable.c @@ -744,8 +744,57 @@ static void free_pud_table(pud_t *pud_start, p4d_t *p4d) p4d_clear(p4d); } +static bool __meminit vmemmap_pmd_is_unused(unsigned long addr, unsigned long end) +{ + unsigned long start = ALIGN_DOWN(addr, PMD_SIZE); + + return !vmemmap_populated(start, PMD_SIZE); +} + +static bool __meminit vmemmap_page_is_unused(unsigned long addr, unsigned long end) +{ + unsigned long start = ALIGN_DOWN(addr, PAGE_SIZE); + + return !vmemmap_populated(start, PAGE_SIZE); + +} + +static void __meminit free_vmemmap_pages(struct page *page, +struct vmem_altmap *altmap, +int order) +{ + unsigned int nr_pages = 1 << order; + + if (altmap) { + unsigned long alt_start, alt_end; + unsigned long base_pfn = page_to_pfn(page); + + /* +* with 1G vmemmap mmaping we can have things setup +* such that even though atlmap is specified we never +* used altmap. +*/ + alt_start = altmap->base_pfn; + alt_end = altmap->base_pfn + altmap->reserve + + altmap->free + altmap->alloc + altmap->align; + + if (base_pfn >= alt_start && base_pfn < alt_end) { + vmem_altmap_free(altmap, nr_pages); + return; + } + } + + if (PageReserved(page)) { + /* allocated from memblock */ + while (nr_pages--) + free_reserved_page(page++); + } else + free_pages((unsigned long)page_address(page), order); +} + static void remove_pte_table(pte_t *pte_start, unsigned long addr, -unsigned long end, bool direct) +unsigned long end, bool direct, +struct vmem_altmap *altmap) { unsigned long next, pages = 0; pte_t *pte; @@ -759,24 +808,23 @@ static void remove_pte_table(pte_t *pte_start, unsigned long addr, if (!pte_present(*pte)) continue; - if (!PAGE_ALIGNED(addr) || !PAGE_ALIGNED(next)) { - /* -* The vmemmap_free() and
[PATCH v3 09/13] powerpc/book3s64/mm: Enable transparent pud hugepage
This is enabled only with radix translation and 1G hugepage size. This will be used with devdax device memory with a namespace alignment of 1G. Anon transparent hugepage is not supported even though we do have helpers checking pud_trans_huge(). We should never find that return true. The only expected pte bit combination is _PAGE_PTE | _PAGE_DEVMAP. Some of the helpers are never expected to get called on hash translation and hence is marked to call BUG() in such a case. Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/include/asm/book3s/64/pgtable.h | 156 -- arch/powerpc/include/asm/book3s/64/radix.h| 36 .../include/asm/book3s/64/tlbflush-radix.h| 2 + arch/powerpc/include/asm/book3s/64/tlbflush.h | 8 + arch/powerpc/mm/book3s64/pgtable.c| 78 + arch/powerpc/mm/book3s64/radix_pgtable.c | 28 arch/powerpc/mm/book3s64/radix_tlb.c | 7 + arch/powerpc/platforms/Kconfig.cputype| 1 + include/trace/events/thp.h| 10 ++ 9 files changed, 315 insertions(+), 11 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h index 4acc9690f599..9a05de007956 100644 --- a/arch/powerpc/include/asm/book3s/64/pgtable.h +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h @@ -921,8 +921,29 @@ static inline pud_t pte_pud(pte_t pte) { return __pud_raw(pte_raw(pte)); } + +static inline pte_t *pudp_ptep(pud_t *pud) +{ + return (pte_t *)pud; +} + +#define pud_pfn(pud) pte_pfn(pud_pte(pud)) +#define pud_dirty(pud) pte_dirty(pud_pte(pud)) +#define pud_young(pud) pte_young(pud_pte(pud)) +#define pud_mkold(pud) pte_pud(pte_mkold(pud_pte(pud))) +#define pud_wrprotect(pud) pte_pud(pte_wrprotect(pud_pte(pud))) +#define pud_mkdirty(pud) pte_pud(pte_mkdirty(pud_pte(pud))) +#define pud_mkclean(pud) pte_pud(pte_mkclean(pud_pte(pud))) +#define pud_mkyoung(pud) pte_pud(pte_mkyoung(pud_pte(pud))) +#define pud_mkwrite(pud) pte_pud(pte_mkwrite(pud_pte(pud))) #define pud_write(pud) pte_write(pud_pte(pud)) +#ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY +#define pud_soft_dirty(pmd)pte_soft_dirty(pud_pte(pud)) +#define pud_mksoft_dirty(pmd) pte_pud(pte_mksoft_dirty(pud_pte(pud))) +#define pud_clear_soft_dirty(pmd) pte_pud(pte_clear_soft_dirty(pud_pte(pud))) +#endif /* CONFIG_HAVE_ARCH_SOFT_DIRTY */ + static inline int pud_bad(pud_t pud) { if (radix_enabled()) @@ -1115,15 +1136,24 @@ static inline bool pmd_access_permitted(pmd_t pmd, bool write) #ifdef CONFIG_TRANSPARENT_HUGEPAGE extern pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot); +extern pud_t pfn_pud(unsigned long pfn, pgprot_t pgprot); extern pmd_t mk_pmd(struct page *page, pgprot_t pgprot); extern pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot); extern void set_pmd_at(struct mm_struct *mm, unsigned long addr, pmd_t *pmdp, pmd_t pmd); +extern void set_pud_at(struct mm_struct *mm, unsigned long addr, + pud_t *pudp, pud_t pud); + static inline void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmd) { } +static inline void update_mmu_cache_pud(struct vm_area_struct *vma, + unsigned long addr, pud_t *pud) +{ +} + extern int hash__has_transparent_hugepage(void); static inline int has_transparent_hugepage(void) { @@ -1133,6 +1163,14 @@ static inline int has_transparent_hugepage(void) } #define has_transparent_hugepage has_transparent_hugepage +static inline int has_transparent_pud_hugepage(void) +{ + if (radix_enabled()) + return radix__has_transparent_pud_hugepage(); + return 0; +} +#define has_transparent_pud_hugepage has_transparent_pud_hugepage + static inline unsigned long pmd_hugepage_update(struct mm_struct *mm, unsigned long addr, pmd_t *pmdp, unsigned long clr, unsigned long set) @@ -1142,6 +1180,16 @@ pmd_hugepage_update(struct mm_struct *mm, unsigned long addr, pmd_t *pmdp, return hash__pmd_hugepage_update(mm, addr, pmdp, clr, set); } +static inline unsigned long +pud_hugepage_update(struct mm_struct *mm, unsigned long addr, pud_t *pudp, + unsigned long clr, unsigned long set) +{ + if (radix_enabled()) + return radix__pud_hugepage_update(mm, addr, pudp, clr, set); + BUG(); + return pud_val(*pudp); +} + /* * returns true for pmd migration entries, THP, devmap, hugetlb * But compile time dependent on THP config @@ -1151,6 +1199,11 @@ static inline int pmd_large(pmd_t pmd) return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_PTE)); } +static inline int pud_large(pud_t pud) +{ + return !!(pud_raw(pud) & cpu_to_be64(_PAGE_PTE)); +} + /* * For radix we should always find H_PAGE_HASHPTE zero. Hence * the below will work for radix too @@
[PATCH v3 08/13] powerpc/mm/trace: Convert trace event to trace event class
A follow-up patch will add a pud variant for this same event. Using event class makes that addition simpler. No functional change in this patch. Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/mm/book3s64/hash_pgtable.c | 2 +- arch/powerpc/mm/book3s64/radix_pgtable.c | 2 +- include/trace/events/thp.h | 23 --- 3 files changed, 18 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/mm/book3s64/hash_pgtable.c b/arch/powerpc/mm/book3s64/hash_pgtable.c index 51f48984abca..988948d69bc1 100644 --- a/arch/powerpc/mm/book3s64/hash_pgtable.c +++ b/arch/powerpc/mm/book3s64/hash_pgtable.c @@ -214,7 +214,7 @@ unsigned long hash__pmd_hugepage_update(struct mm_struct *mm, unsigned long addr old = be64_to_cpu(old_be); - trace_hugepage_update(addr, old, clr, set); + trace_hugepage_update_pmd(addr, old, clr, set); if (old & H_PAGE_HASHPTE) hpte_do_hugepage_flush(mm, addr, pmdp, old); return old; diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c index e7ea492ac510..02e185d2e4d6 100644 --- a/arch/powerpc/mm/book3s64/radix_pgtable.c +++ b/arch/powerpc/mm/book3s64/radix_pgtable.c @@ -962,7 +962,7 @@ unsigned long radix__pmd_hugepage_update(struct mm_struct *mm, unsigned long add #endif old = radix__pte_update(mm, addr, pmdp_ptep(pmdp), clr, set, 1); - trace_hugepage_update(addr, old, clr, set); + trace_hugepage_update_pmd(addr, old, clr, set); return old; } diff --git a/include/trace/events/thp.h b/include/trace/events/thp.h index 202b3e3e67ff..a95c78b10561 100644 --- a/include/trace/events/thp.h +++ b/include/trace/events/thp.h @@ -8,25 +8,29 @@ #include #include -TRACE_EVENT(hugepage_set_pmd, +DECLARE_EVENT_CLASS(hugepage_set, - TP_PROTO(unsigned long addr, unsigned long pmd), - TP_ARGS(addr, pmd), + TP_PROTO(unsigned long addr, unsigned long pte), + TP_ARGS(addr, pte), TP_STRUCT__entry( __field(unsigned long, addr) - __field(unsigned long, pmd) + __field(unsigned long, pte) ), TP_fast_assign( __entry->addr = addr; - __entry->pmd = pmd; + __entry->pte = pte; ), - TP_printk("Set pmd with 0x%lx with 0x%lx", __entry->addr, __entry->pmd) + TP_printk("Set page table entry with 0x%lx with 0x%lx", __entry->addr, __entry->pte) ); +DEFINE_EVENT(hugepage_set, hugepage_set_pmd, + TP_PROTO(unsigned long addr, unsigned long pmd), + TP_ARGS(addr, pmd) +); -TRACE_EVENT(hugepage_update, +DECLARE_EVENT_CLASS(hugepage_update, TP_PROTO(unsigned long addr, unsigned long pte, unsigned long clr, unsigned long set), TP_ARGS(addr, pte, clr, set), @@ -48,6 +52,11 @@ TRACE_EVENT(hugepage_update, TP_printk("hugepage update at addr 0x%lx and pte = 0x%lx clr = 0x%lx, set = 0x%lx", __entry->addr, __entry->pte, __entry->clr, __entry->set) ); +DEFINE_EVENT(hugepage_update, hugepage_update_pmd, + TP_PROTO(unsigned long addr, unsigned long pmd, unsigned long clr, unsigned long set), + TP_ARGS(addr, pmd, clr, set) +); + DECLARE_EVENT_CLASS(migration_pmd, TP_PROTO(unsigned long addr, unsigned long pmd), -- 2.41.0
[PATCH v3 07/13] mm/vmemmap optimization: Split hugetlb and devdax vmemmap optimization
Arm disabled hugetlb vmemmap optimization [1] because hugetlb vmemmap optimization includes an update of both the permissions (writeable to read-only) and the output address (pfn) of the vmemmap ptes. That is not supported without unmapping of pte(marking it invalid) by some architectures. With DAX vmemmap optimization we don't require such pte updates and architectures can enable DAX vmemmap optimization while having hugetlb vmemmap optimization disabled. Hence split DAX optimization support into a different config. s390, loongarch and riscv don't have devdax support. So the DAX config is not enabled for them. With this change, arm64 should be able to select DAX optimization [1] commit 060a2c92d1b6 ("arm64: mm: hugetlb: Disable HUGETLB_PAGE_OPTIMIZE_VMEMMAP") Signed-off-by: Aneesh Kumar K.V --- arch/loongarch/Kconfig | 2 +- arch/riscv/Kconfig | 2 +- arch/s390/Kconfig | 2 +- arch/x86/Kconfig | 3 ++- fs/Kconfig | 2 +- include/linux/mm.h | 2 +- mm/Kconfig | 5 - 7 files changed, 11 insertions(+), 7 deletions(-) diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig index e55511af4c77..537ca2a4005a 100644 --- a/arch/loongarch/Kconfig +++ b/arch/loongarch/Kconfig @@ -59,7 +59,7 @@ config LOONGARCH select ARCH_USE_QUEUED_SPINLOCKS select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT select ARCH_WANT_LD_ORPHAN_WARN - select ARCH_WANT_OPTIMIZE_VMEMMAP + select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP select ARCH_WANTS_NO_INSTR select BUILDTIME_TABLE_SORT select COMMON_CLK diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig index b49793cf34eb..fd9b683c4263 100644 --- a/arch/riscv/Kconfig +++ b/arch/riscv/Kconfig @@ -53,7 +53,7 @@ config RISCV select ARCH_WANT_GENERAL_HUGETLB if !RISCV_ISA_SVNAPOT select ARCH_WANT_HUGE_PMD_SHARE if 64BIT select ARCH_WANT_LD_ORPHAN_WARN if !XIP_KERNEL - select ARCH_WANT_OPTIMIZE_VMEMMAP + select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP select ARCH_WANTS_THP_SWAP if HAVE_ARCH_TRANSPARENT_HUGEPAGE select BINFMT_FLAT_NO_DATA_START_OFFSET if !MMU select BUILDTIME_TABLE_SORT if MMU diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig index 5b39918b7042..975fd06e4f4d 100644 --- a/arch/s390/Kconfig +++ b/arch/s390/Kconfig @@ -127,7 +127,7 @@ config S390 select ARCH_WANTS_NO_INSTR select ARCH_WANT_DEFAULT_BPF_JIT select ARCH_WANT_IPC_PARSE_VERSION - select ARCH_WANT_OPTIMIZE_VMEMMAP + select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP select BUILDTIME_TABLE_SORT select CLONE_BACKWARDS2 select DMA_OPS if PCI diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 7422db409770..78224aa76409 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -128,7 +128,8 @@ config X86 select ARCH_WANT_GENERAL_HUGETLB select ARCH_WANT_HUGE_PMD_SHARE select ARCH_WANT_LD_ORPHAN_WARN - select ARCH_WANT_OPTIMIZE_VMEMMAP if X86_64 + select ARCH_WANT_OPTIMIZE_DAX_VMEMMAP if X86_64 + select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP if X86_64 select ARCH_WANTS_THP_SWAP if X86_64 select ARCH_HAS_PARANOID_L1D_FLUSH select BUILDTIME_TABLE_SORT diff --git a/fs/Kconfig b/fs/Kconfig index 18d034ec7953..9c104c130a6e 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -252,7 +252,7 @@ config HUGETLB_PAGE config HUGETLB_PAGE_OPTIMIZE_VMEMMAP def_bool HUGETLB_PAGE - depends on ARCH_WANT_OPTIMIZE_VMEMMAP + depends on ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP depends on SPARSEMEM_VMEMMAP config HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON diff --git a/include/linux/mm.h b/include/linux/mm.h index 1a2234ee14d2..83f51ec0897d 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3640,7 +3640,7 @@ void vmemmap_free(unsigned long start, unsigned long end, #endif #define VMEMMAP_RESERVE_NR 2 -#ifdef CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP +#ifdef CONFIG_ARCH_WANT_OPTIMIZE_DAX_VMEMMAP static inline bool __vmemmap_can_optimize(struct vmem_altmap *altmap, struct dev_pagemap *pgmap) { diff --git a/mm/Kconfig b/mm/Kconfig index 09130434e30d..923bd35f81f2 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -487,7 +487,10 @@ config SPARSEMEM_VMEMMAP # Select this config option from the architecture Kconfig, if it is preferred # to enable the feature of HugeTLB/dev_dax vmemmap optimization. # -config ARCH_WANT_OPTIMIZE_VMEMMAP +config ARCH_WANT_OPTIMIZE_DAX_VMEMMAP + bool + +config ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP bool config HAVE_MEMBLOCK_PHYS_MAP -- 2.41.0
[PATCH v3 06/13] mm/huge pud: Use transparent huge pud helpers only with CONFIG_TRANSPARENT_HUGEPAGE
pudp_set_wrprotect and move_huge_pud helpers are only used when CONFIG_TRANSPARENT_HUGEPAGE is enabled. Similar to pmdp_set_wrprotect and move_huge_pmd_helpers use architecture override only if CONFIG_TRANSPARENT_HUGEPAGE is set Signed-off-by: Aneesh Kumar K.V --- include/linux/pgtable.h | 2 ++ mm/mremap.c | 2 +- 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 91def34f7784..b5af3e014606 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -558,6 +558,7 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm, #endif #ifndef __HAVE_ARCH_PUDP_SET_WRPROTECT #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD +#ifdef CONFIG_TRANSPARENT_HUGEPAGE static inline void pudp_set_wrprotect(struct mm_struct *mm, unsigned long address, pud_t *pudp) { @@ -571,6 +572,7 @@ static inline void pudp_set_wrprotect(struct mm_struct *mm, { BUILD_BUG(); } +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ #endif diff --git a/mm/mremap.c b/mm/mremap.c index 11e06e4ab33b..056478c106ee 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -349,7 +349,7 @@ static inline bool move_normal_pud(struct vm_area_struct *vma, } #endif -#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD) static bool move_huge_pud(struct vm_area_struct *vma, unsigned long old_addr, unsigned long new_addr, pud_t *old_pud, pud_t *new_pud) { -- 2.41.0
[PATCH v3 05/13] mm: Add __HAVE_ARCH_PUD_SAME similar to __HAVE_ARCH_P4D_SAME
This helps architectures to override pmd_same and pud_same independently. Signed-off-by: Aneesh Kumar K.V --- include/linux/pgtable.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 6fd9b2831338..91def34f7784 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -693,7 +693,9 @@ static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b) { return pmd_val(pmd_a) == pmd_val(pmd_b); } +#endif +#ifndef __HAVE_ARCH_PUD_SAME static inline int pud_same(pud_t pud_a, pud_t pud_b) { return pud_val(pud_a) == pud_val(pud_b); -- 2.41.0
[PATCH v3 04/13] mm/vmemmap: Allow architectures to override how vmemmap optimization works
Architectures like powerpc will like to use different page table allocators and mapping mechanisms to implement vmemmap optimization. Similar to vmemmap_populate allow architectures to implement vmemap_populate_compound_pages Signed-off-by: Aneesh Kumar K.V --- mm/sparse-vmemmap.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c index a044a130405b..541b3f69a481 100644 --- a/mm/sparse-vmemmap.c +++ b/mm/sparse-vmemmap.c @@ -141,6 +141,7 @@ void __meminit vmemmap_verify(pte_t *pte, int node, start, end - 1); } +#ifndef vmemmap_populate_compound_pages pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node, struct vmem_altmap *altmap, struct page *reuse) @@ -446,6 +447,8 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn, return 0; } +#endif + struct page * __meminit __populate_section_memmap(unsigned long pfn, unsigned long nr_pages, int nid, struct vmem_altmap *altmap, struct dev_pagemap *pgmap) -- 2.41.0
[PATCH v3 03/13] mm/vmemmap: Improve vmemmap_can_optimize and allow architectures to override
dax vmemmap optimization requires a minimum of 2 PAGE_SIZE area within vmemmap such that tail page mapping can point to the second PAGE_SIZE area. Enforce that in vmemmap_can_optimize() function. Architectures like powerpc also want to enable vmemmap optimization conditionally (only with radix MMU translation). Hence allow architecture override. Signed-off-by: Aneesh Kumar K.V --- include/linux/mm.h | 27 +++ mm/mm_init.c | 2 +- 2 files changed, 24 insertions(+), 5 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 2dd73e4f3d8e..1a2234ee14d2 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3639,13 +3639,32 @@ void vmemmap_free(unsigned long start, unsigned long end, struct vmem_altmap *altmap); #endif +#define VMEMMAP_RESERVE_NR 2 #ifdef CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP -static inline bool vmemmap_can_optimize(struct vmem_altmap *altmap, - struct dev_pagemap *pgmap) +static inline bool __vmemmap_can_optimize(struct vmem_altmap *altmap, + struct dev_pagemap *pgmap) { - return is_power_of_2(sizeof(struct page)) && - pgmap && (pgmap_vmemmap_nr(pgmap) > 1) && !altmap; + unsigned long nr_pages; + unsigned long nr_vmemmap_pages; + + if (!pgmap || !is_power_of_2(sizeof(struct page))) + return false; + + nr_pages = pgmap_vmemmap_nr(pgmap); + nr_vmemmap_pages = ((nr_pages * sizeof(struct page)) >> PAGE_SHIFT); + /* +* For vmemmap optimization with DAX we need minimum 2 vmemmap +* pages. See layout diagram in Documentation/mm/vmemmap_dedup.rst +*/ + return !altmap && (nr_vmemmap_pages > VMEMMAP_RESERVE_NR); } +/* + * If we don't have an architecture override, use the generic rule + */ +#ifndef vmemmap_can_optimize +#define vmemmap_can_optimize __vmemmap_can_optimize +#endif + #else static inline bool vmemmap_can_optimize(struct vmem_altmap *altmap, struct dev_pagemap *pgmap) diff --git a/mm/mm_init.c b/mm/mm_init.c index a1963c3322af..245ac69b66a5 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -1020,7 +1020,7 @@ static inline unsigned long compound_nr_pages(struct vmem_altmap *altmap, if (!vmemmap_can_optimize(altmap, pgmap)) return pgmap_vmemmap_nr(pgmap); - return 2 * (PAGE_SIZE / sizeof(struct page)); + return VMEMMAP_RESERVE_NR * (PAGE_SIZE / sizeof(struct page)); } static void __ref memmap_init_compound(struct page *head, -- 2.41.0
[PATCH v3 01/13] mm/hugepage pud: Allow arch-specific helper function to check huge page pud support
Architectures like powerpc would like to enable transparent huge page pud support only with radix translation. To support that add has_transparent_pud_hugepage() helper that architectures can override. Signed-off-by: Aneesh Kumar K.V --- drivers/nvdimm/pfn_devs.c | 2 +- include/linux/pgtable.h | 3 +++ 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c index af7d9301520c..18ad315581ca 100644 --- a/drivers/nvdimm/pfn_devs.c +++ b/drivers/nvdimm/pfn_devs.c @@ -100,7 +100,7 @@ static unsigned long *nd_pfn_supported_alignments(unsigned long *alignments) if (has_transparent_hugepage()) { alignments[1] = HPAGE_PMD_SIZE; - if (IS_ENABLED(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)) + if (has_transparent_pud_hugepage()) alignments[2] = HPAGE_PUD_SIZE; } diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 5063b482e34f..cf13f8d938a8 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1499,6 +1499,9 @@ typedef unsigned int pgtbl_mod_mask; #define has_transparent_hugepage() IS_BUILTIN(CONFIG_TRANSPARENT_HUGEPAGE) #endif +#ifndef has_transparent_pud_hugepage +#define has_transparent_pud_hugepage() IS_BUILTIN(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD) +#endif /* * On some architectures it depends on the mm if the p4d/pud or pmd * layer of the page table hierarchy is folded or not. -- 2.41.0
[PATCH v3 02/13] mm: Change pudp_huge_get_and_clear_full take vm_area_struct as arg
We will use this in a later patch to do tlb flush when clearing pud entries on powerpc. This is similar to commit 93a98695f2f9 ("mm: change pmdp_huge_get_and_clear_full take vm_area_struct as arg") Signed-off-by: Aneesh Kumar K.V --- include/linux/pgtable.h | 4 ++-- mm/debug_vm_pgtable.c | 2 +- mm/huge_memory.c| 2 +- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index cf13f8d938a8..6fd9b2831338 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -450,11 +450,11 @@ static inline pmd_t pmdp_huge_get_and_clear_full(struct vm_area_struct *vma, #endif #ifndef __HAVE_ARCH_PUDP_HUGE_GET_AND_CLEAR_FULL -static inline pud_t pudp_huge_get_and_clear_full(struct mm_struct *mm, +static inline pud_t pudp_huge_get_and_clear_full(struct vm_area_struct *vma, unsigned long address, pud_t *pudp, int full) { - return pudp_huge_get_and_clear(mm, address, pudp); + return pudp_huge_get_and_clear(vma->vm_mm, address, pudp); } #endif #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index ee119e33fef1..ee2c4c1dcfc8 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -385,7 +385,7 @@ static void __init pud_advanced_tests(struct pgtable_debug_args *args) WARN_ON(!(pud_write(pud) && pud_dirty(pud))); #ifndef __PAGETABLE_PMD_FOLDED - pudp_huge_get_and_clear_full(args->mm, vaddr, args->pudp, 1); + pudp_huge_get_and_clear_full(args->vma, vaddr, args->pudp, 1); pud = READ_ONCE(*args->pudp); WARN_ON(!pud_none(pud)); #endif /* __PAGETABLE_PMD_FOLDED */ diff --git a/mm/huge_memory.c b/mm/huge_memory.c index eb3678360b97..ba20cef681a4 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1981,7 +1981,7 @@ int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma, if (!ptl) return 0; - pudp_huge_get_and_clear_full(tlb->mm, addr, pud, tlb->fullmm); + pudp_huge_get_and_clear_full(vma, addr, pud, tlb->fullmm); tlb_remove_pud_tlb_entry(tlb, pud, addr); if (vma_is_special_huge(vma)) { spin_unlock(ptl); -- 2.41.0
[PATCH v3 00/13] Add support for DAX vmemmap optimization for ppc64
This patch series implements changes required to support DAX vmemmap optimization for ppc64. The vmemmap optimization is only enabled with radix MMU translation and 1GB PUD mapping with 64K page size. The patch series also split hugetlb vmemmap optimization as a separate Kconfig variable so that architectures can enable DAX vmemmap optimization without enabling hugetlb vmemmap optimization. This should enable architectures like arm64 to enable DAX vmemmap optimization while they can't enable hugetlb vmemmap optimization. More details of the same are in patch "mm/vmemmap optimization: Split hugetlb and devdax vmemmap optimization" Changes from v1: * Rebase to latest linus tree * Address review feedback Changes from V1: * Fix make htmldocs warning * Fix vmemmap allocation bugs with different alignment values. * Correctly check for section validity to before we free vmemmap area Aneesh Kumar K.V (13): mm/hugepage pud: Allow arch-specific helper function to check huge page pud support mm: Change pudp_huge_get_and_clear_full take vm_area_struct as arg mm/vmemmap: Improve vmemmap_can_optimize and allow architectures to override mm/vmemmap: Allow architectures to override how vmemmap optimization works mm: Add __HAVE_ARCH_PUD_SAME similar to __HAVE_ARCH_P4D_SAME mm/huge pud: Use transparent huge pud helpers only with CONFIG_TRANSPARENT_HUGEPAGE mm/vmemmap optimization: Split hugetlb and devdax vmemmap optimization powerpc/mm/trace: Convert trace event to trace event class powerpc/book3s64/mm: Enable transparent pud hugepage powerpc/book3s64/vmemmap: Switch radix to use a different vmemmap handling function powerpc/book3s64/radix: Add support for vmemmap optimization for radix powerpc/book3s64/radix: Remove mmu_vmemmap_psize powerpc/book3s64/radix: Add debug message to give more details of vmemmap allocation Documentation/mm/vmemmap_dedup.rst| 1 + Documentation/powerpc/index.rst | 1 + Documentation/powerpc/vmemmap_dedup.rst | 101 arch/loongarch/Kconfig| 2 +- arch/powerpc/Kconfig | 1 + arch/powerpc/include/asm/book3s/64/pgtable.h | 156 - arch/powerpc/include/asm/book3s/64/radix.h| 47 ++ .../include/asm/book3s/64/tlbflush-radix.h| 2 + arch/powerpc/include/asm/book3s/64/tlbflush.h | 8 + arch/powerpc/include/asm/pgtable.h| 4 + arch/powerpc/mm/book3s64/hash_pgtable.c | 2 +- arch/powerpc/mm/book3s64/pgtable.c| 78 +++ arch/powerpc/mm/book3s64/radix_pgtable.c | 565 -- arch/powerpc/mm/book3s64/radix_tlb.c | 7 + arch/powerpc/mm/init_64.c | 37 +- arch/powerpc/platforms/Kconfig.cputype| 1 + arch/riscv/Kconfig| 2 +- arch/s390/Kconfig | 2 +- arch/x86/Kconfig | 3 +- drivers/nvdimm/pfn_devs.c | 2 +- fs/Kconfig| 2 +- include/linux/mm.h| 29 +- include/linux/pgtable.h | 11 +- include/trace/events/thp.h| 33 +- mm/Kconfig| 5 +- mm/debug_vm_pgtable.c | 2 +- mm/huge_memory.c | 2 +- mm/mm_init.c | 2 +- mm/mremap.c | 2 +- mm/sparse-vmemmap.c | 3 + 30 files changed, 1031 insertions(+), 82 deletions(-) create mode 100644 Documentation/powerpc/vmemmap_dedup.rst -- 2.41.0
Re: [PATCH v2 4/5] mm/hotplug: Simplify ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE kconfig
On 06.07.23 10:50, Aneesh Kumar K.V wrote: Instead of adding menu entry with all supported architectures, add mm/Kconfig variable and select the same from supported architectures. No functional change in this patch. Signed-off-by: Aneesh Kumar K.V --- arch/arm64/Kconfig | 4 +--- arch/x86/Kconfig | 4 +--- mm/Kconfig | 3 +++ 3 files changed, 5 insertions(+), 6 deletions(-) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 7856c3a3e35a..7e5985c018f8 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -78,6 +78,7 @@ config ARM64 select ARCH_INLINE_SPIN_UNLOCK_IRQ if !PREEMPTION select ARCH_INLINE_SPIN_UNLOCK_IRQRESTORE if !PREEMPTION select ARCH_KEEP_MEMBLOCK + select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE select ARCH_USE_CMPXCHG_LOCKREF select ARCH_USE_GNU_PROPERTY select ARCH_USE_MEMTEST @@ -346,9 +347,6 @@ config GENERIC_CSUM config GENERIC_CALIBRATE_DELAY def_bool y -config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE - def_bool y - config SMP def_bool y diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 78224aa76409..d0258e92a8af 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -102,6 +102,7 @@ config X86 select ARCH_HAS_DEBUG_WX select ARCH_HAS_ZONE_DMA_SET if EXPERT select ARCH_HAVE_NMI_SAFE_CMPXCHG + select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE select ARCH_MIGHT_HAVE_ACPI_PDC if ACPI select ARCH_MIGHT_HAVE_PC_PARPORT select ARCH_MIGHT_HAVE_PC_SERIO @@ -2610,9 +2611,6 @@ config ARCH_HAS_ADD_PAGES def_bool y depends on ARCH_ENABLE_MEMORY_HOTPLUG -config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE - def_bool y - menu "Power management and ACPI options" config ARCH_HIBERNATION_HEADER diff --git a/mm/Kconfig b/mm/Kconfig index 923bd35f81f2..2f9d28fee75d 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -570,6 +570,9 @@ config MHP_MEMMAP_ON_MEMORY depends on MEMORY_HOTPLUG && SPARSEMEM_VMEMMAP depends on ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE +config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE + bool + endif # MEMORY_HOTPLUG # Heavily threaded applications may benefit from splitting the mm-wide Acked-by: David Hildenbrand -- Cheers, David / dhildenb
[PATCH v2 5/5] powerpc/book3s64/memhotplug: Enable memmap on memory for radix
Radix vmemmap mapping can map things correctly at the PMD level or PTE level based on different device boundary checks. We also use altmap.reserve feature to align things correctly at pageblock granularity. We can end up loosing some pages in memory with this. For ex: with 256MB memory block size, we require 4 pages to map vmemmap pages, In order to align things correctly we end up adding a reserve of 28 pages. ie, for every 4096 pages 28 pages get reserved. Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/Kconfig | 1 + arch/powerpc/mm/book3s64/radix_pgtable.c | 28 +++ .../platforms/pseries/hotplug-memory.c| 4 ++- 3 files changed, 32 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 116d6add0bb0..f890907e5bbf 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -157,6 +157,7 @@ config PPC select ARCH_HAS_UBSAN_SANITIZE_ALL select ARCH_HAVE_NMI_SAFE_CMPXCHG select ARCH_KEEP_MEMBLOCK + select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE if PPC_RADIX_MMU select ARCH_MIGHT_HAVE_PC_PARPORT select ARCH_MIGHT_HAVE_PC_SERIO select ARCH_OPTIONAL_KERNEL_RWX if ARCH_HAS_STRICT_KERNEL_RWX diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c index a62729f70f2a..c0bd60b5fb64 100644 --- a/arch/powerpc/mm/book3s64/radix_pgtable.c +++ b/arch/powerpc/mm/book3s64/radix_pgtable.c @@ -1678,3 +1678,31 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr) return 1; } + +/* + * mm/memory_hotplug.c:mhp_supports_memmap_on_memory goes into details + * some of the restrictions. We don't check for PMD_SIZE because our + * vmemmap allocation code can fallback correctly. The pageblock + * alignment requirement is met using altmap->reserve blocks. + */ +bool mhp_supports_memmap_on_memory(unsigned long size) +{ + if (!radix_enabled()) + return false; + /* +* The pageblock alignment requirement is met by using +* reserve blocks in altmap. +*/ + return size == memory_block_size_bytes(); +} + +unsigned long memory_block_align_base(struct resource *res) +{ + unsigned long base_pfn = PHYS_PFN(res->start); + unsigned long align, size = resource_size(res); + unsigned long nr_vmemmap_pages = size >> PAGE_SHIFT; + unsigned long vmemmap_size = (nr_vmemmap_pages * sizeof(struct page)) >> PAGE_SHIFT; + + align = pageblock_align(base_pfn + vmemmap_size) - (base_pfn + vmemmap_size); + return align; +} diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c index 9c62c2c3b3d0..326db26d773e 100644 --- a/arch/powerpc/platforms/pseries/hotplug-memory.c +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c @@ -617,6 +617,7 @@ static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, u32 drc_index) static int dlpar_add_lmb(struct drmem_lmb *lmb) { + mhp_t mhp_flags = MHP_NONE; unsigned long block_sz; int nid, rc; @@ -637,7 +638,8 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb) nid = first_online_node; /* Add the memory */ - rc = __add_memory(nid, lmb->base_addr, block_sz, MHP_NONE); + mhp_flags |= get_memmap_on_memory_flags(); + rc = __add_memory(nid, lmb->base_addr, block_sz, mhp_flags); if (rc) { invalidate_lmb_associativity_index(lmb); return rc; -- 2.41.0
[PATCH v2 4/5] mm/hotplug: Simplify ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE kconfig
Instead of adding menu entry with all supported architectures, add mm/Kconfig variable and select the same from supported architectures. No functional change in this patch. Signed-off-by: Aneesh Kumar K.V --- arch/arm64/Kconfig | 4 +--- arch/x86/Kconfig | 4 +--- mm/Kconfig | 3 +++ 3 files changed, 5 insertions(+), 6 deletions(-) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 7856c3a3e35a..7e5985c018f8 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -78,6 +78,7 @@ config ARM64 select ARCH_INLINE_SPIN_UNLOCK_IRQ if !PREEMPTION select ARCH_INLINE_SPIN_UNLOCK_IRQRESTORE if !PREEMPTION select ARCH_KEEP_MEMBLOCK + select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE select ARCH_USE_CMPXCHG_LOCKREF select ARCH_USE_GNU_PROPERTY select ARCH_USE_MEMTEST @@ -346,9 +347,6 @@ config GENERIC_CSUM config GENERIC_CALIBRATE_DELAY def_bool y -config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE - def_bool y - config SMP def_bool y diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 78224aa76409..d0258e92a8af 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -102,6 +102,7 @@ config X86 select ARCH_HAS_DEBUG_WX select ARCH_HAS_ZONE_DMA_SET if EXPERT select ARCH_HAVE_NMI_SAFE_CMPXCHG + select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE select ARCH_MIGHT_HAVE_ACPI_PDC if ACPI select ARCH_MIGHT_HAVE_PC_PARPORT select ARCH_MIGHT_HAVE_PC_SERIO @@ -2610,9 +2611,6 @@ config ARCH_HAS_ADD_PAGES def_bool y depends on ARCH_ENABLE_MEMORY_HOTPLUG -config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE - def_bool y - menu "Power management and ACPI options" config ARCH_HIBERNATION_HEADER diff --git a/mm/Kconfig b/mm/Kconfig index 923bd35f81f2..2f9d28fee75d 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -570,6 +570,9 @@ config MHP_MEMMAP_ON_MEMORY depends on MEMORY_HOTPLUG && SPARSEMEM_VMEMMAP depends on ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE +config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE + bool + endif # MEMORY_HOTPLUG # Heavily threaded applications may benefit from splitting the mm-wide -- 2.41.0
[PATCH v2 3/5] mm/hotplug: Simplify the handling of MHP_MEMMAP_ON_MEMORY flag
Instead of checking for memmap on memory feature enablement within the functions checking for alignment, use the kernel parameter to control the memory hotplug flags. The generic kernel now enables memmap on memory feature if the hotplug flag request for the same. The ACPI code now can pass the flag unconditionally because the kernel will fallback to not using the feature if the alignment rules are not met. Signed-off-by: Aneesh Kumar K.V --- drivers/acpi/acpi_memhotplug.c | 3 +-- include/linux/memory_hotplug.h | 14 ++ mm/memory_hotplug.c| 35 +++--- 3 files changed, 26 insertions(+), 26 deletions(-) diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c index 24f662d8bd39..4d0096fc4cc2 100644 --- a/drivers/acpi/acpi_memhotplug.c +++ b/drivers/acpi/acpi_memhotplug.c @@ -211,8 +211,7 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device) if (!info->length) continue; - if (mhp_supports_memmap_on_memory(info->length)) - mhp_flags |= MHP_MEMMAP_ON_MEMORY; + mhp_flags |= get_memmap_on_memory_flags(); result = __add_memory(mgid, info->start_addr, info->length, mhp_flags); diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index a769f44b8368..af7017122506 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -358,4 +358,18 @@ bool mhp_supports_memmap_on_memory(unsigned long size); bool __mhp_supports_memmap_on_memory(unsigned long size); #endif /* CONFIG_MEMORY_HOTPLUG */ +#ifdef CONFIG_MHP_MEMMAP_ON_MEMORY +extern bool memmap_on_memory; +static inline unsigned long get_memmap_on_memory_flags(void) +{ + if (memmap_on_memory) + return MHP_MEMMAP_ON_MEMORY; + return 0; +} +#else +static inline unsigned long get_memmap_on_memory_flags(void) +{ + return 0; +} +#endif #endif /* __LINUX_MEMORY_HOTPLUG_H */ diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 423f96dd5481..a1542b8f64e6 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -45,19 +45,9 @@ /* * memory_hotplug.memmap_on_memory parameter */ -static bool memmap_on_memory __ro_after_init; +bool memmap_on_memory __ro_after_init; module_param(memmap_on_memory, bool, 0444); MODULE_PARM_DESC(memmap_on_memory, "Enable memmap on memory for memory hotplug"); - -static inline bool mhp_memmap_on_memory(void) -{ - return memmap_on_memory; -} -#else -static inline bool mhp_memmap_on_memory(void) -{ - return false; -} #endif enum { @@ -1280,10 +1270,9 @@ bool __mhp_supports_memmap_on_memory(unsigned long size) * altmap as an alternative source of memory, and we do not exactly * populate a single PMD. */ - return mhp_memmap_on_memory() && - size == memory_block_size_bytes() && - IS_ALIGNED(vmemmap_size, PMD_SIZE) && - IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT)); + return size == memory_block_size_bytes() && + IS_ALIGNED(vmemmap_size, PMD_SIZE) && + IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT)); } bool __weak mhp_supports_memmap_on_memory(unsigned long size) @@ -2082,15 +2071,13 @@ static int __ref try_remove_memory(u64 start, u64 size) * the same granularity it was added - a single memory block. */ - if (mhp_memmap_on_memory()) { - ret = walk_memory_blocks(start, size, , get_vmemmap_altmap_cb); - if (ret) { - if (size != memory_block_size_bytes()) { - pr_warn("Refuse to remove %#llx - %#llx," - "wrong granularity\n", - start, start + size); - return -EINVAL; - } + ret = walk_memory_blocks(start, size, , get_vmemmap_altmap_cb); + if (ret) { + if (size != memory_block_size_bytes()) { + pr_warn("Refuse to remove %#llx - %#llx," + "wrong granularity\n", + start, start + size); + return -EINVAL; } } -- 2.41.0
[PATCH v2 2/5] mm/hotplug: Allow architecture override for memmap on memory feature
Some architectures like ppc64 wants to enable this feature only with radix translation and their vemmap mappings have different alignment requirements. Add overrider for mhp_supports_memmap_on_memory() and also use altmap.reserve feature to adjust the pageblock alignment requirement. The patch also fallback to allocation of memmap outside memblock if the alignment rules are not met for memmap on memory allocation. This allows to use the feature more widely allocating memmap as much as possible within the memory block getting added. A follow patch to enable memmap on memory for ppc64 will use this. Signed-off-by: Aneesh Kumar K.V --- arch/arm64/mm/mmu.c| 5 + arch/x86/mm/init_64.c | 6 ++ include/linux/memory_hotplug.h | 3 ++- mm/memory_hotplug.c| 36 -- 4 files changed, 39 insertions(+), 11 deletions(-) diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index 95d360805f8a..0ff2ec634a58 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -1340,6 +1340,11 @@ void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap) __remove_pgd_mapping(swapper_pg_dir, __phys_to_virt(start), size); } +bool mhp_supports_memmap_on_memory(unsigned long size) +{ + return __mhp_supports_memmap_on_memory(size); +} + /* * This memory hotplug notifier helps prevent boot memory from being * inadvertently removed as it blocks pfn range offlining process in diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index a190aae8ceaf..b318d26a70d4 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -1264,6 +1264,12 @@ void __ref arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap) __remove_pages(start_pfn, nr_pages, altmap); kernel_physical_mapping_remove(start, start + size); } + +bool mhp_supports_memmap_on_memory(unsigned long size) +{ + return __mhp_supports_memmap_on_memory(size); +} + #endif /* CONFIG_MEMORY_HOTPLUG */ static struct kcore_list kcore_vsyscall; diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index 013c69753c91..a769f44b8368 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -354,7 +354,8 @@ extern struct zone *zone_for_pfn_range(int online_type, int nid, extern int arch_create_linear_mapping(int nid, u64 start, u64 size, struct mhp_params *params); void arch_remove_linear_mapping(u64 start, u64 size); -extern bool mhp_supports_memmap_on_memory(unsigned long size); +bool mhp_supports_memmap_on_memory(unsigned long size); +bool __mhp_supports_memmap_on_memory(unsigned long size); #endif /* CONFIG_MEMORY_HOTPLUG */ #endif /* __LINUX_MEMORY_HOTPLUG_H */ diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index c4bac38cc147..423f96dd5481 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1247,7 +1247,8 @@ static int online_memory_block(struct memory_block *mem, void *arg) return device_online(>dev); } -bool mhp_supports_memmap_on_memory(unsigned long size) +/* Helper function for architecture to use. */ +bool __mhp_supports_memmap_on_memory(unsigned long size) { unsigned long nr_vmemmap_pages = size / PAGE_SIZE; unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page); @@ -1285,6 +1286,20 @@ bool mhp_supports_memmap_on_memory(unsigned long size) IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT)); } +bool __weak mhp_supports_memmap_on_memory(unsigned long size) +{ + return false; +} + +/* + * Architectures may want to override the altmap reserve details based + * on the alignment requirement for vmemmap mapping. + */ +unsigned __weak long memory_block_align_base(struct resource *res) +{ + return 0; +} + /* * NOTE: The caller must call lock_device_hotplug() to serialize hotplug * and online/offline operations (triggered e.g. by sysfs). @@ -1295,7 +1310,11 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags) { struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) }; enum memblock_flags memblock_flags = MEMBLOCK_NONE; - struct vmem_altmap mhp_altmap = {}; + struct vmem_altmap mhp_altmap = { + .base_pfn = PHYS_PFN(res->start), + .end_pfn = PHYS_PFN(res->end), + .reserve = memory_block_align_base(res), + }; struct memory_group *group = NULL; u64 start, size; bool new_node = false; @@ -1339,13 +1358,11 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags) * Self hosted memmap array */ if (mhp_flags & MHP_MEMMAP_ON_MEMORY) { - if (!mhp_supports_memmap_on_memory(size)) { - ret = -EINVAL; - goto error; + if (mhp_supports_memmap_on_memory(size)) { +
[PATCH v2 0/5] Add support for memmap on memory feature on ppc64
This patch series update memmap on memory feature to fall back to memmap allocation outside the memory block if the alignment rules are not met. This makes the feature more useful on architectures like ppc64 where alignment rules are different with 64K page size. This patch series is dependent on dax vmemmap optimization series posted here https://lore.kernel.org/linux-mm/20230616110826.344417-1-aneesh.ku...@linux.ibm.com Changes from v1: * update the memblock to store vmemmap_altmap details. This is required so that when we remove the memory we can find the altmap details which is needed on some architectures. * rebase to latest linus tree Aneesh Kumar K.V (5): mm/hotplug: Embed vmem_altmap details in memory block mm/hotplug: Allow architecture override for memmap on memory feature mm/hotplug: Simplify the handling of MHP_MEMMAP_ON_MEMORY flag mm/hotplug: Simplify ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE kconfig powerpc/book3s64/memhotplug: Enable memmap on memory for radix arch/arm64/Kconfig| 4 +- arch/arm64/mm/mmu.c | 5 + arch/powerpc/Kconfig | 1 + arch/powerpc/mm/book3s64/radix_pgtable.c | 28 ++ .../platforms/pseries/hotplug-memory.c| 4 +- arch/x86/Kconfig | 4 +- arch/x86/mm/init_64.c | 6 ++ drivers/acpi/acpi_memhotplug.c| 3 +- drivers/base/memory.c | 28 -- include/linux/memory.h| 25 +++-- include/linux/memory_hotplug.h| 17 +++- include/linux/memremap.h | 18 +--- mm/Kconfig| 3 + mm/memory_hotplug.c | 94 +-- 14 files changed, 151 insertions(+), 89 deletions(-) -- 2.41.0
Re: [PATCH linux-next][RFC]torture: avoid offline tick_do_timer_cpu
On Thu, Jul 6, 2023 at 3:09 PM Christophe Leroy wrote: > > > > Le 21/11/2022 à 04:51, Zhouyi Zhou a écrit : > > During CPU-hotplug torture (CONFIG_NO_HZ_FULL=y), if we try to > > offline tick_do_timer_cpu, the operation will fail because in > > function tick_nohz_cpu_down: > > ``` > > if (tick_nohz_full_running && tick_do_timer_cpu == cpu) > >return -EBUSY; > > ``` > > Above bug was first discovered in torture tests performed in PPC VM > > of Open Source Lab of Oregon State University, and reproducable in RISC-V > > and X86-64 (with additional kernel commandline cpu0_hotplug). > > > > In this patch, we avoid offline tick_do_timer_cpu by distribute > > the offlining cpu among remaining cpus. > > > > Signed-off-by: Zhouyi Zhou > > --- > > include/linux/tick.h| 1 + > > kernel/time/tick-common.c | 1 + > > kernel/time/tick-internal.h | 1 - > > kernel/torture.c| 10 ++ > > 4 files changed, 12 insertions(+), 1 deletion(-) > > > > diff --git a/include/linux/tick.h b/include/linux/tick.h > > index bfd571f18cfd..23cc0b205853 100644 > > --- a/include/linux/tick.h > > +++ b/include/linux/tick.h > > @@ -14,6 +14,7 @@ > > #include > > > > #ifdef CONFIG_GENERIC_CLOCKEVENTS > > +extern int tick_do_timer_cpu __read_mostly; > > extern void __init tick_init(void); > > /* Should be core only, but ARM BL switcher requires it */ > > extern void tick_suspend_local(void); > > diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c > > index 46789356f856..87b9b9afa320 100644 > > --- a/kernel/time/tick-common.c > > +++ b/kernel/time/tick-common.c > > @@ -48,6 +48,7 @@ ktime_t tick_next_period; > >*procedure also covers cpu hotplug. > >*/ > > int tick_do_timer_cpu __read_mostly = TICK_DO_TIMER_BOOT; > > +EXPORT_SYMBOL_GPL(tick_do_timer_cpu); > > #ifdef CONFIG_NO_HZ_FULL > > /* > >* tick_do_timer_boot_cpu indicates the boot CPU temporarily owns > > diff --git a/kernel/time/tick-internal.h b/kernel/time/tick-internal.h > > index 649f2b48e8f0..8953dca10fdd 100644 > > --- a/kernel/time/tick-internal.h > > +++ b/kernel/time/tick-internal.h > > @@ -15,7 +15,6 @@ > > > > DECLARE_PER_CPU(struct tick_device, tick_cpu_device); > > extern ktime_t tick_next_period; > > -extern int tick_do_timer_cpu __read_mostly; > > > > extern void tick_setup_periodic(struct clock_event_device *dev, int > > broadcast); > > extern void tick_handle_periodic(struct clock_event_device *dev); > > diff --git a/kernel/torture.c b/kernel/torture.c > > index 789aeb0e1159..bccbdd33dda2 100644 > > --- a/kernel/torture.c > > +++ b/kernel/torture.c > > @@ -33,6 +33,7 @@ > > #include > > #include > > #include > > +#include > > #include > > #include > > #include > > @@ -358,7 +359,16 @@ torture_onoff(void *arg) > > schedule_timeout_interruptible(HZ / 10); > > continue; > > } > > +#ifdef CONFIG_NO_HZ_FULL > > + /* do not offline tick do timer cpu */ > > + if (tick_nohz_full_running) { > > Can you use fonction tick_nohz_full_enabled() instead and avoid the #ifdef ? Thank Christophe for your wonderful advice, I will follow your advice next time I prepare a patch. For this false positive report, 58d766824264 ("tick/nohz: Fix cpu_is_hotpluggable() by checking with nohz subsystem") has beaten me in mainline. Thanks again Zhouyi > > > + cpu = (torture_random() >> 4) % maxcpu; > > + if (cpu >= tick_do_timer_cpu) > > + cpu = (cpu + 1) % (maxcpu + 1); > > + } else > > +#else > > cpu = (torture_random() >> 4) % (maxcpu + 1); > > +#endif > > if (!torture_offline(cpu, > >_offline_attempts, > > _offline_successes, > >_offline, _offline, > > _offline))
Re: [PATCH] arch/powerpc: Remove unnecessary endian conversion code in XICS
On 30/6/23 3:56 pm, Gautam Menghani wrote: Remove an unnecessary piece of code that does an endianness conversion but does not use the result. The following warning was reported by Clang's static analyzer: arch/powerpc/sysdev/xics/ics-opal.c:114:2: warning: Value stored to 'server' is never read [deadcode.DeadStores] server = be16_to_cpu(oserver); As the result of endianness conversion is never used, delete the line and fix the warning. Signed-off-by: Gautam Menghani 'server' was used as a parameter to opal_get_xive() in commit 5c7c1e9444d8 ("powerpc/powernv: Add OPAL ICS backend") when it was introduced. 'server' was also used in an error message for the call to opal_get_xive(). 'server' was always later set by a call to ics_opal_mangle_server() before being used. Commit bf8e0f891a32 ("powerpc/powernv: Fix endian issues in OPAL ICS backend") used a new variable 'oserver' as the parameter to opal_get_xive() instead of 'server' for endian correctness. It also removed 'server' from the error message for the call to opal_get_xive(). It was commit bf8e0f891a32 that added the unnecessary conversion and never used the result. Reviewed-by: Jordan Niethe --- arch/powerpc/sysdev/xics/ics-opal.c | 1 - 1 file changed, 1 deletion(-) diff --git a/arch/powerpc/sysdev/xics/ics-opal.c b/arch/powerpc/sysdev/xics/ics-opal.c index 6cfbb4fac7fb..5fe73dabab79 100644 --- a/arch/powerpc/sysdev/xics/ics-opal.c +++ b/arch/powerpc/sysdev/xics/ics-opal.c @@ -111,7 +111,6 @@ static int ics_opal_set_affinity(struct irq_data *d, __func__, d->irq, hw_irq, rc); return -1; } - server = be16_to_cpu(oserver); wanted_server = xics_get_irq_server(d->irq, cpumask, 1); if (wanted_server < 0) {
Re: [PATCH] rtc: Kconfig: select REGMAP for RTC_DRV_DS1307
Hi Benjamin, On Thu, Jul 6, 2023 at 8:14 AM Benjamin Gray wrote: > On Thu, 2023-07-06 at 05:13 +, Christophe Leroy wrote: > > Le 05/07/2023 à 02:30, Benjamin Gray a écrit : > > > The drivers/rtc/rtc-ds1307.c driver has a direct dependency on > > > struct regmap_config, which is guarded behind CONFIG_REGMAP. > > > > > > Commit 70a640c0efa7 ("regmap: REGMAP_KUNIT should not select > > > REGMAP") > > > exposed this by disabling the default pick unless KUNIT_ALL_TESTS > > > is > > > set, causing the ppc64be allnoconfig build to fail. > > > > > > Signed-off-by: Benjamin Gray > > > --- > > > drivers/rtc/Kconfig | 1 + > > > 1 file changed, 1 insertion(+) > > > > > > diff --git a/drivers/rtc/Kconfig b/drivers/rtc/Kconfig > > > index ffca9a8bb878..7455ebd189fe 100644 > > > --- a/drivers/rtc/Kconfig > > > +++ b/drivers/rtc/Kconfig > > > @@ -246,6 +246,7 @@ config RTC_DRV_AS3722 > > > > > > config RTC_DRV_DS1307 > > > tristate "Dallas/Maxim DS1307/37/38/39/40/41, ST M41T00, > > > EPSON RX-8025, ISL12057" > > > + select REGMAP > > > > As far as I can see, REGMAP defaults to Y when REGMAP_I2C is > > selected. > > Can you explain more in details why you have to select it explicitely > > ? > > If there is something wrong with the logic, then the logic should be > > fixed instead of just adding a selection of REGMAP for that > > particular > > RTC_DRV_DS1307. Because others like RTC_DRV_ABB5ZES3 or > > RTC_DRV_ABEOZ9 > > might have the exact same problem. > > Right, yeah, I don't want to assert this patch is the correct solution, > sending it was more to offer a fix and allow discussion if it should be > resolved some other way (so thanks for replying, I appreciate it). > > In terms of why I made this patch, the way I see it, if a config option > requires another config option be set, then "selects" is the natural > way of phrasing this dependency. "default" on the REGMAP side seems > weird. If it's a default, does that mean it can be overridden? But > RTC_DRV_DS1307 *requires* REGMAP; it's not just a "would be nice". The > build breaks without it. > > But maybe KConfig works differently to my assumptions. Maybe the > referenced patch that causes the build failure is actually incorrect > (CC Geert). I spoke with Joel Stanley (CC) and he indicated you're not > supposed to depend on REGMAP like KUnit does? Thanks for CCing me! Looks like I made a really silly mistake here: my patch not only allows the user to enable REGMAP manually (for the test), but also to disable it manually, regardless if there are users or not :-( I think the proper fix is to replace the "default y if ..." by "select REGMAP" for all users. Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
On Thu, Jul 06, 2023 at 01:52:08PM +0900, David Stevens wrote: > On Wed, Jul 5, 2023 at 7:17 PM Yu Zhang wrote: > > > > On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote: > > > From: David Stevens > > > > > > Stop passing FOLL_GET to __kvm_follow_pfn. This allows the host to map > > > memory into the guest that is backed by un-refcounted struct pages - for > > > example, higher order non-compound pages allocated by the amdgpu driver > > > via ttm_pool_alloc_page. > > > > I guess you mean the tail pages of the higher order non-compound pages? > > And as to the head page, it is said to be set to one coincidentally[*], > > and shall not be considered as refcounted. IIUC, refcount of this head > > page will be increased and decreased soon in hva_to_pfn_remapped(), so > > this may not be a problem(?). But treating this head page differently, > > as a refcounted one(e.g., to set the A/D flags), is weired. > > > > Or maybe I missed some context, e.g., can the head page be allocted to > > guest at all? > > Yes, this is to allow mapping the tail pages of higher order > non-compound pages - I should have been more precise in my wording. > The head pages can already be mapped into the guest. > > Treating the head and tail pages would require changing how KVM > behaves in a situation it supports today (rather than just adding > support for an unsupported situation). Currently, without this series, > KVM can map VM_PFNMAP|VM_IO memory backed by refcounted pages into the > guest. When that happens, KVM sets the A/D flags. I'm not sure whether > that's actually valid behavior, nor do I know whether anyone actually > cares about it. But it's what KVM does today, and I would shy away > from modifying that behavior without good reason. I know the A/D status of the refcounted, VM_PFNMAP|VM_IO backed pages will be recorded. And I have no idea if this is a necessary requirement either. But it feels awkward to see the head and the tail ones of non-compound pages be treated inconsistently. After all, the head page just happens to have its refcount being 1, it is not a real refcounted page. So I would suggest to mention such different behehavior in the commit message at least. :) > > > > > > @@ -883,7 +884,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, > > > struct kvm_mmu *mmu, > > > */ > > > static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page > > > *sp, int i) > > > { > > > - bool host_writable; > > > + bool host_writable, is_refcounted; > > > gpa_t first_pte_gpa; > > > u64 *sptep, spte; > > > struct kvm_memory_slot *slot; > > > @@ -940,10 +941,12 @@ static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, > > > struct kvm_mmu_page *sp, int > > > sptep = >spt[i]; > > > spte = *sptep; > > > host_writable = spte & shadow_host_writable_mask; > > > + // TODO: is this correct? > > > + is_refcounted = spte & SPTE_MMU_PAGE_REFCOUNTED; > > > slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn); > > > make_spte(vcpu, sp, slot, pte_access, gfn, > > > spte_to_pfn(spte), spte, true, false, > > > - host_writable, ); > > > + host_writable, is_refcounted, ); > > > > Could we restrict that a non-refcounted page shall not be used as shadow > > page? > > I'm not very familiar with the shadow mmu, so my response might not > make sense. But do you mean not allowing non-refcoutned pages as the > guest page tables shadowed by a kvm_mmu_page? It would probably be > possible to do that, and I doubt anyone would care about the > restriction. But as far as I can tell, the guest page table is only > accessed via kvm_vcpu_read_guest_atomic, which handles non-refcounted > pages just fine. Sorry, my brain just got baked... Pls just ignore this question :) B.R. Yu
Re: [PATCH v3] powerpc/fsl_rio: add missing of_node_put() in fsl_rio_setup()
Le 24/11/2022 à 15:33, Yang Yingliang a écrit : > The of node returned by of_find_compatible_node() with refcount > decremented, of_node_put() need be called after using it to avoid > refcount leak. Is that patch still required ? If yes, please rebase and resend. Thanks Christophe > > Fixes: abc3aeae3aaa ("fsl-rio: Add two ports and rapidio message units > support") > Signed-off-by: Yang Yingliang > --- > v2 -> v3: >drop of_node_put() in loop for_each_child_of_node(), it's not needed. > > v1 -> v2: >Add fix tag. > > v1 patch link: > > https://lore.kernel.org/lkml/20220615032137.1878219-1-yangyingli...@huawei.com/ > --- > arch/powerpc/sysdev/fsl_rio.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/arch/powerpc/sysdev/fsl_rio.c b/arch/powerpc/sysdev/fsl_rio.c > index c8f044d62fe2..1ad962d3060e 100644 > --- a/arch/powerpc/sysdev/fsl_rio.c > +++ b/arch/powerpc/sysdev/fsl_rio.c > @@ -550,12 +550,13 @@ int fsl_rio_setup(struct platform_device *dev) > if (!dt_range) { > pr_err("%pOF: unable to find 'reg' property\n", > np); > rc = -ENOMEM; > goto err_pw; > } > + of_node_put(np); > range_start = of_read_number(dt_range, aw); > dbell->dbell_regs = (struct rio_dbell_regs *)(rmu_regs_win + > (u32)range_start); > > /*set up port write node*/ > np = of_find_compatible_node(NULL, NULL, "fsl,srio-port-write-unit"); > @@ -578,12 +579,13 @@ int fsl_rio_setup(struct platform_device *dev) > if (!dt_range) { > pr_err("%pOF: unable to find 'reg' property\n", > np); > rc = -ENOMEM; > goto err; > } > + of_node_put(np); > range_start = of_read_number(dt_range, aw); > pw->pw_regs = (struct rio_pw_regs *)(rmu_regs_win + (u32)range_start); > > /*set up ports node*/ > for_each_child_of_node(dev->dev.of_node, np) { > port_index = of_get_property(np, "cell-index", NULL); > @@ -760,12 +762,13 @@ int fsl_rio_setup(struct platform_device *dev) > err_pw: > kfree(dbell); > dbell = NULL; > err_dbell: > iounmap(rmu_regs_win); > rmu_regs_win = NULL; > + of_node_put(np); > err_rmu: > kfree(ops); > err_ops: > iounmap(rio_regs_win); > rio_regs_win = NULL; > err_rio_regs:
Re: [PATCH linux-next][RFC]torture: avoid offline tick_do_timer_cpu
Le 21/11/2022 à 04:51, Zhouyi Zhou a écrit : > During CPU-hotplug torture (CONFIG_NO_HZ_FULL=y), if we try to > offline tick_do_timer_cpu, the operation will fail because in > function tick_nohz_cpu_down: > ``` > if (tick_nohz_full_running && tick_do_timer_cpu == cpu) >return -EBUSY; > ``` > Above bug was first discovered in torture tests performed in PPC VM > of Open Source Lab of Oregon State University, and reproducable in RISC-V > and X86-64 (with additional kernel commandline cpu0_hotplug). > > In this patch, we avoid offline tick_do_timer_cpu by distribute > the offlining cpu among remaining cpus. > > Signed-off-by: Zhouyi Zhou > --- > include/linux/tick.h| 1 + > kernel/time/tick-common.c | 1 + > kernel/time/tick-internal.h | 1 - > kernel/torture.c| 10 ++ > 4 files changed, 12 insertions(+), 1 deletion(-) > > diff --git a/include/linux/tick.h b/include/linux/tick.h > index bfd571f18cfd..23cc0b205853 100644 > --- a/include/linux/tick.h > +++ b/include/linux/tick.h > @@ -14,6 +14,7 @@ > #include > > #ifdef CONFIG_GENERIC_CLOCKEVENTS > +extern int tick_do_timer_cpu __read_mostly; > extern void __init tick_init(void); > /* Should be core only, but ARM BL switcher requires it */ > extern void tick_suspend_local(void); > diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c > index 46789356f856..87b9b9afa320 100644 > --- a/kernel/time/tick-common.c > +++ b/kernel/time/tick-common.c > @@ -48,6 +48,7 @@ ktime_t tick_next_period; >*procedure also covers cpu hotplug. >*/ > int tick_do_timer_cpu __read_mostly = TICK_DO_TIMER_BOOT; > +EXPORT_SYMBOL_GPL(tick_do_timer_cpu); > #ifdef CONFIG_NO_HZ_FULL > /* >* tick_do_timer_boot_cpu indicates the boot CPU temporarily owns > diff --git a/kernel/time/tick-internal.h b/kernel/time/tick-internal.h > index 649f2b48e8f0..8953dca10fdd 100644 > --- a/kernel/time/tick-internal.h > +++ b/kernel/time/tick-internal.h > @@ -15,7 +15,6 @@ > > DECLARE_PER_CPU(struct tick_device, tick_cpu_device); > extern ktime_t tick_next_period; > -extern int tick_do_timer_cpu __read_mostly; > > extern void tick_setup_periodic(struct clock_event_device *dev, int > broadcast); > extern void tick_handle_periodic(struct clock_event_device *dev); > diff --git a/kernel/torture.c b/kernel/torture.c > index 789aeb0e1159..bccbdd33dda2 100644 > --- a/kernel/torture.c > +++ b/kernel/torture.c > @@ -33,6 +33,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -358,7 +359,16 @@ torture_onoff(void *arg) > schedule_timeout_interruptible(HZ / 10); > continue; > } > +#ifdef CONFIG_NO_HZ_FULL > + /* do not offline tick do timer cpu */ > + if (tick_nohz_full_running) { Can you use fonction tick_nohz_full_enabled() instead and avoid the #ifdef ? > + cpu = (torture_random() >> 4) % maxcpu; > + if (cpu >= tick_do_timer_cpu) > + cpu = (cpu + 1) % (maxcpu + 1); > + } else > +#else > cpu = (torture_random() >> 4) % (maxcpu + 1); > +#endif > if (!torture_offline(cpu, >_offline_attempts, _offline_successes, >_offline, _offline, _offline))
Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET
On Wed, Jul 5, 2023 at 10:19 PM Zhi Wang wrote: > > On Tue, 4 Jul 2023 16:50:48 +0900 > David Stevens wrote: > > > From: David Stevens > > > > Make it so that __kvm_follow_pfn does not imply FOLL_GET. This allows > > callers to resolve a gfn when the associated pfn has a valid struct page > > that isn't being actively refcounted (e.g. tail pages of non-compound > > higher order pages). For a caller to safely omit FOLL_GET, all usages of > > the returned pfn must be guarded by a mmu notifier. > > > > This also adds a is_refcounted_page out parameter to kvm_follow_pfn that > > is set when the returned pfn has an associated struct page with a valid > > refcount. Callers that don't pass FOLL_GET should remember this value > > and use it to avoid places like kvm_is_ad_tracked_page that assume a > > non-zero refcount. > > > > Signed-off-by: David Stevens > > --- > > include/linux/kvm_host.h | 10 ++ > > virt/kvm/kvm_main.c | 67 +--- > > virt/kvm/pfncache.c | 2 +- > > 3 files changed, 47 insertions(+), 32 deletions(-) > > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > > index ef2763c2b12e..a45308c7d2d9 100644 > > --- a/include/linux/kvm_host.h > > +++ b/include/linux/kvm_host.h > > @@ -1157,6 +1157,9 @@ unsigned long gfn_to_hva_memslot_prot(struct > > kvm_memory_slot *slot, gfn_t gfn, > > void kvm_release_page_clean(struct page *page); > > void kvm_release_page_dirty(struct page *page); > > > > +void kvm_set_page_accessed(struct page *page); > > +void kvm_set_page_dirty(struct page *page); > > + > > struct kvm_follow_pfn { > > const struct kvm_memory_slot *slot; > > gfn_t gfn; > > @@ -1164,10 +1167,17 @@ struct kvm_follow_pfn { > > bool atomic; > > /* Allow a read fault to create a writeable mapping. */ > > bool allow_write_mapping; > > + /* > > + * Usage of the returned pfn will be guared by a mmu notifier. Must > ^guarded > > + * be true if FOLL_GET is not set. > > + */ > > + bool guarded_by_mmu_notifier; > > > It seems no one sets the guraded_by_mmu_notifier in this patch. Is > guarded_by_mmu_notifier always equal to !foll->FOLL_GET and set by the > caller of __kvm_follow_pfn()? Yes, this is the case. > If yes, do we have to use FOLL_GET to resolve GFN associated with a tail page? > It seems gup can tolerate gup_flags without FOLL_GET, but it is more like a > temporary solution. I don't think it is a good idea to play tricks with > a temporary solution, more like we are abusing the toleration. I'm not sure I understand what you're getting at. This series never calls gup without FOLL_GET. This series aims to provide kvm_follow_pfn as a unified API on top of gup+follow_pte. Since one of the major clients of this API uses an mmu notifier, it makes sense to support returning a pfn without taking a reference. And we indeed need to do that for certain types of memory. > Is a flag like guarded_by_mmu_notifier (perhaps a better name) enough to > indicate a tail page? What do you mean by to indicate a tail page? Do you mean to indicate that the returned pfn refers to non-refcounted page? That's specified by is_refcounted_page. -David
Re: [PATCH] rtc: Kconfig: select REGMAP for RTC_DRV_DS1307
Le 06/07/2023 à 08:14, Benjamin Gray a écrit : > On Thu, 2023-07-06 at 05:13 +, Christophe Leroy wrote: >> >> >> Le 05/07/2023 à 02:30, Benjamin Gray a écrit : >>> The drivers/rtc/rtc-ds1307.c driver has a direct dependency on >>> struct regmap_config, which is guarded behind CONFIG_REGMAP. >>> >>> Commit 70a640c0efa7 ("regmap: REGMAP_KUNIT should not select >>> REGMAP") >>> exposed this by disabling the default pick unless KUNIT_ALL_TESTS >>> is >>> set, causing the ppc64be allnoconfig build to fail. >>> >>> Signed-off-by: Benjamin Gray >>> --- >>> drivers/rtc/Kconfig | 1 + >>> 1 file changed, 1 insertion(+) >>> >>> diff --git a/drivers/rtc/Kconfig b/drivers/rtc/Kconfig >>> index ffca9a8bb878..7455ebd189fe 100644 >>> --- a/drivers/rtc/Kconfig >>> +++ b/drivers/rtc/Kconfig >>> @@ -246,6 +246,7 @@ config RTC_DRV_AS3722 >>> >>> config RTC_DRV_DS1307 >>> tristate "Dallas/Maxim DS1307/37/38/39/40/41, ST M41T00, >>> EPSON RX-8025, ISL12057" >>> + select REGMAP >> >> As far as I can see, REGMAP defaults to Y when REGMAP_I2C is >> selected. >> Can you explain more in details why you have to select it explicitely >> ? >> If there is something wrong with the logic, then the logic should be >> fixed instead of just adding a selection of REGMAP for that >> particular >> RTC_DRV_DS1307. Because others like RTC_DRV_ABB5ZES3 or >> RTC_DRV_ABEOZ9 >> might have the exact same problem. > > Right, yeah, I don't want to assert this patch is the correct solution, > sending it was more to offer a fix and allow discussion if it should be > resolved some other way (so thanks for replying, I appreciate it). > > In terms of why I made this patch, the way I see it, if a config option > requires another config option be set, then "selects" is the natural > way of phrasing this dependency. "default" on the REGMAP side seems > weird. If it's a default, does that mean it can be overridden? But > RTC_DRV_DS1307 *requires* REGMAP; it's not just a "would be nice". The > build breaks without it. I mean't "why doesn't it work as is", and/or "why didn't you fix what doesn't work". Well, until commit 70a640c0efa7 ("regmap: REGMAP_KUNIT should not select REGMAP") it was not user-selectable so I couldn't be overridden. After commit 70a640c0efa7 ("regmap: REGMAP_KUNIT should not select REGMAP") it is overridable so we have an additional problem. Does RTC_DRV_DS1307 require REGMAP or does RTC_DRV_DS1307 require REGMAP_I2C and then REGMAP_I2C requires REGMAP ? I think that huge default like for REGMAP should be replaced by individual selection of REGMAP from each of those config items. For exemple REGMAP_I2C should select REGMAP then I guess it would also solve your issue wouldn't it ? > > But maybe KConfig works differently to my assumptions. Maybe the > referenced patch that causes the build failure is actually incorrect > (CC Geert). I spoke with Joel Stanley (CC) and he indicated you're not > supposed to depend on REGMAP like KUnit does? > > As for other drivers having the same problem, quite possibly, yes. But > the PPC configs don't seem to build those drivers, so I'd prefer to > leave it to anyone who does build them to report the error. I wasn't > planning to audit all the drivers, I was just trying to fix the PPC > configs I build and test. Well ok but that's probably the indication of a deeper problem which deserves a more generic fix.
Re: [PATCH] rtc: Kconfig: select REGMAP for RTC_DRV_DS1307
On Thu, 2023-07-06 at 05:13 +, Christophe Leroy wrote: > > > Le 05/07/2023 à 02:30, Benjamin Gray a écrit : > > The drivers/rtc/rtc-ds1307.c driver has a direct dependency on > > struct regmap_config, which is guarded behind CONFIG_REGMAP. > > > > Commit 70a640c0efa7 ("regmap: REGMAP_KUNIT should not select > > REGMAP") > > exposed this by disabling the default pick unless KUNIT_ALL_TESTS > > is > > set, causing the ppc64be allnoconfig build to fail. > > > > Signed-off-by: Benjamin Gray > > --- > > drivers/rtc/Kconfig | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/drivers/rtc/Kconfig b/drivers/rtc/Kconfig > > index ffca9a8bb878..7455ebd189fe 100644 > > --- a/drivers/rtc/Kconfig > > +++ b/drivers/rtc/Kconfig > > @@ -246,6 +246,7 @@ config RTC_DRV_AS3722 > > > > config RTC_DRV_DS1307 > > tristate "Dallas/Maxim DS1307/37/38/39/40/41, ST M41T00, > > EPSON RX-8025, ISL12057" > > + select REGMAP > > As far as I can see, REGMAP defaults to Y when REGMAP_I2C is > selected. > Can you explain more in details why you have to select it explicitely > ? > If there is something wrong with the logic, then the logic should be > fixed instead of just adding a selection of REGMAP for that > particular > RTC_DRV_DS1307. Because others like RTC_DRV_ABB5ZES3 or > RTC_DRV_ABEOZ9 > might have the exact same problem. Right, yeah, I don't want to assert this patch is the correct solution, sending it was more to offer a fix and allow discussion if it should be resolved some other way (so thanks for replying, I appreciate it). In terms of why I made this patch, the way I see it, if a config option requires another config option be set, then "selects" is the natural way of phrasing this dependency. "default" on the REGMAP side seems weird. If it's a default, does that mean it can be overridden? But RTC_DRV_DS1307 *requires* REGMAP; it's not just a "would be nice". The build breaks without it. But maybe KConfig works differently to my assumptions. Maybe the referenced patch that causes the build failure is actually incorrect (CC Geert). I spoke with Joel Stanley (CC) and he indicated you're not supposed to depend on REGMAP like KUnit does? As for other drivers having the same problem, quite possibly, yes. But the PPC configs don't seem to build those drivers, so I'd prefer to leave it to anyone who does build them to report the error. I wasn't planning to audit all the drivers, I was just trying to fix the PPC configs I build and test.
[next-20230705] kernel BUG mm/memcontrol.c:3715! (ltp/madvise06)
While running LTP tests (madvise06) on IBM Power9 LPAR booted with 6.4.0-next-20230705 following crash is seen Injecting memory failure for pfn 0x3f79 at process virtual address 0x7fff9b74 Memory failure: 0x3f79: recovery action for clean LRU page: Recovered madvise06 (133636): drop_caches: 3 [ cut here ] kernel BUG at mm/memcontrol.c:3715! Oops: Exception in kernel mode, sig: 5 [#1] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=8192 NUMA pSeries Modules linked in: brd overlay exfat vfat fat xfs loop sctp ip6_udp_tunnel udp_tunnel dm_mod nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bonding ip_set tls rfkill nf_tables libcrc32c nfnetlink sunrpc pseries_rng vmx_crypto ext4 mbcache jbd2 sd_mod t10_pi crc64_rocksoft crc64 sg ibmvscsi scsi_transport_srp ibmveth fuse [last unloaded: init_module(O)] CPU: 10 PID: 133636 Comm: madvise06 Tainted: G O 6.4.0-next-20230705 #1 Hardware name: IBM,8375-42A POWER9 (raw) 0x4e0202 0xf05 of:IBM,FW950.80 (VL950_131) hv:phyp pSeries NIP: c054ea88 LR: c028b2a8 CTR: c054e8d0 REGS: c0029dd7b890 TRAP: 0700 Tainted: G O (6.4.0-next-20230705) MSR: 80029033 CR: 28008288 XER: CFAR: c054e904 IRQMASK: 0 GPR00: c028b2a8 c0029dd7bb30 c1431600 c002bc978000 GPR04: c2b3b288 00010192 0001 GPR08: c000f9abb180 0002 c002bc978580 GPR12: c054e8d0 c0001ec53f00 GPR16: GPR20: c0001b2e6578 00400cc0 7fff f000 GPR24: c0029dd7bd30 c0029dd7bd58 c0001b2e6568 GPR28: c0029dd7bde0 0001 0001 c0001b2e6540 NIP [c054ea88] mem_cgroup_read_u64+0x1b8/0x1d0 LR [c028b2a8] cgroup_seqfile_show+0xb8/0x160 Call Trace: [c0029dd7bb50] [c028b2a8] cgroup_seqfile_show+0xb8/0x160 [c0029dd7bbc0] [c0673ba4] kernfs_seq_show+0x44/0x60 [c0029dd7bbe0] [c05c4238] seq_read_iter+0x238/0x620 [c0029dd7bcb0] [c0675064] kernfs_fop_read_iter+0x1d4/0x2c0 [c0029dd7bd00] [c057fbac] vfs_read+0x26c/0x350 [c0029dd7bdc0] [c058077c] ksys_read+0x7c/0x140 [c0029dd7be10] [c0036900] system_call_exception+0x140/0x350 [c0029dd7be50] [c000d6a0] system_call_common+0x160/0x2e4 --- interrupt: c00 at 0x7fff9eb41484 NIP: 7fff9eb41484 LR: 10008540 CTR: REGS: c0029dd7be80 TRAP: 0c00 Tainted: G O (6.4.0-next-20230705) MSR: 8280f033 CR: 28002282 XER: IRQMASK: 0 GPR00: 0003 7fffc33de7d0 7fff9ec27300 0013 GPR04: 7fffc33e0aa0 1fff 0013 GPR08: 7fffc33e0aa0 GPR12: 7fff9ecca3a0 GPR16: 10035520 10035b90 100347a8 GPR20: 1002fb68 10063900 2000 1002fb68 GPR24: 004c 1002fa78 7fffc33e0aa0 GPR28: 0013 1fff 1fff NIP [7fff9eb41484] 0x7fff9eb41484 LR [10008540] 0x10008540 --- interrupt: c00 Code: 7fa34800 409effc4 7c0802a6 3861 f8010030 4bfffdfd e8010030 786383e4 7c0803a6 4b6c 7c0802a6 f8010030 <0fe0> 7c0802a6 f8010030 0fe0 ---[ end trace ]--- pstore: backend (nvram) writing error (-1) Kernel panic - not syncing: Fatal exception Rebooting in 10 seconds.. Git bisect points to following patch: commit 29bf1eb7d2abbdfc24c4ef7acf7a51b72dc43d2b memcg: drop kmem.limit_in_bytes Does the testcase madvise06 need an update? 90 tst_res(TINFO, "\tCached: %ld Kb", 91 SAFE_READ_MEMINFO("Cached:") - init_cached); 92 93 print_cgmem("memory.current"); 94 print_cgmem("memory.swap.current"); 95 print_cgmem("memory.kmem.usage_in_bytes”); <<== this line. 96 } If I comment line 95 from the testcase, it completes successfully. - Sachin
Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET
On Wed, Jul 5, 2023 at 8:56 PM Yu Zhang wrote: > > On Tue, Jul 04, 2023 at 04:50:48PM +0900, David Stevens wrote: > > From: David Stevens > > > > Make it so that __kvm_follow_pfn does not imply FOLL_GET. This allows > > callers to resolve a gfn when the associated pfn has a valid struct page > > that isn't being actively refcounted (e.g. tail pages of non-compound > > higher order pages). For a caller to safely omit FOLL_GET, all usages of > > the returned pfn must be guarded by a mmu notifier. > > > > This also adds a is_refcounted_page out parameter to kvm_follow_pfn that > > is set when the returned pfn has an associated struct page with a valid > > refcount. Callers that don't pass FOLL_GET should remember this value > > and use it to avoid places like kvm_is_ad_tracked_page that assume a > > non-zero refcount. > > > > Signed-off-by: David Stevens > > --- > > include/linux/kvm_host.h | 10 ++ > > virt/kvm/kvm_main.c | 67 +--- > > virt/kvm/pfncache.c | 2 +- > > 3 files changed, 47 insertions(+), 32 deletions(-) > > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > > index ef2763c2b12e..a45308c7d2d9 100644 > > --- a/include/linux/kvm_host.h > > +++ b/include/linux/kvm_host.h > > @@ -1157,6 +1157,9 @@ unsigned long gfn_to_hva_memslot_prot(struct > > kvm_memory_slot *slot, gfn_t gfn, > > void kvm_release_page_clean(struct page *page); > > void kvm_release_page_dirty(struct page *page); > > > > +void kvm_set_page_accessed(struct page *page); > > +void kvm_set_page_dirty(struct page *page); > > + > > struct kvm_follow_pfn { > > const struct kvm_memory_slot *slot; > > gfn_t gfn; > > @@ -1164,10 +1167,17 @@ struct kvm_follow_pfn { > > bool atomic; > > /* Allow a read fault to create a writeable mapping. */ > > bool allow_write_mapping; > > + /* > > + * Usage of the returned pfn will be guared by a mmu notifier. Must > > + * be true if FOLL_GET is not set. > > + */ > > + bool guarded_by_mmu_notifier; > > And how? Any place to check the invalidate seq? kvm_follow_pfn can't meaningfully validate the seq number, since the mmu notifier locking is handled by the caller. This is more of a sanity check that the API is being used properly, as proposed here [1]. I did deviate from the proposal with a bool instead of some type of integer, since the exact value of mmu_seq wouldn't be useful. [1] https://lore.kernel.org/all/zgvusf7lmkrnd...@google.com/#t -David