Re: [PATCH v2 08/23] parisc: add pte_unmap() to balance get_ptep()
On 6/8/23 21:18, Hugh Dickins wrote: To keep balance in future, remember to pte_unmap() after a successful get_ptep(). And act as if flush_cache_pages() really needs a map there, to read the pfn before "unmapping", to be sure page table is not removed. Signed-off-by: Hugh Dickins For the parisc parts: Acked-by: Helge Deller # parisc Helge --- arch/parisc/kernel/cache.c | 26 +- 1 file changed, 21 insertions(+), 5 deletions(-) diff --git a/arch/parisc/kernel/cache.c b/arch/parisc/kernel/cache.c index ca4a302d4365..501160250bb7 100644 --- a/arch/parisc/kernel/cache.c +++ b/arch/parisc/kernel/cache.c @@ -426,10 +426,15 @@ void flush_dcache_page(struct page *page) offset = (pgoff - mpnt->vm_pgoff) << PAGE_SHIFT; addr = mpnt->vm_start + offset; if (parisc_requires_coherency()) { + bool needs_flush = false; pte_t *ptep; ptep = get_ptep(mpnt->vm_mm, addr); - if (ptep && pte_needs_flush(*ptep)) + if (ptep) { + needs_flush = pte_needs_flush(*ptep); + pte_unmap(ptep); + } + if (needs_flush) flush_user_cache_page(mpnt, addr); } else { /* @@ -561,14 +566,20 @@ EXPORT_SYMBOL(flush_kernel_dcache_page_addr); static void flush_cache_page_if_present(struct vm_area_struct *vma, unsigned long vmaddr, unsigned long pfn) { - pte_t *ptep = get_ptep(vma->vm_mm, vmaddr); + bool needs_flush = false; + pte_t *ptep; /* * The pte check is racy and sometimes the flush will trigger * a non-access TLB miss. Hopefully, the page has already been * flushed. */ - if (ptep && pte_needs_flush(*ptep)) + ptep = get_ptep(vma->vm_mm, vmaddr); + if (ptep) { + needs_flush = pte_needs_flush(*ptep); + pte_unmap(ptep); + } + if (needs_flush) flush_cache_page(vma, vmaddr, pfn); } @@ -635,17 +646,22 @@ static void flush_cache_pages(struct vm_area_struct *vma, unsigned long start, u pte_t *ptep; for (addr = start; addr < end; addr += PAGE_SIZE) { + bool needs_flush = false; /* * The vma can contain pages that aren't present. Although * the pte search is expensive, we need the pte to find the * page pfn and to check whether the page should be flushed. */ ptep = get_ptep(vma->vm_mm, addr); - if (ptep && pte_needs_flush(*ptep)) { + if (ptep) { + needs_flush = pte_needs_flush(*ptep); + pfn = pte_pfn(*ptep); + pte_unmap(ptep); + } + if (needs_flush) { if (parisc_requires_coherency()) { flush_user_cache_page(vma, addr); } else { - pfn = pte_pfn(*ptep); if (WARN_ON(!pfn_valid(pfn))) return; __flush_cache_page(vma, addr, PFN_PHYS(pfn));
Re: [RFC PATCH v2 1/1] powerpc: update ppc_save_regs to save current r1 in pt_regs
On 15/06/23 17:40, Nicholas Piggin wrote: On Thu Jun 15, 2023 at 7:10 PM AEST, Aditya Gupta wrote: ppc_save_regs() skips one stack frame while saving the CPU register states. Instead of saving current R1, it pulls the previous stack frame pointer. ... So this now saves regs as though it was an interrupt taken in the caller, at the instruction after the call to ppc_save_regs, whereas previously the NIP was there, but R1 came from the caller's caller and that mismatch is what causes gdb's dwarf unwinder to go haywire. Nice catch, and I think I follow the fix and I think I agree with it. Before the bug was introduced, NIP also came from the grandparent. Which is what xmon presumably wanted, but since then I think maybe it makes more sense to just have the parent caller. Reivewed-by: Nicholas Piggin Fixes: d16a58f8854b1 ("powerpc: Improve ppc_save_regs()") Thanks for reviewing this, and providing a Fixes tag too. Thanks - Aditya
Re: [PATCH] KVM: ppc64: Enable ring-based dirty memory tracking
Hi Nick/Gavin/Everyone, On 2023-06-08 08:34:48, Kautuk Consul wrote: > - Enable CONFIG_HAVE_KVM_DIRTY_RING_ACQ_REL as ppc64 is weakly > ordered. > - Enable CONFIG_NEED_KVM_DIRTY_RING_WITH_BITMAP because the > kvmppc_xive_native_set_attr is called in the context of an ioctl > syscall and will call kvmppc_xive_native_eq_sync for setting the > KVM_DEV_XIVE_EQ_SYNC attribute which will call mark_dirty_page() > when there isn't a running vcpu. Implemented the > kvm_arch_allow_write_without_running_vcpu to always return true > to allow mark_page_dirty_in_slot to mark the page dirty in the > memslot->dirty_bitmap in this case. > - Set KVM_DIRTY_LOG_PAGE_OFFSET for the ring buffer's physical page > offset. > - Implement the kvm_arch_mmu_enable_log_dirty_pt_masked function required > for the generic KVM code to call. > - Add a check to kvmppc_vcpu_run_hv for checking whether the dirty > ring is soft full. > - Implement the kvm_arch_flush_remote_tlbs_memslot function to support > the CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT config option. > > On testing with live migration it was found that there is around > 150-180 ms improvment in overall migration time with this patch. > > Signed-off-by: Kautuk Consul > --- > Documentation/virt/kvm/api.rst | 2 +- > arch/powerpc/include/uapi/asm/kvm.h | 2 ++ > arch/powerpc/kvm/Kconfig| 2 ++ > arch/powerpc/kvm/book3s_64_mmu_hv.c | 42 + > arch/powerpc/kvm/book3s_hv.c| 3 +++ > include/linux/kvm_dirty_ring.h | 5 > 6 files changed, 55 insertions(+), 1 deletion(-) > Any review comments on this ? > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst > index add067793b90..ce1ebc513bae 100644 > --- a/Documentation/virt/kvm/api.rst > +++ b/Documentation/virt/kvm/api.rst > @@ -8114,7 +8114,7 @@ regardless of what has actually been exposed through > the CPUID leaf. > 8.29 KVM_CAP_DIRTY_LOG_RING/KVM_CAP_DIRTY_LOG_RING_ACQ_REL > -- > > -:Architectures: x86, arm64 > +:Architectures: x86, arm64, ppc64 > :Parameters: args[0] - size of the dirty log ring > > KVM is capable of tracking dirty memory using ring buffers that are > diff --git a/arch/powerpc/include/uapi/asm/kvm.h > b/arch/powerpc/include/uapi/asm/kvm.h > index 9f18fa090f1f..f722309ed7fb 100644 > --- a/arch/powerpc/include/uapi/asm/kvm.h > +++ b/arch/powerpc/include/uapi/asm/kvm.h > @@ -33,6 +33,8 @@ > /* Not always available, but if it is, this is the correct offset. */ > #define KVM_COALESCED_MMIO_PAGE_OFFSET 1 > > +#define KVM_DIRTY_LOG_PAGE_OFFSET 64 > + > struct kvm_regs { > __u64 pc; > __u64 cr; > diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig > index 902611954200..c93354ec3bd5 100644 > --- a/arch/powerpc/kvm/Kconfig > +++ b/arch/powerpc/kvm/Kconfig > @@ -26,6 +26,8 @@ config KVM > select IRQ_BYPASS_MANAGER > select HAVE_KVM_IRQ_BYPASS > select INTERVAL_TREE > + select HAVE_KVM_DIRTY_RING_ACQ_REL > + select NEED_KVM_DIRTY_RING_WITH_BITMAP > > config KVM_BOOK3S_HANDLER > bool > diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c > b/arch/powerpc/kvm/book3s_64_mmu_hv.c > index 7f765d5ad436..c92e8022e017 100644 > --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c > +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c > @@ -2147,3 +2147,45 @@ void kvmppc_mmu_book3s_hv_init(struct kvm_vcpu *vcpu) > > vcpu->arch.hflags |= BOOK3S_HFLAG_SLB; > } > + > +/* > + * kvm_arch_mmu_enable_log_dirty_pt_masked - enable dirty logging for > selected > + * dirty pages. > + * > + * It write protects selected pages to enable dirty logging for them. > + */ > +void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm, > + struct kvm_memory_slot *slot, > + gfn_t gfn_offset, > + unsigned long mask) > +{ > + phys_addr_t base_gfn = slot->base_gfn + gfn_offset; > + phys_addr_t start = (base_gfn + __ffs(mask)) << PAGE_SHIFT; > + phys_addr_t end = (base_gfn + __fls(mask) + 1) << PAGE_SHIFT; > + > + while (start < end) { > + pte_t *ptep; > + unsigned int shift; > + > + ptep = find_kvm_secondary_pte(kvm, start, ); > + > + *ptep = __pte(pte_val(*ptep) & ~(_PAGE_WRITE)); > + > + start += PAGE_SIZE; > + } > +} > + > +#ifdef CONFIG_NEED_KVM_DIRTY_RING_WITH_BITMAP > +bool kvm_arch_allow_write_without_running_vcpu(struct kvm *kvm) > +{ > + return true; > +} > +#endif > + > +#ifdef CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT > +void kvm_arch_flush_remote_tlbs_memslot(struct kvm *kvm, > + const struct kvm_memory_slot *memslot) > +{ > + kvm_flush_remote_tlbs(kvm); > +} > +#endif > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c > index
[PATCH v11 0/4] PowerPC: In-kernel handling of CPU/Memory hotplug/online/offline events for kdump kernel
The Problem: Post CPU/Memory hot plug/unplug and online/offline events occur, the kdump kernel often retains outdated system information. This presents a significant challenge when attempting to perform a dump collection using an outdated or stale kdump kernel. In such situations, there are two potential outcomes that pose risks: either the dump collection fails to capture the required data entirely, leading to a failed dump, or the collected dump data is inaccurate, thereby compromising its reliability for analysis and troubleshooting purposes Existing solution: == The existing solution to keep the kdump kernel up-to-date involves monitoring CPU/Memory hotplug/online/offline events via a udev rule. This approach triggers a full kdump kernel reload for each hotplug event, ensuring that the kdump kernel is always synchronized with the latest system resource changes. Shortcomings of existing solution: == - Leaves a window where kernel crash might not lead to a successful dump collection. - Reloading all kexec segments for each hotplug is inefficient. - udev rules are prone to races if hotplug events are frequent. Further information regarding the problems associated with a current solution can be found here. - https://lore.kernel.org/lkml/b04ed259-dc5f-7f30-6661-c26f92d90...@oracle.com/ - https://lists.ozlabs.org/pipermail/linuxppc-dev/2022-February/240254.html Proposed Solution: == To address the limitations of the current approach, a proposed solution focuses on implementing a more targeted update strategy. Instead of performing a full reload of all kexec segments for every CPU/Memory hot plug/unplug and online/offline events, the proposed solution aims to update only the relevant kexec segment. After loading the kexec segments into the reserved area, a newly introduced hotplug handler will be responsible for updating the specific kexec segment based on the type of hotplug event. This selective update approach enhances overall efficiency by minimizing unnecessary overhead and significantly reduces the chances of a kernel crash leading to a failed or inaccurate dump collection. Series Dependencies: The implementation of the crash hotplug handler on PowerPC is included in this patch series. The introduction of the generic crash hotplug handler is done through the patch series available at https://lore.kernel.org/all/20230612210712.683175-1-eric.devol...@oracle.com/ Git tree for testing: = The following Git tree incorporates this patch series applied on top of the dependent patch series. https://github.com/sourabhjains/linux/tree/e23-s11-with-kexec-config In order to enable this feature, it is necessary to disable the udev rule responsible for reloading the kdump service. To do this, you can make the following additions to the file "/usr/lib/udev/rules.d/98-kexec.rules" on RHEL: Add the following two lines at top: SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" SUBSYSTEM=="memory", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" The changes mentioned above ensure that the kdump reload process is skipped for CPU/Memory hot plug/unplug events when the path "/sys/devices/system/[cpu|memory]/crash_hotplug" exists. Note: only kexec_file_load syscall will work. For kexec_load minor changes are required in kexec tool. --- Changelog: v11: - Rebase to v6.4-rc6 - The patch that introduced CONFIG_CRASH_HOTPLUG for PowerPC has been removed. The config is now part of common configuration: https://lore.kernel.org/all/87ilbpflsk.fsf@mail.lhotse/ v10: - Drop the patch that adds fdt_index attribute to struct kimage_arch Find the fdt segment index when needed. - Added more details into commits messages. - Rebased onto 6.3.0-rc5 v9: - Removed patch to prepare elfcorehdr crash notes for possible CPUs. The patch is moved to generic patch series that introduces generic infrastructure for in kernel crash update. - Removed patch to pass the hotplug action type to the arch crash hotplug handler function. The generic patch series has introduced the hotplug action type in kimage struct. - Add detail commit message for better understanding. v8: - Restrict fdt_index initialization to machine_kexec_post_load it work for both kexec_load and kexec_file_load.[3/8] Laurent Dufour - Updated the logic to find the number of offline core. [6/8] - Changed the logic to find the elfcore program header to accommodate future memory ranges due memory hotplug events. [8/8] v7 - added a new config to configure this feature - pass hotplug action type to arch specific handler v6 - Added crash memory hotplug support v5: - Replace COFNIG_CRASH_HOTPLUG with CONFIG_HOTPLUG_CPU. - Move fdt segment identification for kexec_load case to load path instead of crash hotplug handler - Keep new attribute defined under
[PATCH v11 2/4] powerpc/crash: add crash CPU hotplug support
Introduce powerpc crash hotplug handler to update the necessary kexec segments in the kernel on CPU/Memory hotplug events. Currently, these updates are done by monitoring CPU/Memory hotplug events in userspace. In the generic infrastructure, a shared crash hotplug handler is triggered for both CPU and Memory hotplug events. However, in this particular patch, only CPU hotplug events are handled for crash updates. Support for crash updates during memory hotplug events will be introduced in subsequent patches. The elfcorehdr segment facilitates the exchange of CPU and other relevant dump information between kernels. Ideally, this segment should be regenerated during CPU hotplug events to reflect any changes. However, on powerpc systems, the elfcorehdr is initially constructed with all possible CPUs, rendering it unnecessary to update or recreate it when CPU hotplug events occur. Additionally, on powerpc, there is another segment called the FDT (Flattened Device Tree) that holds CPU data. During the boot of the kdump kernel, it is crucial for the crashing CPU to be present in the FDT. Failure to have the crashing CPU in the FDT would result in a boot failure of the kdump kernel. Therefore, the only action required on powerpc to handle a crash CPU hotplug event is to add the hot-added CPUs to the kdump FDT segment, ensuring that the kdump kernel boots successfully. However, there is no need to remove the hot-unplugged CPUs from the FDT segment, so no action is taken for a CPU hot removal event. In order to support an increasing number of CPUs, the FDT is constructed with extra buffer space to ensure it can accommodate possible number of CPUs nodes. Also do not pack the FDT and shrik its size once prepared. The changes introduced here will also work for the kexec_load system call, considering that the kexec tool constructs the FDT segment with extra space to accommodate possible number CPUs nodes. Since memory crash hotplug support is not there yet the crash hotplug the handler simply warns the user and returns. The changes realted to in-kernel update to kdump kernel kexec segments on CPU/Memory hotplug and online/offline events are kept under CRASH_HOTPLUG config and it is enabled by default on PowerPC platform. Signed-off-by: Sourabh Jain --- arch/powerpc/Kconfig | 3 ++ arch/powerpc/include/asm/kexec.h | 10 + arch/powerpc/kexec/core_64.c | 61 +++ arch/powerpc/kexec/elf_64.c | 12 +- arch/powerpc/kexec/file_load_64.c | 14 +++ 5 files changed, 99 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 36f2fe0cc8a5..87f1b569c682 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -665,6 +665,9 @@ config RELOCATABLE_TEST config ARCH_HAS_CRASH_DUMP def_bool PPC64 || PPC_BOOK3S_32 || PPC_85xx || (44x && !SMP) +config ARCH_HAS_CRASH_HOTPLUG + def_bool y + config ARCH_SUPPORTS_CRASH_DUMP def_bool y depends on CRASH_DUMP diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h index 8090ad7d97d9..154759bf0759 100644 --- a/arch/powerpc/include/asm/kexec.h +++ b/arch/powerpc/include/asm/kexec.h @@ -103,6 +103,16 @@ void kexec_copy_flush(struct kimage *image); struct crash_mem; int update_cpus_node(void *fdt); int get_crash_memory_ranges(struct crash_mem **mem_ranges); + +#ifdef CONFIG_CRASH_HOTPLUG +void arch_crash_handle_hotplug_event(struct kimage *image); +#define arch_crash_handle_hotplug_event arch_crash_handle_hotplug_event + +#ifdef CONFIG_HOTPLUG_CPU +static inline int crash_hotplug_cpu_support(void) { return 1; } +#define crash_hotplug_cpu_support crash_hotplug_cpu_support +#endif +#endif #endif #if defined(CONFIG_CRASH_DUMP) && defined(CONFIG_PPC_RTAS) diff --git a/arch/powerpc/kexec/core_64.c b/arch/powerpc/kexec/core_64.c index 0b292f93a74c..cb38da1c6dbe 100644 --- a/arch/powerpc/kexec/core_64.c +++ b/arch/powerpc/kexec/core_64.c @@ -543,6 +543,67 @@ int update_cpus_node(void *fdt) return ret; } +#ifdef CONFIG_CRASH_HOTPLUG +#undef pr_fmt +#define pr_fmt(fmt) "crash hp: " fmt + +/** + * arch_crash_handle_hotplug_event - Handle crash CPU/Memory hotplug events to update the + * necessary kexec segments based on the hotplug event. + * @image: the active struct kimage + * + * Update FDT segment to include newly added CPU. No action for CPU remove case. + */ +void arch_crash_handle_hotplug_event(struct kimage *image) +{ + void *fdt, *ptr; + unsigned long mem; + int i, fdt_index = -1; + unsigned int hp_action = image->hp_action; + + /* +* Since the hot-unplugged CPU is already part of crash FDT, +* no action is needed for CPU remove case. +*/ + if (hp_action == KEXEC_CRASH_HP_REMOVE_CPU) + return; + + /* crash update on memory hotplug events is not supported yet */ + if (hp_action ==
[PATCH v11 4/4] powerpc/crash: add crash memory hotplug support
Extend PowerPC arch crash hotplug handler to support memory hotplug events. Since elfcorehdr is used to exchange the memory info between the kernels hence it needs to be recreated to reflect the changes due to memory hotplug events. The way memory hotplug events are handled on PowerPC and the notifier call chain used in generic code to trigger the arch crash handler, the process to recreate the elfcorehdr is different for memory add and remove case. For memory remove case the memory change notifier call chain is triggered first and then memblock regions is updated. Whereas for the memory hot add case, memblock regions are updated before invoking the memory change notifier call chain. On PowerPC, memblock regions list is used to prepare the elfcorehdr. In case of memory hot remove the memblock regions are updated after the arch crash hotplug handler is triggered, hence an additional step is taken to ensure that memory ranges used to prepare elfcorehdr do not include hot removed memory. When memory is hot removed it possible that memory regions count may increase. So to accommodate a growing number of memory regions, the elfcorehdr kexec segment is built with additional buffer space. The changes done here will also work for the kexec_load system call given that the kexec tool builds the elfcoredhr with additional space to accommodate future memory regions as it is done for kexec_file_load system call in the kernel. Signed-off-by: Sourabh Jain --- arch/powerpc/include/asm/kexec.h| 6 ++ arch/powerpc/include/asm/kexec_ranges.h | 1 + arch/powerpc/kexec/core_64.c| 77 +- arch/powerpc/kexec/file_load_64.c | 36 ++- arch/powerpc/kexec/ranges.c | 85 + 5 files changed, 201 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h index d3ff481aa9f8..10017880571c 100644 --- a/arch/powerpc/include/asm/kexec.h +++ b/arch/powerpc/include/asm/kexec.h @@ -112,6 +112,12 @@ void arch_crash_handle_hotplug_event(struct kimage *image, void *arg); static inline int crash_hotplug_cpu_support(void) { return 1; } #define crash_hotplug_cpu_support crash_hotplug_cpu_support #endif + +#ifdef CONFIG_MEMORY_HOTPLUG +static inline int crash_hotplug_memory_support(void) { return 1; } +#define crash_hotplug_memory_support crash_hotplug_memory_support +#endif + #endif #endif diff --git a/arch/powerpc/include/asm/kexec_ranges.h b/arch/powerpc/include/asm/kexec_ranges.h index f83866a19e87..802abf580cf0 100644 --- a/arch/powerpc/include/asm/kexec_ranges.h +++ b/arch/powerpc/include/asm/kexec_ranges.h @@ -7,6 +7,7 @@ void sort_memory_ranges(struct crash_mem *mrngs, bool merge); struct crash_mem *realloc_mem_ranges(struct crash_mem **mem_ranges); int add_mem_range(struct crash_mem **mem_ranges, u64 base, u64 size); +int remove_mem_range(struct crash_mem **mem_ranges, u64 base, u64 size); int add_tce_mem_ranges(struct crash_mem **mem_ranges); int add_initrd_mem_range(struct crash_mem **mem_ranges); #ifdef CONFIG_PPC_64S_HASH_MMU diff --git a/arch/powerpc/kexec/core_64.c b/arch/powerpc/kexec/core_64.c index 4d1c53cc9a90..e5038f2769bb 100644 --- a/arch/powerpc/kexec/core_64.c +++ b/arch/powerpc/kexec/core_64.c @@ -19,6 +19,7 @@ #include #include #include +#include #include #include @@ -547,6 +548,76 @@ int update_cpus_node(void *fdt) #undef pr_fmt #define pr_fmt(fmt) "crash hp: " fmt +/** + * update_crash_elfcorehdr() - Recreate the elfcorehdr and replace it with old + *elfcorehdr in the kexec segment array. + * @image: the active struct kimage + * @mn: struct memory_notify data handler + */ +static void update_crash_elfcorehdr(struct kimage *image, struct memory_notify *mn) +{ + int ret; + struct crash_mem *cmem = NULL; + struct kexec_segment *ksegment; + void *ptr, *mem, *elfbuf = NULL; + unsigned long elfsz, memsz, base_addr, size; + + ksegment = >segment[image->elfcorehdr_index]; + mem = (void *) ksegment->mem; + memsz = ksegment->memsz; + + ret = get_crash_memory_ranges(); + if (ret) { + pr_err("Failed to get crash mem range\n"); + return; + } + + /* +* The hot unplugged memory is part of crash memory ranges, +* remove it here. +*/ + if (image->hp_action == KEXEC_CRASH_HP_REMOVE_MEMORY) { + base_addr = PFN_PHYS(mn->start_pfn); + size = mn->nr_pages * PAGE_SIZE; + ret = remove_mem_range(, base_addr, size); + if (ret) { + pr_err("Failed to remove hot-unplugged from crash memory ranges.\n"); + return; + } + } + + ret = crash_prepare_elf64_headers(cmem, false, , ); + if (ret) { + pr_err("Failed to prepare elf header\n"); + return; + } + +
[PATCH v11 1/4] powerpc/kexec: turn some static helper functions public
Move update_cpus_node and get_crash_memory_ranges functions from kexec/file_load_64.c to kexec/core_64.c to make these functions usable by other kexec components. Later in the series, these functions are utilized to do in-kernel update to kexec segments on CPU/Memory hot plug/unplug or online/offline events for both kexec_load and kexec_file_load syscalls. No functional change intended. Signed-off-by: Sourabh Jain Reviewed-by: Laurent Dufour --- arch/powerpc/include/asm/kexec.h | 6 ++ arch/powerpc/kexec/core_64.c | 166 ++ arch/powerpc/kexec/file_load_64.c | 162 - 3 files changed, 172 insertions(+), 162 deletions(-) diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h index a1ddba01e7d1..8090ad7d97d9 100644 --- a/arch/powerpc/include/asm/kexec.h +++ b/arch/powerpc/include/asm/kexec.h @@ -99,6 +99,12 @@ void relocate_new_kernel(unsigned long indirection_page, unsigned long reboot_co void kexec_copy_flush(struct kimage *image); +#ifdef CONFIG_PPC64 +struct crash_mem; +int update_cpus_node(void *fdt); +int get_crash_memory_ranges(struct crash_mem **mem_ranges); +#endif + #if defined(CONFIG_CRASH_DUMP) && defined(CONFIG_PPC_RTAS) void crash_free_reserved_phys_range(unsigned long begin, unsigned long end); #define crash_free_reserved_phys_range crash_free_reserved_phys_range diff --git a/arch/powerpc/kexec/core_64.c b/arch/powerpc/kexec/core_64.c index a79e28c91e2b..0b292f93a74c 100644 --- a/arch/powerpc/kexec/core_64.c +++ b/arch/powerpc/kexec/core_64.c @@ -17,6 +17,8 @@ #include #include #include +#include +#include #include #include @@ -30,6 +32,8 @@ #include #include #include +#include +#include int machine_kexec_prepare(struct kimage *image) { @@ -377,6 +381,168 @@ void default_machine_kexec(struct kimage *image) /* NOTREACHED */ } +/** + * get_crash_memory_ranges - Get crash memory ranges. This list includes + * first/crashing kernel's memory regions that + * would be exported via an elfcore. + * @mem_ranges: Range list to add the memory ranges to. + * + * Returns 0 on success, negative errno on error. + */ +int get_crash_memory_ranges(struct crash_mem **mem_ranges) +{ + phys_addr_t base, end; + struct crash_mem *tmem; + u64 i; + int ret; + + for_each_mem_range(i, , ) { + u64 size = end - base; + + /* Skip backup memory region, which needs a separate entry */ + if (base == BACKUP_SRC_START) { + if (size > BACKUP_SRC_SIZE) { + base = BACKUP_SRC_END + 1; + size -= BACKUP_SRC_SIZE; + } else + continue; + } + + ret = add_mem_range(mem_ranges, base, size); + if (ret) + goto out; + + /* Try merging adjacent ranges before reallocation attempt */ + if ((*mem_ranges)->nr_ranges == (*mem_ranges)->max_nr_ranges) + sort_memory_ranges(*mem_ranges, true); + } + + /* Reallocate memory ranges if there is no space to split ranges */ + tmem = *mem_ranges; + if (tmem && (tmem->nr_ranges == tmem->max_nr_ranges)) { + tmem = realloc_mem_ranges(mem_ranges); + if (!tmem) + goto out; + } + + /* Exclude crashkernel region */ + ret = crash_exclude_mem_range(tmem, crashk_res.start, crashk_res.end); + if (ret) + goto out; + + /* +* FIXME: For now, stay in parity with kexec-tools but if RTAS/OPAL +*regions are exported to save their context at the time of +*crash, they should actually be backed up just like the +*first 64K bytes of memory. +*/ + ret = add_rtas_mem_range(mem_ranges); + if (ret) + goto out; + + ret = add_opal_mem_range(mem_ranges); + if (ret) + goto out; + + /* create a separate program header for the backup region */ + ret = add_mem_range(mem_ranges, BACKUP_SRC_START, BACKUP_SRC_SIZE); + if (ret) + goto out; + + sort_memory_ranges(*mem_ranges, false); +out: + if (ret) + pr_err("Failed to setup crash memory ranges\n"); + return ret; +} + +/** + * add_node_props - Reads node properties from device node structure and add + * them to fdt. + * @fdt:Flattened device tree of the kernel + * @node_offset:offset of the node to add a property at + * @dn: device node pointer + * + * Returns 0 on success, negative errno on error. + */ +static int add_node_props(void *fdt, int node_offset, const struct device_node *dn) +{ + int ret = 0; + struct property *pp; +
[PATCH v11 3/4] crash: forward memory_notify args to arch crash hotplug handler
On PowerPC, memblock regions is used to prepare the elfcorehdr. This elfcorehdr describes the memory regions of the running kernel to the kdump kernel. However, a challenge arises as the notifier for the memory hotplug crash handler is triggered before the memblock region update takes place. Consequently, the newly prepared elfcorehdr still retains the outdated memory regions. If the elfcorehdr is generated with these outdated memblock regions, it will contain inaccurate information about the memory regions. This can result in failures or incomplete dump collection when attempting to collect a dump using the outdated elfcorehdr. This challenge is specific to removing an LMB (Local Memory Block). It does not apply when memory is added. During memory addition, the memory regions are updated first, and then the memory notify function calls the arch crash hotplug handler to update the elfcorehdr. The flow diagram below depicts the series of action taken during memory hot removal. Initiate memory hot remove | v offline pages | v initiate memory notify call chain for MEM_OFFLINE event | v Prepare new elfcorehdr and replace it with old one. | V update memblock regions The arch crash hotplug handler function signature is updated to pass additional argument as the struct memory_notify object to architectures. The memory_notify object contains the starting PFN (Page Frame Number) and the number of pages in the hot removed memory. By utilizing this information, the base address and size of the hot removed memory is calculated and used to avoid adding the hot removed memory region to the elfcorehdr. Signed-off-by: Sourabh Jain --- arch/powerpc/include/asm/kexec.h | 2 +- arch/powerpc/kexec/core_64.c | 3 ++- arch/x86/include/asm/kexec.h | 2 +- arch/x86/kernel/crash.c | 5 +++-- include/linux/kexec.h| 2 +- kernel/crash_core.c | 14 +++--- 6 files changed, 15 insertions(+), 13 deletions(-) diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h index 154759bf0759..d3ff481aa9f8 100644 --- a/arch/powerpc/include/asm/kexec.h +++ b/arch/powerpc/include/asm/kexec.h @@ -105,7 +105,7 @@ int update_cpus_node(void *fdt); int get_crash_memory_ranges(struct crash_mem **mem_ranges); #ifdef CONFIG_CRASH_HOTPLUG -void arch_crash_handle_hotplug_event(struct kimage *image); +void arch_crash_handle_hotplug_event(struct kimage *image, void *arg); #define arch_crash_handle_hotplug_event arch_crash_handle_hotplug_event #ifdef CONFIG_HOTPLUG_CPU diff --git a/arch/powerpc/kexec/core_64.c b/arch/powerpc/kexec/core_64.c index cb38da1c6dbe..4d1c53cc9a90 100644 --- a/arch/powerpc/kexec/core_64.c +++ b/arch/powerpc/kexec/core_64.c @@ -551,10 +551,11 @@ int update_cpus_node(void *fdt) * arch_crash_handle_hotplug_event - Handle crash CPU/Memory hotplug events to update the * necessary kexec segments based on the hotplug event. * @image: the active struct kimage + * @arg: struct memory_notify handler for memory hotplug case and NULL for CPU hotplug case. * * Update FDT segment to include newly added CPU. No action for CPU remove case. */ -void arch_crash_handle_hotplug_event(struct kimage *image) +void arch_crash_handle_hotplug_event(struct kimage *image, void *arg) { void *fdt, *ptr; unsigned long mem; diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h index 3be6a98751f0..ca0bc9b12504 100644 --- a/arch/x86/include/asm/kexec.h +++ b/arch/x86/include/asm/kexec.h @@ -210,7 +210,7 @@ extern crash_vmclear_fn __rcu *crash_vmclear_loaded_vmcss; extern void kdump_nmi_shootdown_cpus(void); #ifdef CONFIG_CRASH_HOTPLUG -void arch_crash_handle_hotplug_event(struct kimage *image); +void arch_crash_handle_hotplug_event(struct kimage *image, void *arg); #define arch_crash_handle_hotplug_event arch_crash_handle_hotplug_event #ifdef CONFIG_HOTPLUG_CPU diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c index 18d2a18d1073..d80a63665436 100644 --- a/arch/x86/kernel/crash.c +++ b/arch/x86/kernel/crash.c @@ -456,11 +456,12 @@ unsigned int arch_crash_get_elfcorehdr_size(void) /** * arch_crash_handle_hotplug_event() - Handle hotplug elfcorehdr changes - * @image: a pointer to kexec_crash_image + * @image: the active struct kimage + * @arg: struct memory_notify handler for memory hotplug case and NULL for CPU hotplug case. * * Prepare the new elfcorehdr and replace the existing elfcorehdr. */ -void arch_crash_handle_hotplug_event(struct kimage *image) +void arch_crash_handle_hotplug_event(struct kimage *image, void *arg) { void *elfbuf = NULL, *old_elfcorehdr; unsigned long nr_mem_ranges; diff --git a/include/linux/kexec.h b/include/linux/kexec.h index bb0e614f2a05..b3fcee14fea4 100644 --- a/include/linux/kexec.h +++ b/include/linux/kexec.h @@ -502,7
Re: [PATCH] ASoC: imx-audmix: check return value of devm_kasprintf()
On Wed, Jun 14, 2023 at 8:15 PM Claudiu Beznea wrote: > devm_kasprintf() returns a pointer to dynamically allocated memory. > Pointer could be NULL in case allocation fails. Check pointer validity. > Identified with coccinelle (kmerr.cocci script). > > Fixes: b86ef5367761 ("ASoC: fsl: Add Audio Mixer machine driver") > Signed-off-by: Claudiu Beznea > Acked-by: Shengjiu Wang Best regards Wang shengjiu > --- > > Hi, > > This has been addressed using kmerr.cocci script proposed for update > at [1]. > > Thank you, > Claudiu Beznea > > [1] > https://lore.kernel.org/all/20230530074044.1603426-1-claudiu.bez...@microchip.com/ > > sound/soc/fsl/imx-audmix.c | 9 + > 1 file changed, 9 insertions(+) > > diff --git a/sound/soc/fsl/imx-audmix.c b/sound/soc/fsl/imx-audmix.c > index 2c57fe9d2d08..af06268ee57b 100644 > --- a/sound/soc/fsl/imx-audmix.c > +++ b/sound/soc/fsl/imx-audmix.c > @@ -228,6 +228,8 @@ static int imx_audmix_probe(struct platform_device > *pdev) > > dai_name = devm_kasprintf(>dev, GFP_KERNEL, "%s%s", > fe_name_pref, args.np->full_name > + 1); > + if (!dai_name) > + return -ENOMEM; > > dev_info(pdev->dev.parent, "DAI FE name:%s\n", dai_name); > > @@ -236,6 +238,8 @@ static int imx_audmix_probe(struct platform_device > *pdev) > capture_dai_name = > devm_kasprintf(>dev, GFP_KERNEL, "%s > %s", >dai_name, "CPU-Capture"); > + if (!capture_dai_name) > + return -ENOMEM; > } > > priv->dai[i].cpus = [0]; > @@ -263,6 +267,8 @@ static int imx_audmix_probe(struct platform_device > *pdev) >"AUDMIX-Playback-%d", i); > be_cp = devm_kasprintf(>dev, GFP_KERNEL, >"AUDMIX-Capture-%d", i); > + if (!be_name || !be_pb || !be_cp) > + return -ENOMEM; > > priv->dai[num_dai + i].cpus = [2]; > priv->dai[num_dai + i].codecs = [3]; > @@ -287,6 +293,9 @@ static int imx_audmix_probe(struct platform_device > *pdev) > priv->dapm_routes[i].source = > devm_kasprintf(>dev, GFP_KERNEL, "%s %s", >dai_name, "CPU-Playback"); > + if (!priv->dapm_routes[i].source) > + return -ENOMEM; > + > priv->dapm_routes[i].sink = be_pb; > priv->dapm_routes[num_dai + i].source = be_pb; > priv->dapm_routes[num_dai + i].sink = be_cp; > -- > 2.34.1 > >
Re: [PATCH v2 06/12] mm/execmem: introduce execmem_data_alloc()
On Mon, Jun 19, 2023 at 02:43:58AM +0200, Thomas Gleixner wrote: > Kent! Hi Thomas :) > No. I am not. Ok. > Whether that's an internal function or not does not make any difference > at all. Well, at the risk of this discussion going completely off the rails, I have to disagree with you there. External interfaces and high level semantics are more important to get right from the outset, internal implementation details can be cleaned up later, within reason. And the discussion on this patchset has been more focused on those external interfaces, which seems like the right approach to me. > > ... I made the same mistake reviewing Song's patchset... > > Songs series had rough edges, but was way more data structure driven > and palatable than this hackery. I liked that aspect of Song's patchset too, and I'm actually inclined to agree with you that this patchset might get a bit cleaner with more of that, but really, this semes like just quibbling over calling convention for an internal helper function.
Re: [PATCH v2 06/12] mm/execmem: introduce execmem_data_alloc()
Kent! On Sun, Jun 18 2023 at 19:14, Kent Overstreet wrote: > On Mon, Jun 19, 2023 at 12:32:55AM +0200, Thomas Gleixner wrote: > > Thomas, you're confusing an internal interface with external No. I am not. Whether that's an internal function or not does not make any difference at all. Having a function growing _eight_ parameters where _six_ of them are derived from a well defined data structure is a clear sign of design fail. It's not rocket science to do: struct generic_allocation_info { }; struct execmem_info { struct generic_allocation_info alloc_info; ; struct execmem_param { ... struct execmem_info[NTYPES]; }; and having a function which can either operate on execmem_param and type or on generic_allocation_info itself. It does not matter as long as the data structure which is handed into this internal function is describing it completely or needs a supplementary argument, i.e. flags. Having tons of wrappers which do: a = generic_info.a; b = generic_info.b; n = generic_info.n; internal_func(a, b, ,, n); is just hillarious and to repeat myself tasteless and therefore disgusting. That's CS course first semester hackery, but TBH, I can only tell from imagination because I did not take CS courses - maybe that's the problem... Data structure driven design works not from the usage site down to the internals. It's the other way round: 1) Define a data structure which describes what the internal function needs to know 2) Implement use case specific variants which describe that 3) Hand the use case specific variant to the internal function eventually with some minimal supplementary information. Object oriented basics, right? There is absolutely nothing wrong with having internal_func(generic_info *, modifier); but having internal_func(a, b, ... n) is fundamentally wrong in the context of an extensible and future proof internal function. Guess what. Today it's sufficient to have _eight_ arguments and we are happy to have 10 nonsensical wrappers around this internal function. Tomorrow there happens to be a use case which needs another argument so you end up: Changing 10 wrappers plus the function declaration and definition in one go instead of adding One new naturally 0 initialized member to generic_info and be done with it. Look at the evolution of execmem_alloc() in this very pathset which I pointed out. That very patchset covers _two_ of at least _six_ cases Song and myself identified. It already had _three_ steps of evolution from _one_ to _five_ to _eight_ parameters. C is not like e.g. python where you can "solve" that problem by simply doing: - internal_func(a, b, c): + internal_func(a, b, c, d=None, ..., n=None): But that's not a solution either. That's a horrible workaround even in python once your parameter space gets sufficiently large. The way out in python is to have **kwargs. But that's not an option in C, and not necessarily the best option for python either. Even in python or any other object oriented language you get to the point where you have to rethink your approach, go back to the drawing board and think about data representation. But creating a new interface based on "let's see what we need over time and add parameters as we see fit" is simply wrong to begin with independent of the programming language. Even if the _eight_ parameters are the end of the range, then they are beyond justifyable because that's way beyond the natural register argument space of any architecture and you are offloading your lazyness to wrappers and the compiler to emit pointlessly horrible code. There is a reason why new syscalls which need more than a few parameters are based on 'struct DATA_WHICH_I_NEED_TO_KNOW' and 'flags'. We've got burned on the non-extensibilty often enough. Why would a new internal function have any different requirements especially as it is neither implemented to the full extent nor a hotpath function? Now you might argue that it _is_ a "hotpath" due to the BPF usage, but then even more so as any intermediate wrapper which converts from one data representation to another data representation is not going to increase performance, right? > ... I made the same mistake reviewing Song's patchset... Songs series had rough edges, but was way more data structure driven and palatable than this hackery. The fact that you made a mistake while reviewing Songs series has absolutely nothing to do with the above or my previous reply to Mike. Thanks, tglx
Re: [PATCH v2 2/2] powerpc/tpm: Reserve SML log when kexec'ing with kexec_file_load()
On 6/15/23 08:37, Michael Ellerman wrote: The TPM code in prom_init.c creates a small buffer of memory to store the TPM's SML (Stored Measurement Log). It's communicated to Linux via the linux,sml-base/size device tree properties of the TPM node. When kexec'ing that buffer can be overwritten, or when kdump'ing it may not be mapped by the second kernel. The latter can lead to a crash when booting the second kernel such as: tpm_ibmvtpm 7103: CRQ initialization completed BUG: Unable to handle kernel data access on read at 0xc0002ffb Faulting instruction address: 0xc000200a70e0 Oops: Kernel access of bad area, sig: 11 [#1] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries Modules linked in: CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.2.0-rc2-00134-g9307ce092f5d #314 Hardware name: IBM pSeries (emulated by qemu) POWER9 (raw) 0x4e1200 0xf05 of:SLOF,git-5b4c5a pSeries NIP: c000200a70e0 LR: c000203dd5dc CTR: 0800 REGS: c00024543280 TRAP: 0300 Not tainted (6.2.0-rc2-00134-g9307ce092f5d) MSR: 82009033 CR: 24002280 XER: 0006 CFAR: c000200a70c8 DAR: c0002ffb DSISR: 4000 IRQMASK: 0 ... NIP memcpy_power7+0x400/0x7d0 LR kmemdup+0x5c/0x80 Call Trace: memcpy_power7+0x274/0x7d0 (unreliable) kmemdup+0x5c/0x80 tpm_read_log_of+0xe8/0x1b0 tpm_bios_log_setup+0x60/0x210 tpm_chip_register+0x134/0x320 tpm_ibmvtpm_probe+0x520/0x7d0 vio_bus_probe+0x9c/0x460 really_probe+0x104/0x420 __driver_probe_device+0xb0/0x170 driver_probe_device+0x58/0x180 __driver_attach+0xd8/0x250 bus_for_each_dev+0xb4/0x140 driver_attach+0x34/0x50 bus_add_driver+0x1e8/0x2d0 driver_register+0xb4/0x1c0 __vio_register_driver+0x74/0x9c ibmvtpm_module_init+0x34/0x48 do_one_initcall+0x80/0x320 kernel_init_freeable+0x304/0x3ac kernel_init+0x30/0x1a0 ret_from_kernel_thread+0x5c/0x64 To fix the crash, add the SML region to the usable memory areas for the kdump kernel, so that the second kernel will map the region. To avoid corruption of the region, add the region to the reserved memory areas, To me the 2nd paragraph and the one below seem to say that in general it does NOT 'avoid corruption of the region.' so that the second kernel does not use the memory for something else. Note that when loading a kdump kernel with the regular kexec_load() syscall the SML may be overwritten by the kdump kernel, depending on where the SML is in memory in relation to the crashkernel region. That is a separate problem that is not solved by this patch. Fixes: a0458284f062 ("powerpc: Add support code for kexec_file_load()") Reported-by: Stefan Berger Signed-off-by: Michael Ellerman I agree to the code: Reviewed-by: Stefan Berger
[PATCH] cxl/ocxl: Possible repeated word
Delete repeated word in comment. Signed-off-by: Zhu Mao --- drivers/misc/cxl/native.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/misc/cxl/native.c b/drivers/misc/cxl/native.c index 50b0c44bb8d7..6957946a6463 100644 --- a/drivers/misc/cxl/native.c +++ b/drivers/misc/cxl/native.c @@ -920,7 +920,7 @@ int cxl_attach_dedicated_process_psl9(struct cxl_context *ctx, u64 wed, u64 amr) * Ideally we should do a wmb() here to make sure the changes to the * PE are visible to the card before we call afu_enable. * On ppc64 though all mmios are preceded by a 'sync' instruction hence - * we dont dont need one here. + * we dont need one here. */ result = cxl_ops->afu_reset(afu);
Re: [PATCH v2 06/12] mm/execmem: introduce execmem_data_alloc()
On Mon, Jun 19, 2023 at 12:32:55AM +0200, Thomas Gleixner wrote: > Mike! > > Sorry for being late on this ... > > On Fri, Jun 16 2023 at 11:50, Mike Rapoport wrote: > > > > +void *execmem_data_alloc(size_t size) > > +{ > > + unsigned long start = execmem_params.modules.data.start; > > + unsigned long end = execmem_params.modules.data.end; > > + pgprot_t pgprot = execmem_params.modules.data.pgprot; > > + unsigned int align = execmem_params.modules.data.alignment; > > + unsigned long fallback_start = > > execmem_params.modules.data.fallback_start; > > + unsigned long fallback_end = execmem_params.modules.data.fallback_end; > > + bool kasan = execmem_params.modules.flags & EXECMEM_KASAN_SHADOW; > > While I know for sure that you read up on the discussion I had with Song > about data structures, it seems you completely failed to understand it. > > > + return execmem_alloc(size, start, end, align, pgprot, > > +fallback_start, fallback_end, kasan); > > Having _seven_ intermediate variables to fill _eight_ arguments of a > function instead of handing in @size and a proper struct pointer is > tasteless and disgusting at best. > > Six out of those seven parameters are from: > > execmem_params.module.data > > while the KASAN shadow part is retrieved from > > execmem_params.module.flags > > So what prevents you from having a uniform data structure, which is > extensible and decribes _all_ types of allocations? > > Absolutely nothing. The flags part can either be in the type dependend > part or you make the type configs an array as I had suggested originally > and then execmem_alloc() becomes: > > void *execmem_alloc(type, size) > > and > > static inline void *execmem_data_alloc(size_t size) > { > return execmem_alloc(EXECMEM_TYPE_DATA, size); > } > > which gets the type independent parts from @execmem_param. > > Just read through your own series and watch the evolution of > execmem_alloc(): > > static void *execmem_alloc(size_t size) > > static void *execmem_alloc(size_t size, unsigned long start, > unsigned long end, unsigned int align, > pgprot_t pgprot) > > static void *execmem_alloc(size_t len, unsigned long start, > unsigned long end, unsigned int align, > pgprot_t pgprot, > unsigned long fallback_start, > unsigned long fallback_end, > bool kasan) > > In a month from now this function will have _ten_ parameters and tons of > horrible wrappers which convert an already existing data structure into > individual function arguments. > > Seriously? > > If you want this function to be [ab]used outside of the exec_param > configuration space for whatever non-sensical reasons then this still > can be either: > > void *execmem_alloc(params, type, size) > > static inline void *execmem_data_alloc(size_t size) > { > return execmem_alloc(_param, EXECMEM_TYPE_DATA, size); > } > > or > > void *execmem_alloc(type_params, size); > > static inline void *execmem_data_alloc(size_t size) > { > return execmem_alloc(_param.data, size); > } > > which both allows you to provide alternative params, right? > > Coming back to my conversation with Song: > >"Bad programmers worry about the code. Good programmers worry about > data structures and their relationships." Thomas, you're confusing an internal interface with external, I made the same mistake reviewing Song's patchset...
Re: [PATCH v2 06/12] mm/execmem: introduce execmem_data_alloc()
Mike! Sorry for being late on this ... On Fri, Jun 16 2023 at 11:50, Mike Rapoport wrote: > > +void *execmem_data_alloc(size_t size) > +{ > + unsigned long start = execmem_params.modules.data.start; > + unsigned long end = execmem_params.modules.data.end; > + pgprot_t pgprot = execmem_params.modules.data.pgprot; > + unsigned int align = execmem_params.modules.data.alignment; > + unsigned long fallback_start = > execmem_params.modules.data.fallback_start; > + unsigned long fallback_end = execmem_params.modules.data.fallback_end; > + bool kasan = execmem_params.modules.flags & EXECMEM_KASAN_SHADOW; While I know for sure that you read up on the discussion I had with Song about data structures, it seems you completely failed to understand it. > + return execmem_alloc(size, start, end, align, pgprot, > + fallback_start, fallback_end, kasan); Having _seven_ intermediate variables to fill _eight_ arguments of a function instead of handing in @size and a proper struct pointer is tasteless and disgusting at best. Six out of those seven parameters are from: execmem_params.module.data while the KASAN shadow part is retrieved from execmem_params.module.flags So what prevents you from having a uniform data structure, which is extensible and decribes _all_ types of allocations? Absolutely nothing. The flags part can either be in the type dependend part or you make the type configs an array as I had suggested originally and then execmem_alloc() becomes: void *execmem_alloc(type, size) and static inline void *execmem_data_alloc(size_t size) { return execmem_alloc(EXECMEM_TYPE_DATA, size); } which gets the type independent parts from @execmem_param. Just read through your own series and watch the evolution of execmem_alloc(): static void *execmem_alloc(size_t size) static void *execmem_alloc(size_t size, unsigned long start, unsigned long end, unsigned int align, pgprot_t pgprot) static void *execmem_alloc(size_t len, unsigned long start, unsigned long end, unsigned int align, pgprot_t pgprot, unsigned long fallback_start, unsigned long fallback_end, bool kasan) In a month from now this function will have _ten_ parameters and tons of horrible wrappers which convert an already existing data structure into individual function arguments. Seriously? If you want this function to be [ab]used outside of the exec_param configuration space for whatever non-sensical reasons then this still can be either: void *execmem_alloc(params, type, size) static inline void *execmem_data_alloc(size_t size) { return execmem_alloc(_param, EXECMEM_TYPE_DATA, size); } or void *execmem_alloc(type_params, size); static inline void *execmem_data_alloc(size_t size) { return execmem_alloc(_param.data, size); } which both allows you to provide alternative params, right? Coming back to my conversation with Song: "Bad programmers worry about the code. Good programmers worry about data structures and their relationships." You might want to reread: https://lore.kernel.org/r/87lenuukj0.ffs@tglx and the subsequent thread. The fact that my suggestions had a 'mod_' namespace prefix does not make any of my points moot. Song did an extremly good job in abstracting things out, but you decided to ditch his ground work instead of building on it and keeping the good parts. That's beyond sad. Worse, you went the full 'not invented here' path with an outcome which is even worse than the original hackery from Song which started the above referenced thread. I don't know what caused you to decide that ad hoc hackery is better than proper data structure based design patterns. I actually don't want to know. As much as my voice counts: NAK-ed-by: Thomas Gleixner Thanks, tglx
Re: [PATCH v2 07/23 replacement] mips: add pte_unmap() to balance pte_offset_map()
On Fri, Jun 16, 2023 at 9:54 PM Yu Zhao wrote: > > On Thu, Jun 15, 2023 at 04:02:43PM -0700, Hugh Dickins wrote: > > To keep balance in future, __update_tlb() remember to pte_unmap() after > > pte_offset_map(). This is an odd case, since the caller has already done > > pte_offset_map_lock(), then mips forgets the address and recalculates it; > > but my two naive attempts to clean that up did more harm than good. > > > > Tested-by: Nathan Chancellor > > Signed-off-by: Hugh Dickins > > FWIW: Tested-by: Yu Zhao > > There is another problem, likely caused by khugepaged, happened multiple > times. But I don't think it's related to your series, just FYI. > > Got mcheck at 81134ef0 > CPU: 3 PID: 36 Comm: khugepaged Not tainted > 6.4.0-rc6-00049-g62d8779610bb-dirty #1 ... > Kernel panic - not syncing: Caught Machine Check exception - caused by > multiple matching entries in the TLB. In case anyone plans to try to fix this - the problem goes back to at least 5.15 stable. My (educated) guess is that nobody complained about it because all the testing is done in QEMU, which does NOT detect conflicting TLBs. This means the verification of the fix would need to be on a real piece of h/w or an updated QEMU. In target/mips/tcg/sysemu/tlb_helper.c: static void r4k_fill_tlb(CPUMIPSState *env, int idx) { r4k_tlb_t *tlb; uint64_t mask = env->CP0_PageMask >> (TARGET_PAGE_BITS + 1); /* XXX: detect conflicting TLBs and raise a MCHECK exception when needed */ ...
Re: kvm/arm64: Spark benchmark
On Fri, Jun 9, 2023 at 7:04 AM Marc Zyngier wrote: > > On Fri, 09 Jun 2023 01:59:35 +0100, > Yu Zhao wrote: > > > > TLDR > > > > Apache Spark spent 12% less time sorting four billion random integers > > twenty times (in ~4 hours) after this patchset [1]. > > Why are the 3 architectures you have considered being evaluated with 3 > different benchmarks? I was hoping people having special interests in different archs might try to reproduce the benchmarks that I didn't report (but did cover) and see what happens. > I am not suspecting you to have cherry-picked > the best results I'm generally very conservative when reporting *synthetic* results. For example, the same memcached benchmark used on powerpc yielded >50% improvement on aarch64, because the default Ubuntu Kconfig uses 64KB base page size for powerpc but 4KB for aarch64. (Before the series, the reclaim (swap) path takes kvm->mmu_lock for *write* on O(nr of all pages to consider); after the series, it becomes O(actual nr of pages to swap), which is <10% given how the benchmark was set up.) Ops/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency Before 639511.40 0.09940 0.04700 0.27100 22.52700 After 974184.60 0.06471 0.04700 0.159003.75900 > but I'd really like to see a variety of benchmarks > that exercise this stuff differently. I'd be happy to try other synthetic workloads that people think that are relatively representative. Also, I've backported the series and started an A/B experiment involving ~1 million devices (real-world workloads). We should have the preliminary results by the time I post the next version.
Re: kvm/x86: multichase benchmark
On Thu, Jun 8, 2023 at 6:59 PM Yu Zhao wrote: > > TLDR > > Multichase in 64 microVMs achieved 6% more total samples (in ~4 hours) after > this patchset [1]. > > Hardware > > HOST $ lscpu > Architecture:x86_64 > CPU op-mode(s):32-bit, 64-bit > Address sizes: 43 bits physical, 48 bits virtual > Byte Order:Little Endian > CPU(s): 128 > On-line CPU(s) list: 0-127 > Vendor ID: AuthenticAMD > Model name:AMD Ryzen Threadripper PRO 3995WX 64-Cores > CPU family: 23 > Model: 49 > Thread(s) per core: 2 > Core(s) per socket: 64 > Socket(s): 1 > Stepping:0 > Frequency boost: disabled > CPU max MHz: 4308.3979 > CPU min MHz: 2200. > BogoMIPS:5390.20 > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge > mca cmov pat pse36 clflush mmx fxsr sse sse2 > ... > Virtualization features: > Virtualization:AMD-V > Caches (sum of all): > L1d: 2 MiB (64 instances) > L1i: 2 MiB (64 instances) > L2:32 MiB (64 instances) > L3:256 MiB (16 instances) > NUMA: > NUMA node(s): 1 > NUMA node0 CPU(s): 0-127 > Vulnerabilities: > Itlb multihit: Not affected > L1tf: Not affected > Mds: Not affected > Meltdown: Not affected > Mmio stale data: Not affected > Retbleed: Mitigation; untrained return thunk; SMT enabled with > STIBP protection > Spec store bypass: Mitigation; Speculative Store Bypass disabled via > prctl > Spectre v1:Mitigation; usercopy/swapgs barriers and __user > pointer sanitization > Spectre v2:Mitigation; Retpolines, IBPB conditional, STIBP > always-on, RSB filling, PBRSB-eIBRS Not affected > Srbds: Not affected > Tsx async abort: Not affected > > HOST $ numactl -H > available: 1 nodes (0) > node 0 cpus: 0-127 > node 0 size: 257542 MB > node 0 free: 224855 MB > node distances: > node 0 > 0: 10 > > HOST $ cat /sys/class/nvme/nvme0/model > INTEL SSDPF21Q800GB > > HOST $ cat /sys/class/nvme/nvme0/numa_node > 0 > > Software > > HOST $ cat /etc/lsb-release > DISTRIB_ID=Ubuntu > DISTRIB_RELEASE=22.04 > DISTRIB_CODENAME=jammy > DISTRIB_DESCRIPTION="Ubuntu 22.04.1 LTS" > > HOST $ uname -a > Linux x86 6.4.0-rc5+ #1 SMP PREEMPT_DYNAMIC Wed Jun 7 22:17:47 UTC 2023 > x86_64 x86_64 x86_64 GNU/Linux > > HOST $ cat /proc/swaps > Filename Type Size UsedPriority > /dev/nvme0n1p2partition4668383560 -2 > > HOST $ cat /sys/kernel/mm/lru_gen/enabled > 0x000f > > HOST $ cat /sys/kernel/mm/transparent_hugepage/enabled > always madvise [never] > > HOST $ cat /sys/kernel/mm/transparent_hugepage/defrag > always defer defer+madvise madvise [never] > > Procedure > = > HOST $ git clone https://github.com/google/multichase > > HOST $ > HOST $ > > HOST $ cp multichase/multichase ./initrd/bin/ > HOST $ sed -i \ > "/^maybe_break top$/i multichase -t 2 -m 4g -n 28800; poweroff" \ I was reminded that I missed one parameter above, i.e., "/^maybe_break top$/i multichase -N -t 2 -m 4g -n 28800; poweroff" \ ^^ > ./initrd/init > > HOST $ > > HOST $ cat run_microvms.sh > memcgs=64 > > run() { > path=/sys/fs/cgroup/memcg$1 > > mkdir $path > echo $BASHPID >$path/cgroup.procs And one line here: echo 4000m >$path/memory.min # or the largest size that doesn't cause OOM kills > qemu-system-x86_64 -M microvm,accel=kvm -cpu host -smp 2 -m 6g \ > -nographic -kernel /boot/vmlinuz -initrd ./initrd.img \ > -append "console=ttyS0 loglevel=0" > } > > for ((memcg = 0; memcg < $memcgs; memcg++)); do > run $memcg & > done > > wait > > Results > === > Before [1]AfterChange > -- > Total samples6824 7237 +6% > > Notes > = > [1] "mm: rmap: Don't flush TLB after checking PTE young for page > reference" was included so that the comparison is apples to > Apples. > https://lore.kernel.org/r/20220706112041.3831-1-21cn...@gmail.com/
Re: [PATCH 1/1] arch:hexagon/powerpc: use KSYM_NAME_LEN in array size
On Wed, May 31, 2023 at 1:14 AM Kees Cook wrote: > > On Mon, May 29, 2023 at 04:50:45PM +0200, Miguel Ojeda wrote: > > Kees: what is the current stance on `[static N]` parameters? Something like: > > > > const char *kallsyms_lookup(unsigned long addr, > > unsigned long *symbolsize, > > unsigned long *offset, > > - char **modname, char *namebuf); > > + char **modname, char namebuf[static > > KSYM_NAME_LEN]); > > > > makes the compiler complain about cases like these (even if trivial): > > > > arch/powerpc/xmon/xmon.c:1711:10: error: array argument is too small; > > contains 128 elements, callee requires at least 512 > > [-Werror,-Warray-bounds] > > name = kallsyms_lookup(pc, , , NULL, tmpstr); > > ^ ~~ > > ./include/linux/kallsyms.h:86:29: note: callee declares array > > parameter as static here > > char **modname, char namebuf[static KSYM_NAME_LEN]); > > ^ ~~ > > Wouldn't that be a good thing? (I.e. complain about the size mismatch?) Yeah, I would say so (i.e. I meant it as a good thing). > > But I only see 2 files in the kernel using `[static N]` (from 2020 and > > 2021). Should something else be used instead (e.g. `__counted_by`), > > even if constexpr-sized?. > > Yeah, it seems pretty uncommon. I'd say traditionally arrays aren't > based too often, rather structs containing them. > > But ultimately, yeah, everything could gain __counted_by and friends in > the future. That would be nice! Cheers, Miguel
Re: [PATCH v2 00/16] Add support for DAX vmemmap optimization for ppc64
> On 16-Jun-2023, at 4:38 PM, Aneesh Kumar K.V > wrote: > > This patch series implements changes required to support DAX vmemmap > optimization for ppc64. The vmemmap optimization is only enabled with radix > MMU > translation and 1GB PUD mapping with 64K page size. The patch series also > split > hugetlb vmemmap optimization as a separate Kconfig variable so that > architectures can enable DAX vmemmap optimization without enabling hugetlb > vmemmap optimization. This should enable architectures like arm64 to enable > DAX > vmemmap optimization while they can't enable hugetlb vmemmap optimization. > More > details of the same are in patch "mm/vmemmap optimization: Split hugetlb and > devdax vmemmap optimization" > > Changes from V1: > * Fix make htmldocs warning > * Fix vmemmap allocation bugs with different alignment values. > * Correctly check for section validity to before we free vmemmap area Thanks for the updated series. The previously reported WARN_ON is not seen with this series. I also ran few regression tests against this series and did not see any problems. Based on the test results Tested-by: Sachin Sant mailto:sach...@linux.ibm.com>> -Sachin
Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()
On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote: > On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote: > > From: "Mike Rapoport (IBM)" > > > > module_alloc() is used everywhere as a mean to allocate memory for code. > > > > Beside being semantically wrong, this unnecessarily ties all subsystems > > that need to allocate code, such as ftrace, kprobes and BPF to modules > > and puts the burden of code allocation to the modules code. > > > > Several architectures override module_alloc() because of various > > constraints where the executable memory can be located and this causes > > additional obstacles for improvements of code allocation. > > > > Start splitting code allocation from modules by introducing > > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs. > > > > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for > > module_alloc() and execmem_free() and jit_free() are replacements of > > module_memfree() to allow updating all call sites to use the new APIs. > > > > The intention semantics for new allocation APIs: > > > > * execmem_text_alloc() should be used to allocate memory that must reside > > close to the kernel image, like loadable kernel modules and generated > > code that is restricted by relative addressing. > > > > * jit_text_alloc() should be used to allocate memory for generated code > > when there are no restrictions for the code placement. For > > architectures that require that any code is within certain distance > > from the kernel image, jit_text_alloc() will be essentially aliased to > > execmem_text_alloc(). > > > > Is there anything in this series to help users do the appropriate > synchronization when the actually populate the allocated memory with > code? See here, for example: This series only factors out the executable allocations from modules and puts them in a central place. Anything else would go on top after this lands. > https://lore.kernel.org/linux-fsdevel/cb6533c6-cea0-4f04-95cf-b8240c6ab...@app.fastmail.com/T/#u -- Sincerely yours, Mike.