[PATCH v2 0/3] x86/snp: Add kexec support
From: Ashish Kalra The patchset adds bits and pieces to get kexec (and crashkernel) work on SNP guest. v2: - address zeroing of unaccepted memory table mappings at all page table levels adding phys_pte_init(), phys_pud_init() and phys_p4d_init(). - include skip efi_arch_mem_reserve() in case of kexec as part of this patch set. - rename last_address_shd_kexec to a more appropriate kexec_last_address_to_make_private. - remove duplicate code shared with TDX and use common interfaces defined for SNP and TDX for kexec/kdump. - remove set_pte_enc() dependency on pg_level_to_pfn() and make the function simpler. - rename unshare_pte() to make_pte_private(). - clarify and make the comment for using kexec_last_address_to_make_private more understandable. - general cleanup. Ashish Kalra (3): efi/x86: skip efi_arch_mem_reserve() in case of kexec. x86/mm: Do not zap page table entries mapping unaccepted memory table during kdump. x86/snp: Convert shared memory back to private on kexec arch/x86/include/asm/probe_roms.h | 1 + arch/x86/include/asm/sev.h| 4 + arch/x86/kernel/probe_roms.c | 16 +++ arch/x86/kernel/sev.c | 169 ++ arch/x86/mm/init_64.c | 16 ++- arch/x86/mm/mem_encrypt_amd.c | 3 + arch/x86/platform/efi/quirks.c| 10 ++ 7 files changed, 215 insertions(+), 4 deletions(-) -- 2.34.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v2 1/3] efi/x86: skip efi_arch_mem_reserve() in case of kexec.
From: Ashish Kalra For kexec use case, need to use and stick to the EFI memmap passed from the first kernel via boot-params/setup data, hence, skip efi_arch_mem_reserve() during kexec. Additionally during SNP guest kexec testing discovered that EFI memmap is corrupted during chained kexec. kexec_enter_virtual_mode() during late init will remap the efi_memmap physical pages allocated in efi_arch_mem_reserve() via memboot & then subsequently cause random EFI memmap corruption once memblock is freed/teared-down. Signed-off-by: Ashish Kalra --- arch/x86/platform/efi/quirks.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c index f0cc00032751..d4562d074371 100644 --- a/arch/x86/platform/efi/quirks.c +++ b/arch/x86/platform/efi/quirks.c @@ -258,6 +258,16 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 size) int num_entries; void *new; + /* +* For kexec use case, we need to use the EFI memmap passed from the first +* kernel via setup data, so we need to skip this. +* Additionally kexec_enter_virtual_mode() during late init will remap +* the efi_memmap physical pages allocated here via memboot & then +* subsequently cause random EFI memmap corruption once memblock is freed. +*/ + if (efi_setup) + return; + if (efi_mem_desc_lookup(addr, &md) || md.type != EFI_BOOT_SERVICES_DATA) { pr_err("Failed to lookup EFI memory descriptor for %pa\n", &addr); -- 2.34.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v2 2/3] x86/mm: Do not zap page table entries mapping unaccepted memory table during kdump.
From: Ashish Kalra During crashkernel boot only pre-allocated crash memory is presented as E820_TYPE_RAM. This can cause page table entries mapping unaccepted memory table to be zapped during phys_pte_init(), phys_pmd_init(), phys_pud_init() and phys_p4d_init() as SNP/TDX guest use E820_TYPE_ACPI to store the unaccepted memory table and pass it between the kernels on kexec/kdump. E820_TYPE_ACPI covers not only ACPI data, but also EFI tables and might be required by kernel to function properly. The problem was discovered during debugging kdump for SNP guest. The unaccepted memory table stored with E820_TYPE_ACPI and passed between the kernels on kdump was getting zapped as the PMD entry mapping this is above the E820_TYPE_RAM range for the reserved crashkernel memory. Signed-off-by: Ashish Kalra --- arch/x86/mm/init_64.c | 16 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index a0dffaca6d2b..cc294a9e9fd7 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -469,7 +469,9 @@ phys_pte_init(pte_t *pte_page, unsigned long paddr, unsigned long paddr_end, !e820__mapped_any(paddr & PAGE_MASK, paddr_next, E820_TYPE_RAM) && !e820__mapped_any(paddr & PAGE_MASK, paddr_next, -E820_TYPE_RESERVED_KERN)) +E820_TYPE_RESERVED_KERN) && + !e820__mapped_any(paddr & PAGE_MASK, paddr_next, +E820_TYPE_ACPI)) set_pte_init(pte, __pte(0), init); continue; } @@ -524,7 +526,9 @@ phys_pmd_init(pmd_t *pmd_page, unsigned long paddr, unsigned long paddr_end, !e820__mapped_any(paddr & PMD_MASK, paddr_next, E820_TYPE_RAM) && !e820__mapped_any(paddr & PMD_MASK, paddr_next, -E820_TYPE_RESERVED_KERN)) +E820_TYPE_RESERVED_KERN) && + !e820__mapped_any(paddr & PMD_MASK, paddr_next, +E820_TYPE_ACPI)) set_pmd_init(pmd, __pmd(0), init); continue; } @@ -611,7 +615,9 @@ phys_pud_init(pud_t *pud_page, unsigned long paddr, unsigned long paddr_end, !e820__mapped_any(paddr & PUD_MASK, paddr_next, E820_TYPE_RAM) && !e820__mapped_any(paddr & PUD_MASK, paddr_next, -E820_TYPE_RESERVED_KERN)) +E820_TYPE_RESERVED_KERN) && + !e820__mapped_any(paddr & PUD_MASK, paddr_next, +E820_TYPE_ACPI)) set_pud_init(pud, __pud(0), init); continue; } @@ -698,7 +704,9 @@ phys_p4d_init(p4d_t *p4d_page, unsigned long paddr, unsigned long paddr_end, !e820__mapped_any(paddr & P4D_MASK, paddr_next, E820_TYPE_RAM) && !e820__mapped_any(paddr & P4D_MASK, paddr_next, -E820_TYPE_RESERVED_KERN)) +E820_TYPE_RESERVED_KERN) && + !e820__mapped_any(paddr & P4D_MASK, paddr_next, +E820_TYPE_ACPI)) set_p4d_init(p4d, __p4d(0), init); continue; } -- 2.34.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v2 3/3] x86/snp: Convert shared memory back to private on kexec
From: Ashish Kalra SNP guests allocate shared buffers to perform I/O. It is done by allocating pages normally from the buddy allocator and converting them to shared with set_memory_decrypted(). The second kernel has no idea what memory is converted this way. It only sees E820_TYPE_RAM. Accessing shared memory via private mapping will cause unrecoverable RMP page-faults. On kexec walk direct mapping and convert all shared memory back to private. It makes all RAM private again and second kernel may use it normally. Additionally for SNP guests convert all bss decrypted section pages back to private and switch back ROM regions to shared so that their revalidation does not fail during kexec kernel boot. The conversion occurs in two steps: stopping new conversions and unsharing all memory. In the case of normal kexec, the stopping of conversions takes place while scheduling is still functioning. This allows for waiting until any ongoing conversions are finished. The second step is carried out when all CPUs except one are inactive and interrupts are disabled. This prevents any conflicts with code that may access shared memory. Signed-off-by: Ashish Kalra --- arch/x86/include/asm/probe_roms.h | 1 + arch/x86/include/asm/sev.h| 4 + arch/x86/kernel/probe_roms.c | 16 +++ arch/x86/kernel/sev.c | 169 ++ arch/x86/mm/mem_encrypt_amd.c | 3 + 5 files changed, 193 insertions(+) diff --git a/arch/x86/include/asm/probe_roms.h b/arch/x86/include/asm/probe_roms.h index 1c7f3815bbd6..d50b67dbff33 100644 --- a/arch/x86/include/asm/probe_roms.h +++ b/arch/x86/include/asm/probe_roms.h @@ -6,4 +6,5 @@ struct pci_dev; extern void __iomem *pci_map_biosrom(struct pci_dev *pdev); extern void pci_unmap_biosrom(void __iomem *rom); extern size_t pci_biosrom_size(struct pci_dev *pdev); +extern void snp_kexec_unprep_rom_memory(void); #endif diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h index d7b27cb34c2b..867518b9bcad 100644 --- a/arch/x86/include/asm/sev.h +++ b/arch/x86/include/asm/sev.h @@ -229,6 +229,8 @@ void snp_accept_memory(phys_addr_t start, phys_addr_t end); u64 snp_get_unsupported_features(u64 status); u64 sev_get_status(void); void kdump_sev_callback(void); +void snp_kexec_unshare_mem(void); +void snp_kexec_stop_conversion(bool crash); #else static inline void sev_es_ist_enter(struct pt_regs *regs) { } static inline void sev_es_ist_exit(void) { } @@ -258,6 +260,8 @@ static inline void snp_accept_memory(phys_addr_t start, phys_addr_t end) { } static inline u64 snp_get_unsupported_features(u64 status) { return 0; } static inline u64 sev_get_status(void) { return 0; } static inline void kdump_sev_callback(void) { } +void snp_kexec_unshare_mem(void) {} +static void snp_kexec_stop_conversion(bool crash) {} #endif #ifdef CONFIG_KVM_AMD_SEV diff --git a/arch/x86/kernel/probe_roms.c b/arch/x86/kernel/probe_roms.c index 319fef37d9dc..457f1e5c8d00 100644 --- a/arch/x86/kernel/probe_roms.c +++ b/arch/x86/kernel/probe_roms.c @@ -177,6 +177,22 @@ size_t pci_biosrom_size(struct pci_dev *pdev) } EXPORT_SYMBOL(pci_biosrom_size); +void snp_kexec_unprep_rom_memory(void) +{ + unsigned long vaddr, npages, sz; + + /* +* Switch back ROM regions to shared so that their validation +* does not fail during kexec kernel boot. +*/ + vaddr = (unsigned long)__va(video_rom_resource.start); + sz = (system_rom_resource.end + 1) - video_rom_resource.start; + npages = PAGE_ALIGN(sz) >> PAGE_SHIFT; + + snp_set_memory_shared(vaddr, npages); +} +EXPORT_SYMBOL(snp_kexec_unprep_rom_memory); + #define ROMSIGNATURE 0xaa55 static int __init romsignature(const unsigned char *rom) diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c index 1ef7ae806a01..7443a9620a31 100644 --- a/arch/x86/kernel/sev.c +++ b/arch/x86/kernel/sev.c @@ -40,6 +40,7 @@ #include #include #include +#include #define DR7_RESET_VALUE0x400 @@ -71,6 +72,9 @@ static struct ghcb *boot_ghcb __section(".data"); /* Bitmap of SEV features supported by the hypervisor */ static u64 sev_hv_features __ro_after_init; +/* Last address to be switched to private during kexec */ +static unsigned long kexec_last_addr_to_make_private; + /* #VC handler runtime per-CPU data */ struct sev_es_runtime_data { struct ghcb ghcb_page; @@ -906,6 +910,171 @@ void snp_accept_memory(phys_addr_t start, phys_addr_t end) set_pages_state(vaddr, npages, SNP_PAGE_STATE_PRIVATE); } +static bool set_pte_enc(pte_t *kpte, int level, void *va) +{ + pte_t new_pte; + + if (pte_none(*kpte)) + return false; + + /* +* Change the physical page attribute from C=0 to C=1. Flush the +* caches to ensure that data gets accessed with the correct C-bit. +*/ + if (pte_present(*kpte)) + clflush_cache_range(va, page_level_size(level)); + + new_pt
Re: [PATCH] makedumpfile: ppc64: get vmalloc start address from vmcoreinfo
Hi, The commit removing 'vmap_area_list' is now merged in Linux mainline tree. commit: 55c49fee57af99f3c663e69dedc5b85e691bbe50 mm/vmalloc: remove vmap_area_list Any comments on this patch ? Thanks, Aditya Gupta On 24/02/24 00:33, Aditya Gupta wrote: Below error was noticed when running makedumpfile on linux-next kernel crash (linux-next tag next-20240121): ... Checking for memory holes : [100.0 %] | readpage_elf: Attempt to read non-existent page at 0xc. [ 17.551718] kdump.sh[404]: readmem: type_addr: 0, addr:c00c, size:16384 [ 17.551793] kdump.sh[404]: __exclude_unnecessary_pages: Can't read the buffer of struct page. [ 17.551864] kdump.sh[404]: create_2nd_bitmap: Can't exclude unnecessary pages. [ 17.562632] kdump.sh[404]: The kernel version is not supported. [ 17.562708] kdump.sh[404]: The makedumpfile operation may be incomplete. [ 17.562773] kdump.sh[404]: makedumpfile Failed. [ 17.564335] kdump[406]: saving vmcore failed, _exitcode:1 Above error was due to 'vmap_area_list' and 'vmlist' symbols missing from the vmcore. 'vmap_area_list' was removed in the linux kernel with below commit: commit 378eb24a0658dd922b29524e0ce35c6c43f56cba mm/vmalloc: remove vmap_area_list Subsequently the commit also introduced 'VMALLOC_START' in vmcoreinfo to get base address of vmalloc area, instead of depending on 'vmap_area_list' Hence if 'VMALLOC_START' symbol is there in vmcoreinfo: 1. Set vmalloc_start based on 'VMALLOC_START' 2. Don't error if vmap_area_list/vmlist are not defined Reported-by: Sachin Sant Signed-off-by: Aditya Gupta --- arch/ppc64.c | 19 +-- makedumpfile.c | 3 ++- makedumpfile.h | 6 +++--- 3 files changed, 18 insertions(+), 10 deletions(-) diff --git a/arch/ppc64.c b/arch/ppc64.c index 96c357cb0335..bb62e2cd199a 100644 --- a/arch/ppc64.c +++ b/arch/ppc64.c @@ -568,7 +568,9 @@ get_machdep_info_ppc64(void) /* * Get vmalloc_start value from either vmap_area_list or vmlist. */ - if ((SYMBOL(vmap_area_list) != NOT_FOUND_SYMBOL) + if (NUMBER(vmalloc_start) != NOT_FOUND_SYMBOL) { + vmalloc_start = NUMBER(vmalloc_start); + } else if ((SYMBOL(vmap_area_list) != NOT_FOUND_SYMBOL) && (OFFSET(vmap_area.va_start) != NOT_FOUND_STRUCTURE) && (OFFSET(vmap_area.list) != NOT_FOUND_STRUCTURE)) { if (!readmem(VADDR, SYMBOL(vmap_area_list) + OFFSET(list_head.next), @@ -684,11 +686,16 @@ vaddr_to_paddr_ppc64(unsigned long vaddr) if ((SYMBOL(vmap_area_list) == NOT_FOUND_SYMBOL) || (OFFSET(vmap_area.va_start) == NOT_FOUND_STRUCTURE) || (OFFSET(vmap_area.list) == NOT_FOUND_STRUCTURE)) { - if ((SYMBOL(vmlist) == NOT_FOUND_SYMBOL) - || (OFFSET(vm_struct.addr) == NOT_FOUND_STRUCTURE)) { - ERRMSG("Can't get info for vmalloc translation.\n"); - return NOT_PADDR; - } + /* +* Don't depend on vmap_area_list/vmlist if vmalloc_start is set in +* vmcoreinfo, in that case proceed without error +*/ + if (NUMBER(vmalloc_start) == NOT_FOUND_NUMBER) + if ((SYMBOL(vmlist) == NOT_FOUND_SYMBOL) + || (OFFSET(vm_struct.addr) == NOT_FOUND_STRUCTURE)) { + ERRMSG("Can't get info for vmalloc translation.\n"); + return NOT_PADDR; + } } return ppc64_vtop_level4(vaddr); diff --git a/makedumpfile.c b/makedumpfile.c index b004b93fecb7..b6c63fad15f3 100644 --- a/makedumpfile.c +++ b/makedumpfile.c @@ -2978,6 +2978,8 @@ read_vmcoreinfo(void) READ_NUMBER("PAGE_OFFLINE_MAPCOUNT_VALUE", PAGE_OFFLINE_MAPCOUNT_VALUE); READ_NUMBER("phys_base", phys_base); READ_NUMBER("KERNEL_IMAGE_SIZE", KERNEL_IMAGE_SIZE); + + READ_NUMBER_UNSIGNED("VMALLOC_START", vmalloc_start); #ifdef __aarch64__ READ_NUMBER("VA_BITS", VA_BITS); READ_NUMBER("TCR_EL1_T1SZ", TCR_EL1_T1SZ); @@ -2989,7 +2991,6 @@ read_vmcoreinfo(void) READ_NUMBER("VA_BITS", va_bits); READ_NUMBER_UNSIGNED("phys_ram_base", phys_ram_base); READ_NUMBER_UNSIGNED("PAGE_OFFSET", page_offset); - READ_NUMBER_UNSIGNED("VMALLOC_START", vmalloc_start); READ_NUMBER_UNSIGNED("VMALLOC_END", vmalloc_end); READ_NUMBER_UNSIGNED("VMEMMAP_START", vmemmap_start); READ_NUMBER_UNSIGNED("VMEMMAP_END", vmemmap_end); diff --git a/makedumpfile.h b/makedumpfile.h index 59c83e1d9df3..4021c5af2a34 100644 --- a/makedumpfile.h +++ b/makedumpfile.h @@ -541,8 +541,6 @@ do { \ * The value of dependence on machine */ #define PAGE_OFFSET (info->page_offset) -#define VMALLOC_START (info->vmalloc_start) -#define VMALLOC_END
Re: [PATCH] makedumpfile: ppc64: get vmalloc start address from vmcoreinfo
On 2024/03/18 17:26, Aditya Gupta wrote: > Hi, > The commit removing 'vmap_area_list' is now merged in Linux mainline tree. > commit: 55c49fee57af99f3c663e69dedc5b85e691bbe50 > mm/vmalloc: remove vmap_area_list Applied with this commit id and the fix. https://github.com/makedumpfile/makedumpfile/commit/94241fd2feed059227a243618f2acc6aabf366e8 Thanks, Kazu > > Any comments on this patch ? > > Thanks, > > Aditya Gupta > > On 24/02/24 00:33, Aditya Gupta wrote: >> Below error was noticed when running makedumpfile on linux-next kernel >> crash (linux-next tag next-20240121): >> >> ... >> Checking for memory holes : [100.0 %] | readpage_elf: Attempt to >> read non-existent page at 0xc. >> [ 17.551718] kdump.sh[404]: readmem: type_addr: 0, >> addr:c00c, size:16384 >> [ 17.551793] kdump.sh[404]: __exclude_unnecessary_pages: Can't >> read the buffer of struct page. >> [ 17.551864] kdump.sh[404]: create_2nd_bitmap: Can't exclude >> unnecessary pages. >> [ 17.562632] kdump.sh[404]: The kernel version is not supported. >> [ 17.562708] kdump.sh[404]: The makedumpfile operation may be >> incomplete. >> [ 17.562773] kdump.sh[404]: makedumpfile Failed. >> [ 17.564335] kdump[406]: saving vmcore failed, _exitcode:1 >> >> Above error was due to 'vmap_area_list' and 'vmlist' symbols missing >> from the vmcore. >> >> 'vmap_area_list' was removed in the linux kernel with below commit: >> >> commit 378eb24a0658dd922b29524e0ce35c6c43f56cba >> mm/vmalloc: remove vmap_area_list >> >> Subsequently the commit also introduced 'VMALLOC_START' in vmcoreinfo to >> get base address of vmalloc area, instead of depending on >> 'vmap_area_list' >> >> Hence if 'VMALLOC_START' symbol is there in vmcoreinfo: >> 1. Set vmalloc_start based on 'VMALLOC_START' >> 2. Don't error if vmap_area_list/vmlist are not defined >> >> Reported-by: Sachin Sant >> Signed-off-by: Aditya Gupta >> --- >> arch/ppc64.c | 19 +-- >> makedumpfile.c | 3 ++- >> makedumpfile.h | 6 +++--- >> 3 files changed, 18 insertions(+), 10 deletions(-) >> >> diff --git a/arch/ppc64.c b/arch/ppc64.c >> index 96c357cb0335..bb62e2cd199a 100644 >> --- a/arch/ppc64.c >> +++ b/arch/ppc64.c >> @@ -568,7 +568,9 @@ get_machdep_info_ppc64(void) >> /* >> * Get vmalloc_start value from either vmap_area_list or vmlist. >> */ >> - if ((SYMBOL(vmap_area_list) != NOT_FOUND_SYMBOL) >> + if (NUMBER(vmalloc_start) != NOT_FOUND_SYMBOL) { >> + vmalloc_start = NUMBER(vmalloc_start); >> + } else if ((SYMBOL(vmap_area_list) != NOT_FOUND_SYMBOL) >> && (OFFSET(vmap_area.va_start) != NOT_FOUND_STRUCTURE) >> && (OFFSET(vmap_area.list) != NOT_FOUND_STRUCTURE)) { >> if (!readmem(VADDR, SYMBOL(vmap_area_list) + >> OFFSET(list_head.next), >> @@ -684,11 +686,16 @@ vaddr_to_paddr_ppc64(unsigned long vaddr) >> if ((SYMBOL(vmap_area_list) == NOT_FOUND_SYMBOL) >> || (OFFSET(vmap_area.va_start) == NOT_FOUND_STRUCTURE) >> || (OFFSET(vmap_area.list) == NOT_FOUND_STRUCTURE)) { >> - if ((SYMBOL(vmlist) == NOT_FOUND_SYMBOL) >> - || (OFFSET(vm_struct.addr) == NOT_FOUND_STRUCTURE)) { >> - ERRMSG("Can't get info for vmalloc translation.\n"); >> - return NOT_PADDR; >> - } >> + /* >> + * Don't depend on vmap_area_list/vmlist if vmalloc_start is >> set in >> + * vmcoreinfo, in that case proceed without error >> + */ >> + if (NUMBER(vmalloc_start) == NOT_FOUND_NUMBER) >> + if ((SYMBOL(vmlist) == NOT_FOUND_SYMBOL) >> + || (OFFSET(vm_struct.addr) == NOT_FOUND_STRUCTURE)) { >> + ERRMSG("Can't get info for vmalloc translation.\n"); >> + return NOT_PADDR; >> + } >> } >> return ppc64_vtop_level4(vaddr); >> diff --git a/makedumpfile.c b/makedumpfile.c >> index b004b93fecb7..b6c63fad15f3 100644 >> --- a/makedumpfile.c >> +++ b/makedumpfile.c >> @@ -2978,6 +2978,8 @@ read_vmcoreinfo(void) >> READ_NUMBER("PAGE_OFFLINE_MAPCOUNT_VALUE", >> PAGE_OFFLINE_MAPCOUNT_VALUE); >> READ_NUMBER("phys_base", phys_base); >> READ_NUMBER("KERNEL_IMAGE_SIZE", KERNEL_IMAGE_SIZE); >> + >> + READ_NUMBER_UNSIGNED("VMALLOC_START", vmalloc_start); >> #ifdef __aarch64__ >> READ_NUMBER("VA_BITS", VA_BITS); >> READ_NUMBER("TCR_EL1_T1SZ", TCR_EL1_T1SZ); >> @@ -2989,7 +2991,6 @@ read_vmcoreinfo(void) >> READ_NUMBER("VA_BITS", va_bits); >> READ_NUMBER_UNSIGNED("phys_ram_base", phys_ram_base); >> READ_NUMBER_UNSIGNED("PAGE_OFFSET", page_offset); >> - READ_NUMBER_UNSIGNED("VMALLOC_START", vmalloc_start); >> READ_NUMBER_UNSIGNED("VMALLOC_END", vmalloc_end); >> READ_NUMBER_UNSIGNED("VMEMMAP_START", vmemmap_start); >> READ_NUMBER_UNSIGNED("VMEMMAP_END", vmemmap_end); >> diff --git a/makedumpfile.h b/maked
Re: [PATCH] makedumpfile: ppc64: get vmalloc start address from vmcoreinfo
On 18/03/24 14:18, HAGIO KAZUHITO(萩尾 一仁) wrote: On 2024/03/18 17:26, Aditya Gupta wrote: Hi, The commit removing 'vmap_area_list' is now merged in Linux mainline tree. commit: 55c49fee57af99f3c663e69dedc5b85e691bbe50 mm/vmalloc: remove vmap_area_list Applied with this commit id and the fix. https://github.com/makedumpfile/makedumpfile/commit/94241fd2feed059227a243618f2acc6aabf366e8 Thanks Kazu. - Aditya Gupta Thanks, Kazu Any comments on this patch ? Thanks, Aditya Gupta On 24/02/24 00:33, Aditya Gupta wrote: Below error was noticed when running makedumpfile on linux-next kernel crash (linux-next tag next-20240121): ... Checking for memory holes : [100.0 %] | readpage_elf: Attempt to read non-existent page at 0xc. [ 17.551718] kdump.sh[404]: readmem: type_addr: 0, addr:c00c, size:16384 [ 17.551793] kdump.sh[404]: __exclude_unnecessary_pages: Can't read the buffer of struct page. [ 17.551864] kdump.sh[404]: create_2nd_bitmap: Can't exclude unnecessary pages. [ 17.562632] kdump.sh[404]: The kernel version is not supported. [ 17.562708] kdump.sh[404]: The makedumpfile operation may be incomplete. [ 17.562773] kdump.sh[404]: makedumpfile Failed. [ 17.564335] kdump[406]: saving vmcore failed, _exitcode:1 Above error was due to 'vmap_area_list' and 'vmlist' symbols missing from the vmcore. 'vmap_area_list' was removed in the linux kernel with below commit: commit 378eb24a0658dd922b29524e0ce35c6c43f56cba mm/vmalloc: remove vmap_area_list Subsequently the commit also introduced 'VMALLOC_START' in vmcoreinfo to get base address of vmalloc area, instead of depending on 'vmap_area_list' Hence if 'VMALLOC_START' symbol is there in vmcoreinfo: 1. Set vmalloc_start based on 'VMALLOC_START' 2. Don't error if vmap_area_list/vmlist are not defined Reported-by: Sachin Sant Signed-off-by: Aditya Gupta --- arch/ppc64.c | 19 +-- makedumpfile.c | 3 ++- makedumpfile.h | 6 +++--- 3 files changed, 18 insertions(+), 10 deletions(-) diff --git a/arch/ppc64.c b/arch/ppc64.c index 96c357cb0335..bb62e2cd199a 100644 --- a/arch/ppc64.c +++ b/arch/ppc64.c @@ -568,7 +568,9 @@ get_machdep_info_ppc64(void) /* * Get vmalloc_start value from either vmap_area_list or vmlist. */ - if ((SYMBOL(vmap_area_list) != NOT_FOUND_SYMBOL) + if (NUMBER(vmalloc_start) != NOT_FOUND_SYMBOL) { + vmalloc_start = NUMBER(vmalloc_start); + } else if ((SYMBOL(vmap_area_list) != NOT_FOUND_SYMBOL) && (OFFSET(vmap_area.va_start) != NOT_FOUND_STRUCTURE) && (OFFSET(vmap_area.list) != NOT_FOUND_STRUCTURE)) { if (!readmem(VADDR, SYMBOL(vmap_area_list) + OFFSET(list_head.next), @@ -684,11 +686,16 @@ vaddr_to_paddr_ppc64(unsigned long vaddr) if ((SYMBOL(vmap_area_list) == NOT_FOUND_SYMBOL) || (OFFSET(vmap_area.va_start) == NOT_FOUND_STRUCTURE) || (OFFSET(vmap_area.list) == NOT_FOUND_STRUCTURE)) { - if ((SYMBOL(vmlist) == NOT_FOUND_SYMBOL) - || (OFFSET(vm_struct.addr) == NOT_FOUND_STRUCTURE)) { - ERRMSG("Can't get info for vmalloc translation.\n"); - return NOT_PADDR; - } + /* + * Don't depend on vmap_area_list/vmlist if vmalloc_start is set in + * vmcoreinfo, in that case proceed without error + */ + if (NUMBER(vmalloc_start) == NOT_FOUND_NUMBER) + if ((SYMBOL(vmlist) == NOT_FOUND_SYMBOL) + || (OFFSET(vm_struct.addr) == NOT_FOUND_STRUCTURE)) { + ERRMSG("Can't get info for vmalloc translation.\n"); + return NOT_PADDR; + } } return ppc64_vtop_level4(vaddr); diff --git a/makedumpfile.c b/makedumpfile.c index b004b93fecb7..b6c63fad15f3 100644 --- a/makedumpfile.c +++ b/makedumpfile.c @@ -2978,6 +2978,8 @@ read_vmcoreinfo(void) READ_NUMBER("PAGE_OFFLINE_MAPCOUNT_VALUE", PAGE_OFFLINE_MAPCOUNT_VALUE); READ_NUMBER("phys_base", phys_base); READ_NUMBER("KERNEL_IMAGE_SIZE", KERNEL_IMAGE_SIZE); + + READ_NUMBER_UNSIGNED("VMALLOC_START", vmalloc_start); #ifdef __aarch64__ READ_NUMBER("VA_BITS", VA_BITS); READ_NUMBER("TCR_EL1_T1SZ", TCR_EL1_T1SZ); @@ -2989,7 +2991,6 @@ read_vmcoreinfo(void) READ_NUMBER("VA_BITS", va_bits); READ_NUMBER_UNSIGNED("phys_ram_base", phys_ram_base); READ_NUMBER_UNSIGNED("PAGE_OFFSET", page_offset); - READ_NUMBER_UNSIGNED("VMALLOC_START", vmalloc_start); READ_NUMBER_UNSIGNED("VMALLOC_END", vmalloc_end); READ_NUMBER_UNSIGNED("VMEMMAP_START", vmemmap_start); READ_NUMBER_UNSIGNED("VMEMMAP_END", vmemmap_end); diff --git a/makedumpfile.h b/makedumpfile.h index 59c83e1d9df3..4021c5af2a34 100644 --- a/makedumpfile.h +++ b/makedumpfile.h @@ -541,8 +541,6 @@ do { \ * The value of dependence on machine */ #define PAGE_OFFSET (info->p
Question about Address Range Validation in Crash Kernel Allocation
Dear kexec Community Members, I encountered an issue while using kexec-tools on my x86_64 machine. When there is a segment marked as 'reserved' within the memory range allocated for the crash kernel in /proc/iomem,the output appears as follows: 2d4fd058-60efefff : System RAM 2d4fd058-58ff : System RAM 4900-58ff : Crash kernel 53cbd000-53cc : Reserved The crash_memory_range array will encounter incorrect address ranges: CRASH MEMORY RANGES 2d4fd058-48ff (0) 53cbd000-48ff (1) 5900-53cc (0) Read the code, I noticed that the get_crash_memory_ranges() function invokes exclude_region() to handle the splitting of memory regions, but it seems unable to properly handle the scenario described above. The code logic is as follows: ... if (start < mend && end > mstart) { if (start != mstart && end != mend) { /* Split memory region */ crash_memory_range[i].end = start - 1; temp_region.start = end + 1; temp_region.end = mend; temp_region.type = RANGE_RAM; tidx = i+1; } else if (start != mstart) crash_memory_range[i].end = start - 1; else crash_memory_range[i].start = end + 1; } ... If start < mstart < mend < end, resulting in crash_memory_range[i].end becoming less than crash_memory_range[i].start, leading to incorrect address ranges. I would like to know if this behavior is reasonable and whether it is necessary to validate the address ranges for compliance at the end. Thank you for your time and assistance. Chen Haixiang ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH linux-next] bpf: fix warning for crash_kexec
On 3/17/24 11:52 PM, Hari Bathini wrote: Just checking on whether this will go via bpf or mm tree? Sending to bpf-next should be okay. Could you resubmit the patch as CONFIG_CRASH_DUMP probably not available to bpf-next when you initially submitted the patch. On 09/02/24 6:05 pm, Hari Bathini wrote: With [1], CONFIG_KEXEC & !CONFIG_CRASH_DUMP is supported but that led to the below warning: "WARN: resolve_btfids: unresolved symbol crash_kexec" Fix it by using the appropriate #ifdef. [1] https://lore.kernel.org/all/20240124051254.67105-1-...@redhat.com/ Signed-off-by: Hari Bathini --- kernel/bpf/helpers.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index 4db1c658254c..e408d1115e26 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -2545,7 +2545,7 @@ __bpf_kfunc void bpf_throw(u64 cookie) __bpf_kfunc_end_defs(); BTF_KFUNCS_START(generic_btf_ids) -#ifdef CONFIG_KEXEC_CORE +#ifdef CONFIG_CRASH_DUMP BTF_ID_FLAGS(func, crash_kexec, KF_DESTRUCTIVE) #endif BTF_ID_FLAGS(func, bpf_obj_new_impl, KF_ACQUIRE | KF_RET_NULL) ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: Question about Address Range Validation in Crash Kernel Allocation
Hi, On 03/18/24 at 12:00pm, chenhaixiang (A) wrote: > Dear kexec Community Members, > > I encountered an issue while using kexec-tools on my x86_64 machine. > When there is a segment marked as 'reserved' within the memory range > allocated for the crash kernel in /proc/iomem,the output appears as follows: > 2d4fd058-60efefff : System RAM > 2d4fd058-58ff : System RAM > 4900-58ff : Crash kernel > 53cbd000-53cc : Reserved What kernel are you using? the version of kernel, and kexec-tools? If you are testing on the latest mainline kernel, you could meet the issue Dave have met and fixed in below patch: [PATCH] x86/kexec: do not update E820 kexec table for setup_data https://lore.kernel.org/all/zez2kos-oozns...@darkstar.users.ipa.redhat.com/T/#u Thanks Baoquan > > The crash_memory_range array will encounter incorrect address ranges: > CRASH MEMORY RANGES > 2d4fd058-48ff (0) > 53cbd000-48ff (1) > 5900-53cc (0) > > Read the code, I noticed that the get_crash_memory_ranges() function invokes > exclude_region() to handle the splitting of memory regions, but it seems > unable to properly handle the scenario described above. > The code logic is as follows: > ... > if (start < mend && end > mstart) { > if (start != mstart && end != mend) { > /* Split memory region */ > crash_memory_range[i].end = start - 1; > temp_region.start = end + 1; > temp_region.end = mend; > temp_region.type = RANGE_RAM; > tidx = i+1; > } else if (start != mstart) > crash_memory_range[i].end = start - 1; > else > crash_memory_range[i].start = end + 1; > } > ... > If start < mstart < mend < end, resulting in crash_memory_range[i].end > becoming less than crash_memory_range[i].start, leading to incorrect address > ranges. > I would like to know if this behavior is reasonable and whether it is > necessary to validate the address ranges for compliance at the end. > > Thank you for your time and assistance. > > Chen Haixiang > > ___ > kexec mailing list > kexec@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/kexec > ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v2 1/3] efi/x86: skip efi_arch_mem_reserve() in case of kexec.
Hi, Added Ard in cc. On 03/18/24 at 07:02am, Ashish Kalra wrote: > From: Ashish Kalra > > For kexec use case, need to use and stick to the EFI memmap passed > from the first kernel via boot-params/setup data, hence, > skip efi_arch_mem_reserve() during kexec. > > Additionally during SNP guest kexec testing discovered that EFI memmap > is corrupted during chained kexec. kexec_enter_virtual_mode() during > late init will remap the efi_memmap physical pages allocated in > efi_arch_mem_reserve() via memboot & then subsequently cause random > EFI memmap corruption once memblock is freed/teared-down. > > Signed-off-by: Ashish Kalra > --- > arch/x86/platform/efi/quirks.c | 10 ++ > 1 file changed, 10 insertions(+) > > diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c > index f0cc00032751..d4562d074371 100644 > --- a/arch/x86/platform/efi/quirks.c > +++ b/arch/x86/platform/efi/quirks.c > @@ -258,6 +258,16 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 > size) > int num_entries; > void *new; > > + /* > + * For kexec use case, we need to use the EFI memmap passed from the > first > + * kernel via setup data, so we need to skip this. > + * Additionally kexec_enter_virtual_mode() during late init will remap > + * the efi_memmap physical pages allocated here via memboot & then > + * subsequently cause random EFI memmap corruption once memblock is > freed. Can you elaborate a bit about the corruption, is it reproducible without SNP? > + */ > + if (efi_setup) > + return; > + How about checking the md attribute instead of checking the efi_setup, personally I feel it a bit better, something like below: diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c index f0cc00032751..699332b075bb 100644 --- a/arch/x86/platform/efi/quirks.c +++ b/arch/x86/platform/efi/quirks.c @@ -255,15 +255,24 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 size) struct efi_memory_map_data data = { 0 }; struct efi_mem_range mr; efi_memory_desc_t md; - int num_entries; + int num_entries, ret; void *new; - if (efi_mem_desc_lookup(addr, &md) || - md.type != EFI_BOOT_SERVICES_DATA) { + ret = efi_mem_desc_lookup(addr, &md); + if (ret) { pr_err("Failed to lookup EFI memory descriptor for %pa\n", &addr); return; } + if (md.type != EFI_BOOT_SERVICES_DATA) { + pr_err("Skil reserving non EFI Boot Service Data memory for %pa\n", &addr); + return; + } + + /* Kexec copied the efi memmap from the 1st kernel, thus skip the case. */ + if (md.attribute & EFI_MEMORY_RUNTIME) + return; + if (addr + size > md.phys_addr + (md.num_pages << EFI_PAGE_SHIFT)) { pr_err("Region spans EFI memory descriptors, %pa\n", &addr); return; ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH] x86/kexec: do not update E820 kexec table for setup_data
On Tue, 5 Mar 2024 at 09:32, Dave Young wrote: > > crashkernel reservation failed on a Thinkpad t440s laptop recently, > Actually the memblock reservation succeeded, but later insert_resource() > failed. > > Test step: > kexec load -> > kexec reboot -> > check the crashkernel memory > dmesg|grep "crashkernel reserved"; saw reserved suceeeded: > 0xd000 - 0xda00 > grep Crash /proc/iomem: got nothing > > The background story is like below: > Currently E820 code reserves setup_data regions for both the current kernel > and the kexec kernel, and it will also insert them into resources list. > Before the kexec kernel reboot nobody passes the old setup_data, kexec only > passes SETUP_EFI and SETUP_IMA if needed. Thus the old setup data memory > are not used at all. But due to old kernel updated the kexec e820 table > as well so kexec kernel see them as E820_TYPE_RESERVED_KERN regions, later > the old setup_data regions will be inserted into resources list in kexec > kernel by e820__reserve_resources(). > > Note, due to no setup_data passed in for those old regions they are not > early reserved (by function early_reserve_memory), crashkernel memblock > reservation will just regard them as usable memory and it could reserve > reserve crashkernel region overlaps with the old setup_data regions. > > Just like the bug I noticed here, kdump insert_resource failed because > e820__reserve_resources added the overlapped chunks in /proc/iomem already. > > Finally, looking at the code, the old setup_data regions are not used > at all as no setup_data passed in by the kexec boot loader. Although > something like SETUP_PCI etc could be needed, kexec should pass > the info as setup_data so that kexec kernel can take care of them. > This should be taken care of in other separate patches if needed. > > Thus drop the useless buggy code here. > > Signed-off-by: Dave Young > --- > arch/x86/kernel/e820.c | 16 +--- > 1 file changed, 1 insertion(+), 15 deletions(-) > > Index: linux/arch/x86/kernel/e820.c > === > --- linux.orig/arch/x86/kernel/e820.c > +++ linux/arch/x86/kernel/e820.c > @@ -1015,16 +1015,6 @@ void __init e820__reserve_setup_data(voi > pa_next = data->next; > > e820__range_update(pa_data, sizeof(*data)+data->len, > E820_TYPE_RAM, E820_TYPE_RESERVED_KERN); > - > - /* > -* SETUP_EFI and SETUP_IMA are supplied by kexec and do not > need > -* to be reserved. > -*/ > - if (data->type != SETUP_EFI && data->type != SETUP_IMA) > - e820__range_update_kexec(pa_data, > -sizeof(*data) + data->len, > -E820_TYPE_RAM, > E820_TYPE_RESERVED_KERN); > - > if (data->type == SETUP_INDIRECT) { > len += data->len; > early_memunmap(data, sizeof(*data)); > @@ -1036,12 +1026,9 @@ void __init e820__reserve_setup_data(voi > > indirect = (struct setup_indirect *)data->data; > > - if (indirect->type != SETUP_INDIRECT) { > + if (indirect->type != SETUP_INDIRECT) > e820__range_update(indirect->addr, > indirect->len, >E820_TYPE_RAM, > E820_TYPE_RESERVED_KERN); > - e820__range_update_kexec(indirect->addr, > indirect->len, > -E820_TYPE_RAM, > E820_TYPE_RESERVED_KERN); > - } > } > > pa_data = pa_next; > @@ -1049,7 +1036,6 @@ void __init e820__reserve_setup_data(voi > } > > e820__update_table(e820_table); > - e820__update_table(e820_table_kexec); > > pr_info("extended physical RAM map:\n"); > e820__print_table("reserve setup_data"); > Kindly ping for review. Thanks! Dave ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec