Re: [PATCH v2 0/5] Introduce new wrappers to copy user-arrays
On September 11, 2023 6:55:32 PM PDT, Dave Airlie wrote: >On Tue, 12 Sept 2023 at 11:27, Kees Cook wrote: >> >> On September 8, 2023 12:59:39 PM PDT, Philipp Stanner >> wrote: >> >Hi! >> > >> >David Airlie suggested that we could implement new wrappers around >> >(v)memdup_user() for duplicating user arrays. >> > >> >This small patch series first implements the two new wrapper functions >> >memdup_array_user() and vmemdup_array_user(). They calculate the >> >array-sizes safely, i.e., they return an error in case of an overflow. >> > >> >It then implements the new wrappers in two components in kernel/ and two >> >in the drm-subsystem. >> > >> >In total, there are 18 files in the kernel that use (v)memdup_user() to >> >duplicate arrays. My plan is to provide patches for the other 14 >> >successively once this series has been merged. >> > >> > >> >Changes since v1: >> >- Insert new headers alphabetically ordered >> >- Remove empty lines in functions' docstrings >> >- Return -EOVERFLOW instead of -EINVAL from wrapper functions >> > >> > >> >@Andy: >> >I test-build it for UM on my x86_64. Builds successfully. >> >A kernel build (localmodconfig) for my Fedora38 @ x86_64 does also boot >> >fine. >> > >> >If there is more I can do to verify the early boot stages are fine, >> >please let me know! >> > >> >P. >> > >> >Philipp Stanner (5): >> > string.h: add array-wrappers for (v)memdup_user() >> > kernel: kexec: copy user-array safely >> > kernel: watch_queue: copy user-array safely >> > drm_lease.c: copy user-array safely >> > drm: vmgfx_surface.c: copy user-array safely >> > >> > drivers/gpu/drm/drm_lease.c | 4 +-- >> > drivers/gpu/drm/vmwgfx/vmwgfx_surface.c | 4 +-- >> > include/linux/string.h | 40 + >> > kernel/kexec.c | 2 +- >> > kernel/watch_queue.c| 2 +- >> > 5 files changed, 46 insertions(+), 6 deletions(-) >> > >> >> Nice. For the series: >> >> Reviewed-by: Kees Cook > >Hey Kees, > >what tree do you think it would best to land this through? I'm happy >to send the initial set from a drm branch, but also happy to have it >land via someone with a better process. Feel free to take it via drm. Usually string.h doesn't get a lot of changes (and even then it's normally additive) so conflicts are rare/easy. :) -Kees -- Kees Cook ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v2 0/5] Introduce new wrappers to copy user-arrays
On Tue, 12 Sept 2023 at 11:27, Kees Cook wrote: > > On September 8, 2023 12:59:39 PM PDT, Philipp Stanner > wrote: > >Hi! > > > >David Airlie suggested that we could implement new wrappers around > >(v)memdup_user() for duplicating user arrays. > > > >This small patch series first implements the two new wrapper functions > >memdup_array_user() and vmemdup_array_user(). They calculate the > >array-sizes safely, i.e., they return an error in case of an overflow. > > > >It then implements the new wrappers in two components in kernel/ and two > >in the drm-subsystem. > > > >In total, there are 18 files in the kernel that use (v)memdup_user() to > >duplicate arrays. My plan is to provide patches for the other 14 > >successively once this series has been merged. > > > > > >Changes since v1: > >- Insert new headers alphabetically ordered > >- Remove empty lines in functions' docstrings > >- Return -EOVERFLOW instead of -EINVAL from wrapper functions > > > > > >@Andy: > >I test-build it for UM on my x86_64. Builds successfully. > >A kernel build (localmodconfig) for my Fedora38 @ x86_64 does also boot > >fine. > > > >If there is more I can do to verify the early boot stages are fine, > >please let me know! > > > >P. > > > >Philipp Stanner (5): > > string.h: add array-wrappers for (v)memdup_user() > > kernel: kexec: copy user-array safely > > kernel: watch_queue: copy user-array safely > > drm_lease.c: copy user-array safely > > drm: vmgfx_surface.c: copy user-array safely > > > > drivers/gpu/drm/drm_lease.c | 4 +-- > > drivers/gpu/drm/vmwgfx/vmwgfx_surface.c | 4 +-- > > include/linux/string.h | 40 + > > kernel/kexec.c | 2 +- > > kernel/watch_queue.c| 2 +- > > 5 files changed, 46 insertions(+), 6 deletions(-) > > > > Nice. For the series: > > Reviewed-by: Kees Cook Hey Kees, what tree do you think it would best to land this through? I'm happy to send the initial set from a drm branch, but also happy to have it land via someone with a better process. Dave. ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v2 0/5] Introduce new wrappers to copy user-arrays
On Fri, 2023-09-08 at 21:59 +0200, Philipp Stanner wrote: > Hi! > > David Airlie suggested that we could implement new wrappers around > (v)memdup_user() for duplicating user arrays. > > This small patch series first implements the two new wrapper functions > memdup_array_user() and vmemdup_array_user(). They calculate the > array-sizes safely, i.e., they return an error in case of an overflow. > > It then implements the new wrappers in two components in kernel/ and two > in the drm-subsystem. > > In total, there are 18 files in the kernel that use (v)memdup_user() to > duplicate arrays. My plan is to provide patches for the other 14 > successively once this series has been merged. > > > Changes since v1: > - Insert new headers alphabetically ordered > - Remove empty lines in functions' docstrings > - Return -EOVERFLOW instead of -EINVAL from wrapper functions > > > @Andy: > I test-build it for UM on my x86_64. Builds successfully. > A kernel build (localmodconfig) for my Fedora38 @ x86_64 does also boot > fine. > > If there is more I can do to verify the early boot stages are fine, > please let me know! > > P. > > Philipp Stanner (5): > string.h: add array-wrappers for (v)memdup_user() > kernel: kexec: copy user-array safely > kernel: watch_queue: copy user-array safely > drm_lease.c: copy user-array safely > drm: vmgfx_surface.c: copy user-array safely > > drivers/gpu/drm/drm_lease.c | 4 +-- > drivers/gpu/drm/vmwgfx/vmwgfx_surface.c | 4 +-- > include/linux/string.h | 40 + > kernel/kexec.c | 2 +- > kernel/watch_queue.c | 2 +- > 5 files changed, 46 insertions(+), 6 deletions(-) > Series, and in particular the vmwgfx changes, look good to me. Reviewed-by: Zack Rusin ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v2 0/5] Introduce new wrappers to copy user-arrays
On September 8, 2023 12:59:39 PM PDT, Philipp Stanner wrote: >Hi! > >David Airlie suggested that we could implement new wrappers around >(v)memdup_user() for duplicating user arrays. > >This small patch series first implements the two new wrapper functions >memdup_array_user() and vmemdup_array_user(). They calculate the >array-sizes safely, i.e., they return an error in case of an overflow. > >It then implements the new wrappers in two components in kernel/ and two >in the drm-subsystem. > >In total, there are 18 files in the kernel that use (v)memdup_user() to >duplicate arrays. My plan is to provide patches for the other 14 >successively once this series has been merged. > > >Changes since v1: >- Insert new headers alphabetically ordered >- Remove empty lines in functions' docstrings >- Return -EOVERFLOW instead of -EINVAL from wrapper functions > > >@Andy: >I test-build it for UM on my x86_64. Builds successfully. >A kernel build (localmodconfig) for my Fedora38 @ x86_64 does also boot >fine. > >If there is more I can do to verify the early boot stages are fine, >please let me know! > >P. > >Philipp Stanner (5): > string.h: add array-wrappers for (v)memdup_user() > kernel: kexec: copy user-array safely > kernel: watch_queue: copy user-array safely > drm_lease.c: copy user-array safely > drm: vmgfx_surface.c: copy user-array safely > > drivers/gpu/drm/drm_lease.c | 4 +-- > drivers/gpu/drm/vmwgfx/vmwgfx_surface.c | 4 +-- > include/linux/string.h | 40 + > kernel/kexec.c | 2 +- > kernel/watch_queue.c| 2 +- > 5 files changed, 46 insertions(+), 6 deletions(-) > Nice. For the series: Reviewed-by: Kees Cook -- Kees Cook ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v2 0/2] x86/kexec: UKI Support
Add Philipp to CC as he is also investigating UKI On 09/11/23 at 07:25am, Jan Hendrik Farr wrote: > Hello, > > this patch (v2) implements UKI support for kexec_file_load. It will require > support in the kexec-tools userspace utility. For testing purposes the > following can be used: https://github.com/Cydox/kexec-test/ > > Creating UKIs for testing can be done with ukify (included in systemd), > sbctl, and mkinitcpio, etc. This is awesome work, Jan, thanks. By the way, could you provide detailed steps about how to test this patchset so that people interested can give it a shot? > > There has been discussion on this topic in an issue on GitHub that is linked > below for reference. > > Changes for v2: > - .cmdline section is now optional > - moving pefile_parse_binary is now in a separate commit for clarity > - parse_pefile.c is now in /lib instead of arch/x86/kernel (not sure if > this is the best location, but it definetly shouldn't have been in an > architecture specific location) > - parse_pefile.h is now in include/kernel instead of architecture > specific location > - if initrd or cmdline is manually supplied EPERM is returned instead of > being silently ignored > - formatting tweaks > > > Some links: > - Related discussion: https://github.com/systemd/systemd/issues/28538 > - Documentation of UKIs: > https://uapi-group.org/specifications/specs/unified_kernel_image/ > > Jan Hendrik Farr (2): > move pefile_parse_binary to its own file > x86/kexec: UKI support > > arch/x86/include/asm/kexec-uki.h | 7 ++ > arch/x86/kernel/Makefile | 1 + > arch/x86/kernel/kexec-uki.c| 126 + > arch/x86/kernel/machine_kexec_64.c | 2 + > crypto/asymmetric_keys/mscode_parser.c | 2 +- > crypto/asymmetric_keys/verify_pefile.c | 110 +++-- > crypto/asymmetric_keys/verify_pefile.h | 16 > include/linux/parse_pefile.h | 32 +++ > lib/Makefile | 3 + > lib/parse_pefile.c | 109 + > 10 files changed, 292 insertions(+), 116 deletions(-) > create mode 100644 arch/x86/include/asm/kexec-uki.h > create mode 100644 arch/x86/kernel/kexec-uki.c > create mode 100644 include/linux/parse_pefile.h > create mode 100644 lib/parse_pefile.c > > -- > 2.40.1 > ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [systemd-devel] [PATCH 0/1] x86/kexec: UKI support
On Mon, Sep 11, 2023 at 7:15 PM Jarkko Sakkinen wrote: > > On Sat Sep 9, 2023 at 7:18 PM EEST, Jan Hendrik Farr wrote: > > Hello, > > > > this patch implements UKI support for kexec_file_load. It will require > > support > > in the kexec-tools userspace utility. For testing purposes the following > > can be used: > > https://github.com/Cydox/kexec-test/ > > > > There has been discussion on this topic in an issue on GitHub that is > > linked below > > for reference. > > > > > > Some links: > > - Related discussion: https://github.com/systemd/systemd/issues/28538 > > - Documentation of UKIs: > > https://uapi-group.org/specifications/specs/unified_kernel_image/ > > > > Jan Hendrik Farr (1): > > x86/kexec: UKI support > > > > arch/x86/include/asm/kexec-uki.h | 7 ++ > > arch/x86/include/asm/parse_pefile.h| 32 +++ > > arch/x86/kernel/Makefile | 2 + > > arch/x86/kernel/kexec-uki.c| 113 + > > arch/x86/kernel/machine_kexec_64.c | 2 + > > arch/x86/kernel/parse_pefile.c | 110 > > crypto/asymmetric_keys/mscode_parser.c | 2 +- > > crypto/asymmetric_keys/verify_pefile.c | 110 +++- > > crypto/asymmetric_keys/verify_pefile.h | 16 > > 9 files changed, 278 insertions(+), 116 deletions(-) > > create mode 100644 arch/x86/include/asm/kexec-uki.h > > create mode 100644 arch/x86/include/asm/parse_pefile.h > > create mode 100644 arch/x86/kernel/kexec-uki.c > > create mode 100644 arch/x86/kernel/parse_pefile.c > > > > -- > > 2.40.1 > > What the heck is UKI? Unified Kernel Images. More details available here: https://uapi-group.org/specifications/specs/unified_kernel_image/ It's a way of creating initramfs-style images as fully generic, reproducible images that can be built server-side. It is a requirement for creating locked down Linux devices for appliances that can be tamper-resistant too. -- 真実はいつも一つ!/ Always, there's only one truth! ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 0/1] x86/kexec: UKI support
> What the heck is UKI? UKI (Unified Kernel Image) is the kernel image + initrd + cmdline (+ some other optional stuff) all packaged up together as one EFI application. This EFI application can then be launched directly by the UEFI without the need for any additional stuff (or by systemd-boot). It's all self contained. One benefit is that this is a convenient way to distribute kernels all in one file. Another benefit is that the whole combination of kernel image, initrd, and cmdline can all be signed together so only that particular combination can be executed if you are using secure boot. The format itself is rather simple. It's just a PE file (as required by the UEFI spec) that contains a small stub application in the .text, .data, etc sections that is responsible for invoking the contained kernel and initrd with the contained cmdline. The kernel image is placed into a .kernel section, the initrd into a .initrd section, and the cmdline into a .cmdline section in the PE executable. If we want to kexec a UKI we could obviously just have userspace pick it apart and kexec it like normal. However in lockdown mode this will only work if you sign the kernel image that is contained inside the UKI. The problem with that is that anybody can then grab that signed kernel and launch it with any initrd or cmdline. So instead this patch makes the kernel do the work instead. The kernel verifies the signature on the entire UKI and then passes its components on to the normal kexec bzimage loader. Useful Links: UKI format documentation: https://uapi-group.org/specifications/specs/unified_kernel_image/ Arch wiki: https://wiki.archlinux.org/title/Unified_kernel_image Fedora UKI support: https://fedoraproject.org/wiki/Changes/Unified_Kernel_Support_Phase_1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 0/1] x86/kexec: UKI support
On Sat Sep 9, 2023 at 7:18 PM EEST, Jan Hendrik Farr wrote: > Hello, > > this patch implements UKI support for kexec_file_load. It will require support > in the kexec-tools userspace utility. For testing purposes the following can > be used: > https://github.com/Cydox/kexec-test/ > > There has been discussion on this topic in an issue on GitHub that is linked > below > for reference. > > > Some links: > - Related discussion: https://github.com/systemd/systemd/issues/28538 > - Documentation of UKIs: > https://uapi-group.org/specifications/specs/unified_kernel_image/ > > Jan Hendrik Farr (1): > x86/kexec: UKI support > > arch/x86/include/asm/kexec-uki.h | 7 ++ > arch/x86/include/asm/parse_pefile.h| 32 +++ > arch/x86/kernel/Makefile | 2 + > arch/x86/kernel/kexec-uki.c| 113 + > arch/x86/kernel/machine_kexec_64.c | 2 + > arch/x86/kernel/parse_pefile.c | 110 > crypto/asymmetric_keys/mscode_parser.c | 2 +- > crypto/asymmetric_keys/verify_pefile.c | 110 +++- > crypto/asymmetric_keys/verify_pefile.h | 16 > 9 files changed, 278 insertions(+), 116 deletions(-) > create mode 100644 arch/x86/include/asm/kexec-uki.h > create mode 100644 arch/x86/include/asm/parse_pefile.h > create mode 100644 arch/x86/kernel/kexec-uki.c > create mode 100644 arch/x86/kernel/parse_pefile.c > > -- > 2.40.1 What the heck is UKI? BR, Jarkko ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: kexec reboot failed due to commit 75d090fd167ac
On 9/11/23 10:53, Kirill A. Shutemov wrote: On Mon, Sep 11, 2023 at 10:33:01AM -0500, Tom Lendacky wrote: On 9/11/23 09:57, Kirill A. Shutemov wrote: On Mon, Sep 11, 2023 at 10:56:36PM +0800, Dave Young wrote: early console in extract_kernel input_data: 0x00807eb433a8 input_len: 0x00d26271 output: 0x00807b00 output_len: 0x04800c10 kernel_total_size: 0x03e28000 needed_size: 0x04a0 trampoline_32bit: 0x0009d000 Decompressing Linux... out of pgt_buf in arch/x86/boot/compressed/ident_map_64.c!? pages->pgt_buf_offset: 0x6000 pages->pgt_buf_size: 0x6000 Error: kernel_ident_mapping_init() failed It crashes on #PF due to stbl->nr_tables dereference in efi_get_conf_table() called from init_unaccepted_memory(). I don't see anything special about stbl location: 0x775d6018. One other bit of information: disabling 5-level paging also helps the issue. I will debug further. The problem is not limited to unaccepted memory, it also triggers if we reach efi_get_rsdp_addr() in the same setup. I think we have several problems here. - 6 pages for !RANDOMIZE_BASE is only enough for kernel, cmdline, boot_data and setup_data if we assume that they are in different 1G regions and do not cross the 1G boundaries. 4-level paging: 1 for PGD, 1 for PUD, 4 for PMD tables. Looks like we never map EFI/ACPI memory explicitly. It might work if kernel/cmdline/... are in single 1G and we have spare pages to handle page faults. - No spare memory to handle mapping for cc_info and cc_info->cpuid_phys; - I didn't increase BOOT_INIT_PGT_SIZE when added 5-level paging support. And if start pagetables from scratch ('else' case of 'if (p4d_offset...)) we run out of memory. I believe similar logic would apply for BOOT_PGT_SIZE for RANDOMIZE_BASE=y case. I don't know what the right fix here. We can increase the constants to be enough to cover existing cases, but it is very fragile. I am not sure I saw all users. Some of them could silently handled with pagefault handler in some setups. And it is hard to catch new users during code review. Also I'm not sure why do we need pagefault handler there. Looks like it just masking problems. I think everything has to be mapped explicitly. Any comments? There was a similar related issue around the cc_info blob that is captured here: https://lore.kernel.org/lkml/20230601072043.24439-1-l...@redhat.com/ Personally, I'm a fan of mapping the EFI tables that will be passed to the kexec/kdump kernel. To me, that seems to more closely match the valid mappings for the tables when control is transferred to the OS from UEFI on the initial boot. I don't see how it would help if initialize_identity_maps() resets pagetables. See 'else' case of 'if (p4d_offset...). Ok, I see what you mean now. Thanks, Tom ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v2] x86/purgatory: Remove LTO flags
On Mon, Sep 11, 2023 at 9:00 AM Nick Desaulniers wrote: > > On Fri, Sep 8, 2023 at 4:13 PM Song Liu wrote: > > > > With LTO enabled, ld.lld generates multiple .text sections for > > purgatory.ro: > > > > $ readelf -S purgatory.ro | grep " .text" > > [ 1] .text PROGBITS 0040 > > [ 7] .text.purgatory PROGBITS 20e0 > > [ 9] .text.warnPROGBITS 21c0 > > [13] .text.sha256_upda PROGBITS 22f0 > > [15] .text.sha224_upda PROGBITS 2be0 > > [17] .text.sha256_fina PROGBITS 2bf0 > > [19] .text.sha224_fina PROGBITS 2cc0 > > > > This cause WARNING from kexec_purgatory_setup_sechdrs(): > > > > WARNING: CPU: 26 PID: 110894 at kernel/kexec_file.c:919 > > kexec_load_purgatory+0x37f/0x390 > > > > Fix this by disabling LTO for purgatory. > > Thanks for the v2! > > > > > Fixes: 8652d44f466a ("kexec: support purgatories with .text.hot sections") > > Dunno that this fixes tag is precise. I think perhaps > > Fixes: b33fff07e3e3 ("x86, build: allow LTO to be selected") Thanks for the correction! > > would be more precise. > > > Cc: Ricardo Ribalda > > Cc: Sami Tolvanen > > Cc: kexec@lists.infradead.org > > Cc: linux-ker...@vger.kernel.org > > Cc: x...@kernel.org > > Cc: l...@lists.linux.dev > > Signed-off-by: Song Liu > > > > --- > > AFAICT, x86 is the only arch that supports LTO and purgatory. > > > > Changes in v2: > > 1. Use CC_FLAGS_LTO instead of hardcode -flto. (Nick Desaulniers) > > --- > > arch/x86/purgatory/Makefile | 4 > > 1 file changed, 4 insertions(+) > > > > diff --git a/arch/x86/purgatory/Makefile b/arch/x86/purgatory/Makefile > > index c2a29be35c01..08aa0f25f12a 100644 > > --- a/arch/x86/purgatory/Makefile > > +++ b/arch/x86/purgatory/Makefile > > @@ -19,6 +19,10 @@ CFLAGS_sha256.o := -D__DISABLE_EXPORTS -D__NO_FORTIFY > > # optimization flags. > > KBUILD_CFLAGS := $(filter-out -fprofile-sample-use=% > > -fprofile-use=%,$(KBUILD_CFLAGS)) > > > > +# When LTO is enabled, llvm emits many text sections, which is not > > supported > > +# by kexec. Remove -flto=* flags. > > -flto* in LLVM implies -ffunction-sections, which creates a .text. name> section per function definition to facilitate -Wl,--gc-sections. > > Overall the question here is "do we really need to optimize kexec?" > > If the answer is yes, then this patch AND 8652d44f466a are both > pessimizing kexec (though having it work at all is strictly better > than not at all). The best fix IMO would be to provide a linker > script for this purgatory image that rejoins the text sections back > into one .text. For example: > > commit eff8728fe698 ("vmlinux.lds.h: Add PGO and AutoFDO input sections") > > I assume people do care about the time to kexec, hence the raison > d'etre for projects like kpatch. AFAICT, optimizations like LTO and PGO can give a few % of improvement, which is probably not important for kexec. The benefit is in the order of seconds (or less?). The benefit of kpatch is that we can keep the workload running while fixing the kernel bug. Based on our experience at Meta, it may take hours to graceful shutdown the application to run kexec. In this case, a few seconds of improvement (via LTO/PGO purgatory) doesn't save us much. Thanks, Song > > I'm fine to sign off on this approach if we don't really care, or want > to care more later, but we can do better here. > > > +KBUILD_CFLAGS := $(filter-out $(CC_FLAGS_LTO),$(KBUILD_CFLAGS)) > > + > > # When linking purgatory.ro with -r unresolved symbols are not checked, > > # also link a purgatory.chk binary without -r to check for unresolved > > symbols. > > PURGATORY_LDFLAGS := -e purgatory_start -z nodefaultlib > > -- > > 2.34.1 > > > > > -- > Thanks, > ~Nick Desaulniers ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v2] x86/purgatory: Remove LTO flags
On Fri, Sep 8, 2023 at 4:13 PM Song Liu wrote: > > With LTO enabled, ld.lld generates multiple .text sections for > purgatory.ro: > > $ readelf -S purgatory.ro | grep " .text" > [ 1] .text PROGBITS 0040 > [ 7] .text.purgatory PROGBITS 20e0 > [ 9] .text.warnPROGBITS 21c0 > [13] .text.sha256_upda PROGBITS 22f0 > [15] .text.sha224_upda PROGBITS 2be0 > [17] .text.sha256_fina PROGBITS 2bf0 > [19] .text.sha224_fina PROGBITS 2cc0 > > This cause WARNING from kexec_purgatory_setup_sechdrs(): > > WARNING: CPU: 26 PID: 110894 at kernel/kexec_file.c:919 > kexec_load_purgatory+0x37f/0x390 > > Fix this by disabling LTO for purgatory. Thanks for the v2! > > Fixes: 8652d44f466a ("kexec: support purgatories with .text.hot sections") Dunno that this fixes tag is precise. I think perhaps Fixes: b33fff07e3e3 ("x86, build: allow LTO to be selected") would be more precise. > Cc: Ricardo Ribalda > Cc: Sami Tolvanen > Cc: kexec@lists.infradead.org > Cc: linux-ker...@vger.kernel.org > Cc: x...@kernel.org > Cc: l...@lists.linux.dev > Signed-off-by: Song Liu > > --- > AFAICT, x86 is the only arch that supports LTO and purgatory. > > Changes in v2: > 1. Use CC_FLAGS_LTO instead of hardcode -flto. (Nick Desaulniers) > --- > arch/x86/purgatory/Makefile | 4 > 1 file changed, 4 insertions(+) > > diff --git a/arch/x86/purgatory/Makefile b/arch/x86/purgatory/Makefile > index c2a29be35c01..08aa0f25f12a 100644 > --- a/arch/x86/purgatory/Makefile > +++ b/arch/x86/purgatory/Makefile > @@ -19,6 +19,10 @@ CFLAGS_sha256.o := -D__DISABLE_EXPORTS -D__NO_FORTIFY > # optimization flags. > KBUILD_CFLAGS := $(filter-out -fprofile-sample-use=% > -fprofile-use=%,$(KBUILD_CFLAGS)) > > +# When LTO is enabled, llvm emits many text sections, which is not supported > +# by kexec. Remove -flto=* flags. -flto* in LLVM implies -ffunction-sections, which creates a .text. section per function definition to facilitate -Wl,--gc-sections. Overall the question here is "do we really need to optimize kexec?" If the answer is yes, then this patch AND 8652d44f466a are both pessimizing kexec (though having it work at all is strictly better than not at all). The best fix IMO would be to provide a linker script for this purgatory image that rejoins the text sections back into one .text. For example: commit eff8728fe698 ("vmlinux.lds.h: Add PGO and AutoFDO input sections") I assume people do care about the time to kexec, hence the raison d'etre for projects like kpatch. I'm fine to sign off on this approach if we don't really care, or want to care more later, but we can do better here. > +KBUILD_CFLAGS := $(filter-out $(CC_FLAGS_LTO),$(KBUILD_CFLAGS)) > + > # When linking purgatory.ro with -r unresolved symbols are not checked, > # also link a purgatory.chk binary without -r to check for unresolved > symbols. > PURGATORY_LDFLAGS := -e purgatory_start -z nodefaultlib > -- > 2.34.1 > -- Thanks, ~Nick Desaulniers ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: kexec reboot failed due to commit 75d090fd167ac
On Mon, Sep 11, 2023 at 10:33:01AM -0500, Tom Lendacky wrote: > On 9/11/23 09:57, Kirill A. Shutemov wrote: > > On Mon, Sep 11, 2023 at 10:56:36PM +0800, Dave Young wrote: > > > > early console in extract_kernel > > > > input_data: 0x00807eb433a8 > > > > input_len: 0x00d26271 > > > > output: 0x00807b00 > > > > output_len: 0x04800c10 > > > > kernel_total_size: 0x03e28000 > > > > needed_size: 0x04a0 > > > > trampoline_32bit: 0x0009d000 > > > > > > > > Decompressing Linux... out of pgt_buf in > > > > arch/x86/boot/compressed/ident_map_64.c!? > > > > pages->pgt_buf_offset: 0x6000 > > > > pages->pgt_buf_size: 0x6000 > > > > > > > > > > > > Error: kernel_ident_mapping_init() failed > > > > > > > > It crashes on #PF due to stbl->nr_tables dereference in > > > > efi_get_conf_table() called from init_unaccepted_memory(). > > > > > > > > I don't see anything special about stbl location: 0x775d6018. > > > > > > > > One other bit of information: disabling 5-level paging also helps the > > > > issue. > > > > > > > > I will debug further. > > > > The problem is not limited to unaccepted memory, it also triggers if we > > reach efi_get_rsdp_addr() in the same setup. > > > > I think we have several problems here. > > > > - 6 pages for !RANDOMIZE_BASE is only enough for kernel, cmdline, > >boot_data and setup_data if we assume that they are in different 1G > >regions and do not cross the 1G boundaries. 4-level paging: 1 for PGD, 1 > >for PUD, 4 for PMD tables. > > > >Looks like we never map EFI/ACPI memory explicitly. > > > >It might work if kernel/cmdline/... are in single 1G and we have > >spare pages to handle page faults. > > > > - No spare memory to handle mapping for cc_info and cc_info->cpuid_phys; > > > > - I didn't increase BOOT_INIT_PGT_SIZE when added 5-level paging support. > >And if start pagetables from scratch ('else' case of 'if (p4d_offset...)) > >we run out of memory. > > > > I believe similar logic would apply for BOOT_PGT_SIZE for RANDOMIZE_BASE=y > > case. > > > > I don't know what the right fix here. We can increase the constants to be > > enough to cover existing cases, but it is very fragile. I am not sure I > > saw all users. Some of them could silently handled with pagefault handler > > in some setups. And it is hard to catch new users during code review. > > > > Also I'm not sure why do we need pagefault handler there. Looks like it > > just masking problems. I think everything has to be mapped explicitly. > > > > Any comments? > > There was a similar related issue around the cc_info blob that is captured > here: https://lore.kernel.org/lkml/20230601072043.24439-1-l...@redhat.com/ > > Personally, I'm a fan of mapping the EFI tables that will be passed to the > kexec/kdump kernel. To me, that seems to more closely match the valid > mappings for the tables when control is transferred to the OS from UEFI on > the initial boot. I don't see how it would help if initialize_identity_maps() resets pagetables. See 'else' case of 'if (p4d_offset...). -- Kiryl Shutsemau / Kirill A. Shutemov ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: kexec reboot failed due to commit 75d090fd167ac
On 9/11/23 09:57, Kirill A. Shutemov wrote: On Mon, Sep 11, 2023 at 10:56:36PM +0800, Dave Young wrote: early console in extract_kernel input_data: 0x00807eb433a8 input_len: 0x00d26271 output: 0x00807b00 output_len: 0x04800c10 kernel_total_size: 0x03e28000 needed_size: 0x04a0 trampoline_32bit: 0x0009d000 Decompressing Linux... out of pgt_buf in arch/x86/boot/compressed/ident_map_64.c!? pages->pgt_buf_offset: 0x6000 pages->pgt_buf_size: 0x6000 Error: kernel_ident_mapping_init() failed It crashes on #PF due to stbl->nr_tables dereference in efi_get_conf_table() called from init_unaccepted_memory(). I don't see anything special about stbl location: 0x775d6018. One other bit of information: disabling 5-level paging also helps the issue. I will debug further. The problem is not limited to unaccepted memory, it also triggers if we reach efi_get_rsdp_addr() in the same setup. I think we have several problems here. - 6 pages for !RANDOMIZE_BASE is only enough for kernel, cmdline, boot_data and setup_data if we assume that they are in different 1G regions and do not cross the 1G boundaries. 4-level paging: 1 for PGD, 1 for PUD, 4 for PMD tables. Looks like we never map EFI/ACPI memory explicitly. It might work if kernel/cmdline/... are in single 1G and we have spare pages to handle page faults. - No spare memory to handle mapping for cc_info and cc_info->cpuid_phys; - I didn't increase BOOT_INIT_PGT_SIZE when added 5-level paging support. And if start pagetables from scratch ('else' case of 'if (p4d_offset...)) we run out of memory. I believe similar logic would apply for BOOT_PGT_SIZE for RANDOMIZE_BASE=y case. I don't know what the right fix here. We can increase the constants to be enough to cover existing cases, but it is very fragile. I am not sure I saw all users. Some of them could silently handled with pagefault handler in some setups. And it is hard to catch new users during code review. Also I'm not sure why do we need pagefault handler there. Looks like it just masking problems. I think everything has to be mapped explicitly. Any comments? There was a similar related issue around the cc_info blob that is captured here: https://lore.kernel.org/lkml/20230601072043.24439-1-l...@redhat.com/ Personally, I'm a fan of mapping the EFI tables that will be passed to the kexec/kdump kernel. To me, that seems to more closely match the valid mappings for the tables when control is transferred to the OS from UEFI on the initial boot. Thanks, Tom ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: kexec reboot failed due to commit 75d090fd167ac
On Mon, Sep 11, 2023 at 10:56:36PM +0800, Dave Young wrote: > > early console in extract_kernel > > input_data: 0x00807eb433a8 > > input_len: 0x00d26271 > > output: 0x00807b00 > > output_len: 0x04800c10 > > kernel_total_size: 0x03e28000 > > needed_size: 0x04a0 > > trampoline_32bit: 0x0009d000 > > > > Decompressing Linux... out of pgt_buf in > > arch/x86/boot/compressed/ident_map_64.c!? > > pages->pgt_buf_offset: 0x6000 > > pages->pgt_buf_size: 0x6000 > > > > > > Error: kernel_ident_mapping_init() failed > > > > It crashes on #PF due to stbl->nr_tables dereference in > > efi_get_conf_table() called from init_unaccepted_memory(). > > > > I don't see anything special about stbl location: 0x775d6018. > > > > One other bit of information: disabling 5-level paging also helps the > > issue. > > > > I will debug further. The problem is not limited to unaccepted memory, it also triggers if we reach efi_get_rsdp_addr() in the same setup. I think we have several problems here. - 6 pages for !RANDOMIZE_BASE is only enough for kernel, cmdline, boot_data and setup_data if we assume that they are in different 1G regions and do not cross the 1G boundaries. 4-level paging: 1 for PGD, 1 for PUD, 4 for PMD tables. Looks like we never map EFI/ACPI memory explicitly. It might work if kernel/cmdline/... are in single 1G and we have spare pages to handle page faults. - No spare memory to handle mapping for cc_info and cc_info->cpuid_phys; - I didn't increase BOOT_INIT_PGT_SIZE when added 5-level paging support. And if start pagetables from scratch ('else' case of 'if (p4d_offset...)) we run out of memory. I believe similar logic would apply for BOOT_PGT_SIZE for RANDOMIZE_BASE=y case. I don't know what the right fix here. We can increase the constants to be enough to cover existing cases, but it is very fragile. I am not sure I saw all users. Some of them could silently handled with pagefault handler in some setups. And it is hard to catch new users during code review. Also I'm not sure why do we need pagefault handler there. Looks like it just masking problems. I think everything has to be mapped explicitly. Any comments? -- Kiryl Shutsemau / Kirill A. Shutemov ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 3/3] /dev/mem: Do not map unaccepted memory
On 9/11/23 01:09, David Hildenbrand wrote: > So, making unaccepted memory similarly depend on "!DEVMEM || > STRICT_DEVMEM" does not sound too far off ... Yeah, considering all of the invasive work folks want to do to "harden" the kernel for TDX, doing that ^ is just about the best bang-for-your-buck "hardening" that you can get. ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 1/3] proc/vmcore: Do not map unaccepted memory
On 11.09.23 12:05, Kirill A. Shutemov wrote: On Mon, Sep 11, 2023 at 11:50:31AM +0200, David Hildenbrand wrote: On 11.09.23 11:27, Kirill A. Shutemov wrote: On Mon, Sep 11, 2023 at 10:42:51AM +0200, David Hildenbrand wrote: On 11.09.23 10:41, Kirill A. Shutemov wrote: On Mon, Sep 11, 2023 at 10:03:36AM +0200, David Hildenbrand wrote: On 06.09.23 09:39, Adrian Hunter wrote: Support for unaccepted memory was added recently, refer commit dcdfdd40fa82 ("mm: Add support for unaccepted memory"), whereby a virtual machine may need to accept memory before it can be used. Do not map unaccepted memory because it can cause the guest to fail. For /proc/vmcore, which is read-only, this means a read or mmap of unaccepted memory will return zeros. Does a second (kdump) kernel that exposes /proc/vmcore reliably get access to the information whether memory of the first kernel is unaccepted (IOW, not its memory, but the memory of the first kernel it is supposed to expose via /proc/vmcore)? There are few patches in my queue to few related issue, but generally, yes, the information is available to the target kernel via EFI configuration table. I assume that table provided by the first kernel, and not read directly from HW, correct? The table is constructed by the EFI stub in the first kernel based on EFI memory map. Okay, should work then once that's done by the first kernel. Maybe include this patch in your series? Can do. But the other two patches are not related to kexec. Hm. Yes, the others can go in separately. But this here really needs other kexec/kdump changes. -- Cheers, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCHv6 3/3] powerpc/setup: alloc extra paca_ptrs to hold boot_cpuid
paca_ptrs should be large enough to hold the boot_cpuid, hence, its lower boundary is set to the bigger one between boot_cpuid+1 and nr_cpus. On the other hand, some kernel component: -1. the timer assumes cpu0 online since the timer_list->flags subfield 'TIMER_CPUMASK' is zero if not initialized to a proper present cpu. -2. power9_idle_stop() assumes the primary thread's paca is allocated. Hence lift nr_cpu_ids from one to two to ensure cpu0 is onlined, if the boot cpu is not cpu0. Signed-off-by: Pingfan Liu Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Christophe Leroy Cc: Mahesh Salgaonkar Cc: Wen Xiong Cc: Baoquan He Cc: Ming Lei Cc: kexec@lists.infradead.org To: linuxppc-...@lists.ozlabs.org --- arch/powerpc/kernel/paca.c | 10 ++ arch/powerpc/kernel/prom.c | 9 ++--- 2 files changed, 12 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c index cda4e00b67c1..91e2401de1bd 100644 --- a/arch/powerpc/kernel/paca.c +++ b/arch/powerpc/kernel/paca.c @@ -242,9 +242,10 @@ static int __initdata paca_struct_size; void __init allocate_paca_ptrs(void) { - paca_nr_cpu_ids = nr_cpu_ids; + int n = (boot_cpuid + 1) > nr_cpu_ids ? (boot_cpuid + 1) : nr_cpu_ids; - paca_ptrs_size = sizeof(struct paca_struct *) * nr_cpu_ids; + paca_nr_cpu_ids = n; + paca_ptrs_size = sizeof(struct paca_struct *) * n; paca_ptrs = memblock_alloc_raw(paca_ptrs_size, SMP_CACHE_BYTES); if (!paca_ptrs) panic("Failed to allocate %d bytes for paca pointers\n", @@ -287,13 +288,14 @@ void __init allocate_paca(int cpu) void __init free_unused_pacas(void) { int new_ptrs_size; + int n = (boot_cpuid + 1) > nr_cpu_ids ? (boot_cpuid + 1) : nr_cpu_ids; - new_ptrs_size = sizeof(struct paca_struct *) * nr_cpu_ids; + new_ptrs_size = sizeof(struct paca_struct *) * n; if (new_ptrs_size < paca_ptrs_size) memblock_phys_free(__pa(paca_ptrs) + new_ptrs_size, paca_ptrs_size - new_ptrs_size); - paca_nr_cpu_ids = nr_cpu_ids; + paca_nr_cpu_ids = n; paca_ptrs_size = new_ptrs_size; #ifdef CONFIG_PPC_64S_HASH_MMU diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c index cb3f3e040455..28441edbc42d 100644 --- a/arch/powerpc/kernel/prom.c +++ b/arch/powerpc/kernel/prom.c @@ -362,9 +362,12 @@ static int __init early_init_dt_scan_cpus(unsigned long node, */ boot_cpuid = i; found = true; - /* This works around the hole in paca_ptrs[]. */ - if (nr_cpu_ids < nthreads) - set_nr_cpu_ids(nthreads); + /* +* Ideally, nr_cpus=1 can be achieved if each kernel +* component does not assume cpu0 is onlined. +*/ + if (boot_cpuid != 0 && nr_cpu_ids < 2) + set_nr_cpu_ids(2); } #ifdef CONFIG_SMP /* logical cpu id is always 0 on UP kernels */ -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCHv6 0/3] enable nr_cpus for powerpc
Since my last v4 [1], the code has undergone great changes. The paca[] array has been reorganized and indexed by paca_ptrs[], which dramatically decreases the memory consumption even if there are many unpresent cpus in the middle. However, reordering the logical cpu numbers can further decrease the size of paca_ptrs[] in the kdump case. So I keep [1/3], which rotate-shifts the cpu's sequence number in the device tree to obtain the logical cpu id. Patch [2-3/3] make efforts to decrease the nr_cpus to be less than or equal to two. [1]: https://lore.kernel.org/linuxppc-dev/1520829790-14029-1-git-send-email-kernelf...@gmail.com/ Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Christophe Leroy Cc: Mahesh Salgaonkar Cc: Wen Xiong Cc: Baoquan He Cc: Ming Lei Cc: kexec@lists.infradead.org To: linuxppc-...@lists.ozlabs.org v5 -> v6: assign nr_cpu_ids by set_nr_cpu_ids() to tackle with the issue if nr_cpu_ids is configured as a constant Pingfan Liu (3): powerpc/setup: Loosen the mapping between cpu logical id and its seq in dt powerpc/setup: Handle the case when boot_cpuid greater than nr_cpus powerpc/setup: alloc extra paca_ptrs to hold boot_cpuid arch/powerpc/kernel/paca.c | 10 +-- arch/powerpc/kernel/prom.c | 28 +--- arch/powerpc/kernel/setup-common.c | 106 - 3 files changed, 113 insertions(+), 31 deletions(-) -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCHv6 2/3] powerpc/setup: Handle the case when boot_cpuid greater than nr_cpus
If the boot_cpuid is smaller than nr_cpus, it requires extra effort to ensure the boot_cpu is in cpu_present_mask. This can be achieved by reserving the last quota for the boot cpu. Note: the restriction on nr_cpus will be lifted with more effort in the next patch Signed-off-by: Pingfan Liu Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Christophe Leroy Cc: Mahesh Salgaonkar Cc: Wen Xiong Cc: Baoquan He Cc: Ming Lei Cc: kexec@lists.infradead.org To: linuxppc-...@lists.ozlabs.org --- arch/powerpc/kernel/setup-common.c | 25 ++--- 1 file changed, 22 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c index a07af8de6674..58a988c64dd2 100644 --- a/arch/powerpc/kernel/setup-common.c +++ b/arch/powerpc/kernel/setup-common.c @@ -456,8 +456,8 @@ struct interrupt_server_node { void __init smp_setup_cpu_maps(void) { struct device_node *dn; - int shift = 0, cpu = 0; - int j, nthreads = 1; + int terminate, shift = 0, cpu = 0; + int j, bt_thread = 0, nthreads = 1; int len; struct interrupt_server_node *intserv_node, *n; struct list_head *bt_node, head; @@ -520,6 +520,7 @@ void __init smp_setup_cpu_maps(void) for (j = 0 ; j < nthreads; j++) { if (be32_to_cpu(intserv[j]) == boot_cpu_hwid) { bt_node = &intserv_node->node; + bt_thread = j; found_boot_cpu = true; /* * Record the round-shift between dt @@ -539,11 +540,21 @@ void __init smp_setup_cpu_maps(void) /* Select the primary thread, the boot cpu's slibing, as the logic 0 */ list_add_tail(&head, bt_node); pr_info("the round shift between dt seq and the cpu logic number: %d\n", shift); + terminate = nr_cpu_ids; list_for_each_entry(intserv_node, &head, node) { + j = 0; + /* Choose a start point to cover the boot cpu */ + if (nr_cpu_ids - 1 < bt_thread) { + /* +* The processor core puts assumption on the thread id, +* not to breach the assumption. +*/ + terminate = nr_cpu_ids - 1; + } avail = intserv_node->avail; nthreads = intserv_node->len / sizeof(int); - for (j = 0; j < nthreads && cpu < nr_cpu_ids; j++) { + for (; j < nthreads && cpu < terminate; j++) { set_cpu_present(cpu, avail); set_cpu_possible(cpu, true); cpu_to_phys_id[cpu] = be32_to_cpu(intserv_node->intserv[j]); @@ -551,6 +562,14 @@ void __init smp_setup_cpu_maps(void) j, cpu, be32_to_cpu(intserv[j])); cpu++; } + /* Online the boot cpu */ + if (nr_cpu_ids - 1 < bt_thread) { + set_cpu_present(bt_thread, avail); + set_cpu_possible(bt_thread, true); + cpu_to_phys_id[bt_thread] = be32_to_cpu(intserv_node->intserv[bt_thread]); + DBG("thread %d -> cpu %d (hard id %d)\n", + bt_thread, bt_thread, be32_to_cpu(intserv[bt_thread])); + } } list_for_each_entry_safe(intserv_node, n, &head, node) { -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCHv6 1/3] powerpc/setup: Loosen the mapping between cpu logical id and its seq in dt
*** Idea *** For kexec -p, the boot cpu can be not the cpu0, this may waste plenty of room when of allocating memory for paca_ptrs[]. However, in theory, there is no requirement to assign cpu's logical id as its present sequence in the device tree. But there is something like cpu_first_thread_sibling(), which makes assumption on the mapping inside a core. Hence partially loosening the mapping, i.e. unbind the mapping of core while keep the mapping inside a core. *** Implement *** At this early stage, there are plenty of memory to utilize. Hence, this patch allocates interim memory to link the cpu info on a list, then reorder cpus by changing the list head. As a result, there is a rotate shift between the sequence number in dt and the cpu logical number. *** Result *** After this patch, a boot-cpu's logical id will always be mapped into the range [0,threads_per_core). Besides this, at this phase, all threads in the boot core are forced to be onlined. This restriction will be lifted in a later patch with extra effort. Signed-off-by: Pingfan Liu Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Christophe Leroy Cc: Mahesh Salgaonkar Cc: Wen Xiong Cc: Baoquan He Cc: Ming Lei Cc: kexec@lists.infradead.org To: linuxppc-...@lists.ozlabs.org --- arch/powerpc/kernel/prom.c | 25 + arch/powerpc/kernel/setup-common.c | 87 +++--- 2 files changed, 85 insertions(+), 27 deletions(-) diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c index 0b5878c3125b..cb3f3e040455 100644 --- a/arch/powerpc/kernel/prom.c +++ b/arch/powerpc/kernel/prom.c @@ -76,7 +76,9 @@ u64 ppc64_rma_size; unsigned int boot_cpu_node_count __ro_after_init; #endif static phys_addr_t first_memblock_size; +#ifdef CONFIG_SMP static int __initdata boot_cpu_count; +#endif static int __init early_parse_mem(char *p) { @@ -331,8 +333,7 @@ static int __init early_init_dt_scan_cpus(unsigned long node, const __be32 *intserv; int i, nthreads; int len; - int found = -1; - int found_thread = 0; + bool found = false; /* We are scanning "cpu" nodes only */ if (type == NULL || strcmp(type, "cpu") != 0) @@ -355,8 +356,15 @@ static int __init early_init_dt_scan_cpus(unsigned long node, for (i = 0; i < nthreads; i++) { if (be32_to_cpu(intserv[i]) == fdt_boot_cpuid_phys(initial_boot_params)) { - found = boot_cpu_count; - found_thread = i; + /* +* always map the boot-cpu logical id into the +* range of [0, thread_per_core) +*/ + boot_cpuid = i; + found = true; + /* This works around the hole in paca_ptrs[]. */ + if (nr_cpu_ids < nthreads) + set_nr_cpu_ids(nthreads); } #ifdef CONFIG_SMP /* logical cpu id is always 0 on UP kernels */ @@ -365,15 +373,14 @@ static int __init early_init_dt_scan_cpus(unsigned long node, } /* Not the boot CPU */ - if (found < 0) + if (!found) return 0; - DBG("boot cpu: logical %d physical %d\n", found, - be32_to_cpu(intserv[found_thread])); - boot_cpuid = found; + DBG("boot cpu: logical %d physical %d\n", boot_cpuid, + be32_to_cpu(intserv[boot_cpuid])); if (IS_ENABLED(CONFIG_PPC64)) - boot_cpu_hwid = be32_to_cpu(intserv[found_thread]); + boot_cpu_hwid = be32_to_cpu(intserv[boot_cpuid]); /* * PAPR defines "logical" PVR values for cpus that diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c index d2a446216444..a07af8de6674 100644 --- a/arch/powerpc/kernel/setup-common.c +++ b/arch/powerpc/kernel/setup-common.c @@ -36,6 +36,7 @@ #include #include #include +#include #include #include #include @@ -427,6 +428,13 @@ static void __init cpu_init_thread_core_maps(int tpc) u32 *cpu_to_phys_id = NULL; +struct interrupt_server_node { + struct list_head node; + boolavail; + int len; + __be32 *intserv; +}; + /** * setup_cpu_maps - initialize the following cpu maps: * cpu_possible_mask @@ -448,11 +456,16 @@ u32 *cpu_to_phys_id = NULL; void __init smp_setup_cpu_maps(void) { struct device_node *dn; - int cpu = 0; - int nthreads = 1; + int shift = 0, cpu = 0; + int j, nthreads = 1; + int len; + struct interrupt_server_node *intserv_node, *n; + struct list_head *bt_node, head; + bool avail, found_boot_cpu = false; DBG("smp_setup_cpu_maps()\n"); + INIT_LIST_HEAD(&head); cpu_to_phys_id = memblock_alloc(nr_cpu_ids * sizeof(u32), __alignof_
[PATCH V2 2/2] proc/kcore: Do not try to access unaccepted memory
Support for unaccepted memory was added recently, refer commit dcdfdd40fa82 ("mm: Add support for unaccepted memory"), whereby a virtual machine may need to accept memory before it can be used. Do not try to access unaccepted memory because it can cause the guest to fail. For /proc/kcore, which is read-only and does not support mmap, this means a read of unaccepted memory will return zeros. Signed-off-by: Adrian Hunter --- fs/proc/kcore.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) Changes in V2: Change patch subject and commit message Do not open code pfn_is_unaccepted_memory() diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c index 23fc24d16b31..6422e569b080 100644 --- a/fs/proc/kcore.c +++ b/fs/proc/kcore.c @@ -546,7 +546,8 @@ static ssize_t read_kcore_iter(struct kiocb *iocb, struct iov_iter *iter) * and explicitly excluded physical ranges. */ if (!page || PageOffline(page) || - is_page_hwpoison(page) || !pfn_is_ram(pfn)) { + is_page_hwpoison(page) || !pfn_is_ram(pfn) || + pfn_is_unaccepted_memory(pfn)) { if (iov_iter_zero(tsz, iter) != tsz) { ret = -EFAULT; goto out; -- 2.34.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH V2 0/2] Do not try to access unaccepted memory
Hi Support for unaccepted memory was added recently, refer commit dcdfdd40fa82 ("mm: Add support for unaccepted memory"), whereby a virtual machine may need to accept memory before it can be used. Plug a few gaps where RAM is exposed without checking if it is unaccepted memory. Changes in V2: efi/unaccepted: Do not let /proc/vmcore try to access unaccepted memory Change patch subject and commit message Use vmcore_cb->.pfn_is_ram() instead of changing vmcore.c proc/kcore: Do not try to access unaccepted memory Change patch subject and commit message Do not open code pfn_is_unaccepted_memory() /dev/mem: Do not map unaccepted memory Patch dropped because it is not required Adrian Hunter (2): efi/unaccepted: Do not let /proc/vmcore try to access unaccepted memory proc/kcore: Do not try to access unaccepted memory drivers/firmware/efi/unaccepted_memory.c | 20 fs/proc/kcore.c | 3 ++- include/linux/mm.h | 7 +++ 3 files changed, 29 insertions(+), 1 deletion(-) Regards Adrian ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH V2 1/2] efi/unaccepted: Do not let /proc/vmcore try to access unaccepted memory
Support for unaccepted memory was added recently, refer commit dcdfdd40fa82 ("mm: Add support for unaccepted memory"), whereby a virtual machine may need to accept memory before it can be used. Do not let /proc/vmcore try to access unaccepted memory because it can cause the guest to fail. For /proc/vmcore, which is read-only, this means a read or mmap of unaccepted memory will return zeros. Signed-off-by: Adrian Hunter --- drivers/firmware/efi/unaccepted_memory.c | 20 include/linux/mm.h | 7 +++ 2 files changed, 27 insertions(+) Changes in V2: Change patch subject and commit message Use vmcore_cb->.pfn_is_ram() instead of changing vmcore.c diff --git a/drivers/firmware/efi/unaccepted_memory.c b/drivers/firmware/efi/unaccepted_memory.c index 853f7dc3c21d..79ba576b22e3 100644 --- a/drivers/firmware/efi/unaccepted_memory.c +++ b/drivers/firmware/efi/unaccepted_memory.c @@ -3,6 +3,7 @@ #include #include #include +#include #include /* Protects unaccepted memory bitmap */ @@ -145,3 +146,22 @@ bool range_contains_unaccepted_memory(phys_addr_t start, phys_addr_t end) return ret; } + +#ifdef CONFIG_PROC_VMCORE +static bool unaccepted_memory_vmcore_pfn_is_ram(struct vmcore_cb *cb, + unsigned long pfn) +{ + return !pfn_is_unaccepted_memory(pfn); +} + +static struct vmcore_cb vmcore_cb = { + .pfn_is_ram = unaccepted_memory_vmcore_pfn_is_ram, +}; + +static int __init unaccepted_memory_init_kdump(void) +{ + register_vmcore_cb(&vmcore_cb); + return 0; +} +core_initcall(unaccepted_memory_init_kdump); +#endif /* CONFIG_PROC_VMCORE */ diff --git a/include/linux/mm.h b/include/linux/mm.h index bf5d0b1b16f4..86511150f1d4 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -4062,4 +4062,11 @@ static inline void accept_memory(phys_addr_t start, phys_addr_t end) #endif +static inline bool pfn_is_unaccepted_memory(unsigned long pfn) +{ + phys_addr_t paddr = pfn << PAGE_SHIFT; + + return range_contains_unaccepted_memory(paddr, paddr + PAGE_SIZE); +} + #endif /* _LINUX_MM_H */ -- 2.34.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH 1/2] zboot: enable arm64 kexec_load for zboot image
kexec_file_load support of zboot kernel image decompressed the vmlinuz, so in kexec_load code just load the kernel with reading the decompressed kernel fd into a new buffer and use it directly. Signed-off-by: Dave Young --- include/kexec-pe-zboot.h | 3 ++- kexec/arch/arm64/kexec-vmlinuz-arm64.c | 20 ++-- kexec/kexec-pe-zboot.c | 4 +++- kexec/kexec.c | 2 +- kexec/kexec.h | 1 + 5 files changed, 25 insertions(+), 5 deletions(-) diff --git a/include/kexec-pe-zboot.h b/include/kexec-pe-zboot.h index e2e0448a81f2..374916cbe883 100644 --- a/include/kexec-pe-zboot.h +++ b/include/kexec-pe-zboot.h @@ -11,5 +11,6 @@ struct linux_pe_zboot_header { uint32_t compress_type; }; -int pez_prepare(const char *crude_buf, off_t buf_sz, int *kernel_fd); +int pez_prepare(const char *crude_buf, off_t buf_sz, int *kernel_fd, + off_t *kernel_size); #endif diff --git a/kexec/arch/arm64/kexec-vmlinuz-arm64.c b/kexec/arch/arm64/kexec-vmlinuz-arm64.c index c0ee47c8f50a..8f378d8fa6d0 100644 --- a/kexec/arch/arm64/kexec-vmlinuz-arm64.c +++ b/kexec/arch/arm64/kexec-vmlinuz-arm64.c @@ -34,6 +34,7 @@ #include "arch/options.h" static int kernel_fd = -1; +static off_t decompressed_size; /* Returns: * -1 : in case of error/invalid format (not a valid PE+compressed ZBOOT format. @@ -72,7 +73,7 @@ int pez_arm64_probe(const char *kernel_buf, off_t kernel_size) return -1; } - ret = pez_prepare(buf, buf_sz, &kernel_fd); + ret = pez_prepare(buf, buf_sz, &kernel_fd, &decompressed_size); if (!ret) { /* validate the arm64 specific header */ @@ -98,8 +99,23 @@ bad_header: int pez_arm64_load(int argc, char **argv, const char *buf, off_t len, struct kexec_info *info) { + char *kbuf; + info->kernel_fd = kernel_fd; - return image_arm64_load(argc, argv, buf, len, info); + if (kernel_fd > 0 && decompressed_size > 0) { + off_t nread; + + kbuf = slurp_fd(kernel_fd, NULL, decompressed_size, &nread); + if (!kbuf || nread != decompressed_size) { + dbgprintf("%s: failed.\n", __func__); + return -1; + } + } else { + dbgprintf("%s: wrong file descriptor.\n", __func__); + return -1; + } + + return image_arm64_load(argc, argv, kbuf, decompressed_size, info); } void pez_arm64_usage(void) diff --git a/kexec/kexec-pe-zboot.c b/kexec/kexec-pe-zboot.c index 2f2e052b76c5..3abd17d9fe59 100644 --- a/kexec/kexec-pe-zboot.c +++ b/kexec/kexec-pe-zboot.c @@ -37,7 +37,8 @@ * * crude_buf: the content, which is read from the kernel file without any processing */ -int pez_prepare(const char *crude_buf, off_t buf_sz, int *kernel_fd) +int pez_prepare(const char *crude_buf, off_t buf_sz, int *kernel_fd, + off_t *kernel_size) { int ret = -1; int fd = 0; @@ -110,6 +111,7 @@ int pez_prepare(const char *crude_buf, off_t buf_sz, int *kernel_fd) goto fail_bad_header; } + *kernel_size = decompressed_size; dbgprintf("%s: done\n", __func__); ret = 0; diff --git a/kexec/kexec.c b/kexec/kexec.c index c3b182e254e0..1edbd349c86d 100644 --- a/kexec/kexec.c +++ b/kexec/kexec.c @@ -489,7 +489,7 @@ static int add_backup_segments(struct kexec_info *info, return 0; } -static char *slurp_fd(int fd, const char *filename, off_t size, off_t *nread) +char *slurp_fd(int fd, const char *filename, off_t size, off_t *nread) { char *buf; off_t progress; diff --git a/kexec/kexec.h b/kexec/kexec.h index ed3b499a80f2..093338969c57 100644 --- a/kexec/kexec.h +++ b/kexec/kexec.h @@ -267,6 +267,7 @@ extern void die(const char *fmt, ...) __attribute__ ((format (printf, 1, 2))); extern void *xmalloc(size_t size); extern void *xrealloc(void *ptr, size_t size); +extern char *slurp_fd(int fd, const char *filename, off_t size, off_t *nread); extern char *slurp_file(const char *filename, off_t *r_size); extern char *slurp_file_mmap(const char *filename, off_t *r_size); extern char *slurp_file_len(const char *filename, off_t size, off_t *nread); -- 2.37.2 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH 2/2] zboot: add loongarch kexec_load support
From: "dyo...@redhat.com" Copy arm64 code and change for loongarch so that the kexec -c can load a zboot image. Note: probe zboot image first otherwise the pei-loongarch file type will be used. Signed-off-by: Dave Young --- kexec/arch/loongarch/Makefile | 1 + kexec/arch/loongarch/image-header.h| 1 + kexec/arch/loongarch/kexec-loongarch.c | 1 + kexec/arch/loongarch/kexec-loongarch.h | 4 + kexec/arch/loongarch/kexec-pez-loongarch.c | 88 ++ 5 files changed, 95 insertions(+) create mode 100644 kexec/arch/loongarch/kexec-pez-loongarch.c diff --git a/kexec/arch/loongarch/Makefile b/kexec/arch/loongarch/Makefile index 3b33b9693287..cee7e569a2a2 100644 --- a/kexec/arch/loongarch/Makefile +++ b/kexec/arch/loongarch/Makefile @@ -6,6 +6,7 @@ loongarch_KEXEC_SRCS += kexec/arch/loongarch/kexec-elf-loongarch.c loongarch_KEXEC_SRCS += kexec/arch/loongarch/kexec-pei-loongarch.c loongarch_KEXEC_SRCS += kexec/arch/loongarch/kexec-elf-rel-loongarch.c loongarch_KEXEC_SRCS += kexec/arch/loongarch/crashdump-loongarch.c +loongarch_KEXEC_SRCS += kexec/arch/loongarch/kexec-pez-loongarch.c loongarch_MEM_REGIONS = kexec/mem_regions.c diff --git a/kexec/arch/loongarch/image-header.h b/kexec/arch/loongarch/image-header.h index 3b7576552685..223d81f77d9f 100644 --- a/kexec/arch/loongarch/image-header.h +++ b/kexec/arch/loongarch/image-header.h @@ -33,6 +33,7 @@ struct loongarch_image_header { }; static const uint8_t loongarch_image_pe_sig[2] = {'M', 'Z'}; +static const uint8_t loongarch_pe_machtype[6] = {'P','E', 0x0, 0x0, 0x64, 0x62}; /** * loongarch_header_check_pe_sig - Helper to check the loongarch image header. diff --git a/kexec/arch/loongarch/kexec-loongarch.c b/kexec/arch/loongarch/kexec-loongarch.c index f47c99861674..62ff8fd1aeb7 100644 --- a/kexec/arch/loongarch/kexec-loongarch.c +++ b/kexec/arch/loongarch/kexec-loongarch.c @@ -165,6 +165,7 @@ int get_memory_ranges(struct memory_range **range, int *ranges, struct file_type file_type[] = { {"elf-loongarch", elf_loongarch_probe, elf_loongarch_load, elf_loongarch_usage}, + {"pez-loongarch", pez_loongarch_probe, pez_loongarch_load, pez_loongarch_usage}, {"pei-loongarch", pei_loongarch_probe, pei_loongarch_load, pei_loongarch_usage}, }; int file_types = sizeof(file_type) / sizeof(file_type[0]); diff --git a/kexec/arch/loongarch/kexec-loongarch.h b/kexec/arch/loongarch/kexec-loongarch.h index 5120a26fd513..2c7624f2fd3a 100644 --- a/kexec/arch/loongarch/kexec-loongarch.h +++ b/kexec/arch/loongarch/kexec-loongarch.h @@ -27,6 +27,10 @@ int pei_loongarch_probe(const char *buf, off_t len); int pei_loongarch_load(int argc, char **argv, const char *buf, off_t len, struct kexec_info *info); void pei_loongarch_usage(void); +int pez_loongarch_probe(const char *kernel_buf, off_t kernel_size); +int pez_loongarch_load(int argc, char **argv, const char *buf, off_t len, + struct kexec_info *info); +void pez_loongarch_usage(void); int loongarch_process_image_header(const struct loongarch_image_header *h); diff --git a/kexec/arch/loongarch/kexec-pez-loongarch.c b/kexec/arch/loongarch/kexec-pez-loongarch.c new file mode 100644 index ..6d94a405d54a --- /dev/null +++ b/kexec/arch/loongarch/kexec-pez-loongarch.c @@ -0,0 +1,88 @@ +/* + * LoongArch PE compressed Image (vmlinuz, ZBOOT) support. + * Based on arm64 code + */ + +#define _GNU_SOURCE +#include +#include +#include +#include "kexec.h" +#include "kexec-loongarch.h" +#include +#include "arch/options.h" + +static int kernel_fd = -1; +static off_t decompressed_size; + +/* Returns: + * -1 : in case of error/invalid format (not a valid PE+compressed ZBOOT format. + */ +int pez_loongarch_probe(const char *kernel_buf, off_t kernel_size) +{ + int ret = -1; + const struct loongarch_image_header *h; + char *buf; + off_t buf_sz; + + buf = (char *)kernel_buf; + buf_sz = kernel_size; + if (!buf) + return -1; + h = (const struct loongarch_image_header *)buf; + + dbgprintf("%s: PROBE.\n", __func__); + if (buf_sz < sizeof(struct loongarch_image_header)) { + dbgprintf("%s: Not large enough to be a PE image.\n", __func__); + return -1; + } + if (!loongarch_header_check_pe_sig(h)) { + dbgprintf("%s: Not an PE image.\n", __func__); + return -1; + } + + if (buf_sz < sizeof(struct loongarch_image_header) + h->pe_header) { + dbgprintf("%s: PE image offset larger than image.\n", __func__); + return -1; + } + + if (memcmp(&buf[h->pe_header], + loongarch_pe_machtype, sizeof(loongarch_pe_machtype))) { + dbgprintf("%s: PE header doesn't match machine type.\n", __func__); + return -1; + } + + ret = pez_prepare(buf, buf_sz, &kernel_fd, &decompressed_size); +
Re: [PATCH v2 2/3] vmcore: allow fadump to export vmcore even if is_kdump_kernel() is false
On 09/11/23 at 05:13pm, Michael Ellerman wrote: > Hari Bathini writes: > > Currently, is_kdump_kernel() returns true when elfcorehdr_addr is set. > > While elfcorehdr_addr is set for kexec based kernel dump mechanism, > > alternate dump capturing methods like fadump [1] also set it to export > > the vmcore. Since, is_kdump_kernel() is used to restrict resources in > > crash dump capture kernel and such restrictions are not desirable for > > fadump, allow is_kdump_kernel() to be defined differently for fadump > > case. With that change, include is_fadump_active() check in functions > > is_vmcore_usable() & vmcore_unusable() to be able to export vmcore for > > fadump case too. > ... > > diff --git a/include/linux/crash_dump.h b/include/linux/crash_dump.h > > index 0f3a656293b0..de8a9fabfb6f 100644 > > --- a/include/linux/crash_dump.h > > +++ b/include/linux/crash_dump.h > > @@ -50,6 +50,7 @@ void vmcore_cleanup(void); > > #define vmcore_elf64_check_arch(x) (elf_check_arch(x) || > > vmcore_elf_check_arch_cross(x)) > > #endif > > > > +#ifndef is_kdump_kernel > > /* > > * is_kdump_kernel() checks whether this kernel is booting after a panic of > > * previous kernel or not. This is determined by checking if previous > > kernel > > @@ -64,6 +65,19 @@ static inline bool is_kdump_kernel(void) > > { > > return elfcorehdr_addr != ELFCORE_ADDR_MAX; > > } > > +#endif > > + > > +#ifndef is_fadump_active > > +/* > > + * If f/w assisted dump capturing mechanism (fadump), instead of kexec > > based > > + * dump capturing mechanism (kdump) is exporting the vmcore, then this > > function > > + * will be defined in arch specific code to return true, when appropriate. > > + */ > > +static inline bool is_fadump_active(void) > > +{ > > + return false; > > +} > > +#endif > > > > /* is_vmcore_usable() checks if the kernel is booting after a panic and > > * the vmcore region is usable. > > @@ -75,7 +89,8 @@ static inline bool is_kdump_kernel(void) > > > > static inline int is_vmcore_usable(void) > > { > > - return is_kdump_kernel() && elfcorehdr_addr != ELFCORE_ADDR_ERR ? 1 : 0; > > + return (is_kdump_kernel() || is_fadump_active()) > > + && elfcorehdr_addr != ELFCORE_ADDR_ERR ? 1 : 0; > > } > > > > /* vmcore_unusable() marks the vmcore as unusable, > > @@ -84,7 +99,7 @@ static inline int is_vmcore_usable(void) > > > > static inline void vmcore_unusable(void) > > { > > - if (is_kdump_kernel()) > > + if (is_kdump_kernel() || is_fadump_active()) > > elfcorehdr_addr = ELFCORE_ADDR_ERR; > > } > > I think it would be cleaner to decouple is_vmcore_usable() and > vmcore_usable() from is_kdump_kernel(). > > ie, make them operate solely based on the value of elforehdr_addr: > > static inline int is_vmcore_usable(void) > { > elfcorehdr_addr != ELFCORE_ADDR_ERR && \ > elfcorehdr_addr != ELFCORE_ADDR_MAX; Agree. I fell into the blind corner of thinking earlier. Above change is better. > } > > static inline void vmcore_unusable(void) > { > elfcorehdr_addr = ELFCORE_ADDR_ERR; > } > > > Then all we need on powerpc is a way to override is_kdump_kernel(). > > cheers > ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 1/3] proc/vmcore: Do not map unaccepted memory
On Mon, Sep 11, 2023 at 11:50:31AM +0200, David Hildenbrand wrote: > On 11.09.23 11:27, Kirill A. Shutemov wrote: > > On Mon, Sep 11, 2023 at 10:42:51AM +0200, David Hildenbrand wrote: > > > On 11.09.23 10:41, Kirill A. Shutemov wrote: > > > > On Mon, Sep 11, 2023 at 10:03:36AM +0200, David Hildenbrand wrote: > > > > > On 06.09.23 09:39, Adrian Hunter wrote: > > > > > > Support for unaccepted memory was added recently, refer commit > > > > > > dcdfdd40fa82 ("mm: Add support for unaccepted memory"), whereby > > > > > > a virtual machine may need to accept memory before it can be used. > > > > > > > > > > > > Do not map unaccepted memory because it can cause the guest to fail. > > > > > > > > > > > > For /proc/vmcore, which is read-only, this means a read or mmap of > > > > > > unaccepted memory will return zeros. > > > > > > > > > > Does a second (kdump) kernel that exposes /proc/vmcore reliably get > > > > > access > > > > > to the information whether memory of the first kernel is unaccepted > > > > > (IOW, > > > > > not its memory, but the memory of the first kernel it is supposed to > > > > > expose > > > > > via /proc/vmcore)? > > > > > > > > There are few patches in my queue to few related issue, but generally, > > > > yes, the information is available to the target kernel via EFI > > > > configuration table. > > > > > > I assume that table provided by the first kernel, and not read directly > > > from > > > HW, correct? > > > > The table is constructed by the EFI stub in the first kernel based on EFI > > memory map. > > > > Okay, should work then once that's done by the first kernel. > > Maybe include this patch in your series? Can do. But the other two patches are not related to kexec. Hm. -- Kiryl Shutsemau / Kirill A. Shutemov ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 1/3] proc/vmcore: Do not map unaccepted memory
On 11.09.23 11:27, Kirill A. Shutemov wrote: On Mon, Sep 11, 2023 at 10:42:51AM +0200, David Hildenbrand wrote: On 11.09.23 10:41, Kirill A. Shutemov wrote: On Mon, Sep 11, 2023 at 10:03:36AM +0200, David Hildenbrand wrote: On 06.09.23 09:39, Adrian Hunter wrote: Support for unaccepted memory was added recently, refer commit dcdfdd40fa82 ("mm: Add support for unaccepted memory"), whereby a virtual machine may need to accept memory before it can be used. Do not map unaccepted memory because it can cause the guest to fail. For /proc/vmcore, which is read-only, this means a read or mmap of unaccepted memory will return zeros. Does a second (kdump) kernel that exposes /proc/vmcore reliably get access to the information whether memory of the first kernel is unaccepted (IOW, not its memory, but the memory of the first kernel it is supposed to expose via /proc/vmcore)? There are few patches in my queue to few related issue, but generally, yes, the information is available to the target kernel via EFI configuration table. I assume that table provided by the first kernel, and not read directly from HW, correct? The table is constructed by the EFI stub in the first kernel based on EFI memory map. Okay, should work then once that's done by the first kernel. Maybe include this patch in your series? -- Cheers, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 1/3] proc/vmcore: Do not map unaccepted memory
On Mon, Sep 11, 2023 at 10:42:51AM +0200, David Hildenbrand wrote: > On 11.09.23 10:41, Kirill A. Shutemov wrote: > > On Mon, Sep 11, 2023 at 10:03:36AM +0200, David Hildenbrand wrote: > > > On 06.09.23 09:39, Adrian Hunter wrote: > > > > Support for unaccepted memory was added recently, refer commit > > > > dcdfdd40fa82 ("mm: Add support for unaccepted memory"), whereby > > > > a virtual machine may need to accept memory before it can be used. > > > > > > > > Do not map unaccepted memory because it can cause the guest to fail. > > > > > > > > For /proc/vmcore, which is read-only, this means a read or mmap of > > > > unaccepted memory will return zeros. > > > > > > Does a second (kdump) kernel that exposes /proc/vmcore reliably get access > > > to the information whether memory of the first kernel is unaccepted (IOW, > > > not its memory, but the memory of the first kernel it is supposed to > > > expose > > > via /proc/vmcore)? > > > > There are few patches in my queue to few related issue, but generally, > > yes, the information is available to the target kernel via EFI > > configuration table. > > I assume that table provided by the first kernel, and not read directly from > HW, correct? The table is constructed by the EFI stub in the first kernel based on EFI memory map. -- Kiryl Shutsemau / Kirill A. Shutemov ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 1/3] proc/vmcore: Do not map unaccepted memory
On 11.09.23 10:41, Kirill A. Shutemov wrote: On Mon, Sep 11, 2023 at 10:03:36AM +0200, David Hildenbrand wrote: On 06.09.23 09:39, Adrian Hunter wrote: Support for unaccepted memory was added recently, refer commit dcdfdd40fa82 ("mm: Add support for unaccepted memory"), whereby a virtual machine may need to accept memory before it can be used. Do not map unaccepted memory because it can cause the guest to fail. For /proc/vmcore, which is read-only, this means a read or mmap of unaccepted memory will return zeros. Does a second (kdump) kernel that exposes /proc/vmcore reliably get access to the information whether memory of the first kernel is unaccepted (IOW, not its memory, but the memory of the first kernel it is supposed to expose via /proc/vmcore)? There are few patches in my queue to few related issue, but generally, yes, the information is available to the target kernel via EFI configuration table. I assume that table provided by the first kernel, and not read directly from HW, correct? -- Cheers, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 1/3] proc/vmcore: Do not map unaccepted memory
On Mon, Sep 11, 2023 at 10:03:36AM +0200, David Hildenbrand wrote: > On 06.09.23 09:39, Adrian Hunter wrote: > > Support for unaccepted memory was added recently, refer commit > > dcdfdd40fa82 ("mm: Add support for unaccepted memory"), whereby > > a virtual machine may need to accept memory before it can be used. > > > > Do not map unaccepted memory because it can cause the guest to fail. > > > > For /proc/vmcore, which is read-only, this means a read or mmap of > > unaccepted memory will return zeros. > > Does a second (kdump) kernel that exposes /proc/vmcore reliably get access > to the information whether memory of the first kernel is unaccepted (IOW, > not its memory, but the memory of the first kernel it is supposed to expose > via /proc/vmcore)? There are few patches in my queue to few related issue, but generally, yes, the information is available to the target kernel via EFI configuration table. -- Kiryl Shutsemau / Kirill A. Shutemov ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 3/3] /dev/mem: Do not map unaccepted memory
On 07.09.23 16:46, Dave Hansen wrote: On 9/7/23 07:25, Kirill A. Shutemov wrote: On Thu, Sep 07, 2023 at 07:15:21AM -0700, Dave Hansen wrote: On 9/6/23 00:39, Adrian Hunter wrote: Support for unaccepted memory was added recently, refer commit dcdfdd40fa82 ("mm: Add support for unaccepted memory"), whereby a virtual machine may need to accept memory before it can be used. Do not map unaccepted memory because it can cause the guest to fail. Doesn't /dev/mem already provide a billion ways for someone to shoot themselves in the foot? TDX seems to have added the 1,000,000,001st. Is this really worth patching? Is it better to let TD die silently? I don't think so. First, let's take a look at all of the distro kernels that folks will run under TDX. Do they have STRICT_DEVMEM set? For virtio-mem, we do config VIRTIO_MEM ... depends on EXCLUSIVE_SYSTEM_RAM Which in turn: config EXCLUSIVE_SYSTEM_RAM ... depends on !DEVMEM || STRICT_DEVMEM Not supported on all archs, but at least on RHEL9 on x86_64 and aarch64. So, making unaccepted memory similarly depend on "!DEVMEM || STRICT_DEVMEM" does not sound too far off ... -- Cheers, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 1/3] proc/vmcore: Do not map unaccepted memory
On 06.09.23 09:39, Adrian Hunter wrote: Support for unaccepted memory was added recently, refer commit dcdfdd40fa82 ("mm: Add support for unaccepted memory"), whereby a virtual machine may need to accept memory before it can be used. Do not map unaccepted memory because it can cause the guest to fail. For /proc/vmcore, which is read-only, this means a read or mmap of unaccepted memory will return zeros. Does a second (kdump) kernel that exposes /proc/vmcore reliably get access to the information whether memory of the first kernel is unaccepted (IOW, not its memory, but the memory of the first kernel it is supposed to expose via /proc/vmcore)? I recall there might be other kdump-related issues for TDX and friends to solve. Especially, which information the second kernel gets provided by the first kernel. So can this patch even be tested reasonably (IOW, get into a kdump kernel in an environment where the first kernel has unaccepted memory, and verify that unaccepted memory is handled accordingly? ... while kdump doing anything reasonable in such an environment at all?) -- Cheers, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v2 2/3] vmcore: allow fadump to export vmcore even if is_kdump_kernel() is false
Hari Bathini writes: > Currently, is_kdump_kernel() returns true when elfcorehdr_addr is set. > While elfcorehdr_addr is set for kexec based kernel dump mechanism, > alternate dump capturing methods like fadump [1] also set it to export > the vmcore. Since, is_kdump_kernel() is used to restrict resources in > crash dump capture kernel and such restrictions are not desirable for > fadump, allow is_kdump_kernel() to be defined differently for fadump > case. With that change, include is_fadump_active() check in functions > is_vmcore_usable() & vmcore_unusable() to be able to export vmcore for > fadump case too. ... > diff --git a/include/linux/crash_dump.h b/include/linux/crash_dump.h > index 0f3a656293b0..de8a9fabfb6f 100644 > --- a/include/linux/crash_dump.h > +++ b/include/linux/crash_dump.h > @@ -50,6 +50,7 @@ void vmcore_cleanup(void); > #define vmcore_elf64_check_arch(x) (elf_check_arch(x) || > vmcore_elf_check_arch_cross(x)) > #endif > > +#ifndef is_kdump_kernel > /* > * is_kdump_kernel() checks whether this kernel is booting after a panic of > * previous kernel or not. This is determined by checking if previous kernel > @@ -64,6 +65,19 @@ static inline bool is_kdump_kernel(void) > { > return elfcorehdr_addr != ELFCORE_ADDR_MAX; > } > +#endif > + > +#ifndef is_fadump_active > +/* > + * If f/w assisted dump capturing mechanism (fadump), instead of kexec based > + * dump capturing mechanism (kdump) is exporting the vmcore, then this > function > + * will be defined in arch specific code to return true, when appropriate. > + */ > +static inline bool is_fadump_active(void) > +{ > + return false; > +} > +#endif > > /* is_vmcore_usable() checks if the kernel is booting after a panic and > * the vmcore region is usable. > @@ -75,7 +89,8 @@ static inline bool is_kdump_kernel(void) > > static inline int is_vmcore_usable(void) > { > - return is_kdump_kernel() && elfcorehdr_addr != ELFCORE_ADDR_ERR ? 1 : 0; > + return (is_kdump_kernel() || is_fadump_active()) > + && elfcorehdr_addr != ELFCORE_ADDR_ERR ? 1 : 0; > } > > /* vmcore_unusable() marks the vmcore as unusable, > @@ -84,7 +99,7 @@ static inline int is_vmcore_usable(void) > > static inline void vmcore_unusable(void) > { > - if (is_kdump_kernel()) > + if (is_kdump_kernel() || is_fadump_active()) > elfcorehdr_addr = ELFCORE_ADDR_ERR; > } I think it would be cleaner to decouple is_vmcore_usable() and vmcore_usable() from is_kdump_kernel(). ie, make them operate solely based on the value of elforehdr_addr: static inline int is_vmcore_usable(void) { elfcorehdr_addr != ELFCORE_ADDR_ERR && \ elfcorehdr_addr != ELFCORE_ADDR_MAX; } static inline void vmcore_unusable(void) { elfcorehdr_addr = ELFCORE_ADDR_ERR; } Then all we need on powerpc is a way to override is_kdump_kernel(). cheers ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: kexec reboot failed due to commit 75d090fd167ac
Add kexec list in cc On Sat, 9 Sept 2023 at 19:34, Kirill A. Shutemov wrote: > > On Fri, Sep 08, 2023 at 06:17:53PM +0200, Ard Biesheuvel wrote: > > On Fri, Sep 8, 2023 at 5:58 PM Kees Cook wrote: > > > > > > On Fri, Sep 08, 2023 at 03:32:33PM +0300, Kirill A. Shutemov wrote: > > > > On Fri, Sep 08, 2023 at 02:02:30PM +0800, Aaron Lu wrote: > > > > > On Thu, Sep 07, 2023 at 04:14:09PM +0300, Kirill A. Shutemov wrote: > > > > > > On Tue, Aug 29, 2023 at 10:04:51PM +0800, Aaron Lu wrote: > > > > > > > > Could you show dmesg of the first kernel before kexec? > > > > > > > > > > > > > > Attached. > > > > > > > > > > > > > > BTW, kexec is invoked like this: > > > > > > > kver=6.4.0-rc5-9-g75d090fd167a > > > > > > > kdir=$HOME/kernels/$kver > > > > > > > sudo kexec -l $kdir/vmlinuz-$kver > > > > > > > --initrd=$kdir/initramfs-$kver.img > > > > > > > --append="root=UUID=4381321e-e01e-455a-9d46-5e8c4c5b2d02 ro > > > > > > > net.ifnames=0 acpi_rsdp=0x728e8014 no_hash_pointers sched_verbose > > > > > > > selinux=0" > > > > > > > > > > > > I don't understand why it happens. > > > > > > > > > > > > Could you check if this patch changes anything: > > > > > > > > > > > > diff --git a/arch/x86/boot/compressed/misc.c > > > > > > b/arch/x86/boot/compressed/misc.c > > > > > > index 94b7abcf624b..172c476ff6f3 100644 > > > > > > --- a/arch/x86/boot/compressed/misc.c > > > > > > +++ b/arch/x86/boot/compressed/misc.c > > > > > > @@ -456,10 +456,12 @@ asmlinkage __visible void > > > > > > *extract_kernel(void *rmode, memptr heap, > > > > > > > > > > > > debug_putstr("\nDecompressing Linux... "); > > > > > > > > > > > > +#if 0 > > > > > > if (init_unaccepted_memory()) { > > > > > > debug_putstr("Accepting memory... "); > > > > > > accept_memory(__pa(output), __pa(output) + needed_size); > > > > > > } > > > > > > +#endif > > > > > > > > > > > > __decompress(input_data, input_len, NULL, NULL, output, > > > > > > output_len, > > > > > > NULL, error); > > > > > > -- > > > > > > > > > > It solved the problem. > > > > > > > > Looks like increasing BOOT_INIT_PGT_SIZE fixes the issue. I don't yet > > > > understand why and how unaccepted memory is involved. I will look more > > > > into it. > > > > > > > > Enabling CONFIG_RANDOMIZE_BASE also makes the issue go away. > > > > > > Is this perhaps just luck? I.e. does is break ever on, say, 1000 boot > > > attempts? (i.e. maybe some position is bad and KASLR happens to usually > > > avoid it?) > > Yes, it can be luck. > > > > > Kees, maybe you have a clue? > > > > > > The only thing I can think of is that something isn't being counted > > > correctly due to the size of code, and it just happens that this commit > > > makes the code large enough to exceed some set of mappings? > > > > > > > > > > > diff --git a/arch/x86/include/asm/boot.h b/arch/x86/include/asm/boot.h > > > > index 9191280d9ea3..26ccce41d781 100644 > > > > --- a/arch/x86/include/asm/boot.h > > > > +++ b/arch/x86/include/asm/boot.h > > > > @@ -40,7 +40,7 @@ > > > > #ifdef CONFIG_X86_64 > > > > # define BOOT_STACK_SIZE 0x4000 > > > > > > > > -# define BOOT_INIT_PGT_SIZE (6*4096) > > > > +# define BOOT_INIT_PGT_SIZE (7*4096) > > > > > > That's why this might be working, for example? How large is the boot > > > image before/after the commit, etc? > > > > > > > Not sure why these changes would make a difference here, but choking > > on accept_memory() on a non-TDX suggests that init_unaccepted_memory() > > is poking into unmapped memory before it even decides that the > > unaccepted memory does not exist. > > > > init_unaccepted_memory() has > > > > ret = efi_get_conf_table(boot_params, &cfg_table_pa, > > &cfg_table_len); > > if (ret) { > > warn("EFI config table not found."); > > return false; > > } > > > > which looks for tuples in an array pointed to by the > > EFI system table, and if either of those is not mapped, things can be > > expected to explode. > > > > The only odd thing there is that this code is invoked after setting up > > the 'demand paging' logic in the decompressor. > > > > If you haven't yet, could you please retry the kexec boot with > > earlyprintk=tty? > > early console in extract_kernel > input_data: 0x00807eb433a8 > input_len: 0x00d26271 > output: 0x00807b00 > output_len: 0x04800c10 > kernel_total_size: 0x03e28000 > needed_size: 0x04a0 > trampoline_32bit: 0x0009d000 > > Decompressing Linux... out of pgt_buf in > arch/x86/boot/compressed/ident_map_64.c!? > pages->pgt_buf_offset: 0x6000 > pages->pgt_buf_size: 0x6000 > > > Error: kernel_ident_mapping_init() failed > > It crashes on #PF due to stbl->nr_tables dereference in > efi_get_conf_table() called from init_unaccepted_memory(). > > I don't see anything special about stbl location: 0x775d6018. > > One other bit of information: disabling 5-level paging