Re: [PATCH v2 0/5] Introduce new wrappers to copy user-arrays

2023-09-11 Thread Kees Cook
On September 11, 2023 6:55:32 PM PDT, Dave Airlie  wrote:
>On Tue, 12 Sept 2023 at 11:27, Kees Cook  wrote:
>>
>> On September 8, 2023 12:59:39 PM PDT, Philipp Stanner  
>> wrote:
>> >Hi!
>> >
>> >David Airlie suggested that we could implement new wrappers around
>> >(v)memdup_user() for duplicating user arrays.
>> >
>> >This small patch series first implements the two new wrapper functions
>> >memdup_array_user() and vmemdup_array_user(). They calculate the
>> >array-sizes safely, i.e., they return an error in case of an overflow.
>> >
>> >It then implements the new wrappers in two components in kernel/ and two
>> >in the drm-subsystem.
>> >
>> >In total, there are 18 files in the kernel that use (v)memdup_user() to
>> >duplicate arrays. My plan is to provide patches for the other 14
>> >successively once this series has been merged.
>> >
>> >
>> >Changes since v1:
>> >- Insert new headers alphabetically ordered
>> >- Remove empty lines in functions' docstrings
>> >- Return -EOVERFLOW instead of -EINVAL from wrapper functions
>> >
>> >
>> >@Andy:
>> >I test-build it for UM on my x86_64. Builds successfully.
>> >A kernel build (localmodconfig) for my Fedora38 @ x86_64 does also boot
>> >fine.
>> >
>> >If there is more I can do to verify the early boot stages are fine,
>> >please let me know!
>> >
>> >P.
>> >
>> >Philipp Stanner (5):
>> >  string.h: add array-wrappers for (v)memdup_user()
>> >  kernel: kexec: copy user-array safely
>> >  kernel: watch_queue: copy user-array safely
>> >  drm_lease.c: copy user-array safely
>> >  drm: vmgfx_surface.c: copy user-array safely
>> >
>> > drivers/gpu/drm/drm_lease.c |  4 +--
>> > drivers/gpu/drm/vmwgfx/vmwgfx_surface.c |  4 +--
>> > include/linux/string.h  | 40 +
>> > kernel/kexec.c  |  2 +-
>> > kernel/watch_queue.c|  2 +-
>> > 5 files changed, 46 insertions(+), 6 deletions(-)
>> >
>>
>> Nice. For the series:
>>
>> Reviewed-by: Kees Cook 
>
>Hey Kees,
>
>what tree do you think it would best to land this through? I'm happy
>to send the initial set from a drm branch, but also happy to have it
>land via someone with a better process.

Feel free to take it via drm. Usually string.h doesn't get a lot of changes 
(and even then it's normally additive) so conflicts are rare/easy. :)

-Kees


-- 
Kees Cook

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2 0/5] Introduce new wrappers to copy user-arrays

2023-09-11 Thread Dave Airlie
On Tue, 12 Sept 2023 at 11:27, Kees Cook  wrote:
>
> On September 8, 2023 12:59:39 PM PDT, Philipp Stanner  
> wrote:
> >Hi!
> >
> >David Airlie suggested that we could implement new wrappers around
> >(v)memdup_user() for duplicating user arrays.
> >
> >This small patch series first implements the two new wrapper functions
> >memdup_array_user() and vmemdup_array_user(). They calculate the
> >array-sizes safely, i.e., they return an error in case of an overflow.
> >
> >It then implements the new wrappers in two components in kernel/ and two
> >in the drm-subsystem.
> >
> >In total, there are 18 files in the kernel that use (v)memdup_user() to
> >duplicate arrays. My plan is to provide patches for the other 14
> >successively once this series has been merged.
> >
> >
> >Changes since v1:
> >- Insert new headers alphabetically ordered
> >- Remove empty lines in functions' docstrings
> >- Return -EOVERFLOW instead of -EINVAL from wrapper functions
> >
> >
> >@Andy:
> >I test-build it for UM on my x86_64. Builds successfully.
> >A kernel build (localmodconfig) for my Fedora38 @ x86_64 does also boot
> >fine.
> >
> >If there is more I can do to verify the early boot stages are fine,
> >please let me know!
> >
> >P.
> >
> >Philipp Stanner (5):
> >  string.h: add array-wrappers for (v)memdup_user()
> >  kernel: kexec: copy user-array safely
> >  kernel: watch_queue: copy user-array safely
> >  drm_lease.c: copy user-array safely
> >  drm: vmgfx_surface.c: copy user-array safely
> >
> > drivers/gpu/drm/drm_lease.c |  4 +--
> > drivers/gpu/drm/vmwgfx/vmwgfx_surface.c |  4 +--
> > include/linux/string.h  | 40 +
> > kernel/kexec.c  |  2 +-
> > kernel/watch_queue.c|  2 +-
> > 5 files changed, 46 insertions(+), 6 deletions(-)
> >
>
> Nice. For the series:
>
> Reviewed-by: Kees Cook 

Hey Kees,

what tree do you think it would best to land this through? I'm happy
to send the initial set from a drm branch, but also happy to have it
land via someone with a better process.

Dave.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2 0/5] Introduce new wrappers to copy user-arrays

2023-09-11 Thread Zack Rusin
On Fri, 2023-09-08 at 21:59 +0200, Philipp Stanner wrote:
> Hi!
> 
> David Airlie suggested that we could implement new wrappers around
> (v)memdup_user() for duplicating user arrays.
> 
> This small patch series first implements the two new wrapper functions
> memdup_array_user() and vmemdup_array_user(). They calculate the
> array-sizes safely, i.e., they return an error in case of an overflow.
> 
> It then implements the new wrappers in two components in kernel/ and two
> in the drm-subsystem.
> 
> In total, there are 18 files in the kernel that use (v)memdup_user() to
> duplicate arrays. My plan is to provide patches for the other 14
> successively once this series has been merged.
> 
> 
> Changes since v1:
> - Insert new headers alphabetically ordered
> - Remove empty lines in functions' docstrings
> - Return -EOVERFLOW instead of -EINVAL from wrapper functions
> 
> 
> @Andy:
> I test-build it for UM on my x86_64. Builds successfully.
> A kernel build (localmodconfig) for my Fedora38 @ x86_64 does also boot
> fine.
> 
> If there is more I can do to verify the early boot stages are fine,
> please let me know!
> 
> P.
> 
> Philipp Stanner (5):
>   string.h: add array-wrappers for (v)memdup_user()
>   kernel: kexec: copy user-array safely
>   kernel: watch_queue: copy user-array safely
>   drm_lease.c: copy user-array safely
>   drm: vmgfx_surface.c: copy user-array safely
> 
>  drivers/gpu/drm/drm_lease.c |  4 +--
>  drivers/gpu/drm/vmwgfx/vmwgfx_surface.c |  4 +--
>  include/linux/string.h  | 40 +
>  kernel/kexec.c  |  2 +-
>  kernel/watch_queue.c    |  2 +-
>  5 files changed, 46 insertions(+), 6 deletions(-)
> 

Series, and in particular the vmwgfx changes, look good to me.

Reviewed-by: Zack Rusin 
___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2 0/5] Introduce new wrappers to copy user-arrays

2023-09-11 Thread Kees Cook
On September 8, 2023 12:59:39 PM PDT, Philipp Stanner  
wrote:
>Hi!
>
>David Airlie suggested that we could implement new wrappers around
>(v)memdup_user() for duplicating user arrays.
>
>This small patch series first implements the two new wrapper functions
>memdup_array_user() and vmemdup_array_user(). They calculate the
>array-sizes safely, i.e., they return an error in case of an overflow.
>
>It then implements the new wrappers in two components in kernel/ and two
>in the drm-subsystem.
>
>In total, there are 18 files in the kernel that use (v)memdup_user() to
>duplicate arrays. My plan is to provide patches for the other 14
>successively once this series has been merged.
>
>
>Changes since v1:
>- Insert new headers alphabetically ordered
>- Remove empty lines in functions' docstrings
>- Return -EOVERFLOW instead of -EINVAL from wrapper functions
>
>
>@Andy:
>I test-build it for UM on my x86_64. Builds successfully.
>A kernel build (localmodconfig) for my Fedora38 @ x86_64 does also boot
>fine.
>
>If there is more I can do to verify the early boot stages are fine,
>please let me know!
>
>P.
>
>Philipp Stanner (5):
>  string.h: add array-wrappers for (v)memdup_user()
>  kernel: kexec: copy user-array safely
>  kernel: watch_queue: copy user-array safely
>  drm_lease.c: copy user-array safely
>  drm: vmgfx_surface.c: copy user-array safely
>
> drivers/gpu/drm/drm_lease.c |  4 +--
> drivers/gpu/drm/vmwgfx/vmwgfx_surface.c |  4 +--
> include/linux/string.h  | 40 +
> kernel/kexec.c  |  2 +-
> kernel/watch_queue.c|  2 +-
> 5 files changed, 46 insertions(+), 6 deletions(-)
>

Nice. For the series:

Reviewed-by: Kees Cook 



-- 
Kees Cook

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2 0/2] x86/kexec: UKI Support

2023-09-11 Thread Baoquan He
Add Philipp to CC as he is also investigating UKI

On 09/11/23 at 07:25am, Jan Hendrik Farr wrote:
> Hello,
> 
> this patch (v2) implements UKI support for kexec_file_load. It will require
> support in the kexec-tools userspace utility. For testing purposes the
> following can be used: https://github.com/Cydox/kexec-test/
> 
> Creating UKIs for testing can be done with ukify (included in systemd),
> sbctl, and mkinitcpio, etc.

This is awesome work, Jan, thanks.

By the way, could you provide detailed steps about how to test this
patchset so that people interested can give it a shot?

> 
> There has been discussion on this topic in an issue on GitHub that is linked
> below for reference.
> 
> Changes for v2:
> - .cmdline section is now optional
> - moving pefile_parse_binary is now in a separate commit for clarity
> - parse_pefile.c is now in /lib instead of arch/x86/kernel (not sure if
>   this is the best location, but it definetly shouldn't have been in an
>   architecture specific location)
> - parse_pefile.h is now in include/kernel instead of architecture
>   specific location
> - if initrd or cmdline is manually supplied EPERM is returned instead of
>   being silently ignored
> - formatting tweaks
> 
> 
> Some links:
> - Related discussion: https://github.com/systemd/systemd/issues/28538
> - Documentation of UKIs: 
> https://uapi-group.org/specifications/specs/unified_kernel_image/
> 
> Jan Hendrik Farr (2):
>   move pefile_parse_binary to its own file
>   x86/kexec: UKI support
> 
>  arch/x86/include/asm/kexec-uki.h   |   7 ++
>  arch/x86/kernel/Makefile   |   1 +
>  arch/x86/kernel/kexec-uki.c| 126 +
>  arch/x86/kernel/machine_kexec_64.c |   2 +
>  crypto/asymmetric_keys/mscode_parser.c |   2 +-
>  crypto/asymmetric_keys/verify_pefile.c | 110 +++--
>  crypto/asymmetric_keys/verify_pefile.h |  16 
>  include/linux/parse_pefile.h   |  32 +++
>  lib/Makefile   |   3 +
>  lib/parse_pefile.c | 109 +
>  10 files changed, 292 insertions(+), 116 deletions(-)
>  create mode 100644 arch/x86/include/asm/kexec-uki.h
>  create mode 100644 arch/x86/kernel/kexec-uki.c
>  create mode 100644 include/linux/parse_pefile.h
>  create mode 100644 lib/parse_pefile.c
> 
> -- 
> 2.40.1
> 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [systemd-devel] [PATCH 0/1] x86/kexec: UKI support

2023-09-11 Thread Neal Gompa
On Mon, Sep 11, 2023 at 7:15 PM Jarkko Sakkinen  wrote:
>
> On Sat Sep 9, 2023 at 7:18 PM EEST, Jan Hendrik Farr wrote:
> > Hello,
> >
> > this patch implements UKI support for kexec_file_load. It will require 
> > support
> > in the kexec-tools userspace utility. For testing purposes the following 
> > can be used:
> > https://github.com/Cydox/kexec-test/
> >
> > There has been discussion on this topic in an issue on GitHub that is 
> > linked below
> > for reference.
> >
> >
> > Some links:
> > - Related discussion: https://github.com/systemd/systemd/issues/28538
> > - Documentation of UKIs: 
> > https://uapi-group.org/specifications/specs/unified_kernel_image/
> >
> > Jan Hendrik Farr (1):
> >   x86/kexec: UKI support
> >
> >  arch/x86/include/asm/kexec-uki.h   |   7 ++
> >  arch/x86/include/asm/parse_pefile.h|  32 +++
> >  arch/x86/kernel/Makefile   |   2 +
> >  arch/x86/kernel/kexec-uki.c| 113 +
> >  arch/x86/kernel/machine_kexec_64.c |   2 +
> >  arch/x86/kernel/parse_pefile.c | 110 
> >  crypto/asymmetric_keys/mscode_parser.c |   2 +-
> >  crypto/asymmetric_keys/verify_pefile.c | 110 +++-
> >  crypto/asymmetric_keys/verify_pefile.h |  16 
> >  9 files changed, 278 insertions(+), 116 deletions(-)
> >  create mode 100644 arch/x86/include/asm/kexec-uki.h
> >  create mode 100644 arch/x86/include/asm/parse_pefile.h
> >  create mode 100644 arch/x86/kernel/kexec-uki.c
> >  create mode 100644 arch/x86/kernel/parse_pefile.c
> >
> > --
> > 2.40.1
>
> What the heck is UKI?

Unified Kernel Images. More details available here:
https://uapi-group.org/specifications/specs/unified_kernel_image/

It's a way of creating initramfs-style images as fully generic,
reproducible images that can be built server-side.

It is a requirement for creating locked down Linux devices for
appliances that can be tamper-resistant too.




--
真実はいつも一つ!/ Always, there's only one truth!

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 0/1] x86/kexec: UKI support

2023-09-11 Thread Jan Hendrik Farr
> What the heck is UKI?

UKI (Unified Kernel Image) is the kernel image + initrd + cmdline (+ some other 
optional stuff) all packaged up together as one EFI application.

This EFI application can then be launched directly by the UEFI without the need 
for any additional stuff (or by systemd-boot). It's all self contained. One 
benefit is that this is a convenient way to distribute kernels all in one file. 
Another benefit is that the whole combination of kernel image, initrd, and 
cmdline can all be signed together so only that particular combination can be 
executed if you are using secure boot.

The format itself is rather simple. It's just a PE file (as required by the 
UEFI spec) that contains a small stub application in the .text, .data, etc 
sections that is responsible for invoking the contained kernel and initrd with 
the contained cmdline. The kernel image is placed into a .kernel section, the 
initrd into a .initrd section, and the cmdline into a .cmdline section in the 
PE executable.

If we want to kexec a UKI we could obviously just have userspace pick it apart 
and kexec it like normal. However in lockdown mode this will only work if you 
sign the kernel image that is contained inside the UKI. The problem with that 
is that anybody can then grab that signed kernel and launch it with any initrd 
or cmdline. So instead this patch makes the kernel do the work instead. The 
kernel verifies the signature on the entire UKI and then passes its components 
on to the normal kexec bzimage loader.

Useful Links:
UKI format documentation: 
https://uapi-group.org/specifications/specs/unified_kernel_image/
Arch wiki: https://wiki.archlinux.org/title/Unified_kernel_image
Fedora UKI support: 
https://fedoraproject.org/wiki/Changes/Unified_Kernel_Support_Phase_1

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 0/1] x86/kexec: UKI support

2023-09-11 Thread Jarkko Sakkinen
On Sat Sep 9, 2023 at 7:18 PM EEST, Jan Hendrik Farr wrote:
> Hello,
>
> this patch implements UKI support for kexec_file_load. It will require support
> in the kexec-tools userspace utility. For testing purposes the following can 
> be used:
> https://github.com/Cydox/kexec-test/
>
> There has been discussion on this topic in an issue on GitHub that is linked 
> below
> for reference.
>
>
> Some links:
> - Related discussion: https://github.com/systemd/systemd/issues/28538
> - Documentation of UKIs: 
> https://uapi-group.org/specifications/specs/unified_kernel_image/
>
> Jan Hendrik Farr (1):
>   x86/kexec: UKI support
>
>  arch/x86/include/asm/kexec-uki.h   |   7 ++
>  arch/x86/include/asm/parse_pefile.h|  32 +++
>  arch/x86/kernel/Makefile   |   2 +
>  arch/x86/kernel/kexec-uki.c| 113 +
>  arch/x86/kernel/machine_kexec_64.c |   2 +
>  arch/x86/kernel/parse_pefile.c | 110 
>  crypto/asymmetric_keys/mscode_parser.c |   2 +-
>  crypto/asymmetric_keys/verify_pefile.c | 110 +++-
>  crypto/asymmetric_keys/verify_pefile.h |  16 
>  9 files changed, 278 insertions(+), 116 deletions(-)
>  create mode 100644 arch/x86/include/asm/kexec-uki.h
>  create mode 100644 arch/x86/include/asm/parse_pefile.h
>  create mode 100644 arch/x86/kernel/kexec-uki.c
>  create mode 100644 arch/x86/kernel/parse_pefile.c
>
> -- 
> 2.40.1

What the heck is UKI?

BR, Jarkko

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: kexec reboot failed due to commit 75d090fd167ac

2023-09-11 Thread Tom Lendacky

On 9/11/23 10:53, Kirill A. Shutemov wrote:

On Mon, Sep 11, 2023 at 10:33:01AM -0500, Tom Lendacky wrote:

On 9/11/23 09:57, Kirill A. Shutemov wrote:

On Mon, Sep 11, 2023 at 10:56:36PM +0800, Dave Young wrote:

early console in extract_kernel
input_data: 0x00807eb433a8
input_len: 0x00d26271
output: 0x00807b00
output_len: 0x04800c10
kernel_total_size: 0x03e28000
needed_size: 0x04a0
trampoline_32bit: 0x0009d000

Decompressing Linux... out of pgt_buf in 
arch/x86/boot/compressed/ident_map_64.c!?
pages->pgt_buf_offset: 0x6000
pages->pgt_buf_size: 0x6000


Error: kernel_ident_mapping_init() failed

It crashes on #PF due to stbl->nr_tables dereference in
efi_get_conf_table() called from init_unaccepted_memory().

I don't see anything special about stbl location: 0x775d6018.

One other bit of information: disabling 5-level paging also helps the
issue.

I will debug further.


The problem is not limited to unaccepted memory, it also triggers if we
reach efi_get_rsdp_addr() in the same setup.

I think we have several problems here.

- 6 pages for !RANDOMIZE_BASE is only enough for kernel, cmdline,
boot_data and setup_data if we assume that they are in different 1G
regions and do not cross the 1G boundaries. 4-level paging: 1 for PGD, 1
for PUD, 4 for PMD tables.

Looks like we never map EFI/ACPI memory explicitly.

It might work if kernel/cmdline/... are in single 1G and we have
spare pages to handle page faults.

- No spare memory to handle mapping for cc_info and cc_info->cpuid_phys;

- I didn't increase BOOT_INIT_PGT_SIZE when added 5-level paging support.
And if start pagetables from scratch ('else' case of 'if (p4d_offset...))
we run out of memory.

I believe similar logic would apply for BOOT_PGT_SIZE for RANDOMIZE_BASE=y
case.

I don't know what the right fix here. We can increase the constants to be
enough to cover existing cases, but it is very fragile. I am not sure I
saw all users. Some of them could silently handled with pagefault handler
in some setups. And it is hard to catch new users during code review.

Also I'm not sure why do we need pagefault handler there. Looks like it
just masking problems. I think everything has to be mapped explicitly.

Any comments?


There was a similar related issue around the cc_info blob that is captured
here: https://lore.kernel.org/lkml/20230601072043.24439-1-l...@redhat.com/

Personally, I'm a fan of mapping the EFI tables that will be passed to the
kexec/kdump kernel. To me, that seems to more closely match the valid
mappings for the tables when control is transferred to the OS from UEFI on
the initial boot.


I don't see how it would help if initialize_identity_maps() resets
pagetables. See 'else' case of 'if (p4d_offset...).


Ok, I see what you mean now.

Thanks,
Tom





___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2] x86/purgatory: Remove LTO flags

2023-09-11 Thread Song Liu
On Mon, Sep 11, 2023 at 9:00 AM Nick Desaulniers
 wrote:
>
> On Fri, Sep 8, 2023 at 4:13 PM Song Liu  wrote:
> >
> > With LTO enabled, ld.lld generates multiple .text sections for
> > purgatory.ro:
> >
> > $ readelf -S purgatory.ro  | grep " .text"
> >   [ 1] .text PROGBITS   0040
> >   [ 7] .text.purgatory   PROGBITS   20e0
> >   [ 9] .text.warnPROGBITS   21c0
> >   [13] .text.sha256_upda PROGBITS   22f0
> >   [15] .text.sha224_upda PROGBITS   2be0
> >   [17] .text.sha256_fina PROGBITS   2bf0
> >   [19] .text.sha224_fina PROGBITS   2cc0
> >
> > This cause WARNING from kexec_purgatory_setup_sechdrs():
> >
> > WARNING: CPU: 26 PID: 110894 at kernel/kexec_file.c:919
> > kexec_load_purgatory+0x37f/0x390
> >
> > Fix this by disabling LTO for purgatory.
>
> Thanks for the v2!
>
> >
> > Fixes: 8652d44f466a ("kexec: support purgatories with .text.hot sections")
>
> Dunno that this fixes tag is precise.  I think perhaps
>
> Fixes: b33fff07e3e3 ("x86, build: allow LTO to be selected")

Thanks for the correction!

>
> would be more precise.
>
> > Cc: Ricardo Ribalda 
> > Cc: Sami Tolvanen 
> > Cc: kexec@lists.infradead.org
> > Cc: linux-ker...@vger.kernel.org
> > Cc: x...@kernel.org
> > Cc: l...@lists.linux.dev
> > Signed-off-by: Song Liu 
> >
> > ---
> > AFAICT, x86 is the only arch that supports LTO and purgatory.
> >
> > Changes in v2:
> > 1. Use CC_FLAGS_LTO instead of hardcode -flto. (Nick Desaulniers)
> > ---
> >  arch/x86/purgatory/Makefile | 4 
> >  1 file changed, 4 insertions(+)
> >
> > diff --git a/arch/x86/purgatory/Makefile b/arch/x86/purgatory/Makefile
> > index c2a29be35c01..08aa0f25f12a 100644
> > --- a/arch/x86/purgatory/Makefile
> > +++ b/arch/x86/purgatory/Makefile
> > @@ -19,6 +19,10 @@ CFLAGS_sha256.o := -D__DISABLE_EXPORTS -D__NO_FORTIFY
> >  # optimization flags.
> >  KBUILD_CFLAGS := $(filter-out -fprofile-sample-use=% 
> > -fprofile-use=%,$(KBUILD_CFLAGS))
> >
> > +# When LTO is enabled, llvm emits many text sections, which is not 
> > supported
> > +# by kexec. Remove -flto=* flags.
>
> -flto* in LLVM implies -ffunction-sections, which creates a .text. name> section per function definition to facilitate -Wl,--gc-sections.
>
> Overall the question here is "do we really need to optimize kexec?"
>
> If the answer is yes, then this patch AND 8652d44f466a are both
> pessimizing kexec (though having it work at all is strictly better
> than not at all).  The best fix IMO would be to provide a linker
> script for this purgatory image that rejoins the text sections back
> into one .text.  For example:
>
> commit eff8728fe698 ("vmlinux.lds.h: Add PGO and AutoFDO input sections")
>
> I assume people do care about the time to kexec, hence the raison
> d'etre for projects like kpatch.

AFAICT, optimizations like LTO and PGO can give a few % of
improvement, which is probably not important for kexec. The benefit
is in the order of seconds (or less?). The benefit of kpatch is that we
can keep the workload running while fixing the kernel bug. Based on
our experience at Meta, it may take hours to graceful shutdown the
application to run kexec. In this case, a few seconds of improvement
(via LTO/PGO purgatory) doesn't save us much.

Thanks,
Song

>
> I'm fine to sign off on this approach if we don't really care, or want
> to care more later, but we can do better here.
>
> > +KBUILD_CFLAGS := $(filter-out $(CC_FLAGS_LTO),$(KBUILD_CFLAGS))
> > +
> >  # When linking purgatory.ro with -r unresolved symbols are not checked,
> >  # also link a purgatory.chk binary without -r to check for unresolved 
> > symbols.
> >  PURGATORY_LDFLAGS := -e purgatory_start -z nodefaultlib
> > --
> > 2.34.1
> >
>
>
> --
> Thanks,
> ~Nick Desaulniers

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2] x86/purgatory: Remove LTO flags

2023-09-11 Thread Nick Desaulniers
On Fri, Sep 8, 2023 at 4:13 PM Song Liu  wrote:
>
> With LTO enabled, ld.lld generates multiple .text sections for
> purgatory.ro:
>
> $ readelf -S purgatory.ro  | grep " .text"
>   [ 1] .text PROGBITS   0040
>   [ 7] .text.purgatory   PROGBITS   20e0
>   [ 9] .text.warnPROGBITS   21c0
>   [13] .text.sha256_upda PROGBITS   22f0
>   [15] .text.sha224_upda PROGBITS   2be0
>   [17] .text.sha256_fina PROGBITS   2bf0
>   [19] .text.sha224_fina PROGBITS   2cc0
>
> This cause WARNING from kexec_purgatory_setup_sechdrs():
>
> WARNING: CPU: 26 PID: 110894 at kernel/kexec_file.c:919
> kexec_load_purgatory+0x37f/0x390
>
> Fix this by disabling LTO for purgatory.

Thanks for the v2!

>
> Fixes: 8652d44f466a ("kexec: support purgatories with .text.hot sections")

Dunno that this fixes tag is precise.  I think perhaps

Fixes: b33fff07e3e3 ("x86, build: allow LTO to be selected")

would be more precise.

> Cc: Ricardo Ribalda 
> Cc: Sami Tolvanen 
> Cc: kexec@lists.infradead.org
> Cc: linux-ker...@vger.kernel.org
> Cc: x...@kernel.org
> Cc: l...@lists.linux.dev
> Signed-off-by: Song Liu 
>
> ---
> AFAICT, x86 is the only arch that supports LTO and purgatory.
>
> Changes in v2:
> 1. Use CC_FLAGS_LTO instead of hardcode -flto. (Nick Desaulniers)
> ---
>  arch/x86/purgatory/Makefile | 4 
>  1 file changed, 4 insertions(+)
>
> diff --git a/arch/x86/purgatory/Makefile b/arch/x86/purgatory/Makefile
> index c2a29be35c01..08aa0f25f12a 100644
> --- a/arch/x86/purgatory/Makefile
> +++ b/arch/x86/purgatory/Makefile
> @@ -19,6 +19,10 @@ CFLAGS_sha256.o := -D__DISABLE_EXPORTS -D__NO_FORTIFY
>  # optimization flags.
>  KBUILD_CFLAGS := $(filter-out -fprofile-sample-use=% 
> -fprofile-use=%,$(KBUILD_CFLAGS))
>
> +# When LTO is enabled, llvm emits many text sections, which is not supported
> +# by kexec. Remove -flto=* flags.

-flto* in LLVM implies -ffunction-sections, which creates a .text. section per function definition to facilitate -Wl,--gc-sections.

Overall the question here is "do we really need to optimize kexec?"

If the answer is yes, then this patch AND 8652d44f466a are both
pessimizing kexec (though having it work at all is strictly better
than not at all).  The best fix IMO would be to provide a linker
script for this purgatory image that rejoins the text sections back
into one .text.  For example:

commit eff8728fe698 ("vmlinux.lds.h: Add PGO and AutoFDO input sections")

I assume people do care about the time to kexec, hence the raison
d'etre for projects like kpatch.

I'm fine to sign off on this approach if we don't really care, or want
to care more later, but we can do better here.

> +KBUILD_CFLAGS := $(filter-out $(CC_FLAGS_LTO),$(KBUILD_CFLAGS))
> +
>  # When linking purgatory.ro with -r unresolved symbols are not checked,
>  # also link a purgatory.chk binary without -r to check for unresolved 
> symbols.
>  PURGATORY_LDFLAGS := -e purgatory_start -z nodefaultlib
> --
> 2.34.1
>


-- 
Thanks,
~Nick Desaulniers

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: kexec reboot failed due to commit 75d090fd167ac

2023-09-11 Thread Kirill A. Shutemov
On Mon, Sep 11, 2023 at 10:33:01AM -0500, Tom Lendacky wrote:
> On 9/11/23 09:57, Kirill A. Shutemov wrote:
> > On Mon, Sep 11, 2023 at 10:56:36PM +0800, Dave Young wrote:
> > > > early console in extract_kernel
> > > > input_data: 0x00807eb433a8
> > > > input_len: 0x00d26271
> > > > output: 0x00807b00
> > > > output_len: 0x04800c10
> > > > kernel_total_size: 0x03e28000
> > > > needed_size: 0x04a0
> > > > trampoline_32bit: 0x0009d000
> > > > 
> > > > Decompressing Linux... out of pgt_buf in 
> > > > arch/x86/boot/compressed/ident_map_64.c!?
> > > > pages->pgt_buf_offset: 0x6000
> > > > pages->pgt_buf_size: 0x6000
> > > > 
> > > > 
> > > > Error: kernel_ident_mapping_init() failed
> > > > 
> > > > It crashes on #PF due to stbl->nr_tables dereference in
> > > > efi_get_conf_table() called from init_unaccepted_memory().
> > > > 
> > > > I don't see anything special about stbl location: 0x775d6018.
> > > > 
> > > > One other bit of information: disabling 5-level paging also helps the
> > > > issue.
> > > > 
> > > > I will debug further.
> > 
> > The problem is not limited to unaccepted memory, it also triggers if we
> > reach efi_get_rsdp_addr() in the same setup.
> > 
> > I think we have several problems here.
> > 
> > - 6 pages for !RANDOMIZE_BASE is only enough for kernel, cmdline,
> >boot_data and setup_data if we assume that they are in different 1G
> >regions and do not cross the 1G boundaries. 4-level paging: 1 for PGD, 1
> >for PUD, 4 for PMD tables.
> > 
> >Looks like we never map EFI/ACPI memory explicitly.
> > 
> >It might work if kernel/cmdline/... are in single 1G and we have
> >spare pages to handle page faults.
> > 
> > - No spare memory to handle mapping for cc_info and cc_info->cpuid_phys;
> > 
> > - I didn't increase BOOT_INIT_PGT_SIZE when added 5-level paging support.
> >And if start pagetables from scratch ('else' case of 'if (p4d_offset...))
> >we run out of memory.
> > 
> > I believe similar logic would apply for BOOT_PGT_SIZE for RANDOMIZE_BASE=y
> > case.
> > 
> > I don't know what the right fix here. We can increase the constants to be
> > enough to cover existing cases, but it is very fragile. I am not sure I
> > saw all users. Some of them could silently handled with pagefault handler
> > in some setups. And it is hard to catch new users during code review.
> > 
> > Also I'm not sure why do we need pagefault handler there. Looks like it
> > just masking problems. I think everything has to be mapped explicitly.
> > 
> > Any comments?
> 
> There was a similar related issue around the cc_info blob that is captured
> here: https://lore.kernel.org/lkml/20230601072043.24439-1-l...@redhat.com/
> 
> Personally, I'm a fan of mapping the EFI tables that will be passed to the
> kexec/kdump kernel. To me, that seems to more closely match the valid
> mappings for the tables when control is transferred to the OS from UEFI on
> the initial boot.

I don't see how it would help if initialize_identity_maps() resets
pagetables. See 'else' case of 'if (p4d_offset...).

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: kexec reboot failed due to commit 75d090fd167ac

2023-09-11 Thread Tom Lendacky

On 9/11/23 09:57, Kirill A. Shutemov wrote:

On Mon, Sep 11, 2023 at 10:56:36PM +0800, Dave Young wrote:

early console in extract_kernel
input_data: 0x00807eb433a8
input_len: 0x00d26271
output: 0x00807b00
output_len: 0x04800c10
kernel_total_size: 0x03e28000
needed_size: 0x04a0
trampoline_32bit: 0x0009d000

Decompressing Linux... out of pgt_buf in 
arch/x86/boot/compressed/ident_map_64.c!?
pages->pgt_buf_offset: 0x6000
pages->pgt_buf_size: 0x6000


Error: kernel_ident_mapping_init() failed

It crashes on #PF due to stbl->nr_tables dereference in
efi_get_conf_table() called from init_unaccepted_memory().

I don't see anything special about stbl location: 0x775d6018.

One other bit of information: disabling 5-level paging also helps the
issue.

I will debug further.


The problem is not limited to unaccepted memory, it also triggers if we
reach efi_get_rsdp_addr() in the same setup.

I think we have several problems here.

- 6 pages for !RANDOMIZE_BASE is only enough for kernel, cmdline,
   boot_data and setup_data if we assume that they are in different 1G
   regions and do not cross the 1G boundaries. 4-level paging: 1 for PGD, 1
   for PUD, 4 for PMD tables.

   Looks like we never map EFI/ACPI memory explicitly.

   It might work if kernel/cmdline/... are in single 1G and we have
   spare pages to handle page faults.

- No spare memory to handle mapping for cc_info and cc_info->cpuid_phys;

- I didn't increase BOOT_INIT_PGT_SIZE when added 5-level paging support.
   And if start pagetables from scratch ('else' case of 'if (p4d_offset...))
   we run out of memory.

I believe similar logic would apply for BOOT_PGT_SIZE for RANDOMIZE_BASE=y
case.

I don't know what the right fix here. We can increase the constants to be
enough to cover existing cases, but it is very fragile. I am not sure I
saw all users. Some of them could silently handled with pagefault handler
in some setups. And it is hard to catch new users during code review.

Also I'm not sure why do we need pagefault handler there. Looks like it
just masking problems. I think everything has to be mapped explicitly.

Any comments?


There was a similar related issue around the cc_info blob that is captured 
here: https://lore.kernel.org/lkml/20230601072043.24439-1-l...@redhat.com/


Personally, I'm a fan of mapping the EFI tables that will be passed to the 
kexec/kdump kernel. To me, that seems to more closely match the valid 
mappings for the tables when control is transferred to the OS from UEFI on 
the initial boot.


Thanks,
Tom





___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: kexec reboot failed due to commit 75d090fd167ac

2023-09-11 Thread Kirill A. Shutemov
On Mon, Sep 11, 2023 at 10:56:36PM +0800, Dave Young wrote:
> > early console in extract_kernel
> > input_data: 0x00807eb433a8
> > input_len: 0x00d26271
> > output: 0x00807b00
> > output_len: 0x04800c10
> > kernel_total_size: 0x03e28000
> > needed_size: 0x04a0
> > trampoline_32bit: 0x0009d000
> >
> > Decompressing Linux... out of pgt_buf in 
> > arch/x86/boot/compressed/ident_map_64.c!?
> > pages->pgt_buf_offset: 0x6000
> > pages->pgt_buf_size: 0x6000
> >
> >
> > Error: kernel_ident_mapping_init() failed
> >
> > It crashes on #PF due to stbl->nr_tables dereference in
> > efi_get_conf_table() called from init_unaccepted_memory().
> >
> > I don't see anything special about stbl location: 0x775d6018.
> >
> > One other bit of information: disabling 5-level paging also helps the
> > issue.
> >
> > I will debug further.

The problem is not limited to unaccepted memory, it also triggers if we
reach efi_get_rsdp_addr() in the same setup.

I think we have several problems here.

- 6 pages for !RANDOMIZE_BASE is only enough for kernel, cmdline,
  boot_data and setup_data if we assume that they are in different 1G
  regions and do not cross the 1G boundaries. 4-level paging: 1 for PGD, 1
  for PUD, 4 for PMD tables.

  Looks like we never map EFI/ACPI memory explicitly.

  It might work if kernel/cmdline/... are in single 1G and we have
  spare pages to handle page faults.

- No spare memory to handle mapping for cc_info and cc_info->cpuid_phys;

- I didn't increase BOOT_INIT_PGT_SIZE when added 5-level paging support.
  And if start pagetables from scratch ('else' case of 'if (p4d_offset...))
  we run out of memory.

I believe similar logic would apply for BOOT_PGT_SIZE for RANDOMIZE_BASE=y
case.

I don't know what the right fix here. We can increase the constants to be
enough to cover existing cases, but it is very fragile. I am not sure I
saw all users. Some of them could silently handled with pagefault handler
in some setups. And it is hard to catch new users during code review.

Also I'm not sure why do we need pagefault handler there. Looks like it
just masking problems. I think everything has to be mapped explicitly.

Any comments?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 3/3] /dev/mem: Do not map unaccepted memory

2023-09-11 Thread Dave Hansen
On 9/11/23 01:09, David Hildenbrand wrote:
> So, making unaccepted memory similarly depend on "!DEVMEM ||
> STRICT_DEVMEM" does not sound too far off ...

Yeah, considering all of the invasive work folks want to do to "harden"
the kernel for TDX, doing that ^ is just about the best
bang-for-your-buck "hardening" that you can get.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 1/3] proc/vmcore: Do not map unaccepted memory

2023-09-11 Thread David Hildenbrand

On 11.09.23 12:05, Kirill A. Shutemov wrote:

On Mon, Sep 11, 2023 at 11:50:31AM +0200, David Hildenbrand wrote:

On 11.09.23 11:27, Kirill A. Shutemov wrote:

On Mon, Sep 11, 2023 at 10:42:51AM +0200, David Hildenbrand wrote:

On 11.09.23 10:41, Kirill A. Shutemov wrote:

On Mon, Sep 11, 2023 at 10:03:36AM +0200, David Hildenbrand wrote:

On 06.09.23 09:39, Adrian Hunter wrote:

Support for unaccepted memory was added recently, refer commit
dcdfdd40fa82 ("mm: Add support for unaccepted memory"), whereby
a virtual machine may need to accept memory before it can be used.

Do not map unaccepted memory because it can cause the guest to fail.

For /proc/vmcore, which is read-only, this means a read or mmap of
unaccepted memory will return zeros.


Does a second (kdump) kernel that exposes /proc/vmcore reliably get access
to the information whether memory of the first kernel is unaccepted (IOW,
not its memory, but the memory of the first kernel it is supposed to expose
via /proc/vmcore)?


There are few patches in my queue to few related issue, but generally,
yes, the information is available to the target kernel via EFI
configuration table.


I assume that table provided by the first kernel, and not read directly from
HW, correct?


The table is constructed by the EFI stub in the first kernel based on EFI
memory map.



Okay, should work then once that's done by the first kernel.

Maybe include this patch in your series?


Can do. But the other two patches are not related to kexec. Hm.


Yes, the others can go in separately. But this here really needs other 
kexec/kdump changes.


--
Cheers,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCHv6 3/3] powerpc/setup: alloc extra paca_ptrs to hold boot_cpuid

2023-09-11 Thread Pingfan Liu
paca_ptrs should be large enough to hold the boot_cpuid, hence, its
lower boundary is set to the bigger one between boot_cpuid+1 and
nr_cpus.

On the other hand, some kernel component: -1. the timer assumes cpu0
online since the timer_list->flags subfield 'TIMER_CPUMASK' is zero if
not initialized to a proper present cpu.  -2. power9_idle_stop() assumes
the primary thread's paca is allocated.

Hence lift nr_cpu_ids from one to two to ensure cpu0 is onlined, if the
boot cpu is not cpu0.

Signed-off-by: Pingfan Liu 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Christophe Leroy 
Cc: Mahesh Salgaonkar 
Cc: Wen Xiong 
Cc: Baoquan He 
Cc: Ming Lei 
Cc: kexec@lists.infradead.org
To: linuxppc-...@lists.ozlabs.org
---
 arch/powerpc/kernel/paca.c | 10 ++
 arch/powerpc/kernel/prom.c |  9 ++---
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c
index cda4e00b67c1..91e2401de1bd 100644
--- a/arch/powerpc/kernel/paca.c
+++ b/arch/powerpc/kernel/paca.c
@@ -242,9 +242,10 @@ static int __initdata paca_struct_size;
 
 void __init allocate_paca_ptrs(void)
 {
-   paca_nr_cpu_ids = nr_cpu_ids;
+   int n = (boot_cpuid + 1) > nr_cpu_ids ? (boot_cpuid + 1) : nr_cpu_ids;
 
-   paca_ptrs_size = sizeof(struct paca_struct *) * nr_cpu_ids;
+   paca_nr_cpu_ids = n;
+   paca_ptrs_size = sizeof(struct paca_struct *) * n;
paca_ptrs = memblock_alloc_raw(paca_ptrs_size, SMP_CACHE_BYTES);
if (!paca_ptrs)
panic("Failed to allocate %d bytes for paca pointers\n",
@@ -287,13 +288,14 @@ void __init allocate_paca(int cpu)
 void __init free_unused_pacas(void)
 {
int new_ptrs_size;
+   int n = (boot_cpuid + 1) > nr_cpu_ids ? (boot_cpuid + 1) : nr_cpu_ids;
 
-   new_ptrs_size = sizeof(struct paca_struct *) * nr_cpu_ids;
+   new_ptrs_size = sizeof(struct paca_struct *) * n;
if (new_ptrs_size < paca_ptrs_size)
memblock_phys_free(__pa(paca_ptrs) + new_ptrs_size,
   paca_ptrs_size - new_ptrs_size);
 
-   paca_nr_cpu_ids = nr_cpu_ids;
+   paca_nr_cpu_ids = n;
paca_ptrs_size = new_ptrs_size;
 
 #ifdef CONFIG_PPC_64S_HASH_MMU
diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index cb3f3e040455..28441edbc42d 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -362,9 +362,12 @@ static int __init early_init_dt_scan_cpus(unsigned long 
node,
 */
boot_cpuid = i;
found = true;
-   /* This works around the hole in paca_ptrs[]. */
-   if (nr_cpu_ids < nthreads)
-   set_nr_cpu_ids(nthreads);
+   /*
+* Ideally, nr_cpus=1 can be achieved if each kernel
+* component does not assume cpu0 is onlined.
+*/
+   if (boot_cpuid != 0 && nr_cpu_ids < 2)
+   set_nr_cpu_ids(2);
}
 #ifdef CONFIG_SMP
/* logical cpu id is always 0 on UP kernels */
-- 
2.31.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCHv6 0/3] enable nr_cpus for powerpc

2023-09-11 Thread Pingfan Liu
Since my last v4 [1], the code has undergone great changes. The paca[]
array has been reorganized and indexed by paca_ptrs[], which
dramatically decreases the memory consumption even if there are many
unpresent cpus in the middle.

However, reordering the logical cpu numbers can further decrease the
size of paca_ptrs[] in the kdump case. So I keep [1/3], which
rotate-shifts the cpu's sequence number in the device tree to obtain the
logical cpu id.

Patch [2-3/3] make efforts to decrease the nr_cpus to be less than or
equal to two.

[1]: 
https://lore.kernel.org/linuxppc-dev/1520829790-14029-1-git-send-email-kernelf...@gmail.com/

Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Christophe Leroy 
Cc: Mahesh Salgaonkar 
Cc: Wen Xiong 
Cc: Baoquan He 
Cc: Ming Lei 
Cc: kexec@lists.infradead.org
To: linuxppc-...@lists.ozlabs.org

v5 -> v6:
  assign nr_cpu_ids by set_nr_cpu_ids() to tackle with the issue if nr_cpu_ids 
is
configured as a constant

Pingfan Liu (3):
  powerpc/setup: Loosen the mapping between cpu logical id and its seq
in dt
  powerpc/setup: Handle the case when boot_cpuid greater than nr_cpus
  powerpc/setup: alloc extra paca_ptrs to hold boot_cpuid

 arch/powerpc/kernel/paca.c |  10 +--
 arch/powerpc/kernel/prom.c |  28 +---
 arch/powerpc/kernel/setup-common.c | 106 -
 3 files changed, 113 insertions(+), 31 deletions(-)

-- 
2.31.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCHv6 2/3] powerpc/setup: Handle the case when boot_cpuid greater than nr_cpus

2023-09-11 Thread Pingfan Liu
If the boot_cpuid is smaller than nr_cpus, it requires extra effort to
ensure the boot_cpu is in cpu_present_mask. This can be achieved by
reserving the last quota for the boot cpu.

Note: the restriction on nr_cpus will be lifted with more effort in the
next patch

Signed-off-by: Pingfan Liu 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Christophe Leroy 
Cc: Mahesh Salgaonkar 
Cc: Wen Xiong 
Cc: Baoquan He 
Cc: Ming Lei 
Cc: kexec@lists.infradead.org
To: linuxppc-...@lists.ozlabs.org
---
 arch/powerpc/kernel/setup-common.c | 25 ++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index a07af8de6674..58a988c64dd2 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -456,8 +456,8 @@ struct interrupt_server_node {
 void __init smp_setup_cpu_maps(void)
 {
struct device_node *dn;
-   int shift = 0, cpu = 0;
-   int j, nthreads = 1;
+   int terminate, shift = 0, cpu = 0;
+   int j, bt_thread = 0, nthreads = 1;
int len;
struct interrupt_server_node *intserv_node, *n;
struct list_head *bt_node, head;
@@ -520,6 +520,7 @@ void __init smp_setup_cpu_maps(void)
for (j = 0 ; j < nthreads; j++) {
if (be32_to_cpu(intserv[j]) == boot_cpu_hwid) {
bt_node = &intserv_node->node;
+   bt_thread = j;
found_boot_cpu = true;
/*
 * Record the round-shift between dt
@@ -539,11 +540,21 @@ void __init smp_setup_cpu_maps(void)
/* Select the primary thread, the boot cpu's slibing, as the logic 0 */
list_add_tail(&head, bt_node);
pr_info("the round shift between dt seq and the cpu logic number: 
%d\n", shift);
+   terminate = nr_cpu_ids;
list_for_each_entry(intserv_node, &head, node) {
 
+   j = 0;
+   /* Choose a start point to cover the boot cpu */
+   if (nr_cpu_ids - 1 < bt_thread) {
+   /*
+* The processor core puts assumption on the thread id,
+* not to breach the assumption.
+*/
+   terminate = nr_cpu_ids - 1;
+   }
avail = intserv_node->avail;
nthreads = intserv_node->len / sizeof(int);
-   for (j = 0; j < nthreads && cpu < nr_cpu_ids; j++) {
+   for (; j < nthreads && cpu < terminate; j++) {
set_cpu_present(cpu, avail);
set_cpu_possible(cpu, true);
cpu_to_phys_id[cpu] = 
be32_to_cpu(intserv_node->intserv[j]);
@@ -551,6 +562,14 @@ void __init smp_setup_cpu_maps(void)
j, cpu, be32_to_cpu(intserv[j]));
cpu++;
}
+   /* Online the boot cpu */
+   if (nr_cpu_ids - 1 < bt_thread) {
+   set_cpu_present(bt_thread, avail);
+   set_cpu_possible(bt_thread, true);
+   cpu_to_phys_id[bt_thread] = 
be32_to_cpu(intserv_node->intserv[bt_thread]);
+   DBG("thread %d -> cpu %d (hard id %d)\n",
+   bt_thread, bt_thread, 
be32_to_cpu(intserv[bt_thread]));
+   }
}
 
list_for_each_entry_safe(intserv_node, n, &head, node) {
-- 
2.31.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCHv6 1/3] powerpc/setup: Loosen the mapping between cpu logical id and its seq in dt

2023-09-11 Thread Pingfan Liu
*** Idea ***
For kexec -p, the boot cpu can be not the cpu0, this may waste plenty of
room when of allocating memory for paca_ptrs[]. However, in theory,
there is no requirement to assign cpu's logical id as its present
sequence in the device tree. But there is something like
cpu_first_thread_sibling(), which makes assumption on the mapping inside
a core. Hence partially loosening the mapping, i.e. unbind the mapping
of core while keep the mapping inside a core.

*** Implement ***
At this early stage, there are plenty of memory to utilize. Hence, this
patch allocates interim memory to link the cpu info on a list, then
reorder cpus by changing the list head. As a result, there is a rotate
shift between the sequence number in dt and the cpu logical number.

*** Result ***
After this patch, a boot-cpu's logical id will always be mapped into the
range [0,threads_per_core).

Besides this, at this phase, all threads in the boot core are forced to
be onlined. This restriction will be lifted in a later patch with
extra effort.

Signed-off-by: Pingfan Liu 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Christophe Leroy 
Cc: Mahesh Salgaonkar 
Cc: Wen Xiong 
Cc: Baoquan He 
Cc: Ming Lei 
Cc: kexec@lists.infradead.org
To: linuxppc-...@lists.ozlabs.org
---
 arch/powerpc/kernel/prom.c | 25 +
 arch/powerpc/kernel/setup-common.c | 87 +++---
 2 files changed, 85 insertions(+), 27 deletions(-)

diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index 0b5878c3125b..cb3f3e040455 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -76,7 +76,9 @@ u64 ppc64_rma_size;
 unsigned int boot_cpu_node_count __ro_after_init;
 #endif
 static phys_addr_t first_memblock_size;
+#ifdef CONFIG_SMP
 static int __initdata boot_cpu_count;
+#endif
 
 static int __init early_parse_mem(char *p)
 {
@@ -331,8 +333,7 @@ static int __init early_init_dt_scan_cpus(unsigned long 
node,
const __be32 *intserv;
int i, nthreads;
int len;
-   int found = -1;
-   int found_thread = 0;
+   bool found = false;
 
/* We are scanning "cpu" nodes only */
if (type == NULL || strcmp(type, "cpu") != 0)
@@ -355,8 +356,15 @@ static int __init early_init_dt_scan_cpus(unsigned long 
node,
for (i = 0; i < nthreads; i++) {
if (be32_to_cpu(intserv[i]) ==
fdt_boot_cpuid_phys(initial_boot_params)) {
-   found = boot_cpu_count;
-   found_thread = i;
+   /*
+* always map the boot-cpu logical id into the
+* range of [0, thread_per_core)
+*/
+   boot_cpuid = i;
+   found = true;
+   /* This works around the hole in paca_ptrs[]. */
+   if (nr_cpu_ids < nthreads)
+   set_nr_cpu_ids(nthreads);
}
 #ifdef CONFIG_SMP
/* logical cpu id is always 0 on UP kernels */
@@ -365,15 +373,14 @@ static int __init early_init_dt_scan_cpus(unsigned long 
node,
}
 
/* Not the boot CPU */
-   if (found < 0)
+   if (!found)
return 0;
 
-   DBG("boot cpu: logical %d physical %d\n", found,
-   be32_to_cpu(intserv[found_thread]));
-   boot_cpuid = found;
+   DBG("boot cpu: logical %d physical %d\n", boot_cpuid,
+   be32_to_cpu(intserv[boot_cpuid]));
 
if (IS_ENABLED(CONFIG_PPC64))
-   boot_cpu_hwid = be32_to_cpu(intserv[found_thread]);
+   boot_cpu_hwid = be32_to_cpu(intserv[boot_cpuid]);
 
/*
 * PAPR defines "logical" PVR values for cpus that
diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index d2a446216444..a07af8de6674 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -427,6 +428,13 @@ static void __init cpu_init_thread_core_maps(int tpc)
 
 u32 *cpu_to_phys_id = NULL;
 
+struct interrupt_server_node {
+   struct list_head node;
+   boolavail;
+   int len;
+   __be32 *intserv;
+};
+
 /**
  * setup_cpu_maps - initialize the following cpu maps:
  *  cpu_possible_mask
@@ -448,11 +456,16 @@ u32 *cpu_to_phys_id = NULL;
 void __init smp_setup_cpu_maps(void)
 {
struct device_node *dn;
-   int cpu = 0;
-   int nthreads = 1;
+   int shift = 0, cpu = 0;
+   int j, nthreads = 1;
+   int len;
+   struct interrupt_server_node *intserv_node, *n;
+   struct list_head *bt_node, head;
+   bool avail, found_boot_cpu = false;
 
DBG("smp_setup_cpu_maps()\n");
 
+   INIT_LIST_HEAD(&head);
cpu_to_phys_id = memblock_alloc(nr_cpu_ids * sizeof(u32),
__alignof_

[PATCH V2 2/2] proc/kcore: Do not try to access unaccepted memory

2023-09-11 Thread Adrian Hunter
Support for unaccepted memory was added recently, refer commit
dcdfdd40fa82 ("mm: Add support for unaccepted memory"), whereby a virtual
machine may need to accept memory before it can be used.

Do not try to access unaccepted memory because it can cause the guest to
fail.

For /proc/kcore, which is read-only and does not support mmap, this means a
read of unaccepted memory will return zeros.

Signed-off-by: Adrian Hunter 
---
 fs/proc/kcore.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)


Changes in V2:

  Change patch subject and commit message
  Do not open code pfn_is_unaccepted_memory()


diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index 23fc24d16b31..6422e569b080 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -546,7 +546,8 @@ static ssize_t read_kcore_iter(struct kiocb *iocb, struct 
iov_iter *iter)
 * and explicitly excluded physical ranges.
 */
if (!page || PageOffline(page) ||
-   is_page_hwpoison(page) || !pfn_is_ram(pfn)) {
+   is_page_hwpoison(page) || !pfn_is_ram(pfn) ||
+   pfn_is_unaccepted_memory(pfn)) {
if (iov_iter_zero(tsz, iter) != tsz) {
ret = -EFAULT;
goto out;
-- 
2.34.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH V2 0/2] Do not try to access unaccepted memory

2023-09-11 Thread Adrian Hunter
Hi

Support for unaccepted memory was added recently, refer commit
dcdfdd40fa82 ("mm: Add support for unaccepted memory"), whereby
a virtual machine may need to accept memory before it can be used.

Plug a few gaps where RAM is exposed without checking if it is
unaccepted memory.


Changes in V2:

  efi/unaccepted: Do not let /proc/vmcore try to access unaccepted memory
  Change patch subject and commit message
  Use vmcore_cb->.pfn_is_ram() instead of changing vmcore.c

  proc/kcore: Do not try to access unaccepted memory
  Change patch subject and commit message
  Do not open code pfn_is_unaccepted_memory()

  /dev/mem: Do not map unaccepted memory
  Patch dropped because it is not required


Adrian Hunter (2):
  efi/unaccepted: Do not let /proc/vmcore try to access unaccepted memory
  proc/kcore: Do not try to access unaccepted memory

 drivers/firmware/efi/unaccepted_memory.c | 20 
 fs/proc/kcore.c  |  3 ++-
 include/linux/mm.h   |  7 +++
 3 files changed, 29 insertions(+), 1 deletion(-)


Regards
Adrian

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH V2 1/2] efi/unaccepted: Do not let /proc/vmcore try to access unaccepted memory

2023-09-11 Thread Adrian Hunter
Support for unaccepted memory was added recently, refer commit dcdfdd40fa82
("mm: Add support for unaccepted memory"), whereby a virtual machine may
need to accept memory before it can be used.

Do not let /proc/vmcore try to access unaccepted memory because it can
cause the guest to fail.

For /proc/vmcore, which is read-only, this means a read or mmap of
unaccepted memory will return zeros.

Signed-off-by: Adrian Hunter 
---
 drivers/firmware/efi/unaccepted_memory.c | 20 
 include/linux/mm.h   |  7 +++
 2 files changed, 27 insertions(+)


Changes in V2:

  Change patch subject and commit message
  Use vmcore_cb->.pfn_is_ram() instead of changing vmcore.c


diff --git a/drivers/firmware/efi/unaccepted_memory.c 
b/drivers/firmware/efi/unaccepted_memory.c
index 853f7dc3c21d..79ba576b22e3 100644
--- a/drivers/firmware/efi/unaccepted_memory.c
+++ b/drivers/firmware/efi/unaccepted_memory.c
@@ -3,6 +3,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /* Protects unaccepted memory bitmap */
@@ -145,3 +146,22 @@ bool range_contains_unaccepted_memory(phys_addr_t start, 
phys_addr_t end)
 
return ret;
 }
+
+#ifdef CONFIG_PROC_VMCORE
+static bool unaccepted_memory_vmcore_pfn_is_ram(struct vmcore_cb *cb,
+   unsigned long pfn)
+{
+   return !pfn_is_unaccepted_memory(pfn);
+}
+
+static struct vmcore_cb vmcore_cb = {
+   .pfn_is_ram = unaccepted_memory_vmcore_pfn_is_ram,
+};
+
+static int __init unaccepted_memory_init_kdump(void)
+{
+   register_vmcore_cb(&vmcore_cb);
+   return 0;
+}
+core_initcall(unaccepted_memory_init_kdump);
+#endif /* CONFIG_PROC_VMCORE */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index bf5d0b1b16f4..86511150f1d4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4062,4 +4062,11 @@ static inline void accept_memory(phys_addr_t start, 
phys_addr_t end)
 
 #endif
 
+static inline bool pfn_is_unaccepted_memory(unsigned long pfn)
+{
+   phys_addr_t paddr = pfn << PAGE_SHIFT;
+
+   return range_contains_unaccepted_memory(paddr, paddr + PAGE_SIZE);
+}
+
 #endif /* _LINUX_MM_H */
-- 
2.34.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH 1/2] zboot: enable arm64 kexec_load for zboot image

2023-09-11 Thread Dave Young
kexec_file_load support of zboot kernel image decompressed the vmlinuz,
so in kexec_load code just load the kernel with reading the decompressed
kernel fd into a new buffer and use it directly.

Signed-off-by: Dave Young 
---
 include/kexec-pe-zboot.h   |  3 ++-
 kexec/arch/arm64/kexec-vmlinuz-arm64.c | 20 ++--
 kexec/kexec-pe-zboot.c |  4 +++-
 kexec/kexec.c  |  2 +-
 kexec/kexec.h  |  1 +
 5 files changed, 25 insertions(+), 5 deletions(-)

diff --git a/include/kexec-pe-zboot.h b/include/kexec-pe-zboot.h
index e2e0448a81f2..374916cbe883 100644
--- a/include/kexec-pe-zboot.h
+++ b/include/kexec-pe-zboot.h
@@ -11,5 +11,6 @@ struct linux_pe_zboot_header {
uint32_t compress_type;
 };
 
-int pez_prepare(const char *crude_buf, off_t buf_sz, int *kernel_fd);
+int pez_prepare(const char *crude_buf, off_t buf_sz, int *kernel_fd,
+   off_t *kernel_size);
 #endif
diff --git a/kexec/arch/arm64/kexec-vmlinuz-arm64.c 
b/kexec/arch/arm64/kexec-vmlinuz-arm64.c
index c0ee47c8f50a..8f378d8fa6d0 100644
--- a/kexec/arch/arm64/kexec-vmlinuz-arm64.c
+++ b/kexec/arch/arm64/kexec-vmlinuz-arm64.c
@@ -34,6 +34,7 @@
 #include "arch/options.h"
 
 static int kernel_fd = -1;
+static off_t decompressed_size;
 
 /* Returns:
  * -1 : in case of error/invalid format (not a valid PE+compressed ZBOOT 
format.
@@ -72,7 +73,7 @@ int pez_arm64_probe(const char *kernel_buf, off_t kernel_size)
return -1;
}
 
-   ret = pez_prepare(buf, buf_sz, &kernel_fd);
+   ret = pez_prepare(buf, buf_sz, &kernel_fd, &decompressed_size);
 
if (!ret) {
/* validate the arm64 specific header */
@@ -98,8 +99,23 @@ bad_header:
 int pez_arm64_load(int argc, char **argv, const char *buf, off_t len,
struct kexec_info *info)
 {
+   char *kbuf;
+
info->kernel_fd = kernel_fd;
-   return image_arm64_load(argc, argv, buf, len, info);
+   if (kernel_fd > 0 && decompressed_size > 0) {
+   off_t nread;
+
+   kbuf = slurp_fd(kernel_fd, NULL, decompressed_size, &nread);
+   if (!kbuf || nread != decompressed_size) {
+   dbgprintf("%s: failed.\n", __func__);
+   return -1;
+   }
+   } else {
+   dbgprintf("%s: wrong file descriptor.\n", __func__);
+   return -1;
+   }
+
+   return image_arm64_load(argc, argv, kbuf, decompressed_size, info);
 }
 
 void pez_arm64_usage(void)
diff --git a/kexec/kexec-pe-zboot.c b/kexec/kexec-pe-zboot.c
index 2f2e052b76c5..3abd17d9fe59 100644
--- a/kexec/kexec-pe-zboot.c
+++ b/kexec/kexec-pe-zboot.c
@@ -37,7 +37,8 @@
  *
  * crude_buf: the content, which is read from the kernel file without any 
processing
  */
-int pez_prepare(const char *crude_buf, off_t buf_sz, int *kernel_fd)
+int pez_prepare(const char *crude_buf, off_t buf_sz, int *kernel_fd,
+   off_t *kernel_size)
 {
int ret = -1;
int fd = 0;
@@ -110,6 +111,7 @@ int pez_prepare(const char *crude_buf, off_t buf_sz, int 
*kernel_fd)
goto fail_bad_header;
}
 
+   *kernel_size = decompressed_size;
dbgprintf("%s: done\n", __func__);
 
ret = 0;
diff --git a/kexec/kexec.c b/kexec/kexec.c
index c3b182e254e0..1edbd349c86d 100644
--- a/kexec/kexec.c
+++ b/kexec/kexec.c
@@ -489,7 +489,7 @@ static int add_backup_segments(struct kexec_info *info,
return 0;
 }
 
-static char *slurp_fd(int fd, const char *filename, off_t size, off_t *nread)
+char *slurp_fd(int fd, const char *filename, off_t size, off_t *nread)
 {
char *buf;
off_t progress;
diff --git a/kexec/kexec.h b/kexec/kexec.h
index ed3b499a80f2..093338969c57 100644
--- a/kexec/kexec.h
+++ b/kexec/kexec.h
@@ -267,6 +267,7 @@ extern void die(const char *fmt, ...)
__attribute__ ((format (printf, 1, 2)));
 extern void *xmalloc(size_t size);
 extern void *xrealloc(void *ptr, size_t size);
+extern char *slurp_fd(int fd, const char *filename, off_t size, off_t *nread);
 extern char *slurp_file(const char *filename, off_t *r_size);
 extern char *slurp_file_mmap(const char *filename, off_t *r_size);
 extern char *slurp_file_len(const char *filename, off_t size, off_t *nread);
-- 
2.37.2


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH 2/2] zboot: add loongarch kexec_load support

2023-09-11 Thread Dave Young
From: "dyo...@redhat.com" 

Copy arm64 code and change for loongarch so that the kexec -c can load
a zboot image.
Note: probe zboot image first otherwise the pei-loongarch file type will
be used.

Signed-off-by: Dave Young 
---
 kexec/arch/loongarch/Makefile  |  1 +
 kexec/arch/loongarch/image-header.h|  1 +
 kexec/arch/loongarch/kexec-loongarch.c |  1 +
 kexec/arch/loongarch/kexec-loongarch.h |  4 +
 kexec/arch/loongarch/kexec-pez-loongarch.c | 88 ++
 5 files changed, 95 insertions(+)
 create mode 100644 kexec/arch/loongarch/kexec-pez-loongarch.c

diff --git a/kexec/arch/loongarch/Makefile b/kexec/arch/loongarch/Makefile
index 3b33b9693287..cee7e569a2a2 100644
--- a/kexec/arch/loongarch/Makefile
+++ b/kexec/arch/loongarch/Makefile
@@ -6,6 +6,7 @@ loongarch_KEXEC_SRCS += 
kexec/arch/loongarch/kexec-elf-loongarch.c
 loongarch_KEXEC_SRCS += kexec/arch/loongarch/kexec-pei-loongarch.c
 loongarch_KEXEC_SRCS += kexec/arch/loongarch/kexec-elf-rel-loongarch.c
 loongarch_KEXEC_SRCS += kexec/arch/loongarch/crashdump-loongarch.c
+loongarch_KEXEC_SRCS += kexec/arch/loongarch/kexec-pez-loongarch.c
 
 loongarch_MEM_REGIONS = kexec/mem_regions.c
 
diff --git a/kexec/arch/loongarch/image-header.h 
b/kexec/arch/loongarch/image-header.h
index 3b7576552685..223d81f77d9f 100644
--- a/kexec/arch/loongarch/image-header.h
+++ b/kexec/arch/loongarch/image-header.h
@@ -33,6 +33,7 @@ struct loongarch_image_header {
 };
 
 static const uint8_t loongarch_image_pe_sig[2] = {'M', 'Z'};
+static const uint8_t loongarch_pe_machtype[6] = {'P','E', 0x0, 0x0, 0x64, 
0x62};
 
 /**
  * loongarch_header_check_pe_sig - Helper to check the loongarch image header.
diff --git a/kexec/arch/loongarch/kexec-loongarch.c 
b/kexec/arch/loongarch/kexec-loongarch.c
index f47c99861674..62ff8fd1aeb7 100644
--- a/kexec/arch/loongarch/kexec-loongarch.c
+++ b/kexec/arch/loongarch/kexec-loongarch.c
@@ -165,6 +165,7 @@ int get_memory_ranges(struct memory_range **range, int 
*ranges,
 
 struct file_type file_type[] = {
{"elf-loongarch", elf_loongarch_probe, elf_loongarch_load, 
elf_loongarch_usage},
+   {"pez-loongarch", pez_loongarch_probe, pez_loongarch_load, 
pez_loongarch_usage},
{"pei-loongarch", pei_loongarch_probe, pei_loongarch_load, 
pei_loongarch_usage},
 };
 int file_types = sizeof(file_type) / sizeof(file_type[0]);
diff --git a/kexec/arch/loongarch/kexec-loongarch.h 
b/kexec/arch/loongarch/kexec-loongarch.h
index 5120a26fd513..2c7624f2fd3a 100644
--- a/kexec/arch/loongarch/kexec-loongarch.h
+++ b/kexec/arch/loongarch/kexec-loongarch.h
@@ -27,6 +27,10 @@ int pei_loongarch_probe(const char *buf, off_t len);
 int pei_loongarch_load(int argc, char **argv, const char *buf, off_t len,
struct kexec_info *info);
 void pei_loongarch_usage(void);
+int pez_loongarch_probe(const char *kernel_buf, off_t kernel_size);
+int pez_loongarch_load(int argc, char **argv, const char *buf, off_t len,
+  struct kexec_info *info);
+void pez_loongarch_usage(void);
 
 int loongarch_process_image_header(const struct loongarch_image_header *h);
 
diff --git a/kexec/arch/loongarch/kexec-pez-loongarch.c 
b/kexec/arch/loongarch/kexec-pez-loongarch.c
new file mode 100644
index ..6d94a405d54a
--- /dev/null
+++ b/kexec/arch/loongarch/kexec-pez-loongarch.c
@@ -0,0 +1,88 @@
+/*
+ * LoongArch PE compressed Image (vmlinuz, ZBOOT) support.
+ * Based on arm64 code
+ */
+
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include "kexec.h"
+#include "kexec-loongarch.h"
+#include 
+#include "arch/options.h"
+
+static int kernel_fd = -1;
+static off_t decompressed_size;
+
+/* Returns:
+ * -1 : in case of error/invalid format (not a valid PE+compressed ZBOOT 
format.
+ */
+int pez_loongarch_probe(const char *kernel_buf, off_t kernel_size)
+{
+   int ret = -1;
+   const struct loongarch_image_header *h;
+   char *buf;
+   off_t buf_sz;
+
+   buf = (char *)kernel_buf;
+   buf_sz = kernel_size;
+   if (!buf)
+   return -1;
+   h = (const struct loongarch_image_header *)buf;
+
+   dbgprintf("%s: PROBE.\n", __func__);
+   if (buf_sz < sizeof(struct loongarch_image_header)) {
+   dbgprintf("%s: Not large enough to be a PE image.\n", __func__);
+   return -1;
+   }
+   if (!loongarch_header_check_pe_sig(h)) {
+   dbgprintf("%s: Not an PE image.\n", __func__);
+   return -1;
+   }
+
+   if (buf_sz < sizeof(struct loongarch_image_header) + h->pe_header) {
+   dbgprintf("%s: PE image offset larger than image.\n", __func__);
+   return -1;
+   }
+
+   if (memcmp(&buf[h->pe_header],
+  loongarch_pe_machtype, sizeof(loongarch_pe_machtype))) {
+   dbgprintf("%s: PE header doesn't match machine type.\n", 
__func__);
+   return -1;
+   }
+
+   ret = pez_prepare(buf, buf_sz, &kernel_fd, &decompressed_size);
+

Re: [PATCH v2 2/3] vmcore: allow fadump to export vmcore even if is_kdump_kernel() is false

2023-09-11 Thread Baoquan He
On 09/11/23 at 05:13pm, Michael Ellerman wrote:
> Hari Bathini  writes:
> > Currently, is_kdump_kernel() returns true when elfcorehdr_addr is set.
> > While elfcorehdr_addr is set for kexec based kernel dump mechanism,
> > alternate dump capturing methods like fadump [1] also set it to export
> > the vmcore. Since, is_kdump_kernel() is used to restrict resources in
> > crash dump capture kernel and such restrictions are not desirable for
> > fadump, allow is_kdump_kernel() to be defined differently for fadump
> > case. With that change, include is_fadump_active() check in functions
> > is_vmcore_usable() & vmcore_unusable() to be able to export vmcore for
> > fadump case too.
> ...
> > diff --git a/include/linux/crash_dump.h b/include/linux/crash_dump.h
> > index 0f3a656293b0..de8a9fabfb6f 100644
> > --- a/include/linux/crash_dump.h
> > +++ b/include/linux/crash_dump.h
> > @@ -50,6 +50,7 @@ void vmcore_cleanup(void);
> >  #define vmcore_elf64_check_arch(x) (elf_check_arch(x) || 
> > vmcore_elf_check_arch_cross(x))
> >  #endif
> >  
> > +#ifndef is_kdump_kernel
> >  /*
> >   * is_kdump_kernel() checks whether this kernel is booting after a panic of
> >   * previous kernel or not. This is determined by checking if previous 
> > kernel
> > @@ -64,6 +65,19 @@ static inline bool is_kdump_kernel(void)
> >  {
> > return elfcorehdr_addr != ELFCORE_ADDR_MAX;
> >  }
> > +#endif
> > +
> > +#ifndef is_fadump_active
> > +/*
> > + * If f/w assisted dump capturing mechanism (fadump), instead of kexec 
> > based
> > + * dump capturing mechanism (kdump) is exporting the vmcore, then this 
> > function
> > + * will be defined in arch specific code to return true, when appropriate.
> > + */
> > +static inline bool is_fadump_active(void)
> > +{
> > +   return false;
> > +}
> > +#endif
> >  
> >  /* is_vmcore_usable() checks if the kernel is booting after a panic and
> >   * the vmcore region is usable.
> > @@ -75,7 +89,8 @@ static inline bool is_kdump_kernel(void)
> >  
> >  static inline int is_vmcore_usable(void)
> >  {
> > -   return is_kdump_kernel() && elfcorehdr_addr != ELFCORE_ADDR_ERR ? 1 : 0;
> > +   return (is_kdump_kernel() || is_fadump_active())
> > +   && elfcorehdr_addr != ELFCORE_ADDR_ERR ? 1 : 0;
> >  }
> >  
> >  /* vmcore_unusable() marks the vmcore as unusable,
> > @@ -84,7 +99,7 @@ static inline int is_vmcore_usable(void)
> >  
> >  static inline void vmcore_unusable(void)
> >  {
> > -   if (is_kdump_kernel())
> > +   if (is_kdump_kernel() || is_fadump_active())
> > elfcorehdr_addr = ELFCORE_ADDR_ERR;
> >  }
> 
> I think it would be cleaner to decouple is_vmcore_usable() and
> vmcore_usable() from is_kdump_kernel().
> 
> ie, make them operate solely based on the value of elforehdr_addr:
> 
> static inline int is_vmcore_usable(void)
> {
>   elfcorehdr_addr != ELFCORE_ADDR_ERR && \
>   elfcorehdr_addr != ELFCORE_ADDR_MAX;

Agree. I fell into the blind corner of thinking earlier. Above change
is better.

> }
> 
> static inline void vmcore_unusable(void)
> {
>   elfcorehdr_addr = ELFCORE_ADDR_ERR;
> }
> 
> 
> Then all we need on powerpc is a way to override is_kdump_kernel().
> 
> cheers
> 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 1/3] proc/vmcore: Do not map unaccepted memory

2023-09-11 Thread Kirill A. Shutemov
On Mon, Sep 11, 2023 at 11:50:31AM +0200, David Hildenbrand wrote:
> On 11.09.23 11:27, Kirill A. Shutemov wrote:
> > On Mon, Sep 11, 2023 at 10:42:51AM +0200, David Hildenbrand wrote:
> > > On 11.09.23 10:41, Kirill A. Shutemov wrote:
> > > > On Mon, Sep 11, 2023 at 10:03:36AM +0200, David Hildenbrand wrote:
> > > > > On 06.09.23 09:39, Adrian Hunter wrote:
> > > > > > Support for unaccepted memory was added recently, refer commit
> > > > > > dcdfdd40fa82 ("mm: Add support for unaccepted memory"), whereby
> > > > > > a virtual machine may need to accept memory before it can be used.
> > > > > > 
> > > > > > Do not map unaccepted memory because it can cause the guest to fail.
> > > > > > 
> > > > > > For /proc/vmcore, which is read-only, this means a read or mmap of
> > > > > > unaccepted memory will return zeros.
> > > > > 
> > > > > Does a second (kdump) kernel that exposes /proc/vmcore reliably get 
> > > > > access
> > > > > to the information whether memory of the first kernel is unaccepted 
> > > > > (IOW,
> > > > > not its memory, but the memory of the first kernel it is supposed to 
> > > > > expose
> > > > > via /proc/vmcore)?
> > > > 
> > > > There are few patches in my queue to few related issue, but generally,
> > > > yes, the information is available to the target kernel via EFI
> > > > configuration table.
> > > 
> > > I assume that table provided by the first kernel, and not read directly 
> > > from
> > > HW, correct?
> > 
> > The table is constructed by the EFI stub in the first kernel based on EFI
> > memory map.
> > 
> 
> Okay, should work then once that's done by the first kernel.
> 
> Maybe include this patch in your series?

Can do. But the other two patches are not related to kexec. Hm.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 1/3] proc/vmcore: Do not map unaccepted memory

2023-09-11 Thread David Hildenbrand

On 11.09.23 11:27, Kirill A. Shutemov wrote:

On Mon, Sep 11, 2023 at 10:42:51AM +0200, David Hildenbrand wrote:

On 11.09.23 10:41, Kirill A. Shutemov wrote:

On Mon, Sep 11, 2023 at 10:03:36AM +0200, David Hildenbrand wrote:

On 06.09.23 09:39, Adrian Hunter wrote:

Support for unaccepted memory was added recently, refer commit
dcdfdd40fa82 ("mm: Add support for unaccepted memory"), whereby
a virtual machine may need to accept memory before it can be used.

Do not map unaccepted memory because it can cause the guest to fail.

For /proc/vmcore, which is read-only, this means a read or mmap of
unaccepted memory will return zeros.


Does a second (kdump) kernel that exposes /proc/vmcore reliably get access
to the information whether memory of the first kernel is unaccepted (IOW,
not its memory, but the memory of the first kernel it is supposed to expose
via /proc/vmcore)?


There are few patches in my queue to few related issue, but generally,
yes, the information is available to the target kernel via EFI
configuration table.


I assume that table provided by the first kernel, and not read directly from
HW, correct?


The table is constructed by the EFI stub in the first kernel based on EFI
memory map.



Okay, should work then once that's done by the first kernel.

Maybe include this patch in your series?

--
Cheers,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 1/3] proc/vmcore: Do not map unaccepted memory

2023-09-11 Thread Kirill A. Shutemov
On Mon, Sep 11, 2023 at 10:42:51AM +0200, David Hildenbrand wrote:
> On 11.09.23 10:41, Kirill A. Shutemov wrote:
> > On Mon, Sep 11, 2023 at 10:03:36AM +0200, David Hildenbrand wrote:
> > > On 06.09.23 09:39, Adrian Hunter wrote:
> > > > Support for unaccepted memory was added recently, refer commit
> > > > dcdfdd40fa82 ("mm: Add support for unaccepted memory"), whereby
> > > > a virtual machine may need to accept memory before it can be used.
> > > > 
> > > > Do not map unaccepted memory because it can cause the guest to fail.
> > > > 
> > > > For /proc/vmcore, which is read-only, this means a read or mmap of
> > > > unaccepted memory will return zeros.
> > > 
> > > Does a second (kdump) kernel that exposes /proc/vmcore reliably get access
> > > to the information whether memory of the first kernel is unaccepted (IOW,
> > > not its memory, but the memory of the first kernel it is supposed to 
> > > expose
> > > via /proc/vmcore)?
> > 
> > There are few patches in my queue to few related issue, but generally,
> > yes, the information is available to the target kernel via EFI
> > configuration table.
> 
> I assume that table provided by the first kernel, and not read directly from
> HW, correct?

The table is constructed by the EFI stub in the first kernel based on EFI
memory map.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 1/3] proc/vmcore: Do not map unaccepted memory

2023-09-11 Thread David Hildenbrand

On 11.09.23 10:41, Kirill A. Shutemov wrote:

On Mon, Sep 11, 2023 at 10:03:36AM +0200, David Hildenbrand wrote:

On 06.09.23 09:39, Adrian Hunter wrote:

Support for unaccepted memory was added recently, refer commit
dcdfdd40fa82 ("mm: Add support for unaccepted memory"), whereby
a virtual machine may need to accept memory before it can be used.

Do not map unaccepted memory because it can cause the guest to fail.

For /proc/vmcore, which is read-only, this means a read or mmap of
unaccepted memory will return zeros.


Does a second (kdump) kernel that exposes /proc/vmcore reliably get access
to the information whether memory of the first kernel is unaccepted (IOW,
not its memory, but the memory of the first kernel it is supposed to expose
via /proc/vmcore)?


There are few patches in my queue to few related issue, but generally,
yes, the information is available to the target kernel via EFI
configuration table.


I assume that table provided by the first kernel, and not read directly 
from HW, correct?


--
Cheers,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 1/3] proc/vmcore: Do not map unaccepted memory

2023-09-11 Thread Kirill A. Shutemov
On Mon, Sep 11, 2023 at 10:03:36AM +0200, David Hildenbrand wrote:
> On 06.09.23 09:39, Adrian Hunter wrote:
> > Support for unaccepted memory was added recently, refer commit
> > dcdfdd40fa82 ("mm: Add support for unaccepted memory"), whereby
> > a virtual machine may need to accept memory before it can be used.
> > 
> > Do not map unaccepted memory because it can cause the guest to fail.
> > 
> > For /proc/vmcore, which is read-only, this means a read or mmap of
> > unaccepted memory will return zeros.
> 
> Does a second (kdump) kernel that exposes /proc/vmcore reliably get access
> to the information whether memory of the first kernel is unaccepted (IOW,
> not its memory, but the memory of the first kernel it is supposed to expose
> via /proc/vmcore)?

There are few patches in my queue to few related issue, but generally,
yes, the information is available to the target kernel via EFI
configuration table.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 3/3] /dev/mem: Do not map unaccepted memory

2023-09-11 Thread David Hildenbrand

On 07.09.23 16:46, Dave Hansen wrote:

On 9/7/23 07:25, Kirill A. Shutemov wrote:

On Thu, Sep 07, 2023 at 07:15:21AM -0700, Dave Hansen wrote:

On 9/6/23 00:39, Adrian Hunter wrote:

Support for unaccepted memory was added recently, refer commit
dcdfdd40fa82 ("mm: Add support for unaccepted memory"), whereby
a virtual machine may need to accept memory before it can be used.

Do not map unaccepted memory because it can cause the guest to fail.

Doesn't /dev/mem already provide a billion ways for someone to shoot
themselves in the foot?  TDX seems to have added the 1,000,000,001st.
Is this really worth patching?

Is it better to let TD die silently? I don't think so.


First, let's take a look at all of the distro kernels that folks will
run under TDX.  Do they have STRICT_DEVMEM set?


For virtio-mem, we do

config VIRTIO_MEM
...
depends on EXCLUSIVE_SYSTEM_RAM

Which in turn:

config EXCLUSIVE_SYSTEM_RAM
...
depends on !DEVMEM || STRICT_DEVMEM


Not supported on all archs, but at least on RHEL9 on x86_64 and aarch64.

So, making unaccepted memory similarly depend on "!DEVMEM || 
STRICT_DEVMEM" does not sound too far off ...



--
Cheers,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 1/3] proc/vmcore: Do not map unaccepted memory

2023-09-11 Thread David Hildenbrand

On 06.09.23 09:39, Adrian Hunter wrote:

Support for unaccepted memory was added recently, refer commit
dcdfdd40fa82 ("mm: Add support for unaccepted memory"), whereby
a virtual machine may need to accept memory before it can be used.

Do not map unaccepted memory because it can cause the guest to fail.

For /proc/vmcore, which is read-only, this means a read or mmap of
unaccepted memory will return zeros.


Does a second (kdump) kernel that exposes /proc/vmcore reliably get 
access to the information whether memory of the first kernel is 
unaccepted (IOW, not its memory, but the memory of the first kernel it 
is supposed to expose via /proc/vmcore)?


I recall there might be other kdump-related issues for TDX and friends 
to solve. Especially, which information the second kernel gets provided 
by the first kernel.


So can this patch even be tested reasonably (IOW, get into a kdump 
kernel in an environment where the first kernel has unaccepted memory, 
and verify that unaccepted memory is handled accordingly? ... while 
kdump doing anything reasonable in such an environment at all?)


--
Cheers,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2 2/3] vmcore: allow fadump to export vmcore even if is_kdump_kernel() is false

2023-09-11 Thread Michael Ellerman
Hari Bathini  writes:
> Currently, is_kdump_kernel() returns true when elfcorehdr_addr is set.
> While elfcorehdr_addr is set for kexec based kernel dump mechanism,
> alternate dump capturing methods like fadump [1] also set it to export
> the vmcore. Since, is_kdump_kernel() is used to restrict resources in
> crash dump capture kernel and such restrictions are not desirable for
> fadump, allow is_kdump_kernel() to be defined differently for fadump
> case. With that change, include is_fadump_active() check in functions
> is_vmcore_usable() & vmcore_unusable() to be able to export vmcore for
> fadump case too.
...
> diff --git a/include/linux/crash_dump.h b/include/linux/crash_dump.h
> index 0f3a656293b0..de8a9fabfb6f 100644
> --- a/include/linux/crash_dump.h
> +++ b/include/linux/crash_dump.h
> @@ -50,6 +50,7 @@ void vmcore_cleanup(void);
>  #define vmcore_elf64_check_arch(x) (elf_check_arch(x) || 
> vmcore_elf_check_arch_cross(x))
>  #endif
>  
> +#ifndef is_kdump_kernel
>  /*
>   * is_kdump_kernel() checks whether this kernel is booting after a panic of
>   * previous kernel or not. This is determined by checking if previous kernel
> @@ -64,6 +65,19 @@ static inline bool is_kdump_kernel(void)
>  {
>   return elfcorehdr_addr != ELFCORE_ADDR_MAX;
>  }
> +#endif
> +
> +#ifndef is_fadump_active
> +/*
> + * If f/w assisted dump capturing mechanism (fadump), instead of kexec based
> + * dump capturing mechanism (kdump) is exporting the vmcore, then this 
> function
> + * will be defined in arch specific code to return true, when appropriate.
> + */
> +static inline bool is_fadump_active(void)
> +{
> + return false;
> +}
> +#endif
>  
>  /* is_vmcore_usable() checks if the kernel is booting after a panic and
>   * the vmcore region is usable.
> @@ -75,7 +89,8 @@ static inline bool is_kdump_kernel(void)
>  
>  static inline int is_vmcore_usable(void)
>  {
> - return is_kdump_kernel() && elfcorehdr_addr != ELFCORE_ADDR_ERR ? 1 : 0;
> + return (is_kdump_kernel() || is_fadump_active())
> + && elfcorehdr_addr != ELFCORE_ADDR_ERR ? 1 : 0;
>  }
>  
>  /* vmcore_unusable() marks the vmcore as unusable,
> @@ -84,7 +99,7 @@ static inline int is_vmcore_usable(void)
>  
>  static inline void vmcore_unusable(void)
>  {
> - if (is_kdump_kernel())
> + if (is_kdump_kernel() || is_fadump_active())
>   elfcorehdr_addr = ELFCORE_ADDR_ERR;
>  }

I think it would be cleaner to decouple is_vmcore_usable() and
vmcore_usable() from is_kdump_kernel().

ie, make them operate solely based on the value of elforehdr_addr:

static inline int is_vmcore_usable(void)
{
elfcorehdr_addr != ELFCORE_ADDR_ERR && \
elfcorehdr_addr != ELFCORE_ADDR_MAX;
}

static inline void vmcore_unusable(void)
{
elfcorehdr_addr = ELFCORE_ADDR_ERR;
}


Then all we need on powerpc is a way to override is_kdump_kernel().

cheers

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: kexec reboot failed due to commit 75d090fd167ac

2023-09-11 Thread Dave Young
Add kexec list in cc

On Sat, 9 Sept 2023 at 19:34, Kirill A. Shutemov
 wrote:
>
> On Fri, Sep 08, 2023 at 06:17:53PM +0200, Ard Biesheuvel wrote:
> > On Fri, Sep 8, 2023 at 5:58 PM Kees Cook  wrote:
> > >
> > > On Fri, Sep 08, 2023 at 03:32:33PM +0300, Kirill A. Shutemov wrote:
> > > > On Fri, Sep 08, 2023 at 02:02:30PM +0800, Aaron Lu wrote:
> > > > > On Thu, Sep 07, 2023 at 04:14:09PM +0300, Kirill A. Shutemov wrote:
> > > > > > On Tue, Aug 29, 2023 at 10:04:51PM +0800, Aaron Lu wrote:
> > > > > > > > Could you show dmesg of the first kernel before kexec?
> > > > > > >
> > > > > > > Attached.
> > > > > > >
> > > > > > > BTW, kexec is invoked like this:
> > > > > > > kver=6.4.0-rc5-9-g75d090fd167a
> > > > > > > kdir=$HOME/kernels/$kver
> > > > > > > sudo kexec -l $kdir/vmlinuz-$kver 
> > > > > > > --initrd=$kdir/initramfs-$kver.img 
> > > > > > > --append="root=UUID=4381321e-e01e-455a-9d46-5e8c4c5b2d02 ro 
> > > > > > > net.ifnames=0 acpi_rsdp=0x728e8014 no_hash_pointers sched_verbose 
> > > > > > > selinux=0"
> > > > > >
> > > > > > I don't understand why it happens.
> > > > > >
> > > > > > Could you check if this patch changes anything:
> > > > > >
> > > > > > diff --git a/arch/x86/boot/compressed/misc.c 
> > > > > > b/arch/x86/boot/compressed/misc.c
> > > > > > index 94b7abcf624b..172c476ff6f3 100644
> > > > > > --- a/arch/x86/boot/compressed/misc.c
> > > > > > +++ b/arch/x86/boot/compressed/misc.c
> > > > > > @@ -456,10 +456,12 @@ asmlinkage __visible void 
> > > > > > *extract_kernel(void *rmode, memptr heap,
> > > > > >
> > > > > >   debug_putstr("\nDecompressing Linux... ");
> > > > > >
> > > > > > +#if 0
> > > > > >   if (init_unaccepted_memory()) {
> > > > > >   debug_putstr("Accepting memory... ");
> > > > > >   accept_memory(__pa(output), __pa(output) + needed_size);
> > > > > >   }
> > > > > > +#endif
> > > > > >
> > > > > >   __decompress(input_data, input_len, NULL, NULL, output, 
> > > > > > output_len,
> > > > > >   NULL, error);
> > > > > > --
> > > > >
> > > > > It solved the problem.
> > > >
> > > > Looks like increasing BOOT_INIT_PGT_SIZE fixes the issue. I don't yet
> > > > understand why and how unaccepted memory is involved. I will look more
> > > > into it.
> > > >
> > > > Enabling CONFIG_RANDOMIZE_BASE also makes the issue go away.
> > >
> > > Is this perhaps just luck? I.e. does is break ever on, say, 1000 boot
> > > attempts? (i.e. maybe some position is bad and KASLR happens to usually
> > > avoid it?)
>
> Yes, it can be luck.
>
> > > > Kees, maybe you have a clue?
> > >
> > > The only thing I can think of is that something isn't being counted
> > > correctly due to the size of code, and it just happens that this commit
> > > makes the code large enough to exceed some set of mappings?
> > >
> > > >
> > > > diff --git a/arch/x86/include/asm/boot.h b/arch/x86/include/asm/boot.h
> > > > index 9191280d9ea3..26ccce41d781 100644
> > > > --- a/arch/x86/include/asm/boot.h
> > > > +++ b/arch/x86/include/asm/boot.h
> > > > @@ -40,7 +40,7 @@
> > > >  #ifdef CONFIG_X86_64
> > > >  # define BOOT_STACK_SIZE 0x4000
> > > >
> > > > -# define BOOT_INIT_PGT_SIZE  (6*4096)
> > > > +# define BOOT_INIT_PGT_SIZE  (7*4096)
> > >
> > > That's why this might be working, for example? How large is the boot
> > > image before/after the commit, etc?
> > >
> >
> > Not sure why these changes would make a difference here, but choking
> > on accept_memory() on a non-TDX suggests that init_unaccepted_memory()
> > is poking into unmapped memory before it even decides that the
> > unaccepted memory does not exist.
> >
> > init_unaccepted_memory() has
> >
> > ret = efi_get_conf_table(boot_params, &cfg_table_pa, 
> > &cfg_table_len);
> > if (ret) {
> > warn("EFI config table not found.");
> > return false;
> > }
> >
> > which looks for  tuples in an array pointed to by the
> > EFI system table, and if either of those is not mapped, things can be
> > expected to explode.
> >
> > The only odd thing there is that this code is invoked after setting up
> > the 'demand paging' logic in the decompressor.
> >
> > If you haven't yet, could you please retry the kexec boot with
> > earlyprintk=tty?
>
> early console in extract_kernel
> input_data: 0x00807eb433a8
> input_len: 0x00d26271
> output: 0x00807b00
> output_len: 0x04800c10
> kernel_total_size: 0x03e28000
> needed_size: 0x04a0
> trampoline_32bit: 0x0009d000
>
> Decompressing Linux... out of pgt_buf in 
> arch/x86/boot/compressed/ident_map_64.c!?
> pages->pgt_buf_offset: 0x6000
> pages->pgt_buf_size: 0x6000
>
>
> Error: kernel_ident_mapping_init() failed
>
> It crashes on #PF due to stbl->nr_tables dereference in
> efi_get_conf_table() called from init_unaccepted_memory().
>
> I don't see anything special about stbl location: 0x775d6018.
>
> One other bit of information: disabling 5-level paging