Re: [PATCH v9 08/19] x86: Secure Launch kernel early boot stub

2024-06-06 Thread Ard Biesheuvel
Hello Ross,

On Fri, 31 May 2024 at 03:32, Ross Philipson  wrote:
>
> The Secure Launch (SL) stub provides the entry point for Intel TXT (and
> later AMD SKINIT) to vector to during the late launch. The symbol
> sl_stub_entry is that entry point and its offset into the kernel is
> conveyed to the launching code using the MLE (Measured Launch
> Environment) header in the structure named mle_header. The offset of the
> MLE header is set in the kernel_info. The routine sl_stub contains the
> very early late launch setup code responsible for setting up the basic
> environment to allow the normal kernel startup_32 code to proceed. It is
> also responsible for properly waking and handling the APs on Intel
> platforms. The routine sl_main which runs after entering 64b mode is
> responsible for measuring configuration and module information before
> it is used like the boot params, the kernel command line, the TXT heap,
> an external initramfs, etc.
>
> Signed-off-by: Ross Philipson 
> ---
>  Documentation/arch/x86/boot.rst|  21 +
>  arch/x86/boot/compressed/Makefile  |   3 +-
>  arch/x86/boot/compressed/head_64.S |  30 +
>  arch/x86/boot/compressed/kernel_info.S |  34 ++
>  arch/x86/boot/compressed/sl_main.c | 577 
>  arch/x86/boot/compressed/sl_stub.S | 725 +
>  arch/x86/include/asm/msr-index.h   |   5 +
>  arch/x86/include/uapi/asm/bootparam.h  |   1 +
>  arch/x86/kernel/asm-offsets.c  |  20 +
>  9 files changed, 1415 insertions(+), 1 deletion(-)
>  create mode 100644 arch/x86/boot/compressed/sl_main.c
>  create mode 100644 arch/x86/boot/compressed/sl_stub.S
>
> diff --git a/Documentation/arch/x86/boot.rst b/Documentation/arch/x86/boot.rst
> index 4fd492cb4970..295cdf9bcbdb 100644
> --- a/Documentation/arch/x86/boot.rst
> +++ b/Documentation/arch/x86/boot.rst
> @@ -482,6 +482,14 @@ Protocol:  2.00+
> - If 1, KASLR enabled.
> - If 0, KASLR disabled.
>
> +  Bit 2 (kernel internal): SLAUNCH_FLAG
> +
> +   - Used internally by the setup kernel to communicate
> + Secure Launch status to kernel proper.
> +
> +   - If 1, Secure Launch enabled.
> +   - If 0, Secure Launch disabled.
> +
>Bit 5 (write): QUIET_FLAG
>
> - If 0, print early messages.
> @@ -1028,6 +1036,19 @@ Offset/size: 0x000c/4
>
>This field contains maximal allowed type for setup_data and setup_indirect 
> structs.
>
> +   =
> +Field name:mle_header_offset
> +Offset/size:   0x0010/4
> +   =
> +
> +  This field contains the offset to the Secure Launch Measured Launch 
> Environment
> +  (MLE) header. This offset is used to locate information needed during a 
> secure
> +  late launch using Intel TXT. If the offset is zero, the kernel does not 
> have
> +  Secure Launch capabilities. The MLE entry point is called from TXT on the 
> BSP
> +  following a success measured launch. The specific state of the processors 
> is
> +  outlined in the TXT Software Development Guide, the latest can be found 
> here:
> +  
> https://www.intel.com/content/dam/www/public/us/en/documents/guides/intel-txt-software-development-guide.pdf
> +
>

Could we just repaint this field as the offset relative to the start
of kernel_info rather than relative to the start of the image? That
way, there is no need for patch #1, and given that the consumer of
this field accesses it via kernel_info, I wouldn't expect any issues
in applying this offset to obtain the actual address.


>  The Image Checksum
>  ==
> diff --git a/arch/x86/boot/compressed/Makefile 
> b/arch/x86/boot/compressed/Makefile
> index 9189a0e28686..9076a248d4b4 100644
> --- a/arch/x86/boot/compressed/Makefile
> +++ b/arch/x86/boot/compressed/Makefile
> @@ -118,7 +118,8 @@ vmlinux-objs-$(CONFIG_EFI) += $(obj)/efi.o
>  vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_mixed.o
>  vmlinux-objs-$(CONFIG_EFI_STUB) += 
> $(objtree)/drivers/firmware/efi/libstub/lib.a
>
> -vmlinux-objs-$(CONFIG_SECURE_LAUNCH) += $(obj)/early_sha1.o 
> $(obj)/early_sha256.o
> +vmlinux-objs-$(CONFIG_SECURE_LAUNCH) += $(obj)/early_sha1.o 
> $(obj)/early_sha256.o \
> +   $(obj)/sl_main.o $(obj)/sl_stub.o
>
>  $(obj)/vmlinux: $(vmlinux-objs-y) FORCE
> $(call if_changed,ld)
> diff --git a/arch/x86/boot/compressed/head_64.S 
> b/arch/x86/boot/compressed/head_64.S
> index 1dcb794c5479..803c9e2e6d85 100644
> --- a/arch/x86/boot/compressed/head_64.S
> +++ b/arch/x86/boot/compressed/head_64.S
> @@ -420,6 +420,13 @@ SYM_CODE_START(startup_64)
> pushq   $0
> popfq
>
> +#ifdef CONFIG_SECURE_LAUNCH
> +   /* Ensure the relocation region is coverd by a PMR */

covered

> +   movq%rbx, %rdi
> +   movl$(_bss - startup_32), %esi
> +   callq   sl_check_region
> +#endif
> +
>  /*
>   * Copy the compressed kernel to the end of our buffer
>   * where decompression in place becomes safe.
> @@ -462,6 

Re: [PATCH v7 1/3] efi/x86: Fix EFI memory map corruption with kexec

2024-06-05 Thread Ard Biesheuvel
On Wed, 5 Jun 2024 at 09:43, Borislav Petkov  wrote:
>
> On Wed, Jun 05, 2024 at 10:53:44AM +0800, Dave Young wrote:
> > It's something good to have but not must for the time being,  also no
> > idea how to save the status across boot, for EFI boot case probably a
> > EFI var can be used;
>
> Yes.
>
> > but how can it be cleared in case of physical boot.  Otherwise
> > probably injecting some kernel parameters, anyway this needs more
> > thinking.
>
> Yeah, this'll need proper analysis whether we can even do that reliably.
>
> We need to increment it only on the kexec reboot paths and clear it on
> the normal reboot paths.
>

I'd argue for the opposite: ideally, the difference between the first
boot and not-the-first-boot should be abstracted away by the
'bootloader' side of kexec as much as possible, so that the tricky
early startup code doesn't have to be riddled with different code
paths depending on !kexec vs kexec.

TDX is a good case in point here: rather than add more conditionals,
I'd urge to remove them so the TDX startup code doesn't have to care
about the difference at all. If there is anything special that needs
to be done, it belongs in the kexec implementation of the previous
kernel.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v9 08/19] x86: Secure Launch kernel early boot stub

2024-06-04 Thread Ard Biesheuvel
On Tue, 4 Jun 2024 at 19:34,  wrote:
>
> On 6/4/24 10:27 AM, Ard Biesheuvel wrote:
> > On Tue, 4 Jun 2024 at 19:24,  wrote:
> >>
> >> On 5/31/24 6:33 AM, Ard Biesheuvel wrote:
> >>> On Fri, 31 May 2024 at 13:00, Ard Biesheuvel  wrote:
> >>>>
> >>>> Hello Ross,
> >>>>
> >>>> On Fri, 31 May 2024 at 03:32, Ross Philipson  
> >>>> wrote:
> >>>>>
> >>>>> The Secure Launch (SL) stub provides the entry point for Intel TXT (and
> >>>>> later AMD SKINIT) to vector to during the late launch. The symbol
> >>>>> sl_stub_entry is that entry point and its offset into the kernel is
> >>>>> conveyed to the launching code using the MLE (Measured Launch
> >>>>> Environment) header in the structure named mle_header. The offset of the
> >>>>> MLE header is set in the kernel_info. The routine sl_stub contains the
> >>>>> very early late launch setup code responsible for setting up the basic
> >>>>> environment to allow the normal kernel startup_32 code to proceed. It is
> >>>>> also responsible for properly waking and handling the APs on Intel
> >>>>> platforms. The routine sl_main which runs after entering 64b mode is
> >>>>> responsible for measuring configuration and module information before
> >>>>> it is used like the boot params, the kernel command line, the TXT heap,
> >>>>> an external initramfs, etc.
> >>>>>
> >>>>> Signed-off-by: Ross Philipson 
> >>>>> ---
> >>>>>Documentation/arch/x86/boot.rst|  21 +
> >>>>>arch/x86/boot/compressed/Makefile  |   3 +-
> >>>>>arch/x86/boot/compressed/head_64.S |  30 +
> >>>>>arch/x86/boot/compressed/kernel_info.S |  34 ++
> >>>>>arch/x86/boot/compressed/sl_main.c | 577 
> >>>>>arch/x86/boot/compressed/sl_stub.S | 725 
> >>>>> +
> >>>>>arch/x86/include/asm/msr-index.h   |   5 +
> >>>>>arch/x86/include/uapi/asm/bootparam.h  |   1 +
> >>>>>arch/x86/kernel/asm-offsets.c  |  20 +
> >>>>>9 files changed, 1415 insertions(+), 1 deletion(-)
> >>>>>create mode 100644 arch/x86/boot/compressed/sl_main.c
> >>>>>create mode 100644 arch/x86/boot/compressed/sl_stub.S
> >>>>>
> >>>>> diff --git a/Documentation/arch/x86/boot.rst 
> >>>>> b/Documentation/arch/x86/boot.rst
> >>>>> index 4fd492cb4970..295cdf9bcbdb 100644
> >>>>> --- a/Documentation/arch/x86/boot.rst
> >>>>> +++ b/Documentation/arch/x86/boot.rst
> >>>>> @@ -482,6 +482,14 @@ Protocol:  2.00+
> >>>>>   - If 1, KASLR enabled.
> >>>>>   - If 0, KASLR disabled.
> >>>>>
> >>>>> +  Bit 2 (kernel internal): SLAUNCH_FLAG
> >>>>> +
> >>>>> +   - Used internally by the setup kernel to communicate
> >>>>> + Secure Launch status to kernel proper.
> >>>>> +
> >>>>> +   - If 1, Secure Launch enabled.
> >>>>> +   - If 0, Secure Launch disabled.
> >>>>> +
> >>>>>  Bit 5 (write): QUIET_FLAG
> >>>>>
> >>>>>   - If 0, print early messages.
> >>>>> @@ -1028,6 +1036,19 @@ Offset/size: 0x000c/4
> >>>>>
> >>>>>  This field contains maximal allowed type for setup_data and 
> >>>>> setup_indirect structs.
> >>>>>
> >>>>> +   =
> >>>>> +Field name:mle_header_offset
> >>>>> +Offset/size:   0x0010/4
> >>>>> +   =
> >>>>> +
> >>>>> +  This field contains the offset to the Secure Launch Measured Launch 
> >>>>> Environment
> >>>>> +  (MLE) header. This offset is used to locate information needed 
> >>>>> during a secure
> >>>>> +  late launch using Intel TXT. If the offset is zero, the kernel does 
> >>>>> not have
> >>>>> +  Secure Launch capabilities. The MLE entry point is called from TX

Re: [PATCH v9 08/19] x86: Secure Launch kernel early boot stub

2024-06-04 Thread Ard Biesheuvel
On Tue, 4 Jun 2024 at 19:24,  wrote:
>
> On 5/31/24 6:33 AM, Ard Biesheuvel wrote:
> > On Fri, 31 May 2024 at 13:00, Ard Biesheuvel  wrote:
> >>
> >> Hello Ross,
> >>
> >> On Fri, 31 May 2024 at 03:32, Ross Philipson  
> >> wrote:
> >>>
> >>> The Secure Launch (SL) stub provides the entry point for Intel TXT (and
> >>> later AMD SKINIT) to vector to during the late launch. The symbol
> >>> sl_stub_entry is that entry point and its offset into the kernel is
> >>> conveyed to the launching code using the MLE (Measured Launch
> >>> Environment) header in the structure named mle_header. The offset of the
> >>> MLE header is set in the kernel_info. The routine sl_stub contains the
> >>> very early late launch setup code responsible for setting up the basic
> >>> environment to allow the normal kernel startup_32 code to proceed. It is
> >>> also responsible for properly waking and handling the APs on Intel
> >>> platforms. The routine sl_main which runs after entering 64b mode is
> >>> responsible for measuring configuration and module information before
> >>> it is used like the boot params, the kernel command line, the TXT heap,
> >>> an external initramfs, etc.
> >>>
> >>> Signed-off-by: Ross Philipson 
> >>> ---
> >>>   Documentation/arch/x86/boot.rst|  21 +
> >>>   arch/x86/boot/compressed/Makefile  |   3 +-
> >>>   arch/x86/boot/compressed/head_64.S |  30 +
> >>>   arch/x86/boot/compressed/kernel_info.S |  34 ++
> >>>   arch/x86/boot/compressed/sl_main.c | 577 
> >>>   arch/x86/boot/compressed/sl_stub.S | 725 +
> >>>   arch/x86/include/asm/msr-index.h   |   5 +
> >>>   arch/x86/include/uapi/asm/bootparam.h  |   1 +
> >>>   arch/x86/kernel/asm-offsets.c  |  20 +
> >>>   9 files changed, 1415 insertions(+), 1 deletion(-)
> >>>   create mode 100644 arch/x86/boot/compressed/sl_main.c
> >>>   create mode 100644 arch/x86/boot/compressed/sl_stub.S
> >>>
> >>> diff --git a/Documentation/arch/x86/boot.rst 
> >>> b/Documentation/arch/x86/boot.rst
> >>> index 4fd492cb4970..295cdf9bcbdb 100644
> >>> --- a/Documentation/arch/x86/boot.rst
> >>> +++ b/Documentation/arch/x86/boot.rst
> >>> @@ -482,6 +482,14 @@ Protocol:  2.00+
> >>>  - If 1, KASLR enabled.
> >>>  - If 0, KASLR disabled.
> >>>
> >>> +  Bit 2 (kernel internal): SLAUNCH_FLAG
> >>> +
> >>> +   - Used internally by the setup kernel to communicate
> >>> + Secure Launch status to kernel proper.
> >>> +
> >>> +   - If 1, Secure Launch enabled.
> >>> +   - If 0, Secure Launch disabled.
> >>> +
> >>> Bit 5 (write): QUIET_FLAG
> >>>
> >>>  - If 0, print early messages.
> >>> @@ -1028,6 +1036,19 @@ Offset/size: 0x000c/4
> >>>
> >>> This field contains maximal allowed type for setup_data and 
> >>> setup_indirect structs.
> >>>
> >>> +   =
> >>> +Field name:mle_header_offset
> >>> +Offset/size:   0x0010/4
> >>> +   =
> >>> +
> >>> +  This field contains the offset to the Secure Launch Measured Launch 
> >>> Environment
> >>> +  (MLE) header. This offset is used to locate information needed during 
> >>> a secure
> >>> +  late launch using Intel TXT. If the offset is zero, the kernel does 
> >>> not have
> >>> +  Secure Launch capabilities. The MLE entry point is called from TXT on 
> >>> the BSP
> >>> +  following a success measured launch. The specific state of the 
> >>> processors is
> >>> +  outlined in the TXT Software Development Guide, the latest can be 
> >>> found here:
> >>> +  
> >>> https://urldefense.com/v3/__https://www.intel.com/content/dam/www/public/us/en/documents/guides/intel-txt-software-development-guide.pdf__;!!ACWV5N9M2RV99hQ!Mng0gnPhOYZ8D02t1rYwQfY6U3uWaypJyd1T2rsWz3QNHr9GhIZ9ANB_-cgPExxX0e0KmCpda-3VX8Fj$
> >>> +
> >>>
> >>
> >> Could we just repaint this field as the offset relative to the start
> >> of kernel_info rather than relative to the start o

Re: [PATCH v9 08/19] x86: Secure Launch kernel early boot stub

2024-05-31 Thread Ard Biesheuvel
On Fri, 31 May 2024 at 16:04, Ard Biesheuvel  wrote:
>
> On Fri, 31 May 2024 at 15:33, Ard Biesheuvel  wrote:
> >
> > On Fri, 31 May 2024 at 13:00, Ard Biesheuvel  wrote:
> > >
> > > Hello Ross,
> > >
> > > On Fri, 31 May 2024 at 03:32, Ross Philipson  
> > > wrote:
> > > >
> > > > The Secure Launch (SL) stub provides the entry point for Intel TXT (and
> > > > later AMD SKINIT) to vector to during the late launch. The symbol
> > > > sl_stub_entry is that entry point and its offset into the kernel is
> > > > conveyed to the launching code using the MLE (Measured Launch
> > > > Environment) header in the structure named mle_header. The offset of the
> > > > MLE header is set in the kernel_info. The routine sl_stub contains the
> > > > very early late launch setup code responsible for setting up the basic
> > > > environment to allow the normal kernel startup_32 code to proceed. It is
> > > > also responsible for properly waking and handling the APs on Intel
> > > > platforms. The routine sl_main which runs after entering 64b mode is
> > > > responsible for measuring configuration and module information before
> > > > it is used like the boot params, the kernel command line, the TXT heap,
> > > > an external initramfs, etc.
> > > >
> > > > Signed-off-by: Ross Philipson 
> > > > ---
> > > >  Documentation/arch/x86/boot.rst|  21 +
> > > >  arch/x86/boot/compressed/Makefile  |   3 +-
> > > >  arch/x86/boot/compressed/head_64.S |  30 +
> > > >  arch/x86/boot/compressed/kernel_info.S |  34 ++
> > > >  arch/x86/boot/compressed/sl_main.c | 577 
> > > >  arch/x86/boot/compressed/sl_stub.S | 725 +
> > > >  arch/x86/include/asm/msr-index.h   |   5 +
> > > >  arch/x86/include/uapi/asm/bootparam.h  |   1 +
> > > >  arch/x86/kernel/asm-offsets.c  |  20 +
> > > >  9 files changed, 1415 insertions(+), 1 deletion(-)
> > > >  create mode 100644 arch/x86/boot/compressed/sl_main.c
> > > >  create mode 100644 arch/x86/boot/compressed/sl_stub.S
> > > >
> > > > diff --git a/Documentation/arch/x86/boot.rst 
> > > > b/Documentation/arch/x86/boot.rst
> > > > index 4fd492cb4970..295cdf9bcbdb 100644
> > > > --- a/Documentation/arch/x86/boot.rst
> > > > +++ b/Documentation/arch/x86/boot.rst
> > > > @@ -482,6 +482,14 @@ Protocol:  2.00+
> > > > - If 1, KASLR enabled.
> > > > - If 0, KASLR disabled.
> > > >
> > > > +  Bit 2 (kernel internal): SLAUNCH_FLAG
> > > > +
> > > > +   - Used internally by the setup kernel to communicate
> > > > + Secure Launch status to kernel proper.
> > > > +
> > > > +   - If 1, Secure Launch enabled.
> > > > +   - If 0, Secure Launch disabled.
> > > > +
> > > >Bit 5 (write): QUIET_FLAG
> > > >
> > > > - If 0, print early messages.
> > > > @@ -1028,6 +1036,19 @@ Offset/size: 0x000c/4
> > > >
> > > >This field contains maximal allowed type for setup_data and 
> > > > setup_indirect structs.
> > > >
> > > > +   =
> > > > +Field name:mle_header_offset
> > > > +Offset/size:   0x0010/4
> > > > +   =
> > > > +
> > > > +  This field contains the offset to the Secure Launch Measured Launch 
> > > > Environment
> > > > +  (MLE) header. This offset is used to locate information needed 
> > > > during a secure
> > > > +  late launch using Intel TXT. If the offset is zero, the kernel does 
> > > > not have
> > > > +  Secure Launch capabilities. The MLE entry point is called from TXT 
> > > > on the BSP
> > > > +  following a success measured launch. The specific state of the 
> > > > processors is
> > > > +  outlined in the TXT Software Development Guide, the latest can be 
> > > > found here:
> > > > +  
> > > > https://www.intel.com/content/dam/www/public/us/en/documents/guides/intel-txt-software-development-guide.pdf
> > > > +
> > > >
> > >
> > > Could we just repaint this field as the offset relative to the start
> > &

Re: [PATCH v9 08/19] x86: Secure Launch kernel early boot stub

2024-05-31 Thread Ard Biesheuvel
On Fri, 31 May 2024 at 15:33, Ard Biesheuvel  wrote:
>
> On Fri, 31 May 2024 at 13:00, Ard Biesheuvel  wrote:
> >
> > Hello Ross,
> >
> > On Fri, 31 May 2024 at 03:32, Ross Philipson  
> > wrote:
> > >
> > > The Secure Launch (SL) stub provides the entry point for Intel TXT (and
> > > later AMD SKINIT) to vector to during the late launch. The symbol
> > > sl_stub_entry is that entry point and its offset into the kernel is
> > > conveyed to the launching code using the MLE (Measured Launch
> > > Environment) header in the structure named mle_header. The offset of the
> > > MLE header is set in the kernel_info. The routine sl_stub contains the
> > > very early late launch setup code responsible for setting up the basic
> > > environment to allow the normal kernel startup_32 code to proceed. It is
> > > also responsible for properly waking and handling the APs on Intel
> > > platforms. The routine sl_main which runs after entering 64b mode is
> > > responsible for measuring configuration and module information before
> > > it is used like the boot params, the kernel command line, the TXT heap,
> > > an external initramfs, etc.
> > >
> > > Signed-off-by: Ross Philipson 
> > > ---
> > >  Documentation/arch/x86/boot.rst|  21 +
> > >  arch/x86/boot/compressed/Makefile  |   3 +-
> > >  arch/x86/boot/compressed/head_64.S |  30 +
> > >  arch/x86/boot/compressed/kernel_info.S |  34 ++
> > >  arch/x86/boot/compressed/sl_main.c | 577 
> > >  arch/x86/boot/compressed/sl_stub.S | 725 +
> > >  arch/x86/include/asm/msr-index.h   |   5 +
> > >  arch/x86/include/uapi/asm/bootparam.h  |   1 +
> > >  arch/x86/kernel/asm-offsets.c  |  20 +
> > >  9 files changed, 1415 insertions(+), 1 deletion(-)
> > >  create mode 100644 arch/x86/boot/compressed/sl_main.c
> > >  create mode 100644 arch/x86/boot/compressed/sl_stub.S
> > >
> > > diff --git a/Documentation/arch/x86/boot.rst 
> > > b/Documentation/arch/x86/boot.rst
> > > index 4fd492cb4970..295cdf9bcbdb 100644
> > > --- a/Documentation/arch/x86/boot.rst
> > > +++ b/Documentation/arch/x86/boot.rst
> > > @@ -482,6 +482,14 @@ Protocol:  2.00+
> > > - If 1, KASLR enabled.
> > > - If 0, KASLR disabled.
> > >
> > > +  Bit 2 (kernel internal): SLAUNCH_FLAG
> > > +
> > > +   - Used internally by the setup kernel to communicate
> > > + Secure Launch status to kernel proper.
> > > +
> > > +   - If 1, Secure Launch enabled.
> > > +   - If 0, Secure Launch disabled.
> > > +
> > >Bit 5 (write): QUIET_FLAG
> > >
> > > - If 0, print early messages.
> > > @@ -1028,6 +1036,19 @@ Offset/size: 0x000c/4
> > >
> > >This field contains maximal allowed type for setup_data and 
> > > setup_indirect structs.
> > >
> > > +   =
> > > +Field name:mle_header_offset
> > > +Offset/size:   0x0010/4
> > > +   =
> > > +
> > > +  This field contains the offset to the Secure Launch Measured Launch 
> > > Environment
> > > +  (MLE) header. This offset is used to locate information needed during 
> > > a secure
> > > +  late launch using Intel TXT. If the offset is zero, the kernel does 
> > > not have
> > > +  Secure Launch capabilities. The MLE entry point is called from TXT on 
> > > the BSP
> > > +  following a success measured launch. The specific state of the 
> > > processors is
> > > +  outlined in the TXT Software Development Guide, the latest can be 
> > > found here:
> > > +  
> > > https://www.intel.com/content/dam/www/public/us/en/documents/guides/intel-txt-software-development-guide.pdf
> > > +
> > >
> >
> > Could we just repaint this field as the offset relative to the start
> > of kernel_info rather than relative to the start of the image? That
> > way, there is no need for patch #1, and given that the consumer of
> > this field accesses it via kernel_info, I wouldn't expect any issues
> > in applying this offset to obtain the actual address.
> >
> >
> > >  The Image Checksum
> > >  ==
> > > diff --git a/arch/x86/boot/compressed/Makefile 
> > > b/arch/x86/boot/compressed/Makefile
> &

Re: [PATCH v9 08/19] x86: Secure Launch kernel early boot stub

2024-05-31 Thread Ard Biesheuvel
On Fri, 31 May 2024 at 13:00, Ard Biesheuvel  wrote:
>
> Hello Ross,
>
> On Fri, 31 May 2024 at 03:32, Ross Philipson  
> wrote:
> >
> > The Secure Launch (SL) stub provides the entry point for Intel TXT (and
> > later AMD SKINIT) to vector to during the late launch. The symbol
> > sl_stub_entry is that entry point and its offset into the kernel is
> > conveyed to the launching code using the MLE (Measured Launch
> > Environment) header in the structure named mle_header. The offset of the
> > MLE header is set in the kernel_info. The routine sl_stub contains the
> > very early late launch setup code responsible for setting up the basic
> > environment to allow the normal kernel startup_32 code to proceed. It is
> > also responsible for properly waking and handling the APs on Intel
> > platforms. The routine sl_main which runs after entering 64b mode is
> > responsible for measuring configuration and module information before
> > it is used like the boot params, the kernel command line, the TXT heap,
> > an external initramfs, etc.
> >
> > Signed-off-by: Ross Philipson 
> > ---
> >  Documentation/arch/x86/boot.rst|  21 +
> >  arch/x86/boot/compressed/Makefile  |   3 +-
> >  arch/x86/boot/compressed/head_64.S |  30 +
> >  arch/x86/boot/compressed/kernel_info.S |  34 ++
> >  arch/x86/boot/compressed/sl_main.c | 577 
> >  arch/x86/boot/compressed/sl_stub.S | 725 +
> >  arch/x86/include/asm/msr-index.h   |   5 +
> >  arch/x86/include/uapi/asm/bootparam.h  |   1 +
> >  arch/x86/kernel/asm-offsets.c  |  20 +
> >  9 files changed, 1415 insertions(+), 1 deletion(-)
> >  create mode 100644 arch/x86/boot/compressed/sl_main.c
> >  create mode 100644 arch/x86/boot/compressed/sl_stub.S
> >
> > diff --git a/Documentation/arch/x86/boot.rst 
> > b/Documentation/arch/x86/boot.rst
> > index 4fd492cb4970..295cdf9bcbdb 100644
> > --- a/Documentation/arch/x86/boot.rst
> > +++ b/Documentation/arch/x86/boot.rst
> > @@ -482,6 +482,14 @@ Protocol:  2.00+
> > - If 1, KASLR enabled.
> > - If 0, KASLR disabled.
> >
> > +  Bit 2 (kernel internal): SLAUNCH_FLAG
> > +
> > +   - Used internally by the setup kernel to communicate
> > + Secure Launch status to kernel proper.
> > +
> > +   - If 1, Secure Launch enabled.
> > +   - If 0, Secure Launch disabled.
> > +
> >Bit 5 (write): QUIET_FLAG
> >
> > - If 0, print early messages.
> > @@ -1028,6 +1036,19 @@ Offset/size: 0x000c/4
> >
> >This field contains maximal allowed type for setup_data and 
> > setup_indirect structs.
> >
> > +   =
> > +Field name:mle_header_offset
> > +Offset/size:   0x0010/4
> > +   =
> > +
> > +  This field contains the offset to the Secure Launch Measured Launch 
> > Environment
> > +  (MLE) header. This offset is used to locate information needed during a 
> > secure
> > +  late launch using Intel TXT. If the offset is zero, the kernel does not 
> > have
> > +  Secure Launch capabilities. The MLE entry point is called from TXT on 
> > the BSP
> > +  following a success measured launch. The specific state of the 
> > processors is
> > +  outlined in the TXT Software Development Guide, the latest can be found 
> > here:
> > +  
> > https://www.intel.com/content/dam/www/public/us/en/documents/guides/intel-txt-software-development-guide.pdf
> > +
> >
>
> Could we just repaint this field as the offset relative to the start
> of kernel_info rather than relative to the start of the image? That
> way, there is no need for patch #1, and given that the consumer of
> this field accesses it via kernel_info, I wouldn't expect any issues
> in applying this offset to obtain the actual address.
>
>
> >  The Image Checksum
> >  ==
> > diff --git a/arch/x86/boot/compressed/Makefile 
> > b/arch/x86/boot/compressed/Makefile
> > index 9189a0e28686..9076a248d4b4 100644
> > --- a/arch/x86/boot/compressed/Makefile
> > +++ b/arch/x86/boot/compressed/Makefile
> > @@ -118,7 +118,8 @@ vmlinux-objs-$(CONFIG_EFI) += $(obj)/efi.o
> >  vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_mixed.o
> >  vmlinux-objs-$(CONFIG_EFI_STUB) += 
> > $(objtree)/drivers/firmware/efi/libstub/lib.a
> >
> > -vmlinux-objs-$(CONFIG_SECURE_LAUNCH) += $(obj)/early_sha1.o 
> > $(obj)/early_sha256.o
> > +vmlinux-objs

Re: [PATCH v9 19/19] x86: EFI stub DRTM launch support for Secure Launch

2024-05-31 Thread Ard Biesheuvel
On Fri, 31 May 2024 at 03:32, Ross Philipson  wrote:
>
> This support allows the DRTM launch to be initiated after an EFI stub
> launch of the Linux kernel is done. This is accomplished by providing
> a handler to jump to when a Secure Launch is in progress. This has to be
> called after the EFI stub does Exit Boot Services.
>
> Signed-off-by: Ross Philipson 

Just some minor remarks below. The overall approach in this patch
looks fine now.


> ---
>  drivers/firmware/efi/libstub/x86-stub.c | 98 +
>  1 file changed, 98 insertions(+)
>
> diff --git a/drivers/firmware/efi/libstub/x86-stub.c 
> b/drivers/firmware/efi/libstub/x86-stub.c
> index d5a8182cf2e1..a1143d006202 100644
> --- a/drivers/firmware/efi/libstub/x86-stub.c
> +++ b/drivers/firmware/efi/libstub/x86-stub.c
> @@ -9,6 +9,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>
>  #include 
>  #include 
> @@ -830,6 +832,97 @@ static efi_status_t efi_decompress_kernel(unsigned long 
> *kernel_entry)
> return efi_adjust_memory_range_protection(addr, kernel_text_size);
>  }
>
> +#if (IS_ENABLED(CONFIG_SECURE_LAUNCH))

IS_ENABLED() is mostly used for C conditionals not CPP ones.

It would be nice if this #if could be dropped, and replaced with ... (see below)


> +static bool efi_secure_launch_update_boot_params(struct slr_table *slrt,
> +struct boot_params 
> *boot_params)
> +{
> +   struct slr_entry_intel_info *txt_info;
> +   struct slr_entry_policy *policy;
> +   struct txt_os_mle_data *os_mle;
> +   bool updated = false;
> +   int i;
> +
> +   txt_info = slr_next_entry_by_tag(slrt, NULL, SLR_ENTRY_INTEL_INFO);
> +   if (!txt_info)
> +   return false;
> +
> +   os_mle = txt_os_mle_data_start((void *)txt_info->txt_heap);
> +   if (!os_mle)
> +   return false;
> +
> +   os_mle->boot_params_addr = (u32)(u64)boot_params;
> +

Why is this safe?

> +   policy = slr_next_entry_by_tag(slrt, NULL, SLR_ENTRY_ENTRY_POLICY);
> +   if (!policy)
> +   return false;
> +
> +   for (i = 0; i < policy->nr_entries; i++) {
> +   if (policy->policy_entries[i].entity_type == 
> SLR_ET_BOOT_PARAMS) {
> +   policy->policy_entries[i].entity = (u64)boot_params;
> +   updated = true;
> +   break;
> +   }
> +   }
> +
> +   /*
> +* If this is a PE entry into EFI stub the mocked up boot params will
> +* be missing some of the setup header data needed for the second 
> stage
> +* of the Secure Launch boot.
> +*/
> +   if (image) {
> +   struct setup_header *hdr = (struct setup_header *)((u8 
> *)image->image_base + 0x1f1);

Could we use something other than a bare 0x1f1 constant here? struct
boot_params has a struct setup_header at the correct offset, so with
some casting of offsetof() use, we can make this look a lot more self
explanatory.


> +   u64 cmdline_ptr, hi_val;
> +
> +   boot_params->hdr.setup_sects = hdr->setup_sects;
> +   boot_params->hdr.syssize = hdr->syssize;
> +   boot_params->hdr.version = hdr->version;
> +   boot_params->hdr.loadflags = hdr->loadflags;
> +   boot_params->hdr.kernel_alignment = hdr->kernel_alignment;
> +   boot_params->hdr.min_alignment = hdr->min_alignment;
> +   boot_params->hdr.xloadflags = hdr->xloadflags;
> +   boot_params->hdr.init_size = hdr->init_size;
> +   boot_params->hdr.kernel_info_offset = hdr->kernel_info_offset;
> +   hi_val = boot_params->ext_cmd_line_ptr;

We have efi_set_u64_split() for this.

> +   cmdline_ptr = boot_params->hdr.cmd_line_ptr | hi_val << 32;
> +   boot_params->hdr.cmdline_size = strlen((const char 
> *)cmdline_ptr);;
> +   }
> +
> +   return updated;
> +}
> +
> +static void efi_secure_launch(struct boot_params *boot_params)
> +{
> +   struct slr_entry_dl_info *dlinfo;
> +   efi_guid_t guid = SLR_TABLE_GUID;
> +   dl_handler_func handler_callback;
> +   struct slr_table *slrt;
> +

... a C conditional here, e.g.,

if (!IS_ENABLED(CONFIG_SECURE_LAUNCH))
return;

The difference is that all the code will get compile test coverage
every time, instead of only in configs that enable
CONFIG_SECURE_LAUNCH.

This significantly reduces the risk that your stuff will get broken
inadvertently.

> +   /*
> +* The presence of this table indicated a Secure Launch
> +* is being requested.
> +*/
> +   slrt = (struct slr_table *)get_efi_config_table(guid);
> +   if (!slrt || slrt->magic != SLR_TABLE_MAGIC)
> +   return;
> +
> +   /*
> +* Since the EFI stub library creates its own boot_params on entry, 
> the
> +* SLRT and TXT heap have to be updated with this 

Re: [RFC PATCH 0/9] kexec x86 purgatory cleanup

2024-04-24 Thread Ard Biesheuvel
On Wed, 24 Apr 2024 at 22:04, Eric W. Biederman  wrote:
>
> Ard Biesheuvel  writes:
>
> > From: Ard Biesheuvel 
> >
> > The kexec purgatory is built like a kernel module, i.e., a partially
> > linked ELF object where each section is allocated and placed
> > individually, and all relocations need to be fixed up, even place
> > relative ones.
> >
> > This makes sense for kernel modules, which share the address space with
> > the core kernel, and contain unresolved references that need to be wired
> > up to symbols in other modules or the kernel itself.
> >
> > The purgatory, however, is a fully linked binary without any external
> > references, or any overlap with the kernel's virtual address space. So
> > it makes much more sense to create a fully linked ELF executable that
> > can just be loaded and run anywhere in memory.
>
> It does have external references that are resolved when it is loaded.
>

It doesn't today, and it hasn't for a while, at least since commit

e4160b2e4b02377c67f8ecd05786811598f39acd
x86/purgatory: Fail the build if purgatory.ro has missing symbols

which forces a build failure on unresolved external references, by
doing a full link of the purgatory.

> Further it is at least my impression that non-PIC code is more
> efficient.  PIC typically requires silly things like Global Offset
> Tables that non-PIC code does not.  At first glance this looks like a
> code passivization.
>

Given that the 64-bit purgatory can be loaded in memory that is not
32-bit addressable, the PIC code is essentially a given, since the
large code model is much worse (it uses 64-bit immediate for all
function and variable symbols, and therefore always uses indirect
calls)

Please refer to

https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=x86/build=cba786af84a0f9716204e09f518ce3b7ada8555e

for more details. (Getting pulled into that discussion is how I ended
up looking into the purgatory in more detail)

> Now at lot of functionality has been stripped out of purgatory so maybe
> in it's stripped down this make sense, but I want to challenge the
> notion that this is the obvious thing to do.
>

The diffstat speaks for itself - on x86, much of the allocation and
relocation logic can simply be dropped when building the purgatory in
this manner.

> > The purgatory build on x86 has already switched over to position
> > independent codegen, which only leaves a handful of absolute references,
> > which can either be dropped (patch #3) or converted into a RIP-relative
> > one (patch #4). That leaves a purgatory executable that can run at any
> > offset in memory with applying any relocations whatsoever.
>
> I missed that conversation.  Do you happen to have a pointer?  I would
> think the 32bit code is where the PIC would be most costly as the 32bit
> x86 instruction set predates PIC being a common compilation target.
>

See link above. Note that this none of this is about 32-bit code - the
purgatory as it exists today never drops out of long mode (and no
32-bit version appears to exist)

> > Some tweaks are needed to deal with the difference between partially
> > (ET_REL) and fully (ET_DYN/ET_EXEC) linked ELF objects, but with those
> > in place, a substantial amount of complicated ELF allocation, placement
> > and patching/relocation code can simply be dropped.
>
> Really?  As I recall it only needed to handle a single allocation type,
> and there were good reasons (at least when I wrote it) to patch symbols.
>
> Again maybe the fact that people have removed 90% of the functionality
> makes this make sense, but that is not obvious at first glance.
>

Again, the patches and the diffstat speak for themselves - the linker
applies all the relocations at build time, and emits all the sections
into a single ELF segment that can be copied into memory and executed
directly (modulo poking values into the global variables for the
sha256 digest and the segment list)

The last patch in the series shows which code we could drop from the
generic kexec_file_load() implementation once other architectures
adopt this scheme.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC PATCH 4/9] x86/purgatory: Avoid absolute reference to GDT

2024-04-24 Thread Ard Biesheuvel
Hi Brian,

Thanks for taking a look.

On Wed, 24 Apr 2024 at 19:39, Brian Gerst  wrote:
>
> On Wed, Apr 24, 2024 at 12:06 PM Ard Biesheuvel  wrote:
> >
> > From: Ard Biesheuvel 
> >
> > The purgatory is almost entirely position independent, without any need
> > for any relocation processing at load time except for the reference to
> > the GDT in the entry code. Generate this reference at runtime instead,
> > to remove the last R_X86_64_64 relocation from this code.
> >
> > While the GDT itself needs to be preserved in memory as long as it is
> > live, the GDT descriptor that is used to program the GDT can be
> > discarded so it can be allocated on the stack.
> >
> > Signed-off-by: Ard Biesheuvel 
> > ---
> >  arch/x86/purgatory/entry64.S | 10 +++---
> >  1 file changed, 7 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/x86/purgatory/entry64.S b/arch/x86/purgatory/entry64.S
> > index 9913877b0dbe..888661d9db9c 100644
> > --- a/arch/x86/purgatory/entry64.S
> > +++ b/arch/x86/purgatory/entry64.S
> > @@ -16,7 +16,11 @@
> >
> >  SYM_CODE_START(entry64)
> > /* Setup a gdt that should be preserved */
> > -   lgdt gdt(%rip)
> > +   leaqgdt(%rip), %rax
> > +   pushq   %rax
> > +   pushw   $gdt_end - gdt - 1
> > +   lgdt(%rsp)
> > +   addq$10, %rsp
>
> This misaligns the stack, pushing 16 bytes on the stack but only
> removing 10 (decimal).
>

pushw subtracts 2 from RSP and stores a word. So the total size stored
is 10 decimal not 16.

> >
> > /* load the data segments */
> > movl$0x18, %eax /* data segment */
> > @@ -83,8 +87,8 @@ SYM_DATA_START_LOCAL(gdt)
> >  * 0x08 unused
> >  * so use them as gdt ptr
>
> obsolete comment
>
> >  */
> > -   .word gdt_end - gdt - 1
> > -   .quad gdt
> > +   .word 0
> > +   .quad 0
> > .word 0, 0, 0
>
> This can be condensed down to:
> .quad 0, 0
>

This code and the comment are removed in the next patch.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[RFC PATCH 9/9] kexec: Drop support for partially linked purgatory executables

2024-04-24 Thread Ard Biesheuvel
From: Ard Biesheuvel 

Remove the handling of purgatories that are allocated, loaded and
relocated as individual ELF sections, which requires a lot of
post-processing on the part of the kexec loader. This has been
superseded by the use of fully linked PIE executables, which do not
require such post-processing.

Signed-off-by: Ard Biesheuvel 
---
 kernel/kexec_file.c | 271 +---
 1 file changed, 14 insertions(+), 257 deletions(-)

diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index 6379f8dfc29f..782a1247558c 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -808,228 +808,31 @@ static int kexec_calculate_store_digests(struct kimage 
*image)
 
 #ifdef CONFIG_ARCH_SUPPORTS_KEXEC_PURGATORY
 /*
- * kexec_purgatory_setup_kbuf - prepare buffer to load purgatory.
- * @pi:Purgatory to be loaded.
- * @kbuf:  Buffer to setup.
- *
- * Allocates the memory needed for the buffer. Caller is responsible to free
- * the memory after use.
- *
- * Return: 0 on success, negative errno on error.
- */
-static int kexec_purgatory_setup_kbuf(struct purgatory_info *pi,
- struct kexec_buf *kbuf)
-{
-   const Elf_Shdr *sechdrs;
-   unsigned long bss_align;
-   unsigned long bss_sz;
-   unsigned long align;
-   int i, ret;
-
-   sechdrs = (void *)pi->ehdr + pi->ehdr->e_shoff;
-   kbuf->buf_align = bss_align = 1;
-   kbuf->bufsz = bss_sz = 0;
-
-   for (i = 0; i < pi->ehdr->e_shnum; i++) {
-   if (!(sechdrs[i].sh_flags & SHF_ALLOC))
-   continue;
-
-   align = sechdrs[i].sh_addralign;
-   if (sechdrs[i].sh_type != SHT_NOBITS) {
-   if (kbuf->buf_align < align)
-   kbuf->buf_align = align;
-   kbuf->bufsz = ALIGN(kbuf->bufsz, align);
-   kbuf->bufsz += sechdrs[i].sh_size;
-   } else {
-   if (bss_align < align)
-   bss_align = align;
-   bss_sz = ALIGN(bss_sz, align);
-   bss_sz += sechdrs[i].sh_size;
-   }
-   }
-   kbuf->bufsz = ALIGN(kbuf->bufsz, bss_align);
-   kbuf->memsz = kbuf->bufsz + bss_sz;
-   if (kbuf->buf_align < bss_align)
-   kbuf->buf_align = bss_align;
-
-   kbuf->buffer = vzalloc(kbuf->bufsz);
-   if (!kbuf->buffer)
-   return -ENOMEM;
-   pi->purgatory_buf = kbuf->buffer;
-
-   ret = kexec_add_buffer(kbuf);
-   if (ret)
-   goto out;
-
-   return 0;
-out:
-   vfree(pi->purgatory_buf);
-   pi->purgatory_buf = NULL;
-   return ret;
-}
-
-/*
- * kexec_purgatory_setup_sechdrs - prepares the pi->sechdrs buffer.
- * @pi:Purgatory to be loaded.
- * @kbuf:  Buffer prepared to store purgatory.
- *
- * Allocates the memory needed for the buffer. Caller is responsible to free
- * the memory after use.
- *
- * Return: 0 on success, negative errno on error.
- */
-static int kexec_purgatory_setup_sechdrs(struct purgatory_info *pi,
-struct kexec_buf *kbuf)
-{
-   unsigned long bss_addr;
-   unsigned long offset;
-   size_t sechdrs_size;
-   Elf_Shdr *sechdrs;
-   int i;
-
-   /*
-* The section headers in kexec_purgatory are read-only. In order to
-* have them modifiable make a temporary copy.
-*/
-   sechdrs_size = array_size(sizeof(Elf_Shdr), pi->ehdr->e_shnum);
-   sechdrs = vzalloc(sechdrs_size);
-   if (!sechdrs)
-   return -ENOMEM;
-   memcpy(sechdrs, (void *)pi->ehdr + pi->ehdr->e_shoff, sechdrs_size);
-   pi->sechdrs = sechdrs;
-
-   offset = 0;
-   bss_addr = kbuf->mem + kbuf->bufsz;
-   kbuf->image->start = pi->ehdr->e_entry;
-
-   for (i = 0; i < pi->ehdr->e_shnum; i++) {
-   unsigned long align;
-   void *src, *dst;
-
-   if (!(sechdrs[i].sh_flags & SHF_ALLOC))
-   continue;
-
-   align = sechdrs[i].sh_addralign;
-   if (sechdrs[i].sh_type == SHT_NOBITS) {
-   bss_addr = ALIGN(bss_addr, align);
-   sechdrs[i].sh_addr = bss_addr;
-   bss_addr += sechdrs[i].sh_size;
-   continue;
-   }
-
-   offset = ALIGN(offset, align);
-
-   /*
-* Check if the segment contains the entry point, if so,
-* calculate the value of image->start based on it.
-* If the compiler has produced more than one .text section
-* (Eg: .text.hot), they are generally after the main .text
-* section, and they shall not 

[RFC PATCH 8/9] x86/purgatory: Simplify references to regs array

2024-04-24 Thread Ard Biesheuvel
From: Ard Biesheuvel 

Use a single symbol reference and offset addressing to load the contents
of the register file from memory, instead of using a symbol reference
for each, which results in larger code and more ELF overhead. While at
it, rename the individual labels with an .L prefix so they are omitted
from the ELF symbol table.

Signed-off-by: Ard Biesheuvel 
---
 arch/x86/purgatory/entry64.S | 67 ++--
 1 file changed, 34 insertions(+), 33 deletions(-)

diff --git a/arch/x86/purgatory/entry64.S b/arch/x86/purgatory/entry64.S
index 3d09781d4f9a..56487fb7fa1d 100644
--- a/arch/x86/purgatory/entry64.S
+++ b/arch/x86/purgatory/entry64.S
@@ -37,45 +37,46 @@ SYM_CODE_START(entry64)
 new_cs_exit:
 
/* Load the registers */
-   movqrax(%rip), %rax
-   movqrbx(%rip), %rbx
-   movqrcx(%rip), %rcx
-   movqrdx(%rip), %rdx
-   movqrsi(%rip), %rsi
-   movqrdi(%rip), %rdi
-   movqrbp(%rip), %rbp
-   movqr8(%rip), %r8
-   movqr9(%rip), %r9
-   movqr10(%rip), %r10
-   movqr11(%rip), %r11
-   movqr12(%rip), %r12
-   movqr13(%rip), %r13
-   movqr14(%rip), %r14
-   movqr15(%rip), %r15
+   leaqentry64_regs(%rip), %r15
+   movq0x00(%r15), %rax
+   movq0x08(%r15), %rcx
+   movq0x10(%r15), %rdx
+   movq0x18(%r15), %rbx
+   movq0x20(%r15), %rbp
+   movq0x28(%r15), %rsi
+   movq0x30(%r15), %rdi
+   movq0x38(%r15), %r8
+   movq0x40(%r15), %r9
+   movq0x48(%r15), %r10
+   movq0x50(%r15), %r11
+   movq0x58(%r15), %r12
+   movq0x60(%r15), %r13
+   movq0x68(%r15), %r14
+   movq0x70(%r15), %r15
 
/* Jump to the new code... */
-   jmpq*rip(%rip)
+   jmpq*.Lrip(%rip)
 SYM_CODE_END(entry64)
 
.section ".rodata"
-   .balign 4
+   .balign 8
 SYM_DATA_START(entry64_regs)
-rax:   .quad 0x0
-rcx:   .quad 0x0
-rdx:   .quad 0x0
-rbx:   .quad 0x0
-rbp:   .quad 0x0
-rsi:   .quad 0x0
-rdi:   .quad 0x0
-r8:.quad 0x0
-r9:.quad 0x0
-r10:   .quad 0x0
-r11:   .quad 0x0
-r12:   .quad 0x0
-r13:   .quad 0x0
-r14:   .quad 0x0
-r15:   .quad 0x0
-rip:   .quad 0x0
+.Lrax: .quad   0x0
+.Lrcx: .quad   0x0
+.Lrdx: .quad   0x0
+.Lrbx: .quad   0x0
+.Lrbp: .quad   0x0
+.Lrsi: .quad   0x0
+.Lrdi: .quad   0x0
+.Lr8:  .quad   0x0
+.Lr9:  .quad   0x0
+.Lr10: .quad   0x0
+.Lr11: .quad   0x0
+.Lr12: .quad   0x0
+.Lr13: .quad   0x0
+.Lr14: .quad   0x0
+.Lr15: .quad   0x0
+.Lrip: .quad   0x0
 SYM_DATA_END(entry64_regs)
 
/* GDT */
-- 
2.44.0.769.g3c40516874-goog


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[RFC PATCH 7/9] x86/purgatory: Use fully linked PIE ELF executable

2024-04-24 Thread Ard Biesheuvel
From: Ard Biesheuvel 

Now that the generic support is in place, switch to a fully linked PIE
ELF executable for the purgatory, so that it can be loaded as a single,
fully relocated image. This allows a lot of ugly post-processing logic
to simply be dropped.

Signed-off-by: Ard Biesheuvel 
---
 arch/x86/include/asm/kexec.h   |   7 --
 arch/x86/kernel/machine_kexec_64.c | 127 
 arch/x86/purgatory/Makefile|  14 +--
 3 files changed, 5 insertions(+), 143 deletions(-)

diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index ee7b32565e5f..c7cacc2e9dfb 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -191,13 +191,6 @@ void arch_kexec_unprotect_crashkres(void);
 #define arch_kexec_unprotect_crashkres arch_kexec_unprotect_crashkres
 
 #ifdef CONFIG_KEXEC_FILE
-struct purgatory_info;
-int arch_kexec_apply_relocations_add(struct purgatory_info *pi,
-Elf_Shdr *section,
-const Elf_Shdr *relsec,
-const Elf_Shdr *symtab);
-#define arch_kexec_apply_relocations_add arch_kexec_apply_relocations_add
-
 int arch_kimage_file_post_load_cleanup(struct kimage *image);
 #define arch_kimage_file_post_load_cleanup arch_kimage_file_post_load_cleanup
 #endif
diff --git a/arch/x86/kernel/machine_kexec_64.c 
b/arch/x86/kernel/machine_kexec_64.c
index bc0a5348b4a6..ded924423e50 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -371,133 +371,6 @@ void machine_kexec(struct kimage *image)
 /* arch-dependent functionality related to kexec file-based syscall */
 
 #ifdef CONFIG_KEXEC_FILE
-/*
- * Apply purgatory relocations.
- *
- * @pi:Purgatory to be relocated.
- * @section:   Section relocations applying to.
- * @relsec:Section containing RELAs.
- * @symtabsec: Corresponding symtab.
- *
- * TODO: Some of the code belongs to generic code. Move that in kexec.c.
- */
-int arch_kexec_apply_relocations_add(struct purgatory_info *pi,
-Elf_Shdr *section, const Elf_Shdr *relsec,
-const Elf_Shdr *symtabsec)
-{
-   unsigned int i;
-   Elf64_Rela *rel;
-   Elf64_Sym *sym;
-   void *location;
-   unsigned long address, sec_base, value;
-   const char *strtab, *name, *shstrtab;
-   const Elf_Shdr *sechdrs;
-
-   /* String & section header string table */
-   sechdrs = (void *)pi->ehdr + pi->ehdr->e_shoff;
-   strtab = (char *)pi->ehdr + sechdrs[symtabsec->sh_link].sh_offset;
-   shstrtab = (char *)pi->ehdr + sechdrs[pi->ehdr->e_shstrndx].sh_offset;
-
-   rel = (void *)pi->ehdr + relsec->sh_offset;
-
-   pr_debug("Applying relocate section %s to %u\n",
-shstrtab + relsec->sh_name, relsec->sh_info);
-
-   for (i = 0; i < relsec->sh_size / sizeof(*rel); i++) {
-
-   /*
-* rel[i].r_offset contains byte offset from beginning
-* of section to the storage unit affected.
-*
-* This is location to update. This is temporary buffer
-* where section is currently loaded. This will finally be
-* loaded to a different address later, pointed to by
-* ->sh_addr. kexec takes care of moving it
-*  (kexec_load_segment()).
-*/
-   location = pi->purgatory_buf;
-   location += section->sh_offset;
-   location += rel[i].r_offset;
-
-   /* Final address of the location */
-   address = section->sh_addr + rel[i].r_offset;
-
-   /*
-* rel[i].r_info contains information about symbol table index
-* w.r.t which relocation must be made and type of relocation
-* to apply. ELF64_R_SYM() and ELF64_R_TYPE() macros get
-* these respectively.
-*/
-   sym = (void *)pi->ehdr + symtabsec->sh_offset;
-   sym += ELF64_R_SYM(rel[i].r_info);
-
-   if (sym->st_name)
-   name = strtab + sym->st_name;
-   else
-   name = shstrtab + sechdrs[sym->st_shndx].sh_name;
-
-   pr_debug("Symbol: %s info: %02x shndx: %02x value=%llx size: 
%llx\n",
-name, sym->st_info, sym->st_shndx, sym->st_value,
-sym->st_size);
-
-   if (sym->st_shndx == SHN_UNDEF) {
-   pr_err("Undefined symbol: %s\n", name);
-   return -ENOEXEC;
-   }
-
-   if (sym->st_shndx == SHN_COMMON) {
-   pr_err("symbol '%s' in common section\n", name);
-   

[RFC PATCH 6/9] kexec: Add support for fully linked purgatory executables

2024-04-24 Thread Ard Biesheuvel
From: Ard Biesheuvel 

The purgatory ELF object is typically a partially linked object, which
puts the burden on the kexec loader to lay out the executable in memory,
and this involves (among other things) deciding the placement of the
sections in memory, and fixing up all relocations (relative and absolute
ones)

All of this can be greatly simplified by using a fully linked PIE ELF
executable instead, constructed in a way that removes the need for any
relocation processing or layout and allocation of individual sections.

By gathering all allocatable sections into a single PT_LOAD segment, and
relying on RIP-relative references, all relocations will be applied by
the linker, and the segment simply needs to be copied into memory.

So add a linker script and some minimal handling in generic code, which
can be used by architectures to opt into this approach. This will be
wired up for x86 in a subsequent patch.

Signed-off-by: Ard Biesheuvel 
---
 include/asm-generic/purgatory.lds | 34 ++
 kernel/kexec_file.c   | 68 +++-
 2 files changed, 101 insertions(+), 1 deletion(-)

diff --git a/include/asm-generic/purgatory.lds 
b/include/asm-generic/purgatory.lds
new file mode 100644
index ..260c457f7608
--- /dev/null
+++ b/include/asm-generic/purgatory.lds
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+PHDRS
+{
+   text PT_LOAD FLAGS(7) FILEHDR PHDRS;
+}
+
+SECTIONS
+{
+   . = SIZEOF_HEADERS;
+
+   .text : {
+   *(.text .rodata* .kexec-purgatory .data*)
+   } :text
+
+   .bss : {
+   *(.bss .dynbss)
+   } :text
+
+   .rela.dyn : {
+   *(.rela.*)
+   }
+
+   .symtab 0 : { *(.symtab) }
+   .strtab 0 : { *(.strtab) }
+   .shstrtab 0 : { *(.shstrtab) }
+
+   /DISCARD/ : {
+   *(.interp .modinfo .dynsym .dynstr .hash .gnu.* .dynamic 
.comment)
+   *(.got .plt .got.plt .plt.got .note.* .eh_frame .sframe)
+   }
+}
+
+ASSERT(SIZEOF(.rela.dyn) == 0, "Absolute relocations detected");
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index bef2f6f2571b..6379f8dfc29f 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -1010,6 +1010,62 @@ static int kexec_apply_relocations(struct kimage *image)
return 0;
 }
 
+/*
+ * kexec_load_purgatory_pie - Load the position independent purgatory object.
+ * @pi:Purgatory info struct.
+ * @kbuf:  Memory parameters to use.
+ *
+ * Load a purgatory PIE executable. This is a fully linked executable
+ * consisting of a single PT_LOAD segment that does not require any relocation
+ * processing.
+ *
+ * Return: 0 on success, negative errno on error.
+ */
+static int kexec_load_purgatory_pie(struct purgatory_info *pi,
+   struct kexec_buf *kbuf)
+{
+   const Elf_Phdr *phdr = (void *)pi->ehdr + pi->ehdr->e_phoff;
+   int ret;
+
+   if (pi->ehdr->e_phnum != 1)
+   return -EINVAL;
+
+   kbuf->bufsz = phdr->p_filesz;
+   kbuf->memsz = phdr->p_memsz;
+   kbuf->buf_align = phdr->p_align;
+
+   kbuf->buffer = vzalloc(kbuf->bufsz);
+   if (!kbuf->buffer)
+   return -ENOMEM;
+
+   ret = kexec_add_buffer(kbuf);
+   if (ret)
+   goto out_free_kbuf;
+
+   kbuf->image->start = kbuf->mem + pi->ehdr->e_entry;
+
+   pi->sechdrs = vcalloc(pi->ehdr->e_shnum, pi->ehdr->e_shentsize);
+   if (!pi->sechdrs)
+   goto out_free_kbuf;
+
+   pi->purgatory_buf = memcpy(kbuf->buffer,
+  (void *)pi->ehdr + phdr->p_offset,
+  kbuf->bufsz);
+
+   memcpy(pi->sechdrs, (void *)pi->ehdr + pi->ehdr->e_shoff,
+  pi->ehdr->e_shnum * pi->ehdr->e_shentsize);
+
+   for (int i = 0; i < pi->ehdr->e_shnum; i++)
+   if (pi->sechdrs[i].sh_flags & SHF_ALLOC)
+   pi->sechdrs[i].sh_addr += kbuf->mem;
+
+   return 0;
+
+out_free_kbuf:
+   vfree(kbuf->buffer);
+   return ret;
+}
+
 /*
  * kexec_load_purgatory - Load and relocate the purgatory object.
  * @image: Image to add the purgatory to.
@@ -1031,6 +1087,9 @@ int kexec_load_purgatory(struct kimage *image, struct 
kexec_buf *kbuf)
 
pi->ehdr = (const Elf_Ehdr *)kexec_purgatory;
 
+   if (pi->ehdr->e_type != ET_REL)
+   return kexec_load_purgatory_pie(pi, kbuf);
+
ret = kexec_purgatory_setup_kbuf(pi, kbuf);
if (ret)
return ret;
@@ -1087,7 +1146,8 @@ static const Elf_Sym *kexec_purgatory_find_symbol(struct 
purgatory_info *pi,
 
/* Go through symbols for a match */
for (k = 0; k < sechdrs[i].sh_size/sizeof(Elf_Sym); k++) {
-   if (ELF_ST_BIND

[RFC PATCH 2/9] x86/purgatory: Simplify stack handling

2024-04-24 Thread Ard Biesheuvel
From: Ard Biesheuvel 

The x86 purgatory, which does little more than verify a SHA-256 hash of
the loaded segments, currently uses three different stacks:
- one in .bss that is used to call the purgatory C code
- one in .rodata that is only used to switch to an updated code segment
  descriptor in the GDT
- one in .data, which allows it to be prepopulated from the kexec loader
  in theory, but this is not actually being taken advantage of.

Simplify this, by dropping the latter two stacks, as well as the loader
logic that programs RSP.

Both the stacks in .bss and .data are 4k aligned, but 16 byte alignment
is more than sufficient.

Signed-off-by: Ard Biesheuvel 
---
 arch/x86/include/asm/kexec.h  |  1 -
 arch/x86/kernel/kexec-bzimage64.c |  8 
 arch/x86/purgatory/entry64.S  |  8 
 arch/x86/purgatory/setup-x86_64.S |  2 +-
 arch/x86/purgatory/stack.S| 18 --
 5 files changed, 1 insertion(+), 36 deletions(-)

diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index 91ca9a9ee3a2..ee7b32565e5f 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -163,7 +163,6 @@ struct kexec_entry64_regs {
uint64_t rcx;
uint64_t rdx;
uint64_t rbx;
-   uint64_t rsp;
uint64_t rbp;
uint64_t rsi;
uint64_t rdi;
diff --git a/arch/x86/kernel/kexec-bzimage64.c 
b/arch/x86/kernel/kexec-bzimage64.c
index cde167b0ea92..f5bf1b7d01a6 100644
--- a/arch/x86/kernel/kexec-bzimage64.c
+++ b/arch/x86/kernel/kexec-bzimage64.c
@@ -400,7 +400,6 @@ static void *bzImage64_load(struct kimage *image, char 
*kernel,
unsigned long bootparam_load_addr, kernel_load_addr, initrd_load_addr;
struct bzimage64_data *ldata;
struct kexec_entry64_regs regs64;
-   void *stack;
unsigned int setup_hdr_offset = offsetof(struct boot_params, hdr);
unsigned int efi_map_offset, efi_map_sz, efi_setup_data_offset;
struct kexec_buf kbuf = { .image = image, .buf_max = ULONG_MAX,
@@ -550,14 +549,7 @@ static void *bzImage64_load(struct kimage *image, char 
*kernel,
regs64.rbx = 0; /* Bootstrap Processor */
regs64.rsi = bootparam_load_addr;
regs64.rip = kernel_load_addr + 0x200;
-   stack = kexec_purgatory_get_symbol_addr(image, "stack_end");
-   if (IS_ERR(stack)) {
-   pr_err("Could not find address of symbol stack_end\n");
-   ret = -EINVAL;
-   goto out_free_params;
-   }
 
-   regs64.rsp = (unsigned long)stack;
ret = kexec_purgatory_get_set_symbol(image, "entry64_regs", ,
 sizeof(regs64), 0);
if (ret)
diff --git a/arch/x86/purgatory/entry64.S b/arch/x86/purgatory/entry64.S
index 0b4390ce586b..9913877b0dbe 100644
--- a/arch/x86/purgatory/entry64.S
+++ b/arch/x86/purgatory/entry64.S
@@ -26,8 +26,6 @@ SYM_CODE_START(entry64)
movl%eax, %fs
movl%eax, %gs
 
-   /* Setup new stack */
-   leaqstack_init(%rip), %rsp
pushq   $0x10 /* CS */
leaqnew_cs_exit(%rip), %rax
pushq   %rax
@@ -41,7 +39,6 @@ new_cs_exit:
movqrdx(%rip), %rdx
movqrsi(%rip), %rsi
movqrdi(%rip), %rdi
-   movqrsp(%rip), %rsp
movqrbp(%rip), %rbp
movqr8(%rip), %r8
movqr9(%rip), %r9
@@ -63,7 +60,6 @@ rax:  .quad 0x0
 rcx:   .quad 0x0
 rdx:   .quad 0x0
 rbx:   .quad 0x0
-rsp:   .quad 0x0
 rbp:   .quad 0x0
 rsi:   .quad 0x0
 rdi:   .quad 0x0
@@ -97,7 +93,3 @@ SYM_DATA_START_LOCAL(gdt)
/* 0x18 4GB flat data segment */
.word 0x, 0x, 0x9200, 0x00CF
 SYM_DATA_END_LABEL(gdt, SYM_L_LOCAL, gdt_end)
-
-SYM_DATA_START_LOCAL(stack)
-   .quad   0, 0
-SYM_DATA_END_LABEL(stack, SYM_L_LOCAL, stack_init)
diff --git a/arch/x86/purgatory/setup-x86_64.S 
b/arch/x86/purgatory/setup-x86_64.S
index 89d9e9e53fcd..2d10ff88851d 100644
--- a/arch/x86/purgatory/setup-x86_64.S
+++ b/arch/x86/purgatory/setup-x86_64.S
@@ -53,7 +53,7 @@ SYM_DATA_START_LOCAL(gdt)
 SYM_DATA_END_LABEL(gdt, SYM_L_LOCAL, gdt_end)
 
.bss
-   .balign 4096
+   .balign 16
 SYM_DATA_START_LOCAL(lstack)
.skip 4096
 SYM_DATA_END_LABEL(lstack, SYM_L_LOCAL, lstack_end)
diff --git a/arch/x86/purgatory/stack.S b/arch/x86/purgatory/stack.S
deleted file mode 100644
index 1ef507ca50a5..
--- a/arch/x86/purgatory/stack.S
+++ /dev/null
@@ -1,18 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-only */
-/*
- * purgatory:  stack
- *
- * Copyright (C) 2014 Red Hat Inc.
- */
-
-#include 
-
-   /* A stack for the loaded kernel.
-* Separate and in the data section so it can be prepopulated.
-*/
-   .data
-   .balign 4096
-
-SYM_DATA_START(stack)
-   .skip 4096
-SYM_DATA_END_LABEL(stack, SYM_L_GLOBAL, stack_end)
-- 
2.44.0.769.g3c40516874-goog


__

[RFC PATCH 1/9] x86/purgatory: Drop function entry padding from purgatory

2024-04-24 Thread Ard Biesheuvel
From: Ard Biesheuvel 

The purgatory is a completely separate ELF executable carried inside the
kernel as an opaque binary blob. This means that function entry padding
and the associated ELF metadata are not exposed to the branch tracking
and code patching machinery, and can there be dropped from the purgatory
binary.

Signed-off-by: Ard Biesheuvel 
---
 arch/x86/purgatory/Makefile | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/purgatory/Makefile b/arch/x86/purgatory/Makefile
index a18591f6e6d9..2df4a4b70ff5 100644
--- a/arch/x86/purgatory/Makefile
+++ b/arch/x86/purgatory/Makefile
@@ -23,6 +23,9 @@ KBUILD_CFLAGS := $(filter-out -fprofile-sample-use=% 
-fprofile-use=%,$(KBUILD_CF
 # by kexec. Remove -flto=* flags.
 KBUILD_CFLAGS := $(filter-out $(CC_FLAGS_LTO),$(KBUILD_CFLAGS))
 
+# Drop the function entry padding, which is not needed here
+KBUILD_CFLAGS := $(filter-out $(PADDING_CFLAGS),$(KBUILD_CFLAGS))
+
 # When linking purgatory.ro with -r unresolved symbols are not checked,
 # also link a purgatory.chk binary without -r to check for unresolved symbols.
 PURGATORY_LDFLAGS := -e purgatory_start -z nodefaultlib
-- 
2.44.0.769.g3c40516874-goog


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[RFC PATCH 5/9] x86/purgatory: Simplify GDT and drop data segment

2024-04-24 Thread Ard Biesheuvel
From: Ard Biesheuvel 

Data segment selectors are ignored in long mode so there is no point in
programming them. So clear them instead. This only leaves the code
segment entry in the GDT, which can be moved up a slot now that the
second slot is no longer used as the GDT descriptor.

Signed-off-by: Ard Biesheuvel 
---
 arch/x86/purgatory/entry64.S | 13 +++--
 1 file changed, 3 insertions(+), 10 deletions(-)

diff --git a/arch/x86/purgatory/entry64.S b/arch/x86/purgatory/entry64.S
index 888661d9db9c..3d09781d4f9a 100644
--- a/arch/x86/purgatory/entry64.S
+++ b/arch/x86/purgatory/entry64.S
@@ -23,14 +23,14 @@ SYM_CODE_START(entry64)
addq$10, %rsp
 
/* load the data segments */
-   movl$0x18, %eax /* data segment */
+   xorl%eax, %eax /* data segment */
movl%eax, %ds
movl%eax, %es
movl%eax, %ss
movl%eax, %fs
movl%eax, %gs
 
-   pushq   $0x10 /* CS */
+   pushq   $0x8 /* CS */
leaqnew_cs_exit(%rip), %rax
pushq   %rax
lretq
@@ -84,16 +84,9 @@ SYM_DATA_END(entry64_regs)
 SYM_DATA_START_LOCAL(gdt)
/*
 * 0x00 unusable segment
-* 0x08 unused
-* so use them as gdt ptr
 */
-   .word 0
.quad 0
-   .word 0, 0, 0
 
-   /* 0x10 4GB flat code segment */
+   /* 0x8 4GB flat code segment */
.word 0x, 0x, 0x9A00, 0x00AF
-
-   /* 0x18 4GB flat data segment */
-   .word 0x, 0x, 0x9200, 0x00CF
 SYM_DATA_END_LABEL(gdt, SYM_L_LOCAL, gdt_end)
-- 
2.44.0.769.g3c40516874-goog


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[RFC PATCH 0/9] kexec x86 purgatory cleanup

2024-04-24 Thread Ard Biesheuvel
From: Ard Biesheuvel 

The kexec purgatory is built like a kernel module, i.e., a partially
linked ELF object where each section is allocated and placed
individually, and all relocations need to be fixed up, even place
relative ones.

This makes sense for kernel modules, which share the address space with
the core kernel, and contain unresolved references that need to be wired
up to symbols in other modules or the kernel itself.

The purgatory, however, is a fully linked binary without any external
references, or any overlap with the kernel's virtual address space. So
it makes much more sense to create a fully linked ELF executable that
can just be loaded and run anywhere in memory.

The purgatory build on x86 has already switched over to position
independent codegen, which only leaves a handful of absolute references,
which can either be dropped (patch #3) or converted into a RIP-relative
one (patch #4). That leaves a purgatory executable that can run at any
offset in memory with applying any relocations whatsoever.

Some tweaks are needed to deal with the difference between partially
(ET_REL) and fully (ET_DYN/ET_EXEC) linked ELF objects, but with those
in place, a substantial amount of complicated ELF allocation, placement
and patching/relocation code can simply be dropped.

The last patch in the series removes this code from the generic kexec
implementation, but this can only be done once other architectures apply
the same changes proposed here for x86 (powerpc, s390 and riscv all
implement the purgatory using the shared logic)

Link: 
https://lore.kernel.org/all/CAKwvOd=3Jrzju++=Ve61=ZdeshxUM=K3-bGMNREnGOQgNw=a...@mail.gmail.com/
Link: https://lore.kernel.org/all/20240418201705.3673200-2-ardb+...@google.com/

Cc: Arnd Bergmann 
Cc: Eric Biederman 
Cc: kexec@lists.infradead.org
Cc: Nathan Chancellor 
Cc: Nick Desaulniers 
Cc: Kees Cook 
Cc: Bill Wendling 
Cc: Justin Stitt 
Cc: Masahiro Yamada 

Ard Biesheuvel (9):
  x86/purgatory: Drop function entry padding from purgatory
  x86/purgatory: Simplify stack handling
  x86/purgatory: Drop pointless GDT switch
  x86/purgatory: Avoid absolute reference to GDT
  x86/purgatory: Simplify GDT and drop data segment
  kexec: Add support for fully linked purgatory executables
  x86/purgatory: Use fully linked PIE ELF executable
  x86/purgatory: Simplify references to regs array
  kexec: Drop support for partially linked purgatory executables

 arch/x86/include/asm/kexec.h   |   8 -
 arch/x86/kernel/kexec-bzimage64.c  |   8 -
 arch/x86/kernel/machine_kexec_64.c | 127 --
 arch/x86/purgatory/Makefile|  17 +-
 arch/x86/purgatory/entry64.S   |  96 
 arch/x86/purgatory/setup-x86_64.S  |  31 +--
 arch/x86/purgatory/stack.S |  18 --
 include/asm-generic/purgatory.lds  |  34 +++
 kernel/kexec_file.c| 255 +++-
 9 files changed, 125 insertions(+), 469 deletions(-)
 delete mode 100644 arch/x86/purgatory/stack.S
 create mode 100644 include/asm-generic/purgatory.lds

-- 
2.44.0.769.g3c40516874-goog


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[RFC PATCH 4/9] x86/purgatory: Avoid absolute reference to GDT

2024-04-24 Thread Ard Biesheuvel
From: Ard Biesheuvel 

The purgatory is almost entirely position independent, without any need
for any relocation processing at load time except for the reference to
the GDT in the entry code. Generate this reference at runtime instead,
to remove the last R_X86_64_64 relocation from this code.

While the GDT itself needs to be preserved in memory as long as it is
live, the GDT descriptor that is used to program the GDT can be
discarded so it can be allocated on the stack.

Signed-off-by: Ard Biesheuvel 
---
 arch/x86/purgatory/entry64.S | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/arch/x86/purgatory/entry64.S b/arch/x86/purgatory/entry64.S
index 9913877b0dbe..888661d9db9c 100644
--- a/arch/x86/purgatory/entry64.S
+++ b/arch/x86/purgatory/entry64.S
@@ -16,7 +16,11 @@
 
 SYM_CODE_START(entry64)
/* Setup a gdt that should be preserved */
-   lgdt gdt(%rip)
+   leaqgdt(%rip), %rax
+   pushq   %rax
+   pushw   $gdt_end - gdt - 1
+   lgdt(%rsp)
+   addq$10, %rsp
 
/* load the data segments */
movl$0x18, %eax /* data segment */
@@ -83,8 +87,8 @@ SYM_DATA_START_LOCAL(gdt)
 * 0x08 unused
 * so use them as gdt ptr
 */
-   .word gdt_end - gdt - 1
-   .quad gdt
+   .word 0
+   .quad 0
.word 0, 0, 0
 
/* 0x10 4GB flat code segment */
-- 
2.44.0.769.g3c40516874-goog


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[RFC PATCH 3/9] x86/purgatory: Drop pointless GDT switch

2024-04-24 Thread Ard Biesheuvel
From: Ard Biesheuvel 

The x86 purgatory switches to a new GDT twice, and the first time, it
doesn't even bother to switch to the new code segment. Given that data
segment selectors are ignored in long mode, and the fact that the GDT is
reprogrammed again after returning from purgatory(), the first switch is
entirely pointless and can just be dropped altogether.

Signed-off-by: Ard Biesheuvel 
---
 arch/x86/purgatory/setup-x86_64.S | 29 
 1 file changed, 29 deletions(-)

diff --git a/arch/x86/purgatory/setup-x86_64.S 
b/arch/x86/purgatory/setup-x86_64.S
index 2d10ff88851d..f160fc729cbe 100644
--- a/arch/x86/purgatory/setup-x86_64.S
+++ b/arch/x86/purgatory/setup-x86_64.S
@@ -15,17 +15,6 @@
.code64
 
 SYM_CODE_START(purgatory_start)
-   /* Load a gdt so I know what the segment registers are */
-   lgdtgdt(%rip)
-
-   /* load the data segments */
-   movl$0x18, %eax /* data segment */
-   movl%eax, %ds
-   movl%eax, %es
-   movl%eax, %ss
-   movl%eax, %fs
-   movl%eax, %gs
-
/* Setup a stack */
leaqlstack_end(%rip), %rsp
 
@@ -34,24 +23,6 @@ SYM_CODE_START(purgatory_start)
jmp entry64
 SYM_CODE_END(purgatory_start)
 
-   .section ".rodata"
-   .balign 16
-SYM_DATA_START_LOCAL(gdt)
-   /* 0x00 unusable segment
-* 0x08 unused
-* so use them as the gdt ptr
-*/
-   .word   gdt_end - gdt - 1
-   .quad   gdt
-   .word   0, 0, 0
-
-   /* 0x10 4GB flat code segment */
-   .word   0x, 0x, 0x9A00, 0x00AF
-
-   /* 0x18 4GB flat data segment */
-   .word   0x, 0x, 0x9200, 0x00CF
-SYM_DATA_END_LABEL(gdt, SYM_L_LOCAL, gdt_end)
-
.bss
.balign 16
 SYM_DATA_START_LOCAL(lstack)
-- 
2.44.0.769.g3c40516874-goog


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v8 14/15] x86: Secure Launch late initcall platform module

2024-02-23 Thread Ard Biesheuvel
On Thu, 22 Feb 2024 at 14:58, Daniel P. Smith
 wrote:
>
> On 2/15/24 03:40, Ard Biesheuvel wrote:
> > On Wed, 14 Feb 2024 at 23:32, Ross Philipson  
> > wrote:
> >>
> >> From: "Daniel P. Smith" 
> >>
> >> The Secure Launch platform module is a late init module. During the
> >> init call, the TPM event log is read and measurements taken in the
> >> early boot stub code are located. These measurements are extended
> >> into the TPM PCRs using the mainline TPM kernel driver.
> >>
> >> The platform module also registers the securityfs nodes to allow
> >> access to TXT register fields on Intel along with the fetching of
> >> and writing events to the late launch TPM log.
> >>
> >> Signed-off-by: Daniel P. Smith 
> >> Signed-off-by: garnetgrimm 
> >> Signed-off-by: Ross Philipson 
> >
> > There is an awful amount of code that executes between the point where
> > the measurements are taken and the point where they are loaded into
> > the PCRs. All of this code could subvert the boot flow and hide this
> > fact, by replacing the actual taken measurement values with the known
> > 'blessed' ones that will unseal the keys and/or phone home to do a
> > successful remote attestation.
>
> To set context, in general the motivation to employ an RTM, Static or
> Dynamic, integrity solution is to enable external platform validation,
> aka attestation. These trust chains are constructed from the principle
> of measure and execute that rely on the presence of a RoT for Storage
> (RTS) and a RoT for Reporting (RTR). Under the TCG architecture adopted
> by x86 vendors and now recently by Arm, those roles are fulfilled by the
> TPM. With this context, lets layout the assumptive trusts being made here,
>1. The CPU GETSEC instruction functions correctly
>2. The IOMMU, and by extension the PMRs, functions correctly
>2. The ACM authentication process functions correctly
>3. The ACM functions correctly
>4. The TPM interactions function correctly
>5. The TPM functions correctly
>
> With this basis, let's explore your assertion here. The assertion breaks
> down into two scenarios. The first is that the at-rest kernel binary is
> corrupt, unintentionally (bug) or maliciously, either of which does not
> matter for the situation. For the sake of simplicity, corruption of the
> Linux kernel during loading or before the DRTM Event is considered an
> equivalent to corruption of the kernel at-rest. The second is that the
> kernel binary was corrupted in memory at some point after the DRTM event
> occurs.
>
> For both scenarios, the ACM will correctly configure the IOMMU PMRs to
> ensure the kernel can no longer be tampered with in memory. After which,
> the ACM will then accurately measure the kernel (bzImage) and safely
> store the measurement in the TPM.
>
> In the first scenario, the TPM will accurately report the kernel
> measurement in the attestation. The attestation authority will be able
> to detect if an invalid kernel was started and can take whatever
> remediation actions it may employ.
>
> In the second scenario, any attempt to corrupt the binary after the ACM
> has configured the IOMMU PMR will fail.
>
>

This protects the memory image from external masters after the
measurement has been taken.

So any external influences in the time window between taking the
measurements and loading them into the PCRs are out of scope here, I
guess?

Maybe it would help (or if I missed it - apologies) to include a
threat model here. I suppose physical tampering is out of scope?

> > At the very least, this should be documented somewhere. And if at all
> > possible, it should also be documented why this is ok, and to what
> > extent it limits the provided guarantees compared to a true D-RTM boot
> > where the early boot code measures straight into the TPMs before
> > proceeding.
>
> I can add a rendition of the above into the existing section of the
> documentation patch that already discusses separation of the measurement
> from the TPM recording code. As to the limits it incurs on the DRTM
> integrity, as explained above, I submit there are none.
>

Thanks for the elaborate explananation. And yes, please document this
with the changes.



Re: [PATCH v8 06/15] x86: Add early SHA support for Secure Launch early measurements

2024-02-23 Thread Ard Biesheuvel
On Thu, 22 Feb 2024 at 13:30, Andrew Cooper  wrote:
>
> On 22/02/2024 9:34 am, Ard Biesheuvel wrote:
> > On Thu, 22 Feb 2024 at 04:05, Andrew Cooper  
> > wrote:
> >> On 15/02/2024 8:17 am, Ard Biesheuvel wrote:
> >>> On Wed, 14 Feb 2024 at 23:31, Ross Philipson  
> >>> wrote:
> >>>> From: "Daniel P. Smith" 
> >>>>
> >>>> The SHA algorithms are necessary to measure configuration information 
> >>>> into
> >>>> the TPM as early as possible before using the values. This implementation
> >>>> uses the established approach of #including the SHA libraries directly in
> >>>> the code since the compressed kernel is not uncompressed at this point.
> >>>>
> >>>> The SHA code here has its origins in the code from the main kernel:
> >>>>
> >>>> commit c4d5b9ffa31f ("crypto: sha1 - implement base layer for SHA-1")
> >>>>
> >>>> A modified version of this code was introduced to the lib/crypto/sha1.c
> >>>> to bring it in line with the sha256 code and allow it to be pulled into 
> >>>> the
> >>>> setup kernel in the same manner as sha256 is.
> >>>>
> >>>> Signed-off-by: Daniel P. Smith 
> >>>> Signed-off-by: Ross Philipson 
> >>> We have had some discussions about this, and you really need to
> >>> capture the justification in the commit log for introducing new code
> >>> that implements an obsolete and broken hashing algorithm.
> >>>
> >>> SHA-1 is broken and should no longer be used for anything. Introducing
> >>> new support for a highly complex boot security feature, and then
> >>> relying on SHA-1 in the implementation makes this whole effort seem
> >>> almost futile, *unless* you provide some rock solid reasons here why
> >>> this is still safe.
> >>>
> >>> If the upshot would be that some people are stuck with SHA-1 so they
> >>> won't be able to use this feature, then I'm not convinced we should
> >>> obsess over that.
> >> To be absolutely crystal clear here.
> >>
> >> The choice of hash algorithm(s) are determined by the OEM and the
> >> platform, not by Linux.
> >>
> >> Failing to (at least) cap a PCR in a bank which the OEM/platform left
> >> active is a security vulnerability.  It permits the unsealing of secrets
> >> if an attacker can replay a good set of measurements into an unused bank.
> >>
> >> The only way to get rid of the requirement for SHA-1 here is to lobby
> >> the IHVs/OEMs, or perhaps the TCG, to produce/spec a platform where the
> >> SHA-1 banks can be disabled.  There are no known such platforms in the
> >> market today, to the best of our knowledge.
> >>
> > OK, so mainline Linux does not support secure launch at all today. At
> > this point, we need to decide whether or not tomorrow's mainline Linux
> > will support secure launch with SHA1 or without, right?
>
> I'd argue that's a slightly unfair characterisation.
>

Fair enough. I'm genuinely trying to have a precise understanding of
this, not trying to be dismissive.

> We want tomorrow's mainline to support Secure Launch.  What that entails
> under the hood is largely outside of the control of the end user.
>

So the debate is really whether it makes sense at all to support
Secure Launch on systems that are stuck on an obsolete and broken hash
algorithm. This is not hyperbole: SHA-1 is broken today and once these
changes hit production 1-2 years down the line, the situation will
only have deteriorated. And another 2-3 years later, we will be the
ones chasing obscure bugs on systems that were already obsolete when
this support was added.

So what is the value proposition here? An end user today, who is
mindful enough of security to actively invest the effort to migrate
their system from ordinary measured boot to secure launch, is really
going to do so on a system that only implements SHA-1 support?

> > And the point you are making here is that we need SHA-1 not only to a)
> > support systems that are on TPM 1.2 and support nothing else, but also
> > to b) ensure that crypto agile TPM 2.0 with both SHA-1 and SHA-256
> > enabled can be supported in a safe manner, which would involve
> > measuring some terminating event into the SHA-1 PCRs to ensure they
> > are not left in a dangling state that might allow an adversary to
> > trick the TPM into unsealing a secret that it shouldn't.
>
> Yes.  Also c) because if the end user

Re: [PATCH v8 06/15] x86: Add early SHA support for Secure Launch early measurements

2024-02-22 Thread Ard Biesheuvel
On Thu, 22 Feb 2024 at 04:05, Andrew Cooper  wrote:
>
> On 15/02/2024 8:17 am, Ard Biesheuvel wrote:
> > On Wed, 14 Feb 2024 at 23:31, Ross Philipson  
> > wrote:
> >> From: "Daniel P. Smith" 
> >>
> >> The SHA algorithms are necessary to measure configuration information into
> >> the TPM as early as possible before using the values. This implementation
> >> uses the established approach of #including the SHA libraries directly in
> >> the code since the compressed kernel is not uncompressed at this point.
> >>
> >> The SHA code here has its origins in the code from the main kernel:
> >>
> >> commit c4d5b9ffa31f ("crypto: sha1 - implement base layer for SHA-1")
> >>
> >> A modified version of this code was introduced to the lib/crypto/sha1.c
> >> to bring it in line with the sha256 code and allow it to be pulled into the
> >> setup kernel in the same manner as sha256 is.
> >>
> >> Signed-off-by: Daniel P. Smith 
> >> Signed-off-by: Ross Philipson 
> > We have had some discussions about this, and you really need to
> > capture the justification in the commit log for introducing new code
> > that implements an obsolete and broken hashing algorithm.
> >
> > SHA-1 is broken and should no longer be used for anything. Introducing
> > new support for a highly complex boot security feature, and then
> > relying on SHA-1 in the implementation makes this whole effort seem
> > almost futile, *unless* you provide some rock solid reasons here why
> > this is still safe.
> >
> > If the upshot would be that some people are stuck with SHA-1 so they
> > won't be able to use this feature, then I'm not convinced we should
> > obsess over that.
>
> To be absolutely crystal clear here.
>
> The choice of hash algorithm(s) are determined by the OEM and the
> platform, not by Linux.
>
> Failing to (at least) cap a PCR in a bank which the OEM/platform left
> active is a security vulnerability.  It permits the unsealing of secrets
> if an attacker can replay a good set of measurements into an unused bank.
>
> The only way to get rid of the requirement for SHA-1 here is to lobby
> the IHVs/OEMs, or perhaps the TCG, to produce/spec a platform where the
> SHA-1 banks can be disabled.  There are no known such platforms in the
> market today, to the best of our knowledge.
>

OK, so mainline Linux does not support secure launch at all today. At
this point, we need to decide whether or not tomorrow's mainline Linux
will support secure launch with SHA1 or without, right?

And the point you are making here is that we need SHA-1 not only to a)
support systems that are on TPM 1.2 and support nothing else, but also
to b) ensure that crypto agile TPM 2.0 with both SHA-1 and SHA-256
enabled can be supported in a safe manner, which would involve
measuring some terminating event into the SHA-1 PCRs to ensure they
are not left in a dangling state that might allow an adversary to
trick the TPM into unsealing a secret that it shouldn't.

So can we support b) without a), and if so, does measuring an
arbitrary dummy event into a PCR that is only meant to keep sealed
forever really require a SHA-1 implementation, or could we just use an
arbitrary (not even random) sequence of 160 bits and use that instead?



Re: [PATCH v8 15/15] x86: EFI stub DRTM launch support for Secure Launch

2024-02-21 Thread Ard Biesheuvel
On Wed, 21 Feb 2024 at 21:37, H. Peter Anvin  wrote:
>
> On February 21, 2024 12:17:30 PM PST, ross.philip...@oracle.com wrote:
> >On 2/15/24 1:01 AM, Ard Biesheuvel wrote:
> >> On Wed, 14 Feb 2024 at 23:32, Ross Philipson  
> >> wrote:
> >>>
> >>> This support allows the DRTM launch to be initiated after an EFI stub
> >>> launch of the Linux kernel is done. This is accomplished by providing
> >>> a handler to jump to when a Secure Launch is in progress. This has to be
> >>> called after the EFI stub does Exit Boot Services.
> >>>
> >>> Signed-off-by: Ross Philipson 
> >>> ---
> >>>   drivers/firmware/efi/libstub/x86-stub.c | 55 +
> >>>   1 file changed, 55 insertions(+)
> >>>
> >>> diff --git a/drivers/firmware/efi/libstub/x86-stub.c 
> >>> b/drivers/firmware/efi/libstub/x86-stub.c
> >>> index 0d510c9a06a4..4df2cf539194 100644
> >>> --- a/drivers/firmware/efi/libstub/x86-stub.c
> >>> +++ b/drivers/firmware/efi/libstub/x86-stub.c
> >>> @@ -9,6 +9,7 @@
> >>>   #include 
> >>>   #include 
> >>>   #include 
> >>> +#include 
> >>>
> >>>   #include 
> >>>   #include 
> >>> @@ -810,6 +811,57 @@ static efi_status_t efi_decompress_kernel(unsigned 
> >>> long *kernel_entry)
> >>>  return EFI_SUCCESS;
> >>>   }
> >>>
> >>> +static void efi_secure_launch(struct boot_params *boot_params)
> >>> +{
> >>> +   struct slr_entry_uefi_config *uefi_config;
> >>> +   struct slr_uefi_cfg_entry *uefi_entry;
> >>> +   struct slr_entry_dl_info *dlinfo;
> >>> +   efi_guid_t guid = SLR_TABLE_GUID;
> >>> +   struct slr_table *slrt;
> >>> +   u64 memmap_hi;
> >>> +   void *table;
> >>> +   u8 buf[64] = {0};
> >>> +
> >>
> >> If you add a flex array to slr_entry_uefi_config as I suggested in
> >> response to the other patch, we could simplify this substantially
> >
> >I feel like there is some reason why we did not use flex arrays. We were 
> >talking and we seem to remember we used to use them and someone asked us to 
> >remove them. We are still looking into it. But if we can go back to them, I 
> >will take all the changes you recommended here.
> >
>
> Linux kernel code doesn't use VLAs because of the limited stack size, and 
> VLAs or alloca() makes stack size tracking impossible. Although this 
> technically speaking runs in a different environment, it is easier to enforce 
> the constraint globally.

Flex array != VLA

VLAs were phased out because of this reason (and VLAISs [VLAs in
structs] were phased out before that because they are a GNU extension
and not supported by Clang)

Today, VLAs are not supported anywhere in the kernel.

Flex arrays are widely used in the kernel. A flex array is a trailing
array of unspecified size in a struct that makes the entire *type*
have a variable size. But that does not make them VLAs (or VLAISs) - a
VLA is a stack allocated *variable* whose size is based on a function
parameter.

Instances of types containing flex arrays can be allocated statically,
or dynamically on the heap. This is common practice in the kernel, and
even supported by instrumentation to help the compiler track the
runtime size and flag overruns. We are even in the process of adding
compiler support to annotate struct members as carrying the number of
elements in an associated flex arrays, to improve the coverage of the
instrumentation.

I am not asking for a VLA here, only a flex array.



Re: [PATCH v8 15/15] x86: EFI stub DRTM launch support for Secure Launch

2024-02-15 Thread Ard Biesheuvel
On Wed, 14 Feb 2024 at 23:32, Ross Philipson  wrote:
>
> This support allows the DRTM launch to be initiated after an EFI stub
> launch of the Linux kernel is done. This is accomplished by providing
> a handler to jump to when a Secure Launch is in progress. This has to be
> called after the EFI stub does Exit Boot Services.
>
> Signed-off-by: Ross Philipson 
> ---
>  drivers/firmware/efi/libstub/x86-stub.c | 55 +
>  1 file changed, 55 insertions(+)
>
> diff --git a/drivers/firmware/efi/libstub/x86-stub.c 
> b/drivers/firmware/efi/libstub/x86-stub.c
> index 0d510c9a06a4..4df2cf539194 100644
> --- a/drivers/firmware/efi/libstub/x86-stub.c
> +++ b/drivers/firmware/efi/libstub/x86-stub.c
> @@ -9,6 +9,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #include 
>  #include 
> @@ -810,6 +811,57 @@ static efi_status_t efi_decompress_kernel(unsigned long 
> *kernel_entry)
> return EFI_SUCCESS;
>  }
>
> +static void efi_secure_launch(struct boot_params *boot_params)
> +{
> +   struct slr_entry_uefi_config *uefi_config;
> +   struct slr_uefi_cfg_entry *uefi_entry;
> +   struct slr_entry_dl_info *dlinfo;
> +   efi_guid_t guid = SLR_TABLE_GUID;
> +   struct slr_table *slrt;
> +   u64 memmap_hi;
> +   void *table;
> +   u8 buf[64] = {0};
> +

If you add a flex array to slr_entry_uefi_config as I suggested in
response to the other patch, we could simplify this substantially

static struct slr_entry_uefi_config cfg = {
.hdr.tag= SLR_ENTRY_UEFI_CONFIG,
.hdr.size   = sizeof(cfg),
.revision   = SLR_UEFI_CONFIG_REVISION,
.nr_entries = 1,
.entries[0] = {
.pcr= 18,
.evt_info = "Measured UEFI memory map",
},
};

cfg.entries[0].cfg  = boot_params->efi_info.efi_memmap |
  (u64)boot_params->efi_info.efi_memmap_hi << 32;
cfg.entries[0].size = boot_params->efi_info.efi_memmap_size;



> +   table = get_efi_config_table(guid);
> +
> +   /*
> +* The presence of this table indicated a Secure Launch
> +* is being requested.
> +*/
> +   if (!table)
> +   return;
> +
> +   slrt = (struct slr_table *)table;
> +
> +   if (slrt->magic != SLR_TABLE_MAGIC)
> +   return;
> +

slrt = (struct slr_table *)get_efi_config_table(guid);
if (!slrt || slrt->magic != SLR_TABLE_MAGIC)
return;

> +   /* Add config information to measure the UEFI memory map */
> +   uefi_config = (struct slr_entry_uefi_config *)buf;
> +   uefi_config->hdr.tag = SLR_ENTRY_UEFI_CONFIG;
> +   uefi_config->hdr.size = sizeof(*uefi_config) + sizeof(*uefi_entry);
> +   uefi_config->revision = SLR_UEFI_CONFIG_REVISION;
> +   uefi_config->nr_entries = 1;
> +   uefi_entry = (struct slr_uefi_cfg_entry *)(buf + 
> sizeof(*uefi_config));
> +   uefi_entry->pcr = 18;
> +   uefi_entry->cfg = boot_params->efi_info.efi_memmap;
> +   memmap_hi = boot_params->efi_info.efi_memmap_hi;
> +   uefi_entry->cfg |= memmap_hi << 32;
> +   uefi_entry->size = boot_params->efi_info.efi_memmap_size;
> +   memcpy(_entry->evt_info[0], "Measured UEFI memory map",
> +   strlen("Measured UEFI memory map"));
> +

Drop all of this

> +   if (slr_add_entry(slrt, (struct slr_entry_hdr *)uefi_config))

if (slr_add_entry(slrt, _config.hdr))


> +   return;
> +
> +   /* Jump through DL stub to initiate Secure Launch */
> +   dlinfo = (struct slr_entry_dl_info *)
> +   slr_next_entry_by_tag(slrt, NULL, SLR_ENTRY_DL_INFO);
> +
> +   asm volatile ("jmp *%%rax"
> + : : "a" (dlinfo->dl_handler), "D" 
> (>bl_context));

Fix the prototype and just do

dlinfo->dl_handler(>bl_context);
unreachable();


So in summary, this becomes

static void efi_secure_launch(struct boot_params *boot_params)
{
static struct slr_entry_uefi_config cfg = {
.hdr.tag= SLR_ENTRY_UEFI_CONFIG,
.hdr.size   = sizeof(cfg),
.revision   = SLR_UEFI_CONFIG_REVISION,
.nr_entries = 1,
.entries[0] = {
.pcr= 18,
.evt_info = "Measured UEFI memory map",
},
};
struct slr_entry_dl_info *dlinfo;
efi_guid_t guid = SLR_TABLE_GUID;
struct slr_table *slrt;

/*
 * The presence of this table indicated a Secure Launch
 * is being requested.
 */
slrt = (struct slr_table *)get_efi_config_table(guid);
if (!slrt || slrt->magic != SLR_TABLE_MAGIC)
return;

cfg.entries[0].cfg  = boot_params->efi_info.efi_memmap |
  (u64)boot_params->efi_info.efi_memmap_hi << 32;
cfg.entries[0].size = boot_params->efi_info.efi_memmap_size;

if (slr_add_entry(slrt, ))
   

Re: [PATCH v8 14/15] x86: Secure Launch late initcall platform module

2024-02-15 Thread Ard Biesheuvel
On Wed, 14 Feb 2024 at 23:32, Ross Philipson  wrote:
>
> From: "Daniel P. Smith" 
>
> The Secure Launch platform module is a late init module. During the
> init call, the TPM event log is read and measurements taken in the
> early boot stub code are located. These measurements are extended
> into the TPM PCRs using the mainline TPM kernel driver.
>
> The platform module also registers the securityfs nodes to allow
> access to TXT register fields on Intel along with the fetching of
> and writing events to the late launch TPM log.
>
> Signed-off-by: Daniel P. Smith 
> Signed-off-by: garnetgrimm 
> Signed-off-by: Ross Philipson 

There is an awful amount of code that executes between the point where
the measurements are taken and the point where they are loaded into
the PCRs. All of this code could subvert the boot flow and hide this
fact, by replacing the actual taken measurement values with the known
'blessed' ones that will unseal the keys and/or phone home to do a
successful remote attestation.

At the very least, this should be documented somewhere. And if at all
possible, it should also be documented why this is ok, and to what
extent it limits the provided guarantees compared to a true D-RTM boot
where the early boot code measures straight into the TPMs before
proceeding.


> ---
>  arch/x86/kernel/Makefile   |   1 +
>  arch/x86/kernel/slmodule.c | 511 +
>  2 files changed, 512 insertions(+)
>  create mode 100644 arch/x86/kernel/slmodule.c
>
> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
> index 5848ea310175..948346ff4595 100644
> --- a/arch/x86/kernel/Makefile
> +++ b/arch/x86/kernel/Makefile
> @@ -75,6 +75,7 @@ obj-$(CONFIG_IA32_EMULATION)  += tls.o
>  obj-y  += step.o
>  obj-$(CONFIG_INTEL_TXT)+= tboot.o
>  obj-$(CONFIG_SECURE_LAUNCH)+= slaunch.o
> +obj-$(CONFIG_SECURE_LAUNCH)+= slmodule.o
>  obj-$(CONFIG_ISA_DMA_API)  += i8237.o
>  obj-y  += stacktrace.o
>  obj-y  += cpu/
> diff --git a/arch/x86/kernel/slmodule.c b/arch/x86/kernel/slmodule.c
> new file mode 100644
> index ..52269f24902e
> --- /dev/null
> +++ b/arch/x86/kernel/slmodule.c
> @@ -0,0 +1,511 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Secure Launch late validation/setup, securityfs exposure and finalization.
> + *
> + * Copyright (c) 2022 Apertus Solutions, LLC
> + * Copyright (c) 2021 Assured Information Security, Inc.
> + * Copyright (c) 2022, Oracle and/or its affiliates.
> + *
> + * Co-developed-by: Garnet T. Grimm 
> + * Signed-off-by: Garnet T. Grimm 
> + * Signed-off-by: Daniel P. Smith 
> + */
> +
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +/*
> + * The macro DECLARE_TXT_PUB_READ_U is used to read values from the TXT
> + * public registers as unsigned values.
> + */
> +#define DECLARE_TXT_PUB_READ_U(size, fmt, msg_size)\
> +static ssize_t txt_pub_read_u##size(unsigned int offset,   \
> +   loff_t *read_offset,\
> +   size_t read_len,\
> +   char __user *buf)   \
> +{  \
> +   char msg_buffer[msg_size];  \
> +   u##size reg_value = 0;  \
> +   void __iomem *txt;  \
> +   \
> +   txt = ioremap(TXT_PUB_CONFIG_REGS_BASE, \
> +   TXT_NR_CONFIG_PAGES * PAGE_SIZE);   \
> +   if (!txt)   \
> +   return -EFAULT; \
> +   memcpy_fromio(_value, txt + offset, sizeof(u##size));   \
> +   iounmap(txt);   \
> +   snprintf(msg_buffer, msg_size, fmt, reg_value); \
> +   return simple_read_from_buffer(buf, read_len, read_offset,  \
> +   _buffer, msg_size); \
> +}
> +
> +DECLARE_TXT_PUB_READ_U(8, "%#04x\n", 6);
> +DECLARE_TXT_PUB_READ_U(32, "%#010x\n", 12);
> +DECLARE_TXT_PUB_READ_U(64, "%#018llx\n", 20);
> +
> +#define DECLARE_TXT_FOPS(reg_name, reg_offset, reg_size)   \
> +static ssize_t txt_##reg_name##_read(struct file *flip,  
>   \
> +   char __user *buf, size_t read_len, loff_t *read_offset) \
> +{  \
> +   return 

Re: [PATCH v8 07/15] x86: Secure Launch kernel early boot stub

2024-02-15 Thread Ard Biesheuvel
On Wed, 14 Feb 2024 at 23:32, Ross Philipson  wrote:
>
> The Secure Launch (SL) stub provides the entry point for Intel TXT (and
> later AMD SKINIT) to vector to during the late launch. The symbol
> sl_stub_entry is that entry point and its offset into the kernel is
> conveyed to the launching code using the MLE (Measured Launch
> Environment) header in the structure named mle_header. The offset of the
> MLE header is set in the kernel_info. The routine sl_stub contains the
> very early late launch setup code responsible for setting up the basic
> environment to allow the normal kernel startup_32 code to proceed. It is
> also responsible for properly waking and handling the APs on Intel
> platforms. The routine sl_main which runs after entering 64b mode is
> responsible for measuring configuration and module information before
> it is used like the boot params, the kernel command line, the TXT heap,
> an external initramfs, etc.
>
> Signed-off-by: Ross Philipson 
> ---
>  Documentation/arch/x86/boot.rst|  21 +
>  arch/x86/boot/compressed/Makefile  |   3 +-
>  arch/x86/boot/compressed/head_64.S |  34 ++
>  arch/x86/boot/compressed/kernel_info.S |  34 ++
>  arch/x86/boot/compressed/sl_main.c | 582 
>  arch/x86/boot/compressed/sl_stub.S | 705 +
>  arch/x86/include/asm/msr-index.h   |   5 +
>  arch/x86/include/uapi/asm/bootparam.h  |   1 +
>  arch/x86/kernel/asm-offsets.c  |  20 +
>  9 files changed, 1404 insertions(+), 1 deletion(-)
>  create mode 100644 arch/x86/boot/compressed/sl_main.c
>  create mode 100644 arch/x86/boot/compressed/sl_stub.S
>
> diff --git a/Documentation/arch/x86/boot.rst b/Documentation/arch/x86/boot.rst
> index c513855a54bb..ce6a51c6d4e7 100644
> --- a/Documentation/arch/x86/boot.rst
> +++ b/Documentation/arch/x86/boot.rst
> @@ -482,6 +482,14 @@ Protocol:  2.00+
> - If 1, KASLR enabled.
> - If 0, KASLR disabled.
>
> +  Bit 2 (kernel internal): SLAUNCH_FLAG
> +
> +   - Used internally by the compressed kernel to communicate

decompressor

> + Secure Launch status to kernel proper.
> +
> +   - If 1, Secure Launch enabled.
> +   - If 0, Secure Launch disabled.
> +
>Bit 5 (write): QUIET_FLAG
>
> - If 0, print early messages.
> @@ -1027,6 +1035,19 @@ Offset/size: 0x000c/4
>
>This field contains maximal allowed type for setup_data and setup_indirect 
> structs.
>
> +   =
> +Field name:mle_header_offset
> +Offset/size:   0x0010/4
> +   =
> +
> +  This field contains the offset to the Secure Launch Measured Launch 
> Environment
> +  (MLE) header. This offset is used to locate information needed during a 
> secure
> +  late launch using Intel TXT. If the offset is zero, the kernel does not 
> have
> +  Secure Launch capabilities. The MLE entry point is called from TXT on the 
> BSP
> +  following a success measured launch. The specific state of the processors 
> is
> +  outlined in the TXT Software Development Guide, the latest can be found 
> here:
> +  
> https://www.intel.com/content/dam/www/public/us/en/documents/guides/intel-txt-software-development-guide.pdf
> +
>
>  The Image Checksum
>  ==
> diff --git a/arch/x86/boot/compressed/Makefile 
> b/arch/x86/boot/compressed/Makefile
> index a1b018eb9801..012f7ca780c3 100644
> --- a/arch/x86/boot/compressed/Makefile
> +++ b/arch/x86/boot/compressed/Makefile
> @@ -118,7 +118,8 @@ vmlinux-objs-$(CONFIG_EFI) += $(obj)/efi.o
>  vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_mixed.o
>  vmlinux-objs-$(CONFIG_EFI_STUB) += 
> $(objtree)/drivers/firmware/efi/libstub/lib.a
>
> -vmlinux-objs-$(CONFIG_SECURE_LAUNCH) += $(obj)/early_sha1.o 
> $(obj)/early_sha256.o
> +vmlinux-objs-$(CONFIG_SECURE_LAUNCH) += $(obj)/early_sha1.o 
> $(obj)/early_sha256.o \
> +   $(obj)/sl_main.o $(obj)/sl_stub.o
>
>  $(obj)/vmlinux: $(vmlinux-objs-y) FORCE
> $(call if_changed,ld)
> diff --git a/arch/x86/boot/compressed/head_64.S 
> b/arch/x86/boot/compressed/head_64.S
> index bf4a10a5794f..6fa5bb87195b 100644
> --- a/arch/x86/boot/compressed/head_64.S
> +++ b/arch/x86/boot/compressed/head_64.S
> @@ -415,6 +415,17 @@ SYM_CODE_START(startup_64)
> pushq   $0
> popfq
>
> +#ifdef CONFIG_SECURE_LAUNCH
> +   pushq   %rsi
> +

This push and the associated pop are no longer needed.

> +   /* Ensure the relocation region coverd by a PMR */

'is covered'

> +   movq%rbx, %rdi
> +   movl$(_bss - startup_32), %esi
> +   callq   sl_check_region
> +
> +   popq%rsi
> +#endif
> +
>  /*
>   * Copy the compressed kernel to the end of our buffer
>   * where decompression in place becomes safe.
> @@ -457,6 +468,29 @@ SYM_FUNC_START_LOCAL_NOALIGN(.Lrelocated)
> shrq$3, %rcx
> rep stosq
>
> +#ifdef CONFIG_SECURE_LAUNCH
> +   /*
> +* Have to do the final early sl stub work in 

Re: [PATCH v8 06/15] x86: Add early SHA support for Secure Launch early measurements

2024-02-15 Thread Ard Biesheuvel
On Wed, 14 Feb 2024 at 23:31, Ross Philipson  wrote:
>
> From: "Daniel P. Smith" 
>
> The SHA algorithms are necessary to measure configuration information into
> the TPM as early as possible before using the values. This implementation
> uses the established approach of #including the SHA libraries directly in
> the code since the compressed kernel is not uncompressed at this point.
>
> The SHA code here has its origins in the code from the main kernel:
>
> commit c4d5b9ffa31f ("crypto: sha1 - implement base layer for SHA-1")
>
> A modified version of this code was introduced to the lib/crypto/sha1.c
> to bring it in line with the sha256 code and allow it to be pulled into the
> setup kernel in the same manner as sha256 is.
>
> Signed-off-by: Daniel P. Smith 
> Signed-off-by: Ross Philipson 

We have had some discussions about this, and you really need to
capture the justification in the commit log for introducing new code
that implements an obsolete and broken hashing algorithm.

SHA-1 is broken and should no longer be used for anything. Introducing
new support for a highly complex boot security feature, and then
relying on SHA-1 in the implementation makes this whole effort seem
almost futile, *unless* you provide some rock solid reasons here why
this is still safe.

If the upshot would be that some people are stuck with SHA-1 so they
won't be able to use this feature, then I'm not convinced we should
obsess over that.

> ---
>  arch/x86/boot/compressed/Makefile   |  2 +
>  arch/x86/boot/compressed/early_sha1.c   | 12 
>  arch/x86/boot/compressed/early_sha256.c |  6 ++



>  include/crypto/sha1.h   |  1 +
>  lib/crypto/sha1.c   | 81 +

This needs to be a separate patch in any case.


>  5 files changed, 102 insertions(+)
>  create mode 100644 arch/x86/boot/compressed/early_sha1.c
>  create mode 100644 arch/x86/boot/compressed/early_sha256.c
>
> diff --git a/arch/x86/boot/compressed/Makefile 
> b/arch/x86/boot/compressed/Makefile
> index f19c038409aa..a1b018eb9801 100644
> --- a/arch/x86/boot/compressed/Makefile
> +++ b/arch/x86/boot/compressed/Makefile
> @@ -118,6 +118,8 @@ vmlinux-objs-$(CONFIG_EFI) += $(obj)/efi.o
>  vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_mixed.o
>  vmlinux-objs-$(CONFIG_EFI_STUB) += 
> $(objtree)/drivers/firmware/efi/libstub/lib.a
>
> +vmlinux-objs-$(CONFIG_SECURE_LAUNCH) += $(obj)/early_sha1.o 
> $(obj)/early_sha256.o
> +
>  $(obj)/vmlinux: $(vmlinux-objs-y) FORCE
> $(call if_changed,ld)
>
> diff --git a/arch/x86/boot/compressed/early_sha1.c 
> b/arch/x86/boot/compressed/early_sha1.c
> new file mode 100644
> index ..0c7cf6f8157a
> --- /dev/null
> +++ b/arch/x86/boot/compressed/early_sha1.c
> @@ -0,0 +1,12 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (c) 2022 Apertus Solutions, LLC.
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include "../../../../lib/crypto/sha1.c"
> diff --git a/arch/x86/boot/compressed/early_sha256.c 
> b/arch/x86/boot/compressed/early_sha256.c
> new file mode 100644
> index ..54930166ffee
> --- /dev/null
> +++ b/arch/x86/boot/compressed/early_sha256.c
> @@ -0,0 +1,6 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (c) 2022 Apertus Solutions, LLC
> + */
> +
> +#include "../../../../lib/crypto/sha256.c"
> diff --git a/include/crypto/sha1.h b/include/crypto/sha1.h
> index 044ecea60ac8..d715dd5332e1 100644
> --- a/include/crypto/sha1.h
> +++ b/include/crypto/sha1.h
> @@ -42,5 +42,6 @@ extern int crypto_sha1_finup(struct shash_desc *desc, const 
> u8 *data,
>  #define SHA1_WORKSPACE_WORDS   16
>  void sha1_init(__u32 *buf);
>  void sha1_transform(__u32 *digest, const char *data, __u32 *W);
> +void sha1(const u8 *data, unsigned int len, u8 *out);
>
>  #endif /* _CRYPTO_SHA1_H */
> diff --git a/lib/crypto/sha1.c b/lib/crypto/sha1.c
> index 1aebe7be9401..10152125b338 100644
> --- a/lib/crypto/sha1.c
> +++ b/lib/crypto/sha1.c
> @@ -137,4 +137,85 @@ void sha1_init(__u32 *buf)
>  }
>  EXPORT_SYMBOL(sha1_init);
>
> +static void __sha1_transform(u32 *digest, const char *data)
> +{
> +   u32 ws[SHA1_WORKSPACE_WORDS];
> +
> +   sha1_transform(digest, data, ws);
> +
> +   memzero_explicit(ws, sizeof(ws));
> +}
> +
> +static void sha1_update(struct sha1_state *sctx, const u8 *data, unsigned 
> int len)
> +{
> +   unsigned int partial = sctx->count % SHA1_BLOCK_SIZE;
> +
> +   sctx->count += len;
> +
> +   if (likely((partial + len) >= SHA1_BLOCK_SIZE)) {
> +   int blocks;
> +
> +   if (partial) {
> +   int p = SHA1_BLOCK_SIZE - partial;
> +
> +   memcpy(sctx->buffer + partial, data, p);
> +   data += p;
> +   len -= p;
> +
> +   __sha1_transform(sctx->state, sctx->buffer);
> +   }
> +
> +   blocks = len / SHA1_BLOCK_SIZE;
> +

Re: [PATCH v8 04/15] x86: Secure Launch Resource Table header file

2024-02-15 Thread Ard Biesheuvel
On Wed, 14 Feb 2024 at 23:31, Ross Philipson  wrote:
>
> Introduce the Secure Launch Resource Table which forms the formal
> interface between the pre and post launch code.
>
> Signed-off-by: Ross Philipson 
> ---
>  include/linux/slr_table.h | 270 ++
>  1 file changed, 270 insertions(+)
>  create mode 100644 include/linux/slr_table.h
>
> diff --git a/include/linux/slr_table.h b/include/linux/slr_table.h
> new file mode 100644
> index ..42020988233a
> --- /dev/null
> +++ b/include/linux/slr_table.h
> @@ -0,0 +1,270 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Secure Launch Resource Table
> + *
> + * Copyright (c) 2023, Oracle and/or its affiliates.
> + */
> +
> +#ifndef _LINUX_SLR_TABLE_H
> +#define _LINUX_SLR_TABLE_H
> +
> +/* Put this in efi.h if it becomes a standard */
> +#define SLR_TABLE_GUID EFI_GUID(0x877a9b2a, 0x0385, 
> 0x45d1, 0xa0, 0x34, 0x9d, 0xac, 0x9c, 0x9e, 0x56, 0x5f)
> +
> +/* SLR table header values */
> +#define SLR_TABLE_MAGIC0x4452544d
> +#define SLR_TABLE_REVISION 1
> +
> +/* Current revisions for the policy and UEFI config */
> +#define SLR_POLICY_REVISION1
> +#define SLR_UEFI_CONFIG_REVISION   1
> +
> +/* SLR defined architectures */
> +#define SLR_INTEL_TXT  1
> +#define SLR_AMD_SKINIT 2
> +
> +/* SLR defined bootloaders */
> +#define SLR_BOOTLOADER_INVALID 0
> +#define SLR_BOOTLOADER_GRUB1
> +
> +/* Log formats */
> +#define SLR_DRTM_TPM12_LOG 1
> +#define SLR_DRTM_TPM20_LOG 2
> +
> +/* DRTM Policy Entry Flags */
> +#define SLR_POLICY_FLAG_MEASURED   0x1
> +#define SLR_POLICY_IMPLICIT_SIZE   0x2
> +
> +/* Array Lengths */
> +#define TPM_EVENT_INFO_LENGTH  32
> +#define TXT_VARIABLE_MTRRS_LENGTH  32
> +
> +/* Tags */
> +#define SLR_ENTRY_INVALID  0x
> +#define SLR_ENTRY_DL_INFO  0x0001
> +#define SLR_ENTRY_LOG_INFO 0x0002
> +#define SLR_ENTRY_ENTRY_POLICY 0x0003
> +#define SLR_ENTRY_INTEL_INFO   0x0004
> +#define SLR_ENTRY_AMD_INFO 0x0005
> +#define SLR_ENTRY_ARM_INFO 0x0006
> +#define SLR_ENTRY_UEFI_INFO0x0007
> +#define SLR_ENTRY_UEFI_CONFIG  0x0008
> +#define SLR_ENTRY_END  0x
> +
> +/* Entity Types */
> +#define SLR_ET_UNSPECIFIED 0x
> +#define SLR_ET_SLRT0x0001
> +#define SLR_ET_BOOT_PARAMS 0x0002
> +#define SLR_ET_SETUP_DATA  0x0003
> +#define SLR_ET_CMDLINE 0x0004
> +#define SLR_ET_UEFI_MEMMAP 0x0005
> +#define SLR_ET_RAMDISK 0x0006
> +#define SLR_ET_TXT_OS2MLE  0x0010
> +#define SLR_ET_UNUSED  0x
> +
> +#ifndef __ASSEMBLY__
> +
> +/*
> + * Primary SLR Table Header
> + */
> +struct slr_table {
> +   u32 magic;
> +   u16 revision;
> +   u16 architecture;
> +   u32 size;
> +   u32 max_size;
> +   /* entries[] */
> +} __packed;

Packing this struct has no effect on the layout so better drop the
__packed here. If this table is part of a structure that can appear
misaligned in memory, better to pack the outer struct or deal with it
there in another way.

> +
> +/*
> + * Common SLRT Table Header
> + */
> +struct slr_entry_hdr {
> +   u16 tag;
> +   u16 size;
> +} __packed;

Same here

> +
> +/*
> + * Boot loader context
> + */
> +struct slr_bl_context {
> +   u16 bootloader;
> +   u16 reserved;
> +   u64 context;
> +} __packed;
> +
> +/*
> + * DRTM Dynamic Launch Configuration
> + */
> +struct slr_entry_dl_info {
> +   struct slr_entry_hdr hdr;
> +   struct slr_bl_context bl_context;
> +   u64 dl_handler;

I noticed in the EFI patch that this is actually

void (*dl_handler)(struct slr_bl_context *bl_context);

so better declare it as such.

> +   u64 dce_base;
> +   u32 dce_size;
> +   u64 dlme_entry;
> +} __packed;
> +
> +/*
> + * TPM Log Information
> + */
> +struct slr_entry_log_info {
> +   struct slr_entry_hdr hdr;
> +   u16 format;
> +   u16 reserved;
> +   u64 addr;
> +   u32 size;
> +} __packed;
> +
> +/*
> + * DRTM Measurement Policy
> + */
> +struct slr_entry_policy {
> +   struct slr_entry_hdr hdr;
> +   u16 revision;
> +   u16 nr_entries;
> +   /* policy_entries[] */

Please use a flex array here:

  struct slr_policy_entry policy_entries[];

> +} __packed;
> +
> +/*
> + * DRTM Measurement Entry
> + */
> +struct slr_policy_entry {
> +   u16 pcr;
> +   u16 entity_type;
> +   u16 flags;
> +   u16 reserved;
> +   u64 entity;
> +   u64 size;
> +   char evt_info[TPM_EVENT_INFO_LENGTH];
> +} __packed;
> +
> +/*
> + * Secure Launch defined MTRR saving structures
> + */
> +struct slr_txt_mtrr_pair {
> +   u64 mtrr_physbase;
> +   u64 mtrr_physmask;
> +} __packed;
> +
> +struct slr_txt_mtrr_state {
> +   u64 default_mem_type;
> +   u64 mtrr_vcnt;
> +   struct slr_txt_mtrr_pair mtrr_pair[TXT_VARIABLE_MTRRS_LENGTH];
> +} __packed;
> +
> +/*
> + * Intel TXT Info 

Re: [PATCH v8 03/15] x86: Secure Launch Kconfig

2024-02-14 Thread Ard Biesheuvel
On Wed, 14 Feb 2024 at 23:31, Ross Philipson  wrote:
>
> Initial bits to bring in Secure Launch functionality. Add Kconfig
> options for compiling in/out the Secure Launch code.
>
> Signed-off-by: Ross Philipson 
> ---
>  arch/x86/Kconfig | 12 
>  1 file changed, 12 insertions(+)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 5edec175b9bf..d96d75f6f1a9 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -2071,6 +2071,18 @@ config EFI_RUNTIME_MAP
>
>   See also Documentation/ABI/testing/sysfs-firmware-efi-runtime-map.
>
> +config SECURE_LAUNCH
> +   bool "Secure Launch support"
> +   default n

'n' is already the default, so you can drop this line.

> +   depends on X86_64 && X86_X2APIC

This depends on CONFIG_TCG_TPM as well (I got build failures without it)

> +   help
> +  The Secure Launch feature allows a kernel to be loaded
> +  directly through an Intel TXT measured launch. Intel TXT
> +  establishes a Dynamic Root of Trust for Measurement (DRTM)
> +  where the CPU measures the kernel image. This feature then
> +  continues the measurement chain over kernel configuration
> +  information and init images.
> +
>  source "kernel/Kconfig.hz"
>
>  config ARCH_SUPPORTS_KEXEC
> --
> 2.39.3
>



Re: [PATCH v8 01/15] x86/boot: Place kernel_info at a fixed offset

2024-02-14 Thread Ard Biesheuvel
On Wed, 14 Feb 2024 at 23:31, Ross Philipson  wrote:
>
> From: Arvind Sankar 
>
> There are use cases for storing the offset of a symbol in kernel_info.
> For example, the trenchboot series [0] needs to store the offset of the
> Measured Launch Environment header in kernel_info.
>

Why? Is this information consumed by the bootloader?

I'd like to get away from x86 specific hacks for boot code and boot
images, so I would like to explore if we can avoid kernel_info, or at
least expose it in a generic way. We might just add a 32-bit offset
somewhere in the first 64 bytes of the bootable image: this could
co-exist with EFI bootable images, and can be implemented on arm64,
RISC-V and LoongArch as well.

> Since commit (note: commit ID from tip/master)
>
> commit 527afc212231 ("x86/boot: Check that there are no run-time relocations")
>
> run-time relocations are not allowed in the compressed kernel, so simply
> using the symbol in kernel_info, as
>
> .long   symbol
>
> will cause a linker error because this is not position-independent.
>
> With kernel_info being a separate object file and in a different section
> from startup_32, there is no way to calculate the offset of a symbol
> from the start of the image in a position-independent way.
>
> To enable such use cases, put kernel_info into its own section which is
> placed at a predetermined offset (KERNEL_INFO_OFFSET) via the linker
> script. This will allow calculating the symbol offset in a
> position-independent way, by adding the offset from the start of
> kernel_info to KERNEL_INFO_OFFSET.
>
> Ensure that kernel_info is aligned, and use the SYM_DATA.* macros
> instead of bare labels. This stores the size of the kernel_info
> structure in the ELF symbol table.
>
> Signed-off-by: Arvind Sankar 
> Cc: Ross Philipson 
> Signed-off-by: Ross Philipson 
> ---
>  arch/x86/boot/compressed/kernel_info.S | 19 +++
>  arch/x86/boot/compressed/kernel_info.h | 12 
>  arch/x86/boot/compressed/vmlinux.lds.S |  6 ++
>  3 files changed, 33 insertions(+), 4 deletions(-)
>  create mode 100644 arch/x86/boot/compressed/kernel_info.h
>
> diff --git a/arch/x86/boot/compressed/kernel_info.S 
> b/arch/x86/boot/compressed/kernel_info.S
> index f818ee8fba38..c18f07181dd5 100644
> --- a/arch/x86/boot/compressed/kernel_info.S
> +++ b/arch/x86/boot/compressed/kernel_info.S
> @@ -1,12 +1,23 @@
>  /* SPDX-License-Identifier: GPL-2.0 */
>
> +#include 
>  #include 
> +#include "kernel_info.h"
>
> -   .section ".rodata.kernel_info", "a"
> +/*
> + * If a field needs to hold the offset of a symbol from the start
> + * of the image, use the macro below, eg
> + * .long   rva(symbol)
> + * This will avoid creating run-time relocations, which are not
> + * allowed in the compressed kernel.
> + */
> +
> +#define rva(X) (((X) - kernel_info) + KERNEL_INFO_OFFSET)
>
> -   .global kernel_info
> +   .section ".rodata.kernel_info", "a"
>
> -kernel_info:
> +   .balign 16
> +SYM_DATA_START(kernel_info)
> /* Header, Linux top (structure). */
> .ascii  "LToP"
> /* Size. */
> @@ -19,4 +30,4 @@ kernel_info:
>
>  kernel_info_var_len_data:
> /* Empty for time being... */
> -kernel_info_end:
> +SYM_DATA_END_LABEL(kernel_info, SYM_L_LOCAL, kernel_info_end)
> diff --git a/arch/x86/boot/compressed/kernel_info.h 
> b/arch/x86/boot/compressed/kernel_info.h
> new file mode 100644
> index ..c127f84aec63
> --- /dev/null
> +++ b/arch/x86/boot/compressed/kernel_info.h
> @@ -0,0 +1,12 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#ifndef BOOT_COMPRESSED_KERNEL_INFO_H
> +#define BOOT_COMPRESSED_KERNEL_INFO_H
> +
> +#ifdef CONFIG_X86_64
> +#define KERNEL_INFO_OFFSET 0x500
> +#else /* 32-bit */
> +#define KERNEL_INFO_OFFSET 0x100
> +#endif
> +
> +#endif /* BOOT_COMPRESSED_KERNEL_INFO_H */
> diff --git a/arch/x86/boot/compressed/vmlinux.lds.S 
> b/arch/x86/boot/compressed/vmlinux.lds.S
> index 083ec6d7722a..718c52f3f1e6 100644
> --- a/arch/x86/boot/compressed/vmlinux.lds.S
> +++ b/arch/x86/boot/compressed/vmlinux.lds.S
> @@ -7,6 +7,7 @@ OUTPUT_FORMAT(CONFIG_OUTPUT_FORMAT)
>
>  #include 
>  #include 
> +#include "kernel_info.h"
>
>  #ifdef CONFIG_X86_64
>  OUTPUT_ARCH(i386:x86-64)
> @@ -27,6 +28,11 @@ SECTIONS
> HEAD_TEXT
> _ehead = . ;
> }
> +   .rodata.kernel_info KERNEL_INFO_OFFSET : {
> +   *(.rodata.kernel_info)
> +   }
> +   ASSERT(ABSOLUTE(kernel_info) == KERNEL_INFO_OFFSET, "kernel_info at 
> bad address!")
> +
> .rodata..compressed : {
> *(.rodata..compressed)
> }
> --
> 2.39.3
>



Re: [PATCH 0/2] Sign the Image which is zboot's payload

2023-09-25 Thread Ard Biesheuvel
On Mon, 25 Sept 2023 at 03:01, Pingfan Liu  wrote:
>
> On Fri, Sep 22, 2023 at 1:19 PM Jan Hendrik Farr  wrote:
> >
...
> > I missed some of the earlier discussion about this zboot kexec support.
> > So just let me know if I'm missing something here. You were exploring
> > these two options in getting this supported:
> >
> > 1. Making kexec_file_load do all the work.
> >
> > This option makes the signature verification easy. kexec_file_load
> > checks the signature on the pe file and then extracts it and does the
> > kexec.
> >
> > This is similar to how I'm approaching UKI support in [1].
> >
>
> Yes, that is my original try.
>
> > 2. Extract in userspace and pass decompressed kernel to kexec_file_load
> >
> > This option requires the decompressed kernel to have a valid signature on
> > it. That's why this patch adds the ability to add that signature to the
> > kernel contained inside the zboot image.
> >
>
> You got it.
>
> > This option would not make sense for UKI support as it would not
> > validate the signature with respect to the initrd and cmdline that it
> > contains. Am I correct in thinking that there is no similar issue with
> > zboot images? They don't contain any more information besides the kernel
> > that is intended to be securely signed, right? Do you have a reference
>
> If using my second method, it means to unpack the UKI image in user
> space, and pass the kernel image, initrd and cmdline through
> kexec_file_load interface. If the UKI can have signature on the initrd
> and cmdline, we extend the capability of that interface to check those
> verification.
>
> > for the zboot image layout somewhere?
> >
>
> Sorry that maybe there is no document. I understand them through the code.
> The zboot image, aka, vmlinuz.efi looks like:
> PE header, which is formed manually in arch/arm64/kernel/head.S
> EFI decompressor, which consists of
> drivers/firmware/efi/libstub/zboot.c and libstub
> Image.gz, which is formed by compressing Image as instructed in Makefile.zboot
>
>

Indeed, this is currently only documented in code. zboot is a PE
executable that decompresses the kernel and boots it, but it also
carries the base and size of the compressed payload in its header,
along with the compression type so non-EFI loaders can run it as well
(QEMU implements this for gzip on arm64)

> > > I hesitate to post this series,
> >
> > I appreciate you sending it, it's helping the discussion along.
> >

Absolutely. RFCs are important because nobody knows how exactly the
code will look until someone takes the time to implement it. So your
work on this is much appreciated, even if we may decide to take
another approach down the road.

> > > [...] since Ard has recommended using an
> > > emulated UEFI boot service to resolve the UKI kexec load problem [1].
> > > since on aarch64, vmlinuz.efi has faced the similar issue at present.
> > > But anyway, I have a crude outline of it and am sending it out for
> > > discussion.
> >
> > The more I'm thinking about it, the more I like Ard's idea. There's now
> > already two different formats trying to be added to kexec that are
> > pretty different from each other, yet they both have the UEFI interface
> > in common. I think if the kernel supported kexec'ing EFI applications
> > that would be a more flexible and forward-looking approach. It's a
>
> Yes, I agree. That method is attractive, originally I had a try when
> Ard suggested it but there was no clear boundary on which boot service
> should be implemented for zboot, so I did not move on along that
> direction.
>
> Now, UKI poses another challenge to kexec_file_load, and seems to
> require more than zboot. And it appears that Ard's approach is a
> silver bullet for that issue.
>

Yes, it looks appealing but it will take some time to iterate on ideas
and converge on an implementation.

> > standard that both zboot and UKI as well as all future formats for UEFI
> > platforms will support anyways. So while it's more work right now to
> > implement, I think it'll likely pay off.
> >
> > It is significantly more work than the other options though. So I think
> > before work is started on it, it would be nice to get some type of
> > consensus on these things (not an exhaustive list, please feel free to
> > add to it):
> >
>
> I try to answer part of the questions.
>
> > 1. Is it the right approach? It adds a significant amount of userspace
> > API.
>
> My crude assumption: this new stub will replace the purgatory, and I
> am not sure whether kexec-tools source tree will accommodate it. It
> can be signed and checked during the kexec_file_load.
>
> > 2. What subset of the UEFI spec needs/should to be supported?
> > 3. Can we let runtime services still be handled by the firmware after
> > exiting boot services?
>
> I think the runtime services survive through the kexec process. It is
> derived from the real firmware, not related with this stub
>

Yes, this should be possible.

> > 4. How can we debug the stubs that are 

Re: [PATCH v2 0/2] x86/kexec: UKI Support

2023-09-20 Thread Ard Biesheuvel
On Wed, 20 Sept 2023 at 08:40, Dave Young  wrote:
>
> On Wed, 20 Sept 2023 at 15:43, Dave Young  wrote:
> >
> > > > In the end the only benefit this series brings is to extend the
> > > > signature checking on the whole UKI except of just the kernel image.
> > > > Everything else can also be done in user space. Compared to the
> > > > problems described above this is a very small gain for me.
> > >
> > > Correct. That is the benefit of pulling the UKI apart in the
> > > kernel. However having to sign the kernel inside the UKI defeats
> > > the whole point.
> >
> >
> > Pingfan added the zboot load support in kexec-tools, I know that he is
> > trying to sign the zboot image and the inside kernel twice. So
> > probably there are some common areas which can be discussed.
> > Added Ard and Pingfan in cc.
> > http://lists.infradead.org/pipermail/kexec/2023-August/027674.html
> >
>
> Here is another thread of the initial try in kernel with a few more
> options eg. some fake efi service helpers.
> https://lore.kernel.org/linux-arm-kernel/zbvksis+dfnqa...@piliu.users.ipa.redhat.com/T/#m42abb0ad3c10126b8b3bfae8a596deb707d6f76e
>

Currently, UKI's external interface is defined in terms of EFI
services, i.e., it is an executable PE/COFF binary that encapsulates
all the logic that performs the unpacking of the individual sections,
and loads the kernel as a PE/COFF binary as well (i.e., via
LoadImage/StartImage)

As soon as we add support to Linux to unpack a UKI and boot the
encapsulated kernel using a boot protocol other than EFI, we are
painting ourselves into a corner, severely limiting the freedom of the
UKI effort to make changes to the interfaces that were implementation
details up to this point.

It also means that UKI handling in kexec will need to be taught about
every individual architecture again, which is something we are trying
to avoid with EFI support in general. Breaking the abstraction like
this lets the cat out of the bag, and will add yet another variation
of kexec that we will need to support and maintain forever.

So the only way to do this properly and portably is to implement the
minimal set of EFI boot services [0] that Linux actually needs to run
its EFI stub (which is mostly identical to the set that UKI relies on
afaict), and expose them to the kexec image as it is being loaded.
This is not as bad as it sounds - I have some Rust code that could be
used as an inspiration [1] and which could be reused and shared
between architectures.

This would also reduce/remove the need for a purgatory: loading a EFI
binary in this way would run it up to the point were it calls
ExitBootServices(), and the actual kexec would invoke the image as if
it was returning from ExitBootServices().

The only fundamental problem here is the need to allocate large chunks
of physical memory, which would need some kind of CMA support, I
imagine?

Maybe we should do a BoF at LPC to discuss this further?

[0] this is not as bad as it sounds: beyond a protocol database, a
heap allocator and a memory map, there is actually very little needed
to boot Linux via the EFI stub (although UKI needs
LoadImage/StartImage as well)

[1] https://github.com/ardbiesheuvel/efilite

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2] x86/kexec: Add EFI config table identity mapping for kexec kernel

2023-08-06 Thread Ard Biesheuvel
On Sat, 5 Aug 2023 at 11:18, Borislav Petkov  wrote:
>
> On Thu, Aug 03, 2023 at 01:11:54PM +0200, Ard Biesheuvel wrote:
> > Sadly, not only 'old' grubs - GRUB mainline only recently added
> > support for booting Linux/x86 via the EFI stub (because I wrote the
> > code for them),
>
> haha.
>
> > but it will still fall back to the previous mode for kernels that are
> > built without EFI stub support, or which are older than ~v5.8 (because
> > their EFI stub does not implement the generic EFI initrd loading
> > mechanism)
>
> The thing is, those SNP kernels pretty much use the EFI boot mechanism.
> I mean, don't take my word for it as I run SNP guests only from time to
> time but that's what everyone uses AFAIK.
>
> > Yeah. what seems to be saving our ass here is that startup_32 maps the
> > first 1G of physical address space 4 times, and x86_64 EFI usually
> > puts firmware tables below 4G. This means the cc blob check doesn't
> > fault, but it may dereference bogus memory traversing the config table
> > array looking for the cc blob GUID. However, the system table field
> > holding the size of the array may also appear as bogus so this may
> > still break in weird ways.
>
> Oh fun.
>

This is not actually true, I misread the code.

The initial mapping is 1:1 for the lower 4G of system memory, so
anything that lives there is accessible before the demand paging stuff
is up and running.

IOW, your change should be sufficient to fix this even when entering
via the 32-bit entry point.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2] x86/kexec: Add EFI config table identity mapping for kexec kernel

2023-08-03 Thread Ard Biesheuvel
On Thu, 3 Aug 2023 at 13:11, Ard Biesheuvel  wrote:
>
> On Wed, 2 Aug 2023 at 17:52, Borislav Petkov  wrote:
> >
> > On Wed, Aug 02, 2023 at 04:55:27PM +0200, Ard Biesheuvel wrote:
> > > ... because now, entering via startup_32 is broken, given that it only
> > > maps the kernel image itself and relies on the #PF handling for
> > > everything else it accesses, including firmware tables.
> > >
> > > AFAICT this also means that entering via startup_32 is broken entirely
> > > for any configuration that enables the cc blob config table check,
> > > regardless of the platform.
> >
> > Lemme brain-dump what Tom and I just talked on IRC.
> >
> > That startup_32 entry path for SNP guests was used with old grubs which
> > used to enter through there and not anymore, reportedly. Which means,
> > that must've worked at some point but Joerg would know. CCed.
> >
>
> Sadly, not only 'old' grubs - GRUB mainline only recently added
> support for booting Linux/x86 via the EFI stub (because I wrote the
> code for them), but it will still fall back to the previous mode for
> kernels that are built without EFI stub support, or which are older
> than ~v5.8 (because their EFI stub does not implement the generic EFI
> initrd loading mechanism)
>
> This fallback still appears to enter via startup_32, even when GRUB
> itself runs in long mode in the context of EFI.
>
> > Newer grubs enter through the 64-bit entry point and thus are fine
> > - otherwise we would be seeing explosions left and right.
> >
>
> Yeah. what seems to be saving our ass here is that startup_32 maps the
> first 1G of physical address space 4 times, and x86_64 EFI usually
> puts firmware tables below 4G. This means the cc blob check doesn't
> fault, but it may dereference bogus memory traversing the config table
> array looking for the cc blob GUID. However, the system table field
> holding the size of the array may also appear as bogus so this may
> still break in weird ways.
>
> > So dependent on what we wanna do, if we kill the 32-bit path, we can
> > kill the 32-bit C-bit verif code. But that's for later and an item on my
> > TODO list.
> >
>
> I don't think we can kill it yet, but it would be nice if we could
> avoid the need to support SNP boot when entering that way.

https://lists.gnu.org/archive/html/grub-devel/2023-08/msg5.html

Coming to your distro any decade now!

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2] x86/kexec: Add EFI config table identity mapping for kexec kernel

2023-08-03 Thread Ard Biesheuvel
On Wed, 2 Aug 2023 at 17:52, Borislav Petkov  wrote:
>
> On Wed, Aug 02, 2023 at 04:55:27PM +0200, Ard Biesheuvel wrote:
> > ... because now, entering via startup_32 is broken, given that it only
> > maps the kernel image itself and relies on the #PF handling for
> > everything else it accesses, including firmware tables.
> >
> > AFAICT this also means that entering via startup_32 is broken entirely
> > for any configuration that enables the cc blob config table check,
> > regardless of the platform.
>
> Lemme brain-dump what Tom and I just talked on IRC.
>
> That startup_32 entry path for SNP guests was used with old grubs which
> used to enter through there and not anymore, reportedly. Which means,
> that must've worked at some point but Joerg would know. CCed.
>

Sadly, not only 'old' grubs - GRUB mainline only recently added
support for booting Linux/x86 via the EFI stub (because I wrote the
code for them), but it will still fall back to the previous mode for
kernels that are built without EFI stub support, or which are older
than ~v5.8 (because their EFI stub does not implement the generic EFI
initrd loading mechanism)

This fallback still appears to enter via startup_32, even when GRUB
itself runs in long mode in the context of EFI.

> Newer grubs enter through the 64-bit entry point and thus are fine
> - otherwise we would be seeing explosions left and right.
>

Yeah. what seems to be saving our ass here is that startup_32 maps the
first 1G of physical address space 4 times, and x86_64 EFI usually
puts firmware tables below 4G. This means the cc blob check doesn't
fault, but it may dereference bogus memory traversing the config table
array looking for the cc blob GUID. However, the system table field
holding the size of the array may also appear as bogus so this may
still break in weird ways.

> So dependent on what we wanna do, if we kill the 32-bit path, we can
> kill the 32-bit C-bit verif code. But that's for later and an item on my
> TODO list.
>

I don't think we can kill it yet, but it would be nice if we could
avoid the need to support SNP boot when entering that way.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2] x86/kexec: Add EFI config table identity mapping for kexec kernel

2023-08-02 Thread Ard Biesheuvel
On Wed, 2 Aug 2023 at 15:59, Borislav Petkov  wrote:
>
> On Wed, Aug 02, 2023 at 08:40:36AM -0500, Tom Lendacky wrote:
> > Short of figuring out how to map page accesses earlier through the
> > boot_page_fault IDT routine
>
> And you want to do that because?
>

... because now, entering via startup_32 is broken, given that it only
maps the kernel image itself and relies on the #PF handling for
everything else it accesses, including firmware tables.

AFAICT this also means that entering via startup_32 is broken entirely
for any configuration that enables the cc blob config table check,
regardless of the platform.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2] x86/kexec: Add EFI config table identity mapping for kexec kernel

2023-07-17 Thread Ard Biesheuvel
On Mon, 17 Jul 2023 at 15:53, Tao Liu  wrote:
>
> Hi Borislav,
>
> On Thu, Jul 13, 2023 at 6:05 PM Borislav Petkov  wrote:
> >
> > On Thu, Jun 01, 2023 at 03:20:44PM +0800, Tao Liu wrote:
> > >  arch/x86/kernel/machine_kexec_64.c | 35 ++
> > >  1 file changed, 31 insertions(+), 4 deletions(-)
> >
> > Ok, pls try this totally untested thing.
> >
> > Thx.
> >
> > ---
> > diff --git a/arch/x86/boot/compressed/sev.c b/arch/x86/boot/compressed/sev.c
> > index 09dc8c187b3c..fefe27b2af85 100644
> > --- a/arch/x86/boot/compressed/sev.c
> > +++ b/arch/x86/boot/compressed/sev.c
> > @@ -404,13 +404,20 @@ void sev_enable(struct boot_params *bp)
> > if (bp)
> > bp->cc_blob_address = 0;
> >
> > +   /* Check for the SME/SEV support leaf */
> > +   eax = 0x8000;
> > +   ecx = 0;
> > +   native_cpuid(, , , );
> > +   if (eax < 0x801f)
> > +   return;
> > +
> > /*
> >  * Setup/preliminary detection of SNP. This will be sanity-checked
> >  * against CPUID/MSR values later.
> >  */
> > snp = snp_init(bp);
> >
> > -   /* Check for the SME/SEV support leaf */
> > +   /* Recheck the SME/SEV support leaf */
> > eax = 0x8000;
> > ecx = 0;
> > native_cpuid(, , , );
> >
> Thanks a lot for the patch above! Sorry for the late response. I have
> compiled and tested it locally against 6.5.0-rc1, though it can pass
> the early stage of kexec kernel bootup,

OK, so that proves that the cc_blob table access is the culprit here.
That still means that kexec on SEV is likely to explode in the exact
same way should anyone attempt that.


> however the kernel will panic
> occasionally later. The test machine is the one with Intel Atom
> x6425RE cpu which encountered the page fault issue of missing efi
> config table.
>

Agree with Boris that this seems entirely unrelated.

> ...snip...
> [   21.360763]  nvme0n1: p1 p2 p3
> [   21.364207] igc :03:00.0: PTM enabled, 4ns granularity
> [   21.421097] pps pps1: new PPS source ptp1
> [   21.425396] igc :03:00.0 (unnamed net_device) (uninitialized): PHC 
> added
> [   21.457005] igc :03:00.0: 4.000 Gb/s available PCIe bandwidth
> (5.0 GT/s PCIe x1 link)
> [   21.465210] igc :03:00.0 eth1: MAC: ...snip...
> [   21.473424] igc :03:00.0 enp3s0: renamed from eth1
> [   21.479446] BUG: kernel NULL pointer dereference, address: 0008
> [   21.486405] #PF: supervisor read access in kernel mode
> [   21.491519] mmc1: Failed to initialize a non-removable card
> [   21.491538] #PF: error_code(0x) - not-present page
> [   21.502229] PGD 0 P4D 0
> [   21.504773] Oops:  [#1] PREEMPT SMP NOPTI
> [   21.509133] CPU: 3 PID: 402 Comm: systemd-udevd Not tainted 6.5.0-rc1+ #1
> [   21.515905] Hardware name: ...snip...


Why are you snipping the hardware name?

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2] x86/kexec: Add EFI config table identity mapping for kexec kernel

2023-07-13 Thread Ard Biesheuvel
On Fri, 7 Jul 2023 at 19:12, Borislav Petkov  wrote:
>
> On Fri, Jul 07, 2023 at 10:25:15AM -0500, Michael Roth wrote:
> > ...
> > It would be unfortunate if we finally abandoned this path because of the
> > issue being hit here though. I think the patch posted here is the proper
> > resolution to the issue being hit, and I'm hoping at this point we've
> > identified all the similar cases where EFI/setup_data-related structures
> > were missing explicit mappings. But if we still think it's too much of a
> > liability to access the EFI config table outside of SEV-enabled guests,
> > then I can work on re-implementing things based on the above logic.
>
> Replying here to Tom's note too...
>
> So, I like the idea of rechecking CPUID. Yes, let's do the sev_status
> check. As a result, we either fail the guest - no problem - or we boot
> and we recheck. Thus, we don't run AMD code on !AMD machines, if the HV
> is not a lying bastard.
>
> Now, if we've gotten a valid setup_data SETUP_EFI entry with a valid
> pointer to an EFI config table, then that should happen in the generic
> path - initialize_identity_maps(), for example - like you've done in
> b57feed2cc26 - not in the kexec code because kexec *happens* to need it.
>
> We want to access the EFI config table? Sure, by all means, but make
> that generic for all code.
>

OK, so in summary, what seems to be happening here is that the SEV
init code in the decompressor looks for the cc blob table before the
on-demand mapping code is up, which normally ensures that any RAM
address is accessible even if it hasn't been mapped explicitly.

This is why the fix happens to work: the code only maps the array of
(guid, phys_addr) tuples that describes the list of configuration
tables that have been provided by the firmware. The actual
configuration tables themselves could be anywhere in physical memory,
and without prior knowledge of a particular GUID value, there is no
way to know the size of the table, and so they cannot be mapped
upfront like this. However, the cc blob table does not exist on this
machine, and so whether the EFI config tables themselves are mapped or
not is irrelevant.

But it does mean the fix is incomplete, and certainly does not belong
in generic kexec code. If anything, we should be fixing the
decompressor code to defer the cc blob table check until after the
demand mapping code is up.

If this is problematic, we might instead disable SEV for kexec, and
rely on the fact that SEV firmware enters with a complete 1:1 map (as
we seem to be doing currently). If kexec for SEV is needed at some
point, we can re-enable it by having it provide a mapping for the
config table array and the cc blob table explicitly.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v6 06/14] x86: Add early SHA support for Secure Launch early measurements

2023-05-12 Thread Ard Biesheuvel
On Fri, 12 May 2023 at 13:28, Matthew Garrett  wrote:
>
> On Fri, May 12, 2023 at 01:18:45PM +0200, Ard Biesheuvel wrote:
> > On Fri, 12 May 2023 at 13:04, Matthew Garrett  wrote:
> > >
> > > On Tue, May 09, 2023 at 06:21:44PM -0700, Eric Biggers wrote:
> > >
> > > > SHA-1 is insecure.  Why are you still using SHA-1?  Don't TPMs support 
> > > > SHA-2
> > > > now?
> > >
> > > TXT is supported on some TPM 1.2 systems as well. TPM 2 systems are also
> > > at the whim of the firmware in terms of whether the SHA-2 banks are
> > > enabled. But even if the SHA-2 banks are enabled, if you suddenly stop
> > > extending the SHA-1 banks, a malicious actor can later turn up and
> > > extend whatever they want into them and present a SHA-1-only
> > > attestation. Ideally whatever is handling that attestation should know
> > > whether or not to expect an attestation with SHA-2, but the easiest way
> > > to maintain security is to always extend all banks.
> > >
> >
> > Wouldn't it make more sense to measure some terminating event into the
> > SHA-1 banks instead?
>
> Unless we assert that SHA-1 events are unsupported, it seems a bit odd
> to force a policy on people who have both banks enabled. People with
> mixed fleets are potentially going to be dealing with SHA-1 measurements
> for a while yet, and while there's obviously a security benefit in using
> SHA-2 instead it'd be irritating to have to maintain two attestation
> policies.

I understand why that matters from an operational perspective.

However, we are dealing with brand new code being proposed for Linux
mainline, and so this is our only chance to push back on this, as
otherwise, we will have to maintain it for a very long time.

IOW, D-RTM does not exist today in Linux, and it is up to us to define
what it will look like. From that perspective, it is downright
preposterous to even consider supporting SHA-1, given that SHA-1 by
itself gives none of the guarantees that D-RTM aims to provide. If
reducing your TCB is important enough to warrant switching to this
implementation of D-RTM, surely you can upgrade your attestation
policies as well.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v6 06/14] x86: Add early SHA support for Secure Launch early measurements

2023-05-12 Thread Ard Biesheuvel
On Fri, 12 May 2023 at 13:04, Matthew Garrett  wrote:
>
> On Tue, May 09, 2023 at 06:21:44PM -0700, Eric Biggers wrote:
>
> > SHA-1 is insecure.  Why are you still using SHA-1?  Don't TPMs support SHA-2
> > now?
>
> TXT is supported on some TPM 1.2 systems as well. TPM 2 systems are also
> at the whim of the firmware in terms of whether the SHA-2 banks are
> enabled. But even if the SHA-2 banks are enabled, if you suddenly stop
> extending the SHA-1 banks, a malicious actor can later turn up and
> extend whatever they want into them and present a SHA-1-only
> attestation. Ideally whatever is handling that attestation should know
> whether or not to expect an attestation with SHA-2, but the easiest way
> to maintain security is to always extend all banks.
>

Wouldn't it make more sense to measure some terminating event into the
SHA-1 banks instead?

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 0/4] Support kexec'ing PEs containing compressed kernels

2023-05-04 Thread Ard Biesheuvel
On Thu, 4 May 2023 at 18:41, Jeremy Linton  wrote:
>
> The linux ZBOOT option creates PEs that contain compressed kernel images
> which are self decompressed on execution by UEFI.
>
> This set adds support for this image format to kexec by decompressing the
> contained kernel image to a temp file, then handing the resulting image
> off to the existing "Image" load routine to pass to the kexec syscall.
>
> There is also an additional patch which cleans up some errors noticed
> in the existing zImage support as well.
>
> Jeremy Linton (4):
>   arm64: Cleanup _probe() return values
>   arm64: Add ZBOOT PE containing compressed image support
>   arm64: Hook up the ZBOOT support as vmlinuz
>   arm64: Fix some issues with zImage _probe()
>

Thanks a lot for taking care of this!

This all looks good to me. The only comment I have is that EFI zboot
itself is generic, even though arm64 is the only arch that distros are
building it for at the moment. So it is not unlikely that some of this
code will end up needing to be shared.

Acked-by: Ard Biesheuvel 


>  kexec/arch/arm64/Makefile  |   3 +-
>  kexec/arch/arm64/image-header.h|  11 ++
>  kexec/arch/arm64/kexec-arm64.c |   7 +
>  kexec/arch/arm64/kexec-arm64.h |   3 +
>  kexec/arch/arm64/kexec-elf-arm64.c |   1 +
>  kexec/arch/arm64/kexec-vmlinuz-arm64.c | 172 +
>  kexec/arch/arm64/kexec-zImage-arm64.c  |  13 +-
>  kexec/kexec.c  |  11 +-
>  8 files changed, 201 insertions(+), 20 deletions(-)
>  create mode 100644 kexec/arch/arm64/kexec-vmlinuz-arm64.c
>
> --
> 2.40.0
>

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 0/6] arm64: make kexec_file able to load zboot image

2023-03-06 Thread Ard Biesheuvel
(cc Mark)

Hello Pingfan,

Thanks for working on this.

On Mon, 6 Mar 2023 at 04:03, Pingfan Liu  wrote:
>
> After introducing zboot image, kexec_file can not load and jump to the
> new style image. Hence it demands a method to load the new kernel.
>
> The crux of the problem lies in when and how to decompress the Image.gz.
> There are three possible courses to take: -1. in user space, but hard to
> achieve due to the signature verification inside the kernel.

That depends. The EFI zboot image encapsulates another PE/COFF image,
which could be signed as well.

So there are at least three other options here:
- sign the encapsulated image with the same key as the zboot image
- sign the encapsulated image with a key that is only valid for kexec boot
- sign the encapsulated image with an ephemeral key that is only valid
for a kexec'ing an image that was produced by the same kernel build

>  -2. at the
> boot time, let the efi_zboot_entry() handles it, which means a simulated
> EFI service should be provided to that entry, especially about how to be
> aware of the memory layout.

This is actually an idea I intend to explore: with the EFI runtime
services regions mapped 1:1, it wouldn't be too hard to implement a
minimal environment that can run the zboot image under the previous
kernel up to the point where it call ExitBootServices(), after which
kexec() would take over.

>  -3. in kernel space, during the file load
> of the zboot image. At that point, the kernel masters the whole memory
> information, and easily allocates a suitable memory for the decompressed
> kernel image. (I think this is similar to what grub does today).
>

GRUB just calls LoadImage(), and the decompression code runs in the EFI context.

> The core of this series is [5/6].  [3,6/6] handles the config option.
> The assumption of [3/6] is kexec_file_load is independent of zboot,
> especially it can load kernel images compressed with different
> compression method.  [6/6] is if EFI_ZBOOT, the corresponding
> decompression method should be included.
>
>
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: Andrew Morton 
> Cc: Ard Biesheuvel 
> Cc: kexec@lists.infradead.org
> To: linux-arm-ker...@lists.infradead.org
> To: linux-ker...@vger.kernel.org
>
> Pingfan Liu (6):
>   arm64: kexec: Rename kexec_image.c to kexec_raw_image.c
>   lib/decompress: Introduce decompress_method_by_name()
>   arm64: Kconfig: Pick decompressing method for kexec file load
>   lib/decompress: Keep decompress routines based on selection
>   arm64: kexec: Introduce zboot image loader
>   init/Kconfig: Select decompressing method if compressing kernel
>
>  arch/arm64/Kconfig|  59 ++
>  arch/arm64/include/asm/kexec.h|   4 +-
>  arch/arm64/kernel/Makefile|   2 +-
>  .../{kexec_image.c => kexec_raw_image.c}  |   2 +-
>  arch/arm64/kernel/kexec_zboot_image.c | 186 ++
>  arch/arm64/kernel/machine_kexec.c |   1 +
>  arch/arm64/kernel/machine_kexec_file.c|   3 +-
>  include/linux/decompress/generic.h|   2 +
>  include/linux/decompress/mm.h |   9 +-
>  include/linux/zboot.h |  26 +++
>  init/Kconfig  |   7 +
>  lib/Kconfig   |   3 +
>  lib/decompress.c  |  17 +-
>  13 files changed, 314 insertions(+), 7 deletions(-)
>  rename arch/arm64/kernel/{kexec_image.c => kexec_raw_image.c} (98%)
>  create mode 100644 arch/arm64/kernel/kexec_zboot_image.c
>  create mode 100644 include/linux/zboot.h
>
> --
> 2.31.1
>

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: Bug: kexec on Lenovo ThinkPad T480 disables EFI mode

2022-11-06 Thread Ard Biesheuvel
On Mon, 7 Nov 2022 at 08:40, Dave Young  wrote:
>
> On Mon, 7 Nov 2022 at 15:36, Dave Young  wrote:
> >
> > Hi Ard,
> >
> > On Mon, 7 Nov 2022 at 15:30, Ard Biesheuvel  wrote:
> > >
> > > On Mon, 7 Nov 2022 at 07:55, Dave Young  wrote:
> > > >
> > > > Hi,
> > > >
> > > > On Sat, 5 Nov 2022 at 22:16,  wrote:
> > > > >
> > > > > On 2022-11-05 05:49, Dave Young wrote:
> > > > > > Baoquan, thanks for cc me.
> > > > > >
> > > > > > On Sat, 5 Nov 2022 at 11:10, Baoquan He  wrote:
> > > > > >>
> > > > > >> Add Dave to CC
> > > > > >>
> > > > > >> On 10/28/22 at 01:02pm, n...@tfwno.gf wrote:
> > > > > >> > Greetings,
> > > > > >> >
> > > > > >> > I've been hitting a bug on my Lenovo ThinkPad T480 where 
> > > > > >> > kexecing will
> > > > > >> > cause EFI mode (if that's the right term for it) to be 
> > > > > >> > unconditionally
> > > > > >> > disabled, even when not using the --noefi option to kexec.
> > > > > >> >
> > > > > >> > What I mean by "EFI mode" being disabled, more than just EFI 
> > > > > >> > runtime
> > > > > >> > services, is that basically nothing about the system's EFI is 
> > > > > >> > visible
> > > > > >> > post-kexec. Normally you have a message like this in dmesg when 
> > > > > >> > the
> > > > > >> > system is booted in EFI mode:
> > > > > >> >
> > > > > >> > [0.00] efi: EFI v2.70 by EDK II
> > > > > >> > [0.00] efi: SMBIOS=0x7f98a000 ACPI=0x7fb7e000 ACPI 
> > > > > >> > 2.0=0x7fb7e014
> > > > > >> > MEMATTR=0x7ec63018
> > > > > >> > (obviously not the real firmware of the machine I'm talking 
> > > > > >> > about, but I
> > > > > >> > can also send that if it would be of any help)
> > > > > >> >
> > > > > >> > No such message pops up in my dmesg as a result of this bug, & 
> > > > > >> > this
> > > > > >> > causes some fallout like being unable to find the system's DMI
> > > > > >> > information:
> > > > > >> >
> > > > > >> > <6>[0.00] DMI not present or invalid.
> > > > > >> >
> > > > > >> > The efivarfs module also fails to load with -ENODEV.
> > > > > >> >
> > > > > >> > I've tried also booting with efi=runtime explicitly but it 
> > > > > >> > doesn't
> > > > > >> > change anything. The kernel still does not print the name of the 
> > > > > >> > EFI
> > > > > >> > firmware, DMI is still missing, & efivarfs still fails to load.
> > > > > >> >
> > > > > >> > I've been using the kexec_load syscall for all these tests, if 
> > > > > >> > it's
> > > > > >> > important.
> > > > > >> >
> > > > > >> > Also, to make it very clear, all this only ever happens 
> > > > > >> > post-kexec. When
> > > > > >> > booting straight from UEFI (with the EFI stub), all the 
> > > > > >> > aforementioned
> > > > > >> > stuff that fails works perfectly fine (i.e. name of firmware is 
> > > > > >> > printed,
> > > > > >> > DMI is properly found, & efivarfs loads & mounts just fine).
> > > > > >> >
> > > > > >> > This is reproducible with a vanilla 6.1-rc2 kernel. I've been 
> > > > > >> > trying to
> > > > > >> > bisect it, but it seems like it goes pretty far back. I've got 
> > > > > >> > vanilla
> > > > > >> > mainline kernel builds dating back to 5.17 that have the exact 
> > > > > >> > same
> > > > > >> > issue. It might be worth noting that during this testing, I made 
> > > > > >> >

Re: Bug: kexec on Lenovo ThinkPad T480 disables EFI mode

2022-11-06 Thread Ard Biesheuvel
On Mon, 7 Nov 2022 at 07:55, Dave Young  wrote:
>
> Hi,
>
> On Sat, 5 Nov 2022 at 22:16,  wrote:
> >
> > On 2022-11-05 05:49, Dave Young wrote:
> > > Baoquan, thanks for cc me.
> > >
> > > On Sat, 5 Nov 2022 at 11:10, Baoquan He  wrote:
> > >>
> > >> Add Dave to CC
> > >>
> > >> On 10/28/22 at 01:02pm, n...@tfwno.gf wrote:
> > >> > Greetings,
> > >> >
> > >> > I've been hitting a bug on my Lenovo ThinkPad T480 where kexecing will
> > >> > cause EFI mode (if that's the right term for it) to be unconditionally
> > >> > disabled, even when not using the --noefi option to kexec.
> > >> >
> > >> > What I mean by "EFI mode" being disabled, more than just EFI runtime
> > >> > services, is that basically nothing about the system's EFI is visible
> > >> > post-kexec. Normally you have a message like this in dmesg when the
> > >> > system is booted in EFI mode:
> > >> >
> > >> > [0.00] efi: EFI v2.70 by EDK II
> > >> > [0.00] efi: SMBIOS=0x7f98a000 ACPI=0x7fb7e000 ACPI 
> > >> > 2.0=0x7fb7e014
> > >> > MEMATTR=0x7ec63018
> > >> > (obviously not the real firmware of the machine I'm talking about, but 
> > >> > I
> > >> > can also send that if it would be of any help)
> > >> >
> > >> > No such message pops up in my dmesg as a result of this bug, & this
> > >> > causes some fallout like being unable to find the system's DMI
> > >> > information:
> > >> >
> > >> > <6>[0.00] DMI not present or invalid.
> > >> >
> > >> > The efivarfs module also fails to load with -ENODEV.
> > >> >
> > >> > I've tried also booting with efi=runtime explicitly but it doesn't
> > >> > change anything. The kernel still does not print the name of the EFI
> > >> > firmware, DMI is still missing, & efivarfs still fails to load.
> > >> >
> > >> > I've been using the kexec_load syscall for all these tests, if it's
> > >> > important.
> > >> >
> > >> > Also, to make it very clear, all this only ever happens post-kexec. 
> > >> > When
> > >> > booting straight from UEFI (with the EFI stub), all the aforementioned
> > >> > stuff that fails works perfectly fine (i.e. name of firmware is 
> > >> > printed,
> > >> > DMI is properly found, & efivarfs loads & mounts just fine).
> > >> >
> > >> > This is reproducible with a vanilla 6.1-rc2 kernel. I've been trying to
> > >> > bisect it, but it seems like it goes pretty far back. I've got vanilla
> > >> > mainline kernel builds dating back to 5.17 that have the exact same
> > >> > issue. It might be worth noting that during this testing, I made sure
> > >> > the version of the kernel being kexeced & the kernel kexecing were the
> > >> > same version. It may not have been a problem in older kernels, but that
> > >> > would be difficult to test for me (a pretty important driver for this
> > >> > machine was only merged during v5.17-rc4). So it may not have been a
> > >> > regression & just a hidden problem since time immemorial.
> > >> >
> > >> > I am willing to test any patches I may get to further debug or fix
> > >> > this issue, preferably based on the current state of 
> > >> > torvalds/linux.git.
> > >> > I can build & test kernels quite a few times per day.
> > >> >
> > >> > I can also send any important materials (kernel config, dmesg, firmware
> > >> > information, so on & so forth) on request. I'll also just mention I'm
> > >> > using kexec-tools 2.0.24 upfront, if it matters.
> > >
> > > Can you check the efi runtime in sysfs:
> > > ls /sys/firmware/efi/runtime-map/
> > >
> > > If nothing then maybe you did not enable CONFIG_EFI_RUNTIME_MAP=y, it
> > > is needed for kexec UEFI boot on x86_64.
> >
> > Oh my, it really is that simple.
> >
> > Indeed, enabling this in the pre-kexec kernel fixes it all up. I had
> > blindly disabled it in my quest to downsize the pre-kexec kernel to
> > reduce boot time (it only runs a bootloader). In hindsight, the firmware
> > drivers section is not really a good section to tweak on a whim.
> >
> > I'm terribly sorry to have taken your time to "fix" this "bug". But I
> > must ask, is there any reason why this is a visible config option, or at
> > least not gated behind CONFIG_EXPERT? drivers/firmware/efi/runtime-map.c
> > is pretty tiny, & considering it depends on CONFIG_KEXEC_CORE, one
> > probably wants to have kexec work properly if they can even enable it.
>
> Glad to know it works with the .config tweaking. I can not recall any
> reason for that though.
>
> Since it sits in the efi code path, let's see how Ard thinks about
> your proposal.
>

I don't understand why EFI_RUNTIME_MAP should depend on KEXEC_CORE at
all: it is documented as a feature that can be enabled for debugging
as well, and kexec does not work as expected without it.

Should we just change it like this perhaps?

--- a/drivers/firmware/efi/Kconfig
+++ b/drivers/firmware/efi/Kconfig
@@ -28,8 +28,8 @@ config EFI_VARS_PSTORE_DEFAULT_DISABLE

 config EFI_RUNTIME_MAP
bool "Export efi runtime maps to sysfs"
-   depends on X86 && EFI && KEXEC_CORE
-   default y

Re: [PATCH 1/2] arm64, kdump: enforce to take 4G as the crashkernel low memory end

2022-09-06 Thread Ard Biesheuvel
On Mon, 5 Sept 2022 at 14:08, Baoquan He  wrote:
>
> On 09/05/22 at 01:28pm, Mike Rapoport wrote:
> > On Thu, Sep 01, 2022 at 08:25:54PM +0800, Baoquan He wrote:
> > > On 09/01/22 at 10:24am, Mike Rapoport wrote:
> > >
> > > max_zone_phys() only handles cases when CONFIG_ZONE_DMA/DMA32 enabled,
> > > the disabledCONFIG_ZONE_DMA/DMA32 case is not included. I can change
> > > it like:
> > >
> > > static phys_addr_t __init crash_addr_low_max(void)
> > > {
> > > phys_addr_t low_mem_mask = U32_MAX;
> > > phys_addr_t phys_start = memblock_start_of_DRAM();
> > >
> > > if ((!IS_ENABLED(CONFIG_ZONE_DMA) && 
> > > !IS_ENABLED(CONFIG_ZONE_DMA32)) ||
> > >  (phys_start > U32_MAX))
> > > low_mem_mask = PHYS_ADDR_MAX;
> > >
> > > return low_mem_mast + 1;
> > > }
> > >
> > > or add the disabled CONFIG_ZONE_DMA/DMA32 case into crash_addr_low_max()
> > > as you suggested. Which one do you like better?
> > >
> > > static phys_addr_t __init crash_addr_low_max(void)
> > > {
> > > if (!IS_ENABLED(CONFIG_ZONE_DMA) && 
> > > !IS_ENABLED(CONFIG_ZONE_DMA32))
> > > return PHYS_ADDR_MAX + 1;
> > >
> > > return max_zone_phys(32);
> > > }
> >
> > I like the second variant better.
>
> Sure, will change to use the 2nd one . Thanks.
>

While I appreciate the effort that has gone into solving this problem,
I don't think there is any consensus that an elaborate fix is required
to ensure that the crash kernel can be unmapped from the linear map at
all cost. In fact, I personally think we shouldn't bother, and IIRC,
Will made a remark along the same lines back when the Huawei engineers
were still driving this effort.

So perhaps we could align on that before doing yet another version of this?

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 1/3] memblock: define functions to set the usable memory range

2022-01-29 Thread Ard Biesheuvel
On Mon, 24 Jan 2022 at 22:05, Frank van der Linden  wrote:
>
> Meanwhile, it seems that this issue was already addressed in:
>
> https://lore.kernel.org/all/20211215021348.8766-1-kernelf...@gmail.com/
>
> ..which has now been pulled in, and sent to stable@ for 5.15. I
> somehow missed that message, and sent my change in a few weeks
> later.
>
> The fix to just reserve the ranges does seem a bit cleaner overall,
> but this will do fine.
>

Works for me.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v1 0/2] firmware: dmi_scan: Make it work in kexec'ed kernel

2021-10-17 Thread Ard Biesheuvel
On Thu, 7 Oct 2021 at 09:23, Andy Shevchenko  wrote:
>
> On Thu, Oct 7, 2021 at 10:20 AM Ard Biesheuvel  wrote:
> > On Wed, 6 Oct 2021 at 18:28, Andy Shevchenko  
> > wrote:
> > > On Mon, Jun 14, 2021 at 08:27:36PM +0300, Andy Shevchenko wrote:
> > > > On Mon, Jun 14, 2021 at 08:07:33PM +0300, Andy Shevchenko wrote:
>
> ...
>
> > > > Double checked, confirmed that it's NOT working.
> > >
> > > Any news here?
> > >
> > > Shall I resend my series?
> >
> > As I said before:
> >
> > """
> > I would still prefer to get to the bottom of this before papering over
> > it with command line options. If the memory gets corrupted by the
> > first kernel, maybe we are not preserving it correctly in the first
> > kernel.
> > """
>
> And I can't agree more, but above I asked about news, implying if
> there is anything to test?
> The issue is still there and it becomes a bit annoying to see my hack
> patches in every tree I have been using.
>

If nobody can be bothered to properly diagnose this, how important is
it, really?

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v1 0/2] firmware: dmi_scan: Make it work in kexec'ed kernel

2021-10-07 Thread Ard Biesheuvel
On Wed, 6 Oct 2021 at 18:28, Andy Shevchenko  wrote:
>
> On Mon, Jun 14, 2021 at 08:27:36PM +0300, Andy Shevchenko wrote:
> > On Mon, Jun 14, 2021 at 08:07:33PM +0300, Andy Shevchenko wrote:
> > > On Mon, Jun 14, 2021 at 06:38:30PM +0300, Andy Shevchenko wrote:
> > > > On Sat, Jun 12, 2021 at 12:40:57PM +0800, Dave Young wrote:
> > > > > > Probably it is doable to have kexec on 32bit efi working
> > > > > > without runtime service support, that means no need the trick of 
> > > > > > fixed
> > > > > > mapping.
> > > > > >
> > > > > > If I can restore my vm to boot 32bit efi on this weekend then I may 
> > > > > > provide some draft
> > > > > > patches for test.
> > > > >
> > > > > Unfortunately I failed to setup a 32bit efi guest,  here are some
> > > > > untested draft patches, please have a try.
> > > >
> > > > Thanks for the patches.
> > > >
> > > > As previously, I have reverted my hacks and applied your patches (also I
> > > > dropped patches from previous mail against kernel and kexec-tools) for 
> > > > both
> > > > kernel and user space on first and second environments.
> > > >
> > > > It does NOT solve the issue.
> > > >
> > > > If there is no idea pops up soon, I'm going to resend my series that
> > > > workarounds the issue.
> > >
> > > Hold on, I may have made a mistake during testing. Let me retest this.
> >
> > Double checked, confirmed that it's NOT working.
>
> Any news here?
>
> Shall I resend my series?
>

As I said before:

"""
I would still prefer to get to the bottom of this before papering over
it with command line options. If the memory gets corrupted by the
first kernel, maybe we are not preserving it correctly in the first
kernel.
"""

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v6] ARM: uncompress: Parse "linux, usable-memory-range" DT property

2021-09-22 Thread Ard Biesheuvel
On Wed, 15 Sept 2021 at 15:20, Geert Uytterhoeven
 wrote:
>
> Add support for parsing the "linux,usable-memory-range" DT property.
> This property is used to describe the usable memory reserved for the
> crash dump kernel, and thus makes the memory reservation explicit.
> If present, Linux no longer needs to mask the program counter, and rely
> on the "mem=" kernel parameter to obtain the start and size of usable
> memory.
>
> For backwards compatibility, the traditional method to derive the start
> of memory is still used if "linux,usable-memory-range" is absent.
>
> Signed-off-by: Geert Uytterhoeven 

Acked-by: Ard Biesheuvel 

> ---
> KernelVersion: v5.15-rc1
> ---
> The corresponding patch for kexec-tools is "[PATCH] arm: kdump: Add DT
> properties to crash dump kernel's DTB", which is still valid:
> https://lore.kernel.org/r/20200902154129.6358-1-geert+rene...@glider.be/
>
> v6:
>   - All dependencies are in v5.15-rc1,
>
> v5:
>   - Remove the addition of "linux,elfcorehdr" and
> "linux,usable-memory-range" handling to arch/arm/mm/init.c,
>
> v4:
>   - Remove references to architectures in chosen.txt, to avoid having to
> change this again when more architectures copy kdump support,
>   - Remove the architecture-specific code for parsing
> "linux,usable-memory-range" and "linux,elfcorehdr", as the FDT core
> code now takes care of this,
>   - Move chosen.txt change to patch changing the FDT core,
>   - Use IS_ENABLED(CONFIG_CRASH_DUMP) instead of #ifdef,
>
> v3:
>   - Rebase on top of accepted solution for DTB memory information
> handling, which is part of v5.12-rc1,
>
> v2:
>   - Rebase on top of reworked DTB memory information handling.
> ---
>  .../arm/boot/compressed/fdt_check_mem_start.c | 48 ---
>  1 file changed, 42 insertions(+), 6 deletions(-)
>
> diff --git a/arch/arm/boot/compressed/fdt_check_mem_start.c 
> b/arch/arm/boot/compressed/fdt_check_mem_start.c
> index 62450d824c3ca180..9291a2661bdfe57f 100644
> --- a/arch/arm/boot/compressed/fdt_check_mem_start.c
> +++ b/arch/arm/boot/compressed/fdt_check_mem_start.c
> @@ -55,16 +55,17 @@ static uint64_t get_val(const fdt32_t *cells, uint32_t 
> ncells)
>   * DTB, and, if out-of-range, replace it by the real start address.
>   * To preserve backwards compatibility (systems reserving a block of memory
>   * at the start of physical memory, kdump, ...), the traditional method is
> - * always used if it yields a valid address.
> + * used if it yields a valid address, unless the "linux,usable-memory-range"
> + * property is present.
>   *
>   * Return value: start address of physical memory to use
>   */
>  uint32_t fdt_check_mem_start(uint32_t mem_start, const void *fdt)
>  {
> -   uint32_t addr_cells, size_cells, base;
> +   uint32_t addr_cells, size_cells, usable_base, base;
> uint32_t fdt_mem_start = 0x;
> -   const fdt32_t *reg, *endp;
> -   uint64_t size, end;
> +   const fdt32_t *usable, *reg, *endp;
> +   uint64_t size, usable_end, end;
> const char *type;
> int offset, len;
>
> @@ -80,6 +81,27 @@ uint32_t fdt_check_mem_start(uint32_t mem_start, const 
> void *fdt)
> if (addr_cells > 2 || size_cells > 2)
> return mem_start;
>
> +   /*
> +* Usable memory in case of a crash dump kernel
> +* This property describes a limitation: memory within this range is
> +* only valid when also described through another mechanism
> +*/
> +   usable = get_prop(fdt, "/chosen", "linux,usable-memory-range",
> + (addr_cells + size_cells) * sizeof(fdt32_t));
> +   if (usable) {
> +   size = get_val(usable + addr_cells, size_cells);
> +   if (!size)
> +   return mem_start;
> +
> +   if (addr_cells > 1 && fdt32_ld(usable)) {
> +   /* Outside 32-bit address space */
> +   return mem_start;
> +   }
> +
> +   usable_base = fdt32_ld(usable + addr_cells - 1);
> +   usable_end = usable_base + size;
> +   }
> +
> /* Walk all memory nodes and regions */
> for (offset = fdt_next_node(fdt, -1, NULL); offset >= 0;
>  offset = fdt_next_node(fdt, offset, NULL)) {
> @@ -107,7 +129,20 @@ uint32_t fdt_check_mem_start(uint32_t mem_start, const 
> void *fdt)
>
> base = fdt32_ld(reg + addr_cells - 1);
> end = ba

Re: [PATCH v1 0/2] firmware: dmi_scan: Make it work in kexec'ed kernel

2021-07-19 Thread Ard Biesheuvel
On Mon, 14 Jun 2021 at 19:27, Andy Shevchenko  wrote:
>
> On Mon, Jun 14, 2021 at 08:07:33PM +0300, Andy Shevchenko wrote:
> > On Mon, Jun 14, 2021 at 06:38:30PM +0300, Andy Shevchenko wrote:
> > > On Sat, Jun 12, 2021 at 12:40:57PM +0800, Dave Young wrote:
> > > > > Probably it is doable to have kexec on 32bit efi working
> > > > > without runtime service support, that means no need the trick of fixed
> > > > > mapping.
> > > > >
> > > > > If I can restore my vm to boot 32bit efi on this weekend then I may 
> > > > > provide some draft
> > > > > patches for test.
> > > >
> > > > Unfortunately I failed to setup a 32bit efi guest,  here are some
> > > > untested draft patches, please have a try.
> > >
> > > Thanks for the patches.
> > >
> > > As previously, I have reverted my hacks and applied your patches (also I
> > > dropped patches from previous mail against kernel and kexec-tools) for 
> > > both
> > > kernel and user space on first and second environments.
> > >
> > > It does NOT solve the issue.
> > >
> > > If there is no idea pops up soon, I'm going to resend my series that
> > > workarounds the issue.
> >
> > Hold on, I may have made a mistake during testing. Let me retest this.
>
> Double checked, confirmed that it's NOT working.
>

Apologies for chiming in so late - in my defence, I was on vacation :-)

So if I understand the thread correctly, the Surface 3 provides a
SMBIOS entry point (not SMBIOS3), and it does not get picked up by the
second kernel, right?

I would still prefer to get to the bottom of this before papering over
it with command line options. If the memory gets corrupted by the
first kernel, maybe we are not preserving it correctly in the first
kernel.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2 1/5] arm64: kexec_file: Forbid non-crash kernels

2021-05-31 Thread Ard Biesheuvel
On Mon, 31 May 2021 at 11:57, Marc Zyngier  wrote:
>
> It has been reported that kexec_file doesn't really work on arm64.
> It completely ignores any of the existing reservations, which results
> in the secondary kernel being loaded where the GICv3 LPI tables live,
> or even corrupting the ACPI tables.
>
> Since only crash kernels are imune to this as they use a reserved
> memory region, disable the non-crash kernel use case. Further
> patches will try and restore the functionality.
>
> Reported-by: Moritz Fischer 
> Signed-off-by: Marc Zyngier 
> Cc: sta...@vger.kernel.org # 5.10

Acked-by: Ard Biesheuvel 

... but do we really only need this in 5.10 and not earlier?

> ---
>  arch/arm64/kernel/kexec_image.c | 20 
>  1 file changed, 20 insertions(+)
>
> diff --git a/arch/arm64/kernel/kexec_image.c b/arch/arm64/kernel/kexec_image.c
> index 9ec34690e255..acf9cd251307 100644
> --- a/arch/arm64/kernel/kexec_image.c
> +++ b/arch/arm64/kernel/kexec_image.c
> @@ -145,3 +145,23 @@ const struct kexec_file_ops kexec_image_ops = {
> .verify_sig = image_verify_sig,
>  #endif
>  };
> +
> +/**
> + * arch_kexec_locate_mem_hole - Find free memory to place the segments.
> + * @kbuf:   Parameters for the memory search.
> + *
> + * On success, kbuf->mem will have the start address of the memory region 
> found.
> + *
> + * Return: 0 on success, negative errno on error.
> + */
> +int arch_kexec_locate_mem_hole(struct kexec_buf *kbuf)
> +{
> +   /*
> +* For the time being, kexec_file_load isn't reliable except
> +* for crash kernel. Say sorry to the user.
> +*/
> +   if (kbuf->image->type != KEXEC_TYPE_CRASH)
> +   return -EADDRNOTAVAIL;
> +
> +   return kexec_locate_mem_hole(kbuf);
> +}
> --
> 2.30.2
>

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2 0/5] arm64: Make kexec_file_load honor iomem reservations

2021-05-31 Thread Ard Biesheuvel
On Mon, 31 May 2021 at 11:57, Marc Zyngier  wrote:
>
> This series is a complete departure from the approach I initially sent
> almost a month ago[0]. Instead of trying to teach EFI, ACPI and other
> subsystem to use memblock, I've decided to stick with the iomem
> resource tree and use that exclusively for arm64.
>
> This means that my current approach is (despite what I initially
> replied to both Dave and Catalin) to provide an arm64-specific
> implementation of arch_kexec_locate_mem_hole() which walks the
> resource tree and excludes ranges of RAM that have been registered for
> any odd purpose. This is exactly what the userspace implementation
> does, and I don't really see a good reason to diverge from it.
>
> Again, this allows my Synquacer board to reliably use kexec_file_load
> with as little as 256M, something that would always fail before as it
> would overwrite most of the reserved tables.
>
> Although this series still targets 5.14, the initial patch is a
> -stable candidate, and disables non-kdump uses of kexec_file_load. I
> have limited it to 5.10, as earlier kernels will require a different,
> probably more invasive approach.
>
> Catalin, Ard: although this series has changed a bit compared to v1,
> I've kept your AB/RB tags. Should anything seem odd, please let me
> know and I'll drop them.
>

Fine with me.

> Thanks,
>
> M.
>
> * From v1 [1]:
>   - Move the overlap exclusion into find_next_iomem_res()
>   - Handle child resource not overlapping with parent
>   - Provide walk_system_ram_excluding_child_res() as a top level
> walker
>   - Simplify arch-specific code
>   - Add initial patch disabling non-crash kernels
>
> [0] https://lore.kernel.org/r/20210429133533.1750721-1-...@kernel.org
> [1] https://lore.kernel.org/r/20210526190531.62751-1-...@kernel.org
>
> Marc Zyngier (5):
>   arm64: kexec_file: Forbid non-crash kernels
>   kexec_file: Make locate_mem_hole_callback global
>   kernel/resource: Allow find_next_iomem_res() to exclude overlapping
> child resources
>   kernel/resource: Introduce walk_system_ram_excluding_child_res()
>   arm64: kexec_image: Restore full kexec functionnality
>
>  arch/arm64/kernel/kexec_image.c | 39 
>  include/linux/ioport.h  |  3 ++
>  include/linux/kexec.h   |  1 +
>  kernel/kexec_file.c |  6 +--
>  kernel/resource.c   | 82 +
>  5 files changed, 119 insertions(+), 12 deletions(-)
>
> --
> 2.30.2
>

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 0/4] arm64: Make kexec_file_load honor iomem reservations

2021-05-31 Thread Ard Biesheuvel
On Thu, 27 May 2021 at 19:39, Catalin Marinas  wrote:
>
> On Wed, May 26, 2021 at 08:05:27PM +0100, Marc Zyngier wrote:
> > This series is a complete departure from the approach I initially sent
> > almost a month ago[1]. Instead of trying to teach EFI, ACPI and other
> > subsystem to use memblock, I've decided to stick with the iomem
> > resource tree and use that exclusively for arm64.
> >
> > This means that my current approach is (despite what I initially
> > replied to both Dave and Catalin) to provide an arm64-specific
> > implementation of arch_kexec_locate_mem_hole() which walks the
> > resource tree and excludes ranges of RAM that have been registered for
> > any odd purpose. This is exactly what the userspace implementation
> > does, and I don't really see a good reason to diverge from it.
> >
> > Again, this allows my Synquacer board to reliably use kexec_file_load
> > with as little as 256M, something that would always fail before as it
> > would overwrite most of the reserved tables.
> >
> > Obviously, this is now at least 5.14 material. Given how broken
> > kexec_file_load is for non-crash kernels on arm64 at the moment,
> > should we at least disable it in 5.13 and all previous stable kernels?
>
> I think it makes sense to disable it in the current and earlier kernels.
>

Ack to that

> For this series:
>
> Acked-by: Catalin Marinas 

and likewise for the series

Reviewed-by: Ard Biesheuvel 

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] efi/x86: Revert struct layout change to fix kexec boot regression

2020-04-10 Thread Ard Biesheuvel
On Fri, 10 Apr 2020 at 16:34, Borislav Petkov  wrote:
>
> On Fri, Apr 10, 2020 at 04:22:49PM +0200, Ard Biesheuvel wrote:
> > > BTW, a fixes tag is good to have..
> >
> > I usually omit those for patches that fix bugs that were introduced in
> > the current cycle.
>
> A valid use case for having the Fixes: tag anyway are the backporting
> kernels gangs which might pick up the first patch for whatever reason
> and would probably be thankful if they find the second one, i.e., the
> fix for the first one, through grepping or other, automated means.
>

Fair point.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] efi/x86: Revert struct layout change to fix kexec boot regression

2020-04-10 Thread Ard Biesheuvel
On Fri, 10 Apr 2020 at 16:02, Dave Young  wrote:
>
> On 04/10/20 at 09:56pm, Dave Young wrote:
> > On 04/10/20 at 09:43am, Ard Biesheuvel wrote:
> > > Commit
> > >
> > >   0a67361dcdaa29 ("efi/x86: Remove runtime table address from kexec EFI 
> > > setup data")
> > >
> > > removed the code that retrieves the non-remapped UEFI runtime services
> > > pointer from the data structure provided by kexec, as it was never really
> > > needed on the kexec boot path: mapping the runtime services table at its
> > > non-remapped address is only needed when calling SetVirtualAddressMap(),
> > > which never happens during a kexec boot in the first place.
> > >
> > > However, dropping the 'runtime' member from struct efi_setup_data was a
> > > mistake. That struct is shared ABI between the kernel and the kexec 
> > > tooling
> > > for x86, and so we cannot simply change its layout. So let's put back the
> > > removed field, but call it 'unused' to reflect the fact that we never look
> > > at its contents. While at it, add a comment to remind our future selves
> > > that the layout is external ABI.
> > >
> > > Reported-by: Theodore Ts'o 
> > > Tested-by: Theodore Ts'o 
> > > Signed-off-by: Ard Biesheuvel 
> > > ---
> > >
> > > Ingo, Thomas, Boris: I sent out my efi-urgent pull request just yesterday,
> > > so please take this directly into tip:efi/urgent - no need to wait for the
> > > next batch.
> > >
> > >  arch/x86/include/asm/efi.h | 2 ++
> > >  1 file changed, 2 insertions(+)
> > >
> > > diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h
> > > index 781170d36f50..96044c8d8600 100644
> > > --- a/arch/x86/include/asm/efi.h
> > > +++ b/arch/x86/include/asm/efi.h
> > > @@ -178,8 +178,10 @@ extern void efi_free_boot_services(void);
> > >  extern pgd_t * __init efi_uv1_memmap_phys_prolog(void);
> > >  extern void __init efi_uv1_memmap_phys_epilog(pgd_t *save_pgd);
> > >
> > > +/* kexec external ABI */
> > >  struct efi_setup_data {
> > > u64 fw_vendor;
> > > +   u64 unused;
> > > u64 tables;
> > > u64 smbios;
> > > u64 reserved[8];
> > > --
> > > 2.17.1
> > >
> >
> > Ah, replied too quick in another mail.  I just cced kexec list again.
> >
> > Thanks for the fix:
> >
> > Reviewed-by: Dave Young 
> >
>

Thanks Dave

> BTW, a fixes tag is good to have..
>

I usually omit those for patches that fix bugs that were introduced in
the current cycle.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v1 2/2] firmware: dmi_scan: Pass dmi_entry_point to kexec'ed kernel

2020-01-20 Thread Ard Biesheuvel
On Mon, 20 Jan 2020 at 23:31, Andy Shevchenko  wrote:
>
> On Mon, Jan 20, 2020 at 9:28 PM Eric W. Biederman  
> wrote:
> > Andy Shevchenko  writes:
> > > On Sat, Dec 17, 2016 at 06:57:21PM +0800, Dave Young wrote:
> > >> Ccing efi people.
> > >>
> > >> On 12/16/16 at 02:33pm, Jean Delvare wrote:
> > >> > On Fri, 16 Dec 2016 14:18:58 +0200, Andy Shevchenko wrote:
> > >> > > On Fri, 2016-12-16 at 10:32 +0800, Dave Young wrote:
> > >> > > > On 12/15/16 at 12:28pm, Jean Delvare wrote:
> > >> > > > > I am no kexec expert but this confuses me. Shouldn't the second
> > >> > > > > kernel have access to the EFI systab as the first kernel does? It
> > >> > > > > includes many more pointers than just ACPI and DMI tables, and it
> > >> > > > > would seem inconvenient to have to pass all these addresses
> > >> > > > > individually explicitly.
> > >> > > >
> > >> > > > Yes, in modern linux kernel, kexec has the support for EFI, I 
> > >> > > > think it
> > >> > > > should work naturally at least in x86_64.
> > >> > >
> > >> > > Thanks for this good news!
> > >> > >
> > >> > > Unfortunately Intel Galileo is 32-bit platform.
> > >> >
> > >> > If it was done for X86_64 then maybe it can be generalized to X86?
> > >>
> > >> For X86_64, we have a new way for efi runtime memmory mapping, in i386
> > >> code it still use old ioremap way. It is impossible to use same way as
> > >> the X86_64 since the virtual address space is limited.
> > >>
> > >> But maybe for 32bit, kexec kernel can run in physical mode, but I'm not
> > >> sure, I would suggest Andy to do a test first with efi=noruntime for
> > >> kexec 2nd kernel.
> > >
> > > Guys, it was quite a long no hear from you. As I told you the proposed 
> > > work
> > > around didn't help. Today I found that Microsoft Surface 3 also affected
> > > by this.
> > >
> > > Can we apply these patches for now until you will find better
> > > solution?
> >
> > Not a chance.  The patches don't apply to any kernel in the git history.
> >
> > Which may be part of your problem.  You are or at least were running
> > with code that has not been merged upstream.
>
> It's done against linux-next.
> Applied clearly. (Not the version in this more than yearly old series
> of course, that's why I told I can resend)
>
> > > P.S. I may resend them rebased on recent vanilla.
> >
> > Second.  I looked at your test results and they don't directly make
> > sense.  dmidecode bypasses the kernel completely or it did last time
> > I looked so I don't know why you would be using that to test if
> > something in the kernel is working.
> >
> > However dmidecode failing suggests that the actual problem is something
> > in the first kernel is stomping the dmi tables.
>
> See below.
>
> > Adding a command line option won't fix stomped tables.
>
> It provides a mechanism, which seems to be absent, to the second
> kernel to know where to look for SMBIOS tables.
>
> > So what I would suggest is:
> > a) Verify that dmidecode works before kexec.
>
> Yes, it does.
>
> > b) Test to see if dmidecode works after kexec.
>
> No, it doesn't.
>
> > c) Once (a) shows that dmidecode works and (b) shows that dmidecode
> >fails figure out what is stomping your dmi tables during or before
> >kexec and that is what should get fixed.
>
> The problem here as I can see it that EFI and kexec protocols are not
> friendly to each other.
> I'm not an expert in either. That's why I'm asking for possible
> solutions. And this needs to be done in kernel to allow drivers to
> work.
>
> Does the
>
> commit 4996c02306a25def1d352ec8e8f48895bbc7dea9
> Author: Takao Indoh 
> Date:   Thu Jul 14 18:05:21 2011 -0400
>
> ACPI: introduce "acpi_rsdp=" parameter for kdump
>
> description shed a light on this?
>
> > Now using a non-efi method of dmi detection relies on the
> > tables being between 0xF and 0x1. AKA the last 64K
> > of the first 1MiB of memory.  You might check to see if your
> > dmi tables are in that address range.
>
> # dmidecode --no-sysfs
> # dmidecode 3.2
> Scanning /dev/mem for entry point.
> # No SMBIOS nor DMI entry point found, sorry.
>
> === with patch applied ===
> # dmidecode
> ...
> Release Date: 03/10/2015
> ...
>
> >
> > Otherwise I suspect the good solution is to give efi it's own page
> > tables in the kernel and switch to it whenever efi functions are called.
> >
>
> > But on 32bit the Linux kernel has historically been just fine directly
> > accessing the hardware, and ignoring efi and all of the other BIOS's.
>
> It seems not only for 32-bit Linux kernel anymore. MS Surface 3 runs
> 64-bit code.
>
> > So if that doesn't work on Intel Galileo that is probably a firmware
> > problem.
>
> It's not only about Galileo anymore.
>

Looking at the x86 kexec EFI code, it seems that it has special
handling for the legacy SMBIOS table address, but not for the SMBIOS3
table address, which was introduced to accommodate SMBIOS tables
living in memory that is not 32-bit addressable.

Could anyone check 

Re: [PATCH v4 4/4] efi: Fix handling of multiple efi_fake_mem= entries

2020-01-09 Thread Ard Biesheuvel
On Wed, 8 Jan 2020 at 22:53, Dan Williams  wrote:
>
> On Tue, Jan 7, 2020 at 9:52 AM Ard Biesheuvel  
> wrote:
> >
> > On Tue, 7 Jan 2020 at 06:19, Dave Young  wrote:
> > >
> > > On 01/06/20 at 08:16pm, Dan Williams wrote:
> > > > On Mon, Jan 6, 2020 at 8:04 PM Dave Young  wrote:
> > > > >
> > > > > On 01/06/20 at 04:40pm, Dan Williams wrote:
> > > > > > Dave noticed that when specifying multiple efi_fake_mem= entries 
> > > > > > only
> > > > > > the last entry was successfully being reflected in the efi memory 
> > > > > > map.
> > > > > > This is due to the fact that the efi_memmap_insert() is being called
> > > > > > multiple times, but on successive invocations the insertion should 
> > > > > > be
> > > > > > applied to the last new memmap rather than the original map at
> > > > > > efi_fake_memmap() entry.
> > > > > >
> > > > > > Rework efi_fake_memmap() to install the new memory map after each
> > > > > > efi_fake_mem= entry is parsed.
> > > > > >
> > > > > > This also fixes an issue in efi_fake_memmap() that caused it to 
> > > > > > litter
> > > > > > emtpy entries into the end of the efi memory map. An empty entry 
> > > > > > causes
> > > > > > efi_memmap_insert() to attempt more memmap splits / copies than
> > > > > > efi_memmap_split_count() accounted for when sizing the new map. When
> > > > > > that happens efi_memmap_insert() may overrun its allocation, and if 
> > > > > > you
> > > > > > are lucky will spill over to an unmapped page leading to crash
> > > > > > signature like the following rather than silent corruption:
> > > > > >
> > > > > > BUG: unable to handle page fault for address: ff281000
> > > > > > [..]
> > > > > > RIP: 0010:efi_memmap_insert+0x11d/0x191
> > > > > > [..]
> > > > > > Call Trace:
> > > > > >  ? bgrt_init+0xbe/0xbe
> > > > > >  ? efi_arch_mem_reserve+0x1cb/0x228
> > > > > >  ? acpi_parse_bgrt+0xa/0xd
> > > > > >  ? acpi_table_parse+0x86/0xb8
> > > > > >  ? acpi_boot_init+0x494/0x4e3
> > > > > >  ? acpi_parse_x2apic+0x87/0x87
> > > > > >  ? setup_acpi_sci+0xa2/0xa2
> > > > > >  ? setup_arch+0x8db/0x9e1
> > > > > >  ? start_kernel+0x6a/0x547
> > > > > >  ? secondary_startup_64+0xb6/0xc0
> > > > > >
> > > > > > Commit af1648984828 "x86/efi: Update e820 with reserved EFI boot
> > > > > > services data to fix kexec breakage" is listed in Fixes: since it
> > > > > > introduces more occurrences where efi_memmap_insert() is invoked 
> > > > > > after
> > > > > > an efi_fake_mem= configuration has been parsed. Previously the side
> > > > > > effects of vestigial empty entries were benign, but with commit
> > > > > > af1648984828 that follow-on efi_memmap_insert() invocation triggers
> > > > > > efi_memmap_insert() overruns.
> > > > > >
> > > > > > Fixes: 0f96a99dab36 ("efi: Add 'efi_fake_mem' boot option")
> > > > > > Fixes: af1648984828 ("x86/efi: Update e820 with reserved EFI boot 
> > > > > > services...")
> > > > >
> > > > > A nitpick for the Fixes flags, as I replied in the thread below:
> > > > > https://lore.kernel.org/linux-efi/CAPcyv4jLxqPaB22Ao9oV31Gm=b0+phty+uz33snex4qchou...@mail.gmail.com/T/#m2bb2dd00f7715c9c19ccc48efef0fcd5fdb626e7
> > > > >
> > > > > I reproduced two other panics without the patches applied, so this 
> > > > > issue
> > > > > is not caused by either of the commits, maybe just drop the Fixes.
> > > >
> > > > Just the "Fixes: af1648984828", right? No objection from me. I'll let
> > > > Ingo say if he needs a resend for that.
> > > >
> > > > The "Fixes: 0f96a99dab36" is valid because the original implementation
> > > > failed to handle the multiple argument case from the beginning.
> > >
> > > Agreed, thanks!
> > >
> >
> 

Re: [PATCH v4 4/4] efi: Fix handling of multiple efi_fake_mem= entries

2020-01-07 Thread Ard Biesheuvel
On Tue, 7 Jan 2020 at 06:19, Dave Young  wrote:
>
> On 01/06/20 at 08:16pm, Dan Williams wrote:
> > On Mon, Jan 6, 2020 at 8:04 PM Dave Young  wrote:
> > >
> > > On 01/06/20 at 04:40pm, Dan Williams wrote:
> > > > Dave noticed that when specifying multiple efi_fake_mem= entries only
> > > > the last entry was successfully being reflected in the efi memory map.
> > > > This is due to the fact that the efi_memmap_insert() is being called
> > > > multiple times, but on successive invocations the insertion should be
> > > > applied to the last new memmap rather than the original map at
> > > > efi_fake_memmap() entry.
> > > >
> > > > Rework efi_fake_memmap() to install the new memory map after each
> > > > efi_fake_mem= entry is parsed.
> > > >
> > > > This also fixes an issue in efi_fake_memmap() that caused it to litter
> > > > emtpy entries into the end of the efi memory map. An empty entry causes
> > > > efi_memmap_insert() to attempt more memmap splits / copies than
> > > > efi_memmap_split_count() accounted for when sizing the new map. When
> > > > that happens efi_memmap_insert() may overrun its allocation, and if you
> > > > are lucky will spill over to an unmapped page leading to crash
> > > > signature like the following rather than silent corruption:
> > > >
> > > > BUG: unable to handle page fault for address: ff281000
> > > > [..]
> > > > RIP: 0010:efi_memmap_insert+0x11d/0x191
> > > > [..]
> > > > Call Trace:
> > > >  ? bgrt_init+0xbe/0xbe
> > > >  ? efi_arch_mem_reserve+0x1cb/0x228
> > > >  ? acpi_parse_bgrt+0xa/0xd
> > > >  ? acpi_table_parse+0x86/0xb8
> > > >  ? acpi_boot_init+0x494/0x4e3
> > > >  ? acpi_parse_x2apic+0x87/0x87
> > > >  ? setup_acpi_sci+0xa2/0xa2
> > > >  ? setup_arch+0x8db/0x9e1
> > > >  ? start_kernel+0x6a/0x547
> > > >  ? secondary_startup_64+0xb6/0xc0
> > > >
> > > > Commit af1648984828 "x86/efi: Update e820 with reserved EFI boot
> > > > services data to fix kexec breakage" is listed in Fixes: since it
> > > > introduces more occurrences where efi_memmap_insert() is invoked after
> > > > an efi_fake_mem= configuration has been parsed. Previously the side
> > > > effects of vestigial empty entries were benign, but with commit
> > > > af1648984828 that follow-on efi_memmap_insert() invocation triggers
> > > > efi_memmap_insert() overruns.
> > > >
> > > > Fixes: 0f96a99dab36 ("efi: Add 'efi_fake_mem' boot option")
> > > > Fixes: af1648984828 ("x86/efi: Update e820 with reserved EFI boot 
> > > > services...")
> > >
> > > A nitpick for the Fixes flags, as I replied in the thread below:
> > > https://lore.kernel.org/linux-efi/CAPcyv4jLxqPaB22Ao9oV31Gm=b0+phty+uz33snex4qchou...@mail.gmail.com/T/#m2bb2dd00f7715c9c19ccc48efef0fcd5fdb626e7
> > >
> > > I reproduced two other panics without the patches applied, so this issue
> > > is not caused by either of the commits, maybe just drop the Fixes.
> >
> > Just the "Fixes: af1648984828", right? No objection from me. I'll let
> > Ingo say if he needs a resend for that.
> >
> > The "Fixes: 0f96a99dab36" is valid because the original implementation
> > failed to handle the multiple argument case from the beginning.
>
> Agreed, thanks!
>

I'll queue this but without the fixes tags. The -stable maintainers
are far too trigger happy IMHO, and this really needs careful review
before being backported. efi_fake_mem is a debug feature anyway, so I
don't see an urgent need to get this fixed retroactively in older
kernels.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v4 3/4] efi: Fix efi_memmap_alloc() leaks

2020-01-07 Thread Ard Biesheuvel
On Tue, 7 Jan 2020 at 06:18, Dave Young  wrote:
>
> On 01/06/20 at 08:24pm, Dan Williams wrote:
> > On Mon, Jan 6, 2020 at 7:58 PM Dave Young  wrote:
> > >
> > > On 01/06/20 at 04:40pm, Dan Williams wrote:
> > > > With efi_fake_memmap() and efi_arch_mem_reserve() the efi table may be
> > > > updated and replaced multiple times. When that happens a previous
> > > > dynamically allocated efi memory map can be garbage collected. Use the
> > > > new EFI_MEMMAP_{SLAB,MEMBLOCK} flags to detect when a dynamically
> > > > allocated memory map is being replaced.
> > > >
> > > > Debug statements in efi_memmap_free() reveal:
> > > >
> > > >  efi: __efi_memmap_free:37: phys: 0x23ffdd580 size: 2688 flags: 0x2
> > > >  efi: __efi_memmap_free:37: phys: 0x9db00 size: 2640 flags: 0x2
> > > >  efi: __efi_memmap_free:37: phys: 0x9e580 size: 2640 flags: 0x2
> > > >
> > > > ...a savings of 7968 bytes on a qemu boot with 2 entries specified to
> > > > efi_fake_mem=.
> > > >
> > > > Cc: Taku Izumi 
> > > > Cc: Ard Biesheuvel 
> > > > Signed-off-by: Dan Williams 
> > > > ---
> > > >  drivers/firmware/efi/memmap.c |   24 
> > > >  1 file changed, 24 insertions(+)
> > > >
> > > > diff --git a/drivers/firmware/efi/memmap.c 
> > > > b/drivers/firmware/efi/memmap.c
> > > > index 04dfa56b994b..bffa320d2f9a 100644
> > > > --- a/drivers/firmware/efi/memmap.c
> > > > +++ b/drivers/firmware/efi/memmap.c
> > > > @@ -29,6 +29,28 @@ static phys_addr_t __init 
> > > > __efi_memmap_alloc_late(unsigned long size)
> > > >   return PFN_PHYS(page_to_pfn(p));
> > > >  }
> > > >
> > > > +static void __init __efi_memmap_free(u64 phys, unsigned long size, 
> > > > unsigned long flags)
> > > > +{
> > > > + if (flags & EFI_MEMMAP_MEMBLOCK) {
> > > > + if (slab_is_available())
> > > > + memblock_free_late(phys, size);
> > > > + else
> > > > + memblock_free(phys, size);
> > > > + } else if (flags & EFI_MEMMAP_SLAB) {
> > > > + struct page *p = pfn_to_page(PHYS_PFN(phys));
> > > > + unsigned int order = get_order(size);
> > > > +
> > > > + free_pages((unsigned long) page_address(p), order);
> > > > + }
> > > > +}
> > > > +
> > > > +static void __init efi_memmap_free(void)
> > > > +{
> > > > + __efi_memmap_free(efi.memmap.phys_map,
> > > > + efi.memmap.desc_size * efi.memmap.nr_map,
> > > > + efi.memmap.flags);
> > > > +}
> > > > +
> > > >  /**
> > > >   * efi_memmap_alloc - Allocate memory for the EFI memory map
> > > >   * @num_entries: Number of entries in the allocated map.
> > > > @@ -100,6 +122,8 @@ static int __init __efi_memmap_init(struct 
> > > > efi_memory_map_data *data)
> > > >   return -ENOMEM;
> > > >   }
> > > >
> > > > + efi_memmap_free();
> > > > +
> > >
> > > This seems still not safe,  see below function:
> > > arch/x86/platform/efi/efi.c:
> > > static void __init efi_clean_memmap(void)
> > > It use same memmap for both old and new, and filter out those invalid
> > > ranges in place, if the memory is freed then ..
> >
> > In the efi_clean_memmap() case flags are 0, so efi_memmap_free() is a nop.
> >
> > Would you feel better with an explicit?
> >
> > WARN_ON(efi.memmap.phys_map == data->phys_map && (data->flags &
> > (EFI_MEMMAP_SLAB | EFI_MEMMAP_MEMBLOCK))
> >
> > ...not sure it's worth it.
>
> Ah, yes, sorry I did not see the flags, although it is not very obvious.
> Maybe add some code comment for efi_mem_alloc and efi_mem_init.
>
> Let's defer the suggestion to Ard.
>

A one line comment to remind our future selves of this discussion
would probably be helpful, but beyond that, I don't think we need to
do much here.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v3 2/4] efi: Add tracking for dynamically allocated memmaps

2020-01-02 Thread Ard Biesheuvel
Hi Dan,

Thanks for taking the time to really fix this properly.

Comments/questions below.

On Thu, 2 Jan 2020 at 05:29, Dan Williams  wrote:
>
> In preparation for fixing efi_memmap_alloc() leaks, add support for
> recording whether the memmap was dynamically allocated from slab,
> memblock, or is the original physical memmap provided by the platform.
>
> Cc: Taku Izumi 
> Cc: Ard Biesheuvel 
> Signed-off-by: Dan Williams 
> ---
>  arch/x86/platform/efi/efi.c |2 +-
>  arch/x86/platform/efi/quirks.c  |   11 ++-
>  drivers/firmware/efi/fake_mem.c |5 +++--
>  drivers/firmware/efi/memmap.c   |   16 ++--
>  include/linux/efi.h |8 ++--
>  5 files changed, 26 insertions(+), 16 deletions(-)
>
> diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
> index 38d44f36d5ed..7086afbb84fd 100644
> --- a/arch/x86/platform/efi/efi.c
> +++ b/arch/x86/platform/efi/efi.c
> @@ -333,7 +333,7 @@ static void __init efi_clean_memmap(void)
> u64 size = efi.memmap.nr_map - n_removal;
>
> pr_warn("Removing %d invalid memory map entries.\n", 
> n_removal);
> -   efi_memmap_install(efi.memmap.phys_map, size);
> +   efi_memmap_install(efi.memmap.phys_map, size, 0);
> }
>  }
>
> diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
> index f8f0220b6a66..4a71c790f9c3 100644
> --- a/arch/x86/platform/efi/quirks.c
> +++ b/arch/x86/platform/efi/quirks.c
> @@ -244,6 +244,7 @@ EXPORT_SYMBOL_GPL(efi_query_variable_store);
>  void __init efi_arch_mem_reserve(phys_addr_t addr, u64 size)
>  {
> phys_addr_t new_phys, new_size;
> +   unsigned long flags = 0;
> struct efi_mem_range mr;
> efi_memory_desc_t md;
> int num_entries;
> @@ -272,8 +273,7 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 
> size)
> num_entries += efi.memmap.nr_map;
>
> new_size = efi.memmap.desc_size * num_entries;
> -
> -   new_phys = efi_memmap_alloc(num_entries);
> +   new_phys = efi_memmap_alloc(num_entries, );
> if (!new_phys) {
> pr_err("Could not allocate boot services memmap\n");
> return;
> @@ -288,7 +288,7 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 
> size)
> efi_memmap_insert(, new, );
> early_memunmap(new, new_size);
>
> -   efi_memmap_install(new_phys, num_entries);
> +   efi_memmap_install(new_phys, num_entries, flags);
> e820__range_update(addr, size, E820_TYPE_RAM, E820_TYPE_RESERVED);
> e820__update_table(e820_table);
>  }
> @@ -408,6 +408,7 @@ static void __init efi_unmap_pages(efi_memory_desc_t *md)
>  void __init efi_free_boot_services(void)
>  {
> phys_addr_t new_phys, new_size;
> +   unsigned long flags = 0;
> efi_memory_desc_t *md;
> int num_entries = 0;
> void *new, *new_md;
> @@ -463,7 +464,7 @@ void __init efi_free_boot_services(void)
> return;
>
> new_size = efi.memmap.desc_size * num_entries;
> -   new_phys = efi_memmap_alloc(num_entries);
> +   new_phys = efi_memmap_alloc(num_entries, );
> if (!new_phys) {
> pr_err("Failed to allocate new EFI memmap\n");
> return;
> @@ -493,7 +494,7 @@ void __init efi_free_boot_services(void)
>
> memunmap(new);
>
> -   if (efi_memmap_install(new_phys, num_entries)) {
> +   if (efi_memmap_install(new_phys, num_entries, flags)) {
> pr_err("Could not install new EFI memmap\n");
> return;
> }
> diff --git a/drivers/firmware/efi/fake_mem.c b/drivers/firmware/efi/fake_mem.c
> index bb9fc70d0cfa..7e53e5520548 100644
> --- a/drivers/firmware/efi/fake_mem.c
> +++ b/drivers/firmware/efi/fake_mem.c
> @@ -39,6 +39,7 @@ void __init efi_fake_memmap(void)
> int new_nr_map = efi.memmap.nr_map;
> efi_memory_desc_t *md;
> phys_addr_t new_memmap_phy;
> +   unsigned long flags = 0;
> void *new_memmap;
> int i;
>
> @@ -55,7 +56,7 @@ void __init efi_fake_memmap(void)
> }
>
> /* allocate memory for new EFI memmap */
> -   new_memmap_phy = efi_memmap_alloc(new_nr_map);
> +   new_memmap_phy = efi_memmap_alloc(new_nr_map, );
> if (!new_memmap_phy)
> return;
>
> @@ -73,7 +74,7 @@ void __init efi_fake_memmap(void)
> /* swap into new EFI memmap */
> early_memunmap(new_memmap, efi.memmap.desc_size * new_nr_map);
>
> -   efi_memmap_install(new_memmap_phy, new_nr_map

Re: [PATCH] efi/memreserve: register reservations as 'reserved' in /proc/iomem

2019-12-05 Thread Ard Biesheuvel
On Wed, 4 Dec 2019 at 20:13, Bhupesh SHARMA  wrote:
>
> Hello Masa,
>
> (+Cc Simon)
>
> On Thu, Dec 5, 2019 at 12:27 AM Masayoshi Mizuma  
> wrote:
> >
> > On Wed, Dec 04, 2019 at 06:17:59PM +, James Morse wrote:
> > > Hi Masa,
> > >
> > > On 04/12/2019 17:17, Masayoshi Mizuma wrote:
> > > > Thank you for sending the patch, but unfortunately it doesn't work for 
> > > > the issue...
> > > >
> > > > After applied your patch, the LPI tables are marked as reserved in
> > > > /proc/iomem like as:
> > > >
> > > > 8030-a1fd : System RAM
> > > >   8048-8134 : Kernel code
> > > >   8135-817b : reserved
> > > >   817c-82ac : Kernel data
> > > >   830f-830f : reserved # Property table
> > > >   8348-83480fff : reserved # Pending table
> > > >   8349-8349 : reserved # Pending table
> > > >
> > > > However, kexec tries to allocate memory from System RAM, it doesn't care
> > > > the reserved in System RAM.
> > >
> > > > I'm not sure why kexec doesn't care the reserved in System RAM, however,
> > >
> > > Hmm, we added these to fix a problem with the UEFI memory map, and more 
> > > recently ACPI
> > > tables being overwritten by kexec.
> > >
> > > Which version of kexec-tools are you using? Could you try:
> > > https://git.linaro.org/people/takahiro.akashi/kexec-tools.git/commit/?h=arm64/resv_mem
> >
> > Thanks a lot! It worked and the issue is gone with Ard's patch and
> > the linaro kexec (arm64/resv_mem branch).
> >
> > Ard, please feel free to add:
> >
> > Tested-by: Masayoshi Mizuma 
>
> Same results at my side, so:
> Tested-and-Reviewed-by: Bhipesh Sharma 
>

Thank you all. I'll get this queued as a fix with cc:stable for v5.4


> > >
> > > > if the kexec behaivor is right, the LPI tables should not belong to
> > > > System RAM.
> > >
> > > > Like as:
> > > >
> > > > 8030-830e : System RAM
> > > >   8048-8134 : Kernel code
> > > >   8135-817b : reserved
> > > >   817c-82ac : Kernel data
> > > > 830f-830f : reserved # Property table
> > > > 8348-83480fff : reserved # Pending table
> > > > 8349-8349 : reserved # Pending table
> > > > 834a-a1fd : System RAM
> > > >
> > > > I don't have ideas to separete LPI tables from System RAM... so I tried
> > > > to add a new file to inform the LPI tables to userspace.
> > >
> > > This is how 'nomap' memory appears, we carve it out of System RAM. A side 
> > > effect of this
> > > is kdump can't touch it, as you've told it this isn't memory.
> > >
> > > As these tables are memory, mapped by the linear map, I think Ard's patch 
> > > is the right
> > > thing to do ... I suspect your kexec-tools doesn't have those patches 
> > > from Akashi to make
> > > it honour all second level entries.
> >
> > I used the kexec on the top of master branch:
> > git://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git
> >
> > Should we use the linaro kexec for aarch64 machine?
> > Or will the arm64/resv_mem branch be merged to the kexec on
> > git.kernel.org...?
>
> Glad that Ard's patch fixes the issue for you.
> Regarding Akashi's patch, I think it was sent to upstream kexec-tools
> some time ago (see [0}) but  seems not integrated in upstream
> kexec-tools (now I noticed my Tested-by email for the same got bounced
> off due to some gmail msmtp setting issues at my end - sorry for
> that). I have added Simon in Cc list.
>
> Hi Simon,
>
> Can you please help pick [0] in upstream kexec-tools with Tested-by
> from Masa and myself? Thanks a lot for your help.
>
> [0]. http://lists.infradead.org/pipermail/kexec/2019-January/022201.html
>
> Thanks,
> Bhupesh

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH] efi/memreserve: register reservations as 'reserved' in /proc/iomem

2019-12-04 Thread Ard Biesheuvel
Memory regions that are reserved using efi_mem_reserve_persistent()
are recorded in a special EFI config table which survives kexec,
allowing the incoming kernel to honour them as well. However,
such reservations are not visible in /proc/iomem, and so the kexec
tools that load the incoming kernel and its initrd into memory may
overwrite these reserved regions before the incoming kernel has a
chance to reserve them from further use.

So add these reservations to /proc/iomem as they are created. Note
that reservations that are inherited from a previous kernel are
memblock_reserve()'d early on, so they are already visible in
/proc/iomem.

Cc: Masayoshi Mizuma 
Cc: d.hatay...@fujitsu.com
Cc: kexec@lists.infradead.org
Signed-off-by: Ard Biesheuvel 
---
 drivers/firmware/efi/efi.c | 29 ++--
 1 file changed, 26 insertions(+), 3 deletions(-)

diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index d101f072c8f8..fcd82dde23c8 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -979,6 +979,24 @@ static int __init efi_memreserve_map_root(void)
return 0;
 }
 
+static int efi_mem_reserve_iomem(phys_addr_t addr, u64 size)
+{
+   struct resource *res, *parent;
+
+   res = kzalloc(sizeof(struct resource), GFP_ATOMIC);
+   if (!res)
+   return -ENOMEM;
+
+   res->name   = "reserved";
+   res->flags  = IORESOURCE_MEM;
+   res->start  = addr;
+   res->end= addr + size - 1;
+
+   /* we expect a conflict with a 'System RAM' region */
+   parent = request_resource_conflict(_resource, res);
+   return parent ? request_resource(parent, res) : 0;
+}
+
 int __ref efi_mem_reserve_persistent(phys_addr_t addr, u64 size)
 {
struct linux_efi_memreserve *rsv;
@@ -1001,9 +1019,8 @@ int __ref efi_mem_reserve_persistent(phys_addr_t addr, 
u64 size)
if (index < rsv->size) {
rsv->entry[index].base = addr;
rsv->entry[index].size = size;
-
memunmap(rsv);
-   return 0;
+   return efi_mem_reserve_iomem(addr, size);
}
memunmap(rsv);
}
@@ -1013,6 +1030,12 @@ int __ref efi_mem_reserve_persistent(phys_addr_t addr, 
u64 size)
if (!rsv)
return -ENOMEM;
 
+   rc = efi_mem_reserve_iomem(__pa(rsv), SZ_4K);
+   if (rc) {
+   free_page(rsv);
+   return rc;
+   }
+
/*
 * The memremap() call above assumes that a linux_efi_memreserve entry
 * never crosses a page boundary, so let's ensure that this remains true
@@ -1029,7 +1052,7 @@ int __ref efi_mem_reserve_persistent(phys_addr_t addr, 
u64 size)
efi_memreserve_root->next = __pa(rsv);
spin_unlock(_mem_reserve_persistent_lock);
 
-   return 0;
+   return efi_mem_reserve_iomem(addr, size);
 }
 
 static int __init efi_memreserve_root_init(void)
-- 
2.17.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] x86/efi: update e820 about reserved EFI boot services data to fix kexec breakage

2019-12-04 Thread Ard Biesheuvel
On Wed, 4 Dec 2019 at 10:14, Ingo Molnar  wrote:
>
>
> * Dave Young  wrote:
>
> > On 12/04/19 at 03:52pm, Dave Young wrote:
> > > Michael Weiser reported he got below error during a kexec rebooting:
> > > esrt: Unsupported ESRT version 2904149718861218184.
> > >
> > > The ESRT memory stays in EFI boot services data, and it was reserved
> > > in kernel via efi_mem_reserve().  The initial purpose of the reservation
> > > is to reuse the EFI boot services data across kexec reboot. For example
> > > the BGRT image data and some ESRT memory like Michael reported.
> > >
> > > But although the memory is reserved it is not updated in X86 e820 table.
> > > And kexec_file_load iterate system ram in io resource list to find places
> > > for kernel, initramfs and other stuff. In Michael's case the kexec loaded
> > > initramfs overwritten the ESRT memory and then the failure happened.
> >
> > s/overwritten/overwrote :)  If need a repost please let me know..
> >
> > >
> > > Since kexec_file_load depends on the e820 to be updated, just fix this
> > > by updating the reserved EFI boot services memory as reserved type in 
> > > e820.
> > >
> > > Originally any memory descriptors with EFI_MEMORY_RUNTIME attribute are
> > > bypassed in the reservation code path because they are assumed as 
> > > reserved.
> > > But the reservation is still needed for multiple kexec reboot.
> > > And it is the only possible case we come here thus just drop the code
> > > chunk then everything works without side effects.
> > >
> > > On my machine the ESRT memory sits in an EFI runtime data range, it does
> > > not trigger the problem, but I successfully tested with BGRT instead.
> > > both kexec_load and kexec_file_load work and kdump works as well.
> > >
> > > Signed-off-by: Dave Young 
>
>
> So I edited this to:
>
>  From: Dave Young 
>
>  Michael Weiser reported he got this error during a kexec rebooting:
>
>esrt: Unsupported ESRT version 2904149718861218184.
>
>  The ESRT memory stays in EFI boot services data, and it was reserved
>  in kernel via efi_mem_reserve().  The initial purpose of the reservation
>  is to reuse the EFI boot services data across kexec reboot. For example
>  the BGRT image data and some ESRT memory like Michael reported.
>
>  But although the memory is reserved it is not updated in the X86 E820 table,
>  and kexec_file_load() iterates system RAM in the IO resource list to find 
> places
>  for kernel, initramfs and other stuff. In Michael's case the kexec loaded
>  initramfs overwrote the ESRT memory and then the failure happened.
>
>  Since kexec_file_load() depends on the E820 table being updated, just fix 
> this
>  by updating the reserved EFI boot services memory as reserved type in E820.
>
>  Originally any memory descriptors with EFI_MEMORY_RUNTIME attribute are
>  bypassed in the reservation code path because they are assumed as reserved.
>
>  But the reservation is still needed for multiple kexec reboots,
>  and it is the only possible case we come here thus just drop the code
>  chunk, then everything works without side effects.
>
>  On my machine the ESRT memory sits in an EFI runtime data range, it does
>  not trigger the problem, but I successfully tested with BGRT instead.
>  both kexec_load() and kexec_file_load() work and kdump works as well.
>

Acked-by: Ard Biesheuvel 

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2 0/2] efi: arm64: Introduce /proc/efi/memreserve to tell the persistent pages

2019-12-04 Thread Ard Biesheuvel
On Tue, 3 Dec 2019 at 20:14, Masayoshi Mizuma  wrote:
>
> From: Masayoshi Mizuma 
>
> kexec reboot sometime fails in early boot sequence on aarch64 machine.
> That is because kexec overwrites the LPI property tables and pending
> tables with the initrd.
>
> To avoid the overwrite, introduce /proc/efi/memreserve to tell the
> tables region to kexec so that kexec can avoid the memory region to
> locate initrd.
>
> kexec also needs a patch to handle /proc/efi/memreserve. I'm preparing
> the patch for kexec.
>
> Changelog
> v2: - Change memreserve file location from sysfs to procfs.
>   memreserve may exceed the PAGE_SIZE in case efi_memreserve_root
>   has a lot of entries. So we cannot use sysfs_kf_seq_show().
>   Use seq_printf() in procfs instead.
>
> Masayoshi Mizuma (2):
>   efi: add /proc/efi directory
>   efi: arm64: Introduce /proc/efi/memreserve to tell the persistent
> pages
>

Apologies for the tardy response.

Adding /proc/efi is really out of the question. *If* we add any
special files to expose this information, it should be under sysfs.

However, this is still only a partial solution, since it only solves
the problem for userspace based kexec, and we need something for
kexec_file_load() as well.

The fundamental issue here is that /proc/iomem apparently lacks the
entries that describe these regions as 'reserved', so we should try to
address that instead.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: kexec_file overwrites reserved EFI ESRT memory

2019-12-03 Thread Ard Biesheuvel
On Mon, 2 Dec 2019 at 09:05, Dave Young  wrote:
>
> Add more cc
> On 12/02/19 at 04:58pm, Dave Young wrote:
> > On 11/29/19 at 04:27pm, Michael Weiser wrote:
> > > Hello Dave,
> > >
> > > On Mon, Nov 25, 2019 at 01:52:01PM +0800, Dave Young wrote:
> > >
> > > > > > Fundamentally when deciding where to place a new kernel kexec 
> > > > > > (either
> > > > > > user space or the in kernel kexec_file implementation) needs to be 
> > > > > > able
> > > > > > to ask the question which memory ares are reserved.
> > > [...]
> > > > > > So my question is why doesn't the ESRT reservation wind up in
> > > > > > /proc/iomem?
> > > > >
> > > > > My guess is that the focus was that some EFI structures need to be 
> > > > > kept
> > > > > around accross the life cycle of *one* running kernel and
> > > > > memblock_reserve() was enough for that. Marking them so they survive
> > > > > kexecing another kernel might just never have cropped up thus far. Ard
> > > > > or Matt would know.
> > > > Can you check your un-reserved memory, if your memory falls into EFI
> > > > BOOT* then in X86 you can use something like below if it is not covered:
> > >
> > > > void __init efi_esrt_init(void)
> > > > {
> > > > ...
> > > >   pr_info("Reserving ESRT space from %pa to %pa.\n", _data, );
> > > >   if (md.type == EFI_BOOT_SERVICES_DATA)
> > > >   efi_mem_reserve(esrt_data, esrt_data_size);
> > > > ...
> > > > }
> > >
> > > Please bear with me if I'm a bit slow on the uptake here: On my machine,
> > > the esrt module reports at boot:
> > >
> > > [0.001244] esrt: Reserving ESRT space from 0x74dd2f98 to 
> > > 0x74dd2fd0.
> > >
> > > This area is of type "Boot Data" (== BOOT_SERVICES_DATA) which makes the
> > > code you quote reserve it using memblock_reserve() shown by
> > > memblock=debug:
> > >
> > > [0.001246] memblock_reserve: [0x74dd2f98-0x74dd2fcf] 
> > > efi_mem_reserve+0x1d/0x2b
> > >
> > > It also calls into arch/x86/platform/efi/quirks.c:efi_arch_mem_reserve()
> > > which tags it as EFI_MEMORY_RUNTIME while the surrounding ones aren't
> > > as shown by efi=debug:
> > >
> > > [0.178111] efi: mem10: [Boot Data  |   |  |  |  |  |  |  |  | 
> > >   |WB|WT|WC|UC] range=[0x74dd3000-0x75becfff] (14MB)
> > > [0.178113] efi: mem11: [Boot Data  |RUN|  |  |  |  |  |  |  | 
> > >   |WB|WT|WC|UC] range=[0x74dd2000-0x74dd2fff] (0MB)
> > > [0.178114] efi: mem12: [Boot Data  |   |  |  |  |  |  |  |  | 
> > >   |WB|WT|WC|UC] range=[0x6d635000-0x74dd1fff] (119MB)
> > >
> > > This prevents arch/x86/platform/efi/quirks.c:efi_free_boot_services()
> > > from calling __memblock_free_late() on it. And indeed, memblock=debug does
> > > not report this area as being free'd while the surrounding ones are:
> > >
> > > [0.178369] __memblock_free_late: 
> > > [0x74dd3000-0x75becfff] efi_free_boot_services+0x126/0x1f8
> > > [0.178658] __memblock_free_late: 
> > > [0x6d635000-0x74dd1fff] efi_free_boot_services+0x126/0x1f8
> > >
> > > The esrt area does not show up in /proc/iomem though:
> > >
> > > 0010-763f5fff : System RAM
> > >   6200-62a00d80 : Kernel code
> > >   62c0-62f15fff : Kernel rodata
> > >   6300-630ea8bf : Kernel data
> > >   63fed000-641f : Kernel bss
> > >   6500-6aff : Crash kernel
> > >
> > > And thus kexec loads the new kernel right over that area as shown when
> > > enabling -DDEBUG on kexec_file.c (0x74dd3000 being inbetween 0x7300
> > > and 0x7300+0x24be000 = 0x754be000):
> > >
> > > [  650.007695] kexec_file: Loading segment 0: buf=0x3a9c84d6 
> > > bufsz=0x5000 mem=0x98000 memsz=0x6000
> > > [  650.007699] kexec_file: Loading segment 1: buf=0x17b2b9e6 
> > > bufsz=0x1240 mem=0x96000 memsz=0x2000
> > > [  650.007703] kexec_file: Loading segment 2: buf=0xfdf72ba2 
> > > bufsz=0x1150888 mem=0x7300 memsz=0x24be000
> > >
> > > ... because it looks for any memory hole large enough in iomem resources
> > > tagged as System RAM, which 0x74dd2000-0x74dd2fff would then need to be
> > > excluded from on my system.
> > >
> > > Looking some more at efi_arch_mem_reserve() I see that it also registers
> > > the area with efi.memmap and installs it using efi_memmap_install().
> > > which seems to call memremap(MEMREMAP_WB) on it. From my understanding
> > > of the comments in the source of memremap(), MEMREMAP_WB does specifically
> > > *not* reserve that memory in any way.
> > >
> > > > Unfortunately I noticed there are different requirements/ways for
> > > > different types of "reserved" memory.  But that is another topic..
> > >
> > > I tried to reserve the area with something like this:
> > >
> > > t a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
> > > index 4de244683a7e..b86a5df027a2 100644
> > > --- a/arch/x86/platform/efi/quirks.c
> > > +++ b/arch/x86/platform/efi/quirks.c
> > > @@ -249,6 

Re: [PATCH] do not clean dummy variable in kexec path

2019-09-25 Thread Ard Biesheuvel
On Tue, 17 Sep 2019 at 19:52, Matthew Garrett  wrote:
>
> On Fri, Sep 13, 2019 at 2:18 AM Ard Biesheuvel
>  wrote:
>
> > > > - Remove the cleanup from the kexec path -- the cleanup logic from [4],
> > > >   even if justified for the cold boot path, should have never modified
> > > >   the kexec path.
> > >
> > > I agree that there's no benefit in it being called in the kexec path.
> >
> > Can I take that as an ack?
>
> An ack of this hunk.

Given that the patch in question has only one hunk, I'll take this as
an ack of the entire patch, and queue it as a fix.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] do not clean dummy variable in kexec path

2019-09-13 Thread Ard Biesheuvel
On Tue, 13 Aug 2019 at 22:14, Matthew Garrett  wrote:
>
> On Tue, Aug 13, 2019 at 4:28 AM Laszlo Ersek  wrote:
> > (I verified yesterday, using the edk2 source code, that there is no
> > varstore reclaim after ExitBootServices(), indeed.)
>
> Some implementations do reclaim at runtime, in which case the
> create/delete dance will permit variable creation.
>
> > (a) Attempting to delete the dummy variable in efi_enter_virtual_mode().
>
> To be clear, the dummy variable should never actually come into
> existence - we explicitly attempt to create a variable that's bigger
> than the available space, so the expectation is that it will always
> fail. However, should it somehow end up being created, there's a race
> between the creation and the deletion and so there's a (small) risk
> that the variable actually ends up there. The cleanup in
> enter_virtual_mode() is just there to ensure that anything that did
> end up being created on a previous boot is deleted - the expectation
> is that it'll be a noop.
>
> > (b) The following part, in efi_query_variable_store():
> >
> > +   /*
> > +* The runtime code may now have triggered a garbage 
> > collection
> > +* run, so check the variable info again
> > +*/
> >
> > Let me start with (b). That code is essentially dead, I would say, based
> > on the information that had already been captured in the commit message
> > of [1]. Reclaim would never happen after ExitBootServices(). (I assume
> > efi_query_variable_store() is only invoked after ExitBootServices(),
> > i.e., from kernel space proper -- sorry if that's a wrong assumption.)
>
> It's dead code on Tiano, but not on at least one vendor implementation.
>
> > Considering (a): what justified the attempt to delete the dummy variable
> > in efi_enter_virtual_mode(), in commit [4]? Was that meant as a
> > fail-safe just so we don't leave a dummy variable lying around?
>
> Yes.
>
> > So even if we consider the "clean DUMMY object" hunk from [4] a
> > justified fail-safe for the normal boot path, it doesn't apply to the
> > kexec path -- the cold-booted primary kernel will have gone through
> > those motions already, will it not?
> >
> > Therefore, we should do two things:
> >
> > - Remove the cleanup from the kexec path -- the cleanup logic from [4],
> >   even if justified for the cold boot path, should have never modified
> >   the kexec path.
>
> I agree that there's no benefit in it being called in the kexec path.

Can I take that as an ack?

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] do not clean dummy variable in kexec path

2019-08-05 Thread Ard Biesheuvel
On Mon, 5 Aug 2019 at 11:36, Dave Young  wrote:
>
> kexec reboot fails randomly in UEFI based kvm guest.  The firmware
> just reset while calling efi_delete_dummy_variable();  Unfortunately
> I don't know how to debug the firmware, it is also possible a potential
> problem on real hardware as well although nobody reproduced it.
>
> The intention of efi_delete_dummy_variable is to trigger garbage collection
> when entering virtual mode.  But SetVirtualAddressMap can only run once
> for each physical reboot, thus kexec_enter_virtual_mode is not necessarily
> a good place to clean dummy object.
>

I would argue that this means it is not a good place to *create* the
dummy variable, and if we don't create it, we don't have to delete it
either.

> Drop efi_delete_dummy_variable so that kexec reboot can work.
>

Creating it and not deleting it is bad, so please try and see if we
can omit the creation on this code path instead.


> Signed-off-by: Dave Young 
> ---
>  arch/x86/platform/efi/efi.c |3 ---
>  1 file changed, 3 deletions(-)
>
> --- linux-x86.orig/arch/x86/platform/efi/efi.c
> +++ linux-x86/arch/x86/platform/efi/efi.c
> @@ -894,9 +894,6 @@ static void __init kexec_enter_virtual_m
>
> if (efi_enabled(EFI_OLD_MEMMAP) && (__supported_pte_mask & _PAGE_NX))
> runtime_code_page_mkexec();
> -
> -   /* clean DUMMY object */
> -   efi_delete_dummy_variable();
>  #endif
>  }
>


Re: [PATCH v2] arm64: invalidate TLB just before turning MMU on

2018-12-13 Thread Ard Biesheuvel
On Fri, 14 Dec 2018 at 05:08, Qian Cai  wrote:
> Also tried to move the local TLB flush part around a bit inside
> __cpu_setup(), although it did complete kdump some times, it did trigger
> "Synchronous Exception" in EFI after a cold-reboot fairly often that
> seems no way to recover remotely without reinstalling the OS.

This doesn't make any sense to me. If the system gets into a weird
state out of cold reboot, how could this code be the culprit? Please
check your firmware, and try to reproduce the issue on a system that
doesn't have such defects.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [Bug Report] kdump crashes after latest EFI memblock changes on arm64 machines with large number of CPUs

2018-11-06 Thread Ard Biesheuvel
On 6 November 2018 at 02:30, Will Deacon  wrote:
> On Fri, Nov 02, 2018 at 02:44:10AM +0530, Bhupesh Sharma wrote:
>> With the latest EFI changes for memblock reservation across kdump
>> kernel from Ard (Commit 71e0940d52e107748b270213a01d3b1546657d74
>> ["efi: honour memory reservations passed via a linux specific config
>> table"]), we hit a panic while trying to boot the kdump kernel on
>> machines which have large number of CPUs.
>>
>> I have a arm64 board which has 224 CPUS:
>> # lscpu
>> <..snip..>
>> CPU(s):  224
>> On-line CPU(s) list: 0-223
>> <..snip..>
>>
>> Here are the crash logs in the kdump kernel on this machine:
>>
>> [0.00] Unable to handle kernel paging request at virtual
>> address 80003ffe
>> val)nt EL), IL ata abort info:
>> [0.or: Oops: 96inted 4.18.0+ #3
>> [0.00] pstate: 20400089 (nzCv daIf +PAN -UAO)
>> [0.00] pc : __memcpy+0x110/0x180
>> [0.00] lr : memblock_double_array+0x240/0x348
>> [0.00] sp : 092efc80 x28: bffe
>> [0.00] x27: 1800 x26: 09d59000
>> [0.00] x25: 80003ffe x24: 
>> [0.00] x23: 0001 x22: 09d594e8
>> [0.00] x21: 09d594f4 x20: 093c7268
>> [0.00] x19: 0c00 x18: 0010
>> [0.00] x17:  x16: 
>> [0.00] x15: 3: 000fc18d x12: 0008
>> [0.00] x11: 0018 x10: ddab9e18
>> [0.00] x9 : 0008 x8 : 02c1
>> [0.00] x7 : 91b9 x6 : 80003ffe
>> [0.00] x5 : 0001 x4 : 
>> [0.00] x3 :  x2 : 0b80
>> [0.00] x1 : 09d59540 x0 : 80003ffe
>> [0.00] Process swapper)
>> [0.00] Call trace:
>> [0.00]  __memcpy+0x110/0x180
>> [0.00]  memblock_add_range+0x134/0x2e8
>> [0.00]  memblock_reserve+0x70/0xb8
>> [0.00]  memblock_alloc_base_nid+0x6c/0x88
>> [0.00]  __memblock_alloc_base+0x3c/0x4c
>> [0.00]  memblock_alloc_base+0x28/0x4c
>> [0.00]  memblock_alloc+0x2c/0x38
>> [0.00]  early_pgtable_alloc+0x20/0xb0
>
> Hmm, so this seems to be the crux of the issue: early_pgtable_alloc() relies
> on memblock to allocate page-table memory, but this can be called before the
> linear mapping is up and running (or even as part of creating the linear
> mapping itself!) so the use of __va in memblock_double_array() actually
> returns an unmapped address.
>

OK, so this means we are calling memblock_allow_resize() too early in any case

> So I guess we either need to implement early_pgtable_alloc() some other way
> (how?) or get memblock_double_array() to use a fixmap if it's called too
> early (yuck). Alternatively, would it be possible to postpone processing of
> the EFI mem_reserve entries until after we've created the linear mapping?
>

We could move this until after paging_init(), I suppose. I'll cook something up.

Bhupesh: any comments?

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [Bug Report] kdump crashes after latest EFI memblock changes on arm64 machines with large number of CPUs

2018-11-05 Thread Ard Biesheuvel
(+ Marc)

On 1 November 2018 at 22:14, Bhupesh Sharma  wrote:
> Hi,
>
> With the latest EFI changes for memblock reservation across kdump
> kernel from Ard (Commit 71e0940d52e107748b270213a01d3b1546657d74
> ["efi: honour memory reservations passed via a linux specific config
> table"]), we hit a panic while trying to boot the kdump kernel on
> machines which have large number of CPUs.
>

Just for my understanding: why do you boot all 224 CPus when running
the crash kernel?

I'm not saying we shouldn't fix the underlying issue, I'm just curious.

> I have a arm64 board which has 224 CPUS:
> # lscpu
> <..snip..>
> CPU(s):  224
> On-line CPU(s) list: 0-223
> <..snip..>
>
> Here are the crash logs in the kdump kernel on this machine:
>
> [0.00] Unable to handle kernel paging request at virtual
> address 80003ffe
> val)nt EL), IL ata abort info:
> [0.or: Oops: 96inted 4.18.0+ #3
> [0.00] pstate: 20400089 (nzCv daIf +PAN -UAO)
> [0.00] pc : __memcpy+0x110/0x180
> [0.00] lr : memblock_double_array+0x240/0x348
> [0.00] sp : 092efc80 x28: bffe
> [0.00] x27: 1800 x26: 09d59000
> [0.00] x25: 80003ffe x24: 
> [0.00] x23: 0001 x22: 09d594e8
> [0.00] x21: 09d594f4 x20: 093c7268
> [0.00] x19: 0c00 x18: 0010
> [0.00] x17:  x16: 
> [0.00] x15: 3: 000fc18d x12: 0008
> [0.00] x11: 0018 x10: ddab9e18
> [0.00] x9 : 0008 x8 : 02c1
> [0.00] x7 : 91b9 x6 : 80003ffe
> [0.00] x5 : 0001 x4 : 
> [0.00] x3 :  x2 : 0b80
> [0.00] x1 : 09d59540 x0 : 80003ffe
> [0.00] Process swapper)
> [0.00] Call trace:
> [0.00]  __memcpy+0x110/0x180
> [0.00]  memblock_add_range+0x134/0x2e8
> [0.00]  memblock_reserve+0x70/0xb8
> [0.00]  memblock_alloc_base_nid+0x6c/0x88
> [0.00]  __memblock_alloc_base+0x3c/0x4c
> [0.00]  memblock_alloc_base+0x28/0x4c
> [0.00]  memblock_alloc+0x2c/0x38
> [0.00]  early_pgtable_alloc+0x20/0xb0
> [0.00]  paging_init+0x28/0x7f8
> [   0.00]  start_kernel+0x78/0x4cc
> [0.00] Code: a8c12027 a8c12829 a8c1302b a8c1382d (a88120c7)
> [0.00] random: get_random_bytes called from
> print_oops_end_marker+0x30/0x58 with crng_init=0
> [0.00] ---[ end trace  ]---
> [0.00] Kernel panic - not syncing: Fatal exception
> [0.00] ---[ end Kernel panic - not syncing: Fatal exception ]---
>
> Adding more debug logs via 'memblock=debug' being passed to the kdump
> kernel, (and adding a few more prints to 'mm/memblock.c'), I can see
> that the panic happens while trying to resize array inside
> 'memblock_double_array' (which doubles the size of the memblock
> regions array):
>
> [0.00] Reserving 13KB of memory at 0xbfff for elfcorehdr
> [0.00] memblock_reserve:
> [0xbfff-0xbfff]
> memblock_alloc_base_nid+0x6c/0x88
> [0.00] memblock: use_slab is 0, new_area_start=bfff,
> new_area_size=1
> [0.00] memblock: use_slab is 0, addr=0, new_area_size=1
> [0.00] memblock: addr=bffe, __va(addr)=80003ffe
> [0.0 [0xbffe-0xbffe17ff]
> [0.00] Unable to handle kernel paging request at virtual
> address 80003ffe
>
> which indicates that after Ard's patch the memblocks being reserved
> across kdump swell up on systems which have large number of CPUs and
> hence 'memblock_double_array' is called up in early kdump boot code to
> double the size of the memblock regions array.
>
> To confirm the above, I reduced the number of SMP CPUs available to
> the kernel on this system, by specifying 'nr_cpus=46' in the kernel
> bootargs for the primary kernel. As expected this makes the kdump
> kernel boot successfully and also save the crash dump properly.
>
> I saw another arm64 kdump user report this issue to me privately, so I
> am sending this to a wider audience, so that kdump users are aware
> that this is a known issue.
>
> I am working on a RFC patch which seems to fix the issue on my board
> and will try to send it out for wider review in coming days after some
> more checks at my end.
>
> Any advices on the same are also welcome :)
>
> Thanks,
> Bhupesh

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] x86, kdump: Fix efi=noruntime NULL pointer dereference

2018-08-21 Thread Ard Biesheuvel
On 9 August 2018 at 11:13, Dave Young  wrote:
> On 08/09/18 at 09:33am, Mike Galbraith wrote:
>> On Thu, 2018-08-09 at 12:21 +0800, Dave Young wrote:
>> > Hi Mike,
>> >
>> > Thanks for the patch!
>> > On 08/08/18 at 04:03pm, Mike Galbraith wrote:
>> > > When booting with efi=noruntime, we call efi_runtime_map_copy() while
>> > > loading the kdump kernel, and trip over a NULL efi.memmap.map.  Avoid
>> > > that and a useless allocation when the only mapping we can use (1:1)
>> > > is not available.
>> >
>> > At first glance, efi_get_runtime_map_size should return 0 in case
>> > noruntime.
>>
>> What efi does internally at unmap time is to leave everything except
>> efi.mmap.map untouched, setting it to NULL and turning off EFI_MEMMAP,
>> rendering efi.mmap.map accessors useless/unsafe without first checking
>> EFI_MEMMAP.
>
> Probably the x86 efi_init should reset nr_map to zero in case runtime is
> disabled.  But let's see how Ard thinks about this and cc linux-efi.
>
> As for efi_get_runtime_map_size, it was introduced for x86 kexec use.
> for copying runtime maps,  so I think it is reasonable this function
> return zero in case no runtime.
>

I don't see the patch in the context so I cannot comment in great detail.

In any case, it is better to decouple EFI_MEMMAP from EFI_RUNTIME
dependencies. On x86, one may imply the other, but this is not
generally the case.

That means that efi_get_runtime_map_size() should probably check the
EFI_RUNTIME flag, and return 0 if it is cleared. Perhaps there are
other places where EFI_MEMMAP flag checks are missing, but I consider
that a separate issue.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v12 16/16] arm64: kexec_file: add kaslr support

2018-07-27 Thread Ard Biesheuvel
On 27 July 2018 at 11:22, James Morse  wrote:
> Hi Akashi,
>
>
> On 07/27/2018 09:31 AM, AKASHI Takahiro wrote:
>
> On Thu, Jul 26, 2018 at 02:40:49PM +0100, James Morse wrote:
>
> On 24/07/18 07:57, AKASHI Takahiro wrote:
>
> Adding "kaslr-seed" to dtb enables triggering kaslr, or kernel virtual
> address randomization, at secondary kernel boot.
>
> Hmm, there are three things that get moved by CONFIG_RANDOMIZE_BASE. The
> kernel
> physical placement when booted via the EFIstub, the kernel-text VAs and the
> location of memory in the linear-map region. Adding the kaslr-seed only does
> the
> last two.
>
> Yes, but I think that I and Mark has agreed that "kaslr" meant
> "virtual" randomisation, not including "physical" randomisation.
>
> Okay, I'll update my terminology!
>
>
> This means the physical placement of the new kernel is predictable from
> /proc/iomem ... but this also tells you the physical placement of the
> current
> kernel, so I don't think this is a problem.
>
>
> We always do this as it will have no harm on kaslr-incapable kernel.
>
> We don't have any "switch" to turn off this feature directly, but still
> can suppress it by passing "nokaslr" as a kernel boot argument.
>
> diff --git a/arch/arm64/kernel/machine_kexec_file.c
> b/arch/arm64/kernel/machine_kexec_file.c
> index 7356da5a53d5..47a4fbd0dc34 100644
> --- a/arch/arm64/kernel/machine_kexec_file.c
> +++ b/arch/arm64/kernel/machine_kexec_file.c
> @@ -158,6 +160,12 @@ static int setup_dtb(struct kimage *image,
>
> Don't you need to reserve some space in the area you vmalloc()d for the DT?
>
> No, I don't think so.
> All the data to be loaded are temporarily saved in kexec buffers,
> which will eventually be copied to target locations in machine_kexec
> (arm64_relocate_new_kernel, which, unlike its name, will handle
> not only kernel but also other data as well).
>
>
> I think we're speaking at cross purposes. Don't you need:
>
> | buf_size += fdt_prop_len("kaslr―seed", sizeof(u64));
>
>
> You can't assume the existing DTB had a kaslr-seed property, and the
> difference may take us over a PAGE_SIZE boundary.
>
>
>
>
> + /* add kaslr-seed */
> + get_random_bytes(, sizeof(value));
>
> What happens if the crng isn't ready?
>
> It looks like this will print a warning that these random-bytes aren't
> really up
> to standard, but the new kernel doesn't know this happened.
>
> crng_ready() isn't exposed, all we could do now is
> wait_for_random_bytes(), but that may wait forever because we do this
> unconditionally.
>
> I'd prefer to leave this feature until we can check crng_ready(), and skip
> adding a dodgy-seed if its not-ready. This avoids polluting the
> next-kernel's
> entropy pool.
>
> OK. I would try to follow the same way as Bhupesh's userspace patch
> does for kaslr-seed:
> http://lists.infradead.org/pipermail/kexec/2018-April/020564.html
>
>
> (I really don't understand this 'copying code from user-space' that happens
> with kexec_file_load)
>
>
>   if (not found kaslr-seed in 1st kernel's dtb)
>  don't care; go ahead
>
>
> Don' t bother. As you say in the commit-message its harmless if the new
> kernel doesn't support it.
> Always having this would let you use kexec_file_load as a bootloader that
> can get the crng to
> provide decent entropy even if the platform bootloader can't.
>
>
>   else
>  if (current kaslr-seed != 0)
> error
>
>
> Don't bother. If this happens its a bug in another part of the kernel that
> doesn't affect this one. We aren't second-guessing the file-system when we
> read the kernel-fd, lets keep this simple.
>
>  if (crng_ready()) ; FIXME, it's a local macro
> get_random_bytes(non-blocking)
> set new kaslr-seed
>  else
> error
>
> error? Something like pr_warn_once().
>
> I thought the kaslr-seed was added to the entropy pool, but now I look again
> I see its a separate EFI table. So the new kernel will add the same entropy
> ... that doesn't sound clever. (I can't see where its zero'd or
> re-initialised)
>

We do have a hook for that: grep for update_efi_random_seed()

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v3.1 0/4] arm64: kexec,kdump: fix boot failures on acpi-only system

2018-07-12 Thread Ard Biesheuvel
On 13 July 2018 at 02:34, AKASHI Takahiro  wrote:
> On Thu, Jul 12, 2018 at 05:49:19PM +0100, Will Deacon wrote:
>> Hi Akashi,
>>
>> On Tue, Jul 10, 2018 at 08:42:25AM +0900, AKASHI Takahiro wrote:
>> > This patch series is a set of bug fixes to address kexec/kdump
>> > failures which are sometimes observed on ACPI-only system and reported
>> > in LAK-ML before.
>>
>> I tried picking this up, along with Ard's fixup, but I'm seeing a build
>> failure for allmodconfig:
>>
>> arch/arm64/kernel/acpi.o: In function `__acpi_get_mem_attribute':
>> acpi.c:(.text+0x60): undefined reference to `efi_mem_attributes'
>>
>> I didn't investigate further. Please can you fix this?
>
> Because CONFIG_ACPI is on and CONFIG_EFI is off.
>
> This can happen in allmodconfig as CONFIG_EFI depends on
> !CONFIG_CPU_BIG_ENDIAN, which is actually on in this case.
>

Allowing both CONFIG_ACPI and CONFIG_CPU_BIG_ENDIAN to be configured
makes no sense at all. Things will surely break if you start using BE
memory accesses while parsing ACPI tables.

Allowing CONFIG_ACPI without CONFIG_EFI makes no sense either, since
on arm64, the only way to find the ACPI tables is through a UEFI
configuration table.

> Looking at __acpi_get_mem_attributes(), since there is no information
> available on memory attributes, what we can do at best is
>   * return PAGE_KERNEL (= cacheable) for mapped memory,
>   * return DEVICE_nGnRnE (= non-cacheable) otherwise
> (See a hunk to be applied on top of my patch#4.)
>
> I think that, after applying, acpi_os_ioremap() would work almost
> in the same way as the original before my patchset given that
> MAP memblock attribute is used only under CONFIG_EFI for now.
>
> Make sense?
>

Let's keep your code as is but fix the Kconfig dependencies instead.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v3.1 2/4] efi/arm: preserve early mapping of UEFI memory map longer for BGRT

2018-07-12 Thread Ard Biesheuvel
On 12 July 2018 at 15:32, Will Deacon  wrote:
> On Tue, Jul 10, 2018 at 08:39:16PM +0200, Ard Biesheuvel wrote:
>> On 10 July 2018 at 19:57, James Morse  wrote:
>> > Hi Ard,
>> >
>> > On 10/07/18 00:42, AKASHI Takahiro wrote:
>> >> From: Ard Biesheuvel 
>> >>
>> >> The BGRT code validates the contents of the table against the UEFI
>> >> memory map, and so it expects it to be mapped when the code runs.
>> >>
>> >> On ARM, this is currently not the case, since we tear down the early
>> >> mapping after efi_init() completes, and only create the permanent
>> >> mapping in arm_enable_runtime_services(), which executes as an early
>> >> initcall, but still leaves a window where the UEFI memory map is not
>> >> mapped.
>> >>
>> >> So move the call to efi_memmap_unmap() from efi_init() to
>> >> arm_enable_runtime_services().
>> >
>> > I don't have a machine that generates a BGRT, but I can see that 
>> > efi_mem_type()
>> > call in efi_bgrt_init() would cause the same problems we have with kexec 
>> > and acpi.
>> >
>>
>> I'm not sure I follow. The BGRT table only contains natively aligned
>> fields, so the alignment faults should not occur when accessing this
>> table after kexec. The issue addressed by this patch is that
>> efi_mem_type() bails when called while EFI_MEMMAP is cleared.
>>
>> >
>> >> diff --git a/drivers/firmware/efi/arm-init.c 
>> >> b/drivers/firmware/efi/arm-init.c
>> >> index b5214c143fee..388a929baf95 100644
>> >> --- a/drivers/firmware/efi/arm-init.c
>> >> +++ b/drivers/firmware/efi/arm-init.c
>> >> @@ -259,7 +259,6 @@ void __init efi_init(void)
>> >>
>> >>   reserve_regions();
>> >>   efi_esrt_init();
>> >> - efi_memmap_unmap();
>> >>
>> >>   memblock_reserve(params.mmap & PAGE_MASK,
>> >>PAGE_ALIGN(params.mmap_size +
>> >> diff --git a/drivers/firmware/efi/arm-runtime.c 
>> >> b/drivers/firmware/efi/arm-runtime.c
>> >> index 5889cbea60b8..59a8c0ec94d5 100644
>> >> --- a/drivers/firmware/efi/arm-runtime.c
>> >> +++ b/drivers/firmware/efi/arm-runtime.c
>> >> @@ -115,6 +115,8 @@ static int __init arm_enable_runtime_services(void)
>> >>   return 0;
>> >>   }
>> >>
>> >> + efi_memmap_unmap();
>> >
>> > This can get called twice if uefi_init() fails after setting the EFI_BOOT 
>> > flag,
>> > but this can only happen if the system table signature is wrong, (or we're 
>> > out
>> > of memory really early).
>> >
>>
>> I guess we should check the EFI_MEMMAP attribute here as well then.
>
> Do you plan to spin a new version of this patch?
>

Either that or fold in the hunk below.


--- a/drivers/firmware/efi/arm-runtime.c
+++ b/drivers/firmware/efi/arm-runtime.c
@@ -110,7 +110,7 @@ static int __init arm_enable_runtime_services(void)
 {
u64 mapsize;

-   if (!efi_enabled(EFI_BOOT)) {
+   if (!efi_enabled(EFI_BOOT) || !efi_enabled(EFI_MEMMAP)) {
pr_info("EFI services will not be available.\n");
return 0;
}

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v5 7/8] ima: based on policy warn about loading firmware (pre-allocated buffer)

2018-07-11 Thread Ard Biesheuvel
On 10 July 2018 at 21:19, Bjorn Andersson  wrote:
> On Mon 09 Jul 23:56 PDT 2018, Ard Biesheuvel wrote:
>
>> On 10 July 2018 at 08:51, Ard Biesheuvel  wrote:
>> > On 9 July 2018 at 21:41, Mimi Zohar  wrote:
>> >> On Mon, 2018-07-02 at 17:30 +0200, Ard Biesheuvel wrote:
>> >>> On 2 July 2018 at 16:38, Mimi Zohar  wrote:
> [..]
>> > So to summarize again: in my opinion, using a single buffer is not a
>> > problem as long as the validation completes before the DMA map is
>> > performed. This will provide the expected guarantees on systems with
>> > IOMMUs, and will not complicate matters on systems where there is no
>> > point in obsessing about this anyway given that devices can access all
>> > of memory whenever they want to.
>> >
>> > As for the Qualcomm case: dma_alloc_coherent() is not needed here but
>> > simply ends up being used because it was already wired up in the
>> > qualcomm specific secure world API, which amounts to doing syscalls
>> > into a higher privilege level than the one the kernel itself runs at.
>
> As I said before, the dma_alloc_coherent() referred to in this
> discussion holds parameters for the Trustzone call, i.e. it will hold
> the address to the buffer that the firmware was loaded into - it won't
> hold any data that comes from the actual firmware.
>

Ah yes, I forgot that detail. Thanks for reminding me.

>> > So again, reasoning about whether the secure world will look at your
>> > data before you checked the sig is rather pointless, and adding
>> > special cases to the IMA api to cater for this use case seems like a
>> > waste of engineering and review effort to me.
>
> Forgive me if I'm missing something in the implementation here, but
> aren't the IMA checks done before request_firmware*() returns?
>

The issue under discussion is whether calling request_firmware() to
load firmware into a buffer that may be readable by the device while
the IMA checks are in progress constitutes a security hazard.

>> > If we have to do
>> > something to tie up this loose end, let's try switching it to the
>> > streaming DMA api instead.
>> >
>>
>> Forgot to mention: the Qualcomm case is about passing data to the CPU
>> running at another privilege level, so IOMMU vs !IOMMU is not a factor
>> here.
>
> Further more, all scenarios we've look at so far is completely
> sequential, so if the firmware request fails we won't invoke the
> Trustzone operation that would consume the memory or we won't turn on
> the power to the CPU that would execute the firmware.
>
>
> Tbh the only case I can think of where there would be a "race condition"
> here is if we have a device that is polling the last byte of a
> predefined firmware memory area for the firmware loader to read some
> specific data into it. In cases where the firmware request is followed
> by some explicit signalling to the device (or a power on sequence) I'm
> unable to see the issue discussed here.
>

I agree. But the latter part is platform specific, and so it requires
some degree of trust in the driver author on the part of the IMA
routines that request_firmware() is called at an appropriate time.

The point I am trying to make in this thread is that there are cases
where it makes no sense for the kernel to reason about these things,
given that higher privilege levels such as the TrustZone secure world
own the kernel's execution context entirely already, and given that
masters that are not behind an IOMMU can read and write all of memory
all of the time anyway.

The bottom line is that reality does not respect the layering that IMA
assumes, and so the only meaningful way to treat some of the use cases
is simply to ignore them entirely. So we should still perform all the
checks, but we will have to live with the limited utility of doing so
in some scenarios (and not print nasty warnings to the kernel log for
such cases)

-- 
Ard.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v3.1 2/4] efi/arm: preserve early mapping of UEFI memory map longer for BGRT

2018-07-10 Thread Ard Biesheuvel
On 10 July 2018 at 19:57, James Morse  wrote:
> Hi Ard,
>
> On 10/07/18 00:42, AKASHI Takahiro wrote:
>> From: Ard Biesheuvel 
>>
>> The BGRT code validates the contents of the table against the UEFI
>> memory map, and so it expects it to be mapped when the code runs.
>>
>> On ARM, this is currently not the case, since we tear down the early
>> mapping after efi_init() completes, and only create the permanent
>> mapping in arm_enable_runtime_services(), which executes as an early
>> initcall, but still leaves a window where the UEFI memory map is not
>> mapped.
>>
>> So move the call to efi_memmap_unmap() from efi_init() to
>> arm_enable_runtime_services().
>
> I don't have a machine that generates a BGRT, but I can see that 
> efi_mem_type()
> call in efi_bgrt_init() would cause the same problems we have with kexec and 
> acpi.
>

I'm not sure I follow. The BGRT table only contains natively aligned
fields, so the alignment faults should not occur when accessing this
table after kexec. The issue addressed by this patch is that
efi_mem_type() bails when called while EFI_MEMMAP is cleared.

>
>> diff --git a/drivers/firmware/efi/arm-init.c 
>> b/drivers/firmware/efi/arm-init.c
>> index b5214c143fee..388a929baf95 100644
>> --- a/drivers/firmware/efi/arm-init.c
>> +++ b/drivers/firmware/efi/arm-init.c
>> @@ -259,7 +259,6 @@ void __init efi_init(void)
>>
>>   reserve_regions();
>>   efi_esrt_init();
>> - efi_memmap_unmap();
>>
>>   memblock_reserve(params.mmap & PAGE_MASK,
>>PAGE_ALIGN(params.mmap_size +
>> diff --git a/drivers/firmware/efi/arm-runtime.c 
>> b/drivers/firmware/efi/arm-runtime.c
>> index 5889cbea60b8..59a8c0ec94d5 100644
>> --- a/drivers/firmware/efi/arm-runtime.c
>> +++ b/drivers/firmware/efi/arm-runtime.c
>> @@ -115,6 +115,8 @@ static int __init arm_enable_runtime_services(void)
>>   return 0;
>>   }
>>
>> + efi_memmap_unmap();
>
> This can get called twice if uefi_init() fails after setting the EFI_BOOT 
> flag,
> but this can only happen if the system table signature is wrong, (or we're out
> of memory really early).
>

I guess we should check the EFI_MEMMAP attribute here as well then.

> I think this is harmless as we end up passing NULL to early_memunmap() which
> WARN()s and returns as its outside the fixmap range. Its just more noise on
> systems with a corrupt efi system table.
>
> Acked-by: James Morse 
>

Thanks James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v5 7/8] ima: based on policy warn about loading firmware (pre-allocated buffer)

2018-07-10 Thread Ard Biesheuvel
On 10 July 2018 at 08:51, Ard Biesheuvel  wrote:
> On 9 July 2018 at 21:41, Mimi Zohar  wrote:
>> On Mon, 2018-07-02 at 17:30 +0200, Ard Biesheuvel wrote:
>>> On 2 July 2018 at 16:38, Mimi Zohar  wrote:
>>> > Some systems are memory constrained but they need to load very large
>>> > firmwares.  The firmware subsystem allows drivers to request this
>>> > firmware be loaded from the filesystem, but this requires that the
>>> > entire firmware be loaded into kernel memory first before it's provided
>>> > to the driver.  This can lead to a situation where we map the firmware
>>> > twice, once to load the firmware into kernel memory and once to copy the
>>> > firmware into the final resting place.
>>> >
>>> > To resolve this problem, commit a098ecd2fa7d ("firmware: support loading
>>> > into a pre-allocated buffer") introduced request_firmware_into_buf() API
>>> > that allows drivers to request firmware be loaded directly into a
>>> > pre-allocated buffer. (Based on the mailing list discussions, calling
>>> > dma_alloc_coherent() is unnecessary and confusing.)
>>> >
>>> > (Very broken/buggy) devices using pre-allocated memory run the risk of
>>> > the firmware being accessible to the device prior to the completion of
>>> > IMA's signature verification.  For the time being, this patch emits a
>>> > warning, but does not prevent the loading of the firmware.
>>> >
>>>
>>> As I attempted to explain in the exchange with Luis, this has nothing
>>> to do with broken or buggy devices, but is simply the reality we have
>>> to deal with on platforms that lack IOMMUs.
>>
>>> Even if you load into one buffer, carry out the signature verification
>>> and *only then* copy it to another buffer, a bus master could
>>> potentially read it from the first buffer as well. Mapping for DMA
>>> does *not* mean 'making the memory readable by the device' unless
>>> IOMMUs are being used. Otherwise, a bus master can read it from the
>>> first buffer, or even patch the code that performs the security check
>>> in the first place. For such platforms, copying the data around to
>>> prevent the device from reading it is simply pointless, as well as any
>>> other mitigation in software to protect yourself from misbehaving bus
>>> masters.
>>
>> Thank you for taking the time to explain this again.
>>
>>> So issuing a warning in this particular case is rather arbitrary. On
>>> these platforms, all bus masters can read (and modify) all of your
>>> memory all of the time, and as long as the firmware loader code takes
>>> care not to provide the DMA address to the device until after the
>>> verification is complete, it really has done all it reasonably can in
>>> the environment that it is expected to operate in.
>>
>> So for the non-IOMMU system case, differentiating between pre-
>> allocated buffers vs. using two buffers doesn't make sense.
>>
>>>
>>> (The use of dma_alloc_coherent() is a bit of a red herring here, as it
>>> incorporates the DMA map operation. However, DMA map is a no-op on
>>> systems with cache coherent 1:1 DMA [iow, all PCs and most arm64
>>> platforms unless they have IOMMUs], and so there is not much
>>> difference between memory allocated with kmalloc() or with
>>> dma_alloc_coherent() in terms of whether the device can access it
>>> freely)
>>
>> What about systems with an IOMMU?
>>
>
> On systems with an IOMMU, performing the DMA map will create an entry
> in the IOMMU page tables for the physical region associated with the
> buffer, making the region accessible to the device. For platforms in
> this category, using dma_alloc_coherent() for allocating a buffer to
> pass firmware to the device does open a window where the device could
> theoretically access this data while the validation is still in
> progress.
>
> Note that the device still needs to be informed about the address of
> the buffer: just calling dma_alloc_coherent() will not allow the
> device to find the firmware image in its memory space, and arbitrary
> DMA accesses performed by the device will trigger faults that are
> reported to the OS. So the window between DMA map (or
> dma_alloc_coherent()) and the device specific command to pass the DMA
> buffer address to the device is not inherently unsafe IMO, but I do
> understand the need to cover this scenario.
>
> As I pointed out before, using coherent DMA buffers to perform
> stre

Re: [PATCH v5 7/8] ima: based on policy warn about loading firmware (pre-allocated buffer)

2018-07-10 Thread Ard Biesheuvel
On 9 July 2018 at 21:41, Mimi Zohar  wrote:
> On Mon, 2018-07-02 at 17:30 +0200, Ard Biesheuvel wrote:
>> On 2 July 2018 at 16:38, Mimi Zohar  wrote:
>> > Some systems are memory constrained but they need to load very large
>> > firmwares.  The firmware subsystem allows drivers to request this
>> > firmware be loaded from the filesystem, but this requires that the
>> > entire firmware be loaded into kernel memory first before it's provided
>> > to the driver.  This can lead to a situation where we map the firmware
>> > twice, once to load the firmware into kernel memory and once to copy the
>> > firmware into the final resting place.
>> >
>> > To resolve this problem, commit a098ecd2fa7d ("firmware: support loading
>> > into a pre-allocated buffer") introduced request_firmware_into_buf() API
>> > that allows drivers to request firmware be loaded directly into a
>> > pre-allocated buffer. (Based on the mailing list discussions, calling
>> > dma_alloc_coherent() is unnecessary and confusing.)
>> >
>> > (Very broken/buggy) devices using pre-allocated memory run the risk of
>> > the firmware being accessible to the device prior to the completion of
>> > IMA's signature verification.  For the time being, this patch emits a
>> > warning, but does not prevent the loading of the firmware.
>> >
>>
>> As I attempted to explain in the exchange with Luis, this has nothing
>> to do with broken or buggy devices, but is simply the reality we have
>> to deal with on platforms that lack IOMMUs.
>
>> Even if you load into one buffer, carry out the signature verification
>> and *only then* copy it to another buffer, a bus master could
>> potentially read it from the first buffer as well. Mapping for DMA
>> does *not* mean 'making the memory readable by the device' unless
>> IOMMUs are being used. Otherwise, a bus master can read it from the
>> first buffer, or even patch the code that performs the security check
>> in the first place. For such platforms, copying the data around to
>> prevent the device from reading it is simply pointless, as well as any
>> other mitigation in software to protect yourself from misbehaving bus
>> masters.
>
> Thank you for taking the time to explain this again.
>
>> So issuing a warning in this particular case is rather arbitrary. On
>> these platforms, all bus masters can read (and modify) all of your
>> memory all of the time, and as long as the firmware loader code takes
>> care not to provide the DMA address to the device until after the
>> verification is complete, it really has done all it reasonably can in
>> the environment that it is expected to operate in.
>
> So for the non-IOMMU system case, differentiating between pre-
> allocated buffers vs. using two buffers doesn't make sense.
>
>>
>> (The use of dma_alloc_coherent() is a bit of a red herring here, as it
>> incorporates the DMA map operation. However, DMA map is a no-op on
>> systems with cache coherent 1:1 DMA [iow, all PCs and most arm64
>> platforms unless they have IOMMUs], and so there is not much
>> difference between memory allocated with kmalloc() or with
>> dma_alloc_coherent() in terms of whether the device can access it
>> freely)
>
> What about systems with an IOMMU?
>

On systems with an IOMMU, performing the DMA map will create an entry
in the IOMMU page tables for the physical region associated with the
buffer, making the region accessible to the device. For platforms in
this category, using dma_alloc_coherent() for allocating a buffer to
pass firmware to the device does open a window where the device could
theoretically access this data while the validation is still in
progress.

Note that the device still needs to be informed about the address of
the buffer: just calling dma_alloc_coherent() will not allow the
device to find the firmware image in its memory space, and arbitrary
DMA accesses performed by the device will trigger faults that are
reported to the OS. So the window between DMA map (or
dma_alloc_coherent()) and the device specific command to pass the DMA
buffer address to the device is not inherently unsafe IMO, but I do
understand the need to cover this scenario.

As I pointed out before, using coherent DMA buffers to perform
streaming DMA is generally a bad idea, since they may be allocated
from a finite pool, and may use uncached mappings, making the access
slower than necessary (while streaming DMA can use any kmalloc'ed
buffer and will just flush the contents of the caches to main memory
when the DMA map is performed).

So to summarize again: in my opinion, using a single buffer is not a
problem as long as th

Re: [PATCH v2 3/4] efi/arm: map UEFI memory map earlier on boot

2018-07-05 Thread Ard Biesheuvel
On 5 July 2018 at 18:48, Will Deacon  wrote:
> On Thu, Jul 05, 2018 at 12:02:15PM +0100, James Morse wrote:
>> On 05/07/18 10:43, AKASHI Takahiro wrote:
>> > On Wed, Jul 04, 2018 at 08:49:32PM +0200, Ard Biesheuvel wrote:
>> >> On 4 July 2018 at 19:06, Will Deacon  wrote:
>> >>> On Tue, Jun 19, 2018 at 03:44:23PM +0900, AKASHI Takahiro wrote:
>> >>>> Since arm_enter_runtime_services() was modified to always create a 
>> >>>> virtual
>> >>>> mapping of UEFI memory map in the previous patch, it is now renamed to
>> >>>> efi_enter_virtual_mode() and called earlier before acpi_load_tables()
>> >>>> in acpi_early_init().
>> >>>>
>> >>>> This will allow us to use UEFI memory map in acpi_os_ioremap() to create
>> >>>> mappings of ACPI tables using memory attributes described in UEFI memory
>> >>>> map.
>>
>> >>> Hmm, this is ugly as hell. Is there nothing else we can piggy-back off?
>> >>> It's also fairly jarring that, on x86, efi_enter_virtual_mode() is called
>> >>> a few lines later, *after* acpi_early_init() has been called.
>>
>> >> Currently, there is a gap where we have already torn down the early
>> >> mapping and haven't created the definitive mapping of the UEFI memory
>> >> map. There are other reasons why this is an issue, and I recently
>> >> proposed [0] myself to address one of them
>>
>> >> Akashi-san, could you please confirm whether the patch below would be
>> >> sufficient for you? Apologies for going back and forth on this, but I
>> >> agree with Will that we should try to avoid warts like the one above
>> >> in generic code.
>> >>
>> >> [0] https://marc.info/?l=linux-efi=152930773507524=2
>> >
>> > I think that this patch will also work.
>> > Please drop my patch#2 and #3 if you want to pick up my patchset, Will.
>>
>> Patch 2 is what changes arm_enable_runtime_services() to map the efi memory 
>> map
>> before bailing out due to efi=noruntime.
>>
>> Without it, 'efi=noruntime' means no-acpi-tables.
>
> So it sounds like we want patch 2. Akashi, given that this series is only
> four patches, please can you send out a v3 with the stuff that should be
> reviewed and merged? Otherwise, there's a real risk we end up with breakage
> that goes unnoticed initially.
>

Yes, we want patches #1, #2 and #4, and this one can be replaced with
my patch above. Everything can be taken via the arm64 tree as far as I
am concerned.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2 1/4] arm64: export memblock_reserve()d regions via /proc/iomem

2018-07-05 Thread Ard Biesheuvel
On 19 June 2018 at 08:44, AKASHI Takahiro  wrote:
> From: James Morse 
>
> There has been some confusion around what is necessary to prevent kexec
> overwriting important memory regions. memblock: reserve, or nomap?
> Only memblock nomap regions are reported via /proc/iomem, kexec's
> user-space doesn't know about memblock_reserve()d regions.
>
> Until commit f56ab9a5b73ca ("efi/arm: Don't mark ACPI reclaim memory
> as MEMBLOCK_NOMAP") the ACPI tables were nomap, now they are reserved
> and thus possible for kexec to overwrite with the new kernel or initrd.
> But this was always broken, as the UEFI memory map is also reserved
> and not marked as nomap.
>
> Exporting both nomap and reserved memblock types is a nuisance as
> they live in different memblock structures which we can't walk at
> the same time.
>
> Take a second walk over memblock.reserved and add new 'reserved'
> subnodes for the memblock_reserved() regions that aren't already
> described by the existing code. (e.g. Kernel Code)
>
> We use reserve_region_with_split() to find the gaps in existing named
> regions. This handles the gap between 'kernel code' and 'kernel data'
> which is memblock_reserve()d, but already partially described by
> request_standard_resources(). e.g.:
> | 8000-dfff : System RAM
> |   8008-80ff : Kernel code
> |   8100-8158 : reserved
> |   8159-8237efff : Kernel data
> |   a000-dfff : Crash kernel
> | e00f-f949 : System RAM
>
> reserve_region_with_split needs kzalloc() which isn't available when
> request_standard_resources() is called, use an initcall.
>
> Reported-by: Bhupesh Sharma 
> Reported-by: Tyler Baicar 
> Suggested-by: Akashi Takahiro 
> Signed-off-by: James Morse 
> Fixes: d28f6df1305a ("arm64/kexec: Add core kexec support")
> CC: Ard Biesheuvel 
> CC: Mark Rutland 

Reviewed-by: Ard Biesheuvel 

> ---
>  arch/arm64/kernel/setup.c | 38 ++
>  1 file changed, 38 insertions(+)
>
> diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
> index 30ad2f085d1f..5b4fac434c84 100644
> --- a/arch/arm64/kernel/setup.c
> +++ b/arch/arm64/kernel/setup.c
> @@ -241,6 +241,44 @@ static void __init request_standard_resources(void)
> }
>  }
>
> +static int __init reserve_memblock_reserved_regions(void)
> +{
> +   phys_addr_t start, end, roundup_end = 0;
> +   struct resource *mem, *res;
> +   u64 i;
> +
> +   for_each_reserved_mem_region(i, , ) {
> +   if (end <= roundup_end)
> +   continue; /* done already */
> +
> +   start = __pfn_to_phys(PFN_DOWN(start));
> +   end = __pfn_to_phys(PFN_UP(end)) - 1;
> +   roundup_end = end;
> +
> +   res = kzalloc(sizeof(*res), GFP_ATOMIC);
> +   if (WARN_ON(!res))
> +   return -ENOMEM;
> +   res->start = start;
> +   res->end = end;
> +   res->name  = "reserved";
> +   res->flags = IORESOURCE_MEM;
> +
> +   mem = request_resource_conflict(_resource, res);
> +   /*
> +* We expected memblock_reserve() regions to conflict with
> +* memory created by request_standard_resources().
> +*/
> +   if (WARN_ON_ONCE(!mem))
> +   continue;
> +   kfree(res);
> +
> +   reserve_region_with_split(mem, start, end, "reserved");
> +   }
> +
> +   return 0;
> +}
> +arch_initcall(reserve_memblock_reserved_regions);
> +
>  u64 __cpu_logical_map[NR_CPUS] = { [0 ... NR_CPUS-1] = INVALID_HWID };
>
>  void __init setup_arch(char **cmdline_p)
> --
> 2.17.0
>

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2 4/4] arm64: acpi: fix alignment fault in accessing ACPI

2018-07-05 Thread Ard Biesheuvel
On 19 June 2018 at 08:44, AKASHI Takahiro  wrote:
> This is a fix against the issue that crash dump kernel may hang up
> during booting, which can happen on any ACPI-based system with "ACPI
> Reclaim Memory."
>
> (kernel messages after panic kicked off kdump)
>(snip...)
> Bye!
>(snip...)
> ACPI: Core revision 20170728
> pud=2e7d0003, *pmd=2e7c0003, *pte=00e839710707
> Internal error: Oops: 9621 [#1] SMP
> Modules linked in:
> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.14.0-rc6 #1
> task: 08d05180 task.stack: 08cc
> PC is at acpi_ns_lookup+0x25c/0x3c0
> LR is at acpi_ds_load1_begin_op+0xa4/0x294
>(snip...)
> Process swapper/0 (pid: 0, stack limit = 0x08cc)
> Call trace:
>(snip...)
> [] acpi_ns_lookup+0x25c/0x3c0
> [] acpi_ds_load1_begin_op+0xa4/0x294
> [] acpi_ps_build_named_op+0xc4/0x198
> [] acpi_ps_create_op+0x14c/0x270
> [] acpi_ps_parse_loop+0x188/0x5c8
> [] acpi_ps_parse_aml+0xb0/0x2b8
> [] acpi_ns_one_complete_parse+0x144/0x184
> [] acpi_ns_parse_table+0x48/0x68
> [] acpi_ns_load_table+0x4c/0xdc
> [] acpi_tb_load_namespace+0xe4/0x264
> [] acpi_load_tables+0x48/0xc0
> [] acpi_early_init+0x9c/0xd0
> [] start_kernel+0x3b4/0x43c
> Code: b9008fb9 2a000318 36380054 32190318 (b94002c0)
> ---[ end trace c46ed37f9651c58e ]---
> Kernel panic - not syncing: Fatal exception
> Rebooting in 10 seconds..
>
> (diagnosis)
> * This fault is a data abort, alignment fault (ESR=0x9621)
>   during reading out ACPI table.
> * Initial ACPI tables are normally stored in system ram and marked as
>   "ACPI Reclaim memory" by the firmware.
> * After the commit f56ab9a5b73c ("efi/arm: Don't mark ACPI reclaim
>   memory as MEMBLOCK_NOMAP"), those regions are differently handled
>   as they are "memblock-reserved", without NOMAP bit.
> * So they are now excluded from device tree's "usable-memory-range"
>   which kexec-tools determines based on a current view of /proc/iomem.
> * When crash dump kernel boots up, it tries to accesses ACPI tables by
>   mapping them with ioremap(), not ioremap_cache(), in acpi_os_ioremap()
>   since they are no longer part of mapped system ram.
> * Given that ACPI accessor/helper functions are compiled in without
>   unaligned access support (ACPI_MISALIGNMENT_NOT_SUPPORTED),
>   any unaligned access to ACPI tables can cause a fatal panic.
>
> With this patch, acpi_os_ioremap() always honors memory attribute
> information provided by the firmware (EFI) and retaining cacheability
> allows the kernel safe access to ACPI tables.
>
> Signed-off-by: AKASHI Takahiro 
> Suggested-by: James Morse 
> Suggested-by: Ard Biesheuvel 
> Reported-by and Tested-by: Bhupesh Sharma 

Reviewed-by: Ard Biesheuvel 

> ---
>  arch/arm64/include/asm/acpi.h | 23 ---
>  arch/arm64/kernel/acpi.c  | 11 +++
>  2 files changed, 19 insertions(+), 15 deletions(-)
>
> diff --git a/arch/arm64/include/asm/acpi.h b/arch/arm64/include/asm/acpi.h
> index 0db62a4cbce2..68bc18cb2b85 100644
> --- a/arch/arm64/include/asm/acpi.h
> +++ b/arch/arm64/include/asm/acpi.h
> @@ -12,10 +12,12 @@
>  #ifndef _ASM_ACPI_H
>  #define _ASM_ACPI_H
>
> +#include 
>  #include 
>  #include 
>
>  #include 
> +#include 
>  #include 
>  #include 
>
> @@ -29,18 +31,22 @@
>
>  /* Basic configuration for ACPI */
>  #ifdef CONFIG_ACPI
> +pgprot_t __acpi_get_mem_attribute(phys_addr_t addr);
> +
>  /* ACPI table mapping after acpi_permanent_mmap is set */
>  static inline void __iomem *acpi_os_ioremap(acpi_physical_address phys,
> acpi_size size)
>  {
> +   /* For normal memory we already have a cacheable mapping. */
> +   if (memblock_is_map_memory(phys))
> +   return (void __iomem *)__phys_to_virt(phys);
> +
> /*
> -* EFI's reserve_regions() call adds memory with the WB attribute
> -* to memblock via early_init_dt_add_memory_arch().
> +* We should still honor the memory's attribute here because
> +* crash dump kernel possibly excludes some ACPI (reclaim)
> +* regions from memblock list.
>  */
> -   if (!memblock_is_memory(phys))
> -   return ioremap(phys, size);
> -
> -   return ioremap_cache(phys, size);
> +   return __ioremap(phys, size, __acpi_get_mem_attribute(phys));
>  }
>

Re: [PATCH v2 2/4] efi/arm: map UEFI memory map even w/o runtime services enabled

2018-07-05 Thread Ard Biesheuvel
On 19 June 2018 at 08:44, AKASHI Takahiro  wrote:
> Under the current implementation, UEFI memory map will be mapped and made
> available in virtual mappings only if runtime services are enabled.
> But in a later patch, we want to use UEFI memory map in acpi_os_ioremap()
> to create mappings of ACPI tables using memory attributes described in
> UEFI memory map.
>
> So, as a first step, arm_enter_runtime_services() will be modified
> so that UEFI memory map will be always accessible.
>
> See a relevant commit:
> arm64: acpi: fix alignment fault in accessing ACPI tables
>
> Signed-off-by: AKASHI Takahiro 
> Cc: Ard Biesheuvel 

Reviewed-by: Ard Biesheuvel 

This may be taken via the arm64 tree.

> ---
>  drivers/firmware/efi/arm-runtime.c | 14 +++---
>  1 file changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/firmware/efi/arm-runtime.c 
> b/drivers/firmware/efi/arm-runtime.c
> index 5889cbea60b8..30ac5c82051e 100644
> --- a/drivers/firmware/efi/arm-runtime.c
> +++ b/drivers/firmware/efi/arm-runtime.c
> @@ -115,6 +115,13 @@ static int __init arm_enable_runtime_services(void)
> return 0;
> }
>
> +   mapsize = efi.memmap.desc_size * efi.memmap.nr_map;
> +
> +   if (efi_memmap_init_late(efi.memmap.phys_map, mapsize)) {
> +   pr_err("Failed to remap EFI memory map\n");
> +   return 0;
> +   }
> +
> if (efi_runtime_disabled()) {
> pr_info("EFI runtime services will be disabled.\n");
> return 0;
> @@ -127,13 +134,6 @@ static int __init arm_enable_runtime_services(void)
>
> pr_info("Remapping and enabling EFI services.\n");
>
> -   mapsize = efi.memmap.desc_size * efi.memmap.nr_map;
> -
> -   if (efi_memmap_init_late(efi.memmap.phys_map, mapsize)) {
> -   pr_err("Failed to remap EFI memory map\n");
> -   return -ENOMEM;
> -   }
> -
> if (!efi_virtmap_init()) {
> pr_err("UEFI virtual mapping missing or invalid -- runtime 
> services will not be available\n");
> return -ENOMEM;
> --
> 2.17.0
>

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2 3/4] efi/arm: map UEFI memory map earlier on boot

2018-07-04 Thread Ard Biesheuvel
On 4 July 2018 at 19:06, Will Deacon  wrote:
> Hi all,
>
> [Ard -- please can you look at the EFI parts of this patch]
>
> On Tue, Jun 19, 2018 at 03:44:23PM +0900, AKASHI Takahiro wrote:
>> Since arm_enter_runtime_services() was modified to always create a virtual
>> mapping of UEFI memory map in the previous patch, it is now renamed to
>> efi_enter_virtual_mode() and called earlier before acpi_load_tables()
>> in acpi_early_init().
>>
>> This will allow us to use UEFI memory map in acpi_os_ioremap() to create
>> mappings of ACPI tables using memory attributes described in UEFI memory
>> map.
>>
>> See a relevant commit:
>> arm64: acpi: fix alignment fault in accessing ACPI tables
>>
>> Signed-off-by: AKASHI Takahiro 
>> Cc: Ard Biesheuvel 
>> Cc: Andrew Morton 
>> ---
>>  drivers/firmware/efi/arm-runtime.c | 15 ++-
>>  init/main.c|  3 +++
>>  2 files changed, 9 insertions(+), 9 deletions(-)
>>
>> diff --git a/drivers/firmware/efi/arm-runtime.c 
>> b/drivers/firmware/efi/arm-runtime.c
>> index 30ac5c82051e..566ef0a9edb5 100644
>> --- a/drivers/firmware/efi/arm-runtime.c
>> +++ b/drivers/firmware/efi/arm-runtime.c
>> @@ -106,46 +106,43 @@ static bool __init efi_virtmap_init(void)
>>   * non-early mapping of the UEFI system table and virtual mappings for all
>>   * EFI_MEMORY_RUNTIME regions.
>>   */
>> -static int __init arm_enable_runtime_services(void)
>> +void __init efi_enter_virtual_mode(void)
>>  {
>>   u64 mapsize;
>>
>>   if (!efi_enabled(EFI_BOOT)) {
>>   pr_info("EFI services will not be available.\n");
>> - return 0;
>> + return;
>>   }
>>
>>   mapsize = efi.memmap.desc_size * efi.memmap.nr_map;
>>
>>   if (efi_memmap_init_late(efi.memmap.phys_map, mapsize)) {
>>   pr_err("Failed to remap EFI memory map\n");
>> - return 0;
>> + return;
>>   }
>>
>>   if (efi_runtime_disabled()) {
>>   pr_info("EFI runtime services will be disabled.\n");
>> - return 0;
>> + return;
>>   }
>>
>>   if (efi_enabled(EFI_RUNTIME_SERVICES)) {
>>   pr_info("EFI runtime services access via paravirt.\n");
>> - return 0;
>> + return;
>>   }
>>
>>   pr_info("Remapping and enabling EFI services.\n");
>>
>>   if (!efi_virtmap_init()) {
>>   pr_err("UEFI virtual mapping missing or invalid -- runtime 
>> services will not be available\n");
>> - return -ENOMEM;
>> + return;
>>   }
>>
>>   /* Set up runtime services function pointers */
>>   efi_native_runtime_setup();
>>   set_bit(EFI_RUNTIME_SERVICES, );
>> -
>> - return 0;
>>  }
>> -early_initcall(arm_enable_runtime_services);
>>
>>  void efi_virtmap_load(void)
>>  {
>> diff --git a/init/main.c b/init/main.c
>> index 3b4ada11ed52..532fc0d02353 100644
>> --- a/init/main.c
>> +++ b/init/main.c
>> @@ -694,6 +694,9 @@ asmlinkage __visible void __init start_kernel(void)
>>   debug_objects_mem_init();
>>   setup_per_cpu_pageset();
>>   numa_policy_init();
>> + if (IS_ENABLED(CONFIG_EFI) &&
>> + (IS_ENABLED(CONFIG_ARM64) || IS_ENABLED(CONFIG_ARM)))
>> + efi_enter_virtual_mode();
>
> Hmm, this is ugly as hell. Is there nothing else we can piggy-back off?
> It's also fairly jarring that, on x86, efi_enter_virtual_mode() is called
> a few lines later, *after* acpi_early_init() has been called.
>

Currently, there is a gap where we have already torn down the early
mapping and haven't created the definitive mapping of the UEFI memory
map. There are other reasons why this is an issue, and I recently
proposed [0] myself to address one of them (and I didn't remember this
particular series, or the fact that I actually suggested this approach
IIRC)

Akashi-san, could you please confirm whether the patch below would be
sufficient for you? Apologies for going back and forth on this, but I
agree with Will that we should try to avoid warts like the one above
in generic code.

[0] https://marc.info/?l=linux-efi=152930773507524=2

> The rest of the series looks fine to me, but I'm not comfortable taking
> changes like this via the arm64 tree.
>
> Will

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v5 7/8] ima: based on policy warn about loading firmware (pre-allocated buffer)

2018-07-02 Thread Ard Biesheuvel
On 2 July 2018 at 16:38, Mimi Zohar  wrote:
> Some systems are memory constrained but they need to load very large
> firmwares.  The firmware subsystem allows drivers to request this
> firmware be loaded from the filesystem, but this requires that the
> entire firmware be loaded into kernel memory first before it's provided
> to the driver.  This can lead to a situation where we map the firmware
> twice, once to load the firmware into kernel memory and once to copy the
> firmware into the final resting place.
>
> To resolve this problem, commit a098ecd2fa7d ("firmware: support loading
> into a pre-allocated buffer") introduced request_firmware_into_buf() API
> that allows drivers to request firmware be loaded directly into a
> pre-allocated buffer. (Based on the mailing list discussions, calling
> dma_alloc_coherent() is unnecessary and confusing.)
>
> (Very broken/buggy) devices using pre-allocated memory run the risk of
> the firmware being accessible to the device prior to the completion of
> IMA's signature verification.  For the time being, this patch emits a
> warning, but does not prevent the loading of the firmware.
>

As I attempted to explain in the exchange with Luis, this has nothing
to do with broken or buggy devices, but is simply the reality we have
to deal with on platforms that lack IOMMUs.

Even if you load into one buffer, carry out the signature verification
and *only then* copy it to another buffer, a bus master could
potentially read it from the first buffer as well. Mapping for DMA
does *not* mean 'making the memory readable by the device' unless
IOMMUs are being used. Otherwise, a bus master can read it from the
first buffer, or even patch the code that performs the security check
in the first place. For such platforms, copying the data around to
prevent the device from reading it is simply pointless, as well as any
other mitigation in software to protect yourself from misbehaving bus
masters.

So issuing a warning in this particular case is rather arbitrary. On
these platforms, all bus masters can read (and modify) all of your
memory all of the time, and as long as the firmware loader code takes
care not to provide the DMA address to the device until after the
verification is complete, it really has done all it reasonably can in
the environment that it is expected to operate in.

(The use of dma_alloc_coherent() is a bit of a red herring here, as it
incorporates the DMA map operation. However, DMA map is a no-op on
systems with cache coherent 1:1 DMA [iow, all PCs and most arm64
platforms unless they have IOMMUs], and so there is not much
difference between memory allocated with kmalloc() or with
dma_alloc_coherent() in terms of whether the device can access it
freely)






> Signed-off-by: Mimi Zohar 
> Cc: Luis R. Rodriguez 
> Cc: David Howells 
> Cc: Kees Cook 
> Cc: Serge E. Hallyn 
> Cc: Stephen Boyd 
> Cc: Bjorn Andersson 
>
> ---
> Changelog v5:
> - Instead of preventing loading firmware from a pre-allocate buffer,
> emit a warning.
>
>  security/integrity/ima/ima_main.c | 25 -
>  1 file changed, 16 insertions(+), 9 deletions(-)
>
> diff --git a/security/integrity/ima/ima_main.c 
> b/security/integrity/ima/ima_main.c
> index e467664965e7..7da123d980ea 100644
> --- a/security/integrity/ima/ima_main.c
> +++ b/security/integrity/ima/ima_main.c
> @@ -416,6 +416,15 @@ void ima_post_path_mknod(struct dentry *dentry)
> iint->flags |= IMA_NEW_FILE;
>  }
>
> +static int read_idmap[READING_MAX_ID] = {
> +   [READING_FIRMWARE] = FIRMWARE_CHECK,
> +   [READING_FIRMWARE_PREALLOC_BUFFER] = FIRMWARE_CHECK,
> +   [READING_MODULE] = MODULE_CHECK,
> +   [READING_KEXEC_IMAGE] = KEXEC_KERNEL_CHECK,
> +   [READING_KEXEC_INITRAMFS] = KEXEC_INITRAMFS_CHECK,
> +   [READING_POLICY] = POLICY_CHECK
> +};
> +
>  /**
>   * ima_read_file - pre-measure/appraise hook decision based on policy
>   * @file: pointer to the file to be measured/appraised/audit
> @@ -439,18 +448,16 @@ int ima_read_file(struct file *file, enum 
> kernel_read_file_id read_id)
> }
> return 0;   /* We rely on module signature checking */
> }
> +
> +   if (read_id == READING_FIRMWARE_PREALLOC_BUFFER) {
> +   if ((ima_appraise & IMA_APPRAISE_FIRMWARE) &&
> +   (ima_appraise & IMA_APPRAISE_ENFORCE)) {
> +   pr_warn("device might be able to access firmware 
> prior to signature verification completion.\n");
> +   }
> +   }
> return 0;
>  }
>
> -static int read_idmap[READING_MAX_ID] = {
> -   [READING_FIRMWARE] = FIRMWARE_CHECK,
> -   [READING_FIRMWARE_PREALLOC_BUFFER] = FIRMWARE_CHECK,
> -   [READING_MODULE] = MODULE_CHECK,
> -   [READING_KEXEC_IMAGE] = KEXEC_KERNEL_CHECK,
> -   [READING_KEXEC_INITRAMFS] = KEXEC_INITRAMFS_CHECK,
> -   [READING_POLICY] = POLICY_CHECK
> -};
> -
>  /**
>   * ima_post_read_file - in memory 

Re: [PATCH] arm64/mm: Introduce a variable to hold base address of linear region

2018-06-12 Thread Ard Biesheuvel
On 12 June 2018 at 08:36, Bhupesh Sharma  wrote:
> The start of the linear region map on a KASLR enabled ARM64 machine -
> which supports a compatible EFI firmware (with EFI_RNG_PROTOCOL
> support), is no longer correctly represented by the PAGE_OFFSET macro,
> since it is defined as:
>
> (UL(1) << (VA_BITS - 1)) + 1)
>

PAGE_OFFSET is the VA of the start of the linear map. The linear map
can be sparsely populated with actual memory, regardless of whether
KASLR is in effect or not. The only difference in the presence of
KASLR is that there may be such a hole at the beginning, but that does
not mean the linear map has moved, or that the value of PAGE_OFFSET is
now wrong.

> So taking an example of a platform with VA_BITS=48, this gives a static
> value of:
> PAGE_OFFSET = 0x8000
>
> However, for the KASLR case, we use the 'memstart_offset_seed'
> to randomize the linear region - since 'memstart_addr' indicates the
> start of physical RAM, we randomize the same on basis
> of 'memstart_offset_seed' value.
>
> As the PAGE_OFFSET value is used presently by several user space
> tools (for e.g. makedumpfile and crash tools) to determine the start
> of linear region and hence to read addresses (like PT_NOTE fields) from
> '/proc/kcore' for the non-KASLR boot cases, so it would be better to
> use 'memblock_start_of_DRAM()' value (converted to virtual) as
> the start of linear region for the KASLR cases and default to
> the PAGE_OFFSET value for non-KASLR cases to indicate the start of
> linear region.
>

Userland code that assumes that the linear map cannot have a hole at
the beginning should be fixed.

> I tested this on my qualcomm (which supports EFI_RNG_PROTOCOL)
> and apm mustang (which does not support EFI_RNG_PROTOCOL) arm64 boards
> and was able to use a modified user space utility (like kexec-tools and
> makedumpfile) to determine the start of linear region correctly for
> both the KASLR and non-KASLR boot cases.
>

Can you explain the nature of the changes to the userland code?

> Cc: Ard Biesheuvel 
> Cc: Mark Rutland 
> Cc: Will Deacon 
> Cc: AKASHI Takahiro 
> Cc: James Morse 
> Signed-off-by: Bhupesh Sharma 
> ---
>  arch/arm64/include/asm/memory.h | 3 +++
>  arch/arm64/kernel/arm64ksyms.c  | 1 +
>  arch/arm64/mm/init.c| 3 +++
>  3 files changed, 7 insertions(+)
>
> diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
> index 49d99214f43c..bfd0915ecaf8 100644
> --- a/arch/arm64/include/asm/memory.h
> +++ b/arch/arm64/include/asm/memory.h
> @@ -178,6 +178,9 @@ extern s64  memstart_addr;
>  /* PHYS_OFFSET - the physical address of the start of memory. */
>  #define PHYS_OFFSET({ VM_BUG_ON(memstart_addr & 1); 
> memstart_addr; })
>
> +/* the virtual base of the linear region. */
> +extern s64 linear_reg_start_addr;
> +
>  /* the virtual base of the kernel image (minus TEXT_OFFSET) */
>  extern u64 kimage_vaddr;
>
> diff --git a/arch/arm64/kernel/arm64ksyms.c b/arch/arm64/kernel/arm64ksyms.c
> index d894a20b70b2..a92238ea45ff 100644
> --- a/arch/arm64/kernel/arm64ksyms.c
> +++ b/arch/arm64/kernel/arm64ksyms.c
> @@ -42,6 +42,7 @@ EXPORT_SYMBOL(__arch_copy_in_user);
>
> /* physical memory */
>  EXPORT_SYMBOL(memstart_addr);
> +EXPORT_SYMBOL(linear_reg_start_addr);
>
> /* string / mem functions */
>  EXPORT_SYMBOL(strchr);
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 325cfb3b858a..29447adb0eef 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -60,6 +60,7 @@
>   * that cannot be mistaken for a real physical address.
>   */
>  s64 memstart_addr __ro_after_init = -1;
> +s64 linear_reg_start_addr __ro_after_init = PAGE_OFFSET;
>  phys_addr_t arm64_dma_phys_limit __ro_after_init;
>
>  #ifdef CONFIG_BLK_DEV_INITRD
> @@ -452,6 +453,8 @@ void __init arm64_memblock_init(void)
> }
> }
>
> +   linear_reg_start_addr = __phys_to_virt(memblock_start_of_DRAM());
> +
> /*
>  * Register the kernel text, kernel data, initrd, and initial
>  * pagetables with memblock.
> --
> 2.7.4
>

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC PATCH v4 7/8] ima: based on policy prevent loading firmware (pre-allocated buffer)

2018-06-06 Thread Ard Biesheuvel
On 6 June 2018 at 00:37, Kees Cook  wrote:
> On Fri, Jun 1, 2018 at 12:25 PM, Luis R. Rodriguez  wrote:
>> On Fri, Jun 01, 2018 at 09:15:45PM +0200, Luis R. Rodriguez wrote:
>>> On Tue, May 29, 2018 at 02:01:59PM -0400, Mimi Zohar wrote:
>>> > Some systems are memory constrained but they need to load very large
>>> > firmwares.  The firmware subsystem allows drivers to request this
>>> > firmware be loaded from the filesystem, but this requires that the
>>> > entire firmware be loaded into kernel memory first before it's provided
>>> > to the driver.  This can lead to a situation where we map the firmware
>>> > twice, once to load the firmware into kernel memory and once to copy the
>>> > firmware into the final resting place.
>>> >
>>> > To resolve this problem, commit a098ecd2fa7d ("firmware: support loading
>>> > into a pre-allocated buffer") introduced request_firmware_into_buf() API
>>> > that allows drivers to request firmware be loaded directly into a
>>> > pre-allocated buffer.  The QCOM_MDT_LOADER calls dma_alloc_coherent() to
>>> > allocate this buffer.  According to Documentation/DMA-API.txt,
>>> >
>>> >  Consistent memory is memory for which a write by either the
>>> >  device or the processor can immediately be read by the processor
>>> >  or device without having to worry about caching effects.  (You
>>> >  may however need to make sure to flush the processor's write
>>> >  buffers before telling devices to read that memory.)
>>> >
>>> > Devices using pre-allocated DMA memory run the risk of the firmware
>>> > being accessible by the device prior to the kernel's firmware signature
>>> > verification has completed.
>>>
>>> Indeed. And since its DMA memory we have *no idea* what can happen in
>>> terms of consumption of this firmware from hardware, when it would start
>>> consuming it in particular.
>>>
>>> If the device has its own hardware firmware verification mechanism this is
>>> completely obscure to us, but it may however suffice certain security 
>>> policies.
>>>
>>> The problem here lies in the conflicting security policies of the kernel 
>>> wanting
>>> to not give away firmware until its complete and the current inability to 
>>> enable
>>> us to have platforms suggest they trust hardware won't do something stupid.
>>> This becomes an issue since the semantics of the firmware API preallocated
>>> buffer do not require currently allow the kernel to inform LSMs of the fact
>>> that a buffer is DMA memory or not, and a way for certain platforms then
>>> to say that such use is fine for specific devices.
>>>
>>> Given a pointer can we determine if a piece of memory is DMA or not?
>>
>> FWIW
>>
>> Vlastimil suggests page_zone() or virt_to_page() may be able to.
>
> I don't see a PAGEFLAG for DMA, but I do see ZONE_DMA for
> page_zone()... So maybe something like
>
> struct page *page;
>
> page = virt_to_page(address);
> if (!page)
>fail closed...
> if (page_zone(page) == ZONE_DMA)
> handle dma case...
> else
> non-dma
>
> But I've CCed Laura and Rik, who I always lean on when I have these
> kinds of page questions...
>

That is not going to help. In general, DMA can access any memory in
the system (unless a IOMMU is actively preventing that).

The streaming DMA API allows you to map()/unmap() arbitrary pieces of
memory for DMA, regardless of how they were allocated. (Some drivers
were even doing DMA from the stack at some point, but this broke
vmapped stacks so most of these cases have been fixed) Uploading
firmware to a device does not require a coherent (as opposed to
streaming) mapping for DMA, and so it is perfectly reasonable for a
driver to use the streaming API to map the firmware image (wherever it
is in memory) and map it.

However, the DMA API does impose some ordering. Mapping memory for DMA
gives you a DMA address (which may be different from the physical
address [depending on the platform]), and this DMA address is what
gets programmed into the device, not the virtual or physical address.
That means you can be reasonably confident that the device will not be
able to consume what is in this memory before it has been mapped for
DMA. Also, the DMA api explicitly forbids touching memory mapped for
streaming DMA: the device owns it at this point, and so the CPU should
refrain from accessing it.

So the question is, why is QCOM_MDT_LOADER using a coherent DMA
mapping? That does not make any sense purely for moving firmware into
the device, and it is indeed a security hazard if we are trying to
perform a signature check before the device is cleared for reading it.

Note that qcom_scm_pas_init_image() is documented as

/*
 * During the scm call memory protection will be enabled for the meta
 * data blob, so make sure it's physically contiguous, 4K aligned and
 * non-cachable to avoid XPU violations.
 */

and dma_alloc_coherent() happens to give them that. Whether the DMA
mapping is actually used is a different 

Re: [PATCH v3 0/2] kexec-tools: arm64: Enable D-cache in purgatory

2018-04-04 Thread Ard Biesheuvel
On 4 April 2018 at 15:28, James Morse  wrote:
> Hi Kostiantyn,
>
> On 04/04/18 13:45, Kostiantyn Iarmak wrote:
>> From: Pratyush Anand 
>>> Date: Fri, Jun 2, 2017 at 5:42 PM
>>> Subject: Re: [PATCH v3 0/2] kexec-tools: arm64: Enable D-cache in purgatory
>>> To: James Morse 
>>> Cc: mark.rutl...@arm.com, b...@redhat.com, kexec@lists.infradead.org,
>>> ho...@verge.net.au, dyo...@redhat.com,
>>> linux-arm-ker...@lists.infradead.org
>>>
>>> On Friday 02 June 2017 01:53 PM, James Morse wrote:
 On 23/05/17 06:02, Pratyush Anand wrote:
> It takes more that 2 minutes to verify SHA in purgatory when vmlinuz image
> is around 13MB and initramfs is around 30MB. It takes more than 20 second
> even when we have -O2 optimization enabled. However, if dcache is enabled
> during purgatory execution then, it takes just a second in SHA
> verification.
>
> Therefore, these patches adds support for dcache enabling facility during
> purgatory execution.
>
 I'm still not convinced we need this. Moving the SHA verification to happen
 before the dcache+mmu are disabled would also solve the delay problem,
>>>
>>> Humm..I am not sure, if we can do that.
>
>>> When we leave kernel (and enter into purgatory), icache+dcache+mmu are
>>> already disabled. I think, that would be possible when we will be in a
>>> position to use in-kernel purgatory.
>>>
 and we
 can print an error message or fail the syscall.

 For kexec we don't expect memory corruption, what are we testing for?
 I can see the use for kdump, but the kdump-kernel is unmapped so the kernel
 can't accidentally write over it.

 (we discussed all this last time, but it fizzled-out. If you and the
   kexec-tools maintainer think its necessary, fine by me!)
>
>>> Yes, there had already been discussion and MAINTAINERs have
>>> discouraged none-purgatory implementation.
>>>
 I have some comments on making this code easier to maintain..

>>> Thanks.
>>>
>>> I have implemented your review comments and have archived the code in
>>>
>>> https://github.com/pratyushanand/kexec-tools.git : purgatory-enable-dcache
>>>
>>> I will be posting the next version only when someone complains about
>>> ARM64 kdump behavior that it is not as fast as x86.
>
>> On our ARM64-based platform we have very long main kernel-secondary kernel
>> switch time.
>>
>> This patch set fixes the issue (we are using 4.4 kernel and 2.0.13 
>> kexec-tools
>> version), we can get ~25x speedup, with this patch secondary kernel boots in 
>> ~3
>> seconds while on 2.0.13-2.0.16 kexec-tools without this patch switch takes 
>> about
>> 75 seconds.
>
> This is slow because its generating a checksum of the kernel without the 
> benefit
> of the caches. This series generated page tables so that it could enable the 
> MMU
> and caches. But, the purgatory code also needs to be a simple as possible
> because its practically impossible to debug.
>
> The purgatory code does this checksum-ing because its worried the panic() was
> because the kernel cause some memory corruption, and that memory corruption 
> may
> have affected the kdump kernel too.
>

If this is the only reason, there is no need to use a strong
cryptographic hash, and we should be able to recover some performance
by switching to CRC32 instead, preferably using the special arm64
instructions (if implemented).

But I agree that skipping the checksum calculation altogether is
probably the best approach here.


> This can't happen on arm64 as we unmap kdump's crash region, so not even the
> kernel can accidentally write to it. 98d2e1539b84 ("arm64: kdump: protect 
> crash
> dump kernel memory") has all the details.
>
> (we also needed to do this to avoid the risk of mismatched memory attributes 
> if
> kdump boots and some CPUs are still stuck in the old kernel)
>
>
>> When do you plan merge this patch?
>
> We ended up with the check-summing code because its the default behaviour of
> kexec-tools on other architectures.
>
> One alternative is to rip it out for arm64. Untested:
> %<
> diff --git a/purgatory/arch/arm64/Makefile b/purgatory/arch/arm64/Makefile
> index 636abea..f10c148 100644
> --- a/purgatory/arch/arm64/Makefile
> +++ b/purgatory/arch/arm64/Makefile
> @@ -7,7 +7,8 @@ arm64_PURGATORY_EXTRA_CFLAGS = \
> -Werror-implicit-function-declaration \
> -Wdeclaration-after-statement \
> -Werror=implicit-int \
> -   -Werror=strict-prototypes
> +   -Werror=strict-prototypes \
> +   -DNO_SHA_IN_PURGATORY
>
>  arm64_PURGATORY_SRCS += \
> purgatory/arch/arm64/entry.S \
> diff --git a/purgatory/purgatory.c b/purgatory/purgatory.c
> index 3bbcc09..44e792a 100644
> --- a/purgatory/purgatory.c
> +++ b/purgatory/purgatory.c
> @@ -9,6 +9,8 @@
>  struct sha256_region sha256_regions[SHA256_REGIONS] = {};
>  sha256_digest_t sha256_digest = { };
>
> 

Re: [RFC] arm64: extra entries in /proc/iomem for kexec

2018-03-15 Thread Ard Biesheuvel
On 15 March 2018 at 04:41, AKASHI Takahiro <takahiro.aka...@linaro.org> wrote:
> On Wed, Mar 14, 2018 at 08:39:23AM +0000, Ard Biesheuvel wrote:
>> On 14 March 2018 at 08:29, AKASHI Takahiro <takahiro.aka...@linaro.org> 
>> wrote:
>> > In the last couples of months, there were some problems reported [1],[2]
>> > around arm64 kexec/kdump. Where those phenomenon look different,
>> > the root cause would be that kexec/kdump doesn't take into account
>> > crucial "reserved" regions of system memory and unintentionally corrupts
>> > them.
>> >
>> > Given that kexec-tools looks for all the information by seeking the file,
>> > /proc/iomem, the first step to address said problems is to expand this 
>> > file's
>> > format so that it will have enough information about system memory and
>> > its usage.
>> >
>> > Attached is my experimental code: With this patch applied, /proc/iomem sees
>> > something like the below:
>> >
>> > (format A)
>> > 4000-5871 : System RAM
>> >   4008-40f1 : Kernel code
>> >   4104-411e8fff : Kernel data
>> >   5440-583f : Crash kernel
>> >   5859-585e : EFI Resources
>> >   5870-5871 : EFI Resources
>> > 5872-58b5 : System RAM
>> >   5872-58b5 : EFI Resources
>> > 58b6-5be3 : System RAM
>> >   58b61018-58b61947 : EFI Memory Map
>> >   59a7b118-59a7b667 : EFI Configuration Tables
>> > 5be4-5bec : System RAM  <== (A-1)
>> >   5be4-5bec : EFI Resources
>> > 5bed-5bed : System RAM
>> > 5bee-5bff : System RAM
>> >   5bee-5bff : EFI Resources
>> > 5c00-5fff : System RAM
>> > 80-ff : PCI Bus :00
>> >
>> > Meanwhile, the workaround I suggested in [3] gave us a simpler view:
>> >
>> > (format B)
>> > 4000-5871 : System RAM
>> >   4008-40f1 : Kernel code
>> >   4104-411e9fff : Kernel data
>> >   5440-583f : Crash kernel
>> >   5859-585e : reserved
>> >   5870-5871 : reserved
>> > 5872-58b5 : reserved
>> > 58b6-5be3 : System RAM
>> >   58b61000-58b61fff : reserved
>> >   59a7b318-59a7b867 : reserved
>> > 5be4-5bec : reserved<== (B-1)
>> > 5bed-5bed : System RAM
>> > 5bee-5bff : reserved
>> > 5c00-5fff : System RAM
>> >   5ec0-5edf : reserved
>> > 80-ff : PCI Bus :00
>> >
>> > Here all the regions to be protected are named just "reserved" whether
>> > they are NOMAP regions or simply-memblock_reserve'd. They are not very
>> > useful for anything but kexec/kdump which knows what they mean.
>> >
>> > Alternatively, we may want to give them more specific names, based on
>> > related efi memory map descriptors and else, that will characterize
>> > their contents:
>> >
>> > (format C)
>> > 4000-5871 : System RAM
>> >   4008-40f1 : Kernel code
>> >   4104-411e9fff : Kernel data
>> >   5440-583f : Crash kernel
>> >   5859-585e : ACPI Reclaim Memory
>> >   5870-5871 : ACPI Reclaim Memory
>> > 5872-58b5 : System RAM
>> >   5872-5878 : Runtime Data
>> >   5879-587d : Runtime Code
>> >   587e-5882 : Runtime Data
>> >   5883-5887 : Runtime Code
>> >   5888-588c : Runtime Data
>> >   588d-5891 : Runtime Code
>> >   5892-5896 : Runtime Data
>> >   5897-589b : Runtime Code
>> >   589c-58a5 : Runtime Data
>> >   58a6-58ab : Runtime Code
>> >   58ac-58b0 : Runtime Data
>> >   58b1-58b5 : Runtime Code
>> > 58b6-5be3 : System RAM
>> >   58b61000-58b61fff : EFI Memory Map
>> >   59a7b118-59a7b667 : EFI Memory Attributes Table
>> > 5be4-5bec : System RAM
>> >   5be4-5bec : Runtime Code
>> > 5bed-5bed : System RAM
>> > 5bee-5bff : System RAM
>> >   5bee-5bff : Runtime Data
>> > 5c00-5fff : System RAM
>> > 80-ff : PCI Bus :00
>> >
>> > I once created a patch for this format, but it looks quite noisy and
>> > names a

Re: [RFC] arm64: extra entries in /proc/iomem for kexec

2018-03-14 Thread Ard Biesheuvel
On 14 March 2018 at 08:29, AKASHI Takahiro  wrote:
> In the last couples of months, there were some problems reported [1],[2]
> around arm64 kexec/kdump. Where those phenomenon look different,
> the root cause would be that kexec/kdump doesn't take into account
> crucial "reserved" regions of system memory and unintentionally corrupts
> them.
>
> Given that kexec-tools looks for all the information by seeking the file,
> /proc/iomem, the first step to address said problems is to expand this file's
> format so that it will have enough information about system memory and
> its usage.
>
> Attached is my experimental code: With this patch applied, /proc/iomem sees
> something like the below:
>
> (format A)
> 4000-5871 : System RAM
>   4008-40f1 : Kernel code
>   4104-411e8fff : Kernel data
>   5440-583f : Crash kernel
>   5859-585e : EFI Resources
>   5870-5871 : EFI Resources
> 5872-58b5 : System RAM
>   5872-58b5 : EFI Resources
> 58b6-5be3 : System RAM
>   58b61018-58b61947 : EFI Memory Map
>   59a7b118-59a7b667 : EFI Configuration Tables
> 5be4-5bec : System RAM  <== (A-1)
>   5be4-5bec : EFI Resources
> 5bed-5bed : System RAM
> 5bee-5bff : System RAM
>   5bee-5bff : EFI Resources
> 5c00-5fff : System RAM
> 80-ff : PCI Bus :00
>
> Meanwhile, the workaround I suggested in [3] gave us a simpler view:
>
> (format B)
> 4000-5871 : System RAM
>   4008-40f1 : Kernel code
>   4104-411e9fff : Kernel data
>   5440-583f : Crash kernel
>   5859-585e : reserved
>   5870-5871 : reserved
> 5872-58b5 : reserved
> 58b6-5be3 : System RAM
>   58b61000-58b61fff : reserved
>   59a7b318-59a7b867 : reserved
> 5be4-5bec : reserved<== (B-1)
> 5bed-5bed : System RAM
> 5bee-5bff : reserved
> 5c00-5fff : System RAM
>   5ec0-5edf : reserved
> 80-ff : PCI Bus :00
>
> Here all the regions to be protected are named just "reserved" whether
> they are NOMAP regions or simply-memblock_reserve'd. They are not very
> useful for anything but kexec/kdump which knows what they mean.
>
> Alternatively, we may want to give them more specific names, based on
> related efi memory map descriptors and else, that will characterize
> their contents:
>
> (format C)
> 4000-5871 : System RAM
>   4008-40f1 : Kernel code
>   4104-411e9fff : Kernel data
>   5440-583f : Crash kernel
>   5859-585e : ACPI Reclaim Memory
>   5870-5871 : ACPI Reclaim Memory
> 5872-58b5 : System RAM
>   5872-5878 : Runtime Data
>   5879-587d : Runtime Code
>   587e-5882 : Runtime Data
>   5883-5887 : Runtime Code
>   5888-588c : Runtime Data
>   588d-5891 : Runtime Code
>   5892-5896 : Runtime Data
>   5897-589b : Runtime Code
>   589c-58a5 : Runtime Data
>   58a6-58ab : Runtime Code
>   58ac-58b0 : Runtime Data
>   58b1-58b5 : Runtime Code
> 58b6-5be3 : System RAM
>   58b61000-58b61fff : EFI Memory Map
>   59a7b118-59a7b667 : EFI Memory Attributes Table
> 5be4-5bec : System RAM
>   5be4-5bec : Runtime Code
> 5bed-5bed : System RAM
> 5bee-5bff : System RAM
>   5bee-5bff : Runtime Data
> 5c00-5fff : System RAM
> 80-ff : PCI Bus :00
>
> I once created a patch for this format, but it looks quite noisy and
> names are a sort of mixture of memory attributes( ACPI Reclaim memory,
> Conventional Memory, Persistent Memory etc.) vs.
> function/usages ([Loader|Boot Service|Runtime] Code/Data).
> (As a matter of fact, (C-1) consists of various ACPI tables.)
> Anyhow, they seem not so useful for most of other applications.
>
> Those observations lead to format A, where some entries with the same
> attributes are squeezed into a single entry under a simple name if they
> are neighbouring.
>
>
> So my questions here are:
>
> 1. Which format, A, B, or C, is the most appropriate for the moment?
>or any other suggestions?
>

I think some variant of B should be sufficient. The only meaningful
distinction between these reserved regions at a general level is
whether they are NOMAP or not, so perhaps we can incorporate that.

As for identifying things like EFI configuration tables: this is a
moving target, and we also define our own config tables for the TPM
log, screeninfo on ARM etc. Also, for EFI memory types, you can boot
with efi=debug and look at the entire memory map. So I think adding
all that information may be overkill.

> Currently, there is a inconsistent view between (A) and the mainline's:
> see (A-1) and (B-1). If this is really a matter, I can fix it.
> Kexec-tools can be easily modified to accept both formats, though.
>
>
> 2. How should we 

Re: [Query] ARM64 kaslr support - randomness, seeding and kdump

2018-03-12 Thread Ard Biesheuvel
On 12 March 2018 at 20:14, Bhupesh Sharma  wrote:
> Hi Ard,
>
> I remember we had a discussion on this topic some time ago, but I was
> working on enabling KASLR support on arm64 boards internally and
> wanted to check your opinion on the following points (especially to
> understand if there are any changes in the opinions of the ARM
> maintainers now):
>
> A. Randomness / Seeding for arm64 kaslr:
>
> - Currently the arm64 kernel requires a bootloader to provide entropy,
> by passing a
>  random u64 value in '/chosen/kaslr-seed' at kernel entry (please see [1])
>
> - On platforms which support UEFI firmware, its the responsibility of
> the UEFI firmware to implement EFI_RNG_PROTOCOL to supply the
> '/chosen/kaslr-seed' property.
>
> - I was wondering if we have any possibility to support a random seed
> generation like the x86 in the efistub only rather than relying on the
> UEFI firmwares with EFI_RNG_PROTOCOL for the same - for e.g. by using
> a randomness seed like the boot time or more proper entropy sources
> like arm64 system timer (see [2] for x86 example).
>
> - I guess that the main problem is that most arm64 UEFI firmware
> vendors still do not support EFI_RNG_PROTOCOL out of the box. We can
> use the ChaosKey (see [3]) EFI driver and use this USB key as the
> source of entropy on the arm64 systems for EFI firmwares which do not
> provide a EFI_RNG_PROTOCOL implementation, but it might not be very
> feasible to connect it to all boards in a production environment.
>

The problem is that arm64 does not have an architected means of
obtaining entropy, and we shouldn't rely on hacks to get pseudo
entropy.

Note that EFI_RNG_PROTOCOL is not only used for KASLR, it is also used
to seed the kernel entropy pool if the firmware provides an
implementation of the protocol.

Any UEFI system that can boot off USB should be able to use the
ChaosKey as well, but the best approach is obviously to implement
EFI_RNG_PROTOCOL natively if the platform has an entropy source
available.

If a platform vendor wants to hack something up based on the timer or
the performance counter, they are free to do so. But that doesn't mean
we should implement anything along those lines in the kernel.

> B. Regarding the arm64 kaslr support in kdump (I have Cc'ed AKASHI and
> kexec list in this thread as well for their inputs), but currently we
> don't seem to have a way to support kaslr in arm64 kdump kernel:
>
> - '/chosen/kaslr-seed' a property is zeroed out in the primary kernel
> (to avoid leaking out randomness secret), but how should this be then
> handed over to the kdump kernel.
> - We pass the dtb over to the kdump kernel for arm64 kdump, but the
> '/chosen/kaslr-seed' property would be zeroed out already by the
> primary kernel and the secondary would work in a *nokaslr* environment
> due to the same (see [4] for example)
>

What would be the point of randomizing the placement of the kdump
kernel? And don't say 'because x86 does it', because that is not a
good reason.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] arm64: kdump: retain reserved memory regions

2018-01-29 Thread Ard Biesheuvel
On 29 January 2018 at 08:12, AKASHI Takahiro  wrote:
> James,
>
> On Fri, Jan 19, 2018 at 11:39:58AM +, James Morse wrote:
>> Hi Akashi,
>>
>> On 11/01/18 11:38, AKASHI Takahiro wrote:
>> > On Wed, Jan 10, 2018 at 11:26:55AM +, James Morse wrote:
>> >> On 10/01/18 10:09, AKASHI Takahiro wrote:
>> >>> This is a fix against the issue that crash dump kernel may hang up
>> >>> during booting, which can happen on any ACPI-based system with "ACPI
>> >>> Reclaim Memory."
>>
>> >>> (diagnosis)
>> >>> * This fault is a data abort, alignment fault (ESR=0x9621)
>> >>>   during reading out ACPI table.
>> >>> * Initial ACPI tables are normally stored in system ram and marked as
>> >>>   "ACPI Reclaim memory" by the firmware.
>> >>> * After the commit f56ab9a5b73c ("efi/arm: Don't mark ACPI reclaim
>> >>>   memory as MEMBLOCK_NOMAP"), those regions' attribute were changed
>> >>>   removing NOMAP bit and they are instead "memblock-reserved".
>> >>> * When crash dump kernel boots up, it tries to accesses ACPI tables by
>> >>>   ioremap'ing them (through acpi_os_ioremap()).
>> >>> * Since those regions are not included in device tree's
>> >>>   "usable-memory-range" and so not recognized as part of crash dump
>> >>>   kernel's system ram, ioremap() will create a non-cacheable mapping 
>> >>> here.
>> >>
>> >> Ugh, because acpi_os_ioremap() looks at the efi memory map through the 
>> >> prism of
>> >> what we pulled into memblock, which is different during kdump.
>> >>
>> >> Is an alternative to teach acpi_os_ioremap() to ask
>> >> efi_mem_attributes() directly for the attributes to use?
>> >> (e.g. arch_apei_get_mem_attribute())
>> >
>> > I didn't think of this approach.
>> > Do you mean a change like the patch below?
>>
>> Yes. Aha, you can pretty much re-use the helper directly.
>>
>> It was just a suggestion, removing the extra abstraction that is causing the 
>> bug
>> could be cleaner ...
>>
>> > (I'm still debugging this code since the kernel fails to boot.)
>>
>> ... but might be too fragile.
>>
>> There are points during boot when the EFI memory map isn't mapped.
>
> Right, this was a problem for my patch.
> Attached is the revised and workable one.
> Efi_memmap_init_late() may alternatively be called in acpi_early_init() or
> even in acpi_os_ioremap(), but either way it looks a bit odd.
>

Akashi-san,

efi_memmap_init_late() is currently being called from
arm_enable_runtime_services(), which is an early initcall. If that is
too late for acpi_early_init(), we could perhaps move the call
forward, i.e., sth like

-8<
diff --git a/drivers/firmware/efi/arm-runtime.c
b/drivers/firmware/efi/arm-runtime.c
index 6f60d659b323..e835d3b20af6 100644
--- a/drivers/firmware/efi/arm-runtime.c
+++ b/drivers/firmware/efi/arm-runtime.c
@@ -117,7 +117,7 @@ static bool __init efi_virtmap_init(void)
  * non-early mapping of the UEFI system table and virtual mappings for all
  * EFI_MEMORY_RUNTIME regions.
  */
-static int __init arm_enable_runtime_services(void)
+void __init efi_enter_virtual_mode(void)
 {
u64 mapsize;

@@ -156,7 +156,6 @@ static int __init arm_enable_runtime_services(void)

return 0;
 }
-early_initcall(arm_enable_runtime_services);

 void efi_virtmap_load(void)
 {
diff --git a/init/main.c b/init/main.c
index a8100b954839..2d0927768e2d 100644
--- a/init/main.c
+++ b/init/main.c
@@ -674,6 +674,9 @@ asmlinkage __visible void __init start_kernel(void)
debug_objects_mem_init();
setup_per_cpu_pageset();
numa_policy_init();
+   if (IS_ENABLED(CONFIG_EFI) &&
+   (IS_ENABLED(CONFIG_ARM64) || IS_ENABLED(CONFIG_ARM)))
+   efi_enter_virtual_mode();
acpi_early_init();
if (late_time_init)
late_time_init();
-8<

would be reasonable imo. Also, I think it is justifiable to make ACPI
depend on UEFI on arm64, which is notably different from x86.

(I know 'efi_enter_virtual_mode' is not entirely accurate here, given
that we call SetVirtualAddressMap from the UEFI stub on ARM, but it is
still close enough, given that one could argue that EFI is not in
'virtual mode' until the mappings are in place)



> ===8<===
> From c88f4c8106ba7a918c835b1cdf538b1d21019863 Mon Sep 17 00:00:00 2001
> From: AKASHI Takahiro 
> Date: Mon, 29 Jan 2018 15:07:43 +0900
> Subject: [PATCH] arm64: kdump: make acpi_os_ioremap() more generic
>
> ---
>  arch/arm64/include/asm/acpi.h | 23 ---
>  arch/arm64/kernel/acpi.c  |  7 ++-
>  init/main.c   |  4 
>  3 files changed, 22 insertions(+), 12 deletions(-)
>
> diff --git a/arch/arm64/include/asm/acpi.h b/arch/arm64/include/asm/acpi.h
> index 32f465a80e4e..d53c95f4e1a9 100644
> --- a/arch/arm64/include/asm/acpi.h
> +++ b/arch/arm64/include/asm/acpi.h
> @@ -12,10 +12,12 @@
>  #ifndef _ASM_ACPI_H
>  #define _ASM_ACPI_H
>
> +#include 
>  #include 
>  #include 
>
>  #include 

Re: [PATCH] arm64: kdump: retain reserved memory regions

2018-01-10 Thread Ard Biesheuvel
On 10 January 2018 at 10:09, AKASHI Takahiro  wrote:
> This is a fix against the issue that crash dump kernel may hang up
> during booting, which can happen on any ACPI-based system with "ACPI
> Reclaim Memory."
>
> 
> Bye!
>(snip...)
> ACPI: Core revision 20170728
> pud=2e7d0003, *pmd=2e7c0003, *pte=00e839710707
> Internal error: Oops: 9621 [#1] SMP
> Modules linked in:
> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.14.0-rc6 #1
> task: 08d05180 task.stack: 08cc
> PC is at acpi_ns_lookup+0x25c/0x3c0
> LR is at acpi_ds_load1_begin_op+0xa4/0x294
>(snip...)
> Process swapper/0 (pid: 0, stack limit = 0x08cc)
> Call trace:
>(snip...)
> [] acpi_ns_lookup+0x25c/0x3c0
> [] acpi_ds_load1_begin_op+0xa4/0x294
> [] acpi_ps_build_named_op+0xc4/0x198
> [] acpi_ps_create_op+0x14c/0x270
> [] acpi_ps_parse_loop+0x188/0x5c8
> [] acpi_ps_parse_aml+0xb0/0x2b8
> [] acpi_ns_one_complete_parse+0x144/0x184
> [] acpi_ns_parse_table+0x48/0x68
> [] acpi_ns_load_table+0x4c/0xdc
> [] acpi_tb_load_namespace+0xe4/0x264
> [] acpi_load_tables+0x48/0xc0
> [] acpi_early_init+0x9c/0xd0
> [] start_kernel+0x3b4/0x43c
> Code: b9008fb9 2a000318 36380054 32190318 (b94002c0)
> ---[ end trace c46ed37f9651c58e ]---
> Kernel panic - not syncing: Fatal exception
> Rebooting in 10 seconds..
>
> (diagnosis)
> * This fault is a data abort, alignment fault (ESR=0x9621)
>   during reading out ACPI table.
> * Initial ACPI tables are normally stored in system ram and marked as
>   "ACPI Reclaim memory" by the firmware.
> * After the commit f56ab9a5b73c ("efi/arm: Don't mark ACPI reclaim
>   memory as MEMBLOCK_NOMAP"), those regions' attribute were changed
>   removing NOMAP bit and they are instead "memblock-reserved".
> * When crash dump kernel boots up, it tries to accesses ACPI tables by
>   ioremap'ing them (through acpi_os_ioremap()).
> * Since those regions are not included in device tree's
>   "usable-memory-range" and so not recognized as part of crash dump
>   kernel's system ram, ioremap() will create a non-cacheable mapping here.
> * ACPI accessor/helper functions are compiled in without unaligned access
>   support (ACPI_MISALIGNMENT_NOT_SUPPORTED), eventually ending up a fatal
>   panic when accessing ACPI tables.
>
> With this patch, all the reserved memory regions, as well as NOMAP-
> attributed ones which are presumably ACPI runtime code and data, are set
> to be retained in system ram even if they are outside of usable memory
> range specified by device tree blob. Accordingly, ACPI tables are mapped
> as cacheable and can be safely accessed without causing unaligned access
> faults.
>
> Reported-by: Bhupesh Sharma 
> Signed-off-by: AKASHI Takahiro 
> ---
>  arch/arm64/mm/init.c | 16 ++--
>  1 file changed, 14 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 2d5a443b205c..e4a8b64a09b1 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -352,11 +352,23 @@ static void __init fdt_enforce_memory_region(void)
> struct memblock_region reg = {
> .size = 0,
> };
> +   u64 idx;
> +   phys_addr_t start, end;
>
> of_scan_flat_dt(early_init_dt_scan_usablemem, );
>
> -   if (reg.size)
> -   memblock_cap_memory_range(reg.base, reg.size);

Given that memblock_cap_memory_range() was introduced by you for
kdump, is there any way to handle it there?
If not, should we remove it?

> +   if (reg.size) {
> +retry:
> +   /* exclude usable & !reserved memory */
> +   for_each_free_mem_range(idx, NUMA_NO_NODE, MEMBLOCK_NONE,
> +   , , NULL) {
> +   memblock_remove(start, end - start);
> +   goto retry;
> +   }
> +
> +   /* add back fdt's usable memory */
> +   memblock_add(reg.base, reg.size);
> +   }
>  }
>
>  void __init arm64_memblock_init(void)
> --
> 2.15.1
>

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v5 03/10] kexec_file: factor out arch_kexec_kernel_*() from x86, powerpc

2017-10-11 Thread Ard Biesheuvel
Hello Takahiro-san,

> On 11 Oct 2017, at 06:07, AKASHI Takahiro  wrote:
> 
> On Tue, Oct 10, 2017 at 12:02:01PM +0100, Julien Thierry wrote:
> 
> [snip]
> 
>>> --- a/kernel/kexec_file.c
>>> +++ b/kernel/kexec_file.c
>>> @@ -26,30 +26,79 @@
>>> #include 
>>> #include "kexec_internal.h"
>>> 
>>> +const __weak struct kexec_file_ops * const kexec_file_loaders[] = {NULL};
>>> +
>>> static int kexec_calculate_store_digests(struct kimage *image);
>>> 
>>> +int _kexec_kernel_image_probe(struct kimage *image, void *buf,
>>> +  unsigned long buf_len)
>>> +{
>>> + const struct kexec_file_ops *fops;
>>> + int ret = -ENOEXEC;
>>> +
>>> + for (fops = kexec_file_loaders[0]; fops && fops->probe; ++fops) {
>> 
>> Hmm, that's not gonna work (and I see that what I said in the previous
>> patch was not 100% correct either).
> 
> Can you elaborate this a bit more?
> 
> I'm sure that, with my code, any member of fops, cannot be changed;
> "const struct kexec_file_ops *fops" means that fops is a pointer to
> "constant sturct kexec_file_ops," while "struct kexec_file_ops *
> const kexec_file_loaders[]" means that kexec_file_loaders is a "constant
> array" of pointers to "constant struct kexec_file_ops."
> 

No, you need 2x const for that, i.e.,

const struct kexec_file_ops * const kexec_file_loaders[]

otherwise, the pointed-to objects may still be modified. 





> Thanks,
> -Takahiro AKASHI
> 
> 
>> 'fops' should be of type 'const struct kexec_file_ops **', and the loop
>> should be:
>> 
>> for (fops = _file_loaders[0]; *fops && (*fops)->probe; ++fops)
>> 
>> With some additional dereferences in the body of the loop.
>> 
>> Unless you prefer the previous state of the loop (with i and the break
>> inside), but I still think this looks better.
>> 
>> Cheers,
>> 
>> --
>> Julien Thierry
>> IMPORTANT NOTICE: The contents of this email and any attachments are 
>> confidential and may also be privileged. If you are not the intended 
>> recipient, please notify the sender immediately and do not disclose the 
>> contents to any other person, use it for any purpose, or store or copy the 
>> information in any medium. Thank you.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 08/14] arm64: kexec_file: create purgatory

2017-08-24 Thread Ard Biesheuvel
On 24 August 2017 at 09:18, AKASHI Takahiro  wrote:
> This is a basic purgtory, or a kind of glue code between the two kernel,
> for arm64. We will later add a feature of verifying a digest check against
> loaded memory segments.
>
> arch_kexec_apply_relocations_add() is responsible for re-linking any
> relative symbols in purgatory. Please note that the purgatory is not
> an executable, but a non-linked archive of binaries so relative symbols
> contained here must be resolved at kexec load time.

This sounds fragile to me. What is the reason we cannot let the linker
deal with this, similar to, e.g., how the VDSO gets linked?

Otherwise, couldn't we reuse the module loader to get these objects
relocated in memory? I'm sure there are differences that would require
some changes there, but implementing all of this again sounds like
overkill to me.


> Despite that arm64_kernel_start and arm64_dtb_addr are only such global
> variables now, arch_kexec_apply_relocations_add() can manage more various
> types of relocations.
>
> Signed-off-by: AKASHI Takahiro 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> ---
>  arch/arm64/Makefile|   1 +
>  arch/arm64/kernel/Makefile |   1 +
>  arch/arm64/kernel/machine_kexec_file.c | 199 
> +
>  arch/arm64/purgatory/Makefile  |  24 
>  arch/arm64/purgatory/entry.S   |  28 +
>  5 files changed, 253 insertions(+)
>  create mode 100644 arch/arm64/kernel/machine_kexec_file.c
>  create mode 100644 arch/arm64/purgatory/Makefile
>  create mode 100644 arch/arm64/purgatory/entry.S
>
> diff --git a/arch/arm64/Makefile b/arch/arm64/Makefile
> index 9b41f1e3b1a0..429f60728c0a 100644
> --- a/arch/arm64/Makefile
> +++ b/arch/arm64/Makefile
> @@ -105,6 +105,7 @@ core-$(CONFIG_XEN) += arch/arm64/xen/
>  core-$(CONFIG_CRYPTO) += arch/arm64/crypto/
>  libs-y := arch/arm64/lib/ $(libs-y)
>  core-$(CONFIG_EFI_STUB) += $(objtree)/drivers/firmware/efi/libstub/lib.a
> +core-$(CONFIG_KEXEC_FILE) += arch/arm64/purgatory/
>
>  # Default target when executing plain make
>  boot   := arch/arm64/boot
> diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
> index f2b4e816b6de..16e9f56b536a 100644
> --- a/arch/arm64/kernel/Makefile
> +++ b/arch/arm64/kernel/Makefile
> @@ -50,6 +50,7 @@ arm64-obj-$(CONFIG_RANDOMIZE_BASE)+= kaslr.o
>  arm64-obj-$(CONFIG_HIBERNATION)+= hibernate.o hibernate-asm.o
>  arm64-obj-$(CONFIG_KEXEC)  += machine_kexec.o relocate_kernel.o  
>   \
>cpu-reset.o
> +arm64-obj-$(CONFIG_KEXEC_FILE) += machine_kexec_file.o
>  arm64-obj-$(CONFIG_ARM64_RELOC_TEST)   += arm64-reloc-test.o
>  arm64-reloc-test-y := reloc_test_core.o reloc_test_syms.o
>  arm64-obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
> diff --git a/arch/arm64/kernel/machine_kexec_file.c 
> b/arch/arm64/kernel/machine_kexec_file.c
> new file mode 100644
> index ..183f7776d6dd
> --- /dev/null
> +++ b/arch/arm64/kernel/machine_kexec_file.c
> @@ -0,0 +1,199 @@
> +/*
> + * kexec_file for arm64
> + *
> + * Copyright (C) 2017 Linaro Limited
> + * Author: AKASHI Takahiro 
> + *
> + * Most code is derived from arm64 port of kexec-tools
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#define pr_fmt(fmt) "kexec_file: " fmt
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +/*
> + * Apply purgatory relocations.
> + *
> + * ehdr: Pointer to elf headers
> + * sechdrs: Pointer to section headers.
> + * relsec: section index of SHT_RELA section.
> + *
> + * Note:
> + * Currently R_AARCH64_ABS64, R_AARCH64_LD_PREL_LO19 and R_AARCH64_CALL26
> + * are the only types to be generated from purgatory code.
> + * If we add more functionalities, other types may also be used.
> + */
> +int arch_kexec_apply_relocations_add(const Elf64_Ehdr *ehdr,
> +Elf64_Shdr *sechdrs, unsigned int relsec)
> +{
> +   Elf64_Rela *rel;
> +   Elf64_Shdr *section, *symtabsec;
> +   Elf64_Sym *sym;
> +   const char *strtab, *name, *shstrtab;
> +   unsigned long address, sec_base, value;
> +   void *location;
> +   u64 *loc64;
> +   u32 *loc32, imm;
> +   unsigned int i;
> +
> +   /*
> +* ->sh_offset has been modified to keep the pointer to section
> +* contents in memory
> +*/
> +   rel = (void *)sechdrs[relsec].sh_offset;
> +
> +   /* Section to which relocations apply */
> +   section = [sechdrs[relsec].sh_info];
> +
> +   pr_debug("reloc: Applying relocate section %u to %u\n", relsec,
> +sechdrs[relsec].sh_info);

Re: [PATCH 09/14] arm64: kexec_file: add sha256 digest check in purgatory

2017-08-24 Thread Ard Biesheuvel
On 24 August 2017 at 09:18, AKASHI Takahiro <takahiro.aka...@linaro.org> wrote:
> Most of sha256 code is based on crypto/sha256-glue.c, particularly using
> non-neon version.
>
> Please note that we won't be able to re-use lib/mem*.S for purgatory
> because unaligned memory access is not allowed in purgatory where mmu
> is turned off.
>
> Since purgatory is not linked with the other part of kernel, care must be
> taken of selecting an appropriate set of compiler options in order to
> prevent undefined symbol references from being generated.
>
> Signed-off-by: AKASHI Takahiro <takahiro.aka...@linaro.org>
> Cc: Catalin Marinas <catalin.mari...@arm.com>
> Cc: Will Deacon <will.dea...@arm.com>
> Cc: Ard Biesheuvel <ard.biesheu...@linaro.org>
> ---
>  arch/arm64/crypto/sha256-core.S_shipped |  2 +
>  arch/arm64/purgatory/Makefile   | 21 -
>  arch/arm64/purgatory/entry.S| 13 ++
>  arch/arm64/purgatory/purgatory.c| 20 +
>  arch/arm64/purgatory/sha256-core.S  |  1 +
>  arch/arm64/purgatory/sha256.c   | 79 
> +
>  arch/arm64/purgatory/sha256.h   |  1 +
>  arch/arm64/purgatory/string.c   | 32 +
>  arch/arm64/purgatory/string.h   |  5 +++
>  9 files changed, 173 insertions(+), 1 deletion(-)
>  create mode 100644 arch/arm64/purgatory/purgatory.c
>  create mode 100644 arch/arm64/purgatory/sha256-core.S
>  create mode 100644 arch/arm64/purgatory/sha256.c
>  create mode 100644 arch/arm64/purgatory/sha256.h
>  create mode 100644 arch/arm64/purgatory/string.c
>  create mode 100644 arch/arm64/purgatory/string.h
>
> diff --git a/arch/arm64/crypto/sha256-core.S_shipped 
> b/arch/arm64/crypto/sha256-core.S_shipped
> index 3ce82cc860bc..9ce7419c9152 100644
> --- a/arch/arm64/crypto/sha256-core.S_shipped
> +++ b/arch/arm64/crypto/sha256-core.S_shipped
> @@ -1210,6 +1210,7 @@ sha256_block_armv8:
> ret
>  .size  sha256_block_armv8,.-sha256_block_armv8
>  #endif
> +#ifndef __PURGATORY__
>  #ifdef __KERNEL__
>  .globl sha256_block_neon
>  #endif
> @@ -2056,6 +2057,7 @@ sha256_block_neon:
> add sp,sp,#16*4+16
> ret
>  .size  sha256_block_neon,.-sha256_block_neon
> +#endif
>  #ifndef__KERNEL__
>  .comm  OPENSSL_armcap_P,4,4
>  #endif

Could you please try to find another way to address this?
sha256-core.S_shipped is generated code from the accompanying Perl
script, and that script is kept in sync with upstream OpenSSL. Also,
the performance delta between the generic code is not /that/
spectacular, so we may simply use that instead.


> diff --git a/arch/arm64/purgatory/Makefile b/arch/arm64/purgatory/Makefile
> index c2127a2cbd51..d9b38be31e0a 100644
> --- a/arch/arm64/purgatory/Makefile
> +++ b/arch/arm64/purgatory/Makefile
> @@ -1,14 +1,33 @@
>  OBJECT_FILES_NON_STANDARD := y
>
> -purgatory-y := entry.o
> +purgatory-y := entry.o purgatory.o sha256.o sha256-core.o string.o
>
>  targets += $(purgatory-y)
>  PURGATORY_OBJS = $(addprefix $(obj)/,$(purgatory-y))
>
> +# Purgatory is expected to be ET_REL, not an executable
>  LDFLAGS_purgatory.ro := -e purgatory_start -r --no-undefined \
> -nostdlib -z nodefaultlib
> +
>  targets += purgatory.ro
>
> +GCOV_PROFILE   := n
> +KASAN_SANITIZE := n
> +KCOV_INSTRUMENT:= n
> +
> +# Some kernel configurations may generate additional code containing
> +# undefined symbols, like _mcount for ftrace and __stack_chk_guard
> +# for stack-protector. Those should be removed from purgatory.
> +
> +CFLAGS_REMOVE_purgatory.o = -pg
> +CFLAGS_REMOVE_sha256.o = -pg
> +CFLAGS_REMOVE_string.o = -pg
> +
> +NO_PROTECTOR := $(call cc-option, -fno-stack-protector)
> +KBUILD_CFLAGS += $(NO_PROTECTOR)
> +
> +KBUILD_AFLAGS += -D__PURGATORY__
> +
>  $(obj)/purgatory.ro: $(PURGATORY_OBJS) FORCE
> $(call if_changed,ld)
>
> diff --git a/arch/arm64/purgatory/entry.S b/arch/arm64/purgatory/entry.S
> index bc4e6b3bf8a1..74d028b838bd 100644
> --- a/arch/arm64/purgatory/entry.S
> +++ b/arch/arm64/purgatory/entry.S
> @@ -6,6 +6,11 @@
>  .text
>
>  ENTRY(purgatory_start)
> +   adr x19, .Lstack
> +   mov sp, x19
> +
> +   bl  purgatory
> +
> /* Start new image. */
> ldr x17, arm64_kernel_entry
> ldr x0, arm64_dtb_addr
> @@ -15,6 +20,14 @@ ENTRY(purgatory_start)
> br  x17
>  END(purgatory_start)
>
> +.ltorg
> +
> +.align 4
> +   .rept   256
> +   .quad   0
> +   .endr
> +.Lstack:
> +
>  .data
>
>  .align 3
> diff --git a/arch/

Re: [PATCH 03/14] resource: add walk_system_ram_res_rev()

2017-08-24 Thread Ard Biesheuvel
On 24 August 2017 at 09:18, AKASHI Takahiro  wrote:
> This function, being a variant of walk_system_ram_res() introduced in
> commit 8c86e70acead ("resource: provide new functions to walk through
> resources"), walks through a list of all the resources of System RAM
> in reversed order, i.e., from higher to lower.
>
> It will be used in kexec_file implementation on arm64.
>
> Signed-off-by: AKASHI Takahiro 
> Cc: Vivek Goyal 
> Cc: Andrew Morton 
> Cc: Linus Torvalds 
> ---
>  include/linux/ioport.h |  3 +++
>  kernel/resource.c  | 48 
>  2 files changed, 51 insertions(+)
>
> diff --git a/include/linux/ioport.h b/include/linux/ioport.h
> index 6230064d7f95..9a212266299f 100644
> --- a/include/linux/ioport.h
> +++ b/include/linux/ioport.h
> @@ -271,6 +271,9 @@ extern int
>  walk_system_ram_res(u64 start, u64 end, void *arg,
> int (*func)(u64, u64, void *));
>  extern int
> +walk_system_ram_res_rev(u64 start, u64 end, void *arg,
> +   int (*func)(u64, u64, void *));
> +extern int
>  walk_iomem_res_desc(unsigned long desc, unsigned long flags, u64 start, u64 
> end,
> void *arg, int (*func)(u64, u64, void *));
>
> diff --git a/kernel/resource.c b/kernel/resource.c
> index 9b5f04404152..1d6d734c75ac 100644
> --- a/kernel/resource.c
> +++ b/kernel/resource.c
> @@ -23,6 +23,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>
>
> @@ -469,6 +470,53 @@ int walk_system_ram_res(u64 start, u64 end, void *arg,
> return ret;
>  }
>
> +int walk_system_ram_res_rev(u64 start, u64 end, void *arg,
> +   int (*func)(u64, u64, void *))
> +{
> +   struct resource res, *rams;
> +   u64 orig_end;
> +   int count, i;
> +   int ret = -1;
> +
> +   count = 16; /* initial */
> +again:
> +   /* create a list */
> +   rams = vmalloc(sizeof(struct resource) * count);
> +   if (!rams)
> +   return ret;
> +
> +   res.start = start;
> +   res.end = end;
> +   res.flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
> +   orig_end = res.end;
> +   i = 0;
> +   while ((res.start < res.end) &&
> +   (!find_next_iomem_res(, IORES_DESC_NONE, true))) {
> +   if (i >= count) {
> +   /* unlikely but */
> +   vfree(rams);
> +   count += 16;

If the count is likely to be < 16, why are we using vmalloc() here?

> +   goto again;
> +   }
> +
> +   rams[i].start = res.start;
> +   rams[i++].end = res.end;
> +
> +   res.start = res.end + 1;
> +   res.end = orig_end;
> +   }
> +
> +   /* go reverse */
> +   for (i--; i >= 0; i--) {
> +   ret = (*func)(rams[i].start, rams[i].end, arg);
> +   if (ret)
> +   break;
> +   }
> +
> +   vfree(rams);
> +   return ret;
> +}
> +
>  #if !defined(CONFIG_ARCH_HAS_WALK_MEMORY)
>
>  /*
> --
> 2.14.1
>

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 02/14] include: pe.h: remove message[] from mz header definition

2017-08-24 Thread Ard Biesheuvel
On 24 August 2017 at 09:17, AKASHI Takahiro <takahiro.aka...@linaro.org> wrote:
> message[] field won't be part of the definition of mz header.
>
> This change is crucial for enabling kexec_file_load on arm64 because
> arm64's "Image" binary, as in PE format, doesn't have any data for it and
> accordingly the following check in pefile_parse_binary() will fail:
>
> chkaddr(cursor, mz->peaddr, sizeof(*pe));
>
> Signed-off-by: AKASHI Takahiro <takahiro.aka...@linaro.org>
> Cc: David Howells <dhowe...@redhat.com>
> Cc: Vivek Goyal <vgo...@redhat.com>
> Cc: Herbert Xu <herb...@gondor.apana.org.au>
> Cc: David S. Miller <da...@davemloft.net>
> ---
>  include/linux/pe.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/include/linux/pe.h b/include/linux/pe.h
> index 143ce75be5f0..3482b18a48b5 100644
> --- a/include/linux/pe.h
> +++ b/include/linux/pe.h
> @@ -166,7 +166,7 @@ struct mz_hdr {
> uint16_t oem_info;  /* oem specific */
> uint16_t reserved1[10]; /* reserved */
> uint32_t peaddr;/* address of pe header */
> -   char message[64];   /* message to print */
> +   char message[]; /* message to print */
>  };
>
>  struct mz_reloc {

Reviewed-by: Ard Biesheuvel <ard.biesheu...@linaro.org>

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v3 0/2] kexec-tools: arm64: Enable D-cache in purgatory

2017-06-02 Thread Ard Biesheuvel
On 2 June 2017 at 11:36, Ard Biesheuvel <ard.biesheu...@linaro.org> wrote:
> On 2 June 2017 at 11:15, Bhupesh SHARMA <bhupesh.li...@gmail.com> wrote:
>> Hi Ard, James
>>
>> On Fri, Jun 2, 2017 at 3:25 PM, Ard Biesheuvel
>> <ard.biesheu...@linaro.org> wrote:
>>> On 2 June 2017 at 08:23, James Morse <james.mo...@arm.com> wrote:
>>>> Hi Pratyush,
>>>>
>>>> On 23/05/17 06:02, Pratyush Anand wrote:
>>>>> It takes more that 2 minutes to verify SHA in purgatory when vmlinuz image
>>>>> is around 13MB and initramfs is around 30MB. It takes more than 20 second
>>>>> even when we have -O2 optimization enabled. However, if dcache is enabled
>>>>> during purgatory execution then, it takes just a second in SHA
>>>>> verification.
>>>>>
>>>>> Therefore, these patches adds support for dcache enabling facility during
>>>>> purgatory execution.
>>>>
>>>> I'm still not convinced we need this. Moving the SHA verification to happen
>>>> before the dcache+mmu are disabled would also solve the delay problem, and 
>>>> we
>>>> can print an error message or fail the syscall.
>>>>
>>>> For kexec we don't expect memory corruption, what are we testing for?
>>>
>>> This is a very good question. SHA-256 is quite a heavy hammer if all
>>> you need is CRC style error detection. Note that SHA-256 uses 256
>>> bytes of round keys, which means that in the absence of a cache, each
>>> 64 byte chunk of data processed involves (re)reading 320 bytes from
>>> DRAM. That also means you could write a SHA-256 implementation for
>>> AArch64 that keeps the round keys in NEON registers instead, and it
>>> would probably be a lot faster.
>>
>> AFAICR the sha-256 implementation was proposed to boot a signed
>> kexec/kdump kernel to circumvent kexec from violating UEFI secure boot
>> restrictions (see [1]).
>>
>> As Matthew Garret rightly noted (see[2]), secure Boot, if enabled, is
>> explicitly designed to stop you booting modified kernels unless you've
>> added your own keys.
>>
>> But if you boot a signed Linux distribution with kexec enabled without
>> using the SHA like feature in the purgatory (like, say, Ubuntu) then
>> you're able to boot a modified Windows kernel that will still believe
>> it was booted securely.
>>
>> So, CRC wouldn't possibly fulfil the functionality we are trying to
>> achieve with SHA-256 in the purgatory.
>>
>
> OK. But it appears that kexec_load_file() generates the hashes, and
> the purgatory just double checks them? That means there is wiggle room
> in terms of hash implementation, even though a non-cryptographic hash
> may be out of the question.
>
>> However, having seen the power of using the inbuilt CRC instructions
>> from the ARM64 ISA on a SoC which supports it, I can vouch that the
>> native ISA implementations are much faster than other approaches.
>>
>> However, using the SHA-256 implementation (as you rightly noted) would
>> employ NEON registers and can be faster, however I remember some SoC
>> vendors disabling co-processor extensions in their ARM implementations
>> in the past, so I am not sure we can assume that NEON extensions would
>> be available in all ARMv8 implementations by default.
>>
>
> Alternatively, a SHA-256 implementation that uses movz/movk sequences
> instead of ldr instructions to load the round constants would already
> be 5x faster, given that we don't need page tables to enable the
> I-cache.

Actually, looking at the C code and the objdump of the kernel's
sha256_generic driver, it is likely that it is already doing this, and
none of the points I made actually make a lot of sense ...

Pratyush: I assume you are already enabling the I-cache in the purgatory?

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


  1   2   >