Re: [PATCH v7 1/4] x86: kdump: move reserve_crashkernel_low() into crash_core.c

2019-12-28 Thread Dave Young
On 12/27/19 at 07:04pm, Chen Zhou wrote:
> Hi Dave
> 
> On 2019/12/27 13:54, Dave Young wrote:
> > Hi,
> > On 12/23/19 at 11:23pm, Chen Zhou wrote:
> >> In preparation for supporting reserve_crashkernel_low in arm64 as
> >> x86_64 does, move reserve_crashkernel_low() into kernel/crash_core.c.
> >>
> >> Note, in arm64, we reserve low memory if and only if crashkernel=X,low
> >> is specified. Different with x86_64, don't set low memory automatically.
> > 
> > Do you have any reason for the difference?  I'd expect we have same
> > logic if possible and remove some of the ifdefs.
> 
> In x86_64, if we reserve crashkernel above 4G, then we call 
> reserve_crashkernel_low()
> to reserve low memory.
> 
> In arm64, to simplify, we call reserve_crashkernel_low() at the beginning of 
> reserve_crashkernel()
> and then relax the arm64_dma32_phys_limit if reserve_crashkernel_low() 
> allocated something.
> In this case, if reserve crashkernel below 4G there will be 256M low memory 
> set automatically
> and this needs extra considerations.

Sorry that I did not read the old thread details and thought that is
arch dependent.  But rethink about that, it would be better that we can
have same semantic about crashkernel parameters across arches.  If we
make them different then it causes confusion, especially for
distributions.

OTOH, I thought if we reserve high memory then the low memory should be
needed.  There might be some exceptions, but I do not know the exact
one, can we make the behavior same, and special case those systems which
do not need low memory reservation.

> 
> previous discusses:
>   https://lkml.org/lkml/2019/6/5/670
>   https://lkml.org/lkml/2019/6/13/229

Another concern from James:
"
With both crashk_low_res and crashk_res, we end up with two entries in 
/proc/iomem called
"Crash kernel". Because its sorted by address, and kexec-tools stops searching 
when it
find "Crash kernel", you are always going to get the kernel placed in the lower 
portion.
"

The kexec-tools code is iterating all "Crash kernel" ranges and add them
in an array.  In X86 code, it uses the higher range to locate memory.

> 
> > 
> >>
> >> Reported-by: kbuild test robot 
> >> Signed-off-by: Chen Zhou 
> >> ---
> >>  arch/x86/kernel/setup.c| 62 -
> >>  include/linux/crash_core.h |  3 ++
> >>  include/linux/kexec.h  |  2 --
> >>  kernel/crash_core.c| 87 
> >> ++
> >>  kernel/kexec_core.c| 17 -
> >>  5 files changed, 96 insertions(+), 75 deletions(-)
> >>
> >> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> >> index cedfe20..5f38942 100644
> >> --- a/arch/x86/kernel/setup.c
> >> +++ b/arch/x86/kernel/setup.c
> >> @@ -486,59 +486,6 @@ static void __init 
> >> memblock_x86_reserve_range_setup_data(void)
> >>  # define CRASH_ADDR_HIGH_MAX  SZ_64T
> >>  #endif
> >>  
> >> -static int __init reserve_crashkernel_low(void)
> >> -{
> >> -#ifdef CONFIG_X86_64
> >> -  unsigned long long base, low_base = 0, low_size = 0;
> >> -  unsigned long total_low_mem;
> >> -  int ret;
> >> -
> >> -  total_low_mem = memblock_mem_size(1UL << (32 - PAGE_SHIFT));
> >> -
> >> -  /* crashkernel=Y,low */
> >> -  ret = parse_crashkernel_low(boot_command_line, total_low_mem, 
> >> &low_size, &base);
> >> -  if (ret) {
> >> -  /*
> >> -   * two parts from kernel/dma/swiotlb.c:
> >> -   * -swiotlb size: user-specified with swiotlb= or default.
> >> -   *
> >> -   * -swiotlb overflow buffer: now hardcoded to 32k. We round it
> >> -   * to 8M for other buffers that may need to stay low too. Also
> >> -   * make sure we allocate enough extra low memory so that we
> >> -   * don't run out of DMA buffers for 32-bit devices.
> >> -   */
> >> -  low_size = max(swiotlb_size_or_default() + (8UL << 20), 256UL 
> >> << 20);
> >> -  } else {
> >> -  /* passed with crashkernel=0,low ? */
> >> -  if (!low_size)
> >> -  return 0;
> >> -  }
> >> -
> >> -  low_base = memblock_find_in_range(0, 1ULL << 32, low_size, CRASH_ALIGN);
> >> -  if (!low_base) {
> >> -  pr_err("Cannot reserve %ldMB crashkernel low memory, please try 
> >> smaller size.\n",
> >> - (unsigned long)(low_size >> 20));
> >> -  return -ENOMEM;
> >> -  }
> >> -
> >> -  ret = memblock_reserve(low_base, low_size);
> >> -  if (ret) {
> >> -  pr_err("%s: Error reserving crashkernel low memblock.\n", 
> >> __func__);
> >> -  return ret;
> >> -  }
> >> -
> >> -  pr_info("Reserving %ldMB of low memory at %ldMB for crashkernel (System 
> >> low RAM: %ldMB)\n",
> >> -  (unsigned long)(low_size >> 20),
> >> -  (unsigned long)(low_base >> 20),
> >> -  (unsigned long)(total_low_mem >> 20));
> >> -
> >> -  crashk_low_res.start = low_base;
> >> -  crashk_low_res.end   = low_base + low_size - 1;
> >> -  insert_resour

Re: [PATCH] x86/efi: update e820 about reserved EFI boot services data to fix kexec breakage

2019-12-28 Thread Dan Williams
On Tue, Dec 3, 2019 at 11:53 PM Dave Young  wrote:
>
> Michael Weiser reported he got below error during a kexec rebooting:
> esrt: Unsupported ESRT version 2904149718861218184.
>
> The ESRT memory stays in EFI boot services data, and it was reserved
> in kernel via efi_mem_reserve().  The initial purpose of the reservation
> is to reuse the EFI boot services data across kexec reboot. For example
> the BGRT image data and some ESRT memory like Michael reported.
>
> But although the memory is reserved it is not updated in X86 e820 table.
> And kexec_file_load iterate system ram in io resource list to find places
> for kernel, initramfs and other stuff. In Michael's case the kexec loaded
> initramfs overwritten the ESRT memory and then the failure happened.
>
> Since kexec_file_load depends on the e820 to be updated, just fix this
> by updating the reserved EFI boot services memory as reserved type in e820.
>
> Originally any memory descriptors with EFI_MEMORY_RUNTIME attribute are
> bypassed in the reservation code path because they are assumed as reserved.
> But the reservation is still needed for multiple kexec reboot.
> And it is the only possible case we come here thus just drop the code
> chunk then everything works without side effects.
>
> On my machine the ESRT memory sits in an EFI runtime data range, it does
> not trigger the problem, but I successfully tested with BGRT instead.
> both kexec_load and kexec_file_load work and kdump works as well.
>
> Signed-off-by: Dave Young 
> ---
>  arch/x86/platform/efi/quirks.c |6 ++
>  1 file changed, 2 insertions(+), 4 deletions(-)
>
> --- linux-x86.orig/arch/x86/platform/efi/quirks.c
> +++ linux-x86/arch/x86/platform/efi/quirks.c
> @@ -260,10 +260,6 @@ void __init efi_arch_mem_reserve(phys_ad
> return;
> }
>
> -   /* No need to reserve regions that will never be freed. */
> -   if (md.attribute & EFI_MEMORY_RUNTIME)
> -   return;
> -
> size += addr % EFI_PAGE_SIZE;
> size = round_up(size, EFI_PAGE_SIZE);
> addr = round_down(addr, EFI_PAGE_SIZE);
> @@ -293,6 +289,8 @@ void __init efi_arch_mem_reserve(phys_ad
> early_memunmap(new, new_size);
>
> efi_memmap_install(new_phys, num_entries);
> +   e820__range_update(addr, size, E820_TYPE_RAM, E820_TYPE_RESERVED);
> +   e820__update_table(e820_table);
>  }
>
>  /*
>

Bisect says this change (commit af1648984828) is triggering a
regression, likely not urgent, in my testing of the new efi_fake_mem=
facility to allow memory to be marked "soft reserved" via the kernel
command line (commit 199c84717612 x86/efi: Add efi_fake_mem support
for EFI_MEMORY_SP). The following command line triggers the crash
signature below:

efi_fake_mem=4G@9G:0x4,4G@13G:0x4

However, this command line works ok:

efi_fake_mem=8G@9G:0x4

So, something about multiple efi_fake_mem statements interacts badly
with this change. Nothing obvious occurs to me at the moment, I'll
keep debugging, but wanted to highlight this in the meantime in case
someone else sees a deeper issue or the root cause.

BUG: unable to handle page fault for address: ff281000
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 188615067 P4D 188615067 PUD 188617067 PMD 188e4d067 PTE 0
Oops: 0002 [#1] SMP PTI
CPU: 0 PID: 0 Comm: swapper Not tainted 5.4.0+ #154
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
RIP: 0010:efi_memmap_insert+0xed/0x14b
Code: 48 89 48 18 49 39 d9 76 67 49 39 d1 73 62 4c 89 c9 48 2b 48 08
4c 89 c6 48 c1 e9 0c 48 89 48 18 49 8b 4a 28 48 01 c8 48 89 c7  a4
49 39 d3 73 2c 4c 89 48 08 4c 29 da 4c 89 c6 4c 89 68 18 48
RSP: :b7603d70 EFLAGS: 00010086
RAX: ff280ff0 RBX:  RCX: 0020
RDX:  RSI: ff200fe0 RDI: ff281000
RBP: bea2d000 R08: ff200fd0 R09: bea06000
R10: b77e1718 R11: bea2cfff R12: 800f
R13: 0027 R14: 415fa001 R15: 0ab0
FS:  () GS:b7c31000() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: ff281000 CR3: 00018861 CR4: 000606b0
Call Trace:
 ? efi_arch_mem_reserve+0x149/0x1a6
 ? bgrt_init+0xbe/0xbe
 ? bgrt_init+0xbe/0xbe
 ? acpi_parse_bgrt+0xa/0xd
 ? acpi_table_parse+0x86/0xb8
 ? acpi_boot_init+0x494/0x4e3
 ? acpi_parse_x2apic+0x87/0x87
 ? setup_acpi_sci+0xa2/0xa2
 ? setup_arch+0x8db/0x9e1
 ? start_kernel+0x6a/0x547
 ? secondary_startup_64+0xb6/0xc0
Modules linked in:
CR2: ff281000
random: get_random_bytes called from print_oops_end_marker+0x26/0x40
with crng_init=0
---[ end trace 2acc14aa0990ee9d ]---
RIP: 0010:efi_memmap_insert+0xed/0x14b
Code: 48 89 48 18 49 39 d9 76 67 49 39 d1 73 62 4c 89 c9 48 2b 48 08
4c 89 c6 48 c1 e9 0c 48 89 48 18 49 8b 4a 28 48 01 c8 48 89 c7  a4
49 39 d3 73 2c 4c 89 48 08 4c 29 da 4c

Re: [PATCH] x86/efi: update e820 about reserved EFI boot services data to fix kexec breakage

2019-12-28 Thread Dan Williams
On Sat, Dec 28, 2019 at 12:54 PM Dan Williams
 wrote:
>
> On Tue, Dec 3, 2019 at 11:53 PM Dave Young  wrote:
> >
> > Michael Weiser reported he got below error during a kexec rebooting:
> > esrt: Unsupported ESRT version 2904149718861218184.
> >
> > The ESRT memory stays in EFI boot services data, and it was reserved
> > in kernel via efi_mem_reserve().  The initial purpose of the reservation
> > is to reuse the EFI boot services data across kexec reboot. For example
> > the BGRT image data and some ESRT memory like Michael reported.
> >
> > But although the memory is reserved it is not updated in X86 e820 table.
> > And kexec_file_load iterate system ram in io resource list to find places
> > for kernel, initramfs and other stuff. In Michael's case the kexec loaded
> > initramfs overwritten the ESRT memory and then the failure happened.
> >
> > Since kexec_file_load depends on the e820 to be updated, just fix this
> > by updating the reserved EFI boot services memory as reserved type in e820.
> >
> > Originally any memory descriptors with EFI_MEMORY_RUNTIME attribute are
> > bypassed in the reservation code path because they are assumed as reserved.
> > But the reservation is still needed for multiple kexec reboot.
> > And it is the only possible case we come here thus just drop the code
> > chunk then everything works without side effects.
> >
> > On my machine the ESRT memory sits in an EFI runtime data range, it does
> > not trigger the problem, but I successfully tested with BGRT instead.
> > both kexec_load and kexec_file_load work and kdump works as well.
> >
> > Signed-off-by: Dave Young 
> > ---
> >  arch/x86/platform/efi/quirks.c |6 ++
> >  1 file changed, 2 insertions(+), 4 deletions(-)
> >
> > --- linux-x86.orig/arch/x86/platform/efi/quirks.c
> > +++ linux-x86/arch/x86/platform/efi/quirks.c
> > @@ -260,10 +260,6 @@ void __init efi_arch_mem_reserve(phys_ad
> > return;
> > }
> >
> > -   /* No need to reserve regions that will never be freed. */
> > -   if (md.attribute & EFI_MEMORY_RUNTIME)
> > -   return;
> > -
> > size += addr % EFI_PAGE_SIZE;
> > size = round_up(size, EFI_PAGE_SIZE);
> > addr = round_down(addr, EFI_PAGE_SIZE);
> > @@ -293,6 +289,8 @@ void __init efi_arch_mem_reserve(phys_ad
> > early_memunmap(new, new_size);
> >
> > efi_memmap_install(new_phys, num_entries);
> > +   e820__range_update(addr, size, E820_TYPE_RAM, E820_TYPE_RESERVED);
> > +   e820__update_table(e820_table);
> >  }
> >
> >  /*
> >
>
> Bisect says this change (commit af1648984828) is triggering a
> regression, likely not urgent, in my testing of the new efi_fake_mem=
> facility to allow memory to be marked "soft reserved" via the kernel
> command line (commit 199c84717612 x86/efi: Add efi_fake_mem support
> for EFI_MEMORY_SP). The following command line triggers the crash
> signature below:
>
> efi_fake_mem=4G@9G:0x4,4G@13G:0x4
>
> However, this command line works ok:
>
> efi_fake_mem=8G@9G:0x4
>
> So, something about multiple efi_fake_mem statements interacts badly
> with this change. Nothing obvious occurs to me at the moment, I'll
> keep debugging, but wanted to highlight this in the meantime in case
> someone else sees a deeper issue or the root cause.

Still looking, but this failure does not seem to be specific to the
"soft reservation" changes. Any update to the efi memmap that pushes
it over a page boundary triggers this failure. I.e. I can fix the
problem by over-allocating the efi memmap and then page aligning the
result. __early_ioremap "should" be handling this case, but it appears
something else is messing this up.

>
> BUG: unable to handle page fault for address: ff281000
> #PF: supervisor write access in kernel mode
> #PF: error_code(0x0002) - not-present page
> PGD 188615067 P4D 188615067 PUD 188617067 PMD 188e4d067 PTE 0
> Oops: 0002 [#1] SMP PTI
> CPU: 0 PID: 0 Comm: swapper Not tainted 5.4.0+ #154
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
> RIP: 0010:efi_memmap_insert+0xed/0x14b
> Code: 48 89 48 18 49 39 d9 76 67 49 39 d1 73 62 4c 89 c9 48 2b 48 08
> 4c 89 c6 48 c1 e9 0c 48 89 48 18 49 8b 4a 28 48 01 c8 48 89 c7  a4
> 49 39 d3 73 2c 4c 89 48 08 4c 29 da 4c 89 c6 4c 89 68 18 48
> RSP: :b7603d70 EFLAGS: 00010086
> RAX: ff280ff0 RBX:  RCX: 0020
> RDX:  RSI: ff200fe0 RDI: ff281000
> RBP: bea2d000 R08: ff200fd0 R09: bea06000
> R10: b77e1718 R11: bea2cfff R12: 800f
> R13: 0027 R14: 415fa001 R15: 0ab0
> FS:  () GS:b7c31000() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: ff281000 CR3: 00018861 CR4: 000606b0
> Call Trace:
>  ? efi_arch_mem_reserve+0x149/0x1a6
>  ? bgrt_init+0xbe/0xbe
>  ? bg