Re: [PATCH] x86/efi: update e820 about reserved EFI boot services data to fix kexec breakage

2019-12-28 Thread Dan Williams
On Tue, Dec 3, 2019 at 11:53 PM Dave Young  wrote:
>
> Michael Weiser reported he got below error during a kexec rebooting:
> esrt: Unsupported ESRT version 2904149718861218184.
>
> The ESRT memory stays in EFI boot services data, and it was reserved
> in kernel via efi_mem_reserve().  The initial purpose of the reservation
> is to reuse the EFI boot services data across kexec reboot. For example
> the BGRT image data and some ESRT memory like Michael reported.
>
> But although the memory is reserved it is not updated in X86 e820 table.
> And kexec_file_load iterate system ram in io resource list to find places
> for kernel, initramfs and other stuff. In Michael's case the kexec loaded
> initramfs overwritten the ESRT memory and then the failure happened.
>
> Since kexec_file_load depends on the e820 to be updated, just fix this
> by updating the reserved EFI boot services memory as reserved type in e820.
>
> Originally any memory descriptors with EFI_MEMORY_RUNTIME attribute are
> bypassed in the reservation code path because they are assumed as reserved.
> But the reservation is still needed for multiple kexec reboot.
> And it is the only possible case we come here thus just drop the code
> chunk then everything works without side effects.
>
> On my machine the ESRT memory sits in an EFI runtime data range, it does
> not trigger the problem, but I successfully tested with BGRT instead.
> both kexec_load and kexec_file_load work and kdump works as well.
>
> Signed-off-by: Dave Young 
> ---
>  arch/x86/platform/efi/quirks.c |6 ++
>  1 file changed, 2 insertions(+), 4 deletions(-)
>
> --- linux-x86.orig/arch/x86/platform/efi/quirks.c
> +++ linux-x86/arch/x86/platform/efi/quirks.c
> @@ -260,10 +260,6 @@ void __init efi_arch_mem_reserve(phys_ad
> return;
> }
>
> -   /* No need to reserve regions that will never be freed. */
> -   if (md.attribute & EFI_MEMORY_RUNTIME)
> -   return;
> -
> size += addr % EFI_PAGE_SIZE;
> size = round_up(size, EFI_PAGE_SIZE);
> addr = round_down(addr, EFI_PAGE_SIZE);
> @@ -293,6 +289,8 @@ void __init efi_arch_mem_reserve(phys_ad
> early_memunmap(new, new_size);
>
> efi_memmap_install(new_phys, num_entries);
> +   e820__range_update(addr, size, E820_TYPE_RAM, E820_TYPE_RESERVED);
> +   e820__update_table(e820_table);
>  }
>
>  /*
>

Bisect says this change (commit af1648984828) is triggering a
regression, likely not urgent, in my testing of the new efi_fake_mem=
facility to allow memory to be marked "soft reserved" via the kernel
command line (commit 199c84717612 x86/efi: Add efi_fake_mem support
for EFI_MEMORY_SP). The following command line triggers the crash
signature below:

efi_fake_mem=4G@9G:0x4,4G@13G:0x4

However, this command line works ok:

efi_fake_mem=8G@9G:0x4

So, something about multiple efi_fake_mem statements interacts badly
with this change. Nothing obvious occurs to me at the moment, I'll
keep debugging, but wanted to highlight this in the meantime in case
someone else sees a deeper issue or the root cause.

BUG: unable to handle page fault for address: ff281000
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 188615067 P4D 188615067 PUD 188617067 PMD 188e4d067 PTE 0
Oops: 0002 [#1] SMP PTI
CPU: 0 PID: 0 Comm: swapper Not tainted 5.4.0+ #154
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
RIP: 0010:efi_memmap_insert+0xed/0x14b
Code: 48 89 48 18 49 39 d9 76 67 49 39 d1 73 62 4c 89 c9 48 2b 48 08
4c 89 c6 48 c1 e9 0c 48 89 48 18 49 8b 4a 28 48 01 c8 48 89 c7  a4
49 39 d3 73 2c 4c 89 48 08 4c 29 da 4c 89 c6 4c 89 68 18 48
RSP: :b7603d70 EFLAGS: 00010086
RAX: ff280ff0 RBX:  RCX: 0020
RDX:  RSI: ff200fe0 RDI: ff281000
RBP: bea2d000 R08: ff200fd0 R09: bea06000
R10: b77e1718 R11: bea2cfff R12: 800f
R13: 0027 R14: 415fa001 R15: 0ab0
FS:  () GS:b7c31000() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: ff281000 CR3: 00018861 CR4: 000606b0
Call Trace:
 ? efi_arch_mem_reserve+0x149/0x1a6
 ? bgrt_init+0xbe/0xbe
 ? bgrt_init+0xbe/0xbe
 ? acpi_parse_bgrt+0xa/0xd
 ? acpi_table_parse+0x86/0xb8
 ? acpi_boot_init+0x494/0x4e3
 ? acpi_parse_x2apic+0x87/0x87
 ? setup_acpi_sci+0xa2/0xa2
 ? setup_arch+0x8db/0x9e1
 ? start_kernel+0x6a/0x547
 ? secondary_startup_64+0xb6/0xc0
Modules linked in:
CR2: ff281000
random: get_random_bytes called from print_oops_end_marker+0x26/0x40
with crng_init=0
---[ end trace 2acc14aa0990ee9d ]---
RIP: 0010:efi_memmap_insert+0xed/0x14b
Code: 48 89 48 18 49 39 d9 76 67 49 39 d1 73 62 4c 89 c9 48 2b 48 08
4c 89 c6 48 c1 e9 0c 48 89 48 18 49 8b 4a 28 48 01 c8 48 89 c7  a4
49 39 d3 73 2c 4c 89 48 08 4c 29 da 4c

Re: [PATCH] x86/efi: update e820 about reserved EFI boot services data to fix kexec breakage

2019-12-28 Thread Dan Williams
On Sat, Dec 28, 2019 at 12:54 PM Dan Williams
 wrote:
>
> On Tue, Dec 3, 2019 at 11:53 PM Dave Young  wrote:
> >
> > Michael Weiser reported he got below error during a kexec rebooting:
> > esrt: Unsupported ESRT version 2904149718861218184.
> >
> > The ESRT memory stays in EFI boot services data, and it was reserved
> > in kernel via efi_mem_reserve().  The initial purpose of the reservation
> > is to reuse the EFI boot services data across kexec reboot. For example
> > the BGRT image data and some ESRT memory like Michael reported.
> >
> > But although the memory is reserved it is not updated in X86 e820 table.
> > And kexec_file_load iterate system ram in io resource list to find places
> > for kernel, initramfs and other stuff. In Michael's case the kexec loaded
> > initramfs overwritten the ESRT memory and then the failure happened.
> >
> > Since kexec_file_load depends on the e820 to be updated, just fix this
> > by updating the reserved EFI boot services memory as reserved type in e820.
> >
> > Originally any memory descriptors with EFI_MEMORY_RUNTIME attribute are
> > bypassed in the reservation code path because they are assumed as reserved.
> > But the reservation is still needed for multiple kexec reboot.
> > And it is the only possible case we come here thus just drop the code
> > chunk then everything works without side effects.
> >
> > On my machine the ESRT memory sits in an EFI runtime data range, it does
> > not trigger the problem, but I successfully tested with BGRT instead.
> > both kexec_load and kexec_file_load work and kdump works as well.
> >
> > Signed-off-by: Dave Young 
> > ---
> >  arch/x86/platform/efi/quirks.c |6 ++
> >  1 file changed, 2 insertions(+), 4 deletions(-)
> >
> > --- linux-x86.orig/arch/x86/platform/efi/quirks.c
> > +++ linux-x86/arch/x86/platform/efi/quirks.c
> > @@ -260,10 +260,6 @@ void __init efi_arch_mem_reserve(phys_ad
> > return;
> > }
> >
> > -   /* No need to reserve regions that will never be freed. */
> > -   if (md.attribute & EFI_MEMORY_RUNTIME)
> > -   return;
> > -
> > size += addr % EFI_PAGE_SIZE;
> > size = round_up(size, EFI_PAGE_SIZE);
> > addr = round_down(addr, EFI_PAGE_SIZE);
> > @@ -293,6 +289,8 @@ void __init efi_arch_mem_reserve(phys_ad
> > early_memunmap(new, new_size);
> >
> > efi_memmap_install(new_phys, num_entries);
> > +   e820__range_update(addr, size, E820_TYPE_RAM, E820_TYPE_RESERVED);
> > +   e820__update_table(e820_table);
> >  }
> >
> >  /*
> >
>
> Bisect says this change (commit af1648984828) is triggering a
> regression, likely not urgent, in my testing of the new efi_fake_mem=
> facility to allow memory to be marked "soft reserved" via the kernel
> command line (commit 199c84717612 x86/efi: Add efi_fake_mem support
> for EFI_MEMORY_SP). The following command line triggers the crash
> signature below:
>
> efi_fake_mem=4G@9G:0x4,4G@13G:0x4
>
> However, this command line works ok:
>
> efi_fake_mem=8G@9G:0x4
>
> So, something about multiple efi_fake_mem statements interacts badly
> with this change. Nothing obvious occurs to me at the moment, I'll
> keep debugging, but wanted to highlight this in the meantime in case
> someone else sees a deeper issue or the root cause.

Still looking, but this failure does not seem to be specific to the
"soft reservation" changes. Any update to the efi memmap that pushes
it over a page boundary triggers this failure. I.e. I can fix the
problem by over-allocating the efi memmap and then page aligning the
result. __early_ioremap "should" be handling this case, but it appears
something else is messing this up.

>
> BUG: unable to handle page fault for address: ff281000
> #PF: supervisor write access in kernel mode
> #PF: error_code(0x0002) - not-present page
> PGD 188615067 P4D 188615067 PUD 188617067 PMD 188e4d067 PTE 0
> Oops: 0002 [#1] SMP PTI
> CPU: 0 PID: 0 Comm: swapper Not tainted 5.4.0+ #154
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
> RIP: 0010:efi_memmap_insert+0xed/0x14b
> Code: 48 89 48 18 49 39 d9 76 67 49 39 d1 73 62 4c 89 c9 48 2b 48 08
> 4c 89 c6 48 c1 e9 0c 48 89 48 18 49 8b 4a 28 48 01 c8 48 89 c7  a4
> 49 39 d3 73 2c 4c 89 48 08 4c 29 da 4c 89 c6 4c 89 68 18 48
> RSP: :b7603d70 EFLAGS: 00010086
> RAX: ff280ff0 RBX:  RCX: 0020
> RDX:  RSI: ff200fe0 RDI: ff281000

Re: [PATCH] x86/efi: update e820 about reserved EFI boot services data to fix kexec breakage

2019-12-30 Thread Dan Williams
On Sat, Dec 28, 2019 at 10:13 PM Dan Williams  wrote:
>
> On Sat, Dec 28, 2019 at 12:54 PM Dan Williams
>  wrote:
> >
> > On Tue, Dec 3, 2019 at 11:53 PM Dave Young  wrote:
> > >
> > > Michael Weiser reported he got below error during a kexec rebooting:
> > > esrt: Unsupported ESRT version 2904149718861218184.
> > >
> > > The ESRT memory stays in EFI boot services data, and it was reserved
> > > in kernel via efi_mem_reserve().  The initial purpose of the reservation
> > > is to reuse the EFI boot services data across kexec reboot. For example
> > > the BGRT image data and some ESRT memory like Michael reported.
> > >
> > > But although the memory is reserved it is not updated in X86 e820 table.
> > > And kexec_file_load iterate system ram in io resource list to find places
> > > for kernel, initramfs and other stuff. In Michael's case the kexec loaded
> > > initramfs overwritten the ESRT memory and then the failure happened.
> > >
> > > Since kexec_file_load depends on the e820 to be updated, just fix this
> > > by updating the reserved EFI boot services memory as reserved type in 
> > > e820.
> > >
> > > Originally any memory descriptors with EFI_MEMORY_RUNTIME attribute are
> > > bypassed in the reservation code path because they are assumed as 
> > > reserved.
> > > But the reservation is still needed for multiple kexec reboot.
> > > And it is the only possible case we come here thus just drop the code
> > > chunk then everything works without side effects.
> > >
> > > On my machine the ESRT memory sits in an EFI runtime data range, it does
> > > not trigger the problem, but I successfully tested with BGRT instead.
> > > both kexec_load and kexec_file_load work and kdump works as well.
> > >
> > > Signed-off-by: Dave Young 
> > > ---
> > >  arch/x86/platform/efi/quirks.c |6 ++
> > >  1 file changed, 2 insertions(+), 4 deletions(-)
> > >
> > > --- linux-x86.orig/arch/x86/platform/efi/quirks.c
> > > +++ linux-x86/arch/x86/platform/efi/quirks.c
> > > @@ -260,10 +260,6 @@ void __init efi_arch_mem_reserve(phys_ad
> > > return;
> > > }
> > >
> > > -   /* No need to reserve regions that will never be freed. */
> > > -   if (md.attribute & EFI_MEMORY_RUNTIME)
> > > -   return;
> > > -
> > > size += addr % EFI_PAGE_SIZE;
> > > size = round_up(size, EFI_PAGE_SIZE);
> > > addr = round_down(addr, EFI_PAGE_SIZE);
> > > @@ -293,6 +289,8 @@ void __init efi_arch_mem_reserve(phys_ad
> > > early_memunmap(new, new_size);
> > >
> > > efi_memmap_install(new_phys, num_entries);
> > > +   e820__range_update(addr, size, E820_TYPE_RAM, E820_TYPE_RESERVED);
> > > +   e820__update_table(e820_table);
> > >  }
> > >
> > >  /*
> > >
> >
> > Bisect says this change (commit af1648984828) is triggering a
> > regression, likely not urgent, in my testing of the new efi_fake_mem=
> > facility to allow memory to be marked "soft reserved" via the kernel
> > command line (commit 199c84717612 x86/efi: Add efi_fake_mem support
> > for EFI_MEMORY_SP). The following command line triggers the crash
> > signature below:
> >
> > efi_fake_mem=4G@9G:0x4,4G@13G:0x4
> >
> > However, this command line works ok:
> >
> > efi_fake_mem=8G@9G:0x4
> >
> > So, something about multiple efi_fake_mem statements interacts badly
> > with this change. Nothing obvious occurs to me at the moment, I'll
> > keep debugging, but wanted to highlight this in the meantime in case
> > someone else sees a deeper issue or the root cause.
>
> Still looking, but this failure does not seem to be specific to the
> "soft reservation" changes. Any update to the efi memmap that pushes
> it over a page boundary triggers this failure. I.e. I can fix the
> problem by over-allocating the efi memmap and then page aligning the
> result. __early_ioremap "should" be handling this case, but it appears
> something else is messing this up.

Found it. Neither this patch nor the soft reservation changes are at
fault, they are just helping to trigger a long standing bug in
efi_fake_memmap(). Its usage of efi_memmap_split_count() can over
count the number of splits needed for new entries. Consider the case
of 2 contiguous fake entries intersecting the end of a single entry.
The first call to efi_memmap_split_count() determines the resulting
split will be (old1, new1, old2), the second call determines (old1,
new2). The result is 2 splits when only 1 is needed to get a result of
(old1, new1, new2) and the new map ends up with an empty entry.
efi_memmap_install() interprets an empty entry as start = 0 end =
0x and attempts an extra split / copy past the end of
the new map.

I'll send a patch to fix up efi_fake_memmap().

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH] efi: Fix handling of multiple contiguous efi_fake_mem= entries

2019-12-30 Thread Dan Williams
A recent test of efi_fake_mem=4G@9G:0x4,4G@13G:0x4 crashed with
a signature of:

BUG: unable to handle page fault for address: ff281000
[..]
RIP: 0010:efi_memmap_insert+0x11d/0x191
[..]
Call Trace:
 ? bgrt_init+0xbe/0xbe
 ? efi_arch_mem_reserve+0x1cb/0x228
 ? acpi_parse_bgrt+0xa/0xd
 ? acpi_table_parse+0x86/0xb8
 ? acpi_boot_init+0x494/0x4e3
 ? acpi_parse_x2apic+0x87/0x87
 ? setup_acpi_sci+0xa2/0xa2
 ? setup_arch+0x8db/0x9e1
 ? start_kernel+0x6a/0x547
 ? secondary_startup_64+0xb6/0xc0

efi_memmap_insert() is attempting to insert entries past the end of the
new map. This condition is setup by efi_fake_mem() leaking empty entries
to the end of memory map, and then efi_arch_mem_reserve() trips over the
bad entry when attempting an additional efi_memmap_insert(). The empty
entry causes efi_memmap_insert() to attempt more memmap splits / copies
than efi_memmap_split_count() accounted for when sizing the new map.

Update efi_fake_memmap() to cleanup lagging empty entries.

This change is related to commit af1648984828 "x86/efi: Update e820 with
reserved EFI boot services data to fix kexec breakage" since that
introduces more occurrences where efi_memmap_insert() is invoked after
an efi_fake_mem= configuration has been parsed. Previously the side
effects of vestigial empty entries were benign, but with commit
af1648984828 that follow-on efi_memmap_insert() invocation triggers the
above crash signature.

Fixes: 0f96a99dab36 ("efi: Add 'efi_fake_mem' boot option")
Fixes: af1648984828 ("x86/efi: Update e820 with reserved EFI boot services...")
Cc: Taku Izumi 
Cc: Michael Weiser 
Cc: Dave Young 
Cc: Ard Biesheuvel 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Signed-off-by: Dan Williams 
---
 drivers/firmware/efi/fake_mem.c |   22 +-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/drivers/firmware/efi/fake_mem.c b/drivers/firmware/efi/fake_mem.c
index bb9fc70d0cfa..6df51ba93ae8 100644
--- a/drivers/firmware/efi/fake_mem.c
+++ b/drivers/firmware/efi/fake_mem.c
@@ -67,13 +67,33 @@ void __init efi_fake_memmap(void)
return;
}
 
+   memset(new_memmap, 0, efi.memmap.desc_size * new_nr_map);
for (i = 0; i < nr_fake_mem; i++)
efi_memmap_insert(&efi.memmap, new_memmap, &efi_fake_mems[i]);
 
+   /*
+* efi_memmap_split_count() may over count the number of
+* required splits in the case when contiguous fake entries are
+* specified. Check that all new_nr_map entries were consumed.
+*/
+   for (i = new_nr_map; i > 0; i--) {
+   efi_memory_desc_t *md;
+   u64 start, end;
+
+   md = new_memmap + efi.memmap.desc_size * (new_nr_map - i - 1);
+   end = md->phys_addr + (md->num_pages << EFI_PAGE_SHIFT) - 1;
+   start = md->phys_addr;
+
+   if (start == 0 && end + 1 == 0)
+   continue;
+   break;
+   }
+
/* swap into new EFI memmap */
early_memunmap(new_memmap, efi.memmap.desc_size * new_nr_map);
 
-   efi_memmap_install(new_memmap_phy, new_nr_map);
+   /* install only the valid entries */
+   efi_memmap_install(new_memmap_phy, i);
 
/* print new EFI memmap */
efi_print_memmap();


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] x86/efi: update e820 about reserved EFI boot services data to fix kexec breakage

2019-12-30 Thread Dan Williams
On Mon, Dec 30, 2019 at 1:42 AM Dan Williams  wrote:
[..]
> I'll send a patch to fix up efi_fake_memmap().

For others following this thread, that patch is here:

http://lore.kernel.org/r/157773590338.4153451.5898675419563883883.st...@dwillia2-desk3.amr.corp.intel.com

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] efi: Fix handling of multiple contiguous efi_fake_mem= entries

2019-12-31 Thread Dan Williams
On Mon, Dec 30, 2019 at 5:46 PM Dave Young  wrote:
>
> Hi Dan,
> On 12/30/19 at 11:58am, Dan Williams wrote:
> > A recent test of efi_fake_mem=4G@9G:0x4,4G@13G:0x4 crashed with
> > a signature of:
> >
> > BUG: unable to handle page fault for address: ff281000
> > [..]
> > RIP: 0010:efi_memmap_insert+0x11d/0x191
> > [..]
> > Call Trace:
> >  ? bgrt_init+0xbe/0xbe
> >  ? efi_arch_mem_reserve+0x1cb/0x228
> >  ? acpi_parse_bgrt+0xa/0xd
> >  ? acpi_table_parse+0x86/0xb8
> >  ? acpi_boot_init+0x494/0x4e3
> >  ? acpi_parse_x2apic+0x87/0x87
> >  ? setup_acpi_sci+0xa2/0xa2
> >  ? setup_arch+0x8db/0x9e1
> >  ? start_kernel+0x6a/0x547
> >  ? secondary_startup_64+0xb6/0xc0
> >
> > efi_memmap_insert() is attempting to insert entries past the end of the
> > new map. This condition is setup by efi_fake_mem() leaking empty entries
> > to the end of memory map, and then efi_arch_mem_reserve() trips over the
> > bad entry when attempting an additional efi_memmap_insert(). The empty
> > entry causes efi_memmap_insert() to attempt more memmap splits / copies
> > than efi_memmap_split_count() accounted for when sizing the new map.
> >
> > Update efi_fake_memmap() to cleanup lagging empty entries.
> >
> > This change is related to commit af1648984828 "x86/efi: Update e820 with
> > reserved EFI boot services data to fix kexec breakage" since that
> > introduces more occurrences where efi_memmap_insert() is invoked after
> > an efi_fake_mem= configuration has been parsed. Previously the side
> > effects of vestigial empty entries were benign, but with commit
> > af1648984828 that follow-on efi_memmap_insert() invocation triggers the
> > above crash signature.
> >
> > Fixes: 0f96a99dab36 ("efi: Add 'efi_fake_mem' boot option")
> > Fixes: af1648984828 ("x86/efi: Update e820 with reserved EFI boot 
> > services...")
> > Cc: Taku Izumi 
> > Cc: Michael Weiser 
> > Cc: Dave Young 
> > Cc: Ard Biesheuvel 
> > Cc: Thomas Gleixner 
> > Cc: Ingo Molnar 
> > Signed-off-by: Dan Williams 
> > ---
> >  drivers/firmware/efi/fake_mem.c |   22 +-
> >  1 file changed, 21 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/firmware/efi/fake_mem.c 
> > b/drivers/firmware/efi/fake_mem.c
> > index bb9fc70d0cfa..6df51ba93ae8 100644
> > --- a/drivers/firmware/efi/fake_mem.c
> > +++ b/drivers/firmware/efi/fake_mem.c
> > @@ -67,13 +67,33 @@ void __init efi_fake_memmap(void)
> >   return;
> >   }
> >
> > + memset(new_memmap, 0, efi.memmap.desc_size * new_nr_map);
> >   for (i = 0; i < nr_fake_mem; i++)
> >   efi_memmap_insert(&efi.memmap, new_memmap, &efi_fake_mems[i]);
> >
> > + /*
> > +  * efi_memmap_split_count() may over count the number of
> > +  * required splits in the case when contiguous fake entries are
> > +  * specified. Check that all new_nr_map entries were consumed.
> > +  */
> > + for (i = new_nr_map; i > 0; i--) {
> > + efi_memory_desc_t *md;
> > + u64 start, end;
> > +
> > + md = new_memmap + efi.memmap.desc_size * (new_nr_map - i - 1);
> > + end = md->phys_addr + (md->num_pages << EFI_PAGE_SHIFT) - 1;
> > + start = md->phys_addr;
> > +
> > + if (start == 0 && end + 1 == 0)
> > + continue;
> > + break;
> > + }
> > +
> >   /* swap into new EFI memmap */
> >   early_memunmap(new_memmap, efi.memmap.desc_size * new_nr_map);
> >
> > - efi_memmap_install(new_memmap_phy, new_nr_map);
> > + /* install only the valid entries */
> > + efi_memmap_install(new_memmap_phy, i);
> >
> >   /* print new EFI memmap */
> >   efi_print_memmap();
> >
>
> Although kernel bootup works with this patch, it still does not fix the
> issue I noticed, you can see:
> [root@localhost ~]# cat /proc/cmdline
> BOOT_IMAGE=/bzImage root=/dev/vda3 ro audit=0 selinux=0 crashkernel=160M 
> efi=debug console=ttyS0 console=tty0 3 
> efi_fake_mem=200M@5G:0x4,300M@5600M:0x4 earlyprintk=serial
> [root@localhost ~]# dmesg|grep fake_mem
> [0.00] Command line: BOOT_IMAGE=/bzImage root=/dev/vda3 ro audit=0 
> selinux=0 crashkernel=160M efi=debug console=ttyS0 console=tty0 3 
> efi_fake_mem=200M@5G:0x4,300M@5600M

[PATCH v2 1/4] efi: Add a flags parameter to efi_memory_map

2019-12-31 Thread Dan Williams
In preparation for garbage collecting dynamically allocated efi memory
maps, where the allocation method of memblock vs slab needs to be
recalled, convert the existing 'late' flag into a 'flags' bitmask.

Cc: Taku Izumi 
Cc: Ard Biesheuvel 
Signed-off-by: Dan Williams 
---
 drivers/firmware/efi/memmap.c |   24 +---
 include/linux/efi.h   |3 ++-
 2 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/drivers/firmware/efi/memmap.c b/drivers/firmware/efi/memmap.c
index 38b686c67b17..813674ef9000 100644
--- a/drivers/firmware/efi/memmap.c
+++ b/drivers/firmware/efi/memmap.c
@@ -52,7 +52,7 @@ phys_addr_t __init efi_memmap_alloc(unsigned int num_entries)
 /**
  * __efi_memmap_init - Common code for mapping the EFI memory map
  * @data: EFI memory map data
- * @late: Use early or late mapping function?
+ * @flags: Use early or late mapping function?
  *
  * This function takes care of figuring out which function to use to
  * map the EFI memory map in efi.memmap based on how far into the boot
@@ -66,9 +66,9 @@ phys_addr_t __init efi_memmap_alloc(unsigned int num_entries)
  * Returns zero on success, a negative error code on failure.
  */
 static int __init
-__efi_memmap_init(struct efi_memory_map_data *data, bool late)
+__efi_memmap_init(struct efi_memory_map_data *data, unsigned long flags)
 {
-   struct efi_memory_map map;
+   struct efi_memory_map map = { 0 };
phys_addr_t phys_map;
 
if (efi_enabled(EFI_PARAVIRT))
@@ -76,7 +76,7 @@ __efi_memmap_init(struct efi_memory_map_data *data, bool late)
 
phys_map = data->phys_map;
 
-   if (late)
+   if (flags & EFI_MEMMAP_LATE)
map.map = memremap(phys_map, data->size, MEMREMAP_WB);
else
map.map = early_memremap(phys_map, data->size);
@@ -92,7 +92,7 @@ __efi_memmap_init(struct efi_memory_map_data *data, bool late)
 
map.desc_version = data->desc_version;
map.desc_size = data->desc_size;
-   map.late = late;
+   map.flags |= flags;
 
set_bit(EFI_MEMMAP, &efi.flags);
 
@@ -111,9 +111,9 @@ __efi_memmap_init(struct efi_memory_map_data *data, bool 
late)
 int __init efi_memmap_init_early(struct efi_memory_map_data *data)
 {
/* Cannot go backwards */
-   WARN_ON(efi.memmap.late);
+   WARN_ON(efi.memmap.flags & EFI_MEMMAP_LATE);
 
-   return __efi_memmap_init(data, false);
+   return __efi_memmap_init(data, 0);
 }
 
 void __init efi_memmap_unmap(void)
@@ -121,7 +121,7 @@ void __init efi_memmap_unmap(void)
if (!efi_enabled(EFI_MEMMAP))
return;
 
-   if (!efi.memmap.late) {
+   if (!(efi.memmap.flags & EFI_MEMMAP_LATE)) {
unsigned long size;
 
size = efi.memmap.desc_size * efi.memmap.nr_map;
@@ -168,7 +168,7 @@ int __init efi_memmap_init_late(phys_addr_t addr, unsigned 
long size)
WARN_ON(efi.memmap.map);
 
/* Were we already called? */
-   WARN_ON(efi.memmap.late);
+   WARN_ON(efi.memmap.flags & EFI_MEMMAP_LATE);
 
/*
 * It makes no sense to allow callers to register different
@@ -178,7 +178,7 @@ int __init efi_memmap_init_late(phys_addr_t addr, unsigned 
long size)
data.desc_version = efi.memmap.desc_version;
data.desc_size = efi.memmap.desc_size;
 
-   return __efi_memmap_init(&data, true);
+   return __efi_memmap_init(&data, EFI_MEMMAP_LATE);
 }
 
 /**
@@ -195,6 +195,7 @@ int __init efi_memmap_init_late(phys_addr_t addr, unsigned 
long size)
 int __init efi_memmap_install(phys_addr_t addr, unsigned int nr_map)
 {
struct efi_memory_map_data data;
+   unsigned long flags;
 
efi_memmap_unmap();
 
@@ -202,8 +203,9 @@ int __init efi_memmap_install(phys_addr_t addr, unsigned 
int nr_map)
data.size = efi.memmap.desc_size * nr_map;
data.desc_version = efi.memmap.desc_version;
data.desc_size = efi.memmap.desc_size;
+   flags = efi.memmap.flags & EFI_MEMMAP_LATE;
 
-   return __efi_memmap_init(&data, efi.memmap.late);
+   return __efi_memmap_init(&data, flags);
 }
 
 /**
diff --git a/include/linux/efi.h b/include/linux/efi.h
index aa54586db7a5..b8e930f5ff77 100644
--- a/include/linux/efi.h
+++ b/include/linux/efi.h
@@ -795,7 +795,8 @@ struct efi_memory_map {
int nr_map;
unsigned long desc_version;
unsigned long desc_size;
-   bool late;
+#define EFI_MEMMAP_LATE (1UL << 0)
+   unsigned long flags;
 };
 
 struct efi_mem_range {


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v2 2/4] efi: Add tracking for dynamically allocated memmaps

2019-12-31 Thread Dan Williams
In preparation for fixing efi_memmap_alloc() leaks, add support for
recording whether the memmap was dynamically allocated from slab,
memblock, or is the original physical memmap provided by the platform.

Cc: Taku Izumi 
Cc: Ard Biesheuvel 
Signed-off-by: Dan Williams 
---
 arch/x86/platform/efi/efi.c |2 +-
 arch/x86/platform/efi/quirks.c  |   11 ++-
 drivers/firmware/efi/fake_mem.c |5 +++--
 drivers/firmware/efi/memmap.c   |   16 ++--
 include/linux/efi.h |8 ++--
 5 files changed, 26 insertions(+), 16 deletions(-)

diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
index 38d44f36d5ed..7086afbb84fd 100644
--- a/arch/x86/platform/efi/efi.c
+++ b/arch/x86/platform/efi/efi.c
@@ -333,7 +333,7 @@ static void __init efi_clean_memmap(void)
u64 size = efi.memmap.nr_map - n_removal;
 
pr_warn("Removing %d invalid memory map entries.\n", n_removal);
-   efi_memmap_install(efi.memmap.phys_map, size);
+   efi_memmap_install(efi.memmap.phys_map, size, 0);
}
 }
 
diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
index f8f0220b6a66..4a71c790f9c3 100644
--- a/arch/x86/platform/efi/quirks.c
+++ b/arch/x86/platform/efi/quirks.c
@@ -244,6 +244,7 @@ EXPORT_SYMBOL_GPL(efi_query_variable_store);
 void __init efi_arch_mem_reserve(phys_addr_t addr, u64 size)
 {
phys_addr_t new_phys, new_size;
+   unsigned long flags = 0;
struct efi_mem_range mr;
efi_memory_desc_t md;
int num_entries;
@@ -272,8 +273,7 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 size)
num_entries += efi.memmap.nr_map;
 
new_size = efi.memmap.desc_size * num_entries;
-
-   new_phys = efi_memmap_alloc(num_entries);
+   new_phys = efi_memmap_alloc(num_entries, &flags);
if (!new_phys) {
pr_err("Could not allocate boot services memmap\n");
return;
@@ -288,7 +288,7 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 size)
efi_memmap_insert(&efi.memmap, new, &mr);
early_memunmap(new, new_size);
 
-   efi_memmap_install(new_phys, num_entries);
+   efi_memmap_install(new_phys, num_entries, flags);
e820__range_update(addr, size, E820_TYPE_RAM, E820_TYPE_RESERVED);
e820__update_table(e820_table);
 }
@@ -408,6 +408,7 @@ static void __init efi_unmap_pages(efi_memory_desc_t *md)
 void __init efi_free_boot_services(void)
 {
phys_addr_t new_phys, new_size;
+   unsigned long flags = 0;
efi_memory_desc_t *md;
int num_entries = 0;
void *new, *new_md;
@@ -463,7 +464,7 @@ void __init efi_free_boot_services(void)
return;
 
new_size = efi.memmap.desc_size * num_entries;
-   new_phys = efi_memmap_alloc(num_entries);
+   new_phys = efi_memmap_alloc(num_entries, &flags);
if (!new_phys) {
pr_err("Failed to allocate new EFI memmap\n");
return;
@@ -493,7 +494,7 @@ void __init efi_free_boot_services(void)
 
memunmap(new);
 
-   if (efi_memmap_install(new_phys, num_entries)) {
+   if (efi_memmap_install(new_phys, num_entries, flags)) {
pr_err("Could not install new EFI memmap\n");
return;
}
diff --git a/drivers/firmware/efi/fake_mem.c b/drivers/firmware/efi/fake_mem.c
index bb9fc70d0cfa..7e53e5520548 100644
--- a/drivers/firmware/efi/fake_mem.c
+++ b/drivers/firmware/efi/fake_mem.c
@@ -39,6 +39,7 @@ void __init efi_fake_memmap(void)
int new_nr_map = efi.memmap.nr_map;
efi_memory_desc_t *md;
phys_addr_t new_memmap_phy;
+   unsigned long flags = 0;
void *new_memmap;
int i;
 
@@ -55,7 +56,7 @@ void __init efi_fake_memmap(void)
}
 
/* allocate memory for new EFI memmap */
-   new_memmap_phy = efi_memmap_alloc(new_nr_map);
+   new_memmap_phy = efi_memmap_alloc(new_nr_map, &flags);
if (!new_memmap_phy)
return;
 
@@ -73,7 +74,7 @@ void __init efi_fake_memmap(void)
/* swap into new EFI memmap */
early_memunmap(new_memmap, efi.memmap.desc_size * new_nr_map);
 
-   efi_memmap_install(new_memmap_phy, new_nr_map);
+   efi_memmap_install(new_memmap_phy, new_nr_map, flags);
 
/* print new EFI memmap */
efi_print_memmap();
diff --git a/drivers/firmware/efi/memmap.c b/drivers/firmware/efi/memmap.c
index 813674ef9000..2b81ee6858a9 100644
--- a/drivers/firmware/efi/memmap.c
+++ b/drivers/firmware/efi/memmap.c
@@ -32,6 +32,7 @@ static phys_addr_t __init __efi_memmap_alloc_late(unsigned 
long size)
 /**
  * efi_memmap_alloc - Allocate memory for the EFI memory map
  * @num_entries: Number of entries in the allocated map.
+ * @flags: Late map, memblock alloc, slab alloc flags
  *
  * Depending on whether mm_init() has already been invoked or 

[PATCH v2 3/4] efi: Fix efi_memmap_alloc() leaks

2019-12-31 Thread Dan Williams
With efi_fake_memmap() and efi_arch_mem_reserve() the efi table may be
updated and replaced multiple times. When that happens a previous
dynamically allocated efi memory map can be garbage collected. Use the
new EFI_MEMMAP_{SLAB,MEMBLOCK} flags to detect when a dynamically
allocated memory map is being replaced.

Cc: Taku Izumi 
Cc: Ard Biesheuvel 
Signed-off-by: Dan Williams 
---
 drivers/firmware/efi/memmap.c |   24 
 1 file changed, 24 insertions(+)

diff --git a/drivers/firmware/efi/memmap.c b/drivers/firmware/efi/memmap.c
index 2b81ee6858a9..188ab3cd5c52 100644
--- a/drivers/firmware/efi/memmap.c
+++ b/drivers/firmware/efi/memmap.c
@@ -29,6 +29,28 @@ static phys_addr_t __init __efi_memmap_alloc_late(unsigned 
long size)
return PFN_PHYS(page_to_pfn(p));
 }
 
+static void __init __efi_memmap_free(u64 phys, unsigned long size, unsigned 
long flags)
+{
+   if (WARN_ON(slab_is_available() && (flags & EFI_MEMMAP_MEMBLOCK)))
+   return;
+
+   if (flags & EFI_MEMMAP_MEMBLOCK) {
+   memblock_free(phys, size);
+   } else if (flags & EFI_MEMMAP_SLAB) {
+   struct page *p = pfn_to_page(PHYS_PFN(phys));
+   unsigned int order = get_order(size);
+
+   free_pages((unsigned long) page_address(p), order);
+   }
+}
+
+static void __init efi_memmap_free(void)
+{
+   __efi_memmap_free(efi.memmap.phys_map,
+   efi.memmap.desc_size * efi.memmap.nr_map,
+   efi.memmap.flags);
+}
+
 /**
  * efi_memmap_alloc - Allocate memory for the EFI memory map
  * @num_entries: Number of entries in the allocated map.
@@ -209,6 +231,8 @@ int __init efi_memmap_install(phys_addr_t addr, unsigned 
int nr_map,
data.desc_size = efi.memmap.desc_size;
flags |= efi.memmap.flags & EFI_MEMMAP_LATE;
 
+   efi_memmap_free();
+
return __efi_memmap_init(&data, flags);
 }
 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v2 0/4] efi: Fix handling of multiple efi_fake_mem= entries

2019-12-31 Thread Dan Williams
Changes since v1 [1]:
- Add support for garbage collecting idle efi_memmap_alloc() allocations

- As Dave noticed the same problem of calling efi_memmap_split_count()
  multiple times with the old map also applies to efi_memmap_insert().
  Update efi_fake_memmap() to invoke efi_memmap_install() after every
  efi_memmap_insert() call.

[1]: 
https://lore.kernel.org/r/157773590338.4153451.5898675419563883883.st...@dwillia2-desk3.amr.corp.intel.com/

---

Copied from patch4:

Dave noticed that when specifying multiple efi_fake_mem= entries only
the last entry was successfully being reflected in the efi memory map.
This is due to the fact that the efi_memmap_insert() is being called
multiple times, but on successive invocations the insertion should be
applied to the last new memmap rather than the original map at
efi_fake_memmap() entry.

Rework efi_fake_memmap() to install the new memory map after each
efi_fake_mem= entry is parsed.

This also fixes an issue in efi_fake_memmap() that caused it to litter
emtpy entries into the end of the efi memory map. The empty entry causes
efi_memmap_insert() to attempt more memmap splits / copies than
efi_memmap_split_count() accounted for when sizing the new map.

BUG: unable to handle page fault for address: ff281000
[..]
RIP: 0010:efi_memmap_insert+0x11d/0x191
[..]
Call Trace:
 ? bgrt_init+0xbe/0xbe
 ? efi_arch_mem_reserve+0x1cb/0x228
 ? acpi_parse_bgrt+0xa/0xd
 ? acpi_table_parse+0x86/0xb8
 ? acpi_boot_init+0x494/0x4e3
 ? acpi_parse_x2apic+0x87/0x87
 ? setup_acpi_sci+0xa2/0xa2
 ? setup_arch+0x8db/0x9e1
 ? start_kernel+0x6a/0x547
 ? secondary_startup_64+0xb6/0xc0

Commit af1648984828 "x86/efi: Update e820 with reserved EFI boot
services data to fix kexec breakage" is listed in Fixes: since it
introduces more occurrences where efi_memmap_insert() is invoked after
an efi_fake_mem= configuration has been parsed. Previously the side
effects of vestigial empty entries were benign, but with commit
af1648984828 that follow-on efi_memmap_insert() invocation triggers the
above crash signature.

---

Dan Williams (4):
  efi: Add a flags parameter to efi_memory_map
  efi: Add tracking for dynamically allocated memmaps
  efi: Fix efi_memmap_alloc() leaks
  efi: Fix handling of multiple efi_fake_mem= entries


 arch/x86/platform/efi/efi.c |2 +
 arch/x86/platform/efi/quirks.c  |   11 ---
 drivers/firmware/efi/fake_mem.c |   37 +
 drivers/firmware/efi/memmap.c   |   58 ++-
 include/linux/efi.h |   13 +++--
 5 files changed, 81 insertions(+), 40 deletions(-)

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v2 4/4] efi: Fix handling of multiple efi_fake_mem= entries

2019-12-31 Thread Dan Williams
Dave noticed that when specifying multiple efi_fake_mem= entries only
the last entry was successfully being reflected in the efi memory map.
This is due to the fact that the efi_memmap_insert() is being called
multiple times, but on successive invocations the insertion should be
applied to the last new memmap rather than the original map at
efi_fake_memmap() entry.

Rework efi_fake_memmap() to install the new memory map after each
efi_fake_mem= entry is parsed.

This also fixes an issue in efi_fake_memmap() that caused it to litter
emtpy entries into the end of the efi memory map. The empty entry causes
efi_memmap_insert() to attempt more memmap splits / copies than
efi_memmap_split_count() accounted for when sizing the new map.

BUG: unable to handle page fault for address: ff281000
[..]
RIP: 0010:efi_memmap_insert+0x11d/0x191
[..]
Call Trace:
 ? bgrt_init+0xbe/0xbe
 ? efi_arch_mem_reserve+0x1cb/0x228
 ? acpi_parse_bgrt+0xa/0xd
 ? acpi_table_parse+0x86/0xb8
 ? acpi_boot_init+0x494/0x4e3
 ? acpi_parse_x2apic+0x87/0x87
 ? setup_acpi_sci+0xa2/0xa2
 ? setup_arch+0x8db/0x9e1
 ? start_kernel+0x6a/0x547
 ? secondary_startup_64+0xb6/0xc0

Commit af1648984828 "x86/efi: Update e820 with reserved EFI boot
services data to fix kexec breakage" is listed in Fixes: since it
introduces more occurrences where efi_memmap_insert() is invoked after
an efi_fake_mem= configuration has been parsed. Previously the side
effects of vestigial empty entries were benign, but with commit
af1648984828 that follow-on efi_memmap_insert() invocation triggers the
above crash signature.

Fixes: 0f96a99dab36 ("efi: Add 'efi_fake_mem' boot option")
Fixes: af1648984828 ("x86/efi: Update e820 with reserved EFI boot services...")
Link: 
https://lore.kernel.org/r/20191231014630.ga24...@dhcp-128-65.nay.redhat.com
Reported-by: Dave Young 
Cc: Taku Izumi 
Cc: Michael Weiser 
Cc: Ard Biesheuvel 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Signed-off-by: Dan Williams 
---
 drivers/firmware/efi/fake_mem.c |   32 +---
 drivers/firmware/efi/memmap.c   |2 +-
 include/linux/efi.h |2 ++
 3 files changed, 20 insertions(+), 16 deletions(-)

diff --git a/drivers/firmware/efi/fake_mem.c b/drivers/firmware/efi/fake_mem.c
index 7e53e5520548..68d752d8af21 100644
--- a/drivers/firmware/efi/fake_mem.c
+++ b/drivers/firmware/efi/fake_mem.c
@@ -34,26 +34,17 @@ static int __init cmp_fake_mem(const void *x1, const void 
*x2)
return 0;
 }
 
-void __init efi_fake_memmap(void)
+static void __init efi_fake_range(struct efi_mem_range *efi_range)
 {
int new_nr_map = efi.memmap.nr_map;
efi_memory_desc_t *md;
phys_addr_t new_memmap_phy;
unsigned long flags = 0;
void *new_memmap;
-   int i;
-
-   if (!efi_enabled(EFI_MEMMAP) || !nr_fake_mem)
-   return;
 
/* count up the number of EFI memory descriptor */
-   for (i = 0; i < nr_fake_mem; i++) {
-   for_each_efi_memory_desc(md) {
-   struct range *r = &efi_fake_mems[i].range;
-
-   new_nr_map += efi_memmap_split_count(md, r);
-   }
-   }
+   for_each_efi_memory_desc(md)
+   new_nr_map += efi_memmap_split_count(md, &efi_range->range);
 
/* allocate memory for new EFI memmap */
new_memmap_phy = efi_memmap_alloc(new_nr_map, &flags);
@@ -64,17 +55,28 @@ void __init efi_fake_memmap(void)
new_memmap = early_memremap(new_memmap_phy,
efi.memmap.desc_size * new_nr_map);
if (!new_memmap) {
-   memblock_free(new_memmap_phy, efi.memmap.desc_size * 
new_nr_map);
+   __efi_memmap_free(new_memmap_phy,
+   efi.memmap.desc_size * new_nr_map, flags);
return;
}
 
-   for (i = 0; i < nr_fake_mem; i++)
-   efi_memmap_insert(&efi.memmap, new_memmap, &efi_fake_mems[i]);
+   efi_memmap_insert(&efi.memmap, new_memmap, efi_range);
 
/* swap into new EFI memmap */
early_memunmap(new_memmap, efi.memmap.desc_size * new_nr_map);
 
efi_memmap_install(new_memmap_phy, new_nr_map, flags);
+}
+
+void __init efi_fake_memmap(void)
+{
+   int i;
+
+   if (!efi_enabled(EFI_MEMMAP) || !nr_fake_mem)
+   return;
+
+   for (i = 0; i < nr_fake_mem; i++)
+   efi_fake_range(&efi_fake_mems[i]);
 
/* print new EFI memmap */
efi_print_memmap();
diff --git a/drivers/firmware/efi/memmap.c b/drivers/firmware/efi/memmap.c
index 188ab3cd5c52..de66c2a0e8f8 100644
--- a/drivers/firmware/efi/memmap.c
+++ b/drivers/firmware/efi/memmap.c
@@ -29,7 +29,7 @@ static phys_addr_t __init __efi_memmap_alloc_late(unsigned 
long size)
return PFN_PHYS(page_to_pfn(p));
 }
 
-static void __init __

Re: [PATCH] efi: Fix handling of multiple contiguous efi_fake_mem= entries

2019-12-31 Thread Dan Williams
On Tue, Dec 31, 2019 at 1:11 PM Dan Williams  wrote:
> Perhaps a prettier way to do this is to push the handling of each
> efi_fake_mem entry into a subroutine. However, I notice when a memmap
> allocated by efi_memmap_alloc() is replaced by another dynamically
> allocated memmap the previous one isn't released. I have a series that
> fixes that up as well.

Available here:
http://lore.kernel.org/r/157782985777.367056.14741265874314204783.st...@dwillia2-desk3.amr.corp.intel.com

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2 3/4] efi: Fix efi_memmap_alloc() leaks

2019-12-31 Thread Dan Williams
On Tue, Dec 31, 2019 at 7:35 PM Dave Young  wrote:
>
> Hi Dan,
> On 12/31/19 at 02:04pm, Dan Williams wrote:
> > With efi_fake_memmap() and efi_arch_mem_reserve() the efi table may be
> > updated and replaced multiple times. When that happens a previous
> > dynamically allocated efi memory map can be garbage collected. Use the
> > new EFI_MEMMAP_{SLAB,MEMBLOCK} flags to detect when a dynamically
> > allocated memory map is being replaced.
> >
> > Cc: Taku Izumi 
> > Cc: Ard Biesheuvel 
> > Signed-off-by: Dan Williams 
> > ---
> >  drivers/firmware/efi/memmap.c |   24 
> >  1 file changed, 24 insertions(+)
> >
> > diff --git a/drivers/firmware/efi/memmap.c b/drivers/firmware/efi/memmap.c
> > index 2b81ee6858a9..188ab3cd5c52 100644
> > --- a/drivers/firmware/efi/memmap.c
> > +++ b/drivers/firmware/efi/memmap.c
> > @@ -29,6 +29,28 @@ static phys_addr_t __init 
> > __efi_memmap_alloc_late(unsigned long size)
> >   return PFN_PHYS(page_to_pfn(p));
> >  }
> >
> > +static void __init __efi_memmap_free(u64 phys, unsigned long size, 
> > unsigned long flags)
> > +{
> > + if (WARN_ON(slab_is_available() && (flags & EFI_MEMMAP_MEMBLOCK)))
> > + return;
> > +
> > + if (flags & EFI_MEMMAP_MEMBLOCK) {
> > + memblock_free(phys, size);
> > + } else if (flags & EFI_MEMMAP_SLAB) {
> > + struct page *p = pfn_to_page(PHYS_PFN(phys));
> > + unsigned int order = get_order(size);
> > +
> > + free_pages((unsigned long) page_address(p), order);
> > + }
> > +}
> > +
> > +static void __init efi_memmap_free(void)
> > +{
> > + __efi_memmap_free(efi.memmap.phys_map,
> > + efi.memmap.desc_size * efi.memmap.nr_map,
> > + efi.memmap.flags);
> > +}
> > +
> >  /**
> >   * efi_memmap_alloc - Allocate memory for the EFI memory map
> >   * @num_entries: Number of entries in the allocated map.
> > @@ -209,6 +231,8 @@ int __init efi_memmap_install(phys_addr_t addr, 
> > unsigned int nr_map,
> >   data.desc_size = efi.memmap.desc_size;
> >   flags |= efi.memmap.flags & EFI_MEMMAP_LATE;
> >
> > + efi_memmap_free();
> > +
> >   return __efi_memmap_init(&data, flags);
>
> Hmm, only free the memmap in case __efi_memmap_init succeeded..

Ah true, that is a hastily chosen placement. Probably better in
__efi_memmap_init() after we're committed to the new map.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2 4/4] efi: Fix handling of multiple efi_fake_mem= entries

2019-12-31 Thread Dan Williams
On Tue, Dec 31, 2019 at 8:52 PM Dave Young  wrote:
>
> Hi Dan,
> On 12/31/19 at 02:04pm, Dan Williams wrote:
> > Dave noticed that when specifying multiple efi_fake_mem= entries only
> > the last entry was successfully being reflected in the efi memory map.
> > This is due to the fact that the efi_memmap_insert() is being called
> > multiple times, but on successive invocations the insertion should be
> > applied to the last new memmap rather than the original map at
> > efi_fake_memmap() entry.
> >
> > Rework efi_fake_memmap() to install the new memory map after each
> > efi_fake_mem= entry is parsed.
> >
> > This also fixes an issue in efi_fake_memmap() that caused it to litter
> > emtpy entries into the end of the efi memory map. The empty entry causes
> > efi_memmap_insert() to attempt more memmap splits / copies than
> > efi_memmap_split_count() accounted for when sizing the new map.
> >
> > BUG: unable to handle page fault for address: ff281000
> > [..]
> > RIP: 0010:efi_memmap_insert+0x11d/0x191
> > [..]
> > Call Trace:
> >  ? bgrt_init+0xbe/0xbe
> >  ? efi_arch_mem_reserve+0x1cb/0x228
> >  ? acpi_parse_bgrt+0xa/0xd
> >  ? acpi_table_parse+0x86/0xb8
> >  ? acpi_boot_init+0x494/0x4e3
> >  ? acpi_parse_x2apic+0x87/0x87
> >  ? setup_acpi_sci+0xa2/0xa2
> >  ? setup_arch+0x8db/0x9e1
> >  ? start_kernel+0x6a/0x547
> >  ? secondary_startup_64+0xb6/0xc0
> >
> > Commit af1648984828 "x86/efi: Update e820 with reserved EFI boot
> > services data to fix kexec breakage" is listed in Fixes: since it
> > introduces more occurrences where efi_memmap_insert() is invoked after
> > an efi_fake_mem= configuration has been parsed. Previously the side
> > effects of vestigial empty entries were benign, but with commit
> > af1648984828 that follow-on efi_memmap_insert() invocation triggers the
> > above crash signature.
> >
> > Fixes: 0f96a99dab36 ("efi: Add 'efi_fake_mem' boot option")
> > Fixes: af1648984828 ("x86/efi: Update e820 with reserved EFI boot 
> > services...")
> > Link: 
> > https://lore.kernel.org/r/20191231014630.ga24...@dhcp-128-65.nay.redhat.com
> > Reported-by: Dave Young 
> > Cc: Taku Izumi 
> > Cc: Michael Weiser 
> > Cc: Ard Biesheuvel 
> > Cc: Thomas Gleixner 
> > Cc: Ingo Molnar 
> > Signed-off-by: Dan Williams 
> > ---
> >  drivers/firmware/efi/fake_mem.c |   32 +---
> >  drivers/firmware/efi/memmap.c   |2 +-
> >  include/linux/efi.h |2 ++
> >  3 files changed, 20 insertions(+), 16 deletions(-)
> >
> > diff --git a/drivers/firmware/efi/fake_mem.c 
> > b/drivers/firmware/efi/fake_mem.c
> > index 7e53e5520548..68d752d8af21 100644
> > --- a/drivers/firmware/efi/fake_mem.c
> > +++ b/drivers/firmware/efi/fake_mem.c
> > @@ -34,26 +34,17 @@ static int __init cmp_fake_mem(const void *x1, const 
> > void *x2)
> >   return 0;
> >  }
> >
> > -void __init efi_fake_memmap(void)
> > +static void __init efi_fake_range(struct efi_mem_range *efi_range)
> >  {
> >   int new_nr_map = efi.memmap.nr_map;
> >   efi_memory_desc_t *md;
> >   phys_addr_t new_memmap_phy;
> >   unsigned long flags = 0;
> >   void *new_memmap;
> > - int i;
> > -
> > - if (!efi_enabled(EFI_MEMMAP) || !nr_fake_mem)
> > - return;
> >
> >   /* count up the number of EFI memory descriptor */
> > - for (i = 0; i < nr_fake_mem; i++) {
> > - for_each_efi_memory_desc(md) {
> > - struct range *r = &efi_fake_mems[i].range;
> > -
> > - new_nr_map += efi_memmap_split_count(md, r);
> > - }
> > - }
> > + for_each_efi_memory_desc(md)
> > + new_nr_map += efi_memmap_split_count(md, &efi_range->range);
>
> I have another concern here :(
>
> THe efi_memmap_split_count mean to only split for a specific md, and you
> can see arch/x86/platform/efi/quirks.c about the use:
> if (addr + size > md.phys_addr + (md.num_pages << EFI_PAGE_SHIFT)) {
> pr_err("Region spans EFI memory descriptors, %pa\n", &addr);
> return;
> }
>
> Any memory region to be inserted but spans different md will be
> rejected.  So the memmap insert logic seems does not support the
> spanned ranges.  I did not find a case two contiguous same type r

Re: [PATCH v2 4/4] efi: Fix handling of multiple efi_fake_mem= entries

2020-01-01 Thread Dan Williams
On Tue, Dec 31, 2019 at 10:21 PM Dave Young  wrote:
>
> > > Does kexec preserve iomem? I.e. as long as the initial translation of
> > > efi entries to e820, and resulting resource tree, is preserved by
> > > successive kexec cycles then I think we're ok.
> >
> > It will not preserve them automatically, but that can be fixed if
> > needed.
> >
> > There are two places:
> > 1. the in kernel loader, we can do similar with below commit (for Soft
> > Reseved instead):
> > commit 980621daf368f2b9aa69c7ea01baa654edb7577b
> > Author: Lianbo Jiang 
> > Date:   Tue Apr 23 09:30:07 2019 +0800
> >
> > x86/crash: Add e820 reserved ranges to kdump kernel's e820 table
>
> Oops, that is for kdump only, for kexec, should update the kexec e820
> table.  But before doing that we need first to see if this is necessary.

We can cross that bridge later, but I expect it will eventually be
necessary. The soft-reservation facility will become more prevalent as
more platforms ship with DRAM differentiated memory ranges, like
high-bandwidth-memory, and the system needs to reserve it from general
kernel allocations. See commit 262b45ae3ab4 "x86/efi: EFI soft
reservation to E820 enumeration" and commit fe3e5e65c06e "efi:
Enumerate EFI_MEMORY_SP" for more details.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v3 3/4] efi: Fix efi_memmap_alloc() leaks

2020-01-01 Thread Dan Williams
With efi_fake_memmap() and efi_arch_mem_reserve() the efi table may be
updated and replaced multiple times. When that happens a previous
dynamically allocated efi memory map can be garbage collected. Use the
new EFI_MEMMAP_{SLAB,MEMBLOCK} flags to detect when a dynamically
allocated memory map is being replaced.

Debug statements in efi_memmap_free() reveal:

 efi: __efi_memmap_free:37: phys: 0x23ffdd580 size: 2688 flags: 0x2
 efi: __efi_memmap_free:37: phys: 0x9db00 size: 2640 flags: 0x2
 efi: __efi_memmap_free:37: phys: 0x9e580 size: 2640 flags: 0x2

...a savings of 7968 bytes on a qemu boot with 2 entries specified to
efi_fake_mem=.

Cc: Taku Izumi 
Cc: Ard Biesheuvel 
Signed-off-by: Dan Williams 
---
 drivers/firmware/efi/memmap.c |   24 
 1 file changed, 24 insertions(+)

diff --git a/drivers/firmware/efi/memmap.c b/drivers/firmware/efi/memmap.c
index 2b81ee6858a9..46c8b4056cc1 100644
--- a/drivers/firmware/efi/memmap.c
+++ b/drivers/firmware/efi/memmap.c
@@ -29,6 +29,28 @@ static phys_addr_t __init __efi_memmap_alloc_late(unsigned 
long size)
return PFN_PHYS(page_to_pfn(p));
 }
 
+static void __init __efi_memmap_free(u64 phys, unsigned long size, unsigned 
long flags)
+{
+   if (flags & EFI_MEMMAP_MEMBLOCK) {
+   if (slab_is_available())
+   memblock_free_late(phys, size);
+   else
+   memblock_free(phys, size);
+   } else if (flags & EFI_MEMMAP_SLAB) {
+   struct page *p = pfn_to_page(PHYS_PFN(phys));
+   unsigned int order = get_order(size);
+
+   free_pages((unsigned long) page_address(p), order);
+   }
+}
+
+static void __init efi_memmap_free(void)
+{
+   __efi_memmap_free(efi.memmap.phys_map,
+   efi.memmap.desc_size * efi.memmap.nr_map,
+   efi.memmap.flags);
+}
+
 /**
  * efi_memmap_alloc - Allocate memory for the EFI memory map
  * @num_entries: Number of entries in the allocated map.
@@ -90,6 +112,8 @@ __efi_memmap_init(struct efi_memory_map_data *data, unsigned 
long flags)
return -ENOMEM;
}
 
+   efi_memmap_free();
+
map.phys_map = data->phys_map;
map.nr_map = data->size / data->desc_size;
map.map_end = map.map + data->size;


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v3 1/4] efi: Add a flags parameter to efi_memory_map

2020-01-01 Thread Dan Williams
In preparation for garbage collecting dynamically allocated efi memory
maps, where the allocation method of memblock vs slab needs to be
recalled, convert the existing 'late' flag into a 'flags' bitmask.

Cc: Taku Izumi 
Cc: Ard Biesheuvel 
Signed-off-by: Dan Williams 
---
 drivers/firmware/efi/memmap.c |   24 +---
 include/linux/efi.h   |3 ++-
 2 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/drivers/firmware/efi/memmap.c b/drivers/firmware/efi/memmap.c
index 38b686c67b17..813674ef9000 100644
--- a/drivers/firmware/efi/memmap.c
+++ b/drivers/firmware/efi/memmap.c
@@ -52,7 +52,7 @@ phys_addr_t __init efi_memmap_alloc(unsigned int num_entries)
 /**
  * __efi_memmap_init - Common code for mapping the EFI memory map
  * @data: EFI memory map data
- * @late: Use early or late mapping function?
+ * @flags: Use early or late mapping function?
  *
  * This function takes care of figuring out which function to use to
  * map the EFI memory map in efi.memmap based on how far into the boot
@@ -66,9 +66,9 @@ phys_addr_t __init efi_memmap_alloc(unsigned int num_entries)
  * Returns zero on success, a negative error code on failure.
  */
 static int __init
-__efi_memmap_init(struct efi_memory_map_data *data, bool late)
+__efi_memmap_init(struct efi_memory_map_data *data, unsigned long flags)
 {
-   struct efi_memory_map map;
+   struct efi_memory_map map = { 0 };
phys_addr_t phys_map;
 
if (efi_enabled(EFI_PARAVIRT))
@@ -76,7 +76,7 @@ __efi_memmap_init(struct efi_memory_map_data *data, bool late)
 
phys_map = data->phys_map;
 
-   if (late)
+   if (flags & EFI_MEMMAP_LATE)
map.map = memremap(phys_map, data->size, MEMREMAP_WB);
else
map.map = early_memremap(phys_map, data->size);
@@ -92,7 +92,7 @@ __efi_memmap_init(struct efi_memory_map_data *data, bool late)
 
map.desc_version = data->desc_version;
map.desc_size = data->desc_size;
-   map.late = late;
+   map.flags |= flags;
 
set_bit(EFI_MEMMAP, &efi.flags);
 
@@ -111,9 +111,9 @@ __efi_memmap_init(struct efi_memory_map_data *data, bool 
late)
 int __init efi_memmap_init_early(struct efi_memory_map_data *data)
 {
/* Cannot go backwards */
-   WARN_ON(efi.memmap.late);
+   WARN_ON(efi.memmap.flags & EFI_MEMMAP_LATE);
 
-   return __efi_memmap_init(data, false);
+   return __efi_memmap_init(data, 0);
 }
 
 void __init efi_memmap_unmap(void)
@@ -121,7 +121,7 @@ void __init efi_memmap_unmap(void)
if (!efi_enabled(EFI_MEMMAP))
return;
 
-   if (!efi.memmap.late) {
+   if (!(efi.memmap.flags & EFI_MEMMAP_LATE)) {
unsigned long size;
 
size = efi.memmap.desc_size * efi.memmap.nr_map;
@@ -168,7 +168,7 @@ int __init efi_memmap_init_late(phys_addr_t addr, unsigned 
long size)
WARN_ON(efi.memmap.map);
 
/* Were we already called? */
-   WARN_ON(efi.memmap.late);
+   WARN_ON(efi.memmap.flags & EFI_MEMMAP_LATE);
 
/*
 * It makes no sense to allow callers to register different
@@ -178,7 +178,7 @@ int __init efi_memmap_init_late(phys_addr_t addr, unsigned 
long size)
data.desc_version = efi.memmap.desc_version;
data.desc_size = efi.memmap.desc_size;
 
-   return __efi_memmap_init(&data, true);
+   return __efi_memmap_init(&data, EFI_MEMMAP_LATE);
 }
 
 /**
@@ -195,6 +195,7 @@ int __init efi_memmap_init_late(phys_addr_t addr, unsigned 
long size)
 int __init efi_memmap_install(phys_addr_t addr, unsigned int nr_map)
 {
struct efi_memory_map_data data;
+   unsigned long flags;
 
efi_memmap_unmap();
 
@@ -202,8 +203,9 @@ int __init efi_memmap_install(phys_addr_t addr, unsigned 
int nr_map)
data.size = efi.memmap.desc_size * nr_map;
data.desc_version = efi.memmap.desc_version;
data.desc_size = efi.memmap.desc_size;
+   flags = efi.memmap.flags & EFI_MEMMAP_LATE;
 
-   return __efi_memmap_init(&data, efi.memmap.late);
+   return __efi_memmap_init(&data, flags);
 }
 
 /**
diff --git a/include/linux/efi.h b/include/linux/efi.h
index aa54586db7a5..b8e930f5ff77 100644
--- a/include/linux/efi.h
+++ b/include/linux/efi.h
@@ -795,7 +795,8 @@ struct efi_memory_map {
int nr_map;
unsigned long desc_version;
unsigned long desc_size;
-   bool late;
+#define EFI_MEMMAP_LATE (1UL << 0)
+   unsigned long flags;
 };
 
 struct efi_mem_range {


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v3 2/4] efi: Add tracking for dynamically allocated memmaps

2020-01-01 Thread Dan Williams
In preparation for fixing efi_memmap_alloc() leaks, add support for
recording whether the memmap was dynamically allocated from slab,
memblock, or is the original physical memmap provided by the platform.

Cc: Taku Izumi 
Cc: Ard Biesheuvel 
Signed-off-by: Dan Williams 
---
 arch/x86/platform/efi/efi.c |2 +-
 arch/x86/platform/efi/quirks.c  |   11 ++-
 drivers/firmware/efi/fake_mem.c |5 +++--
 drivers/firmware/efi/memmap.c   |   16 ++--
 include/linux/efi.h |8 ++--
 5 files changed, 26 insertions(+), 16 deletions(-)

diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
index 38d44f36d5ed..7086afbb84fd 100644
--- a/arch/x86/platform/efi/efi.c
+++ b/arch/x86/platform/efi/efi.c
@@ -333,7 +333,7 @@ static void __init efi_clean_memmap(void)
u64 size = efi.memmap.nr_map - n_removal;
 
pr_warn("Removing %d invalid memory map entries.\n", n_removal);
-   efi_memmap_install(efi.memmap.phys_map, size);
+   efi_memmap_install(efi.memmap.phys_map, size, 0);
}
 }
 
diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
index f8f0220b6a66..4a71c790f9c3 100644
--- a/arch/x86/platform/efi/quirks.c
+++ b/arch/x86/platform/efi/quirks.c
@@ -244,6 +244,7 @@ EXPORT_SYMBOL_GPL(efi_query_variable_store);
 void __init efi_arch_mem_reserve(phys_addr_t addr, u64 size)
 {
phys_addr_t new_phys, new_size;
+   unsigned long flags = 0;
struct efi_mem_range mr;
efi_memory_desc_t md;
int num_entries;
@@ -272,8 +273,7 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 size)
num_entries += efi.memmap.nr_map;
 
new_size = efi.memmap.desc_size * num_entries;
-
-   new_phys = efi_memmap_alloc(num_entries);
+   new_phys = efi_memmap_alloc(num_entries, &flags);
if (!new_phys) {
pr_err("Could not allocate boot services memmap\n");
return;
@@ -288,7 +288,7 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 size)
efi_memmap_insert(&efi.memmap, new, &mr);
early_memunmap(new, new_size);
 
-   efi_memmap_install(new_phys, num_entries);
+   efi_memmap_install(new_phys, num_entries, flags);
e820__range_update(addr, size, E820_TYPE_RAM, E820_TYPE_RESERVED);
e820__update_table(e820_table);
 }
@@ -408,6 +408,7 @@ static void __init efi_unmap_pages(efi_memory_desc_t *md)
 void __init efi_free_boot_services(void)
 {
phys_addr_t new_phys, new_size;
+   unsigned long flags = 0;
efi_memory_desc_t *md;
int num_entries = 0;
void *new, *new_md;
@@ -463,7 +464,7 @@ void __init efi_free_boot_services(void)
return;
 
new_size = efi.memmap.desc_size * num_entries;
-   new_phys = efi_memmap_alloc(num_entries);
+   new_phys = efi_memmap_alloc(num_entries, &flags);
if (!new_phys) {
pr_err("Failed to allocate new EFI memmap\n");
return;
@@ -493,7 +494,7 @@ void __init efi_free_boot_services(void)
 
memunmap(new);
 
-   if (efi_memmap_install(new_phys, num_entries)) {
+   if (efi_memmap_install(new_phys, num_entries, flags)) {
pr_err("Could not install new EFI memmap\n");
return;
}
diff --git a/drivers/firmware/efi/fake_mem.c b/drivers/firmware/efi/fake_mem.c
index bb9fc70d0cfa..7e53e5520548 100644
--- a/drivers/firmware/efi/fake_mem.c
+++ b/drivers/firmware/efi/fake_mem.c
@@ -39,6 +39,7 @@ void __init efi_fake_memmap(void)
int new_nr_map = efi.memmap.nr_map;
efi_memory_desc_t *md;
phys_addr_t new_memmap_phy;
+   unsigned long flags = 0;
void *new_memmap;
int i;
 
@@ -55,7 +56,7 @@ void __init efi_fake_memmap(void)
}
 
/* allocate memory for new EFI memmap */
-   new_memmap_phy = efi_memmap_alloc(new_nr_map);
+   new_memmap_phy = efi_memmap_alloc(new_nr_map, &flags);
if (!new_memmap_phy)
return;
 
@@ -73,7 +74,7 @@ void __init efi_fake_memmap(void)
/* swap into new EFI memmap */
early_memunmap(new_memmap, efi.memmap.desc_size * new_nr_map);
 
-   efi_memmap_install(new_memmap_phy, new_nr_map);
+   efi_memmap_install(new_memmap_phy, new_nr_map, flags);
 
/* print new EFI memmap */
efi_print_memmap();
diff --git a/drivers/firmware/efi/memmap.c b/drivers/firmware/efi/memmap.c
index 813674ef9000..2b81ee6858a9 100644
--- a/drivers/firmware/efi/memmap.c
+++ b/drivers/firmware/efi/memmap.c
@@ -32,6 +32,7 @@ static phys_addr_t __init __efi_memmap_alloc_late(unsigned 
long size)
 /**
  * efi_memmap_alloc - Allocate memory for the EFI memory map
  * @num_entries: Number of entries in the allocated map.
+ * @flags: Late map, memblock alloc, slab alloc flags
  *
  * Depending on whether mm_init() has already been invoked or 

[PATCH v3 0/4] efi: Fix handling of multiple efi_fake_mem= entries

2020-01-01 Thread Dan Williams
Changes since v2 [1]:
- Move the efi_memmap_free() until efi_memmap_install() is committed to
  installing the new map. (Dave).

- Handle the case of a memblock allocated memmap being freed after the slab
  allocator is up. Just use memblock_free_late() for that case rather
  than warn. (Prompted by Dave's feedback on how many successful
  efi_memmap_free() calls occur during a boot).

- Not changed was anything additional related to Dave's concern about
  efi_fake_mem= being applied to overlapping entries. I tested
  "efi_fake_mem=4G@9G:0x4,4G@12G:0x4" which triggers the second
  entry to overwrite the first as well as another entry. The result is
  reasonable and functional for what is otherwise garbage input:

  efi: mem53: [Conventional Memory|   |  |SP|  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x24000-0x2] (3072MB)
  efi: mem54: [Conventional Memory|   |  |SP|  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x3-0x33fff] (1024MB)
  efi: mem55: [Conventional Memory|   |  |SP|  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x34000-0x3] (3072MB)
  efi: mem56: [Conventional Memory|   |  |  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x4-0x43fff] (1024MB)

  # cat /proc/iomem  | grep Sof
  24000-3 : Soft Reserved

[1]: 
http://lore.kernel.org/r/157782985777.367056.14741265874314204783.st...@dwillia2-desk3.amr.corp.intel.com

---

While testing an upcoming patchset to enhance the "soft reservation"
implementation it started crashing when rebased on v5.5-rc3. This
uncovered a few bugs in the efi_fake_mem= handling and
efi_memmap_alloc() leaks.

---

Copied from patch4:

Dave noticed that when specifying multiple efi_fake_mem= entries only
the last entry was successfully being reflected in the efi memory map.
This is due to the fact that the efi_memmap_insert() is being called
multiple times, but on successive invocations the insertion should be
applied to the last new memmap rather than the original map at
efi_fake_memmap() entry.

Rework efi_fake_memmap() to install the new memory map after each
efi_fake_mem= entry is parsed.

This also fixes an issue in efi_fake_memmap() that caused it to litter
emtpy entries into the end of the efi memory map. The empty entry causes
efi_memmap_insert() to attempt more memmap splits / copies than
efi_memmap_split_count() accounted for when sizing the new map.

BUG: unable to handle page fault for address: ff281000
[..]
RIP: 0010:efi_memmap_insert+0x11d/0x191
[..]
Call Trace:
 ? bgrt_init+0xbe/0xbe
 ? efi_arch_mem_reserve+0x1cb/0x228
 ? acpi_parse_bgrt+0xa/0xd
 ? acpi_table_parse+0x86/0xb8
 ? acpi_boot_init+0x494/0x4e3
 ? acpi_parse_x2apic+0x87/0x87
 ? setup_acpi_sci+0xa2/0xa2
 ? setup_arch+0x8db/0x9e1
 ? start_kernel+0x6a/0x547
 ? secondary_startup_64+0xb6/0xc0

Commit af1648984828 "x86/efi: Update e820 with reserved EFI boot
services data to fix kexec breakage" is listed in Fixes: since it
introduces more occurrences where efi_memmap_insert() is invoked after
an efi_fake_mem= configuration has been parsed. Previously the side
effects of vestigial empty entries were benign, but with commit
af1648984828 that follow-on efi_memmap_insert() invocation triggers the
above crash signature.

---

Dan Williams (4):
  efi: Add a flags parameter to efi_memory_map
  efi: Add tracking for dynamically allocated memmaps
  efi: Fix efi_memmap_alloc() leaks
  efi: Fix handling of multiple efi_fake_mem= entries


 arch/x86/platform/efi/efi.c |2 +
 arch/x86/platform/efi/quirks.c  |   11 ---
 drivers/firmware/efi/fake_mem.c |   37 +
 drivers/firmware/efi/memmap.c   |   58 ++-
 include/linux/efi.h |   13 +++--
 5 files changed, 81 insertions(+), 40 deletions(-)

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v3 4/4] efi: Fix handling of multiple efi_fake_mem= entries

2020-01-01 Thread Dan Williams
Dave noticed that when specifying multiple efi_fake_mem= entries only
the last entry was successfully being reflected in the efi memory map.
This is due to the fact that the efi_memmap_insert() is being called
multiple times, but on successive invocations the insertion should be
applied to the last new memmap rather than the original map at
efi_fake_memmap() entry.

Rework efi_fake_memmap() to install the new memory map after each
efi_fake_mem= entry is parsed.

This also fixes an issue in efi_fake_memmap() that caused it to litter
emtpy entries into the end of the efi memory map. The empty entry causes
efi_memmap_insert() to attempt more memmap splits / copies than
efi_memmap_split_count() accounted for when sizing the new map.

BUG: unable to handle page fault for address: ff281000
[..]
RIP: 0010:efi_memmap_insert+0x11d/0x191
[..]
Call Trace:
 ? bgrt_init+0xbe/0xbe
 ? efi_arch_mem_reserve+0x1cb/0x228
 ? acpi_parse_bgrt+0xa/0xd
 ? acpi_table_parse+0x86/0xb8
 ? acpi_boot_init+0x494/0x4e3
 ? acpi_parse_x2apic+0x87/0x87
 ? setup_acpi_sci+0xa2/0xa2
 ? setup_arch+0x8db/0x9e1
 ? start_kernel+0x6a/0x547
 ? secondary_startup_64+0xb6/0xc0

Commit af1648984828 "x86/efi: Update e820 with reserved EFI boot
services data to fix kexec breakage" is listed in Fixes: since it
introduces more occurrences where efi_memmap_insert() is invoked after
an efi_fake_mem= configuration has been parsed. Previously the side
effects of vestigial empty entries were benign, but with commit
af1648984828 that follow-on efi_memmap_insert() invocation triggers the
above crash signature.

Fixes: 0f96a99dab36 ("efi: Add 'efi_fake_mem' boot option")
Fixes: af1648984828 ("x86/efi: Update e820 with reserved EFI boot services...")
Link: 
https://lore.kernel.org/r/20191231014630.ga24...@dhcp-128-65.nay.redhat.com
Reported-by: Dave Young 
Cc: Taku Izumi 
Cc: Michael Weiser 
Cc: Ard Biesheuvel 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Signed-off-by: Dan Williams 
---
 drivers/firmware/efi/fake_mem.c |   32 +---
 drivers/firmware/efi/memmap.c   |2 +-
 include/linux/efi.h |2 ++
 3 files changed, 20 insertions(+), 16 deletions(-)

diff --git a/drivers/firmware/efi/fake_mem.c b/drivers/firmware/efi/fake_mem.c
index 7e53e5520548..68d752d8af21 100644
--- a/drivers/firmware/efi/fake_mem.c
+++ b/drivers/firmware/efi/fake_mem.c
@@ -34,26 +34,17 @@ static int __init cmp_fake_mem(const void *x1, const void 
*x2)
return 0;
 }
 
-void __init efi_fake_memmap(void)
+static void __init efi_fake_range(struct efi_mem_range *efi_range)
 {
int new_nr_map = efi.memmap.nr_map;
efi_memory_desc_t *md;
phys_addr_t new_memmap_phy;
unsigned long flags = 0;
void *new_memmap;
-   int i;
-
-   if (!efi_enabled(EFI_MEMMAP) || !nr_fake_mem)
-   return;
 
/* count up the number of EFI memory descriptor */
-   for (i = 0; i < nr_fake_mem; i++) {
-   for_each_efi_memory_desc(md) {
-   struct range *r = &efi_fake_mems[i].range;
-
-   new_nr_map += efi_memmap_split_count(md, r);
-   }
-   }
+   for_each_efi_memory_desc(md)
+   new_nr_map += efi_memmap_split_count(md, &efi_range->range);
 
/* allocate memory for new EFI memmap */
new_memmap_phy = efi_memmap_alloc(new_nr_map, &flags);
@@ -64,17 +55,28 @@ void __init efi_fake_memmap(void)
new_memmap = early_memremap(new_memmap_phy,
efi.memmap.desc_size * new_nr_map);
if (!new_memmap) {
-   memblock_free(new_memmap_phy, efi.memmap.desc_size * 
new_nr_map);
+   __efi_memmap_free(new_memmap_phy,
+   efi.memmap.desc_size * new_nr_map, flags);
return;
}
 
-   for (i = 0; i < nr_fake_mem; i++)
-   efi_memmap_insert(&efi.memmap, new_memmap, &efi_fake_mems[i]);
+   efi_memmap_insert(&efi.memmap, new_memmap, efi_range);
 
/* swap into new EFI memmap */
early_memunmap(new_memmap, efi.memmap.desc_size * new_nr_map);
 
efi_memmap_install(new_memmap_phy, new_nr_map, flags);
+}
+
+void __init efi_fake_memmap(void)
+{
+   int i;
+
+   if (!efi_enabled(EFI_MEMMAP) || !nr_fake_mem)
+   return;
+
+   for (i = 0; i < nr_fake_mem; i++)
+   efi_fake_range(&efi_fake_mems[i]);
 
/* print new EFI memmap */
efi_print_memmap();
diff --git a/drivers/firmware/efi/memmap.c b/drivers/firmware/efi/memmap.c
index 46c8b4056cc1..157b7776caf5 100644
--- a/drivers/firmware/efi/memmap.c
+++ b/drivers/firmware/efi/memmap.c
@@ -29,7 +29,7 @@ static phys_addr_t __init __efi_memmap_alloc_late(unsigned 
long size)
return PFN_PHYS(page_to_pfn(p));
 }
 
-static void __init __

Re: [PATCH v3 2/4] efi: Add tracking for dynamically allocated memmaps

2020-01-06 Thread Dan Williams
On Thu, Jan 2, 2020 at 1:02 AM Ard Biesheuvel  wrote:
>
> Hi Dan,
>
> Thanks for taking the time to really fix this properly.
>
> Comments/questions below.
>
> On Thu, 2 Jan 2020 at 05:29, Dan Williams  wrote:
> >
> > In preparation for fixing efi_memmap_alloc() leaks, add support for
> > recording whether the memmap was dynamically allocated from slab,
> > memblock, or is the original physical memmap provided by the platform.
> >
> > Cc: Taku Izumi 
> > Cc: Ard Biesheuvel 
> > Signed-off-by: Dan Williams 
> > ---
> >  arch/x86/platform/efi/efi.c |2 +-
> >  arch/x86/platform/efi/quirks.c  |   11 ++-
> >  drivers/firmware/efi/fake_mem.c |5 +++--
> >  drivers/firmware/efi/memmap.c   |   16 ++--
> >  include/linux/efi.h |8 ++--
> >  5 files changed, 26 insertions(+), 16 deletions(-)
> >
> > diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
> > index 38d44f36d5ed..7086afbb84fd 100644
> > --- a/arch/x86/platform/efi/efi.c
> > +++ b/arch/x86/platform/efi/efi.c
> > @@ -333,7 +333,7 @@ static void __init efi_clean_memmap(void)
> > u64 size = efi.memmap.nr_map - n_removal;
> >
> > pr_warn("Removing %d invalid memory map entries.\n", 
> > n_removal);
> > -   efi_memmap_install(efi.memmap.phys_map, size);
> > +   efi_memmap_install(efi.memmap.phys_map, size, 0);
> > }
> >  }
> >
> > diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
> > index f8f0220b6a66..4a71c790f9c3 100644
> > --- a/arch/x86/platform/efi/quirks.c
> > +++ b/arch/x86/platform/efi/quirks.c
> > @@ -244,6 +244,7 @@ EXPORT_SYMBOL_GPL(efi_query_variable_store);
> >  void __init efi_arch_mem_reserve(phys_addr_t addr, u64 size)
> >  {
> > phys_addr_t new_phys, new_size;
> > +   unsigned long flags = 0;
> > struct efi_mem_range mr;
> > efi_memory_desc_t md;
> > int num_entries;
> > @@ -272,8 +273,7 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 
> > size)
> > num_entries += efi.memmap.nr_map;
> >
> > new_size = efi.memmap.desc_size * num_entries;
> > -
> > -   new_phys = efi_memmap_alloc(num_entries);
> > +   new_phys = efi_memmap_alloc(num_entries, &flags);
> > if (!new_phys) {
> > pr_err("Could not allocate boot services memmap\n");
> > return;
> > @@ -288,7 +288,7 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 
> > size)
> > efi_memmap_insert(&efi.memmap, new, &mr);
> > early_memunmap(new, new_size);
> >
> > -   efi_memmap_install(new_phys, num_entries);
> > +   efi_memmap_install(new_phys, num_entries, flags);
> > e820__range_update(addr, size, E820_TYPE_RAM, E820_TYPE_RESERVED);
> > e820__update_table(e820_table);
> >  }
> > @@ -408,6 +408,7 @@ static void __init efi_unmap_pages(efi_memory_desc_t 
> > *md)
> >  void __init efi_free_boot_services(void)
> >  {
> > phys_addr_t new_phys, new_size;
> > +   unsigned long flags = 0;
> > efi_memory_desc_t *md;
> > int num_entries = 0;
> > void *new, *new_md;
> > @@ -463,7 +464,7 @@ void __init efi_free_boot_services(void)
> > return;
> >
> > new_size = efi.memmap.desc_size * num_entries;
> > -   new_phys = efi_memmap_alloc(num_entries);
> > +   new_phys = efi_memmap_alloc(num_entries, &flags);
> > if (!new_phys) {
> > pr_err("Failed to allocate new EFI memmap\n");
> > return;
> > @@ -493,7 +494,7 @@ void __init efi_free_boot_services(void)
> >
> > memunmap(new);
> >
> > -   if (efi_memmap_install(new_phys, num_entries)) {
> > +   if (efi_memmap_install(new_phys, num_entries, flags)) {
> > pr_err("Could not install new EFI memmap\n");
> > return;
> > }
> > diff --git a/drivers/firmware/efi/fake_mem.c 
> > b/drivers/firmware/efi/fake_mem.c
> > index bb9fc70d0cfa..7e53e5520548 100644
> > --- a/drivers/firmware/efi/fake_mem.c
> > +++ b/drivers/firmware/efi/fake_mem.c
> > @@ -39,6 +39,7 @@ void __init efi_fake_memmap(void)
> > int new_nr_map = efi.memmap.nr_map;
> > efi_memory_desc_t *md;
> > phys_addr_t new_memmap_phy;
> > +   unsigned lo

[PATCH v4 3/4] efi: Fix efi_memmap_alloc() leaks

2020-01-06 Thread Dan Williams
With efi_fake_memmap() and efi_arch_mem_reserve() the efi table may be
updated and replaced multiple times. When that happens a previous
dynamically allocated efi memory map can be garbage collected. Use the
new EFI_MEMMAP_{SLAB,MEMBLOCK} flags to detect when a dynamically
allocated memory map is being replaced.

Debug statements in efi_memmap_free() reveal:

 efi: __efi_memmap_free:37: phys: 0x23ffdd580 size: 2688 flags: 0x2
 efi: __efi_memmap_free:37: phys: 0x9db00 size: 2640 flags: 0x2
 efi: __efi_memmap_free:37: phys: 0x9e580 size: 2640 flags: 0x2

...a savings of 7968 bytes on a qemu boot with 2 entries specified to
efi_fake_mem=.

Cc: Taku Izumi 
Cc: Ard Biesheuvel 
Signed-off-by: Dan Williams 
---
 drivers/firmware/efi/memmap.c |   24 
 1 file changed, 24 insertions(+)

diff --git a/drivers/firmware/efi/memmap.c b/drivers/firmware/efi/memmap.c
index 04dfa56b994b..bffa320d2f9a 100644
--- a/drivers/firmware/efi/memmap.c
+++ b/drivers/firmware/efi/memmap.c
@@ -29,6 +29,28 @@ static phys_addr_t __init __efi_memmap_alloc_late(unsigned 
long size)
return PFN_PHYS(page_to_pfn(p));
 }
 
+static void __init __efi_memmap_free(u64 phys, unsigned long size, unsigned 
long flags)
+{
+   if (flags & EFI_MEMMAP_MEMBLOCK) {
+   if (slab_is_available())
+   memblock_free_late(phys, size);
+   else
+   memblock_free(phys, size);
+   } else if (flags & EFI_MEMMAP_SLAB) {
+   struct page *p = pfn_to_page(PHYS_PFN(phys));
+   unsigned int order = get_order(size);
+
+   free_pages((unsigned long) page_address(p), order);
+   }
+}
+
+static void __init efi_memmap_free(void)
+{
+   __efi_memmap_free(efi.memmap.phys_map,
+   efi.memmap.desc_size * efi.memmap.nr_map,
+   efi.memmap.flags);
+}
+
 /**
  * efi_memmap_alloc - Allocate memory for the EFI memory map
  * @num_entries: Number of entries in the allocated map.
@@ -100,6 +122,8 @@ static int __init __efi_memmap_init(struct 
efi_memory_map_data *data)
return -ENOMEM;
}
 
+   efi_memmap_free();
+
map.phys_map = data->phys_map;
map.nr_map = data->size / data->desc_size;
map.map_end = map.map + data->size;


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v4 1/4] efi: Add a flags parameter to efi_memory_map

2020-01-06 Thread Dan Williams
In preparation for garbage collecting dynamically allocated efi memory
maps, where the allocation method of memblock vs slab needs to be
recalled, convert the existing 'late' flag into a 'flags' bitmask.

Arrange for the flag to be passed via 'struct efi_memory_map_data'. This
structure grows additional flags in follow-on changes.

Cc: Taku Izumi 
Cc: Ard Biesheuvel 
Signed-off-by: Dan Williams 
---
 drivers/firmware/efi/memmap.c |   31 +--
 include/linux/efi.h   |4 +++-
 2 files changed, 20 insertions(+), 15 deletions(-)

diff --git a/drivers/firmware/efi/memmap.c b/drivers/firmware/efi/memmap.c
index 38b686c67b17..4f98eb12c172 100644
--- a/drivers/firmware/efi/memmap.c
+++ b/drivers/firmware/efi/memmap.c
@@ -52,21 +52,20 @@ phys_addr_t __init efi_memmap_alloc(unsigned int 
num_entries)
 /**
  * __efi_memmap_init - Common code for mapping the EFI memory map
  * @data: EFI memory map data
- * @late: Use early or late mapping function?
  *
  * This function takes care of figuring out which function to use to
  * map the EFI memory map in efi.memmap based on how far into the boot
  * we are.
  *
- * During bootup @late should be %false since we only have access to
- * the early_memremap*() functions as the vmalloc space isn't setup.
- * Once the kernel is fully booted we can fallback to the more robust
- * memremap*() API.
+ * During bootup EFI_MEMMAP_LATE in data->flags should be clear since we
+ * only have access to the early_memremap*() functions as the vmalloc
+ * space isn't setup.  Once the kernel is fully booted we can fallback
+ * to the more robust memremap*() API.
  *
  * Returns zero on success, a negative error code on failure.
  */
 static int __init
-__efi_memmap_init(struct efi_memory_map_data *data, bool late)
+__efi_memmap_init(struct efi_memory_map_data *data)
 {
struct efi_memory_map map;
phys_addr_t phys_map;
@@ -76,7 +75,7 @@ __efi_memmap_init(struct efi_memory_map_data *data, bool late)
 
phys_map = data->phys_map;
 
-   if (late)
+   if (data->flags & EFI_MEMMAP_LATE)
map.map = memremap(phys_map, data->size, MEMREMAP_WB);
else
map.map = early_memremap(phys_map, data->size);
@@ -92,7 +91,7 @@ __efi_memmap_init(struct efi_memory_map_data *data, bool late)
 
map.desc_version = data->desc_version;
map.desc_size = data->desc_size;
-   map.late = late;
+   map.flags = data->flags;
 
set_bit(EFI_MEMMAP, &efi.flags);
 
@@ -111,9 +110,10 @@ __efi_memmap_init(struct efi_memory_map_data *data, bool 
late)
 int __init efi_memmap_init_early(struct efi_memory_map_data *data)
 {
/* Cannot go backwards */
-   WARN_ON(efi.memmap.late);
+   WARN_ON(efi.memmap.flags & EFI_MEMMAP_LATE);
 
-   return __efi_memmap_init(data, false);
+   data->flags = 0;
+   return __efi_memmap_init(data);
 }
 
 void __init efi_memmap_unmap(void)
@@ -121,7 +121,7 @@ void __init efi_memmap_unmap(void)
if (!efi_enabled(EFI_MEMMAP))
return;
 
-   if (!efi.memmap.late) {
+   if (!(efi.memmap.flags & EFI_MEMMAP_LATE)) {
unsigned long size;
 
size = efi.memmap.desc_size * efi.memmap.nr_map;
@@ -162,13 +162,14 @@ int __init efi_memmap_init_late(phys_addr_t addr, 
unsigned long size)
struct efi_memory_map_data data = {
.phys_map = addr,
.size = size,
+   .flags = EFI_MEMMAP_LATE,
};
 
/* Did we forget to unmap the early EFI memmap? */
WARN_ON(efi.memmap.map);
 
/* Were we already called? */
-   WARN_ON(efi.memmap.late);
+   WARN_ON(efi.memmap.flags & EFI_MEMMAP_LATE);
 
/*
 * It makes no sense to allow callers to register different
@@ -178,7 +179,7 @@ int __init efi_memmap_init_late(phys_addr_t addr, unsigned 
long size)
data.desc_version = efi.memmap.desc_version;
data.desc_size = efi.memmap.desc_size;
 
-   return __efi_memmap_init(&data, true);
+   return __efi_memmap_init(&data);
 }
 
 /**
@@ -195,6 +196,7 @@ int __init efi_memmap_init_late(phys_addr_t addr, unsigned 
long size)
 int __init efi_memmap_install(phys_addr_t addr, unsigned int nr_map)
 {
struct efi_memory_map_data data;
+   unsigned long flags;
 
efi_memmap_unmap();
 
@@ -202,8 +204,9 @@ int __init efi_memmap_install(phys_addr_t addr, unsigned 
int nr_map)
data.size = efi.memmap.desc_size * nr_map;
data.desc_version = efi.memmap.desc_version;
data.desc_size = efi.memmap.desc_size;
+   data.flags = efi.memmap.flags & EFI_MEMMAP_LATE;
 
-   return __efi_memmap_init(&data, efi.memmap.late);
+   return __efi_memmap_init(&data);
 }
 
 /**
diff --git a/include/linux/efi.h b/include/linux/efi.h
index aa54586db7a5..e7e6c64469a7 100644
--- a/include/l

[PATCH v4 2/4] efi: Add tracking for dynamically allocated memmaps

2020-01-06 Thread Dan Williams
In preparation for fixing efi_memmap_alloc() leaks, add support for
recording whether the memmap was dynamically allocated from slab,
memblock, or is the original physical memmap provided by the platform.

Given this tracking is established in efi_memmap_alloc() and needs to be
carried to efi_memmap_install(), use 'struct efi_memory_map_data' to
convey the flags.

Some small cleanups result from this reorganization, specifically the
removal of local variables for 'phys' and 'size' that are already
tracked in @data.

Cc: Taku Izumi 
Cc: Ard Biesheuvel 
Signed-off-by: Dan Williams 
---
 arch/x86/platform/efi/efi.c |   10 +++-
 arch/x86/platform/efi/quirks.c  |   23 +++
 drivers/firmware/efi/fake_mem.c |   14 +---
 drivers/firmware/efi/memmap.c   |   47 ++-
 include/linux/efi.h |   11 ++---
 5 files changed, 56 insertions(+), 49 deletions(-)

diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
index 38d44f36d5ed..0bd45a988501 100644
--- a/arch/x86/platform/efi/efi.c
+++ b/arch/x86/platform/efi/efi.c
@@ -330,10 +330,16 @@ static void __init efi_clean_memmap(void)
}
 
if (n_removal > 0) {
-   u64 size = efi.memmap.nr_map - n_removal;
+   struct efi_memory_map_data data = {
+   .phys_map = efi.memmap.phys_map,
+   .desc_version = efi.memmap.desc_version,
+   .desc_size = efi.memmap.desc_size,
+   .size = data.desc_size * (efi.memmap.nr_map - 
n_removal),
+   .flags = 0,
+   };
 
pr_warn("Removing %d invalid memory map entries.\n", n_removal);
-   efi_memmap_install(efi.memmap.phys_map, size);
+   efi_memmap_install(&data);
}
 }
 
diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
index f8f0220b6a66..a4f6c7176f24 100644
--- a/arch/x86/platform/efi/quirks.c
+++ b/arch/x86/platform/efi/quirks.c
@@ -243,7 +243,7 @@ EXPORT_SYMBOL_GPL(efi_query_variable_store);
  */
 void __init efi_arch_mem_reserve(phys_addr_t addr, u64 size)
 {
-   phys_addr_t new_phys, new_size;
+   struct efi_memory_map_data data = { 0 };
struct efi_mem_range mr;
efi_memory_desc_t md;
int num_entries;
@@ -271,24 +271,21 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 
size)
num_entries = efi_memmap_split_count(&md, &mr.range);
num_entries += efi.memmap.nr_map;
 
-   new_size = efi.memmap.desc_size * num_entries;
-
-   new_phys = efi_memmap_alloc(num_entries);
-   if (!new_phys) {
+   if (efi_memmap_alloc(num_entries, &data) != 0) {
pr_err("Could not allocate boot services memmap\n");
return;
}
 
-   new = early_memremap(new_phys, new_size);
+   new = early_memremap(data.phys_map, data.size);
if (!new) {
pr_err("Failed to map new boot services memmap\n");
return;
}
 
efi_memmap_insert(&efi.memmap, new, &mr);
-   early_memunmap(new, new_size);
+   early_memunmap(new, data.size);
 
-   efi_memmap_install(new_phys, num_entries);
+   efi_memmap_install(&data);
e820__range_update(addr, size, E820_TYPE_RAM, E820_TYPE_RESERVED);
e820__update_table(e820_table);
 }
@@ -407,7 +404,7 @@ static void __init efi_unmap_pages(efi_memory_desc_t *md)
 
 void __init efi_free_boot_services(void)
 {
-   phys_addr_t new_phys, new_size;
+   struct efi_memory_map_data data = { 0 };
efi_memory_desc_t *md;
int num_entries = 0;
void *new, *new_md;
@@ -462,14 +459,12 @@ void __init efi_free_boot_services(void)
if (!num_entries)
return;
 
-   new_size = efi.memmap.desc_size * num_entries;
-   new_phys = efi_memmap_alloc(num_entries);
-   if (!new_phys) {
+   if (efi_memmap_alloc(num_entries, &data) != 0) {
pr_err("Failed to allocate new EFI memmap\n");
return;
}
 
-   new = memremap(new_phys, new_size, MEMREMAP_WB);
+   new = memremap(data.phys_map, data.size, MEMREMAP_WB);
if (!new) {
pr_err("Failed to map new EFI memmap\n");
return;
@@ -493,7 +488,7 @@ void __init efi_free_boot_services(void)
 
memunmap(new);
 
-   if (efi_memmap_install(new_phys, num_entries)) {
+   if (efi_memmap_install(&data) != 0) {
pr_err("Could not install new EFI memmap\n");
return;
}
diff --git a/drivers/firmware/efi/fake_mem.c b/drivers/firmware/efi/fake_mem.c
index bb9fc70d0cfa..a8d20568d532 100644
--- a/drivers/firmware/efi/fake_mem.c
+++ b/drivers/firmware/efi/fake_mem.c
@@ -36,9 +36,9 @@ static int __init cmp_fake_mem(con

[PATCH v4 0/4] efi: Fix handling of multiple efi_fake_mem= entries

2020-01-06 Thread Dan Williams
Changes since v3 [1]:
- Rather than pass a reference to a new flags argument, pass a common
  data structure ('struct efi_memory_map_data'), between
  efi_memmap_alloc() and efi_memmap_install(). (Ard)

- Arrange for EFI_MEMMAP_SLAB to be clear if EFI_MEMMAP_MEMBLOCK was set
  and vice versa (Ard).

[1]: 
http://lore.kernel.org/r/157793839827.977550.7845382457971215205.st...@dwillia2-desk3.amr.corp.intel.com
---

While testing an upcoming patchset to enhance the "soft reservation"
implementation it started crashing when rebased on v5.5-rc3. This
uncovered a few bugs in the efi_fake_mem= handling and
efi_memmap_alloc() leaks.

---

Copied from patch4:

Dave noticed that when specifying multiple efi_fake_mem= entries only
the last entry was successfully being reflected in the efi memory map.
This is due to the fact that the efi_memmap_insert() is being called
multiple times, but on successive invocations the insertion should be
applied to the last new memmap rather than the original map at
efi_fake_memmap() entry.

Rework efi_fake_memmap() to install the new memory map after each
efi_fake_mem= entry is parsed.

This also fixes an issue in efi_fake_memmap() that caused it to litter
emtpy entries into the end of the efi memory map. An empty entry causes
efi_memmap_insert() to attempt more memmap splits / copies than
efi_memmap_split_count() accounted for when sizing the new map. When
that happens efi_memmap_insert() may overrun its allocation, and if you
are lucky will spill over to an unmapped page leading to crash
signature like the following rather than silent corruption:

BUG: unable to handle page fault for address: ff281000
[..]
RIP: 0010:efi_memmap_insert+0x11d/0x191
[..]
Call Trace:
 ? bgrt_init+0xbe/0xbe
 ? efi_arch_mem_reserve+0x1cb/0x228
 ? acpi_parse_bgrt+0xa/0xd
 ? acpi_table_parse+0x86/0xb8
 ? acpi_boot_init+0x494/0x4e3
 ? acpi_parse_x2apic+0x87/0x87
 ? setup_acpi_sci+0xa2/0xa2
 ? setup_arch+0x8db/0x9e1
 ? start_kernel+0x6a/0x547
 ? secondary_startup_64+0xb6/0xc0

Commit af1648984828 "x86/efi: Update e820 with reserved EFI boot
services data to fix kexec breakage" is listed in Fixes: since it
introduces more occurrences where efi_memmap_insert() is invoked after
an efi_fake_mem= configuration has been parsed. Previously the side
effects of vestigial empty entries were benign, but with commit
af1648984828 that follow-on efi_memmap_insert() invocation triggers
efi_memmap_insert() overruns.

---

Dan Williams (4):
  efi: Add a flags parameter to efi_memory_map
  efi: Add tracking for dynamically allocated memmaps
  efi: Fix efi_memmap_alloc() leaks
  efi: Fix handling of multiple efi_fake_mem= entries


 arch/x86/platform/efi/efi.c |   10 +++-
 arch/x86/platform/efi/quirks.c  |   23 --
 drivers/firmware/efi/fake_mem.c |   43 +-
 drivers/firmware/efi/memmap.c   |   94 ++-
 include/linux/efi.h |   17 +--
 5 files changed, 113 insertions(+), 74 deletions(-)

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v4 4/4] efi: Fix handling of multiple efi_fake_mem= entries

2020-01-06 Thread Dan Williams
Dave noticed that when specifying multiple efi_fake_mem= entries only
the last entry was successfully being reflected in the efi memory map.
This is due to the fact that the efi_memmap_insert() is being called
multiple times, but on successive invocations the insertion should be
applied to the last new memmap rather than the original map at
efi_fake_memmap() entry.

Rework efi_fake_memmap() to install the new memory map after each
efi_fake_mem= entry is parsed.

This also fixes an issue in efi_fake_memmap() that caused it to litter
emtpy entries into the end of the efi memory map. An empty entry causes
efi_memmap_insert() to attempt more memmap splits / copies than
efi_memmap_split_count() accounted for when sizing the new map. When
that happens efi_memmap_insert() may overrun its allocation, and if you
are lucky will spill over to an unmapped page leading to crash
signature like the following rather than silent corruption:

BUG: unable to handle page fault for address: ff281000
[..]
RIP: 0010:efi_memmap_insert+0x11d/0x191
[..]
Call Trace:
 ? bgrt_init+0xbe/0xbe
 ? efi_arch_mem_reserve+0x1cb/0x228
 ? acpi_parse_bgrt+0xa/0xd
 ? acpi_table_parse+0x86/0xb8
 ? acpi_boot_init+0x494/0x4e3
 ? acpi_parse_x2apic+0x87/0x87
 ? setup_acpi_sci+0xa2/0xa2
 ? setup_arch+0x8db/0x9e1
 ? start_kernel+0x6a/0x547
 ? secondary_startup_64+0xb6/0xc0

Commit af1648984828 "x86/efi: Update e820 with reserved EFI boot
services data to fix kexec breakage" is listed in Fixes: since it
introduces more occurrences where efi_memmap_insert() is invoked after
an efi_fake_mem= configuration has been parsed. Previously the side
effects of vestigial empty entries were benign, but with commit
af1648984828 that follow-on efi_memmap_insert() invocation triggers
efi_memmap_insert() overruns.

Fixes: 0f96a99dab36 ("efi: Add 'efi_fake_mem' boot option")
Fixes: af1648984828 ("x86/efi: Update e820 with reserved EFI boot services...")
Link: 
https://lore.kernel.org/r/20191231014630.ga24...@dhcp-128-65.nay.redhat.com
Reported-by: Dave Young 
Cc: Taku Izumi 
Cc: Michael Weiser 
Cc: Ard Biesheuvel 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Signed-off-by: Dan Williams 
---
 drivers/firmware/efi/fake_mem.c |   31 ---
 drivers/firmware/efi/memmap.c   |2 +-
 include/linux/efi.h |2 ++
 3 files changed, 19 insertions(+), 16 deletions(-)

diff --git a/drivers/firmware/efi/fake_mem.c b/drivers/firmware/efi/fake_mem.c
index a8d20568d532..6e0f34a38171 100644
--- a/drivers/firmware/efi/fake_mem.c
+++ b/drivers/firmware/efi/fake_mem.c
@@ -34,25 +34,16 @@ static int __init cmp_fake_mem(const void *x1, const void 
*x2)
return 0;
 }
 
-void __init efi_fake_memmap(void)
+static void __init efi_fake_range(struct efi_mem_range *efi_range)
 {
struct efi_memory_map_data data = { 0 };
int new_nr_map = efi.memmap.nr_map;
efi_memory_desc_t *md;
void *new_memmap;
-   int i;
-
-   if (!efi_enabled(EFI_MEMMAP) || !nr_fake_mem)
-   return;
 
/* count up the number of EFI memory descriptor */
-   for (i = 0; i < nr_fake_mem; i++) {
-   for_each_efi_memory_desc(md) {
-   struct range *r = &efi_fake_mems[i].range;
-
-   new_nr_map += efi_memmap_split_count(md, r);
-   }
-   }
+   for_each_efi_memory_desc(md)
+   new_nr_map += efi_memmap_split_count(md, &efi_range->range);
 
/* allocate memory for new EFI memmap */
if (efi_memmap_alloc(new_nr_map, &data) != 0)
@@ -61,17 +52,27 @@ void __init efi_fake_memmap(void)
/* create new EFI memmap */
new_memmap = early_memremap(data.phys_map, data.size);
if (!new_memmap) {
-   memblock_free(data.phys_map, data.size);
+   __efi_memmap_free(data.phys_map, data.size, data.flags);
return;
}
 
-   for (i = 0; i < nr_fake_mem; i++)
-   efi_memmap_insert(&efi.memmap, new_memmap, &efi_fake_mems[i]);
+   efi_memmap_insert(&efi.memmap, new_memmap, efi_range);
 
/* swap into new EFI memmap */
early_memunmap(new_memmap, data.size);
 
efi_memmap_install(&data);
+}
+
+void __init efi_fake_memmap(void)
+{
+   int i;
+
+   if (!efi_enabled(EFI_MEMMAP) || !nr_fake_mem)
+   return;
+
+   for (i = 0; i < nr_fake_mem; i++)
+   efi_fake_range(&efi_fake_mems[i]);
 
/* print new EFI memmap */
efi_print_memmap();
diff --git a/drivers/firmware/efi/memmap.c b/drivers/firmware/efi/memmap.c
index bffa320d2f9a..1b6a4aa78a09 100644
--- a/drivers/firmware/efi/memmap.c
+++ b/drivers/firmware/efi/memmap.c
@@ -29,7 +29,7 @@ static phys_addr_t __init __efi_memmap_alloc_late(unsigned 
long size)
return PFN_PHYS(page_to_pfn(p));
 }
 
-static voi

Re: [PATCH v4 4/4] efi: Fix handling of multiple efi_fake_mem= entries

2020-01-06 Thread Dan Williams
On Mon, Jan 6, 2020 at 8:04 PM Dave Young  wrote:
>
> On 01/06/20 at 04:40pm, Dan Williams wrote:
> > Dave noticed that when specifying multiple efi_fake_mem= entries only
> > the last entry was successfully being reflected in the efi memory map.
> > This is due to the fact that the efi_memmap_insert() is being called
> > multiple times, but on successive invocations the insertion should be
> > applied to the last new memmap rather than the original map at
> > efi_fake_memmap() entry.
> >
> > Rework efi_fake_memmap() to install the new memory map after each
> > efi_fake_mem= entry is parsed.
> >
> > This also fixes an issue in efi_fake_memmap() that caused it to litter
> > emtpy entries into the end of the efi memory map. An empty entry causes
> > efi_memmap_insert() to attempt more memmap splits / copies than
> > efi_memmap_split_count() accounted for when sizing the new map. When
> > that happens efi_memmap_insert() may overrun its allocation, and if you
> > are lucky will spill over to an unmapped page leading to crash
> > signature like the following rather than silent corruption:
> >
> > BUG: unable to handle page fault for address: ff281000
> > [..]
> > RIP: 0010:efi_memmap_insert+0x11d/0x191
> > [..]
> > Call Trace:
> >  ? bgrt_init+0xbe/0xbe
> >  ? efi_arch_mem_reserve+0x1cb/0x228
> >  ? acpi_parse_bgrt+0xa/0xd
> >  ? acpi_table_parse+0x86/0xb8
> >  ? acpi_boot_init+0x494/0x4e3
> >  ? acpi_parse_x2apic+0x87/0x87
> >  ? setup_acpi_sci+0xa2/0xa2
> >  ? setup_arch+0x8db/0x9e1
> >  ? start_kernel+0x6a/0x547
> >  ? secondary_startup_64+0xb6/0xc0
> >
> > Commit af1648984828 "x86/efi: Update e820 with reserved EFI boot
> > services data to fix kexec breakage" is listed in Fixes: since it
> > introduces more occurrences where efi_memmap_insert() is invoked after
> > an efi_fake_mem= configuration has been parsed. Previously the side
> > effects of vestigial empty entries were benign, but with commit
> > af1648984828 that follow-on efi_memmap_insert() invocation triggers
> > efi_memmap_insert() overruns.
> >
> > Fixes: 0f96a99dab36 ("efi: Add 'efi_fake_mem' boot option")
> > Fixes: af1648984828 ("x86/efi: Update e820 with reserved EFI boot 
> > services...")
>
> A nitpick for the Fixes flags, as I replied in the thread below:
> https://lore.kernel.org/linux-efi/CAPcyv4jLxqPaB22Ao9oV31Gm=b0+phty+uz33snex4qchou...@mail.gmail.com/T/#m2bb2dd00f7715c9c19ccc48efef0fcd5fdb626e7
>
> I reproduced two other panics without the patches applied, so this issue
> is not caused by either of the commits, maybe just drop the Fixes.

Just the "Fixes: af1648984828", right? No objection from me. I'll let
Ingo say if he needs a resend for that.

The "Fixes: 0f96a99dab36" is valid because the original implementation
failed to handle the multiple argument case from the beginning.

>
> > Link: 
> > https://lore.kernel.org/r/20191231014630.ga24...@dhcp-128-65.nay.redhat.com
> > Reported-by: Dave Young 
> > Cc: Taku Izumi 
> > Cc: Michael Weiser 
> > Cc: Ard Biesheuvel 
> > Cc: Thomas Gleixner 
> > Cc: Ingo Molnar 
> > Signed-off-by: Dan Williams 
> > ---
> >  drivers/firmware/efi/fake_mem.c |   31 ---
> >  drivers/firmware/efi/memmap.c   |2 +-
> >  include/linux/efi.h |2 ++
> >  3 files changed, 19 insertions(+), 16 deletions(-)
> >
> > diff --git a/drivers/firmware/efi/fake_mem.c 
> > b/drivers/firmware/efi/fake_mem.c
> > index a8d20568d532..6e0f34a38171 100644
> > --- a/drivers/firmware/efi/fake_mem.c
> > +++ b/drivers/firmware/efi/fake_mem.c
> > @@ -34,25 +34,16 @@ static int __init cmp_fake_mem(const void *x1, const 
> > void *x2)
> >   return 0;
> >  }
> >
> > -void __init efi_fake_memmap(void)
> > +static void __init efi_fake_range(struct efi_mem_range *efi_range)
> >  {
> >   struct efi_memory_map_data data = { 0 };
> >   int new_nr_map = efi.memmap.nr_map;
> >   efi_memory_desc_t *md;
> >   void *new_memmap;
> > - int i;
> > -
> > - if (!efi_enabled(EFI_MEMMAP) || !nr_fake_mem)
> > - return;
> >
> >   /* count up the number of EFI memory descriptor */
> > - for (i = 0; i < nr_fake_mem; i++) {
> > - for_each_efi_memory_desc(md) {
> > - struct range *r = &efi_fake_mems[i].range;
> > -
> > - new_nr_map += efi_memmap_split_count(md, r);
> > - }
> > - }
> > + for_each_efi_memory_desc(md)
> > + new_nr_map += efi_memmap_split_count(md, &efi_range->range);
>
> For this part, although I still have some concerns, but since I'm not
> 100% clear about it, maybe just leave it as you do, and see if it is
> good to Ard.

Absent a specific failure case I didn't see anything to change here.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v4 3/4] efi: Fix efi_memmap_alloc() leaks

2020-01-06 Thread Dan Williams
On Mon, Jan 6, 2020 at 7:58 PM Dave Young  wrote:
>
> On 01/06/20 at 04:40pm, Dan Williams wrote:
> > With efi_fake_memmap() and efi_arch_mem_reserve() the efi table may be
> > updated and replaced multiple times. When that happens a previous
> > dynamically allocated efi memory map can be garbage collected. Use the
> > new EFI_MEMMAP_{SLAB,MEMBLOCK} flags to detect when a dynamically
> > allocated memory map is being replaced.
> >
> > Debug statements in efi_memmap_free() reveal:
> >
> >  efi: __efi_memmap_free:37: phys: 0x23ffdd580 size: 2688 flags: 0x2
> >  efi: __efi_memmap_free:37: phys: 0x9db00 size: 2640 flags: 0x2
> >  efi: __efi_memmap_free:37: phys: 0x9e580 size: 2640 flags: 0x2
> >
> > ...a savings of 7968 bytes on a qemu boot with 2 entries specified to
> > efi_fake_mem=.
> >
> > Cc: Taku Izumi 
> > Cc: Ard Biesheuvel 
> > Signed-off-by: Dan Williams 
> > ---
> >  drivers/firmware/efi/memmap.c |   24 
> >  1 file changed, 24 insertions(+)
> >
> > diff --git a/drivers/firmware/efi/memmap.c b/drivers/firmware/efi/memmap.c
> > index 04dfa56b994b..bffa320d2f9a 100644
> > --- a/drivers/firmware/efi/memmap.c
> > +++ b/drivers/firmware/efi/memmap.c
> > @@ -29,6 +29,28 @@ static phys_addr_t __init 
> > __efi_memmap_alloc_late(unsigned long size)
> >   return PFN_PHYS(page_to_pfn(p));
> >  }
> >
> > +static void __init __efi_memmap_free(u64 phys, unsigned long size, 
> > unsigned long flags)
> > +{
> > + if (flags & EFI_MEMMAP_MEMBLOCK) {
> > + if (slab_is_available())
> > + memblock_free_late(phys, size);
> > + else
> > + memblock_free(phys, size);
> > + } else if (flags & EFI_MEMMAP_SLAB) {
> > + struct page *p = pfn_to_page(PHYS_PFN(phys));
> > + unsigned int order = get_order(size);
> > +
> > + free_pages((unsigned long) page_address(p), order);
> > + }
> > +}
> > +
> > +static void __init efi_memmap_free(void)
> > +{
> > + __efi_memmap_free(efi.memmap.phys_map,
> > + efi.memmap.desc_size * efi.memmap.nr_map,
> > + efi.memmap.flags);
> > +}
> > +
> >  /**
> >   * efi_memmap_alloc - Allocate memory for the EFI memory map
> >   * @num_entries: Number of entries in the allocated map.
> > @@ -100,6 +122,8 @@ static int __init __efi_memmap_init(struct 
> > efi_memory_map_data *data)
> >   return -ENOMEM;
> >   }
> >
> > + efi_memmap_free();
> > +
>
> This seems still not safe,  see below function:
> arch/x86/platform/efi/efi.c:
> static void __init efi_clean_memmap(void)
> It use same memmap for both old and new, and filter out those invalid
> ranges in place, if the memory is freed then ..

In the efi_clean_memmap() case flags are 0, so efi_memmap_free() is a nop.

Would you feel better with an explicit?

WARN_ON(efi.memmap.phys_map == data->phys_map && (data->flags &
(EFI_MEMMAP_SLAB | EFI_MEMMAP_MEMBLOCK))

...not sure it's worth it.

>
> >   map.phys_map = data->phys_map;
> >   map.nr_map = data->size / data->desc_size;
> >   map.map_end = map.map + data->size;
> >

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v4 4/4] efi: Fix handling of multiple efi_fake_mem= entries

2020-01-08 Thread Dan Williams
On Tue, Jan 7, 2020 at 9:52 AM Ard Biesheuvel  wrote:
>
> On Tue, 7 Jan 2020 at 06:19, Dave Young  wrote:
> >
> > On 01/06/20 at 08:16pm, Dan Williams wrote:
> > > On Mon, Jan 6, 2020 at 8:04 PM Dave Young  wrote:
> > > >
> > > > On 01/06/20 at 04:40pm, Dan Williams wrote:
> > > > > Dave noticed that when specifying multiple efi_fake_mem= entries only
> > > > > the last entry was successfully being reflected in the efi memory map.
> > > > > This is due to the fact that the efi_memmap_insert() is being called
> > > > > multiple times, but on successive invocations the insertion should be
> > > > > applied to the last new memmap rather than the original map at
> > > > > efi_fake_memmap() entry.
> > > > >
> > > > > Rework efi_fake_memmap() to install the new memory map after each
> > > > > efi_fake_mem= entry is parsed.
> > > > >
> > > > > This also fixes an issue in efi_fake_memmap() that caused it to litter
> > > > > emtpy entries into the end of the efi memory map. An empty entry 
> > > > > causes
> > > > > efi_memmap_insert() to attempt more memmap splits / copies than
> > > > > efi_memmap_split_count() accounted for when sizing the new map. When
> > > > > that happens efi_memmap_insert() may overrun its allocation, and if 
> > > > > you
> > > > > are lucky will spill over to an unmapped page leading to crash
> > > > > signature like the following rather than silent corruption:
> > > > >
> > > > > BUG: unable to handle page fault for address: ff281000
> > > > > [..]
> > > > > RIP: 0010:efi_memmap_insert+0x11d/0x191
> > > > > [..]
> > > > > Call Trace:
> > > > >  ? bgrt_init+0xbe/0xbe
> > > > >  ? efi_arch_mem_reserve+0x1cb/0x228
> > > > >  ? acpi_parse_bgrt+0xa/0xd
> > > > >  ? acpi_table_parse+0x86/0xb8
> > > > >  ? acpi_boot_init+0x494/0x4e3
> > > > >  ? acpi_parse_x2apic+0x87/0x87
> > > > >  ? setup_acpi_sci+0xa2/0xa2
> > > > >  ? setup_arch+0x8db/0x9e1
> > > > >  ? start_kernel+0x6a/0x547
> > > > >  ? secondary_startup_64+0xb6/0xc0
> > > > >
> > > > > Commit af1648984828 "x86/efi: Update e820 with reserved EFI boot
> > > > > services data to fix kexec breakage" is listed in Fixes: since it
> > > > > introduces more occurrences where efi_memmap_insert() is invoked after
> > > > > an efi_fake_mem= configuration has been parsed. Previously the side
> > > > > effects of vestigial empty entries were benign, but with commit
> > > > > af1648984828 that follow-on efi_memmap_insert() invocation triggers
> > > > > efi_memmap_insert() overruns.
> > > > >
> > > > > Fixes: 0f96a99dab36 ("efi: Add 'efi_fake_mem' boot option")
> > > > > Fixes: af1648984828 ("x86/efi: Update e820 with reserved EFI boot 
> > > > > services...")
> > > >
> > > > A nitpick for the Fixes flags, as I replied in the thread below:
> > > > https://lore.kernel.org/linux-efi/CAPcyv4jLxqPaB22Ao9oV31Gm=b0+phty+uz33snex4qchou...@mail.gmail.com/T/#m2bb2dd00f7715c9c19ccc48efef0fcd5fdb626e7
> > > >
> > > > I reproduced two other panics without the patches applied, so this issue
> > > > is not caused by either of the commits, maybe just drop the Fixes.
> > >
> > > Just the "Fixes: af1648984828", right? No objection from me. I'll let
> > > Ingo say if he needs a resend for that.
> > >
> > > The "Fixes: 0f96a99dab36" is valid because the original implementation
> > > failed to handle the multiple argument case from the beginning.
> >
> > Agreed, thanks!
> >
>
> I'll queue this but without the fixes tags. The -stable maintainers
> are far too trigger happy IMHO, and this really needs careful review
> before being backported. efi_fake_mem is a debug feature anyway, so I
> don't see an urgent need to get this fixed retroactively in older
> kernels.

I'm fine to drop the fixes tags.

However, I do want to point out my driving motive for digging in on
efi_fake_mem=nn@ss:0x4, is that it is a better interface for
diverting memory ranges to device-dax than the current standard bearer
memmap=nn!ss. The main benefit is that the kernel only considers the
attribute when it is applied to EFI_CONVENTIONAL_MEMORY. This fixes a
long standing unsolvable issue of people picking busted memmap=nn!ss
settings that clobber platform memory ranges, or vector off into
nothing.

So yes, efi_fake_mem is a debug feature, but if the popularity of
memmap=nn!ss is any clue I expect efi_fake_mem=nn@ss:0x4 will be a
useful tool going forward for late enabling, or repairing platform
"soft reservation" declarations.

I'll respin the series with those tags dropped and add the comment you
recommended about the cases when efi_memmap_free() is expected to be a
nop. Holler if there's anything else, but that's all I had on my list
to fix up.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v4 4/4] efi: Fix handling of multiple efi_fake_mem= entries

2020-01-09 Thread Dan Williams
On Thu, Jan 9, 2020 at 1:36 AM Ard Biesheuvel  wrote:
[..]
> If it's just for the comment, I can just slap that on, as I already
> queued the patches with the fixes tags dropped.

That would be great. Thanks Ard!

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] mm/sparse: reset section's mem_map when fully deactivated

2020-01-16 Thread Dan Williams
On Thu, Jan 16, 2020 at 10:23 PM Pingfan Liu  wrote:
>
> On Thu, Jan 16, 2020 at 3:50 PM Michal Hocko  wrote:
> >
> > On Thu 16-01-20 11:01:08, Pingfan Liu wrote:
> > > When fully deactivated, it is meaningless to keep the value of a section's
> > > mem_map. And its mem_map will be reassigned during re-added.
> > >
> > > Beside this, it breaks the user space tool "makedumpfile", which makes
> > > assumption that a hot-removed section having mem_map as NULL.
> >
> > We used to do that before ba72b4c8cf60 ("mm/sparsemem: support
> > sub-section hotplug"). Dan was this an intentional change?
> I do not know the purpose of this. But the change just leave section
> start pfn in fully deactivated section_mem_map, and not used any more.

Not intentional, IIRC at the time I had convinced myself that the
value would always be translated by sparse_decode_mem_map(), so I
thought it could be hiding NULL de-references.  I don't see any harm
in the patch.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


RE: nvdimm,pmem: makedumpfile: __vtop4_x86_64: Can't get a valid pte.

2022-11-30 Thread Dan Williams
lizhij...@fujitsu.com wrote:
> Hi folks,
> 
> I'm going to make crash coredump support pmem region. So
> I have modified kexec-tools to add pmem region to PT_LOAD of vmcore.
> 
> But it failed at makedumpfile, log are as following:
> 
> In my environment, i found the last 512 pages in pmem region will cause the 
> error.
> 
> qemu commandline:
>   -object 
> memory-backend-file,id=memnvdimm0,prealloc=yes,mem-path=/root/qemu-dax.img,share=yes,size=4267704320,align=2097152
> -device nvdimm,node=0,label-size=4194304,memdev=memnvdimm0,id=nvdimm0,slot=0
> 
> ndctl info:
> [root@rdma-server ~]# ndctl list
> [
>{
>  "dev":"namespace0.0",
>  "mode":"devdax",
>  "map":"dev",
>  "size":4127195136,
>  "uuid":"f6fc1e86-ac5b-48d8-9cda-4888a33158f9",
>  "chardev":"dax0.0",
>  "align":4096
>}
> ]
> [root@rdma-server ~]# ndctl list -iRD
> {
>"dimms":[
>  {
>"dev":"nmem0",
>"id":"8680-56341200",
>"handle":1,
>"phys_id":0
>  }
>],
>"regions":[
>  {
>"dev":"region0",
>"size":4263510016,
>"align":16777216,
>"available_size":0,
>"max_available_extent":0,
>"type":"pmem",
>"iset_id":10248187106440278,
>"mappings":[
>  {
>"dimm":"nmem0",
>"offset":0,
>"length":4263510016,
>"position":0
>  }
>],
>"persistence_domain":"unknown"
>  }
>]
> }
> 
> iomem info:
> [root@rdma-server ~]# cat /proc/iomem  | grep Persi
> 14000-23e1f : Persistent Memory
> 
> makedumpfile info:
> [   57.229110] kdump.sh[240]: mem_map[  71] ea0008e0   238000 
>   23e200
> 
> 
> Firstly, i wonder that
> 1) makedumpfile read the whole range of iomem(same with the PT_LOAD of pmem)
> 2) 1st kernel side only setup mem_map(vmemmap) for this namespace, which size 
> is 512 pages smaller than iomem for some reasons.
> 3) Since there is an align in nvdimm region(16MiB in above), i also guess the 
> maximum size of the pmem can used by user should
> be ALIGN(iomem, 10MiB), after this alignment, the last 512 pages will be 
> dropped. then kernel only setups vmemmap for this
> range. but i didn't see any code doing such things in kernel side.
> 
> So if you guy know the reasons, please let me know :), any hint/feedback is 
> very welcome.

This is due to the region alignment.

2522afb86a8c libnvdimm/region: Introduce an 'align' attribute

If you want to use the full capacity it would be something like this
(untested, and may destroy any data currently on the namespace):

ndctl destroy-namespace namespace0.0
echo $((2<<20)) > /sys/bus/nd/devices/region0/align
ndctl create-namespace -m dax -a 4k -M mem



RE: [RFC][nvdimm][crash] pmem memmap dump support

2023-03-16 Thread Dan Williams
lizhij...@fujitsu.com wrote:
[..]
> Case D: unsupported && need your input To support this situation, the
> makedumpfile needs to know the location of metadata for each pmem
> namespace and the address and size of metadata in the pmem [start,
> end)

My first reaction is that you should copy what the ndctl utility does
when it needs to manipulate or interrogate the metadata space.

For example, see namespace_rw_infoblock():

https://github.com/pmem/ndctl/blob/main/ndctl/namespace.c#L2022

That facility uses the force_raw attribute
("/sys/bus/nd/devices/namespaceX.Y/force_raw") to arrange for the
namespace to initalize without considering any pre-existing metdata
*and* without overwriting it. In that mode makedumpfile can walk the
namespaces and retrieve the metadata written by the previous kernel.

The module to block to allow makedumpfile to access the namespace in raw
mode is the nd_pmem module, or if it is builtin the
nd_pmem_driver_init() initcall.



Re: [RFC][nvdimm][crash] pmem memmap dump support

2023-03-17 Thread Dan Williams
lizhij...@fujitsu.com wrote:
> 
> 
> On 17/03/2023 14:12, Dan Williams wrote:
> > lizhij...@fujitsu.com wrote:
> > [..]
> >> Case D: unsupported && need your input To support this situation, the
> >> makedumpfile needs to know the location of metadata for each pmem
> >> namespace and the address and size of metadata in the pmem [start,
> >> end)
> > 
> > My first reaction is that you should copy what the ndctl utility does
> > when it needs to manipulate or interrogate the metadata space.
> > 
> > For example, see namespace_rw_infoblock():> 
> > https://github.com/pmem/ndctl/blob/main/ndctl/namespace.c#L2022
> > 
> > That facility uses the force_raw attribute
> > ("/sys/bus/nd/devices/namespaceX.Y/force_raw") to arrange for the
> > namespace to initalize without considering any pre-existing metdata
> > *and* without overwriting it. In that mode makedumpfile can walk the
> > namespaces and retrieve the metadata written by the previous kernel.
> 
> For the dumping application(makedumpfile or cp), it will/should reads
> /proc/vmcore to construct the dumpfile, So makedumpfile need to know
> the *address* and *size/end* of metadata in the view of 1st kernel
> address space.

Another option, instead of passing the metadata layout into the crash
kernel, is to just parse the infoblock and calculate teh boundaries of
userdata and metadata.

> I haven't known much about namespace_rw_infoblock() , so it is also an
> option if we can know such information from it.
> 
> My current WIP propose is to export a list linking all pmem namespaces
> to vmcore, with this, the kdump kernel don't need to rely on the pmem
> driver.

Seems like more work to avoid using the pmem driver as new information
passing infrastructure needs to be built vs reusing what is already
there.



RE: [RFC PATCH v2 0/3] pmem memmap dump support

2023-04-28 Thread Dan Williams
Li Zhijian wrote:
> Hello folks,
> 
> About 2 months ago, we posted our first RFC[3] and received your kindly 
> feedback. Thank you :)
> Now, I'm back with the code.
> 
> Currently, this RFC has already implemented to supported case D*. And the 
> case A&B is disabled
> deliberately in makedumpfile. It includes changes in 3 source code as below:

I think the reason this patchkit is difficult to follow is that it
spends a lot of time describing a chosen solution, but not enough time
describing the problem and the tradeoffs.

For example why is updating /proc/vmcore with pmem metadata the chosen
solution? Why not leave the kernel out of it and have makedumpfile
tooling aware of how to parse persistent memory namespace info-blocks
and retrieve that dump itself? This is what I proposed here:

http://lore.kernel.org/r/641484f7ef780_a52e2...@dwillia2-mobl3.amr.corp.intel.com.notmuch

...but never got an answer, or I missed the answer.



Re: [PATCH RFC PKS/PMEM 33/58] fs/cramfs: Utilize new kmap_thread()

2020-10-15 Thread Dan Williams
On Fri, Oct 9, 2020 at 12:52 PM  wrote:
>
> From: Ira Weiny 
>
> The kmap() calls in this FS are localized to a single thread.  To avoid
> the over head of global PKRS updates use the new kmap_thread() call.
>
> Cc: Nicolas Pitre 
> Signed-off-by: Ira Weiny 
> ---
>  fs/cramfs/inode.c | 10 +-
>  1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/fs/cramfs/inode.c b/fs/cramfs/inode.c
> index 912308600d39..003c014a42ed 100644
> --- a/fs/cramfs/inode.c
> +++ b/fs/cramfs/inode.c
> @@ -247,8 +247,8 @@ static void *cramfs_blkdev_read(struct super_block *sb, 
> unsigned int offset,
> struct page *page = pages[i];
>
> if (page) {
> -   memcpy(data, kmap(page), PAGE_SIZE);
> -   kunmap(page);
> +   memcpy(data, kmap_thread(page), PAGE_SIZE);
> +   kunmap_thread(page);

Why does this need a sleepable kmap? This looks like a textbook
kmap_atomic() use case.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH RFC PKS/PMEM 33/58] fs/cramfs: Utilize new kmap_thread()

2020-10-15 Thread Dan Williams
On Tue, Oct 13, 2020 at 12:37 PM Matthew Wilcox  wrote:
>
> On Tue, Oct 13, 2020 at 11:44:29AM -0700, Dan Williams wrote:
> > On Fri, Oct 9, 2020 at 12:52 PM  wrote:
> > >
> > > From: Ira Weiny 
> > >
> > > The kmap() calls in this FS are localized to a single thread.  To avoid
> > > the over head of global PKRS updates use the new kmap_thread() call.
> > >
> > > Cc: Nicolas Pitre 
> > > Signed-off-by: Ira Weiny 
> > > ---
> > >  fs/cramfs/inode.c | 10 +-
> > >  1 file changed, 5 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/fs/cramfs/inode.c b/fs/cramfs/inode.c
> > > index 912308600d39..003c014a42ed 100644
> > > --- a/fs/cramfs/inode.c
> > > +++ b/fs/cramfs/inode.c
> > > @@ -247,8 +247,8 @@ static void *cramfs_blkdev_read(struct super_block 
> > > *sb, unsigned int offset,
> > > struct page *page = pages[i];
> > >
> > > if (page) {
> > > -   memcpy(data, kmap(page), PAGE_SIZE);
> > > -   kunmap(page);
> > > +   memcpy(data, kmap_thread(page), PAGE_SIZE);
> > > +   kunmap_thread(page);
> >
> > Why does this need a sleepable kmap? This looks like a textbook
> > kmap_atomic() use case.
>
> There's a lot of code of this form.  Could we perhaps have:
>
> static inline void copy_to_highpage(struct page *to, void *vfrom, unsigned 
> int size)
> {
> char *vto = kmap_atomic(to);
>
> memcpy(vto, vfrom, size);
> kunmap_atomic(vto);
> }
>
> in linux/highmem.h ?

Nice, yes, that could also replace the local ones in lib/iov_iter.c
(memcpy_{to,from}_page())

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v1 2/3] kernel/resource: make walk_mem_res() find all busy IORESOURCE_MEM resources

2021-03-22 Thread Dan Williams
On Mon, Mar 22, 2021 at 9:02 AM David Hildenbrand  wrote:
>
> It used to be true that we can have system RAM only on the first level
> in the resourc tree. However, this is no longer holds for driver-managed
> system RAM (i.e., dax/kmem and virtio-mem).
>
> The function walk_mem_res() only consideres the first level and is
> used in arch/x86/mm/ioremap.c:__ioremap_check_mem() only. We currently
> fail to identify System RAM added by dax/kmem and virtio-mem as
> "IORES_MAP_SYSTEM_RAM", for example, allowing for remapping of such
> "normal RAM" in __ioremap_caller().
>
> Let's find all busy IORESOURCE_MEM resources, making the function
> behave similar to walk_system_ram_res().

Looks good,

Reviewed-by: Dan Williams 

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v1 3/3] kernel/resource: remove first_lvl / siblings_only logic

2021-03-22 Thread Dan Williams
On Mon, Mar 22, 2021 at 9:03 AM David Hildenbrand  wrote:
>
> All IORESOURCE_SYSTEM_RAM and IORESOURCE_MEM now properly consider the
> whole resource tree, not just the first level. Let's drop the unused
> first_lvl / siblings_only logic.
>
> All functions properly search the whole tree, so remove documentation
> that indicates that some functions behave differently.

Looks good, and the staging of the potential regressions as standalone
lead-in patches makes sense.

Reviewed-by: Dan Williams 

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v1 1/3] kernel/resource: make walk_system_ram_res() find all busy IORESOURCE_SYSTEM_RAM resources

2021-03-22 Thread Dan Williams
On Mon, Mar 22, 2021 at 9:02 AM David Hildenbrand  wrote:
>
> It used to be true that we can have busy system RAM only on the first level
> in the resourc tree. However, this is no longer holds for driver-managed
> system RAM (i.e., added via dax/kmem and virtio-mem), which gets added on
> lower levels.
>
> We have two users of walk_system_ram_res(), which currently only
> consideres the first level:
> a) kernel/kexec_file.c:kexec_walk_resources() -- We properly skip
>IORESOURCE_SYSRAM_DRIVER_MANAGED resources via
>locate_mem_hole_callback(), so even after this change, we won't be
>placing kexec images onto dax/kmem and virtio-mem added memory. No
>change.
> b) arch/x86/kernel/crash.c:fill_up_crash_elf_data() -- we're currently
>not adding relevant ranges to the crash elf info, resulting in them
>not getting dumped via kdump.
>
> This change fixes loading a crashkernel via kexec_file_load() and including
> dax/kmem and virtio-mem added System RAM in the crashdump on x86-64. Note
> that e.g,, arm64 relies on memblock data and, therefore, always considers
> all added System RAM already.
>
> Let's find all busy IORESOURCE_SYSTEM_RAM resources, making the function
> behave like walk_system_ram_range().
>
> Cc: Andrew Morton 
> Cc: Greg Kroah-Hartman 
> Cc: Dan Williams 
> Cc: Daniel Vetter 
> Cc: Andy Shevchenko 
> Cc: Mauro Carvalho Chehab 
> Cc: Signed-off-by: David Hildenbrand 
> Cc: Dave Young 
> Cc: Baoquan He 
> Cc: Vivek Goyal 
> Cc: Dave Hansen 
> Cc: Keith Busch 
> Cc: Michal Hocko 
> Cc: Qian Cai 
> Cc: Oscar Salvador 
> Cc: Eric Biederman 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: Borislav Petkov 
> Cc: "H. Peter Anvin" 
> Cc: Tom Lendacky 
> Cc: Brijesh Singh 
> Cc: x...@kernel.org
> Cc: kexec@lists.infradead.org
> Signed-off-by: David Hildenbrand 
> ---
>  kernel/resource.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/resource.c b/kernel/resource.c
> index 627e61b0c124..4efd6e912279 100644
> --- a/kernel/resource.c
> +++ b/kernel/resource.c
> @@ -457,7 +457,7 @@ int walk_system_ram_res(u64 start, u64 end, void *arg,
>  {
> unsigned long flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
>
> -   return __walk_iomem_res_desc(start, end, flags, IORES_DESC_NONE, true,
> +   return __walk_iomem_res_desc(start, end, flags, IORES_DESC_NONE, 
> false,
>  arg, func);

Looks good,

Reviewed-by: Dan Williams 

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v3] lockdown,selinux: fix wrong subject in some SELinux lockdown checks

2021-06-18 Thread Dan Williams
On Wed, Jun 16, 2021 at 1:51 AM Ondrej Mosnacek  wrote:
>
> Commit 59438b46471a ("security,lockdown,selinux: implement SELinux
> lockdown") added an implementation of the locked_down LSM hook to
> SELinux, with the aim to restrict which domains are allowed to perform
> operations that would breach lockdown.
>
> However, in several places the security_locked_down() hook is called in
> situations where the current task isn't doing any action that would
> directly breach lockdown, leading to SELinux checks that are basically
> bogus.
>
> To fix this, add an explicit struct cred pointer argument to
> security_lockdown() and define NULL as a special value to pass instead
> of current_cred() in such situations. LSMs that take the subject
> credentials into account can then fall back to some default or ignore
> such calls altogether. In the SELinux lockdown hook implementation, use
> SECINITSID_KERNEL in case the cred argument is NULL.
>
> Most of the callers are updated to pass current_cred() as the cred
> pointer, thus maintaining the same behavior. The following callers are
> modified to pass NULL as the cred pointer instead:
> 1. arch/powerpc/xmon/xmon.c
>  Seems to be some interactive debugging facility. It appears that
>  the lockdown hook is called from interrupt context here, so it
>  should be more appropriate to request a global lockdown decision.
> 2. fs/tracefs/inode.c:tracefs_create_file()
>  Here the call is used to prevent creating new tracefs entries when
>  the kernel is locked down. Assumes that locking down is one-way -
>  i.e. if the hook returns non-zero once, it will never return zero
>  again, thus no point in creating these files. Also, the hook is
>  often called by a module's init function when it is loaded by
>  userspace, where it doesn't make much sense to do a check against
>  the current task's creds, since the task itself doesn't actually
>  use the tracing functionality (i.e. doesn't breach lockdown), just
>  indirectly makes some new tracepoints available to whoever is
>  authorized to use them.
> 3. net/xfrm/xfrm_user.c:copy_to_user_*()
>  Here a cryptographic secret is redacted based on the value returned
>  from the hook. There are two possible actions that may lead here:
>  a) A netlink message XFRM_MSG_GETSA with NLM_F_DUMP set - here the
> task context is relevant, since the dumped data is sent back to
> the current task.
>  b) When adding/deleting/updating an SA via XFRM_MSG_xxxSA, the
> dumped SA is broadcasted to tasks subscribed to XFRM events -
> here the current task context is not relevant as it doesn't
> represent the tasks that could potentially see the secret.
>  It doesn't seem worth it to try to keep using the current task's
>  context in the a) case, since the eventual data leak can be
>  circumvented anyway via b), plus there is no way for the task to
>  indicate that it doesn't care about the actual key value, so the
>  check could generate a lot of "false alert" denials with SELinux.
>  Thus, let's pass NULL instead of current_cred() here faute de
>  mieux.
>
> Improvements-suggested-by: Casey Schaufler 
> Improvements-suggested-by: Paul Moore 
> Fixes: 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown")
> Signed-off-by: Ondrej Mosnacek 
[..]
> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> index 2acc6173da36..c1747b6555c7 100644
> --- a/drivers/cxl/mem.c
> +++ b/drivers/cxl/mem.c
> @@ -568,7 +568,7 @@ static bool cxl_mem_raw_command_allowed(u16 opcode)
> if (!IS_ENABLED(CONFIG_CXL_MEM_RAW_COMMANDS))
> return false;
>
> -   if (security_locked_down(LOCKDOWN_NONE))
> +   if (security_locked_down(current_cred(), LOCKDOWN_NONE))

Acked-by: Dan Williams 

...however that usage looks wrong. The expectation is that if kernel
integrity protections are enabled then raw command access should be
disabled. So I think that should be equivalent to LOCKDOWN_PCI_ACCESS
in terms of the command capabilities to filter.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v3] lockdown,selinux: fix wrong subject in some SELinux lockdown checks

2021-08-31 Thread Dan Williams
On Tue, Aug 31, 2021 at 6:53 AM Paul Moore  wrote:
>
> On Tue, Aug 31, 2021 at 5:09 AM Ondrej Mosnacek  wrote:
> > On Sat, Jun 19, 2021 at 12:18 AM Dan Williams  
> > wrote:
> > > On Wed, Jun 16, 2021 at 1:51 AM Ondrej Mosnacek  
> > > wrote:
>
> ...
>
> > > > diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> > > > index 2acc6173da36..c1747b6555c7 100644
> > > > --- a/drivers/cxl/mem.c
> > > > +++ b/drivers/cxl/mem.c
> > > > @@ -568,7 +568,7 @@ static bool cxl_mem_raw_command_allowed(u16 opcode)
> > > > if (!IS_ENABLED(CONFIG_CXL_MEM_RAW_COMMANDS))
> > > > return false;
> > > >
> > > > -   if (security_locked_down(LOCKDOWN_NONE))
> > > > +   if (security_locked_down(current_cred(), LOCKDOWN_NONE))
> > >
> > > Acked-by: Dan Williams 
> > >
> > > ...however that usage looks wrong. The expectation is that if kernel
> > > integrity protections are enabled then raw command access should be
> > > disabled. So I think that should be equivalent to LOCKDOWN_PCI_ACCESS
> > > in terms of the command capabilities to filter.
> >
> > Yes, the LOCKDOWN_NONE seems wrong here... but it's a pre-existing bug
> > and I didn't want to go down yet another rabbit hole trying to fix it.
> > I'll look at this again once this patch is settled - it may indeed be
> > as simple as replacing LOCKDOWN_NONE with LOCKDOWN_PCI_ACCESS.
>
> At this point you should be well aware of my distaste for merging
> patches that have known bugs in them.  Yes, this is a pre-existing
> condition, but it seems well within the scope of this work to address
> it as well.
>
> This isn't something that is going to get merged while the merge
> window is open, so at the very least you've got almost two weeks to
> sort this out - please do that.

Yes, apologies, I should have sent the fix shortly after noticing the
problem. I'll get the CXL bug fix out of the way so Ondrej can move
this along.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: kexec reboot broken with ioatdma?

2009-12-16 Thread Dan Williams

Roland Dreier wrote:

 > from a kexec point of view I believe that the preferred option is the
 > former - shutdown the device so it can be initialised using standard paths
 > in the second kernel.

OK... however I'm not suggesting a separate kexec initialization path,
simply adding a reset of the device in the standard initialization.
This would be fairly normal for other types of device; for example, the
BIOS may have left a NIC in an undefined state due to network boot.  Of
course BIOS is unlikely to use an IOAT DMA engine but the principle of
limiting assumptions about platform state still stands I think.


I agree that is more robust if the init path copes with hardware 
arriving in an unknown state.  I'll look into adding a channel reset in 
the init path (something that should probably have been there since the 
beginning).



From a quick look, it seems tricky to get a clean shutdown of IOAT stuff
since there doesn't seem to be a clean ordering that makes sure the
ioatdma stuff is shutdown after everything using it.


The engines may be in use by multiple subsytems (net, raid) so 
coordinating shutdown ordering would indeed be a pain.


--
Dan


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: kexec reboot broken with ioatdma?

2009-12-18 Thread Dan Williams
Subject: ioat2,3: put channel hardware in known state at init

Put the ioat2 and ioat3 state machines in the halted state with all
errors cleared.

The ioat1 init path is not disturbed for stability, there are no
reported ioat1 initialization issues.

Reported-by: Roland Dreier 
Not-Yet-Signed-off-by: Dan Williams 
---
On Wed, 2009-12-16 at 16:23 -0700, Roland Dreier wrote: 
> > I agree that is more robust if the init path copes with hardware
>  > arriving in an unknown state.  I'll look into adding a channel reset
>  > in the init path (something that should probably have been there since
>  > the beginning).
> 
> Great, let me know and I'll try it on my system with kexec.
> 

I was able to recover stuck channels with this patch, let me know if it
resolves the kexec issue.

Thanks,
Dan

 drivers/dma/ioat/dma.c   |2 +
 drivers/dma/ioat/dma.h   |   18 +++
 drivers/dma/ioat/dma_v2.c|   69 --
 drivers/dma/ioat/dma_v2.h|2 +
 drivers/dma/ioat/dma_v3.c|   54 -
 drivers/dma/ioat/registers.h |1 +
 6 files changed, 114 insertions(+), 32 deletions(-)


diff --git a/drivers/dma/ioat/dma.c b/drivers/dma/ioat/dma.c
index c524d36..dcc4ab7 100644
--- a/drivers/dma/ioat/dma.c
+++ b/drivers/dma/ioat/dma.c
@@ -1032,7 +1032,7 @@ int __devinit ioat_probe(struct ioatdma_device *device)
dma->dev = &pdev->dev;
 
if (!dma->chancnt) {
-   dev_err(dev, "zero channels detected\n");
+   dev_err(dev, "channel enumeration error\n");
goto err_setup_interrupts;
}
 
diff --git a/drivers/dma/ioat/dma.h b/drivers/dma/ioat/dma.h
index 45edde9..bbc3e78 100644
--- a/drivers/dma/ioat/dma.h
+++ b/drivers/dma/ioat/dma.h
@@ -60,6 +60,7 @@
  * @dca: direct cache access context
  * @intr_quirk: interrupt setup quirk (for ioat_v1 devices)
  * @enumerate_channels: hw version specific channel enumeration
+ * @reset_hw: hw version specific channel (re)initialization
  * @cleanup_tasklet: select between the v2 and v3 cleanup routines
  * @timer_fn: select between the v2 and v3 timer watchdog routines
  * @self_test: hardware version specific self test for each supported op type
@@ -78,6 +79,7 @@ struct ioatdma_device {
struct dca_provider *dca;
void (*intr_quirk)(struct ioatdma_device *device);
int (*enumerate_channels)(struct ioatdma_device *device);
+   int (*reset_hw)(struct ioat_chan_common *chan);
void (*cleanup_tasklet)(unsigned long data);
void (*timer_fn)(unsigned long data);
int (*self_test)(struct ioatdma_device *device);
@@ -264,6 +266,22 @@ static inline void ioat_suspend(struct ioat_chan_common 
*chan)
writeb(IOAT_CHANCMD_SUSPEND, chan->reg_base + IOAT_CHANCMD_OFFSET(ver));
 }
 
+static inline void ioat_reset(struct ioat_chan_common *chan)
+{
+   u8 ver = chan->device->version;
+
+   writeb(IOAT_CHANCMD_RESET, chan->reg_base + IOAT_CHANCMD_OFFSET(ver));
+}
+
+static inline bool ioat_reset_pending(struct ioat_chan_common *chan)
+{
+   u8 ver = chan->device->version;
+   u8 cmd;
+
+   cmd = readb(chan->reg_base + IOAT_CHANCMD_OFFSET(ver));
+   return (cmd & IOAT_CHANCMD_RESET) == IOAT_CHANCMD_RESET;
+}
+
 static inline void ioat_set_chainaddr(struct ioat_dma_chan *ioat, u64 addr)
 {
struct ioat_chan_common *chan = &ioat->base;
diff --git a/drivers/dma/ioat/dma_v2.c b/drivers/dma/ioat/dma_v2.c
index 8f1f7f0..5f7a500 100644
--- a/drivers/dma/ioat/dma_v2.c
+++ b/drivers/dma/ioat/dma_v2.c
@@ -239,20 +239,50 @@ void __ioat2_restart_chan(struct ioat2_dma_chan *ioat)
__ioat2_start_null_desc(ioat);
 }
 
-static void ioat2_restart_channel(struct ioat2_dma_chan *ioat)
+int ioat2_quiesce(struct ioat_chan_common *chan, unsigned long tmo)
 {
-   struct ioat_chan_common *chan = &ioat->base;
-   unsigned long phys_complete;
+   unsigned long end = jiffies + tmo;
+   int err = 0;
u32 status;
 
status = ioat_chansts(chan);
if (is_ioat_active(status) || is_ioat_idle(status))
ioat_suspend(chan);
while (is_ioat_active(status) || is_ioat_idle(status)) {
+   if (end && time_after(jiffies, end)) {
+   err = -ETIMEDOUT;
+   break;
+   }
status = ioat_chansts(chan);
cpu_relax();
}
 
+   return err;
+}
+
+int ioat2_reset_sync(struct ioat_chan_common *chan, unsigned long tmo)
+{
+   unsigned long end = jiffies + tmo;
+   int err = 0;
+
+   ioat_reset(chan);
+   while (ioat_reset_pending(chan)) {
+   if (end && time_after(jiffies, end)) {
+   err = -ETIMEDOUT;
+   break;
+   }
+   cpu_relax();
+   }

Re: kexec reboot broken with ioatdma?

2009-12-18 Thread Dan Williams
On Fri, 2009-12-18 at 15:20 -0700, Roland Dreier wrote:
> > Subject: ioat2,3: put channel hardware in known state at init
>  > 
>  > Put the ioat2 and ioat3 state machines in the halted state with all
>  > errors cleared.
> 
> Yes, with this applied I was able to kexec reboot from a kernel with
> ioatdma loaded and have ioatdma work in the new kernel.  So:
> 
> Tested-by: Roland Dreier 

What was that, all of 8 minutes to turn that test around! :-) Nice.

Thanks,
Dan



___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec