Re: [PATCH] x86/efi: Do not release sub-1MB memory regions when the crashkernel option is specified

2021-04-13 Thread Baoquan He
On 04/12/21 at 08:24am, Andy Lutomirski wrote:
> On Mon, Apr 12, 2021 at 2:52 AM Baoquan He  wrote:
> >
> > On 04/11/21 at 06:49pm, Andy Lutomirski wrote:
> > >
> > >
> > > > On Apr 11, 2021, at 6:14 PM, Baoquan He  wrote:
> > > >
> > > > On 04/09/21 at 07:59pm, H. Peter Anvin wrote:
> > > >> Why don't we do this unconditionally? At the very best we gain half a 
> > > >> megabyte of memory (except the trampoline, which has to live there, 
> > > >> but it is only a few kilobytes.)
> > > >
> > > > This is a great suggestion, thanks. I think we can fix it in this way to
> > > > make code simpler. Then the specific caring of real mode in
> > > > efi_free_boot_services() can be removed too.
> > > >
> > >
> > > This whole situation makes me think that the code is buggy before and 
> > > buggy after.
> > >
> > > The issue here (I think) is that various pieces of code want to reserve 
> > > specific pieces of otherwise-available low memory for their own nefarious 
> > > uses. I don’t know *why* crash kernel needs this, but that doesn’t matter 
> > > too much.
> >
> > Kdump kernel also need go through real mode code path during bootup. It
> > is not different than normal kernel except that it skips the firmware
> > resetting. So kdump kernel needs low 1M as system RAM just as normal
> > kernel does. Here we reserve the whole low 1M with memblock_reserve()
> > to avoid any later kernel or driver data reside in this area. Otherwise,
> > we need dump the content of this area to vmcore. As we know, when crash
> > happened, the old memory of 1st kernel should be untouched until vmcore
> > dumping read out its content. Meanwhile, kdump kernel need reuse low 1M.
> > In the past, we used a back up region to copy out the low 1M area, and
> > map the back up region into the low 1M area in vmcore elf file. In
> > 6f599d84231fd27 ("x86/kdump: Always reserve the low 1M when the crashkernel
> > option is specified"), we changed to lock the whole low 1M to avoid
> > writting any kernel data into, like this we can skip this area when
> > dumping vmcore.
> >
> > Above is why we try to memblock reserve the whole low 1M. We don't want
> > to use it, just don't want anyone to use it in 1st kernel.
> >
> > >
> > > I propose that the right solution is to give low-memory-reserving code 
> > > paths two chances to do what they need: once at the very beginning and 
> > > once after EFI boot services are freed.
> > >
> > > Alternatively, just reserve *all* otherwise unused sub 1M memory up 
> > > front, then release it right after releasing boot services, and then 
> > > invoke the special cases exactly once.
> >
> > I am not sure if I got both suggested ways clearly. They look a little
> > complicated in our case. As I explained at above, we want the whole low
> > 1M locked up, not one piece or some pieces of it.
> 
> My second suggestion is probably the better one.  Here it is, concretely:
> 
> The early (pre-free_efi_boot_services) code just reserves all
> available sub-1M memory unconditionally, but it specially marks it as
> reserved-but-available-later.  We stop allocating the trampoline page
> at this stage.
> 
> In free_efi_boot_services, instead of *freeing* the sub-1M memory, we
> stick it in the pile of reserved memory created in the early step.
> This may involve splitting a block, kind of like the current
> trampoline late allocation works.
> 
> Then, *after* free_efi_boot_services(), we run a single block of code
> that lets everything that wants sub-1M code claim some.  This means
> that the trampoline gets allocated and, if crashkernel wants to claim
> everything else, it can.  After that, everything still unclaimed gets
> freed.

void __init setup_arch(char **cmdline_p)
{
...
efi_reserve_boot_services();
e820__memblock_alloc_reserved_mpc_new();
#ifdef CONFIG_X86_CHECK_BIOS_CORRUPTION
setup_bios_corruption_check();
#endif
reserve_real_mode();
  

trim_platform_memory_ranges();
trim_low_memory_range();
...
}

After efi_reserve_boot_services(), there are several function calling to
require memory reservation under low 1M.


asmlinkage __visible void __init __no_sanitize_address start_kernel(void)   
  
{
...
setup_arch(_line);
...
mm_init();
--> mem_init();
 -->memblock_free_all();

...
#ifdef CONFIG_X86
if (efi_enabled(EFI_RUNTIME_SERVICES))
efi_enter_virtual_mode();
-->efi_free_boot_services();
-->memblock_free_late();
#endif
...
}

So from the code flow, we can see that buddy allocator is built in
mm_init() which puts all memory from memblock.memory excluding
memblock.reserved into buddy. And much later, we call
efi_free_boot_services() to release those 

Re: [PATCH] x86/efi: Do not release sub-1MB memory regions when the crashkernel option is specified

2021-04-12 Thread Andy Lutomirski
On Mon, Apr 12, 2021 at 2:52 AM Baoquan He  wrote:
>
> On 04/11/21 at 06:49pm, Andy Lutomirski wrote:
> >
> >
> > > On Apr 11, 2021, at 6:14 PM, Baoquan He  wrote:
> > >
> > > On 04/09/21 at 07:59pm, H. Peter Anvin wrote:
> > >> Why don't we do this unconditionally? At the very best we gain half a 
> > >> megabyte of memory (except the trampoline, which has to live there, but 
> > >> it is only a few kilobytes.)
> > >
> > > This is a great suggestion, thanks. I think we can fix it in this way to
> > > make code simpler. Then the specific caring of real mode in
> > > efi_free_boot_services() can be removed too.
> > >
> >
> > This whole situation makes me think that the code is buggy before and buggy 
> > after.
> >
> > The issue here (I think) is that various pieces of code want to reserve 
> > specific pieces of otherwise-available low memory for their own nefarious 
> > uses. I don’t know *why* crash kernel needs this, but that doesn’t matter 
> > too much.
>
> Kdump kernel also need go through real mode code path during bootup. It
> is not different than normal kernel except that it skips the firmware
> resetting. So kdump kernel needs low 1M as system RAM just as normal
> kernel does. Here we reserve the whole low 1M with memblock_reserve()
> to avoid any later kernel or driver data reside in this area. Otherwise,
> we need dump the content of this area to vmcore. As we know, when crash
> happened, the old memory of 1st kernel should be untouched until vmcore
> dumping read out its content. Meanwhile, kdump kernel need reuse low 1M.
> In the past, we used a back up region to copy out the low 1M area, and
> map the back up region into the low 1M area in vmcore elf file. In
> 6f599d84231fd27 ("x86/kdump: Always reserve the low 1M when the crashkernel
> option is specified"), we changed to lock the whole low 1M to avoid
> writting any kernel data into, like this we can skip this area when
> dumping vmcore.
>
> Above is why we try to memblock reserve the whole low 1M. We don't want
> to use it, just don't want anyone to use it in 1st kernel.
>
> >
> > I propose that the right solution is to give low-memory-reserving code 
> > paths two chances to do what they need: once at the very beginning and once 
> > after EFI boot services are freed.
> >
> > Alternatively, just reserve *all* otherwise unused sub 1M memory up front, 
> > then release it right after releasing boot services, and then invoke the 
> > special cases exactly once.
>
> I am not sure if I got both suggested ways clearly. They look a little
> complicated in our case. As I explained at above, we want the whole low
> 1M locked up, not one piece or some pieces of it.

My second suggestion is probably the better one.  Here it is, concretely:

The early (pre-free_efi_boot_services) code just reserves all
available sub-1M memory unconditionally, but it specially marks it as
reserved-but-available-later.  We stop allocating the trampoline page
at this stage.

In free_efi_boot_services, instead of *freeing* the sub-1M memory, we
stick it in the pile of reserved memory created in the early step.
This may involve splitting a block, kind of like the current
trampoline late allocation works.

Then, *after* free_efi_boot_services(), we run a single block of code
that lets everything that wants sub-1M code claim some.  This means
that the trampoline gets allocated and, if crashkernel wants to claim
everything else, it can.  After that, everything still unclaimed gets
freed.

Does that make sense?

--Andy


Re: [PATCH] x86/efi: Do not release sub-1MB memory regions when the crashkernel option is specified

2021-04-12 Thread lijiang
Thank you for the comment, H. Peter, Andy and Baoquan.

在 2021年04月12日 17:52, Baoquan He 写道:
> On 04/11/21 at 06:49pm, Andy Lutomirski wrote:
>>
>>
>>> On Apr 11, 2021, at 6:14 PM, Baoquan He  wrote:
>>>
>>> On 04/09/21 at 07:59pm, H. Peter Anvin wrote:
 Why don't we do this unconditionally? At the very best we gain half a 
 megabyte of memory (except the trampoline, which has to live there, but it 
 is only a few kilobytes.)
>>>
>>> This is a great suggestion, thanks. I think we can fix it in this way to
>>> make code simpler. Then the specific caring of real mode in
>>> efi_free_boot_services() can be removed too.
>>>
>>
>> This whole situation makes me think that the code is buggy before and buggy 
>> after.
>>
>> The issue here (I think) is that various pieces of code want to reserve 
>> specific pieces of otherwise-available low memory for their own nefarious 
>> uses. I don’t know *why* crash kernel needs this, but that doesn’t matter 
>> too much.
> 
> Kdump kernel also need go through real mode code path during bootup. It
> is not different than normal kernel except that it skips the firmware
> resetting. So kdump kernel needs low 1M as system RAM just as normal
> kernel does. Here we reserve the whole low 1M with memblock_reserve()
> to avoid any later kernel or driver data reside in this area. Otherwise,
> we need dump the content of this area to vmcore. As we know, when crash
> happened, the old memory of 1st kernel should be untouched until vmcore
> dumping read out its content. Meanwhile, kdump kernel need reuse low 1M.
> In the past, we used a back up region to copy out the low 1M area, and
> map the back up region into the low 1M area in vmcore elf file. In
> 6f599d84231fd27 ("x86/kdump: Always reserve the low 1M when the crashkernel
> option is specified"), we changed to lock the whole low 1M to avoid
> writting any kernel data into, like this we can skip this area when
> dumping vmcore.
> 
> Above is why we try to memblock reserve the whole low 1M. We don't want
> to use it, just don't want anyone to use it in 1st kernel.
> 
>>
>> I propose that the right solution is to give low-memory-reserving code paths 
>> two chances to do what they need: once at the very beginning and once after 
>> EFI boot services are freed.
>>
>> Alternatively, just reserve *all* otherwise unused sub 1M memory up front, 
>> then release it right after releasing boot services, and then invoke the 
>> special cases exactly once.
> 

After EFI boot services are freed, I'm worried that it's a bit late. All sub-1M 
memory regions need to be reserved early as soon as possible.

> I am not sure if I got both suggested ways clearly. They look a little
> complicated in our case. As I explained at above, we want the whole low
> 1M locked up, not one piece or some pieces of it.
> 
>>
>> In either case, the result is that the crashkernel mess gets unified with 
>> the trampoline mess.  One way the result is called twice and needs to be 
>> more careful, and the other way it’s called only once.
>>

That may still have a chance to allocate memory from sub-1M regions at some 
point, because EFI boot services will be freed after EFI enters virtual mode, 
it looks late.

>> Just skipping freeing boot services seems wrong.  It doesn’t unmap boot 
>> services, and skipping that is incorrect, I think. And it seems to result in 
>> a bogus memory map in which the system thinks that some crashkernel memory 
>> is EFI memory instead.
> 
> I like hpa's thought to lock the whole low 1M unconditionally since only
> a few KB except of trampoline area is there. Rethinking about it, doing
> it in can_free_region() may be risky because efi memory region could
> cross the 1M boundary, e.g [640K, 100M] with type of
> EFI_BOOT_SERVICES_CODE|EFI_BOOT_SERVICES_DATA, it could cause loss of memory.

Theoretically, yes. But so far I haven't seen the situation of crossing the 1M 
boundary.

Thanks.
Lianbo

> Just a wild guess, not very sure if the 1M boundary corssing can really
> happen. efi_reserve_boot_services() won't split regions.
> 
> If moving efi_reserve_boot_services() after reserve_real_mode() is not
> accepted, maybe we can call efi_mem_reserve(0, 1M) just as
> efi_esrt_init() has done.
> 



Re: [PATCH] x86/efi: Do not release sub-1MB memory regions when the crashkernel option is specified

2021-04-12 Thread Baoquan He
On 04/11/21 at 06:49pm, Andy Lutomirski wrote:
> 
> 
> > On Apr 11, 2021, at 6:14 PM, Baoquan He  wrote:
> > 
> > On 04/09/21 at 07:59pm, H. Peter Anvin wrote:
> >> Why don't we do this unconditionally? At the very best we gain half a 
> >> megabyte of memory (except the trampoline, which has to live there, but it 
> >> is only a few kilobytes.)
> > 
> > This is a great suggestion, thanks. I think we can fix it in this way to
> > make code simpler. Then the specific caring of real mode in
> > efi_free_boot_services() can be removed too.
> > 
> 
> This whole situation makes me think that the code is buggy before and buggy 
> after.
> 
> The issue here (I think) is that various pieces of code want to reserve 
> specific pieces of otherwise-available low memory for their own nefarious 
> uses. I don’t know *why* crash kernel needs this, but that doesn’t matter too 
> much.

Kdump kernel also need go through real mode code path during bootup. It
is not different than normal kernel except that it skips the firmware
resetting. So kdump kernel needs low 1M as system RAM just as normal
kernel does. Here we reserve the whole low 1M with memblock_reserve()
to avoid any later kernel or driver data reside in this area. Otherwise,
we need dump the content of this area to vmcore. As we know, when crash
happened, the old memory of 1st kernel should be untouched until vmcore
dumping read out its content. Meanwhile, kdump kernel need reuse low 1M.
In the past, we used a back up region to copy out the low 1M area, and
map the back up region into the low 1M area in vmcore elf file. In
6f599d84231fd27 ("x86/kdump: Always reserve the low 1M when the crashkernel
option is specified"), we changed to lock the whole low 1M to avoid
writting any kernel data into, like this we can skip this area when
dumping vmcore.

Above is why we try to memblock reserve the whole low 1M. We don't want
to use it, just don't want anyone to use it in 1st kernel.

> 
> I propose that the right solution is to give low-memory-reserving code paths 
> two chances to do what they need: once at the very beginning and once after 
> EFI boot services are freed.
> 
> Alternatively, just reserve *all* otherwise unused sub 1M memory up front, 
> then release it right after releasing boot services, and then invoke the 
> special cases exactly once.

I am not sure if I got both suggested ways clearly. They look a little
complicated in our case. As I explained at above, we want the whole low
1M locked up, not one piece or some pieces of it.

> 
> In either case, the result is that the crashkernel mess gets unified with the 
> trampoline mess.  One way the result is called twice and needs to be more 
> careful, and the other way it’s called only once.
> 
> Just skipping freeing boot services seems wrong.  It doesn’t unmap boot 
> services, and skipping that is incorrect, I think. And it seems to result in 
> a bogus memory map in which the system thinks that some crashkernel memory is 
> EFI memory instead.

I like hpa's thought to lock the whole low 1M unconditionally since only
a few KB except of trampoline area is there. Rethinking about it, doing
it in can_free_region() may be risky because efi memory region could
cross the 1M boundary, e.g [640K, 100M] with type of
EFI_BOOT_SERVICES_CODE|EFI_BOOT_SERVICES_DATA, it could cause loss of memory.
Just a wild guess, not very sure if the 1M boundary corssing can really
happen. efi_reserve_boot_services() won't split regions.

If moving efi_reserve_boot_services() after reserve_real_mode() is not
accepted, maybe we can call efi_mem_reserve(0, 1M) just as
efi_esrt_init() has done.



Re: [PATCH] x86/efi: Do not release sub-1MB memory regions when the crashkernel option is specified

2021-04-11 Thread Andy Lutomirski



> On Apr 11, 2021, at 6:14 PM, Baoquan He  wrote:
> 
> On 04/09/21 at 07:59pm, H. Peter Anvin wrote:
>> Why don't we do this unconditionally? At the very best we gain half a 
>> megabyte of memory (except the trampoline, which has to live there, but it 
>> is only a few kilobytes.)
> 
> This is a great suggestion, thanks. I think we can fix it in this way to
> make code simpler. Then the specific caring of real mode in
> efi_free_boot_services() can be removed too.
> 

This whole situation makes me think that the code is buggy before and buggy 
after.

The issue here (I think) is that various pieces of code want to reserve 
specific pieces of otherwise-available low memory for their own nefarious uses. 
I don’t know *why* crash kernel needs this, but that doesn’t matter too much.

I propose that the right solution is to give low-memory-reserving code paths 
two chances to do what they need: once at the very beginning and once after EFI 
boot services are freed.

Alternatively, just reserve *all* otherwise unused sub 1M memory up front, then 
release it right after releasing boot services, and then invoke the special 
cases exactly once.

In either case, the result is that the crashkernel mess gets unified with the 
trampoline mess.  One way the result is called twice and needs to be more 
careful, and the other way it’s called only once.

Just skipping freeing boot services seems wrong.  It doesn’t unmap boot 
services, and skipping that is incorrect, I think. And it seems to result in a 
bogus memory map in which the system thinks that some crashkernel memory is EFI 
memory instead.

Let’s please just fix the problem instead of papering over it with more hacks.

> Thanks
> Baoquan
> 


Re: [PATCH] x86/efi: Do not release sub-1MB memory regions when the crashkernel option is specified

2021-04-11 Thread Baoquan He
On 04/09/21 at 07:59pm, H. Peter Anvin wrote:
> Why don't we do this unconditionally? At the very best we gain half a 
> megabyte of memory (except the trampoline, which has to live there, but it is 
> only a few kilobytes.)

This is a great suggestion, thanks. I think we can fix it in this way to
make code simpler. Then the specific caring of real mode in
efi_free_boot_services() can be removed too.

Thanks
Baoquan



Re: [PATCH] x86/efi: Do not release sub-1MB memory regions when the crashkernel option is specified

2021-04-09 Thread lijiang
Hi, Baoquan
Thank you for the comment.
在 2021年04月09日 20:44, Baoquan He 写道:
> On 04/07/21 at 10:03pm, Lianbo Jiang wrote:
>> Some sub-1MB memory regions may be reserved by EFI boot services, and the
>> memory regions will be released later in the efi_free_boot_services().
>>
>> Currently, always reserve all sub-1MB memory regions when the crashkernel
>> option is specified, but unfortunately EFI boot services may have already
>> reserved some sub-1MB memory regions before the crash_reserve_low_1M() is
>> called, which makes that the crash_reserve_low_1M() only own the
>> remaining sub-1MB memory regions, not all sub-1MB memory regions, because,
>> subsequently EFI boot services will free its own sub-1MB memory regions.
>> Eventually, DMA will be able to allocate memory from the sub-1MB area and
>> cause the following error:
>>
> 
> So this patch is fixing a problem found in crash utility. We ever met
> the similar issue, later fixed by always reserving low 1M in commit
> 6f599d84231fd27 ("x86/kdump: Always reserve the low 1M when the crashkernel
> option is specified"). Seems the commit is not fixing it completely.
> 
Maybe I should add the "Fixes: 6f599d84231f" in front of 'Signed-off-by' as 
below:

Fixes: 6f599d84231f ("x86/kdump: Always reserve the low 1M when the crashkernel 
option is specified")

>> crash> kmem -s |grep invalid
>> kmem: dma-kmalloc-512: slab: d52c40001900 invalid freepointer: 
>> 9403c0067300
>> kmem: dma-kmalloc-512: slab: d52c40001900 invalid freepointer: 
>> 9403c0067300
>> crash> vtop 9403c0067300
>> VIRTUAL   PHYSICAL
>> 9403c0067300  67300   --->The physical address falls into this range 
>> [0x00063000-0x0008efff]
>>
>> kernel debugging log:
>> ...
>> [0.008927] memblock_reserve: [0x0001-0x00013fff] 
>> efi_reserve_boot_services+0x85/0xd0
>> [0.008930] memblock_reserve: [0x00063000-0x0008efff] 
>> efi_reserve_boot_services+0x85/0xd0
>> ...
>> [0.009425] memblock_reserve: [0x-0x000f] 
>> crash_reserve_low_1M+0x2c/0x49
>> ...
>> [0.010586] Zone ranges:
>> [0.010587]   DMA  [mem 0x1000-0x00ff]
>> [0.010589]   DMA32[mem 0x0100-0x]
>> [0.010591]   Normal   [mem 0x0001-0x000c7fff]
>> [0.010593]   Device   empty
>> ...
>> [8.814894] __memblock_free_late: [0x00063000-0x0008efff] 
>> efi_free_boot_services+0x14b/0x23b
>> [8.815793] __memblock_free_late: [0x0001-0x00013fff] 
>> efi_free_boot_services+0x14b/0x23b
> 
> 
> In commit 6f599d84231fd27, we call crash_reserve_low_1M() to lock the
> whole low 1M area if crashkernel is specified in kernel cmdline.
> But earlier efi_reserve_boot_services() invokation will break the
> intention of the whole low 1M reserving. In efi_reserve_boot_services(),
> if any memory under low 1M hasn't been reserved, it will call
> memblock_reserve() to reserve it and leave it to
> efi_free_boot_services() to free.
> 

Good understanding.

> Hi Lianbo,
> 
> Please correct me if I am wrong or anything is missed. IIUC, can we move
> efi_reserve_boot_services() after reserve_real_mode() to fix this bug?

What do you think about the following changes?

patch [1]:

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 5ecd69a48393..c343de3178ec 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1064,12 +1064,6 @@ void __init setup_arch(char **cmdline_p)
efi_esrt_init();
efi_mokvar_table_init();
 
-   /*
-* The EFI specification says that boot service code won't be
-* called after ExitBootServices(). This is, in fact, a lie.
-*/
-   efi_reserve_boot_services();
-
/* preallocate 4k for mptable mpc */
e820__memblock_alloc_reserved_mpc_new();
 
@@ -1087,6 +1081,12 @@ void __init setup_arch(char **cmdline_p)
trim_platform_memory_ranges();
trim_low_memory_range();
 
+   /*
+* The EFI specification says that boot service code won't be
+* called after ExitBootServices(). This is, in fact, a lie.
+*/
+   efi_reserve_boot_services();
+
init_mem_mapping();
 
idt_setup_early_pf();

> Or move reserve_real_mode() before efi_reserve_boot_services() since
> those real mode regions are all under 1M? Assume efi boot code/data

Or patch [2]

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 5ecd69a48393..ceec5af0dfab 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1058,6 +1058,7 @@ void __init setup_arch(char **cmdline_p)
sev_setup_arch();
 
reserve_bios_regions();
+   reserve_real_mode();
 
efi_fake_memmap();
efi_find_mirror();
@@ -1082,8 +1083,6 @@ void __init setup_arch(char **cmdline_p)
(max_pfn_mapped< won't rely on low 1M area any more at this moment.
> 

Re: [PATCH] x86/efi: Do not release sub-1MB memory regions when the crashkernel option is specified

2021-04-09 Thread Baoquan He
On 04/07/21 at 10:03pm, Lianbo Jiang wrote:
> Some sub-1MB memory regions may be reserved by EFI boot services, and the
> memory regions will be released later in the efi_free_boot_services().
> 
> Currently, always reserve all sub-1MB memory regions when the crashkernel
> option is specified, but unfortunately EFI boot services may have already
> reserved some sub-1MB memory regions before the crash_reserve_low_1M() is
> called, which makes that the crash_reserve_low_1M() only own the
> remaining sub-1MB memory regions, not all sub-1MB memory regions, because,
> subsequently EFI boot services will free its own sub-1MB memory regions.
> Eventually, DMA will be able to allocate memory from the sub-1MB area and
> cause the following error:
> 

So this patch is fixing a problem found in crash utility. We ever met
the similar issue, later fixed by always reserving low 1M in commit
6f599d84231fd27 ("x86/kdump: Always reserve the low 1M when the crashkernel
option is specified"). Seems the commit is not fixing it completely.

> crash> kmem -s |grep invalid
> kmem: dma-kmalloc-512: slab: d52c40001900 invalid freepointer: 
> 9403c0067300
> kmem: dma-kmalloc-512: slab: d52c40001900 invalid freepointer: 
> 9403c0067300
> crash> vtop 9403c0067300
> VIRTUAL   PHYSICAL
> 9403c0067300  67300   --->The physical address falls into this range 
> [0x00063000-0x0008efff]
> 
> kernel debugging log:
> ...
> [0.008927] memblock_reserve: [0x0001-0x00013fff] 
> efi_reserve_boot_services+0x85/0xd0
> [0.008930] memblock_reserve: [0x00063000-0x0008efff] 
> efi_reserve_boot_services+0x85/0xd0
> ...
> [0.009425] memblock_reserve: [0x-0x000f] 
> crash_reserve_low_1M+0x2c/0x49
> ...
> [0.010586] Zone ranges:
> [0.010587]   DMA  [mem 0x1000-0x00ff]
> [0.010589]   DMA32[mem 0x0100-0x]
> [0.010591]   Normal   [mem 0x0001-0x000c7fff]
> [0.010593]   Device   empty
> ...
> [8.814894] __memblock_free_late: [0x00063000-0x0008efff] 
> efi_free_boot_services+0x14b/0x23b
> [8.815793] __memblock_free_late: [0x0001-0x00013fff] 
> efi_free_boot_services+0x14b/0x23b


In commit 6f599d84231fd27, we call crash_reserve_low_1M() to lock the
whole low 1M area if crashkernel is specified in kernel cmdline.
But earlier efi_reserve_boot_services() invokation will break the
intention of the whole low 1M reserving. In efi_reserve_boot_services(),
if any memory under low 1M hasn't been reserved, it will call
memblock_reserve() to reserve it and leave it to
efi_free_boot_services() to free.

Hi Lianbo,

Please correct me if I am wrong or anything is missed. IIUC, can we move
efi_reserve_boot_services() after reserve_real_mode() to fix this bug?
Or move reserve_real_mode() before efi_reserve_boot_services() since
those real mode regions are all under 1M? Assume efi boot code/data
won't rely on low 1M area any more at this moment.

Thanks
Baoquan

> 
> Do not release sub-1MB memory regions even though they are reserved by
> EFI boot services, so that always reserve all sub-1MB memory regions when
> the crashkernel option is specified.
> 
> Signed-off-by: Lianbo Jiang 
> ---
>  arch/x86/platform/efi/quirks.c | 14 ++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
> index 67d93a243c35..637f932c4fd4 100644
> --- a/arch/x86/platform/efi/quirks.c
> +++ b/arch/x86/platform/efi/quirks.c
> @@ -18,6 +18,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #define EFI_MIN_RESERVE 5120
>  
> @@ -303,6 +304,19 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 
> size)
>   */
>  static __init bool can_free_region(u64 start, u64 size)
>  {
> + /*
> +  * Some sub-1MB memory regions may be reserved by EFI boot
> +  * services, and these memory regions will be released later
> +  * in the efi_free_boot_services().
> +  *
> +  * Do not release sub-1MB memory regions even though they are
> +  * reserved by EFI boot services, because, always reserve all
> +  * sub-1MB memory when the crashkernel option is specified.
> +  */
> + if (cmdline_find_option(boot_command_line, "crashkernel", NULL, 0) > 0
> + && (start + size < (1<<20)))
> + return false;
> +
>   if (start + size > __pa_symbol(_text) && start <= __pa_symbol(_end))
>   return false;
>  
> -- 
> 2.17.1
> 



[PATCH] x86/efi: Do not release sub-1MB memory regions when the crashkernel option is specified

2021-04-07 Thread Lianbo Jiang
Some sub-1MB memory regions may be reserved by EFI boot services, and the
memory regions will be released later in the efi_free_boot_services().

Currently, always reserve all sub-1MB memory regions when the crashkernel
option is specified, but unfortunately EFI boot services may have already
reserved some sub-1MB memory regions before the crash_reserve_low_1M() is
called, which makes that the crash_reserve_low_1M() only own the
remaining sub-1MB memory regions, not all sub-1MB memory regions, because,
subsequently EFI boot services will free its own sub-1MB memory regions.
Eventually, DMA will be able to allocate memory from the sub-1MB area and
cause the following error:

crash> kmem -s |grep invalid
kmem: dma-kmalloc-512: slab: d52c40001900 invalid freepointer: 
9403c0067300
kmem: dma-kmalloc-512: slab: d52c40001900 invalid freepointer: 
9403c0067300
crash> vtop 9403c0067300
VIRTUAL   PHYSICAL
9403c0067300  67300   --->The physical address falls into this range 
[0x00063000-0x0008efff]

kernel debugging log:
...
[0.008927] memblock_reserve: [0x0001-0x00013fff] 
efi_reserve_boot_services+0x85/0xd0
[0.008930] memblock_reserve: [0x00063000-0x0008efff] 
efi_reserve_boot_services+0x85/0xd0
...
[0.009425] memblock_reserve: [0x-0x000f] 
crash_reserve_low_1M+0x2c/0x49
...
[0.010586] Zone ranges:
[0.010587]   DMA  [mem 0x1000-0x00ff]
[0.010589]   DMA32[mem 0x0100-0x]
[0.010591]   Normal   [mem 0x0001-0x000c7fff]
[0.010593]   Device   empty
...
[8.814894] __memblock_free_late: [0x00063000-0x0008efff] 
efi_free_boot_services+0x14b/0x23b
[8.815793] __memblock_free_late: [0x0001-0x00013fff] 
efi_free_boot_services+0x14b/0x23b

Do not release sub-1MB memory regions even though they are reserved by
EFI boot services, so that always reserve all sub-1MB memory regions when
the crashkernel option is specified.

Signed-off-by: Lianbo Jiang 
---
 arch/x86/platform/efi/quirks.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
index 67d93a243c35..637f932c4fd4 100644
--- a/arch/x86/platform/efi/quirks.c
+++ b/arch/x86/platform/efi/quirks.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define EFI_MIN_RESERVE 5120
 
@@ -303,6 +304,19 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 
size)
  */
 static __init bool can_free_region(u64 start, u64 size)
 {
+   /*
+* Some sub-1MB memory regions may be reserved by EFI boot
+* services, and these memory regions will be released later
+* in the efi_free_boot_services().
+*
+* Do not release sub-1MB memory regions even though they are
+* reserved by EFI boot services, because, always reserve all
+* sub-1MB memory when the crashkernel option is specified.
+*/
+   if (cmdline_find_option(boot_command_line, "crashkernel", NULL, 0) > 0
+   && (start + size < (1<<20)))
+   return false;
+
if (start + size > __pa_symbol(_text) && start <= __pa_symbol(_end))
return false;
 
-- 
2.17.1