Thank you for the comment, H. Peter, Andy and Baoquan. 在 2021年04月12日 17:52, Baoquan He 写道: > On 04/11/21 at 06:49pm, Andy Lutomirski wrote: >> >> >>> On Apr 11, 2021, at 6:14 PM, Baoquan He <[email protected]> wrote: >>> >>> On 04/09/21 at 07:59pm, H. Peter Anvin wrote: >>>> Why don't we do this unconditionally? At the very best we gain half a >>>> megabyte of memory (except the trampoline, which has to live there, but it >>>> is only a few kilobytes.) >>> >>> This is a great suggestion, thanks. I think we can fix it in this way to >>> make code simpler. Then the specific caring of real mode in >>> efi_free_boot_services() can be removed too. >>> >> >> This whole situation makes me think that the code is buggy before and buggy >> after. >> >> The issue here (I think) is that various pieces of code want to reserve >> specific pieces of otherwise-available low memory for their own nefarious >> uses. I don’t know *why* crash kernel needs this, but that doesn’t matter >> too much. > > Kdump kernel also need go through real mode code path during bootup. It > is not different than normal kernel except that it skips the firmware > resetting. So kdump kernel needs low 1M as system RAM just as normal > kernel does. Here we reserve the whole low 1M with memblock_reserve() > to avoid any later kernel or driver data reside in this area. Otherwise, > we need dump the content of this area to vmcore. As we know, when crash > happened, the old memory of 1st kernel should be untouched until vmcore > dumping read out its content. Meanwhile, kdump kernel need reuse low 1M. > In the past, we used a back up region to copy out the low 1M area, and > map the back up region into the low 1M area in vmcore elf file. In > 6f599d84231fd27 ("x86/kdump: Always reserve the low 1M when the crashkernel > option is specified"), we changed to lock the whole low 1M to avoid > writting any kernel data into, like this we can skip this area when > dumping vmcore. > > Above is why we try to memblock reserve the whole low 1M. We don't want > to use it, just don't want anyone to use it in 1st kernel. > >> >> I propose that the right solution is to give low-memory-reserving code paths >> two chances to do what they need: once at the very beginning and once after >> EFI boot services are freed. >> >> Alternatively, just reserve *all* otherwise unused sub 1M memory up front, >> then release it right after releasing boot services, and then invoke the >> special cases exactly once. >
After EFI boot services are freed, I'm worried that it's a bit late. All sub-1M memory regions need to be reserved early as soon as possible. > I am not sure if I got both suggested ways clearly. They look a little > complicated in our case. As I explained at above, we want the whole low > 1M locked up, not one piece or some pieces of it. > >> >> In either case, the result is that the crashkernel mess gets unified with >> the trampoline mess. One way the result is called twice and needs to be >> more careful, and the other way it’s called only once. >> That may still have a chance to allocate memory from sub-1M regions at some point, because EFI boot services will be freed after EFI enters virtual mode, it looks late. >> Just skipping freeing boot services seems wrong. It doesn’t unmap boot >> services, and skipping that is incorrect, I think. And it seems to result in >> a bogus memory map in which the system thinks that some crashkernel memory >> is EFI memory instead. > > I like hpa's thought to lock the whole low 1M unconditionally since only > a few KB except of trampoline area is there. Rethinking about it, doing > it in can_free_region() may be risky because efi memory region could > cross the 1M boundary, e.g [640K, 100M] with type of > EFI_BOOT_SERVICES_CODE|EFI_BOOT_SERVICES_DATA, it could cause loss of memory. Theoretically, yes. But so far I haven't seen the situation of crossing the 1M boundary. Thanks. Lianbo > Just a wild guess, not very sure if the 1M boundary corssing can really > happen. efi_reserve_boot_services() won't split regions. > > If moving efi_reserve_boot_services() after reserve_real_mode() is not > accepted, maybe we can call efi_mem_reserve(0, 1M) just as > efi_esrt_init() has done. >

