On Fri, Jun 23, 2017 at 10:42:10AM +0200, Ingo Molnar wrote: > > * Chen Yu <yu.c.c...@intel.com> wrote: > > > Hi Ingo, > > On Thu, Jun 22, 2017 at 11:40:30AM +0200, Ingo Molnar wrote: > > > > > > * Chen Yu <yu.c.c...@intel.com> wrote: > > > > > > > Currently we try to have e820_table_firmware to represent the > > > > original firmware memory layout passed to us by the bootloader, > > > > however it is not the case, the e820_table_firmware might still > > > > be modified by linux: > > > > 1. During bootup, the efi boot stub might allocate memory via > > > > efi service for the PCI device information structure, then > > > > later e820_reserve_setup_data() reserved these dynamically > > > > allocated structures(AKA, setup_data) in e820_table_firmware > > > > accordingly. > > > > 2. The kexec might also modify the e820_table_firmware. > > > > > > Hm, so why does the EFI code modify e280_table_firmware - why doesn't > > > it modify e820_table? > > > > > Both the e820_table and e820_table_firmware will be updated in > > e820__reserve_setup_data(): > > Changing the PCI device information structures from E820_TYPE_RAM > > to E820_TYPE_RESERVED_KERN. > > > I.e. what is the point of having 3 different versions of the > > > memory layout table? > > My original thought was that, we should not record the modification > > from the efi boot stub into the e820_tabel_firmware and we are done. > > But after checking the code, I realized that if we do so the > > kexec might have potiential problem. > > > > The e820_table_firmware was introduced mainly for kexec and > > was used to pass the original memory layout to the second > > kernel: > > > > commit 5dfcf14d5b28174f94cbe9b4fb35d415db61c64a > > Author: Bernhard Walle <bwa...@suse.de> > > Date: Fri Jun 27 13:12:55 2008 +0200 > > > > x86: use FIRMWARE_MEMMAP on x86/E820 > > > > Besides, the second kernel will not re-enter the efi boot stub > > code and it will reuse the PCI device information structure created > > by the first kernel, which is stored in the E820_TYPE_RESERVED_KERN > > region. So these PCI device information structures will not be > > modified by the second kernel, as kexec will only pass the E820_TYPE_RAM > > to the second kernel, thus the latter could leverage ioremap to access > > the PCI information. > > > > So the problem is, if we do not record the PCI information in > > the e820_table_firmware, the PCI information will be kept as > > type E820_TYPE_RAM, and all the E820_TYPE_RAM type regions will > > be passed to the second kernel and might be allocated for ordinary > > use in the second kernel, as a result the second kernel might not > > get valid PCI information(might be overwritten by others). So > > currently we try to introduce a new e820_table_ori to represent > > the original one provided by the BIOS(mainly for hibernation > > memory layout md5 checking). > > So there's 3 versions we need: > > - the original 'firmware' table as-is - for MD5 check and other potential > purposes > > - some intermediate version of the table for kexec: what is the exact > definition > of that table, what changes from the real table does it _not_ want? > Some boot options such as 'mem=' are not wanted by kexec, because the kexec wants to let the second kernel see the whole memory layout passed by the bootloader. I think this is why e820_table_firmware was introduced. > - the 'real' table > > all the naming should reflect that. I.e. instead of some nonsensical "_ori" > postfix, that is really the _firmware table. If kexec needs a separate one > then > name it _kexec and copy it at the right stage. > > Ok? > Ok. I'm sending V2 of this patch. I tried not to break the old behavior and split the patch into three, thus the logic might look more clear. > Thanks, > > Ingo Thanks, Yu