Re: [PATCH v2 00/14] Split crash out from kexec and clean up related config items

2024-02-20 Thread Hari Bathini

Hi Baoquan,

On 04/02/24 8:56 am, Baoquan He wrote:

Hope Hari and Pingfan can help have a look, see if
it's doable. Now, I make it either have both kexec and crash enabled, or
disable both of them altogether.


Sure. I will take a closer look...

Thanks a lot. Please feel free to post patches to make that, or I can do
it with your support or suggestion.


Tested your changes and on top of these changes, came up with the below
changes to get it working for powerpc:


https://lore.kernel.org/all/20240213113150.1148276-1-hbath...@linux.ibm.com/

Please take a look.

Thanks
Hari

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCHv7 07/16] x86/mm: Return correct level from lookup_address() if pte is none

2024-02-20 Thread Baoquan He
On 02/20/24 at 02:36pm, Kirill A. Shutemov wrote:
> On Tue, Feb 20, 2024 at 06:25:43PM +0800, Baoquan He wrote:
> > > I am not sure what part of the comment you see doesn't reflect the
> > > behaviour. From my PoV, changed code matches the comment closer that
> > > original.
> > 
> > Oh, I didn't make it clear. I mean update the code comment for
> > lookup_address(), and add code comment for lookup_address_in_pgd() to
> > mention the level thing. Maybe it's a chance to do that.
> > 
> > ===1>
> > *
> >  * Lookup the page table entry for a virtual address. Return a pointer
> >  * to the entry and the level of the mapping.
> >  *
> >  * Note: We return pud and pmd either when the entry is marked large
> >~~~ seems we return p4d too
> >  * or when the present bit is not set. Otherwise we would return a
> >  * pointer to a nonexisting mapping.
> >   ~~~ NULL?
> >  */  
> > pte_t *lookup_address(unsigned long address, unsigned int *level)
> > {
> > return lookup_address_in_pgd(pgd_offset_k(address), address, level);
> > }
> > EXPORT_SYMBOL_GPL(lookup_address);
> > ===
> > 
> > ===2>
> > /*
> >  * Lookup the page table entry for a virtual address in a specific pgd.
> >  * Return a pointer to the entry and the level of the mapping.
> >~~ also could return NULL if it's none entry. And do we need to
> > mention the level thing?
> >  */
> > pte_t *lookup_address_in_pgd(pgd_t *pgd, unsigned long address,
> >  unsigned int *level)
> > ...
> > }
> > 
> 
> What about this fixup:

Some nitpicks.
> 
> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> index 3612e3167147..425ff6e192e6 100644
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -657,7 +657,8 @@ static inline pgprot_t verify_rwx(pgprot_t old, pgprot_t 
> new, unsigned long star
>  
>  /*
>   * Lookup the page table entry for a virtual address in a specific pgd.
> - * Return a pointer to the entry and the level of the mapping.
> + * Return a pointer to the entry and the level of the mapping (or NULL if
> + * the entry is none) and level of the entry.
   ^ this right parenthesis may need be moved to the end.


===
 * Return a pointer to the entry and the level of the mapping (or NULL if
 * the entry is none and level of the entry).
===

>   */
>  pte_t *lookup_address_in_pgd(pgd_t *pgd, unsigned long address,
>unsigned int *level)
> @@ -704,9 +705,8 @@ pte_t *lookup_address_in_pgd(pgd_t *pgd, unsigned long 
> address,
>   * Lookup the page table entry for a virtual address. Return a pointer
>   * to the entry and the level of the mapping.
>   *
> - * Note: We return pud and pmd either when the entry is marked large
> - * or when the present bit is not set. Otherwise we would return a
> - * pointer to a nonexisting mapping.
> + * Note: the function returns p4d, pud and pmd either when the entry is 
> marked
  ~~~
   ^ s/and/or/
> + * large or when the present bit is not set. Otherwise it returns NULL.
>   */
>  pte_t *lookup_address(unsigned long address, unsigned int *level)
>  {
> -- 
>   Kiryl Shutsemau / Kirill A. Shutemov
> 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v5 0/8] ima: kexec: measure events between kexec load and execute

2024-02-20 Thread Mimi Zohar
On Wed, 2024-02-14 at 07:38 -0800, Tushar Sugandhi wrote:
> The current Kernel behavior is IMA measurements snapshot is taken at
> kexec 'load' and not at kexec 'execute'.  IMA log is then carried
> over to the new Kernel after kexec 'execute'.

'Kernel' should not be capitalized since it isn't a proper name.  'Linux' would
be capitalized (e.g. The Linux kernel).

-> "The IMA measurement list is copied at kexec 'load', not kexec 'execute',
before being carried over to the new kexec'ed kernel.

Mimi

> 
> New events can be measured during/after the IMA log snapshot at kexec 
> 'load' and before the system boots to the new Kernel.  In this scenario,
> the TPM PCRs are extended with these events, but they are not carried
> over to the new Kernel after kexec soft reboot since the snapshot is
> already taken.  This results in mismatch between TPM PCR quotes and the
> actual IMA measurements list after kexec soft reboot, which in turn
> results in remote attestation failure.
> 
> To solve this problem - 
>  - allocate the necessary buffer at kexec 'load' time,
>  - populate the buffer with the IMA measurements at kexec 'execute' time, 
>  - and measure two new IMA events 'kexec_load' and 'kexec_execute' as
>critical data to help detect missing events after kexec soft reboot.
> 
> The solution details include:
>  - refactoring the existing code to allocate a buffer to hold IMA
>measurements at kexec 'load', and dump the measurements at kexec
>'execute'
> 
>  - IMA functionality to suspend and resume measurements as needed during
>buffer copy at kexec 'execute',
> 
>  - kexec functionality for mapping the segments from the current Kernel
>to the subsequent one, 
> 
>  - necessary changes to the kexec_file_load syscall, enabling it to call
>the ima functions,
> 
>  - registering a reboot notifier which gets called during kexec 
>'execute',
> 
>  - introducing a new Kconfig option to configure the extra memory to be
>allocated for passing IMA log from the current Kernel to the next,
>
>  - introducing two new events to be measured by IMA during kexec, to
>help diagnose if the IMA log was copied fully or partially, from the
>current Kernel to the next,
> 
>  - excluding IMA segment while calculating and storing digest in function
>kexec_calculate_store_digests(), since IMA segment can be modified
>after the digest is computed during kexec 'load'.  This will ensure
>that the segment is not added to the 'purgatory_sha_regions', and thus
>not verified by verify_sha256_digest().
> 
> The changes proposed in this series ensure the integrity of the IMA
> measurements is preserved across kexec soft reboots, thus significantly
> improving the security of the Kernel post kexec soft reboots.
> 
> There were previous attempts to fix this issue [1], [2], [3].  But they
> were not merged into the mainline Kernel.
> 
> We took inspiration from the past work [1] and [2] while working on this
> patch series.
> 
> V4 of this series is available here[6] for reference.
> 





Re: [PATCH v5 1/8] ima: define and call ima_alloc_kexec_file_buf

2024-02-20 Thread Mimi Zohar
On Wed, 2024-02-14 at 07:38 -0800, Tushar Sugandhi wrote:
> Carrying the IMA measurement list across kexec requires allocating a
> buffer and copying the measurement records.  Separate allocating the
> buffer and copying the measurement records into separate functions in
> order to allocate the buffer at kexec 'load' and copy the measurements
> at kexec 'execute'.
> 
> This patch includes the following changes:
>  - Refactor ima_dump_measurement_list() to move the memory allocation
>to a separate function ima_alloc_kexec_file_buf() which allocates
>buffer of size 'kexec_segment_size' at kexec 'load'.
>  - Make the local variable ima_kexec_file in ima_dump_measurement_list()
>as local static to the file, so that it can be accessed from 
>ima_alloc_kexec_file_buf().
>  - Make necessary changes to the function ima_add_kexec_buffer() to call
>the above two functions.
> 
> Suggested-by: Stefan Berger 
> Signed-off-by: Tushar Sugandhi 
> ---
>  security/integrity/ima/ima_kexec.c | 107 +
>  1 file changed, 78 insertions(+), 29 deletions(-)
> 
> diff --git a/security/integrity/ima/ima_kexec.c
> b/security/integrity/ima/ima_kexec.c
> index dadc1d138118..a9cb5e882e2e 100644
> --- a/security/integrity/ima/ima_kexec.c
> +++ b/security/integrity/ima/ima_kexec.c
> @@ -15,62 +15,98 @@
>  #include "ima.h"
>  
>  #ifdef CONFIG_IMA_KEXEC
> +static struct seq_file ima_kexec_file;
> +
> +static void ima_reset_kexec_file(struct seq_file *sf)
> +{
> + sf->buf = NULL;
> + sf->size = 0;
> + sf->read_pos = 0;
> + sf->count = 0;
> +}
> +
> +static void ima_free_kexec_file_buf(struct seq_file *sf)
> +{
> + vfree(sf->buf);
> + ima_reset_kexec_file(sf);
> +}
> +
> +static int ima_alloc_kexec_file_buf(size_t segment_size)
> +{
> + /*
> +  * kexec 'load' may be called multiple times.
> +  * Free and realloc the buffer only if the segment_size is
> +  * changed from the previous kexec 'load' call.
> +  */
> + if (ima_kexec_file.buf &&
> + ima_kexec_file.size == segment_size &&
> + ima_kexec_file.read_pos == 0 &&
> + ima_kexec_file.count == sizeof(struct ima_kexec_hdr))
> + return 0;

ima_kexec_file.size will be initialized to zero.  No need to test anything other
than the segment size.

> +
> + ima_free_kexec_file_buf(_kexec_file);

The first time ima_free_kexec_file_buf() is called the memory has not been
allocated.  Test that the memory has been allocated before freeing it.  Since
there's no other ima_free_kexec_file_buf() caller, there's no need to clear the
other fields as they will be set below.

> +
> + /* segment size can't change between kexec load and execute */
> + ima_kexec_file.buf = vmalloc(segment_size);
> + if (!ima_kexec_file.buf)
> + return -ENOMEM;
> +
> + ima_kexec_file.size = segment_size;
> + ima_kexec_file.read_pos = 0;

static variables are initialized to zero.  No need to set it here.

> + ima_kexec_file.count = sizeof(struct ima_kexec_hdr);/* reserved
> space */
> +
> + return 0;
> +}
> +
>  static int ima_dump_measurement_list(unsigned long *buffer_size, void
> **buffer,
>unsigned long segment_size)
>  {
>   struct ima_queue_entry *qe;
> - struct seq_file file;
>   struct ima_kexec_hdr khdr;
> - int ret = 0;
>  
> - /* segment size can't change between kexec load and execute */
> - file.buf = vmalloc(segment_size);
> - if (!file.buf) {
> - ret = -ENOMEM;
> - goto out;
> + if (!ima_kexec_file.buf) {
> + *buffer_size = 0;
> + *buffer = NULL;
> + pr_err("%s: Kexec file buf not allocated\n", __func__);
> + return -EINVAL;
>   }
>  
> - file.size = segment_size;
> - file.read_pos = 0;
> - file.count = sizeof(khdr);  /* reserved space */
> -
>   memset(, 0, sizeof(khdr));
>   khdr.version = 1;
> +
> + /* Copy as many IMA measurements list records as possible */
>   list_for_each_entry_rcu(qe, _measurements, later) {
> - if (file.count < file.size) {
> + if (ima_kexec_file.count < ima_kexec_file.size) {

With an extra half page memory allocated, there should be enough memory for a
few extra records as a result of the kexec load.  Subsequent patches are going
to change this assumption.  This test doesn't ensure there is enough memory for
the entire measurement record.  Call get_binary_runtime_size() to get the record
size to ensure there is enough memory for the entire record.

>   khdr.count++;
> - ima_measurements_show(, qe);
> + ima_measurements_show(_kexec_file, qe);
>   } else {
> - ret = -EINVAL;
> + pr_err("%s: IMA log file is too big for Kexec buf\n",
> +__func__);

This message and the one above 

Re: [PATCH 1/2] x86/mm: Do not zap PMD entry mapping unaccepted memory table during kdump.

2024-02-20 Thread Kalra, Ashish

Hi Kirill,

On 2/20/2024 6:42 AM, Kirill A. Shutemov wrote:

On Tue, Feb 20, 2024 at 01:18:29AM +, Ashish Kalra wrote:

From: Ashish Kalra 

During crashkernel boot only pre-allocated crash memory is presented as
E820_TYPE_RAM. This can cause PMD entry mapping unaccepted memory table
to be zapped during phys_pmd_init() as SNP/TDX guest use E820_TYPE_ACPI
to store the unaccepted memory table and pass it between the kernels on
kexec/kdump.

E820_TYPE_ACPI covers not only ACPI data, but also EFI tables and might
be required by kernel to function properly.

The problem was discovered during debugging kdump for SNP guest. The
unaccepted memory table stored with E820_TYPE_ACPI and passed between
the kernels on kdump was getting zapped as the PMD entry mapping this
is above the E820_TYPE_RAM range for the reserved crashkernel memory.

Signed-off-by: Ashish Kalra 
---
  arch/x86/mm/init_64.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index a0dffaca6d2b..207c6e0c 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -524,7 +524,9 @@ phys_pmd_init(pmd_t *pmd_page, unsigned long paddr, 
unsigned long paddr_end,
!e820__mapped_any(paddr & PMD_MASK, paddr_next,
 E820_TYPE_RAM) &&
!e820__mapped_any(paddr & PMD_MASK, paddr_next,
-E820_TYPE_RESERVED_KERN))
+E820_TYPE_RESERVED_KERN) &&
+   !e820__mapped_any(paddr & PMD_MASK, paddr_next,
+E820_TYPE_ACPI))
set_pmd_init(pmd, __pmd(0), init);
continue;

Why do you single out phys_pmd_init()? I think it has to be addressed for
all page table levels as we do for E820_TYPE_RAM and E820_TYPE_RESERVED_KERN.


I believe i only discovered the issue with PMDe's (phys_pmd_init()) 
because of the crashkernel reserved memory size and the E820_TYPE_ACPI 
physical memory range mapping on my test system, but you are right this 
fix needs to be done for all page table levels and i will add also the 
fix in phys_pte_init(), phys_pud_init() and phys_p4d_init().


Thanks, Ashish


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH linux-next 1/3] kexec/kdump: make struct crash_mem available without CONFIG_CRASH_DUMP

2024-02-20 Thread Baoquan He
On 02/13/24 at 05:01pm, Hari Bathini wrote:
> struct crash_mem defined under include/linux/crash_core.h represents
> a list of memory ranges. While it is used to represent memory ranges

>From its name, it's not only representing memory ranges, it's
representing crash memory ranges. Except of this, the whole series looks
good to me. Thanks for the effort.

> for kdump kernel, it can also be used for other kind of memory ranges.
> In fact, KEXEC_FILE_LOAD syscall in powerpc uses this structure to
> represent reserved memory ranges and exclude memory ranges needed to
> find the right memory regions to load kexec kernel. So, make the
> definition of crash_mem structure available for !CONFIG_CRASH_DUMP
> case too.
> 
> Signed-off-by: Hari Bathini 
> ---
>  include/linux/crash_core.h | 12 ++--
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h
> index 23270b16e1db..d33352c2e386 100644
> --- a/include/linux/crash_core.h
> +++ b/include/linux/crash_core.h
> @@ -8,6 +8,12 @@
>  
>  struct kimage;
>  
> +struct crash_mem {
> + unsigned int max_nr_ranges;
> + unsigned int nr_ranges;
> + struct range ranges[] __counted_by(max_nr_ranges);
> +};
> +
>  #ifdef CONFIG_CRASH_DUMP
>  
>  int crash_shrink_memory(unsigned long new_size);
> @@ -51,12 +57,6 @@ static inline unsigned int crash_get_elfcorehdr_size(void) 
> { return 0; }
>  /* Alignment required for elf header segment */
>  #define ELF_CORE_HEADER_ALIGN   4096
>  
> -struct crash_mem {
> - unsigned int max_nr_ranges;
> - unsigned int nr_ranges;
> - struct range ranges[] __counted_by(max_nr_ranges);
> -};
> -
>  extern int crash_exclude_mem_range(struct crash_mem *mem,
>  unsigned long long mstart,
>  unsigned long long mend);
> -- 
> 2.43.0
> 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 1/2] x86/mm: Do not zap PMD entry mapping unaccepted memory table during kdump.

2024-02-20 Thread Kirill A. Shutemov
On Tue, Feb 20, 2024 at 01:18:29AM +, Ashish Kalra wrote:
> From: Ashish Kalra 
> 
> During crashkernel boot only pre-allocated crash memory is presented as
> E820_TYPE_RAM. This can cause PMD entry mapping unaccepted memory table
> to be zapped during phys_pmd_init() as SNP/TDX guest use E820_TYPE_ACPI
> to store the unaccepted memory table and pass it between the kernels on
> kexec/kdump.
> 
> E820_TYPE_ACPI covers not only ACPI data, but also EFI tables and might
> be required by kernel to function properly.
> 
> The problem was discovered during debugging kdump for SNP guest. The
> unaccepted memory table stored with E820_TYPE_ACPI and passed between
> the kernels on kdump was getting zapped as the PMD entry mapping this
> is above the E820_TYPE_RAM range for the reserved crashkernel memory.
> 
> Signed-off-by: Ashish Kalra 
> ---
>  arch/x86/mm/init_64.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index a0dffaca6d2b..207c6e0c 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -524,7 +524,9 @@ phys_pmd_init(pmd_t *pmd_page, unsigned long paddr, 
> unsigned long paddr_end,
>   !e820__mapped_any(paddr & PMD_MASK, paddr_next,
>E820_TYPE_RAM) &&
>   !e820__mapped_any(paddr & PMD_MASK, paddr_next,
> -  E820_TYPE_RESERVED_KERN))
> +  E820_TYPE_RESERVED_KERN) &&
> + !e820__mapped_any(paddr & PMD_MASK, paddr_next,
> +  E820_TYPE_ACPI))
>   set_pmd_init(pmd, __pmd(0), init);
>   continue;

Why do you single out phys_pmd_init()? I think it has to be addressed for
all page table levels as we do for E820_TYPE_RAM and E820_TYPE_RESERVED_KERN.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCHv7 07/16] x86/mm: Return correct level from lookup_address() if pte is none

2024-02-20 Thread Kirill A. Shutemov
On Tue, Feb 20, 2024 at 06:25:43PM +0800, Baoquan He wrote:
> > I am not sure what part of the comment you see doesn't reflect the
> > behaviour. From my PoV, changed code matches the comment closer that
> > original.
> 
> Oh, I didn't make it clear. I mean update the code comment for
> lookup_address(), and add code comment for lookup_address_in_pgd() to
> mention the level thing. Maybe it's a chance to do that.
> 
> ===1>
> *
>  * Lookup the page table entry for a virtual address. Return a pointer
>  * to the entry and the level of the mapping.
>  *
>  * Note: We return pud and pmd either when the entry is marked large
>~~~ seems we return p4d too
>  * or when the present bit is not set. Otherwise we would return a
>  * pointer to a nonexisting mapping.
>   ~~~ NULL?
>  */  
> pte_t *lookup_address(unsigned long address, unsigned int *level)
> {
> return lookup_address_in_pgd(pgd_offset_k(address), address, level);
> }
> EXPORT_SYMBOL_GPL(lookup_address);
> ===
> 
> ===2>
> /*
>  * Lookup the page table entry for a virtual address in a specific pgd.
>  * Return a pointer to the entry and the level of the mapping.
>~~ also could return NULL if it's none entry. And do we need to
> mention the level thing?
>  */
> pte_t *lookup_address_in_pgd(pgd_t *pgd, unsigned long address,
>  unsigned int *level)
> ...
> }
> 

What about this fixup:

diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 3612e3167147..425ff6e192e6 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -657,7 +657,8 @@ static inline pgprot_t verify_rwx(pgprot_t old, pgprot_t 
new, unsigned long star
 
 /*
  * Lookup the page table entry for a virtual address in a specific pgd.
- * Return a pointer to the entry and the level of the mapping.
+ * Return a pointer to the entry and the level of the mapping (or NULL if
+ * the entry is none) and level of the entry.
  */
 pte_t *lookup_address_in_pgd(pgd_t *pgd, unsigned long address,
 unsigned int *level)
@@ -704,9 +705,8 @@ pte_t *lookup_address_in_pgd(pgd_t *pgd, unsigned long 
address,
  * Lookup the page table entry for a virtual address. Return a pointer
  * to the entry and the level of the mapping.
  *
- * Note: We return pud and pmd either when the entry is marked large
- * or when the present bit is not set. Otherwise we would return a
- * pointer to a nonexisting mapping.
+ * Note: the function returns p4d, pud and pmd either when the entry is marked
+ * large or when the present bit is not set. Otherwise it returns NULL.
  */
 pte_t *lookup_address(unsigned long address, unsigned int *level)
 {
-- 
  Kiryl Shutsemau / Kirill A. Shutemov

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v3 09/17] x86: Add KHO support

2024-02-20 Thread Mike Rapoport
Hi Alex,

On Wed, Jan 17, 2024 at 02:46:56PM +, Alexander Graf wrote:
> We now have all bits in place to support KHO kexecs. This patch adds
> awareness of KHO in the kexec file as well as boot path for x86 and
> adds the respective kconfig option to the architecture so that it can
> use KHO successfully.
> 
> In addition, it enlightens it decompression code with KHO so that its
> KASLR location finder only considers memory regions that are not already
> occupied by KHO memory.
> 
> Signed-off-by: Alexander Graf 
> 
> ---
> 
> v1 -> v2:
> 
>   - Change kconfig option to ARCH_SUPPORTS_KEXEC_KHO
>   - s/kho_reserve_mem/kho_reserve_previous_mem/g
>   - s/kho_reserve/kho_reserve_scratch/g
> ---
>  arch/x86/Kconfig  |  3 ++
>  arch/x86/boot/compressed/kaslr.c  | 55 +++
>  arch/x86/include/uapi/asm/bootparam.h | 15 +++-
>  arch/x86/kernel/e820.c|  9 +
>  arch/x86/kernel/kexec-bzimage64.c | 39 +++
>  arch/x86/kernel/setup.c   | 46 ++
>  arch/x86/mm/init_32.c |  7 
>  arch/x86/mm/init_64.c |  7 
>  8 files changed, 180 insertions(+), 1 deletion(-)

...

> @@ -987,8 +1013,26 @@ void __init setup_arch(char **cmdline_p)
>   cleanup_highmap();
>  
>   memblock_set_current_limit(ISA_END_ADDRESS);
> +
>   e820__memblock_setup();
>  
> + /*
> +  * We can resize memblocks at this point, let's dump all KHO
> +  * reservations in and switch from scratch-only to normal allocations
> +  */
> + kho_reserve_previous_mem();
> +
> + /* Allocations now skip scratch mem, return low 1M to the pool */
> + if (is_kho_boot()) {
> + u64 i;
> + phys_addr_t base, end;
> +
> + __for_each_mem_range(i, , NULL, NUMA_NO_NODE,
> +  MEMBLOCK_SCRATCH, , , NULL)
> + if (end <= ISA_END_ADDRESS)
> + memblock_clear_scratch(base, end - base);
> + }

You had to mark lower 16M as MEMBLOCK_SCRATCH because at this point the
mapping of the physical memory is not ready yet and page tables only cover
lower 16M and the memory mapped in kexec::init_pgtable(). Hence the call
for memblock_set_current_limit(ISA_END_ADDRESS) slightly above, which
essentially makes scratch mem reserved by KHO unusable for allocations.

I'd suggest to move kho_reserve_previous_mem() earlier, probably even right
next to kho_populate().
kho_populate() already does memblock_add(scratch) and at that point it's
the only physical memory that memblock knows of, so if it'll have to
allocate, the allocations will end up there.

Also, there are no kernel allocations before e820__memblock_setup(), so the
only memory that might need to be allocated is for memblock_double_array()
and that will be discarded later anyway.

With this, it seems that MEMBLOCK_SCRATCH is not needed, as the scratch
memory is anyway the only usable memory up to e820__memblock_setup().

>   /*
>* Needs to run after memblock setup because it needs the physical
>* memory size.
> @@ -1104,6 +1148,8 @@ void __init setup_arch(char **cmdline_p)
>*/
>   arch_reserve_crashkernel();
>  
> + kho_reserve_scratch();
> +
>   memblock_find_dma_reserve();
>  
>   if (!early_xdbc_setup_hardware())
> diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
> index b63403d7179d..6c3810afed04 100644
> --- a/arch/x86/mm/init_32.c
> +++ b/arch/x86/mm/init_32.c
> @@ -20,6 +20,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -738,6 +739,12 @@ void __init mem_init(void)
>   after_bootmem = 1;
>   x86_init.hyper.init_after_bootmem();
>  
> + /*
> +  * Now that all KHO pages are marked as reserved, let's flip them back
> +  * to normal pages with accurate refcount.
> +  */
> + kho_populate_refcount();

This should go to mm_core_init(), there's nothing architecture specific
there.

> +
>   /*
>* Check boundaries twice: Some fundamental inconsistencies can
>* be detected at build time already.

-- 
Sincerely yours,
Mike.



Re: [PATCHv7 07/16] x86/mm: Return correct level from lookup_address() if pte is none

2024-02-20 Thread Baoquan He
On 02/19/24 at 03:52pm, Kirill A. Shutemov wrote:
> On Mon, Feb 19, 2024 at 01:12:32PM +0800, Baoquan He wrote:
> > On 02/12/24 at 12:44pm, Kirill A. Shutemov wrote:
> > > lookup_address() only returns correct page table level for the entry if
> > > the entry is not none.
> > > 
> > > Make the helper to always return correct 'level'. It allows to implement
> > > iterator over kernel page tables using lookup_address().
> > > 
> > > Add one more entry into enum pg_level to indicate size of VA covered by
> > > one PGD entry in 5-level paging mode.
> > > 
> > > Signed-off-by: Kirill A. Shutemov 
> > > Reviewed-by: Rick Edgecombe 
> > > ---
> > >  arch/x86/include/asm/pgtable_types.h | 1 +
> > >  arch/x86/mm/pat/set_memory.c | 8 
> > >  2 files changed, 5 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/arch/x86/include/asm/pgtable_types.h 
> > > b/arch/x86/include/asm/pgtable_types.h
> > > index 0b748ee16b3d..3f648ffdfbe5 100644
> > > --- a/arch/x86/include/asm/pgtable_types.h
> > > +++ b/arch/x86/include/asm/pgtable_types.h
> > > @@ -548,6 +548,7 @@ enum pg_level {
> > >   PG_LEVEL_2M,
> > >   PG_LEVEL_1G,
> > >   PG_LEVEL_512G,
> > > + PG_LEVEL_256T,
> > >   PG_LEVEL_NUM
> > >  };
> > >  
> > > diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> > > index f92da8c9a86d..3612e3167147 100644
> > > --- a/arch/x86/mm/pat/set_memory.c
> > > +++ b/arch/x86/mm/pat/set_memory.c
> > > @@ -666,32 +666,32 @@ pte_t *lookup_address_in_pgd(pgd_t *pgd, unsigned 
> > > long address,
> > 
> > LGTM,
> > 
> > Reviewed-by: Baoquan He 
> > 
> > By the way, we may need update the code comment above function
> > lookup_address_in_pgd() and function lookup_address() since they don't
> > reflect the latest behaviour of them.
> 
> I am not sure what part of the comment you see doesn't reflect the
> behaviour. From my PoV, changed code matches the comment closer that
> original.

Oh, I didn't make it clear. I mean update the code comment for
lookup_address(), and add code comment for lookup_address_in_pgd() to
mention the level thing. Maybe it's a chance to do that.

===1>
*
 * Lookup the page table entry for a virtual address. Return a pointer
 * to the entry and the level of the mapping.
 *
 * Note: We return pud and pmd either when the entry is marked large
   ~~~ seems we return p4d too
 * or when the present bit is not set. Otherwise we would return a
 * pointer to a nonexisting mapping.
  ~~~ NULL?
 */  
pte_t *lookup_address(unsigned long address, unsigned int *level)
{
return lookup_address_in_pgd(pgd_offset_k(address), address, level);
}
EXPORT_SYMBOL_GPL(lookup_address);
===

===2>
/*
 * Lookup the page table entry for a virtual address in a specific pgd.
 * Return a pointer to the entry and the level of the mapping.
   ~~ also could return NULL if it's none entry. And do we need to
mention the level thing?
 */
pte_t *lookup_address_in_pgd(pgd_t *pgd, unsigned long address,
 unsigned int *level)
...
}


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec