Re: [PATCH] kexec: do syscore_shutdown() in kernel_kexec

2024-01-08 Thread Gowans, James
+ akpm

Hi Eric and Andrew,
Just checking in on this patch.
Would be keen to get the fix merged if you're okay with it, or some
feedback.

Also still keen for input for the driver maintainers in CC if they
support or have objections to their shutdown hooks being invoked on
kexec.

JG

On Mon, 2023-12-18 at 14:41 +0200, James Gowans wrote:
> Hi Eric,
> 
> On Wed, 2023-12-13 at 10:39 -0600, Eric W. Biederman wrote:
> > 
> > James Gowans  writes:
> > 
> > > syscore_shutdown() runs driver and module callbacks to get the system
> > > into a state where it can be correctly shut down. In commit
> > > 6f389a8f1dd2 ("PM / reboot: call syscore_shutdown() after 
> > > disable_nonboot_cpus()")
> > > syscore_shutdown() was removed from kernel_restart_prepare() and hence
> > > got (incorrectly?) removed from the kexec flow. This was innocuous until
> > > commit 6735150b6997 ("KVM: Use syscore_ops instead of reboot_notifier to 
> > > hook restart/shutdown")
> > > changed the way that KVM registered its shutdown callbacks, switching from
> > > reboot notifiers to syscore_ops.shutdown. As syscore_shutdown() is
> > > missing from kexec, KVM's shutdown hook is not run and virtualisation is
> > > left enabled on the boot CPU which results in triple faults when
> > > switching to the new kernel on Intel x86 VT-x with VMXE enabled.
> > > 
> > > Fix this by adding syscore_shutdown() to the kexec sequence. In terms of
> > > where to add it, it is being added after migrating the kexec task to the
> > > boot CPU, but before APs are shut down. It is not totally clear if this
> > > is the best place: in commit 6f389a8f1dd2 ("PM / reboot: call 
> > > syscore_shutdown() after disable_nonboot_cpus()")
> > > it is stated that "syscore_ops operations should be carried with one
> > > CPU on-line and interrupts disabled." APs are only offlined later in
> > > machine_shutdown(), so this syscore_shutdown() is being run while APs
> > > are still online. This seems to be the correct place as it matches where
> > > syscore_shutdown() is run in the reboot and halt flows - they also run
> > > it before APs are shut down. The assumption is that the commit message
> > > in commit 6f389a8f1dd2 ("PM / reboot: call syscore_shutdown() after 
> > > disable_nonboot_cpus()")
> > > is no longer valid.
> > > 
> > > KVM has been discussed here as it is what broke loudly by not having
> > > syscore_shutdown() in kexec, but this change impacts more than just KVM;
> > > all drivers/modules which register a syscore_ops.shutdown callback will
> > > now be invoked in the kexec flow. Looking at some of them like x86 MCE
> > > it is probably more correct to also shut these down during kexec.
> > > Maintainers of all drivers which use syscore_ops.shutdown are added on
> > > CC for visibility. They are:
> > > 
> > > arch/powerpc/platforms/cell/spu_base.c  .shutdown = spu_shutdown,
> > > arch/x86/kernel/cpu/mce/core.c.shutdown = 
> > > mce_syscore_shutdown,
> > > arch/x86/kernel/i8259.c .shutdown = i8259A_shutdown,
> > > drivers/irqchip/irq-i8259.c   .shutdown = i8259A_shutdown,
> > > drivers/irqchip/irq-sun6i-r.c .shutdown = sun6i_r_intc_shutdown,
> > > drivers/leds/trigger/ledtrig-cpu.c.shutdown = 
> > > ledtrig_cpu_syscore_shutdown,
> > > drivers/power/reset/sc27xx-poweroff.c .shutdown = 
> > > sc27xx_poweroff_shutdown,
> > > kernel/irq/generic-chip.c .shutdown = irq_gc_shutdown,
> > > virt/kvm/kvm_main.c   .shutdown = kvm_shutdown,
> > > 
> > > This has been tested by doing a kexec on x86_64 and aarch64.
> > 
> > From the 10,000 foot perspective:
> > Acked-by: "Eric W. Biederman" 
> 
> Thanks for the ACK!
> What's the next step to get this into the kexec tree?
> 
> JG
> 
> > 
> > 
> > Eric
> > 
> > > Fixes: 6735150b6997 ("KVM: Use syscore_ops instead of reboot_notifier to 
> > > hook restart/shutdown")
> > > 
> > > Signed-off-by: James Gowans 
> > > Cc: Eric Biederman 
> > > Cc: Paolo Bonzini 
> > > Cc: Sean Christopherson 
> > > Cc: Marc Zyngier 
> > > Cc: Arnd Bergmann 
> > > Cc: Tony Luck 
> > > Cc: Borislav Petkov 
> > > Cc: Thomas Gleixner 
> > > Cc: Ingo Molnar 
> > > Cc: Chen-Yu Tsai 
> > > Cc: Jernej Skrabec 
> > > Cc: Samuel Holland 
> > > Cc: Pavel Machek 
> > > Cc: Sebastian Reichel 
> > > Cc: Orson Zhai 
> > > Cc: Alexander Graf 
> > > Cc: Jan H. Schoenherr 
> > > ---
> > >  kernel/kexec_core.c | 1 +
> > >  1 file changed, 1 insertion(+)
> > > 
> > > diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
> > > index be5642a4ec49..b926c4db8a91 100644
> > > --- a/kernel/kexec_core.c
> > > +++ b/kernel/kexec_core.c
> > > @@ -1254,6 +1254,7 @@ int kernel_kexec(void)
> > >   kexec_in_progress = true;
> > >   kernel_restart_prepare("kexec reboot");
> > >   migrate_to_reboot_cpu();
> > > + syscore_shutdown();
> > > 
> > >   /*
> > >* migrate_to_reboot_cpu() disables CPU hotplug assuming 
> > > that
> 

_

Re: [PATCHv9 2/2] powerpc/setup: Loosen the mapping between cpu logical id and its seq in dt

2024-01-08 Thread Hari Bathini



On 09/01/24 9:57 am, Hari Bathini wrote:

Hi Michael,



Sorry, Michael.
I am just about getting back to work and I spoke too soon.
You already seem to have posted a set with the approach you had in mind:

  https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=388350

Thanks
Hari


I am fine with either approach. I was trying to address your concerns
in my way. Looking for your inputs here on how to go about this now..

On 29/11/23 7:00 am, Pingfan Liu wrote:

Hi Hari,


On Mon, Nov 27, 2023 at 12:30 PM Hari Bathini  
wrote:


Hi Pingfan, Michael,

On 17/10/23 4:03 pm, Hari Bathini wrote:



On 17/10/23 7:58 am, Pingfan Liu wrote:

*** Idea ***
For kexec -p, the boot cpu can be not the cpu0, this causes the 
problem

of allocating memory for paca_ptrs[]. However, in theory, there is no
requirement to assign cpu's logical id as its present sequence in the
device tree. But there is something like cpu_first_thread_sibling(),
which makes assumption on the mapping inside a core. Hence partially
loosening the mapping, i.e. unbind the mapping of core while keep the
mapping inside a core.

*** Implement ***
At this early stage, there are plenty of memory to utilize. Hence, 
this

patch allocates interim memory to link the cpu info on a list, then
reorder cpus by changing the list head. As a result, there is a rotate
shift between the sequence number in dt and the cpu logical number.

*** Result ***
After this patch, a boot-cpu's logical id will always be mapped 
into the

range [0,threads_per_core).

Besides this, at this phase, all threads in the boot core are 
forced to

be onlined. This restriction will be lifted in a later patch with
extra effort.

Signed-off-by: Pingfan Liu 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Christophe Leroy 
Cc: Mahesh Salgaonkar 
Cc: Wen Xiong 
Cc: Baoquan He 
Cc: Ming Lei 
Cc: Sourabh Jain 
Cc: Hari Bathini 
Cc: kexec@lists.infradead.org
To: linuxppc-...@lists.ozlabs.org


Thanks for working on this, Pingfan.
Looks good to me.

Acked-by: Hari Bathini 



On second thoughts, probably better off with no impact for
bootcpu < nr_cpu_ids case and changing only two cores logical
numbering otherwise. Something like the below (Please share
your thoughts):



I am afraid that it may not be as ideal as it looks, considering the
following factors:
-1. For the case of 'bootcpu < nr_cpu_ids', crash can happen evenly
across any cpu in the system, which seriously undermines the
protection intended here (Under the most optimistic scenario, there is
a 50% chance of success)

-2. For the re-ordering of logical numbering, IMHO, if there is
concern that re-ordering will break something, the partial re-ordering
can not avoid that.  We ought to spot probable hazards so as to ease
worries.


Thanks,

Pingfan


diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index ec82f5bda908..78a8312aa8c4 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -76,7 +76,9 @@ u64 ppc64_rma_size;
   unsigned int boot_cpu_node_count __ro_after_init;
   #endif
   static phys_addr_t first_memblock_size;
+#ifdef CONFIG_SMP
   static int __initdata boot_cpu_count;
+#endif

   static int __init early_parse_mem(char *p)
   {
@@ -357,6 +359,25 @@ static int __init early_init_dt_scan_cpus(unsigned
long node,
 fdt_boot_cpuid_phys(initial_boot_params)) {
 found = boot_cpu_count;
 found_thread = i;
+   /*
+    * Map boot-cpu logical id into the range
+    * of [0, thread_per_core) if it can't be
+    * accommodated within nr_cpu_ids.
+    */
+   if (i != boot_cpu_count && boot_cpu_count >= 
nr_cpu_ids) {

+   boot_cpuid = i;
+   DBG("Logical CPU number for boot CPU 
changed from %d to %d\n",

+   boot_cpu_count, i);
+   } else {
+   boot_cpuid = boot_cpu_count;
+   }
+
+   /* Ensure boot thread is acconted for in 
nr_cpu_ids */

+   if (boot_cpuid >= nr_cpu_ids) {
+   set_nr_cpu_ids(boot_cpuid + 1);
+   DBG("Adjusted nr_cpu_ids to %u, to 
include boot CPU.\n",

+   nr_cpu_ids);
+   }
 }
   #ifdef CONFIG_SMP
 /* logical cpu id is always 0 on UP kernels */
@@ -368,9 +389,8 @@ static int __init early_init_dt_scan_cpus(unsigned
long node,
 if (found < 0)
 return 0;

-   DBG("boot cpu: logical %d physical %d\n", found,
+   DBG("boot cpu: logical %d physical %d\n", boot_cpuid,
 be32_to_cpu(intserv[found_thread]));
-   boot_cpuid = found;

 boot_cpu_hwid = be32_to_cpu(intserv[found_thread]);

diff --git a/arch

Re: [PATCHv9 2/2] powerpc/setup: Loosen the mapping between cpu logical id and its seq in dt

2024-01-08 Thread Hari Bathini

Hi Michael,

I am fine with either approach. I was trying to address your concerns
in my way. Looking for your inputs here on how to go about this now..

On 29/11/23 7:00 am, Pingfan Liu wrote:

Hi Hari,


On Mon, Nov 27, 2023 at 12:30 PM Hari Bathini  wrote:


Hi Pingfan, Michael,

On 17/10/23 4:03 pm, Hari Bathini wrote:



On 17/10/23 7:58 am, Pingfan Liu wrote:

*** Idea ***
For kexec -p, the boot cpu can be not the cpu0, this causes the problem
of allocating memory for paca_ptrs[]. However, in theory, there is no
requirement to assign cpu's logical id as its present sequence in the
device tree. But there is something like cpu_first_thread_sibling(),
which makes assumption on the mapping inside a core. Hence partially
loosening the mapping, i.e. unbind the mapping of core while keep the
mapping inside a core.

*** Implement ***
At this early stage, there are plenty of memory to utilize. Hence, this
patch allocates interim memory to link the cpu info on a list, then
reorder cpus by changing the list head. As a result, there is a rotate
shift between the sequence number in dt and the cpu logical number.

*** Result ***
After this patch, a boot-cpu's logical id will always be mapped into the
range [0,threads_per_core).

Besides this, at this phase, all threads in the boot core are forced to
be onlined. This restriction will be lifted in a later patch with
extra effort.

Signed-off-by: Pingfan Liu 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Christophe Leroy 
Cc: Mahesh Salgaonkar 
Cc: Wen Xiong 
Cc: Baoquan He 
Cc: Ming Lei 
Cc: Sourabh Jain 
Cc: Hari Bathini 
Cc: kexec@lists.infradead.org
To: linuxppc-...@lists.ozlabs.org


Thanks for working on this, Pingfan.
Looks good to me.

Acked-by: Hari Bathini 



On second thoughts, probably better off with no impact for
bootcpu < nr_cpu_ids case and changing only two cores logical
numbering otherwise. Something like the below (Please share
your thoughts):



I am afraid that it may not be as ideal as it looks, considering the
following factors:
-1. For the case of 'bootcpu < nr_cpu_ids', crash can happen evenly
across any cpu in the system, which seriously undermines the
protection intended here (Under the most optimistic scenario, there is
a 50% chance of success)

-2. For the re-ordering of logical numbering, IMHO, if there is
concern that re-ordering will break something, the partial re-ordering
can not avoid that.  We ought to spot probable hazards so as to ease
worries.


Thanks,

Pingfan


diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index ec82f5bda908..78a8312aa8c4 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -76,7 +76,9 @@ u64 ppc64_rma_size;
   unsigned int boot_cpu_node_count __ro_after_init;
   #endif
   static phys_addr_t first_memblock_size;
+#ifdef CONFIG_SMP
   static int __initdata boot_cpu_count;
+#endif

   static int __init early_parse_mem(char *p)
   {
@@ -357,6 +359,25 @@ static int __init early_init_dt_scan_cpus(unsigned
long node,
 fdt_boot_cpuid_phys(initial_boot_params)) {
 found = boot_cpu_count;
 found_thread = i;
+   /*
+* Map boot-cpu logical id into the range
+* of [0, thread_per_core) if it can't be
+* accommodated within nr_cpu_ids.
+*/
+   if (i != boot_cpu_count && boot_cpu_count >= 
nr_cpu_ids) {
+   boot_cpuid = i;
+   DBG("Logical CPU number for boot CPU changed from %d 
to %d\n",
+   boot_cpu_count, i);
+   } else {
+   boot_cpuid = boot_cpu_count;
+   }
+
+   /* Ensure boot thread is acconted for in nr_cpu_ids */
+   if (boot_cpuid >= nr_cpu_ids) {
+   set_nr_cpu_ids(boot_cpuid + 1);
+   DBG("Adjusted nr_cpu_ids to %u, to include boot 
CPU.\n",
+   nr_cpu_ids);
+   }
 }
   #ifdef CONFIG_SMP
 /* logical cpu id is always 0 on UP kernels */
@@ -368,9 +389,8 @@ static int __init early_init_dt_scan_cpus(unsigned
long node,
 if (found < 0)
 return 0;

-   DBG("boot cpu: logical %d physical %d\n", found,
+   DBG("boot cpu: logical %d physical %d\n", boot_cpuid,
 be32_to_cpu(intserv[found_thread]));
-   boot_cpuid = found;

 boot_cpu_hwid = be32_to_cpu(intserv[found_thread]);

diff --git a/arch/powerpc/kernel/setup-common.c
b/arch/powerpc/kernel/setup-common.c
index b7b733474b60..f7179525c774 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -409,6 +409,12 @@ static void __init cpu_init_thread_core_maps(int tpc)

   u32 *cpu_to_phys_id = NULL;

Re: [PATCH 5/5] crash: clean up CRASH_DUMP

2024-01-08 Thread Baoquan He
On 01/07/24 at 09:19pm, kernel test robot wrote:
> Hi Baoquan,
> 
> kernel test robot noticed the following build errors:
> 
> [auto build test ERROR on linus/master]
> [also build test ERROR on v6.7-rc8]
> [cannot apply to powerpc/next powerpc/fixes tip/x86/core arm64/for-next/core 
> next-20240105]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
> 
> url:
> https://github.com/intel-lab-lkp/linux/commits/Baoquan-He/kexec_core-move-kdump-related-codes-from-crash_core-c-to-kexec_core-c/20240105-223735
> base:   linus/master
> patch link:
> https://lore.kernel.org/r/20240105103305.557273-6-bhe%40redhat.com
> patch subject: [PATCH 5/5] crash: clean up CRASH_DUMP
> :: branch date: 2 days ago
> :: commit date: 2 days ago
> config: x86_64-randconfig-122-20240106 
> (https://download.01.org/0day-ci/archive/20240107/202401071326.52yn9ftd-...@intel.com/config)
> compiler: ClangBuiltLinux clang version 17.0.6 
> (https://github.com/llvm/llvm-project 
> 6009708b4367171ccdbf4b5905cb6a803753fe18)
> reproduce (this is a W=1 build): 
> (https://download.01.org/0day-ci/archive/20240107/202401071326.52yn9ftd-...@intel.com/reproduce)
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version 
> of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot 
> | Closes: https://lore.kernel.org/r/202401071326.52yn9ftd-...@intel.com/

Thanks for reporting.

I have reproduced these linking errors, will consier how to rearrange
code change and fix them. The thing splitting kdump out could be more
complicated than I thought.

> 
> All errors (new ones prefixed by >>):
> 
> >> ld.lld: error: undefined symbol: crashk_res
>>>> referenced by initramfs.c:638 (init/initramfs.c:638)
>>>>   init/initramfs.o:(kexec_free_initrd) in archive vmlinux.a
>>>> referenced by initramfs.c:638 (init/initramfs.c:638)
>>>>   init/initramfs.o:(kexec_free_initrd) in archive vmlinux.a
>>>> referenced by initramfs.c:0 (init/initramfs.c:0)
>>>>   init/initramfs.o:(kexec_free_initrd) in archive vmlinux.a
>>>> referenced 77 more times
> --
> >> ld.lld: error: undefined symbol: parse_crashkernel
>>>> referenced by setup.c:479 (arch/x86/kernel/setup.c:479)
>>>>   arch/x86/kernel/setup.o:(arch_reserve_crashkernel) in 
> archive vmlinux.a
> --
> >> ld.lld: error: undefined symbol: crashk_low_res
>>>> referenced by machine_kexec_64.c:539 
> (arch/x86/kernel/machine_kexec_64.c:539)
>>>>   
> arch/x86/kernel/machine_kexec_64.o:(kexec_mark_crashkres) in archive vmlinux.a
>>>> referenced by machine_kexec_64.c:539 
> (arch/x86/kernel/machine_kexec_64.c:539)
>>>>   
> arch/x86/kernel/machine_kexec_64.o:(kexec_mark_crashkres) in archive vmlinux.a
>>>> referenced by machine_kexec_64.c:539 
> (arch/x86/kernel/machine_kexec_64.c:539)
>>>>   
> arch/x86/kernel/machine_kexec_64.o:(kexec_mark_crashkres) in archive vmlinux.a
>>>> referenced 36 more times
> --
> >> ld.lld: error: undefined symbol: crash_update_vmcoreinfo_safecopy
>>>> referenced by kexec_core.c:522 (kernel/kexec_core.c:522)
>>>>   kernel/kexec_core.o:(kimage_crash_copy_vmcoreinfo) in 
> archive vmlinux.a
>>>> referenced by kexec_core.c:610 (kernel/kexec_core.c:610)
>>>>   kernel/kexec_core.o:(kimage_free) in archive vmlinux.a
> --
> >> ld.lld: error: undefined symbol: crash_save_vmcoreinfo
>>>> referenced by kexec_core.c:1053 (kernel/kexec_core.c:1053)
>>>>   kernel/kexec_core.o:(__crash_kexec) in archive vmlinux.a
> --
> >> ld.lld: error: undefined symbol: paddr_vmcoreinfo_note
>>>> referenced by kexec_core.c:1148 (kernel/kexec_core.c:1148)
>>>>   kernel/kexec_core.o:(crash_prepare_elf64_headers) in 
> archive vmlinux.a
> --
> >> ld.lld: error: undefined symbol: append_elf_note
>>>> referenced by kexec_core.c:1390 (kernel/kexec_core.c:1390)
>>>>   kernel/kexec_core.o:(crash_save_cpu) in archive vmlinux.a
> --
> >> ld.lld: error: undefined symbol: final_note
>>>> referenced by kexec_core.c:1392 (kernel/kexec_core.c:1392)
>>>>   kernel/kexec_core.o:(crash_save_cpu) in archive vmlinux.a
> 
> -- 
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests/wiki
> 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v4] x86/kexec: fix potential cmem->ranges out of bounds

2024-01-08 Thread Baoquan He
On 01/08/24 at 09:06pm, fuqiang wang wrote:
> In memmap_exclude_ranges(), elfheader will be excluded from crashk_res.
> In the current x86 architecture code, the elfheader is always allocated
> at crashk_res.start. It seems that there won't be a new split range.
> But it depends on the allocation position of elfheader in crashk_res. To
> avoid potential out of bounds in future, add a extra slot.
> 
> The similar issue also exists in fill_up_crash_elf_data(). The range to
> be excluded is [0, 1M], start (0) is special and will not appear in the
> middle of existing cmem->ranges[]. But in cast the low 1M could be
> changed in the future, add a extra slot too.
> 
> Previously discussed link:
> [1] https://lore.kernel.org/kexec/ZXk2oBf%2FT1Ul6o0c@MiWiFi-R3L-srv/
> [2] 
> https://lore.kernel.org/kexec/273284e8-7680-4f5f-8065-c5d780987...@easystack.cn/
> [3] https://lore.kernel.org/kexec/ZYQ6O%2F57sHAPxTHm@MiWiFi-R3L-srv/
> 
> Signed-off-by: fuqiang wang 
> ---
>  arch/x86/kernel/crash.c | 21 +++--
>  1 file changed, 19 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
> index b6b044356f1b..d21592ad8952 100644
> --- a/arch/x86/kernel/crash.c
> +++ b/arch/x86/kernel/crash.c
> @@ -149,8 +149,18 @@ static struct crash_mem *fill_up_crash_elf_data(void)
>   /*
>* Exclusion of crash region and/or crashk_low_res may cause
>* another range split. So add extra two slots here.
> +  *
> +  * Exclusion of low 1M may not cause another range split, because the
> +  * range of exclude is [0, 1M] and the condition for splitting a new
> +  * region is that the start, end parameters are both in a certain
> +  * existing region in cmem and cannot be equal to existing region's
> +  * start or end. Obviously, the start of [0, 1M] cannot meet this
> +  * condition.
> +  *
> +  * But in order to lest the low 1M could be changed in the future,
> +  * (e.g. [stare, 1M]), add a extra slot.
>*/
> - nr_ranges += 2;
> + nr_ranges += 3;
>   cmem = vzalloc(struct_size(cmem, ranges, nr_ranges));
>   if (!cmem)
>   return NULL;
> @@ -282,9 +292,16 @@ int crash_setup_memmap_entries(struct kimage *image, 
> struct boot_params *params)
>   struct crash_memmap_data cmd;
>   struct crash_mem *cmem;
>  
> - cmem = vzalloc(struct_size(cmem, ranges, 1));
> + /*
> +  * In the current x86 architecture code, the elfheader is always
> +  * allocated at crashk_res.start. But it depends on the allocation
> +  * position of elfheader in crashk_res. To avoid potential out of
> +  * bounds in future, add a extra slot.
> +  */
> + cmem = vzalloc(struct_size(cmem, ranges, 2));
>   if (!cmem)
>   return -ENOMEM;
> + cmem->max_nr_ranges = 2;

LGTM, thx

Acked-by: Baoquan He 

>  
>   memset(&cmd, 0, sizeof(struct crash_memmap_data));
>   cmd.params = params;
> -- 
> 2.42.0
> 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCHv5.1 14/16] x86/smp: Add smp_ops.stop_this_cpu() callback

2024-01-08 Thread Huang, Kai


> > > @@ -835,6 +835,13 @@ void __noreturn stop_this_cpu(void *dummy)
> > >*/
> > >   cpumask_clear_cpu(cpu, &cpus_stop_mask);
> > >  
> > > +#ifdef CONFIG_SMP
> > > + if (smp_ops.stop_this_cpu) {
> > > + smp_ops.stop_this_cpu();
> > > + unreachable();
> > > + }
> > > +#endif
> > 
> > If I read correctly this will result in stop_this_cpu() having different
> > behaviour for SMP and !SMP build for TDX guest.  For example, AFAICT
> > machine_halt() also calls stop_this_cpu() on local cpu after it stops other
> > cpus.  So for the local cpu, in SMP build it will calls into BIOS's reset 
> > vector
> > but in !SMP it will call native_halt().
> 
> It doesn't make a difference in practice: both halt and giving CPU to
> BIOS will be unrecoverable operation. Both are equally acceptable for
> machine_halt().
> 

OK fair enough. :-)
___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC V2] IMA Log Snapshotting Design Proposal

2024-01-08 Thread Paul Moore
On Mon, Jan 8, 2024 at 6:48 AM Mimi Zohar  wrote:
> On Sun, 2024-01-07 at 21:58 -0500, Paul Moore wrote:
> > On Sun, Jan 7, 2024 at 7:59 AM Mimi Zohar  wrote:
> > > On Sat, 2024-01-06 at 18:27 -0500, Paul Moore wrote:
> > > > On Tue, Nov 28, 2023 at 9:07 PM Mimi Zohar  wrote:
> > > > > On Tue, 2023-11-28 at 20:06 -0500, Paul Moore wrote:
> > > > > > On Tue, Nov 28, 2023 at 7:09 AM Mimi Zohar  
> > > > > > wrote:
> > > > > > > On Mon, 2023-11-27 at 17:16 -0500, Paul Moore wrote:
> > > > > > > > On Mon, Nov 27, 2023 at 12:08 PM Mimi Zohar 
> > > > > > > >  wrote:
> > > > > > > > > On Wed, 2023-11-22 at 09:22 -0500, Paul Moore wrote:
> > > >
> > > > ...
> > > >
> > > > > > > > > Before defining a new critical-data record, we need to decide 
> > > > > > > > > whether
> > > > > > > > > it is really necessary or if it is redundant.  If we define a 
> > > > > > > > > new
> > > > > > > > > "critical-data" record, can it be defined such that it 
> > > > > > > > > doesn't require
> > > > > > > > > pausing extending the measurement list?  For example, a new 
> > > > > > > > > simple
> > > > > > > > > visual critical-data record could contain the number of 
> > > > > > > > > records (e.g.
> > > > > > > > > /ima/runtime_measurements_count) up to that point.
> > > > > > > >
> > > > > > > > What if the snapshot_aggregate was a hash of the measurement log
> > > > > > > > starting with either the boot_aggregate or the latest
> > > > > > > > snapshot_aggregate and ending on the record before the new
> > > > > > > > snapshot_aggregate?  The performance impact at snapshot time 
> > > > > > > > should be
> > > > > > > > minimal as the hash can be incrementally updated as new records 
> > > > > > > > are
> > > > > > > > added to the measurement list.  While the hash wouldn't capture 
> > > > > > > > the
> > > > > > > > TPM state, it would allow some crude verification when 
> > > > > > > > reassembling
> > > > > > > > the log.  If one could bear the cost of a TPM signing 
> > > > > > > > operation, the
> > > > > > > > log digest could be signed by the TPM.
> > > > > > >
> > > > > > > Other critical data is calculated, before calling
> > > > > > > ima_measure_critical_data(), which adds the record to the 
> > > > > > > measurement
> > > > > > > list and extends the TPM PCR.
> > > > > > >
> > > > > > > Signing the hash shouldn't be an issue if it behaves like other
> > > > > > > critical data.
> > > > > > >
> > > > > > > In addition to the hash, consider including other information in 
> > > > > > > the
> > > > > > > new critical data record (e.g. total number of measurement 
> > > > > > > records, the
> > > > > > > number of measurements included in the hash, the number of times 
> > > > > > > the
> > > > > > > measurement list was trimmed, etc).
> > > > > >
> > > > > > It would be nice if you could provide an explicit list of what you
> > > > > > would want hashed into a snapshot_aggregate record; the above is
> > > > > > close, but it is still a little hand-wavy.  I'm just trying to 
> > > > > > reduce
> > > > > > the back-n-forth :)
> > > > >
> > > > > What is being defined here is the first IMA critical-data record, 
> > > > > which
> > > > > really requires some thought.
> > > >
> > > > My thinking has always been that taking a hash of the current
> > > > measurement log up to the snapshot point would be a nice
> > > > snapshot_aggregate measurement, but I'm not heavily invested in that.
> > > > To me it is more important that we find something we can all agree on,
> > > > perhaps reluctantly, so we can move forward with a solution.
> > > >
> > > > > For ease of review, this new critical-
> > > > > data record should be a separate patch set from trimming the
> > > > > measurement list.
> > > >
> > > > I see the two as linked, but if you prefer them as separate then so be
> > > > it.  Once again, the important part is to move forward with a
> > > > solution, I'm not overly bothered if it arrives in multiple pieces
> > > > instead of one.
> > >
> > > Trimming the IMA measurement list could be used in conjunction with the 
> > > new IMA
> > > critical data record or independently.  Both options should be supported.
> > >
> > > 1. trim N number of records from the head of the in kernel IMA 
> > > measurement list
> > > 2. intermittently include the new IMA critical data record based on some 
> > > trigger
> > > 3. trim the measurement list up to the (first/last/Nth) IMA critical data 
> > > record
> > >
> > > Since the two features could be used independently of each other, there 
> > > is no
> > > reason to upstream them as a single patch set.  It just makes it harder to
> > > review.
> >
> > I don't see much point in recording a snapshot aggregate if you aren't
> > doing a snapshot, but it's not harmful in any way, so sure, go for it.
> > Like I said earlier, as long as the functionality is there, I don't
> > think anyone cares too much how it gets into the kernel (although
> > Tushar and Sush should comment from the perspect

Re: [PATCHv5 15/16] x86/mm: Introduce kernel_ident_mapping_free()

2024-01-08 Thread kirill.shute...@linux.intel.com
On Mon, Jan 08, 2024 at 01:13:18PM +, Huang, Kai wrote:
> On Mon, 2024-01-08 at 13:17 +0300, kirill.shute...@linux.intel.com wrote:
> > On Mon, Jan 08, 2024 at 03:30:21AM +, Huang, Kai wrote:
> > > On Mon, 2024-01-08 at 03:13 +, Huang, Kai wrote:
> > > > On Sat, 2023-12-23 at 02:52 +0300, Kirill A. Shutemov wrote:
> > > > > The helper complements kernel_ident_mapping_init(): it frees the
> > > > > identity mapping that was previously allocated. It will be used in the
> > > > > error path to free a partially allocated mapping or if the mapping is 
> > > > > no
> > > > > longer needed.
> > > > > 
> > > > > The caller provides a struct x86_mapping_info with the free_pgd_page()
> > > > > callback hooked up and the pgd_t to free.
> > > > > 
> > > > > Signed-off-by: Kirill A. Shutemov 
> > > > > ---
> > > > >  arch/x86/include/asm/init.h |  3 ++
> > > > >  arch/x86/mm/ident_map.c | 73 
> > > > > +
> > > > >  2 files changed, 76 insertions(+)
> > > > > 
> > > > > diff --git a/arch/x86/include/asm/init.h b/arch/x86/include/asm/init.h
> > > > > index cc9ccf61b6bd..14d72727d7ee 100644
> > > > > --- a/arch/x86/include/asm/init.h
> > > > > +++ b/arch/x86/include/asm/init.h
> > > > > @@ -6,6 +6,7 @@
> > > > >  
> > > > >  struct x86_mapping_info {
> > > > >   void *(*alloc_pgt_page)(void *); /* allocate buf for page table 
> > > > > */
> > > > > + void (*free_pgt_page)(void *, void *); /* free buf for page 
> > > > > table */
> > > > >   void *context;   /* context for alloc_pgt_page 
> > > > > */
> > > > >   unsigned long page_flag; /* page flag for PMD or PUD 
> > > > > entry */
> > > > >   unsigned long offset;/* ident mapping offset */
> > > > > @@ -16,4 +17,6 @@ struct x86_mapping_info {
> > > > >  int kernel_ident_mapping_init(struct x86_mapping_info *info, pgd_t 
> > > > > *pgd_page,
> > > > >   unsigned long pstart, unsigned long 
> > > > > pend);
> > > > >  
> > > > > +void kernel_ident_mapping_free(struct x86_mapping_info *info, pgd_t 
> > > > > *pgd);
> > > > 
> > > > Maybe range-based free function can provide more flexibility (e.g., you 
> > > > can
> > > > directly call the free function to cleanup in 
> > > > kernel_ident_mapping_init()
> > > > internally when something goes wrong), but I guess this is sufficient 
> > > > for
> > > > current use case (and perhaps the majority use cases).
> > > > 
> > > > Reviewed-by: Kai Huang 
> > > > 
> > > 
> > > Another argument of range-based free function is, theoretically you can 
> > > build
> > > the identical mapping table using different x86_mapping_info on different
> > > ranges, thus it makes less sense to use one 'struct x86_mapping_info 
> > > *info' to
> > > free the entire page table, albeit in this implementation only the
> > > 'free_pgt_page()' callback is used. 
> > 
> > The interface can be changed if there will be need for such behaviour.
> > This kind of future-proofing rarely helpful.
> > 
> 
> Do you want to just pass the 'free_pgt_page' function pointer to
> kernel_ident_mapping_free(), instead of 'struct x86_mapping_info *info'?  As
> mentioned above conceptually the page table can be built from multiple
> x86_mapping_info for multiple ranges.

I don't think we have such cases in kernel. Let's not overcomplicate
things. I see value in keeping interface symmetric.

We can always change things according to needs.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCHv5 15/16] x86/mm: Introduce kernel_ident_mapping_free()

2024-01-08 Thread Huang, Kai
On Mon, 2024-01-08 at 13:17 +0300, kirill.shute...@linux.intel.com wrote:
> On Mon, Jan 08, 2024 at 03:30:21AM +, Huang, Kai wrote:
> > On Mon, 2024-01-08 at 03:13 +, Huang, Kai wrote:
> > > On Sat, 2023-12-23 at 02:52 +0300, Kirill A. Shutemov wrote:
> > > > The helper complements kernel_ident_mapping_init(): it frees the
> > > > identity mapping that was previously allocated. It will be used in the
> > > > error path to free a partially allocated mapping or if the mapping is no
> > > > longer needed.
> > > > 
> > > > The caller provides a struct x86_mapping_info with the free_pgd_page()
> > > > callback hooked up and the pgd_t to free.
> > > > 
> > > > Signed-off-by: Kirill A. Shutemov 
> > > > ---
> > > >  arch/x86/include/asm/init.h |  3 ++
> > > >  arch/x86/mm/ident_map.c | 73 +
> > > >  2 files changed, 76 insertions(+)
> > > > 
> > > > diff --git a/arch/x86/include/asm/init.h b/arch/x86/include/asm/init.h
> > > > index cc9ccf61b6bd..14d72727d7ee 100644
> > > > --- a/arch/x86/include/asm/init.h
> > > > +++ b/arch/x86/include/asm/init.h
> > > > @@ -6,6 +6,7 @@
> > > >  
> > > >  struct x86_mapping_info {
> > > > void *(*alloc_pgt_page)(void *); /* allocate buf for page table 
> > > > */
> > > > +   void (*free_pgt_page)(void *, void *); /* free buf for page 
> > > > table */
> > > > void *context;   /* context for alloc_pgt_page 
> > > > */
> > > > unsigned long page_flag; /* page flag for PMD or PUD 
> > > > entry */
> > > > unsigned long offset;/* ident mapping offset */
> > > > @@ -16,4 +17,6 @@ struct x86_mapping_info {
> > > >  int kernel_ident_mapping_init(struct x86_mapping_info *info, pgd_t 
> > > > *pgd_page,
> > > > unsigned long pstart, unsigned long 
> > > > pend);
> > > >  
> > > > +void kernel_ident_mapping_free(struct x86_mapping_info *info, pgd_t 
> > > > *pgd);
> > > 
> > > Maybe range-based free function can provide more flexibility (e.g., you 
> > > can
> > > directly call the free function to cleanup in kernel_ident_mapping_init()
> > > internally when something goes wrong), but I guess this is sufficient for
> > > current use case (and perhaps the majority use cases).
> > > 
> > > Reviewed-by: Kai Huang 
> > > 
> > 
> > Another argument of range-based free function is, theoretically you can 
> > build
> > the identical mapping table using different x86_mapping_info on different
> > ranges, thus it makes less sense to use one 'struct x86_mapping_info *info' 
> > to
> > free the entire page table, albeit in this implementation only the
> > 'free_pgt_page()' callback is used. 
> 
> The interface can be changed if there will be need for such behaviour.
> This kind of future-proofing rarely helpful.
> 

Do you want to just pass the 'free_pgt_page' function pointer to
kernel_ident_mapping_free(), instead of 'struct x86_mapping_info *info'?  As
mentioned above conceptually the page table can be built from multiple
x86_mapping_info for multiple ranges.
___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v4] x86/kexec: fix potential cmem->ranges out of bounds

2024-01-08 Thread fuqiang wang
In memmap_exclude_ranges(), elfheader will be excluded from crashk_res.
In the current x86 architecture code, the elfheader is always allocated
at crashk_res.start. It seems that there won't be a new split range.
But it depends on the allocation position of elfheader in crashk_res. To
avoid potential out of bounds in future, add a extra slot.

The similar issue also exists in fill_up_crash_elf_data(). The range to
be excluded is [0, 1M], start (0) is special and will not appear in the
middle of existing cmem->ranges[]. But in cast the low 1M could be
changed in the future, add a extra slot too.

Previously discussed link:
[1] https://lore.kernel.org/kexec/ZXk2oBf%2FT1Ul6o0c@MiWiFi-R3L-srv/
[2] 
https://lore.kernel.org/kexec/273284e8-7680-4f5f-8065-c5d780987...@easystack.cn/
[3] https://lore.kernel.org/kexec/ZYQ6O%2F57sHAPxTHm@MiWiFi-R3L-srv/

Signed-off-by: fuqiang wang 
---
 arch/x86/kernel/crash.c | 21 +++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index b6b044356f1b..d21592ad8952 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -149,8 +149,18 @@ static struct crash_mem *fill_up_crash_elf_data(void)
/*
 * Exclusion of crash region and/or crashk_low_res may cause
 * another range split. So add extra two slots here.
+*
+* Exclusion of low 1M may not cause another range split, because the
+* range of exclude is [0, 1M] and the condition for splitting a new
+* region is that the start, end parameters are both in a certain
+* existing region in cmem and cannot be equal to existing region's
+* start or end. Obviously, the start of [0, 1M] cannot meet this
+* condition.
+*
+* But in order to lest the low 1M could be changed in the future,
+* (e.g. [stare, 1M]), add a extra slot.
 */
-   nr_ranges += 2;
+   nr_ranges += 3;
cmem = vzalloc(struct_size(cmem, ranges, nr_ranges));
if (!cmem)
return NULL;
@@ -282,9 +292,16 @@ int crash_setup_memmap_entries(struct kimage *image, 
struct boot_params *params)
struct crash_memmap_data cmd;
struct crash_mem *cmem;
 
-   cmem = vzalloc(struct_size(cmem, ranges, 1));
+   /*
+* In the current x86 architecture code, the elfheader is always
+* allocated at crashk_res.start. But it depends on the allocation
+* position of elfheader in crashk_res. To avoid potential out of
+* bounds in future, add a extra slot.
+*/
+   cmem = vzalloc(struct_size(cmem, ranges, 2));
if (!cmem)
return -ENOMEM;
+   cmem->max_nr_ranges = 2;
 
memset(&cmd, 0, sizeof(struct crash_memmap_data));
cmd.params = params;
-- 
2.42.0


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC V2] IMA Log Snapshotting Design Proposal

2024-01-08 Thread Mimi Zohar
On Sun, 2024-01-07 at 21:58 -0500, Paul Moore wrote:
> On Sun, Jan 7, 2024 at 7:59 AM Mimi Zohar  wrote:
> > On Sat, 2024-01-06 at 18:27 -0500, Paul Moore wrote:
> > > On Tue, Nov 28, 2023 at 9:07 PM Mimi Zohar  wrote:
> > > > On Tue, 2023-11-28 at 20:06 -0500, Paul Moore wrote:
> > > > > On Tue, Nov 28, 2023 at 7:09 AM Mimi Zohar  
> > > > > wrote:
> > > > > > On Mon, 2023-11-27 at 17:16 -0500, Paul Moore wrote:
> > > > > > > On Mon, Nov 27, 2023 at 12:08 PM Mimi Zohar  
> > > > > > > wrote:
> > > > > > > > On Wed, 2023-11-22 at 09:22 -0500, Paul Moore wrote:
> > >
> > > ...
> > >
> > > > > > > > Before defining a new critical-data record, we need to decide 
> > > > > > > > whether
> > > > > > > > it is really necessary or if it is redundant.  If we define a 
> > > > > > > > new
> > > > > > > > "critical-data" record, can it be defined such that it doesn't 
> > > > > > > > require
> > > > > > > > pausing extending the measurement list?  For example, a new 
> > > > > > > > simple
> > > > > > > > visual critical-data record could contain the number of records 
> > > > > > > > (e.g.
> > > > > > > > /ima/runtime_measurements_count) up to that point.
> > > > > > >
> > > > > > > What if the snapshot_aggregate was a hash of the measurement log
> > > > > > > starting with either the boot_aggregate or the latest
> > > > > > > snapshot_aggregate and ending on the record before the new
> > > > > > > snapshot_aggregate?  The performance impact at snapshot time 
> > > > > > > should be
> > > > > > > minimal as the hash can be incrementally updated as new records 
> > > > > > > are
> > > > > > > added to the measurement list.  While the hash wouldn't capture 
> > > > > > > the
> > > > > > > TPM state, it would allow some crude verification when 
> > > > > > > reassembling
> > > > > > > the log.  If one could bear the cost of a TPM signing operation, 
> > > > > > > the
> > > > > > > log digest could be signed by the TPM.
> > > > > >
> > > > > > Other critical data is calculated, before calling
> > > > > > ima_measure_critical_data(), which adds the record to the 
> > > > > > measurement
> > > > > > list and extends the TPM PCR.
> > > > > >
> > > > > > Signing the hash shouldn't be an issue if it behaves like other
> > > > > > critical data.
> > > > > >
> > > > > > In addition to the hash, consider including other information in the
> > > > > > new critical data record (e.g. total number of measurement records, 
> > > > > > the
> > > > > > number of measurements included in the hash, the number of times the
> > > > > > measurement list was trimmed, etc).
> > > > >
> > > > > It would be nice if you could provide an explicit list of what you
> > > > > would want hashed into a snapshot_aggregate record; the above is
> > > > > close, but it is still a little hand-wavy.  I'm just trying to reduce
> > > > > the back-n-forth :)
> > > >
> > > > What is being defined here is the first IMA critical-data record, which
> > > > really requires some thought.
> > >
> > > My thinking has always been that taking a hash of the current
> > > measurement log up to the snapshot point would be a nice
> > > snapshot_aggregate measurement, but I'm not heavily invested in that.
> > > To me it is more important that we find something we can all agree on,
> > > perhaps reluctantly, so we can move forward with a solution.
> > >
> > > > For ease of review, this new critical-
> > > > data record should be a separate patch set from trimming the
> > > > measurement list.
> > >
> > > I see the two as linked, but if you prefer them as separate then so be
> > > it.  Once again, the important part is to move forward with a
> > > solution, I'm not overly bothered if it arrives in multiple pieces
> > > instead of one.
> >
> > Trimming the IMA measurement list could be used in conjunction with the new 
> > IMA
> > critical data record or independently.  Both options should be supported.
> >
> > 1. trim N number of records from the head of the in kernel IMA measurement 
> > list
> > 2. intermittently include the new IMA critical data record based on some 
> > trigger
> > 3. trim the measurement list up to the (first/last/Nth) IMA critical data 
> > record
> >
> > Since the two features could be used independently of each other, there is 
> > no
> > reason to upstream them as a single patch set.  It just makes it harder to
> > review.
> 
> I don't see much point in recording a snapshot aggregate if you aren't
> doing a snapshot, but it's not harmful in any way, so sure, go for it.
> Like I said earlier, as long as the functionality is there, I don't
> think anyone cares too much how it gets into the kernel (although
> Tushar and Sush should comment from the perspective).

Paul, there are two features: 
- trimming the measurement list
- defining and including an IMA critical data record

The original design doc combined these two features making them an "atomic"
operation and referred to it as a snapshot.  At the time the term "snapshot" was
an appropriate term 

Re: [PATCHv5 15/16] x86/mm: Introduce kernel_ident_mapping_free()

2024-01-08 Thread kirill.shute...@linux.intel.com
On Mon, Jan 08, 2024 at 03:30:21AM +, Huang, Kai wrote:
> On Mon, 2024-01-08 at 03:13 +, Huang, Kai wrote:
> > On Sat, 2023-12-23 at 02:52 +0300, Kirill A. Shutemov wrote:
> > > The helper complements kernel_ident_mapping_init(): it frees the
> > > identity mapping that was previously allocated. It will be used in the
> > > error path to free a partially allocated mapping or if the mapping is no
> > > longer needed.
> > > 
> > > The caller provides a struct x86_mapping_info with the free_pgd_page()
> > > callback hooked up and the pgd_t to free.
> > > 
> > > Signed-off-by: Kirill A. Shutemov 
> > > ---
> > >  arch/x86/include/asm/init.h |  3 ++
> > >  arch/x86/mm/ident_map.c | 73 +
> > >  2 files changed, 76 insertions(+)
> > > 
> > > diff --git a/arch/x86/include/asm/init.h b/arch/x86/include/asm/init.h
> > > index cc9ccf61b6bd..14d72727d7ee 100644
> > > --- a/arch/x86/include/asm/init.h
> > > +++ b/arch/x86/include/asm/init.h
> > > @@ -6,6 +6,7 @@
> > >  
> > >  struct x86_mapping_info {
> > >   void *(*alloc_pgt_page)(void *); /* allocate buf for page table */
> > > + void (*free_pgt_page)(void *, void *); /* free buf for page table */
> > >   void *context;   /* context for alloc_pgt_page */
> > >   unsigned long page_flag; /* page flag for PMD or PUD entry */
> > >   unsigned long offset;/* ident mapping offset */
> > > @@ -16,4 +17,6 @@ struct x86_mapping_info {
> > >  int kernel_ident_mapping_init(struct x86_mapping_info *info, pgd_t 
> > > *pgd_page,
> > >   unsigned long pstart, unsigned long pend);
> > >  
> > > +void kernel_ident_mapping_free(struct x86_mapping_info *info, pgd_t 
> > > *pgd);
> > 
> > Maybe range-based free function can provide more flexibility (e.g., you can
> > directly call the free function to cleanup in kernel_ident_mapping_init()
> > internally when something goes wrong), but I guess this is sufficient for
> > current use case (and perhaps the majority use cases).
> > 
> > Reviewed-by: Kai Huang 
> > 
> 
> Another argument of range-based free function is, theoretically you can build
> the identical mapping table using different x86_mapping_info on different
> ranges, thus it makes less sense to use one 'struct x86_mapping_info *info' to
> free the entire page table, albeit in this implementation only the
> 'free_pgt_page()' callback is used. 

The interface can be changed if there will be need for such behaviour.
This kind of future-proofing rarely helpful.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCHv5.1 14/16] x86/smp: Add smp_ops.stop_this_cpu() callback

2024-01-08 Thread kirill.shute...@linux.intel.com
On Mon, Jan 08, 2024 at 03:04:31AM +, Huang, Kai wrote:
> On Mon, 2023-12-25 at 11:05 +0300, Kirill A. Shutemov wrote:
> > If the helper is defined, it is called instead of halt() to stop the CPU
> > at the end of stop_this_cpu() and on crash CPU shutdown.
> > 
> > ACPI MADT will use it to hand over the CPU to BIOS in order to be able
> > to wake it up again after kexec.
> > 
> > Signed-off-by: Kirill A. Shutemov 
> > ---
> > 
> >  v5.1:
> >- Fix build for !SMP;
> > 
> > ---
> >  arch/x86/include/asm/smp.h |  1 +
> >  arch/x86/kernel/process.c  |  7 +++
> >  arch/x86/kernel/reboot.c   | 12 
> >  3 files changed, 16 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
> > index 4fab2ed454f3..390d53fd34f9 100644
> > --- a/arch/x86/include/asm/smp.h
> > +++ b/arch/x86/include/asm/smp.h
> > @@ -38,6 +38,7 @@ struct smp_ops {
> > int (*cpu_disable)(void);
> > void (*cpu_die)(unsigned int cpu);
> > void (*play_dead)(void);
> > +   void (*stop_this_cpu)(void);
> >  
> > void (*send_call_func_ipi)(const struct cpumask *mask);
> > void (*send_call_func_single_ipi)(int cpu);
> > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> > index b6f4e8399fca..ea4c812c7bf3 100644
> > --- a/arch/x86/kernel/process.c
> > +++ b/arch/x86/kernel/process.c
> > @@ -835,6 +835,13 @@ void __noreturn stop_this_cpu(void *dummy)
> >  */
> > cpumask_clear_cpu(cpu, &cpus_stop_mask);
> >  
> > +#ifdef CONFIG_SMP
> > +   if (smp_ops.stop_this_cpu) {
> > +   smp_ops.stop_this_cpu();
> > +   unreachable();
> > +   }
> > +#endif
> 
> If I read correctly this will result in stop_this_cpu() having different
> behaviour for SMP and !SMP build for TDX guest.  For example, AFAICT
> machine_halt() also calls stop_this_cpu() on local cpu after it stops other
> cpus.  So for the local cpu, in SMP build it will calls into BIOS's reset 
> vector
> but in !SMP it will call native_halt().

It doesn't make a difference in practice: both halt and giving CPU to
BIOS will be unrecoverable operation. Both are equally acceptable for
machine_halt().

> > +
> > for (;;) {
> > /*
> >  * Use native_halt() so that memory contents don't change
> > diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
> > index 16dde83df49a..738b3e810196 100644
> > --- a/arch/x86/kernel/reboot.c
> > +++ b/arch/x86/kernel/reboot.c
> > @@ -881,10 +881,14 @@ static int crash_nmi_callback(unsigned int val, 
> > struct pt_regs *regs)
> > cpu_emergency_disable_virtualization();
> >  
> > atomic_dec(&waiting_for_crash_ipi);
> > -   /* Assume hlt works */
> > -   halt();
> > -   for (;;)
> > -   cpu_relax();
> > +
> > +   if (smp_ops.stop_this_cpu) {
> > +   smp_ops.stop_this_cpu();
> 
> Could you explain why unreachable() is called in stop_this_cpu() but not here?

Compiler complained previously on stop_this_cpu() when it had halt() in
'else' case because the function is declared as __noreturn. I left
unreachable() after reworking it without 'else' to document the behaviour.

> > +   } else {
> > +   halt();
> > +   for (;;)
> > +   cpu_relax();
> > +   }
> 
> Similar to stop_this_cpu(), if you also call unreachable() here, then I think
> you can remove the 'else' here but directly calls halt() + cpu_relax() loop.

It doesn't make much difference to me. But okay, I will rework it the same
way in the next version.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec