Re: [PATCH] use x86 cpu park to speedup smp_init in kexec situation

2021-02-16 Thread David Woodhouse
On Tue, 2021-02-16 at 15:10 +, David Woodhouse wrote:
> Actually it breaks before that, in rcu_cpu_starting(). A spinlock
> around that, an atomic_t to let the APs do their TSC sync one at a time
> (both in the above tree now), and I have a 75% saving on CPU bringup
> time for my 28-thread Haswell:

Here's a dual-socket Skylake box (EC2 c5.metal)... before:

[1.449015] smp: Bringing up secondary CPUs ...
[1.449358] x86: Booting SMP configuration:
[1.449578]  node  #0, CPUs:#1  #2  #3  #4  #5  #6  #7  #8  #9 
#10 #11 #12 #13 #14 #15 #16 #17 #18 #19 #20 #21 #22 #23
[1.514986]  node  #1, CPUs:   #24 #25 #26 #27 #28 #29 #30 #31 #32 #33 
#34 #35 #36 #37 #38 #39 #40 #41 #42 #43 #44 #45 #46 #47
[1.644978]  node  #0, CPUs:   #48 #49 #50 #51 #52 #53 #54 #55 #56 #57 
#58 #59 #60 #61 #62 #63 #64 #65 #66 #67 #68 #69 #70 #71
[1.711970]  node  #1, CPUs:   #72 #73 #74 #75 #76 #77 #78 #79 #80 #81 
#82 #83 #84 #85 #86 #87 #88 #89 #90 #91 #92 #93 #94 #95
[1.781438] Brought CPUs online in 1063391744 cycles
[1.782289] smp: Brought up 2 nodes, 96 CPUs
[1.782515] smpboot: Max logical packages: 2
[1.782738] smpboot: Total of 96 processors activated (576364.89 BogoMIPS)

... and after:

[1.445661] smp: Bringing up secondary CPUs ...
[1.446004] x86: Booting SMP configuration:
[1.446047]  node  #0, CPUs:#1  #2  #3  #4  #5  #6  #7  #8  #9 
#10 #11 #12 #13 #14 #15 #16 #17 #18 #19 #20 #21 #22 #23
[1.451048]  node  #1, CPUs:   #24 #25 #26 #27 #28 #29 #30 #31 #32 #33 
#34 #35 #36 #37 #38 #39 #40 #41 #42 #43 #44 #45 #46 #47
[1.455046]  node  #0, CPUs:   #48 #49 #50 #51 #52 #53 #54 #55 #56 #57 
#58 #59 #60 #61 #62 #63 #64 #65 #66 #67 #68 #69 #70 #71
[1.462050]  node  #1, CPUs:   #72 #73 #74 #75 #76 #77 #78 #79 #80 #81 
#82 #83 #84 #85 #86 #87 #88 #89 #90 #91 #92 #93 #94 #95
[1.465983] Brought 96 CPUs to x86/cpu:kick in 68802688 cycles
[0.094170] smpboot: CPU 26 Converting physical 1 to logical package 2
[1.468078] Brought 96 CPUs to x86/cpu:wait-init in 5178300 cycles
[1.476112] Brought CPUs online in 23479546 cycles
[1.476298] smp: Brought up 2 nodes, 96 CPUs
[1.476520] smpboot: Max logical packages: 2
[1.476743] smpboot: Total of 96 processors activated (576000.00 BogoMIPS)

So the CPU bringup went from 334ms to 31ms on that one.

I might try that without spewing over 1KiB to the serial port at 115200
baud during the proceedings, and see if it makes a bigger difference :)




smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH] use x86 cpu park to speedup smp_init in kexec situation

2021-02-16 Thread David Woodhouse
On Tue, 2021-02-16 at 13:53 +, David Woodhouse wrote:
> I threw it into my tree at
> https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/parallel
>
> It seems to work fairly nicely. The parallel SIPI seems to win be about
> a third of the bringup time on my 28-thread Haswell box. This is at the
> penultimate commit of the above branch:
> 
> [0.307590] smp: Bringing up secondary CPUs ...
> [0.307826] x86: Booting SMP configuration:
> [0.307830]  node  #0, CPUs:#1  #2  #3  #4  #5  #6  #7  #8  #9 
> #10 #11 #12 #13 #14
> [0.376677] MDS CPU bug present and SMT on, data leak possible. See 
> https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more 
> details.
> [0.377177]  #15 #16 #17 #18 #19 #20 #21 #22 #23 #24 #25 #26 #27
> [0.402323] Brought CPUs online in 246691584 cycles
> [0.402323] smp: Brought up 1 node, 28 CPUs
> 
> ... and this is the tip of the branch:
> 
> [0.308332] smp: Bringing up secondary CPUs ... 
> [0.308569] x86: Booting SMP configuration:
> [0.308572]  node  #0, CPUs:#1  #2  #3  #4  #5  #6  #7  #8  #9 
> #10 #11 #12 #13 #14 #15 #16 #17 #18 #19 #20 #21 #22 #23 #24 #25 #26 #27
> [0.321120] Brought 28 CPUs to x86/cpu:kick in 34828752 cycles
> [0.33] MDS CPU bug present and SMT on, data leak possible. See 
> https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more 
> details.
> [0.368749] Brought CPUs online in 124913032 cycles
> [0.368749] smp: Brought up 1 node, 28 CPUs
> [0.368749] smpboot: Max logical packages: 1
> [0.368749] smpboot: Total of 28 processors activated (145259.85 BogoMIPS)
> 
> There's more to be gained here if we can fix up the next stage. Right
> now if I set every CPU's bit in cpu_initialized_mask to allow them to
> proceed from wait_for_master_cpu() through to the end of cpu_init() and
> onwards through start_secondary(), they all end up hitting
> check_tsc_sync_target() in parallel and it goes horridly wrong.

Actually it breaks before that, in rcu_cpu_starting(). A spinlock
around that, an atomic_t to let the APs do their TSC sync one at a time
(both in the above tree now), and I have a 75% saving on CPU bringup
time for my 28-thread Haswell:

[0.307341] smp: Bringing up secondary CPUs ...
[0.307576] x86: Booting SMP configuration:
[0.307579]  node  #0, CPUs:#1  #2  #3  #4  #5  #6  #7  #8  #9 
#10 #11 #12 #13 #14 #15 #16 #17 #18 #19 #20 #21 #22 #23 #24 #25 #26 #27
[0.320100] Brought 28 CPUs to x86/cpu:kick in 34645984 cycles
[0.325032] Brought 28 CPUs to x86/cpu:wait-init in 12865752 cycles
[0.326902] MDS CPU bug present and SMT on, data leak possible. See 
https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more 
details.
[0.328739] Brought CPUs online in 11702224 cycles
[0.328739] smp: Brought up 1 node, 28 CPUs
[0.328739] smpboot: Max logical packages: 1
[0.328739] smpboot: Total of 28 processors activated (145261.81 BogoMIPS)



smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH] use x86 cpu park to speedup smp_init in kexec situation

2021-02-16 Thread David Woodhouse
On Fri, 2021-02-12 at 18:30 +0100, Thomas Gleixner wrote:
> On Fri, Feb 12 2021 at 01:29, Thomas Gleixner wrote:
> > On Thu, Feb 11 2021 at 22:58, David Woodhouse wrote:
> > I have no problem with making that jump based. It does not matter at
> > all. But you can't use the idle task stack before the CR3 switch in
> > secondary_startup_64 - unless I'm missing something.
> > 
> > And there's more than 'call verify_cpu' on the way which uses stack. Let
> > me stare more tomorrow with brain awake.
> 
> The below boots on top of mainline. It probably breaks i386 and XEN and
> whatever :)

Nice. As discussed on IRC, you'll need more than 0x for APIC IDs
but I think you can probably get away with 0xF for physical APIC ID
since nothing more than that can fit in 32 bits of *logical* X2APIC ID.

> I didn't come around to play with your patches yet and won't be able to
> do so before next week.

I threw it into my tree at
https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/parallel

It seems to work fairly nicely. The parallel SIPI seems to win be about
a third of the bringup time on my 28-thread Haswell box. This is at the
penultimate commit of the above branch:

[0.307590] smp: Bringing up secondary CPUs ...
[0.307826] x86: Booting SMP configuration:
[0.307830]  node  #0, CPUs:#1  #2  #3  #4  #5  #6  #7  #8  #9 
#10 #11 #12 #13 #14
[0.376677] MDS CPU bug present and SMT on, data leak possible. See 
https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more 
details.
[0.377177]  #15 #16 #17 #18 #19 #20 #21 #22 #23 #24 #25 #26 #27
[0.402323] Brought CPUs online in 246691584 cycles
[0.402323] smp: Brought up 1 node, 28 CPUs

... and this is the tip of the branch:

[0.308332] smp: Bringing up secondary CPUs ... 
[0.308569] x86: Booting SMP configuration:
[0.308572]  node  #0, CPUs:#1  #2  #3  #4  #5  #6  #7  #8  #9 
#10 #11 #12 #13 #14 #15 #16 #17 #18 #19 #20 #21 #22 #23 #24 #25 #26 #27
[0.321120] Brought 28 CPUs to x86/cpu:kick in 34828752 cycles
[0.33] MDS CPU bug present and SMT on, data leak possible. See 
https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more 
details.
[0.368749] Brought CPUs online in 124913032 cycles
[0.368749] smp: Brought up 1 node, 28 CPUs
[0.368749] smpboot: Max logical packages: 1
[0.368749] smpboot: Total of 28 processors activated (145259.85 BogoMIPS)

There's more to be gained here if we can fix up the next stage. Right
now if I set every CPU's bit in cpu_initialized_mask to allow them to
proceed from wait_for_master_cpu() through to the end of cpu_init() and
onwards through start_secondary(), they all end up hitting
check_tsc_sync_target() in parallel and it goes horridly wrong.

A simple answer might be to have another synchronisation point right
before the TSC sync, to force them to do it one at a time but *without*
having to wait for the rest of the AP bringup in series.

Other answers include inventing a magical parallel TSC sync algorithm
that isn't 1-to-1 between the BSP and each AP. Or observing that we're
running in KVM or after kexec, and we *know* they're in sync anyway and
bypassing the whole bloody nonsense.

Recall, my initial debug patch timing the four stages of the bringup,
which I've left in my branch above for visibility, was showing this
kind of output with the original serial bringup...

> [0.285312] CPU#10 up in 192950,952898,  60014786,28 (  
> 61160662)
> [0.288311] CPU#11 up in 181092,962704,  60010432,30 (  
> 61154258)
> [0.291312] CPU#12 up in 386080,970416,  60013956,28 (  
> 61370480)
> [0.294311] CPU#13 up in 372782,964506,  60010564,28 (  
> 61347880)
> [0.297312] CPU#14 up in 389602,976280,  60013046,28 (  
> 61378956)
> [0.300312] CPU#15 up in 213132,968148,  60012138,28 (  
> 61193446)

So that's almost a million cycles per CPU of do_wait_cpu_callin()
before we get to the TSC sync (which is insanely expensive on KVM).
It's something like three times the time taken by the SIPI+wait in the
first phase, which is what we've parallelised so far (at a gain of
about 33%).



> ---
>  arch/x86/include/asm/realmode.h  |3 +
>  arch/x86/include/asm/smp.h   |9 +++-
>  arch/x86/kernel/acpi/sleep.c |1 
>  arch/x86/kernel/apic/apic.c  |2 
>  arch/x86/kernel/head_64.S|   71 
> +++
>  arch/x86/kernel/smpboot.c|   19 -
>  arch/x86/realmode/init.c |3 +
>  arch/x86/realmode/rm/trampoline_64.S |   14 ++
>  kernel/smpboot.c |2 
>  9 files changed, 119 insertions(+), 5 deletions(-)
> 
> --- a/arch/x86/include/asm/realmode.h
> +++ b/arch/x86/include/asm/realmode.h
> @@ -51,6 +51,7 @@ struct trampoline_header {
>   u64 efer;
>   u32 cr4;
>   u32 

Re: [PATCH] use x86 cpu park to speedup smp_init in kexec situation

2021-02-01 Thread David Woodhouse
On Thu, 2021-01-21 at 15:55 +0100, Thomas Gleixner wrote:
> > Here's the hack we're testing with, for reference. It's kind of ugly
> > but you can see where it's going. Note that the CMOS mangling for the
> > warm reset vector is going to need to be lifted out of the per-cpu
> > loop, and done *once* at startup and torn down once in smp_cpus_done.
> > Except that it also needs to be done before/after a hotplug cpu up;
> > we'll have to come back to that but we've just shifted it to
> > native_smp_cpus_done() for testing for now.
> 
> Right. It's at least a start.

Here's what we have now.

I've refcounted the warm reset vector thing which should fix the
hotplug case, although I need to check it gets torn down in the error
cases correctly.

With the X2APIC global variable thing fixed, the new states can be
immediately before CPUHP_BRINGUP_CPU as we originally wanted. I've
fixed up the bringup_nonboot_cpus() loop to bring an appropriate number
of CPUs to those "CPUHP_BP_PARALLEL_DYN" dynamic parallel pre-bringup
states in parallel.

We spent a while vacillating about how to add the new states, because
of the existing special-case hackery in bringup_cpu() for the
CPUHP_BRINGUP_CPU case.

The irq_lock_sparse() and the idle_thread_get() from there might
actually be needed in *earlier* states for platforms which do parallel
bringup so do we add similar wrappers in kernel/cpu.c for *all* of
the pre-bringup states, having hard-coded them? Then let the arch
provide a config symbol for whether it really wants them or not? That
seemed kind of horrid, so I went for the simple option of just letting
the arch register the CPUHP_BP_PARALLEL_DYN states the normal way with
its own functions to be called directly, and the loop in
bringup_nonboot_cpus() can then operate directly on whether they exist
in the state table or not, for which there is precedent already.

That means I needed to export idle_thread_get() for the pre-bringup
state functions to use too. I'll also want to add the irq_lock_sparse()
into my final patch but frankly, that's the least of my worries about
that patch right now.

It's also fairly much a no-brainer to splitting up the x86
native_cpu_up() into the four separate phases that I had got separate
timings for previously. We can do that just as a "cleanup" with no
functional change.

So I'm relatively happy at least that far, as preparatory work...

David Woodhouse (6):
  x86/apic/x2apic: Fix parallel handling of cluster_mask
  cpu/hotplug: Add dynamic states before CPUHP_BRINGUP_CPU for parallel 
bringup
  x86/smpboot: Reference count on smpboot_setup_warm_reset_vector()
  x86/smpboot: Split up native_cpu_up into separate phases
  cpu/hotplug: Move idle_thread_get() to 

 arch/x86/kernel/apic/x2apic_cluster.c |  82 +
 arch/x86/kernel/smpboot.c | 159 
++--
 include/linux/cpuhotplug.h|   2 +
 include/linux/smpboot.h   |   7 +++
 kernel/cpu.c  |  27 +-
 kernel/smpboot.h  |   2 -
 6 files changed, 180 insertions(+), 99 deletions(-)

That's the generic part mostly done, and the fun part is where we turn
back to x86 and actually try to split out those four phases of
native_cpu_up() to happen in parallel.

We store initial_stack and initial_gs for "the" AP that is coming up,
in global variables. It turns out that the APs don't like all sharing
the same stack as they come up in parallel, and weird behaviour ensues.

I think the only thing the AP has that can disambiguate it from other
APs is its CPUID, which it can get in its full 32-bit glory from
CPUID.0BH:EDX (and I think we can say we'll do parallel bringup *only*
of that leaf exists on the boot CPU).

So the trampoline code would need to find the logical CPU# and thus the
idle thread stack and per-cpu data with a lookup based on its APICID.
Perhaps just by trawling the various per-cpu data until it finds one
with the right apicid, much like default_cpu_present_to_apicid() does.

Oh, and ideally it needs to do this without using a real-mode stack,
because they won't like sharing that *either*.

(Actually they don't seem to mind in practice right now because the
only thing they all use it for is a 'call verify_cpu' and they all
place the *same* return address at the same place on the stack, but it
would be horrid to rely on that on *purpose* :)

So we'll continue to work on that in order to enable the parallel
bringup on x86, unless anyone has any cleverer ideas.

After that we'll get to the TSC sync, which is also not working in
parallel.


smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH] use x86 cpu park to speedup smp_init in kexec situation

2021-01-21 Thread David Woodhouse
On Thu, 2021-01-21 at 15:42 +, David Woodhouse wrote:
> [2.289283] BUG: kernel NULL pointer dereference, address: 
> [2.289283] #PF: supervisor write access in kernel mode
> [2.289283] #PF: error_code(0x0002) - not-present page
> [2.289283] PGD 0 P4D 0 
> [2.289283] Oops: 0002 [#1] SMP PTI
> [2.289283] CPU: 32 PID: 0 Comm: swapper/32 Not tainted 5.10.0+ #745
> [2.289283] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
> 1.14.0-1.fc33 04/01/2014
> [2.289283] RIP: 0010:init_x2apic_ldr+0xa0/0xb0


OK... in alloc_clustermask() for each CPU we were preallocating a
cluster_mask and storing it in the global cluster_hotplug_mask.

Then later for each CPU we were taking the preallocated cluster_mask
and setting cluster_hotplug_mask to NULL.

That doesn't parallelise well :)

So... ditch the global variable, let alloc_clustermask() install the
appropriate cluster_mask *directly* into the target CPU's per_cpu data
before it's running. And since we have to calculate the logical APIC ID
for the cluster ID, we might as well set x86_cpu_to_logical_apicid at
the same time.

Now all that init_x2apic_ldr() actually *does* on the target CPU is set
that CPU's bit in the pre-existing cluster_mask.

To reduce the number of loops over all (present or online) CPUs, I've
made it set the per_cpu cluster_mask for *all* CPUs in the cluster in
one pass at boot time. I think the case for later hotplug is also sane;
will have to test that.

But it passes that qemu boot test it was failing earlier, at least...
 
diff --git a/arch/x86/kernel/apic/x2apic_cluster.c 
b/arch/x86/kernel/apic/x2apic_cluster.c
index b0889c48a2ac..74bb4cae8b5b 100644
--- a/arch/x86/kernel/apic/x2apic_cluster.c
+++ b/arch/x86/kernel/apic/x2apic_cluster.c
@@ -18,7 +18,6 @@ struct cluster_mask {
 static DEFINE_PER_CPU(u32, x86_cpu_to_logical_apicid);
 static DEFINE_PER_CPU(cpumask_var_t, ipi_mask);
 static DEFINE_PER_CPU(struct cluster_mask *, cluster_masks);
-static struct cluster_mask *cluster_hotplug_mask;
 
 static int x2apic_acpi_madt_oem_check(char *oem_id, char *oem_table_id)
 {
@@ -98,54 +97,61 @@ static u32 x2apic_calc_apicid(unsigned int cpu)
 static void init_x2apic_ldr(void)
 {
struct cluster_mask *cmsk = this_cpu_read(cluster_masks);
-   u32 cluster, apicid = apic_read(APIC_LDR);
-   unsigned int cpu;
 
-   this_cpu_write(x86_cpu_to_logical_apicid, apicid);
+   BUG_ON(!cmsk);
 
-   if (cmsk)
-   goto update;
-
-   cluster = apicid >> 16;
-   for_each_online_cpu(cpu) {
-   cmsk = per_cpu(cluster_masks, cpu);
-   /* Matching cluster found. Link and update it. */
-   if (cmsk && cmsk->clusterid == cluster)
-   goto update;
-   }
-   cmsk = cluster_hotplug_mask;
-   cmsk->clusterid = cluster;
-   cluster_hotplug_mask = NULL;
-update:
-   this_cpu_write(cluster_masks, cmsk);
cpumask_set_cpu(smp_processor_id(), >mask);
 }
 
-static int alloc_clustermask(unsigned int cpu, int node)
+static int alloc_clustermask(unsigned int cpu, u32 cluster, int node)
 {
+   struct cluster_mask *cmsk = NULL;
+   u32 apicid;
+
if (per_cpu(cluster_masks, cpu))
return 0;
-   /*
-* If a hotplug spare mask exists, check whether it's on the right
-* node. If not, free it and allocate a new one.
+
+   /* For the hotplug case, don't always allocate a new one */
+   for_each_online_cpu(cpu) {
+   apicid = apic->cpu_present_to_apicid(cpu);
+   if (apicid != BAD_APICID && apicid >> 4 == cluster) {
+   cmsk = per_cpu(cluster_masks, cpu);
+   if (cmsk)
+   break;
+   }
+   }
+   if (!cmsk)
+   cmsk = kzalloc_node(sizeof(*cmsk), GFP_KERNEL, node);
+   if (!cmsk)
+   return -ENOMEM;
+
+   cmsk->node = node;
+   cmsk->clusterid = cluster;
+
+/*
+* As an optimisation during boot, set the cluster_mask for *all*
+* present CPUs at once, which will include 'cpu'.
 */
-   if (cluster_hotplug_mask) {
-   if (cluster_hotplug_mask->node == node)
-   return 0;
-   kfree(cluster_hotplug_mask);
+   if (system_state < SYSTEM_RUNNING) {
+   for_each_present_cpu(cpu) {
+   u32 apicid = apic->cpu_present_to_apicid(cpu);
+   if (apicid != BAD_APICID && apicid >> 4 == cluster)
+   per_cpu(cluster_masks, cpu) = cmsk;
+   }
}
 
-   cluster_hotplug_mask = kzalloc_node(sizeof(*cluster_hotplug_mask),
-   GFP_KERNEL, node);
-   if (!cluster_hotplug_mask)
-   return -ENOMEM;
-   cluster_hotplug_mask->node = node;
return 0;
 }
 
 static int x2apic_prepare_cpu(unsigned 

Re: [PATCH] use x86 cpu park to speedup smp_init in kexec situation

2021-01-21 Thread David Woodhouse
On Thu, 2021-01-21 at 15:55 +0100, Thomas Gleixner wrote:
> > Testing on real hardware has been more interesting and less useful so
> > far. We started with the CPUHP_BRINGUP_KICK_CPU state being
> > *immediately* before CPUHP_BRINGUP_CPU. On my 28-thread Haswell box,
> > that didn't come up at all even without actually *doing* anything in
> > the pre-bringup phase. Merely bringing all the AP threads up through
> > the various CPUHP_PREPARE_foo stages before actually bringing them
> > online, was enough to break it. I have no serial port on this box so we
> > haven't get worked out why; I've resorted to putting the
> > CPUHP_BRINGUP_KICK_CPU state before CPUHP_WORKQUEUE_PREP instead.
> 
> Hrm.

Aha, I managed to reproduce in qemu. It's CPUHP_X2APIC_PREPARE, which
is only used in x2apic *cluster* mode not physical mode. So I actually
need to give the guest an IOMMU with IRQ remapping before I see it.


$ git diff
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index bc56287a1ed1..f503e66b4718 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -92,6 +92,7 @@ enum cpuhp_state {
CPUHP_MIPS_SOC_PREPARE,
CPUHP_BP_PREPARE_DYN,
CPUHP_BP_PREPARE_DYN_END= CPUHP_BP_PREPARE_DYN + 20,
+   CPUHP_BRINGUP_WAKE_CPU,
CPUHP_BRINGUP_CPU,
CPUHP_AP_IDLE_DEAD,
CPUHP_AP_OFFLINE,
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 2b8d7a5db383..6c6f2986bfdb 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -1336,6 +1336,12 @@ void bringup_nonboot_cpus(unsigned int setup_max_cpus)
 {
unsigned int cpu;
 
+   for_each_present_cpu(cpu) {
+   if (num_online_cpus() >= setup_max_cpus)
+   break;
+   if (!cpu_online(cpu))
+   cpu_up(cpu, CPUHP_BRINGUP_WAKE_CPU);
+   }
for_each_present_cpu(cpu) {
if (num_online_cpus() >= setup_max_cpus)
break;
$ qemu-system-x86_64 -kernel arch/x86/boot/bzImage -append "console=ttyS0  
trace_event=cpuhp tp_printk" -display none -serial mon:stdio  -m 2G -M 
q35,accel=kvm,kernel-irqchip=split -device intel-iommu,intremap=on -smp 40
...
[0.349968] smp: Bringing up secondary CPUs ...
[0.350281] cpuhp_enter: cpu: 0001 target:  42 step:   1 
(smpboot_create_threads)
[0.351421] cpuhp_exit:  cpu: 0001  state:   1 step:   1 ret: 0
[0.352074] cpuhp_enter: cpu: 0001 target:  42 step:   2 
(perf_event_init_cpu)
[0.352276] cpuhp_exit:  cpu: 0001  state:   2 step:   2 ret: 0
[0.353273] cpuhp_enter: cpu: 0001 target:  42 step:  37 
(workqueue_prepare_cpu)
[0.354377] cpuhp_exit:  cpu: 0001  state:  37 step:  37 ret: 0
[0.355273] cpuhp_enter: cpu: 0001 target:  42 step:  39 
(hrtimers_prepare_cpu)
[0.356271] cpuhp_exit:  cpu: 0001  state:  39 step:  39 ret: 0
[0.356937] cpuhp_enter: cpu: 0001 target:  42 step:  41 (x2apic_prepare_cpu)
[0.357277] cpuhp_exit:  cpu: 0001  state:  41 step:  41 ret: 0
[0.358278] cpuhp_enter: cpu: 0002 target:  42 step:   1 
(smpboot_create_threads)
...
[0.614278] cpuhp_enter: cpu: 0032 target:  42 step:   1 
(smpboot_create_threads)
[0.615610] cpuhp_exit:  cpu: 0032  state:   1 step:   1 ret: 0
[0.616274] cpuhp_enter: cpu: 0032 target:  42 step:   2 
(perf_event_init_cpu)
[0.617271] cpuhp_exit:  cpu: 0032  state:   2 step:   2 ret: 0
[0.618272] cpuhp_enter: cpu: 0032 target:  42 step:  37 
(workqueue_prepare_cpu)
[0.619388] cpuhp_exit:  cpu: 0032  state:  37 step:  37 ret: 0
[0.620273] cpuhp_enter: cpu: 0032 target:  42 step:  39 
(hrtimers_prepare_cpu)
[0.621270] cpuhp_exit:  cpu: 0032  state:  39 step:  39 ret: 0
[0.622009] cpuhp_enter: cpu: 0032 target:  42 step:  41 (x2apic_prepare_cpu)
[0.622275] cpuhp_exit:  cpu: 0032  state:  41 step:  41 ret: 0
...
[0.684272] cpuhp_enter: cpu: 0039 target:  42 step:  41 (x2apic_prepare_cpu)
[0.685277] cpuhp_exit:  cpu: 0039  state:  41 step:  41 ret: 0
[0.685979] cpuhp_enter: cpu: 0001 target: 217 step:  43 (smpcfd_prepare_cpu)
[0.686283] cpuhp_exit:  cpu: 0001  state:  43 step:  43 ret: 0
[0.687274] cpuhp_enter: cpu: 0001 target: 217 step:  44 (relay_prepare_cpu)
[0.688274] cpuhp_exit:  cpu: 0001  state:  44 step:  44 ret: 0
[0.689274] cpuhp_enter: cpu: 0001 target: 217 step:  47 
(rcutree_prepare_cpu)
[0.690271] cpuhp_exit:  cpu: 0001  state:  47 step:  47 ret: 0
[0.690982] cpuhp_multi_enter: cpu: 0001 target: 217 step:  59 
(trace_rb_cpu_prepare)
[0.691281] cpuhp_exit:  cpu: 0001  state:  59 step:  59 ret: 0
[0.692272] cpuhp_multi_enter: cpu: 0001 target: 217 step:  59 
(trace_rb_cpu_prepare)
[0.694640] cpuhp_exit:  cpu: 0001  state:  59 step:  59 ret: 0
[0.695272] cpuhp_multi_enter: cpu: 0001 target: 217 step:  59 
(trace_rb_cpu_prepare)
[0.696280] cpuhp_exit:  cpu: 0001  state:  59 step:  59 ret: 0
[0.697279] cpuhp_enter: cpu: 0001 target: 217 step:  65 

Re: [PATCH] use x86 cpu park to speedup smp_init in kexec situation

2021-01-21 Thread Thomas Gleixner
David,

On Tue, Jan 19 2021 at 12:12, David Woodhouse wrote:
> On Tue, 2020-12-15 at 22:20 +0100, Thomas Gleixner wrote:
> We've been playing with this a little. There's a proof-of-concept hack
> below; don't look too hard because it's only really for figuring out
> the timing etc.
>
> Basically we ripped out the 'wait' parts of the x86 do_boot_cpu() into
> a separate function do_wait_cpu(). There are four phases to the wait.
>
>  • Wait for the AP to turn up in cpu_initialized_mask, set its bit in
>cpu_callout_mask to allow it to run the AP thread.
>  • Wait for it to finish init and show up in cpu_callin_mask.
>  • check_tsc_sync_source()
>  • Wait for cpu_online(cpu)
>
> There's an EARLY_INIT macro which controls whether the early bringup
> call actually *does* anything, or whether it's left until bringup_cpu()
> as the current code does. It allows a simple comparison of the two.
>
> First we tested under qemu (on a Skylake EC2 c5.metal instance). The
> do_boot_cpu() actually sending the IPIs took ~300k cycles itself.
> Without EARLY_INIT we see timing for the four wait phases along the
> lines of:
>
> [0.285312] CPU#10 up in 192950,952898,  60014786,28 (  
> 61160662)
> [0.288311] CPU#11 up in 181092,962704,  60010432,30 (  
> 61154258)
> [0.291312] CPU#12 up in 386080,970416,  60013956,28 (  
> 61370480)
> [0.294311] CPU#13 up in 372782,964506,  60010564,28 (  
> 61347880)
> [0.297312] CPU#14 up in 389602,976280,  60013046,28 (  
> 61378956)
> [0.300312] CPU#15 up in 213132,968148,  60012138,28 (  
> 61193446)
>
> If we define EARLY_INIT then that first phase of waiting for the CPU
> add itself is fairly much instantaneous, which is precisely what we
> were hoping for. We also seem to save about 300k cycles on the AP
> bringup too. It's just that it *all* pales into insignificance with
> whatever it's doing to synchronise the TSC for 60M cycles.

Yes, that's annoying, but it can be avoided. The host could tell the
guest that the TSC is perfectly synced.

> [0.338829] CPU#10 up in600,689054,  60025522,28 (  
> 60715204)
> [0.341829] CPU#11 up in610,635346,  60019390,28 (  
> 60655374)
> [0.343829] CPU#12 up in632,619352,  60020728,28 (  
> 60640740)
> [0.346829] CPU#13 up in602,514234,  60025402,26 (  
> 60540264)
> [0.348830] CPU#14 up in608,621058,  60025952,26 (  
> 60647644)
> [0.351829] CPU#15 up in600,624690,  60021526,   410 (  
> 60647226)
>
> Testing on real hardware has been more interesting and less useful so
> far. We started with the CPUHP_BRINGUP_KICK_CPU state being
> *immediately* before CPUHP_BRINGUP_CPU. On my 28-thread Haswell box,
> that didn't come up at all even without actually *doing* anything in
> the pre-bringup phase. Merely bringing all the AP threads up through
> the various CPUHP_PREPARE_foo stages before actually bringing them
> online, was enough to break it. I have no serial port on this box so we
> haven't get worked out why; I've resorted to putting the
> CPUHP_BRINGUP_KICK_CPU state before CPUHP_WORKQUEUE_PREP instead.

Hrm.

> That lets it boot without the EARLY_INIT at least (so it's basically a
> no-op), and I get these timings. Looks like there's 3-4M cycles to be
> had by the parallel SIPI/INIT, but the *first* thread of each core is
> also taking another 8M cycles and it might be worth doing *those* in
> parallel too. And Thomas I think that waiting for the AP to bring
> itself up is the part you meant was pointlessly differently
> reimplemented across architectures? So the way forward there might be
> to offer a generic CPUHP state for that, for architectures to plug into
> and ditch their own tracking.

Yes. The whole wait for alive and callin and online can be generic.

> When I enabled EARLY_INIT it didn't boot; I need to hook up some box
> with a serial port to make more meaningful progress there, but figured
> it was worth sharing the findings so far.
>
> Here's the hack we're testing with, for reference. It's kind of ugly
> but you can see where it's going. Note that the CMOS mangling for the
> warm reset vector is going to need to be lifted out of the per-cpu
> loop, and done *once* at startup and torn down once in smp_cpus_done.
> Except that it also needs to be done before/after a hotplug cpu up;
> we'll have to come back to that but we've just shifted it to
> native_smp_cpus_done() for testing for now.

Right. It's at least a start.

Thanks,

tglx




Re: [PATCH] use x86 cpu park to speedup smp_init in kexec situation

2021-01-19 Thread David Woodhouse
On Tue, 2020-12-15 at 22:20 +0100, Thomas Gleixner wrote:
> Since the rewrite of the CPU hotplug infrastructure to a state machine
> it's pretty obvious that the bringup of APs can changed from the fully
> serialized:
> 
>  for_each_present_cpu(cpu) {
> if (!cpu_online(cpu))
>cpu_up(cpu, CPUHP_ONLINE);
>  }
> 
> to
> 
>  for_each_present_cpu(cpu) {
> if (!cpu_online(cpu))
>cpu_up(cpu, CPUHP_KICK_CPU);
>  }
> 
>  for_each_present_cpu(cpu) {
> if (!cpu_active(cpu))
>cpu_up(cpu, CPUHP_ONLINE);
>  }
> 
> The CPUHP_KICK_CPU state does not exist today, but it's just the logical
> consequence of the state machine. It's basically splitting __cpu_up()
> into:
> 
> __cpu_kick()
> {
> prepare();
> arch_kick_remote_cpu(); -> Send IPI/NMI, Firmware call .
> }
> 
> __cpu_wait_online()
> {
> wait_until_cpu_online();
> do_further_stuff();
> }
> 
> There is some more to it than just blindly splitting it up at the
> architecture level.

We've been playing with this a little. There's a proof-of-concept hack
below; don't look too hard because it's only really for figuring out
the timing etc.

Basically we ripped out the 'wait' parts of the x86 do_boot_cpu() into
a separate function do_wait_cpu(). There are four phases to the wait.

 • Wait for the AP to turn up in cpu_initialized_mask, set its bit in
   cpu_callout_mask to allow it to run the AP thread.
 • Wait for it to finish init and show up in cpu_callin_mask.
 • check_tsc_sync_source()
 • Wait for cpu_online(cpu)

There's an EARLY_INIT macro which controls whether the early bringup
call actually *does* anything, or whether it's left until bringup_cpu()
as the current code does. It allows a simple comparison of the two.

First we tested under qemu (on a Skylake EC2 c5.metal instance). The
do_boot_cpu() actually sending the IPIs took ~300k cycles itself.
Without EARLY_INIT we see timing for the four wait phases along the
lines of:

[0.285312] CPU#10 up in 192950,952898,  60014786,28 (  
61160662)
[0.288311] CPU#11 up in 181092,962704,  60010432,30 (  
61154258)
[0.291312] CPU#12 up in 386080,970416,  60013956,28 (  
61370480)
[0.294311] CPU#13 up in 372782,964506,  60010564,28 (  
61347880)
[0.297312] CPU#14 up in 389602,976280,  60013046,28 (  
61378956)
[0.300312] CPU#15 up in 213132,968148,  60012138,28 (  
61193446)

If we define EARLY_INIT then that first phase of waiting for the CPU
add itself is fairly much instantaneous, which is precisely what we
were hoping for. We also seem to save about 300k cycles on the AP
bringup too. It's just that it *all* pales into insignificance with
whatever it's doing to synchronise the TSC for 60M cycles.

[0.338829] CPU#10 up in600,689054,  60025522,28 (  
60715204)
[0.341829] CPU#11 up in610,635346,  60019390,28 (  
60655374)
[0.343829] CPU#12 up in632,619352,  60020728,28 (  
60640740)
[0.346829] CPU#13 up in602,514234,  60025402,26 (  
60540264)
[0.348830] CPU#14 up in608,621058,  60025952,26 (  
60647644)
[0.351829] CPU#15 up in600,624690,  60021526,   410 (  
60647226)

Testing on real hardware has been more interesting and less useful so
far. We started with the CPUHP_BRINGUP_KICK_CPU state being
*immediately* before CPUHP_BRINGUP_CPU. On my 28-thread Haswell box,
that didn't come up at all even without actually *doing* anything in
the pre-bringup phase. Merely bringing all the AP threads up through
the various CPUHP_PREPARE_foo stages before actually bringing them
online, was enough to break it. I have no serial port on this box so we
haven't get worked out why; I've resorted to putting the
CPUHP_BRINGUP_KICK_CPU state before CPUHP_WORKQUEUE_PREP instead.

That lets it boot without the EARLY_INIT at least (so it's basically a
no-op), and I get these timings. Looks like there's 3-4M cycles to be
had by the parallel SIPI/INIT, but the *first* thread of each core is
also taking another 8M cycles and it might be worth doing *those* in
parallel too. And Thomas I think that waiting for the AP to bring
itself up is the part you meant was pointlessly differently
reimplemented across architectures? So the way forward there might be
to offer a generic CPUHP state for that, for architectures to plug into
and ditch their own tracking.

[0.311581] CPU#1 up in4057008,   8820492,  1828,   808 (  
12880136)
[0.316802] CPU#2 up in3885080,   8738092,  1792,   904 (  
12625868)
[0.321674] CPU#3 up in3468276,   8244880,  1724,   860 (  
11715740)
[0.326609] CPU#4 up in3565216,   8357876,  1808,   984 (  
11925884)
[0.331565] CPU#5 up in3566916,   8367340,  1836,   708 (  
11936800)
[0.336446] 

Re: [PATCH] use x86 cpu park to speedup smp_init in kexec situation

2021-01-07 Thread David Woodhouse
On Wed, 2020-12-16 at 16:31 +0100, Thomas Gleixner wrote:
> But obviously the C-state in which the APs are waiting is not really
> relevant, as you demonstrated that the cost is due to INIT/SIPI even
> with spinwait, which is what I suspected.
> 
> OTOH, the advantage of INIT/SIPI is that the AP comes up in a well known
> state.

And once we parallelise the bringup we basically only incur the latency
of *one* INIT/SIPI instead of multiplying it by the number of CPUs, so
it isn't clear that there's any *disadvantage* to it. It's certainly a
lot simpler.

I think we should definitely start by implementing the parallel bringup
as you described it, and then see if there's still a problem left to be
solved.

We were working on a SIPI-avoiding patch set which is similar to the
above, which Johanna had just about got working the night before this
one was posted. But it looks like we should go back to the drawing
board anyway instead of bothering to compare the details of the two.


smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH] use x86 cpu park to speedup smp_init in kexec situation

2020-12-17 Thread shenkai (D)

在 2020/12/16 23:31, Thomas Gleixner 写道:

OTOH, the advantage of INIT/SIPI is that the AP comes up in a well known
state.


We can set APs to a known state explicitly like BSP will do in kexec 
case (what we also tried


to do in the patch). Maybe it is not a big problem?

Best regards

Kai



Re: [PATCH] use x86 cpu park to speedup smp_init in kexec situation

2020-12-16 Thread Thomas Gleixner
Kai,

On Wed, Dec 16 2020 at 22:18, shenkai wrote:
> After some tests, the conclusion that time cost is from deep C-state 
> turns out to be wrong
>
> Sorry for that.

No problem.

> In kexec case, first let APs spinwait like what I did  in that patch,
> but wake APs up by sending apic INIT and SIPI  interrupts as normal
> procedure instead of writing to some address and there is no
> acceleration (time cost is still 210ms).

Ok.

> So can we say that the main time cost is from apic INIT and SIPI
> interrupts and the handling of them instead of deep C-state?

That's a fair conclusion.

> I didn't test with play_dead() because in kexec case, one new kernel
> will be started and APs can't be waken up by normal interrupts like in
> hibernate case for the irq vectors are gone with the old kernel.
>
> Or maybe I didn't get the point correctly?

Not exactly, but your experiment answered the question already.

My point was that the regular kexec unplugs the APs which then end up in
play_dead() and trying to use the deepest C-state via mwait(). So if the
overhead would be related to getting them out of a deep C-state then
forcing that play_dead() to use the HLT instruction or the most shallow
C-state with mwait() would have brought an improvement, right?

But obviously the C-state in which the APs are waiting is not really
relevant, as you demonstrated that the cost is due to INIT/SIPI even
with spinwait, which is what I suspected.

OTOH, the advantage of INIT/SIPI is that the AP comes up in a well known
state.

Thanks,

tglx




Re: [PATCH] use x86 cpu park to speedup smp_init in kexec situation

2020-12-16 Thread shenkai (D)

在 2020/12/16 18:12, Thomas Gleixner 写道:

Kai,

On Wed, Dec 16 2020 at 16:45, shenkai wrote:

在 2020/12/16 5:20, Thomas Gleixner 写道:



Thanks for your and Andy's precious comments. I would like to take a try on

reconstructing this patch to make it more decent and generic.

It would be interesting to see the numbers just with play_dead() using
hlt() or mwait(eax=0, 0) for the kexec case and no other change at all.

Can you please as a first step look into this and check if the time
changes?

Thanks,

 tglx
.


After some tests, the conclusion that time cost is from deep C-state 
turns out to be wrong


Sorry for that.

Here is what I do:

In kexec case, first let APs spinwait like what I did  in that patch, 
but wake APs up by


sending apic INIT and SIPI  interrupts as normal procedure instead of 
writing to some


address and there is no acceleration (time cost is still 210ms).

So can we say that the main time cost is from apic INIT and SIPI 
interrupts and the handling


of them instead of deep C-state?


I didn't test with play_dead() because in kexec case, one new kernel 
will be started and APs can't be


waken up by normal interrupts like in hibernate case for the irq vectors 
are gone with the old kernel.


Or maybe I didn't get the point correctly?


Best regards

Kai





Re: [PATCH] use x86 cpu park to speedup smp_init in kexec situation

2020-12-16 Thread Thomas Gleixner
Kai,

On Wed, Dec 16 2020 at 16:45, shenkai wrote:
> 在 2020/12/16 5:20, Thomas Gleixner 写道:
>>
>>
> Thanks for your and Andy's precious comments. I would like to take a try on
>
> reconstructing this patch to make it more decent and generic.

>> It would be interesting to see the numbers just with play_dead() using
>> hlt() or mwait(eax=0, 0) for the kexec case and no other change at all.

Can you please as a first step look into this and check if the time
changes?

Thanks,

tglx


Re: [PATCH] use x86 cpu park to speedup smp_init in kexec situation

2020-12-16 Thread shenkai (D)

在 2020/12/16 5:20, Thomas Gleixner 写道:

On Tue, Dec 15 2020 at 08:31, Andy Lutomirski wrote:

On Tue, Dec 15, 2020 at 6:46 AM shenkai (D)  wrote:

From: shenkai 
Date: Tue, 15 Dec 2020 01:58:06 +
Subject: [PATCH] use x86 cpu park to speedup smp_init in kexec situation

In kexec reboot on x86 machine, APs will be halted and then waked up
by the apic INIT and SIPI interrupt. Here we can let APs spin instead
of being halted and boot APs by writing to specific address. In this way
we can accelerate smp_init procedure for we don't need to pull APs up
from a deep C-state.

This is meaningful in many situations where users are sensitive to reboot
time cost.

I like the concept.

No. This is the wrong thing to do. We are not optimizing for _one_
special case.

We can optimize it for all operations where all the non boot CPUs have
to brought up, be it cold boot, hibernation resume or kexec.

Aside of that this is not a magic X86 special problem. Pretty much all
architectures have the same issue and it can be solved very simple,
which has been discussed before and I outlined the solution years ago,
but nobody sat down and actually made it work.

Since the rewrite of the CPU hotplug infrastructure to a state machine
it's pretty obvious that the bringup of APs can changed from the fully
serialized:

  for_each_present_cpu(cpu) {
if (!cpu_online(cpu))
cpu_up(cpu, CPUHP_ONLINE);
  }

to

  for_each_present_cpu(cpu) {
if (!cpu_online(cpu))
cpu_up(cpu, CPUHP_KICK_CPU);
  }

  for_each_present_cpu(cpu) {
if (!cpu_active(cpu))
cpu_up(cpu, CPUHP_ONLINE);
  }

The CPUHP_KICK_CPU state does not exist today, but it's just the logical
consequence of the state machine. It's basically splitting __cpu_up()
into:

__cpu_kick()
{
 prepare();
 arch_kick_remote_cpu(); -> Send IPI/NMI, Firmware call .
}
 
__cpu_wait_online()

{
 wait_until_cpu_online();
 do_further_stuff();
}

There is some more to it than just blindly splitting it up at the
architecture level.

All __cpu_up() implementations across arch/ have a lot of needlessly
duplicated and pointlessly differently implemented code which can move
completely into the core.

So actually we want to split this further up:

CPUHP_PREPARE_CPU_UP:   Generic preparation step where all
 the magic cruft which is duplicated
 across architectures goes to

CPUHP_KICK_CPU: Architecture specific prepare and kick

CPUHP_WAIT_ONLINE:   Generic wait function for CPU coming
 online: wait_for_completion_timeout()
 which releases the upcoming CPU and
 invokes an optional arch_sync_cpu_up()
 function which finalizes the bringup.
and on the AP side:

CPU comes up, does all the low level setup, sets online, calls
complete() and the spinwaits for release.

Once the control CPU comes out of the completion it releases the
spinwait.

That works for all bringup situations and not only for kexec and the
simple trick is that by the time the last CPU has been kicked in the
first step, the first kicked CPU is already spinwaiting for release.

By the time the first kicked CPU has completed the process, i.e. reached
the active state, then the next CPU is spinwaiting and so on.

If you look at the provided time saving:

Mainline:   210ms
Patched: 80ms
-
Delta130ms

i.e. it takes ~ 1.8ms to kick and wait for the AP to come up and ~ 1.1ms
per CPU for the whole bringup. It does not completly add up, but it has
a clear benefit for everything.

Also the changelog says that the delay is related to CPUs in deep
C-states. If CPUs are brought down for kexec then it's trivial enough to
limit the C-states or just not use mwait() at all.

It would be interesting to see the numbers just with play_dead() using
hlt() or mwait(eax=0, 0) for the kexec case and no other change at all.

Thanks,

 tglx


Thanks for your and Andy's precious comments. I would like to take a try on

reconstructing this patch to make it more decent and generic.


Thanks again

Kai


Re: [PATCH] use x86 cpu park to speedup smp_init in kexec situation

2020-12-15 Thread Thomas Gleixner
On Tue, Dec 15 2020 at 08:31, Andy Lutomirski wrote:
> On Tue, Dec 15, 2020 at 6:46 AM shenkai (D)  wrote:
>>
>> From: shenkai 
>> Date: Tue, 15 Dec 2020 01:58:06 +
>> Subject: [PATCH] use x86 cpu park to speedup smp_init in kexec situation
>>
>> In kexec reboot on x86 machine, APs will be halted and then waked up
>> by the apic INIT and SIPI interrupt. Here we can let APs spin instead
>> of being halted and boot APs by writing to specific address. In this way
>> we can accelerate smp_init procedure for we don't need to pull APs up
>> from a deep C-state.
>>
>> This is meaningful in many situations where users are sensitive to reboot
>> time cost.
>
> I like the concept.

No. This is the wrong thing to do. We are not optimizing for _one_
special case.

We can optimize it for all operations where all the non boot CPUs have
to brought up, be it cold boot, hibernation resume or kexec.

Aside of that this is not a magic X86 special problem. Pretty much all
architectures have the same issue and it can be solved very simple,
which has been discussed before and I outlined the solution years ago,
but nobody sat down and actually made it work.

Since the rewrite of the CPU hotplug infrastructure to a state machine
it's pretty obvious that the bringup of APs can changed from the fully
serialized:

 for_each_present_cpu(cpu) {
if (!cpu_online(cpu))
   cpu_up(cpu, CPUHP_ONLINE);
 }

to

 for_each_present_cpu(cpu) {
if (!cpu_online(cpu))
   cpu_up(cpu, CPUHP_KICK_CPU);
 }

 for_each_present_cpu(cpu) {
if (!cpu_active(cpu))
   cpu_up(cpu, CPUHP_ONLINE);
 }

The CPUHP_KICK_CPU state does not exist today, but it's just the logical
consequence of the state machine. It's basically splitting __cpu_up()
into:

__cpu_kick()
{
prepare();
arch_kick_remote_cpu(); -> Send IPI/NMI, Firmware call .
}

__cpu_wait_online()
{
wait_until_cpu_online();
do_further_stuff();
}

There is some more to it than just blindly splitting it up at the
architecture level.

All __cpu_up() implementations across arch/ have a lot of needlessly
duplicated and pointlessly differently implemented code which can move
completely into the core.

So actually we want to split this further up:

   CPUHP_PREPARE_CPU_UP:Generic preparation step where all
the magic cruft which is duplicated
across architectures goes to

   CPUHP_KICK_CPU:  Architecture specific prepare and kick

   CPUHP_WAIT_ONLINE:   Generic wait function for CPU coming
online: wait_for_completion_timeout()
which releases the upcoming CPU and
invokes an optional arch_sync_cpu_up()
function which finalizes the bringup.
and on the AP side:

   CPU comes up, does all the low level setup, sets online, calls
   complete() and the spinwaits for release.

Once the control CPU comes out of the completion it releases the
spinwait.

That works for all bringup situations and not only for kexec and the
simple trick is that by the time the last CPU has been kicked in the
first step, the first kicked CPU is already spinwaiting for release.

By the time the first kicked CPU has completed the process, i.e. reached
the active state, then the next CPU is spinwaiting and so on.

If you look at the provided time saving:

   Mainline:210ms
   Patched:  80ms
-
   Delta130ms

i.e. it takes ~ 1.8ms to kick and wait for the AP to come up and ~ 1.1ms
per CPU for the whole bringup. It does not completly add up, but it has
a clear benefit for everything.

Also the changelog says that the delay is related to CPUs in deep
C-states. If CPUs are brought down for kexec then it's trivial enough to
limit the C-states or just not use mwait() at all.

It would be interesting to see the numbers just with play_dead() using
hlt() or mwait(eax=0, 0) for the kexec case and no other change at all.

Thanks,

tglx






Re: [PATCH] use x86 cpu park to speedup smp_init in kexec situation

2020-12-15 Thread Andy Lutomirski
On Tue, Dec 15, 2020 at 6:46 AM shenkai (D)  wrote:
>
> From: shenkai 
> Date: Tue, 15 Dec 2020 01:58:06 +
> Subject: [PATCH] use x86 cpu park to speedup smp_init in kexec situation
>
> In kexec reboot on x86 machine, APs will be halted and then waked up
> by the apic INIT and SIPI interrupt. Here we can let APs spin instead
> of being halted and boot APs by writing to specific address. In this way
> we can accelerate smp_init procedure for we don't need to pull APs up
> from a deep C-state.
>
> This is meaningful in many situations where users are sensitive to reboot
> time cost.

I like the concept.

>
> On 72-core x86(Intel Xeon Gold 6154 CPU @ 3.00GHz) machine with
> Linux-5.10.0, time cost of smp_init is 210ms with cpu park while 80ms
> without, and the difference is more significant when there are more
> cores.
>
> Implementation is basicly as follow:
> 1. reserve some space for cpu park code and data
> 2. when APs get reboot IPI and cpu_park is enabled, they will turn MMU
> off and excute cpu park code where APs will spin and wait on an
> address.(APs will reuse the pgtable which is used by BSP in kexec
> procedure to turn MMU off)
> 3. after BSP reboot successfully, it will wake up APs by writing an entry
> address to the spin-read address so that APs can jmp to normal bootup
> procedure.
> 4. The hyperthreaded twin processor of BSP can be specified so that the
> twin processor can halt as normal procedure and will not compete with
> BSP during booting.
>
> Signed-off-by: shenkai 
> Co-Developed-by: Luo Longjun 
> ---
>   arch/x86/Kconfig | 12 
>   arch/x86/include/asm/kexec.h | 43 ++
>   arch/x86/include/asm/realmode.h  |  3 +
>   arch/x86/kernel/Makefile |  1 +
>   arch/x86/kernel/cpu-park.S   | 87 +++
>   arch/x86/kernel/machine_kexec_64.c   | 16 +
>   arch/x86/kernel/process.c| 13 
>   arch/x86/kernel/reboot.c |  6 ++
>   arch/x86/kernel/relocate_kernel_64.S | 51 
>   arch/x86/kernel/setup.c  | 67 +
>   arch/x86/kernel/smp.c| 89 
>   arch/x86/kernel/smpboot.c| 14 +
>   arch/x86/realmode/rm/header.S|  3 +
>   arch/x86/realmode/rm/trampoline_64.S |  5 ++
>   kernel/smp.c |  6 ++
>   15 files changed, 416 insertions(+)
>   create mode 100644 arch/x86/kernel/cpu-park.S
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index fbf26e0f7a6a..8dedd0e62eb2 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -2406,6 +2406,18 @@ config MODIFY_LDT_SYSCALL
> surface.  Disabling it removes the modify_ldt(2) system call.
>
> Saying 'N' here may make sense for embedded or server kernels.
> +config X86_CPU_PARK
> +bool "Support CPU PARK on kexec"
> +default n
> +depends on SMP
> +depends on KEXEC_CORE
> +help
> +  This enables support for CPU PARK feature in
> +  order to save time of cpu down to up.
> +  CPU park is a state through kexec, spin loop
> +  instead of cpu die before jumping to new kernel,
> +  jumping out from loop to new kernel entry in
> +  smp_init.

Why does this need to be configurable?

>
>   source "kernel/livepatch/Kconfig"
>
> diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
> index 6802c59e8252..3801df240f5f 100644
> --- a/arch/x86/include/asm/kexec.h
> +++ b/arch/x86/include/asm/kexec.h
> @@ -197,6 +197,49 @@ typedef void crash_vmclear_fn(void);
>   extern crash_vmclear_fn __rcu *crash_vmclear_loaded_vmcss;
>   extern void kdump_nmi_shootdown_cpus(void);
>
> +#ifdef CONFIG_X86_CPU_PARK
> +#define PARK_MAGIC 0x7061726b
> +#define PARK_SECTION_SIZE PAGE_ALIGN(sizeof(struct cpu_park_section) +
> 2 * PAGE_SIZE - 1)
> +extern unsigned long park_start;
> +extern unsigned long park_len;
> +extern bool cpu_park_enable;
> +extern int boot_core_sibling;
> +extern void cpu_park(unsigned int cpu);
> +extern void __do_cpu_park(unsigned long exit);
> +extern int write_park_exit(unsigned int cpu);
> +extern unsigned long park_cpu(unsigned long pgd_addr,
> +unsigned long park_code_addr,
> +unsigned long exit_addr);
> +extern int install_cpu_park(void);
> +extern int uninstall_cpu_park(void);
> +
> +struct cpu_park_section {
> +struct {
> +unsigned long exit;/* entry address to exit loop */
> +unsigned long magic;/* maigc indicates park state */

No magic please.  If you actually need a flag, give it a descriptive
name and document what the values are.  But I think you could get away
with just a single field per cpu:

u32 jump_address;

0 means keep spinning.

But I think you could do even better -- just have a single field that
you write to wake all parked CPUs.  You'll also want some indication
of how many CPUs are parked so that the new kernel can reliably tell