David,

On Tue, Jan 19 2021 at 12:12, David Woodhouse wrote:
> On Tue, 2020-12-15 at 22:20 +0100, Thomas Gleixner wrote:
> We've been playing with this a little. There's a proof-of-concept hack
> below; don't look too hard because it's only really for figuring out
> the timing etc.
>
> Basically we ripped out the 'wait' parts of the x86 do_boot_cpu() into
> a separate function do_wait_cpu(). There are four phases to the wait.
>
>  • Wait for the AP to turn up in cpu_initialized_mask, set its bit in
>    cpu_callout_mask to allow it to run the AP thread.
>  • Wait for it to finish init and show up in cpu_callin_mask.
>  • check_tsc_sync_source()
>  • Wait for cpu_online(cpu)
>
> There's an EARLY_INIT macro which controls whether the early bringup
> call actually *does* anything, or whether it's left until bringup_cpu()
> as the current code does. It allows a simple comparison of the two.
>
> First we tested under qemu (on a Skylake EC2 c5.metal instance). The
> do_boot_cpu() actually sending the IPIs took ~300k cycles itself.
> Without EARLY_INIT we see timing for the four wait phases along the
> lines of:
>
> [    0.285312] CPU#10 up in     192950,    952898,  60014786,        28 (  
> 61160662)
> [    0.288311] CPU#11 up in     181092,    962704,  60010432,        30 (  
> 61154258)
> [    0.291312] CPU#12 up in     386080,    970416,  60013956,        28 (  
> 61370480)
> [    0.294311] CPU#13 up in     372782,    964506,  60010564,        28 (  
> 61347880)
> [    0.297312] CPU#14 up in     389602,    976280,  60013046,        28 (  
> 61378956)
> [    0.300312] CPU#15 up in     213132,    968148,  60012138,        28 (  
> 61193446)
>
> If we define EARLY_INIT then that first phase of waiting for the CPU
> add itself is fairly much instantaneous, which is precisely what we
> were hoping for. We also seem to save about 300k cycles on the AP
> bringup too. It's just that it *all* pales into insignificance with
> whatever it's doing to synchronise the TSC for 60M cycles.

Yes, that's annoying, but it can be avoided. The host could tell the
guest that the TSC is perfectly synced.

> [    0.338829] CPU#10 up in        600,    689054,  60025522,        28 (  
> 60715204)
> [    0.341829] CPU#11 up in        610,    635346,  60019390,        28 (  
> 60655374)
> [    0.343829] CPU#12 up in        632,    619352,  60020728,        28 (  
> 60640740)
> [    0.346829] CPU#13 up in        602,    514234,  60025402,        26 (  
> 60540264)
> [    0.348830] CPU#14 up in        608,    621058,  60025952,        26 (  
> 60647644)
> [    0.351829] CPU#15 up in        600,    624690,  60021526,       410 (  
> 60647226)
>
> Testing on real hardware has been more interesting and less useful so
> far. We started with the CPUHP_BRINGUP_KICK_CPU state being
> *immediately* before CPUHP_BRINGUP_CPU. On my 28-thread Haswell box,
> that didn't come up at all even without actually *doing* anything in
> the pre-bringup phase. Merely bringing all the AP threads up through
> the various CPUHP_PREPARE_foo stages before actually bringing them
> online, was enough to break it. I have no serial port on this box so we
> haven't get worked out why; I've resorted to putting the
> CPUHP_BRINGUP_KICK_CPU state before CPUHP_WORKQUEUE_PREP instead.

Hrm.

> That lets it boot without the EARLY_INIT at least (so it's basically a
> no-op), and I get these timings. Looks like there's 3-4M cycles to be
> had by the parallel SIPI/INIT, but the *first* thread of each core is
> also taking another 8M cycles and it might be worth doing *those* in
> parallel too. And Thomas I think that waiting for the AP to bring
> itself up is the part you meant was pointlessly differently
> reimplemented across architectures? So the way forward there might be
> to offer a generic CPUHP state for that, for architectures to plug into
> and ditch their own tracking.

Yes. The whole wait for alive and callin and online can be generic.

> When I enabled EARLY_INIT it didn't boot; I need to hook up some box
> with a serial port to make more meaningful progress there, but figured
> it was worth sharing the findings so far.
>
> Here's the hack we're testing with, for reference. It's kind of ugly
> but you can see where it's going. Note that the CMOS mangling for the
> warm reset vector is going to need to be lifted out of the per-cpu
> loop, and done *once* at startup and torn down once in smp_cpus_done.
> Except that it also needs to be done before/after a hotplug cpu up;
> we'll have to come back to that but we've just shifted it to
> native_smp_cpus_done() for testing for now.

Right. It's at least a start.

Thanks,

        tglx


Reply via email to