Hi,

On 26.01.2024 19:30, Julien Grall wrote:
Hi,

I am not too familiary with the x86 boot code. But I will give a try to review :).

On 14/11/2023 17:49, Krystian Hebel wrote:
This is made as first step of making parallel AP bring-up possible. It
should be enough for pre-C code.

Signed-off-by: Krystian Hebel <krystian.he...@3mdeb.com>
---
  xen/arch/x86/boot/trampoline.S | 20 ++++++++++++++++++++
  xen/arch/x86/boot/x86_64.S     | 28 +++++++++++++++++++++++++++-
  xen/arch/x86/setup.c           |  7 +++++++
  3 files changed, 54 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/boot/trampoline.S b/xen/arch/x86/boot/trampoline.S
index b8ab0ffdcbb0..ec254125016d 100644
--- a/xen/arch/x86/boot/trampoline.S
+++ b/xen/arch/x86/boot/trampoline.S
@@ -72,6 +72,26 @@ trampoline_protmode_entry:
          mov     $X86_CR4_PAE,%ecx
          mov     %ecx,%cr4
  +        /*
+         * Get APIC ID while we're in non-paged mode. Start by checking if
+         * x2APIC is enabled.
+         */
+        mov     $MSR_APIC_BASE, %ecx
+        rdmsr
+        and     $APIC_BASE_EXTD, %eax
+        jnz     .Lx2apic
+
+        /* Not x2APIC, read from MMIO */
+        mov     0xfee00020, %esp
+        shr     $24, %esp
+        jmp     1f
+
+.Lx2apic:
+        mov     $(MSR_X2APIC_FIRST + (0x20 >> 4)), %ecx
+        rdmsr
+        mov     %eax, %esp
+1:
+
          /* Load pagetable base register. */
          mov     $sym_offs(idle_pg_table),%eax
          add     bootsym_rel(trampoline_xen_phys_start,4,%eax)
diff --git a/xen/arch/x86/boot/x86_64.S b/xen/arch/x86/boot/x86_64.S
index 04bb62ae8680..b85b47b5c1a0 100644
--- a/xen/arch/x86/boot/x86_64.S
+++ b/xen/arch/x86/boot/x86_64.S
@@ -15,7 +15,33 @@ ENTRY(__high_start)
          mov     $XEN_MINIMAL_CR4,%rcx
          mov     %rcx,%cr4
  -        mov     stack_start(%rip),%rsp
+        test    %ebx,%ebx
+        cmovz   stack_start(%rip), %rsp
+        jz      .L_stack_set
+
+        /* APs only: get stack base from APIC ID saved in %esp. */
+        mov     $-1, %rax
+        lea     x86_cpu_to_apicid(%rip), %rcx
I would consider to move this patch after #2 and #3, so the logic is not modified again. This would help the review.
I agree, maybe even after #4 after that patch is modified according to other comments.

+1:
+        add     $1, %rax
+        cmp     $NR_CPUS, %eax
+        jb      2f
I think we can get rid of this jump by reworking the loop so %eax is tested as the end of the loop. But this is boot code, so it is possibly not worth it. I will leave the x86 maintainers commenting.
Not sure if I understood "end of the loop" correctly, but if I did, it would result in out-of-bounds read. Anyway, this is changed by further patches which I still have to reorder, I'll check if final form can be improved.

+        hlt
+2:
+        cmp     %esp, (%rcx, %rax, 4)
+        jne     1b
+
+        /* %eax is now Xen CPU index. */
+        lea     stack_base(%rip), %rcx
+        mov     (%rcx, %rax, 8), %rsp
+
+        test    %rsp,%rsp
+        jnz     1f
+        hlt
+1:
NIT: Can you use 3? This makes the code easier to read and less prone to error (you have two very close 1).
Ack

+        add     $(STACK_SIZE - CPUINFO_sizeof), %rsp
+
+.L_stack_set:
            /* Reset EFLAGS (subsumes CLI and CLD). */
          pushq   $0
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index a3d3f797bb1e..1285969901e0 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -1951,6 +1951,7 @@ void __init noreturn __start_xen(unsigned long mbi_p)
       */
      if ( !pv_shim )
      {
+        /* Separate loop to make parallel AP bringup possible. */

The loop split seems to be unrelated to this patch. Actually, I was expecting that only the assembly code would be modified.
Fair point, I originally left it here so the code can be bisected if needed, but code changed significantly since then. In current form it should be safe to do this in the last commit.

          for_each_present_cpu ( i )
          {
              /* Set up cpu_to_node[]. */
@@ -1958,6 +1959,12 @@ void __init noreturn __start_xen(unsigned long mbi_p)
              /* Set up node_to_cpumask based on cpu_to_node[]. */
              numa_add_cpu(i);
  +            if ( stack_base[i] == NULL )
+                stack_base[i] = cpu_alloc_stack(i);

I don't quite understand this change at least in the context of this patch. AFAICT the stack will be currently allocated in cpu_smpboot_callback() which is called while the CPU is prepared. So you should not need this allocation right now.

Looking at the rest of the series, it seems you allocate the stack earlier so you start the CPU bring-up earlier. But they will still be held in assembly code until cpu_up() is called.

So effectively, part of the C part of the CPUs bring-up is still serialized. Did I understand the logic correctly?

If so, I would suggest to clarify it in the series because this wasn't obvious to me (I was expecting start_secondary() would also run in parallell).

start_secondary() is started in parallel, CPUs are not held in assembly. Calling C (almost) always requires stack, and most of this series comes to making stack available for all APs at once, just so APs can loop early in start_secondary().

You are correct, most of C part is serialized. I tried to make it parallel as well but quickly gave up. Some of the notifiers callbacks are resistant against any attempts at parallelization, and this set of patches already gave satisfactory improvements in boot time (and was enough to go through peculiar SMP bring-up used by Intel TXT dynamic launch, which is the reason why I had to do it in the first place).


Regarding the change in setup.c, I think it would make more sense in patch #9.

Cheers,

Best regards,

--
Krystian Hebel
Firmware Engineer
https://3mdeb.com | @3mdeb_com


Reply via email to