from:"Alejandro Vallejo"

Re: [PATCH v3 1/2] x86/fpu: Combine fpu_ctxt and xsave_area in arch_vcpu

2024-10-04 Thread Alejandro Vallejo

Hi,

On Thu Oct 3, 2024 at 8:38 PM BST, Andrew Cooper wrote:
> On 13/08/2024 3:21 pm, Alejandro Vallejo wrote:
> > @@ -299,44 +299,14 @@ void save_fpu_enable(void)
> >  /* Initialize FPU's context save area */
> >  int vcpu_init_fpu(struct vcpu *v)
> >  {
> > -int rc;
> > -
> >  v->arch.fully_eager_fpu = opt_eager_fpu;
> > -
> > -if ( (rc = xstate_alloc_save_area(v)) != 0 )
> > -return rc;
> > -
> > -if ( v->arch.xsave_area )
> > -v->arch.fpu_ctxt = &v->arch.xsave_area->fpu_sse;
> > -else
> > -{
> > -BUILD_BUG_ON(__alignof(v->arch.xsave_area->fpu_sse) < 16);
> > -v->arch.fpu_ctxt = _xzalloc(sizeof(v->arch.xsave_area->fpu_sse),
> > -
> > __alignof(v->arch.xsave_area->fpu_sse));
> > -if ( v->arch.fpu_ctxt )
> > -{
> > -fpusse_t *fpu_sse = v->arch.fpu_ctxt;
> > -
> > -fpu_sse->fcw = FCW_DEFAULT;
> > -fpu_sse->mxcsr = MXCSR_DEFAULT;
> > -}
> > -else
> > -rc = -ENOMEM;
>
> This looks wonky.  It's not, because xstate_alloc_save_area() contains
> the same logic for setting up FCW/MXCSR.
>
> It would be helpful to note this in the commit message.  Something about
> deduplicating the setup alongside deduplicating the pointer.
>

Sure

> > diff --git a/xen/arch/x86/include/asm/domain.h 
> > b/xen/arch/x86/include/asm/domain.h
> > index bca3258d69ac..3da60af2a44a 100644
> > --- a/xen/arch/x86/include/asm/domain.h
> > +++ b/xen/arch/x86/include/asm/domain.h
> > @@ -592,11 +592,11 @@ struct pv_vcpu
> >  struct arch_vcpu
> >  {
> >  /*
> > - * guest context (mirroring struct vcpu_guest_context) common
> > - * between pv and hvm guests
> > + * Guest context common between PV and HVM guests. Includes general 
> > purpose
> > + * registers, segment registers and other parts of the exception frame.
> > + *
> > + * It doesn't contain FPU state, as that lives in xsave_area instead.
> >   */
>
> This new comment isn't really correct either.  arch_vcpu contains the
> PV/HVM union, so it not only things which are common between the two.

It's about cpu_user_regs though, not arch_vcpu?

>
> I'd either leave it alone, or delete it entirely.  It doesn't serve much
> purpose IMO, and it is going to bitrot very quickly (FRED alone will
> change two of the state groups you mention).
>

I'm happy getting rid of it because it's actively confusing in its current
form. That said, I can't possibly believe there's not a single simple
description of cpu_user_regs that everyone can agree on.

> > -
> > -void  *fpu_ctxt;
> >  struct cpu_user_regs user_regs;
> >  
> >  /* Debug registers. */
> > diff --git a/xen/arch/x86/x86_emulate/blk.c b/xen/arch/x86/x86_emulate/blk.c
> > index e790f4f90056..28b54f26fe29 100644
> > --- a/xen/arch/x86/x86_emulate/blk.c
> > +++ b/xen/arch/x86/x86_emulate/blk.c
> > @@ -11,7 +11,8 @@
> >  !defined(X86EMUL_NO_SIMD)
> >  # ifdef __XEN__
> >  #  include 
> > -#  define FXSAVE_AREA current->arch.fpu_ctxt
> > +#  define FXSAVE_AREA ((struct x86_fxsr *) \
> > +   (void *)¤t->arch.xsave_area->fpu_sse)
>
> This isn't a like-for-like replacement.
>
> Previously FXSAVE_AREA's type was void *.  I'd leave the expression as just
>
>     (void *)¤t->arch.xsave_area->fpu_sse
>
> because struct x86_fxsr is not the only type needing to be used here in
> due course.   (There are 8 variations of data layout for older
> instructions.)
>

Sure

> >  # else
> >  #  define FXSAVE_AREA get_fpu_save_area()
> >  # endif
> > diff --git a/xen/arch/x86/xstate.c b/xen/arch/x86/xstate.c
> > index 5c4144d55e89..850ee31bd18c 100644
> > --- a/xen/arch/x86/xstate.c
> > +++ b/xen/arch/x86/xstate.c
> > @@ -507,9 +507,16 @@ int xstate_alloc_save_area(struct vcpu *v)
> >  unsigned int size;
> >  
> >  if ( !cpu_has_xsave )
> > -return 0;
> > -
> > -if ( !is_idle_vcpu(v) || !cpu_has_xsavec )
> > +{
> > +/*
> > + * This is bigger than FXSAVE_SIZE by 64 bytes, but it helps 
> > treating
> > + * the FPU state uniformly as an XSAVE buffer even if XSAVE is not
> > + * available in the host. Note the alignment restriction of the 
> > XSAVE
> > + * area are stricter than those of the FXSAVE area.
> > + */
>
> Can I suggest the following?
>
> "On non-XSAVE systems, we allocate an XSTATE buffer for simplicity. 
> XSTATE is backwards compatible to FXSAVE, and only one cacheline larger."
>
> It's rather more concise.
>
> ~Andrew

Sure.

Cheers,
Alejandro

Re: [PATCH v3 2/2] x86/fpu: Split fpu_setup_fpu() in three

2024-10-03 Thread Alejandro Vallejo

Hi,

On Tue Aug 13, 2024 at 5:33 PM BST, Alejandro Vallejo wrote:
> On Tue Aug 13, 2024 at 3:32 PM BST, Jan Beulich wrote:
> > On 13.08.2024 16:21, Alejandro Vallejo wrote:
> > > It was trying to do too many things at once and there was no clear way of
> > > defining what it was meant to do. This commit splits the function in 
> > > three.
> > > 
> > >   1. A function to return the FPU to power-on reset values.
> > >   2. A function to return the FPU to default values.
> > >   3. A x87/SSE state loader (equivalent to the old function when it took 
> > > a data
> > >  pointer).
> > > 
> > > While at it, make sure the abridged tag is consistent with the manuals and
> > > start as 0xFF.
> > > 
> > > Signed-off-by: Alejandro Vallejo 
> >
> > Reviewed-by: Jan Beulich 
> >
> > > ---
> > > v3:
> > >   * Adjust commit message, as the split is now in 3.
> > >   * Remove bulky comment, as the rationale for it turned out to be
> > > unsubstantiated. I can't find proof in xen-devel of the stream
> > > operating the way I claimed, and at that point having the comment
> > > at all is pointless
> >
> > So you deliberately removed the comment altogether, not just point 3 of it?
> >
> > Jan
>
> Yes. The other two cases can be deduced pretty trivially from the conditional,
> I reckon. I commented them more heavily in order to properly introduce (3), 
> but
> seeing how it was all a midsummer dream might as well reduce clutter.
>
> I got as far as the original implementation of XSAVE in Xen and it seems to
> have been tested against many combinations of src and dst, none of which was
> that ficticious "xsave enabled + xsave context missing". I suspect the
> xsave_enabled(v) was merely avoiding writing to the XSAVE buffer just for
> efficiency (however minor effect it might have had). I just reverse 
> engineering
> it wrong.
>
> Which reminds me. Thanks for mentioning that, because it was really just
> guesswork on my part.
>
> Cheers,
> Alejandro

Playing around with the FPU I noticed this patch wasn't committed, did it fall
under the cracks or is there a specific reason?

Cheers,
Alejandro

[PATCH v6 09/11] xen/x86: Derive topologically correct x2APIC IDs from the policy

2024-10-01 Thread Alejandro Vallejo

Implements the helper for mapping vcpu_id to x2apic_id given a valid
topology in a policy. The algo is written with the intention of
extending it to leaves 0x1f and extended 0x26 in the future.

Toolstack doesn't set leaf 0xb and the HVM default policy has it
cleared, so the leaf is not implemented. In that case, the new helper
just returns the legacy mapping.

Signed-off-by: Alejandro Vallejo 
---
 tools/tests/cpu-policy/test-cpu-policy.c | 68 +
 xen/include/xen/lib/x86/cpu-policy.h | 11 
 xen/lib/x86/policy.c | 76 
 3 files changed, 155 insertions(+)

diff --git a/tools/tests/cpu-policy/test-cpu-policy.c 
b/tools/tests/cpu-policy/test-cpu-policy.c
index 870f7ecee0e5..ae7fc46a47d2 100644
--- a/tools/tests/cpu-policy/test-cpu-policy.c
+++ b/tools/tests/cpu-policy/test-cpu-policy.c
@@ -785,6 +785,73 @@ static void test_topo_from_parts(void)
 }
 }
 
+static void test_x2apic_id_from_vcpu_id_success(void)
+{
+static const struct test {
+unsigned int vcpu_id;
+unsigned int threads_per_core;
+unsigned int cores_per_pkg;
+uint32_t x2apic_id;
+uint8_t x86_vendor;
+} tests[] = {
+{
+.vcpu_id = 3, .threads_per_core = 3, .cores_per_pkg = 8,
+.x2apic_id = 1 << 2,
+},
+{
+.vcpu_id = 6, .threads_per_core = 3, .cores_per_pkg = 8,
+.x2apic_id = 2 << 2,
+},
+{
+.vcpu_id = 24, .threads_per_core = 3, .cores_per_pkg = 8,
+.x2apic_id = 1 << 5,
+},
+{
+.vcpu_id = 35, .threads_per_core = 3, .cores_per_pkg = 8,
+.x2apic_id = (35 % 3) | (((35 / 3) % 8) << 2) | ((35 / 24) << 5),
+},
+{
+.vcpu_id = 96, .threads_per_core = 7, .cores_per_pkg = 3,
+.x2apic_id = (96 % 7) | (((96 / 7) % 3) << 3) | ((96 / 21) << 5),
+},
+};
+
+const uint8_t vendors[] = {
+X86_VENDOR_INTEL,
+X86_VENDOR_AMD,
+X86_VENDOR_CENTAUR,
+X86_VENDOR_SHANGHAI,
+X86_VENDOR_HYGON,
+};
+
+printf("Testing x2apic id from vcpu id success:\n");
+
+/* Perform the test run on every vendor we know about */
+for ( size_t i = 0; i < ARRAY_SIZE(vendors); ++i )
+{
+for ( size_t j = 0; j < ARRAY_SIZE(tests); ++j )
+{
+struct cpu_policy policy = { .x86_vendor = vendors[i] };
+const struct test *t = &tests[j];
+uint32_t x2apic_id;
+int rc = x86_topo_from_parts(&policy, t->threads_per_core,
+ t->cores_per_pkg);
+
+if ( rc ) {
+fail("FAIL[%d] - 'x86_topo_from_parts() failed", rc);
+continue;
+}
+
+x2apic_id = x86_x2apic_id_from_vcpu_id(&policy, t->vcpu_id);
+if ( x2apic_id != t->x2apic_id )
+fail("FAIL - '%s cpu%u %u t/c %u c/p'. bad x2apic_id: 
expected=%u actual=%u\n",
+ x86_cpuid_vendor_to_str(policy.x86_vendor),
+ t->vcpu_id, t->threads_per_core, t->cores_per_pkg,
+ t->x2apic_id, x2apic_id);
+}
+}
+}
+
 int main(int argc, char **argv)
 {
 printf("CPU Policy unit tests\n");
@@ -803,6 +870,7 @@ int main(int argc, char **argv)
 test_is_compatible_failure();
 
 test_topo_from_parts();
+test_x2apic_id_from_vcpu_id_success();
 
 if ( nr_failures )
 printf("Done: %u failures\n", nr_failures);
diff --git a/xen/include/xen/lib/x86/cpu-policy.h 
b/xen/include/xen/lib/x86/cpu-policy.h
index 116b305a1d7f..6fe19490d290 100644
--- a/xen/include/xen/lib/x86/cpu-policy.h
+++ b/xen/include/xen/lib/x86/cpu-policy.h
@@ -542,6 +542,17 @@ int x86_cpu_policies_are_compatible(const struct 
cpu_policy *host,
 const struct cpu_policy *guest,
 struct cpu_policy_errors *err);
 
+/**
+ * Calculates the x2APIC ID of a vCPU given a CPU policy
+ *
+ * If the policy lacks leaf 0xb falls back to legacy mapping of apic_id=cpu*2
+ *
+ * @param p  CPU policy of the domain.
+ * @param id vCPU ID of the vCPU.
+ * @returns x2APIC ID of the vCPU.
+ */
+uint32_t x86_x2apic_id_from_vcpu_id(const struct cpu_policy *p, uint32_t id);
+
 /**
  * Synthesise topology information in `p` given high-level constraints
  *
diff --git a/xen/lib/x86/policy.c b/xen/lib/x86/policy.c
index 16b09a427841..6dd9a2900ad7 100644
--- a/xen/lib/x86/policy.c
+++ b/xen/lib/x86/policy.c
@@ -2,6 +2,82 @@
 
 #include 
 
+static uint32_t parts_per_higher_scoped_level(const struct cpu_policy *p,
+  size_t lvl)
+{
+/*
+ * `nr_logical` reported by Intel is the number of THREADS co

[PATCH v6 05/11] tools/hvmloader: Retrieve (x2)APIC IDs from the APs themselves

2024-10-01 Thread Alejandro Vallejo

Make it so the APs expose their own APIC IDs in a LUT. We can use that
LUT to populate the MADT, decoupling the algorithm that relates CPU IDs
and APIC IDs from hvmloader.

While at this also remove ap_callin, as writing the APIC ID may serve
the same purpose.

Signed-off-by: Alejandro Vallejo 
---
 tools/firmware/hvmloader/config.h   |  5 ++-
 tools/firmware/hvmloader/hvmloader.c|  6 +--
 tools/firmware/hvmloader/mp_tables.c|  4 +-
 tools/firmware/hvmloader/smp.c  | 54 -
 tools/firmware/hvmloader/util.c |  2 +-
 tools/include/xen-tools/common-macros.h |  5 +++
 6 files changed, 60 insertions(+), 16 deletions(-)

diff --git a/tools/firmware/hvmloader/config.h 
b/tools/firmware/hvmloader/config.h
index cd716bf39245..17666a1fdb01 100644
--- a/tools/firmware/hvmloader/config.h
+++ b/tools/firmware/hvmloader/config.h
@@ -4,6 +4,8 @@
 #include 
 #include 
 
+#include 
+
 enum virtual_vga { VGA_none, VGA_std, VGA_cirrus, VGA_pt };
 extern enum virtual_vga virtual_vga;
 
@@ -48,8 +50,9 @@ extern uint8_t ioapic_version;
 
 #define IOAPIC_ID   0x01
 
+extern uint32_t CPU_TO_X2APICID[HVM_MAX_VCPUS];
+
 #define LAPIC_BASE_ADDRESS  0xfee0
-#define LAPIC_ID(vcpu_id)   ((vcpu_id) * 2)
 
 #define PCI_ISA_DEVFN   0x08/* dev 1, fn 0 */
 #define PCI_ISA_IRQ_MASK0x0c20U /* ISA IRQs 5,10,11 are PCI connected */
diff --git a/tools/firmware/hvmloader/hvmloader.c 
b/tools/firmware/hvmloader/hvmloader.c
index f8af88fabf24..0ff190ff4ec0 100644
--- a/tools/firmware/hvmloader/hvmloader.c
+++ b/tools/firmware/hvmloader/hvmloader.c
@@ -224,7 +224,7 @@ static void apic_setup(void)
 
 /* 8259A ExtInts are delivered through IOAPIC pin 0 (Virtual Wire Mode). */
 ioapic_write(0x10, APIC_DM_EXTINT);
-ioapic_write(0x11, SET_APIC_ID(LAPIC_ID(0)));
+ioapic_write(0x11, SET_APIC_ID(CPU_TO_X2APICID[0]));
 }
 
 struct bios_info {
@@ -341,11 +341,11 @@ int main(void)
 
 printf("CPU speed is %u MHz\n", get_cpu_mhz());
 
+smp_initialise();
+
 apic_setup();
 pci_setup();
 
-smp_initialise();
-
 perform_tests();
 
 if ( bios->bios_info_setup )
diff --git a/tools/firmware/hvmloader/mp_tables.c 
b/tools/firmware/hvmloader/mp_tables.c
index 77d3010406d0..494f5bb3d813 100644
--- a/tools/firmware/hvmloader/mp_tables.c
+++ b/tools/firmware/hvmloader/mp_tables.c
@@ -198,8 +198,10 @@ static void fill_mp_config_table(struct mp_config_table 
*mpct, int length)
 /* fills in an MP processor entry for VCPU 'vcpu_id' */
 static void fill_mp_proc_entry(struct mp_proc_entry *mppe, int vcpu_id)
 {
+ASSERT(CPU_TO_X2APICID[vcpu_id] < 0xFF );
+
 mppe->type = ENTRY_TYPE_PROCESSOR;
-mppe->lapic_id = LAPIC_ID(vcpu_id);
+mppe->lapic_id = CPU_TO_X2APICID[vcpu_id];
 mppe->lapic_version = 0x11;
 mppe->cpu_flags = CPU_FLAG_ENABLED;
 if ( vcpu_id == 0 )
diff --git a/tools/firmware/hvmloader/smp.c b/tools/firmware/hvmloader/smp.c
index 1b940cefd071..b0d4da111904 100644
--- a/tools/firmware/hvmloader/smp.c
+++ b/tools/firmware/hvmloader/smp.c
@@ -29,7 +29,34 @@
 
 #include 
 
-static int ap_callin;
+/**
+ * Lookup table of x2APIC IDs.
+ *
+ * Each entry is populated its respective CPU as they come online. This is 
required
+ * for generating the MADT with minimal assumptions about ID relationships.
+ */
+uint32_t CPU_TO_X2APICID[HVM_MAX_VCPUS];
+
+/** Tristate about x2apic being supported. -1=unknown */
+static int has_x2apic = -1;
+
+static uint32_t read_apic_id(void)
+{
+uint32_t apic_id;
+
+if ( has_x2apic )
+cpuid(0xb, NULL, NULL, NULL, &apic_id);
+else
+{
+cpuid(1, NULL, &apic_id, NULL, NULL);
+apic_id >>= 24;
+}
+
+/* Never called by cpu0, so should never return 0 */
+ASSERT(apic_id);
+
+return apic_id;
+}
 
 static void cpu_setup(unsigned int cpu)
 {
@@ -37,13 +64,17 @@ static void cpu_setup(unsigned int cpu)
 cacheattr_init();
 printf("done.\n");
 
-if ( !cpu ) /* Used on the BSP too */
+/* The BSP exits early because its APIC ID is known to be zero */
+if ( !cpu )
 return;
 
 wmb();
-ap_callin = 1;
+ACCESS_ONCE(CPU_TO_X2APICID[cpu]) = read_apic_id();
 
-/* After this point, the BSP will shut us down. */
+/*
+ * After this point the BSP will shut us down. A write to
+ * CPU_TO_X2APICID[cpu] signals the BSP to bring down `cpu`.
+ */
 
 for ( ;; )
 asm volatile ( "hlt" );
@@ -54,10 +85,6 @@ static void boot_cpu(unsigned int cpu)
 static uint8_t ap_stack[PAGE_SIZE] __attribute__ ((aligned (16)));
 static struct vcpu_hvm_context ap;
 
-/* Initialise shared variables. */
-ap_callin = 0;
-wmb();
-
 /* Wake up the secondary processor */
 ap = (struct vcpu_hvm_context) {
 .mode = VCPU_HVM_MODE_32B,
@@ -90,10 +117,11 @@ static void boot_cpu(unsigned int cpu)
 BUG();
 
 /*
- * Wait f

[PATCH v6 06/11] tools/libacpi: Use LUT of APIC IDs rather than function pointer

2024-10-01 Thread Alejandro Vallejo

Refactors libacpi so that a single LUT is the authoritative source of
truth for the CPU to APIC ID mappings. This has a know-on effect in
reducing complexity on future patches, as the same LUT can be used for
configuring the APICs and configuring the ACPI tables for PVH.

Not functional change intended, because the same mappings are preserved.

Signed-off-by: Alejandro Vallejo 
---
 tools/firmware/hvmloader/util.c   | 7 +--
 tools/include/xenguest.h  | 5 +
 tools/libacpi/build.c | 6 +++---
 tools/libacpi/libacpi.h   | 2 +-
 tools/libs/light/libxl_dom.c  | 5 +
 tools/libs/light/libxl_x86_acpi.c | 7 +--
 6 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/tools/firmware/hvmloader/util.c b/tools/firmware/hvmloader/util.c
index 7e1e105d79dd..4a6303bbae8c 100644
--- a/tools/firmware/hvmloader/util.c
+++ b/tools/firmware/hvmloader/util.c
@@ -825,11 +825,6 @@ static void acpi_mem_free(struct acpi_ctxt *ctxt,
 /* ACPI builder currently doesn't free memory so this is just a stub */
 }
 
-static uint32_t acpi_lapic_id(unsigned cpu)
-{
-return CPU_TO_X2APIC_ID[cpu];
-}
-
 void hvmloader_acpi_build_tables(struct acpi_config *config,
  unsigned int physical)
 {
@@ -859,7 +854,7 @@ void hvmloader_acpi_build_tables(struct acpi_config *config,
 }
 
 config->lapic_base_address = LAPIC_BASE_ADDRESS;
-config->lapic_id = acpi_lapic_id;
+config->cpu_to_apicid = CPU_TO_X2APICID;
 config->ioapic_base_address = IOAPIC_BASE_ADDRESS;
 config->ioapic_id = IOAPIC_ID;
 config->pci_isa_irq_mask = PCI_ISA_IRQ_MASK; 
diff --git a/tools/include/xenguest.h b/tools/include/xenguest.h
index e01f494b772a..aa50b78dfb89 100644
--- a/tools/include/xenguest.h
+++ b/tools/include/xenguest.h
@@ -22,6 +22,8 @@
 #ifndef XENGUEST_H
 #define XENGUEST_H
 
+#include "xen/hvm/hvm_info_table.h"
+
 #define XC_NUMA_NO_NODE   (~0U)
 
 #define XCFLAGS_LIVE  (1 << 0)
@@ -236,6 +238,9 @@ struct xc_dom_image {
 #if defined(__i386__) || defined(__x86_64__)
 struct e820entry *e820;
 unsigned int e820_entries;
+
+/* LUT mapping cpu id to (x2)APIC ID */
+uint32_t cpu_to_apicid[HVM_MAX_VCPUS];
 #endif
 
 xen_pfn_t vuart_gfn;
diff --git a/tools/libacpi/build.c b/tools/libacpi/build.c
index 2f29863db154..2ad1d461a2ec 100644
--- a/tools/libacpi/build.c
+++ b/tools/libacpi/build.c
@@ -74,7 +74,7 @@ static struct acpi_20_madt *construct_madt(struct acpi_ctxt 
*ctxt,
 const struct hvm_info_table   *hvminfo = config->hvminfo;
 int i, sz;
 
-if ( config->lapic_id == NULL )
+if ( config->cpu_to_apicid == NULL )
 return NULL;
 
 sz  = sizeof(struct acpi_20_madt);
@@ -148,7 +148,7 @@ static struct acpi_20_madt *construct_madt(struct acpi_ctxt 
*ctxt,
 lapic->length  = sizeof(*lapic);
 /* Processor ID must match processor-object IDs in the DSDT. */
 lapic->acpi_processor_id = i;
-lapic->apic_id = config->lapic_id(i);
+lapic->apic_id = config->cpu_to_apicid[i];
 lapic->flags = (test_bit(i, hvminfo->vcpu_online)
 ? ACPI_LOCAL_APIC_ENABLED : 0);
 lapic++;
@@ -236,7 +236,7 @@ static struct acpi_20_srat *construct_srat(struct acpi_ctxt 
*ctxt,
 processor->type = ACPI_PROCESSOR_AFFINITY;
 processor->length   = sizeof(*processor);
 processor->domain   = config->numa.vcpu_to_vnode[i];
-processor->apic_id  = config->lapic_id(i);
+processor->apic_id  = config->cpu_to_apicid[i];
 processor->flags= ACPI_LOCAL_APIC_AFFIN_ENABLED;
 processor++;
 }
diff --git a/tools/libacpi/libacpi.h b/tools/libacpi/libacpi.h
index deda39e5dbc4..8b010212448c 100644
--- a/tools/libacpi/libacpi.h
+++ b/tools/libacpi/libacpi.h
@@ -84,7 +84,7 @@ struct acpi_config {
 unsigned long rsdp;
 
 /* x86-specific parameters */
-uint32_t (*lapic_id)(unsigned cpu);
+uint32_t *cpu_to_apicid; /* LUT mapping cpu id to (x2)APIC ID */
 uint32_t lapic_base_address;
 uint32_t ioapic_base_address;
 uint16_t pci_isa_irq_mask;
diff --git a/tools/libs/light/libxl_dom.c b/tools/libs/light/libxl_dom.c
index 94fef374014e..7f9c6eaa8b24 100644
--- a/tools/libs/light/libxl_dom.c
+++ b/tools/libs/light/libxl_dom.c
@@ -1082,6 +1082,11 @@ int libxl__build_hvm(libxl__gc *gc, uint32_t domid,
 
 dom->container_type = XC_DOM_HVM_CONTAINER;
 
+#if defined(__i386__) || defined(__x86_64__)
+for ( uint32_t i = 0; i < info->max_vcpus; i++ )
+dom->cpu_to_apicid[i] = 2 * i; /* TODO: Replace by topo calculation */
+#endif
+
 /* The params from the configuration file are in Mb, which are then
  * multiplied by 1 Kb. This was then divided off when calling
  * the old xc_hvm_build_target_mem() which then turned them to bytes.
diff --git a/tools/libs/light/libxl

[PATCH v6 08/11] xen/lib: Add topology generator for x86

2024-10-01 Thread Alejandro Vallejo

Add a helper to populate topology leaves in the cpu policy from
threads/core and cores/package counts. It's unit-tested in
test-cpu-policy.c,
but it's not connected to the rest of the code yet.

Adds the ASSERT() macro to xen/lib/x86/private.h, as it was missing.

Signed-off-by: Alejandro Vallejo 
---
 tools/tests/cpu-policy/test-cpu-policy.c | 133 +++
 xen/include/xen/lib/x86/cpu-policy.h |  16 +++
 xen/lib/x86/policy.c |  88 +++
 xen/lib/x86/private.h|   4 +
 4 files changed, 241 insertions(+)

diff --git a/tools/tests/cpu-policy/test-cpu-policy.c 
b/tools/tests/cpu-policy/test-cpu-policy.c
index 9216010b1c5d..870f7ecee0e5 100644
--- a/tools/tests/cpu-policy/test-cpu-policy.c
+++ b/tools/tests/cpu-policy/test-cpu-policy.c
@@ -654,6 +654,137 @@ static void test_is_compatible_failure(void)
 }
 }
 
+static void test_topo_from_parts(void)
+{
+static const struct test {
+unsigned int threads_per_core;
+unsigned int cores_per_pkg;
+struct cpu_policy policy;
+} tests[] = {
+{
+.threads_per_core = 3, .cores_per_pkg = 1,
+.policy = {
+.x86_vendor = X86_VENDOR_AMD,
+.topo.subleaf = {
+{ .nr_logical = 3, .level = 0, .type = 1, .id_shift = 2, },
+{ .nr_logical = 1, .level = 1, .type = 2, .id_shift = 2, },
+},
+},
+},
+{
+.threads_per_core = 1, .cores_per_pkg = 3,
+.policy = {
+.x86_vendor = X86_VENDOR_AMD,
+.topo.subleaf = {
+{ .nr_logical = 1, .level = 0, .type = 1, .id_shift = 0, },
+{ .nr_logical = 3, .level = 1, .type = 2, .id_shift = 2, },
+},
+},
+},
+{
+.threads_per_core = 7, .cores_per_pkg = 5,
+.policy = {
+.x86_vendor = X86_VENDOR_AMD,
+.topo.subleaf = {
+{ .nr_logical = 7, .level = 0, .type = 1, .id_shift = 3, },
+{ .nr_logical = 5, .level = 1, .type = 2, .id_shift = 6, },
+},
+},
+},
+{
+.threads_per_core = 2, .cores_per_pkg = 128,
+.policy = {
+.x86_vendor = X86_VENDOR_AMD,
+.topo.subleaf = {
+{ .nr_logical = 2, .level = 0, .type = 1, .id_shift = 1, },
+{ .nr_logical = 128, .level = 1, .type = 2,
+  .id_shift = 8, },
+},
+},
+},
+{
+.threads_per_core = 3, .cores_per_pkg = 1,
+.policy = {
+.x86_vendor = X86_VENDOR_INTEL,
+.topo.subleaf = {
+{ .nr_logical = 3, .level = 0, .type = 1, .id_shift = 2, },
+{ .nr_logical = 3, .level = 1, .type = 2, .id_shift = 2, },
+},
+},
+},
+{
+.threads_per_core = 1, .cores_per_pkg = 3,
+.policy = {
+.x86_vendor = X86_VENDOR_INTEL,
+.topo.subleaf = {
+{ .nr_logical = 1, .level = 0, .type = 1, .id_shift = 0, },
+{ .nr_logical = 3, .level = 1, .type = 2, .id_shift = 2, },
+},
+},
+},
+{
+.threads_per_core = 7, .cores_per_pkg = 5,
+.policy = {
+.x86_vendor = X86_VENDOR_INTEL,
+.topo.subleaf = {
+{ .nr_logical = 7, .level = 0, .type = 1, .id_shift = 3, },
+{ .nr_logical = 35, .level = 1, .type = 2, .id_shift = 6, 
},
+},
+},
+},
+{
+.threads_per_core = 2, .cores_per_pkg = 128,
+.policy = {
+.x86_vendor = X86_VENDOR_INTEL,
+.topo.subleaf = {
+{ .nr_logical = 2, .level = 0, .type = 1, .id_shift = 1, },
+{ .nr_logical = 256, .level = 1, .type = 2,
+  .id_shift = 8, },
+},
+},
+},
+};
+
+printf("Testing topology synthesis from parts:\n");
+
+for ( size_t i = 0; i < ARRAY_SIZE(tests); ++i )
+{
+const struct test *t = &tests[i];
+struct cpu_policy actual = { .x86_vendor = t->policy.x86_vendor };
+int rc = x86_topo_from_parts(&actual, t->threads_per_core,
+ t->cores_per_pkg);
+
+if ( rc || memcmp(&actual.topo, &t->policy.topo, sizeof(actual.topo)) )
+{
+#define TOPO(n, f)  t->policy.topo.subleaf[(n)].f, actual.topo.subleaf[(n)].f
+fail("FAIL[%d] - '%s %u t/c, %u c/p'\n",
+ rc,
+ x86_cpuid_vendor_to_str(t->policy.x86_vendor),

[PATCH v6 04/11] xen/x86: Add supporting code for uploading LAPIC contexts during domain create

2024-10-01 Thread Alejandro Vallejo

If toolstack were to upload LAPIC contexts as part of domain creation it
would encounter a problem were the architectural state does not reflect
the APIC ID in the hidden state. This patch ensures updates to the
hidden state trigger an update in the architectural registers so the
APIC ID in both is consistent.

Signed-off-by: Alejandro Vallejo 
---
 xen/arch/x86/hvm/vlapic.c | 20 
 1 file changed, 20 insertions(+)

diff --git a/xen/arch/x86/hvm/vlapic.c b/xen/arch/x86/hvm/vlapic.c
index 02570f9dd63a..a8183c3023da 100644
--- a/xen/arch/x86/hvm/vlapic.c
+++ b/xen/arch/x86/hvm/vlapic.c
@@ -1640,7 +1640,27 @@ static int cf_check lapic_load_hidden(struct domain *d, 
hvm_domain_context_t *h)
 
 s->loaded.hw = 1;
 if ( s->loaded.regs )
+{
+/*
+ * We already processed architectural regs in lapic_load_regs(), so
+ * this must be a migration. Fix up inconsistencies from any older Xen.
+ */
 lapic_load_fixup(s);
+}
+else
+{
+/*
+ * We haven't seen architectural regs so this could be a migration or a
+ * plain domain create. In the domain create case it's fine to modify
+ * the architectural state to align it to the APIC ID that was just
+ * uploaded and in the migrate case it doesn't matter because the
+ * architectural state will be replaced by the LAPIC_REGS ctx later on.
+ */
+if ( vlapic_x2apic_mode(s) )
+set_x2apic_id(s);
+else
+vlapic_set_reg(s, APIC_ID, SET_xAPIC_ID(s->hw.x2apic_id));
+}
 
 hvm_update_vlapic_mode(v);
 
-- 
2.46.0

[PATCH v6 00/11] x86: Expose consistent topology to guests

2024-10-01 Thread Alejandro Vallejo

 x2APIC ID of a given CPU and assuming
the guest will assume 2 threads per core.

This series involves bringing x2APIC IDs into the migration stream, so
older guests keep operating as they used to and enhancing Xen+toolstack so
new guests get topology information consistent with their x2APIC IDs. As a
side effect of this, x2APIC IDs are now packed and don't have (unless under
a pathological case) gaps.

Further work ought to allow combining this topology configurations with
gang-scheduling of guest hyperthreads into affine physical hyperthreads.
For the time being it purposefully keeps the configuration of "1 socket" +
"1 thread per core" + "1 core per vCPU".

Patch 1: Includes x2APIC IDs in the migration stream. This allows Xen to
 reconstruct the right x2APIC IDs on migrated-in guests, and
 future-proofs itself in the face of x2APIC ID derivation changes.
Patch 2: Minor refactor to expose xc_cpu_policy in libxl
Patch 3: Refactors xen/lib/x86 to work on non-Xen freestanding environments
 (e.g: hvmloader)
Patch 4: Remove old assumptions about vcpu_id<->apic_id relationship in 
hvmloader
Patch 5: Add logic to derive x2APIC IDs given a CPU policy and vCPU IDs
Patch 6: Includes a simple topology generator for toolstack so new guests
 have topologically consistent information in CPUID

===
v6:

Alejandro Vallejo (11):
  lib/x86: Relax checks about policy compatibility
  x86/vlapic: Move lapic migration checks to the check hooks
  xen/x86: Add initial x2APIC ID to the per-vLAPIC save area
  xen/x86: Add supporting code for uploading LAPIC contexts during
domain create
  tools/hvmloader: Retrieve (x2)APIC IDs from the APs themselves
  tools/libacpi: Use LUT of APIC IDs rather than function pointer
  tools/libguest: Always set vCPU context in vcpu_hvm()
  xen/lib: Add topology generator for x86
  xen/x86: Derive topologically correct x2APIC IDs from the policy
  tools/libguest: Set distinct x2APIC IDs for each vCPU
  xen/x86: Synthesise domain topologies

 tools/firmware/hvmloader/config.h|   5 +-
 tools/firmware/hvmloader/hvmloader.c |   6 +-
 tools/firmware/hvmloader/mp_tables.c |   4 +-
 tools/firmware/hvmloader/smp.c   |  54 --
 tools/firmware/hvmloader/util.c  |   7 +-
 tools/include/xen-tools/common-macros.h  |   5 +
 tools/include/xenguest.h |   8 +
 tools/libacpi/build.c|   6 +-
 tools/libacpi/libacpi.h  |   2 +-
 tools/libs/guest/xg_cpuid_x86.c  |  29 +++-
 tools/libs/guest/xg_dom_x86.c|  93 ++
 tools/libs/light/libxl_dom.c |  25 +++
 tools/libs/light/libxl_x86_acpi.c|   7 +-
 tools/tests/cpu-policy/test-cpu-policy.c | 207 ++-
 xen/arch/x86/cpu-policy.c|   9 +-
 xen/arch/x86/cpuid.c |  14 +-
 xen/arch/x86/hvm/vlapic.c| 126 ++
 xen/arch/x86/include/asm/hvm/vlapic.h|   1 +
 xen/include/public/arch-x86/hvm/save.h   |   2 +
 xen/include/xen/lib/x86/cpu-policy.h |  27 +++
 xen/lib/x86/policy.c | 175 ++-
 xen/lib/x86/private.h|   4 +
 22 files changed, 704 insertions(+), 112 deletions(-)

-- 
2.46.0

[PATCH v6 07/11] tools/libguest: Always set vCPU context in vcpu_hvm()

2024-10-01 Thread Alejandro Vallejo

Currently used by PVH to set MTRR, will be used by a later patch to set
APIC state. Unconditionally send the hypercall, and gate overriding the
MTRR so it remains functionally equivalent.

While at it, add a missing "goto out" to what was the error condition
in the loop.

In principle this patch shouldn't affect functionality. An extra record
(the MTRR) is sent to the hypervisor per vCPU on HVM, but these records
are identical to those retrieved in the first place so there's no
expected functional change.

Signed-off-by: Alejandro Vallejo 
---
 tools/libs/guest/xg_dom_x86.c | 84 ++-
 1 file changed, 44 insertions(+), 40 deletions(-)

diff --git a/tools/libs/guest/xg_dom_x86.c b/tools/libs/guest/xg_dom_x86.c
index cba01384ae75..c98229317db7 100644
--- a/tools/libs/guest/xg_dom_x86.c
+++ b/tools/libs/guest/xg_dom_x86.c
@@ -989,6 +989,7 @@ const static void *hvm_get_save_record(const void *ctx, 
unsigned int type,
 
 static int vcpu_hvm(struct xc_dom_image *dom)
 {
+/* Initialises the BSP */
 struct {
 struct hvm_save_descriptor header_d;
 HVM_SAVE_TYPE(HEADER) header;
@@ -997,6 +998,18 @@ static int vcpu_hvm(struct xc_dom_image *dom)
 struct hvm_save_descriptor end_d;
 HVM_SAVE_TYPE(END) end;
 } bsp_ctx;
+/* Initialises APICs and MTRRs of every vCPU */
+struct {
+struct hvm_save_descriptor header_d;
+HVM_SAVE_TYPE(HEADER) header;
+struct hvm_save_descriptor mtrr_d;
+HVM_SAVE_TYPE(MTRR) mtrr;
+struct hvm_save_descriptor end_d;
+HVM_SAVE_TYPE(END) end;
+} vcpu_ctx;
+/* Context from full_ctx */
+const HVM_SAVE_TYPE(MTRR) *mtrr_record;
+/* Raw context as taken from Xen */
 uint8_t *full_ctx = NULL;
 int rc;
 
@@ -1083,51 +1096,42 @@ static int vcpu_hvm(struct xc_dom_image *dom)
 bsp_ctx.end_d.instance = 0;
 bsp_ctx.end_d.length = HVM_SAVE_LENGTH(END);
 
-/* TODO: maybe this should be a firmware option instead? */
-if ( !dom->device_model )
+/* TODO: maybe setting MTRRs should be a firmware option instead? */
+mtrr_record = hvm_get_save_record(full_ctx, HVM_SAVE_CODE(MTRR), 0);
+
+if ( !mtrr_record)
 {
-struct {
-struct hvm_save_descriptor header_d;
-HVM_SAVE_TYPE(HEADER) header;
-struct hvm_save_descriptor mtrr_d;
-HVM_SAVE_TYPE(MTRR) mtrr;
-struct hvm_save_descriptor end_d;
-HVM_SAVE_TYPE(END) end;
-} mtrr = {
-.header_d = bsp_ctx.header_d,
-.header = bsp_ctx.header,
-.mtrr_d.typecode = HVM_SAVE_CODE(MTRR),
-.mtrr_d.length = HVM_SAVE_LENGTH(MTRR),
-.end_d = bsp_ctx.end_d,
-.end = bsp_ctx.end,
-};
-const HVM_SAVE_TYPE(MTRR) *mtrr_record =
-hvm_get_save_record(full_ctx, HVM_SAVE_CODE(MTRR), 0);
-unsigned int i;
-
-if ( !mtrr_record )
-{
-xc_dom_panic(dom->xch, XC_INTERNAL_ERROR,
- "%s: unable to get MTRR save record", __func__);
-goto out;
-}
+xc_dom_panic(dom->xch, XC_INTERNAL_ERROR,
+ "%s: unable to get MTRR save record", __func__);
+goto out;
+}
 
-memcpy(&mtrr.mtrr, mtrr_record, sizeof(mtrr.mtrr));
+vcpu_ctx.header_d = bsp_ctx.header_d;
+vcpu_ctx.header = bsp_ctx.header;
+vcpu_ctx.mtrr_d.typecode = HVM_SAVE_CODE(MTRR);
+vcpu_ctx.mtrr_d.length = HVM_SAVE_LENGTH(MTRR);
+vcpu_ctx.mtrr = *mtrr_record;
+vcpu_ctx.end_d = bsp_ctx.end_d;
+vcpu_ctx.end = bsp_ctx.end;
 
-/*
- * Enable MTRR, set default type to WB.
- * TODO: add MMIO areas as UC when passthrough is supported.
- */
-mtrr.mtrr.msr_mtrr_def_type = MTRR_TYPE_WRBACK | MTRR_DEF_TYPE_ENABLE;
+/*
+ * Enable MTRR, set default type to WB.
+ * TODO: add MMIO areas as UC when passthrough is supported in PVH
+ */
+if ( !dom->device_model )
+vcpu_ctx.mtrr.msr_mtrr_def_type = MTRR_TYPE_WRBACK | 
MTRR_DEF_TYPE_ENABLE;
+
+for ( unsigned int i = 0; i < dom->max_vcpus; i++ )
+{
+vcpu_ctx.mtrr_d.instance = i;
 
-for ( i = 0; i < dom->max_vcpus; i++ )
+rc = xc_domain_hvm_setcontext(dom->xch, dom->guest_domid,
+  (uint8_t *)&vcpu_ctx, sizeof(vcpu_ctx));
+if ( rc != 0 )
 {
-mtrr.mtrr_d.instance = i;
-rc = xc_domain_hvm_setcontext(dom->xch, dom->guest_domid,
-  (uint8_t *)&mtrr, sizeof(mtrr));
-if ( rc != 0 )
-xc_dom_panic(dom->xch, XC_INTERNAL_ERROR,
- "%s: SETHVMCONTEXT failed (rc=%d)", __func__, rc);
+xc_dom_panic(dom->xch, XC_INTERNAL_ERROR,
+ "%s: SETHVMCONTEXT failed (rc=%d)", __func__, rc);
+goto out;
 }
 }
 
-- 
2.46.0

[PATCH v6 03/11] xen/x86: Add initial x2APIC ID to the per-vLAPIC save area

2024-10-01 Thread Alejandro Vallejo

This allows the initial x2APIC ID to be sent on the migration stream.
This allows further changes to topology and APIC ID assignment without
breaking existing hosts. Given the vlapic data is zero-extended on
restore, fix up migrations from hosts without the field by setting it to
the old convention if zero.

The hardcoded mapping x2apic_id=2*vcpu_id is kept for the time being,
but it's meant to be overriden by toolstack on a later patch with
appropriate values.

Signed-off-by: Alejandro Vallejo 
---
 xen/arch/x86/cpuid.c   | 14 +-
 xen/arch/x86/hvm/vlapic.c  | 22 --
 xen/arch/x86/include/asm/hvm/vlapic.h  |  1 +
 xen/include/public/arch-x86/hvm/save.h |  2 ++
 4 files changed, 28 insertions(+), 11 deletions(-)

diff --git a/xen/arch/x86/cpuid.c b/xen/arch/x86/cpuid.c
index 2a777436ee27..dcbdeabadce9 100644
--- a/xen/arch/x86/cpuid.c
+++ b/xen/arch/x86/cpuid.c
@@ -138,10 +138,9 @@ void guest_cpuid(const struct vcpu *v, uint32_t leaf,
 const struct cpu_user_regs *regs;
 
 case 0x1:
-/* TODO: Rework topology logic. */
 res->b &= 0x00ffu;
 if ( is_hvm_domain(d) )
-res->b |= (v->vcpu_id * 2) << 24;
+res->b |= vlapic_x2apic_id(vcpu_vlapic(v)) << 24;
 
 /* TODO: Rework vPMU control in terms of toolstack choices. */
 if ( vpmu_available(v) &&
@@ -311,18 +310,15 @@ void guest_cpuid(const struct vcpu *v, uint32_t leaf,
 
 case 0xb:
 /*
- * In principle, this leaf is Intel-only.  In practice, it is tightly
- * coupled with x2apic, and we offer an x2apic-capable APIC emulation
- * to guests on AMD hardware as well.
- *
- * TODO: Rework topology logic.
+ * Don't expose topology information to PV guests. Exposed on HVM
+ * along with x2APIC because they are tightly coupled.
  */
-if ( p->basic.x2apic )
+if ( is_hvm_domain(d) && p->basic.x2apic )
 {
 *(uint8_t *)&res->c = subleaf;
 
 /* Fix the x2APIC identifier. */
-res->d = v->vcpu_id * 2;
+res->d = vlapic_x2apic_id(vcpu_vlapic(v));
 }
 break;
 
diff --git a/xen/arch/x86/hvm/vlapic.c b/xen/arch/x86/hvm/vlapic.c
index 101902cff889..02570f9dd63a 100644
--- a/xen/arch/x86/hvm/vlapic.c
+++ b/xen/arch/x86/hvm/vlapic.c
@@ -1090,7 +1090,7 @@ static uint32_t x2apic_ldr_from_id(uint32_t id)
 static void set_x2apic_id(struct vlapic *vlapic)
 {
 const struct vcpu *v = vlapic_vcpu(vlapic);
-uint32_t apic_id = v->vcpu_id * 2;
+uint32_t apic_id = vlapic->hw.x2apic_id;
 uint32_t apic_ldr = x2apic_ldr_from_id(apic_id);
 
 /*
@@ -1470,7 +1470,7 @@ void vlapic_reset(struct vlapic *vlapic)
 if ( v->vcpu_id == 0 )
 vlapic->hw.apic_base_msr |= APIC_BASE_BSP;
 
-vlapic_set_reg(vlapic, APIC_ID, (v->vcpu_id * 2) << 24);
+vlapic_set_reg(vlapic, APIC_ID, SET_xAPIC_ID(vlapic->hw.x2apic_id));
 vlapic_do_init(vlapic);
 }
 
@@ -1538,6 +1538,16 @@ static void lapic_load_fixup(struct vlapic *vlapic)
 const struct vcpu *v = vlapic_vcpu(vlapic);
 uint32_t good_ldr = x2apic_ldr_from_id(vlapic->loaded.id);
 
+/*
+ * Loading record without hw.x2apic_id in the save stream, calculate using
+ * the traditional "vcpu_id * 2" relation. There's an implicit assumption
+ * that vCPU0 always has x2APIC0, which is true for the old relation, and
+ * still holds under the new x2APIC generation algorithm. While that case
+ * goes through the conditional it's benign because it still maps to zero.
+ */
+if ( !vlapic->hw.x2apic_id )
+vlapic->hw.x2apic_id = v->vcpu_id * 2;
+
 /* Skip fixups on xAPIC mode, or if the x2APIC LDR is already correct */
 if ( !vlapic_x2apic_mode(vlapic) ||
  (vlapic->loaded.ldr == good_ldr) )
@@ -1606,6 +1616,13 @@ static int cf_check lapic_check_hidden(const struct 
domain *d,
  APIC_BASE_EXTD )
 return -EINVAL;
 
+/*
+ * Fail migrations from newer versions of Xen where
+ * rsvd_zero is interpreted as something else.
+ */
+if ( s.rsvd_zero )
+return -EINVAL;
+
 return 0;
 }
 
@@ -1687,6 +1704,7 @@ int vlapic_init(struct vcpu *v)
 }
 
 vlapic->pt.source = PTSRC_lapic;
+vlapic->hw.x2apic_id = 2 * v->vcpu_id;
 
 vlapic->regs_page = alloc_domheap_page(v->domain, MEMF_no_owner);
 if ( !vlapic->regs_page )
diff --git a/xen/arch/x86/include/asm/hvm/vlapic.h 
b/xen/arch/x86/include/asm/hvm/vlapic.h
index 2c4ff94ae7a8..85c4a236b9f6 100644
--- a/xen/arch/x86/include/asm/hvm/vlapic.h
+++ b/xen/arch/x86/include/asm/hvm/vlapic.h
@@ -44,6 +44,7 @@
 #define vlapic_xapic_mode(vlapic)   \
 (!vlapic_hw_disabled(vlapic) && \
  !((vlapic)->hw.a

[PATCH v6 02/11] x86/vlapic: Move lapic migration checks to the check hooks

2024-10-01 Thread Alejandro Vallejo

While doing this, factor out checks common to architectural and hidden
state.

Signed-off-by: Alejandro Vallejo 
Reviewed-by: Roger Pau Monné 
--
Last reviewed in the topology series v3. Fell under the cracks.

  https://lore.kernel.org/xen-devel/ZlhP11Vvk6c1Ix36@macbook/
---
 xen/arch/x86/hvm/vlapic.c | 84 ++-
 1 file changed, 56 insertions(+), 28 deletions(-)

diff --git a/xen/arch/x86/hvm/vlapic.c b/xen/arch/x86/hvm/vlapic.c
index 992355e511cd..101902cff889 100644
--- a/xen/arch/x86/hvm/vlapic.c
+++ b/xen/arch/x86/hvm/vlapic.c
@@ -1571,60 +1571,88 @@ static void lapic_load_fixup(struct vlapic *vlapic)
v, vlapic->loaded.id, vlapic->loaded.ldr, good_ldr);
 }
 
-static int cf_check lapic_load_hidden(struct domain *d, hvm_domain_context_t 
*h)
-{
-unsigned int vcpuid = hvm_load_instance(h);
-struct vcpu *v;
-struct vlapic *s;
 
+static int lapic_check_common(const struct domain *d, unsigned int vcpuid)
+{
 if ( !has_vlapic(d) )
 return -ENODEV;
 
 /* Which vlapic to load? */
-if ( vcpuid >= d->max_vcpus || (v = d->vcpu[vcpuid]) == NULL )
+if ( !domain_vcpu(d, vcpuid) )
 {
-dprintk(XENLOG_G_ERR, "HVM restore: dom%d has no apic%u\n",
+dprintk(XENLOG_G_ERR, "HVM restore: dom%d has no vCPU %u\n",
 d->domain_id, vcpuid);
 return -EINVAL;
 }
-s = vcpu_vlapic(v);
+
+return 0;
+}
+
+static int cf_check lapic_check_hidden(const struct domain *d,
+   hvm_domain_context_t *h)
+{
+unsigned int vcpuid = hvm_load_instance(h);
+struct hvm_hw_lapic s;
+int rc = lapic_check_common(d, vcpuid);
+
+if ( rc )
+return rc;
+
+if ( hvm_load_entry_zeroextend(LAPIC, h, &s) != 0 )
+return -ENODATA;
+
+/* EN=0 with EXTD=1 is illegal */
+if ( (s.apic_base_msr & (APIC_BASE_ENABLE | APIC_BASE_EXTD)) ==
+ APIC_BASE_EXTD )
+return -EINVAL;
+
+return 0;
+}
+
+static int cf_check lapic_load_hidden(struct domain *d, hvm_domain_context_t 
*h)
+{
+unsigned int vcpuid = hvm_load_instance(h);
+struct vcpu *v = d->vcpu[vcpuid];
+struct vlapic *s = vcpu_vlapic(v);
 
 if ( hvm_load_entry_zeroextend(LAPIC, h, &s->hw) != 0 )
+{
+ASSERT_UNREACHABLE();
 return -EINVAL;
+}
 
 s->loaded.hw = 1;
 if ( s->loaded.regs )
 lapic_load_fixup(s);
 
-if ( !(s->hw.apic_base_msr & APIC_BASE_ENABLE) &&
- unlikely(vlapic_x2apic_mode(s)) )
-return -EINVAL;
-
 hvm_update_vlapic_mode(v);
 
 return 0;
 }
 
-static int cf_check lapic_load_regs(struct domain *d, hvm_domain_context_t *h)
+static int cf_check lapic_check_regs(const struct domain *d,
+ hvm_domain_context_t *h)
 {
 unsigned int vcpuid = hvm_load_instance(h);
-struct vcpu *v;
-struct vlapic *s;
+int rc;
 
-if ( !has_vlapic(d) )
-return -ENODEV;
+if ( (rc = lapic_check_common(d, vcpuid)) )
+return rc;
 
-/* Which vlapic to load? */
-if ( vcpuid >= d->max_vcpus || (v = d->vcpu[vcpuid]) == NULL )
-{
-dprintk(XENLOG_G_ERR, "HVM restore: dom%d has no apic%u\n",
-d->domain_id, vcpuid);
-return -EINVAL;
-}
-s = vcpu_vlapic(v);
+if ( !hvm_get_entry(LAPIC_REGS, h) )
+return -ENODATA;
+
+return 0;
+}
+
+static int cf_check lapic_load_regs(struct domain *d, hvm_domain_context_t *h)
+{
+unsigned int vcpuid = hvm_load_instance(h);
+struct vcpu *v = d->vcpu[vcpuid];
+struct vlapic *s = vcpu_vlapic(v);
 
 if ( hvm_load_entry(LAPIC_REGS, h, s->regs) != 0 )
-return -EINVAL;
+ASSERT_UNREACHABLE();
 
 s->loaded.id = vlapic_get_reg(s, APIC_ID);
 s->loaded.ldr = vlapic_get_reg(s, APIC_LDR);
@@ -1641,9 +1669,9 @@ static int cf_check lapic_load_regs(struct domain *d, 
hvm_domain_context_t *h)
 return 0;
 }
 
-HVM_REGISTER_SAVE_RESTORE(LAPIC, lapic_save_hidden, NULL,
+HVM_REGISTER_SAVE_RESTORE(LAPIC, lapic_save_hidden, lapic_check_hidden,
   lapic_load_hidden, 1, HVMSR_PER_VCPU);
-HVM_REGISTER_SAVE_RESTORE(LAPIC_REGS, lapic_save_regs, NULL,
+HVM_REGISTER_SAVE_RESTORE(LAPIC_REGS, lapic_save_regs, lapic_check_regs,
   lapic_load_regs, 1, HVMSR_PER_VCPU);
 
 int vlapic_init(struct vcpu *v)
-- 
2.46.0

[PATCH v6 01/11] lib/x86: Relax checks about policy compatibility

2024-10-01 Thread Alejandro Vallejo

Allow a guest policy have up to leaf 0xb even if the host doesn't.
Otherwise it's not possible to show leaf 0xb to guests we're emulating
an x2APIC for on old AMD machines.

No externally visible changes though because toolstack doesn't yet
populate that leaf.

Signed-off-by: Alejandro Vallejo 
---
 tools/tests/cpu-policy/test-cpu-policy.c |  6 +-
 xen/lib/x86/policy.c | 11 ++-
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/tools/tests/cpu-policy/test-cpu-policy.c 
b/tools/tests/cpu-policy/test-cpu-policy.c
index 301df2c00285..9216010b1c5d 100644
--- a/tools/tests/cpu-policy/test-cpu-policy.c
+++ b/tools/tests/cpu-policy/test-cpu-policy.c
@@ -586,6 +586,10 @@ static void test_is_compatible_success(void)
 .platform_info.cpuid_faulting = true,
 },
 },
+{
+.name = "Host missing leaf 0xb, Guest wanted",
+.guest.basic.max_leaf = 0xb,
+},
 };
 struct cpu_policy_errors no_errors = INIT_CPU_POLICY_ERRORS;
 
@@ -614,7 +618,7 @@ static void test_is_compatible_failure(void)
 } tests[] = {
 {
 .name = "Host basic.max_leaf out of range",
-.guest.basic.max_leaf = 1,
+.guest.basic.max_leaf = 0xc,
 .e = { 0, -1, -1 },
 },
 {
diff --git a/xen/lib/x86/policy.c b/xen/lib/x86/policy.c
index f033d22785be..63bc96451d2c 100644
--- a/xen/lib/x86/policy.c
+++ b/xen/lib/x86/policy.c
@@ -15,7 +15,16 @@ int x86_cpu_policies_are_compatible(const struct cpu_policy 
*host,
 #define FAIL_MSR(m) \
 do { e.msr = (m); goto out; } while ( 0 )
 
-if ( guest->basic.max_leaf > host->basic.max_leaf )
+/*
+ * Old AMD hardware doesn't expose topology information in leaf 0xb. We
+ * want to emulate that leaf with credible information because it must be
+ * present on systems in which we emulate the x2APIC.
+ *
+ * For that reason, allow the max basic guest leaf to be larger than the
+ * hosts' up until 0xb.
+ */
+if ( guest->basic.max_leaf > 0xb &&
+ guest->basic.max_leaf > host->basic.max_leaf )
 FAIL_CPUID(0, NA);
 
 if ( guest->feat.max_subleaf > host->feat.max_subleaf )
-- 
2.46.0

[PATCH v6 10/11] tools/libguest: Set distinct x2APIC IDs for each vCPU

2024-10-01 Thread Alejandro Vallejo

Have toolstack populate the new x2APIC ID in the LAPIC save record with
the proper IDs intended for each vCPU.

Signed-off-by: Alejandro Vallejo 
---
v6:
  * Rely on the new LUT in xc_dom_image rather than recalculating APIC IDs.
---
 tools/libs/guest/xg_dom_x86.c | 19 ++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/tools/libs/guest/xg_dom_x86.c b/tools/libs/guest/xg_dom_x86.c
index c98229317db7..38486140ed15 100644
--- a/tools/libs/guest/xg_dom_x86.c
+++ b/tools/libs/guest/xg_dom_x86.c
@@ -1004,11 +1004,14 @@ static int vcpu_hvm(struct xc_dom_image *dom)
 HVM_SAVE_TYPE(HEADER) header;
 struct hvm_save_descriptor mtrr_d;
 HVM_SAVE_TYPE(MTRR) mtrr;
+struct hvm_save_descriptor lapic_d;
+HVM_SAVE_TYPE(LAPIC) lapic;
 struct hvm_save_descriptor end_d;
 HVM_SAVE_TYPE(END) end;
 } vcpu_ctx;
-/* Context from full_ctx */
+/* Contexts from full_ctx */
 const HVM_SAVE_TYPE(MTRR) *mtrr_record;
+const HVM_SAVE_TYPE(LAPIC) *lapic_record;
 /* Raw context as taken from Xen */
 uint8_t *full_ctx = NULL;
 int rc;
@@ -,6 +1114,8 @@ static int vcpu_hvm(struct xc_dom_image *dom)
 vcpu_ctx.mtrr_d.typecode = HVM_SAVE_CODE(MTRR);
 vcpu_ctx.mtrr_d.length = HVM_SAVE_LENGTH(MTRR);
 vcpu_ctx.mtrr = *mtrr_record;
+vcpu_ctx.lapic_d.typecode = HVM_SAVE_CODE(LAPIC);
+vcpu_ctx.lapic_d.length = HVM_SAVE_LENGTH(LAPIC);
 vcpu_ctx.end_d = bsp_ctx.end_d;
 vcpu_ctx.end = bsp_ctx.end;
 
@@ -1125,6 +1130,18 @@ static int vcpu_hvm(struct xc_dom_image *dom)
 {
 vcpu_ctx.mtrr_d.instance = i;
 
+lapic_record = hvm_get_save_record(full_ctx, HVM_SAVE_CODE(LAPIC), i);
+if ( !lapic_record )
+{
+xc_dom_panic(dom->xch, XC_INTERNAL_ERROR,
+ "%s: unable to get LAPIC[%d] save record", __func__, 
i);
+goto out;
+}
+
+vcpu_ctx.lapic = *lapic_record;
+vcpu_ctx.lapic.x2apic_id = dom->cpu_to_apicid[i];
+vcpu_ctx.lapic_d.instance = i;
+
 rc = xc_domain_hvm_setcontext(dom->xch, dom->guest_domid,
   (uint8_t *)&vcpu_ctx, sizeof(vcpu_ctx));
 if ( rc != 0 )
-- 
2.46.0

[PATCH v6 11/11] tools/x86: Synthesise domain topologies

2024-10-01 Thread Alejandro Vallejo

Expose sensible topologies in leaf 0xb. At the moment it synthesises
non-HT systems, in line with the previous code intent.

Leaf 0xb in the host policy is no longer zapped and the guest {max,def}
policies have their topology leaves zapped instead. The intent is for
toolstack to populate them. There's no current use for the topology
information in the host policy, but it makes no harm.

Signed-off-by: Alejandro Vallejo 
---
v6:
  * Assign accurate APIC IDs to xc_dom_img->cpu_to_apicid
* New field in v6. Allows ACPI generation to be correct on PVH too.
---
 tools/include/xenguest.h|  3 +++
 tools/libs/guest/xg_cpuid_x86.c | 29 -
 tools/libs/light/libxl_dom.c| 22 +-
 xen/arch/x86/cpu-policy.c   |  9 ++---
 4 files changed, 58 insertions(+), 5 deletions(-)

diff --git a/tools/include/xenguest.h b/tools/include/xenguest.h
index aa50b78dfb89..dcabf219b9cb 100644
--- a/tools/include/xenguest.h
+++ b/tools/include/xenguest.h
@@ -831,6 +831,9 @@ int xc_set_domain_cpu_policy(xc_interface *xch, uint32_t 
domid,
 
 uint32_t xc_get_cpu_featureset_size(void);
 
+/* Returns the APIC ID of the `cpu`-th CPU according to `policy` */
+uint32_t xc_cpu_to_apicid(const xc_cpu_policy_t *policy, unsigned int cpu);
+
 enum xc_static_cpu_featuremask {
 XC_FEATUREMASK_KNOWN,
 XC_FEATUREMASK_SPECIAL,
diff --git a/tools/libs/guest/xg_cpuid_x86.c b/tools/libs/guest/xg_cpuid_x86.c
index 4453178100ad..c591f8732a1a 100644
--- a/tools/libs/guest/xg_cpuid_x86.c
+++ b/tools/libs/guest/xg_cpuid_x86.c
@@ -725,8 +725,16 @@ int xc_cpuid_apply_policy(xc_interface *xch, uint32_t 
domid, bool restore,
 p->policy.basic.htt   = test_bit(X86_FEATURE_HTT, host_featureset);
 p->policy.extd.cmp_legacy = test_bit(X86_FEATURE_CMP_LEGACY, 
host_featureset);
 }
-else
+else if ( restore )
 {
+/*
+ * Reconstruct the topology exposed on Xen <= 4.13. It makes very 
little
+ * sense, but it's what those guests saw so it's set in stone now.
+ *
+ * Guests from Xen 4.14 onwards carry their own CPUID leaves in the
+ * migration stream so they don't need special treatment.
+ */
+
 /*
  * Topology for HVM guests is entirely controlled by Xen.  For now, we
  * hardcode APIC_ID = vcpu_id * 2 to give the illusion of no SMT.
@@ -782,6 +790,20 @@ int xc_cpuid_apply_policy(xc_interface *xch, uint32_t 
domid, bool restore,
 break;
 }
 }
+else
+{
+/* TODO: Expose the ability to choose a custom topology for HVM/PVH */
+unsigned int threads_per_core = 1;
+unsigned int cores_per_pkg = di.max_vcpu_id + 1;
+
+rc = x86_topo_from_parts(&p->policy, threads_per_core, cores_per_pkg);
+if ( rc )
+{
+ERROR("Failed to generate topology: rc=%d t/c=%u c/p=%u",
+  rc, threads_per_core, cores_per_pkg);
+goto out;
+}
+}
 
 nr_leaves = ARRAY_SIZE(p->leaves);
 rc = x86_cpuid_copy_to_buffer(&p->policy, p->leaves, &nr_leaves);
@@ -1028,3 +1050,8 @@ bool xc_cpu_policy_is_compatible(xc_interface *xch, 
xc_cpu_policy_t *host,
 
 return false;
 }
+
+uint32_t xc_cpu_to_apicid(const xc_cpu_policy_t *policy, unsigned int cpu)
+{
+return x86_x2apic_id_from_vcpu_id(&policy->policy, cpu);
+}
diff --git a/tools/libs/light/libxl_dom.c b/tools/libs/light/libxl_dom.c
index 7f9c6eaa8b24..8dbfc5b7df61 100644
--- a/tools/libs/light/libxl_dom.c
+++ b/tools/libs/light/libxl_dom.c
@@ -1063,6 +1063,9 @@ int libxl__build_hvm(libxl__gc *gc, uint32_t domid,
 libxl_domain_build_info *const info = &d_config->b_info;
 struct xc_dom_image *dom = NULL;
 bool device_model = info->type == LIBXL_DOMAIN_TYPE_HVM ? true : false;
+#if defined(__i386__) || defined(__x86_64__)
+struct xc_cpu_policy *policy = NULL;
+#endif
 
 xc_dom_loginit(ctx->xch);
 
@@ -1083,8 +1086,22 @@ int libxl__build_hvm(libxl__gc *gc, uint32_t domid,
 dom->container_type = XC_DOM_HVM_CONTAINER;
 
 #if defined(__i386__) || defined(__x86_64__)
+policy = xc_cpu_policy_init();
+if (!policy) {
+LOGE(ERROR, "xc_cpu_policy_get_domain failed d%u", domid);
+rc = ERROR_NOMEM;
+goto out;
+}
+
+rc = xc_cpu_policy_get_domain(ctx->xch, domid, policy);
+if (rc != 0) {
+LOGE(ERROR, "xc_cpu_policy_get_domain failed d%u", domid);
+rc = ERROR_FAIL;
+goto out;
+}
+
 for ( uint32_t i = 0; i < info->max_vcpus; i++ )
-dom->cpu_to_apicid[i] = 2 * i; /* TODO: Replace by topo calculation */
+dom->cpu_to_apicid[i] = xc_cpu_to_apicid(policy, i);
 #endif
 
 /* The params from the configuration file are in Mb, which are then
@@ -1214,6 +1231,9 @@ int libxl__build_hvm(libxl__gc *gc, uint32_

Re: [PATCH v3] x86/traps: Re-enable interrupts after reading cr2 in the #PF handler

2024-09-27 Thread Alejandro Vallejo

On Fri Sep 27, 2024 at 3:39 PM BST, Alejandro Vallejo wrote:
> Would whoever planned to commit this mind replacing the commit msg on commit?
> Otherwise I'll just resend.

Actually, I've just re-sent it to avoid confusions. Apologies.

Cheers,
Alejandro

[PATCH v3-resend] x86/traps: Re-enable interrupts after reading cr2 in the #PF handler

2024-09-27 Thread Alejandro Vallejo

Hitting a page fault clobbers %cr2, so if a page fault is handled while
handling a previous page fault then %cr2 will hold the address of the
latter fault rather than the former. In particular, if a debug key
handler happens to trigger during #PF and before %cr2 is read, and that
handler itself encounters a #PF, then %cr2 will be corrupt for the outer #PF
handler.

This patch makes the page fault path delay re-enabling IRQs until %cr2
has been read in order to ensure it stays consistent.

A similar argument holds in additional cases, but they happen to be safe:
* %dr6 inside #DB: Safe because IST exceptions don't re-enable IRQs.
* MSR_XFD_ERR inside #NM: Safe because AMX isn't used in #NM handler.

While in the area, remove redundant q suffix to a movq in entry.S and
the space after the comma.

Fixes: a4cd20a19073 ("[XEN] 'd' key dumps both host and guest state.")
Signed-off-by: Alejandro Vallejo 
Acked-by: Roger Pau Monné 
---
v3:
  * s/dispatch_handlers/dispatch_exceptions/
  * Updated commit message, spelling out the state of #DB and #NM, and
state an existing race with debug keys.
---
 xen/arch/x86/traps.c|  8 
 xen/arch/x86/x86_64/entry.S | 20 
 2 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 708136f62558..a9c2c607eb08 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -1600,6 +1600,14 @@ void asmlinkage do_page_fault(struct cpu_user_regs *regs)
 
 addr = read_cr2();
 
+/*
+ * Don't re-enable interrupts if we were running an IRQ-off region when
+ * we hit the page fault, or we'll break that code.
+ */
+ASSERT(!local_irq_is_enabled());
+if ( regs->flags & X86_EFLAGS_IF )
+local_irq_enable();
+
 /* fixup_page_fault() might change regs->error_code, so cache it here. */
 error_code = regs->error_code;
 
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index b8482de8ee5b..9b0cdb76408b 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -844,9 +844,9 @@ handle_exception_saved:
 #elif !defined(CONFIG_PV)
 ASSERT_CONTEXT_IS_XEN
 #endif /* CONFIG_PV */
-sti
-1:  movq  %rsp,%rdi
-movzbl UREGS_entry_vector(%rsp),%eax
+.Ldispatch_exceptions:
+mov   %rsp, %rdi
+movzbl UREGS_entry_vector(%rsp), %eax
 #ifdef CONFIG_PERF_COUNTERS
 lea   per_cpu__perfcounters(%rip), %rcx
 add   STACK_CPUINFO_FIELD(per_cpu_offset)(%r14), %rcx
@@ -866,7 +866,19 @@ handle_exception_saved:
 jmp   .L_exn_dispatch_done;\
 .L_ ## vec ## _done:
 
+/*
+ * IRQs kept off to derisk being hit by a nested interrupt before
+ * reading %cr2. Otherwise a page fault in the nested interrupt handler
+ * would corrupt %cr2.
+ */
 DISPATCH(X86_EXC_PF, do_page_fault)
+
+/* Only re-enable IRQs if they were active before taking the fault */
+testb $X86_EFLAGS_IF >> 8, UREGS_eflags + 1(%rsp)
+jz1f
+sti
+1:
+
 DISPATCH(X86_EXC_GP, do_general_protection)
 DISPATCH(X86_EXC_UD, do_invalid_op)
 DISPATCH(X86_EXC_NM, do_device_not_available)
@@ -911,7 +923,7 @@ exception_with_ints_disabled:
 movq  %rsp,%rdi
 call  search_pre_exception_table
 testq %rax,%rax # no fixup code for faulting EIP?
-jz1b
+jz.Ldispatch_exceptions
 movq  %rax,UREGS_rip(%rsp)  # fixup regular stack
 
 #ifdef CONFIG_XEN_SHSTK
-- 
2.46.0

Re: [PATCH v3] x86/traps: Re-enable interrupts after reading cr2 in the #PF handler

2024-09-27 Thread Alejandro Vallejo

On Fri Sep 27, 2024 at 3:23 PM BST, Alejandro Vallejo wrote:
> Hitting a page fault clobbers %cr2, so if a page fault is handled while
> handling a previous page fault then %cr2 will hold the address of the
> latter fault rather than the former. This patch makes the page fault
> path delay re-enabling IRQs until %cr2 has been read in order to ensure
> it stays consistent.
>
> A similar argument holds in additional cases, but they happen to be safe:
>
>   * %dr6 inside #DB: Safe because IST exceptions don't re-enable IRQs.
>   * MSR_XFD_ERR inside #NM: Safe because AMX isn't used in #NM handler.
>
> While in the area, remove redundant q suffix to a movq in entry.S and
> add space after the comma.
>
> Fixes: a4cd20a19073 ("[XEN] 'd' key dumps both host and guest state.")
> Signed-off-by: Alejandro Vallejo 
> Acked-by: Roger Pau Monné 
> ---
> v3:
>   * s/dispatch_handlers/dispatch_exceptions/
>   * Updated commit message, spelling out the state of #DB and #NM, and
> state an existing race with debug keys.

Bah, I didn't refresh the patch with the latest commit message. It was meant to
be:

  x86/traps: Re-enable interrupts after reading cr2 in the #PF handler

  Hitting a page fault clobbers %cr2, so if a page fault is handled while
  handling a previous page fault then %cr2 will hold the address of the latter
  fault rather than the former. In particular, if a debug key handler happens
  to trigger during #PF and before %cr2 is read, and that handler itself
  encounters a #PF, then %cr2 will be corrupt for the outer #PF handler.

  This patch makes the page fault path delay re-enabling IRQs until %cr2 has
  been read in order to ensure it stays consistent.

  A similar argument holds in additional cases, but they happen to be safe:

* %dr6 inside #DB: Safe because IST exceptions don't re-enable IRQs.
* MSR_XFD_ERR inside #NM: Safe because AMX isn't used in #NM handler.

  While in the area, remove redundant q suffix to a movq in entry.S and
  add space after the comma.

  Fixes: a4cd20a19073 ("[XEN] 'd' key dumps both host and guest state.")
  Signed-off-by: Alejandro Vallejo 
  Acked-by: Roger Pau Monné 

Would whoever planned to commit this mind replacing the commit msg on commit?
Otherwise I'll just resend.

Cheers,
Alejandro

[PATCH v3] x86/traps: Re-enable interrupts after reading cr2 in the #PF handler

2024-09-27 Thread Alejandro Vallejo

Hitting a page fault clobbers %cr2, so if a page fault is handled while
handling a previous page fault then %cr2 will hold the address of the
latter fault rather than the former. This patch makes the page fault
path delay re-enabling IRQs until %cr2 has been read in order to ensure
it stays consistent.

A similar argument holds in additional cases, but they happen to be safe:

  * %dr6 inside #DB: Safe because IST exceptions don't re-enable IRQs.
  * MSR_XFD_ERR inside #NM: Safe because AMX isn't used in #NM handler.

While in the area, remove redundant q suffix to a movq in entry.S and
add space after the comma.

Fixes: a4cd20a19073 ("[XEN] 'd' key dumps both host and guest state.")
Signed-off-by: Alejandro Vallejo 
Acked-by: Roger Pau Monné 
---
v3:
  * s/dispatch_handlers/dispatch_exceptions/
  * Updated commit message, spelling out the state of #DB and #NM, and
state an existing race with debug keys.
---
 xen/arch/x86/traps.c|  8 
 xen/arch/x86/x86_64/entry.S | 20 
 2 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 708136f62558..a9c2c607eb08 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -1600,6 +1600,14 @@ void asmlinkage do_page_fault(struct cpu_user_regs *regs)
 
 addr = read_cr2();
 
+/*
+ * Don't re-enable interrupts if we were running an IRQ-off region when
+ * we hit the page fault, or we'll break that code.
+ */
+ASSERT(!local_irq_is_enabled());
+if ( regs->flags & X86_EFLAGS_IF )
+local_irq_enable();
+
 /* fixup_page_fault() might change regs->error_code, so cache it here. */
 error_code = regs->error_code;
 
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index b8482de8ee5b..9b0cdb76408b 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -844,9 +844,9 @@ handle_exception_saved:
 #elif !defined(CONFIG_PV)
 ASSERT_CONTEXT_IS_XEN
 #endif /* CONFIG_PV */
-sti
-1:  movq  %rsp,%rdi
-movzbl UREGS_entry_vector(%rsp),%eax
+.Ldispatch_exceptions:
+mov   %rsp, %rdi
+movzbl UREGS_entry_vector(%rsp), %eax
 #ifdef CONFIG_PERF_COUNTERS
 lea   per_cpu__perfcounters(%rip), %rcx
 add   STACK_CPUINFO_FIELD(per_cpu_offset)(%r14), %rcx
@@ -866,7 +866,19 @@ handle_exception_saved:
 jmp   .L_exn_dispatch_done;\
 .L_ ## vec ## _done:
 
+/*
+ * IRQs kept off to derisk being hit by a nested interrupt before
+ * reading %cr2. Otherwise a page fault in the nested interrupt handler
+ * would corrupt %cr2.
+ */
 DISPATCH(X86_EXC_PF, do_page_fault)
+
+/* Only re-enable IRQs if they were active before taking the fault */
+testb $X86_EFLAGS_IF >> 8, UREGS_eflags + 1(%rsp)
+jz1f
+sti
+1:
+
 DISPATCH(X86_EXC_GP, do_general_protection)
 DISPATCH(X86_EXC_UD, do_invalid_op)
 DISPATCH(X86_EXC_NM, do_device_not_available)
@@ -911,7 +923,7 @@ exception_with_ints_disabled:
 movq  %rsp,%rdi
 call  search_pre_exception_table
 testq %rax,%rax # no fixup code for faulting EIP?
-jz1b
+jz.Ldispatch_exceptions
 movq  %rax,UREGS_rip(%rsp)  # fixup regular stack
 
 #ifdef CONFIG_XEN_SHSTK
-- 
2.46.0

Re: [PATCH v2] x86/traps: Re-enable interrupts after reading cr2 in the #PF handler

2024-09-27 Thread Alejandro Vallejo

On Wed Sep 25, 2024 at 7:35 AM BST, Jan Beulich wrote:
> On 24.09.2024 20:36, Andrew Cooper wrote:
> > On 23/09/2024 2:03 pm, Jan Beulich wrote:
> >> On 23.09.2024 12:14, Alejandro Vallejo wrote:
> >>> On Fri Sep 20, 2024 at 3:12 PM BST, Roger Pau Monné wrote:
> >>>> On Wed, Sep 18, 2024 at 02:05:54PM +0100, Alejandro Vallejo wrote:
> >>>>> Moves sti directly after the cr2 read and immediately after the #PF
> >>>>> handler.
> >>>> I think you need to add some context about why this is needed, iow:
> >>>> avoid corrupting %cr2 if a nested 3PF happens.
> >>> I can send a v3 with:
> >>>
> >>> ```
> >>>   Hitting a page fault clobbers %cr2, so if a page fault is handled while
> >>>   handling a previous page fault then %cr2 will hold the address of the 
> >>> latter
> >>>   fault rather than the former. This patch makes the page fault path delay
> >>>   re-enabling IRQs until %cr2 has been read in order to ensure it stays
> >>>   consistent.
> >> And under what conditions would we experience #PF while already processing
> >> an earlier #PF? If an interrupt kicks in, that's not supposed to by raising
> >> any #PF itself. Which isn't to say that the change isn't worthwhile to 
> >> make,
> >> but it would be nice if it was explicit whether there are active issues, or
> >> whether this is merely to be on the safe side going forward.
> > 
> > My understanding is that this came from code inspection, not an active
> > issue.
> > 
> > The same is true for %dr6 and #DB, and MSR_XFD_ERR and #NM.
> > 
> > I think we can safely agree to veto the use of AMX in the #NM handler,
> > and IST exceptions don't re-enable interrupts[1], so #PF is the only
> > problem case.
> > 
> > Debug keys happen off the back of plain IRQs, and we can get #PF when
> > interrogating guest stacks.
>
> Hmm, yes, this looks like a case that is actively being fixed here. Wants
> mentioning, likely wants a respective Fixes: tag, and then also wants
> backporting.

Sure.

>
> >  Also, I'm far from certain we're safe to
> > spurious #PF's from updating Xen mappings, so I think there are a bunch
> > of risky corner cases that we might be exposed to.
>
> Spurious #PF are possible, but __page_fault_type() explicitly excludes
> the in_irq() case.
>
> Jan

Cheers,
Alejandro

Re: [PATCH v2] x86/traps: Re-enable interrupts after reading cr2 in the #PF handler

2024-09-27 Thread Alejandro Vallejo

Hi,

On Tue Sep 24, 2024 at 7:36 PM BST, Andrew Cooper wrote:
> On 23/09/2024 2:03 pm, Jan Beulich wrote:
> > On 23.09.2024 12:14, Alejandro Vallejo wrote:
> >> On Fri Sep 20, 2024 at 3:12 PM BST, Roger Pau Monné wrote:
> >>> On Wed, Sep 18, 2024 at 02:05:54PM +0100, Alejandro Vallejo wrote:
> >>>> Moves sti directly after the cr2 read and immediately after the #PF
> >>>> handler.
> >>> I think you need to add some context about why this is needed, iow:
> >>> avoid corrupting %cr2 if a nested 3PF happens.
> >> I can send a v3 with:
> >>
> >> ```
> >>   Hitting a page fault clobbers %cr2, so if a page fault is handled while
> >>   handling a previous page fault then %cr2 will hold the address of the 
> >> latter
> >>   fault rather than the former. This patch makes the page fault path delay
> >>   re-enabling IRQs until %cr2 has been read in order to ensure it stays
> >>   consistent.
> > And under what conditions would we experience #PF while already processing
> > an earlier #PF? If an interrupt kicks in, that's not supposed to by raising
> > any #PF itself. Which isn't to say that the change isn't worthwhile to make,
> > but it would be nice if it was explicit whether there are active issues, or
> > whether this is merely to be on the safe side going forward.
>
> My understanding is that this came from code inspection, not an active
> issue.

That's right. I merely eyeballed it while going through the interrupt dispatch
sequence. This is not a bugfix as much as simply being cautious.

>
> The same is true for %dr6 and #DB, and MSR_XFD_ERR and #NM.
>
> I think we can safely agree to veto the use of AMX in the #NM handler,

Agree.

> and IST exceptions don't re-enable interrupts[1], so #PF is the only
> problem case.
>
> Debug keys happen off the back of plain IRQs, and we can get #PF when

Could you elaborate here on debug keys? Not sure I understand what you mean.

> interrogating guest stacks.  Also, I'm far from certain we're safe to
> spurious #PF's from updating Xen mappings, so I think there are a bunch
> of risky corner cases that we might be exposed to.
>
> And I really need to find some time to get FRED working...
>
> ~Andrew
>
> [1] We do re-enable interrupts in order to IPI cpu0 for "clean"
> shutdown, and this explodes in our faces if kexec isn't active and we
> crashed in the middle of context switch.  We really need to not need
> irqs-on in order to shut down.

Why do we need them currently in that path? Regardless, shutdowns would be the
response to aborts (#MC or #DF) rather than #DB, right?

Cheers,
Alejandro

Re: [PATCH v2] x86/traps: Re-enable interrupts after reading cr2 in the #PF handler

2024-09-23 Thread Alejandro Vallejo

On Fri Sep 20, 2024 at 3:12 PM BST, Roger Pau Monné wrote:
> On Wed, Sep 18, 2024 at 02:05:54PM +0100, Alejandro Vallejo wrote:
> > Moves sti directly after the cr2 read and immediately after the #PF
> > handler.
>
> I think you need to add some context about why this is needed, iow:
> avoid corrupting %cr2 if a nested 3PF happens.

I can send a v3 with:

```
  Hitting a page fault clobbers %cr2, so if a page fault is handled while
  handling a previous page fault then %cr2 will hold the address of the latter
  fault rather than the former. This patch makes the page fault path delay
  re-enabling IRQs until %cr2 has been read in order to ensure it stays
  consistent.

  Furthermore, the patch preserves the invariant of "IRQs are only re-enabled
  if they were enabled in the interrupted context" in order to not break
  IRQs-off faulting contexts.
```

>
> > While in the area, remove redundant q suffix to a movq in entry.S
> > 
> > Signed-off-by: Alejandro Vallejo 
>
> Acked-by: Roger Pau Monné 

Thanks

>
> One nit below.
>
> > ---
> > Got lost alongside other patches. Here's the promised v2.
> > 
> > pipeline: 
> > https://gitlab.com/xen-project/people/agvallejo/xen/-/pipelines/1458699639
> > v1: 
> > https://lore.kernel.org/xen-devel/20240911145823.12066-1-alejandro.vall...@cloud.com/
> > 
> > v2:
> >   * (cosmetic), add whitespace after comma
> >   * Added ASSERT(local_irq_is_enabled()) to do_page_fault()
> >   * Only re-enable interrupts if they were enabled in the interrupted
> > context.
> > ---
> >  xen/arch/x86/traps.c|  8 
> >  xen/arch/x86/x86_64/entry.S | 20 
> >  2 files changed, 24 insertions(+), 4 deletions(-)
> > 
> > diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
> > index 708136f62558..a9c2c607eb08 100644
> > --- a/xen/arch/x86/traps.c
> > +++ b/xen/arch/x86/traps.c
> > @@ -1600,6 +1600,14 @@ void asmlinkage do_page_fault(struct cpu_user_regs 
> > *regs)
> >  
> >  addr = read_cr2();
> >  
> > +/*
> > + * Don't re-enable interrupts if we were running an IRQ-off region when
> > + * we hit the page fault, or we'll break that code.
> > + */
> > +ASSERT(!local_irq_is_enabled());
> > +if ( regs->flags & X86_EFLAGS_IF )
> > +local_irq_enable();
> > +
> >  /* fixup_page_fault() might change regs->error_code, so cache it here. 
> > */
> >  error_code = regs->error_code;
> >  
> > diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
> > index b8482de8ee5b..218e5ea85efb 100644
> > --- a/xen/arch/x86/x86_64/entry.S
> > +++ b/xen/arch/x86/x86_64/entry.S
> > @@ -844,9 +844,9 @@ handle_exception_saved:
> >  #elif !defined(CONFIG_PV)
> >  ASSERT_CONTEXT_IS_XEN
> >  #endif /* CONFIG_PV */
> > -sti
> > -1:  movq  %rsp,%rdi
> > -movzbl UREGS_entry_vector(%rsp),%eax
> > +.Ldispatch_handlers:
>
> Maybe 'dispatch_exception', since it's only exceptions that are
> handled here? dispatch_handlers seems a bit too generic, but no strong
> opinion.

Sure, anything would be better than "1:"

>
> Thanks, Roger.

Cheers,
Alejandro

[PATCH v2] x86/traps: Re-enable interrupts after reading cr2 in the #PF handler

2024-09-18 Thread Alejandro Vallejo

Moves sti directly after the cr2 read and immediately after the #PF
handler.

While in the area, remove redundant q suffix to a movq in entry.S

Signed-off-by: Alejandro Vallejo 
---
Got lost alongside other patches. Here's the promised v2.

pipeline: 
https://gitlab.com/xen-project/people/agvallejo/xen/-/pipelines/1458699639
v1: 
https://lore.kernel.org/xen-devel/20240911145823.12066-1-alejandro.vall...@cloud.com/

v2:
  * (cosmetic), add whitespace after comma
  * Added ASSERT(local_irq_is_enabled()) to do_page_fault()
  * Only re-enable interrupts if they were enabled in the interrupted
context.
---
 xen/arch/x86/traps.c|  8 
 xen/arch/x86/x86_64/entry.S | 20 
 2 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 708136f62558..a9c2c607eb08 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -1600,6 +1600,14 @@ void asmlinkage do_page_fault(struct cpu_user_regs *regs)
 
 addr = read_cr2();
 
+/*
+ * Don't re-enable interrupts if we were running an IRQ-off region when
+ * we hit the page fault, or we'll break that code.
+ */
+ASSERT(!local_irq_is_enabled());
+if ( regs->flags & X86_EFLAGS_IF )
+local_irq_enable();
+
 /* fixup_page_fault() might change regs->error_code, so cache it here. */
 error_code = regs->error_code;
 
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index b8482de8ee5b..218e5ea85efb 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -844,9 +844,9 @@ handle_exception_saved:
 #elif !defined(CONFIG_PV)
 ASSERT_CONTEXT_IS_XEN
 #endif /* CONFIG_PV */
-sti
-1:  movq  %rsp,%rdi
-movzbl UREGS_entry_vector(%rsp),%eax
+.Ldispatch_handlers:
+mov   %rsp, %rdi
+movzbl UREGS_entry_vector(%rsp), %eax
 #ifdef CONFIG_PERF_COUNTERS
 lea   per_cpu__perfcounters(%rip), %rcx
 add   STACK_CPUINFO_FIELD(per_cpu_offset)(%r14), %rcx
@@ -866,7 +866,19 @@ handle_exception_saved:
 jmp   .L_exn_dispatch_done;\
 .L_ ## vec ## _done:
 
+/*
+ * IRQs kept off to derisk being hit by a nested interrupt before
+ * reading %cr2. Otherwise a page fault in the nested interrupt handler
+ * would corrupt %cr2.
+ */
 DISPATCH(X86_EXC_PF, do_page_fault)
+
+/* Only re-enable IRQs if they were active before taking the fault */
+testb $X86_EFLAGS_IF >> 8, UREGS_eflags + 1(%rsp)
+jz1f
+sti
+1:
+
 DISPATCH(X86_EXC_GP, do_general_protection)
 DISPATCH(X86_EXC_UD, do_invalid_op)
 DISPATCH(X86_EXC_NM, do_device_not_available)
@@ -911,7 +923,7 @@ exception_with_ints_disabled:
 movq  %rsp,%rdi
 call  search_pre_exception_table
 testq %rax,%rax # no fixup code for faulting EIP?
-jz1b
+jz.Ldispatch_handlers
 movq  %rax,UREGS_rip(%rsp)  # fixup regular stack
 
 #ifdef CONFIG_XEN_SHSTK
-- 
2.46.0

Re: [PATCH] x86/traps: Re-enable IRQs after reading cr2 in the #PF handler

2024-09-12 Thread Alejandro Vallejo

On Thu Sep 12, 2024 at 10:49 AM BST, Andrew Cooper wrote:
> On 11/09/2024 3:58 pm, Alejandro Vallejo wrote:
> > diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
> > index b8482de8ee..ef803f6288 100644
> > --- a/xen/arch/x86/x86_64/entry.S
> > +++ b/xen/arch/x86/x86_64/entry.S
> > @@ -844,8 +844,7 @@ handle_exception_saved:
> >  #elif !defined(CONFIG_PV)
> >  ASSERT_CONTEXT_IS_XEN
> >  #endif /* CONFIG_PV */
> > -sti
> > -1:  movq  %rsp,%rdi
> > +1:  mov   %rsp,%rdi
> >  movzbl UREGS_entry_vector(%rsp),%eax
> >  #ifdef CONFIG_PERF_COUNTERS
> >  lea   per_cpu__perfcounters(%rip), %rcx
>
> I'm afraid this isn't correctly.  The STI is only on one of two paths to
> the dispatch logic.
>
> Right now, you're re-enabling interrupts even if #PF hits an irqs-off
> region in Xen.
>
> You must not enabled IRQs if going via the exception_with_ints_disabled
> path, which is the user of that 1: label immediately after STI.
>
> ~Andrew

Well, darn. That's a well-hidden Waldo.

I'll send a v2 with conditional enables on C and assembly, and a change of that
label from "1" to ".Lfoo" to clearly imply the control flow might take a
backflip from several miles down the file.

Cheers,
Alejandro

[PATCH] x86/traps: Re-enable IRQs after reading cr2 in the #PF handler

2024-09-11 Thread Alejandro Vallejo

Moves sti directly after the cr2 read and immediately after the #PF
handler.

While in the area, remove redundant q suffix to a movq in entry.S

Signed-off-by: Alejandro Vallejo 
---
I don't think this is a bug as much as an accident about to happen. Even if
there's no cases at the moment in which the IRQ handler may page fault, that
might change in the future.

Note: I haven't tested it extensively beyond running it on GitLab.

pipeline:
https://gitlab.com/xen-project/people/agvallejo/xen/-/pipelines/1449182525

---
 xen/arch/x86/traps.c|  2 ++
 xen/arch/x86/x86_64/entry.S | 11 +--
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 708136f625..1c04c03d9f 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -1600,6 +1600,8 @@ void asmlinkage do_page_fault(struct cpu_user_regs *regs)
 
 addr = read_cr2();
 
+local_irq_enable();
+
 /* fixup_page_fault() might change regs->error_code, so cache it here. */
 error_code = regs->error_code;
 
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index b8482de8ee..ef803f6288 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -844,8 +844,7 @@ handle_exception_saved:
 #elif !defined(CONFIG_PV)
 ASSERT_CONTEXT_IS_XEN
 #endif /* CONFIG_PV */
-sti
-1:  movq  %rsp,%rdi
+1:  mov   %rsp,%rdi
 movzbl UREGS_entry_vector(%rsp),%eax
 #ifdef CONFIG_PERF_COUNTERS
 lea   per_cpu__perfcounters(%rip), %rcx
@@ -866,7 +865,15 @@ handle_exception_saved:
 jmp   .L_exn_dispatch_done;\
 .L_ ## vec ## _done:
 
+/*
+ * IRQs kept off to derisk being hit by a nested interrupt before
+ * reading %cr2. Otherwise a page fault in the nested interrupt hadnler
+ * would corrupt %cr2.
+ */
 DISPATCH(X86_EXC_PF, do_page_fault)
+
+sti
+
 DISPATCH(X86_EXC_GP, do_general_protection)
 DISPATCH(X86_EXC_UD, do_invalid_op)
 DISPATCH(X86_EXC_NM, do_device_not_available)
-- 
2.46.0

Re: [PATCH 1/3] x86/acpi: Drop acpi_video_flags and use bootsym(video_flags) directly

2024-09-05 Thread Alejandro Vallejo

On Thu Sep 5, 2024 at 2:06 PM BST, Andrew Cooper wrote:
> This removes a level of indirection, as well as removing a somewhat misleading
> name; the variable is really "S3 video quirks".

nit: Would it be beneficial to rename video_flags to s3_video_flags?

>
> More importantly however it makes it very clear that, right now, parsing the
> cmdline and quirks depends on having already placed the trampoline; a
> dependency which is going to be gnarly to untangle.
>
> That said, fixing the quirk is easy.  The Toshiba Satellite 4030CDT has an
> Intel Celeron 300Mhz CPU (Pentium 2 era) from 1998 when MMX was the headline
> feature, sporting 64M of RAM.  Being a 32-bit processor, it hasn't been able
> to run Xen for about a decade now, so drop the quirk entirely.
>
> Signed-off-by: Andrew Cooper 
> ---
> CC: Jan Beulich 
> CC: Roger Pau Monné 
> CC: Frediano Ziglio 
> CC: Alejandro Vallejo 
> ---
>  xen/arch/x86/acpi/power.c   |  2 +-
>  xen/arch/x86/dmi_scan.c | 12 
>  xen/arch/x86/include/asm/acpi.h |  1 -
>  3 files changed, 1 insertion(+), 14 deletions(-)

Always nice to see old hacks disappear.

  Reviewed-by: Alejandro Vallejo 

Cheers,
Alejandro

Re: [PATCH v4 01/44] x86/boot: move x86 boot module counting into a new boot_info struct

2024-09-02 Thread Alejandro Vallejo

I haven't read the entire series yet, but here's my .02 so far

On Fri Aug 30, 2024 at 10:46 PM BST, Daniel P. Smith wrote:
> From: Christopher Clark 
>
> An initial step towards a non-multiboot internal representation of boot
> modules for common code, starting with x86 setup and converting the fields
> that are accessed for the startup calculations.
>
> Introduce a new header, , and populate it with a new
> boot_info structure initially containing a count of the number of boot
> modules.
>
> No functional change intended.
>
> Signed-off-by: Christopher Clark 
> Signed-off-by: Daniel P. Smith 
> ---
>  xen/arch/x86/include/asm/bootinfo.h | 25 +
>  xen/arch/x86/setup.c| 58 +
>  2 files changed, 59 insertions(+), 24 deletions(-)
>  create mode 100644 xen/arch/x86/include/asm/bootinfo.h
>
> diff --git a/xen/arch/x86/include/asm/bootinfo.h 
> b/xen/arch/x86/include/asm/bootinfo.h
> new file mode 100644
> index ..e850f80d26a7
> --- /dev/null
> +++ b/xen/arch/x86/include/asm/bootinfo.h
> @@ -0,0 +1,25 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +/*
> + * Copyright (c) 2024 Christopher Clark 
> + * Copyright (c) 2024 Apertus Solutions, LLC
> + * Author: Daniel P. Smith 
> + */
> +
> +#ifndef __XEN_X86_BOOTINFO_H__
> +#define __XEN_X86_BOOTINFO_H__
> +

This struct would benefit from a comment stating what it's for and how it's
meant to be used. At a glance it seems like it's meant to be serve as a
boot-protocol agnostic representation of boot-parameters, used as a generic
means of information handover. Which would imply multiboot_info is parsed onto
it when booting from multiboot and is synthesised from scratch in other cases
(e.g: direct EFI?).

> +struct boot_info {
> +unsigned int nr_mods;

It's imo better to treat this as an ABI. That would allow using this layer as a
boot protocol in itself (which I'm guessing is the objective? I haven't gotten
that far in the series). If so, this would need to be a fixed-width uintN_t.

Same with other fields in follow-up patches.

> +};
> +
> +#endif
> +
> +/*
> + * Local variables:
> + * mode: C
> + * c-file-style: "BSD"
> + * c-basic-offset: 4
> + * tab-width: 4
> + * indent-tabs-mode: nil
> + * End:
> + */
> diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
> index eee20bb1753c..dd94ee2e736b 100644
> --- a/xen/arch/x86/setup.c
> +++ b/xen/arch/x86/setup.c
> @@ -32,6 +32,7 @@
>  #include 
>  #endif
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -276,7 +277,16 @@ static int __init cf_check parse_acpi_param(const char 
> *s)
>  custom_param("acpi", parse_acpi_param);
>  
>  static const module_t *__initdata initial_images;
> -static unsigned int __initdata nr_initial_images;
> +static struct boot_info __initdata *boot_info;
> +
> +static void __init multiboot_to_bootinfo(multiboot_info_t *mbi)

If this function returned boot_info instead and the caller made the
assignment then it would be possible to unit-test/fuzz it.

It also fits a bit more nicely with the usual implications of that function
name pattern, I think.

> +{
> +static struct boot_info __initdata info;
> +
> +info.nr_mods = mbi->mods_count;

Shouldn't this be gated on MBI_MODULES being set?

   info.nr_mods = (mbi->flags & MBI_MODULES) ? mbi->mods_count : 0;

> +
> +boot_info = &info;
> +}
>  
>  unsigned long __init initial_images_nrpages(nodeid_t node)
>  {
> @@ -285,7 +295,7 @@ unsigned long __init initial_images_nrpages(nodeid_t node)
>  unsigned long nr;
>  unsigned int i;
>  
> -for ( nr = i = 0; i < nr_initial_images; ++i )
> +for ( nr = i = 0; i < boot_info->nr_mods; ++i )
>  {
>  unsigned long start = initial_images[i].mod_start;
>  unsigned long end = start + PFN_UP(initial_images[i].mod_end);
> @@ -301,7 +311,7 @@ void __init discard_initial_images(void)
>  {
>  unsigned int i;
>  
> -for ( i = 0; i < nr_initial_images; ++i )
> +for ( i = 0; i < boot_info->nr_mods; ++i )
>  {
>  uint64_t start = (uint64_t)initial_images[i].mod_start << PAGE_SHIFT;
>  
> @@ -309,7 +319,7 @@ void __init discard_initial_images(void)
> start + PAGE_ALIGN(initial_images[i].mod_end));
>  }
>  
> -nr_initial_images = 0;
> +boot_info->nr_mods = 0;

Out of curiosity, why is this required?

>  initial_images = NULL;
>  }
>  
> @@ -1034,9 +1044,10 @@ void asmlinkage __init noreturn __start_xen(unsigned 
> long mbi_p)
>  mod = __va(mbi->mods_addr);
>  }
>  
> +multiboot_to_bootinfo(mbi);
> +
>  loader = (mbi->flags & MBI_LOADERNAME) ? __va(mbi->boot_loader_name)
> : "unknown";
> -

Stray newline removal?

>  /* Parse the command-line options. */
>  if ( mbi->flags & MBI_CMDLINE )
>  cmdline = cmdline_cook(__va(mbi->cmdline), loader);
> @@ -1141,18 +1152,18 @@ void asmlinkage __init noreturn __start_xen(unsigned 
> long mb

Re: [XEN PATCH v2 3/3] libxl: Update the documentation of libxl_xen_console_read_line()

2024-09-02 Thread Alejandro Vallejo

On Fri Aug 23, 2024 at 6:05 PM BST, Javi Merino wrote:
> Despite its name, libxl_xen_console_read_line() does not read a line,
> it fills the buffer with as many characters as fit.  Update the
> documentation to reflect the real behaviour of the function.  Rename
> line_r to avoid confusion since it is a pointer to an array of
> characters.
>
> Signed-off-by: Javi Merino 
> ---
>  tools/libs/light/libxl_console.c | 22 --
>  1 file changed, 12 insertions(+), 10 deletions(-)
>
> diff --git a/tools/libs/light/libxl_console.c 
> b/tools/libs/light/libxl_console.c
> index f42f6a51ee6f..652897e4ef6d 100644
> --- a/tools/libs/light/libxl_console.c
> +++ b/tools/libs/light/libxl_console.c
> @@ -789,17 +789,19 @@ libxl_xen_console_reader *
>  return cr;
>  }
>  
> -/* return values:  *line_r
> - *   1  success, whole line obtained from buffernon-0
> - *   0  no more lines available right now   0
> - *   negative   error code ERROR_*  0
> - * On success *line_r is updated to point to a nul-terminated
> +/* Copy part of the console ring into a buffer
> + *
> + * Return values:
> + *   1: Success, the buffer obtained from the console ring an

Seems like this line in the comment is incomplete?

> + *   0: No more lines available right now
> + *   -ERROR_* on error
> + *
> + * On success, *line_r is updated to point to a nul-terminated
>   * string which is valid until the next call on the same console
> - * reader.  The libxl caller may overwrite parts of the string
> - * if it wishes. */
> + * reader. */

Cheers,
Alejandro

Re: [PATCH v2] x86/cpufeatures: Add new cpuid features in SPR to featureset

2024-09-02 Thread Alejandro Vallejo

On Wed Aug 21, 2024 at 5:07 PM BST, Jan Beulich wrote:
> On 21.08.2024 17:34, Matthew Barnes wrote:
> > Upon running `xen-cpuid -v` on a host machine with Sapphire Rapids
> > within Dom0, there exist unrecognised features.
> > 
> > This patch adds these features as macros to the CPU featureset,
> > disabled by default.
> > 
> > Signed-off-by: Matthew Barnes 
>
> I don't strictly mind the patch in this shape, but ...
>
> > @@ -276,10 +283,13 @@ XEN_CPUFEATURE(TSX_FORCE_ABORT, 9*32+13) /* 
> > MSR_TSX_FORCE_ABORT.RTM_ABORT */
> >  XEN_CPUFEATURE(SERIALIZE, 9*32+14) /*A  SERIALIZE insn */
> >  XEN_CPUFEATURE(HYBRID,9*32+15) /*   Heterogeneous platform */
> >  XEN_CPUFEATURE(TSXLDTRK,  9*32+16) /*a  TSX load tracking 
> > suspend/resume insns */
> > +XEN_CPUFEATURE(PCONFIG,   9*32+18) /*   PCONFIG insn */
> >  XEN_CPUFEATURE(ARCH_LBR,  9*32+19) /*   Architectural Last Branch 
> > Record */
> >  XEN_CPUFEATURE(CET_IBT,   9*32+20) /*   CET - Indirect Branch Tracking 
> > */
> > +XEN_CPUFEATURE(AMX_BF16,  9*32+22) /*   Tile computational operations 
> > on bfloat16 numbers */
> >  XEN_CPUFEATURE(AVX512_FP16,   9*32+23) /*A  AVX512 FP16 instructions */
> >  XEN_CPUFEATURE(AMX_TILE,  9*32+24) /*   AMX Tile architecture */
> > +XEN_CPUFEATURE(AMX_INT8,  9*32+25) /*   Tile computational operations 
> > on 8-bit integers */
> >  XEN_CPUFEATURE(IBRSB, 9*32+26) /*A  IBRS and IBPB support (used by 
> > Intel) */
> >  XEN_CPUFEATURE(STIBP, 9*32+27) /*A  STIBP */
> >  XEN_CPUFEATURE(L1D_FLUSH, 9*32+28) /*S  MSR_FLUSH_CMD and L1D flush. */
>
> ... having had a respective (more complete) patch pending for years I really
> wonder if it shouldn't be that one to be taken. While it would need adjustment
> to go ahead of other stuff (as posted in v3), I don't think it has any true
> dependency on earlier patches in the AMX series. IOW I could re-post v4
> standalone, and then we'd have a more complete view on AMX as well as proper
> dependencies in place.
>
> Thoughts?
>
> Jan

Oh! I had no idea you already posted patches to enable AMX. Is this the one?

https://lore.kernel.org/xen-devel/322de6db-e01f-0b57-5777-5d94a13c4...@suse.com/

Cheers,
Alejandro

Re: [PATCH 15/22] x86/idle: allow using a per-pCPU L4

2024-08-21 Thread Alejandro Vallejo

On Fri Jul 26, 2024 at 4:21 PM BST, Roger Pau Monne wrote:
> diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
> index 9cfcf0dc63f3..b62c4311da6c 100644
> --- a/xen/arch/x86/domain.c
> +++ b/xen/arch/x86/domain.c
> @@ -555,6 +555,7 @@ void arch_vcpu_regs_init(struct vcpu *v)
>  int arch_vcpu_create(struct vcpu *v)
>  {
>  struct domain *d = v->domain;
> +root_pgentry_t *pgt = NULL;
>  int rc;
>  
>  v->arch.flags = TF_kernel_mode;
> @@ -589,7 +590,23 @@ int arch_vcpu_create(struct vcpu *v)
>  else
>  {
>  /* Idle domain */
> -v->arch.cr3 = __pa(idle_pg_table);
> +if ( (opt_asi_pv || opt_asi_hvm) && v->vcpu_id )
> +{
> +pgt = alloc_xenheap_page();
> +
> +/*
> + * For the idle vCPU 0 (the BSP idle vCPU) use idle_pg_table
> + * directly, there's no need to create yet another copy.
> + */

Shouldn't this comment be in the else branch instead? Or reworded to refer to
non-0 vCPUs.

> +rc = -ENOMEM;

While it's true rc is overriden later, I feel uneasy leaving it with -ENOMEM
after the check. Could we have it immediately before "goto fail"?

> +if ( !pgt )
> +goto fail;
> +
> +copy_page(pgt, idle_pg_table);
> +v->arch.cr3 = __pa(pgt);
> +}
> +else
> +v->arch.cr3 = __pa(idle_pg_table);
>  rc = 0;
>  v->arch.msrs = ZERO_BLOCK_PTR; /* Catch stray misuses */
>  }
> @@ -611,6 +628,7 @@ int arch_vcpu_create(struct vcpu *v)
>  vcpu_destroy_fpu(v);
>  xfree(v->arch.msrs);
>  v->arch.msrs = NULL;
> +free_xenheap_page(pgt);
>  
>  return rc;
>  }

I guess the idle domain has a forever lifetime and its vCPUs are kept around
forever too, right?; otherwise we'd need extra logic in the the vcpu_destroy()
to free the page table copies should they exist too.

Cheers,
Alejandro

Re: [PATCH 13/22] x86/hvm: use a per-pCPU monitor table in HAP mode

2024-08-19 Thread Alejandro Vallejo

On Fri Aug 16, 2024 at 7:02 PM BST, Alejandro Vallejo wrote:
> On Fri Jul 26, 2024 at 4:21 PM BST, Roger Pau Monne wrote:
> > Instead of allocating a monitor table for each vCPU when running in HVM HAP
> > mode, use a per-pCPU monitor table, which gets the per-domain slot updated 
> > on
> > guest context switch.
> >
> > This limits the amount of memory used for HVM HAP monitor tables to the 
> > amount
> > of active pCPUs, rather than to the number of vCPUs.  It also simplifies 
> > vCPU
> > allocation and teardown, since the monitor table handling is removed from
> > there.
> >
> > Note the switch to using a per-CPU monitor table is done regardless of 
> > whether
>
> s/per-CPU/per-pCPU/
>
> > Address Space Isolation is enabled or not.  Partly for the memory usage
> > reduction, and also because it allows to simplify the VM tear down path by 
> > not
> > having to cleanup the per-vCPU monitor tables.
> >
> > Signed-off-by: Roger Pau Monné 
> > ---
> > Note the monitor table is not made static because uses outside of the file
> > where it's defined will be added by further patches.
> > ---
> >  xen/arch/x86/hvm/hvm.c | 60 
> >  xen/arch/x86/hvm/svm/svm.c |  5 ++
> >  xen/arch/x86/hvm/vmx/vmcs.c|  1 +
> >  xen/arch/x86/hvm/vmx/vmx.c |  4 ++
> >  xen/arch/x86/include/asm/hap.h |  1 -
> >  xen/arch/x86/include/asm/hvm/hvm.h |  8 
> >  xen/arch/x86/mm.c  |  8 
> >  xen/arch/x86/mm/hap/hap.c  | 75 --
> >  xen/arch/x86/mm/paging.c   |  4 +-
> >  9 files changed, 87 insertions(+), 79 deletions(-)
> >
> > diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
> > index 7f4b627b1f5f..3f771bc65677 100644
> > --- a/xen/arch/x86/hvm/hvm.c
> > +++ b/xen/arch/x86/hvm/hvm.c
> > @@ -104,6 +104,54 @@ static const char __initconst warning_hvm_fep[] =
> >  static bool __initdata opt_altp2m_enabled;
> >  boolean_param("altp2m", opt_altp2m_enabled);
> >  
> > +DEFINE_PER_CPU(root_pgentry_t *, monitor_pgt);
> > +
> > +static int allocate_cpu_monitor_table(unsigned int cpu)
>
> To avoid ambiguity, could we call these *_pcpu_*() instead?
>
> > +{
> > +root_pgentry_t *pgt = alloc_xenheap_page();
> > +
> > +if ( !pgt )
> > +return -ENOMEM;
> > +
> > +clear_page(pgt);
> > +
> > +init_xen_l4_slots(pgt, _mfn(virt_to_mfn(pgt)), INVALID_MFN, NULL,
> > +  false, true, false);
> > +
> > +ASSERT(!per_cpu(monitor_pgt, cpu));
> > +per_cpu(monitor_pgt, cpu) = pgt;
> > +
> > +return 0;
> > +}
> > +
> > +static void free_cpu_monitor_table(unsigned int cpu)
> > +{
> > +root_pgentry_t *pgt = per_cpu(monitor_pgt, cpu);
> > +
> > +if ( !pgt )
> > +return;
> > +
> > +per_cpu(monitor_pgt, cpu) = NULL;
> > +free_xenheap_page(pgt);
> > +}
> > +
> > +void hvm_set_cpu_monitor_table(struct vcpu *v)
> > +{
> > +root_pgentry_t *pgt = this_cpu(monitor_pgt);
> > +
> > +ASSERT(pgt);
> > +
> > +setup_perdomain_slot(v, pgt);
>
> Why not modify them as part of write_ptbase() instead? As it stands, it 
> appears
> to be modifying the PTEs of what may very well be our current PT, which makes
> the perdomain slot be in a $DEITY-knows-what state until the next flush
> (presumably the write to cr3 in write_ptbase()?; assuming no PCIDs).
>
> Setting the slot up right before the cr3 change should reduce the potential 
> for
> misuse.
>
> > +
> > +make_cr3(v, _mfn(virt_to_mfn(pgt)));
> > +}
> > +
> > +void hvm_clear_cpu_monitor_table(struct vcpu *v)
> > +{
> > +/* Poison %cr3, it will be updated when the vCPU is scheduled. */
> > +make_cr3(v, INVALID_MFN);
>
> I think this would benefit from more exposition in the comment. If I'm getting
> this right, after descheduling this vCPU we can't assume it'll be rescheduled
> on the same pCPU, and if it's not it'll end up using a different monitor 
> table.
> This poison value is meant to highlight forgetting to set cr3 in the
> "ctxt_switch_to()" path. 
>
> All of that can be deduced from what you wrote and sufficient headscratching
> but seeing how this is invoked from the context switch path it's not 
> incredibly
> clear wether you meant the perdomain slot would be updated by the next vCPU or
> what I stated in the

Re: [PATCH 16/22] x86/mm: introduce a per-CPU L3 table for the per-domain slot

2024-08-16 Thread Alejandro Vallejo

On Fri Jul 26, 2024 at 4:22 PM BST, Roger Pau Monne wrote:
> So far L4 slot 260 has always been per-domain, in other words: all vCPUs of a
> domain share the same L3 entry.  Currently only 3 slots are used in that L3
> table, which leaves plenty of room.
>
> Introduce a per-CPU L3 that's used the the domain has Address Space Isolation
> enabled.  Such per-CPU L3 gets currently populated using the same L3 entries
> present on the per-domain L3 (d->arch.perdomain_l3_pg).
>
> No functional change expected, as the per-CPU L3 is always a copy of the
> contents of d->arch.perdomain_l3_pg.
>
> Note that all the per-domain L3 entries are populated at domain create, and
> hence there's no need to sync the state of the per-CPU L3 as the domain won't
> yet be running when the L3 is modified.
>
> Signed-off-by: Roger Pau Monné 

Still scratching my head with the details on this, but in general I'm utterly
confused whenever I read per-CPU in the series because it's not obvious which
CPU (p or v) I should be thinking about. A general change that would help a lot
is to replace every instance of per-CPU with per-vCPU or per-pCPU as needed.

Cheers,
Alejandro

Re: [PATCH 13/22] x86/hvm: use a per-pCPU monitor table in HAP mode

2024-08-16 Thread Alejandro Vallejo

On Fri Jul 26, 2024 at 4:21 PM BST, Roger Pau Monne wrote:
> Instead of allocating a monitor table for each vCPU when running in HVM HAP
> mode, use a per-pCPU monitor table, which gets the per-domain slot updated on
> guest context switch.
>
> This limits the amount of memory used for HVM HAP monitor tables to the amount
> of active pCPUs, rather than to the number of vCPUs.  It also simplifies vCPU
> allocation and teardown, since the monitor table handling is removed from
> there.
>
> Note the switch to using a per-CPU monitor table is done regardless of whether

s/per-CPU/per-pCPU/

> Address Space Isolation is enabled or not.  Partly for the memory usage
> reduction, and also because it allows to simplify the VM tear down path by not
> having to cleanup the per-vCPU monitor tables.
>
> Signed-off-by: Roger Pau Monné 
> ---
> Note the monitor table is not made static because uses outside of the file
> where it's defined will be added by further patches.
> ---
>  xen/arch/x86/hvm/hvm.c | 60 
>  xen/arch/x86/hvm/svm/svm.c |  5 ++
>  xen/arch/x86/hvm/vmx/vmcs.c|  1 +
>  xen/arch/x86/hvm/vmx/vmx.c |  4 ++
>  xen/arch/x86/include/asm/hap.h |  1 -
>  xen/arch/x86/include/asm/hvm/hvm.h |  8 
>  xen/arch/x86/mm.c  |  8 
>  xen/arch/x86/mm/hap/hap.c  | 75 --
>  xen/arch/x86/mm/paging.c   |  4 +-
>  9 files changed, 87 insertions(+), 79 deletions(-)
>
> diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
> index 7f4b627b1f5f..3f771bc65677 100644
> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -104,6 +104,54 @@ static const char __initconst warning_hvm_fep[] =
>  static bool __initdata opt_altp2m_enabled;
>  boolean_param("altp2m", opt_altp2m_enabled);
>  
> +DEFINE_PER_CPU(root_pgentry_t *, monitor_pgt);
> +
> +static int allocate_cpu_monitor_table(unsigned int cpu)

To avoid ambiguity, could we call these *_pcpu_*() instead?

> +{
> +root_pgentry_t *pgt = alloc_xenheap_page();
> +
> +if ( !pgt )
> +return -ENOMEM;
> +
> +clear_page(pgt);
> +
> +init_xen_l4_slots(pgt, _mfn(virt_to_mfn(pgt)), INVALID_MFN, NULL,
> +  false, true, false);
> +
> +ASSERT(!per_cpu(monitor_pgt, cpu));
> +per_cpu(monitor_pgt, cpu) = pgt;
> +
> +return 0;
> +}
> +
> +static void free_cpu_monitor_table(unsigned int cpu)
> +{
> +root_pgentry_t *pgt = per_cpu(monitor_pgt, cpu);
> +
> +if ( !pgt )
> +return;
> +
> +per_cpu(monitor_pgt, cpu) = NULL;
> +free_xenheap_page(pgt);
> +}
> +
> +void hvm_set_cpu_monitor_table(struct vcpu *v)
> +{
> +root_pgentry_t *pgt = this_cpu(monitor_pgt);
> +
> +ASSERT(pgt);
> +
> +setup_perdomain_slot(v, pgt);

Why not modify them as part of write_ptbase() instead? As it stands, it appears
to be modifying the PTEs of what may very well be our current PT, which makes
the perdomain slot be in a $DEITY-knows-what state until the next flush
(presumably the write to cr3 in write_ptbase()?; assuming no PCIDs).

Setting the slot up right before the cr3 change should reduce the potential for
misuse.

> +
> +make_cr3(v, _mfn(virt_to_mfn(pgt)));
> +}
> +
> +void hvm_clear_cpu_monitor_table(struct vcpu *v)
> +{
> +/* Poison %cr3, it will be updated when the vCPU is scheduled. */
> +make_cr3(v, INVALID_MFN);

I think this would benefit from more exposition in the comment. If I'm getting
this right, after descheduling this vCPU we can't assume it'll be rescheduled
on the same pCPU, and if it's not it'll end up using a different monitor table.
This poison value is meant to highlight forgetting to set cr3 in the
"ctxt_switch_to()" path. 

All of that can be deduced from what you wrote and sufficient headscratching
but seeing how this is invoked from the context switch path it's not incredibly
clear wether you meant the perdomain slot would be updated by the next vCPU or
what I stated in the previous paragraph.

Assuming it is as I mentioned, maybe hvm_forget_cpu_monitor_table() would
convey what it does better? i.e: the vCPU forgets/unbinds the monitor table
from its internal state.

Cheers,
Alejandro

Re: [XEN PATCH v2 1/5] x86/Kconfig: introduce CENTAUR, HYGON & SHANGHAI config options

2024-08-16 Thread Alejandro Vallejo

On Fri Aug 16, 2024 at 12:10 PM BST, Sergiy Kibrik wrote:
> These options aim to represent what's currently supported by Xen, and later
> to allow tuning for specific platform(s) only.
>
> HYGON and SHANGHAI options depend on AMD and INTEL as there're build
> dependencies on support code for AMD and Intel CPUs respectively.
>
> Signed-off-by: Sergiy Kibrik 
> CC: Alejandro Vallejo 
> CC: Jan Beulich 
> ---
>  xen/arch/x86/Kconfig.cpu  | 29 +
>  xen/arch/x86/cpu/Makefile |  6 +++---
>  xen/arch/x86/cpu/common.c |  6 ++
>  3 files changed, 38 insertions(+), 3 deletions(-)
>
> diff --git a/xen/arch/x86/Kconfig.cpu b/xen/arch/x86/Kconfig.cpu
> index 5fb18db1aa..ac8f41d464 100644
> --- a/xen/arch/x86/Kconfig.cpu
> +++ b/xen/arch/x86/Kconfig.cpu
> @@ -10,6 +10,25 @@ config AMD
> May be turned off in builds targetting other vendors.  Otherwise,
> must be enabled for Xen to work suitably on AMD platforms.
>  
> +config CENTAUR
> + bool "Support Centaur CPUs"
> + default y
> + help
> +   Detection, tunings and quirks for VIA platforms.
> +
> +   May be turned off in builds targeting other vendors. Otherwise, must
> +  be enabled for Xen to work suitably on VIA platforms.
> +
> +config HYGON
> + bool "Support Hygon CPUs"
> + depends on AMD
> + default y
> + help
> +   Detection, tunings and quirks for Hygon platforms.
> +
> +   May be turned off in builds targeting other vendors. Otherwise, must
> +  be enabled for Xen to work suitably on Hygon platforms.
> +
>  config INTEL
>   bool "Support Intel CPUs"
>   default y
> @@ -19,4 +38,14 @@ config INTEL
> May be turned off in builds targetting other vendors.  Otherwise,
> must be enabled for Xen to work suitably on Intel platforms.
>  
> +config SHANGHAI
> + bool "Support Shanghai CPUs"
> + depends on INTEL
> + default y
> + help
> +   Detection, tunings and quirks for Zhaoxin platforms.
> +
> +   May be turned off in builds targeting other vendors. Otherwise, must
> +  be enabled for Xen to work suitably on Zhaoxin platforms.
> +
>  endmenu
> diff --git a/xen/arch/x86/cpu/Makefile b/xen/arch/x86/cpu/Makefile
> index eafce5f204..80739d0256 100644
> --- a/xen/arch/x86/cpu/Makefile
> +++ b/xen/arch/x86/cpu/Makefile
> @@ -3,13 +3,13 @@ obj-y += microcode/
>  obj-y += mtrr/
>  
>  obj-y += amd.o
> -obj-y += centaur.o
> +obj-$(CONFIG_CENTAUR) += centaur.o
>  obj-y += common.o
> -obj-y += hygon.o
> +obj-$(CONFIG_HYGON) += hygon.o
>  obj-y += intel.o
>  obj-y += intel_cacheinfo.o
>  obj-y += mwait-idle.o
> -obj-y += shanghai.o
> +obj-$(CONFIG_SHANGHAI) += shanghai.o
>  obj-y += vpmu.o
>  obj-$(CONFIG_AMD) += vpmu_amd.o
>  obj-$(CONFIG_INTEL) += vpmu_intel.o
> diff --git a/xen/arch/x86/cpu/common.c b/xen/arch/x86/cpu/common.c
> index ff4cd22897..dcc2753212 100644
> --- a/xen/arch/x86/cpu/common.c
> +++ b/xen/arch/x86/cpu/common.c
> @@ -339,9 +339,15 @@ void __init early_cpu_init(bool verbose)
>   case X86_VENDOR_INTEL:intel_unlock_cpuid_leaves(c);
> actual_cpu = intel_cpu_dev;break;
>   case X86_VENDOR_AMD:  actual_cpu = amd_cpu_dev;  break;
> +#ifdef CONFIG_CENTAUR
>   case X86_VENDOR_CENTAUR:  actual_cpu = centaur_cpu_dev;  break;
> +#endif
> +#ifdef CONFIG_SHANGHAI
>   case X86_VENDOR_SHANGHAI: actual_cpu = shanghai_cpu_dev; break;
> +#endif
> +#ifdef CONFIG_HYGON
>   case X86_VENDOR_HYGON:actual_cpu = hygon_cpu_dev;break;
> +#endif
>   default:
>   actual_cpu = default_cpu;
>   if (!verbose)

Reviewed-by: Alejandro Vallejo 

Cheers,
Alejandro

Re: [PATCH v3 2/2] x86/fpu: Split fpu_setup_fpu() in three

2024-08-13 Thread Alejandro Vallejo

On Tue Aug 13, 2024 at 3:32 PM BST, Jan Beulich wrote:
> On 13.08.2024 16:21, Alejandro Vallejo wrote:
> > It was trying to do too many things at once and there was no clear way of
> > defining what it was meant to do. This commit splits the function in three.
> > 
> >   1. A function to return the FPU to power-on reset values.
> >   2. A function to return the FPU to default values.
> >   3. A x87/SSE state loader (equivalent to the old function when it took a 
> > data
> >  pointer).
> > 
> > While at it, make sure the abridged tag is consistent with the manuals and
> > start as 0xFF.
> > 
> > Signed-off-by: Alejandro Vallejo 
>
> Reviewed-by: Jan Beulich 
>
> > ---
> > v3:
> >   * Adjust commit message, as the split is now in 3.
> >   * Remove bulky comment, as the rationale for it turned out to be
> > unsubstantiated. I can't find proof in xen-devel of the stream
> > operating the way I claimed, and at that point having the comment
> > at all is pointless
>
> So you deliberately removed the comment altogether, not just point 3 of it?
>
> Jan

Yes. The other two cases can be deduced pretty trivially from the conditional,
I reckon. I commented them more heavily in order to properly introduce (3), but
seeing how it was all a midsummer dream might as well reduce clutter.

I got as far as the original implementation of XSAVE in Xen and it seems to
have been tested against many combinations of src and dst, none of which was
that ficticious "xsave enabled + xsave context missing". I suspect the
xsave_enabled(v) was merely avoiding writing to the XSAVE buffer just for
efficiency (however minor effect it might have had). I just reverse engineering
it wrong.

Which reminds me. Thanks for mentioning that, because it was really just
guesswork on my part.

Cheers,
Alejandro

[PATCH v3 2/2] x86/fpu: Split fpu_setup_fpu() in three

2024-08-13 Thread Alejandro Vallejo

It was trying to do too many things at once and there was no clear way of
defining what it was meant to do. This commit splits the function in three.

  1. A function to return the FPU to power-on reset values.
  2. A function to return the FPU to default values.
  3. A x87/SSE state loader (equivalent to the old function when it took a data
 pointer).

While at it, make sure the abridged tag is consistent with the manuals and
start as 0xFF.

Signed-off-by: Alejandro Vallejo 
---
v3:
  * Adjust commit message, as the split is now in 3.
  * Remove bulky comment, as the rationale for it turned out to be
unsubstantiated. I can't find proof in xen-devel of the stream
operating the way I claimed, and at that point having the comment
at all is pointless

I suspect the rationale for xsave_vcpu(v) was merely to skip writing the XSAVE
header when it would be rewritten later on. Whatever it might be the current
logic does the right thing and is several orders of magnitude clearer about its
objective and its intent.

---
 xen/arch/x86/domain.c |  7 ++--
 xen/arch/x86/hvm/hvm.c| 12 +++
 xen/arch/x86/i387.c   | 60 +++
 xen/arch/x86/include/asm/i387.h   | 28 ---
 xen/arch/x86/include/asm/xstate.h |  1 +
 5 files changed, 62 insertions(+), 46 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index d977ec71ca20..5af9e3e7a8b4 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1186,9 +1186,10 @@ int arch_set_info_guest(
  is_pv_64bit_domain(d) )
 v->arch.flags &= ~TF_kernel_mode;
 
-vcpu_setup_fpu(v, v->arch.xsave_area,
-   flags & VGCF_I387_VALID ? &c.nat->fpu_ctxt : NULL,
-   FCW_DEFAULT);
+if ( flags & VGCF_I387_VALID )
+vcpu_setup_fpu(v, &c.nat->fpu_ctxt);
+else
+vcpu_default_fpu(v);
 
 if ( !compat )
 {
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 76bbb645b77a..95d66e68a849 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -1165,10 +1165,10 @@ static int cf_check hvm_load_cpu_ctxt(struct domain *d, 
hvm_domain_context_t *h)
 seg.attr = ctxt.ldtr_arbytes;
 hvm_set_segment_register(v, x86_seg_ldtr, &seg);
 
-/* Cover xsave-absent save file restoration on xsave-capable host. */
-vcpu_setup_fpu(v, xsave_enabled(v) ? NULL : v->arch.xsave_area,
-   ctxt.flags & XEN_X86_FPU_INITIALISED ? ctxt.fpu_regs : NULL,
-   FCW_RESET);
+if ( ctxt.flags & XEN_X86_FPU_INITIALISED )
+vcpu_setup_fpu(v, &ctxt.fpu_regs);
+else
+vcpu_reset_fpu(v);
 
 v->arch.user_regs.rax = ctxt.rax;
 v->arch.user_regs.rbx = ctxt.rbx;
@@ -4008,9 +4008,7 @@ void hvm_vcpu_reset_state(struct vcpu *v, uint16_t cs, 
uint16_t ip)
 v->arch.guest_table = pagetable_null();
 }
 
-if ( v->arch.xsave_area )
-v->arch.xsave_area->xsave_hdr.xstate_bv = 0;
-vcpu_setup_fpu(v, v->arch.xsave_area, NULL, FCW_RESET);
+vcpu_reset_fpu(v);
 
 arch_vcpu_regs_init(v);
 v->arch.user_regs.rip = ip;
diff --git a/xen/arch/x86/i387.c b/xen/arch/x86/i387.c
index fbb9d3584a3d..f7a9dcd162ba 100644
--- a/xen/arch/x86/i387.c
+++ b/xen/arch/x86/i387.c
@@ -303,41 +303,37 @@ int vcpu_init_fpu(struct vcpu *v)
 return xstate_alloc_save_area(v);
 }
 
-void vcpu_setup_fpu(struct vcpu *v, struct xsave_struct *xsave_area,
-const void *data, unsigned int fcw_default)
+void vcpu_reset_fpu(struct vcpu *v)
 {
-fpusse_t *fpu_sse = &v->arch.xsave_area->fpu_sse;
-
-ASSERT(!xsave_area || xsave_area == v->arch.xsave_area);
-
-v->fpu_initialised = !!data;
+v->fpu_initialised = false;
+*v->arch.xsave_area = (struct xsave_struct) {
+.fpu_sse = {
+.mxcsr = MXCSR_DEFAULT,
+.fcw = FCW_RESET,
+.ftw = FTW_RESET,
+},
+.xsave_hdr.xstate_bv = X86_XCR0_X87,
+};
+}
 
-if ( data )
-{
-memcpy(fpu_sse, data, sizeof(*fpu_sse));
-if ( xsave_area )
-xsave_area->xsave_hdr.xstate_bv = XSTATE_FP_SSE;
-}
-else if ( xsave_area && fcw_default == FCW_DEFAULT )
-{
-xsave_area->xsave_hdr.xstate_bv = 0;
-fpu_sse->mxcsr = MXCSR_DEFAULT;
-}
-else
-{
-memset(fpu_sse, 0, sizeof(*fpu_sse));
-fpu_sse->fcw = fcw_default;
-fpu_sse->mxcsr = MXCSR_DEFAULT;
-if ( v->arch.xsave_area )
-{
-v->arch.xsave_area->xsave_hdr.xstate_bv &= ~XSTATE_FP_SSE;
-if ( fcw_default != FCW_DEFAULT )
-v->arch.xsave_area->xsave_hdr.xstate_bv |= X86_XCR0_X87;
-}
-}
+void vcpu_default_fpu(struct vcpu *v)
+{
+v->fpu_initialised = false;
+*v->arch.xsave_area = (struct xsa

[PATCH v3 0/2] x86: FPU handling cleanup

2024-08-13 Thread Alejandro Vallejo

v2: 
https://lore.kernel.org/xen-devel/20240808134150.29927-1-alejandro.vall...@cloud.com/T/#t
v2 -> v3: Cosmetic changes and wiped big comment about missing data in the
  migration stream. Details in each patch.

v1: 
https://lore.kernel.org/xen-devel/cover.1720538832.git.alejandro.vall...@cloud.com/T/#t
v1 -> v2: v1/patch1 and v1/patch2 are already in staging.

=== Original cover letter =
I want to eventually reach a position in which the FPU state can be allocated
from the domheap and hidden via the same core mechanism proposed in Elias'
directmap removal series. Doing so is complicated by the presence of 2 aliased
pointers (v->arch.fpu_ctxt and v->arch.xsave_area) and the rather complicated
semantics of vcpu_setup_fpu(). This series tries to simplify the code so moving
to a "map/modify/unmap" model is more tractable.

Patches 1 and 2 are trivial refactors.

Patch 3 unifies FPU state so an XSAVE area is allocated per vCPU regardless of
the host supporting it or not. The rationale is that the memory savings are
negligible and not worth the extra complexity.

Patch 4 is a non-trivial split of the vcpu_setup_fpu() into 2 separate
functions. One to override x87/SSE state, and another to set a reset state.
=======

Alejandro Vallejo (2):
  x86/fpu: Combine fpu_ctxt and xsave_area in arch_vcpu
  x86/fpu: Split fpu_setup_fpu() in three

 xen/arch/x86/domain.c |   7 +-
 xen/arch/x86/domctl.c |   6 +-
 xen/arch/x86/hvm/emulate.c|   4 +-
 xen/arch/x86/hvm/hvm.c|  18 +++---
 xen/arch/x86/i387.c   | 103 ++
 xen/arch/x86/include/asm/domain.h |   8 +--
 xen/arch/x86/include/asm/i387.h   |  28 ++--
 xen/arch/x86/include/asm/xstate.h |   1 +
 xen/arch/x86/x86_emulate/blk.c|   3 +-
 xen/arch/x86/xstate.c |  13 +++-
 10 files changed, 95 insertions(+), 96 deletions(-)

-- 
2.45.2

[PATCH v3 1/2] x86/fpu: Combine fpu_ctxt and xsave_area in arch_vcpu

2024-08-13 Thread Alejandro Vallejo

fpu_ctxt is either a pointer to the legacy x87/SSE save area (used by FXSAVE) or
a pointer aliased with xsave_area that points to its fpu_sse subfield. Such
subfield is at the base and is identical in size and layout to the legacy
buffer.

This patch merges the 2 pointers in the arch_vcpu into a single XSAVE area. In
the very rare case in which the host doesn't support XSAVE all we're doing is
wasting a tiny amount of memory and trading those for a lot more simplicity in
the code.

Signed-off-by: Alejandro Vallejo 
---
v3:
  * Reverse memcpy() and BUILD_BUG_ON()
  * Use sizeof(arg) rather than sizeof(type)
---
 xen/arch/x86/domctl.c |  6 -
 xen/arch/x86/hvm/emulate.c|  4 +--
 xen/arch/x86/hvm/hvm.c|  6 -
 xen/arch/x86/i387.c   | 45 +--
 xen/arch/x86/include/asm/domain.h |  8 +++---
 xen/arch/x86/x86_emulate/blk.c|  3 ++-
 xen/arch/x86/xstate.c | 13 ++---
 7 files changed, 34 insertions(+), 51 deletions(-)

diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c
index 68b5b46d1a83..63fbceac0911 100644
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -1344,7 +1344,11 @@ void arch_get_info_guest(struct vcpu *v, 
vcpu_guest_context_u c)
 #define c(fld) (c.nat->fld)
 #endif
 
-memcpy(&c.nat->fpu_ctxt, v->arch.fpu_ctxt, sizeof(c.nat->fpu_ctxt));
+BUILD_BUG_ON(sizeof(c.nat->fpu_ctxt) !=
+ sizeof(v->arch.xsave_area->fpu_sse));
+memcpy(&c.nat->fpu_ctxt, &v->arch.xsave_area->fpu_sse,
+   sizeof(c.nat->fpu_ctxt));
+
 if ( is_pv_domain(d) )
 c(flags = v->arch.pv.vgc_flags & ~(VGCF_i387_valid|VGCF_in_kernel));
 else
diff --git a/xen/arch/x86/hvm/emulate.c b/xen/arch/x86/hvm/emulate.c
index feb4792cc567..03020542c3ba 100644
--- a/xen/arch/x86/hvm/emulate.c
+++ b/xen/arch/x86/hvm/emulate.c
@@ -2363,7 +2363,7 @@ static int cf_check hvmemul_get_fpu(
 alternative_vcall(hvm_funcs.fpu_dirty_intercept);
 else if ( type == X86EMUL_FPU_fpu )
 {
-const fpusse_t *fpu_ctxt = curr->arch.fpu_ctxt;
+const fpusse_t *fpu_ctxt = &curr->arch.xsave_area->fpu_sse;
 
 /*
  * Latch current register state so that we can back out changes
@@ -2403,7 +2403,7 @@ static void cf_check hvmemul_put_fpu(
 
 if ( aux )
 {
-fpusse_t *fpu_ctxt = curr->arch.fpu_ctxt;
+fpusse_t *fpu_ctxt = &curr->arch.xsave_area->fpu_sse;
 bool dval = aux->dval;
 int mode = hvm_guest_x86_mode(curr);
 
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index f49e29faf753..76bbb645b77a 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -916,7 +916,11 @@ static int cf_check hvm_save_cpu_ctxt(struct vcpu *v, 
hvm_domain_context_t *h)
 
 if ( v->fpu_initialised )
 {
-memcpy(ctxt.fpu_regs, v->arch.fpu_ctxt, sizeof(ctxt.fpu_regs));
+BUILD_BUG_ON(sizeof(ctxt.fpu_regs) !=
+ sizeof(v->arch.xsave_area->fpu_sse));
+memcpy(ctxt.fpu_regs, &v->arch.xsave_area->fpu_sse,
+   sizeof(ctxt.fpu_regs));
+
 ctxt.flags = XEN_X86_FPU_INITIALISED;
 }
 
diff --git a/xen/arch/x86/i387.c b/xen/arch/x86/i387.c
index 134e0bece519..fbb9d3584a3d 100644
--- a/xen/arch/x86/i387.c
+++ b/xen/arch/x86/i387.c
@@ -39,7 +39,7 @@ static inline void fpu_xrstor(struct vcpu *v, uint64_t mask)
 /* Restore x87 FPU, MMX, SSE and SSE2 state */
 static inline void fpu_fxrstor(struct vcpu *v)
 {
-const fpusse_t *fpu_ctxt = v->arch.fpu_ctxt;
+const fpusse_t *fpu_ctxt = &v->arch.xsave_area->fpu_sse;
 
 /*
  * Some CPUs don't save/restore FDP/FIP/FOP unless an exception
@@ -151,7 +151,7 @@ static inline void fpu_xsave(struct vcpu *v)
 /* Save x87 FPU, MMX, SSE and SSE2 state */
 static inline void fpu_fxsave(struct vcpu *v)
 {
-fpusse_t *fpu_ctxt = v->arch.fpu_ctxt;
+fpusse_t *fpu_ctxt = &v->arch.xsave_area->fpu_sse;
 unsigned int fip_width = v->domain->arch.x87_fip_width;
 
 if ( fip_width != 4 )
@@ -212,7 +212,7 @@ void vcpu_restore_fpu_nonlazy(struct vcpu *v, bool 
need_stts)
  * above) we also need to restore full state, to prevent subsequently
  * saving state belonging to another vCPU.
  */
-if ( v->arch.fully_eager_fpu || (v->arch.xsave_area && xstate_all(v)) )
+if ( v->arch.fully_eager_fpu || xstate_all(v) )
 {
 if ( cpu_has_xsave )
 fpu_xrstor(v, XSTATE_ALL);
@@ -299,44 +299,14 @@ void save_fpu_enable(void)
 /* Initialize FPU's context save area */
 int vcpu_init_fpu(struct vcpu *v)
 {
-int rc;
-
 v->arch.fully_eager_fpu = opt_eager_fpu;
-
-if ( (rc = xstate_alloc_save_area(v)) != 0 )
-return rc;
-
-if ( v->arch.xsave_area )
-v->arch.fpu_ctxt = &

Re: [PATCH v2 2/2] x86/fpu: Split fpu_setup_fpu() in two

2024-08-13 Thread Alejandro Vallejo

On Mon Aug 12, 2024 at 4:23 PM BST, Jan Beulich wrote:
> On 08.08.2024 15:41, Alejandro Vallejo wrote:
> > --- a/xen/arch/x86/hvm/hvm.c
> > +++ b/xen/arch/x86/hvm/hvm.c
> > @@ -1164,10 +1164,25 @@ static int cf_check hvm_load_cpu_ctxt(struct domain 
> > *d, hvm_domain_context_t *h)
> >  seg.attr = ctxt.ldtr_arbytes;
> >  hvm_set_segment_register(v, x86_seg_ldtr, &seg);
> >  
> > -/* Cover xsave-absent save file restoration on xsave-capable host. */
> > -vcpu_setup_fpu(v, xsave_enabled(v) ? NULL : v->arch.xsave_area,
> > -   ctxt.flags & XEN_X86_FPU_INITIALISED ? ctxt.fpu_regs : 
> > NULL,
> > -   FCW_RESET);
> > +/*
> > + * On Xen 4.1 and later the FPU state is restored on later HVM context 
> > in
> > + * the migrate stream, so what we're doing here is initialising the FPU
> > + * state for guests from even older versions of Xen.
> > + *
> > + * In particular:
> > + *   1. If there's an XSAVE context later in the stream what we do 
> > here for
> > + *  the FPU doesn't matter because it'll be overriden later.
> > + *   2. If there isn't and the guest didn't use extended states it's 
> > still
> > + *  fine because we have all the information we need here.
> > + *   3. If there isn't and the guest DID use extended states (could've
> > + *  happened prior to Xen 4.1) then we're in a pickle because we 
> > have
> > + *  to make up non-existing state. For this case we initialise the 
> > FPU
> > + *  as using x87/SSE only because the rest of the state is gone.
>
> Was this really possible to happen? Guests wouldn't have been able to
> turn on CR4.OSXSAVE, would they?
>
> Jan

You may be right, but my reading of the comment and the code was that
xsave_enabled(v) might be set and the XSAVE hvm context might be missing in the
stream. The archives didn't shed a lot more light than what the code already
gives away.

Otherwise it would've been far simpler to unconditionally pass
v->arch.xsave_area to the second parameter and let the xsave area to be
overriden by the follow-up HVM context with its actual state.

If my understanding is wrong, I'm happy to remove (3), as I don't think it
affects the code anyway. I thought however that it was a relevant data point
to leave paper trail for.

Cheers,
Alejandro

Re: [PATCH v2 1/2] x86/fpu: Combine fpu_ctxt and xsave_area in arch_vcpu

2024-08-13 Thread Alejandro Vallejo

On Mon Aug 12, 2024 at 4:16 PM BST, Jan Beulich wrote:
> On 08.08.2024 15:41, Alejandro Vallejo wrote:
> > --- a/xen/arch/x86/domctl.c
> > +++ b/xen/arch/x86/domctl.c
> > @@ -1344,7 +1344,10 @@ void arch_get_info_guest(struct vcpu *v, 
> > vcpu_guest_context_u c)
> >  #define c(fld) (c.nat->fld)
> >  #endif
> >  
> > -memcpy(&c.nat->fpu_ctxt, v->arch.fpu_ctxt, sizeof(c.nat->fpu_ctxt));
> > +memcpy(&c.nat->fpu_ctxt, &v->arch.xsave_area->fpu_sse,
> > +   sizeof(c.nat->fpu_ctxt));
> > +BUILD_BUG_ON(sizeof(c.nat->fpu_ctxt) != sizeof(fpusse_t));
>
> While it may seem unlikely that it would change going forward, I think
> that such build-time checks should make no implications at all. I.e.
> here the right side ought to be sizeof(v->arch.xsave_area->fpu_sse)
> even if that's longer.

Sounds sensible.

>
> Personally I also think that BUILD_BUG_ON(), just like BUG_ON(), would
> better live ahead of the construct they're for.
>
> Same again in at least one more place.
>
> Jan

Ack, sure.

Cheers,
Alejandro

Re: [PATCH] x86: slightly simplify MB2/EFI "magic" check

2024-08-12 Thread Alejandro Vallejo

Hi,

On Mon Aug 12, 2024 at 3:43 PM BST, Jan Beulich wrote:
> On 12.08.2024 16:34, Alejandro Vallejo wrote:
> > On Thu Aug 8, 2024 at 9:49 AM BST, Jan Beulich wrote:
> >> A few dozen lines down from here we repeatedly use a pattern involving
> >> just a single (conditional) branch. Do so also when checking for the
> >> boot loader magic value.
> >>
> >> Signed-off-by: Jan Beulich 
> >> ---
> >> I further question the placement of the clearing of vga_text_buffer,
> >> just out of context: Shouldn't that be placed with the increments of
> >> efi_platform and skip_realmode? Or else is the terminology in comments
> >> ("on EFI platforms") wrong in one of the two places? In the end, if we
> >> are entered at __efi64_mb2_start but the magic doesn't match, we simply
> >> don't know what environment we're in. There may or may not be a VGA
> >> console at the default address, so we may as well (try to) write to it
> >> (just like we do when entered at start).
> > 
> > It's fair to assume we're in 64bits, and in that situation it's also fair to
> > assume the text console is long gone (pun intended).
>
> How is being in 64-bit mode correlated with there being a VGA console?
> (I question "fair to assume" here anyway. We're on a path where we know
> something's wonky.)

The only way in which you could plausibly have a text-mode console in 64bits is
if you booted from BIOS and didn't set a VESA mode, so it boils down to what
failure modes you want to consider. For "anything goes" you're right that we
can't even be sure of being in 64bit (or 32bit) mode, but that's too draconian
an assumption to try to uphold, imo. I think that while details in the boot
protocol might be incorrect (like the magic tag), broad strokes (like being in
long mode and having a UEFI runtime) must still hold. Trying to use the serial
is fine (worst-case scenario it doesn't work), but trying to use a framebuffer
you're not sure about is not unlikely to triple fault your machine prematurely
and then debugging it is a pain even with an emulator.

>
> >> --- a/xen/arch/x86/boot/head.S
> >> +++ b/xen/arch/x86/boot/head.S
> >> @@ -233,13 +233,11 @@ __efi64_mb2_start:
> >>  
> >>  /* Check for Multiboot2 bootloader. */
> >>  cmp $MULTIBOOT2_BOOTLOADER_MAGIC,%eax
> >> -je  .Lefi_multiboot2_proto
> >>  
> >>  /* Jump to .Lnot_multiboot after switching CPU to x86_32 mode. */
> >>  lea .Lnot_multiboot(%rip), %r15
> > 
> > I don't think there's much benefit to this, but it would read more 
> > naturally if
> > lea was before cmp. Then cmp would be next to its (new) associated jne.
>
> You did look at the pattern though that I'm referring to in the description?
> Knowing that generally paring the CMP/TEST with the Jcc, I would have
> switched things around. Yet I wanted to make things as similar as possible,
> in the hope that be(com)ing consistent would make it easiest to get such a
> minor change in.
>
> Jan

I'm not sure about the pattern you mention. Seems like a standard set of
doX+check+cond_jump finished in an unconditional jump. All of them pretty
normal.

Regardless, I'm not arguing against this. While I happen to find it easier to
mentally parse it in its current form we do save a jump instruction after your
change. It's just that it'd be easier to follow with the mentioned reversal of
lea and cmp.

Cheers,
Alejandro

Re: [PATCH] x86: slightly simplify MB2/EFI "magic" check

2024-08-12 Thread Alejandro Vallejo

Hi,

On Thu Aug 8, 2024 at 9:49 AM BST, Jan Beulich wrote:
> A few dozen lines down from here we repeatedly use a pattern involving
> just a single (conditional) branch. Do so also when checking for the
> boot loader magic value.
>
> Signed-off-by: Jan Beulich 
> ---
> I further question the placement of the clearing of vga_text_buffer,
> just out of context: Shouldn't that be placed with the increments of
> efi_platform and skip_realmode? Or else is the terminology in comments
> ("on EFI platforms") wrong in one of the two places? In the end, if we
> are entered at __efi64_mb2_start but the magic doesn't match, we simply
> don't know what environment we're in. There may or may not be a VGA
> console at the default address, so we may as well (try to) write to it
> (just like we do when entered at start).

It's fair to assume we're in 64bits, and in that situation it's also fair to
assume the text console is long gone (pun intended). Seeing how this would be a
boot protocol bug, I think the most reasonable thing to do is to leave a poison
value in RAX and then hang (deadbeef..., badcafe..., take a pick). That would
point the poor sod debugging this to the right part of the code.

Though this is a largely theoretical issue that I can't see happening in
practice.

>
> --- a/xen/arch/x86/boot/head.S
> +++ b/xen/arch/x86/boot/head.S
> @@ -233,13 +233,11 @@ __efi64_mb2_start:
>  
>  /* Check for Multiboot2 bootloader. */
>  cmp $MULTIBOOT2_BOOTLOADER_MAGIC,%eax
> -je  .Lefi_multiboot2_proto
>  
>  /* Jump to .Lnot_multiboot after switching CPU to x86_32 mode. */
>  lea .Lnot_multiboot(%rip), %r15

I don't think there's much benefit to this, but it would read more naturally if
lea was before cmp. Then cmp would be next to its (new) associated jne.

> -jmp x86_32_switch
> +jne x86_32_switch
>  
> -.Lefi_multiboot2_proto:
>  /* Zero EFI SystemTable, EFI ImageHandle addresses and cmdline. */
>  xor %esi,%esi
>  xor %edi,%edi

Cheers,
Alejandro

Re: [RFC PATCH] xen: Remove -Wdeclaration-after-statement

2024-08-12 Thread Alejandro Vallejo

On Fri Aug 9, 2024 at 8:25 PM BST, Stefano Stabellini wrote:
> Adding Roberto
>
> Does MISRA have a view on this? I seem to remember this is discouraged?
>

I'd be surprised if MISRA didn't promote declaring close to first use to avoid
use-before-init, but you very clearly have a lot more exposure to it than I do.

I'm quite curious about what its preference is and the rationale for it.

>
> On Fri, 9 Aug 2024, Alejandro Vallejo wrote:
> > This warning only makes sense when developing using a compiler with C99
> > support on a codebase meant to be built with C89 compilers too, and
> > that's no longer the case (nor should it be, as it's been 25 years since
> > C99 came out already).
> > 
> > Signed-off-by: Alejandro Vallejo 
> > ---
> > Yes, I'm opening this can of worms. I'd like to hear others people's
> > thoughts on this and whether this is something MISRA has views on. If
> > there's an ulterior non-obvious reason besides stylistic preference I
> > think it should be documented somewhere, but I haven't seen such an
> > explanation.
> > 
> > IMO, the presence of this warning causes several undesirable effects:
> > 
> >   1. Small functions are hampered by the preclusion of check+declare
> >  patterns that improve readability via concision. e.g: Consider a
> >  silly example like:
> > 
> >  /* with warning */ /* without warning */
> >  void foo(uint8_t *p)   void foo(uint8_t *p)
> >  {  {
> >  uint8_t  tmp1; if ( !p )
> >  uint16_t tmp2; return;
> >  uint32_t tmp3;
> > uint8_t  tmp1 = OFFSET1 + 
> > *p;
> >  if ( !p )  uint16_t tmp2 = OFFSET2 + 
> > *p;
> >  return;uint32_t tmp3 = OFFSET3 + 
> > *p;
> > 
> >  tmp1 = OFFSET1 + *p;   /* Lots of uses of `tmpX` */
> >  tmp2 = OFFSET2 + *p;   }
> >  tmp2 = OFFSET2 + *p;
> > 
> >  /* Lots of uses of tmpX */
> >  }
> > 
> >   2. Promotes scope-creep. On small functions it doesn't matter much,
> >  but on bigger ones to prevent declaring halfway through the body
> >  needlessly increases variable scope to the full scope in which they
> >  are defined rather than the subscope of point-of-declaration to
> >  end-of-current-scope. In cases in which they can be trivially
> >  defined at that point, it also means they can be trivially misused
> >  before they are meant to. i.e: On the example in (1) assume the
> >  conditional in "with warning" is actually a large switch statement.
> > 
> >   3. It facilitates a disconnect between time-of-declaration and
> >  time-of-use that lead very easily to "use-before-init" bugs.
> >  While a modern compiler can alleviate the most egregious cases of
> >  this, there's cases it simply cannot cover. A conditional
> >  initialization on anything with external linkage combined with a
> >  conditional use on something else with external linkage will squash
> >  the warning of using an uninitialised variable. Things are worse
> >  where the variable in question is preinitialised to something
> >  credible (e.g: a pointer to NULL), as then that can be misused
> >  between its declaration and its original point of intended use.
> > 
> > So... thoughts? yay or nay?
>
> In my opinion, there are some instances where mixing declarations and
> statements would enhance the code, but these are uncommon. Given the
> choice between:
>
> 1) declarations always first
> 2) declarations always mixed with statements
>
> I would choose 1).

FWIW, so would I under those contraints. But let me at least try to persuade
you. There's at least two more arguments to weigh:

  1. It wasn't that long ago that we had to resort to GNU extensions to work
 around this restriction. It's mildly annoying having to play games with
 compiler extensions because we're purposely restricting our use of the
 language.

 See the clang codegen workaround:
https://lore.kernel.org/xen-devel/d2zm0d609toq.2gqqwr1qal...@cloud.com/

  2. There's an existing divide between toolstack and hypervisor. Toolstack
 already allows this kind of mixing, and it's hard not-to due to external
 dependencies. While style doesn't have to ma

Re: [XEN PATCH v1 1/2] x86/intel: optional build of intel.c

2024-08-12 Thread Alejandro Vallejo

On Mon Aug 12, 2024 at 10:58 AM BST, Jan Beulich wrote:
> On 12.08.2024 11:40, Sergiy Kibrik wrote:
> > 09.08.24 13:36, Alejandro Vallejo:
> >> On Fri Aug 9, 2024 at 11:09 AM BST, Sergiy Kibrik wrote:
> >>> --- a/xen/arch/x86/cpu/Makefile
> >>> +++ b/xen/arch/x86/cpu/Makefile
> >>> @@ -6,10 +6,10 @@ obj-y += amd.o
> >>>   obj-y += centaur.o
> >>>   obj-y += common.o
> >>>   obj-y += hygon.o
> >>> -obj-y += intel.o
> >>> -obj-y += intel_cacheinfo.o
> >>> +obj-$(CONFIG_INTEL) += intel.o
> >>> +obj-$(CONFIG_INTEL) += intel_cacheinfo.o
> >>>   obj-y += mwait-idle.o
> >>> -obj-y += shanghai.o
> >>> +obj-$(CONFIG_INTEL) += shanghai.o
> >>
> >> Why pick this one too? It's based on VIA IP, aiui.
> > 
> > shanghai.c and intel.c both use init_intel_cacheinfo() routine, so 
> > there's build dependency on Intel code.

My point is that the use of Intel functions on Shanghai and not Centaur is
accidental. If shanghai goes under Intel so should Centaur (imo).

>
> Yet Shanghai isn't as directly a clone of Intel CPUs as Hygon ones are
> for AMD. So at the very least you want to justify your choice in the
> description. After all there's also the alternative of having a separate
> SHANGHAI Kconfig setting, which would merely have "select INTEL" or
> "depends on INTEL".
>
> Jan

That's one option, another is for the Kconfig options to explicitly state which
vendors they apply to. I'd be fine with either. It's less fine for CONFIG_INTEL
to cover a VIA derivative and not the other.

Cheers,
Alejandro

[RFC PATCH] xen: Remove -Wdeclaration-after-statement

2024-08-09 Thread Alejandro Vallejo

This warning only makes sense when developing using a compiler with C99
support on a codebase meant to be built with C89 compilers too, and
that's no longer the case (nor should it be, as it's been 25 years since
C99 came out already).

Signed-off-by: Alejandro Vallejo 
---
Yes, I'm opening this can of worms. I'd like to hear others people's
thoughts on this and whether this is something MISRA has views on. If
there's an ulterior non-obvious reason besides stylistic preference I
think it should be documented somewhere, but I haven't seen such an
explanation.

IMO, the presence of this warning causes several undesirable effects:

  1. Small functions are hampered by the preclusion of check+declare
 patterns that improve readability via concision. e.g: Consider a
 silly example like:

 /* with warning */ /* without warning */
 void foo(uint8_t *p)   void foo(uint8_t *p)
 {  {
 uint8_t  tmp1; if ( !p )
 uint16_t tmp2; return;
 uint32_t tmp3;
uint8_t  tmp1 = OFFSET1 + *p;
 if ( !p )  uint16_t tmp2 = OFFSET2 + *p;
 return;uint32_t tmp3 = OFFSET3 + *p;

 tmp1 = OFFSET1 + *p;   /* Lots of uses of `tmpX` */
 tmp2 = OFFSET2 + *p;   }
 tmp2 = OFFSET2 + *p;

 /* Lots of uses of tmpX */
 }

  2. Promotes scope-creep. On small functions it doesn't matter much,
 but on bigger ones to prevent declaring halfway through the body
 needlessly increases variable scope to the full scope in which they
 are defined rather than the subscope of point-of-declaration to
 end-of-current-scope. In cases in which they can be trivially
 defined at that point, it also means they can be trivially misused
 before they are meant to. i.e: On the example in (1) assume the
 conditional in "with warning" is actually a large switch statement.

  3. It facilitates a disconnect between time-of-declaration and
 time-of-use that lead very easily to "use-before-init" bugs.
 While a modern compiler can alleviate the most egregious cases of
 this, there's cases it simply cannot cover. A conditional
 initialization on anything with external linkage combined with a
 conditional use on something else with external linkage will squash
 the warning of using an uninitialised variable. Things are worse
 where the variable in question is preinitialised to something
 credible (e.g: a pointer to NULL), as then that can be misused
 between its declaration and its original point of intended use.

So... thoughts? yay or nay?
---
 xen/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/xen/Makefile b/xen/Makefile
index 2e1a925c8417..288b7ac8bb2d 100644
--- a/xen/Makefile
+++ b/xen/Makefile
@@ -394,7 +394,7 @@ CFLAGS-$(CONFIG_CC_SPLIT_SECTIONS) += -ffunction-sections 
-fdata-sections
 
 CFLAGS += -nostdinc -fno-builtin -fno-common
 CFLAGS += -Werror -Wredundant-decls -Wwrite-strings -Wno-pointer-arith
-CFLAGS += -Wdeclaration-after-statement -Wuninitialized
+CFLAGS += -Wuninitialized
 $(call cc-option-add,CFLAGS,CC,-Wvla)
 $(call cc-option-add,CFLAGS,CC,-Wflex-array-member-not-at-end)
 $(call cc-option-add,CFLAGS,CC,-Winit-self)
-- 
2.45.2

Re: [XEN PATCH v1 1/2] x86/intel: optional build of intel.c

2024-08-09 Thread Alejandro Vallejo

On Fri Aug 9, 2024 at 11:09 AM BST, Sergiy Kibrik wrote:
> With specific config option INTEL in place and most of the code that depends
> on intel.c now can be optionally enabled/disabled it's now possible to put
> the whole intel.c under INTEL option as well. This will allow for a Xen build
> without Intel CPU support.
>
> Signed-off-by: Sergiy Kibrik 
> ---
>  xen/arch/x86/cpu/Makefile| 6 +++---
>  xen/arch/x86/cpu/common.c| 4 +++-
>  xen/arch/x86/include/asm/processor.h | 7 ---
>  3 files changed, 10 insertions(+), 7 deletions(-)
>
> diff --git a/xen/arch/x86/cpu/Makefile b/xen/arch/x86/cpu/Makefile
> index eafce5f204..020c86bda3 100644
> --- a/xen/arch/x86/cpu/Makefile
> +++ b/xen/arch/x86/cpu/Makefile
> @@ -6,10 +6,10 @@ obj-y += amd.o
>  obj-y += centaur.o
>  obj-y += common.o
>  obj-y += hygon.o
> -obj-y += intel.o
> -obj-y += intel_cacheinfo.o
> +obj-$(CONFIG_INTEL) += intel.o
> +obj-$(CONFIG_INTEL) += intel_cacheinfo.o
>  obj-y += mwait-idle.o
> -obj-y += shanghai.o
> +obj-$(CONFIG_INTEL) += shanghai.o

Why pick this one too? It's based on VIA IP, aiui.

>  obj-y += vpmu.o
>  obj-$(CONFIG_AMD) += vpmu_amd.o
>  obj-$(CONFIG_INTEL) += vpmu_intel.o
> diff --git a/xen/arch/x86/cpu/common.c b/xen/arch/x86/cpu/common.c
> index ff4cd22897..50ce13f81c 100644
> --- a/xen/arch/x86/cpu/common.c
> +++ b/xen/arch/x86/cpu/common.c
> @@ -336,11 +336,13 @@ void __init early_cpu_init(bool verbose)
>  
>   c->x86_vendor = x86_cpuid_lookup_vendor(ebx, ecx, edx);
>   switch (c->x86_vendor) {
> +#ifdef CONFIG_INTEL
>   case X86_VENDOR_INTEL:intel_unlock_cpuid_leaves(c);
> actual_cpu = intel_cpu_dev;break;
> + case X86_VENDOR_SHANGHAI: actual_cpu = shanghai_cpu_dev; break;
> +#endif
>   case X86_VENDOR_AMD:  actual_cpu = amd_cpu_dev;  break;
>   case X86_VENDOR_CENTAUR:  actual_cpu = centaur_cpu_dev;  break;
> - case X86_VENDOR_SHANGHAI: actual_cpu = shanghai_cpu_dev; break;
>   case X86_VENDOR_HYGON:actual_cpu = hygon_cpu_dev;break;
>   default:
>   actual_cpu = default_cpu;
> diff --git a/xen/arch/x86/include/asm/processor.h 
> b/xen/arch/x86/include/asm/processor.h
> index 66463f6a6d..a88d45252b 100644
> --- a/xen/arch/x86/include/asm/processor.h
> +++ b/xen/arch/x86/include/asm/processor.h
> @@ -507,15 +507,16 @@ static inline uint8_t get_cpu_family(uint32_t raw, 
> uint8_t *model,
>  extern int8_t opt_tsx;
>  extern bool rtm_disabled;
>  void tsx_init(void);
> +void update_mcu_opt_ctrl(void);
> +void set_in_mcu_opt_ctrl(uint32_t mask, uint32_t val);
>  #else
>  #define opt_tsx  0 /* explicitly indicate TSX is off */
>  #define rtm_disabled false /* RTM was not force-disabled */
>  static inline void tsx_init(void) {}
> +static inline void update_mcu_opt_ctrl(void) {}
> +static inline void set_in_mcu_opt_ctrl(uint32_t mask, uint32_t val) {}
>  #endif
>  
> -void update_mcu_opt_ctrl(void);
> -void set_in_mcu_opt_ctrl(uint32_t mask, uint32_t val);
> -
>  enum ap_boot_method {
>  AP_BOOT_NORMAL,
>  AP_BOOT_SKINIT,


Cheers,
Alejandro

Re: xen | Failed pipeline for staging | 08aacc39

2024-08-08 Thread Alejandro Vallejo

On Thu Aug 8, 2024 at 2:02 PM BST, Jan Beulich wrote:
> On 08.08.2024 14:43, GitLab wrote:
> > 
> > 
> > Pipeline #1405649318 has failed!
> > 
> > Project: xen ( https://gitlab.com/xen-project/hardware/xen )
> > Branch: staging ( 
> > https://gitlab.com/xen-project/hardware/xen/-/commits/staging )
> > 
> > Commit: 08aacc39 ( 
> > https://gitlab.com/xen-project/hardware/xen/-/commit/08aacc392d86d4c7dbebdb5e664060ae2af72057
> >  )
> > Commit Message: x86/emul: Fix misaligned IO breakpoint behaviou...
> > Commit Author: Matthew Barnes
> > Committed by: Jan Beulich ( https://gitlab.com/jbeulich )
> > 
> > 
> > Pipeline #1405649318 ( 
> > https://gitlab.com/xen-project/hardware/xen/-/pipelines/1405649318 ) 
> > triggered by Jan Beulich ( https://gitlab.com/jbeulich )
> > had 4 failed jobs.
> > 
> > Job #7535428747 ( 
> > https://gitlab.com/xen-project/hardware/xen/-/jobs/7535428747/raw )
> > 
> > Stage: build
> > Name: qemu-system-aarch64-6.0.0-arm64-export
> > Job #7535428873 ( 
> > https://gitlab.com/xen-project/hardware/xen/-/jobs/7535428873/raw )
> > 
> > Stage: build
> > Name: alpine-3.18-gcc-debug-arm64-static-shared-mem
> > Job #7535428869 ( 
> > https://gitlab.com/xen-project/hardware/xen/-/jobs/7535428869/raw )
> > 
> > Stage: build
> > Name: alpine-3.18-gcc-debug-arm64-staticmem
> > Job #7535429434 ( 
> > https://gitlab.com/xen-project/hardware/xen/-/jobs/7535429434/raw )
> > 
> > Stage: test
> > Name: qemu-smoke-dom0less-arm32-gcc
>
> All Arm failures when the three commits under test only touch x86 code.
> How can that be? And Stefano, note how this would needlessly have blocked
> a merge request, if we were already using that model you're proposing to
> switch to.
>
> Jan

I'd argue it the other way around. It would (may?) have prevented reaching that
situation in the first place.

Cheers,
Alejandro

Re: [PATCH v5 01/10] tools/hvmloader: Fix non-deterministic cpuid()

2024-08-08 Thread Alejandro Vallejo

On Thu Aug 8, 2024 at 3:10 PM BST, Jan Beulich wrote:
> On 08.08.2024 15:42, Alejandro Vallejo wrote:
> > hvmloader's cpuid() implementation deviates from Xen's in that the value
> > passed on ecx is unspecified. This means that when used on leaves that
> > implement subleaves it's unspecified which one you get; though it's more
> > than likely an invalid one.
> > 
> > Import Xen's implementation so there are no surprises.
> > 
> > Fixes: 318ac791f9f9 ("Add utilities needed for SMBIOS generation to
> > hvmloader")
> > Signed-off-by: Alejandro Vallejo 
>
> Reviewed-by: Jan Beulich 
>
> Minor remark: A Fixes: tag wants to go all on a single line.

Noted for next time.

>
> > --- a/tools/firmware/hvmloader/util.c
> > +++ b/tools/firmware/hvmloader/util.c
> > @@ -267,15 +267,6 @@ memcmp(const void *s1, const void *s2, unsigned n)
> >  return 0;
> >  }
> >  
> > -void
> > -cpuid(uint32_t idx, uint32_t *eax, uint32_t *ebx, uint32_t *ecx, uint32_t 
> > *edx)
> > -{
> > -asm volatile (
> > -"cpuid"
> > -: "=a" (*eax), "=b" (*ebx), "=c" (*ecx), "=d" (*edx)
> > -: "0" (idx) );
>
> Compared to the original ...
>
> > --- a/tools/firmware/hvmloader/util.h
> > +++ b/tools/firmware/hvmloader/util.h
> > @@ -184,9 +184,30 @@ int uart_exists(uint16_t uart_base);
> >  int lpt_exists(uint16_t lpt_base);
> >  int hpet_exists(unsigned long hpet_base);
> >  
> > -/* Do cpuid instruction, with operation 'idx' */
> > -void cpuid(uint32_t idx, uint32_t *eax, uint32_t *ebx,
> > -   uint32_t *ecx, uint32_t *edx);
> > +/* Some CPUID calls want 'count' to be placed in ecx */
> > +static inline void cpuid_count(
> > +uint32_t leaf,
> > +uint32_t subleaf,
> > +uint32_t *eax,
> > +uint32_t *ebx,
> > +uint32_t *ecx,
> > +uint32_t *edx)
> > +{
> > +asm volatile ( "cpuid"
> > +  : "=a" (*eax), "=b" (*ebx), "=c" (*ecx), "=d" (*edx)
> > +  : "a" (leaf), "c" (subleaf) );
>
> ... you alter indentation, without it becoming clear why you do so. Imo
> there are only two ways of indenting this which are conforming to our
> style - either as it was (secondary lines indented by one more level,
> i.e. 4 more spaces) or
>
> asm volatile ( "cpuid"
>: "=a" (*eax), "=b" (*ebx), "=c" (*ecx), "=d" (*edx)
>: "a" (leaf), "c" (subleaf) );
>
> I guess I'll take the liberty and adjust while committing.
>
> Jan

Sure, I don't mind about that.

As for the indentation difference, the inline assembly is taken quasi-verbatim
from arch/x86/include/asm/processor.h. That one happens to have this
indentation.

Cheers,
Alejandro

[PATCH v5 08/10] xen/x86: Derive topologically correct x2APIC IDs from the policy

2024-08-08 Thread Alejandro Vallejo

Implements the helper for mapping vcpu_id to x2apic_id given a valid
topology in a policy. The algo is written with the intention of
extending it to leaves 0x1f and extended 0x26 in the future.

Toolstack doesn't set leaf 0xb and the HVM default policy has it
cleared, so the leaf is not implemented. In that case, the new helper
just returns the legacy mapping.

Signed-off-by: Alejandro Vallejo 
---
v5:
  * No change
---
 tools/tests/cpu-policy/test-cpu-policy.c | 68 +
 xen/include/xen/lib/x86/cpu-policy.h | 11 
 xen/lib/x86/policy.c | 76 
 3 files changed, 155 insertions(+)

diff --git a/tools/tests/cpu-policy/test-cpu-policy.c 
b/tools/tests/cpu-policy/test-cpu-policy.c
index 849d7cebaa7c..e5f9b8f7ee39 100644
--- a/tools/tests/cpu-policy/test-cpu-policy.c
+++ b/tools/tests/cpu-policy/test-cpu-policy.c
@@ -781,6 +781,73 @@ static void test_topo_from_parts(void)
 }
 }
 
+static void test_x2apic_id_from_vcpu_id_success(void)
+{
+static const struct test {
+unsigned int vcpu_id;
+unsigned int threads_per_core;
+unsigned int cores_per_pkg;
+uint32_t x2apic_id;
+uint8_t x86_vendor;
+} tests[] = {
+{
+.vcpu_id = 3, .threads_per_core = 3, .cores_per_pkg = 8,
+.x2apic_id = 1 << 2,
+},
+{
+.vcpu_id = 6, .threads_per_core = 3, .cores_per_pkg = 8,
+.x2apic_id = 2 << 2,
+},
+{
+.vcpu_id = 24, .threads_per_core = 3, .cores_per_pkg = 8,
+.x2apic_id = 1 << 5,
+},
+{
+.vcpu_id = 35, .threads_per_core = 3, .cores_per_pkg = 8,
+.x2apic_id = (35 % 3) | (((35 / 3) % 8) << 2) | ((35 / 24) << 5),
+},
+{
+.vcpu_id = 96, .threads_per_core = 7, .cores_per_pkg = 3,
+.x2apic_id = (96 % 7) | (((96 / 7) % 3) << 3) | ((96 / 21) << 5),
+},
+};
+
+const uint8_t vendors[] = {
+X86_VENDOR_INTEL,
+X86_VENDOR_AMD,
+X86_VENDOR_CENTAUR,
+X86_VENDOR_SHANGHAI,
+X86_VENDOR_HYGON,
+};
+
+printf("Testing x2apic id from vcpu id success:\n");
+
+/* Perform the test run on every vendor we know about */
+for ( size_t i = 0; i < ARRAY_SIZE(vendors); ++i )
+{
+for ( size_t j = 0; j < ARRAY_SIZE(tests); ++j )
+{
+struct cpu_policy policy = { .x86_vendor = vendors[i] };
+const struct test *t = &tests[j];
+uint32_t x2apic_id;
+int rc = x86_topo_from_parts(&policy, t->threads_per_core,
+ t->cores_per_pkg);
+
+if ( rc ) {
+fail("FAIL[%d] - 'x86_topo_from_parts() failed", rc);
+continue;
+}
+
+x2apic_id = x86_x2apic_id_from_vcpu_id(&policy, t->vcpu_id);
+if ( x2apic_id != t->x2apic_id )
+fail("FAIL - '%s cpu%u %u t/c %u c/p'. bad x2apic_id: 
expected=%u actual=%u\n",
+ x86_cpuid_vendor_to_str(policy.x86_vendor),
+ t->vcpu_id, t->threads_per_core, t->cores_per_pkg,
+ t->x2apic_id, x2apic_id);
+}
+}
+}
+
 int main(int argc, char **argv)
 {
 printf("CPU Policy unit tests\n");
@@ -799,6 +866,7 @@ int main(int argc, char **argv)
 test_is_compatible_failure();
 
 test_topo_from_parts();
+test_x2apic_id_from_vcpu_id_success();
 
 if ( nr_failures )
 printf("Done: %u failures\n", nr_failures);
diff --git a/xen/include/xen/lib/x86/cpu-policy.h 
b/xen/include/xen/lib/x86/cpu-policy.h
index 116b305a1d7f..6fe19490d290 100644
--- a/xen/include/xen/lib/x86/cpu-policy.h
+++ b/xen/include/xen/lib/x86/cpu-policy.h
@@ -542,6 +542,17 @@ int x86_cpu_policies_are_compatible(const struct 
cpu_policy *host,
 const struct cpu_policy *guest,
 struct cpu_policy_errors *err);
 
+/**
+ * Calculates the x2APIC ID of a vCPU given a CPU policy
+ *
+ * If the policy lacks leaf 0xb falls back to legacy mapping of apic_id=cpu*2
+ *
+ * @param p  CPU policy of the domain.
+ * @param id vCPU ID of the vCPU.
+ * @returns x2APIC ID of the vCPU.
+ */
+uint32_t x86_x2apic_id_from_vcpu_id(const struct cpu_policy *p, uint32_t id);
+
 /**
  * Synthesise topology information in `p` given high-level constraints
  *
diff --git a/xen/lib/x86/policy.c b/xen/lib/x86/policy.c
index 72b67b44a893..c52b7192559a 100644
--- a/xen/lib/x86/policy.c
+++ b/xen/lib/x86/policy.c
@@ -2,6 +2,82 @@
 
 #include 
 
+static uint32_t parts_per_higher_scoped_level(const struct cpu_policy *p,
+  size_t lvl)
+{
+/*
+ * `nr_logical` reported by Intel is the n

[PATCH v5 06/10] tools/libguest: Always set vCPU context in vcpu_hvm()

2024-08-08 Thread Alejandro Vallejo

Currently used by PVH to set MTRR, will be used by a later patch to set
APIC state. Unconditionally send the hypercall, and gate overriding the
MTRR so it remains functionally equivalent.

While at it, add a missing "goto out" to what was the error condition
in the loop.

In principle this patch shouldn't affect functionality. An extra record
(the MTRR) is sent to the hypervisor per vCPU on HVM, but these records
are identical to those retrieved in the first place so there's no
expected functional change.

Signed-off-by: Alejandro Vallejo 
---
v5:
  * Ensure MTRRs are only overriden in PVH, and left as-is on HVM
---
 tools/libs/guest/xg_dom_x86.c | 84 ++-
 1 file changed, 44 insertions(+), 40 deletions(-)

diff --git a/tools/libs/guest/xg_dom_x86.c b/tools/libs/guest/xg_dom_x86.c
index cba01384ae75..fafe7acb7e91 100644
--- a/tools/libs/guest/xg_dom_x86.c
+++ b/tools/libs/guest/xg_dom_x86.c
@@ -989,6 +989,7 @@ const static void *hvm_get_save_record(const void *ctx, 
unsigned int type,
 
 static int vcpu_hvm(struct xc_dom_image *dom)
 {
+/* Initialises the BSP */
 struct {
 struct hvm_save_descriptor header_d;
 HVM_SAVE_TYPE(HEADER) header;
@@ -997,6 +998,18 @@ static int vcpu_hvm(struct xc_dom_image *dom)
 struct hvm_save_descriptor end_d;
 HVM_SAVE_TYPE(END) end;
 } bsp_ctx;
+/* Initialises APICs and MTRRs of every vCPU */
+struct {
+struct hvm_save_descriptor header_d;
+HVM_SAVE_TYPE(HEADER) header;
+struct hvm_save_descriptor mtrr_d;
+HVM_SAVE_TYPE(MTRR) mtrr;
+struct hvm_save_descriptor end_d;
+HVM_SAVE_TYPE(END) end;
+} vcpu_ctx;
+/* Context from full_ctx */
+const HVM_SAVE_TYPE(MTRR) *mtrr_record;
+/* Raw context as taken from Xen */
 uint8_t *full_ctx = NULL;
 int rc;
 
@@ -1083,51 +1096,42 @@ static int vcpu_hvm(struct xc_dom_image *dom)
 bsp_ctx.end_d.instance = 0;
 bsp_ctx.end_d.length = HVM_SAVE_LENGTH(END);
 
-/* TODO: maybe this should be a firmware option instead? */
-if ( !dom->device_model )
+/* TODO: maybe setting MTRRs should be a firmware option instead? */
+mtrr_record = hvm_get_save_record(full_ctx, HVM_SAVE_CODE(MTRR), 0);
+
+if ( !mtrr_record)
 {
-struct {
-struct hvm_save_descriptor header_d;
-HVM_SAVE_TYPE(HEADER) header;
-struct hvm_save_descriptor mtrr_d;
-HVM_SAVE_TYPE(MTRR) mtrr;
-struct hvm_save_descriptor end_d;
-HVM_SAVE_TYPE(END) end;
-} mtrr = {
-.header_d = bsp_ctx.header_d,
-.header = bsp_ctx.header,
-.mtrr_d.typecode = HVM_SAVE_CODE(MTRR),
-.mtrr_d.length = HVM_SAVE_LENGTH(MTRR),
-.end_d = bsp_ctx.end_d,
-.end = bsp_ctx.end,
-};
-const HVM_SAVE_TYPE(MTRR) *mtrr_record =
-hvm_get_save_record(full_ctx, HVM_SAVE_CODE(MTRR), 0);
-unsigned int i;
-
-if ( !mtrr_record )
-{
-xc_dom_panic(dom->xch, XC_INTERNAL_ERROR,
- "%s: unable to get MTRR save record", __func__);
-goto out;
-}
+xc_dom_panic(dom->xch, XC_INTERNAL_ERROR,
+ "%s: unable to get MTRR save record", __func__);
+goto out;
+}
 
-memcpy(&mtrr.mtrr, mtrr_record, sizeof(mtrr.mtrr));
+vcpu_ctx.header_d = bsp_ctx.header_d;
+vcpu_ctx.header = bsp_ctx.header;
+vcpu_ctx.mtrr_d.typecode = HVM_SAVE_CODE(MTRR);
+vcpu_ctx.mtrr_d.length = HVM_SAVE_LENGTH(MTRR);
+vcpu_ctx.mtrr = *mtrr_record;
+vcpu_ctx.end_d = bsp_ctx.end_d;
+vcpu_ctx.end = bsp_ctx.end;
 
-/*
- * Enable MTRR, set default type to WB.
- * TODO: add MMIO areas as UC when passthrough is supported.
- */
-mtrr.mtrr.msr_mtrr_def_type = MTRR_TYPE_WRBACK | MTRR_DEF_TYPE_ENABLE;
+/*
+ * Enable MTRR, set default type to WB.
+ * TODO: add MMIO areas as UC when passthrough is supported in PVH
+ */
+if ( !dom->device_model)
+vcpu_ctx.mtrr.msr_mtrr_def_type = MTRR_TYPE_WRBACK | 
MTRR_DEF_TYPE_ENABLE;
 
-for ( i = 0; i < dom->max_vcpus; i++ )
+for ( unsigned int i = 0; i < dom->max_vcpus; i++ )
+{
+vcpu_ctx.mtrr_d.instance = i;
+
+rc = xc_domain_hvm_setcontext(dom->xch, dom->guest_domid,
+  (uint8_t *)&vcpu_ctx, sizeof(vcpu_ctx));
+if ( rc != 0 )
 {
-mtrr.mtrr_d.instance = i;
-rc = xc_domain_hvm_setcontext(dom->xch, dom->guest_domid,
-  (uint8_t *)&mtrr, sizeof(mtrr));
-if ( rc != 0 )
-xc_dom_panic(dom->xch, XC_INTERNAL_ERROR,
- "%s: SETHVMCONTEXT failed (rc=%d)", __func__, rc);
+

[PATCH v5 07/10] xen/lib: Add topology generator for x86

2024-08-08 Thread Alejandro Vallejo

Add a helper to populate topology leaves in the cpu policy from
threads/core and cores/package counts. It's unit-tested in
test-cpu-policy.c,
but it's not connected to the rest of the code yet.

Adds the ASSERT() macro to xen/lib/x86/private.h, as it was missing.

Signed-off-by: Alejandro Vallejo 
---
v5:
  * No change
---
 tools/tests/cpu-policy/test-cpu-policy.c | 133 +++
 xen/include/xen/lib/x86/cpu-policy.h |  16 +++
 xen/lib/x86/policy.c |  88 +++
 xen/lib/x86/private.h|   4 +
 4 files changed, 241 insertions(+)

diff --git a/tools/tests/cpu-policy/test-cpu-policy.c 
b/tools/tests/cpu-policy/test-cpu-policy.c
index 301df2c00285..849d7cebaa7c 100644
--- a/tools/tests/cpu-policy/test-cpu-policy.c
+++ b/tools/tests/cpu-policy/test-cpu-policy.c
@@ -650,6 +650,137 @@ static void test_is_compatible_failure(void)
 }
 }
 
+static void test_topo_from_parts(void)
+{
+static const struct test {
+unsigned int threads_per_core;
+unsigned int cores_per_pkg;
+struct cpu_policy policy;
+} tests[] = {
+{
+.threads_per_core = 3, .cores_per_pkg = 1,
+.policy = {
+.x86_vendor = X86_VENDOR_AMD,
+.topo.subleaf = {
+{ .nr_logical = 3, .level = 0, .type = 1, .id_shift = 2, },
+{ .nr_logical = 1, .level = 1, .type = 2, .id_shift = 2, },
+},
+},
+},
+{
+.threads_per_core = 1, .cores_per_pkg = 3,
+.policy = {
+.x86_vendor = X86_VENDOR_AMD,
+.topo.subleaf = {
+{ .nr_logical = 1, .level = 0, .type = 1, .id_shift = 0, },
+{ .nr_logical = 3, .level = 1, .type = 2, .id_shift = 2, },
+},
+},
+},
+{
+.threads_per_core = 7, .cores_per_pkg = 5,
+.policy = {
+.x86_vendor = X86_VENDOR_AMD,
+.topo.subleaf = {
+{ .nr_logical = 7, .level = 0, .type = 1, .id_shift = 3, },
+{ .nr_logical = 5, .level = 1, .type = 2, .id_shift = 6, },
+},
+},
+},
+{
+.threads_per_core = 2, .cores_per_pkg = 128,
+.policy = {
+.x86_vendor = X86_VENDOR_AMD,
+.topo.subleaf = {
+{ .nr_logical = 2, .level = 0, .type = 1, .id_shift = 1, },
+{ .nr_logical = 128, .level = 1, .type = 2,
+  .id_shift = 8, },
+},
+},
+},
+{
+.threads_per_core = 3, .cores_per_pkg = 1,
+.policy = {
+.x86_vendor = X86_VENDOR_INTEL,
+.topo.subleaf = {
+{ .nr_logical = 3, .level = 0, .type = 1, .id_shift = 2, },
+{ .nr_logical = 3, .level = 1, .type = 2, .id_shift = 2, },
+},
+},
+},
+{
+.threads_per_core = 1, .cores_per_pkg = 3,
+.policy = {
+.x86_vendor = X86_VENDOR_INTEL,
+.topo.subleaf = {
+{ .nr_logical = 1, .level = 0, .type = 1, .id_shift = 0, },
+{ .nr_logical = 3, .level = 1, .type = 2, .id_shift = 2, },
+},
+},
+},
+{
+.threads_per_core = 7, .cores_per_pkg = 5,
+.policy = {
+.x86_vendor = X86_VENDOR_INTEL,
+.topo.subleaf = {
+{ .nr_logical = 7, .level = 0, .type = 1, .id_shift = 3, },
+{ .nr_logical = 35, .level = 1, .type = 2, .id_shift = 6, 
},
+},
+},
+},
+{
+.threads_per_core = 2, .cores_per_pkg = 128,
+.policy = {
+.x86_vendor = X86_VENDOR_INTEL,
+.topo.subleaf = {
+{ .nr_logical = 2, .level = 0, .type = 1, .id_shift = 1, },
+{ .nr_logical = 256, .level = 1, .type = 2,
+  .id_shift = 8, },
+},
+},
+},
+};
+
+printf("Testing topology synthesis from parts:\n");
+
+for ( size_t i = 0; i < ARRAY_SIZE(tests); ++i )
+{
+const struct test *t = &tests[i];
+struct cpu_policy actual = { .x86_vendor = t->policy.x86_vendor };
+int rc = x86_topo_from_parts(&actual, t->threads_per_core,
+ t->cores_per_pkg);
+
+if ( rc || memcmp(&actual.topo, &t->policy.topo, sizeof(actual.topo)) )
+{
+#define TOPO(n, f)  t->policy.topo.subleaf[(n)].f, actual.topo.subleaf[(n)].f
+fail("FAIL[%d] - '%s %u t/c, %u c/p'\n",
+ rc,
+ x86_cpuid_vendor_to_str(t-&g

[PATCH v5 03/10] xen/x86: Add initial x2APIC ID to the per-vLAPIC save area

2024-08-08 Thread Alejandro Vallejo

This allows the initial x2APIC ID to be sent on the migration stream.
This allows further changes to topology and APIC ID assignment without
breaking existing hosts. Given the vlapic data is zero-extended on
restore, fix up migrations from hosts without the field by setting it to
the old convention if zero.

The hardcoded mapping x2apic_id=2*vcpu_id is kept for the time being,
but it's meant to be overriden by toolstack on a later patch with
appropriate values.

Signed-off-by: Alejandro Vallejo 
---
v5:
  * No change
---
 xen/arch/x86/cpuid.c   | 14 +-
 xen/arch/x86/hvm/vlapic.c  | 22 --
 xen/arch/x86/include/asm/hvm/vlapic.h  |  1 +
 xen/include/public/arch-x86/hvm/save.h |  2 ++
 4 files changed, 28 insertions(+), 11 deletions(-)

diff --git a/xen/arch/x86/cpuid.c b/xen/arch/x86/cpuid.c
index 2a777436ee27..dcbdeabadce9 100644
--- a/xen/arch/x86/cpuid.c
+++ b/xen/arch/x86/cpuid.c
@@ -138,10 +138,9 @@ void guest_cpuid(const struct vcpu *v, uint32_t leaf,
 const struct cpu_user_regs *regs;
 
 case 0x1:
-/* TODO: Rework topology logic. */
 res->b &= 0x00ffu;
 if ( is_hvm_domain(d) )
-res->b |= (v->vcpu_id * 2) << 24;
+res->b |= vlapic_x2apic_id(vcpu_vlapic(v)) << 24;
 
 /* TODO: Rework vPMU control in terms of toolstack choices. */
 if ( vpmu_available(v) &&
@@ -311,18 +310,15 @@ void guest_cpuid(const struct vcpu *v, uint32_t leaf,
 
 case 0xb:
 /*
- * In principle, this leaf is Intel-only.  In practice, it is tightly
- * coupled with x2apic, and we offer an x2apic-capable APIC emulation
- * to guests on AMD hardware as well.
- *
- * TODO: Rework topology logic.
+ * Don't expose topology information to PV guests. Exposed on HVM
+ * along with x2APIC because they are tightly coupled.
  */
-if ( p->basic.x2apic )
+if ( is_hvm_domain(d) && p->basic.x2apic )
 {
 *(uint8_t *)&res->c = subleaf;
 
 /* Fix the x2APIC identifier. */
-res->d = v->vcpu_id * 2;
+res->d = vlapic_x2apic_id(vcpu_vlapic(v));
 }
 break;
 
diff --git a/xen/arch/x86/hvm/vlapic.c b/xen/arch/x86/hvm/vlapic.c
index 521b98988be9..0e0699fc8279 100644
--- a/xen/arch/x86/hvm/vlapic.c
+++ b/xen/arch/x86/hvm/vlapic.c
@@ -1073,7 +1073,7 @@ static uint32_t x2apic_ldr_from_id(uint32_t id)
 static void set_x2apic_id(struct vlapic *vlapic)
 {
 const struct vcpu *v = vlapic_vcpu(vlapic);
-uint32_t apic_id = v->vcpu_id * 2;
+uint32_t apic_id = vlapic->hw.x2apic_id;
 uint32_t apic_ldr = x2apic_ldr_from_id(apic_id);
 
 /*
@@ -1453,7 +1453,7 @@ void vlapic_reset(struct vlapic *vlapic)
 if ( v->vcpu_id == 0 )
 vlapic->hw.apic_base_msr |= APIC_BASE_BSP;
 
-vlapic_set_reg(vlapic, APIC_ID, (v->vcpu_id * 2) << 24);
+vlapic_set_reg(vlapic, APIC_ID, SET_xAPIC_ID(vlapic->hw.x2apic_id));
 vlapic_do_init(vlapic);
 }
 
@@ -1521,6 +1521,16 @@ static void lapic_load_fixup(struct vlapic *vlapic)
 const struct vcpu *v = vlapic_vcpu(vlapic);
 uint32_t good_ldr = x2apic_ldr_from_id(vlapic->loaded.id);
 
+/*
+ * Loading record without hw.x2apic_id in the save stream, calculate using
+ * the traditional "vcpu_id * 2" relation. There's an implicit assumption
+ * that vCPU0 always has x2APIC0, which is true for the old relation, and
+ * still holds under the new x2APIC generation algorithm. While that case
+ * goes through the conditional it's benign because it still maps to zero.
+ */
+if ( !vlapic->hw.x2apic_id )
+vlapic->hw.x2apic_id = v->vcpu_id * 2;
+
 /* Skip fixups on xAPIC mode, or if the x2APIC LDR is already correct */
 if ( !vlapic_x2apic_mode(vlapic) ||
  (vlapic->loaded.ldr == good_ldr) )
@@ -1589,6 +1599,13 @@ static int cf_check lapic_check_hidden(const struct 
domain *d,
  APIC_BASE_EXTD )
 return -EINVAL;
 
+/*
+ * Fail migrations from newer versions of Xen where
+ * rsvd_zero is interpreted as something else.
+ */
+if ( s.rsvd_zero )
+return -EINVAL;
+
 return 0;
 }
 
@@ -1667,6 +1684,7 @@ int vlapic_init(struct vcpu *v)
 }
 
 vlapic->pt.source = PTSRC_lapic;
+vlapic->hw.x2apic_id = 2 * v->vcpu_id;
 
 vlapic->regs_page = alloc_domheap_page(v->domain, MEMF_no_owner);
 if ( !vlapic->regs_page )
diff --git a/xen/arch/x86/include/asm/hvm/vlapic.h 
b/xen/arch/x86/include/asm/hvm/vlapic.h
index 2c4ff94ae7a8..85c4a236b9f6 100644
--- a/xen/arch/x86/include/asm/hvm/vlapic.h
+++ b/xen/arch/x86/include/asm/hvm/vlapic.h
@@ -44,6 +44,7 @@
 #define vlapic_xapic_mode(vlapic)   \
 (!vlapic_hw_disabled(vlapic) && \

[PATCH v5 04/10] xen/x86: Add supporting code for uploading LAPIC contexts during domain create

2024-08-08 Thread Alejandro Vallejo

If toolstack were to upload LAPIC contexts as part of domain creation it
would encounter a problem were the architectural state does not reflect
the APIC ID in the hidden state. This patch ensures updates to the
hidden state trigger an update in the architectural registers so the
APIC ID in both is consistent.

Signed-off-by: Alejandro Vallejo 
---
We could also let toolstack synthesise architectural registers, but
that would be adding logic on how architectural state operates to
software that really shouldn't care. I could be persuaded to do it the
other way, but I think it's going to be messier.

v5:
  * No change
---
 xen/arch/x86/hvm/vlapic.c | 20 
 1 file changed, 20 insertions(+)

diff --git a/xen/arch/x86/hvm/vlapic.c b/xen/arch/x86/hvm/vlapic.c
index 0e0699fc8279..3fa839087fe0 100644
--- a/xen/arch/x86/hvm/vlapic.c
+++ b/xen/arch/x86/hvm/vlapic.c
@@ -1620,7 +1620,27 @@ static int cf_check lapic_load_hidden(struct domain *d, 
hvm_domain_context_t *h)
 
 s->loaded.hw = 1;
 if ( s->loaded.regs )
+{
+/*
+ * We already processed architectural regs in lapic_load_regs(), so
+ * this must be a migration. Fix up inconsistencies from any older Xen.
+ */
 lapic_load_fixup(s);
+}
+else
+{
+/*
+ * We haven't seen architectural regs so this could be a migration or a
+ * plain domain create. In the domain create case it's fine to modify
+ * the architectural state to align it to the APIC ID that was just
+ * uploaded and in the migrate case it doesn't matter because the
+ * architectural state will be replaced by the LAPIC_REGS ctx later on.
+ */
+if ( vlapic_x2apic_mode(s) )
+set_x2apic_id(s);
+else
+vlapic_set_reg(s, APIC_ID, SET_xAPIC_ID(s->hw.x2apic_id));
+}
 
 hvm_update_vlapic_mode(v);
 
-- 
2.45.2

[PATCH v5 05/10] tools/hvmloader: Retrieve (x2)APIC IDs from the APs themselves

2024-08-08 Thread Alejandro Vallejo

Make it so the APs expose their own APIC IDs in a LUT. We can use that
LUT to
populate the MADT, decoupling the algorithm that relates CPU IDs and
APIC IDs
from hvmloader.

While at this also remove ap_callin, as writing the APIC ID may serve
the same
purpose.

Signed-off-by: Alejandro Vallejo 
---
v5:
  * No change
---
 tools/firmware/hvmloader/config.h   |  6 ++-
 tools/firmware/hvmloader/hvmloader.c|  4 +-
 tools/firmware/hvmloader/smp.c  | 54 -
 tools/include/xen-tools/common-macros.h |  5 +++
 4 files changed, 56 insertions(+), 13 deletions(-)

diff --git a/tools/firmware/hvmloader/config.h 
b/tools/firmware/hvmloader/config.h
index cd716bf39245..213ac1f28e17 100644
--- a/tools/firmware/hvmloader/config.h
+++ b/tools/firmware/hvmloader/config.h
@@ -4,6 +4,8 @@
 #include 
 #include 
 
+#include 
+
 enum virtual_vga { VGA_none, VGA_std, VGA_cirrus, VGA_pt };
 extern enum virtual_vga virtual_vga;
 
@@ -48,8 +50,10 @@ extern uint8_t ioapic_version;
 
 #define IOAPIC_ID   0x01
 
+extern uint32_t CPU_TO_X2APICID[HVM_MAX_VCPUS];
+
 #define LAPIC_BASE_ADDRESS  0xfee0
-#define LAPIC_ID(vcpu_id)   ((vcpu_id) * 2)
+#define LAPIC_ID(vcpu_id)   (CPU_TO_X2APICID[(vcpu_id)])
 
 #define PCI_ISA_DEVFN   0x08/* dev 1, fn 0 */
 #define PCI_ISA_IRQ_MASK0x0c20U /* ISA IRQs 5,10,11 are PCI connected */
diff --git a/tools/firmware/hvmloader/hvmloader.c 
b/tools/firmware/hvmloader/hvmloader.c
index f8af88fabf24..5c02e8fc226a 100644
--- a/tools/firmware/hvmloader/hvmloader.c
+++ b/tools/firmware/hvmloader/hvmloader.c
@@ -341,11 +341,11 @@ int main(void)
 
 printf("CPU speed is %u MHz\n", get_cpu_mhz());
 
+smp_initialise();
+
 apic_setup();
 pci_setup();
 
-smp_initialise();
-
 perform_tests();
 
 if ( bios->bios_info_setup )
diff --git a/tools/firmware/hvmloader/smp.c b/tools/firmware/hvmloader/smp.c
index 1b940cefd071..b0d4da111904 100644
--- a/tools/firmware/hvmloader/smp.c
+++ b/tools/firmware/hvmloader/smp.c
@@ -29,7 +29,34 @@
 
 #include 
 
-static int ap_callin;
+/**
+ * Lookup table of x2APIC IDs.
+ *
+ * Each entry is populated its respective CPU as they come online. This is 
required
+ * for generating the MADT with minimal assumptions about ID relationships.
+ */
+uint32_t CPU_TO_X2APICID[HVM_MAX_VCPUS];
+
+/** Tristate about x2apic being supported. -1=unknown */
+static int has_x2apic = -1;
+
+static uint32_t read_apic_id(void)
+{
+uint32_t apic_id;
+
+if ( has_x2apic )
+cpuid(0xb, NULL, NULL, NULL, &apic_id);
+else
+{
+cpuid(1, NULL, &apic_id, NULL, NULL);
+apic_id >>= 24;
+}
+
+/* Never called by cpu0, so should never return 0 */
+ASSERT(apic_id);
+
+return apic_id;
+}
 
 static void cpu_setup(unsigned int cpu)
 {
@@ -37,13 +64,17 @@ static void cpu_setup(unsigned int cpu)
 cacheattr_init();
 printf("done.\n");
 
-if ( !cpu ) /* Used on the BSP too */
+/* The BSP exits early because its APIC ID is known to be zero */
+if ( !cpu )
 return;
 
 wmb();
-ap_callin = 1;
+ACCESS_ONCE(CPU_TO_X2APICID[cpu]) = read_apic_id();
 
-/* After this point, the BSP will shut us down. */
+/*
+ * After this point the BSP will shut us down. A write to
+ * CPU_TO_X2APICID[cpu] signals the BSP to bring down `cpu`.
+ */
 
 for ( ;; )
 asm volatile ( "hlt" );
@@ -54,10 +85,6 @@ static void boot_cpu(unsigned int cpu)
 static uint8_t ap_stack[PAGE_SIZE] __attribute__ ((aligned (16)));
 static struct vcpu_hvm_context ap;
 
-/* Initialise shared variables. */
-ap_callin = 0;
-wmb();
-
 /* Wake up the secondary processor */
 ap = (struct vcpu_hvm_context) {
 .mode = VCPU_HVM_MODE_32B,
@@ -90,10 +117,11 @@ static void boot_cpu(unsigned int cpu)
 BUG();
 
 /*
- * Wait for the secondary processor to complete initialisation.
+ * Wait for the secondary processor to complete initialisation,
+ * which is signaled by its x2APIC ID being written to the LUT.
  * Do not touch shared resources meanwhile.
  */
-while ( !ap_callin )
+while ( !ACCESS_ONCE(CPU_TO_X2APICID[cpu]) )
 cpu_relax();
 
 /* Take the secondary processor offline. */
@@ -104,6 +132,12 @@ static void boot_cpu(unsigned int cpu)
 void smp_initialise(void)
 {
 unsigned int i, nr_cpus = hvm_info->nr_vcpus;
+uint32_t ecx;
+
+cpuid(1, NULL, NULL, &ecx, NULL);
+has_x2apic = (ecx >> 21) & 1;
+if ( has_x2apic )
+printf("x2APIC supported\n");
 
 printf("Multiprocessor initialisation:\n");
 cpu_setup(0);
diff --git a/tools/include/xen-tools/common-macros.h 
b/tools/include/xen-tools/common-macros.h
index 60912225cb7a..336c6309d96e 100644
--- a/tools/include/xen-tools/common-macros.h
+++ b/tools/include/xen-tools/common-macros.h
@@ -108,4 +108,9 @@
 #define get_unal

[PATCH v5 10/10] xen/x86: Synthesise domain topologies

2024-08-08 Thread Alejandro Vallejo

Expose sensible topologies in leaf 0xb. At the moment it synthesises
non-HT systems, in line with the previous code intent.

Leaf 0xb in the host policy is no longer zapped and the guest {max,def}
policies have their topology leaves zapped instead. The intent is for
toolstack to populate them. There's no current use for the topology
information in the host policy, but it makes no harm.

Signed-off-by: Alejandro Vallejo 
---
v5:
  * No change
---
 tools/libs/guest/xg_cpuid_x86.c | 24 +++-
 xen/arch/x86/cpu-policy.c   |  9 ++---
 2 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/tools/libs/guest/xg_cpuid_x86.c b/tools/libs/guest/xg_cpuid_x86.c
index 4453178100ad..6062dcab01ce 100644
--- a/tools/libs/guest/xg_cpuid_x86.c
+++ b/tools/libs/guest/xg_cpuid_x86.c
@@ -725,8 +725,16 @@ int xc_cpuid_apply_policy(xc_interface *xch, uint32_t 
domid, bool restore,
 p->policy.basic.htt   = test_bit(X86_FEATURE_HTT, host_featureset);
 p->policy.extd.cmp_legacy = test_bit(X86_FEATURE_CMP_LEGACY, 
host_featureset);
 }
-else
+else if ( restore )
 {
+/*
+ * Reconstruct the topology exposed on Xen <= 4.13. It makes very 
little
+ * sense, but it's what those guests saw so it's set in stone now.
+ *
+ * Guests from Xen 4.14 onwards carry their own CPUID leaves in the
+ * migration stream so they don't need special treatment.
+ */
+
 /*
  * Topology for HVM guests is entirely controlled by Xen.  For now, we
  * hardcode APIC_ID = vcpu_id * 2 to give the illusion of no SMT.
@@ -782,6 +790,20 @@ int xc_cpuid_apply_policy(xc_interface *xch, uint32_t 
domid, bool restore,
 break;
 }
 }
+else
+{
+/* TODO: Expose the ability to choose a custom topology for HVM/PVH */
+unsigned int threads_per_core = 1;
+unsigned int cores_per_pkg = di.max_vcpu_id + 1;
+
+rc = x86_topo_from_parts(&p->policy, threads_per_core, cores_per_pkg);
+if ( rc )
+{
+ERROR("Failed to generate topology: rc=%d t/c=%u c/p=%u",
+  rc, threads_per_core, cores_per_pkg);
+goto out;
+}
+}
 
 nr_leaves = ARRAY_SIZE(p->leaves);
 rc = x86_cpuid_copy_to_buffer(&p->policy, p->leaves, &nr_leaves);
diff --git a/xen/arch/x86/cpu-policy.c b/xen/arch/x86/cpu-policy.c
index 304dc20cfab8..55a95f6e164c 100644
--- a/xen/arch/x86/cpu-policy.c
+++ b/xen/arch/x86/cpu-policy.c
@@ -263,9 +263,6 @@ static void recalculate_misc(struct cpu_policy *p)
 
 p->basic.raw[0x8] = EMPTY_LEAF;
 
-/* TODO: Rework topology logic. */
-memset(p->topo.raw, 0, sizeof(p->topo.raw));
-
 p->basic.raw[0xc] = EMPTY_LEAF;
 
 p->extd.e1d &= ~CPUID_COMMON_1D_FEATURES;
@@ -613,6 +610,9 @@ static void __init calculate_pv_max_policy(void)
 recalculate_xstate(p);
 
 p->extd.raw[0xa] = EMPTY_LEAF; /* No SVM for PV guests. */
+
+/* Wipe host topology. Populated by toolstack */
+memset(p->topo.raw, 0, sizeof(p->topo.raw));
 }
 
 static void __init calculate_pv_def_policy(void)
@@ -776,6 +776,9 @@ static void __init calculate_hvm_max_policy(void)
 
 /* It's always possible to emulate CPUID faulting for HVM guests */
 p->platform_info.cpuid_faulting = true;
+
+/* Wipe host topology. Populated by toolstack */
+memset(p->topo.raw, 0, sizeof(p->topo.raw));
 }
 
 static void __init calculate_hvm_def_policy(void)
-- 
2.45.2

[PATCH v5 02/10] x86/vlapic: Move lapic migration checks to the check hooks

2024-08-08 Thread Alejandro Vallejo

While doing this, factor out checks common to architectural and hidden
state.

Signed-off-by: Alejandro Vallejo 
Reviewed-by: Roger Pau Monné 
---
v5:
  * No change
---
 xen/arch/x86/hvm/vlapic.c | 81 +--
 1 file changed, 53 insertions(+), 28 deletions(-)

diff --git a/xen/arch/x86/hvm/vlapic.c b/xen/arch/x86/hvm/vlapic.c
index 2ec95942713e..521b98988be9 100644
--- a/xen/arch/x86/hvm/vlapic.c
+++ b/xen/arch/x86/hvm/vlapic.c
@@ -1554,60 +1554,85 @@ static void lapic_load_fixup(struct vlapic *vlapic)
v, vlapic->loaded.id, vlapic->loaded.ldr, good_ldr);
 }
 
-static int cf_check lapic_load_hidden(struct domain *d, hvm_domain_context_t 
*h)
-{
-unsigned int vcpuid = hvm_load_instance(h);
-struct vcpu *v;
-struct vlapic *s;
 
+static int lapic_check_common(const struct domain *d, unsigned int vcpuid)
+{
 if ( !has_vlapic(d) )
 return -ENODEV;
 
 /* Which vlapic to load? */
-if ( vcpuid >= d->max_vcpus || (v = d->vcpu[vcpuid]) == NULL )
+if ( !domain_vcpu(d, vcpuid) )
 {
 dprintk(XENLOG_G_ERR, "HVM restore: dom%d has no apic%u\n",
 d->domain_id, vcpuid);
 return -EINVAL;
 }
-s = vcpu_vlapic(v);
 
-if ( hvm_load_entry_zeroextend(LAPIC, h, &s->hw) != 0 )
+return 0;
+}
+
+static int cf_check lapic_check_hidden(const struct domain *d,
+   hvm_domain_context_t *h)
+{
+unsigned int vcpuid = hvm_load_instance(h);
+struct hvm_hw_lapic s;
+int rc;
+
+if ( (rc = lapic_check_common(d, vcpuid)) )
+return rc;
+
+if ( hvm_load_entry_zeroextend(LAPIC, h, &s) != 0 )
+return -ENODATA;
+
+/* EN=0 with EXTD=1 is illegal */
+if ( (s.apic_base_msr & (APIC_BASE_ENABLE | APIC_BASE_EXTD)) ==
+ APIC_BASE_EXTD )
 return -EINVAL;
 
+return 0;
+}
+
+static int cf_check lapic_load_hidden(struct domain *d, hvm_domain_context_t 
*h)
+{
+unsigned int vcpuid = hvm_load_instance(h);
+struct vcpu *v = d->vcpu[vcpuid];
+struct vlapic *s = vcpu_vlapic(v);
+
+if ( hvm_load_entry_zeroextend(LAPIC, h, &s->hw) != 0 )
+ASSERT_UNREACHABLE();
+
 s->loaded.hw = 1;
 if ( s->loaded.regs )
 lapic_load_fixup(s);
 
-if ( !(s->hw.apic_base_msr & APIC_BASE_ENABLE) &&
- unlikely(vlapic_x2apic_mode(s)) )
-return -EINVAL;
-
 hvm_update_vlapic_mode(v);
 
 return 0;
 }
 
-static int cf_check lapic_load_regs(struct domain *d, hvm_domain_context_t *h)
+static int cf_check lapic_check_regs(const struct domain *d,
+ hvm_domain_context_t *h)
 {
 unsigned int vcpuid = hvm_load_instance(h);
-struct vcpu *v;
-struct vlapic *s;
+int rc;
 
-if ( !has_vlapic(d) )
-return -ENODEV;
+if ( (rc = lapic_check_common(d, vcpuid)) )
+return rc;
 
-/* Which vlapic to load? */
-if ( vcpuid >= d->max_vcpus || (v = d->vcpu[vcpuid]) == NULL )
-{
-dprintk(XENLOG_G_ERR, "HVM restore: dom%d has no apic%u\n",
-d->domain_id, vcpuid);
-return -EINVAL;
-}
-s = vcpu_vlapic(v);
+if ( !hvm_get_entry(LAPIC_REGS, h) )
+return -ENODATA;
+
+return 0;
+}
+
+static int cf_check lapic_load_regs(struct domain *d, hvm_domain_context_t *h)
+{
+unsigned int vcpuid = hvm_load_instance(h);
+struct vcpu *v = d->vcpu[vcpuid];
+struct vlapic *s = vcpu_vlapic(v);
 
 if ( hvm_load_entry(LAPIC_REGS, h, s->regs) != 0 )
-return -EINVAL;
+ASSERT_UNREACHABLE();
 
 s->loaded.id = vlapic_get_reg(s, APIC_ID);
 s->loaded.ldr = vlapic_get_reg(s, APIC_LDR);
@@ -1624,9 +1649,9 @@ static int cf_check lapic_load_regs(struct domain *d, 
hvm_domain_context_t *h)
 return 0;
 }
 
-HVM_REGISTER_SAVE_RESTORE(LAPIC, lapic_save_hidden, NULL,
+HVM_REGISTER_SAVE_RESTORE(LAPIC, lapic_save_hidden, lapic_check_hidden,
   lapic_load_hidden, 1, HVMSR_PER_VCPU);
-HVM_REGISTER_SAVE_RESTORE(LAPIC_REGS, lapic_save_regs, NULL,
+HVM_REGISTER_SAVE_RESTORE(LAPIC_REGS, lapic_save_regs, lapic_check_regs,
   lapic_load_regs, 1, HVMSR_PER_VCPU);
 
 int vlapic_init(struct vcpu *v)
-- 
2.45.2

[PATCH v5 01/10] tools/hvmloader: Fix non-deterministic cpuid()

2024-08-08 Thread Alejandro Vallejo

hvmloader's cpuid() implementation deviates from Xen's in that the value
passed on ecx is unspecified. This means that when used on leaves that
implement subleaves it's unspecified which one you get; though it's more
than likely an invalid one.

Import Xen's implementation so there are no surprises.

Fixes: 318ac791f9f9 ("Add utilities needed for SMBIOS generation to
hvmloader")
Signed-off-by: Alejandro Vallejo 
---
v5:
  * Added Fixes tag
  * Cosmetic changes to static inline, as proposed by Andrew
---
 tools/firmware/hvmloader/util.c |  9 -
 tools/firmware/hvmloader/util.h | 27 ---
 2 files changed, 24 insertions(+), 12 deletions(-)

diff --git a/tools/firmware/hvmloader/util.c b/tools/firmware/hvmloader/util.c
index c34f077b38e3..d3b3f9038e64 100644
--- a/tools/firmware/hvmloader/util.c
+++ b/tools/firmware/hvmloader/util.c
@@ -267,15 +267,6 @@ memcmp(const void *s1, const void *s2, unsigned n)
 return 0;
 }
 
-void
-cpuid(uint32_t idx, uint32_t *eax, uint32_t *ebx, uint32_t *ecx, uint32_t *edx)
-{
-asm volatile (
-"cpuid"
-: "=a" (*eax), "=b" (*ebx), "=c" (*ecx), "=d" (*edx)
-: "0" (idx) );
-}
-
 static const char hex_digits[] = "0123456789abcdef";
 
 /* Write a two-character hex representation of 'byte' to digits[].
diff --git a/tools/firmware/hvmloader/util.h b/tools/firmware/hvmloader/util.h
index deb823a892ef..e53a36476b70 100644
--- a/tools/firmware/hvmloader/util.h
+++ b/tools/firmware/hvmloader/util.h
@@ -184,9 +184,30 @@ int uart_exists(uint16_t uart_base);
 int lpt_exists(uint16_t lpt_base);
 int hpet_exists(unsigned long hpet_base);
 
-/* Do cpuid instruction, with operation 'idx' */
-void cpuid(uint32_t idx, uint32_t *eax, uint32_t *ebx,
-   uint32_t *ecx, uint32_t *edx);
+/* Some CPUID calls want 'count' to be placed in ecx */
+static inline void cpuid_count(
+uint32_t leaf,
+uint32_t subleaf,
+uint32_t *eax,
+uint32_t *ebx,
+uint32_t *ecx,
+uint32_t *edx)
+{
+asm volatile ( "cpuid"
+  : "=a" (*eax), "=b" (*ebx), "=c" (*ecx), "=d" (*edx)
+  : "a" (leaf), "c" (subleaf) );
+}
+
+/* Generic CPUID function (subleaf 0) */
+static inline void cpuid(
+uint32_t leaf,
+uint32_t *eax,
+uint32_t *ebx,
+uint32_t *ecx,
+uint32_t *edx)
+{
+cpuid_count(leaf, 0, eax, ebx, ecx, edx);
+}
 
 /* Read the TSC register. */
 static inline uint64_t rdtsc(void)
-- 
2.45.2

[PATCH v5 09/10] tools/libguest: Set distinct x2APIC IDs for each vCPU

2024-08-08 Thread Alejandro Vallejo

Have toolstack populate the new x2APIC ID in the LAPIC save record with
the proper IDs intended for each vCPU.

Signed-off-by: Alejandro Vallejo 
---
v5:
  * No change
---
 tools/libs/guest/xg_dom_x86.c | 38 ++-
 1 file changed, 37 insertions(+), 1 deletion(-)

diff --git a/tools/libs/guest/xg_dom_x86.c b/tools/libs/guest/xg_dom_x86.c
index fafe7acb7e91..a3e4e2052128 100644
--- a/tools/libs/guest/xg_dom_x86.c
+++ b/tools/libs/guest/xg_dom_x86.c
@@ -1004,19 +1004,40 @@ static int vcpu_hvm(struct xc_dom_image *dom)
 HVM_SAVE_TYPE(HEADER) header;
 struct hvm_save_descriptor mtrr_d;
 HVM_SAVE_TYPE(MTRR) mtrr;
+struct hvm_save_descriptor lapic_d;
+HVM_SAVE_TYPE(LAPIC) lapic;
 struct hvm_save_descriptor end_d;
 HVM_SAVE_TYPE(END) end;
 } vcpu_ctx;
-/* Context from full_ctx */
+/* Contexts from full_ctx */
 const HVM_SAVE_TYPE(MTRR) *mtrr_record;
+const HVM_SAVE_TYPE(LAPIC) *lapic_record;
 /* Raw context as taken from Xen */
 uint8_t *full_ctx = NULL;
+xc_cpu_policy_t *policy = xc_cpu_policy_init();
 int rc;
 
 DOMPRINTF_CALLED(dom->xch);
 
 assert(dom->max_vcpus);
 
+/*
+ * Fetch the CPU policy of this domain. We need it to determine the APIC 
IDs
+ * each of vCPU in a manner consistent with the exported topology.
+ *
+ * TODO: It's silly to query a policy we have ourselves created. It should
+ *   instead be part of xc_dom_image
+ */
+
+rc = xc_cpu_policy_get_domain(dom->xch, dom->guest_domid, policy);
+if ( rc != 0 )
+{
+xc_dom_panic(dom->xch, XC_INTERNAL_ERROR,
+ "%s: unable to fetch cpu policy for dom%u (rc=%d)",
+ __func__, dom->guest_domid, rc);
+goto out;
+}
+
 /*
  * Get the full HVM context in order to have the header, it is not
  * possible to get the header with getcontext_partial, and crafting one
@@ -,6 +1132,8 @@ static int vcpu_hvm(struct xc_dom_image *dom)
 vcpu_ctx.mtrr_d.typecode = HVM_SAVE_CODE(MTRR);
 vcpu_ctx.mtrr_d.length = HVM_SAVE_LENGTH(MTRR);
 vcpu_ctx.mtrr = *mtrr_record;
+vcpu_ctx.lapic_d.typecode = HVM_SAVE_CODE(LAPIC);
+vcpu_ctx.lapic_d.length = HVM_SAVE_LENGTH(LAPIC);
 vcpu_ctx.end_d = bsp_ctx.end_d;
 vcpu_ctx.end = bsp_ctx.end;
 
@@ -1125,6 +1148,18 @@ static int vcpu_hvm(struct xc_dom_image *dom)
 {
 vcpu_ctx.mtrr_d.instance = i;
 
+lapic_record = hvm_get_save_record(full_ctx, HVM_SAVE_CODE(LAPIC), i);
+if ( !lapic_record )
+{
+xc_dom_panic(dom->xch, XC_INTERNAL_ERROR,
+ "%s: unable to get LAPIC[%d] save record", __func__, 
i);
+goto out;
+}
+
+vcpu_ctx.lapic = *lapic_record;
+vcpu_ctx.lapic.x2apic_id = x86_x2apic_id_from_vcpu_id(&policy->policy, 
i);
+vcpu_ctx.lapic_d.instance = i;
+
 rc = xc_domain_hvm_setcontext(dom->xch, dom->guest_domid,
   (uint8_t *)&vcpu_ctx, sizeof(vcpu_ctx));
 if ( rc != 0 )
@@ -1147,6 +1182,7 @@ static int vcpu_hvm(struct xc_dom_image *dom)
 
  out:
 free(full_ctx);
+xc_cpu_policy_destroy(policy);
 return rc;
 }
 
-- 
2.45.2

[PATCH v5 00/10] x86: Expose consistent topology to guests

2024-08-08 Thread Alejandro Vallejo

v4 -> v5:

Largely unchanged and resent for review after the 4.20 dev cycle started.

  * Addressed Andrew's nits in v4/patch1
  * Addressed Jan's concern with MTRR overrides in v4/patch6 by keeping the
same MTRR data in the vCPU contexts for HVM domain creations.

v4: 
https://lore.kernel.org/xen-devel/cover.1719416329.git.alejandro.vall...@cloud.com/
v3 -> v4:

  * Fixed cpuid() bug in hvmloader, causing UB in v3
  * Fixed a bogus assert in hvmloader, also causing a crash in v3
  * Used HVM contexts rather than sync'd algos between Xen and toolstack in
order to initialise the per-vCPU LAPIC state.
  * Formatting asjustments.

v3: 
https://lore.kernel.org/xen-devel/cover.1716976271.git.alejandro.vall...@cloud.com/
v2 -> v3:

  (v2/patch2 and v2/patch4 are already committed)

  * Moved the vlapic check hook addition to v3/patch1
* And created a check hook for the architectural state too for consistency.
  * Fixed migrations from Xen <= 4.13 by reconstructing the previous topology.
  * Correctly set the APIC ID after a policy change when vlapic is already in
x2APIC mode.
  * Removed bogus assumption introduced in v1 and v2 on hvmloader about which
8bit APIC IDs represent ids > 254. (it's "id % 0xff", not "min(id, 0xff)".
* Used an x2apic flag check instead.
  * Various formatting adjustments.

v2: 
https://lore.kernel.org/xen-devel/cover.1715102098.git.alejandro.vall...@cloud.com/
v1 -> v2:

  * v1/patch 4 replaced by a different strategy (See patches 4 and 5 in v2):
  * Have hvmloader populate MADT with the real APIC IDs as read by the APs
themselves rather than giving it knowledge on how to derive them.
  * Removed patches 2 and 3 in v1, as no longer relevant.
  * Split v1/patch6 in two parts ((a) creating the generator and (b) plugging it
in) and use the generator in the unit tests of the vcpuid->apicid mapping
function. Becomes patches 6 and 8 in v2.

  Patch 1: Same as v1/patch1.
  Patch 2: Header dependency cleanup in preparation for patch3.
  Patch 3: Adds vlapic_hidden check for the newly introduced reserved area.
  Patch 4: [hvmloader] Replaces INIT+SIPI+SIPI sequences with hypercalls.
  Patch 5: [hvmloader] Retrieve the per-CPU APIC IDs from the APs themselves.
  Patch 6: Split from v1/patch6.
  Patch 7: Logically matching v1/patch5, but using v2/patch6 for testing.
  Patch 8: Split from v1/patch6.

v1: 
https://lore.kernel.org/xen-devel/20240109153834.4192-1-alejandro.vall...@cloud.com/
=== Original cover letter ===

Current topology handling is close to non-existent. As things stand, APIC
IDs are allocated through the apic_id=vcpu_id*2 relation without giving any
hints to the OS on how to parse the x2APIC ID of a given CPU and assuming
the guest will assume 2 threads per core.

This series involves bringing x2APIC IDs into the migration stream, so
older guests keep operating as they used to and enhancing Xen+toolstack so
new guests get topology information consistent with their x2APIC IDs. As a
side effect of this, x2APIC IDs are now packed and don't have (unless under
a pathological case) gaps.

Further work ought to allow combining this topology configurations with
gang-scheduling of guest hyperthreads into affine physical hyperthreads.
For the time being it purposefully keeps the configuration of "1 socket" +
"1 thread per core" + "1 core per vCPU".

Patch 1: Includes x2APIC IDs in the migration stream. This allows Xen to
 reconstruct the right x2APIC IDs on migrated-in guests, and
 future-proofs itself in the face of x2APIC ID derivation changes.
Patch 2: Minor refactor to expose xc_cpu_policy in libxl
Patch 3: Refactors xen/lib/x86 to work on non-Xen freestanding environments
 (e.g: hvmloader)
Patch 4: Remove old assumptions about vcpu_id<->apic_id relationship in 
hvmloader
Patch 5: Add logic to derive x2APIC IDs given a CPU policy and vCPU IDs
Patch 6: Includes a simple topology generator for toolstack so new guests
 have topologically consistent information in CPUID

Alejandro Vallejo (10):
  tools/hvmloader: Fix non-deterministic cpuid()
  x86/vlapic: Move lapic migration checks to the check hooks
  xen/x86: Add initial x2APIC ID to the per-vLAPIC save area
  xen/x86: Add supporting code for uploading LAPIC contexts during
domain create
  tools/hvmloader: Retrieve (x2)APIC IDs from the APs themselves
  tools/libguest: Always set vCPU context in vcpu_hvm()
  xen/lib: Add topology generator for x86
  xen/x86: Derive topologically correct x2APIC IDs from the policy
  tools/libguest: Set distinct x2APIC IDs for each vCPU
  xen/x86: Synthesise domain topologies

 tools/firmware/hvmloader/config.h|   6 +-
 tools/firmware/hvmloader/hvmloader.c |   4 +-
 tools/firmware/hvmloader/smp.c   |  54 --
 tools/firmware/hvmloader/util.c  |   9 -
 tools/firmware/hvmloader/util.h

[PATCH v2 1/2] x86/fpu: Combine fpu_ctxt and xsave_area in arch_vcpu

2024-08-08 Thread Alejandro Vallejo

fpu_ctxt is either a pointer to the legacy x87/SSE save area (used by FXSAVE) or
a pointer aliased with xsave_area that points to its fpu_sse subfield. Such
subfield is at the base and is identical in size and layout to the legacy
buffer.

This patch merges the 2 pointers in the arch_vcpu into a single XSAVE area. In
the very rare case in which the host doesn't support XSAVE all we're doing is
wasting a tiny amount of memory and trading those for a lot more simplicity in
the code.

Signed-off-by: Alejandro Vallejo 
---
v2:

  * Added BUILD_BUG_ON(sizeof(x) != sizeof(fpusse_t)) on forceful casts
involving fpusse_t.
  * Reworded comment on top of vcpu_arch->user_regs
  * Added missing whitespace in x86_emulate/blk.c
---
 xen/arch/x86/domctl.c |  5 +++-
 xen/arch/x86/hvm/emulate.c|  4 +--
 xen/arch/x86/hvm/hvm.c|  5 +++-
 xen/arch/x86/i387.c   | 45 +--
 xen/arch/x86/include/asm/domain.h |  8 +++---
 xen/arch/x86/x86_emulate/blk.c|  3 ++-
 xen/arch/x86/xstate.c | 13 ++---
 7 files changed, 32 insertions(+), 51 deletions(-)

diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c
index 68b5b46d1a83..bceff6be0ff3 100644
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -1344,7 +1344,10 @@ void arch_get_info_guest(struct vcpu *v, 
vcpu_guest_context_u c)
 #define c(fld) (c.nat->fld)
 #endif
 
-memcpy(&c.nat->fpu_ctxt, v->arch.fpu_ctxt, sizeof(c.nat->fpu_ctxt));
+memcpy(&c.nat->fpu_ctxt, &v->arch.xsave_area->fpu_sse,
+   sizeof(c.nat->fpu_ctxt));
+BUILD_BUG_ON(sizeof(c.nat->fpu_ctxt) != sizeof(fpusse_t));
+
 if ( is_pv_domain(d) )
 c(flags = v->arch.pv.vgc_flags & ~(VGCF_i387_valid|VGCF_in_kernel));
 else
diff --git a/xen/arch/x86/hvm/emulate.c b/xen/arch/x86/hvm/emulate.c
index feb4792cc567..03020542c3ba 100644
--- a/xen/arch/x86/hvm/emulate.c
+++ b/xen/arch/x86/hvm/emulate.c
@@ -2363,7 +2363,7 @@ static int cf_check hvmemul_get_fpu(
 alternative_vcall(hvm_funcs.fpu_dirty_intercept);
 else if ( type == X86EMUL_FPU_fpu )
 {
-const fpusse_t *fpu_ctxt = curr->arch.fpu_ctxt;
+const fpusse_t *fpu_ctxt = &curr->arch.xsave_area->fpu_sse;
 
 /*
  * Latch current register state so that we can back out changes
@@ -2403,7 +2403,7 @@ static void cf_check hvmemul_put_fpu(
 
 if ( aux )
 {
-fpusse_t *fpu_ctxt = curr->arch.fpu_ctxt;
+fpusse_t *fpu_ctxt = &curr->arch.xsave_area->fpu_sse;
 bool dval = aux->dval;
 int mode = hvm_guest_x86_mode(curr);
 
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index f49e29faf753..6607dba562a4 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -916,7 +916,10 @@ static int cf_check hvm_save_cpu_ctxt(struct vcpu *v, 
hvm_domain_context_t *h)
 
 if ( v->fpu_initialised )
 {
-memcpy(ctxt.fpu_regs, v->arch.fpu_ctxt, sizeof(ctxt.fpu_regs));
+memcpy(ctxt.fpu_regs, &v->arch.xsave_area->fpu_sse,
+   sizeof(ctxt.fpu_regs));
+BUILD_BUG_ON(sizeof(ctxt.fpu_regs) != sizeof(fpusse_t));
+
 ctxt.flags = XEN_X86_FPU_INITIALISED;
 }
 
diff --git a/xen/arch/x86/i387.c b/xen/arch/x86/i387.c
index 134e0bece519..fbb9d3584a3d 100644
--- a/xen/arch/x86/i387.c
+++ b/xen/arch/x86/i387.c
@@ -39,7 +39,7 @@ static inline void fpu_xrstor(struct vcpu *v, uint64_t mask)
 /* Restore x87 FPU, MMX, SSE and SSE2 state */
 static inline void fpu_fxrstor(struct vcpu *v)
 {
-const fpusse_t *fpu_ctxt = v->arch.fpu_ctxt;
+const fpusse_t *fpu_ctxt = &v->arch.xsave_area->fpu_sse;
 
 /*
  * Some CPUs don't save/restore FDP/FIP/FOP unless an exception
@@ -151,7 +151,7 @@ static inline void fpu_xsave(struct vcpu *v)
 /* Save x87 FPU, MMX, SSE and SSE2 state */
 static inline void fpu_fxsave(struct vcpu *v)
 {
-fpusse_t *fpu_ctxt = v->arch.fpu_ctxt;
+fpusse_t *fpu_ctxt = &v->arch.xsave_area->fpu_sse;
 unsigned int fip_width = v->domain->arch.x87_fip_width;
 
 if ( fip_width != 4 )
@@ -212,7 +212,7 @@ void vcpu_restore_fpu_nonlazy(struct vcpu *v, bool 
need_stts)
  * above) we also need to restore full state, to prevent subsequently
  * saving state belonging to another vCPU.
  */
-if ( v->arch.fully_eager_fpu || (v->arch.xsave_area && xstate_all(v)) )
+if ( v->arch.fully_eager_fpu || xstate_all(v) )
 {
 if ( cpu_has_xsave )
 fpu_xrstor(v, XSTATE_ALL);
@@ -299,44 +299,14 @@ void save_fpu_enable(void)
 /* Initialize FPU's context save area */
 int vcpu_init_fpu(struct vcpu *v)
 {
-int rc;
-
 v->arch.fully_eager_fpu = opt_eager_fpu;
-
-if ( (rc = xstate_alloc_save_area(v)) != 0 )
-return rc;
-
-if ( v->arch.xsave_area )
-v->arch.fpu_ctxt = &v->arc

[PATCH v2 0/2] x86: FPU handling cleanup

2024-08-08 Thread Alejandro Vallejo

v1: 
https://lore.kernel.org/xen-devel/cover.1720538832.git.alejandro.vall...@cloud.com/T/#t
v1 -> v2: v1/patch1 and v1/patch2 are already in staging.

=== Original cover letter =
I want to eventually reach a position in which the FPU state can be allocated
from the domheap and hidden via the same core mechanism proposed in Elias'
directmap removal series. Doing so is complicated by the presence of 2 aliased
pointers (v->arch.fpu_ctxt and v->arch.xsave_area) and the rather complicated
semantics of vcpu_setup_fpu(). This series tries to simplify the code so moving
to a "map/modify/unmap" model is more tractable.

Patches 1 and 2 are trivial refactors.

Patch 3 unifies FPU state so an XSAVE area is allocated per vCPU regardless of
the host supporting it or not. The rationale is that the memory savings are
negligible and not worth the extra complexity.

Patch 4 is a non-trivial split of the vcpu_setup_fpu() into 2 separate
functions. One to override x87/SSE state, and another to set a reset state.
=======

Alejandro Vallejo (2):
  x86/fpu: Combine fpu_ctxt and xsave_area in arch_vcpu
  x86/fpu: Split fpu_setup_fpu() in two

 xen/arch/x86/domain.c |   7 +-
 xen/arch/x86/domctl.c |   5 +-
 xen/arch/x86/hvm/emulate.c|   4 +-
 xen/arch/x86/hvm/hvm.c|  32 +++---
 xen/arch/x86/i387.c   | 103 ++
 xen/arch/x86/include/asm/domain.h |   8 +--
 xen/arch/x86/include/asm/i387.h   |  28 ++--
 xen/arch/x86/include/asm/xstate.h |   1 +
 xen/arch/x86/x86_emulate/blk.c|   3 +-
 xen/arch/x86/xstate.c |  13 +++-
 10 files changed, 108 insertions(+), 96 deletions(-)

-- 
2.45.2

[PATCH v2 2/2] x86/fpu: Split fpu_setup_fpu() in two

2024-08-08 Thread Alejandro Vallejo

It was trying to do too many things at once and there was no clear way of
defining what it was meant to do.This commit splits the function in two.

  1. A reset function, parameterized by the FCW value. FCW_RESET means to reset
 the state to power-on reset values, while FCW_DEFAULT means to reset to the
 default values present during vCPU creation.
  2. A x87/SSE state loader (equivalent to the old function when it took a data
 pointer).

While at it, make sure the abridged tag is consistent with the manuals and
start as 0xFF.

Signed-off-by: Alejandro Vallejo 
---
v2:

  * Reworded comment about pre-Xen 4.1 migrations.
  * Reset abridged FTW to -1 (tag=0x), as per the manuals.
  * Undo const cast-away.
  * Split vcpu_reset_fpu() into vcpu_default_fpu()
* vcpu_init_fpu() already exists.
  * Removed backticks from comments
---
 xen/arch/x86/domain.c |  7 ++--
 xen/arch/x86/hvm/hvm.c| 27 ++
 xen/arch/x86/i387.c   | 60 +++
 xen/arch/x86/include/asm/i387.h   | 28 ---
 xen/arch/x86/include/asm/xstate.h |  1 +
 5 files changed, 77 insertions(+), 46 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index d977ec71ca20..5af9e3e7a8b4 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1186,9 +1186,10 @@ int arch_set_info_guest(
  is_pv_64bit_domain(d) )
 v->arch.flags &= ~TF_kernel_mode;
 
-vcpu_setup_fpu(v, v->arch.xsave_area,
-   flags & VGCF_I387_VALID ? &c.nat->fpu_ctxt : NULL,
-   FCW_DEFAULT);
+if ( flags & VGCF_I387_VALID )
+vcpu_setup_fpu(v, &c.nat->fpu_ctxt);
+else
+vcpu_default_fpu(v);
 
 if ( !compat )
 {
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 6607dba562a4..83cb21884ce6 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -1164,10 +1164,25 @@ static int cf_check hvm_load_cpu_ctxt(struct domain *d, 
hvm_domain_context_t *h)
 seg.attr = ctxt.ldtr_arbytes;
 hvm_set_segment_register(v, x86_seg_ldtr, &seg);
 
-/* Cover xsave-absent save file restoration on xsave-capable host. */
-vcpu_setup_fpu(v, xsave_enabled(v) ? NULL : v->arch.xsave_area,
-   ctxt.flags & XEN_X86_FPU_INITIALISED ? ctxt.fpu_regs : NULL,
-   FCW_RESET);
+/*
+ * On Xen 4.1 and later the FPU state is restored on later HVM context in
+ * the migrate stream, so what we're doing here is initialising the FPU
+ * state for guests from even older versions of Xen.
+ *
+ * In particular:
+ *   1. If there's an XSAVE context later in the stream what we do here for
+ *  the FPU doesn't matter because it'll be overriden later.
+ *   2. If there isn't and the guest didn't use extended states it's still
+ *  fine because we have all the information we need here.
+ *   3. If there isn't and the guest DID use extended states (could've
+ *  happened prior to Xen 4.1) then we're in a pickle because we have
+ *  to make up non-existing state. For this case we initialise the FPU
+ *  as using x87/SSE only because the rest of the state is gone.
+ */
+if ( ctxt.flags & XEN_X86_FPU_INITIALISED )
+vcpu_setup_fpu(v, &ctxt.fpu_regs);
+else
+vcpu_reset_fpu(v);
 
 v->arch.user_regs.rax = ctxt.rax;
 v->arch.user_regs.rbx = ctxt.rbx;
@@ -4007,9 +4022,7 @@ void hvm_vcpu_reset_state(struct vcpu *v, uint16_t cs, 
uint16_t ip)
 v->arch.guest_table = pagetable_null();
 }
 
-if ( v->arch.xsave_area )
-v->arch.xsave_area->xsave_hdr.xstate_bv = 0;
-vcpu_setup_fpu(v, v->arch.xsave_area, NULL, FCW_RESET);
+vcpu_reset_fpu(v);
 
 arch_vcpu_regs_init(v);
 v->arch.user_regs.rip = ip;
diff --git a/xen/arch/x86/i387.c b/xen/arch/x86/i387.c
index fbb9d3584a3d..af5ae805998a 100644
--- a/xen/arch/x86/i387.c
+++ b/xen/arch/x86/i387.c
@@ -303,41 +303,37 @@ int vcpu_init_fpu(struct vcpu *v)
 return xstate_alloc_save_area(v);
 }
 
-void vcpu_setup_fpu(struct vcpu *v, struct xsave_struct *xsave_area,
-const void *data, unsigned int fcw_default)
+void vcpu_reset_fpu(struct vcpu *v)
 {
-fpusse_t *fpu_sse = &v->arch.xsave_area->fpu_sse;
-
-ASSERT(!xsave_area || xsave_area == v->arch.xsave_area);
-
-v->fpu_initialised = !!data;
+v->fpu_initialised = false;
+*v->arch.xsave_area = (struct xsave_struct) {
+.fpu_sse = {
+.mxcsr = MXCSR_DEFAULT,
+.fcw = FCW_RESET,
+.ftw = FTW_RESET,
+},
+.xsave_hdr.xstate_bv = fcw == X86_XCR0_X87,
+};
+}
 
-if ( data )
-{
-memcpy(fpu_sse, data, sizeof(*fpu_sse));
-if ( xsave_area )
-xsave_area->xsave_hdr.xstate_bv =

Re: [PATCH for-4.20 3/4] x86/fpu: Combine fpu_ctxt and xsave_area in arch_vcpu

2024-08-07 Thread Alejandro Vallejo

Hi. Sorry, I've been busy and couldn't get back to this sooner.

On Fri Jul 19, 2024 at 10:14 AM BST, Jan Beulich wrote:
> On 18.07.2024 18:54, Alejandro Vallejo wrote:
> > On Thu Jul 18, 2024 at 12:49 PM BST, Jan Beulich wrote:
> >> On 09.07.2024 17:52, Alejandro Vallejo wrote:
> >>> --- a/xen/arch/x86/include/asm/domain.h
> >>> +++ b/xen/arch/x86/include/asm/domain.h
> >>> @@ -591,12 +591,7 @@ struct pv_vcpu
> >>>  
> >>>  struct arch_vcpu
> >>>  {
> >>> -/*
> >>> - * guest context (mirroring struct vcpu_guest_context) common
> >>> - * between pv and hvm guests
> >>> - */
> >>> -
> >>> -void  *fpu_ctxt;
> >>> +/* Fixed point registers */
> >>>  struct cpu_user_regs user_regs;
> >>
> >> Not exactly, no. Selector registers are there as well for example, which
> >> I wouldn't consider "fixed point" ones. I wonder why the existing comment
> >> cannot simply be kept, perhaps extended to mention that fpu_ctxt now lives
> >> elsewhere.
> > 
> > Would you prefer "general purpose registers"? It's not quite that either, 
> > but
> > it's arguably closer. I can part with the comment altogether but I'd rather
> > leave a token amount of information to say "non-FPU register state" (but not
> > that, because that would be a terrible description). 
> > 
> > I'd rather update it to something that better reflects reality, as I found 
> > it
> > quite misleading when reading through. I initially thought it may have been
> > related to struct layout (as in C-style single-level inheritance), but as it
> > turns out it's merely establishing a vague relationship between arch_vcpu 
> > and
> > vcpu_guest_context. I can believe once upon a time the relationship was 
> > closer
> > than it it now, but with the guest context missing AVX state, MSR state and
> > other bits and pieces I thought it better to avoid such confusions for 
> > future
> > navigators down the line so limit its description to the line below.
>
> As said, I'd prefer if you amended the existing comment. Properly describing
> what's in cpu_user_regs isn't quite as easy in only very few words. Neither
> "fixed point register" nor "general purpose registers" really covers it. And
> I'd really like to avoid having potentially confusing comments.

Sure.

>
> >>> --- a/xen/arch/x86/xstate.c
> >>> +++ b/xen/arch/x86/xstate.c
> >>> @@ -507,9 +507,16 @@ int xstate_alloc_save_area(struct vcpu *v)
> >>>  unsigned int size;
> >>>  
> >>>  if ( !cpu_has_xsave )
> >>> -return 0;
> >>> -
> >>> -if ( !is_idle_vcpu(v) || !cpu_has_xsavec )
> >>> +{
> >>> +/*
> >>> + * This is bigger than FXSAVE_SIZE by 64 bytes, but it helps 
> >>> treating
> >>> + * the FPU state uniformly as an XSAVE buffer even if XSAVE is 
> >>> not
> >>> + * available in the host. Note the alignment restriction of the 
> >>> XSAVE
> >>> + * area are stricter than those of the FXSAVE area.
> >>> + */
> >>> +size = XSTATE_AREA_MIN_SIZE;
> >>
> >> What exactly would break if just (a little over) 512 bytes worth were 
> >> allocated
> >> when there's no XSAVE? If it was exactly 512, something like xstate_all() 
> >> would
> >> need to apply a little more care, I guess. Yet for that having just 
> >> always-zero
> >> xstate_bv and xcomp_bv there would already suffice (e.g. using
> >> offsetof(..., xsave_hdr.reserved) here, to cover further fields gaining 
> >> meaning
> >> down the road). Remember that due to xmalloc() overhead and the 
> >> 64-byte-aligned
> >> requirement, you can only have 6 of them in a page the way you do it, when 
> >> the
> >> alternative way 7 would fit (if I got my math right).
> > 
> > I'm slightly confused.
> > 
> > XSTATE_AREA_MIN_SIZE is already 512 + 64 to account for the XSAVE header,
> > including its reserved fields. Did you mean something else?
>
> No, I didn't. I've in fact commented on it precisely because it is the value
> you name. That's larger than necessary, and when suitably shrunk - as said -
> one more of these structures could fit in a page (assumin

[PATCH 3/5] x86: Set xen_phys_start and trampoline_xen_phys_start earlier

2024-08-07 Thread Alejandro Vallejo

No reason to wait, if Xen image is loaded by EFI (not multiboot
EFI path) these are set in efi_arch_load_addr_check, but
not in the multiboot EFI code path.
This change makes the 2 code paths more similar and allows
the usage of these variables if needed.

Signed-off-by: Frediano Ziglio 
---
 xen/arch/x86/boot/head.S | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/xen/arch/x86/boot/head.S b/xen/arch/x86/boot/head.S
index 296f76146a..5b82221038 100644
--- a/xen/arch/x86/boot/head.S
+++ b/xen/arch/x86/boot/head.S
@@ -259,6 +259,11 @@ __efi64_mb2_start:
 jmp x86_32_switch
 
 .Lefi_multiboot2_proto:
+/* Save Xen image load base address for later use. */
+lea __image_base__(%rip),%rsi
+movq%rsi, xen_phys_start(%rip)
+movl%esi, trampoline_xen_phys_start(%rip)
+
 /* Zero EFI SystemTable, EFI ImageHandle addresses and cmdline. */
 xor %esi,%esi
 xor %edi,%edi
@@ -605,10 +610,6 @@ trampoline_setup:
  * Called on legacy BIOS and EFI platforms.
  */
 
-/* Save Xen image load base address for later use. */
-mov %esi, sym_esi(xen_phys_start)
-mov %esi, sym_esi(trampoline_xen_phys_start)
-
 /* Get bottom-most low-memory stack address. */
 mov sym_esi(trampoline_phys), %ecx
 add $TRAMPOLINE_SPACE,%ecx
-- 
2.45.2

[PATCH 2/5] x86: Fix early output messages in case of EFI

2024-08-07 Thread Alejandro Vallejo

If code is loaded by EFI the loader will relocate the image
under 4GB. This cause offsets in x86 code generated by
sym_offs(SYMBOL) to be relocated too (basically they won't be
offsets from image base). In order to get real offset the
formulae "sym_offs(SYMBOL) - sym_offs(__image_base__)" is
used instead.
Also, in some case %esi register (that should point to
__image_base__ addresss) is not set so compute in all cases.
Code tested forcing failures in the code.

Signed-off-by: Frediano Ziglio 
---
 xen/arch/x86/boot/head.S | 21 -
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/boot/head.S b/xen/arch/x86/boot/head.S
index f027ff45fd..296f76146a 100644
--- a/xen/arch/x86/boot/head.S
+++ b/xen/arch/x86/boot/head.S
@@ -188,8 +188,27 @@ early_error: /* Here to improve the disassembly. */
 xor %edi,%edi   # No VGA text buffer
 jmp .Lprint_err
 .Lget_vtb:
-mov sym_esi(vga_text_buffer), %edi
+mov $sym_offs(vga_text_buffer), %edi
 .Lprint_err:
+mov $sym_offs(__image_base__), %ebx
+
+/* compute base, relocation or not */
+call1f
+1:
+pop %esi
+subl$sym_offs(1b), %esi
+addl%ebx, %esi
+
+/* adjust offset and load */
+test%edi, %edi
+jz  1f
+subl%ebx, %edi
+movl(%edi,%esi,1), %edi
+1:
+
+/* adjust message offset */
+subl%ebx, %ecx
+
 add %ecx, %esi # Add string offset to relocation base.
 # NOTE: No further use of sym_esi() till the end of the "function"!
 1:
-- 
2.45.2

[PATCH 4/5] x86: Force proper gdt_boot_base setting

2024-08-07 Thread Alejandro Vallejo

Instead of relocate the value at that position compute it
entirely and write it.
During EFI boots sym_offs(SYMBOL) are potentially relocated
causing the values to be corrupted.
For PVH and BIOS the change won't be necessary but keep the
code consistent.

Signed-off-by: Frediano Ziglio 
---
 xen/arch/x86/boot/head.S | 18 ++
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/xen/arch/x86/boot/head.S b/xen/arch/x86/boot/head.S
index 5b82221038..abfa3d82f7 100644
--- a/xen/arch/x86/boot/head.S
+++ b/xen/arch/x86/boot/head.S
@@ -132,8 +132,7 @@ multiboot2_header:
 gdt_boot_descr:
 .word   .Ltrampoline_gdt_end - trampoline_gdt - 1
 gdt_boot_base:
-.long   sym_offs(trampoline_gdt)
-.long   0 /* Needed for 64-bit lgdt */
+.quad   0 /* Needed for 64-bit lgdt */
 
 vga_text_buffer:
 .long   0xb8000
@@ -392,15 +391,16 @@ __efi64_mb2_start:
 x86_32_switch:
 mov %r15,%rdi
 
-/* Store Xen image load base address in place accessible for 32-bit 
code. */
-lea __image_base__(%rip),%esi
-
 cli
 
 /* Initialize GDTR. */
-add %esi,gdt_boot_base(%rip)
+lea trampoline_gdt(%rip),%esi
+movl%esi,gdt_boot_base(%rip)
 lgdtgdt_boot_descr(%rip)
 
+/* Store Xen image load base address in place accessible for 32-bit 
code. */
+lea __image_base__(%rip),%esi
+
 /* Reload code selector. */
 pushq   $BOOT_CS32
 lea cs32_switch(%rip),%edx
@@ -458,7 +458,8 @@ __pvh_start:
 movb$-1, sym_esi(opt_console_xen)
 
 /* Prepare gdt and segments */
-add %esi, sym_esi(gdt_boot_base)
+lea sym_esi(trampoline_gdt), %ecx
+movl%ecx, sym_esi(gdt_boot_base)
 lgdtsym_esi(gdt_boot_descr)
 
 mov $BOOT_DS, %ecx
@@ -562,7 +563,8 @@ trampoline_bios_setup:
  *
  * Initialize GDTR and basic data segments.
  */
-add %esi,sym_esi(gdt_boot_base)
+lea sym_esi(trampoline_gdt), %ecx
+movl%ecx, sym_esi(gdt_boot_base)
 lgdtsym_esi(gdt_boot_descr)
 
 mov $BOOT_DS,%ecx
-- 
2.45.2

[PATCH 0/5] Improve support for EFI multiboot loading

2024-08-07 Thread Alejandro Vallejo

--
(This series is work from Frediano. He's having some issues with his dev
environment and asked me to push it to xen-devel on his behalf)
--

Testing this feature in preparation for UEFI CA memory mitigation
requirements I found some issues causing the loading to fail and
other minor issues.
Details in series commit messages.

Frediano Ziglio (5):
  x86: Put trampoline in .init.data section
  x86: Fix early output messages in case of EFI
  x86: Set xen_phys_start and trampoline_xen_phys_start earlier
  x86: Force proper gdt_boot_base setting
  x86: Rollback relocation in case of EFI multiboot

 xen/arch/x86/boot/head.S  | 81 ++-
 xen/arch/x86/boot/reloc.c | 63 +-
 2 files changed, 125 insertions(+), 19 deletions(-)

-- 
2.45.2

[PATCH 1/5] x86: Put trampoline in .init.data section

2024-08-07 Thread Alejandro Vallejo

This change allows to put the trampoline in a separate, not executable
section. The trampoline contains a mix of code and data (data which
is modified from C code during early start so must be writable).
This is in preparation for W^X patch in order to satisfy UEFI CA
memory mitigation requirements.
At the moment .init.text and .init.data in EFI mode are put together
so they will be in the same final section as before this patch.

Signed-off-by: Frediano Ziglio 
---
 xen/arch/x86/boot/head.S | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/xen/arch/x86/boot/head.S b/xen/arch/x86/boot/head.S
index d8ac0f0494..f027ff45fd 100644
--- a/xen/arch/x86/boot/head.S
+++ b/xen/arch/x86/boot/head.S
@@ -870,6 +870,8 @@ cmdline_parse_early:
 reloc:
 .incbin "reloc.bin"
 
+.section .init.data, "aw", @progbits
+.align 4
 ENTRY(trampoline_start)
 #include "trampoline.S"
 ENTRY(trampoline_end)
-- 
2.45.2

[PATCH 5/5] x86: Rollback relocation in case of EFI multiboot

2024-08-07 Thread Alejandro Vallejo

In case EFI not multiboot rolling back relocation is done in
efi_arch_post_exit_boot, called by efi_start however this is
not done in multiboot code path.
Do it also for this path to make it work correctly.

Signed-off-by: Frediano Ziglio 
---
 xen/arch/x86/boot/head.S  | 29 +++---
 xen/arch/x86/boot/reloc.c | 63 ++-
 2 files changed, 87 insertions(+), 5 deletions(-)

diff --git a/xen/arch/x86/boot/head.S b/xen/arch/x86/boot/head.S
index abfa3d82f7..75ac74a589 100644
--- a/xen/arch/x86/boot/head.S
+++ b/xen/arch/x86/boot/head.S
@@ -352,6 +352,7 @@ __efi64_mb2_start:
 and $~15,%rsp
 
 /* Save Multiboot2 magic on the stack. */
+shlq$32, %rax
 push%rax
 
 /* Save EFI ImageHandle on the stack. */
@@ -382,11 +383,24 @@ __efi64_mb2_start:
 /* Just pop an item from the stack. */
 pop %rax
 
-/* Restore Multiboot2 magic. */
-pop %rax
+/* Prepare stack for relocation call */
+subq$16, %rsp
+lea l2_bootmap(%rip), %ecx
+movl%ecx, 16(%rsp)
+lea l3_bootmap(%rip), %ecx
+movl%ecx, 12(%rsp)
+lea __base_relocs_end(%rip), %ecx
+movl%ecx, 8(%rsp)
+lea __base_relocs_start(%rip), %ecx
+movl%ecx, 4(%rsp)
+lea __image_base__(%rip),%rsi
+movl%esi, (%rsp)
+movabsq $__XEN_VIRT_START, %rcx
+subq%rsi, %rcx
+push%rcx
 
-/* Jump to trampoline_setup after switching CPU to x86_32 mode. */
-lea trampoline_setup(%rip),%r15
+/* Jump to trampoline_efi_setup after switching CPU to x86_32 mode. */
+lea trampoline_efi_setup(%rip),%r15
 
 x86_32_switch:
 mov %r15,%rdi
@@ -557,6 +571,12 @@ __start:
 and $~(MULTIBOOT2_TAG_ALIGN-1),%ecx
 jmp .Lmb2_tsize
 
+trampoline_efi_setup:
+movb$1, %al
+callreloc
+pop %eax
+jmp trampoline_setup
+
 trampoline_bios_setup:
 /*
  * Called on legacy BIOS platforms only.
@@ -627,6 +647,7 @@ trampoline_setup:
 push%ecx/* Bottom-most low-memory stack address. */
 push%ebx/* Multiboot / PVH information address. */
 push%eax/* Magic number. */
+movb$0, %al
 callreloc
 #ifdef CONFIG_PVH_GUEST
 cmpb$0, sym_esi(pvh_boot)
diff --git a/xen/arch/x86/boot/reloc.c b/xen/arch/x86/boot/reloc.c
index 4033557481..3aa97a99d0 100644
--- a/xen/arch/x86/boot/reloc.c
+++ b/xen/arch/x86/boot/reloc.c
@@ -23,7 +23,9 @@ asm (
 ".text \n"
 ".globl _start \n"
 "_start:   \n"
-"jmp  reloc\n"
+"cmpb $0, %al  \n"
+"je   reloc\n"
+"jmp  reloc_pe_back\n"
 );
 
 #include "defs.h"
@@ -375,6 +377,65 @@ void *__stdcall reloc(uint32_t magic, uint32_t in, 
uint32_t trampoline,
 }
 }
 
+struct pe_base_relocs {
+u32 rva;
+u32 size;
+u16 entries[];
+};
+
+#define PE_BASE_RELOC_ABS  0
+#define PE_BASE_RELOC_HIGHLOW  3
+#define PE_BASE_RELOC_DIR64   10
+
+void __stdcall reloc_pe_back(long long delta,
+ uint32_t xen_phys_start,
+ const struct pe_base_relocs *__base_relocs_start,
+ const struct pe_base_relocs *__base_relocs_end,
+ char *l3_bootmap, char *l2_bootmap)
+{
+const struct pe_base_relocs *base_relocs;
+
+for ( base_relocs = __base_relocs_start; base_relocs < __base_relocs_end; )
+{
+unsigned int i = 0, n;
+
+n = (base_relocs->size - sizeof(*base_relocs)) /
+sizeof(*base_relocs->entries);
+
+/*
+ * Relevant l{2,3}_bootmap entries get initialized explicitly in
+ * efi_arch_memory_setup(), so we must not apply relocations there.
+ * l2_directmap's first slot, otoh, should be handled normally, as
+ * efi_arch_memory_setup() won't touch it (xen_phys_start should
+ * never be zero).
+ */
+if ( xen_phys_start + base_relocs->rva == (unsigned long)l3_bootmap ||
+ xen_phys_start + base_relocs->rva == (unsigned long)l2_bootmap )
+i = n;
+
+for ( ; i < n; ++i )
+{
+unsigned long addr = xen_phys_start + base_relocs->rva +
+ (base_relocs->entries[i] & 0xfff);
+
+switch ( base_relocs->entries[i] >> 12 )
+{
+case PE_BASE_RELOC_ABS:
+break;
+case PE_BASE_RELOC_HIGHLOW:
+if ( delta )
+*(u32 *)addr += delta;
+break;
+case PE_BASE_RELOC_DIR64:
+if ( delta )
+

Re: [PATCH 08/22] x86/mm: avoid passing a domain parameter to L4 init function

2024-07-29 Thread Alejandro Vallejo

On Fri Jul 26, 2024 at 4:21 PM BST, Roger Pau Monne wrote:
> In preparation for the function being called from contexts where no domain is
> present.
>
> No functional change intended.
>
> Signed-off-by: Roger Pau Monné 
> ---
>  xen/arch/x86/include/asm/mm.h  |  4 +++-
>  xen/arch/x86/mm.c  | 24 +---
>  xen/arch/x86/mm/hap/hap.c  |  3 ++-
>  xen/arch/x86/mm/shadow/hvm.c   |  3 ++-
>  xen/arch/x86/mm/shadow/multi.c |  7 +--
>  xen/arch/x86/pv/dom0_build.c   |  3 ++-
>  xen/arch/x86/pv/domain.c   |  3 ++-
>  7 files changed, 29 insertions(+), 18 deletions(-)
>
> diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h
> index b3853ae734fa..076e7009dc99 100644
> --- a/xen/arch/x86/include/asm/mm.h
> +++ b/xen/arch/x86/include/asm/mm.h
> @@ -375,7 +375,9 @@ int devalidate_page(struct page_info *page, unsigned long 
> type,
>  
>  void init_xen_pae_l2_slots(l2_pgentry_t *l2t, const struct domain *d);
>  void init_xen_l4_slots(l4_pgentry_t *l4t, mfn_t l4mfn,
> -   const struct domain *d, mfn_t sl4mfn, bool ro_mpt);
> +   mfn_t sl4mfn, const struct page_info *perdomain_l3,
> +   bool ro_mpt, bool maybe_compat, bool short_directmap);
> +

The comment currently in the .c file should probably be here instead, and
updated for the new arguments. That said, I'm skeptical 3 booleans is something
desirable. It induces a lot of complexity at the call sites (which of the 8
forms of init_xen_l4_slots() do I need here?) and a lot of cognitive overload.

I can't propose a solution because I'm still wrapping my head around how the
layout (esp. compat layout) fits together. Maybe the booleans can be mapped to
an enum? It would also help interpret the callsites as it'd no longer be a
sequence of contextless booleans, but a readable identifier.

>  bool fill_ro_mpt(mfn_t mfn);
>  void zap_ro_mpt(mfn_t mfn);
>  
> diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
> index a792a300a866..c01b6712143e 100644
> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -1645,14 +1645,9 @@ static int promote_l3_table(struct page_info *page)
>   * extended directmap.
>   */
>  void init_xen_l4_slots(l4_pgentry_t *l4t, mfn_t l4mfn,
> -   const struct domain *d, mfn_t sl4mfn, bool ro_mpt)
> +   mfn_t sl4mfn, const struct page_info *perdomain_l3,
> +   bool ro_mpt, bool maybe_compat, bool short_directmap)
>  {
> -/*
> - * PV vcpus need a shortened directmap.  HVM and Idle vcpus get the full
> - * directmap.
> - */
> -bool short_directmap = !paging_mode_external(d);
> -
>  /* Slot 256: RO M2P (if applicable). */
>  l4t[l4_table_offset(RO_MPT_VIRT_START)] =
>  ro_mpt ? idle_pg_table[l4_table_offset(RO_MPT_VIRT_START)]
> @@ -1673,13 +1668,14 @@ void init_xen_l4_slots(l4_pgentry_t *l4t, mfn_t l4mfn,
>  l4e_from_mfn(sl4mfn, __PAGE_HYPERVISOR_RW);
>  
>  /* Slot 260: Per-domain mappings. */
> -l4t[l4_table_offset(PERDOMAIN_VIRT_START)] =
> -l4e_from_page(d->arch.perdomain_l3_pg, __PAGE_HYPERVISOR_RW);
> +if ( perdomain_l3 )
> +l4t[l4_table_offset(PERDOMAIN_VIRT_START)] =
> +l4e_from_page(perdomain_l3, __PAGE_HYPERVISOR_RW);
>  
>  /* Slot 4: Per-domain mappings mirror. */
>  BUILD_BUG_ON(IS_ENABLED(CONFIG_PV32) &&
>   !l4_table_offset(PERDOMAIN_ALT_VIRT_START));
> -if ( !is_pv_64bit_domain(d) )
> +if ( perdomain_l3 && maybe_compat )
>  l4t[l4_table_offset(PERDOMAIN_ALT_VIRT_START)] =
>  l4t[l4_table_offset(PERDOMAIN_VIRT_START)];
>  
> @@ -1710,6 +1706,10 @@ void init_xen_l4_slots(l4_pgentry_t *l4t, mfn_t l4mfn,
>  else
>  #endif
>  {
> +/*
> + * PV vcpus need a shortened directmap.  HVM and Idle vcpus get the 
> full
> + * directmap.
> + */
>  unsigned int slots = (short_directmap
>? ROOT_PAGETABLE_PV_XEN_SLOTS
>: ROOT_PAGETABLE_XEN_SLOTS);
> @@ -1830,7 +1830,9 @@ static int promote_l4_table(struct page_info *page)
>  if ( !rc )
>  {
>  init_xen_l4_slots(pl4e, l4mfn,
> -  d, INVALID_MFN, VM_ASSIST(d, m2p_strict));
> +  INVALID_MFN, d->arch.perdomain_l3_pg,
> +  VM_ASSIST(d, m2p_strict), !is_pv_64bit_domain(d),
> +  true);
>  atomic_inc(&d->arch.pv.nr_l4_pages);
>  }
>  unmap_domain_page(pl4e);
> diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
> index d2011fde2462..c8514ca0e917 100644
> --- a/xen/arch/x86/mm/hap/hap.c
> +++ b/xen/arch/x86/mm/hap/hap.c
> @@ -402,7 +402,8 @@ static mfn_t hap_make_monitor_table(struct vcpu *v)
>  m4mfn = page_to_mfn(pg);
>  l4e = map_domain_page(m4mfn);
>  
> -init_xen_l4_slots(l4e, m4mfn, d, INVALID_MFN, false);
> +init_xen_l4_slots(l4

Re: [PATCH v2] x86/altcall: further refine clang workaround

2024-07-29 Thread Alejandro Vallejo

On Mon Jul 29, 2024 at 11:47 AM BST, Jan Beulich wrote:
> On 29.07.2024 12:30, Roger Pau Monne wrote:
> > --- a/xen/arch/x86/include/asm/alternative.h
> > +++ b/xen/arch/x86/include/asm/alternative.h
> > @@ -183,13 +183,13 @@ extern void alternative_branches(void);
> >   * https://github.com/llvm/llvm-project/issues/12579
> >   * https://github.com/llvm/llvm-project/issues/82598
> >   */
> > -#define ALT_CALL_ARG(arg, n)\
> > -register union {\
> > -typeof(arg) e[sizeof(long) / sizeof(arg)];  \
> > -unsigned long r;\
> > -} a ## n ## _ asm ( ALT_CALL_arg ## n ) = { \
> > -.e[0] = ({ BUILD_BUG_ON(sizeof(arg) > sizeof(void *)); (arg); })\
> > -}
> > +#define ALT_CALL_ARG(arg, n) \
> > + register unsigned long a ## n ## _ asm ( ALT_CALL_arg ## n ) = ({   \
> > + unsigned long tmp = 0;  \
> > + *(typeof(arg) *)&tmp = (arg);   \
> > + BUILD_BUG_ON(sizeof(arg) > sizeof(void *)); \
>
> With this, even more so than before, I think the type of tmp would better
> be void * (or the BUILD_BUG_ON() be made use unsigned long, yet I consider
> that less desirable). As a nit, I also don't think the backslashes need
> moving out by one position. Finally I'm afraid you're leaving stale the
> comment ahead of this construct.
>
> Jan

I wouldn't be thrilled to create a temp variable of yet another type that is
potentially neither "typeof(arg)" nor "unsigned long". No need to create a 3
body problem, where 2 is enough. Adjusting BUILD_BUG_ON() to use the same temp
type seems sensible, but I don't mind very much.

With the stale comment adjusted:

  Reviewed-by: Alejandro Vallejo 

Cheers,
Alejandro

Re: [PATCH] x86/altcall: further refine clang workaround

2024-07-26 Thread Alejandro Vallejo

On Fri Jul 26, 2024 at 5:25 PM BST, Alejandro Vallejo wrote:
> On Fri Jul 26, 2024 at 4:18 PM BST, Roger Pau Monné wrote:
> > On Fri, Jul 26, 2024 at 03:25:08PM +0100, Alejandro Vallejo wrote:
> > > On Fri Jul 26, 2024 at 3:17 PM BST, Alejandro Vallejo wrote:
> > > > On Fri Jul 26, 2024 at 9:05 AM BST, Jan Beulich wrote:
> > > > > On 26.07.2024 09:52, Roger Pau Monné wrote:
> > > > > > On Fri, Jul 26, 2024 at 09:36:15AM +0200, Jan Beulich wrote:
> > > > > >> On 26.07.2024 09:31, Roger Pau Monné wrote:
> > > > > >>> On Thu, Jul 25, 2024 at 05:00:22PM +0200, Jan Beulich wrote:
> > > > > >>>> On 25.07.2024 16:54, Roger Pau Monné wrote:
> > > > > >>>>> On Thu, Jul 25, 2024 at 03:18:29PM +0200, Jan Beulich wrote:
> > > > > >>>>>> On 25.07.2024 12:56, Roger Pau Monne wrote:
> > > > > >>>>>>> --- a/xen/arch/x86/include/asm/alternative.h
> > > > > >>>>>>> +++ b/xen/arch/x86/include/asm/alternative.h
> > > > > >>>>>>> @@ -184,11 +184,11 @@ extern void alternative_branches(void);
> > > > > >>>>>>>   * https://github.com/llvm/llvm-project/issues/82598
> > > > > >>>>>>>   */
> > > > > >>>>>>>  #define ALT_CALL_ARG(arg, n) 
> > > > > >>>>>>>\
> > > > > >>>>>>> -register union { 
> > > > > >>>>>>>\
> > > > > >>>>>>> -typeof(arg) e[sizeof(long) / sizeof(arg)];   
> > > > > >>>>>>>\
> > > > > >>>>>>> -unsigned long r; 
> > > > > >>>>>>>\
> > > > > >>>>>>> +register struct {
> > > > > >>>>>>>\
> > > > > >>>>>>> +typeof(arg) e;   
> > > > > >>>>>>>\
> > > > > >>>>>>> +char pad[sizeof(void *) - sizeof(arg)];  
> > > > > >>>>>>>\
> > > > > >>>>>>
> > > > > >>>>>> One thing that occurred to me only after our discussion, and I 
> > > > > >>>>>> then forgot
> > > > > >>>>>> to mention this before you would send a patch: What if 
> > > > > >>>>>> sizeof(void *) ==
> > > > > >>>>>> sizeof(arg)? Zero-sized arrays are explicitly something we're 
> > > > > >>>>>> trying to
> > > > > >>>>>> get rid of.
> > > > > >>>>>
> > > > > >>>>> I wondered about this, but I though it was only [] that we were 
> > > > > >>>>> trying
> > > > > >>>>> to get rid of, not [0].
> > > > > >>>>
> > > > > >>>> Sadly (here) it's actually the other way around, aiui.
> > > > > >>>
> > > > > >>> The only other option I have in mind is using an oversized array 
> > > > > >>> on
> > > > > >>> the union, like:
> > > > > >>>
> > > > > >>> #define ALT_CALL_ARG(arg, n)  
> > > > > >>>   \
> > > > > >>> union {   
> > > > > >>>   \
> > > > > >>> typeof(arg) e[(sizeof(long) + sizeof(arg) - 1) / 
> > > > > >>> sizeof(arg)];  \
> > > > > >>> unsigned long r;  
> > > > > >>>   \
> > > > > >>> } a ## n ## __  = {   
> > > > > >>>   \
> > > > > >>> .e[0] = ({ BUILD_BUG_ON(sizeof(arg)

Re: [PATCH] x86/viridian: Clarify some viridian logging strings

2024-07-26 Thread Alejandro Vallejo

On Fri Jul 26, 2024 at 4:11 PM BST, Paul Durrant wrote:
> On 26/07/2024 15:52, Alejandro Vallejo wrote:
> > It's sadically misleading to show an error without letters and expect
> > the dmesg reader to understand it's in hex.
>
> That depends on who's doing the reading.
>
> > The patch adds a 0x prefix
> > to all hex numbers that don't already have it.
> > 
> > On the one instance in which a boolean is printed as an integer, print
> > it as a decimal integer instead so it's 0/1 in the common case and not
> > misleading if it's ever not just that due to a bug.
> > 
> > While at it, rename VIRIDIAN CRASH to VIRIDIAN GUEST_CRASH. Every member
> > of a support team that looks at the message systematically believes
> > "viridian" crashed,
>
> ... which suggests they need educating as to what 'viridian' is (or was).
>

Can't argue with you there. But if a minor cosmetic tweak to a dmesg string
clarifies a matter without further explanation it's imo a net positive change.

> > which is absolutely not what goes on. It's the guest
> > asking the hypervisor for a sudden shutdown because it crashed, and
> > stating why.
> > 
> > Signed-off-by: Alejandro Vallejo 
> > ---
> > Still going through its Gitlab pipeline
> > 
> > ---
> >   xen/arch/x86/hvm/viridian/synic.c| 2 +-
> >   xen/arch/x86/hvm/viridian/viridian.c | 9 +
> >   2 files changed, 6 insertions(+), 5 deletions(-)
> > 
> > diff --git a/xen/arch/x86/hvm/viridian/synic.c 
> > b/xen/arch/x86/hvm/viridian/synic.c
> > index 3375e55e95ca..c3dc573b003d 100644
> > --- a/xen/arch/x86/hvm/viridian/synic.c
> > +++ b/xen/arch/x86/hvm/viridian/synic.c
> > @@ -172,7 +172,7 @@ int viridian_synic_wrmsr(struct vcpu *v, uint32_t idx, 
> > uint64_t val)
> >   vector = new.vector;
> >   vv->vector_to_sintx[vector] = sintx;
> >   
> > -printk(XENLOG_G_INFO "%pv: VIRIDIAN SINT%u: vector: %x\n", v, 
> > sintx,
> > +printk(XENLOG_G_INFO "%pv: VIRIDIAN SINT%u: vector: %#x\n", v, 
> > sintx,
> >  vector);
> >   
> >   *vs = new;
> > diff --git a/xen/arch/x86/hvm/viridian/viridian.c 
> > b/xen/arch/x86/hvm/viridian/viridian.c
> > index 0496c52ed5a2..21480d9ee700 100644
> > --- a/xen/arch/x86/hvm/viridian/viridian.c
> > +++ b/xen/arch/x86/hvm/viridian/viridian.c
> > @@ -253,7 +253,7 @@ static void dump_guest_os_id(const struct domain *d)
> >   goi = &d->arch.hvm.viridian->guest_os_id;
> >   
> >   printk(XENLOG_G_INFO
> > -   "d%d: VIRIDIAN GUEST_OS_ID: vendor: %x os: %x major: %x minor: 
> > %x sp: %x build: %x\n",
> > +   "d%d: VIRIDIAN GUEST_OS_ID: vendor: %#x os: %#x major: %#x 
> > minor: %#x sp: %#x build: %#x\n",
> >  d->domain_id, goi->vendor, goi->os, goi->major, goi->minor,
> >  goi->service_pack, goi->build_number);
> >   }
> > @@ -264,7 +264,7 @@ static void dump_hypercall(const struct domain *d)
> >   
> >   hg = &d->arch.hvm.viridian->hypercall_gpa;
> >   
> > -printk(XENLOG_G_INFO "d%d: VIRIDIAN HYPERCALL: enabled: %x pfn: %lx\n",
> > +printk(XENLOG_G_INFO "d%d: VIRIDIAN HYPERCALL: enabled: %u pfn: 
> > %#lx\n",
> >  d->domain_id,
> >  hg->enabled, (unsigned long)hg->pfn);
> >   }
> > @@ -372,7 +372,8 @@ int guest_wrmsr_viridian(struct vcpu *v, uint32_t idx, 
> > uint64_t val)
> >   d->shutdown_code = SHUTDOWN_crash;
> >   spin_unlock(&d->shutdown_lock);
> >   
> > -gprintk(XENLOG_WARNING, "VIRIDIAN CRASH: %lx %lx %lx %lx %lx\n",
> > +gprintk(XENLOG_WARNING,
> > +"VIRIDIAN GUEST_CRASH: %#lx %#lx %#lx %#lx %#lx\n",
>
> Honestly this change should be unnecessary, but since this is all 
> cosmetic...
>
> Reviewed-by: Paul Durrant 
>

Thanks

> >   vv->crash_param[0], vv->crash_param[1], 
> > vv->crash_param[2],
> >   vv->crash_param[3], vv->crash_param[4]);
> >   break;
> > @@ -1056,7 +1057,7 @@ void viridian_dump_guest_page(const struct vcpu *v, 
> > const char *name,
> >   if ( !vp->msr.enabled )
> >   return;
> >   
> > -printk(XENLOG_G_INFO "%pv: VIRIDIAN %s: pfn: %lx\n",
> > +printk(XENLOG_G_INFO "%pv: VIRIDIAN %s: pfn: %#lx\n",
> >  v, name, (unsigned long)vp->msr.pfn);
> >   }
> >   

Cheers,
Alejandro

Re: [PATCH] x86/altcall: further refine clang workaround

2024-07-26 Thread Alejandro Vallejo

On Fri Jul 26, 2024 at 4:18 PM BST, Roger Pau Monné wrote:
> On Fri, Jul 26, 2024 at 03:25:08PM +0100, Alejandro Vallejo wrote:
> > On Fri Jul 26, 2024 at 3:17 PM BST, Alejandro Vallejo wrote:
> > > On Fri Jul 26, 2024 at 9:05 AM BST, Jan Beulich wrote:
> > > > On 26.07.2024 09:52, Roger Pau Monné wrote:
> > > > > On Fri, Jul 26, 2024 at 09:36:15AM +0200, Jan Beulich wrote:
> > > > >> On 26.07.2024 09:31, Roger Pau Monné wrote:
> > > > >>> On Thu, Jul 25, 2024 at 05:00:22PM +0200, Jan Beulich wrote:
> > > > >>>> On 25.07.2024 16:54, Roger Pau Monné wrote:
> > > > >>>>> On Thu, Jul 25, 2024 at 03:18:29PM +0200, Jan Beulich wrote:
> > > > >>>>>> On 25.07.2024 12:56, Roger Pau Monne wrote:
> > > > >>>>>>> --- a/xen/arch/x86/include/asm/alternative.h
> > > > >>>>>>> +++ b/xen/arch/x86/include/asm/alternative.h
> > > > >>>>>>> @@ -184,11 +184,11 @@ extern void alternative_branches(void);
> > > > >>>>>>>   * https://github.com/llvm/llvm-project/issues/82598
> > > > >>>>>>>   */
> > > > >>>>>>>  #define ALT_CALL_ARG(arg, n)   
> > > > >>>>>>>  \
> > > > >>>>>>> -register union {   
> > > > >>>>>>>  \
> > > > >>>>>>> -typeof(arg) e[sizeof(long) / sizeof(arg)]; 
> > > > >>>>>>>  \
> > > > >>>>>>> -unsigned long r;   
> > > > >>>>>>>  \
> > > > >>>>>>> +register struct {  
> > > > >>>>>>>  \
> > > > >>>>>>> +typeof(arg) e; 
> > > > >>>>>>>  \
> > > > >>>>>>> +char pad[sizeof(void *) - sizeof(arg)];
> > > > >>>>>>>  \
> > > > >>>>>>
> > > > >>>>>> One thing that occurred to me only after our discussion, and I 
> > > > >>>>>> then forgot
> > > > >>>>>> to mention this before you would send a patch: What if 
> > > > >>>>>> sizeof(void *) ==
> > > > >>>>>> sizeof(arg)? Zero-sized arrays are explicitly something we're 
> > > > >>>>>> trying to
> > > > >>>>>> get rid of.
> > > > >>>>>
> > > > >>>>> I wondered about this, but I though it was only [] that we were 
> > > > >>>>> trying
> > > > >>>>> to get rid of, not [0].
> > > > >>>>
> > > > >>>> Sadly (here) it's actually the other way around, aiui.
> > > > >>>
> > > > >>> The only other option I have in mind is using an oversized array on
> > > > >>> the union, like:
> > > > >>>
> > > > >>> #define ALT_CALL_ARG(arg, n)
> > > > >>> \
> > > > >>> union { 
> > > > >>> \
> > > > >>> typeof(arg) e[(sizeof(long) + sizeof(arg) - 1) / 
> > > > >>> sizeof(arg)];  \
> > > > >>> unsigned long r;
> > > > >>> \
> > > > >>> } a ## n ## __  = { 
> > > > >>> \
> > > > >>> .e[0] = ({ BUILD_BUG_ON(sizeof(arg) > sizeof(void *)); 
> > > > >>> (arg); })\
> > > > >>> };  
> > > > >>> \
> > > > >>> register unsigned long a ## n ## _ asm ( ALT_CALL_arg ## n ) =  
> > > > >>> \
> > > > >>> a ## n ## __.r
> > > > >>
> > > > >&

[PATCH] x86/viridian: Clarify some viridian logging strings

2024-07-26 Thread Alejandro Vallejo

It's sadically misleading to show an error without letters and expect
the dmesg reader to understand it's in hex. The patch adds a 0x prefix
to all hex numbers that don't already have it.

On the one instance in which a boolean is printed as an integer, print
it as a decimal integer instead so it's 0/1 in the common case and not
misleading if it's ever not just that due to a bug.

While at it, rename VIRIDIAN CRASH to VIRIDIAN GUEST_CRASH. Every member
of a support team that looks at the message systematically believes
"viridian" crashed, which is absolutely not what goes on. It's the guest
asking the hypervisor for a sudden shutdown because it crashed, and
stating why.

Signed-off-by: Alejandro Vallejo 
---
Still going through its Gitlab pipeline

---
 xen/arch/x86/hvm/viridian/synic.c| 2 +-
 xen/arch/x86/hvm/viridian/viridian.c | 9 +
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/xen/arch/x86/hvm/viridian/synic.c 
b/xen/arch/x86/hvm/viridian/synic.c
index 3375e55e95ca..c3dc573b003d 100644
--- a/xen/arch/x86/hvm/viridian/synic.c
+++ b/xen/arch/x86/hvm/viridian/synic.c
@@ -172,7 +172,7 @@ int viridian_synic_wrmsr(struct vcpu *v, uint32_t idx, 
uint64_t val)
 vector = new.vector;
 vv->vector_to_sintx[vector] = sintx;
 
-printk(XENLOG_G_INFO "%pv: VIRIDIAN SINT%u: vector: %x\n", v, sintx,
+printk(XENLOG_G_INFO "%pv: VIRIDIAN SINT%u: vector: %#x\n", v, sintx,
vector);
 
 *vs = new;
diff --git a/xen/arch/x86/hvm/viridian/viridian.c 
b/xen/arch/x86/hvm/viridian/viridian.c
index 0496c52ed5a2..21480d9ee700 100644
--- a/xen/arch/x86/hvm/viridian/viridian.c
+++ b/xen/arch/x86/hvm/viridian/viridian.c
@@ -253,7 +253,7 @@ static void dump_guest_os_id(const struct domain *d)
 goi = &d->arch.hvm.viridian->guest_os_id;
 
 printk(XENLOG_G_INFO
-   "d%d: VIRIDIAN GUEST_OS_ID: vendor: %x os: %x major: %x minor: %x 
sp: %x build: %x\n",
+   "d%d: VIRIDIAN GUEST_OS_ID: vendor: %#x os: %#x major: %#x minor: 
%#x sp: %#x build: %#x\n",
d->domain_id, goi->vendor, goi->os, goi->major, goi->minor,
goi->service_pack, goi->build_number);
 }
@@ -264,7 +264,7 @@ static void dump_hypercall(const struct domain *d)
 
 hg = &d->arch.hvm.viridian->hypercall_gpa;
 
-printk(XENLOG_G_INFO "d%d: VIRIDIAN HYPERCALL: enabled: %x pfn: %lx\n",
+printk(XENLOG_G_INFO "d%d: VIRIDIAN HYPERCALL: enabled: %u pfn: %#lx\n",
d->domain_id,
hg->enabled, (unsigned long)hg->pfn);
 }
@@ -372,7 +372,8 @@ int guest_wrmsr_viridian(struct vcpu *v, uint32_t idx, 
uint64_t val)
 d->shutdown_code = SHUTDOWN_crash;
 spin_unlock(&d->shutdown_lock);
 
-gprintk(XENLOG_WARNING, "VIRIDIAN CRASH: %lx %lx %lx %lx %lx\n",
+gprintk(XENLOG_WARNING,
+"VIRIDIAN GUEST_CRASH: %#lx %#lx %#lx %#lx %#lx\n",
 vv->crash_param[0], vv->crash_param[1], vv->crash_param[2],
 vv->crash_param[3], vv->crash_param[4]);
 break;
@@ -1056,7 +1057,7 @@ void viridian_dump_guest_page(const struct vcpu *v, const 
char *name,
 if ( !vp->msr.enabled )
 return;
 
-printk(XENLOG_G_INFO "%pv: VIRIDIAN %s: pfn: %lx\n",
+printk(XENLOG_G_INFO "%pv: VIRIDIAN %s: pfn: %#lx\n",
v, name, (unsigned long)vp->msr.pfn);
 }
 
-- 
2.45.2

Re: [PATCH] x86/altcall: further refine clang workaround

2024-07-26 Thread Alejandro Vallejo

On Fri Jul 26, 2024 at 3:17 PM BST, Alejandro Vallejo wrote:
> On Fri Jul 26, 2024 at 9:05 AM BST, Jan Beulich wrote:
> > On 26.07.2024 09:52, Roger Pau Monné wrote:
> > > On Fri, Jul 26, 2024 at 09:36:15AM +0200, Jan Beulich wrote:
> > >> On 26.07.2024 09:31, Roger Pau Monné wrote:
> > >>> On Thu, Jul 25, 2024 at 05:00:22PM +0200, Jan Beulich wrote:
> > >>>> On 25.07.2024 16:54, Roger Pau Monné wrote:
> > >>>>> On Thu, Jul 25, 2024 at 03:18:29PM +0200, Jan Beulich wrote:
> > >>>>>> On 25.07.2024 12:56, Roger Pau Monne wrote:
> > >>>>>>> --- a/xen/arch/x86/include/asm/alternative.h
> > >>>>>>> +++ b/xen/arch/x86/include/asm/alternative.h
> > >>>>>>> @@ -184,11 +184,11 @@ extern void alternative_branches(void);
> > >>>>>>>   * https://github.com/llvm/llvm-project/issues/82598
> > >>>>>>>   */
> > >>>>>>>  #define ALT_CALL_ARG(arg, n)   
> > >>>>>>>  \
> > >>>>>>> -register union {   
> > >>>>>>>  \
> > >>>>>>> -typeof(arg) e[sizeof(long) / sizeof(arg)]; 
> > >>>>>>>  \
> > >>>>>>> -unsigned long r;   
> > >>>>>>>  \
> > >>>>>>> +register struct {  
> > >>>>>>>  \
> > >>>>>>> +typeof(arg) e; 
> > >>>>>>>  \
> > >>>>>>> +char pad[sizeof(void *) - sizeof(arg)];
> > >>>>>>>  \
> > >>>>>>
> > >>>>>> One thing that occurred to me only after our discussion, and I then 
> > >>>>>> forgot
> > >>>>>> to mention this before you would send a patch: What if sizeof(void 
> > >>>>>> *) ==
> > >>>>>> sizeof(arg)? Zero-sized arrays are explicitly something we're trying 
> > >>>>>> to
> > >>>>>> get rid of.
> > >>>>>
> > >>>>> I wondered about this, but I though it was only [] that we were trying
> > >>>>> to get rid of, not [0].
> > >>>>
> > >>>> Sadly (here) it's actually the other way around, aiui.
> > >>>
> > >>> The only other option I have in mind is using an oversized array on
> > >>> the union, like:
> > >>>
> > >>> #define ALT_CALL_ARG(arg, n)
> > >>> \
> > >>> union { 
> > >>> \
> > >>> typeof(arg) e[(sizeof(long) + sizeof(arg) - 1) / sizeof(arg)];  
> > >>> \
> > >>> unsigned long r;
> > >>> \
> > >>> } a ## n ## __  = { 
> > >>> \
> > >>> .e[0] = ({ BUILD_BUG_ON(sizeof(arg) > sizeof(void *)); (arg); 
> > >>> })\
> > >>> };  
> > >>> \
> > >>> register unsigned long a ## n ## _ asm ( ALT_CALL_arg ## n ) =  
> > >>> \
> > >>> a ## n ## __.r
> > >>
> > >> Yet that's likely awful code-gen wise?
> > > 
> > > Seems OK: https://godbolt.org/z/nsdo5Gs8W
> >
> > In which case why not go this route. If the compiler is doing fine with
> > that, maybe the array dimension expression could be further simplified,
> > accepting yet more over-sizing? Like "sizeof(void *) / sizeof (arg) + 1"
> > or even simply "sizeof(void *)"? Suitably commented of course ...
> >
> > >> For the time being, can we perhaps
> > >> just tighten the BUILD_BUG_ON(), as iirc Alejandro had suggested?
> > > 
> > > My main concern with tightening the BUILD_BUG_ON() is that then I
> > > would also like to do so for the GCC one, so that build fails
> > > uniformly.
> >
> > If we were to take that route, then yes, probably should constrain both
> > (with a suitable comment on the gcc one).
> >
> > Jan
>
> Yet another way would be to have an intermediate `long` to cast onto. 
> Compilers
> will optimise away the copy. It ignores the different-type aliasing rules in
> the C spec, so there's an assumption that we have -fno-strict-aliasing. But I
> belive we do? Otherwise it should pretty much work on anything.
>
> ```
>   #define ALT_CALL_ARG(arg, n)  \
>   unsigned long __tmp = 0;  \
>   *(typeof(arg) *)&__tmp =  \
>   ({ BUILD_BUG_ON(sizeof(arg) > sizeof(void *)); (arg); })  \
>   register unsigned long a ## n ## _ asm ( ALT_CALL_arg ## n ) = __tmp; \
> ```
>
> fwiw, clang18 emits identical code compared with the previous godbolt link.
>
> Link: https://godbolt.org/z/facd1M9xa
>
> Cheers,
> Alejandro

Bah. s/b/__tmp/ in line15. Same output though, so the point still stands.

Cheers,
Alejandro

Re: [PATCH] x86/altcall: further refine clang workaround

2024-07-26 Thread Alejandro Vallejo

On Fri Jul 26, 2024 at 9:05 AM BST, Jan Beulich wrote:
> On 26.07.2024 09:52, Roger Pau Monné wrote:
> > On Fri, Jul 26, 2024 at 09:36:15AM +0200, Jan Beulich wrote:
> >> On 26.07.2024 09:31, Roger Pau Monné wrote:
> >>> On Thu, Jul 25, 2024 at 05:00:22PM +0200, Jan Beulich wrote:
>  On 25.07.2024 16:54, Roger Pau Monné wrote:
> > On Thu, Jul 25, 2024 at 03:18:29PM +0200, Jan Beulich wrote:
> >> On 25.07.2024 12:56, Roger Pau Monne wrote:
> >>> --- a/xen/arch/x86/include/asm/alternative.h
> >>> +++ b/xen/arch/x86/include/asm/alternative.h
> >>> @@ -184,11 +184,11 @@ extern void alternative_branches(void);
> >>>   * https://github.com/llvm/llvm-project/issues/82598
> >>>   */
> >>>  #define ALT_CALL_ARG(arg, n) 
> >>>\
> >>> -register union { 
> >>>\
> >>> -typeof(arg) e[sizeof(long) / sizeof(arg)];   
> >>>\
> >>> -unsigned long r; 
> >>>\
> >>> +register struct {
> >>>\
> >>> +typeof(arg) e;   
> >>>\
> >>> +char pad[sizeof(void *) - sizeof(arg)];  
> >>>\
> >>
> >> One thing that occurred to me only after our discussion, and I then 
> >> forgot
> >> to mention this before you would send a patch: What if sizeof(void *) 
> >> ==
> >> sizeof(arg)? Zero-sized arrays are explicitly something we're trying to
> >> get rid of.
> >
> > I wondered about this, but I though it was only [] that we were trying
> > to get rid of, not [0].
> 
>  Sadly (here) it's actually the other way around, aiui.
> >>>
> >>> The only other option I have in mind is using an oversized array on
> >>> the union, like:
> >>>
> >>> #define ALT_CALL_ARG(arg, n)\
> >>> union { \
> >>> typeof(arg) e[(sizeof(long) + sizeof(arg) - 1) / sizeof(arg)];  \
> >>> unsigned long r;\
> >>> } a ## n ## __  = { \
> >>> .e[0] = ({ BUILD_BUG_ON(sizeof(arg) > sizeof(void *)); (arg); })\
> >>> };  \
> >>> register unsigned long a ## n ## _ asm ( ALT_CALL_arg ## n ) =  \
> >>> a ## n ## __.r
> >>
> >> Yet that's likely awful code-gen wise?
> > 
> > Seems OK: https://godbolt.org/z/nsdo5Gs8W
>
> In which case why not go this route. If the compiler is doing fine with
> that, maybe the array dimension expression could be further simplified,
> accepting yet more over-sizing? Like "sizeof(void *) / sizeof (arg) + 1"
> or even simply "sizeof(void *)"? Suitably commented of course ...
>
> >> For the time being, can we perhaps
> >> just tighten the BUILD_BUG_ON(), as iirc Alejandro had suggested?
> > 
> > My main concern with tightening the BUILD_BUG_ON() is that then I
> > would also like to do so for the GCC one, so that build fails
> > uniformly.
>
> If we were to take that route, then yes, probably should constrain both
> (with a suitable comment on the gcc one).
>
> Jan

Yet another way would be to have an intermediate `long` to cast onto. Compilers
will optimise away the copy. It ignores the different-type aliasing rules in
the C spec, so there's an assumption that we have -fno-strict-aliasing. But I
belive we do? Otherwise it should pretty much work on anything.

```
  #define ALT_CALL_ARG(arg, n)  \
  unsigned long __tmp = 0;  \
  *(typeof(arg) *)&__tmp =  \
  ({ BUILD_BUG_ON(sizeof(arg) > sizeof(void *)); (arg); })  \
  register unsigned long a ## n ## _ asm ( ALT_CALL_arg ## n ) = __tmp; \
```

fwiw, clang18 emits identical code compared with the previous godbolt link.

Link: https://godbolt.org/z/facd1M9xa

Cheers,
Alejandro

Re: [RFC XEN PATCH v2] x86/cpuid: Expose max_vcpus field in HVM hypervisor leaf

2024-07-24 Thread Alejandro Vallejo

On Wed Jul 24, 2024 at 2:01 PM BST, Jan Beulich wrote:
> On 24.07.2024 14:51, Matthew Barnes wrote:
> > On Wed, Jul 24, 2024 at 07:42:19AM +0200, Jan Beulich wrote:
> >> (re-adding xen-devel@)
> >>
> >> On 23.07.2024 14:57, Matthew Barnes wrote:
> >>> On Mon, Jul 22, 2024 at 01:37:11PM +0200, Jan Beulich wrote:
>  On 19.07.2024 16:21, Matthew Barnes wrote:
> > Currently, OVMF is hard-coded to set up a maximum of 64 vCPUs on
> > startup.
> >
> > There are efforts to support a maximum of 128 vCPUs, which would involve
> > bumping the OVMF constant from 64 to 128.
> >
> > However, it would be more future-proof for OVMF to access the maximum
> > number of vCPUs for a domain and set itself up appropriately at
> > run-time.
> >
> > GitLab ticket: https://gitlab.com/xen-project/xen/-/issues/191
> >
> > For OVMF to access the maximum vCPU count, this patch has Xen expose
> > the maximum vCPU ID via cpuid on the HVM hypervisor leaf in edx.
> >
> > Signed-off-by: Matthew Barnes 
> > ---
> > Changes in v2:
> > - Tweak value from "maximum vcpu count" to "maximum vcpu id"
> > - Reword commit message to avoid "have to" wording
> > - Fix vpcus -> vcpus typo
> > ---
> 
>  Yet still HVM-only?
> >>>
> >>> This field is only used when the guest is HVM, so I decided it should
> >>> only be present to HVM guests.
> >>>
> >>> If not, where else would you suggest to put this field?
> >>
> >> In a presently unused leaf? Or one of the unused registers of leaf x01
> >> (with the gating flag in leaf x02 ECX)?
> > 
> > I could establish leaf x06 as a 'domain info' leaf for both HVM and PV,
> > have EAX as a features bitmap, and EBX as the max_vcpu_id field.
> > 
> > Is this satisfactory?
>
> Hmm. Personally I think that all new leaves would better permit for multiple
> sub-leaves. Hence EAX is already unavailable. Additionally I'm told that
> there are internal discussions (supposed to be) going on at your end, which
> makes me wonder whether the above is the outcome of those discussions (in
> particular having at least tentative buy-off by Andrew).
>
> For the particular data to expose here, I would prefer the indicated re-use
> of an existing leaf. I haven't seen counter-arguments to that so far.
>
> Jan

I recommended Matt originally to expose it on the HVM leaf for semantic
cohesion with the other domain-related data and because it's strictly just
needed for HVM, at least for the time being.

It is true though that it's not HVM-specific and could go elsewhere. There's a
fiction of choice, but not so much in practice, I think. Re-using leaf 1 would
overload it semantically, as it's already used for version reporting (just like
other architectural CPUID groups). Leaf 2 could be an option, but it's somewhat
annoying because it leaves (pun intended) no room for expansion. A potential
new leaf 6 would indeed need to ensure only subleaf0 is implemented (as do
leaves 4 and 5), but otherwise should be pretty harmless.

Andrew might very well have wildly different views.

Cheers,
Alejandro

Re: [PATCH 1/2] x86/efi: Simplify efi_arch_cpu() a little

2024-07-24 Thread Alejandro Vallejo

On Wed Jul 24, 2024 at 6:42 AM BST, Jan Beulich wrote:
> On 23.07.2024 15:47, Alejandro Vallejo wrote:
> > On Mon Jul 22, 2024 at 11:18 AM BST, Andrew Cooper wrote:
> >> +if ( (eax >> 16) != 0x8000 || eax < 0x8000U )
> >> +blexit(L"In 64bit mode, but no extended CPUID leaves?!?");
> > 
> > I'm not sure about the condition even for the old code. If eax had 
> > 0x9000
> > (because new convention appeared 10y in the future), then there would be
> > extended leaves but we would be needlessly bailing out. Why not simply check
> > that eax < 0x8001 in here?
>
> eax = 0x9000 is in leaf group 0x9000, not in the extended leaf group
> (0x8000). The splitting into groups may not be written down very well,
> but you can see the pattern in e.g. groups 0x8086 and 0xc000 also being
> used (by non-Intel non-AMD hardware), without those really being extended
> leaves in the sense that 0x8000 are.
>
> Jan

The code is checking for a number specifically in the extended group, but
that's the output of leaf 0x8000 which is defined to be just that.

AMD: "The value returned in EAX provides the largest extended function number
  supported by the processor"

Intel: "Maximum Input Value for Extended Function CPUID Information."

Unless there are quirks I don't know about (I admit it's not unlikely) I just
don't see why this condition needs to be anything else than a check that the
maximum function number is bigger than any of the leaves we read further ahead.

If the number happens to start with 8000, that'd be fine; but there's no reason
to bail out if it was 8001. And even if there was, the exit message is
misleading as it's claiming there's no extended CPUID leaves when in reality an
unexpected max-extended-leaf was read off the base extended leaf.

Not that it matters a whole lot in practice because that's going to be within
range. But it feels like a needless complication of the check.

Regardless, as I said it's more of a comment on the previous code than it is
about this mechanical transformation.

Cheers,
Alejandro

Re: [PATCH for-4.19] x86/altcall: fix clang code-gen when using altcall in loop constructs

2024-07-23 Thread Alejandro Vallejo

On Tue Jul 23, 2024 at 5:09 PM BST, Roger Pau Monné wrote:
> On Tue, Jul 23, 2024 at 04:37:12PM +0100, Alejandro Vallejo wrote:
> > On Tue Jul 23, 2024 at 10:31 AM BST, Roger Pau Monne wrote:
> > > Clang will generate machine code that only resets the low 8 bits of %rdi
> > > between loop calls, leaving the rest of the register possibly containing
> > > garbage from the use of %rdi inside the called function.  Note also that 
> > > clang
> > > doesn't truncate the input parameters at the callee, thus breaking the 
> > > psABI.
> > >
> > > Fix this by turning the `e` element in the anonymous union into an array 
> > > that
> > > consumes the same space as an unsigned long, as this forces clang to 
> > > reset the
> > > whole %rdi register instead of just the low 8 bits.
> > >
> > > Fixes: 2ce562b2a413 ('x86/altcall: use a union as register type for 
> > > function parameters on clang')
> > > Suggested-by: Jan Beulich 
> > > Signed-off-by: Roger Pau Monné 
> > > ---
> > > Adding Oleksii as to whether this could be considered for 4.19: it's 
> > > strictly
> > > limited to clang builds, plus will need to be backported anyway.
> > > ---
> > >  xen/arch/x86/include/asm/alternative.h | 4 ++--
> > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/xen/arch/x86/include/asm/alternative.h 
> > > b/xen/arch/x86/include/asm/alternative.h
> > > index 0d3697f1de49..e63b45927643 100644
> > > --- a/xen/arch/x86/include/asm/alternative.h
> > > +++ b/xen/arch/x86/include/asm/alternative.h
> > > @@ -185,10 +185,10 @@ extern void alternative_branches(void);
> > >   */
> > >  #define ALT_CALL_ARG(arg, n)\
> > >  register union {\
> > > -typeof(arg) e;  \
> > > +typeof(arg) e[sizeof(long) / sizeof(arg)];  \
> > >  unsigned long r;\
> > >  } a ## n ## _ asm ( ALT_CALL_arg ## n ) = { \
> > > -.e = ({ BUILD_BUG_ON(sizeof(arg) > sizeof(void *)); (arg); })   \
> > > +.e[0] = ({ BUILD_BUG_ON(sizeof(arg) > sizeof(void *)); (arg); })\
> > >  }
> > >  #else
> > >  #define ALT_CALL_ARG(arg, n) \
> > 
> > Don't we want BUILD_BUG_ON(sizeof(long) % sizeof(arg) == 0) instead?
>
> I think you meant BUILD_BUG_ON(sizeof(long) % sizeof(arg) != 0)?

Bah, yes. I wrote it as a COMPILE_ASSERT().

>
> > Otherwise
> > odd sizes will cause the wrong union size to prevail, and while I can't see
> > today how those might come to happen there's Murphy's law.
>
> The overall union size would still be fine, because it has the
> unsigned long element, it's just that the array won't cover all the
> space assigned to the long member?

I explained myself poorly. If the current BUILD_BUG_ON() stays as-is that's
right, but...

>
> IOW if sizeof(arg) == 7, then we would define an array with only 1
> element, which won't make the size of the union change, but won't
> cover the same space that's used by the long member.

... I thought the point of the patch was to cover the full union with the
array, and not just a subset. My proposed alternative merely tries to ensure
the argument is always a submultiple in size of a long so the array is always a
perfect match.

Though admittedly, it wouldn't be rare for this to be enough to work around the
bug.

>
> However it's not possible for sizeof(arg) > 8 due to the existing
> BUILD_BUG_ON(), so the union can never be bigger than a long.
>
> Thanks, Roger.

Cheers,
Alejandro

Re: [PATCH for-4.19] x86/altcall: fix clang code-gen when using altcall in loop constructs

2024-07-23 Thread Alejandro Vallejo

On Tue Jul 23, 2024 at 10:31 AM BST, Roger Pau Monne wrote:
> Yet another clang code generation issue when using altcalls.
>
> The issue this time is with using loop constructs around alternative_{,v}call
> instances using parameter types smaller than the register size.
>
> Given the following example code:
>
> static void bar(bool b)
> {
> unsigned int i;
>
> for ( i = 0; i < 10; i++ )
> {
> int ret_;
> register union {
> bool e;
> unsigned long r;
> } di asm("rdi") = { .e = b };
> register unsigned long si asm("rsi");
> register unsigned long dx asm("rdx");
> register unsigned long cx asm("rcx");
> register unsigned long r8 asm("r8");
> register unsigned long r9 asm("r9");
> register unsigned long r10 asm("r10");
> register unsigned long r11 asm("r11");
>
> asm volatile ( "call %c[addr]"
>: "+r" (di), "=r" (si), "=r" (dx),
>  "=r" (cx), "=r" (r8), "=r" (r9),
>  "=r" (r10), "=r" (r11), "=a" (ret_)
>: [addr] "i" (&(func)), "g" (func)
>: "memory" );
> }
> }
>
> See: https://godbolt.org/z/qvxMGd84q
>
> Clang will generate machine code that only resets the low 8 bits of %rdi
> between loop calls, leaving the rest of the register possibly containing
> garbage from the use of %rdi inside the called function.  Note also that clang
> doesn't truncate the input parameters at the callee, thus breaking the psABI.
>
> Fix this by turning the `e` element in the anonymous union into an array that
> consumes the same space as an unsigned long, as this forces clang to reset the
> whole %rdi register instead of just the low 8 bits.
>
> Fixes: 2ce562b2a413 ('x86/altcall: use a union as register type for function 
> parameters on clang')
> Suggested-by: Jan Beulich 
> Signed-off-by: Roger Pau Monné 
> ---
> Adding Oleksii as to whether this could be considered for 4.19: it's strictly
> limited to clang builds, plus will need to be backported anyway.
> ---
>  xen/arch/x86/include/asm/alternative.h | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/xen/arch/x86/include/asm/alternative.h 
> b/xen/arch/x86/include/asm/alternative.h
> index 0d3697f1de49..e63b45927643 100644
> --- a/xen/arch/x86/include/asm/alternative.h
> +++ b/xen/arch/x86/include/asm/alternative.h
> @@ -185,10 +185,10 @@ extern void alternative_branches(void);
>   */
>  #define ALT_CALL_ARG(arg, n)\
>  register union {\
> -typeof(arg) e;  \
> +typeof(arg) e[sizeof(long) / sizeof(arg)];  \
>  unsigned long r;\
>  } a ## n ## _ asm ( ALT_CALL_arg ## n ) = { \
> -.e = ({ BUILD_BUG_ON(sizeof(arg) > sizeof(void *)); (arg); })   \
> +.e[0] = ({ BUILD_BUG_ON(sizeof(arg) > sizeof(void *)); (arg); })\
>  }
>  #else
>  #define ALT_CALL_ARG(arg, n) \

Don't we want BUILD_BUG_ON(sizeof(long) % sizeof(arg) == 0) instead? Otherwise
odd sizes will cause the wrong union size to prevail, and while I can't see
today how those might come to happen there's Murphy's law.

Cheers,
Alejandro

Re: [PATCH 2/2] x86/efi: Unlock NX if necessary

2024-07-23 Thread Alejandro Vallejo

Well, damn. At least it was found rather quickly.

On Mon Jul 22, 2024 at 11:18 AM BST, Andrew Cooper wrote:
> EFI systems can run with NX disabled, as has been discovered on a Broadwell
> Supermicro X10SRM-TF system.
>
> Prior to commit fc3090a47b21 ("x86/boot: Clear XD_DISABLE from the early boot
> path"), the logic to unlock NX was common to all boot paths, but that commit
> moved it out of the native-EFI booth path.

I suspect you meant boot rather than booth.

>
> Have the EFI path attempt to unlock NX, rather than just blindly refusing to
> boot when CONFIG_REQUIRE_NX is active.
>
> Fixes: fc3090a47b21 ("x86/boot: Clear XD_DISABLE from the early boot path")
> Link: https://xcp-ng.org/forum/post/80520
> Reported-by: Gene Bright 
> Signed-off-by: Andrew Cooper 
> ---
> CC: Jan Beulich 
> CC: Roger Pau Monné 
> CC: Daniel P. Smith 
> CC: Marek Marczykowski-Górecki 
> CC: Alejandro Vallejo 
> CC: Gene Bright 
>
> Note.  Entirely speculative coding, based only on the forum report.
> ---
>  xen/arch/x86/efi/efi-boot.h | 33 ++---
>  1 file changed, 30 insertions(+), 3 deletions(-)
>
> diff --git a/xen/arch/x86/efi/efi-boot.h b/xen/arch/x86/efi/efi-boot.h
> index 4e4be7174751..158350aa14e4 100644
> --- a/xen/arch/x86/efi/efi-boot.h
> +++ b/xen/arch/x86/efi/efi-boot.h
> @@ -736,13 +736,33 @@ static void __init efi_arch_handle_module(const struct 
> file *file,
>  efi_bs->FreePool(ptr);
>  }
>  
> +static bool __init intel_unlock_nx(void)
> +{
> +uint64_t val, disable;
> +
> +rdmsrl(MSR_IA32_MISC_ENABLE, val);
> +
> +disable = val & MSR_IA32_MISC_ENABLE_XD_DISABLE;
> +
> +if ( !disable )
> +return false;
> +
> +wrmsrl(MSR_IA32_MISC_ENABLE, val & ~disable);
> +trampoline_misc_enable_off |= disable;
> +
> +return true;
> +}

Do we want "#ifdef CONFIG_INTEL" the contents?

> +
>  static void __init efi_arch_cpu(void)
>  {
> -uint32_t eax;
> +uint32_t eax, ebx, ecx, edx;
>  uint32_t *caps = boot_cpu_data.x86_capability;
>  
>  boot_tsc_stamp = rdtsc();
>  
> +cpuid(0, &eax, &ebx, &ecx, &edx);
> +boot_cpu_data.x86_vendor = x86_cpuid_lookup_vendor(ebx, ecx, edx);
> +
>  caps[FEATURESET_1c] = cpuid_ecx(1);
>  
>  eax = cpuid_eax(0x8000U);
> @@ -752,10 +772,17 @@ static void __init efi_arch_cpu(void)
>  caps[FEATURESET_e1d] = cpuid_edx(0x8001U);
>  
>  /*
> - * This check purposefully doesn't use cpu_has_nx because
> + * These checks purposefully doesn't use cpu_has_nx because
>   * cpu_has_nx bypasses the boot_cpu_data read if Xen was compiled
> - * with CONFIG_REQUIRE_NX
> + * with CONFIG_REQUIRE_NX.
> + *
> + * If NX isn't available, it might be hidden.  Try to reactivate it.
>   */
> +if ( !boot_cpu_has(X86_FEATURE_NX) &&
> + boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
> + intel_unlock_nx() )
> +caps[FEATURESET_e1d] = cpuid_edx(0x8001U);
> +
>  if ( IS_ENABLED(CONFIG_REQUIRE_NX) &&
>   !boot_cpu_has(X86_FEATURE_NX) )
>  blexit(L"This build of Xen requires NX support");

Cheers,
Alejandro

Re: [PATCH 1/2] x86/efi: Simplify efi_arch_cpu() a little

2024-07-23 Thread Alejandro Vallejo

On Mon Jul 22, 2024 at 11:18 AM BST, Andrew Cooper wrote:
> Make the "no extended leaves" case fatal and remove one level of indentation.
> Defer the max-leaf aquisition until it is first used.
>
> No functional change.
>
> Signed-off-by: Andrew Cooper 
> ---
> CC: Jan Beulich 
> CC: Roger Pau Monné 
> CC: Daniel P. Smith 
> CC: Marek Marczykowski-Górecki 
> CC: Alejandro Vallejo 
> CC: Gene Bright 
> ---
>  xen/arch/x86/efi/efi-boot.h | 31 ---
>  1 file changed, 16 insertions(+), 15 deletions(-)
>
> diff --git a/xen/arch/x86/efi/efi-boot.h b/xen/arch/x86/efi/efi-boot.h
> index f282358435f1..4e4be7174751 100644
> --- a/xen/arch/x86/efi/efi-boot.h
> +++ b/xen/arch/x86/efi/efi-boot.h
> @@ -738,29 +738,30 @@ static void __init efi_arch_handle_module(const struct 
> file *file,
>  
>  static void __init efi_arch_cpu(void)
>  {
> -uint32_t eax = cpuid_eax(0x8000U);
> +uint32_t eax;
>  uint32_t *caps = boot_cpu_data.x86_capability;
>  
>  boot_tsc_stamp = rdtsc();
>  
>  caps[FEATURESET_1c] = cpuid_ecx(1);
>  
> -if ( (eax >> 16) == 0x8000 && eax > 0x8000U )
> -{
> -caps[FEATURESET_e1d] = cpuid_edx(0x8001U);
> +eax = cpuid_eax(0x8000U);

Why this movement?

> +if ( (eax >> 16) != 0x8000 || eax < 0x8000U )
> +blexit(L"In 64bit mode, but no extended CPUID leaves?!?");

I'm not sure about the condition even for the old code. If eax had 0x9000
(because new convention appeared 10y in the future), then there would be
extended leaves but we would be needlessly bailing out. Why not simply check
that eax < 0x8001 in here?

>  
> -/*
> - * This check purposefully doesn't use cpu_has_nx because
> - * cpu_has_nx bypasses the boot_cpu_data read if Xen was compiled
> - * with CONFIG_REQUIRE_NX
> - */
> -if ( IS_ENABLED(CONFIG_REQUIRE_NX) &&
> - !boot_cpu_has(X86_FEATURE_NX) )
> -blexit(L"This build of Xen requires NX support");
> +caps[FEATURESET_e1d] = cpuid_edx(0x8001U);
>  
> -if ( cpu_has_nx )
> -trampoline_efer |= EFER_NXE;
> -}
> +/*
> + * This check purposefully doesn't use cpu_has_nx because
> + * cpu_has_nx bypasses the boot_cpu_data read if Xen was compiled
> + * with CONFIG_REQUIRE_NX
> + */
> +if ( IS_ENABLED(CONFIG_REQUIRE_NX) &&
> + !boot_cpu_has(X86_FEATURE_NX) )
> +blexit(L"This build of Xen requires NX support");
> +
> +if ( cpu_has_nx )
> +trampoline_efer |= EFER_NXE;
>  }
>  
>  static void __init efi_arch_blexit(void)

Cheers,
Alejandro

Re: [PATCH for-4.19?] x86/IOMMU: Move allocation in iommu_identity_mapping

2024-07-18 Thread Alejandro Vallejo

On Wed Jul 17, 2024 at 4:51 PM BST, Teddy Astie wrote:
> If for some reason, xmalloc fails after having mapped the
> reserved regions, a error is reported, but the regions are
> actually mapped in p2m.
>
> Move the allocation before trying to map the regions, in
> case the allocation fails, no mapping is actually done
> which could allows this operation to be retried with the
> same regions without failing due to already existing mappings.
>
> Fixes: c0e19d7c6c ("IOMMU: generalize VT-d's tracking of mapped RMRR regions")
> Signed-off-by: Teddy Astie 
> ---
>  xen/drivers/passthrough/x86/iommu.c | 10 +++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/xen/drivers/passthrough/x86/iommu.c 
> b/xen/drivers/passthrough/x86/iommu.c
> index cc0062b027..b6bc6d71cb 100644
> --- a/xen/drivers/passthrough/x86/iommu.c
> +++ b/xen/drivers/passthrough/x86/iommu.c
> @@ -267,18 +267,22 @@ int iommu_identity_mapping(struct domain *d, 
> p2m_access_t p2ma,
>  if ( p2ma == p2m_access_x )
>  return -ENOENT;
>  
> +map = xmalloc(struct identity_map);
> +if ( !map )
> +return -ENOMEM;
> +
>  while ( base_pfn < end_pfn )
>  {
>  int err = set_identity_p2m_entry(d, base_pfn, p2ma, flag);
>  
>  if ( err )
> +{
> +xfree(map);
>  return err;
> +}
>  base_pfn++;
>  }
>  
> -map = xmalloc(struct identity_map);
> -if ( !map )
> -return -ENOMEM;
>  map->base = base;
>  map->end = end;
>  map->access = p2ma;

That covers the case where xmalloc fails, but what about the case where
set_identity_p2m_entry() fails in for a middle pfn? (i.e: due to ENOMEM).

Cheers,
Alejandro

Re: [PATCH v2 2/2] Add scripts/oss-fuzz/build.sh

2024-07-18 Thread Alejandro Vallejo

On Tue Jun 25, 2024 at 11:47 PM BST, Tamas K Lengyel wrote:
> The build integration script for oss-fuzz targets. Future fuzzing targets can
> be added to this script and those targets will be automatically picked up by
> oss-fuzz without having to open separate PRs on the oss-fuzz repo.
>
> Signed-off-by: Tamas K Lengyel 
> ---
>  scripts/oss-fuzz/build.sh | 23 +++
>  1 file changed, 23 insertions(+)
>  create mode 100755 scripts/oss-fuzz/build.sh
>
> diff --git a/scripts/oss-fuzz/build.sh b/scripts/oss-fuzz/build.sh
> new file mode 100755
> index 00..2cfd72adf1
> --- /dev/null
> +++ b/scripts/oss-fuzz/build.sh
> @@ -0,0 +1,23 @@
> +#!/bin/bash -eu

The shebang probably wants to be "/usr/bin/env bash" to account for systems
that don't have bash specifically there.

With that "-eu" would need to move down a line to be "set -eu"

> +# SPDX-License-Identifier: Apache-2.0
> +# Copyright 2024 Google LLC
> +#
> +# Licensed under the Apache License, Version 2.0 (the "License");
> +# you may not use this file except in compliance with the License.
> +# You may obtain a copy of the License at
> +#
> +#  http://www.apache.org/licenses/LICENSE-2.0
> +#
> +# Unless required by applicable law or agreed to in writing, software
> +# distributed under the License is distributed on an "AS IS" BASIS,
> +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> +# See the License for the specific language governing permissions and
> +# limitations under the License.
> +#
> +
> +
> +cd xen
> +./configure --disable-stubdom --disable-pvshim --disable-docs --disable-xen
> +make clang=y -C tools/include
> +make clang=y -C tools/fuzz/x86_instruction_emulator libfuzzer-harness
> +cp tools/fuzz/x86_instruction_emulator/libfuzzer-harness 
> $OUT/x86_instruction_emulator

Cheers,
Alejandro

Re: [PATCH for-4.20 4/4] x86/fpu: Split fpu_setup_fpu() in two

2024-07-18 Thread Alejandro Vallejo

On Thu Jul 18, 2024 at 1:19 PM BST, Jan Beulich wrote:
> On 09.07.2024 17:52, Alejandro Vallejo wrote:
> > It's doing too many things at once and there's no clear way of defining what
> > it's meant to do. This patch splits the function in two.
> > 
> >   1. A reset function, parameterized by the FCW value. FCW_RESET means to 
> > reset
> >  the state to power-on reset values, while FCW_DEFAULT means to reset 
> > to the
> >  default values present during vCPU creation.
> >   2. A x87/SSE state loader (equivalent to the old function when it took a 
> > data
> >  pointer).
> > 
> > Signed-off-by: Alejandro Vallejo 
> > ---
> > I'm still not sure what the old function tries to do. The state we start 
> > vCPUs
> > in is _similar_ to the after-finit, but it's not quite (`ftw` is not -1). I 
> > went
> > for the "let's not deviate too much from previous behaviour", but maybe we 
> > did
> > intend for vCPUs to start as if `finit` had just been executed?
>
> A relevant aspect here may be that what FSXR and XSAVE area have is only an
> abridged form of the tag word, being only 8 bits in size. 0x00 there is
> equivalent to FTW=0x (all st() empty). That's not quite correct for
> the reset case indeed, where FTW=0x (i.e. all st() zero, requiring

I missed the tag being abridged. That makes a lot of sense, thanks.

> the abridged form to hold 0xff instead). While no-one has reported issues
> there so far, I think it wouldn't be inappropriate to correct this.

Ack, I'll add it on v2.

>
> > --- a/xen/arch/x86/hvm/hvm.c
> > +++ b/xen/arch/x86/hvm/hvm.c
> > @@ -1162,10 +1162,17 @@ static int cf_check hvm_load_cpu_ctxt(struct domain 
> > *d, hvm_domain_context_t *h)
> >  seg.attr = ctxt.ldtr_arbytes;
> >  hvm_set_segment_register(v, x86_seg_ldtr, &seg);
> >  
> > -/* Cover xsave-absent save file restoration on xsave-capable host. */
> > -vcpu_setup_fpu(v, xsave_enabled(v) ? NULL : v->arch.xsave_area,
> > -   ctxt.flags & XEN_X86_FPU_INITIALISED ? ctxt.fpu_regs : 
> > NULL,
> > -   FCW_RESET);
> > +/*
> > + * On Xen 4.1 and later the FPU state is restored on a later HVM 
> > context, so
> > + * what we're doing here is initialising the FPU state for guests from 
> > even
> > + * older versions of Xen. In general such guests only use legacy 
> > x87/SSE
> > + * state, and if they did use XSAVE then our best-effort strategy is 
> > to make
> > + * an XSAVE header for x87 and SSE hoping that's good enough.
> > + */
> > +if ( ctxt.flags & XEN_X86_FPU_INITIALISED )
> > +vcpu_setup_fpu(v, &ctxt.fpu_regs);
> > +else
> > +vcpu_reset_fpu(v, FCW_RESET);
>
> I'm struggling with the use of "later" in the comment. What exactly is that
> meant to express? Fundamentally the XSAVE data is fully backwards compatible
> with the FXSR one, I think, so the mentioning of "best-effort" isn't quite
> clear to me either.

I meant that the XSAVE state (including FPU/SSE state) is passed not on the HVM
context struct being process _here_, but another one that will arrive later on
in the stream. There's 3 interesting cases regarding extended states:

  1. If there is an XSAVE context later in the stream, what we do here for the
 FPU doesn't matter because it'll be overriden later. That's fine.
  2. If there isn't and the guest didn't use extended states  it's still fine
 because we have all the information we need here.
  2. If there isn't but the guest DID use extended states (could've happened
 prior to Xen 4.1) then we're in a pickle because we have to make up
 non-existing state. This is what I meant by best effort.

Seeing how you got confused the comment probably needs to be rewritten to
better reflect this.

>
> > --- a/xen/arch/x86/i387.c
> > +++ b/xen/arch/x86/i387.c
> > @@ -310,41 +310,25 @@ int vcpu_init_fpu(struct vcpu *v)
> >  return xstate_alloc_save_area(v);
> >  }
> >  
> > -void vcpu_setup_fpu(struct vcpu *v, struct xsave_struct *xsave_area,
> > -const void *data, unsigned int fcw_default)
> > +void vcpu_reset_fpu(struct vcpu *v, uint16_t fcw)
> >  {
> > -fpusse_t *fpu_sse = &v->arch.xsave_area->fpu_sse;
> > -
> > -ASSERT(!xsave_area || xsave_area == v->arch.xsave_area);
> > -
> > -v->fpu_initialised = !!data;
> > -
> > -if ( data )
> >

Re: [PATCH for-4.20 3/4] x86/fpu: Combine fpu_ctxt and xsave_area in arch_vcpu

2024-07-18 Thread Alejandro Vallejo

On Thu Jul 18, 2024 at 12:49 PM BST, Jan Beulich wrote:
> On 09.07.2024 17:52, Alejandro Vallejo wrote:
> > --- a/xen/arch/x86/domctl.c
> > +++ b/xen/arch/x86/domctl.c
> > @@ -1343,7 +1343,8 @@ void arch_get_info_guest(struct vcpu *v, 
> > vcpu_guest_context_u c)
> >  #define c(fld) (c.nat->fld)
> >  #endif
> >  
> > -memcpy(&c.nat->fpu_ctxt, v->arch.fpu_ctxt, sizeof(c.nat->fpu_ctxt));
> > +memcpy(&c.nat->fpu_ctxt, &v->arch.xsave_area->fpu_sse,
> > +   sizeof(c.nat->fpu_ctxt));
>
> Now that the middle argument has proper type, maybe take the opportunity
> and add BUILD_BUG_ON(sizeof(...) == sizeof(...))? (Also in e.g.
> hvm_save_cpu_ctxt() then.)

Sure.

>
> > --- a/xen/arch/x86/include/asm/domain.h
> > +++ b/xen/arch/x86/include/asm/domain.h
> > @@ -591,12 +591,7 @@ struct pv_vcpu
> >  
> >  struct arch_vcpu
> >  {
> > -/*
> > - * guest context (mirroring struct vcpu_guest_context) common
> > - * between pv and hvm guests
> > - */
> > -
> > -void  *fpu_ctxt;
> > +/* Fixed point registers */
> >  struct cpu_user_regs user_regs;
>
> Not exactly, no. Selector registers are there as well for example, which
> I wouldn't consider "fixed point" ones. I wonder why the existing comment
> cannot simply be kept, perhaps extended to mention that fpu_ctxt now lives
> elsewhere.

Would you prefer "general purpose registers"? It's not quite that either, but
it's arguably closer. I can part with the comment altogether but I'd rather
leave a token amount of information to say "non-FPU register state" (but not
that, because that would be a terrible description). 

I'd rather update it to something that better reflects reality, as I found it
quite misleading when reading through. I initially thought it may have been
related to struct layout (as in C-style single-level inheritance), but as it
turns out it's merely establishing a vague relationship between arch_vcpu and
vcpu_guest_context. I can believe once upon a time the relationship was closer
than it it now, but with the guest context missing AVX state, MSR state and
other bits and pieces I thought it better to avoid such confusions for future
navigators down the line so limit its description to the line below.

>
> > --- a/xen/arch/x86/x86_emulate/blk.c
> > +++ b/xen/arch/x86/x86_emulate/blk.c
> > @@ -11,7 +11,8 @@
> >  !defined(X86EMUL_NO_SIMD)
> >  # ifdef __XEN__
> >  #  include 
> > -#  define FXSAVE_AREA current->arch.fpu_ctxt
> > +#  define FXSAVE_AREA ((struct x86_fxsr *) \
> > +   (void*)¤t->arch.xsave_area->fpu_sse)
>
> Nit: Blank missing after before *.

Heh, took me a while looking at x86_fxsr to realise you mean the void pointer.

Ack.

>
> > --- a/xen/arch/x86/xstate.c
> > +++ b/xen/arch/x86/xstate.c
> > @@ -507,9 +507,16 @@ int xstate_alloc_save_area(struct vcpu *v)
> >  unsigned int size;
> >  
> >  if ( !cpu_has_xsave )
> > -return 0;
> > -
> > -if ( !is_idle_vcpu(v) || !cpu_has_xsavec )
> > +{
> > +/*
> > + * This is bigger than FXSAVE_SIZE by 64 bytes, but it helps 
> > treating
> > + * the FPU state uniformly as an XSAVE buffer even if XSAVE is not
> > + * available in the host. Note the alignment restriction of the 
> > XSAVE
> > + * area are stricter than those of the FXSAVE area.
> > + */
> > +size = XSTATE_AREA_MIN_SIZE;
>
> What exactly would break if just (a little over) 512 bytes worth were 
> allocated
> when there's no XSAVE? If it was exactly 512, something like xstate_all() 
> would
> need to apply a little more care, I guess. Yet for that having just 
> always-zero
> xstate_bv and xcomp_bv there would already suffice (e.g. using
> offsetof(..., xsave_hdr.reserved) here, to cover further fields gaining 
> meaning
> down the road). Remember that due to xmalloc() overhead and the 
> 64-byte-aligned
> requirement, you can only have 6 of them in a page the way you do it, when the
> alternative way 7 would fit (if I got my math right).
>
> Jan

I'm slightly confused.

XSTATE_AREA_MIN_SIZE is already 512 + 64 to account for the XSAVE header,
including its reserved fields. Did you mean something else?

#define XSAVE_HDR_SIZE64
#define XSAVE_SSE_OFFSET  160
#define XSTATE_YMM_SIZE   256
#define FXSAVE_SIZE   512
#define XSAVE_HDR_OFFSET  FXSAVE_SIZE
#define XSTATE_AREA_MIN_SIZE  (FXSAVE_SIZE + XSAVE_HDR_SIZE)

Part of the rationale is to simplify other bits of code that are currently
conditionalized on v->xsave_header being NULL. And for that the full xsave
header must be present (even if unused because !cpu_xsave)

Do you mean something else?

Cheers,
Alejandro

Re: [PATCH for-4.20 2/4] x86/fpu: Create a typedef for the x87/SSE area inside "struct xsave_struct"

2024-07-18 Thread Alejandro Vallejo

On Thu Jul 18, 2024 at 12:23 PM BST, Jan Beulich wrote:
> On 09.07.2024 17:52, Alejandro Vallejo wrote:
> > Making the union non-anonymous causes a lot of headaches,
>
> Maybe better "would cause", as that's not what you're doing here?

Yes, sounds better.

>
> > because a lot of code
> > relies on it being so, but it's possible to make a typedef of the anonymous
> > union so all callsites currently relying on typeof() can stop doing so 
> > directly.
> > 
> > This commit creates a `fpusse_t` typedef to the anonymous union at the head 
> > of
> > the XSAVE area and uses it instead of typeof().
> > 
> > No functional change.
> > 
> > Signed-off-by: Alejandro Vallejo 
>
> Acked-by: Jan Beulich 

Thanks

Alejandro

Re: [PATCH V3 (resend) 04/19] x86: Lift mapcache variable to the arch level

2024-07-17 Thread Alejandro Vallejo

On Tue Jul 16, 2024 at 6:06 PM BST, Alejandro Vallejo wrote:
> On Mon May 13, 2024 at 2:40 PM BST, Elias El Yandouzi wrote:
> > From: Wei Liu 
> >
> > It is going to be needed by HVM and idle domain as well, because without
> > the direct map, both need a mapcache to map pages.
> >
> > This commit lifts the mapcache variable up and initialise it a bit earlier
> > for PV and HVM domains.
> >
> > Signed-off-by: Wei Liu 
> > Signed-off-by: Wei Wang 
> > Signed-off-by: Hongyan Xia 
> > Signed-off-by: Julien Grall 
> >
> > diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
> > index 20e83cf38b..507d704f16 100644
> > --- a/xen/arch/x86/domain.c
> > +++ b/xen/arch/x86/domain.c
> > @@ -851,6 +851,8 @@ int arch_domain_create(struct domain *d,
> >  
> >  psr_domain_init(d);
> >  
> > +mapcache_domain_init(d);
> > +
>
> I think this is missing free_perdomain_mappings() in the error case. (error
> handling is already committed).
>
> Can't the callee jump to a "fail" label and do free_perdomain_mappings()
> internally?
>
> Cheers,
> Alejandro

Bah, ignore this. They are freed in the "fail" label at the end.

Cheers,
Alejandro

Re: [PATCH V3 (resend) 04/19] x86: Lift mapcache variable to the arch level

2024-07-16 Thread Alejandro Vallejo

On Mon May 13, 2024 at 2:40 PM BST, Elias El Yandouzi wrote:
> From: Wei Liu 
>
> It is going to be needed by HVM and idle domain as well, because without
> the direct map, both need a mapcache to map pages.
>
> This commit lifts the mapcache variable up and initialise it a bit earlier
> for PV and HVM domains.
>
> Signed-off-by: Wei Liu 
> Signed-off-by: Wei Wang 
> Signed-off-by: Hongyan Xia 
> Signed-off-by: Julien Grall 
>
> diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
> index 20e83cf38b..507d704f16 100644
> --- a/xen/arch/x86/domain.c
> +++ b/xen/arch/x86/domain.c
> @@ -851,6 +851,8 @@ int arch_domain_create(struct domain *d,
>  
>  psr_domain_init(d);
>  
> +mapcache_domain_init(d);
> +

I think this is missing free_perdomain_mappings() in the error case. (error
handling is already committed).

Can't the callee jump to a "fail" label and do free_perdomain_mappings()
internally?

Cheers,
Alejandro

Re: [RFC XEN PATCH v3 5/5] xen/public: Introduce PV-IOMMU hypercall interface

2024-07-11 Thread Alejandro Vallejo

On Thu Jul 11, 2024 at 3:04 PM BST, Teddy Astie wrote:
> Introduce a new pv interface to manage the underlying IOMMU and manage 
> contexts
> and devices. This interface allows creation of new contexts from Dom0 and
> addition of IOMMU mappings using guest PoV.
>
> This interface doesn't allow creation of mapping to other domains.
>
> Signed-off-by Teddy Astie 
> ---
> Changed in V2:
> * formatting
>
> Changed in V3:
> * prevent IOMMU operations on dying contexts
> ---
>  xen/common/Makefile   |   1 +
>  xen/common/pv-iommu.c | 328 ++
>  xen/include/hypercall-defs.c  |   6 +
>  xen/include/public/pv-iommu.h | 114 
>  xen/include/public/xen.h  |   1 +
>  5 files changed, 450 insertions(+)
>  create mode 100644 xen/common/pv-iommu.c
>  create mode 100644 xen/include/public/pv-iommu.h
>
> diff --git a/xen/common/Makefile b/xen/common/Makefile
> index f12a474d40..52ada89888 100644
> --- a/xen/common/Makefile
> +++ b/xen/common/Makefile
> @@ -58,6 +58,7 @@ obj-y += wait.o
>  obj-bin-y += warning.init.o
>  obj-$(CONFIG_XENOPROF) += xenoprof.o
>  obj-y += xmalloc_tlsf.o
> +obj-y += pv-iommu.o
>  
>  obj-bin-$(CONFIG_X86) += $(foreach n,decompress bunzip2 unxz unlzma lzo 
> unlzo unlz4 unzstd earlycpio,$(n).init.o)
>  
> diff --git a/xen/common/pv-iommu.c b/xen/common/pv-iommu.c
> new file mode 100644
> index 00..a94c0f1e1a
> --- /dev/null
> +++ b/xen/common/pv-iommu.c
> @@ -0,0 +1,328 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * xen/common/pv_iommu.c
> + *
> + * PV-IOMMU hypercall interface.
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#define PVIOMMU_PREFIX "[PV-IOMMU] "
> +
> +#define PVIOMMU_MAX_PAGES 256 /* Move to Kconfig ? */

It probably wants to be a cmdline argument, I think.

> +
> +/* Allowed masks for each sub-operation */
> +#define ALLOC_OP_FLAGS_MASK (0)
> +#define FREE_OP_FLAGS_MASK (IOMMU_TEARDOWN_REATTACH_DEFAULT)
> +
> +static int get_paged_frame(struct domain *d, gfn_t gfn, mfn_t *mfn,
> +   struct page_info **page, int readonly)
> +{
> +p2m_type_t p2mt;
> +
> +*page = get_page_from_gfn(d, gfn_x(gfn), &p2mt,
> + (readonly) ? P2M_ALLOC : P2M_UNSHARE);
> +
> +if ( !(*page) )
> +{
> +*mfn = INVALID_MFN;
> +if ( p2m_is_shared(p2mt) )
> +return -EINVAL;
> +if ( p2m_is_paging(p2mt) )
> +{
> +p2m_mem_paging_populate(d, gfn);
> +return -EIO;
> +}
> +
> +return -EPERM;

This is ambiguous with the other usage of EPERM.

> +}
> +
> +*mfn = page_to_mfn(*page);
> +
> +return 0;
> +}
> +
> +static int can_use_iommu_check(struct domain *d)
> +{
> +if ( !iommu_enabled )
> +{
> +printk(PVIOMMU_PREFIX "IOMMU is not enabled\n");
> +return 0;
> +}
> +
> +if ( !is_hardware_domain(d) )
> +{
> +printk(PVIOMMU_PREFIX "Non-hardware domain\n");
> +return 0;
> +}
> +
> +if ( !is_iommu_enabled(d) )
> +{
> +printk(PVIOMMU_PREFIX "IOMMU disabled for this domain\n");
> +return 0;
> +}
> +
> +return 1;
> +}
> +
> +static long query_cap_op(struct pv_iommu_op *op, struct domain *d)
> +{
> +op->cap.max_ctx_no = d->iommu.other_contexts.count;
> +op->cap.max_nr_pages = PVIOMMU_MAX_PAGES;
> +op->cap.max_iova_addr = (1LLU << 39) - 1; /* TODO: hardcoded 39-bits */
> +
> +return 0;
> +}
> +
> +static long alloc_context_op(struct pv_iommu_op *op, struct domain *d)
> +{
> +u16 ctx_no = 0;
> +int status = 0;
> +
> +status = iommu_context_alloc(d, &ctx_no, op->flags & 
> ALLOC_OP_FLAGS_MASK);
> +
> +if (status < 0)
> +return status;
> +
> +printk("Created context %hu\n", ctx_no);
> +
> +op->ctx_no = ctx_no;
> +return 0;
> +}
> +
> +static long free_context_op(struct pv_iommu_op *op, struct domain *d)
> +{
> +return iommu_context_free(d, op->ctx_no,
> +  IOMMU_TEARDOWN_PREEMPT | (op->flags & 
> FREE_OP_FLAGS_MASK));
> +}
> +
> +static long reattach_device_op(struct pv_iommu_op *op, struct domain *d)
> +{
> +struct physdev_pci_device dev = op->reattach_device.dev;
> +device_t *pdev;
> +
> +pdev = pci_get_pdev(d, PCI_SBDF(dev.seg, dev.bus, dev.devfn));
> +
> +if ( !pdev )
> +return -ENOENT;
> +
> +return iommu_reattach_context(d, d, pdev, op->ctx_no);
> +}
> +
> +static long map_pages_op(struct pv_iommu_op *op, struct domain *d)
> +{
> +int ret = 0, flush_ret;
> +struct page_info *page = NULL;
> +mfn_t mfn;
> +unsigned int flags;
> +unsigned int flush_flags = 0;
> +size_t i = 0;
> +
> +if ( op->map_pages.nr_pages > PVIOMMU_MAX_PAGES )
> +return -E2BIG;
> +
> +if ( !iommu_check_context(d, op->ctx_no) )
> +return -EINVAL;
> +
> +//printk("Mapping gfn:%lx-%lx to df

Re: [RFC XEN PATCH v3 1/5] docs/designs: Add a design document for PV-IOMMU

2024-07-11 Thread Alejandro Vallejo

Disclaimer: I haven't looked at the code yet.

On Thu Jul 11, 2024 at 3:04 PM BST, Teddy Astie wrote:
> Some operating systems want to use IOMMU to implement various features (e.g
> VFIO) or DMA protection.
> This patch introduce a proposal for IOMMU paravirtualization for Dom0.
>
> Signed-off-by Teddy Astie 
> ---
>  docs/designs/pv-iommu.md | 105 +++
>  1 file changed, 105 insertions(+)
>  create mode 100644 docs/designs/pv-iommu.md
>
> diff --git a/docs/designs/pv-iommu.md b/docs/designs/pv-iommu.md
> new file mode 100644
> index 00..c01062a3ad
> --- /dev/null
> +++ b/docs/designs/pv-iommu.md
> @@ -0,0 +1,105 @@
> +# IOMMU paravirtualization for Dom0
> +
> +Status: Experimental
> +
> +# Background
> +
> +By default, Xen only uses the IOMMU for itself, either to make device adress
> +space coherent with guest adress space (x86 HVM/PVH) or to prevent devices
> +from doing DMA outside it's expected memory regions including the hypervisor
> +(x86 PV).

"By default...": Do you mean "currently"?

> +
> +A limitation is that guests (especially privildged ones) may want to use
> +IOMMU hardware in order to implement features such as DMA protection and
> +VFIO [1] as IOMMU functionality is not available outside of the hypervisor
> +currently.

s/privildged/privileged/

> +
> +[1] VFIO - "Virtual Function I/O" - 
> https://www.kernel.org/doc/html/latest/driver-api/vfio.html
> +
> +# Design
> +
> +The operating system may want to have access to various IOMMU features such 
> as
> +context management and DMA remapping. We can create a new hypercall that 
> allows
> +the guest to have access to a new paravirtualized IOMMU interface.
> +
> +This feature is only meant to be available for the Dom0, as DomU have some
> +emulated devices that can't be managed on Xen side and are not hardware, we
> +can't rely on the hardware IOMMU to enforce DMA remapping.

Is that the reason though? While it's true we can't mix emulated and real
devices under the same emulated PCI bus covered by an IOMMU, nothing prevents us
from stating "the IOMMU(s) configured via PV-IOMMU cover from busN to busM".

AFAIK, that already happens on systems with several IOMMUs, where they might
affect partially disjoint devices. But I admit I'm no expert on this.

I can definitely see a lot of interesting use cases for a PV-IOMMU interface
exposed to domUs (it'd be a subset of that of dom0, obviously); that'd
allow them to use the IOMMU without resorting to 2-stage translation, which has
terrible IOTLB miss costs.

> +
> +This interface is exposed under the `iommu_op` hypercall.
> +
> +In addition, Xen domains are modified in order to allow existence of several
> +IOMMU context including a default one that implement default behavior (e.g
> +hardware assisted paging) and can't be modified by guest. DomU cannot have
> +contexts, and therefore act as if they only have the default domain.
> +
> +Each IOMMU context within a Xen domain is identified using a domain-specific
> +context number that is used in the Xen IOMMU subsystem and the hypercall
> +interface.
> +
> +The number of IOMMU context a domain can use is predetermined at domain 
> creation
> +and is configurable through `dom0-iommu=nb-ctx=N` xen cmdline.

nit: I think it's more typical within Xen to see "nr" rather than "nb"

> +
> +# IOMMU operations
> +
> +## Alloc context
> +
> +Create a new IOMMU context for the guest and return the context number to the
> +guest.
> +Fail if the IOMMU context limit of the guest is reached.

or -ENOMEM, I guess.

I'm guessing from this dom0 takes care of the contexts for guests? Or are these
contexts for use within dom0 exclusively?

> +
> +A flag can be specified to create a identity mapping.
> +
> +## Free context
> +
> +Destroy a IOMMU context created previously.
> +It is not possible to free the default context.
> +
> +Reattach context devices to default context if specified by the guest.
> +
> +Fail if there is a device in the context and reattach-to-default flag is not
> +specified.
> +
> +## Reattach device
> +
> +Reattach a device to another IOMMU context (including the default one).
> +The target IOMMU context number must be valid and the context allocated.
> +
> +The guest needs to specify a PCI SBDF of a device he has access to.
> +
> +## Map/unmap page
> +
> +Map/unmap a page on a context.
> +The guest needs to specify a gfn and target dfn to map.

And an "order", I hope; to enable superpages and hugepages without having to
find out after the fact that the mappings are in fact mergeable and the leaf PTs
can go away.

> +
> +Refuse to create the mapping if one already exist for the same dfn.
> +
> +## Lookup page
> +
> +Get the gfn mapped by a specific dfn.
> +
> +# Implementation considerations
> +
> +## Hypercall batching
> +
> +In order to prevent unneeded hypercalls and IOMMU flushing, it is advisable 
> to
> +be able to batch some critical IOMMU operations (e.g map/unmap multiple 
> pages).

See above fo

Re: [PATCH for-4.20 v2] automation: Use a different ImageBuilder repository URL

2024-07-10 Thread Alejandro Vallejo

On Wed Jul 10, 2024 at 10:37 AM BST, Michal Orzel wrote:
> Switch to using https://gitlab.com/xen-project/imagebuilder.git which
> should be considered official ImageBuilder repo.
>
> Take the opportunity to truncate the git history when cloning using
> --depth 1.
>
> Signed-off-by: Michal Orzel 
> Reviewed-by: Stefano Stabellini 
> ---
> Changes in v2:
>  - truncate history when cloning
> ---
>  automation/scripts/qemu-smoke-dom0-arm32.sh   | 2 +-
>  automation/scripts/qemu-smoke-dom0-arm64.sh   | 2 +-
>  automation/scripts/qemu-smoke-dom0less-arm32.sh   | 2 +-
>  automation/scripts/qemu-smoke-dom0less-arm64.sh   | 2 +-
>  automation/scripts/qemu-xtf-dom0less-arm64.sh | 2 +-
>  automation/scripts/xilinx-smoke-dom0less-arm64.sh | 2 +-
>  6 files changed, 6 insertions(+), 6 deletions(-)

lgtm,

Reviewed-by: Alejandro Vallejo

[PATCH for-4.20 0/4] x86: FPU handling cleanup

2024-07-09 Thread Alejandro Vallejo

I want to eventually reach a position in which the FPU state can be allocated
from the domheap and hidden via the same core mechanism proposed in Elias'
directmap removal series. Doing so is complicated by the presence of 2 aliased
pointers (v->arch.fpu_ctxt and v->arch.xsave_area) and the rather complicated
semantics of vcpu_setup_fpu(). This series tries to simplify the code so moving
to a "map/modify/unmap" model is more tractable.

Patches 1 and 2 are trivial refactors.

Patch 3 unifies FPU state so an XSAVE area is allocated per vCPU regardless of
the host supporting it or not. The rationale is that the memory savings are
negligible and not worth the extra complexity.

Patch 4 is a non-trivial split of the vcpu_setup_fpu() into 2 separate
functions. One to override x87/SSE state, and another to set a reset state.

Alejandro Vallejo (4):
  x86/xstate: Use compression check helper in xstate_all()
  x86/fpu: Create a typedef for the x87/SSE area inside "struct
xsave_struct"
  x86/fpu: Combine fpu_ctxt and xsave_area in arch_vcpu
  x86/fpu: Split fpu_setup_fpu() in two

 xen/arch/x86/domain.c |  7 ++-
 xen/arch/x86/domctl.c |  3 +-
 xen/arch/x86/hvm/emulate.c|  5 +-
 xen/arch/x86/hvm/hvm.c| 22 +---
 xen/arch/x86/i387.c   | 93 ---
 xen/arch/x86/include/asm/domain.h |  7 +--
 xen/arch/x86/include/asm/i387.h   | 27 +++--
 xen/arch/x86/include/asm/xstate.h | 17 +++---
 xen/arch/x86/x86_emulate/blk.c|  3 +-
 xen/arch/x86/xstate.c | 15 +++--
 10 files changed, 90 insertions(+), 109 deletions(-)

-- 
2.34.1

[PATCH for-4.20 2/4] x86/fpu: Create a typedef for the x87/SSE area inside "struct xsave_struct"

2024-07-09 Thread Alejandro Vallejo

Making the union non-anonymous causes a lot of headaches, because a lot of code
relies on it being so, but it's possible to make a typedef of the anonymous
union so all callsites currently relying on typeof() can stop doing so directly.

This commit creates a `fpusse_t` typedef to the anonymous union at the head of
the XSAVE area and uses it instead of typeof().

No functional change.

Signed-off-by: Alejandro Vallejo 
---
 xen/arch/x86/hvm/emulate.c| 5 ++---
 xen/arch/x86/i387.c   | 8 
 xen/arch/x86/include/asm/xstate.h | 2 ++
 xen/arch/x86/xstate.c | 2 +-
 4 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/xen/arch/x86/hvm/emulate.c b/xen/arch/x86/hvm/emulate.c
index 02e378365b40..65ee70ce67db 100644
--- a/xen/arch/x86/hvm/emulate.c
+++ b/xen/arch/x86/hvm/emulate.c
@@ -2364,8 +2364,7 @@ static int cf_check hvmemul_get_fpu(
 alternative_vcall(hvm_funcs.fpu_dirty_intercept);
 else if ( type == X86EMUL_FPU_fpu )
 {
-const typeof(curr->arch.xsave_area->fpu_sse) *fpu_ctxt =
-curr->arch.fpu_ctxt;
+const fpusse_t *fpu_ctxt = curr->arch.fpu_ctxt;
 
 /*
  * Latch current register state so that we can back out changes
@@ -2405,7 +2404,7 @@ static void cf_check hvmemul_put_fpu(
 
 if ( aux )
 {
-typeof(curr->arch.xsave_area->fpu_sse) *fpu_ctxt = curr->arch.fpu_ctxt;
+fpusse_t *fpu_ctxt = curr->arch.fpu_ctxt;
 bool dval = aux->dval;
 int mode = hvm_guest_x86_mode(curr);
 
diff --git a/xen/arch/x86/i387.c b/xen/arch/x86/i387.c
index fcdee10a6e69..89804435b659 100644
--- a/xen/arch/x86/i387.c
+++ b/xen/arch/x86/i387.c
@@ -39,7 +39,7 @@ static inline void fpu_xrstor(struct vcpu *v, uint64_t mask)
 /* Restore x87 FPU, MMX, SSE and SSE2 state */
 static inline void fpu_fxrstor(struct vcpu *v)
 {
-const typeof(v->arch.xsave_area->fpu_sse) *fpu_ctxt = v->arch.fpu_ctxt;
+const fpusse_t *fpu_ctxt = v->arch.fpu_ctxt;
 
 /*
  * Some CPUs don't save/restore FDP/FIP/FOP unless an exception
@@ -152,7 +152,7 @@ static inline void fpu_xsave(struct vcpu *v)
 /* Save x87 FPU, MMX, SSE and SSE2 state */
 static inline void fpu_fxsave(struct vcpu *v)
 {
-typeof(v->arch.xsave_area->fpu_sse) *fpu_ctxt = v->arch.fpu_ctxt;
+fpusse_t *fpu_ctxt = v->arch.fpu_ctxt;
 unsigned int fip_width = v->domain->arch.x87_fip_width;
 
 if ( fip_width != 4 )
@@ -322,7 +322,7 @@ int vcpu_init_fpu(struct vcpu *v)
 __alignof(v->arch.xsave_area->fpu_sse));
 if ( v->arch.fpu_ctxt )
 {
-typeof(v->arch.xsave_area->fpu_sse) *fpu_sse = v->arch.fpu_ctxt;
+fpusse_t *fpu_sse = v->arch.fpu_ctxt;
 
 fpu_sse->fcw = FCW_DEFAULT;
 fpu_sse->mxcsr = MXCSR_DEFAULT;
@@ -343,7 +343,7 @@ void vcpu_setup_fpu(struct vcpu *v, struct xsave_struct 
*xsave_area,
  * accesses through both pointers alias one another, and the shorter form
  * is used here.
  */
-typeof(xsave_area->fpu_sse) *fpu_sse = v->arch.fpu_ctxt;
+fpusse_t *fpu_sse = v->arch.fpu_ctxt;
 
 ASSERT(!xsave_area || xsave_area == v->arch.xsave_area);
 
diff --git a/xen/arch/x86/include/asm/xstate.h 
b/xen/arch/x86/include/asm/xstate.h
index f0eeb13b87a4..ebeb2a3dcaf9 100644
--- a/xen/arch/x86/include/asm/xstate.h
+++ b/xen/arch/x86/include/asm/xstate.h
@@ -82,6 +82,8 @@ struct __attribute__((aligned (64))) xsave_struct
 char data[]; /* Variable layout states */
 };
 
+typedef typeof(((struct xsave_struct){}).fpu_sse) fpusse_t;
+
 struct xstate_bndcsr {
 uint64_t bndcfgu;
 uint64_t bndstatus;
diff --git a/xen/arch/x86/xstate.c b/xen/arch/x86/xstate.c
index 68cdd8fcf021..5c4144d55e89 100644
--- a/xen/arch/x86/xstate.c
+++ b/xen/arch/x86/xstate.c
@@ -846,7 +846,7 @@ void xstate_init(struct cpuinfo_x86 *c)
 
 if ( bsp )
 {
-static typeof(current->arch.xsave_area->fpu_sse) __initdata ctxt;
+static fpusse_t __initdata ctxt;
 
 asm ( "fxsave %0" : "=m" (ctxt) );
 if ( ctxt.mxcsr_mask )
-- 
2.34.1

[PATCH for-4.20 3/4] x86/fpu: Combine fpu_ctxt and xsave_area in arch_vcpu

2024-07-09 Thread Alejandro Vallejo

fpu_ctxt is either a pointer to the legacy x87/SSE save area (used by FXSAVE) or
a pointer aliased with xsave_area that points to its fpu_sse subfield. Such
subfield is at the base and is identical in size and layout to the legacy
buffer.

This patch merges the 2 pointers in the arch_vcpu into a single XSAVE area. In
the very rare case in which the host doesn't support XSAVE all we're doing is
wasting a tiny amount of memory and trading those for a lot more simplicity in
the code.

Signed-off-by: Alejandro Vallejo 
---
 xen/arch/x86/domctl.c |  3 ++-
 xen/arch/x86/hvm/emulate.c|  4 +--
 xen/arch/x86/hvm/hvm.c|  3 ++-
 xen/arch/x86/i387.c   | 45 +--
 xen/arch/x86/include/asm/domain.h |  7 +
 xen/arch/x86/x86_emulate/blk.c|  3 ++-
 xen/arch/x86/xstate.c | 13 ++---
 7 files changed, 25 insertions(+), 53 deletions(-)

diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c
index 9190e11faaa3..7b04b584c540 100644
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -1343,7 +1343,8 @@ void arch_get_info_guest(struct vcpu *v, 
vcpu_guest_context_u c)
 #define c(fld) (c.nat->fld)
 #endif
 
-memcpy(&c.nat->fpu_ctxt, v->arch.fpu_ctxt, sizeof(c.nat->fpu_ctxt));
+memcpy(&c.nat->fpu_ctxt, &v->arch.xsave_area->fpu_sse,
+   sizeof(c.nat->fpu_ctxt));
 if ( is_pv_domain(d) )
 c(flags = v->arch.pv.vgc_flags & ~(VGCF_i387_valid|VGCF_in_kernel));
 else
diff --git a/xen/arch/x86/hvm/emulate.c b/xen/arch/x86/hvm/emulate.c
index 65ee70ce67db..72a8136a9bbf 100644
--- a/xen/arch/x86/hvm/emulate.c
+++ b/xen/arch/x86/hvm/emulate.c
@@ -2364,7 +2364,7 @@ static int cf_check hvmemul_get_fpu(
 alternative_vcall(hvm_funcs.fpu_dirty_intercept);
 else if ( type == X86EMUL_FPU_fpu )
 {
-const fpusse_t *fpu_ctxt = curr->arch.fpu_ctxt;
+const fpusse_t *fpu_ctxt = &curr->arch.xsave_area->fpu_sse;
 
 /*
  * Latch current register state so that we can back out changes
@@ -2404,7 +2404,7 @@ static void cf_check hvmemul_put_fpu(
 
 if ( aux )
 {
-fpusse_t *fpu_ctxt = curr->arch.fpu_ctxt;
+fpusse_t *fpu_ctxt = &curr->arch.xsave_area->fpu_sse;
 bool dval = aux->dval;
 int mode = hvm_guest_x86_mode(curr);
 
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 7f4b627b1f5f..09b1426ee314 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -916,7 +916,8 @@ static int cf_check hvm_save_cpu_ctxt(struct vcpu *v, 
hvm_domain_context_t *h)
 
 if ( v->fpu_initialised )
 {
-memcpy(ctxt.fpu_regs, v->arch.fpu_ctxt, sizeof(ctxt.fpu_regs));
+memcpy(ctxt.fpu_regs, &v->arch.xsave_area->fpu_sse,
+   sizeof(ctxt.fpu_regs));
 ctxt.flags = XEN_X86_FPU_INITIALISED;
 }
 
diff --git a/xen/arch/x86/i387.c b/xen/arch/x86/i387.c
index 89804435b659..a964b84757ec 100644
--- a/xen/arch/x86/i387.c
+++ b/xen/arch/x86/i387.c
@@ -39,7 +39,7 @@ static inline void fpu_xrstor(struct vcpu *v, uint64_t mask)
 /* Restore x87 FPU, MMX, SSE and SSE2 state */
 static inline void fpu_fxrstor(struct vcpu *v)
 {
-const fpusse_t *fpu_ctxt = v->arch.fpu_ctxt;
+const fpusse_t *fpu_ctxt = &v->arch.xsave_area->fpu_sse;
 
 /*
  * Some CPUs don't save/restore FDP/FIP/FOP unless an exception
@@ -152,7 +152,7 @@ static inline void fpu_xsave(struct vcpu *v)
 /* Save x87 FPU, MMX, SSE and SSE2 state */
 static inline void fpu_fxsave(struct vcpu *v)
 {
-fpusse_t *fpu_ctxt = v->arch.fpu_ctxt;
+fpusse_t *fpu_ctxt = &v->arch.xsave_area->fpu_sse;
 unsigned int fip_width = v->domain->arch.x87_fip_width;
 
 if ( fip_width != 4 )
@@ -219,7 +219,7 @@ void vcpu_restore_fpu_nonlazy(struct vcpu *v, bool 
need_stts)
  * above) we also need to restore full state, to prevent subsequently
  * saving state belonging to another vCPU.
  */
-if ( v->arch.fully_eager_fpu || (v->arch.xsave_area && xstate_all(v)) )
+if ( v->arch.fully_eager_fpu || xstate_all(v) )
 {
 if ( cpu_has_xsave )
 fpu_xrstor(v, XSTATE_ALL);
@@ -306,44 +306,14 @@ void save_fpu_enable(void)
 /* Initialize FPU's context save area */
 int vcpu_init_fpu(struct vcpu *v)
 {
-int rc;
-
 v->arch.fully_eager_fpu = opt_eager_fpu;
-
-if ( (rc = xstate_alloc_save_area(v)) != 0 )
-return rc;
-
-if ( v->arch.xsave_area )
-v->arch.fpu_ctxt = &v->arch.xsave_area->fpu_sse;
-else
-{
-BUILD_BUG_ON(__alignof(v->arch.xsave_area->fpu_sse) < 16);
-v->arch.fpu_ctxt = _xzalloc(sizeof(v->arch.xsave_area->fpu_sse),
-__alignof(v->arch.xsave_area->fpu_sse));
-if ( v->arch.fpu_ctxt )
-{
-

[PATCH for-4.20 4/4] x86/fpu: Split fpu_setup_fpu() in two

2024-07-09 Thread Alejandro Vallejo

It's doing too many things at once and there's no clear way of defining what
it's meant to do. This patch splits the function in two.

  1. A reset function, parameterized by the FCW value. FCW_RESET means to reset
 the state to power-on reset values, while FCW_DEFAULT means to reset to the
 default values present during vCPU creation.
  2. A x87/SSE state loader (equivalent to the old function when it took a data
 pointer).

Signed-off-by: Alejandro Vallejo 
---
I'm still not sure what the old function tries to do. The state we start vCPUs
in is _similar_ to the after-finit, but it's not quite (`ftw` is not -1). I went
for the "let's not deviate too much from previous behaviour", but maybe we did
intend for vCPUs to start as if `finit` had just been executed?
---
 xen/arch/x86/domain.c   |  7 +++--
 xen/arch/x86/hvm/hvm.c  | 19 -
 xen/arch/x86/i387.c | 50 +++--
 xen/arch/x86/include/asm/i387.h | 27 +++---
 4 files changed, 56 insertions(+), 47 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index ccadfe0c9e70..245899cc792f 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1198,9 +1198,10 @@ int arch_set_info_guest(
  is_pv_64bit_domain(d) )
 v->arch.flags &= ~TF_kernel_mode;
 
-vcpu_setup_fpu(v, v->arch.xsave_area,
-   flags & VGCF_I387_VALID ? &c.nat->fpu_ctxt : NULL,
-   FCW_DEFAULT);
+if ( flags & VGCF_I387_VALID )
+vcpu_setup_fpu(v, &c.nat->fpu_ctxt);
+else
+vcpu_reset_fpu(v, FCW_DEFAULT);
 
 if ( !compat )
 {
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 09b1426ee314..bedbd2a0b888 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -1162,10 +1162,17 @@ static int cf_check hvm_load_cpu_ctxt(struct domain *d, 
hvm_domain_context_t *h)
 seg.attr = ctxt.ldtr_arbytes;
 hvm_set_segment_register(v, x86_seg_ldtr, &seg);
 
-/* Cover xsave-absent save file restoration on xsave-capable host. */
-vcpu_setup_fpu(v, xsave_enabled(v) ? NULL : v->arch.xsave_area,
-   ctxt.flags & XEN_X86_FPU_INITIALISED ? ctxt.fpu_regs : NULL,
-   FCW_RESET);
+/*
+ * On Xen 4.1 and later the FPU state is restored on a later HVM context, 
so
+ * what we're doing here is initialising the FPU state for guests from even
+ * older versions of Xen. In general such guests only use legacy x87/SSE
+ * state, and if they did use XSAVE then our best-effort strategy is to 
make
+ * an XSAVE header for x87 and SSE hoping that's good enough.
+ */
+if ( ctxt.flags & XEN_X86_FPU_INITIALISED )
+vcpu_setup_fpu(v, &ctxt.fpu_regs);
+else
+vcpu_reset_fpu(v, FCW_RESET);
 
 v->arch.user_regs.rax = ctxt.rax;
 v->arch.user_regs.rbx = ctxt.rbx;
@@ -4005,9 +4012,7 @@ void hvm_vcpu_reset_state(struct vcpu *v, uint16_t cs, 
uint16_t ip)
 v->arch.guest_table = pagetable_null();
 }
 
-if ( v->arch.xsave_area )
-v->arch.xsave_area->xsave_hdr.xstate_bv = 0;
-vcpu_setup_fpu(v, v->arch.xsave_area, NULL, FCW_RESET);
+vcpu_reset_fpu(v, FCW_RESET);
 
 arch_vcpu_regs_init(v);
 v->arch.user_regs.rip = ip;
diff --git a/xen/arch/x86/i387.c b/xen/arch/x86/i387.c
index a964b84757ec..7851f1b3f6e4 100644
--- a/xen/arch/x86/i387.c
+++ b/xen/arch/x86/i387.c
@@ -310,41 +310,25 @@ int vcpu_init_fpu(struct vcpu *v)
 return xstate_alloc_save_area(v);
 }
 
-void vcpu_setup_fpu(struct vcpu *v, struct xsave_struct *xsave_area,
-const void *data, unsigned int fcw_default)
+void vcpu_reset_fpu(struct vcpu *v, uint16_t fcw)
 {
-fpusse_t *fpu_sse = &v->arch.xsave_area->fpu_sse;
-
-ASSERT(!xsave_area || xsave_area == v->arch.xsave_area);
-
-v->fpu_initialised = !!data;
-
-if ( data )
-{
-memcpy(fpu_sse, data, sizeof(*fpu_sse));
-if ( xsave_area )
-xsave_area->xsave_hdr.xstate_bv = XSTATE_FP_SSE;
-}
-else if ( xsave_area && fcw_default == FCW_DEFAULT )
-{
-xsave_area->xsave_hdr.xstate_bv = 0;
-fpu_sse->mxcsr = MXCSR_DEFAULT;
-}
-else
-{
-memset(fpu_sse, 0, sizeof(*fpu_sse));
-fpu_sse->fcw = fcw_default;
-fpu_sse->mxcsr = MXCSR_DEFAULT;
-if ( v->arch.xsave_area )
-{
-v->arch.xsave_area->xsave_hdr.xstate_bv &= ~XSTATE_FP_SSE;
-if ( fcw_default != FCW_DEFAULT )
-v->arch.xsave_area->xsave_hdr.xstate_bv |= X86_XCR0_X87;
-}
-}
+v->fpu_initialised = false;
+*v->arch.xsave_area = (struct xsave_struct) {
+.fpu_sse = {
+.mxcsr = MXCSR_DEFAULT,
+.fcw = fcw,
+},
+

[PATCH for-4.20 1/4] x86/xstate: Use compression check helper in xstate_all()

2024-07-09 Thread Alejandro Vallejo

Minor refactor to make xstate_all() use a helper rather than poking directly
into the XSAVE header.

No functional change

Signed-off-by: Alejandro Vallejo 
---
 xen/arch/x86/include/asm/xstate.h | 15 +++
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/xen/arch/x86/include/asm/xstate.h 
b/xen/arch/x86/include/asm/xstate.h
index f4a8e5f814a0..f0eeb13b87a4 100644
--- a/xen/arch/x86/include/asm/xstate.h
+++ b/xen/arch/x86/include/asm/xstate.h
@@ -122,6 +122,12 @@ static inline uint64_t xgetbv(unsigned int index)
 return lo | ((uint64_t)hi << 32);
 }
 
+static inline bool __nonnull(1)
+xsave_area_compressed(const struct xsave_struct *xsave_area)
+{
+return xsave_area->xsave_hdr.xcomp_bv & XSTATE_COMPACTION_ENABLED;
+}
+
 static inline bool xstate_all(const struct vcpu *v)
 {
 /*
@@ -129,15 +135,8 @@ static inline bool xstate_all(const struct vcpu *v)
  * (in the legacy region of xsave area) are fixed, so saving
  * XSTATE_FP_SSE will not cause overwriting problem with XSAVES/XSAVEC.
  */
-return (v->arch.xsave_area->xsave_hdr.xcomp_bv &
-XSTATE_COMPACTION_ENABLED) &&
+return xsave_area_compressed(v->arch.xsave_area) &&
(v->arch.xcr0_accum & XSTATE_LAZY & ~XSTATE_FP_SSE);
 }
 
-static inline bool __nonnull(1)
-xsave_area_compressed(const struct xsave_struct *xsave_area)
-{
-return xsave_area->xsave_hdr.xcomp_bv & XSTATE_COMPACTION_ENABLED;
-}
-
 #endif /* __ASM_XSTATE_H */
-- 
2.34.1

Re: [PATCH for-4.20] automation: Use a different ImageBuilder repository URL

2024-07-09 Thread Alejandro Vallejo

On Tue Jul 9, 2024 at 1:21 PM BST, Michal Orzel wrote:
> Switch to using https://gitlab.com/xen-project/imagebuilder.git which
> should be considered official ImageBuilder repo.
>
> Signed-off-by: Michal Orzel 
> ---
>  automation/scripts/qemu-smoke-dom0-arm32.sh   | 2 +-
>  automation/scripts/qemu-smoke-dom0-arm64.sh   | 2 +-
>  automation/scripts/qemu-smoke-dom0less-arm32.sh   | 2 +-
>  automation/scripts/qemu-smoke-dom0less-arm64.sh   | 2 +-
>  automation/scripts/qemu-xtf-dom0less-arm64.sh | 2 +-
>  automation/scripts/xilinx-smoke-dom0less-arm64.sh | 2 +-
>  6 files changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/automation/scripts/qemu-smoke-dom0-arm32.sh 
> b/automation/scripts/qemu-smoke-dom0-arm32.sh
> index d91648905669..5b62e3f691f1 100755
> --- a/automation/scripts/qemu-smoke-dom0-arm32.sh
> +++ b/automation/scripts/qemu-smoke-dom0-arm32.sh
> @@ -73,7 +73,7 @@ UBOOT_SOURCE="boot.source"
>  UBOOT_SCRIPT="boot.scr"' > config
>  
>  rm -rf imagebuilder
> -git clone https://gitlab.com/ViryaOS/imagebuilder
> +git clone https://gitlab.com/xen-project/imagebuilder.git

For this clone and all others:

You probably want "git clone --depth 1 " to pull the tip of the repo and
not its history.

>  bash imagebuilder/scripts/uboot-script-gen -t tftp -d . -c config
>  
>  rm -f ${serial_log}
> diff --git a/automation/scripts/qemu-smoke-dom0-arm64.sh 
> b/automation/scripts/qemu-smoke-dom0-arm64.sh
> index e0bb37af3610..ca59bdec1b2b 100755
> --- a/automation/scripts/qemu-smoke-dom0-arm64.sh
> +++ b/automation/scripts/qemu-smoke-dom0-arm64.sh
> @@ -87,7 +87,7 @@ LOAD_CMD="tftpb"
>  UBOOT_SOURCE="boot.source"
>  UBOOT_SCRIPT="boot.scr"' > binaries/config
>  rm -rf imagebuilder
> -git clone https://gitlab.com/ViryaOS/imagebuilder
> +git clone https://gitlab.com/xen-project/imagebuilder.git
>  bash imagebuilder/scripts/uboot-script-gen -t tftp -d binaries/ -c 
> binaries/config
>  
>  
> diff --git a/automation/scripts/qemu-smoke-dom0less-arm32.sh 
> b/automation/scripts/qemu-smoke-dom0less-arm32.sh
> index 1e2b939aadf7..11804cbd729f 100755
> --- a/automation/scripts/qemu-smoke-dom0less-arm32.sh
> +++ b/automation/scripts/qemu-smoke-dom0less-arm32.sh
> @@ -125,7 +125,7 @@ if [[ "${test_variant}" == "without-dom0" ]]; then
>  fi
>  
>  rm -rf imagebuilder
> -git clone https://gitlab.com/ViryaOS/imagebuilder
> +git clone https://gitlab.com/xen-project/imagebuilder.git
>  bash imagebuilder/scripts/uboot-script-gen -t tftp -d . -c config
>  
>  # Run the test
> diff --git a/automation/scripts/qemu-smoke-dom0less-arm64.sh 
> b/automation/scripts/qemu-smoke-dom0less-arm64.sh
> index 292c38a56147..4b548d1f8e54 100755
> --- a/automation/scripts/qemu-smoke-dom0less-arm64.sh
> +++ b/automation/scripts/qemu-smoke-dom0less-arm64.sh
> @@ -198,7 +198,7 @@ NUM_CPUPOOLS=1' >> binaries/config
>  fi
>  
>  rm -rf imagebuilder
> -git clone https://gitlab.com/ViryaOS/imagebuilder
> +git clone https://gitlab.com/xen-project/imagebuilder.git
>  bash imagebuilder/scripts/uboot-script-gen -t tftp -d binaries/ -c 
> binaries/config
>  
>  
> diff --git a/automation/scripts/qemu-xtf-dom0less-arm64.sh 
> b/automation/scripts/qemu-xtf-dom0less-arm64.sh
> index a667e0412c92..59f926d35fb9 100755
> --- a/automation/scripts/qemu-xtf-dom0less-arm64.sh
> +++ b/automation/scripts/qemu-xtf-dom0less-arm64.sh
> @@ -45,7 +45,7 @@ UBOOT_SOURCE="boot.source"
>  UBOOT_SCRIPT="boot.scr"' > binaries/config
>  
>  rm -rf imagebuilder
> -git clone https://gitlab.com/ViryaOS/imagebuilder
> +git clone https://gitlab.com/xen-project/imagebuilder.git
>  bash imagebuilder/scripts/uboot-script-gen -t tftp -d binaries/ -c 
> binaries/config
>  
>  # Run the test
> diff --git a/automation/scripts/xilinx-smoke-dom0less-arm64.sh 
> b/automation/scripts/xilinx-smoke-dom0less-arm64.sh
> index 4a071c6ef148..e3f7648d5031 100755
> --- a/automation/scripts/xilinx-smoke-dom0less-arm64.sh
> +++ b/automation/scripts/xilinx-smoke-dom0less-arm64.sh
> @@ -122,7 +122,7 @@ if [[ "${test_variant}" == "gem-passthrough" ]]; then
>  fi
>  
>  rm -rf imagebuilder
> -git clone https://gitlab.com/ViryaOS/imagebuilder
> +git clone https://gitlab.com/xen-project/imagebuilder.git
>  bash imagebuilder/scripts/uboot-script-gen -t tftp -d $TFTP/ -c $TFTP/config
>  
>  # restart the board

Cheers,
Alejandro

Re: [XEN PATCH v3 09/12] x86/mm: add defensive return

2024-07-09 Thread Alejandro Vallejo

On Mon Jul 1, 2024 at 9:57 AM BST, Jan Beulich wrote:
> On 26.06.2024 11:28, Federico Serafini wrote:
> > Add defensive return statement at the end of an unreachable
> > default case. Other than improve safety, this meets the requirements
> > to deviate a violation of MISRA C Rule 16.3: "An unconditional `break'
> > statement shall terminate every switch-clause".
> > 
> > Signed-off-by: Federico Serafini 
>
> Tentatively
> Reviewed-by: Jan Beulich 
>
> > --- a/xen/arch/x86/mm.c
> > +++ b/xen/arch/x86/mm.c
> > @@ -916,6 +916,7 @@ get_page_from_l1e(
> >  return 0;
> >  default:
> >  ASSERT_UNREACHABLE();
> > +return -EPERM;
> >  }
> >  }
> >  else if ( l1f & _PAGE_RW )
>
> I don't like the use of -EPERM here very much, but I understand that there's
> no really suitable errno value. I wonder though whether something far more
> "exotic" wouldn't be better in such a case, say -EBADMSG or -EADDRNOTAVAIL.
> Just to mention it: -EPERM is what failed XSM checks would typically yield,
> so from that perspective alone even switching to -EACCES might be a little
> bit better.
>

fwiw: EACCES, being typically used for interface version mismatches, would
confuse me a lot.

> I further wonder whether, with the assertion catching an issue with the
> implementation, we shouldn't consider using BUG() here instead. Input from
> in particular the other x86 maintainers appreciated.
>
> Jan

Cheers,
Alejandro

Re: [RFC XEN PATCH] x86/cpuid: Expose max_vcpus field in HVM hypervisor leaf

2024-07-09 Thread Alejandro Vallejo

I'll pitch in, seeing as I created the GitLab ticket.

On Tue Jul 9, 2024 at 7:40 AM BST, Jan Beulich wrote:
> On 08.07.2024 17:42, Matthew Barnes wrote:
> > Currently, OVMF is hard-coded to set up a maximum of 64 vCPUs on
> > startup.
> > 
> > There are efforts to support a maximum of 128 vCPUs, which would involve
> > bumping the OVMF constant from 64 to 128.
> > 
> > However, it would be more future-proof for OVMF to access the maximum
> > number of vCPUs for a domain and set itself up appropriately at
> > run-time.
> > 
> > For OVMF to access the maximum vCPU count, Xen will have to expose this
> > property via cpuid.
>
> Why "have to"? The information is available from xenstore, isn't it?

That would create an avoidable dependency between OVMF and xenstore, precluding
xenstoreless UEFI-enabled domUs.

>
> > This patch exposes the max_vcpus field via cpuid on the HVM hypervisor
> > leaf in edx.
>
> If exposing via CPUID, why only for HVM?
>
> > --- a/xen/include/public/arch-x86/cpuid.h
> > +++ b/xen/include/public/arch-x86/cpuid.h
> > @@ -87,6 +87,7 @@
> >   * Sub-leaf 0: EAX: Features
> >   * Sub-leaf 0: EBX: vcpu id (iff EAX has XEN_HVM_CPUID_VCPU_ID_PRESENT 
> > flag)
> >   * Sub-leaf 0: ECX: domain id (iff EAX has XEN_HVM_CPUID_DOMID_PRESENT 
> > flag)
> > + * Sub-leaf 0: EDX: max vcpus (iff EAX has XEN_HVM_CPUID_MAX_VCPUS_PRESENT 
> > flag)
> >   */
>
> Unlike EBX and ECX, the proposed value for EDX cannot be zero. I'm therefore
> not entirely convinced that we need a qualifying flag. Things would be
> different if the field was "highest possible vCPU ID", which certainly would
> be the better approach if the field wasn't occupying the entire register.
> Even with it being 32 bits, I'd still suggest switching its meaning this way.
>
> Jan

Using max_vcpu_id instead of max_vcpus is also fine, but the flag is important
as otherwise it's impossible to retroactively change the meaning of EDX (i.e: to
stop advertising this datum, or repurpose EDX altogether)

We could also reserve only the lower 16bits of EDX rather than the whole thing;
but we have plenty of subleafs for growth, so I'm not sure it's worth it.

Cheers,
Alejandro

1 2 3 4 5 >

1 - 100 of 444 matches

Mail list logo