[PATCH] lib/strtoul: fix MISRA R10.2 violation

2024-05-13 Thread Stefano Stabellini
Fix last violation of R10.2 by casting the result of toupper to plain
char. Note that we don't want to change toupper itself as it is a legacy
interface and it would cause more issues.

Signed-off-by: Stefano Stabellini 
---
I believe this is the last R10.2 violation

diff --git a/xen/lib/strtoul.c b/xen/lib/strtoul.c
index a378fe735e..345dcf9d8c 100644
--- a/xen/lib/strtoul.c
+++ b/xen/lib/strtoul.c
@@ -38,7 +38,7 @@ unsigned long simple_strtoul(
 
 while ( isxdigit(*cp) &&
 (value = isdigit(*cp) ? *cp - '0'
-  : toupper(*cp) - 'A' + 10) < base )
+  : (char)toupper(*cp) - 'A' + 10) < base )
 {
 result = result * base + value;
 cp++;



[linux-linus test] 185990: tolerable FAIL - PUSHED

2024-05-13 Thread osstest service owner
flight 185990 linux-linus real [real]
flight 185991 linux-linus real-retest [real]
http://logs.test-lab.xenproject.org/osstest/logs/185990/
http://logs.test-lab.xenproject.org/osstest/logs/185991/

Failures :-/ but no regressions.

Tests which are failing intermittently (not blocking):
 test-armhf-armhf-libvirt 10 host-ping-check-xen fail pass in 185991-retest
 test-armhf-armhf-xl-credit1   8 xen-bootfail pass in 185991-retest

Tests which did not succeed, but are not blocking:
 test-armhf-armhf-libvirt 16 saverestore-support-check fail in 185991 like 
185986
 test-armhf-armhf-libvirt15 migrate-support-check fail in 185991 never pass
 test-armhf-armhf-xl-credit1 15 migrate-support-check fail in 185991 never pass
 test-armhf-armhf-xl-credit1 16 saverestore-support-check fail in 185991 never 
pass
 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 185986
 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 185986
 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 185986
 test-armhf-armhf-xl   8 xen-boot fail  like 185986
 test-armhf-armhf-libvirt-vhd  8 xen-boot fail  like 185986
 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 185986
 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 185986
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-arm64-arm64-xl-credit2  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-qcow2 14 migrate-support-checkfail never pass
 test-amd64-amd64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  14 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  15 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail  never pass
 test-arm64-arm64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 15 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-qcow214 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-qcow215 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-raw  14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-raw  15 saverestore-support-checkfail   never pass

version targeted for testing:
 linux6d1346f1bcbf2724dee8af013cdab9f7b581435b
baseline version:
 linuxa38297e3fb012ddfa7ce0321a7e5a8daeb1872b6

Last test of basis   185986  2024-05-12 23:41:44 Z1 days
Testing same since   185990  2024-05-13 16:43:50 Z0 days1 attempts


People who touched revisions under test:
  "Jason J. Herne" 
  Abel Vesa 
  Adam Ford 
  Akhil R 
  Alain Volmat 
  Alexander Egorenkov 
  Alexander Gordeev 
  Alexander Stein 
  Alexander Stein  # TQMa9352LA/CA
  Alexandre Torgue 
  Alexey Minnekhanov 
  Alice Guo 
  Alim Akhtar 
  Anand Moon 
  Andre Przywara 
  Andreas Kemnade 
  Andrejs Cainikovs 
  Andrew Davis 
  André Draszik 
  Andy Yan 
  AngeloGioacchino Del Regno 
  Anton Bambura 
  Aren Moynihan 
  Arnd Bergmann 
  Arthur Demchenkov 
  

Re: [PATCH] x86/cpufreq: Rename cpuid variable/parameters to cpu

2024-05-13 Thread Stefano Stabellini
On Sat, 11 May 2024, Andrew Cooper wrote:
> Various functions have a parameter or local variable called cpuid, but this
> triggers a MISRA R5.3 violation because we also have a function called cpuid()
> which wraps the real CPUID instruction.
> 
> In all these cases, it's a Xen cpu index, which is far more commonly named
> just cpu in our code.
> 
> While adjusting these, fix a couple of other issues:
> 
>  * cpufreq_cpu_init() is on the end of a hypercall (with in-memory parameters,
>even), making EFAULT the wrong error to use.  Use EOPNOTSUPP instead.
> 
>  * check_est_cpu() is wrong to tie EIST to just Intel, and nowhere else using
>EIST makes this restriction.  Just check the feature itself, which is more
>succinctly done after being folded into its single caller.
> 
>  * In powernow_cpufreq_update(), replace an opencoded cpu_online().
> 
> Signed-off-by: Andrew Cooper 
> ---
> CC: Jan Beulich 
> CC: Roger Pau Monné 
> CC: Stefano Stabellini 
> CC: Nicola Vetrini 
> CC: consult...@bugseng.com 
> 
> cpu needs to stay signed for now in set_px_pminfo(), because of get_cpu_id().
> This can be cleaned up by making better use of BAD_APICID, but that's a much
> more involved change.
> ---
>  xen/arch/x86/acpi/cpu_idle.c  | 14 
>  xen/arch/x86/acpi/cpufreq/cpufreq.c   | 24 +++--
>  xen/arch/x86/acpi/cpufreq/hwp.c   |  6 ++--
>  xen/arch/x86/acpi/cpufreq/powernow.c  |  6 ++--
>  xen/drivers/cpufreq/cpufreq.c | 18 +-
>  xen/drivers/cpufreq/utility.c | 43 +++
>  xen/include/acpi/cpufreq/cpufreq.h|  6 ++--
>  xen/include/acpi/cpufreq/processor_perf.h |  8 ++---
>  xen/include/xen/pmstat.h  |  6 ++--
>  9 files changed, 57 insertions(+), 74 deletions(-)
> 
> diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
> index cfce4cc0408f..c8db1aa9913a 100644
> --- a/xen/arch/x86/acpi/cpu_idle.c
> +++ b/xen/arch/x86/acpi/cpu_idle.c
> @@ -1498,14 +1498,14 @@ static void amd_cpuidle_init(struct 
> acpi_processor_power *power)
>  vendor_override = -1;
>  }
>  
> -uint32_t pmstat_get_cx_nr(uint32_t cpuid)
> +uint32_t pmstat_get_cx_nr(unsigned int cpu)
>  {
> -return processor_powers[cpuid] ? processor_powers[cpuid]->count : 0;
> +return processor_powers[cpu] ? processor_powers[cpu]->count : 0;
>  }
>  
> -int pmstat_get_cx_stat(uint32_t cpuid, struct pm_cx_stat *stat)
> +int pmstat_get_cx_stat(unsigned int cpu, struct pm_cx_stat *stat)
>  {
> -struct acpi_processor_power *power = processor_powers[cpuid];
> +struct acpi_processor_power *power = processor_powers[cpu];
>  uint64_t idle_usage = 0, idle_res = 0;
>  uint64_t last_state_update_tick, current_stime, current_tick;
>  uint64_t usage[ACPI_PROCESSOR_MAX_POWER] = { 0 };
> @@ -1522,7 +1522,7 @@ int pmstat_get_cx_stat(uint32_t cpuid, struct 
> pm_cx_stat *stat)
>  return 0;
>  }
>  
> -stat->idle_time = get_cpu_idle_time(cpuid);
> +stat->idle_time = get_cpu_idle_time(cpu);
>  nr = min(stat->nr, power->count);
>  
>  /* mimic the stat when detail info hasn't been registered by dom0 */
> @@ -1572,7 +1572,7 @@ int pmstat_get_cx_stat(uint32_t cpuid, struct 
> pm_cx_stat *stat)
>  idle_res += res[i];
>  }
>  
> -get_hw_residencies(cpuid, _res);
> +get_hw_residencies(cpu, _res);
>  
>  #define PUT_xC(what, n) do { \
>  if ( stat->nr_##what >= n && \
> @@ -1613,7 +1613,7 @@ int pmstat_get_cx_stat(uint32_t cpuid, struct 
> pm_cx_stat *stat)
>  return 0;
>  }
>  
> -int pmstat_reset_cx_stat(uint32_t cpuid)
> +int pmstat_reset_cx_stat(unsigned int cpu)
>  {
>  return 0;
>  }
> diff --git a/xen/arch/x86/acpi/cpufreq/cpufreq.c 
> b/xen/arch/x86/acpi/cpufreq/cpufreq.c
> index 2b6ef99678ae..a341f2f02063 100644
> --- a/xen/arch/x86/acpi/cpufreq/cpufreq.c
> +++ b/xen/arch/x86/acpi/cpufreq/cpufreq.c
> @@ -55,17 +55,6 @@ struct acpi_cpufreq_data *cpufreq_drv_data[NR_CPUS];
>  static bool __read_mostly acpi_pstate_strict;
>  boolean_param("acpi_pstate_strict", acpi_pstate_strict);
>  
> -static int check_est_cpu(unsigned int cpuid)
> -{
> -struct cpuinfo_x86 *cpu = _data[cpuid];
> -
> -if (cpu->x86_vendor != X86_VENDOR_INTEL ||
> -!cpu_has(cpu, X86_FEATURE_EIST))
> -return 0;
> -
> -return 1;
> -}
> -
>  static unsigned extract_io(u32 value, struct acpi_cpufreq_data *data)
>  {
>  struct processor_performance *perf;
> @@ -530,7 +519,7 @@ static int cf_check acpi_cpufreq_cpu_init(struct 
> cpufreq_policy *policy)
>  if (cpufreq_verbose)
>  printk("xen_pminfo: @acpi_cpufreq_cpu_init,"
> "HARDWARE addr space\n");
> -if (!check_est_cpu(cpu)) {
> +if (!cpu_has(c, X86_FEATURE_EIST)) {
>  result = -ENODEV;
>  goto err_unreg;
>  }
> @@ -690,15 +679,12 @@ static int __init cf_check 
> cpufreq_driver_late_init(void)
>  }
> 

Re: [PATCH] fix Rule 10.2 violation

2024-05-13 Thread Stefano Stabellini
On Mon, 13 May 2024, Julien Grall wrote:
> Hi Stefano,
> 
> title: Is this the only violation we have in Xen? If so, then please add the
> subsystem in the title.

The only remaining violations are about the use of the "toupper" macro.
Bugseng is recommending to add a cast to fix those or deviate toupper.


> On 11/05/2024 00:37, Stefano Stabellini wrote:
> > Change opt_conswitch to char to fix a violation of Rule 10.2.
> > 
> > Signed-off-by: Stefano Stabellini 
> > 
> > diff --git a/xen/drivers/char/console.c b/xen/drivers/char/console.c
> > index 2c363d9c1d..3a3a97bcbe 100644
> > --- a/xen/drivers/char/console.c
> > +++ b/xen/drivers/char/console.c
> > @@ -49,7 +49,7 @@ string_param("console", opt_console);
> >   /* Char 1: CTRL+ is used to switch console input between Xen and
> > DOM0 */
> >   /* Char 2: If this character is 'x', then do not auto-switch to DOM0 when
> > it */
> >   /* boots. Any other value, or omitting the char, enables
> > auto-switch */
> > -static unsigned char __read_mostly opt_conswitch[3] = "a";
> > +static char __read_mostly opt_conswitch[3] = "a";
> 
> Looking at the rest of the code, we have:
> 
> #define switch_code (opt_conswitch[0] - 'a' + 1)
> 
> Can you confirm whether this is not somehow adding a new violation?

No, this patch is to fix a violation exactly there.



Re: [RFC PATCH 2/2] xen/arm: Rework dt_unreserved_regions to avoid recursion

2024-05-13 Thread Julien Grall

Hi Luca,

On 25/04/2024 14:11, Luca Fancellu wrote:

The function dt_unreserved_regions is currently using recursion
to compute the non overlapping ranges of a passed region against
the reserved memory banks, in the spirit of removing the recursion
to improve safety and also improve the scalability of the function,
rework its code to use an iterative algorithm with the same result.

The function was taking an additional parameter 'first', but given
the rework and given that the function was always initially called
with this parameter as zero, remove the parameter and update the
codebase to reflect the change.


In general, I am in favor to remove the recursion. I have some 
questions/remarks below though.




Signed-off-by: Luca Fancellu 
---
  xen/arch/arm/include/asm/setup.h|   8 +-
  xen/arch/arm/include/asm/static-shmem.h |   7 +-
  xen/arch/arm/kernel.c   |   2 +-
  xen/arch/arm/setup.c| 133 
  4 files changed, 96 insertions(+), 54 deletions(-)

diff --git a/xen/arch/arm/include/asm/setup.h b/xen/arch/arm/include/asm/setup.h
index fc6967f9a435..24519b9ed969 100644
--- a/xen/arch/arm/include/asm/setup.h
+++ b/xen/arch/arm/include/asm/setup.h
@@ -9,7 +9,12 @@
  #define MAX_FDT_SIZE SZ_2M
  
  #define NR_MEM_BANKS 256

+
+#ifdef CONFIG_STATIC_SHM
  #define NR_SHMEM_BANKS 32
+#else
+#define NR_SHMEM_BANKS 0
+#endif
  
  #define MAX_MODULES 32 /* Current maximum useful modules */
  
@@ -215,8 +220,7 @@ void create_dom0(void);
  
  void discard_initial_modules(void);

  void fw_unreserved_regions(paddr_t s, paddr_t e,
-   void (*cb)(paddr_t ps, paddr_t pe),
-   unsigned int first);
+   void (*cb)(paddr_t ps, paddr_t pe));
  
  size_t boot_fdt_info(const void *fdt, paddr_t paddr);

  const char *boot_fdt_cmdline(const void *fdt);
diff --git a/xen/arch/arm/include/asm/static-shmem.h 
b/xen/arch/arm/include/asm/static-shmem.h
index 3b6569e5703f..1b7c7ea0e17d 100644
--- a/xen/arch/arm/include/asm/static-shmem.h
+++ b/xen/arch/arm/include/asm/static-shmem.h
@@ -7,11 +7,11 @@
  #include 
  #include 
  
-#ifdef CONFIG_STATIC_SHM

-
  /* Worst case /memory node reg element: (addrcells + sizecells) */
  #define DT_MEM_NODE_REG_RANGE_SIZE ((NR_MEM_BANKS + NR_SHMEM_BANKS) * 4)
  
+#ifdef CONFIG_STATIC_SHM

+
  int make_resv_memory_node(const struct kernel_info *kinfo, int addrcells,
int sizecells);
  
@@ -47,9 +47,6 @@ void shm_mem_node_fill_reg_range(const struct kernel_info *kinfo, __be32 *reg,
  
  #else /* !CONFIG_STATIC_SHM */
  
-/* Worst case /memory node reg element: (addrcells + sizecells) */

-#define DT_MEM_NODE_REG_RANGE_SIZE (NR_MEM_BANKS * 4)
-


I don't understand how this is related to the rest patch. Can you clarify?


  static inline int make_resv_memory_node(const struct kernel_info *kinfo,
  int addrcells, int sizecells)
  {
diff --git a/xen/arch/arm/kernel.c b/xen/arch/arm/kernel.c
index 674388fa11a2..ecbeec518754 100644
--- a/xen/arch/arm/kernel.c
+++ b/xen/arch/arm/kernel.c
@@ -247,7 +247,7 @@ static __init int kernel_decompress(struct bootmodule *mod, 
uint32_t offset)
   * Free the original kernel, update the pointers to the
   * decompressed kernel
   */
-fw_unreserved_regions(addr, addr + size, init_domheap_pages, 0);
+fw_unreserved_regions(addr, addr + size, init_domheap_pages);
  
  return 0;

  }
diff --git a/xen/arch/arm/setup.c b/xen/arch/arm/setup.c
index c4e5c19b11d6..d737fe56e539 100644
--- a/xen/arch/arm/setup.c
+++ b/xen/arch/arm/setup.c
@@ -204,55 +204,97 @@ static void __init processor_id(void)
  }
  
  static void __init dt_unreserved_regions(paddr_t s, paddr_t e,

- void (*cb)(paddr_t ps, paddr_t pe),
- unsigned int first)
+ void (*cb)(paddr_t ps, paddr_t pe))
  {
-const struct membanks *reserved_mem = bootinfo_get_reserved_mem();
-#ifdef CONFIG_STATIC_SHM
-const struct membanks *shmem = bootinfo_get_shmem();
-unsigned int offset;
-#endif
-unsigned int i;
-
  /*
- * i is the current bootmodule we are evaluating across all possible
- * kinds.
+ * The worst case scenario is to have N reserved region ovelapping the
+ * passed one, so having N+1 regions in the stack
   */
-for ( i = first; i < reserved_mem->nr_banks; i++ )
-{
-paddr_t r_s = reserved_mem->bank[i].start;
-paddr_t r_e = r_s + reserved_mem->bank[i].size;
-
-if ( s < r_e && r_s < e )
-{
-dt_unreserved_regions(r_e, e, cb, i + 1);
-dt_unreserved_regions(s, r_s, cb, i + 1);
-return;
-}
-}
-
+struct {
+paddr_t s;
+paddr_t e;
+unsigned int dict;
+unsigned int bank;
+} stack[NR_MEM_BANKS + NR_SHMEM_BANKS + 1];



Re: [RFC PATCH 1/2] xen/arm: Add DT reserve map regions to bootinfo.reserved_mem

2024-05-13 Thread Julien Grall

Hi Luca,

On 25/04/2024 14:11, Luca Fancellu wrote:

Currently the code is listing device tree reserve map regions
as reserved memory for Xen, but they are not added into
bootinfo.reserved_mem and they are fetched in multiple places
using the same code sequence, causing duplication. Fix this
by adding them to the bootinfo.reserved_mem at early stage.


Do we have enough space in bootinfo.reserved_mem for them?

The rest of the changes LGTM.

Cheers,

--
Julien Grall



Re: [PATCH] fix Rule 10.2 violation

2024-05-13 Thread Julien Grall

Hi Stefano,

title: Is this the only violation we have in Xen? If so, then please add 
the subsystem in the title.


On 11/05/2024 00:37, Stefano Stabellini wrote:

Change opt_conswitch to char to fix a violation of Rule 10.2.

Signed-off-by: Stefano Stabellini 

diff --git a/xen/drivers/char/console.c b/xen/drivers/char/console.c
index 2c363d9c1d..3a3a97bcbe 100644
--- a/xen/drivers/char/console.c
+++ b/xen/drivers/char/console.c
@@ -49,7 +49,7 @@ string_param("console", opt_console);
  /* Char 1: CTRL+ is used to switch console input between Xen and DOM0 
*/
  /* Char 2: If this character is 'x', then do not auto-switch to DOM0 when it 
*/
  /* boots. Any other value, or omitting the char, enables auto-switch 
*/
-static unsigned char __read_mostly opt_conswitch[3] = "a";
+static char __read_mostly opt_conswitch[3] = "a";


Looking at the rest of the code, we have:

#define switch_code (opt_conswitch[0] - 'a' + 1)

Can you confirm whether this is not somehow adding a new violation?

Cheers,

--
Julien Grall



Re: Serious AMD-Vi(?) issue

2024-05-13 Thread Elliott Mitchell
On Mon, May 13, 2024 at 10:44:59AM +0200, Roger Pau Monné wrote:
> On Fri, May 10, 2024 at 09:09:54PM -0700, Elliott Mitchell wrote:
> > On Thu, Apr 18, 2024 at 09:33:31PM -0700, Elliott Mitchell wrote:
> > > 
> > > I suspect this is a case of there is some step which is missing from
> > > Xen's IOMMU handling.  Perhaps something which Linux does during an early
> > > DMA setup stage, but the current Xen implementation does lazily?
> > > Alternatively some flag setting or missing step?
> > > 
> > > I should be able to do another test approach in a few weeks, but I would
> > > love if something could be found sooner.
> > 
> > Turned out to be disturbingly easy to get the first entry when it
> > happened.  Didn't even need `dbench`, it simply showed once the OS was
> > fully loaded.  I did get some additional data points.
> > 
> > Appears this requires an AMD IOMMUv2.  A test system with known
> > functioning AMD IOMMUv1 didn't display the issue at all.
> > 
> > (XEN) AMD-Vi: IO_PAGE_FAULT: :bb:dd.f d0 addr fffdf800 flags 
> > 0x8 I
> 
> I would expect the address field to contain more information about the
> fault, but I'm not finding any information on the AMD-Vi specification
> apart from that it contains the DVA, which makes no sense when the
> fault is caused by an interrupt.
> 
> > (XEN) :bb:dd.f root @ 83b5f5 (3 levels) dfn=fffdf8000
> > (XEN)   L3[1f7] = 0 np
> 
> Attempting to print the page table walk for an Interrupt remapping
> fault is useless, we should likely avoid that when the I flag is set.

> > I find it surprising this required "iommu=debug" to get this level of
> > detail.  This amount of output seems more appropriate for "verbose".
> 
> "verbose" should also print this information.

Mostly I've noticed Xen's dmesg seems a bit sparse at default settings.
Confirming IOMMU was recognized and operational had been a challenge.  On
the flip side this does mean less potentially sensitive data gets in.

> > I strongly prefer to provide snippets.  There is a fair bit of output,
> > I'm unsure which portion is most pertinent.
> 
> I've already voiced my concern that I think what yo uare doing is not
> fair.  We are debugging this out of interest, and hence you refusing
> to provide all information just hampers our ability to debug, and
> makes us spend more time than required just thinking what snippets we
> need to ask for.
> 
> I will ask again, what's there in the Xen or the Linux dmesgs that you
> are so worried about leaking? Please provide an specific example.

I cannot point to specific data in Xen's dmesg which is known to be
sensitive.  On the flip side all the addresses could readily function as
a subliminal channel.

Might only be kernels from certain vendors, but hardware serial numbers
frequently make it into Linux's dmesg.  All the data coming from ACPI
tables could readily hide something.  Worse, data which seems harmless
now might later turn out to reveal things.

The usual approach is everyone has PGP keys and logs are kept private
on request.

> Why do you mask the device SBDF in the above snippet?  I would really
> like to understand what's so privacy relevant in a PCI SBDF number.

I doubt it reveals much.  Simply seems unlikely to help debugging and
therefore I prefer to mask it.  One more Xen dmesg line:

(XEN) AMD-Vi: Setup I/O page table: device id = 0xbbdd, type = 0x1, root table 
= 0xADDRADDR, domain = 0, paging mode = 3

> Does booting with `iommu=no-intremap` lead to any issues being
> reported?

I'll try that next time I restart the system.


Another viable approach.  I imagine one or more of the Xen developers
have computers with AMD processors.  I could send a pair of SATA devices
which are known to exhibit the behavior to someone.

The known reproductions have featured ASUS motherboards.  I doubt this is
a requirement, but if one of the main developers has such a system that
is a better target.  I also note these are plugged into motherboard SATA
ports.  It is possible add-on card SATA ports might not exhibit the
behavior.

Then you may discover not much log data is being provided simply due to
not much log data being generated.


-- 
(\___(\___(\__  --=> 8-) EHM <=--  __/)___/)___/)
 \BS (| ehem+sig...@m5p.com  PGP 87145445 |)   /
  \_CS\   |  _  -O #include  O-   _  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445





Re: [PATCH 4.5/8] tools/hvmloader: Further simplify SMP setup

2024-05-13 Thread Alejandro Vallejo
On 09/05/2024 18:50, Andrew Cooper wrote:
> Now that we're using hypercalls to start APs, we can replace the 'ap_cpuid'
> global with a regular function parameter.  This requires telling the compiler
> that we'd like the parameter in a register rather than on the stack.
> 
> While adjusting, rename to cpu_setup().  It's always been used on the BSP,
> making the name ap_start() specifically misleading.
> 
> Signed-off-by: Andrew Cooper 
> ---
> CC: Jan Beulich 
> CC: Roger Pau Monné 
> CC: Alejandro Vallejo 
> 
> This is a trick I found for XTF, not that I've completed the SMP support yet.
> 
> I realise it's cheating slightly WRT 4.19, but it came out of the middle of a
> series targetted for 4.19.
> ---
>  tools/firmware/hvmloader/smp.c | 15 ---
>  1 file changed, 8 insertions(+), 7 deletions(-)
> 
> diff --git a/tools/firmware/hvmloader/smp.c b/tools/firmware/hvmloader/smp.c
> index 6ebf0b60faab..5d46eee1c5f4 100644
> --- a/tools/firmware/hvmloader/smp.c
> +++ b/tools/firmware/hvmloader/smp.c
> @@ -29,15 +29,15 @@
>  
>  #include 
>  
> -static int ap_callin, ap_cpuid;
> +static int ap_callin;
>  
> -static void ap_start(void)
> +static void __attribute__((regparm(1))) cpu_setup(unsigned int cpu)

I like it, but I'm not a fan of compiler attributes when there's sane
alternatives. We could pre-push the argument onto ap_stack to achieve
the same thing. As in, add a -4 offset to esp, and write "cpu" there.

  *(uint32_t*)ap.cpu_regs.x86_32.esp) = cpu;

That said, this is a solution as good as any other and it's definitely
better than a global, so take it or leave it.

With or without the proposed alternative...

Reviewed-by: Alejandro Vallejo 

>  {
> -printf(" - CPU%d ... ", ap_cpuid);
> +printf(" - CPU%d ... ", cpu);
>  cacheattr_init();
>  printf("done.\n");
>  
> -if ( !ap_cpuid ) /* Used on the BSP too */
> +if ( !cpu ) /* Used on the BSP too */
>  return;
>  
>  wmb();
> @@ -55,7 +55,6 @@ static void boot_cpu(unsigned int cpu)
>  static struct vcpu_hvm_context ap;
>  
>  /* Initialise shared variables. */
> -ap_cpuid = cpu;
>  ap_callin = 0;
>  wmb();
>  
> @@ -63,9 +62,11 @@ static void boot_cpu(unsigned int cpu)
>  ap = (struct vcpu_hvm_context) {
>  .mode = VCPU_HVM_MODE_32B,
>  .cpu_regs.x86_32 = {
> -.eip = (unsigned long)ap_start,
> +.eip = (unsigned long)cpu_setup,
>  .esp = (unsigned long)ap_stack + ARRAY_SIZE(ap_stack),
>  
> +.eax = cpu,
> +
>  /* Protected Mode, no paging. */
>  .cr0 = X86_CR0_PE,
>  
> @@ -105,7 +106,7 @@ void smp_initialise(void)
>  unsigned int i, nr_cpus = hvm_info->nr_vcpus;
>  
>  printf("Multiprocessor initialisation:\n");
> -ap_start();
> +cpu_setup(0);
>  for ( i = 1; i < nr_cpus; i++ )
>  boot_cpu(i);
>  }
> 
> base-commit: 53959cb8309919fc2f157a1c99e0af2ce280cb84




Re: [PATCH V3 (resend) 03/19] x86/pv: Rewrite how building PV dom0 handles domheap mappings

2024-05-13 Thread Roger Pau Monné
On Mon, May 13, 2024 at 01:40:30PM +, Elias El Yandouzi wrote:
> From: Hongyan Xia 
> 
> Building a PV dom0 is allocating from the domheap but uses it like the
> xenheap. Use the pages as they should be.
> 
> Signed-off-by: Hongyan Xia 
> Signed-off-by: Julien Grall 
> Signed-off-by: Elias El Yandouzi 
> 
> 
> Changes in V3:
> * Fold following patch 'x86/pv: Map L4 page table for shim domain'
> 
> Changes in V2:
> * Clarify the commit message
> * Break the patch in two parts
> 
> Changes since Hongyan's version:
> * Rebase
> * Remove spurious newline
> 
> diff --git a/xen/arch/x86/pv/dom0_build.c b/xen/arch/x86/pv/dom0_build.c
> index 807296c280..ac910b438a 100644
> --- a/xen/arch/x86/pv/dom0_build.c
> +++ b/xen/arch/x86/pv/dom0_build.c
> @@ -382,6 +382,10 @@ int __init dom0_construct_pv(struct domain *d,
>  l3_pgentry_t *l3tab = NULL, *l3start = NULL;
>  l2_pgentry_t *l2tab = NULL, *l2start = NULL;
>  l1_pgentry_t *l1tab = NULL, *l1start = NULL;
> +mfn_t l4start_mfn = INVALID_MFN;
> +mfn_t l3start_mfn = INVALID_MFN;
> +mfn_t l2start_mfn = INVALID_MFN;
> +mfn_t l1start_mfn = INVALID_MFN;
>  
>  /*
>   * This fully describes the memory layout of the initial domain. All
> @@ -710,22 +714,32 @@ int __init dom0_construct_pv(struct domain *d,
>  v->arch.pv.event_callback_cs= FLAT_COMPAT_KERNEL_CS;
>  }
>  
> +#define UNMAP_MAP_AND_ADVANCE(mfn_var, virt_var, maddr) \
> +do {\
> +unmap_domain_page(virt_var);\
> +mfn_var = maddr_to_mfn(maddr);  \
> +maddr += PAGE_SIZE; \
> +virt_var = map_domain_page(mfn_var);\

FWIW, I would do the advance after the map, so that the order matches
the name of the function.

> +} while ( false )
> +
>  if ( !compat )
>  {
>  maddr_to_page(mpt_alloc)->u.inuse.type_info = PGT_l4_page_table;
> -l4start = l4tab = __va(mpt_alloc); mpt_alloc += PAGE_SIZE;
> +UNMAP_MAP_AND_ADVANCE(l4start_mfn, l4start, mpt_alloc);
> +l4tab = l4start;

You could even make the macro return virt_var, and so use it like:

l4tab = l4start = UNMAP_MAP_AND_ADVANCE(l4start_mfn, mpt_alloc);

?

Anyway, no strong opinion.

Thanks, Roger.



Re: [PATCH V3 (resend) 02/19] x86/pv: Domheap pages should be mapped while relocating initrd

2024-05-13 Thread Roger Pau Monné
On Mon, May 13, 2024 at 01:40:29PM +, Elias El Yandouzi wrote:
> From: Wei Liu 
> 
> Xen shouldn't use domheap page as if they were xenheap pages. Map and
> unmap pages accordingly.
> 
> Signed-off-by: Wei Liu 
> Signed-off-by: Wei Wang 
> Signed-off-by: Julien Grall 
> Signed-off-by: Elias El Yandouzi 
> 
> 
> Changes in V3:
> * Rename commit title
> * Rework the for loop copying the pages
> 
> Changes in V2:
> * Get rid of mfn_to_virt
> * Don't open code copy_domain_page()
> 
> Changes since Hongyan's version:
> * Add missing newline after the variable declaration
> 
> diff --git a/xen/arch/x86/pv/dom0_build.c b/xen/arch/x86/pv/dom0_build.c
> index d8043fa58a..807296c280 100644
> --- a/xen/arch/x86/pv/dom0_build.c
> +++ b/xen/arch/x86/pv/dom0_build.c
> @@ -618,18 +618,24 @@ int __init dom0_construct_pv(struct domain *d,
>  if ( d->arch.physaddr_bitsize &&
>   ((mfn + count - 1) >> (d->arch.physaddr_bitsize - PAGE_SHIFT)) )
>  {
> +unsigned int nr_pages = 1UL << order;

Shouldn't you better initialize nr_pages to 'count' instead of 'order'
here?

Also note how 'order' at this point is not yet initialized to the
'count' based value, so I'm not sure from where that 'order' is coming
from.

In v2 you had:

+unsigned long nr_pages;
+
 order = get_order_from_pages(count);
 page = alloc_domheap_pages(d, order, MEMF_no_scrub);
 if ( !page )
 panic("Not enough RAM for domain 0 initrd\n");
+
+nr_pages = 1UL << order;

nr_pages was derived from the 'order' value based on 'count'.  As said
above, I think you want to use just 'count' here, which is the rounded
up value of pages needed by initrd_len.

Thanks, Roger.



Re: [PATCH V3 01/19] x86: Create per-domain mapping of guest_root_pt

2024-05-13 Thread Roger Pau Monné
On Mon, May 13, 2024 at 11:10:59AM +, Elias El Yandouzi wrote:
> From: Hongyan Xia 
> 
> Create a per-domain mapping of PV guest_root_pt as direct map is being
> removed.
> 
> Note that we do not map and unmap root_pgt for now since it is still a
> xenheap page.

I'm afraid this needs more context, at least for me to properly
understand.

I think I've figured out what create_perdomain_mapping() is supposed
to do, and I have to admit the interface is very awkward.

Anyway, attempted to provide some feedback.

> 
> Signed-off-by: Hongyan Xia 
> Signed-off-by: Julien Grall 
> Signed-off-by: Elias El Yandouzi 
> 
> 
> Changes in V3:
> * Rename SHADOW_ROOT
> * Haven't addressed the potentially over-allocation issue as I don't 
> get it
> 
> Changes in V2:
> * Rework the shadow perdomain mapping solution in the follow-up 
> patches
> 
> Changes since Hongyan's version:
> * Remove the final dot in the commit title
> 
> diff --git a/xen/arch/x86/include/asm/config.h 
> b/xen/arch/x86/include/asm/config.h
> index ab7288cb36..5d710384df 100644
> --- a/xen/arch/x86/include/asm/config.h
> +++ b/xen/arch/x86/include/asm/config.h
> @@ -203,7 +203,7 @@ extern unsigned char boot_edid_info[128];
>  /* Slot 260: per-domain mappings (including map cache). */
>  #define PERDOMAIN_VIRT_START(PML4_ADDR(260))
>  #define PERDOMAIN_SLOT_MBYTES   (PML4_ENTRY_BYTES >> (20 + PAGETABLE_ORDER))
> -#define PERDOMAIN_SLOTS 3
> +#define PERDOMAIN_SLOTS 4
>  #define PERDOMAIN_VIRT_SLOT(s)  (PERDOMAIN_VIRT_START + (s) * \
>   (PERDOMAIN_SLOT_MBYTES << 20))
>  /* Slot 4: mirror of per-domain mappings (for compat xlat area accesses). */
> @@ -317,6 +317,14 @@ extern unsigned long xen_phys_start;
>  #define ARG_XLAT_START(v)\
>  (ARG_XLAT_VIRT_START + ((v)->vcpu_id << ARG_XLAT_VA_SHIFT))
>  
> +/* pv_root_pt mapping area. The fourth per-domain-mapping sub-area */
> +#define PV_ROOT_PT_MAPPING_VIRT_START   PERDOMAIN_VIRT_SLOT(3)
> +#define PV_ROOT_PT_MAPPING_ENTRIES  MAX_VIRT_CPUS
> +
> +/* The address of a particular VCPU's PV_ROOT_PT */
> +#define PV_ROOT_PT_MAPPING_VCPU_VIRT_START(v) \
> +(PV_ROOT_PT_MAPPING_VIRT_START + ((v)->vcpu_id * PAGE_SIZE))

I know we are not there yet, but I wonder if we need to start having
some non-shared per-cpu mapping area in the page-tables.  Right now
this is shared between all the vCPUs, as it's per-domain space
(instead of per-vCPU).

> +
>  #define ELFSIZE 64
>  
>  #define ARCH_CRASH_SAVE_VMCOREINFO
> diff --git a/xen/arch/x86/include/asm/domain.h 
> b/xen/arch/x86/include/asm/domain.h
> index f5daeb182b..8a97530607 100644
> --- a/xen/arch/x86/include/asm/domain.h
> +++ b/xen/arch/x86/include/asm/domain.h
> @@ -272,6 +272,7 @@ struct time_scale {
>  struct pv_domain
>  {
>  l1_pgentry_t **gdt_ldt_l1tab;
> +l1_pgentry_t **root_pt_l1tab;
>  
>  atomic_t nr_l4_pages;
>  
> diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
> index d968bbbc73..efdf20f775 100644
> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -505,6 +505,13 @@ void share_xen_page_with_guest(struct page_info *page, 
> struct domain *d,
>  nrspin_unlock(>page_alloc_lock);
>  }
>  
> +#define pv_root_pt_idx(v) \
> +((v)->vcpu_id >> PAGETABLE_ORDER)
> +
> +#define pv_root_pt_pte(v) \
> +((v)->domain->arch.pv.root_pt_l1tab[pv_root_pt_idx(v)] + \
> + ((v)->vcpu_id & (L1_PAGETABLE_ENTRIES - 1)))
> +
>  void make_cr3(struct vcpu *v, mfn_t mfn)
>  {
>  struct domain *d = v->domain;
> @@ -524,6 +531,13 @@ void write_ptbase(struct vcpu *v)
>  
>  if ( is_pv_vcpu(v) && v->domain->arch.pv.xpti )
>  {
> +mfn_t guest_root_pt = _mfn(MASK_EXTR(v->arch.cr3, PAGE_MASK));
> +l1_pgentry_t *pte = pv_root_pt_pte(v);
> +
> +ASSERT(v == current);
> +
> +l1e_write(pte, l1e_from_mfn(guest_root_pt, __PAGE_HYPERVISOR_RO));
> +
>  cpu_info->root_pgt_changed = true;
>  cpu_info->pv_cr3 = __pa(this_cpu(root_pgt));
>  if ( new_cr4 & X86_CR4_PCIDE )
> diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
> index 2a445bb17b..1b025986f7 100644
> --- a/xen/arch/x86/pv/domain.c
> +++ b/xen/arch/x86/pv/domain.c
> @@ -288,6 +288,21 @@ static void pv_destroy_gdt_ldt_l1tab(struct vcpu *v)
>1U << GDT_LDT_VCPU_SHIFT);
>  }
>  
> +static int pv_create_root_pt_l1tab(struct vcpu *v)
> +{
> +return create_perdomain_mapping(v->domain,
> +PV_ROOT_PT_MAPPING_VCPU_VIRT_START(v),
> +1, v->domain->arch.pv.root_pt_l1tab,
> +NULL);
> +}
> +
> +static void pv_destroy_root_pt_l1tab(struct vcpu *v)

The two 'v' parameters could be const here.

> +
> +{
> +destroy_perdomain_mapping(v->domain,
> +  PV_ROOT_PT_MAPPING_VCPU_VIRT_START(v), 1);
> +}
> +
>  void pv_vcpu_destroy(struct vcpu *v)
>  {

[PATCH V3 (resend) 17/19] xen/arm64: mm: Use per-pCPU page-tables

2024-05-13 Thread Elias El Yandouzi
From: Julien Grall 

At the moment, on Arm64, every pCPU is sharing the same page-tables.

In a follow-up patch, we will allow the possibility to remove the
direct map and therefore it will be necessary to have a mapcache.

While we have plenty of spare virtual address space to reserve part
for each pCPU, it means that temporary mappings (e.g. guest memory)
could be accessible by every pCPU.

In order to increase our security posture, it would be better if
those mappings are only accessible by the pCPU doing the temporary
mapping.

In addition to that, a per-pCPU page-tables opens the way to have
per-domain mapping area.

Arm32 is already using per-pCPU page-tables so most of the code
can be re-used. Arm64 doesn't yet have support for the mapcache,
so a stub is provided (moved to its own header asm/domain_page.h).

Take the opportunity to fix a typo in a comment that is modified.

Signed-off-by: Julien Grall 
Signed-off-by: Elias El Yandouzi 



Changelog since v1:
* Rebase
* Fix typoes

diff --git a/xen/arch/arm/arm64/mmu/mm.c b/xen/arch/arm/arm64/mmu/mm.c
index 293acb67e0..2ec1ffe1dc 100644
--- a/xen/arch/arm/arm64/mmu/mm.c
+++ b/xen/arch/arm/arm64/mmu/mm.c
@@ -76,6 +76,7 @@ static void __init prepare_runtime_identity_mapping(void)
 paddr_t id_addr = virt_to_maddr(_start);
 lpae_t pte;
 DECLARE_OFFSETS(id_offsets, id_addr);
+lpae_t *root = this_cpu(xen_pgtable);
 
 if ( id_offsets[0] >= IDENTITY_MAPPING_AREA_NR_L0 )
 panic("Cannot handle ID mapping above %uTB\n",
@@ -86,7 +87,7 @@ static void __init prepare_runtime_identity_mapping(void)
 pte.pt.table = 1;
 pte.pt.xn = 0;
 
-write_pte(_pgtable[id_offsets[0]], pte);
+write_pte([id_offsets[0]], pte);
 
 /* Link second ID table */
 pte = pte_of_xenaddr((vaddr_t)xen_second_id);
diff --git a/xen/arch/arm/domain_page.c b/xen/arch/arm/domain_page.c
index 3a43601623..ac2a6d0332 100644
--- a/xen/arch/arm/domain_page.c
+++ b/xen/arch/arm/domain_page.c
@@ -3,6 +3,8 @@
 #include 
 #include 
 
+#include 
+
 /* Override macros from asm/page.h to make them work with mfn_t */
 #undef virt_to_mfn
 #define virt_to_mfn(va) _mfn(__virt_to_mfn(va))
diff --git a/xen/arch/arm/include/asm/arm32/mm.h 
b/xen/arch/arm/include/asm/arm32/mm.h
index 856f2dbec4..87a315db01 100644
--- a/xen/arch/arm/include/asm/arm32/mm.h
+++ b/xen/arch/arm/include/asm/arm32/mm.h
@@ -1,12 +1,6 @@
 #ifndef __ARM_ARM32_MM_H__
 #define __ARM_ARM32_MM_H__
 
-#include 
-
-#include 
-
-DECLARE_PER_CPU(lpae_t *, xen_pgtable);
-
 /*
  * Only a limited amount of RAM, called xenheap, is always mapped on ARM32.
  * For convenience always return false.
@@ -16,8 +10,6 @@ static inline bool arch_mfns_in_directmap(unsigned long mfn, 
unsigned long nr)
 return false;
 }
 
-bool init_domheap_mappings(unsigned int cpu);
-
 static inline void arch_setup_page_tables(void)
 {
 }
diff --git a/xen/arch/arm/include/asm/domain_page.h 
b/xen/arch/arm/include/asm/domain_page.h
new file mode 100644
index 00..e9f52685e2
--- /dev/null
+++ b/xen/arch/arm/include/asm/domain_page.h
@@ -0,0 +1,13 @@
+#ifndef __ASM_ARM_DOMAIN_PAGE_H__
+#define __ASM_ARM_DOMAIN_PAGE_H__
+
+#ifdef CONFIG_ARCH_MAP_DOMAIN_PAGE
+bool init_domheap_mappings(unsigned int cpu);
+#else
+static inline bool init_domheap_mappings(unsigned int cpu)
+{
+return true;
+}
+#endif
+
+#endif /* __ASM_ARM_DOMAIN_PAGE_H__ */
diff --git a/xen/arch/arm/include/asm/mm.h b/xen/arch/arm/include/asm/mm.h
index 2bca3f9e87..60e0122cba 100644
--- a/xen/arch/arm/include/asm/mm.h
+++ b/xen/arch/arm/include/asm/mm.h
@@ -2,6 +2,9 @@
 #define __ARCH_ARM_MM__
 
 #include 
+#include 
+
+#include 
 #include 
 #include 
 #include 
diff --git a/xen/arch/arm/include/asm/mmu/mm.h 
b/xen/arch/arm/include/asm/mmu/mm.h
index c5e03a66bf..c03c3a51e4 100644
--- a/xen/arch/arm/include/asm/mmu/mm.h
+++ b/xen/arch/arm/include/asm/mmu/mm.h
@@ -2,6 +2,8 @@
 #ifndef __ARM_MMU_MM_H__
 #define __ARM_MMU_MM_H__
 
+DECLARE_PER_CPU(lpae_t *, xen_pgtable);
+
 /* Non-boot CPUs use this to find the correct pagetables. */
 extern uint64_t init_ttbr;
 
diff --git a/xen/arch/arm/mmu/pt.c b/xen/arch/arm/mmu/pt.c
index da28d669e7..1ed1a53ab1 100644
--- a/xen/arch/arm/mmu/pt.c
+++ b/xen/arch/arm/mmu/pt.c
@@ -607,9 +607,9 @@ static int xen_pt_update(unsigned long virt,
 unsigned long left = nr_mfns;
 
 /*
- * For arm32, page-tables are different on each CPUs. Yet, they share
- * some common mappings. It is assumed that only common mappings
- * will be modified with this function.
+ * Page-tables are different on each CPU. Yet, they share some common
+ * mappings. It is assumed that only common mappings will be modified
+ * with this function.
  *
  * XXX: Add a check.
  */
diff --git a/xen/arch/arm/mmu/setup.c b/xen/arch/arm/mmu/setup.c
index f4bb424c3c..7b981456e6 100644
--- a/xen/arch/arm/mmu/setup.c
+++ b/xen/arch/arm/mmu/setup.c
@@ -28,17 +28,15 @@
  * PCPUs.
  */
 
-#ifdef 

[PATCH V3 (resend) 14/19] Rename mfn_to_virt() calls

2024-05-13 Thread Elias El Yandouzi
Until directmap gets completely removed, we'd still need to
keep some calls to mfn_to_virt() for xenheap pages or when the
directmap is enabled.

Rename the macro to mfn_to_directmap_virt() to flag them and
prevent further use of mfn_to_virt().

Signed-off-by: Elias El Yandouzi 

diff --git a/xen/arch/arm/include/asm/mm.h b/xen/arch/arm/include/asm/mm.h
index 48538b5337..2bca3f9e87 100644
--- a/xen/arch/arm/include/asm/mm.h
+++ b/xen/arch/arm/include/asm/mm.h
@@ -336,6 +336,7 @@ static inline uint64_t gvirt_to_maddr(vaddr_t va, paddr_t 
*pa,
  */
 #define virt_to_mfn(va) __virt_to_mfn(va)
 #define mfn_to_virt(mfn)__mfn_to_virt(mfn)
+#define mfn_to_directmap_virt(mfn) mfn_to_virt(mfn)
 
 /* Convert between Xen-heap virtual addresses and page-info structures. */
 static inline struct page_info *virt_to_page(const void *v)
diff --git a/xen/arch/x86/domain_page.c b/xen/arch/x86/domain_page.c
index 89caefc8a2..62d6fee0f4 100644
--- a/xen/arch/x86/domain_page.c
+++ b/xen/arch/x86/domain_page.c
@@ -81,14 +81,14 @@ void *map_domain_page(mfn_t mfn)
 
 #ifdef NDEBUG
 if ( arch_mfns_in_directmap(mfn_x(mfn), 1) )
-return mfn_to_virt(mfn_x(mfn));
+return mfn_to_directmap_virt(mfn_x(mfn));
 #endif
 
 v = mapcache_current_vcpu();
 if ( !v || !v->domain->arch.mapcache.inuse )
 {
 if ( arch_mfns_in_directmap(mfn_x(mfn), 1) )
-return mfn_to_virt(mfn_x(mfn));
+return mfn_to_directmap_virt(mfn_x(mfn));
 else
 {
 BUG_ON(system_state >= SYS_STATE_smp_boot);
@@ -324,7 +324,7 @@ void *map_domain_page_global(mfn_t mfn)
 
 #ifdef NDEBUG
 if ( arch_mfn_in_directmap(mfn_x(mfn, 1)) )
-return mfn_to_virt(mfn_x(mfn));
+return mfn_to_directmap_virt(mfn_x(mfn));
 #endif
 
 return vmap(, 1);
diff --git a/xen/arch/x86/hvm/dom0_build.c b/xen/arch/x86/hvm/dom0_build.c
index b0cb96c3bc..d1482ae2f7 100644
--- a/xen/arch/x86/hvm/dom0_build.c
+++ b/xen/arch/x86/hvm/dom0_build.c
@@ -439,7 +439,7 @@ static int __init pvh_populate_p2m(struct domain *d)
  d->arch.e820[i].addr + d->arch.e820[i].size);
 enum hvm_translation_result res =
  hvm_copy_to_guest_phys(mfn_to_maddr(_mfn(addr)),
-mfn_to_virt(addr),
+mfn_to_directmap_virt(addr),
 end - d->arch.e820[i].addr,
 v);
 
@@ -725,7 +725,7 @@ static int __init pvh_load_kernel(struct domain *d, const 
module_t *image,
 
 if ( initrd != NULL )
 {
-rc = hvm_copy_to_guest_phys(last_addr, mfn_to_virt(initrd->mod_start),
+rc = hvm_copy_to_guest_phys(last_addr, 
mfn_to_directmap_virt(initrd->mod_start),
 initrd_len, v);
 if ( rc )
 {
diff --git a/xen/arch/x86/include/asm/page.h b/xen/arch/x86/include/asm/page.h
index 350d1fb110..c6891b52d4 100644
--- a/xen/arch/x86/include/asm/page.h
+++ b/xen/arch/x86/include/asm/page.h
@@ -268,7 +268,7 @@ void copy_page_sse2(void *to, const void *from);
  */
 #define mfn_valid(mfn)  __mfn_valid(mfn_x(mfn))
 #define virt_to_mfn(va) __virt_to_mfn(va)
-#define mfn_to_virt(mfn)__mfn_to_virt(mfn)
+#define mfn_to_directmap_virt(mfn)__mfn_to_virt(mfn)
 #define virt_to_maddr(va)   __virt_to_maddr((unsigned long)(va))
 #define maddr_to_virt(ma)   __maddr_to_virt((unsigned long)(ma))
 #define maddr_to_page(ma)   __maddr_to_page(ma)
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index efdf20f775..337363cf17 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -318,8 +318,8 @@ void __init arch_init_memory(void)
 iostart_pfn = max_t(unsigned long, pfn, 1UL << (20 - PAGE_SHIFT));
 ioend_pfn = min(rstart_pfn, 16UL << (20 - PAGE_SHIFT));
 if ( iostart_pfn < ioend_pfn )
-destroy_xen_mappings((unsigned long)mfn_to_virt(iostart_pfn),
- (unsigned long)mfn_to_virt(ioend_pfn));
+destroy_xen_mappings((unsigned 
long)mfn_to_directmap_virt(iostart_pfn),
+ (unsigned 
long)mfn_to_directmap_virt(ioend_pfn));
 
 /* Mark as I/O up to next RAM region. */
 for ( ; pfn < rstart_pfn; pfn++ )
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 919347d8c2..e0671ab3c3 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -399,7 +399,7 @@ void *__init bootstrap_map(const module_t *mod)
 void *ret;
 
 if ( system_state != SYS_STATE_early_boot )
-return mod ? mfn_to_virt(mod->mod_start) : NULL;
+return mod ? mfn_to_directmap_virt(mod->mod_start) : NULL;
 
 if ( !mod )
 {
@@ -1708,7 +1708,7 @@ void asmlinkage __init noreturn __start_xen(unsigned long 
mbi_p)
 {
 set_pdx_range(mod[i].mod_start,
   mod[i].mod_start + PFN_UP(mod[i].mod_end));
-

[PATCH V3 (resend) 18/19] xen/arm64: Implement a mapcache for arm64

2024-05-13 Thread Elias El Yandouzi
From: Julien Grall 

At the moment, on arm64, map_domain_page() is implemented using
virt_to_mfn(). Therefore it is relying on the directmap.

In a follow-up patch, we will allow the admin to remove the directmap.
Therefore we want to implement a mapcache.

Thanksfully there is already one for arm32. So select ARCH_ARM_DOMAIN_PAGE
and add the necessary boiler plate to support 64-bit:
- The page-table start at level 0, so we need to allocate the level
  1 page-table
- map_domain_page() should check if the page is in the directmap. If
  yes, then use virt_to_mfn() to limit the performance impact
  when the directmap is still enabled (this will be selectable
  on the command line).

Take the opportunity to replace first_table_offset(...) with offsets[...].

Note that, so far, arch_mfns_in_directmap() always return true on
arm64. So the mapcache is not yet used. This will change in a
follow-up patch.

Signed-off-by: Julien Grall 
Signed-off-by: Elias El Yandouzi 



There are a few TODOs:
- It is becoming more critical to fix the mapcache
  implementation (this is not compliant with the Arm Arm)
- Evaluate the performance

diff --git a/xen/arch/arm/Kconfig b/xen/arch/arm/Kconfig
index 21d03d9f44..0462960fc7 100644
--- a/xen/arch/arm/Kconfig
+++ b/xen/arch/arm/Kconfig
@@ -1,7 +1,6 @@
 config ARM_32
def_bool y
depends on "$(ARCH)" = "arm32"
-   select ARCH_MAP_DOMAIN_PAGE
 
 config ARM_64
def_bool y
diff --git a/xen/arch/arm/arm64/mmu/mm.c b/xen/arch/arm/arm64/mmu/mm.c
index 2ec1ffe1dc..826864d25d 100644
--- a/xen/arch/arm/arm64/mmu/mm.c
+++ b/xen/arch/arm/arm64/mmu/mm.c
@@ -4,6 +4,7 @@
 #include 
 #include 
 
+#include 
 #include 
 #include 
 #include 
@@ -237,6 +238,14 @@ void __init setup_mm(void)
 setup_frametable_mappings(ram_start, ram_end);
 max_page = PFN_DOWN(ram_end);
 
+/*
+ * The allocators may need to use map_domain_page() (such as for
+ * scrubbing pages). So we need to prepare the domheap area first.
+ */
+if ( !init_domheap_mappings(smp_processor_id()) )
+panic("CPU%u: Unable to prepare the domheap page-tables\n",
+  smp_processor_id());
+
 init_staticmem_pages();
 init_sharedmem_pages();
 }
diff --git a/xen/arch/arm/domain_page.c b/xen/arch/arm/domain_page.c
index ac2a6d0332..0f6ba48892 100644
--- a/xen/arch/arm/domain_page.c
+++ b/xen/arch/arm/domain_page.c
@@ -1,4 +1,5 @@
 /* SPDX-License-Identifier: GPL-2.0-or-later */
+#include 
 #include 
 #include 
 #include 
@@ -8,6 +9,8 @@
 /* Override macros from asm/page.h to make them work with mfn_t */
 #undef virt_to_mfn
 #define virt_to_mfn(va) _mfn(__virt_to_mfn(va))
+#undef mfn_to_virt
+#define mfn_to_virt(va) __mfn_to_virt(mfn_x(mfn))
 
 /* cpu0's domheap page tables */
 static DEFINE_PAGE_TABLES(cpu0_dommap, DOMHEAP_SECOND_PAGES);
@@ -31,13 +34,30 @@ bool init_domheap_mappings(unsigned int cpu)
 {
 unsigned int order = get_order_from_pages(DOMHEAP_SECOND_PAGES);
 lpae_t *root = per_cpu(xen_pgtable, cpu);
+lpae_t *first;
 unsigned int i, first_idx;
 lpae_t *domheap;
 mfn_t mfn;
 
+/* Convenience aliases */
+DECLARE_OFFSETS(offsets, DOMHEAP_VIRT_START);
+
 ASSERT(root);
 ASSERT(!per_cpu(xen_dommap, cpu));
 
+/*
+ * On Arm64, the root is at level 0. Therefore we need an extra step
+ * to allocate the first level page-table.
+ */
+#ifdef CONFIG_ARM_64
+if ( create_xen_table([offsets[0]]) )
+return false;
+
+first = xen_map_table(lpae_get_mfn(root[offsets[0]]));
+#else
+first = root;
+#endif
+
 /*
  * The domheap for cpu0 is initialized before the heap is initialized.
  * So we need to use pre-allocated pages.
@@ -58,16 +78,20 @@ bool init_domheap_mappings(unsigned int cpu)
  * domheap mapping pages.
  */
 mfn = virt_to_mfn(domheap);
-first_idx = first_table_offset(DOMHEAP_VIRT_START);
+first_idx = offsets[1];
 for ( i = 0; i < DOMHEAP_SECOND_PAGES; i++ )
 {
 lpae_t pte = mfn_to_xen_entry(mfn_add(mfn, i), MT_NORMAL);
 pte.pt.table = 1;
-write_pte([first_idx + i], pte);
+write_pte([first_idx + i], pte);
 }
 
 per_cpu(xen_dommap, cpu) = domheap;
 
+#ifdef CONFIG_ARM_64
+xen_unmap_table(first);
+#endif
+
 return true;
 }
 
@@ -91,6 +115,10 @@ void *map_domain_page(mfn_t mfn)
 lpae_t pte;
 int i, slot;
 
+/* Bypass the mapcache if the page is in the directmap */
+if ( arch_mfns_in_directmap(mfn_x(mfn), 1) )
+return mfn_to_virt(mfn);
+
 local_irq_save(flags);
 
 /* The map is laid out as an open-addressed hash table where each
@@ -153,13 +181,25 @@ void *map_domain_page(mfn_t mfn)
 /* Release a mapping taken with map_domain_page() */
 void unmap_domain_page(const void *ptr)
 {
+unsigned long va = (unsigned long)ptr;
 unsigned long flags;
 lpae_t *map = this_cpu(xen_dommap);
-int slot = ((unsigned long)ptr - 

[PATCH V3 (resend) 16/19] xen/arm32: mm: Rename 'first' to 'root' in init_secondary_pagetables()

2024-05-13 Thread Elias El Yandouzi
From: Julien Grall 

The arm32 version of init_secondary_pagetables() will soon be re-used
for arm64 as well where the root table starts at level 0 rather than level 1.

So rename 'first' to 'root'.

Signed-off-by: Julien Grall 
Signed-off-by: Elias El Yandouzi 



Changelog in v2:
* Rebase
* Fix typo

diff --git a/xen/arch/arm/mmu/smpboot.c b/xen/arch/arm/mmu/smpboot.c
index 4ffc8254a4..e29b6f34f2 100644
--- a/xen/arch/arm/mmu/smpboot.c
+++ b/xen/arch/arm/mmu/smpboot.c
@@ -109,32 +109,32 @@ int prepare_secondary_mm(int cpu)
 #else
 int prepare_secondary_mm(int cpu)
 {
-lpae_t *first;
+lpae_t *root = alloc_xenheap_page();
 
 first = alloc_xenheap_page(); /* root == first level on 32-bit 3-level 
trie */
 
-if ( !first )
+if ( !root )
 {
-printk("CPU%u: Unable to allocate the first page-table\n", cpu);
+printk("CPU%u: Unable to allocate the root page-table\n", cpu);
 return -ENOMEM;
 }
 
 /* Initialise root pagetable from root of boot tables */
-memcpy(first, per_cpu(xen_pgtable, 0), PAGE_SIZE);
-per_cpu(xen_pgtable, cpu) = first;
+memcpy(root, per_cpu(xen_pgtable, 0), PAGE_SIZE);
+per_cpu(xen_pgtable, cpu) = root;
 
 if ( !init_domheap_mappings(cpu) )
 {
 printk("CPU%u: Unable to prepare the domheap page-tables\n", cpu);
 per_cpu(xen_pgtable, cpu) = NULL;
-free_xenheap_page(first);
+free_xenheap_page(root);
 return -ENOMEM;
 }
 
 clear_boot_pagetables();
 
 /* Set init_ttbr for this CPU coming up */
-set_init_ttbr(first);
+set_init_ttbr(root);
 
 return 0;
 }
-- 
2.40.1




[PATCH V3 (resend) 19/19] xen/arm64: Allow the admin to enable/disable the directmap

2024-05-13 Thread Elias El Yandouzi
From: Julien Grall 

Implement the same command line option as x86 to enable/disable the
directmap. By default this is kept enabled.

Also modify setup_directmap_mappings() to populate the L0 entries
related to the directmap area.

Signed-off-by: Julien Grall 
Signed-off-by: Elias El Yandouzi 



Changes in v2:
* Rely on the Kconfig option to enable Secret Hiding on Arm64
* Use generic helper instead of arch_has_directmap()

diff --git a/docs/misc/xen-command-line.pandoc 
b/docs/misc/xen-command-line.pandoc
index 743d343ffa..cccd5e4282 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -799,7 +799,7 @@ that enabling this option cannot guarantee anything beyond 
what underlying
 hardware guarantees (with, where available and known to Xen, respective
 tweaks applied).
 
-### directmap (x86)
+### directmap (arm64, x86)
 > `= `
 
 > Default: `true`
diff --git a/xen/arch/arm/Kconfig b/xen/arch/arm/Kconfig
index 0462960fc7..1cb495e334 100644
--- a/xen/arch/arm/Kconfig
+++ b/xen/arch/arm/Kconfig
@@ -7,6 +7,7 @@ config ARM_64
depends on !ARM_32
select 64BIT
select HAS_FAST_MULTIPLY
+   select HAS_SECRET_HIDING
 
 config ARM
def_bool y
diff --git a/xen/arch/arm/arm64/mmu/mm.c b/xen/arch/arm/arm64/mmu/mm.c
index 826864d25d..81115cce51 100644
--- a/xen/arch/arm/arm64/mmu/mm.c
+++ b/xen/arch/arm/arm64/mmu/mm.c
@@ -158,16 +158,27 @@ void __init switch_ttbr(uint64_t ttbr)
 update_identity_mapping(false);
 }
 
-/* Map the region in the directmap area. */
+/*
+ * This either populate a valid fdirect map, or allocates empty L1 tables
+ * and creates the L0 entries for the given region in the direct map
+ * depending on has_directmap().
+ *
+ * When directmap=no, we still need to populate empty L1 tables in the
+ * directmap region. The reason is that the root page-table (i.e. L0)
+ * is per-CPU and secondary CPUs will initialize their root page-table
+ * based on the pCPU0 one. So L0 entries will be shared if they are
+ * pre-populated. We also rely on the fact that L1 tables are never
+ * freed.
+ */
 static void __init setup_directmap_mappings(unsigned long base_mfn,
 unsigned long nr_mfns)
 {
+unsigned long mfn_gb = base_mfn & ~((FIRST_SIZE >> PAGE_SHIFT) - 1);
 int rc;
 
 /* First call sets the directmap physical and virtual offset. */
 if ( mfn_eq(directmap_mfn_start, INVALID_MFN) )
 {
-unsigned long mfn_gb = base_mfn & ~((FIRST_SIZE >> PAGE_SHIFT) - 1);
 
 directmap_mfn_start = _mfn(base_mfn);
 directmap_base_pdx = mfn_to_pdx(_mfn(base_mfn));
@@ -188,6 +199,24 @@ static void __init setup_directmap_mappings(unsigned long 
base_mfn,
 panic("cannot add directmap mapping at %lx below heap start %lx\n",
   base_mfn, mfn_x(directmap_mfn_start));
 
+if ( !has_directmap() )
+{
+vaddr_t vaddr = (vaddr_t)__mfn_to_virt(base_mfn);
+lpae_t *root = this_cpu(xen_pgtable);
+unsigned int i, slot;
+
+slot = first_table_offset(vaddr);
+nr_mfns += base_mfn - mfn_gb;
+for ( i = 0; i < nr_mfns; i += BIT(XEN_PT_LEVEL_ORDER(0), UL), slot++ )
+{
+lpae_t *entry = [slot];
+
+if ( !lpae_is_valid(*entry) && !create_xen_table(entry) )
+panic("Unable to populate zeroeth slot %u\n", slot);
+}
+return;
+}
+
 rc = map_pages_to_xen((vaddr_t)__mfn_to_virt(base_mfn),
   _mfn(base_mfn), nr_mfns,
   PAGE_HYPERVISOR_RW | _PAGE_BLOCK);
diff --git a/xen/arch/arm/include/asm/arm64/mm.h 
b/xen/arch/arm/include/asm/arm64/mm.h
index e0bd23a6ed..5888f29159 100644
--- a/xen/arch/arm/include/asm/arm64/mm.h
+++ b/xen/arch/arm/include/asm/arm64/mm.h
@@ -3,13 +3,10 @@
 
 extern DEFINE_PAGE_TABLE(xen_pgtable);
 
-/*
- * On ARM64, all the RAM is currently direct mapped in Xen.
- * Hence return always true.
- */
+/* On Arm64, the user can chose whether all the RAM is directmap. */
 static inline bool arch_mfns_in_directmap(unsigned long mfn, unsigned long nr)
 {
-return true;
+return has_directmap();
 }
 
 void arch_setup_page_tables(void);
diff --git a/xen/arch/arm/mm.c b/xen/arch/arm/mm.c
index def939172c..0f3ffab6ba 100644
--- a/xen/arch/arm/mm.c
+++ b/xen/arch/arm/mm.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
diff --git a/xen/arch/arm/setup.c b/xen/arch/arm/setup.c
index d15987d6ea..6b06e2f4f5 100644
--- a/xen/arch/arm/setup.c
+++ b/xen/arch/arm/setup.c
@@ -778,6 +778,7 @@ void asmlinkage __init start_xen(unsigned long 
boot_phys_offset,
 cmdline_parse(cmdline);
 
 setup_mm();
+printk("Booting with directmap %s\n", has_directmap() ? "on" : "off");
 
 vm_init();
 
-- 
2.40.1




[PATCH V3 (resend) 15/19] Rename maddr_to_virt() calls

2024-05-13 Thread Elias El Yandouzi
Until directmap gets completely removed, we'd still need to
keep some calls to mmaddr_to_virt() for xenheap pages or when the
directmap is enabled.

Rename the macro to maddr_to_directmap_virt() to flag them and
prevent further use of maddr_to_virt().

Signed-off-by: Elias El Yandouzi 

diff --git a/xen/arch/x86/dmi_scan.c b/xen/arch/x86/dmi_scan.c
index 81f80c053a..ac016f3a04 100644
--- a/xen/arch/x86/dmi_scan.c
+++ b/xen/arch/x86/dmi_scan.c
@@ -277,7 +277,7 @@ const char *__init dmi_get_table(paddr_t *base, u32 *len)
return "SMBIOS";
}
} else {
-   char __iomem *p = maddr_to_virt(0xF), *q;
+   char __iomem *p = maddr_to_directmap_virt(0xF), *q;
union {
struct dmi_eps dmi;
struct smbios3_eps smbios3;
@@ -364,7 +364,7 @@ static int __init dmi_iterate(void (*decode)(const struct 
dmi_header *))
dmi.size = 0;
smbios3.length = 0;
 
-   p = maddr_to_virt(0xF);
+   p = maddr_to_directmap_virt(0xF);
for (q = p; q < p + 0x1; q += 16) {
if (!dmi.size) {
memcpy_fromio(, q, sizeof(dmi));
diff --git a/xen/arch/x86/include/asm/mach-default/bios_ebda.h 
b/xen/arch/x86/include/asm/mach-default/bios_ebda.h
index 42de6b2a5b..8cfe53d1f2 100644
--- a/xen/arch/x86/include/asm/mach-default/bios_ebda.h
+++ b/xen/arch/x86/include/asm/mach-default/bios_ebda.h
@@ -7,7 +7,7 @@
  */
 static inline unsigned int get_bios_ebda(void)
 {
-   unsigned int address = *(unsigned short *)maddr_to_virt(0x40E);
+   unsigned int address = *(unsigned short 
*)maddr_to_directmap_virt(0x40E);
address <<= 4;
return address; /* 0 means none */
 }
diff --git a/xen/arch/x86/include/asm/page.h b/xen/arch/x86/include/asm/page.h
index c6891b52d4..bf7bf08ba4 100644
--- a/xen/arch/x86/include/asm/page.h
+++ b/xen/arch/x86/include/asm/page.h
@@ -240,11 +240,11 @@ void copy_page_sse2(void *to, const void *from);
 
 /* Convert between Xen-heap virtual addresses and machine addresses. */
 #define __pa(x) (virt_to_maddr(x))
-#define __va(x) (maddr_to_virt(x))
+#define __va(x) (maddr_to_directmap_virt(x))
 
 /* Convert between Xen-heap virtual addresses and machine frame numbers. */
 #define __virt_to_mfn(va)   (virt_to_maddr(va) >> PAGE_SHIFT)
-#define __mfn_to_virt(mfn)  (maddr_to_virt((paddr_t)(mfn) << PAGE_SHIFT))
+#define __mfn_to_virt(mfn)  (maddr_to_directmap_virt((paddr_t)(mfn) << 
PAGE_SHIFT))
 
 /* Convert between machine frame numbers and page-info structures. */
 #define mfn_to_page(mfn)(frame_table + mfn_to_pdx(mfn))
@@ -270,7 +270,7 @@ void copy_page_sse2(void *to, const void *from);
 #define virt_to_mfn(va) __virt_to_mfn(va)
 #define mfn_to_directmap_virt(mfn)__mfn_to_virt(mfn)
 #define virt_to_maddr(va)   __virt_to_maddr((unsigned long)(va))
-#define maddr_to_virt(ma)   __maddr_to_virt((unsigned long)(ma))
+#define maddr_to_directmap_virt(ma)   __maddr_to_directmap_virt((unsigned 
long)(ma))
 #define maddr_to_page(ma)   __maddr_to_page(ma)
 #define page_to_maddr(pg)   __page_to_maddr(pg)
 #define virt_to_page(va)__virt_to_page(va)
diff --git a/xen/arch/x86/include/asm/x86_64/page.h 
b/xen/arch/x86/include/asm/x86_64/page.h
index 19ca64d792..a95ebc088f 100644
--- a/xen/arch/x86/include/asm/x86_64/page.h
+++ b/xen/arch/x86/include/asm/x86_64/page.h
@@ -48,7 +48,7 @@ static inline unsigned long __virt_to_maddr(unsigned long va)
 return xen_phys_start + va - XEN_VIRT_START;
 }
 
-static inline void *__maddr_to_virt(unsigned long ma)
+static inline void *__maddr_to_directmap_virt(unsigned long ma)
 {
 /* Offset in the direct map, accounting for pdx compression */
 unsigned long va_offset = maddr_to_directmapoff(ma);
diff --git a/xen/arch/x86/mpparse.c b/xen/arch/x86/mpparse.c
index d8ccab2449..69181b0abe 100644
--- a/xen/arch/x86/mpparse.c
+++ b/xen/arch/x86/mpparse.c
@@ -664,7 +664,7 @@ void __init get_smp_config (void)
 
 static int __init smp_scan_config (unsigned long base, unsigned long length)
 {
-   unsigned int *bp = maddr_to_virt(base);
+   unsigned int *bp = maddr_to_directmap_virt(base);
struct intel_mp_floating *mpf;
 
Dprintk("Scan SMP from %p for %ld bytes.\n", bp,length);
diff --git a/xen/common/efi/boot.c b/xen/common/efi/boot.c
index 39aed5845d..1b02e2b6d5 100644
--- a/xen/common/efi/boot.c
+++ b/xen/common/efi/boot.c
@@ -1764,7 +1764,7 @@ void __init efi_init_memory(void)
 if ( map_pages_to_xen((unsigned 
long)mfn_to_directmap_virt(smfn),
 _mfn(smfn), emfn - smfn, prot) == 0 )
 desc->VirtualStart =
-(unsigned long)maddr_to_virt(desc->PhysicalStart);
+(unsigned 
long)maddr_to_directmap_virt(desc->PhysicalStart);
 else
 printk(XENLOG_ERR "Could not 

[PATCH V3 (resend) 13/19] x86/setup: Do not create valid mappings when directmap=no

2024-05-13 Thread Elias El Yandouzi
From: Hongyan Xia 

Create empty mappings in the second e820 pass. Also, destroy existing
direct map mappings created in the first pass.

To make xenheap pages visible in guests, it is necessary to create empty
L3 tables in the direct map even when directmap=no, since guest cr3s
copy idle domain's L4 entries, which means they will share mappings in
the direct map if we pre-populate idle domain's L4 entries and L3
tables. A helper is introduced for this.

Also, after the direct map is actually gone, we need to stop updating
the direct map in update_xen_mappings().

Signed-off-by: Hongyan Xia 
Signed-off-by: Julien Grall 
Signed-off-by: Elias El Yandouzi 

diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index f26c9799e4..919347d8c2 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -978,6 +978,57 @@ static struct domain *__init create_dom0(const module_t 
*image,
 /* How much of the directmap is prebuilt at compile time. */
 #define PREBUILT_MAP_LIMIT (1 << L2_PAGETABLE_SHIFT)
 
+/*
+ * This either populates a valid direct map, or allocates empty L3 tables and
+ * creates the L4 entries for virtual address between [start, end) in the
+ * direct map depending on has_directmap();
+ *
+ * When directmap=no, we still need to populate empty L3 tables in the
+ * direct map region. The reason is that on-demand xenheap mappings are
+ * created in the idle domain's page table but must be seen by
+ * everyone. Since all domains share the direct map L4 entries, they
+ * will share xenheap mappings if we pre-populate the L4 entries and L3
+ * tables in the direct map region for all RAM. We also rely on the fact
+ * that L3 tables are never freed.
+ */
+static void __init populate_directmap(uint64_t pstart, uint64_t pend,
+  unsigned int flags)
+{
+unsigned long vstart = (unsigned long)__va(pstart);
+unsigned long vend = (unsigned long)__va(pend);
+
+if ( pstart >= pend )
+return;
+
+BUG_ON(vstart < DIRECTMAP_VIRT_START);
+BUG_ON(vend > DIRECTMAP_VIRT_END);
+
+if ( has_directmap() )
+/* Populate valid direct map. */
+BUG_ON(map_pages_to_xen(vstart, maddr_to_mfn(pstart),
+PFN_DOWN(pend - pstart), flags));
+else
+{
+/* Create empty L3 tables. */
+unsigned long vaddr = vstart & ~((1UL << L4_PAGETABLE_SHIFT) - 1);
+
+for ( ; vaddr < vend; vaddr += (1UL << L4_PAGETABLE_SHIFT) )
+{
+l4_pgentry_t *pl4e = _pg_table[l4_table_offset(vaddr)];
+
+if ( !(l4e_get_flags(*pl4e) & _PAGE_PRESENT) )
+{
+mfn_t mfn = alloc_boot_pages(1, 1);
+void *v = map_domain_page(mfn);
+
+clear_page(v);
+UNMAP_DOMAIN_PAGE(v);
+l4e_write(pl4e, l4e_from_mfn(mfn, __PAGE_HYPERVISOR));
+}
+}
+}
+}
+
 void asmlinkage __init noreturn __start_xen(unsigned long mbi_p)
 {
 const char *memmap_type = NULL, *loader, *cmdline = "";
@@ -1601,8 +1652,17 @@ void asmlinkage __init noreturn __start_xen(unsigned 
long mbi_p)
 map_e = min_t(uint64_t, e,
   ARRAY_SIZE(l2_directmap) << L2_PAGETABLE_SHIFT);
 
-/* Pass mapped memory to allocator /before/ creating new mappings. */
+/*
+ * Pass mapped memory to allocator /before/ creating new mappings.
+ * The direct map for the bottom 4GiB has been populated in the first
+ * e820 pass. In the second pass, we make sure those existing mappings
+ * are destroyed when directmap=no.
+ */
 init_boot_pages(s, min(map_s, e));
+if ( !has_directmap() )
+destroy_xen_mappings((unsigned long)__va(s),
+ (unsigned long)__va(min(map_s, e)));
+
 s = map_s;
 if ( s < map_e )
 {
@@ -1610,6 +1670,9 @@ void asmlinkage __init noreturn __start_xen(unsigned long 
mbi_p)
 map_s = (s + mask) & ~mask;
 map_e &= ~mask;
 init_boot_pages(map_s, map_e);
+if ( !has_directmap() )
+destroy_xen_mappings((unsigned long)__va(map_s),
+ (unsigned long)__va(map_e));
 }
 
 if ( map_s > map_e )
@@ -1623,8 +1686,7 @@ void asmlinkage __init noreturn __start_xen(unsigned long 
mbi_p)
 
 if ( map_e < end )
 {
-map_pages_to_xen((unsigned long)__va(map_e), 
maddr_to_mfn(map_e),
- PFN_DOWN(end - map_e), PAGE_HYPERVISOR);
+populate_directmap(map_e, end, PAGE_HYPERVISOR);
 init_boot_pages(map_e, end);
 map_e = end;
 }
@@ -1633,13 +1695,11 @@ void asmlinkage __init noreturn __start_xen(unsigned 
long mbi_p)
 {
 /* This range must not be passed to the boot allocator and
  * must also not be mapped with _PAGE_GLOBAL. */
-  

[PATCH V3 (resend) 10/19] xen/page_alloc: Add a path for xenheap when there is no direct map

2024-05-13 Thread Elias El Yandouzi
From: Hongyan Xia 

When there is not an always-mapped direct map, xenheap allocations need
to be mapped and unmapped on-demand.

Signed-off-by: Hongyan Xia 
Signed-off-by: Julien Grall 
Signed-off-by: Elias El Yandouzi 



I have left the call to map_pages_to_xen() and destroy_xen_mappings()
in the split heap for now. I am not entirely convinced this is necessary
because in that setup only the xenheap would be always mapped and
this doesn't contain any guest memory (aside the grant-table).
So map/unmapping for every allocation seems unnecessary.

Changes in v2:
* Fix remaining wrong indentation in alloc_xenheap_pages()

Changes since Hongyan's version:
* Rebase
* Fix indentation in alloc_xenheap_pages()
* Fix build for arm32

diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index 9b7e4721cd..dfb2c05322 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -2242,6 +2242,7 @@ void init_xenheap_pages(paddr_t ps, paddr_t pe)
 void *alloc_xenheap_pages(unsigned int order, unsigned int memflags)
 {
 struct page_info *pg;
+void *ret;
 
 ASSERT_ALLOC_CONTEXT();
 
@@ -2250,17 +2251,36 @@ void *alloc_xenheap_pages(unsigned int order, unsigned 
int memflags)
 if ( unlikely(pg == NULL) )
 return NULL;
 
+ret = page_to_virt(pg);
+
+if ( !has_directmap() &&
+ map_pages_to_xen((unsigned long)ret, page_to_mfn(pg), 1UL << order,
+  PAGE_HYPERVISOR) )
+{
+/* Failed to map xenheap pages. */
+free_heap_pages(pg, order, false);
+return NULL;
+}
+
 return page_to_virt(pg);
 }
 
 
 void free_xenheap_pages(void *v, unsigned int order)
 {
+unsigned long va = (unsigned long)v & PAGE_MASK;
+
 ASSERT_ALLOC_CONTEXT();
 
 if ( v == NULL )
 return;
 
+if ( !has_directmap() &&
+ destroy_xen_mappings(va, va + (1UL << (order + PAGE_SHIFT))) )
+dprintk(XENLOG_WARNING,
+"Error while destroying xenheap mappings at %p, order %u\n",
+v, order);
+
 free_heap_pages(virt_to_page(v), order, false);
 }
 
@@ -2284,6 +2304,7 @@ void *alloc_xenheap_pages(unsigned int order, unsigned 
int memflags)
 {
 struct page_info *pg;
 unsigned int i;
+void *ret;
 
 ASSERT_ALLOC_CONTEXT();
 
@@ -2296,16 +2317,28 @@ void *alloc_xenheap_pages(unsigned int order, unsigned 
int memflags)
 if ( unlikely(pg == NULL) )
 return NULL;
 
+ret = page_to_virt(pg);
+
+if ( !has_directmap() &&
+ map_pages_to_xen((unsigned long)ret, page_to_mfn(pg), 1UL << order,
+  PAGE_HYPERVISOR) )
+{
+/* Failed to map xenheap pages. */
+free_domheap_pages(pg, order);
+return NULL;
+}
+
 for ( i = 0; i < (1u << order); i++ )
 pg[i].count_info |= PGC_xen_heap;
 
-return page_to_virt(pg);
+return ret;
 }
 
 void free_xenheap_pages(void *v, unsigned int order)
 {
 struct page_info *pg;
 unsigned int i;
+unsigned long va = (unsigned long)v & PAGE_MASK;
 
 ASSERT_ALLOC_CONTEXT();
 
@@ -2317,6 +2350,12 @@ void free_xenheap_pages(void *v, unsigned int order)
 for ( i = 0; i < (1u << order); i++ )
 pg[i].count_info &= ~PGC_xen_heap;
 
+if ( !has_directmap() &&
+ destroy_xen_mappings(va, va + (1UL << (order + PAGE_SHIFT))) )
+dprintk(XENLOG_WARNING,
+"Error while destroying xenheap mappings at %p, order %u\n",
+v, order);
+
 free_heap_pages(pg, order, true);
 }
 
-- 
2.40.1




[PATCH V3 (resend) 08/19] xen/x86: Add build assertion for fixmap entries

2024-05-13 Thread Elias El Yandouzi
The early fixed addresses must all fit into the static L1 table.
Introduce a build assertion to this end.

Signed-off-by: Elias El Yandouzi 



 Changes in v2:
 * New patch

diff --git a/xen/arch/x86/include/asm/fixmap.h 
b/xen/arch/x86/include/asm/fixmap.h
index a7ac365fc6..904bee0480 100644
--- a/xen/arch/x86/include/asm/fixmap.h
+++ b/xen/arch/x86/include/asm/fixmap.h
@@ -77,6 +77,11 @@ enum fixed_addresses {
 #define FIXADDR_SIZE  (__end_of_fixed_addresses << PAGE_SHIFT)
 #define FIXADDR_START (FIXADDR_TOP - FIXADDR_SIZE)
 
+static inline void fixaddr_build_assertion(void)
+{
+BUILD_BUG_ON(FIX_PMAP_END > L1_PAGETABLE_ENTRIES - 1);
+}
+
 extern void __set_fixmap(
 enum fixed_addresses idx, unsigned long mfn, unsigned long flags);
 
-- 
2.40.1




[PATCH V3 (resend) 12/19] x86/setup: vmap heap nodes when they are outside the direct map

2024-05-13 Thread Elias El Yandouzi
From: Hongyan Xia 

When we do not have a direct map, archs_mfn_in_direct_map() will always
return false, thus init_node_heap() will allocate xenheap pages from an
existing node for the metadata of a new node. This means that the
metadata of a new node is in a different node, slowing down heap
allocation.

Since we now have early vmap, vmap the metadata locally in the new node.

Signed-off-by: Hongyan Xia 
Signed-off-by: Julien Grall 
Signed-off-by: Elias El Yandouzi 



Changes in v2:
* vmap_contig_pages() was renamed to vmap_contig()
* Fix indentation and coding style

Changes from Hongyan's version:
* arch_mfn_in_direct_map() was renamed to
  arch_mfns_in_direct_map()
* Use vmap_contig_pages() rather than __vmap(...).
* Add missing include (xen/vmap.h) so it compiles on Arm

diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index dfb2c05322..3c0909f333 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -136,6 +136,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -605,22 +606,44 @@ static unsigned long init_node_heap(int node, unsigned 
long mfn,
 needed = 0;
 }
 else if ( *use_tail && nr >= needed &&
-  arch_mfns_in_directmap(mfn + nr - needed, needed) &&
   (!xenheap_bits ||
-   !((mfn + nr - 1) >> (xenheap_bits - PAGE_SHIFT))) )
+  !((mfn + nr - 1) >> (xenheap_bits - PAGE_SHIFT))) )
 {
-_heap[node] = mfn_to_virt(mfn + nr - needed);
-avail[node] = mfn_to_virt(mfn + nr - 1) +
-  PAGE_SIZE - sizeof(**avail) * NR_ZONES;
+if ( arch_mfns_in_directmap(mfn + nr - needed, needed) )
+{
+_heap[node] = mfn_to_virt(mfn + nr - needed);
+avail[node] = mfn_to_virt(mfn + nr - 1) +
+  PAGE_SIZE - sizeof(**avail) * NR_ZONES;
+}
+else
+{
+mfn_t needed_start = _mfn(mfn + nr - needed);
+
+_heap[node] = vmap_contig(needed_start, needed);
+BUG_ON(!_heap[node]);
+avail[node] = (void *)(_heap[node]) + (needed << PAGE_SHIFT) -
+  sizeof(**avail) * NR_ZONES;
+}
 }
 else if ( nr >= needed &&
-  arch_mfns_in_directmap(mfn, needed) &&
   (!xenheap_bits ||
-   !((mfn + needed - 1) >> (xenheap_bits - PAGE_SHIFT))) )
+  !((mfn + needed - 1) >> (xenheap_bits - PAGE_SHIFT))) )
 {
-_heap[node] = mfn_to_virt(mfn);
-avail[node] = mfn_to_virt(mfn + needed - 1) +
-  PAGE_SIZE - sizeof(**avail) * NR_ZONES;
+if ( arch_mfns_in_directmap(mfn, needed) )
+{
+_heap[node] = mfn_to_virt(mfn);
+avail[node] = mfn_to_virt(mfn + needed - 1) +
+  PAGE_SIZE - sizeof(**avail) * NR_ZONES;
+}
+else
+{
+mfn_t needed_start = _mfn(mfn);
+
+_heap[node] = vmap_contig(needed_start, needed);
+BUG_ON(!_heap[node]);
+avail[node] = (void *)(_heap[node]) + (needed << PAGE_SHIFT) -
+  sizeof(**avail) * NR_ZONES;
+}
 *use_tail = false;
 }
 else if ( get_order_from_bytes(sizeof(**_heap)) ==
-- 
2.40.1




[PATCH V3 (resend) 01/19] x86: Create per-domain mapping of guest_root_pt

2024-05-13 Thread Elias El Yandouzi
From: Hongyan Xia 

Create a per-domain mapping of PV guest_root_pt as direct map is being
removed.

Note that we do not map and unmap root_pgt for now since it is still a
xenheap page.

Signed-off-by: Hongyan Xia 
Signed-off-by: Julien Grall 
Signed-off-by: Elias El Yandouzi 


Changes in V3:
* Rename SHADOW_ROOT
* Haven't addressed the potentially over-allocation issue as I don't 
get it

Changes in V2:
* Rework the shadow perdomain mapping solution in the follow-up patches

Changes since Hongyan's version:
* Remove the final dot in the commit title

diff --git a/xen/arch/x86/include/asm/config.h 
b/xen/arch/x86/include/asm/config.h
index ab7288cb36..5d710384df 100644
--- a/xen/arch/x86/include/asm/config.h
+++ b/xen/arch/x86/include/asm/config.h
@@ -203,7 +203,7 @@ extern unsigned char boot_edid_info[128];
 /* Slot 260: per-domain mappings (including map cache). */
 #define PERDOMAIN_VIRT_START(PML4_ADDR(260))
 #define PERDOMAIN_SLOT_MBYTES   (PML4_ENTRY_BYTES >> (20 + PAGETABLE_ORDER))
-#define PERDOMAIN_SLOTS 3
+#define PERDOMAIN_SLOTS 4
 #define PERDOMAIN_VIRT_SLOT(s)  (PERDOMAIN_VIRT_START + (s) * \
  (PERDOMAIN_SLOT_MBYTES << 20))
 /* Slot 4: mirror of per-domain mappings (for compat xlat area accesses). */
@@ -317,6 +317,14 @@ extern unsigned long xen_phys_start;
 #define ARG_XLAT_START(v)\
 (ARG_XLAT_VIRT_START + ((v)->vcpu_id << ARG_XLAT_VA_SHIFT))
 
+/* pv_root_pt mapping area. The fourth per-domain-mapping sub-area */
+#define PV_ROOT_PT_MAPPING_VIRT_START   PERDOMAIN_VIRT_SLOT(3)
+#define PV_ROOT_PT_MAPPING_ENTRIES  MAX_VIRT_CPUS
+
+/* The address of a particular VCPU's PV_ROOT_PT */
+#define PV_ROOT_PT_MAPPING_VCPU_VIRT_START(v) \
+(PV_ROOT_PT_MAPPING_VIRT_START + ((v)->vcpu_id * PAGE_SIZE))
+
 #define ELFSIZE 64
 
 #define ARCH_CRASH_SAVE_VMCOREINFO
diff --git a/xen/arch/x86/include/asm/domain.h 
b/xen/arch/x86/include/asm/domain.h
index f5daeb182b..8a97530607 100644
--- a/xen/arch/x86/include/asm/domain.h
+++ b/xen/arch/x86/include/asm/domain.h
@@ -272,6 +272,7 @@ struct time_scale {
 struct pv_domain
 {
 l1_pgentry_t **gdt_ldt_l1tab;
+l1_pgentry_t **root_pt_l1tab;
 
 atomic_t nr_l4_pages;
 
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index d968bbbc73..efdf20f775 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -505,6 +505,13 @@ void share_xen_page_with_guest(struct page_info *page, 
struct domain *d,
 nrspin_unlock(>page_alloc_lock);
 }
 
+#define pv_root_pt_idx(v) \
+((v)->vcpu_id >> PAGETABLE_ORDER)
+
+#define pv_root_pt_pte(v) \
+((v)->domain->arch.pv.root_pt_l1tab[pv_root_pt_idx(v)] + \
+ ((v)->vcpu_id & (L1_PAGETABLE_ENTRIES - 1)))
+
 void make_cr3(struct vcpu *v, mfn_t mfn)
 {
 struct domain *d = v->domain;
@@ -524,6 +531,13 @@ void write_ptbase(struct vcpu *v)
 
 if ( is_pv_vcpu(v) && v->domain->arch.pv.xpti )
 {
+mfn_t guest_root_pt = _mfn(MASK_EXTR(v->arch.cr3, PAGE_MASK));
+l1_pgentry_t *pte = pv_root_pt_pte(v);
+
+ASSERT(v == current);
+
+l1e_write(pte, l1e_from_mfn(guest_root_pt, __PAGE_HYPERVISOR_RO));
+
 cpu_info->root_pgt_changed = true;
 cpu_info->pv_cr3 = __pa(this_cpu(root_pgt));
 if ( new_cr4 & X86_CR4_PCIDE )
diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index 2a445bb17b..1b025986f7 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -288,6 +288,21 @@ static void pv_destroy_gdt_ldt_l1tab(struct vcpu *v)
   1U << GDT_LDT_VCPU_SHIFT);
 }
 
+static int pv_create_root_pt_l1tab(struct vcpu *v)
+{
+return create_perdomain_mapping(v->domain,
+PV_ROOT_PT_MAPPING_VCPU_VIRT_START(v),
+1, v->domain->arch.pv.root_pt_l1tab,
+NULL);
+}
+
+static void pv_destroy_root_pt_l1tab(struct vcpu *v)
+
+{
+destroy_perdomain_mapping(v->domain,
+  PV_ROOT_PT_MAPPING_VCPU_VIRT_START(v), 1);
+}
+
 void pv_vcpu_destroy(struct vcpu *v)
 {
 if ( is_pv_32bit_vcpu(v) )
@@ -297,6 +312,7 @@ void pv_vcpu_destroy(struct vcpu *v)
 }
 
 pv_destroy_gdt_ldt_l1tab(v);
+pv_destroy_root_pt_l1tab(v);
 XFREE(v->arch.pv.trap_ctxt);
 }
 
@@ -311,6 +327,13 @@ int pv_vcpu_initialise(struct vcpu *v)
 if ( rc )
 return rc;
 
+if ( v->domain->arch.pv.xpti )
+{
+rc = pv_create_root_pt_l1tab(v);
+if ( rc )
+goto done;
+}
+
 BUILD_BUG_ON(X86_NR_VECTORS * sizeof(*v->arch.pv.trap_ctxt) >
  PAGE_SIZE);
 v->arch.pv.trap_ctxt = xzalloc_array(struct trap_info, X86_NR_VECTORS);
@@ -346,10 +369,12 @@ void pv_domain_destroy(struct domain *d)
 
 destroy_perdomain_mapping(d, GDT_LDT_VIRT_START,
   GDT_LDT_MBYTES << (20 - PAGE_SHIFT));
+destroy_perdomain_mapping(d, 

[PATCH V3 (resend) 11/19] x86/setup: Leave early boot slightly earlier

2024-05-13 Thread Elias El Yandouzi
From: Hongyan Xia 

When we do not have a direct map, memory for metadata of heap nodes in
init_node_heap() is allocated from xenheap, which needs to be mapped and
unmapped on demand. However, we cannot just take memory from the boot
allocator to create the PTEs while we are passing memory to the heap
allocator.

To solve this race, we leave early boot slightly sooner so that Xen PTE
pages are allocated from the heap instead of the boot allocator. We can
do this because the metadata for the 1st node is statically allocated,
and by the time we need memory to create mappings for the 2nd node, we
already have enough memory in the heap allocator in the 1st node.

Signed-off-by: Hongyan Xia 
Signed-off-by: Julien Grall 
Signed-off-by: Elias El Yandouzi 

diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index bd6b1184f5..f26c9799e4 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -1751,6 +1751,22 @@ void asmlinkage __init noreturn __start_xen(unsigned 
long mbi_p)
 
 numa_initmem_init(0, raw_max_page);
 
+/*
+ * When we do not have a direct map, memory for metadata of heap nodes in
+ * init_node_heap() is allocated from xenheap, which needs to be mapped and
+ * unmapped on demand. However, we cannot just take memory from the boot
+ * allocator to create the PTEs while we are passing memory to the heap
+ * allocator during end_boot_allocator().
+ *
+ * To solve this race, we need to leave early boot before
+ * end_boot_allocator() so that Xen PTE pages are allocated from the heap
+ * instead of the boot allocator. We can do this because the metadata for
+ * the 1st node is statically allocated, and by the time we need memory to
+ * create mappings for the 2nd node, we already have enough memory in the
+ * heap allocator in the 1st node.
+ */
+system_state = SYS_STATE_boot;
+
 if ( max_page - 1 > virt_to_mfn(HYPERVISOR_VIRT_END - 1) )
 {
 unsigned long lo = virt_to_mfn(HYPERVISOR_VIRT_END - 1);
@@ -1782,8 +1798,6 @@ void asmlinkage __init noreturn __start_xen(unsigned long 
mbi_p)
 else
 end_boot_allocator();
 
-system_state = SYS_STATE_boot;
-
 bsp_stack = cpu_alloc_stack(0);
 if ( !bsp_stack )
 panic("No memory for BSP stack\n");
-- 
2.40.1




[PATCH V3 (resend) 06/19] x86: Add a boot option to enable and disable the direct map

2024-05-13 Thread Elias El Yandouzi
From: Hongyan Xia 

Also add a helper function to retrieve it. Change arch_mfns_in_direct_map
to check this option before returning.

This is added as a Kconfig option as well as a boot command line option.
While being generic, the Kconfig option is only usable for x86 at the moment.

Note that there remains some users of the directmap at this point. The option
is introduced now as it will be needed in follow-up patches.

Signed-off-by: Hongyan Xia 
Signed-off-by: Julien Grall 
Signed-off-by: Elias El Yandouzi 



Changes in V2:
* Introduce a Kconfig option
* Reword the commit message
* Make opt_directmap and helper generic

Changes since Hongyan's version:
* Reword the commit message
* opt_directmap is only modified during boot so mark it as
  __ro_after_init

diff --git a/docs/misc/xen-command-line.pandoc 
b/docs/misc/xen-command-line.pandoc
index e760f3266e..743d343ffa 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -799,6 +799,18 @@ that enabling this option cannot guarantee anything beyond 
what underlying
 hardware guarantees (with, where available and known to Xen, respective
 tweaks applied).
 
+### directmap (x86)
+> `= `
+
+> Default: `true`
+
+Enable or disable the directmap region in Xen.
+
+By default, Xen creates the directmap region which maps physical memory
+in that region. Setting this to no will sparsely populate the directmap,
+blocking exploits that leak secrets via speculative memory access in the
+directmap.
+
 ### dma_bits
 > `= `
 
diff --git a/xen/arch/x86/Kconfig b/xen/arch/x86/Kconfig
index 7e03e4bc55..b4ec0e582e 100644
--- a/xen/arch/x86/Kconfig
+++ b/xen/arch/x86/Kconfig
@@ -28,6 +28,7 @@ config X86
select HAS_PCI_MSI
select HAS_PIRQ
select HAS_SCHED_GRANULARITY
+   select HAS_SECRET_HIDING
select HAS_UBSAN
select HAS_VPCI if HVM
select NEEDS_LIBELF
diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h
index 98b66edaca..54d835f156 100644
--- a/xen/arch/x86/include/asm/mm.h
+++ b/xen/arch/x86/include/asm/mm.h
@@ -622,11 +622,17 @@ void write_32bit_pse_identmap(uint32_t *l2);
 /*
  * x86 maps part of physical memory via the directmap region.
  * Return whether the range of MFN falls in the directmap region.
+ *
+ * When boot command line sets directmap=no, the directmap will mostly be empty
+ * so this will always return false.
  */
 static inline bool arch_mfns_in_directmap(unsigned long mfn, unsigned long nr)
 {
 unsigned long eva = min(DIRECTMAP_VIRT_END, HYPERVISOR_VIRT_END);
 
+if ( !has_directmap() )
+return false;
+
 return (mfn + nr) <= (virt_to_mfn(eva - 1) + 1);
 }
 
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index f84e1cd79c..bd6b1184f5 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -1517,6 +1517,8 @@ void asmlinkage __init noreturn __start_xen(unsigned long 
mbi_p)
 if ( highmem_start )
 xenheap_max_mfn(PFN_DOWN(highmem_start - 1));
 
+printk("Booting with directmap %s\n", has_directmap() ? "on" : "off");
+
 /*
  * Walk every RAM region and map it in its entirety (on x86/64, at least)
  * and notify it to the boot allocator.
diff --git a/xen/common/Kconfig b/xen/common/Kconfig
index 565ceda741..856604068c 100644
--- a/xen/common/Kconfig
+++ b/xen/common/Kconfig
@@ -80,12 +80,29 @@ config HAS_PMAP
 config HAS_SCHED_GRANULARITY
bool
 
+config HAS_SECRET_HIDING
+   bool
+
 config HAS_UBSAN
bool
 
 config MEM_ACCESS_ALWAYS_ON
bool
 
+config SECRET_HIDING
+bool "Secret hiding"
+depends on HAS_SECRET_HIDING
+help
+   The directmap contains mapping for most of the RAM which makes 
domain
+   memory easily accessible. While making the performance better, 
it also makes
+   the hypervisor more vulnerable to speculation attacks.
+
+   Enabling this feature will allow the user to decide whether the 
memory
+   is always mapped at boot or mapped only on demand (see the 
command line
+   option "directmap").
+
+   If unsure, say N.
+
 config MEM_ACCESS
def_bool MEM_ACCESS_ALWAYS_ON
prompt "Memory Access and VM events" if !MEM_ACCESS_ALWAYS_ON
diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index 7c1bdfc046..9b7e4721cd 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -174,6 +174,11 @@ paddr_t __ro_after_init mem_hotplug;
 static char __initdata opt_badpage[100] = "";
 string_param("badpage", opt_badpage);
 
+bool __ro_after_init opt_directmap = true;
+#ifdef CONFIG_HAS_SECRET_HIDING
+boolean_param("directmap", opt_directmap);
+#endif
+
 /*
  * no-bootscrub -> Free pages are not zeroed during boot.
  */
diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h
index 7561297a75..9d4f1f2d0d 100644
--- a/xen/include/xen/mm.h
+++ b/xen/include/xen/mm.h
@@ -167,6 

[PATCH V3 (resend) 05/19] x86/mapcache: Initialise the mapcache for the idle domain

2024-05-13 Thread Elias El Yandouzi
From: Hongyan Xia 

In order to use the mapcache in the idle domain, we also have to
populate its page tables in the PERDOMAIN region, and we need to move
mapcache_domain_init() earlier in arch_domain_create().

Note, commit 'x86: lift mapcache variable to the arch level' has
initialised the mapcache for HVM domains. With this patch, PV, HVM,
idle domains now all initialise the mapcache.

Signed-off-by: Wei Wang 
Signed-off-by: Hongyan Xia 
Signed-off-by: Julien Grall 
Signed-off-by: Elias El Yandouzi 



Changes in V2:
  * Free resources if mapcache initialisation fails
  * Remove `is_idle_domain()` check from `create_perdomain_mappings()`

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 507d704f16..3303bdb53e 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -758,9 +758,16 @@ int arch_domain_create(struct domain *d,
 
 spin_lock_init(>arch.e820_lock);
 
+if ( (rc = mapcache_domain_init(d)) != 0)
+{
+free_perdomain_mappings(d);
+return rc;
+}
+
 /* Minimal initialisation for the idle domain. */
 if ( unlikely(is_idle_domain(d)) )
 {
+struct page_info *pg = d->arch.perdomain_l3_pg;
 static const struct arch_csw idle_csw = {
 .from = paravirt_ctxt_switch_from,
 .to   = paravirt_ctxt_switch_to,
@@ -771,6 +778,9 @@ int arch_domain_create(struct domain *d,
 
 d->arch.cpu_policy = ZERO_BLOCK_PTR; /* Catch stray misuses. */
 
+idle_pg_table[l4_table_offset(PERDOMAIN_VIRT_START)] =
+l4e_from_page(pg, __PAGE_HYPERVISOR_RW);
+
 return 0;
 }
 
@@ -851,8 +861,6 @@ int arch_domain_create(struct domain *d,
 
 psr_domain_init(d);
 
-mapcache_domain_init(d);
-
 if ( is_hvm_domain(d) )
 {
 if ( (rc = hvm_domain_initialise(d, config)) != 0 )
-- 
2.40.1




[PATCH V3 (resend) 07/19] xen/x86: Add support for the PMAP

2024-05-13 Thread Elias El Yandouzi
From: Julien Grall 

PMAP will be used in a follow-up patch to bootstrap map domain
page infrastructure -- we need some way to map pages to setup the
mapcache without a direct map.

The functions pmap_{map, unmap} open code {set, clear}_fixmap to break
the loop.

Signed-off-by: Julien Grall 
Signed-off-by: Elias El Yandouzi 



The PMAP infrastructure was upstream separately for Arm since
Hongyan sent the secret-free hypervisor series. So this is a new
patch to plumb the feature on x86.

Changes in v2:
* Declare PMAP entries earlier in fixed_addresses
* Reword the commit message

diff --git a/xen/arch/x86/Kconfig b/xen/arch/x86/Kconfig
index b4ec0e582e..56feb0c564 100644
--- a/xen/arch/x86/Kconfig
+++ b/xen/arch/x86/Kconfig
@@ -27,6 +27,7 @@ config X86
select HAS_PCI
select HAS_PCI_MSI
select HAS_PIRQ
+   select HAS_PMAP
select HAS_SCHED_GRANULARITY
select HAS_SECRET_HIDING
select HAS_UBSAN
diff --git a/xen/arch/x86/include/asm/fixmap.h 
b/xen/arch/x86/include/asm/fixmap.h
index 516ec3fa6c..a7ac365fc6 100644
--- a/xen/arch/x86/include/asm/fixmap.h
+++ b/xen/arch/x86/include/asm/fixmap.h
@@ -21,6 +21,8 @@
 
 #include 
 #include 
+#include 
+
 #include 
 #include 
 #include 
@@ -53,6 +55,8 @@ enum fixed_addresses {
 FIX_PV_CONSOLE,
 FIX_XEN_SHARED_INFO,
 #endif /* CONFIG_XEN_GUEST */
+FIX_PMAP_BEGIN,
+FIX_PMAP_END = FIX_PMAP_BEGIN + NUM_FIX_PMAP,
 /* Everything else should go further down. */
 FIX_APIC_BASE,
 FIX_IO_APIC_BASE_0,
diff --git a/xen/arch/x86/include/asm/pmap.h b/xen/arch/x86/include/asm/pmap.h
new file mode 100644
index 00..62746e191d
--- /dev/null
+++ b/xen/arch/x86/include/asm/pmap.h
@@ -0,0 +1,25 @@
+#ifndef __ASM_PMAP_H__
+#define __ASM_PMAP_H__
+
+#include 
+
+static inline void arch_pmap_map(unsigned int slot, mfn_t mfn)
+{
+unsigned long linear = (unsigned long)fix_to_virt(slot);
+l1_pgentry_t *pl1e = _fixmap[l1_table_offset(linear)];
+
+ASSERT(!(l1e_get_flags(*pl1e) & _PAGE_PRESENT));
+
+l1e_write_atomic(pl1e, l1e_from_mfn(mfn, PAGE_HYPERVISOR));
+}
+
+static inline void arch_pmap_unmap(unsigned int slot)
+{
+unsigned long linear = (unsigned long)fix_to_virt(slot);
+l1_pgentry_t *pl1e = _fixmap[l1_table_offset(linear)];
+
+l1e_write_atomic(pl1e, l1e_empty());
+flush_tlb_one_local(linear);
+}
+
+#endif /* __ASM_PMAP_H__ */
-- 
2.40.1




[PATCH V3 (resend) 04/19] x86: Lift mapcache variable to the arch level

2024-05-13 Thread Elias El Yandouzi
From: Wei Liu 

It is going to be needed by HVM and idle domain as well, because without
the direct map, both need a mapcache to map pages.

This commit lifts the mapcache variable up and initialise it a bit earlier
for PV and HVM domains.

Signed-off-by: Wei Liu 
Signed-off-by: Wei Wang 
Signed-off-by: Hongyan Xia 
Signed-off-by: Julien Grall 

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 20e83cf38b..507d704f16 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -851,6 +851,8 @@ int arch_domain_create(struct domain *d,
 
 psr_domain_init(d);
 
+mapcache_domain_init(d);
+
 if ( is_hvm_domain(d) )
 {
 if ( (rc = hvm_domain_initialise(d, config)) != 0 )
@@ -858,8 +860,6 @@ int arch_domain_create(struct domain *d,
 }
 else if ( is_pv_domain(d) )
 {
-mapcache_domain_init(d);
-
 if ( (rc = pv_domain_initialise(d)) != 0 )
 goto fail;
 }
diff --git a/xen/arch/x86/domain_page.c b/xen/arch/x86/domain_page.c
index eac5e3304f..55e337aaf7 100644
--- a/xen/arch/x86/domain_page.c
+++ b/xen/arch/x86/domain_page.c
@@ -82,11 +82,11 @@ void *map_domain_page(mfn_t mfn)
 #endif
 
 v = mapcache_current_vcpu();
-if ( !v || !is_pv_vcpu(v) )
+if ( !v )
 return mfn_to_virt(mfn_x(mfn));
 
-dcache = >domain->arch.pv.mapcache;
-vcache = >arch.pv.mapcache;
+dcache = >domain->arch.mapcache;
+vcache = >arch.mapcache;
 if ( !dcache->inuse )
 return mfn_to_virt(mfn_x(mfn));
 
@@ -187,14 +187,14 @@ void unmap_domain_page(const void *ptr)
 ASSERT(va >= MAPCACHE_VIRT_START && va < MAPCACHE_VIRT_END);
 
 v = mapcache_current_vcpu();
-ASSERT(v && is_pv_vcpu(v));
+ASSERT(v);
 
-dcache = >domain->arch.pv.mapcache;
+dcache = >domain->arch.mapcache;
 ASSERT(dcache->inuse);
 
 idx = PFN_DOWN(va - MAPCACHE_VIRT_START);
 mfn = l1e_get_pfn(MAPCACHE_L1ENT(idx));
-hashent = >arch.pv.mapcache.hash[MAPHASH_HASHFN(mfn)];
+hashent = >arch.mapcache.hash[MAPHASH_HASHFN(mfn)];
 
 local_irq_save(flags);
 
@@ -233,11 +233,9 @@ void unmap_domain_page(const void *ptr)
 
 int mapcache_domain_init(struct domain *d)
 {
-struct mapcache_domain *dcache = >arch.pv.mapcache;
+struct mapcache_domain *dcache = >arch.mapcache;
 unsigned int bitmap_pages;
 
-ASSERT(is_pv_domain(d));
-
 #ifdef NDEBUG
 if ( !mem_hotplug && max_page <= PFN_DOWN(__pa(HYPERVISOR_VIRT_END - 1)) )
 return 0;
@@ -261,12 +259,12 @@ int mapcache_domain_init(struct domain *d)
 int mapcache_vcpu_init(struct vcpu *v)
 {
 struct domain *d = v->domain;
-struct mapcache_domain *dcache = >arch.pv.mapcache;
+struct mapcache_domain *dcache = >arch.mapcache;
 unsigned long i;
 unsigned int ents = d->max_vcpus * MAPCACHE_VCPU_ENTRIES;
 unsigned int nr = PFN_UP(BITS_TO_LONGS(ents) * sizeof(long));
 
-if ( !is_pv_vcpu(v) || !dcache->inuse )
+if ( !dcache->inuse )
 return 0;
 
 if ( ents > dcache->entries )
@@ -293,7 +291,7 @@ int mapcache_vcpu_init(struct vcpu *v)
 BUILD_BUG_ON(MAPHASHENT_NOTINUSE < MAPCACHE_ENTRIES);
 for ( i = 0; i < MAPHASH_ENTRIES; i++ )
 {
-struct vcpu_maphash_entry *hashent = >arch.pv.mapcache.hash[i];
+struct vcpu_maphash_entry *hashent = >arch.mapcache.hash[i];
 
 hashent->mfn = ~0UL; /* never valid to map */
 hashent->idx = MAPHASHENT_NOTINUSE;
diff --git a/xen/arch/x86/include/asm/domain.h 
b/xen/arch/x86/include/asm/domain.h
index 8a97530607..7f0480d7a7 100644
--- a/xen/arch/x86/include/asm/domain.h
+++ b/xen/arch/x86/include/asm/domain.h
@@ -285,9 +285,6 @@ struct pv_domain
 /* Mitigate L1TF with shadow/crashing? */
 bool check_l1tf;
 
-/* map_domain_page() mapping cache. */
-struct mapcache_domain mapcache;
-
 struct cpuidmasks *cpuidmasks;
 };
 
@@ -326,6 +323,9 @@ struct arch_domain
 
 uint8_t scf; /* See SCF_DOM_MASK */
 
+/* map_domain_page() mapping cache. */
+struct mapcache_domain mapcache;
+
 union {
 struct pv_domain pv;
 struct hvm_domain hvm;
@@ -516,9 +516,6 @@ struct arch_domain
 
 struct pv_vcpu
 {
-/* map_domain_page() mapping cache. */
-struct mapcache_vcpu mapcache;
-
 unsigned int vgc_flags;
 
 struct trap_info *trap_ctxt;
@@ -618,6 +615,9 @@ struct arch_vcpu
 #define async_exception_state(t) async_exception_state[(t)-1]
 uint8_t async_exception_mask;
 
+/* map_domain_page() mapping cache. */
+struct mapcache_vcpu mapcache;
+
 /* Virtual Machine Extensions */
 union {
 struct pv_vcpu pv;
-- 
2.40.1




[PATCH V3 (resend) 09/19] x86/domain_page: Remove the fast paths when mfn is not in the directmap

2024-05-13 Thread Elias El Yandouzi
From: Hongyan Xia 

When mfn is not in direct map, never use mfn_to_virt for any mappings.

We replace mfn_x(mfn) <= PFN_DOWN(__pa(HYPERVISOR_VIRT_END - 1)) with
arch_mfns_in_direct_map(mfn, 1) because these two are equivalent. The
extra comparison in arch_mfns_in_direct_map() looks different but because
DIRECTMAP_VIRT_END is always higher, it does not make any difference.

Lastly, domain_page_map_to_mfn() needs to gain to a special case for
the PMAP.

Signed-off-by: Hongyan Xia 
Signed-off-by: Julien Grall 



Changes since Hongyan's version:
* arch_mfn_in_direct_map() was renamed to arch_mfns_in_directmap()
* add a special case for the PMAP in domain_page_map_to_mfn()

diff --git a/xen/arch/x86/domain_page.c b/xen/arch/x86/domain_page.c
index 55e337aaf7..89caefc8a2 100644
--- a/xen/arch/x86/domain_page.c
+++ b/xen/arch/x86/domain_page.c
@@ -14,8 +14,10 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
+#include 
 #include 
 
 static DEFINE_PER_CPU(struct vcpu *, override);
@@ -35,10 +37,11 @@ static inline struct vcpu *mapcache_current_vcpu(void)
 /*
  * When using efi runtime page tables, we have the equivalent of the idle
  * domain's page tables but current may point at another domain's VCPU.
- * Return NULL as though current is not properly set up yet.
+ * Return the idle domains's vcpu on that core because the efi per-domain
+ * region (where the mapcache is) is in-sync with the idle domain.
  */
 if ( efi_rs_using_pgtables() )
-return NULL;
+return idle_vcpu[smp_processor_id()];
 
 /*
  * If guest_table is NULL, and we are running a paravirtualised guest,
@@ -77,18 +80,24 @@ void *map_domain_page(mfn_t mfn)
 struct vcpu_maphash_entry *hashent;
 
 #ifdef NDEBUG
-if ( mfn_x(mfn) <= PFN_DOWN(__pa(HYPERVISOR_VIRT_END - 1)) )
+if ( arch_mfns_in_directmap(mfn_x(mfn), 1) )
 return mfn_to_virt(mfn_x(mfn));
 #endif
 
 v = mapcache_current_vcpu();
-if ( !v )
-return mfn_to_virt(mfn_x(mfn));
+if ( !v || !v->domain->arch.mapcache.inuse )
+{
+if ( arch_mfns_in_directmap(mfn_x(mfn), 1) )
+return mfn_to_virt(mfn_x(mfn));
+else
+{
+BUG_ON(system_state >= SYS_STATE_smp_boot);
+return pmap_map(mfn);
+}
+}
 
 dcache = >domain->arch.mapcache;
 vcache = >arch.mapcache;
-if ( !dcache->inuse )
-return mfn_to_virt(mfn_x(mfn));
 
 perfc_incr(map_domain_page_count);
 
@@ -184,6 +193,12 @@ void unmap_domain_page(const void *ptr)
 if ( !va || va >= DIRECTMAP_VIRT_START )
 return;
 
+if ( va >= FIXADDR_START && va < FIXADDR_TOP )
+{
+pmap_unmap((void *)ptr);
+return;
+}
+
 ASSERT(va >= MAPCACHE_VIRT_START && va < MAPCACHE_VIRT_END);
 
 v = mapcache_current_vcpu();
@@ -237,7 +252,7 @@ int mapcache_domain_init(struct domain *d)
 unsigned int bitmap_pages;
 
 #ifdef NDEBUG
-if ( !mem_hotplug && max_page <= PFN_DOWN(__pa(HYPERVISOR_VIRT_END - 1)) )
+if ( !mem_hotplug && arch_mfn_in_directmap(0, max_page) )
 return 0;
 #endif
 
@@ -308,7 +323,7 @@ void *map_domain_page_global(mfn_t mfn)
 local_irq_is_enabled()));
 
 #ifdef NDEBUG
-if ( mfn_x(mfn) <= PFN_DOWN(__pa(HYPERVISOR_VIRT_END - 1)) )
+if ( arch_mfn_in_directmap(mfn_x(mfn, 1)) )
 return mfn_to_virt(mfn_x(mfn));
 #endif
 
@@ -335,6 +350,23 @@ mfn_t domain_page_map_to_mfn(const void *ptr)
 if ( va >= DIRECTMAP_VIRT_START )
 return _mfn(virt_to_mfn(ptr));
 
+/*
+ * The fixmap is stealing the top-end of the VMAP. So the check for
+ * the PMAP *must* happen first.
+ *
+ * Also, the fixmap translate a slot to an address backwards. The
+ * logic will rely on it to avoid any complexity. So check at
+ * compile time this will always hold.
+*/
+BUILD_BUG_ON(fix_to_virt(FIX_PMAP_BEGIN) < fix_to_virt(FIX_PMAP_END));
+
+if ( ((unsigned long)fix_to_virt(FIX_PMAP_END) <= va) &&
+ ((va & PAGE_MASK) <= (unsigned long)fix_to_virt(FIX_PMAP_BEGIN)) )
+{
+BUG_ON(system_state >= SYS_STATE_smp_boot);
+return l1e_get_mfn(l1_fixmap[l1_table_offset(va)]);
+}
+
 if ( va >= VMAP_VIRT_START && va < VMAP_VIRT_END )
 return vmap_to_mfn(va);
 
-- 
2.40.1




[PATCH V3 (resend) 03/19] x86/pv: Rewrite how building PV dom0 handles domheap mappings

2024-05-13 Thread Elias El Yandouzi
From: Hongyan Xia 

Building a PV dom0 is allocating from the domheap but uses it like the
xenheap. Use the pages as they should be.

Signed-off-by: Hongyan Xia 
Signed-off-by: Julien Grall 
Signed-off-by: Elias El Yandouzi 


Changes in V3:
* Fold following patch 'x86/pv: Map L4 page table for shim domain'

Changes in V2:
* Clarify the commit message
* Break the patch in two parts

Changes since Hongyan's version:
* Rebase
* Remove spurious newline

diff --git a/xen/arch/x86/pv/dom0_build.c b/xen/arch/x86/pv/dom0_build.c
index 807296c280..ac910b438a 100644
--- a/xen/arch/x86/pv/dom0_build.c
+++ b/xen/arch/x86/pv/dom0_build.c
@@ -382,6 +382,10 @@ int __init dom0_construct_pv(struct domain *d,
 l3_pgentry_t *l3tab = NULL, *l3start = NULL;
 l2_pgentry_t *l2tab = NULL, *l2start = NULL;
 l1_pgentry_t *l1tab = NULL, *l1start = NULL;
+mfn_t l4start_mfn = INVALID_MFN;
+mfn_t l3start_mfn = INVALID_MFN;
+mfn_t l2start_mfn = INVALID_MFN;
+mfn_t l1start_mfn = INVALID_MFN;
 
 /*
  * This fully describes the memory layout of the initial domain. All
@@ -710,22 +714,32 @@ int __init dom0_construct_pv(struct domain *d,
 v->arch.pv.event_callback_cs= FLAT_COMPAT_KERNEL_CS;
 }
 
+#define UNMAP_MAP_AND_ADVANCE(mfn_var, virt_var, maddr) \
+do {\
+unmap_domain_page(virt_var);\
+mfn_var = maddr_to_mfn(maddr);  \
+maddr += PAGE_SIZE; \
+virt_var = map_domain_page(mfn_var);\
+} while ( false )
+
 if ( !compat )
 {
 maddr_to_page(mpt_alloc)->u.inuse.type_info = PGT_l4_page_table;
-l4start = l4tab = __va(mpt_alloc); mpt_alloc += PAGE_SIZE;
+UNMAP_MAP_AND_ADVANCE(l4start_mfn, l4start, mpt_alloc);
+l4tab = l4start;
 clear_page(l4tab);
-init_xen_l4_slots(l4tab, _mfn(virt_to_mfn(l4start)),
-  d, INVALID_MFN, true);
-v->arch.guest_table = pagetable_from_paddr(__pa(l4start));
+init_xen_l4_slots(l4tab, l4start_mfn, d, INVALID_MFN, true);
+v->arch.guest_table = pagetable_from_mfn(l4start_mfn);
 }
 else
 {
 /* Monitor table already created by switch_compat(). */
-l4start = l4tab = __va(pagetable_get_paddr(v->arch.guest_table));
+l4start_mfn = pagetable_get_mfn(v->arch.guest_table);
+l4start = l4tab = map_domain_page(l4start_mfn);
 /* See public/xen.h on why the following is needed. */
 maddr_to_page(mpt_alloc)->u.inuse.type_info = PGT_l3_page_table;
 l3start = __va(mpt_alloc); mpt_alloc += PAGE_SIZE;
+UNMAP_MAP_AND_ADVANCE(l3start_mfn, l3start, mpt_alloc);
 }
 
 l4tab += l4_table_offset(v_start);
@@ -735,14 +749,16 @@ int __init dom0_construct_pv(struct domain *d,
 if ( !((unsigned long)l1tab & (PAGE_SIZE-1)) )
 {
 maddr_to_page(mpt_alloc)->u.inuse.type_info = PGT_l1_page_table;
-l1start = l1tab = __va(mpt_alloc); mpt_alloc += PAGE_SIZE;
+UNMAP_MAP_AND_ADVANCE(l1start_mfn, l1start, mpt_alloc);
+l1tab = l1start;
 clear_page(l1tab);
 if ( count == 0 )
 l1tab += l1_table_offset(v_start);
 if ( !((unsigned long)l2tab & (PAGE_SIZE-1)) )
 {
 maddr_to_page(mpt_alloc)->u.inuse.type_info = 
PGT_l2_page_table;
-l2start = l2tab = __va(mpt_alloc); mpt_alloc += PAGE_SIZE;
+UNMAP_MAP_AND_ADVANCE(l2start_mfn, l2start, mpt_alloc);
+l2tab = l2start;
 clear_page(l2tab);
 if ( count == 0 )
 l2tab += l2_table_offset(v_start);
@@ -752,19 +768,19 @@ int __init dom0_construct_pv(struct domain *d,
 {
 maddr_to_page(mpt_alloc)->u.inuse.type_info =
 PGT_l3_page_table;
-l3start = __va(mpt_alloc); mpt_alloc += PAGE_SIZE;
+UNMAP_MAP_AND_ADVANCE(l3start_mfn, l3start, mpt_alloc);
 }
 l3tab = l3start;
 clear_page(l3tab);
 if ( count == 0 )
 l3tab += l3_table_offset(v_start);
-*l4tab = l4e_from_paddr(__pa(l3start), L4_PROT);
+*l4tab = l4e_from_mfn(l3start_mfn, L4_PROT);
 l4tab++;
 }
-*l3tab = l3e_from_paddr(__pa(l2start), L3_PROT);
+*l3tab = l3e_from_mfn(l2start_mfn, L3_PROT);
 l3tab++;
 }
-*l2tab = l2e_from_paddr(__pa(l1start), L2_PROT);
+*l2tab = l2e_from_mfn(l1start_mfn, L2_PROT);
 l2tab++;
 }
 if ( count < initrd_pfn || count >= initrd_pfn + PFN_UP(initrd_len) )
@@ -783,27 

[PATCH V3 (resend) 00/19] Remove the directmap

2024-05-13 Thread Elias El Yandouzi
Hi all,

A few years ago, Wei Liu implemented a PoC to remove the directmap
from Xen. The last version was sent by Hongyan Xia [1].

I will start with thanking both Wei and Hongyan for the initial work
to upstream the feature. A lot of patches already went in and this is
the last few patches missing to effectively enable the feature.

=== What is the directmap? ===

At the moment, on both arm64 and x86, most of the RAM is mapped
in Xen address space. This means that domain memory is easily
accessible in Xen.

=== Why do we want to remove the directmap? ===

(Summarizing my understanding of the previous discussion)

Speculation attacks (like Spectre SP1) rely on loading piece of memory
in the cache. If the region is not mapped then it can't be loaded.

So removing reducing the amount of memory mapped in Xen will also
reduce the surface attack.

=== What's the performance impact? ===

As the guest memory is not always mapped, then the cost of mapping
will increase. I haven't done the numbers with this new version, but
some measurement were provided in the previous version for x86.

=== Improvement possible ===

The known area to improve on x86 are:
   * Mapcache: There was a patch sent by Hongyan:
 
https://lore.kernel.org/xen-devel/4058e92ce21627731c49b588a95809dc0affd83a.1581015491.git.hongy...@amazon.com/
   * EPT: At the moment an guest page-tabel walk requires about 20 map/unmap.
 This will have an very high impact on the performance. We need to decide
 whether keep the EPT always mapped is a problem

The original series didn't have support for Arm64. But as there were
some interest, I have provided a PoC.

There are more extra work for Arm64:
   * The mapcache is quite simple. We would investigate the performance
   * The mapcache should be made compliant to the Arm Arm (this is now
 more critical).
   * We will likely have the same problem as for the EPT.
   * We have no support for merging table to a superpage, neither
 free empty page-tables. (See more below)

=== Implementation ===

The subject is probably a misnomer. The directmap is still present but
the RAM is not mapped by default. Instead, the region will still be used
to map pages allocate via alloc_xenheap_pages().

The advantage is the solution is simple (so IHMO good enough for been
merged as a tech preview). The disadvantage is the page allocator is not
trying to keep all the xenheap pages together. So we may end up to have
an increase of page table usage.

In the longer term, we should consider to remove the direct map
completely and switch to vmap(). The main problem with this approach
is it is frequent to use mfn_to_virt() in the code. So we would need
to cache the mapping (maybe in the struct page_info).

=== Why arm32 is not covered? ===

On Arm32, the domheap and xenheap is always separated. So by design
the guest memory is not mapped by default.

At this stage, it seems unnecessary to have to map/unmap xenheap pages
every time they are allocated.

=== Why not using a separate domheap and xenheap? ===

While a separate xenheap/domheap reduce the page-table usage (all
xenheap pages are contiguous and could be always mapped), it is also
currently less scalable because the split is fixed at boot time (XXX:
Can this be dynamic?).

=== Future of secret-free hypervisor ===

There are some information in an e-mail from Andrew a few years ago:

https://lore.kernel.org/xen-devel/e3219697-0759-39fc-2486-715cdec1c...@citrix.com/

Cheers,

[1] https://lore.kernel.org/xen-devel/cover.1588278317.git.hongy...@amazon.com/

*** BLURB HERE ***

Elias El Yandouzi (3):
  xen/x86: Add build assertion for fixmap entries
  Rename mfn_to_virt() calls
  Rename maddr_to_virt() calls

Hongyan Xia (9):
  x86: Create per-domain mapping of guest_root_pt
  x86/pv: Rewrite how building PV dom0 handles domheap mappings
  x86/mapcache: Initialise the mapcache for the idle domain
  x86: Add a boot option to enable and disable the direct map
  x86/domain_page: Remove the fast paths when mfn is not in the
directmap
  xen/page_alloc: Add a path for xenheap when there is no direct map
  x86/setup: Leave early boot slightly earlier
  x86/setup: vmap heap nodes when they are outside the direct map
  x86/setup: Do not create valid mappings when directmap=no

Julien Grall (5):
  xen/x86: Add support for the PMAP
  xen/arm32: mm: Rename 'first' to 'root' in init_secondary_pagetables()
  xen/arm64: mm: Use per-pCPU page-tables
  xen/arm64: Implement a mapcache for arm64
  xen/arm64: Allow the admin to enable/disable the directmap

Wei Liu (2):
  x86/pv: Domheap pages should be mapped while relocating initrd
  x86: Lift mapcache variable to the arch level

 docs/misc/xen-command-line.pandoc | 12 +++
 xen/arch/arm/Kconfig  |  2 +-
 xen/arch/arm/arm64/mmu/mm.c   | 45 -
 xen/arch/arm/domain_page.c| 50 +-
 xen/arch/arm/include/asm/arm32/mm.h   |  8 --
 

[PATCH V3 (resend) 02/19] x86/pv: Domheap pages should be mapped while relocating initrd

2024-05-13 Thread Elias El Yandouzi
From: Wei Liu 

Xen shouldn't use domheap page as if they were xenheap pages. Map and
unmap pages accordingly.

Signed-off-by: Wei Liu 
Signed-off-by: Wei Wang 
Signed-off-by: Julien Grall 
Signed-off-by: Elias El Yandouzi 


Changes in V3:
* Rename commit title
* Rework the for loop copying the pages

Changes in V2:
* Get rid of mfn_to_virt
* Don't open code copy_domain_page()

Changes since Hongyan's version:
* Add missing newline after the variable declaration

diff --git a/xen/arch/x86/pv/dom0_build.c b/xen/arch/x86/pv/dom0_build.c
index d8043fa58a..807296c280 100644
--- a/xen/arch/x86/pv/dom0_build.c
+++ b/xen/arch/x86/pv/dom0_build.c
@@ -618,18 +618,24 @@ int __init dom0_construct_pv(struct domain *d,
 if ( d->arch.physaddr_bitsize &&
  ((mfn + count - 1) >> (d->arch.physaddr_bitsize - PAGE_SHIFT)) )
 {
+unsigned int nr_pages = 1UL << order;
+
 order = get_order_from_pages(count);
 page = alloc_domheap_pages(d, order, MEMF_no_scrub);
 if ( !page )
 panic("Not enough RAM for domain 0 initrd\n");
+
 for ( count = -count; order--; )
 if ( count & (1UL << order) )
 {
 free_domheap_pages(page, order);
 page += 1UL << order;
+nr_pages -= 1UL << order;
 }
-memcpy(page_to_virt(page), mfn_to_virt(initrd->mod_start),
-   initrd_len);
+
+for ( ; nr_pages-- ; page++, mfn++ )
+copy_domain_page(page_to_mfn(page), _mfn(mfn));
+
 mpt_alloc = (paddr_t)initrd->mod_start << PAGE_SHIFT;
 init_domheap_pages(mpt_alloc,
mpt_alloc + PAGE_ALIGN(initrd_len));
-- 
2.40.1




[libvirt test] 185988: tolerable all pass - PUSHED

2024-05-13 Thread osstest service owner
flight 185988 libvirt real [real]
http://logs.test-lab.xenproject.org/osstest/logs/185988/

Failures :-/ but no regressions.

Tests which did not succeed, but are not blocking:
 test-armhf-armhf-libvirt 16 saverestore-support-checkfail  like 185978
 test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-libvirt 15 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt 16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-qcow2 14 migrate-support-checkfail never pass
 test-amd64-amd64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-armhf-armhf-libvirt 15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 15 saverestore-support-checkfail   never pass
 test-arm64-arm64-libvirt-qcow2 14 migrate-support-checkfail never pass
 test-arm64-arm64-libvirt-qcow2 15 saverestore-support-checkfail never pass
 test-armhf-armhf-libvirt-vhd 14 migrate-support-checkfail   never pass
 test-armhf-armhf-libvirt-vhd 15 saverestore-support-checkfail   never pass

version targeted for testing:
 libvirt  9e59ba56c8a26156799a556fa79c9654a5d1acd4
baseline version:
 libvirt  d4528bb9dbf21464e68beb9175a38aaf6484536e

Last test of basis   185978  2024-05-11 04:20:38 Z2 days
Testing same since   185988  2024-05-13 04:20:28 Z0 days1 attempts


People who touched revisions under test:
  Dr. David Alan Gilbert 

jobs:
 build-amd64-xsm  pass
 build-arm64-xsm  pass
 build-i386-xsm   pass
 build-amd64  pass
 build-arm64  pass
 build-armhf  pass
 build-i386   pass
 build-amd64-libvirt  pass
 build-arm64-libvirt  pass
 build-armhf-libvirt  pass
 build-i386-libvirt   pass
 build-amd64-pvopspass
 build-arm64-pvopspass
 build-armhf-pvopspass
 build-i386-pvops pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm   pass
 test-amd64-amd64-libvirt-xsm pass
 test-arm64-arm64-libvirt-xsm pass
 test-amd64-amd64-libvirt pass
 test-arm64-arm64-libvirt pass
 test-armhf-armhf-libvirt pass
 test-amd64-amd64-libvirt-pairpass
 test-amd64-amd64-libvirt-qcow2   pass
 test-arm64-arm64-libvirt-qcow2   pass
 test-amd64-amd64-libvirt-raw pass
 test-arm64-arm64-libvirt-raw pass
 test-amd64-amd64-libvirt-vhd pass
 test-armhf-armhf-libvirt-vhd pass



sg-report-flight on osstest.test-lab.xenproject.org
logs: /home/logs/logs
images: /home/logs/images

Logs, config files, etc. are available at
http://logs.test-lab.xenproject.org/osstest/logs

Explanation of these reports, and of osstest in general, is at
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master

Test harness code can be found at
http://xenbits.xen.org/gitweb?p=osstest.git;a=summary


Pushing revision :

To xenbits.xen.org:/home/xen/git/libvirt.git
   d4528bb9db..9e59ba56c8  9e59ba56c8a26156799a556fa79c9654a5d1acd4 -> 
xen-tested-master



Re: [PATCH V3 00/19] Remove the directmap

2024-05-13 Thread Roger Pau Monné
You seem to have forgotten to add the maintainers on Cc for the
patches.  Adding them here for reference.

Regards, Roger.

On Mon, May 13, 2024 at 11:10:58AM +, Elias El Yandouzi wrote:
> Hi all,
> 
> A few years ago, Wei Liu implemented a PoC to remove the directmap
> from Xen. The last version was sent by Hongyan Xia [1].
> 
> I will start with thanking both Wei and Hongyan for the initial work
> to upstream the feature. A lot of patches already went in and this is
> the last few patches missing to effectively enable the feature.
> 
> === What is the directmap? ===
> 
> At the moment, on both arm64 and x86, most of the RAM is mapped
> in Xen address space. This means that domain memory is easily
> accessible in Xen.
> 
> === Why do we want to remove the directmap? ===
> 
> (Summarizing my understanding of the previous discussion)
> 
> Speculation attacks (like Spectre SP1) rely on loading piece of memory
> in the cache. If the region is not mapped then it can't be loaded.
> 
> So removing reducing the amount of memory mapped in Xen will also
> reduce the surface attack.
> 
> === What's the performance impact? ===
> 
> As the guest memory is not always mapped, then the cost of mapping
> will increase. I haven't done the numbers with this new version, but
> some measurement were provided in the previous version for x86.
> 
> === Improvement possible ===
> 
> The known area to improve on x86 are:
>* Mapcache: There was a patch sent by Hongyan:
>  
> https://lore.kernel.org/xen-devel/4058e92ce21627731c49b588a95809dc0affd83a.1581015491.git.hongy...@amazon.com/
>* EPT: At the moment an guest page-tabel walk requires about 20 map/unmap.
>  This will have an very high impact on the performance. We need to decide
>  whether keep the EPT always mapped is a problem
> 
> The original series didn't have support for Arm64. But as there were
> some interest, I have provided a PoC.
> 
> There are more extra work for Arm64:
>* The mapcache is quite simple. We would investigate the performance
>* The mapcache should be made compliant to the Arm Arm (this is now
>  more critical).
>* We will likely have the same problem as for the EPT.
>* We have no support for merging table to a superpage, neither
>  free empty page-tables. (See more below)
> 
> === Implementation ===
> 
> The subject is probably a misnomer. The directmap is still present but
> the RAM is not mapped by default. Instead, the region will still be used
> to map pages allocate via alloc_xenheap_pages().
> 
> The advantage is the solution is simple (so IHMO good enough for been
> merged as a tech preview). The disadvantage is the page allocator is not
> trying to keep all the xenheap pages together. So we may end up to have
> an increase of page table usage.
> 
> In the longer term, we should consider to remove the direct map
> completely and switch to vmap(). The main problem with this approach
> is it is frequent to use mfn_to_virt() in the code. So we would need
> to cache the mapping (maybe in the struct page_info).
> 
> === Why arm32 is not covered? ===
> 
> On Arm32, the domheap and xenheap is always separated. So by design
> the guest memory is not mapped by default.
> 
> At this stage, it seems unnecessary to have to map/unmap xenheap pages
> every time they are allocated.
> 
> === Why not using a separate domheap and xenheap? ===
> 
> While a separate xenheap/domheap reduce the page-table usage (all
> xenheap pages are contiguous and could be always mapped), it is also
> currently less scalable because the split is fixed at boot time (XXX:
> Can this be dynamic?).
> 
> === Future of secret-free hypervisor ===
> 
> There are some information in an e-mail from Andrew a few years ago:
> 
> https://lore.kernel.org/xen-devel/e3219697-0759-39fc-2486-715cdec1c...@citrix.com/
> 
> Cheers,
> 
> [1] 
> https://lore.kernel.org/xen-devel/cover.1588278317.git.hongy...@amazon.com/
> 
> *** BLURB HERE ***
> 
> Elias El Yandouzi (3):
>   xen/x86: Add build assertion for fixmap entries
>   Rename mfn_to_virt() calls
>   Rename maddr_to_virt() calls
> 
> Hongyan Xia (9):
>   x86: Create per-domain mapping of guest_root_pt
>   x86/pv: Rewrite how building PV dom0 handles domheap mappings
>   x86/mapcache: Initialise the mapcache for the idle domain
>   x86: Add a boot option to enable and disable the direct map
>   x86/domain_page: Remove the fast paths when mfn is not in the
> directmap
>   xen/page_alloc: Add a path for xenheap when there is no direct map
>   x86/setup: Leave early boot slightly earlier
>   x86/setup: vmap heap nodes when they are outside the direct map
>   x86/setup: Do not create valid mappings when directmap=no
> 
> Julien Grall (5):
>   xen/x86: Add support for the PMAP
>   xen/arm32: mm: Rename 'first' to 'root' in init_secondary_pagetables()
>   xen/arm64: mm: Use per-pCPU page-tables
>   xen/arm64: Implement a mapcache for arm64
>   xen/arm64: Allow the admin to enable/disable the directmap

[xen-unstable test] 185987: tolerable FAIL

2024-05-13 Thread osstest service owner
flight 185987 xen-unstable real [real]
http://logs.test-lab.xenproject.org/osstest/logs/185987/

Failures :-/ but no regressions.

Tests which did not succeed, but are not blocking:
 test-armhf-armhf-libvirt 16 saverestore-support-checkfail  like 185980
 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 185983
 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 185983
 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 185983
 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 185983
 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 185983
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  15 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-armhf-armhf-xl  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-qcow2 14 migrate-support-checkfail never pass
 test-amd64-amd64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt 15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 15 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  14 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  15 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-qcow214 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-qcow215 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-raw  14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-raw  15 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail  never pass
 test-armhf-armhf-xl-credit1  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt-vhd 14 migrate-support-checkfail   never pass
 test-armhf-armhf-libvirt-vhd 15 saverestore-support-checkfail   never pass

version targeted for testing:
 xen  46aa3031ae89ac1771f4159972edab65710e7349
baseline version:
 xen  46aa3031ae89ac1771f4159972edab65710e7349

Last test of basis   185987  2024-05-13 01:53:43 Z0 days
Testing same since  (not found) 0 attempts

jobs:
 build-amd64-xsm  pass
 build-arm64-xsm  pass
 build-i386-xsm   pass
 build-amd64-xtf  pass
 build-amd64  pass
 build-arm64  pass
 build-armhf  pass
 build-i386   pass
 build-amd64-libvirt  pass
 build-arm64-libvirt  pass
 

[PATCH V3 19/19] xen/arm64: Allow the admin to enable/disable the directmap

2024-05-13 Thread Elias El Yandouzi
From: Julien Grall 

Implement the same command line option as x86 to enable/disable the
directmap. By default this is kept enabled.

Also modify setup_directmap_mappings() to populate the L0 entries
related to the directmap area.

Signed-off-by: Julien Grall 
Signed-off-by: Elias El Yandouzi 



Changes in v2:
* Rely on the Kconfig option to enable Secret Hiding on Arm64
* Use generic helper instead of arch_has_directmap()

diff --git a/docs/misc/xen-command-line.pandoc 
b/docs/misc/xen-command-line.pandoc
index 743d343ffa..cccd5e4282 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -799,7 +799,7 @@ that enabling this option cannot guarantee anything beyond 
what underlying
 hardware guarantees (with, where available and known to Xen, respective
 tweaks applied).
 
-### directmap (x86)
+### directmap (arm64, x86)
 > `= `
 
 > Default: `true`
diff --git a/xen/arch/arm/Kconfig b/xen/arch/arm/Kconfig
index 0462960fc7..1cb495e334 100644
--- a/xen/arch/arm/Kconfig
+++ b/xen/arch/arm/Kconfig
@@ -7,6 +7,7 @@ config ARM_64
depends on !ARM_32
select 64BIT
select HAS_FAST_MULTIPLY
+   select HAS_SECRET_HIDING
 
 config ARM
def_bool y
diff --git a/xen/arch/arm/arm64/mmu/mm.c b/xen/arch/arm/arm64/mmu/mm.c
index 826864d25d..81115cce51 100644
--- a/xen/arch/arm/arm64/mmu/mm.c
+++ b/xen/arch/arm/arm64/mmu/mm.c
@@ -158,16 +158,27 @@ void __init switch_ttbr(uint64_t ttbr)
 update_identity_mapping(false);
 }
 
-/* Map the region in the directmap area. */
+/*
+ * This either populate a valid fdirect map, or allocates empty L1 tables
+ * and creates the L0 entries for the given region in the direct map
+ * depending on has_directmap().
+ *
+ * When directmap=no, we still need to populate empty L1 tables in the
+ * directmap region. The reason is that the root page-table (i.e. L0)
+ * is per-CPU and secondary CPUs will initialize their root page-table
+ * based on the pCPU0 one. So L0 entries will be shared if they are
+ * pre-populated. We also rely on the fact that L1 tables are never
+ * freed.
+ */
 static void __init setup_directmap_mappings(unsigned long base_mfn,
 unsigned long nr_mfns)
 {
+unsigned long mfn_gb = base_mfn & ~((FIRST_SIZE >> PAGE_SHIFT) - 1);
 int rc;
 
 /* First call sets the directmap physical and virtual offset. */
 if ( mfn_eq(directmap_mfn_start, INVALID_MFN) )
 {
-unsigned long mfn_gb = base_mfn & ~((FIRST_SIZE >> PAGE_SHIFT) - 1);
 
 directmap_mfn_start = _mfn(base_mfn);
 directmap_base_pdx = mfn_to_pdx(_mfn(base_mfn));
@@ -188,6 +199,24 @@ static void __init setup_directmap_mappings(unsigned long 
base_mfn,
 panic("cannot add directmap mapping at %lx below heap start %lx\n",
   base_mfn, mfn_x(directmap_mfn_start));
 
+if ( !has_directmap() )
+{
+vaddr_t vaddr = (vaddr_t)__mfn_to_virt(base_mfn);
+lpae_t *root = this_cpu(xen_pgtable);
+unsigned int i, slot;
+
+slot = first_table_offset(vaddr);
+nr_mfns += base_mfn - mfn_gb;
+for ( i = 0; i < nr_mfns; i += BIT(XEN_PT_LEVEL_ORDER(0), UL), slot++ )
+{
+lpae_t *entry = [slot];
+
+if ( !lpae_is_valid(*entry) && !create_xen_table(entry) )
+panic("Unable to populate zeroeth slot %u\n", slot);
+}
+return;
+}
+
 rc = map_pages_to_xen((vaddr_t)__mfn_to_virt(base_mfn),
   _mfn(base_mfn), nr_mfns,
   PAGE_HYPERVISOR_RW | _PAGE_BLOCK);
diff --git a/xen/arch/arm/include/asm/arm64/mm.h 
b/xen/arch/arm/include/asm/arm64/mm.h
index e0bd23a6ed..5888f29159 100644
--- a/xen/arch/arm/include/asm/arm64/mm.h
+++ b/xen/arch/arm/include/asm/arm64/mm.h
@@ -3,13 +3,10 @@
 
 extern DEFINE_PAGE_TABLE(xen_pgtable);
 
-/*
- * On ARM64, all the RAM is currently direct mapped in Xen.
- * Hence return always true.
- */
+/* On Arm64, the user can chose whether all the RAM is directmap. */
 static inline bool arch_mfns_in_directmap(unsigned long mfn, unsigned long nr)
 {
-return true;
+return has_directmap();
 }
 
 void arch_setup_page_tables(void);
diff --git a/xen/arch/arm/mm.c b/xen/arch/arm/mm.c
index def939172c..0f3ffab6ba 100644
--- a/xen/arch/arm/mm.c
+++ b/xen/arch/arm/mm.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
diff --git a/xen/arch/arm/setup.c b/xen/arch/arm/setup.c
index d15987d6ea..6b06e2f4f5 100644
--- a/xen/arch/arm/setup.c
+++ b/xen/arch/arm/setup.c
@@ -778,6 +778,7 @@ void asmlinkage __init start_xen(unsigned long 
boot_phys_offset,
 cmdline_parse(cmdline);
 
 setup_mm();
+printk("Booting with directmap %s\n", has_directmap() ? "on" : "off");
 
 vm_init();
 
-- 
2.40.1




[PATCH V3 09/19] x86/domain_page: Remove the fast paths when mfn is not in the directmap

2024-05-13 Thread Elias El Yandouzi
From: Hongyan Xia 

When mfn is not in direct map, never use mfn_to_virt for any mappings.

We replace mfn_x(mfn) <= PFN_DOWN(__pa(HYPERVISOR_VIRT_END - 1)) with
arch_mfns_in_direct_map(mfn, 1) because these two are equivalent. The
extra comparison in arch_mfns_in_direct_map() looks different but because
DIRECTMAP_VIRT_END is always higher, it does not make any difference.

Lastly, domain_page_map_to_mfn() needs to gain to a special case for
the PMAP.

Signed-off-by: Hongyan Xia 
Signed-off-by: Julien Grall 



Changes since Hongyan's version:
* arch_mfn_in_direct_map() was renamed to arch_mfns_in_directmap()
* add a special case for the PMAP in domain_page_map_to_mfn()

diff --git a/xen/arch/x86/domain_page.c b/xen/arch/x86/domain_page.c
index 55e337aaf7..89caefc8a2 100644
--- a/xen/arch/x86/domain_page.c
+++ b/xen/arch/x86/domain_page.c
@@ -14,8 +14,10 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
+#include 
 #include 
 
 static DEFINE_PER_CPU(struct vcpu *, override);
@@ -35,10 +37,11 @@ static inline struct vcpu *mapcache_current_vcpu(void)
 /*
  * When using efi runtime page tables, we have the equivalent of the idle
  * domain's page tables but current may point at another domain's VCPU.
- * Return NULL as though current is not properly set up yet.
+ * Return the idle domains's vcpu on that core because the efi per-domain
+ * region (where the mapcache is) is in-sync with the idle domain.
  */
 if ( efi_rs_using_pgtables() )
-return NULL;
+return idle_vcpu[smp_processor_id()];
 
 /*
  * If guest_table is NULL, and we are running a paravirtualised guest,
@@ -77,18 +80,24 @@ void *map_domain_page(mfn_t mfn)
 struct vcpu_maphash_entry *hashent;
 
 #ifdef NDEBUG
-if ( mfn_x(mfn) <= PFN_DOWN(__pa(HYPERVISOR_VIRT_END - 1)) )
+if ( arch_mfns_in_directmap(mfn_x(mfn), 1) )
 return mfn_to_virt(mfn_x(mfn));
 #endif
 
 v = mapcache_current_vcpu();
-if ( !v )
-return mfn_to_virt(mfn_x(mfn));
+if ( !v || !v->domain->arch.mapcache.inuse )
+{
+if ( arch_mfns_in_directmap(mfn_x(mfn), 1) )
+return mfn_to_virt(mfn_x(mfn));
+else
+{
+BUG_ON(system_state >= SYS_STATE_smp_boot);
+return pmap_map(mfn);
+}
+}
 
 dcache = >domain->arch.mapcache;
 vcache = >arch.mapcache;
-if ( !dcache->inuse )
-return mfn_to_virt(mfn_x(mfn));
 
 perfc_incr(map_domain_page_count);
 
@@ -184,6 +193,12 @@ void unmap_domain_page(const void *ptr)
 if ( !va || va >= DIRECTMAP_VIRT_START )
 return;
 
+if ( va >= FIXADDR_START && va < FIXADDR_TOP )
+{
+pmap_unmap((void *)ptr);
+return;
+}
+
 ASSERT(va >= MAPCACHE_VIRT_START && va < MAPCACHE_VIRT_END);
 
 v = mapcache_current_vcpu();
@@ -237,7 +252,7 @@ int mapcache_domain_init(struct domain *d)
 unsigned int bitmap_pages;
 
 #ifdef NDEBUG
-if ( !mem_hotplug && max_page <= PFN_DOWN(__pa(HYPERVISOR_VIRT_END - 1)) )
+if ( !mem_hotplug && arch_mfn_in_directmap(0, max_page) )
 return 0;
 #endif
 
@@ -308,7 +323,7 @@ void *map_domain_page_global(mfn_t mfn)
 local_irq_is_enabled()));
 
 #ifdef NDEBUG
-if ( mfn_x(mfn) <= PFN_DOWN(__pa(HYPERVISOR_VIRT_END - 1)) )
+if ( arch_mfn_in_directmap(mfn_x(mfn, 1)) )
 return mfn_to_virt(mfn_x(mfn));
 #endif
 
@@ -335,6 +350,23 @@ mfn_t domain_page_map_to_mfn(const void *ptr)
 if ( va >= DIRECTMAP_VIRT_START )
 return _mfn(virt_to_mfn(ptr));
 
+/*
+ * The fixmap is stealing the top-end of the VMAP. So the check for
+ * the PMAP *must* happen first.
+ *
+ * Also, the fixmap translate a slot to an address backwards. The
+ * logic will rely on it to avoid any complexity. So check at
+ * compile time this will always hold.
+*/
+BUILD_BUG_ON(fix_to_virt(FIX_PMAP_BEGIN) < fix_to_virt(FIX_PMAP_END));
+
+if ( ((unsigned long)fix_to_virt(FIX_PMAP_END) <= va) &&
+ ((va & PAGE_MASK) <= (unsigned long)fix_to_virt(FIX_PMAP_BEGIN)) )
+{
+BUG_ON(system_state >= SYS_STATE_smp_boot);
+return l1e_get_mfn(l1_fixmap[l1_table_offset(va)]);
+}
+
 if ( va >= VMAP_VIRT_START && va < VMAP_VIRT_END )
 return vmap_to_mfn(va);
 
-- 
2.40.1




[PATCH V3 08/19] xen/x86: Add build assertion for fixmap entries

2024-05-13 Thread Elias El Yandouzi
The early fixed addresses must all fit into the static L1 table.
Introduce a build assertion to this end.

Signed-off-by: Elias El Yandouzi 



 Changes in v2:
 * New patch

diff --git a/xen/arch/x86/include/asm/fixmap.h 
b/xen/arch/x86/include/asm/fixmap.h
index a7ac365fc6..904bee0480 100644
--- a/xen/arch/x86/include/asm/fixmap.h
+++ b/xen/arch/x86/include/asm/fixmap.h
@@ -77,6 +77,11 @@ enum fixed_addresses {
 #define FIXADDR_SIZE  (__end_of_fixed_addresses << PAGE_SHIFT)
 #define FIXADDR_START (FIXADDR_TOP - FIXADDR_SIZE)
 
+static inline void fixaddr_build_assertion(void)
+{
+BUILD_BUG_ON(FIX_PMAP_END > L1_PAGETABLE_ENTRIES - 1);
+}
+
 extern void __set_fixmap(
 enum fixed_addresses idx, unsigned long mfn, unsigned long flags);
 
-- 
2.40.1




[PATCH V3 14/19] Rename mfn_to_virt() calls

2024-05-13 Thread Elias El Yandouzi
Until directmap gets completely removed, we'd still need to
keep some calls to mfn_to_virt() for xenheap pages or when the
directmap is enabled.

Rename the macro to mfn_to_directmap_virt() to flag them and
prevent further use of mfn_to_virt().

Signed-off-by: Elias El Yandouzi 

diff --git a/xen/arch/arm/include/asm/mm.h b/xen/arch/arm/include/asm/mm.h
index 48538b5337..2bca3f9e87 100644
--- a/xen/arch/arm/include/asm/mm.h
+++ b/xen/arch/arm/include/asm/mm.h
@@ -336,6 +336,7 @@ static inline uint64_t gvirt_to_maddr(vaddr_t va, paddr_t 
*pa,
  */
 #define virt_to_mfn(va) __virt_to_mfn(va)
 #define mfn_to_virt(mfn)__mfn_to_virt(mfn)
+#define mfn_to_directmap_virt(mfn) mfn_to_virt(mfn)
 
 /* Convert between Xen-heap virtual addresses and page-info structures. */
 static inline struct page_info *virt_to_page(const void *v)
diff --git a/xen/arch/x86/domain_page.c b/xen/arch/x86/domain_page.c
index 89caefc8a2..62d6fee0f4 100644
--- a/xen/arch/x86/domain_page.c
+++ b/xen/arch/x86/domain_page.c
@@ -81,14 +81,14 @@ void *map_domain_page(mfn_t mfn)
 
 #ifdef NDEBUG
 if ( arch_mfns_in_directmap(mfn_x(mfn), 1) )
-return mfn_to_virt(mfn_x(mfn));
+return mfn_to_directmap_virt(mfn_x(mfn));
 #endif
 
 v = mapcache_current_vcpu();
 if ( !v || !v->domain->arch.mapcache.inuse )
 {
 if ( arch_mfns_in_directmap(mfn_x(mfn), 1) )
-return mfn_to_virt(mfn_x(mfn));
+return mfn_to_directmap_virt(mfn_x(mfn));
 else
 {
 BUG_ON(system_state >= SYS_STATE_smp_boot);
@@ -324,7 +324,7 @@ void *map_domain_page_global(mfn_t mfn)
 
 #ifdef NDEBUG
 if ( arch_mfn_in_directmap(mfn_x(mfn, 1)) )
-return mfn_to_virt(mfn_x(mfn));
+return mfn_to_directmap_virt(mfn_x(mfn));
 #endif
 
 return vmap(, 1);
diff --git a/xen/arch/x86/hvm/dom0_build.c b/xen/arch/x86/hvm/dom0_build.c
index b0cb96c3bc..d1482ae2f7 100644
--- a/xen/arch/x86/hvm/dom0_build.c
+++ b/xen/arch/x86/hvm/dom0_build.c
@@ -439,7 +439,7 @@ static int __init pvh_populate_p2m(struct domain *d)
  d->arch.e820[i].addr + d->arch.e820[i].size);
 enum hvm_translation_result res =
  hvm_copy_to_guest_phys(mfn_to_maddr(_mfn(addr)),
-mfn_to_virt(addr),
+mfn_to_directmap_virt(addr),
 end - d->arch.e820[i].addr,
 v);
 
@@ -725,7 +725,7 @@ static int __init pvh_load_kernel(struct domain *d, const 
module_t *image,
 
 if ( initrd != NULL )
 {
-rc = hvm_copy_to_guest_phys(last_addr, mfn_to_virt(initrd->mod_start),
+rc = hvm_copy_to_guest_phys(last_addr, 
mfn_to_directmap_virt(initrd->mod_start),
 initrd_len, v);
 if ( rc )
 {
diff --git a/xen/arch/x86/include/asm/page.h b/xen/arch/x86/include/asm/page.h
index 350d1fb110..c6891b52d4 100644
--- a/xen/arch/x86/include/asm/page.h
+++ b/xen/arch/x86/include/asm/page.h
@@ -268,7 +268,7 @@ void copy_page_sse2(void *to, const void *from);
  */
 #define mfn_valid(mfn)  __mfn_valid(mfn_x(mfn))
 #define virt_to_mfn(va) __virt_to_mfn(va)
-#define mfn_to_virt(mfn)__mfn_to_virt(mfn)
+#define mfn_to_directmap_virt(mfn)__mfn_to_virt(mfn)
 #define virt_to_maddr(va)   __virt_to_maddr((unsigned long)(va))
 #define maddr_to_virt(ma)   __maddr_to_virt((unsigned long)(ma))
 #define maddr_to_page(ma)   __maddr_to_page(ma)
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index efdf20f775..337363cf17 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -318,8 +318,8 @@ void __init arch_init_memory(void)
 iostart_pfn = max_t(unsigned long, pfn, 1UL << (20 - PAGE_SHIFT));
 ioend_pfn = min(rstart_pfn, 16UL << (20 - PAGE_SHIFT));
 if ( iostart_pfn < ioend_pfn )
-destroy_xen_mappings((unsigned long)mfn_to_virt(iostart_pfn),
- (unsigned long)mfn_to_virt(ioend_pfn));
+destroy_xen_mappings((unsigned 
long)mfn_to_directmap_virt(iostart_pfn),
+ (unsigned 
long)mfn_to_directmap_virt(ioend_pfn));
 
 /* Mark as I/O up to next RAM region. */
 for ( ; pfn < rstart_pfn; pfn++ )
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 919347d8c2..e0671ab3c3 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -399,7 +399,7 @@ void *__init bootstrap_map(const module_t *mod)
 void *ret;
 
 if ( system_state != SYS_STATE_early_boot )
-return mod ? mfn_to_virt(mod->mod_start) : NULL;
+return mod ? mfn_to_directmap_virt(mod->mod_start) : NULL;
 
 if ( !mod )
 {
@@ -1708,7 +1708,7 @@ void asmlinkage __init noreturn __start_xen(unsigned long 
mbi_p)
 {
 set_pdx_range(mod[i].mod_start,
   mod[i].mod_start + PFN_UP(mod[i].mod_end));
-

[PATCH V3 17/19] xen/arm64: mm: Use per-pCPU page-tables

2024-05-13 Thread Elias El Yandouzi
From: Julien Grall 

At the moment, on Arm64, every pCPU is sharing the same page-tables.

In a follow-up patch, we will allow the possibility to remove the
direct map and therefore it will be necessary to have a mapcache.

While we have plenty of spare virtual address space to reserve part
for each pCPU, it means that temporary mappings (e.g. guest memory)
could be accessible by every pCPU.

In order to increase our security posture, it would be better if
those mappings are only accessible by the pCPU doing the temporary
mapping.

In addition to that, a per-pCPU page-tables opens the way to have
per-domain mapping area.

Arm32 is already using per-pCPU page-tables so most of the code
can be re-used. Arm64 doesn't yet have support for the mapcache,
so a stub is provided (moved to its own header asm/domain_page.h).

Take the opportunity to fix a typo in a comment that is modified.

Signed-off-by: Julien Grall 
Signed-off-by: Elias El Yandouzi 



Changelog since v1:
* Rebase
* Fix typoes

diff --git a/xen/arch/arm/arm64/mmu/mm.c b/xen/arch/arm/arm64/mmu/mm.c
index 293acb67e0..2ec1ffe1dc 100644
--- a/xen/arch/arm/arm64/mmu/mm.c
+++ b/xen/arch/arm/arm64/mmu/mm.c
@@ -76,6 +76,7 @@ static void __init prepare_runtime_identity_mapping(void)
 paddr_t id_addr = virt_to_maddr(_start);
 lpae_t pte;
 DECLARE_OFFSETS(id_offsets, id_addr);
+lpae_t *root = this_cpu(xen_pgtable);
 
 if ( id_offsets[0] >= IDENTITY_MAPPING_AREA_NR_L0 )
 panic("Cannot handle ID mapping above %uTB\n",
@@ -86,7 +87,7 @@ static void __init prepare_runtime_identity_mapping(void)
 pte.pt.table = 1;
 pte.pt.xn = 0;
 
-write_pte(_pgtable[id_offsets[0]], pte);
+write_pte([id_offsets[0]], pte);
 
 /* Link second ID table */
 pte = pte_of_xenaddr((vaddr_t)xen_second_id);
diff --git a/xen/arch/arm/domain_page.c b/xen/arch/arm/domain_page.c
index 3a43601623..ac2a6d0332 100644
--- a/xen/arch/arm/domain_page.c
+++ b/xen/arch/arm/domain_page.c
@@ -3,6 +3,8 @@
 #include 
 #include 
 
+#include 
+
 /* Override macros from asm/page.h to make them work with mfn_t */
 #undef virt_to_mfn
 #define virt_to_mfn(va) _mfn(__virt_to_mfn(va))
diff --git a/xen/arch/arm/include/asm/arm32/mm.h 
b/xen/arch/arm/include/asm/arm32/mm.h
index 856f2dbec4..87a315db01 100644
--- a/xen/arch/arm/include/asm/arm32/mm.h
+++ b/xen/arch/arm/include/asm/arm32/mm.h
@@ -1,12 +1,6 @@
 #ifndef __ARM_ARM32_MM_H__
 #define __ARM_ARM32_MM_H__
 
-#include 
-
-#include 
-
-DECLARE_PER_CPU(lpae_t *, xen_pgtable);
-
 /*
  * Only a limited amount of RAM, called xenheap, is always mapped on ARM32.
  * For convenience always return false.
@@ -16,8 +10,6 @@ static inline bool arch_mfns_in_directmap(unsigned long mfn, 
unsigned long nr)
 return false;
 }
 
-bool init_domheap_mappings(unsigned int cpu);
-
 static inline void arch_setup_page_tables(void)
 {
 }
diff --git a/xen/arch/arm/include/asm/domain_page.h 
b/xen/arch/arm/include/asm/domain_page.h
new file mode 100644
index 00..e9f52685e2
--- /dev/null
+++ b/xen/arch/arm/include/asm/domain_page.h
@@ -0,0 +1,13 @@
+#ifndef __ASM_ARM_DOMAIN_PAGE_H__
+#define __ASM_ARM_DOMAIN_PAGE_H__
+
+#ifdef CONFIG_ARCH_MAP_DOMAIN_PAGE
+bool init_domheap_mappings(unsigned int cpu);
+#else
+static inline bool init_domheap_mappings(unsigned int cpu)
+{
+return true;
+}
+#endif
+
+#endif /* __ASM_ARM_DOMAIN_PAGE_H__ */
diff --git a/xen/arch/arm/include/asm/mm.h b/xen/arch/arm/include/asm/mm.h
index 2bca3f9e87..60e0122cba 100644
--- a/xen/arch/arm/include/asm/mm.h
+++ b/xen/arch/arm/include/asm/mm.h
@@ -2,6 +2,9 @@
 #define __ARCH_ARM_MM__
 
 #include 
+#include 
+
+#include 
 #include 
 #include 
 #include 
diff --git a/xen/arch/arm/include/asm/mmu/mm.h 
b/xen/arch/arm/include/asm/mmu/mm.h
index c5e03a66bf..c03c3a51e4 100644
--- a/xen/arch/arm/include/asm/mmu/mm.h
+++ b/xen/arch/arm/include/asm/mmu/mm.h
@@ -2,6 +2,8 @@
 #ifndef __ARM_MMU_MM_H__
 #define __ARM_MMU_MM_H__
 
+DECLARE_PER_CPU(lpae_t *, xen_pgtable);
+
 /* Non-boot CPUs use this to find the correct pagetables. */
 extern uint64_t init_ttbr;
 
diff --git a/xen/arch/arm/mmu/pt.c b/xen/arch/arm/mmu/pt.c
index da28d669e7..1ed1a53ab1 100644
--- a/xen/arch/arm/mmu/pt.c
+++ b/xen/arch/arm/mmu/pt.c
@@ -607,9 +607,9 @@ static int xen_pt_update(unsigned long virt,
 unsigned long left = nr_mfns;
 
 /*
- * For arm32, page-tables are different on each CPUs. Yet, they share
- * some common mappings. It is assumed that only common mappings
- * will be modified with this function.
+ * Page-tables are different on each CPU. Yet, they share some common
+ * mappings. It is assumed that only common mappings will be modified
+ * with this function.
  *
  * XXX: Add a check.
  */
diff --git a/xen/arch/arm/mmu/setup.c b/xen/arch/arm/mmu/setup.c
index f4bb424c3c..7b981456e6 100644
--- a/xen/arch/arm/mmu/setup.c
+++ b/xen/arch/arm/mmu/setup.c
@@ -28,17 +28,15 @@
  * PCPUs.
  */
 
-#ifdef 

[PATCH V3 10/19] xen/page_alloc: Add a path for xenheap when there is no direct map

2024-05-13 Thread Elias El Yandouzi
From: Hongyan Xia 

When there is not an always-mapped direct map, xenheap allocations need
to be mapped and unmapped on-demand.

Signed-off-by: Hongyan Xia 
Signed-off-by: Julien Grall 
Signed-off-by: Elias El Yandouzi 



I have left the call to map_pages_to_xen() and destroy_xen_mappings()
in the split heap for now. I am not entirely convinced this is necessary
because in that setup only the xenheap would be always mapped and
this doesn't contain any guest memory (aside the grant-table).
So map/unmapping for every allocation seems unnecessary.

Changes in v2:
* Fix remaining wrong indentation in alloc_xenheap_pages()

Changes since Hongyan's version:
* Rebase
* Fix indentation in alloc_xenheap_pages()
* Fix build for arm32

diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index 9b7e4721cd..dfb2c05322 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -2242,6 +2242,7 @@ void init_xenheap_pages(paddr_t ps, paddr_t pe)
 void *alloc_xenheap_pages(unsigned int order, unsigned int memflags)
 {
 struct page_info *pg;
+void *ret;
 
 ASSERT_ALLOC_CONTEXT();
 
@@ -2250,17 +2251,36 @@ void *alloc_xenheap_pages(unsigned int order, unsigned 
int memflags)
 if ( unlikely(pg == NULL) )
 return NULL;
 
+ret = page_to_virt(pg);
+
+if ( !has_directmap() &&
+ map_pages_to_xen((unsigned long)ret, page_to_mfn(pg), 1UL << order,
+  PAGE_HYPERVISOR) )
+{
+/* Failed to map xenheap pages. */
+free_heap_pages(pg, order, false);
+return NULL;
+}
+
 return page_to_virt(pg);
 }
 
 
 void free_xenheap_pages(void *v, unsigned int order)
 {
+unsigned long va = (unsigned long)v & PAGE_MASK;
+
 ASSERT_ALLOC_CONTEXT();
 
 if ( v == NULL )
 return;
 
+if ( !has_directmap() &&
+ destroy_xen_mappings(va, va + (1UL << (order + PAGE_SHIFT))) )
+dprintk(XENLOG_WARNING,
+"Error while destroying xenheap mappings at %p, order %u\n",
+v, order);
+
 free_heap_pages(virt_to_page(v), order, false);
 }
 
@@ -2284,6 +2304,7 @@ void *alloc_xenheap_pages(unsigned int order, unsigned 
int memflags)
 {
 struct page_info *pg;
 unsigned int i;
+void *ret;
 
 ASSERT_ALLOC_CONTEXT();
 
@@ -2296,16 +2317,28 @@ void *alloc_xenheap_pages(unsigned int order, unsigned 
int memflags)
 if ( unlikely(pg == NULL) )
 return NULL;
 
+ret = page_to_virt(pg);
+
+if ( !has_directmap() &&
+ map_pages_to_xen((unsigned long)ret, page_to_mfn(pg), 1UL << order,
+  PAGE_HYPERVISOR) )
+{
+/* Failed to map xenheap pages. */
+free_domheap_pages(pg, order);
+return NULL;
+}
+
 for ( i = 0; i < (1u << order); i++ )
 pg[i].count_info |= PGC_xen_heap;
 
-return page_to_virt(pg);
+return ret;
 }
 
 void free_xenheap_pages(void *v, unsigned int order)
 {
 struct page_info *pg;
 unsigned int i;
+unsigned long va = (unsigned long)v & PAGE_MASK;
 
 ASSERT_ALLOC_CONTEXT();
 
@@ -2317,6 +2350,12 @@ void free_xenheap_pages(void *v, unsigned int order)
 for ( i = 0; i < (1u << order); i++ )
 pg[i].count_info &= ~PGC_xen_heap;
 
+if ( !has_directmap() &&
+ destroy_xen_mappings(va, va + (1UL << (order + PAGE_SHIFT))) )
+dprintk(XENLOG_WARNING,
+"Error while destroying xenheap mappings at %p, order %u\n",
+v, order);
+
 free_heap_pages(pg, order, true);
 }
 
-- 
2.40.1




[PATCH V3 16/19] xen/arm32: mm: Rename 'first' to 'root' in init_secondary_pagetables()

2024-05-13 Thread Elias El Yandouzi
From: Julien Grall 

The arm32 version of init_secondary_pagetables() will soon be re-used
for arm64 as well where the root table starts at level 0 rather than level 1.

So rename 'first' to 'root'.

Signed-off-by: Julien Grall 
Signed-off-by: Elias El Yandouzi 



Changelog in v2:
* Rebase
* Fix typo

diff --git a/xen/arch/arm/mmu/smpboot.c b/xen/arch/arm/mmu/smpboot.c
index 4ffc8254a4..e29b6f34f2 100644
--- a/xen/arch/arm/mmu/smpboot.c
+++ b/xen/arch/arm/mmu/smpboot.c
@@ -109,32 +109,32 @@ int prepare_secondary_mm(int cpu)
 #else
 int prepare_secondary_mm(int cpu)
 {
-lpae_t *first;
+lpae_t *root = alloc_xenheap_page();
 
 first = alloc_xenheap_page(); /* root == first level on 32-bit 3-level 
trie */
 
-if ( !first )
+if ( !root )
 {
-printk("CPU%u: Unable to allocate the first page-table\n", cpu);
+printk("CPU%u: Unable to allocate the root page-table\n", cpu);
 return -ENOMEM;
 }
 
 /* Initialise root pagetable from root of boot tables */
-memcpy(first, per_cpu(xen_pgtable, 0), PAGE_SIZE);
-per_cpu(xen_pgtable, cpu) = first;
+memcpy(root, per_cpu(xen_pgtable, 0), PAGE_SIZE);
+per_cpu(xen_pgtable, cpu) = root;
 
 if ( !init_domheap_mappings(cpu) )
 {
 printk("CPU%u: Unable to prepare the domheap page-tables\n", cpu);
 per_cpu(xen_pgtable, cpu) = NULL;
-free_xenheap_page(first);
+free_xenheap_page(root);
 return -ENOMEM;
 }
 
 clear_boot_pagetables();
 
 /* Set init_ttbr for this CPU coming up */
-set_init_ttbr(first);
+set_init_ttbr(root);
 
 return 0;
 }
-- 
2.40.1




[PATCH V3 13/19] x86/setup: Do not create valid mappings when directmap=no

2024-05-13 Thread Elias El Yandouzi
From: Hongyan Xia 

Create empty mappings in the second e820 pass. Also, destroy existing
direct map mappings created in the first pass.

To make xenheap pages visible in guests, it is necessary to create empty
L3 tables in the direct map even when directmap=no, since guest cr3s
copy idle domain's L4 entries, which means they will share mappings in
the direct map if we pre-populate idle domain's L4 entries and L3
tables. A helper is introduced for this.

Also, after the direct map is actually gone, we need to stop updating
the direct map in update_xen_mappings().

Signed-off-by: Hongyan Xia 
Signed-off-by: Julien Grall 
Signed-off-by: Elias El Yandouzi 

diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index f26c9799e4..919347d8c2 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -978,6 +978,57 @@ static struct domain *__init create_dom0(const module_t 
*image,
 /* How much of the directmap is prebuilt at compile time. */
 #define PREBUILT_MAP_LIMIT (1 << L2_PAGETABLE_SHIFT)
 
+/*
+ * This either populates a valid direct map, or allocates empty L3 tables and
+ * creates the L4 entries for virtual address between [start, end) in the
+ * direct map depending on has_directmap();
+ *
+ * When directmap=no, we still need to populate empty L3 tables in the
+ * direct map region. The reason is that on-demand xenheap mappings are
+ * created in the idle domain's page table but must be seen by
+ * everyone. Since all domains share the direct map L4 entries, they
+ * will share xenheap mappings if we pre-populate the L4 entries and L3
+ * tables in the direct map region for all RAM. We also rely on the fact
+ * that L3 tables are never freed.
+ */
+static void __init populate_directmap(uint64_t pstart, uint64_t pend,
+  unsigned int flags)
+{
+unsigned long vstart = (unsigned long)__va(pstart);
+unsigned long vend = (unsigned long)__va(pend);
+
+if ( pstart >= pend )
+return;
+
+BUG_ON(vstart < DIRECTMAP_VIRT_START);
+BUG_ON(vend > DIRECTMAP_VIRT_END);
+
+if ( has_directmap() )
+/* Populate valid direct map. */
+BUG_ON(map_pages_to_xen(vstart, maddr_to_mfn(pstart),
+PFN_DOWN(pend - pstart), flags));
+else
+{
+/* Create empty L3 tables. */
+unsigned long vaddr = vstart & ~((1UL << L4_PAGETABLE_SHIFT) - 1);
+
+for ( ; vaddr < vend; vaddr += (1UL << L4_PAGETABLE_SHIFT) )
+{
+l4_pgentry_t *pl4e = _pg_table[l4_table_offset(vaddr)];
+
+if ( !(l4e_get_flags(*pl4e) & _PAGE_PRESENT) )
+{
+mfn_t mfn = alloc_boot_pages(1, 1);
+void *v = map_domain_page(mfn);
+
+clear_page(v);
+UNMAP_DOMAIN_PAGE(v);
+l4e_write(pl4e, l4e_from_mfn(mfn, __PAGE_HYPERVISOR));
+}
+}
+}
+}
+
 void asmlinkage __init noreturn __start_xen(unsigned long mbi_p)
 {
 const char *memmap_type = NULL, *loader, *cmdline = "";
@@ -1601,8 +1652,17 @@ void asmlinkage __init noreturn __start_xen(unsigned 
long mbi_p)
 map_e = min_t(uint64_t, e,
   ARRAY_SIZE(l2_directmap) << L2_PAGETABLE_SHIFT);
 
-/* Pass mapped memory to allocator /before/ creating new mappings. */
+/*
+ * Pass mapped memory to allocator /before/ creating new mappings.
+ * The direct map for the bottom 4GiB has been populated in the first
+ * e820 pass. In the second pass, we make sure those existing mappings
+ * are destroyed when directmap=no.
+ */
 init_boot_pages(s, min(map_s, e));
+if ( !has_directmap() )
+destroy_xen_mappings((unsigned long)__va(s),
+ (unsigned long)__va(min(map_s, e)));
+
 s = map_s;
 if ( s < map_e )
 {
@@ -1610,6 +1670,9 @@ void asmlinkage __init noreturn __start_xen(unsigned long 
mbi_p)
 map_s = (s + mask) & ~mask;
 map_e &= ~mask;
 init_boot_pages(map_s, map_e);
+if ( !has_directmap() )
+destroy_xen_mappings((unsigned long)__va(map_s),
+ (unsigned long)__va(map_e));
 }
 
 if ( map_s > map_e )
@@ -1623,8 +1686,7 @@ void asmlinkage __init noreturn __start_xen(unsigned long 
mbi_p)
 
 if ( map_e < end )
 {
-map_pages_to_xen((unsigned long)__va(map_e), 
maddr_to_mfn(map_e),
- PFN_DOWN(end - map_e), PAGE_HYPERVISOR);
+populate_directmap(map_e, end, PAGE_HYPERVISOR);
 init_boot_pages(map_e, end);
 map_e = end;
 }
@@ -1633,13 +1695,11 @@ void asmlinkage __init noreturn __start_xen(unsigned 
long mbi_p)
 {
 /* This range must not be passed to the boot allocator and
  * must also not be mapped with _PAGE_GLOBAL. */
-  

[PATCH V3 18/19] xen/arm64: Implement a mapcache for arm64

2024-05-13 Thread Elias El Yandouzi
From: Julien Grall 

At the moment, on arm64, map_domain_page() is implemented using
virt_to_mfn(). Therefore it is relying on the directmap.

In a follow-up patch, we will allow the admin to remove the directmap.
Therefore we want to implement a mapcache.

Thanksfully there is already one for arm32. So select ARCH_ARM_DOMAIN_PAGE
and add the necessary boiler plate to support 64-bit:
- The page-table start at level 0, so we need to allocate the level
  1 page-table
- map_domain_page() should check if the page is in the directmap. If
  yes, then use virt_to_mfn() to limit the performance impact
  when the directmap is still enabled (this will be selectable
  on the command line).

Take the opportunity to replace first_table_offset(...) with offsets[...].

Note that, so far, arch_mfns_in_directmap() always return true on
arm64. So the mapcache is not yet used. This will change in a
follow-up patch.

Signed-off-by: Julien Grall 
Signed-off-by: Elias El Yandouzi 



There are a few TODOs:
- It is becoming more critical to fix the mapcache
  implementation (this is not compliant with the Arm Arm)
- Evaluate the performance

diff --git a/xen/arch/arm/Kconfig b/xen/arch/arm/Kconfig
index 21d03d9f44..0462960fc7 100644
--- a/xen/arch/arm/Kconfig
+++ b/xen/arch/arm/Kconfig
@@ -1,7 +1,6 @@
 config ARM_32
def_bool y
depends on "$(ARCH)" = "arm32"
-   select ARCH_MAP_DOMAIN_PAGE
 
 config ARM_64
def_bool y
diff --git a/xen/arch/arm/arm64/mmu/mm.c b/xen/arch/arm/arm64/mmu/mm.c
index 2ec1ffe1dc..826864d25d 100644
--- a/xen/arch/arm/arm64/mmu/mm.c
+++ b/xen/arch/arm/arm64/mmu/mm.c
@@ -4,6 +4,7 @@
 #include 
 #include 
 
+#include 
 #include 
 #include 
 #include 
@@ -237,6 +238,14 @@ void __init setup_mm(void)
 setup_frametable_mappings(ram_start, ram_end);
 max_page = PFN_DOWN(ram_end);
 
+/*
+ * The allocators may need to use map_domain_page() (such as for
+ * scrubbing pages). So we need to prepare the domheap area first.
+ */
+if ( !init_domheap_mappings(smp_processor_id()) )
+panic("CPU%u: Unable to prepare the domheap page-tables\n",
+  smp_processor_id());
+
 init_staticmem_pages();
 init_sharedmem_pages();
 }
diff --git a/xen/arch/arm/domain_page.c b/xen/arch/arm/domain_page.c
index ac2a6d0332..0f6ba48892 100644
--- a/xen/arch/arm/domain_page.c
+++ b/xen/arch/arm/domain_page.c
@@ -1,4 +1,5 @@
 /* SPDX-License-Identifier: GPL-2.0-or-later */
+#include 
 #include 
 #include 
 #include 
@@ -8,6 +9,8 @@
 /* Override macros from asm/page.h to make them work with mfn_t */
 #undef virt_to_mfn
 #define virt_to_mfn(va) _mfn(__virt_to_mfn(va))
+#undef mfn_to_virt
+#define mfn_to_virt(va) __mfn_to_virt(mfn_x(mfn))
 
 /* cpu0's domheap page tables */
 static DEFINE_PAGE_TABLES(cpu0_dommap, DOMHEAP_SECOND_PAGES);
@@ -31,13 +34,30 @@ bool init_domheap_mappings(unsigned int cpu)
 {
 unsigned int order = get_order_from_pages(DOMHEAP_SECOND_PAGES);
 lpae_t *root = per_cpu(xen_pgtable, cpu);
+lpae_t *first;
 unsigned int i, first_idx;
 lpae_t *domheap;
 mfn_t mfn;
 
+/* Convenience aliases */
+DECLARE_OFFSETS(offsets, DOMHEAP_VIRT_START);
+
 ASSERT(root);
 ASSERT(!per_cpu(xen_dommap, cpu));
 
+/*
+ * On Arm64, the root is at level 0. Therefore we need an extra step
+ * to allocate the first level page-table.
+ */
+#ifdef CONFIG_ARM_64
+if ( create_xen_table([offsets[0]]) )
+return false;
+
+first = xen_map_table(lpae_get_mfn(root[offsets[0]]));
+#else
+first = root;
+#endif
+
 /*
  * The domheap for cpu0 is initialized before the heap is initialized.
  * So we need to use pre-allocated pages.
@@ -58,16 +78,20 @@ bool init_domheap_mappings(unsigned int cpu)
  * domheap mapping pages.
  */
 mfn = virt_to_mfn(domheap);
-first_idx = first_table_offset(DOMHEAP_VIRT_START);
+first_idx = offsets[1];
 for ( i = 0; i < DOMHEAP_SECOND_PAGES; i++ )
 {
 lpae_t pte = mfn_to_xen_entry(mfn_add(mfn, i), MT_NORMAL);
 pte.pt.table = 1;
-write_pte([first_idx + i], pte);
+write_pte([first_idx + i], pte);
 }
 
 per_cpu(xen_dommap, cpu) = domheap;
 
+#ifdef CONFIG_ARM_64
+xen_unmap_table(first);
+#endif
+
 return true;
 }
 
@@ -91,6 +115,10 @@ void *map_domain_page(mfn_t mfn)
 lpae_t pte;
 int i, slot;
 
+/* Bypass the mapcache if the page is in the directmap */
+if ( arch_mfns_in_directmap(mfn_x(mfn), 1) )
+return mfn_to_virt(mfn);
+
 local_irq_save(flags);
 
 /* The map is laid out as an open-addressed hash table where each
@@ -153,13 +181,25 @@ void *map_domain_page(mfn_t mfn)
 /* Release a mapping taken with map_domain_page() */
 void unmap_domain_page(const void *ptr)
 {
+unsigned long va = (unsigned long)ptr;
 unsigned long flags;
 lpae_t *map = this_cpu(xen_dommap);
-int slot = ((unsigned long)ptr - 

[PATCH V3 12/19] x86/setup: vmap heap nodes when they are outside the direct map

2024-05-13 Thread Elias El Yandouzi
From: Hongyan Xia 

When we do not have a direct map, archs_mfn_in_direct_map() will always
return false, thus init_node_heap() will allocate xenheap pages from an
existing node for the metadata of a new node. This means that the
metadata of a new node is in a different node, slowing down heap
allocation.

Since we now have early vmap, vmap the metadata locally in the new node.

Signed-off-by: Hongyan Xia 
Signed-off-by: Julien Grall 
Signed-off-by: Elias El Yandouzi 



Changes in v2:
* vmap_contig_pages() was renamed to vmap_contig()
* Fix indentation and coding style

Changes from Hongyan's version:
* arch_mfn_in_direct_map() was renamed to
  arch_mfns_in_direct_map()
* Use vmap_contig_pages() rather than __vmap(...).
* Add missing include (xen/vmap.h) so it compiles on Arm

diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index dfb2c05322..3c0909f333 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -136,6 +136,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -605,22 +606,44 @@ static unsigned long init_node_heap(int node, unsigned 
long mfn,
 needed = 0;
 }
 else if ( *use_tail && nr >= needed &&
-  arch_mfns_in_directmap(mfn + nr - needed, needed) &&
   (!xenheap_bits ||
-   !((mfn + nr - 1) >> (xenheap_bits - PAGE_SHIFT))) )
+  !((mfn + nr - 1) >> (xenheap_bits - PAGE_SHIFT))) )
 {
-_heap[node] = mfn_to_virt(mfn + nr - needed);
-avail[node] = mfn_to_virt(mfn + nr - 1) +
-  PAGE_SIZE - sizeof(**avail) * NR_ZONES;
+if ( arch_mfns_in_directmap(mfn + nr - needed, needed) )
+{
+_heap[node] = mfn_to_virt(mfn + nr - needed);
+avail[node] = mfn_to_virt(mfn + nr - 1) +
+  PAGE_SIZE - sizeof(**avail) * NR_ZONES;
+}
+else
+{
+mfn_t needed_start = _mfn(mfn + nr - needed);
+
+_heap[node] = vmap_contig(needed_start, needed);
+BUG_ON(!_heap[node]);
+avail[node] = (void *)(_heap[node]) + (needed << PAGE_SHIFT) -
+  sizeof(**avail) * NR_ZONES;
+}
 }
 else if ( nr >= needed &&
-  arch_mfns_in_directmap(mfn, needed) &&
   (!xenheap_bits ||
-   !((mfn + needed - 1) >> (xenheap_bits - PAGE_SHIFT))) )
+  !((mfn + needed - 1) >> (xenheap_bits - PAGE_SHIFT))) )
 {
-_heap[node] = mfn_to_virt(mfn);
-avail[node] = mfn_to_virt(mfn + needed - 1) +
-  PAGE_SIZE - sizeof(**avail) * NR_ZONES;
+if ( arch_mfns_in_directmap(mfn, needed) )
+{
+_heap[node] = mfn_to_virt(mfn);
+avail[node] = mfn_to_virt(mfn + needed - 1) +
+  PAGE_SIZE - sizeof(**avail) * NR_ZONES;
+}
+else
+{
+mfn_t needed_start = _mfn(mfn);
+
+_heap[node] = vmap_contig(needed_start, needed);
+BUG_ON(!_heap[node]);
+avail[node] = (void *)(_heap[node]) + (needed << PAGE_SHIFT) -
+  sizeof(**avail) * NR_ZONES;
+}
 *use_tail = false;
 }
 else if ( get_order_from_bytes(sizeof(**_heap)) ==
-- 
2.40.1




[PATCH V3 15/19] Rename maddr_to_virt() calls

2024-05-13 Thread Elias El Yandouzi
Until directmap gets completely removed, we'd still need to
keep some calls to mmaddr_to_virt() for xenheap pages or when the
directmap is enabled.

Rename the macro to maddr_to_directmap_virt() to flag them and
prevent further use of maddr_to_virt().

Signed-off-by: Elias El Yandouzi 

diff --git a/xen/arch/x86/dmi_scan.c b/xen/arch/x86/dmi_scan.c
index 81f80c053a..ac016f3a04 100644
--- a/xen/arch/x86/dmi_scan.c
+++ b/xen/arch/x86/dmi_scan.c
@@ -277,7 +277,7 @@ const char *__init dmi_get_table(paddr_t *base, u32 *len)
return "SMBIOS";
}
} else {
-   char __iomem *p = maddr_to_virt(0xF), *q;
+   char __iomem *p = maddr_to_directmap_virt(0xF), *q;
union {
struct dmi_eps dmi;
struct smbios3_eps smbios3;
@@ -364,7 +364,7 @@ static int __init dmi_iterate(void (*decode)(const struct 
dmi_header *))
dmi.size = 0;
smbios3.length = 0;
 
-   p = maddr_to_virt(0xF);
+   p = maddr_to_directmap_virt(0xF);
for (q = p; q < p + 0x1; q += 16) {
if (!dmi.size) {
memcpy_fromio(, q, sizeof(dmi));
diff --git a/xen/arch/x86/include/asm/mach-default/bios_ebda.h 
b/xen/arch/x86/include/asm/mach-default/bios_ebda.h
index 42de6b2a5b..8cfe53d1f2 100644
--- a/xen/arch/x86/include/asm/mach-default/bios_ebda.h
+++ b/xen/arch/x86/include/asm/mach-default/bios_ebda.h
@@ -7,7 +7,7 @@
  */
 static inline unsigned int get_bios_ebda(void)
 {
-   unsigned int address = *(unsigned short *)maddr_to_virt(0x40E);
+   unsigned int address = *(unsigned short 
*)maddr_to_directmap_virt(0x40E);
address <<= 4;
return address; /* 0 means none */
 }
diff --git a/xen/arch/x86/include/asm/page.h b/xen/arch/x86/include/asm/page.h
index c6891b52d4..bf7bf08ba4 100644
--- a/xen/arch/x86/include/asm/page.h
+++ b/xen/arch/x86/include/asm/page.h
@@ -240,11 +240,11 @@ void copy_page_sse2(void *to, const void *from);
 
 /* Convert between Xen-heap virtual addresses and machine addresses. */
 #define __pa(x) (virt_to_maddr(x))
-#define __va(x) (maddr_to_virt(x))
+#define __va(x) (maddr_to_directmap_virt(x))
 
 /* Convert between Xen-heap virtual addresses and machine frame numbers. */
 #define __virt_to_mfn(va)   (virt_to_maddr(va) >> PAGE_SHIFT)
-#define __mfn_to_virt(mfn)  (maddr_to_virt((paddr_t)(mfn) << PAGE_SHIFT))
+#define __mfn_to_virt(mfn)  (maddr_to_directmap_virt((paddr_t)(mfn) << 
PAGE_SHIFT))
 
 /* Convert between machine frame numbers and page-info structures. */
 #define mfn_to_page(mfn)(frame_table + mfn_to_pdx(mfn))
@@ -270,7 +270,7 @@ void copy_page_sse2(void *to, const void *from);
 #define virt_to_mfn(va) __virt_to_mfn(va)
 #define mfn_to_directmap_virt(mfn)__mfn_to_virt(mfn)
 #define virt_to_maddr(va)   __virt_to_maddr((unsigned long)(va))
-#define maddr_to_virt(ma)   __maddr_to_virt((unsigned long)(ma))
+#define maddr_to_directmap_virt(ma)   __maddr_to_directmap_virt((unsigned 
long)(ma))
 #define maddr_to_page(ma)   __maddr_to_page(ma)
 #define page_to_maddr(pg)   __page_to_maddr(pg)
 #define virt_to_page(va)__virt_to_page(va)
diff --git a/xen/arch/x86/include/asm/x86_64/page.h 
b/xen/arch/x86/include/asm/x86_64/page.h
index 19ca64d792..a95ebc088f 100644
--- a/xen/arch/x86/include/asm/x86_64/page.h
+++ b/xen/arch/x86/include/asm/x86_64/page.h
@@ -48,7 +48,7 @@ static inline unsigned long __virt_to_maddr(unsigned long va)
 return xen_phys_start + va - XEN_VIRT_START;
 }
 
-static inline void *__maddr_to_virt(unsigned long ma)
+static inline void *__maddr_to_directmap_virt(unsigned long ma)
 {
 /* Offset in the direct map, accounting for pdx compression */
 unsigned long va_offset = maddr_to_directmapoff(ma);
diff --git a/xen/arch/x86/mpparse.c b/xen/arch/x86/mpparse.c
index d8ccab2449..69181b0abe 100644
--- a/xen/arch/x86/mpparse.c
+++ b/xen/arch/x86/mpparse.c
@@ -664,7 +664,7 @@ void __init get_smp_config (void)
 
 static int __init smp_scan_config (unsigned long base, unsigned long length)
 {
-   unsigned int *bp = maddr_to_virt(base);
+   unsigned int *bp = maddr_to_directmap_virt(base);
struct intel_mp_floating *mpf;
 
Dprintk("Scan SMP from %p for %ld bytes.\n", bp,length);
diff --git a/xen/common/efi/boot.c b/xen/common/efi/boot.c
index 39aed5845d..1b02e2b6d5 100644
--- a/xen/common/efi/boot.c
+++ b/xen/common/efi/boot.c
@@ -1764,7 +1764,7 @@ void __init efi_init_memory(void)
 if ( map_pages_to_xen((unsigned 
long)mfn_to_directmap_virt(smfn),
 _mfn(smfn), emfn - smfn, prot) == 0 )
 desc->VirtualStart =
-(unsigned long)maddr_to_virt(desc->PhysicalStart);
+(unsigned 
long)maddr_to_directmap_virt(desc->PhysicalStart);
 else
 printk(XENLOG_ERR "Could not 

[PATCH V3 03/19] x86/pv: Rewrite how building PV dom0 handles domheap mappings

2024-05-13 Thread Elias El Yandouzi
From: Hongyan Xia 

Building a PV dom0 is allocating from the domheap but uses it like the
xenheap. Use the pages as they should be.

Signed-off-by: Hongyan Xia 
Signed-off-by: Julien Grall 
Signed-off-by: Elias El Yandouzi 


Changes in V3:
* Fold following patch 'x86/pv: Map L4 page table for shim domain'

Changes in V2:
* Clarify the commit message
* Break the patch in two parts

Changes since Hongyan's version:
* Rebase
* Remove spurious newline

diff --git a/xen/arch/x86/pv/dom0_build.c b/xen/arch/x86/pv/dom0_build.c
index 807296c280..ac910b438a 100644
--- a/xen/arch/x86/pv/dom0_build.c
+++ b/xen/arch/x86/pv/dom0_build.c
@@ -382,6 +382,10 @@ int __init dom0_construct_pv(struct domain *d,
 l3_pgentry_t *l3tab = NULL, *l3start = NULL;
 l2_pgentry_t *l2tab = NULL, *l2start = NULL;
 l1_pgentry_t *l1tab = NULL, *l1start = NULL;
+mfn_t l4start_mfn = INVALID_MFN;
+mfn_t l3start_mfn = INVALID_MFN;
+mfn_t l2start_mfn = INVALID_MFN;
+mfn_t l1start_mfn = INVALID_MFN;
 
 /*
  * This fully describes the memory layout of the initial domain. All
@@ -710,22 +714,32 @@ int __init dom0_construct_pv(struct domain *d,
 v->arch.pv.event_callback_cs= FLAT_COMPAT_KERNEL_CS;
 }
 
+#define UNMAP_MAP_AND_ADVANCE(mfn_var, virt_var, maddr) \
+do {\
+unmap_domain_page(virt_var);\
+mfn_var = maddr_to_mfn(maddr);  \
+maddr += PAGE_SIZE; \
+virt_var = map_domain_page(mfn_var);\
+} while ( false )
+
 if ( !compat )
 {
 maddr_to_page(mpt_alloc)->u.inuse.type_info = PGT_l4_page_table;
-l4start = l4tab = __va(mpt_alloc); mpt_alloc += PAGE_SIZE;
+UNMAP_MAP_AND_ADVANCE(l4start_mfn, l4start, mpt_alloc);
+l4tab = l4start;
 clear_page(l4tab);
-init_xen_l4_slots(l4tab, _mfn(virt_to_mfn(l4start)),
-  d, INVALID_MFN, true);
-v->arch.guest_table = pagetable_from_paddr(__pa(l4start));
+init_xen_l4_slots(l4tab, l4start_mfn, d, INVALID_MFN, true);
+v->arch.guest_table = pagetable_from_mfn(l4start_mfn);
 }
 else
 {
 /* Monitor table already created by switch_compat(). */
-l4start = l4tab = __va(pagetable_get_paddr(v->arch.guest_table));
+l4start_mfn = pagetable_get_mfn(v->arch.guest_table);
+l4start = l4tab = map_domain_page(l4start_mfn);
 /* See public/xen.h on why the following is needed. */
 maddr_to_page(mpt_alloc)->u.inuse.type_info = PGT_l3_page_table;
 l3start = __va(mpt_alloc); mpt_alloc += PAGE_SIZE;
+UNMAP_MAP_AND_ADVANCE(l3start_mfn, l3start, mpt_alloc);
 }
 
 l4tab += l4_table_offset(v_start);
@@ -735,14 +749,16 @@ int __init dom0_construct_pv(struct domain *d,
 if ( !((unsigned long)l1tab & (PAGE_SIZE-1)) )
 {
 maddr_to_page(mpt_alloc)->u.inuse.type_info = PGT_l1_page_table;
-l1start = l1tab = __va(mpt_alloc); mpt_alloc += PAGE_SIZE;
+UNMAP_MAP_AND_ADVANCE(l1start_mfn, l1start, mpt_alloc);
+l1tab = l1start;
 clear_page(l1tab);
 if ( count == 0 )
 l1tab += l1_table_offset(v_start);
 if ( !((unsigned long)l2tab & (PAGE_SIZE-1)) )
 {
 maddr_to_page(mpt_alloc)->u.inuse.type_info = 
PGT_l2_page_table;
-l2start = l2tab = __va(mpt_alloc); mpt_alloc += PAGE_SIZE;
+UNMAP_MAP_AND_ADVANCE(l2start_mfn, l2start, mpt_alloc);
+l2tab = l2start;
 clear_page(l2tab);
 if ( count == 0 )
 l2tab += l2_table_offset(v_start);
@@ -752,19 +768,19 @@ int __init dom0_construct_pv(struct domain *d,
 {
 maddr_to_page(mpt_alloc)->u.inuse.type_info =
 PGT_l3_page_table;
-l3start = __va(mpt_alloc); mpt_alloc += PAGE_SIZE;
+UNMAP_MAP_AND_ADVANCE(l3start_mfn, l3start, mpt_alloc);
 }
 l3tab = l3start;
 clear_page(l3tab);
 if ( count == 0 )
 l3tab += l3_table_offset(v_start);
-*l4tab = l4e_from_paddr(__pa(l3start), L4_PROT);
+*l4tab = l4e_from_mfn(l3start_mfn, L4_PROT);
 l4tab++;
 }
-*l3tab = l3e_from_paddr(__pa(l2start), L3_PROT);
+*l3tab = l3e_from_mfn(l2start_mfn, L3_PROT);
 l3tab++;
 }
-*l2tab = l2e_from_paddr(__pa(l1start), L2_PROT);
+*l2tab = l2e_from_mfn(l1start_mfn, L2_PROT);
 l2tab++;
 }
 if ( count < initrd_pfn || count >= initrd_pfn + PFN_UP(initrd_len) )
@@ -783,27 

[PATCH V3 06/19] x86: Add a boot option to enable and disable the direct map

2024-05-13 Thread Elias El Yandouzi
From: Hongyan Xia 

Also add a helper function to retrieve it. Change arch_mfns_in_direct_map
to check this option before returning.

This is added as a Kconfig option as well as a boot command line option.
While being generic, the Kconfig option is only usable for x86 at the moment.

Note that there remains some users of the directmap at this point. The option
is introduced now as it will be needed in follow-up patches.

Signed-off-by: Hongyan Xia 
Signed-off-by: Julien Grall 
Signed-off-by: Elias El Yandouzi 



Changes in V2:
* Introduce a Kconfig option
* Reword the commit message
* Make opt_directmap and helper generic

Changes since Hongyan's version:
* Reword the commit message
* opt_directmap is only modified during boot so mark it as
  __ro_after_init

diff --git a/docs/misc/xen-command-line.pandoc 
b/docs/misc/xen-command-line.pandoc
index e760f3266e..743d343ffa 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -799,6 +799,18 @@ that enabling this option cannot guarantee anything beyond 
what underlying
 hardware guarantees (with, where available and known to Xen, respective
 tweaks applied).
 
+### directmap (x86)
+> `= `
+
+> Default: `true`
+
+Enable or disable the directmap region in Xen.
+
+By default, Xen creates the directmap region which maps physical memory
+in that region. Setting this to no will sparsely populate the directmap,
+blocking exploits that leak secrets via speculative memory access in the
+directmap.
+
 ### dma_bits
 > `= `
 
diff --git a/xen/arch/x86/Kconfig b/xen/arch/x86/Kconfig
index 7e03e4bc55..b4ec0e582e 100644
--- a/xen/arch/x86/Kconfig
+++ b/xen/arch/x86/Kconfig
@@ -28,6 +28,7 @@ config X86
select HAS_PCI_MSI
select HAS_PIRQ
select HAS_SCHED_GRANULARITY
+   select HAS_SECRET_HIDING
select HAS_UBSAN
select HAS_VPCI if HVM
select NEEDS_LIBELF
diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h
index 98b66edaca..54d835f156 100644
--- a/xen/arch/x86/include/asm/mm.h
+++ b/xen/arch/x86/include/asm/mm.h
@@ -622,11 +622,17 @@ void write_32bit_pse_identmap(uint32_t *l2);
 /*
  * x86 maps part of physical memory via the directmap region.
  * Return whether the range of MFN falls in the directmap region.
+ *
+ * When boot command line sets directmap=no, the directmap will mostly be empty
+ * so this will always return false.
  */
 static inline bool arch_mfns_in_directmap(unsigned long mfn, unsigned long nr)
 {
 unsigned long eva = min(DIRECTMAP_VIRT_END, HYPERVISOR_VIRT_END);
 
+if ( !has_directmap() )
+return false;
+
 return (mfn + nr) <= (virt_to_mfn(eva - 1) + 1);
 }
 
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index f84e1cd79c..bd6b1184f5 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -1517,6 +1517,8 @@ void asmlinkage __init noreturn __start_xen(unsigned long 
mbi_p)
 if ( highmem_start )
 xenheap_max_mfn(PFN_DOWN(highmem_start - 1));
 
+printk("Booting with directmap %s\n", has_directmap() ? "on" : "off");
+
 /*
  * Walk every RAM region and map it in its entirety (on x86/64, at least)
  * and notify it to the boot allocator.
diff --git a/xen/common/Kconfig b/xen/common/Kconfig
index 565ceda741..856604068c 100644
--- a/xen/common/Kconfig
+++ b/xen/common/Kconfig
@@ -80,12 +80,29 @@ config HAS_PMAP
 config HAS_SCHED_GRANULARITY
bool
 
+config HAS_SECRET_HIDING
+   bool
+
 config HAS_UBSAN
bool
 
 config MEM_ACCESS_ALWAYS_ON
bool
 
+config SECRET_HIDING
+bool "Secret hiding"
+depends on HAS_SECRET_HIDING
+help
+   The directmap contains mapping for most of the RAM which makes 
domain
+   memory easily accessible. While making the performance better, 
it also makes
+   the hypervisor more vulnerable to speculation attacks.
+
+   Enabling this feature will allow the user to decide whether the 
memory
+   is always mapped at boot or mapped only on demand (see the 
command line
+   option "directmap").
+
+   If unsure, say N.
+
 config MEM_ACCESS
def_bool MEM_ACCESS_ALWAYS_ON
prompt "Memory Access and VM events" if !MEM_ACCESS_ALWAYS_ON
diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index 7c1bdfc046..9b7e4721cd 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -174,6 +174,11 @@ paddr_t __ro_after_init mem_hotplug;
 static char __initdata opt_badpage[100] = "";
 string_param("badpage", opt_badpage);
 
+bool __ro_after_init opt_directmap = true;
+#ifdef CONFIG_HAS_SECRET_HIDING
+boolean_param("directmap", opt_directmap);
+#endif
+
 /*
  * no-bootscrub -> Free pages are not zeroed during boot.
  */
diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h
index 7561297a75..9d4f1f2d0d 100644
--- a/xen/include/xen/mm.h
+++ b/xen/include/xen/mm.h
@@ -167,6 

[PATCH V3 05/19] x86/mapcache: Initialise the mapcache for the idle domain

2024-05-13 Thread Elias El Yandouzi
From: Hongyan Xia 

In order to use the mapcache in the idle domain, we also have to
populate its page tables in the PERDOMAIN region, and we need to move
mapcache_domain_init() earlier in arch_domain_create().

Note, commit 'x86: lift mapcache variable to the arch level' has
initialised the mapcache for HVM domains. With this patch, PV, HVM,
idle domains now all initialise the mapcache.

Signed-off-by: Wei Wang 
Signed-off-by: Hongyan Xia 
Signed-off-by: Julien Grall 
Signed-off-by: Elias El Yandouzi 



Changes in V2:
  * Free resources if mapcache initialisation fails
  * Remove `is_idle_domain()` check from `create_perdomain_mappings()`

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 507d704f16..3303bdb53e 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -758,9 +758,16 @@ int arch_domain_create(struct domain *d,
 
 spin_lock_init(>arch.e820_lock);
 
+if ( (rc = mapcache_domain_init(d)) != 0)
+{
+free_perdomain_mappings(d);
+return rc;
+}
+
 /* Minimal initialisation for the idle domain. */
 if ( unlikely(is_idle_domain(d)) )
 {
+struct page_info *pg = d->arch.perdomain_l3_pg;
 static const struct arch_csw idle_csw = {
 .from = paravirt_ctxt_switch_from,
 .to   = paravirt_ctxt_switch_to,
@@ -771,6 +778,9 @@ int arch_domain_create(struct domain *d,
 
 d->arch.cpu_policy = ZERO_BLOCK_PTR; /* Catch stray misuses. */
 
+idle_pg_table[l4_table_offset(PERDOMAIN_VIRT_START)] =
+l4e_from_page(pg, __PAGE_HYPERVISOR_RW);
+
 return 0;
 }
 
@@ -851,8 +861,6 @@ int arch_domain_create(struct domain *d,
 
 psr_domain_init(d);
 
-mapcache_domain_init(d);
-
 if ( is_hvm_domain(d) )
 {
 if ( (rc = hvm_domain_initialise(d, config)) != 0 )
-- 
2.40.1




[PATCH V3 11/19] x86/setup: Leave early boot slightly earlier

2024-05-13 Thread Elias El Yandouzi
From: Hongyan Xia 

When we do not have a direct map, memory for metadata of heap nodes in
init_node_heap() is allocated from xenheap, which needs to be mapped and
unmapped on demand. However, we cannot just take memory from the boot
allocator to create the PTEs while we are passing memory to the heap
allocator.

To solve this race, we leave early boot slightly sooner so that Xen PTE
pages are allocated from the heap instead of the boot allocator. We can
do this because the metadata for the 1st node is statically allocated,
and by the time we need memory to create mappings for the 2nd node, we
already have enough memory in the heap allocator in the 1st node.

Signed-off-by: Hongyan Xia 
Signed-off-by: Julien Grall 
Signed-off-by: Elias El Yandouzi 

diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index bd6b1184f5..f26c9799e4 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -1751,6 +1751,22 @@ void asmlinkage __init noreturn __start_xen(unsigned 
long mbi_p)
 
 numa_initmem_init(0, raw_max_page);
 
+/*
+ * When we do not have a direct map, memory for metadata of heap nodes in
+ * init_node_heap() is allocated from xenheap, which needs to be mapped and
+ * unmapped on demand. However, we cannot just take memory from the boot
+ * allocator to create the PTEs while we are passing memory to the heap
+ * allocator during end_boot_allocator().
+ *
+ * To solve this race, we need to leave early boot before
+ * end_boot_allocator() so that Xen PTE pages are allocated from the heap
+ * instead of the boot allocator. We can do this because the metadata for
+ * the 1st node is statically allocated, and by the time we need memory to
+ * create mappings for the 2nd node, we already have enough memory in the
+ * heap allocator in the 1st node.
+ */
+system_state = SYS_STATE_boot;
+
 if ( max_page - 1 > virt_to_mfn(HYPERVISOR_VIRT_END - 1) )
 {
 unsigned long lo = virt_to_mfn(HYPERVISOR_VIRT_END - 1);
@@ -1782,8 +1798,6 @@ void asmlinkage __init noreturn __start_xen(unsigned long 
mbi_p)
 else
 end_boot_allocator();
 
-system_state = SYS_STATE_boot;
-
 bsp_stack = cpu_alloc_stack(0);
 if ( !bsp_stack )
 panic("No memory for BSP stack\n");
-- 
2.40.1




[PATCH V3 04/19] x86: Lift mapcache variable to the arch level

2024-05-13 Thread Elias El Yandouzi
From: Wei Liu 

It is going to be needed by HVM and idle domain as well, because without
the direct map, both need a mapcache to map pages.

This commit lifts the mapcache variable up and initialise it a bit earlier
for PV and HVM domains.

Signed-off-by: Wei Liu 
Signed-off-by: Wei Wang 
Signed-off-by: Hongyan Xia 
Signed-off-by: Julien Grall 

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 20e83cf38b..507d704f16 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -851,6 +851,8 @@ int arch_domain_create(struct domain *d,
 
 psr_domain_init(d);
 
+mapcache_domain_init(d);
+
 if ( is_hvm_domain(d) )
 {
 if ( (rc = hvm_domain_initialise(d, config)) != 0 )
@@ -858,8 +860,6 @@ int arch_domain_create(struct domain *d,
 }
 else if ( is_pv_domain(d) )
 {
-mapcache_domain_init(d);
-
 if ( (rc = pv_domain_initialise(d)) != 0 )
 goto fail;
 }
diff --git a/xen/arch/x86/domain_page.c b/xen/arch/x86/domain_page.c
index eac5e3304f..55e337aaf7 100644
--- a/xen/arch/x86/domain_page.c
+++ b/xen/arch/x86/domain_page.c
@@ -82,11 +82,11 @@ void *map_domain_page(mfn_t mfn)
 #endif
 
 v = mapcache_current_vcpu();
-if ( !v || !is_pv_vcpu(v) )
+if ( !v )
 return mfn_to_virt(mfn_x(mfn));
 
-dcache = >domain->arch.pv.mapcache;
-vcache = >arch.pv.mapcache;
+dcache = >domain->arch.mapcache;
+vcache = >arch.mapcache;
 if ( !dcache->inuse )
 return mfn_to_virt(mfn_x(mfn));
 
@@ -187,14 +187,14 @@ void unmap_domain_page(const void *ptr)
 ASSERT(va >= MAPCACHE_VIRT_START && va < MAPCACHE_VIRT_END);
 
 v = mapcache_current_vcpu();
-ASSERT(v && is_pv_vcpu(v));
+ASSERT(v);
 
-dcache = >domain->arch.pv.mapcache;
+dcache = >domain->arch.mapcache;
 ASSERT(dcache->inuse);
 
 idx = PFN_DOWN(va - MAPCACHE_VIRT_START);
 mfn = l1e_get_pfn(MAPCACHE_L1ENT(idx));
-hashent = >arch.pv.mapcache.hash[MAPHASH_HASHFN(mfn)];
+hashent = >arch.mapcache.hash[MAPHASH_HASHFN(mfn)];
 
 local_irq_save(flags);
 
@@ -233,11 +233,9 @@ void unmap_domain_page(const void *ptr)
 
 int mapcache_domain_init(struct domain *d)
 {
-struct mapcache_domain *dcache = >arch.pv.mapcache;
+struct mapcache_domain *dcache = >arch.mapcache;
 unsigned int bitmap_pages;
 
-ASSERT(is_pv_domain(d));
-
 #ifdef NDEBUG
 if ( !mem_hotplug && max_page <= PFN_DOWN(__pa(HYPERVISOR_VIRT_END - 1)) )
 return 0;
@@ -261,12 +259,12 @@ int mapcache_domain_init(struct domain *d)
 int mapcache_vcpu_init(struct vcpu *v)
 {
 struct domain *d = v->domain;
-struct mapcache_domain *dcache = >arch.pv.mapcache;
+struct mapcache_domain *dcache = >arch.mapcache;
 unsigned long i;
 unsigned int ents = d->max_vcpus * MAPCACHE_VCPU_ENTRIES;
 unsigned int nr = PFN_UP(BITS_TO_LONGS(ents) * sizeof(long));
 
-if ( !is_pv_vcpu(v) || !dcache->inuse )
+if ( !dcache->inuse )
 return 0;
 
 if ( ents > dcache->entries )
@@ -293,7 +291,7 @@ int mapcache_vcpu_init(struct vcpu *v)
 BUILD_BUG_ON(MAPHASHENT_NOTINUSE < MAPCACHE_ENTRIES);
 for ( i = 0; i < MAPHASH_ENTRIES; i++ )
 {
-struct vcpu_maphash_entry *hashent = >arch.pv.mapcache.hash[i];
+struct vcpu_maphash_entry *hashent = >arch.mapcache.hash[i];
 
 hashent->mfn = ~0UL; /* never valid to map */
 hashent->idx = MAPHASHENT_NOTINUSE;
diff --git a/xen/arch/x86/include/asm/domain.h 
b/xen/arch/x86/include/asm/domain.h
index 8a97530607..7f0480d7a7 100644
--- a/xen/arch/x86/include/asm/domain.h
+++ b/xen/arch/x86/include/asm/domain.h
@@ -285,9 +285,6 @@ struct pv_domain
 /* Mitigate L1TF with shadow/crashing? */
 bool check_l1tf;
 
-/* map_domain_page() mapping cache. */
-struct mapcache_domain mapcache;
-
 struct cpuidmasks *cpuidmasks;
 };
 
@@ -326,6 +323,9 @@ struct arch_domain
 
 uint8_t scf; /* See SCF_DOM_MASK */
 
+/* map_domain_page() mapping cache. */
+struct mapcache_domain mapcache;
+
 union {
 struct pv_domain pv;
 struct hvm_domain hvm;
@@ -516,9 +516,6 @@ struct arch_domain
 
 struct pv_vcpu
 {
-/* map_domain_page() mapping cache. */
-struct mapcache_vcpu mapcache;
-
 unsigned int vgc_flags;
 
 struct trap_info *trap_ctxt;
@@ -618,6 +615,9 @@ struct arch_vcpu
 #define async_exception_state(t) async_exception_state[(t)-1]
 uint8_t async_exception_mask;
 
+/* map_domain_page() mapping cache. */
+struct mapcache_vcpu mapcache;
+
 /* Virtual Machine Extensions */
 union {
 struct pv_vcpu pv;
-- 
2.40.1




[PATCH V3 02/19] x86/pv: Domheap pages should be mapped while relocating initrd

2024-05-13 Thread Elias El Yandouzi
From: Wei Liu 

Xen shouldn't use domheap page as if they were xenheap pages. Map and
unmap pages accordingly.

Signed-off-by: Wei Liu 
Signed-off-by: Wei Wang 
Signed-off-by: Julien Grall 
Signed-off-by: Elias El Yandouzi 


Changes in V3:
* Rename commit title
* Rework the for loop copying the pages

Changes in V2:
* Get rid of mfn_to_virt
* Don't open code copy_domain_page()

Changes since Hongyan's version:
* Add missing newline after the variable declaration

diff --git a/xen/arch/x86/pv/dom0_build.c b/xen/arch/x86/pv/dom0_build.c
index d8043fa58a..807296c280 100644
--- a/xen/arch/x86/pv/dom0_build.c
+++ b/xen/arch/x86/pv/dom0_build.c
@@ -618,18 +618,24 @@ int __init dom0_construct_pv(struct domain *d,
 if ( d->arch.physaddr_bitsize &&
  ((mfn + count - 1) >> (d->arch.physaddr_bitsize - PAGE_SHIFT)) )
 {
+unsigned int nr_pages = 1UL << order;
+
 order = get_order_from_pages(count);
 page = alloc_domheap_pages(d, order, MEMF_no_scrub);
 if ( !page )
 panic("Not enough RAM for domain 0 initrd\n");
+
 for ( count = -count; order--; )
 if ( count & (1UL << order) )
 {
 free_domheap_pages(page, order);
 page += 1UL << order;
+nr_pages -= 1UL << order;
 }
-memcpy(page_to_virt(page), mfn_to_virt(initrd->mod_start),
-   initrd_len);
+
+for ( ; nr_pages-- ; page++, mfn++ )
+copy_domain_page(page_to_mfn(page), _mfn(mfn));
+
 mpt_alloc = (paddr_t)initrd->mod_start << PAGE_SHIFT;
 init_domheap_pages(mpt_alloc,
mpt_alloc + PAGE_ALIGN(initrd_len));
-- 
2.40.1




[PATCH V3 07/19] xen/x86: Add support for the PMAP

2024-05-13 Thread Elias El Yandouzi
From: Julien Grall 

PMAP will be used in a follow-up patch to bootstrap map domain
page infrastructure -- we need some way to map pages to setup the
mapcache without a direct map.

The functions pmap_{map, unmap} open code {set, clear}_fixmap to break
the loop.

Signed-off-by: Julien Grall 
Signed-off-by: Elias El Yandouzi 



The PMAP infrastructure was upstream separately for Arm since
Hongyan sent the secret-free hypervisor series. So this is a new
patch to plumb the feature on x86.

Changes in v2:
* Declare PMAP entries earlier in fixed_addresses
* Reword the commit message

diff --git a/xen/arch/x86/Kconfig b/xen/arch/x86/Kconfig
index b4ec0e582e..56feb0c564 100644
--- a/xen/arch/x86/Kconfig
+++ b/xen/arch/x86/Kconfig
@@ -27,6 +27,7 @@ config X86
select HAS_PCI
select HAS_PCI_MSI
select HAS_PIRQ
+   select HAS_PMAP
select HAS_SCHED_GRANULARITY
select HAS_SECRET_HIDING
select HAS_UBSAN
diff --git a/xen/arch/x86/include/asm/fixmap.h 
b/xen/arch/x86/include/asm/fixmap.h
index 516ec3fa6c..a7ac365fc6 100644
--- a/xen/arch/x86/include/asm/fixmap.h
+++ b/xen/arch/x86/include/asm/fixmap.h
@@ -21,6 +21,8 @@
 
 #include 
 #include 
+#include 
+
 #include 
 #include 
 #include 
@@ -53,6 +55,8 @@ enum fixed_addresses {
 FIX_PV_CONSOLE,
 FIX_XEN_SHARED_INFO,
 #endif /* CONFIG_XEN_GUEST */
+FIX_PMAP_BEGIN,
+FIX_PMAP_END = FIX_PMAP_BEGIN + NUM_FIX_PMAP,
 /* Everything else should go further down. */
 FIX_APIC_BASE,
 FIX_IO_APIC_BASE_0,
diff --git a/xen/arch/x86/include/asm/pmap.h b/xen/arch/x86/include/asm/pmap.h
new file mode 100644
index 00..62746e191d
--- /dev/null
+++ b/xen/arch/x86/include/asm/pmap.h
@@ -0,0 +1,25 @@
+#ifndef __ASM_PMAP_H__
+#define __ASM_PMAP_H__
+
+#include 
+
+static inline void arch_pmap_map(unsigned int slot, mfn_t mfn)
+{
+unsigned long linear = (unsigned long)fix_to_virt(slot);
+l1_pgentry_t *pl1e = _fixmap[l1_table_offset(linear)];
+
+ASSERT(!(l1e_get_flags(*pl1e) & _PAGE_PRESENT));
+
+l1e_write_atomic(pl1e, l1e_from_mfn(mfn, PAGE_HYPERVISOR));
+}
+
+static inline void arch_pmap_unmap(unsigned int slot)
+{
+unsigned long linear = (unsigned long)fix_to_virt(slot);
+l1_pgentry_t *pl1e = _fixmap[l1_table_offset(linear)];
+
+l1e_write_atomic(pl1e, l1e_empty());
+flush_tlb_one_local(linear);
+}
+
+#endif /* __ASM_PMAP_H__ */
-- 
2.40.1




[PATCH V3 01/19] x86: Create per-domain mapping of guest_root_pt

2024-05-13 Thread Elias El Yandouzi
From: Hongyan Xia 

Create a per-domain mapping of PV guest_root_pt as direct map is being
removed.

Note that we do not map and unmap root_pgt for now since it is still a
xenheap page.

Signed-off-by: Hongyan Xia 
Signed-off-by: Julien Grall 
Signed-off-by: Elias El Yandouzi 


Changes in V3:
* Rename SHADOW_ROOT
* Haven't addressed the potentially over-allocation issue as I don't 
get it

Changes in V2:
* Rework the shadow perdomain mapping solution in the follow-up patches

Changes since Hongyan's version:
* Remove the final dot in the commit title

diff --git a/xen/arch/x86/include/asm/config.h 
b/xen/arch/x86/include/asm/config.h
index ab7288cb36..5d710384df 100644
--- a/xen/arch/x86/include/asm/config.h
+++ b/xen/arch/x86/include/asm/config.h
@@ -203,7 +203,7 @@ extern unsigned char boot_edid_info[128];
 /* Slot 260: per-domain mappings (including map cache). */
 #define PERDOMAIN_VIRT_START(PML4_ADDR(260))
 #define PERDOMAIN_SLOT_MBYTES   (PML4_ENTRY_BYTES >> (20 + PAGETABLE_ORDER))
-#define PERDOMAIN_SLOTS 3
+#define PERDOMAIN_SLOTS 4
 #define PERDOMAIN_VIRT_SLOT(s)  (PERDOMAIN_VIRT_START + (s) * \
  (PERDOMAIN_SLOT_MBYTES << 20))
 /* Slot 4: mirror of per-domain mappings (for compat xlat area accesses). */
@@ -317,6 +317,14 @@ extern unsigned long xen_phys_start;
 #define ARG_XLAT_START(v)\
 (ARG_XLAT_VIRT_START + ((v)->vcpu_id << ARG_XLAT_VA_SHIFT))
 
+/* pv_root_pt mapping area. The fourth per-domain-mapping sub-area */
+#define PV_ROOT_PT_MAPPING_VIRT_START   PERDOMAIN_VIRT_SLOT(3)
+#define PV_ROOT_PT_MAPPING_ENTRIES  MAX_VIRT_CPUS
+
+/* The address of a particular VCPU's PV_ROOT_PT */
+#define PV_ROOT_PT_MAPPING_VCPU_VIRT_START(v) \
+(PV_ROOT_PT_MAPPING_VIRT_START + ((v)->vcpu_id * PAGE_SIZE))
+
 #define ELFSIZE 64
 
 #define ARCH_CRASH_SAVE_VMCOREINFO
diff --git a/xen/arch/x86/include/asm/domain.h 
b/xen/arch/x86/include/asm/domain.h
index f5daeb182b..8a97530607 100644
--- a/xen/arch/x86/include/asm/domain.h
+++ b/xen/arch/x86/include/asm/domain.h
@@ -272,6 +272,7 @@ struct time_scale {
 struct pv_domain
 {
 l1_pgentry_t **gdt_ldt_l1tab;
+l1_pgentry_t **root_pt_l1tab;
 
 atomic_t nr_l4_pages;
 
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index d968bbbc73..efdf20f775 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -505,6 +505,13 @@ void share_xen_page_with_guest(struct page_info *page, 
struct domain *d,
 nrspin_unlock(>page_alloc_lock);
 }
 
+#define pv_root_pt_idx(v) \
+((v)->vcpu_id >> PAGETABLE_ORDER)
+
+#define pv_root_pt_pte(v) \
+((v)->domain->arch.pv.root_pt_l1tab[pv_root_pt_idx(v)] + \
+ ((v)->vcpu_id & (L1_PAGETABLE_ENTRIES - 1)))
+
 void make_cr3(struct vcpu *v, mfn_t mfn)
 {
 struct domain *d = v->domain;
@@ -524,6 +531,13 @@ void write_ptbase(struct vcpu *v)
 
 if ( is_pv_vcpu(v) && v->domain->arch.pv.xpti )
 {
+mfn_t guest_root_pt = _mfn(MASK_EXTR(v->arch.cr3, PAGE_MASK));
+l1_pgentry_t *pte = pv_root_pt_pte(v);
+
+ASSERT(v == current);
+
+l1e_write(pte, l1e_from_mfn(guest_root_pt, __PAGE_HYPERVISOR_RO));
+
 cpu_info->root_pgt_changed = true;
 cpu_info->pv_cr3 = __pa(this_cpu(root_pgt));
 if ( new_cr4 & X86_CR4_PCIDE )
diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index 2a445bb17b..1b025986f7 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -288,6 +288,21 @@ static void pv_destroy_gdt_ldt_l1tab(struct vcpu *v)
   1U << GDT_LDT_VCPU_SHIFT);
 }
 
+static int pv_create_root_pt_l1tab(struct vcpu *v)
+{
+return create_perdomain_mapping(v->domain,
+PV_ROOT_PT_MAPPING_VCPU_VIRT_START(v),
+1, v->domain->arch.pv.root_pt_l1tab,
+NULL);
+}
+
+static void pv_destroy_root_pt_l1tab(struct vcpu *v)
+
+{
+destroy_perdomain_mapping(v->domain,
+  PV_ROOT_PT_MAPPING_VCPU_VIRT_START(v), 1);
+}
+
 void pv_vcpu_destroy(struct vcpu *v)
 {
 if ( is_pv_32bit_vcpu(v) )
@@ -297,6 +312,7 @@ void pv_vcpu_destroy(struct vcpu *v)
 }
 
 pv_destroy_gdt_ldt_l1tab(v);
+pv_destroy_root_pt_l1tab(v);
 XFREE(v->arch.pv.trap_ctxt);
 }
 
@@ -311,6 +327,13 @@ int pv_vcpu_initialise(struct vcpu *v)
 if ( rc )
 return rc;
 
+if ( v->domain->arch.pv.xpti )
+{
+rc = pv_create_root_pt_l1tab(v);
+if ( rc )
+goto done;
+}
+
 BUILD_BUG_ON(X86_NR_VECTORS * sizeof(*v->arch.pv.trap_ctxt) >
  PAGE_SIZE);
 v->arch.pv.trap_ctxt = xzalloc_array(struct trap_info, X86_NR_VECTORS);
@@ -346,10 +369,12 @@ void pv_domain_destroy(struct domain *d)
 
 destroy_perdomain_mapping(d, GDT_LDT_VIRT_START,
   GDT_LDT_MBYTES << (20 - PAGE_SHIFT));
+destroy_perdomain_mapping(d, 

[PATCH V3 00/19] Remove the directmap

2024-05-13 Thread Elias El Yandouzi
Hi all,

A few years ago, Wei Liu implemented a PoC to remove the directmap
from Xen. The last version was sent by Hongyan Xia [1].

I will start with thanking both Wei and Hongyan for the initial work
to upstream the feature. A lot of patches already went in and this is
the last few patches missing to effectively enable the feature.

=== What is the directmap? ===

At the moment, on both arm64 and x86, most of the RAM is mapped
in Xen address space. This means that domain memory is easily
accessible in Xen.

=== Why do we want to remove the directmap? ===

(Summarizing my understanding of the previous discussion)

Speculation attacks (like Spectre SP1) rely on loading piece of memory
in the cache. If the region is not mapped then it can't be loaded.

So removing reducing the amount of memory mapped in Xen will also
reduce the surface attack.

=== What's the performance impact? ===

As the guest memory is not always mapped, then the cost of mapping
will increase. I haven't done the numbers with this new version, but
some measurement were provided in the previous version for x86.

=== Improvement possible ===

The known area to improve on x86 are:
   * Mapcache: There was a patch sent by Hongyan:
 
https://lore.kernel.org/xen-devel/4058e92ce21627731c49b588a95809dc0affd83a.1581015491.git.hongy...@amazon.com/
   * EPT: At the moment an guest page-tabel walk requires about 20 map/unmap.
 This will have an very high impact on the performance. We need to decide
 whether keep the EPT always mapped is a problem

The original series didn't have support for Arm64. But as there were
some interest, I have provided a PoC.

There are more extra work for Arm64:
   * The mapcache is quite simple. We would investigate the performance
   * The mapcache should be made compliant to the Arm Arm (this is now
 more critical).
   * We will likely have the same problem as for the EPT.
   * We have no support for merging table to a superpage, neither
 free empty page-tables. (See more below)

=== Implementation ===

The subject is probably a misnomer. The directmap is still present but
the RAM is not mapped by default. Instead, the region will still be used
to map pages allocate via alloc_xenheap_pages().

The advantage is the solution is simple (so IHMO good enough for been
merged as a tech preview). The disadvantage is the page allocator is not
trying to keep all the xenheap pages together. So we may end up to have
an increase of page table usage.

In the longer term, we should consider to remove the direct map
completely and switch to vmap(). The main problem with this approach
is it is frequent to use mfn_to_virt() in the code. So we would need
to cache the mapping (maybe in the struct page_info).

=== Why arm32 is not covered? ===

On Arm32, the domheap and xenheap is always separated. So by design
the guest memory is not mapped by default.

At this stage, it seems unnecessary to have to map/unmap xenheap pages
every time they are allocated.

=== Why not using a separate domheap and xenheap? ===

While a separate xenheap/domheap reduce the page-table usage (all
xenheap pages are contiguous and could be always mapped), it is also
currently less scalable because the split is fixed at boot time (XXX:
Can this be dynamic?).

=== Future of secret-free hypervisor ===

There are some information in an e-mail from Andrew a few years ago:

https://lore.kernel.org/xen-devel/e3219697-0759-39fc-2486-715cdec1c...@citrix.com/

Cheers,

[1] https://lore.kernel.org/xen-devel/cover.1588278317.git.hongy...@amazon.com/

*** BLURB HERE ***

Elias El Yandouzi (3):
  xen/x86: Add build assertion for fixmap entries
  Rename mfn_to_virt() calls
  Rename maddr_to_virt() calls

Hongyan Xia (9):
  x86: Create per-domain mapping of guest_root_pt
  x86/pv: Rewrite how building PV dom0 handles domheap mappings
  x86/mapcache: Initialise the mapcache for the idle domain
  x86: Add a boot option to enable and disable the direct map
  x86/domain_page: Remove the fast paths when mfn is not in the
directmap
  xen/page_alloc: Add a path for xenheap when there is no direct map
  x86/setup: Leave early boot slightly earlier
  x86/setup: vmap heap nodes when they are outside the direct map
  x86/setup: Do not create valid mappings when directmap=no

Julien Grall (5):
  xen/x86: Add support for the PMAP
  xen/arm32: mm: Rename 'first' to 'root' in init_secondary_pagetables()
  xen/arm64: mm: Use per-pCPU page-tables
  xen/arm64: Implement a mapcache for arm64
  xen/arm64: Allow the admin to enable/disable the directmap

Wei Liu (2):
  x86/pv: Domheap pages should be mapped while relocating initrd
  x86: Lift mapcache variable to the arch level

 docs/misc/xen-command-line.pandoc | 12 +++
 xen/arch/arm/Kconfig  |  2 +-
 xen/arch/arm/arm64/mmu/mm.c   | 45 -
 xen/arch/arm/domain_page.c| 50 +-
 xen/arch/arm/include/asm/arm32/mm.h   |  8 --
 

Re: [PATCH v2 (resend) 13/27] x86: Add a boot option to enable and disable the direct map

2024-05-13 Thread Elias El Yandouzi




On 20/02/2024 11:14, Jan Beulich wrote:

On 16.01.2024 20:25, Elias El Yandouzi wrote:

--- a/xen/arch/x86/Kconfig
+++ b/xen/arch/x86/Kconfig
@@ -29,6 +29,7 @@ config X86
select HAS_UBSAN
select HAS_VPCI if HVM
select NEEDS_LIBELF
+   select HAS_SECRET_HIDING


Please respect alphabetic sorting. As to "secret hiding" - personally I
consider this too generic a term. This is about limiting the direct map. Why
not name the option then accordingly?



I think it is a fairly decent name, would you have any suggestion? 
Otherwise I will just stick to it.



--- a/xen/arch/x86/include/asm/mm.h
+++ b/xen/arch/x86/include/asm/mm.h
@@ -620,10 +620,18 @@ void write_32bit_pse_identmap(uint32_t *l2);
  /*
   * x86 maps part of physical memory via the directmap region.
   * Return whether the range of MFN falls in the directmap region.
+ *
+ * When boot command line sets directmap=no, we will not have a direct map at
+ * all so this will always return false.
   */


As with the command line doc, please state the full truth.


  static inline bool arch_mfns_in_directmap(unsigned long mfn, unsigned long nr)
  {
-unsigned long eva = min(DIRECTMAP_VIRT_END, HYPERVISOR_VIRT_END);
+unsigned long eva;
+
+if ( !has_directmap() )
+return false;


Hmm. The sole user of this function is init_node_heap(). Would it perhaps make
sense to simply map the indicated number of pages then? init_node_heap() would
fall back to xmalloc(), so the data will be in what's left of the directmap
anyway.



There will be more users of arch_mfns_in_directmap() in the following 
patches.



+eva = min(DIRECTMAP_VIRT_END, HYPERVISOR_VIRT_END);


Irrespective I don't see a need to replace the initializer by an assignment.


I guess it was to avoid the useless min() computation in case directmap 
is disabled. I can put it back to what it was.





--- a/xen/common/Kconfig
+++ b/xen/common/Kconfig
@@ -83,6 +83,23 @@ config HAS_UBSAN
  config MEM_ACCESS_ALWAYS_ON
bool
  
+config HAS_SECRET_HIDING

+   bool


This again wants placing suitably among the other HAS_*.


+config SECRET_HIDING
+bool "Secret hiding"
+depends on HAS_SECRET_HIDING
+---help---
+The directmap contains mapping for most of the RAM which makes domain
+memory easily accessible. While making the performance better, it also 
makes
+the hypervisor more vulnerable to speculation attacks.
+
+Enabling this feature will allow the user to decide whether the memory
+is always mapped at boot or mapped only on demand (see the command line
+option "directmap").
+
+If unsure, say N.


Also as an alternative did you consider making this new setting merely
control the default of opt_directmap? Otherwise the variable shouldn't exist
at all when the Kconfig option is off, but rather be #define-d to "true" in
that case.


I am not sure to understand why the option shouldn't exist at all when 
Kconfig option is off.


If SECRET_HIDING option is off, then opt_directmap must be 
unconditionally set to true. If SECRET_HIDING option is on, then 
opt_directmap value depends on the commandline option.


The corresponding wrapper, has_directmap(), will be used in multiple 
location in follow-up patch. I don't really see how you want to do.



--- a/xen/include/xen/mm.h
+++ b/xen/include/xen/mm.h
@@ -165,6 +165,13 @@ extern unsigned long max_page;
  extern unsigned long total_pages;
  extern paddr_t mem_hotplug;
  
+extern bool opt_directmap;

+
+static inline bool has_directmap(void)
+{
+return opt_directmap;
+}


If opt_directmap isn't static, I see little point in having such a wrapper.
If there are reasons, I think they want stating in the description.


I don't think there is a specific reason to be mentioned, if you really 
wish to, I can remove it.



On the whole: Is the placement of this patch in the series an indication
that as of here all directmap uses have gone away? If so, what's the rest of
the series about? Alternatively isn't use of this option still problematic
at this point of the series? Whichever way it is - this wants clarifying in
the description.


This patch is not an indication that all directmap uses have been 
removed. We need to know in follow-up patch whether or not the option is 
enabled and so we have to introduce this patch here.


At this point in the series, the feature is not yet complete.


Elias




[linux-linus test] 185986: tolerable FAIL - PUSHED

2024-05-13 Thread osstest service owner
flight 185986 linux-linus real [real]
flight 185989 linux-linus real-retest [real]
http://logs.test-lab.xenproject.org/osstest/logs/185986/
http://logs.test-lab.xenproject.org/osstest/logs/185989/

Failures :-/ but no regressions.

Tests which are failing intermittently (not blocking):
 test-amd64-amd64-xl-qemuu-debianhvm-i386-xsm 12 debian-hvm-install fail pass 
in 185989-retest
 test-armhf-armhf-xl   8 xen-bootfail pass in 185989-retest
 test-armhf-armhf-xl-raw   8 xen-bootfail pass in 185989-retest
 test-armhf-armhf-libvirt-vhd  8 xen-bootfail pass in 185989-retest

Tests which did not succeed, but are not blocking:
 test-armhf-armhf-xl 15 migrate-support-check fail in 185989 never pass
 test-armhf-armhf-xl 16 saverestore-support-check fail in 185989 never pass
 test-armhf-armhf-libvirt-vhd 14 migrate-support-check fail in 185989 never pass
 test-armhf-armhf-libvirt-vhd 15 saverestore-support-check fail in 185989 never 
pass
 test-armhf-armhf-xl-raw 14 migrate-support-check fail in 185989 never pass
 test-armhf-armhf-xl-raw 15 saverestore-support-check fail in 185989 never pass
 test-armhf-armhf-libvirt 16 saverestore-support-checkfail  like 185977
 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 185982
 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 185982
 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 185982
 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 185982
 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 185982
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-qcow2 14 migrate-support-checkfail never pass
 test-amd64-amd64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  14 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  15 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail  never pass
 test-armhf-armhf-libvirt 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 15 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-qcow214 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-qcow215 saverestore-support-checkfail   never pass

version targeted for testing:
 linuxa38297e3fb012ddfa7ce0321a7e5a8daeb1872b6
baseline version:
 linuxcf87f46fd34d6c19283d9625a7822f20d90b64a4

Last test of basis   185982  2024-05-11 16:13:39 Z1 days
Failing since185984  2024-05-12 16:44:05 Z0 days2 attempts
Testing same since   185986  2024-05-12 23:41:44 Z0 days1 attempts


People who touched revisions under test:
  Borislav Petkov (AMD) 
  Christian Borntraeger 
  

Re: [PATCH v2 (resend) 12/27] x86/mapcache: Initialise the mapcache for the idle domain

2024-05-13 Thread Elias El Yandouzi




On 20/02/2024 10:51, Jan Beulich wrote:

On 16.01.2024 20:25, Elias El Yandouzi wrote:

--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -750,9 +750,16 @@ int arch_domain_create(struct domain *d,
  
  spin_lock_init(>arch.e820_lock);
  
+if ( (rc = mapcache_domain_init(d)) != 0)

+{
+free_perdomain_mappings(d);
+return rc;
+}
+
  /* Minimal initialisation for the idle domain. */
  if ( unlikely(is_idle_domain(d)) )
  {
+struct page_info *pg = d->arch.perdomain_l3_pg;
  static const struct arch_csw idle_csw = {
  .from = paravirt_ctxt_switch_from,
  .to   = paravirt_ctxt_switch_to,
@@ -763,6 +770,9 @@ int arch_domain_create(struct domain *d,
  
  d->arch.cpu_policy = ZERO_BLOCK_PTR; /* Catch stray misuses. */
  
+idle_pg_table[l4_table_offset(PERDOMAIN_VIRT_START)] =

+l4e_from_page(pg, __PAGE_HYPERVISOR_RW);
+
  return 0;
  }


Why not add another call to mapcache_domain_init() right here, allowing
a more specific panic() to be invoked in case of failure (compared to
the BUG_ON() upon failure of creation of the idle domain as a whole)?
Then the other mapcache_domain_init() call doesn't need moving a 2nd
time in close succession.



To be honest, I don't really like the idea of having twice the same call 
just for the benefit of having a panic() call in case of failure for the 
idle domain.


If you don't mind, I'd rather keep just a single call to 
mapcache_domain_init() as it is now.


Elias



Re: [XEN PATCH v2 5/5] x86/MCE: optional build of AMD/Intel MCE code

2024-05-13 Thread Sergiy Kibrik

06.05.24 14:32, Jan Beulich:

On 02.05.2024 11:21, Sergiy Kibrik wrote:

Separate Intel/AMD-specific MCE code using CONFIG_{INTEL,AMD} config options.
Now we can avoid build of mcheck code if support for specific platform is
intentionally disabled by configuration.

Add default return value to init_nonfatal_mce_checker() routine -- in case
of a build with both AMD and INTEL options are off (e.g. randconfig).


I'm afraid that, as before, I can't accept this as a justification for the
addition. The addition likely is wanted, but perhaps in a separate up-front
patch and explaining what's wrong when that's missing.


sure, I'll do separate patch for that.




Also global Intel-specific variables lmce_support & cmci_support have to be
redefined if !INTEL, as they get checked in common code.


Them being checked in common code may have different resolution strategies.
The justification here imo is that, right now, both variables are only ever
written by mce_intel.c. As mentioned for vmce_has_lmce(), there's nothing
fundamentally preventing MCG_CAP from having respective bits set on a non-
Intel CPU.



so could these global variables just be moved to common code then? Like 
arch/x86/cpu/mcheck/mce.c ?


  -Sergiy



[PATCH for-4.19] x86/mtrr: avoid system wide rendezvous when setting AP MTRRs

2024-05-13 Thread Roger Pau Monne
There's no point in forcing a system wide update of the MTRRs on all processors
when there are no changes to be propagated.  On AP startup it's only the AP
that needs to write the system wide MTRR values in order to match the rest of
the already online CPUs.

We have occasionally seen the watchdog trigger during `xen-hptool cpu-online`
in one Intel Cascade Lake box with 448 CPUs due to the re-setting of the MTRRs
on all the CPUs in the system.

While there adjust the comment to clarify why the system-wide resetting of the
MTRR registers is not needed for the purposes of mtrr_ap_init().

Signed-off-by: Roger Pau Monné 
---
For consideration for 4.19: it's a bugfix of a rare instance of the watchdog
triggering, but it's also a good performance improvement when performing
cpu-online.

Hopefully runtime changes to MTRR will affect a single MSR at a time, lowering
the chance of the watchdog triggering due to the system-wide resetting of the
range.
---
 xen/arch/x86/cpu/mtrr/main.c | 15 ---
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/xen/arch/x86/cpu/mtrr/main.c b/xen/arch/x86/cpu/mtrr/main.c
index 90b235f57e68..0a44ebbcb04f 100644
--- a/xen/arch/x86/cpu/mtrr/main.c
+++ b/xen/arch/x86/cpu/mtrr/main.c
@@ -573,14 +573,15 @@ void mtrr_ap_init(void)
if (!mtrr_if || hold_mtrr_updates_on_aps)
return;
/*
-* Ideally we should hold mtrr_mutex here to avoid mtrr entries changed,
-* but this routine will be called in cpu boot time, holding the lock
-* breaks it. This routine is called in two cases: 1.very earily time
-* of software resume, when there absolutely isn't mtrr entry changes;
-* 2.cpu hotadd time. We let mtrr_add/del_page hold cpuhotplug lock to
-* prevent mtrr entry changes
+* hold_mtrr_updates_on_aps takes care of preventing unnecessary MTRR
+* updates when batch starting the CPUs (see
+* mtrr_aps_sync_{begin,end}()).
+*
+* Otherwise just apply the current system wide MTRR values to this AP.
+* Note this doesn't require synchronization with the other CPUs, as
+* there are strictly no modifications of the current MTRR values.
 */
-   set_mtrr(~0U, 0, 0, 0);
+   mtrr_set_all();
 }
 
 /**
-- 
2.44.0




Re: Serious AMD-Vi(?) issue

2024-05-13 Thread Roger Pau Monné
On Fri, May 10, 2024 at 09:09:54PM -0700, Elliott Mitchell wrote:
> On Thu, Apr 18, 2024 at 09:33:31PM -0700, Elliott Mitchell wrote:
> > 
> > I suspect this is a case of there is some step which is missing from
> > Xen's IOMMU handling.  Perhaps something which Linux does during an early
> > DMA setup stage, but the current Xen implementation does lazily?
> > Alternatively some flag setting or missing step?
> > 
> > I should be able to do another test approach in a few weeks, but I would
> > love if something could be found sooner.
> 
> Turned out to be disturbingly easy to get the first entry when it
> happened.  Didn't even need `dbench`, it simply showed once the OS was
> fully loaded.  I did get some additional data points.
> 
> Appears this requires an AMD IOMMUv2.  A test system with known
> functioning AMD IOMMUv1 didn't display the issue at all.
> 
> (XEN) AMD-Vi: IO_PAGE_FAULT: :bb:dd.f d0 addr fffdf800 flags 0x8 I

I would expect the address field to contain more information about the
fault, but I'm not finding any information on the AMD-Vi specification
apart from that it contains the DVA, which makes no sense when the
fault is caused by an interrupt.

> (XEN) :bb:dd.f root @ 83b5f5 (3 levels) dfn=fffdf8000
> (XEN)   L3[1f7] = 0 np

Attempting to print the page table walk for an Interrupt remapping
fault is useless, we should likely avoid that when the I flag is set.

> 
> I find it surprising this required "iommu=debug" to get this level of
> detail.  This amount of output seems more appropriate for "verbose".

"verbose" should also print this information.

> 
> I strongly prefer to provide snippets.  There is a fair bit of output,
> I'm unsure which portion is most pertinent.

I've already voiced my concern that I think what yo uare doing is not
fair.  We are debugging this out of interest, and hence you refusing
to provide all information just hampers our ability to debug, and
makes us spend more time than required just thinking what snippets we
need to ask for.

I will ask again, what's there in the Xen or the Linux dmesgs that you
are so worried about leaking? Please provide an specific example.

Why do you mask the device SBDF in the above snippet?  I would really
like to understand what's so privacy relevant in a PCI SBDF number.

Does booting with `iommu=no-intremap` lead to any issues being
reported?

Regards, Roger.



Re: [XEN PATCH v2 2/5] x86/intel: move vmce_has_lmce() routine to header

2024-05-13 Thread Sergiy Kibrik

06.05.24 14:18, Jan Beulich:

On 02.05.2024 11:14, Sergiy Kibrik wrote:

Moving this function out of mce_intel.c would make it possible to disable
build of Intel MCE code later on, because the function gets called from
common x86 code.

Add internal check for CONFIG_INTEL option, as MCG_LMCE_P bit is currently
specific to Intel CPUs only.

My previously voiced concern regarding this was not addressed. If ...


I misunderstood you comment to v1 patch.
I'll drop checks for CONFIG_INTEL, I see now that we don't really need one.

  -Sergiy



Re: [PATCH 3/7] xen/p2m: put reference for superpage

2024-05-13 Thread Roger Pau Monné
On Fri, May 10, 2024 at 10:37:53PM +0100, Julien Grall wrote:
> Hi Roger,
> 
> On 09/05/2024 13:58, Roger Pau Monné wrote:
> > On Thu, May 09, 2024 at 01:12:00PM +0100, Julien Grall wrote:
> > > Hi,
> > > 
> > > On 09/05/2024 12:28, Roger Pau Monné wrote:
> > > > OTOH for 1GB given the code here the page could be freed in one go,
> > > > without a chance of preempting the operation.
> > > > 
> > > > Maybe you have to shatter superpages into 4K entries and then remove
> > > > them individually, as to allow for preemption to be possible by
> > > > calling put_page() for each 4K chunk?
> > > This would require to allocate some pages from the P2M pool for the 
> > > tables.
> > > As the pool may be exhausted, it could be problematic when relinquishing 
> > > the
> > > resources.
> > 
> > Indeed, it's not ideal.
> > 
> > > It may be possible to find a way to have memory available by removing 
> > > other
> > > mappings first. But it feels a bit hackish and I would rather prefer if we
> > > avoid allocating any memory when relinquishing.
> > 
> > Maybe it could be helpful to provide a function to put a superpage,
> > that internally calls free_domheap_pages() with the appropriate order
> > so that freeing a superpage only takes a single free_domheap_pages()
> > call.
> 
> Today, free_domheap_page() is only called when the last reference is
> removed. I don't thinkt here is any guarantee that all the references will
> dropped at the same time.

I see, yes, we have no guarantee that removing the superpage entry in
the mapping domain will lead to either the whole superpage freed at
once, or not freed.  The source domain may have shattered the
super-page and hence freeing might need to be done at a smaller
granularity.

> >  That could reduce some of the contention around the heap_lock
> > and d->page_alloc_lock locks.
> 
> From previous experience (when Hongyan and I worked on optimizing
> init_heap_pages() for Live-Update), the lock is actually not the biggest
> problem. The issue is adding the pages back to the heap (which may requiring
> merging). So as long as the pages are not freed contiguously, we may not
> gain anything.

Would it help to defer the merging to the idle context, kind of
similar to what we do with scrubbing?

Thanks, Roger.



Re: [RFC KERNEL PATCH v6 3/3] xen/privcmd: Add new syscall to get gsi from irq

2024-05-13 Thread Jürgen Groß

On 13.05.24 09:47, Chen, Jiqian wrote:

Hi,
On 2024/5/10 17:06, Chen, Jiqian wrote:

Hi,

On 2024/5/10 14:46, Jürgen Groß wrote:

On 19.04.24 05:36, Jiqian Chen wrote:

In PVH dom0, it uses the linux local interrupt mechanism,
when it allocs irq for a gsi, it is dynamic, and follow
the principle of applying first, distributing first. And
the irq number is alloced from small to large, but the
applying gsi number is not, may gsi 38 comes before gsi 28,
it causes the irq number is not equal with the gsi number.
And when passthrough a device, QEMU will use device's gsi
number to do pirq mapping, but the gsi number is got from
file /sys/bus/pci/devices//irq, irq!= gsi, so it will
fail when mapping.
And in current linux codes, there is no method to translate
irq to gsi for userspace.

For above purpose, record the relationship of gsi and irq
when PVH dom0 do acpi_register_gsi_ioapic for devices and
adds a new syscall into privcmd to let userspace can get
that translation when they have a need.

Co-developed-by: Huang Rui 
Signed-off-by: Jiqian Chen 
---
   arch/x86/include/asm/apic.h  |  8 +++
   arch/x86/include/asm/xen/pci.h   |  5 
   arch/x86/kernel/acpi/boot.c  |  2 +-
   arch/x86/pci/xen.c   | 21 +
   drivers/xen/events/events_base.c | 39 
   drivers/xen/privcmd.c    | 19 
   include/uapi/xen/privcmd.h   |  7 ++
   include/xen/events.h |  5 
   8 files changed, 105 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index 9d159b771dc8..dd4139250895 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -169,6 +169,9 @@ extern bool apic_needs_pit(void);
     extern void apic_send_IPI_allbutself(unsigned int vector);
   +extern int acpi_register_gsi_ioapic(struct device *dev, u32 gsi,
+    int trigger, int polarity);
+
   #else /* !CONFIG_X86_LOCAL_APIC */
   static inline void lapic_shutdown(void) { }
   #define local_apic_timer_c2_ok    1
@@ -183,6 +186,11 @@ static inline void apic_intr_mode_init(void) { }
   static inline void lapic_assign_system_vectors(void) { }
   static inline void lapic_assign_legacy_vector(unsigned int i, bool r) { }
   static inline bool apic_needs_pit(void) { return true; }
+static inline int acpi_register_gsi_ioapic(struct device *dev, u32 gsi,
+    int trigger, int polarity)
+{
+    return (int)gsi;
+}
   #endif /* !CONFIG_X86_LOCAL_APIC */
     #ifdef CONFIG_X86_X2APIC
diff --git a/arch/x86/include/asm/xen/pci.h b/arch/x86/include/asm/xen/pci.h
index 9015b888edd6..aa8ded61fc2d 100644
--- a/arch/x86/include/asm/xen/pci.h
+++ b/arch/x86/include/asm/xen/pci.h
@@ -5,6 +5,7 @@
   #if defined(CONFIG_PCI_XEN)
   extern int __init pci_xen_init(void);
   extern int __init pci_xen_hvm_init(void);
+extern int __init pci_xen_pvh_init(void);
   #define pci_xen 1
   #else
   #define pci_xen 0
@@ -13,6 +14,10 @@ static inline int pci_xen_hvm_init(void)
   {
   return -1;
   }
+static inline int pci_xen_pvh_init(void)
+{
+    return -1;
+}
   #endif
   #ifdef CONFIG_XEN_PV_DOM0
   int __init pci_xen_initial_domain(void);
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 85a3ce2a3666..72c73458c083 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -749,7 +749,7 @@ static int acpi_register_gsi_pic(struct device *dev, u32 
gsi,
   }
     #ifdef CONFIG_X86_LOCAL_APIC
-static int acpi_register_gsi_ioapic(struct device *dev, u32 gsi,
+int acpi_register_gsi_ioapic(struct device *dev, u32 gsi,
   int trigger, int polarity)
   {
   int irq = gsi;
diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c
index 652cd53e77f6..f056ab5c0a06 100644
--- a/arch/x86/pci/xen.c
+++ b/arch/x86/pci/xen.c
@@ -114,6 +114,21 @@ static int acpi_register_gsi_xen_hvm(struct device *dev, 
u32 gsi,
    false /* no mapping of GSI to PIRQ */);
   }
   +static int acpi_register_gsi_xen_pvh(struct device *dev, u32 gsi,
+    int trigger, int polarity)
+{
+    int irq;
+
+    irq = acpi_register_gsi_ioapic(dev, gsi, trigger, polarity);
+    if (irq < 0)
+    return irq;
+
+    if (xen_pvh_add_gsi_irq_map(gsi, irq) == -EEXIST)
+    printk(KERN_INFO "Already map the GSI :%u and IRQ: %d\n", gsi, irq);
+
+    return irq;
+}
+
   #ifdef CONFIG_XEN_PV_DOM0
   static int xen_register_gsi(u32 gsi, int triggering, int polarity)
   {
@@ -558,6 +573,12 @@ int __init pci_xen_hvm_init(void)
   return 0;
   }
   +int __init pci_xen_pvh_init(void)
+{
+    __acpi_register_gsi = acpi_register_gsi_xen_pvh;


No support for unregistering the gsi again?

__acpi_unregister_gsi is set in function acpi_set_irq_model_ioapic.
Maybe I need to use a new function to call acpi_unregister_gsi_ioapic and 
remove the mapping of irq and gsi from xen_irq_list_head ?

When I tried to support unregistering the gsi 

Re: [RFC KERNEL PATCH v6 3/3] xen/privcmd: Add new syscall to get gsi from irq

2024-05-13 Thread Chen, Jiqian
Hi,
On 2024/5/10 17:06, Chen, Jiqian wrote:
> Hi,
> 
> On 2024/5/10 14:46, Jürgen Groß wrote:
>> On 19.04.24 05:36, Jiqian Chen wrote:
>>> In PVH dom0, it uses the linux local interrupt mechanism,
>>> when it allocs irq for a gsi, it is dynamic, and follow
>>> the principle of applying first, distributing first. And
>>> the irq number is alloced from small to large, but the
>>> applying gsi number is not, may gsi 38 comes before gsi 28,
>>> it causes the irq number is not equal with the gsi number.
>>> And when passthrough a device, QEMU will use device's gsi
>>> number to do pirq mapping, but the gsi number is got from
>>> file /sys/bus/pci/devices//irq, irq!= gsi, so it will
>>> fail when mapping.
>>> And in current linux codes, there is no method to translate
>>> irq to gsi for userspace.
>>>
>>> For above purpose, record the relationship of gsi and irq
>>> when PVH dom0 do acpi_register_gsi_ioapic for devices and
>>> adds a new syscall into privcmd to let userspace can get
>>> that translation when they have a need.
>>>
>>> Co-developed-by: Huang Rui 
>>> Signed-off-by: Jiqian Chen 
>>> ---
>>>   arch/x86/include/asm/apic.h  |  8 +++
>>>   arch/x86/include/asm/xen/pci.h   |  5 
>>>   arch/x86/kernel/acpi/boot.c  |  2 +-
>>>   arch/x86/pci/xen.c   | 21 +
>>>   drivers/xen/events/events_base.c | 39 
>>>   drivers/xen/privcmd.c    | 19 
>>>   include/uapi/xen/privcmd.h   |  7 ++
>>>   include/xen/events.h |  5 
>>>   8 files changed, 105 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
>>> index 9d159b771dc8..dd4139250895 100644
>>> --- a/arch/x86/include/asm/apic.h
>>> +++ b/arch/x86/include/asm/apic.h
>>> @@ -169,6 +169,9 @@ extern bool apic_needs_pit(void);
>>>     extern void apic_send_IPI_allbutself(unsigned int vector);
>>>   +extern int acpi_register_gsi_ioapic(struct device *dev, u32 gsi,
>>> +    int trigger, int polarity);
>>> +
>>>   #else /* !CONFIG_X86_LOCAL_APIC */
>>>   static inline void lapic_shutdown(void) { }
>>>   #define local_apic_timer_c2_ok    1
>>> @@ -183,6 +186,11 @@ static inline void apic_intr_mode_init(void) { }
>>>   static inline void lapic_assign_system_vectors(void) { }
>>>   static inline void lapic_assign_legacy_vector(unsigned int i, bool r) { }
>>>   static inline bool apic_needs_pit(void) { return true; }
>>> +static inline int acpi_register_gsi_ioapic(struct device *dev, u32 gsi,
>>> +    int trigger, int polarity)
>>> +{
>>> +    return (int)gsi;
>>> +}
>>>   #endif /* !CONFIG_X86_LOCAL_APIC */
>>>     #ifdef CONFIG_X86_X2APIC
>>> diff --git a/arch/x86/include/asm/xen/pci.h b/arch/x86/include/asm/xen/pci.h
>>> index 9015b888edd6..aa8ded61fc2d 100644
>>> --- a/arch/x86/include/asm/xen/pci.h
>>> +++ b/arch/x86/include/asm/xen/pci.h
>>> @@ -5,6 +5,7 @@
>>>   #if defined(CONFIG_PCI_XEN)
>>>   extern int __init pci_xen_init(void);
>>>   extern int __init pci_xen_hvm_init(void);
>>> +extern int __init pci_xen_pvh_init(void);
>>>   #define pci_xen 1
>>>   #else
>>>   #define pci_xen 0
>>> @@ -13,6 +14,10 @@ static inline int pci_xen_hvm_init(void)
>>>   {
>>>   return -1;
>>>   }
>>> +static inline int pci_xen_pvh_init(void)
>>> +{
>>> +    return -1;
>>> +}
>>>   #endif
>>>   #ifdef CONFIG_XEN_PV_DOM0
>>>   int __init pci_xen_initial_domain(void);
>>> diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
>>> index 85a3ce2a3666..72c73458c083 100644
>>> --- a/arch/x86/kernel/acpi/boot.c
>>> +++ b/arch/x86/kernel/acpi/boot.c
>>> @@ -749,7 +749,7 @@ static int acpi_register_gsi_pic(struct device *dev, 
>>> u32 gsi,
>>>   }
>>>     #ifdef CONFIG_X86_LOCAL_APIC
>>> -static int acpi_register_gsi_ioapic(struct device *dev, u32 gsi,
>>> +int acpi_register_gsi_ioapic(struct device *dev, u32 gsi,
>>>   int trigger, int polarity)
>>>   {
>>>   int irq = gsi;
>>> diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c
>>> index 652cd53e77f6..f056ab5c0a06 100644
>>> --- a/arch/x86/pci/xen.c
>>> +++ b/arch/x86/pci/xen.c
>>> @@ -114,6 +114,21 @@ static int acpi_register_gsi_xen_hvm(struct device 
>>> *dev, u32 gsi,
>>>    false /* no mapping of GSI to PIRQ */);
>>>   }
>>>   +static int acpi_register_gsi_xen_pvh(struct device *dev, u32 gsi,
>>> +    int trigger, int polarity)
>>> +{
>>> +    int irq;
>>> +
>>> +    irq = acpi_register_gsi_ioapic(dev, gsi, trigger, polarity);
>>> +    if (irq < 0)
>>> +    return irq;
>>> +
>>> +    if (xen_pvh_add_gsi_irq_map(gsi, irq) == -EEXIST)
>>> +    printk(KERN_INFO "Already map the GSI :%u and IRQ: %d\n", gsi, 
>>> irq);
>>> +
>>> +    return irq;
>>> +}
>>> +
>>>   #ifdef CONFIG_XEN_PV_DOM0
>>>   static int xen_register_gsi(u32 gsi, int triggering, int polarity)
>>>   {
>>> @@ -558,6 +573,12 @@ int __init pci_xen_hvm_init(void)
>>>   return 0;

[PATCH for-4.19 v3] tools/xen-cpuid: switch to use cpu-policy defined names

2024-05-13 Thread Roger Pau Monne
Like it was done recently for libxl, switch to using the auto-generated feature
names by the processing of cpufeatureset.h, this allows removing the open-coded
feature names, and unifies the feature naming with libxl and the hypervisor.

Introduce a newly auto-generated array that contains the feature names indexed
at featureset bit position, otherwise using the existing INIT_FEATURE_NAMES
would require iterating over the array elements until a match with the expected
bit position is found.

Note that leaf names need to be kept, as the current auto-generated data
doesn't contain the leaf names.

Signed-off-by: Roger Pau Monné 
Reviewed-by: Jan Beulich 
---
Changes since v2:
 - Remove __maybe_unused definition from the emulator harness.

Changes since v1:
 - Modify gen-cpuid.py to generate an array of strings with the feature names.
 - Introduce and use __maybe_unused.
---
 tools/include/xen-tools/common-macros.h |   4 +
 tools/misc/xen-cpuid.c  | 320 +++-
 tools/tests/x86_emulator/x86-emulate.h  |   1 -
 xen/tools/gen-cpuid.py  |  26 ++
 4 files changed, 68 insertions(+), 283 deletions(-)

diff --git a/tools/include/xen-tools/common-macros.h 
b/tools/include/xen-tools/common-macros.h
index 60912225cb7a..560528dbc638 100644
--- a/tools/include/xen-tools/common-macros.h
+++ b/tools/include/xen-tools/common-macros.h
@@ -83,6 +83,10 @@
 #define __packed __attribute__((__packed__))
 #endif
 
+#ifndef __maybe_unused
+# define __maybe_unused __attribute__((__unused__))
+#endif
+
 #define container_of(ptr, type, member) ({  \
 typeof(((type *)0)->member) *mptr__ = (ptr);\
 (type *)((char *)mptr__ - offsetof(type, member));  \
diff --git a/tools/misc/xen-cpuid.c b/tools/misc/xen-cpuid.c
index 8893547bebce..2a1ac0ee8326 100644
--- a/tools/misc/xen-cpuid.c
+++ b/tools/misc/xen-cpuid.c
@@ -12,282 +12,33 @@
 
 #include 
 
-static uint32_t nr_features;
-
-static const char *const str_1d[32] =
-{
-[ 0] = "fpu",  [ 1] = "vme",
-[ 2] = "de",   [ 3] = "pse",
-[ 4] = "tsc",  [ 5] = "msr",
-[ 6] = "pae",  [ 7] = "mce",
-[ 8] = "cx8",  [ 9] = "apic",
-/* [10] */ [11] = "sysenter",
-[12] = "mtrr", [13] = "pge",
-[14] = "mca",  [15] = "cmov",
-[16] = "pat",  [17] = "pse36",
-[18] = "psn",  [19] = "clflush",
-/* [20] */ [21] = "ds",
-[22] = "acpi", [23] = "mmx",
-[24] = "fxsr", [25] = "sse",
-[26] = "sse2", [27] = "ss",
-[28] = "htt",  [29] = "tm",
-[30] = "ia64", [31] = "pbe",
-};
-
-static const char *const str_1c[32] =
-{
-[ 0] = "sse3",[ 1] = "pclmulqdq",
-[ 2] = "dtes64",  [ 3] = "monitor",
-[ 4] = "ds-cpl",  [ 5] = "vmx",
-[ 6] = "smx", [ 7] = "est",
-[ 8] = "tm2", [ 9] = "ssse3",
-[10] = "cntx-id", [11] = "sdgb",
-[12] = "fma", [13] = "cx16",
-[14] = "xtpr",[15] = "pdcm",
-/* [16] */[17] = "pcid",
-[18] = "dca", [19] = "sse41",
-[20] = "sse42",   [21] = "x2apic",
-[22] = "movebe",  [23] = "popcnt",
-[24] = "tsc-dl",  [25] = "aesni",
-[26] = "xsave",   [27] = "osxsave",
-[28] = "avx", [29] = "f16c",
-[30] = "rdrnd",   [31] = "hyper",
-};
-
-static const char *const str_e1d[32] =
-{
-[ 0] = "fpu",[ 1] = "vme",
-[ 2] = "de", [ 3] = "pse",
-[ 4] = "tsc",[ 5] = "msr",
-[ 6] = "pae",[ 7] = "mce",
-[ 8] = "cx8",[ 9] = "apic",
-/* [10] */   [11] = "syscall",
-[12] = "mtrr",   [13] = "pge",
-[14] = "mca",[15] = "cmov",
-[16] = "fcmov",  [17] = "pse36",
-/* [18] */   [19] = "mp",
-[20] = "nx", /* [21] */
-[22] = "mmx+",   [23] = "mmx",
-[24] = "fxsr",   [25] = "fxsr+",
-[26] = "pg1g",   [27] = "rdtscp",
-/* [28] */   [29] = "lm",
-[30] = "3dnow+", [31] = "3dnow",
-};
-
-static const char *const str_e1c[32] =
-{
-[ 0] = "lahf-lm",[ 1] = "cmp",
-[ 2] = "svm",[ 3] = "extapic",
-[ 4] = "cr8d",   [ 5] = "lzcnt",
-[ 6] = "sse4a",  [ 7] = "msse",
-[ 8] = "3dnowpf",[ 9] = "osvw",
-[10] = "ibs",[11] = "xop",
-[12] = "skinit", [13] = "wdt",
-/* [14] */   [15] = "lwp",
-[16] = "fma4",   [17] = "tce",
-/* [18] */   [19] = "nodeid",
-/* [20] */   [21] = "tbm",
-[22] = "topoext",[23] = "perfctr-core",
-[24] = "perfctr-nb", /* [25] */
-[26] = "dbx",[27] = "perftsc",
-[28] = "pcx-l2i",[29] = "monitorx",
-[30] = "addr-msk-ext",
-};
-
-static const char *const str_7b0[32] =
-{
-[ 0] = "fsgsbase", [ 1] = "tsc-adj",
-[ 2] = "sgx",  [ 3] = "bmi1",
-[ 4] = "hle",  [ 5] = "avx2",
-[ 6] = "fdp-exn",  [ 7] = "smep",
-[ 8] = "bmi2", [ 9] = "erms",
-[10] = "invpcid",  [11] = "rtm",
-[12] = "pqm",  [13] = "depfpp",
-[14] = "mpx",  [15] = "pqe",
-[16] = "avx512f",  [17] = "avx512dq",
-[18] = "rdseed",   [19] =