date:20210924

[linux-linus test] 165185: regressions - FAIL

2021-09-24 Thread osstest service owner

flight 165185 linux-linus real [real]
http://logs.test-lab.xenproject.org/osstest/logs/165185/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
 test-amd64-amd64-xl-pvshim   17 guest-saverestorefail REGR. vs. 152332
 test-amd64-amd64-dom0pvh-xl-amd 17 guest-saverestore fail REGR. vs. 152332
 test-amd64-amd64-xl-credit1  17 guest-saverestorefail REGR. vs. 152332
 test-amd64-amd64-xl-multivcpu 17 guest-saverestore   fail REGR. vs. 152332
 test-amd64-amd64-dom0pvh-xl-intel 17 guest-saverestore   fail REGR. vs. 152332
 test-amd64-amd64-xl-pvhv2-intel 17 guest-saverestore fail REGR. vs. 152332
 test-amd64-amd64-xl  17 guest-saverestorefail REGR. vs. 152332
 test-amd64-amd64-xl-credit2  17 guest-saverestorefail REGR. vs. 152332
 test-amd64-amd64-xl-shadow   17 guest-saverestorefail REGR. vs. 152332
 test-amd64-amd64-libvirt 17 guest-saverestorefail REGR. vs. 152332
 test-amd64-amd64-libvirt-xsm 17 guest-saverestorefail REGR. vs. 152332
 test-amd64-amd64-pair 26 guest-migrate/src_host/dst_host fail REGR. vs. 152332
 test-amd64-amd64-libvirt-pair 26 guest-migrate/src_host/dst_host fail REGR. 
vs. 152332
 test-amd64-amd64-xl-pvhv2-amd 17 guest-saverestore   fail REGR. vs. 152332
 test-amd64-coresched-amd64-xl 17 guest-saverestore   fail REGR. vs. 152332
 test-amd64-amd64-xl-xsm  17 guest-saverestorefail REGR. vs. 152332
 test-arm64-arm64-xl  13 debian-fixup fail REGR. vs. 152332
 test-arm64-arm64-xl-credit1  13 debian-fixup fail REGR. vs. 152332
 test-arm64-arm64-libvirt-xsm 13 debian-fixup fail REGR. vs. 152332
 test-arm64-arm64-xl-xsm  13 debian-fixup fail REGR. vs. 152332
 test-arm64-arm64-xl-credit2  13 debian-fixup fail REGR. vs. 152332
 test-arm64-arm64-xl-thunderx 13 debian-fixup fail REGR. vs. 152332
 test-armhf-armhf-xl-credit1  14 guest-start  fail REGR. vs. 152332

Regressions which are regarded as allowable (not blocking):
 test-amd64-amd64-xl-rtds 17 guest-saverestorefail REGR. vs. 152332

Tests which did not succeed, but are not blocking:
 test-armhf-armhf-libvirt-qcow2 15 saverestore-support-check fail baseline 
untested
 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 152332
 test-armhf-armhf-libvirt 16 saverestore-support-checkfail  like 152332
 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 152332
 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 152332
 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 152332
 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 152332
 test-armhf-armhf-libvirt-raw 15 saverestore-support-checkfail  like 152332
 test-arm64-arm64-xl-seattle  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-seattle  16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-amd64-amd64-libvirt-qcow2 14 migrate-support-checkfail never pass
 test-armhf-armhf-xl-arndale  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 15 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail  never pass
 test-arm64-arm64-xl-vhd  14 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  15 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-cubietruck 15 migrate-support-checkfail never pass
 test-armhf-armhf-xl-cubietruck 16 saverestore-support-checkfail never pass
 test-armhf-armhf-xl-rtds 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt 15 migrate-support-checkfail   never pass
 test-armhf-armhf-libvirt-qcow2 14 migrate-support-checkfail never pass
 test-armhf-armhf-libvirt-raw 14 migrate-support-checkfail   never pass
 test-armhf-armhf-x

[PATCH 0/1] x86: centralize default APIC id definition

2021-09-24 Thread Alex Olson

I am interested in making the x86 topology seen by guests more flexible.
This patch keeps the original functionality but allows the APIC identifier
seen by guests for each vCPU to be altered more easily in future revisions.

Since the same mapping of vcpu_id to vlapic id is currently preserved,
the existing adjustments for 'logical processors per package' are left 
unchanged.

Alex Olson (1):
  x86: centralize default APIC id definition

 tools/firmware/hvmloader/Makefile  |  2 +-
 tools/firmware/hvmloader/config.h  |  1 -
 tools/firmware/hvmloader/hvmloader.c   |  3 +-
 tools/firmware/hvmloader/mp_tables.c   |  3 +-
 tools/firmware/hvmloader/smp.c |  3 +-
 tools/firmware/hvmloader/topology.c| 54 ++
 tools/firmware/hvmloader/topology.h| 33 
 tools/firmware/hvmloader/util.c|  6 ++-
 tools/include/xenctrl.h|  6 +++
 tools/libacpi/build.c  |  4 +-
 tools/libacpi/libacpi.h|  3 +-
 tools/libs/ctrl/xc_domain.c| 27 +
 tools/libs/light/libxl_x86_acpi.c  |  9 -
 xen/arch/x86/cpuid.c   | 14 +--
 xen/arch/x86/hvm/hvm.c | 36 -
 xen/arch/x86/hvm/vlapic.c  | 18 ++---
 xen/include/asm-x86/hvm/vlapic.h   |  4 +-
 xen/include/public/arch-x86/hvm/save.h |  1 +
 xen/include/public/hvm/hvm_op.h| 17 
 19 files changed, 222 insertions(+), 22 deletions(-)
 create mode 100644 tools/firmware/hvmloader/topology.c
 create mode 100644 tools/firmware/hvmloader/topology.h

-- 
2.25.1

[PATCH 1/1] x86: centralize default APIC id definition

2021-09-24 Thread Alex Olson

Inspired by an earlier attempt by Chao Gao ,
this revision aims to put the hypervisor in control of x86 APIC identifier
definition instead of hard-coding a formula in multiple places
(libxl, hvmloader, hypervisor).

This is intended as a first step toward exposing/altering CPU topology
seen by guests.

Changes:

- Add field to vlapic for holding default ID (on reset)

- add HVMOP_get_vcpu_topology_id hypercall so libxl (for PVH domains)
  can access APIC ids needed for ACPI table definition prior to domain start.

- For HVM guests, hvmloader now also uses the same hypercall.

- Make CPUID code use vlapic ID instead of hard-coded formula
  for runtime reporting to guests

Signed-off-by: Alex Olson 
---
 tools/firmware/hvmloader/Makefile  |  2 +-
 tools/firmware/hvmloader/config.h  |  1 -
 tools/firmware/hvmloader/hvmloader.c   |  3 +-
 tools/firmware/hvmloader/mp_tables.c   |  3 +-
 tools/firmware/hvmloader/smp.c |  3 +-
 tools/firmware/hvmloader/topology.c| 54 ++
 tools/firmware/hvmloader/topology.h| 33 
 tools/firmware/hvmloader/util.c|  6 ++-
 tools/include/xenctrl.h|  6 +++
 tools/libacpi/build.c  |  4 +-
 tools/libacpi/libacpi.h|  3 +-
 tools/libs/ctrl/xc_domain.c| 27 +
 tools/libs/light/libxl_x86_acpi.c  |  9 -
 xen/arch/x86/cpuid.c   | 14 +--
 xen/arch/x86/hvm/hvm.c | 36 -
 xen/arch/x86/hvm/vlapic.c  | 18 ++---
 xen/include/asm-x86/hvm/vlapic.h   |  4 +-
 xen/include/public/arch-x86/hvm/save.h |  1 +
 xen/include/public/hvm/hvm_op.h| 17 
 19 files changed, 222 insertions(+), 22 deletions(-)
 create mode 100644 tools/firmware/hvmloader/topology.c
 create mode 100644 tools/firmware/hvmloader/topology.h

diff --git a/tools/firmware/hvmloader/Makefile 
b/tools/firmware/hvmloader/Makefile
index e980ce7c5f..158f62b4e6 100644
--- a/tools/firmware/hvmloader/Makefile
+++ b/tools/firmware/hvmloader/Makefile
@@ -29,7 +29,7 @@ CFLAGS += $(CFLAGS_xeninclude)
 CFLAGS += -D__XEN_INTERFACE_VERSION__=__XEN_LATEST_INTERFACE_VERSION__
 
 OBJS  = hvmloader.o mp_tables.o util.o smbios.o 
-OBJS += smp.o cacheattr.o xenbus.o vnuma.o
+OBJS += smp.o cacheattr.o xenbus.o vnuma.o topology.o
 OBJS += e820.o pci.o pir.o ctype.o
 OBJS += hvm_param.o
 OBJS += ovmf.o seabios.o
diff --git a/tools/firmware/hvmloader/config.h 
b/tools/firmware/hvmloader/config.h
index 844120bc87..91d73c9086 100644
--- a/tools/firmware/hvmloader/config.h
+++ b/tools/firmware/hvmloader/config.h
@@ -50,7 +50,6 @@ extern uint8_t ioapic_version;
 #define IOAPIC_ID   0x01
 
 #define LAPIC_BASE_ADDRESS  0xfee0
-#define LAPIC_ID(vcpu_id)   ((vcpu_id) * 2)
 
 #define PCI_ISA_DEVFN   0x08/* dev 1, fn 0 */
 #define PCI_ISA_IRQ_MASK0x0c20U /* ISA IRQs 5,10,11 are PCI connected */
diff --git a/tools/firmware/hvmloader/hvmloader.c 
b/tools/firmware/hvmloader/hvmloader.c
index c58841e5b5..250e6779b1 100644
--- a/tools/firmware/hvmloader/hvmloader.c
+++ b/tools/firmware/hvmloader/hvmloader.c
@@ -25,6 +25,7 @@
 #include "pci_regs.h"
 #include "apic_regs.h"
 #include "vnuma.h"
+#include "topology.h"
 #include 
 #include 
 #include 
@@ -225,7 +226,7 @@ static void apic_setup(void)
 
 /* 8259A ExtInts are delivered through IOAPIC pin 0 (Virtual Wire Mode). */
 ioapic_write(0x10, APIC_DM_EXTINT);
-ioapic_write(0x11, SET_APIC_ID(LAPIC_ID(0)));
+ioapic_write(0x11, SET_APIC_ID(get_topology_id(0)));
 }
 
 struct bios_info {
diff --git a/tools/firmware/hvmloader/mp_tables.c 
b/tools/firmware/hvmloader/mp_tables.c
index d207ecbf00..562961ca0b 100644
--- a/tools/firmware/hvmloader/mp_tables.c
+++ b/tools/firmware/hvmloader/mp_tables.c
@@ -29,6 +29,7 @@
 
 #include 
 #include "config.h"
+#include "topology.h"
 
 /* number of non-processor MP table entries */
 #define NR_NONPROC_ENTRIES 18
@@ -199,7 +200,7 @@ static void fill_mp_config_table(struct mp_config_table 
*mpct, int length)
 static void fill_mp_proc_entry(struct mp_proc_entry *mppe, int vcpu_id)
 {
 mppe->type = ENTRY_TYPE_PROCESSOR;
-mppe->lapic_id = LAPIC_ID(vcpu_id);
+mppe->lapic_id = get_topology_id(vcpu_id);
 mppe->lapic_version = 0x11;
 mppe->cpu_flags = CPU_FLAG_ENABLED;
 if ( vcpu_id == 0 )
diff --git a/tools/firmware/hvmloader/smp.c b/tools/firmware/hvmloader/smp.c
index 082b17f138..5091ff1f1f 100644
--- a/tools/firmware/hvmloader/smp.c
+++ b/tools/firmware/hvmloader/smp.c
@@ -22,6 +22,7 @@
 #include "util.h"
 #include "config.h"
 #include "apic_regs.h"
+#include "topology.h"
 
 #define AP_BOOT_EIP 0x1000
 extern char ap_boot_start[], ap_boot_end[];
@@ -86,7 +87,7 @@ static void lapic_wait_ready(void)
 
 static void boot_cpu(unsigned int cpu)
 {
-unsigned int icr2 = SET_APIC_DEST_FIELD(LAPIC_ID(cpu));
+unsigned int icr2 = SET_APIC_DEST_FIELD(get_topology_id(cpu));
 
 /* Initialise shared variables.

Re: [future abi] [RFC PATCH V3] xen/gnttab: Store frame GFN in struct page_info on Arm

2021-09-24 Thread Julien Grall


Hi Roger,

On 24/09/2021 21:10, Roger Pau Monné wrote:

On Fri, Sep 24, 2021 at 07:52:24PM +0500, Julien Grall wrote:

Hi Roger,

On 24/09/2021 13:41, Roger Pau Monné wrote:

On Thu, Sep 23, 2021 at 09:59:26PM +0100, Andrew Cooper wrote:

On 23/09/2021 20:32, Oleksandr Tyshchenko wrote:

Suggested-by: Julien Grall 
Signed-off-by: Oleksandr Tyshchenko 
---
You can find the related discussions at:
https://lore.kernel.org/xen-devel/93d0df14-2c8a-c2e3-8c51-544121901...@xen.org/
https://lore.kernel.org/xen-devel/1628890077-12545-1-git-send-email-olekst...@gmail.com/
https://lore.kernel.org/xen-devel/1631652245-30746-1-git-send-email-olekst...@gmail.com/

! Please note, there is still unresolved locking question here for which
I failed to find a suitable solution. So, it is still an RFC !


Just FYI, I thought I'd share some of the plans for ABI v2.  Obviously
these plans are future work and don't solve the current problem.

Guests mapping Xen pages is backwards.  There are reasons why it was
used for x86 PV guests, but the entire interface should have been design
differently for x86 HVM.

In particular, Xen should be mapping guest RAM, rather than the guest
manipulating the 2nd stage tables to map Xen RAM.  Amongst other things,
its far far lower overhead.


A much better design is one where the grant table looks like an MMIO
device.  The domain builder decides the ABI (v1 vs v2 - none of this
dynamic switch at runtime nonsense), and picks a block of guest physical
addresses, which are registered with Xen.  This forms the grant table,
status table (v2 only), and holes to map into.


I think this could be problematic for identity mapped Arm dom0, as
IIRC in that case grants are mapped so that gfn == mfn in order to
account for the lack of an IOMMU. You could use a bounce buffer, but
that would introduce a big performance penalty.


Or you could find a hole that is outside of the RAM regions. This is not
trivial but not impossible (see [1]).


I certainly not familiar with the Arm identity map.

If you map them at random areas (so no longer identity mapped), how do
you pass the addresses to the physical devices for DMA operations? I
assume there must be some kind of translation then that converts from
gfn to mfn in order to cope with the lack of an IOMMU, 


For grant mapping, the hypercall will return the machine address in 
dev_bus_addr. Dom0, will keep the conversion dom0 GFN <-> MFN for later 
use in the swiotlb.


For foreign mapping, AFAICT, we are expecting them to bounce everytime. 
But DMA into a foreign mapping should be rarer.



and because
dom0 doesn't know the mfn of the grant reference in order to map it at
the same gfn.


IIRC, we tried an approach where the grant mapping would be direct 
mapped in dom0. However, this was an issue on arm32 because Debian was 
(is?) using short descriptor page tables. This didn't allow dom0 to 
cover all the mappings and therefore some mappings would not be accessible.


--
Julien Grall

Re: [PATCH v2 11/11] xen/arm: Process pending vPCI map/unmap operations

2021-09-24 Thread Stefano Stabellini

+ x86 maintainers

On Thu, 23 Sep 2021, Oleksandr Andrushchenko wrote:
> From: Oleksandr Andrushchenko 
> 
> vPCI may map and unmap PCI device memory (BARs) being passed through which
> may take a lot of time. For this those operations may be deferred to be
> performed later, so that they can be safely preempted.
> Run the corresponding vPCI code while switching a vCPU.
> 
> Signed-off-by: Oleksandr Andrushchenko 

>From an ARM point of view, I think the code change is OK.

Only one note: it would be good to add to the commit message a short
list of relevant TODOs which in particular affect this patch.

Something like:

Please be aware that there are a few outstanding TODOs affecting this
code path, see xen/drivers/vpci/header.c:map_range and
xen/drivers/vpci/header.c:vpci_process_pending.


> ---
> Since v1:
>  - Moved the check for pending vpci work from the common IOREQ code
>to hvm_do_resume on x86
>  - Re-worked the code for Arm to ensure we don't miss pending vPCI work
> ---
>  xen/arch/arm/traps.c   | 13 +
>  xen/arch/x86/hvm/hvm.c |  6 ++
>  xen/common/ioreq.c |  9 -
>  3 files changed, 19 insertions(+), 9 deletions(-)
> 
> diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c
> index 219ab3c3fbde..b246f51086e3 100644
> --- a/xen/arch/arm/traps.c
> +++ b/xen/arch/arm/traps.c
> @@ -34,6 +34,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -2304,6 +2305,18 @@ static bool check_for_vcpu_work(void)
>  }
>  #endif
>  
> +if ( has_vpci(v->domain) )
> +{
> +bool pending;
> +
> +local_irq_enable();
> +pending = vpci_process_pending(v);
> +local_irq_disable();
> +
> +if ( pending )
> +return true;
> +}
> +
>  if ( likely(!v->arch.need_flush_to_ram) )
>  return false;
>  
> diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
> index 7b48a1b925bb..d32f5d572941 100644
> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -549,6 +549,12 @@ void hvm_do_resume(struct vcpu *v)
>  if ( !vcpu_ioreq_handle_completion(v) )
>  return;
>  
> +if ( has_vpci(v->domain) && vpci_process_pending(v) )
> +{
> +raise_softirq(SCHEDULE_SOFTIRQ);
> +return;
> +}
> +
>  if ( unlikely(v->arch.vm_event) )
>  hvm_vm_event_do_resume(v);
>  
> diff --git a/xen/common/ioreq.c b/xen/common/ioreq.c
> index d732dc045df9..689d256544c8 100644
> --- a/xen/common/ioreq.c
> +++ b/xen/common/ioreq.c
> @@ -25,9 +25,7 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
> -#include 
>  
>  #include 
>  #include 
> @@ -212,19 +210,12 @@ static bool wait_for_io(struct ioreq_vcpu *sv, ioreq_t 
> *p)
>  
>  bool vcpu_ioreq_handle_completion(struct vcpu *v)
>  {
> -struct domain *d = v->domain;
>  struct vcpu_io *vio = &v->io;
>  struct ioreq_server *s;
>  struct ioreq_vcpu *sv;
>  enum vio_completion completion;
>  bool res = true;
>  
> -if ( has_vpci(d) && vpci_process_pending(v) )
> -{
> -raise_softirq(SCHEDULE_SOFTIRQ);
> -return false;
> -}
> -
>  while ( (sv = get_pending_vcpu(v, &s)) != NULL )
>  if ( !wait_for_io(sv, get_ioreq(s, v)) )
>  return false;
> -- 
> 2.25.1
>

[xen-unstable test] 165183: regressions - FAIL

2021-09-24 Thread osstest service owner

flight 165183 xen-unstable real [real]
http://logs.test-lab.xenproject.org/osstest/logs/165183/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
 build-amd64-prev  6 xen-buildfail REGR. vs. 164945
 test-amd64-amd64-xl-qemuu-debianhvm-amd64-shadow 20 
guest-start/debianhvm.repeat fail REGR. vs. 164945
 test-arm64-arm64-libvirt-raw 17 guest-start/debian.repeat fail REGR. vs. 164945

Tests which did not succeed, but are not blocking:
 test-amd64-amd64-migrupgrade  1 build-check(1)   blocked  n/a
 test-amd64-i386-migrupgrade   1 build-check(1)   blocked  n/a
 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 164945
 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 164945
 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 164945
 test-amd64-i386-xl-qemut-ws16-amd64 19 guest-stop fail like 164945
 test-amd64-i386-xl-qemut-win7-amd64 19 guest-stop fail like 164945
 test-armhf-armhf-libvirt-qcow2 15 saverestore-support-check   fail like 164945
 test-armhf-armhf-libvirt-raw 15 saverestore-support-checkfail  like 164945
 test-armhf-armhf-libvirt 16 saverestore-support-checkfail  like 164945
 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 164945
 test-amd64-i386-xl-qemuu-win7-amd64 19 guest-stop fail like 164945
 test-amd64-i386-xl-qemuu-ws16-amd64 19 guest-stop fail like 164945
 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 164945
 test-arm64-arm64-xl-seattle  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-seattle  16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-amd64-i386-xl-pvshim14 guest-start  fail   never pass
 test-amd64-i386-libvirt  15 migrate-support-checkfail   never pass
 test-amd64-i386-libvirt-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-armhf-armhf-xl-arndale  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  16 saverestore-support-checkfail   never pass
 test-amd64-i386-libvirt-raw  14 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 15 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  14 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  15 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail  never pass
 test-armhf-armhf-xl-credit2  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-cubietruck 15 migrate-support-checkfail never pass
 test-armhf-armhf-xl-cubietruck 16 saverestore-support-checkfail never pass
 test-armhf-armhf-libvirt-qcow2 14 migrate-support-checkfail never pass
 test-armhf-armhf-libvirt-raw 14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt 15 migrate-support-checkfa

Re: [PATCH v2 10/11] xen/arm: Do not map PCI ECAM and MMIO space to Domain-0's p2m

2021-09-24 Thread Stefano Stabellini

On Thu, 23 Sep 2021, Oleksandr Andrushchenko wrote:
> From: Oleksandr Andrushchenko 
> 
> PCI host bridges are special devices in terms of implementing PCI
> passthrough. According to [1] the current implementation depends on
> Domain-0 to perform the initialization of the relevant PCI host
> bridge hardware and perform PCI device enumeration. In order to
> achieve that one of the required changes is to not map all the memory
> ranges in map_range_to_domain as we traverse the device tree on startup
> and perform some additional checks if the range needs to be mapped to
> Domain-0.
> 
> The generic PCI host controller device tree binding says [2]:
> - ranges: As described in IEEE Std 1275-1994, but must provide
>   at least a definition of non-prefetchable memory. One
>   or both of prefetchable Memory and IO Space may also
>   be provided.
> 
> - reg   : The Configuration Space base address and size, as accessed
>   from the parent bus.  The base address corresponds to
>   the first bus in the "bus-range" property.  If no
>   "bus-range" is specified, this will be bus 0 (the default).
> 
> >From the above none of the memory ranges from the "ranges" property
> needs to be mapped to Domain-0 at startup as MMIO mapping is going to
> be handled dynamically by vPCI as we assign PCI devices, e.g. each
> device assigned to Domain-0/guest will have its MMIOs mapped/unmapped
> as needed by Xen.
> 
> The "reg" property covers not only ECAM space, but may also have other
> then the configuration memory ranges described, for example [3]:
> - reg: Should contain rc_dbi, config registers location and length.
> - reg-names: Must include the following entries:
>"rc_dbi": controller configuration registers;
>"config": PCIe configuration space registers.
> 
> This patch makes it possible to not map all the ranges from the
> "ranges" property and also ECAM from the "reg". All the rest from the
> "reg" property still needs to be mapped to Domain-0, so the PCI
> host bridge remains functional in Domain-0.
> 
> [1] https://lists.xenproject.org/archives/html/xen-devel/2020-07/msg00777.html
> [2] 
> https://www.kernel.org/doc/Documentation/devicetree/bindings/pci/host-generic-pci.txt
> [3] 
> https://www.kernel.org/doc/Documentation/devicetree/bindings/pci/hisilicon-pcie.txt
> 
> Signed-off-by: Oleksandr Andrushchenko 
> 
> ---
> Since v1:
>  - Added better description of why and what needs to be mapped into
>Domain-0's p2m and what doesn't
>  - Do not do any mappings for PCI devices while traversing the DT
>  - Walk all the bridges and make required mappings in one go
> ---
>  xen/arch/arm/domain_build.c| 38 +++
>  xen/arch/arm/pci/ecam.c| 14 +
>  xen/arch/arm/pci/pci-host-common.c | 48 ++
>  xen/arch/arm/pci/pci-host-zynqmp.c |  1 +
>  xen/include/asm-arm/pci.h  |  9 ++
>  xen/include/asm-arm/setup.h| 13 
>  6 files changed, 111 insertions(+), 12 deletions(-)
> 
> diff --git a/xen/arch/arm/domain_build.c b/xen/arch/arm/domain_build.c
> index 83ab0d52cce9..e72c1b881cae 100644
> --- a/xen/arch/arm/domain_build.c
> +++ b/xen/arch/arm/domain_build.c
> @@ -10,7 +10,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
>  #include 
>  #include 
> @@ -47,12 +46,6 @@ static int __init parse_dom0_mem(const char *s)
>  }
>  custom_param("dom0_mem", parse_dom0_mem);
>  
> -struct map_range_data
> -{
> -struct domain *d;
> -p2m_type_t p2mt;
> -};
> -
>  /* Override macros from asm/page.h to make them work with mfn_t */
>  #undef virt_to_mfn
>  #define virt_to_mfn(va) _mfn(__virt_to_mfn(va))
> @@ -1388,9 +1381,8 @@ static int __init map_dt_irq_to_domain(const struct 
> dt_device_node *dev,
>  return 0;
>  }
>  
> -static int __init map_range_to_domain(const struct dt_device_node *dev,
> -  u64 addr, u64 len,
> -  void *data)
> +int __init map_range_to_domain(const struct dt_device_node *dev,
> +   u64 addr, u64 len, void *data)
>  {
>  struct map_range_data *mr_data = data;
>  struct domain *d = mr_data->d;
> @@ -1417,6 +1409,13 @@ static int __init map_range_to_domain(const struct 
> dt_device_node *dev,
>  }
>  }
>  
> +#ifdef CONFIG_HAS_PCI
> +if ( is_pci_passthrough_enabled() &&
> + (device_get_class(dev) == DEVICE_PCI) &&
> + !mr_data->map_pci_bridge )
> +need_mapping = false;
> +#endif

With the change I suggested below turning map_pci_bridge into
skip_mapping, then this check could go away if we just set need_mapping
as follows:

bool need_mapping = !dt_device_for_passthrough(dev) &&
!mr_data->skip_mapping;


>  if ( need_mapping )
>  {
>  res = map_regions_p2mt(d,
> @@ -1450,7 +1449,11 @@ static int __init map_device_children(struct domain *d,
>

Re: [PATCH v2 09/11] xen/arm: Setup MMIO range trap handlers for hardware domain

2021-09-24 Thread Stefano Stabellini

On Thu, 23 Sep 2021, Oleksandr Andrushchenko wrote:
> From: Oleksandr Andrushchenko 
> 
> In order for vPCI to work it needs to maintain guest and hardware
> domain's views on the configuration space. For example, BARs and
 ^ of

> COMMAND registers require emulation for guests and the guest view
> on the registers needs to be in sync with the real contents of the
  ^ of

> relevant registers. For that ECAM address space needs to also be
> trapped for the hardware domain, so we need to implement PCI host
> bridge specific callbacks to properly setup MMIO handlers for those
> ranges depending on particular host bridge implementation.
> 
> Signed-off-by: Oleksandr Andrushchenko 

The patch looks pretty good, only a couple of minor comments below


> ---
> Since v1:
>  - Dynamically calculate the number of MMIO handlers required for vPCI
>and update the total number accordingly
>  - s/clb/cb
>  - Do not introduce a new callback for MMIO handler setup
> ---
>  xen/arch/arm/domain.c  |  2 ++
>  xen/arch/arm/pci/pci-host-common.c | 28 +
>  xen/arch/arm/vpci.c| 33 ++
>  xen/arch/arm/vpci.h|  6 ++
>  xen/include/asm-arm/pci.h  |  7 +++
>  5 files changed, 76 insertions(+)
> 
> diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c
> index 854e8fed0393..c7b25bc70439 100644
> --- a/xen/arch/arm/domain.c
> +++ b/xen/arch/arm/domain.c
> @@ -733,6 +733,8 @@ int arch_domain_create(struct domain *d,
>  if ( (rc = domain_vgic_register(d, &count)) != 0 )
>  goto fail;
>  
> +count += domain_vpci_get_num_mmio_handlers(d);
> +
>  if ( (rc = domain_io_init(d, count + MAX_IO_HANDLER)) != 0 )
>  goto fail;
>  
> diff --git a/xen/arch/arm/pci/pci-host-common.c 
> b/xen/arch/arm/pci/pci-host-common.c
> index 1567b6e2956c..155f2a2743af 100644
> --- a/xen/arch/arm/pci/pci-host-common.c
> +++ b/xen/arch/arm/pci/pci-host-common.c
> @@ -300,6 +300,34 @@ struct dt_device_node *pci_find_host_bridge_node(struct 
> device *dev)
>  }
>  return bridge->dt_node;
>  }
> +
> +int pci_host_iterate_bridges(struct domain *d,
> + int (*cb)(struct domain *d,
> +   struct pci_host_bridge *bridge))
> +{
> +struct pci_host_bridge *bridge;
> +int err;
> +
> +list_for_each_entry( bridge, &pci_host_bridges, node )
> +{
> +err = cb(d, bridge);
> +if ( err )
> +return err;
> +}
> +return 0;
> +}
> +
> +int pci_host_get_num_bridges(void)
> +{
> +struct pci_host_bridge *bridge;
> +int count = 0;
> +
> +list_for_each_entry( bridge, &pci_host_bridges, node )
> +count++;
> +
> +return count;
> +}
> +
>  /*
>   * Local variables:
>   * mode: C
> diff --git a/xen/arch/arm/vpci.c b/xen/arch/arm/vpci.c
> index 76c12b92814f..14947e975d69 100644
> --- a/xen/arch/arm/vpci.c
> +++ b/xen/arch/arm/vpci.c
> @@ -80,17 +80,50 @@ static const struct mmio_handler_ops vpci_mmio_handler = {
>  .write = vpci_mmio_write,
>  };
>  
> +static int vpci_setup_mmio_handler(struct domain *d,
> +   struct pci_host_bridge *bridge)
> +{
> +struct pci_config_window *cfg = bridge->cfg;
> +
> +register_mmio_handler(d, &vpci_mmio_handler,
> +  cfg->phys_addr, cfg->size, NULL);
> +return 0;
> +}
> +
>  int domain_vpci_init(struct domain *d)
>  {
>  if ( !has_vpci(d) )
>  return 0;
>  
> +if ( is_hardware_domain(d) )
> +return pci_host_iterate_bridges(d, vpci_setup_mmio_handler);
> +
> +/* Guest domains use what is programmed in their device tree. */
>  register_mmio_handler(d, &vpci_mmio_handler,
>GUEST_VPCI_ECAM_BASE, GUEST_VPCI_ECAM_SIZE, NULL);
>  
>  return 0;
>  }
>  
> +int domain_vpci_get_num_mmio_handlers(struct domain *d)
> +{
> +int count = 0;
> +
> +if ( is_hardware_domain(d) )
> +/* For each PCI host bridge's configuration space. */
> +count += pci_host_get_num_bridges();

NIT: why += instead of = ?


> +else
> +/*
> + * VPCI_MSIX_MEM_NUM handlers for MSI-X tables per each PCI device
> + * being passed through. Maximum number of supported devices
> + * is 32 as virtual bus topology emulates the devices as embedded
> + * endpoints.
> + * +1 for a single emulated host bridge's configuration space. */
> +count = VPCI_MSIX_MEM_NUM * 32 + 1;
> +
> +return count;
> +}
> +
>  /*
>   * Local variables:
>   * mode: C
> diff --git a/xen/arch/arm/vpci.h b/xen/arch/arm/vpci.h
> index d8a7b0e3e802..27a2b069abd2 100644
> --- a/xen/arch/arm/vpci.h
> +++ b/xen/arch/arm/vpci.h
> @@ -17,11 +17,17 @@
>  
>  #ifdef CONFIG_HAS_VPCI
>  int domain_vpci_init(struct domain *d);
> +int domain_vpci_get_num_mmio_handlers(struct domain *d);
>  #else
>  static inline int domain_vpci_in

Re: [PATCH v2 08/11] libxl: Only map legacy PCI IRQs if they are supported

2021-09-24 Thread Stefano Stabellini

On Thu, 23 Sep 2021, Oleksandr Andrushchenko wrote:
> From: Oleksandr Andrushchenko 
> 
> Arm's PCI passthrough implementation doesn't support legacy interrupts,
> but MSI/MSI-X. This can be the case for other platforms too.
> For that reason introduce a new CONFIG_PCI_SUPP_LEGACY_IRQ and add
> it to the CFLAGS and compile the relevant code in the toolstack only if
> applicable.
> 
> Signed-off-by: Oleksandr Andrushchenko 
> Cc: Ian Jackson 
> Cc: Juergen Gross 
>
> ---
> Since v1:
>  - Minimized #idefery by introducing pci_supp_legacy_irq function
>for relevant checks
> ---
>  tools/libs/light/Makefile|  4 
>  tools/libs/light/libxl_pci.c | 13 +
>  2 files changed, 17 insertions(+)
> 
> diff --git a/tools/libs/light/Makefile b/tools/libs/light/Makefile
> index 7d8c51d49242..bd3f6be2a183 100644
> --- a/tools/libs/light/Makefile
> +++ b/tools/libs/light/Makefile
> @@ -46,6 +46,10 @@ CFLAGS += -Wno-format-zero-length -Wmissing-declarations \
>   -Wno-declaration-after-statement -Wformat-nonliteral
>  CFLAGS += -I.
>  
> +ifeq ($(CONFIG_X86),y)
> +CFLAGS += -DCONFIG_PCI_SUPP_LEGACY_IRQ
> +endif

This patch is a lot better than the previous version, thanks!

I think the usage of pci_supp_legacy_irq below is good and we can't do
better than that.

I wonder if there is a better way than the above to export
CONFIG_PCI_SUPP_LEGACY_IRQ. Suggestions?


>  SRCS-$(CONFIG_X86) += libxl_cpuid.c
>  SRCS-$(CONFIG_X86) += libxl_x86.c
>  SRCS-$(CONFIG_X86) += libxl_psr.c
> diff --git a/tools/libs/light/libxl_pci.c b/tools/libs/light/libxl_pci.c
> index 59f3686fc85e..4c2d7aeefbb2 100644
> --- a/tools/libs/light/libxl_pci.c
> +++ b/tools/libs/light/libxl_pci.c
> @@ -1364,6 +1364,15 @@ static void pci_add_timeout(libxl__egc *egc, 
> libxl__ev_time *ev,
>  pci_add_dm_done(egc, pas, rc);
>  }
>  
> +static bool pci_supp_legacy_irq(void)
> +{
> +#ifdef CONFIG_PCI_SUPP_LEGACY_IRQ
> +return true;
> +#else
> +return false;
> +#endif
> +}
> +
>  static void pci_add_dm_done(libxl__egc *egc,
>  pci_add_state *pas,
>  int rc)
> @@ -1434,6 +1443,8 @@ static void pci_add_dm_done(libxl__egc *egc,
>  }
>  }
>  fclose(f);
> +if (!pci_supp_legacy_irq())
> +goto out_no_irq;
>  sysfs_path = GCSPRINTF(SYSFS_PCI_DEV"/"PCI_BDF"/irq", pci->domain,
>  pci->bus, pci->dev, pci->func);
>  f = fopen(sysfs_path, "r");
> @@ -1983,6 +1994,8 @@ static void do_pci_remove(libxl__egc *egc, 
> pci_remove_state *prs)
>  }
>  fclose(f);
>  skip1:
> +if (!pci_supp_legacy_irq())
> +goto skip_irq;
>  sysfs_path = GCSPRINTF(SYSFS_PCI_DEV"/"PCI_BDF"/irq", pci->domain,
> pci->bus, pci->dev, pci->func);
>  f = fopen(sysfs_path, "r");
> -- 
> 2.25.1
>

Re: [PATCH v2 07/11] libxl: Allow removing PCI devices for all types of domains

2021-09-24 Thread Stefano Stabellini

+x86 maintainers

On Thu, 23 Sep 2021, Oleksandr Andrushchenko wrote:
> From: Oleksandr Andrushchenko 
> 
> The PCI device remove path may now be used by PVH on ARM, so the
> assert is no longer valid.
> 
> Signed-off-by: Oleksandr Andrushchenko 
> Cc: Ian Jackson 
> Cc: Juergen Gross 
>
> ---
>  tools/libs/light/libxl_pci.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/tools/libs/light/libxl_pci.c b/tools/libs/light/libxl_pci.c
> index 1a1c2630803b..59f3686fc85e 100644
> --- a/tools/libs/light/libxl_pci.c
> +++ b/tools/libs/light/libxl_pci.c
> @@ -1947,8 +1947,6 @@ static void do_pci_remove(libxl__egc *egc, 
> pci_remove_state *prs)
>  goto out_fail;
>  }
>  } else {
> -assert(type == LIBXL_DOMAIN_TYPE_PV);

This is fine for ARM, but is it OK from an x86 point of view considering
the PVH implications?


>  char *sysfs_path = GCSPRINTF(SYSFS_PCI_DEV"/"PCI_BDF"/resource", 
> pci->domain,
>   pci->bus, pci->dev, pci->func);
>  FILE *f = fopen(sysfs_path, "r");
> -- 
> 2.25.1
>

Re: Xen Rust VirtIO demos work breakdown for Project Stratos

2021-09-24 Thread Marek Marczykowski-Górecki

On Fri, Sep 24, 2021 at 05:02:46PM +0100, Alex Bennée wrote:
> Hi,

Hi,

> 2.1 Stable ABI for foreignmemory mapping to non-dom0 ([STR-57])
> ───
> 
>   Currently the foreign memory mapping support only works for dom0 due
>   to reference counting issues. If we are to support backends running in
>   their own domains this will need to get fixed.
> 
>   Estimate: 8w
> 
> 
> [STR-57] 

I'm pretty sure it was discussed before, but I can't find relevant
(part of) thread right now: does your model assumes the backend (running
outside of dom0) will gain ability to map (or access in other way)
_arbitrary_ memory page of a frontend domain? Or worse: any domain?
That is a significant regression in terms of security model Xen
provides. It would give the backend domain _a lot more_ control over the
system that it normally has with Xen PV drivers - negating significant
part of security benefits of using driver domains.

So, does the above require frontend agreeing (explicitly or implicitly)
for accessing specific pages by the backend? There were several
approaches to that discussed, including using grant tables (as PV
drivers do), vIOMMU(?), or even drastically different model with no
shared memory at all (Argo). Can you clarify which (if any) approach
your attempt of VirtIO on Xen will use?

A more general idea: can we collect info on various VirtIO on Xen
approaches (since there is more than one) in a single place, including:
 - key characteristics, differences
 - who is involved
 - status
 - links to relevant threads, maybe

I'd propose to revive https://wiki.xenproject.org/wiki/Virtio_On_Xen

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

signature.asc
Description: PGP signature

Re: [PATCH v2 06/11] xen/domain: Call pci_release_devices() when releasing domain resources

2021-09-24 Thread Stefano Stabellini

On Thu, 23 Sep 2021, Oleksandr Andrushchenko wrote:
> From: Oleksandr Tyshchenko 
> 
> This is the very same that we already do for DT devices. Moreover, x86
> already calls pci_release_devices().
> 
> Signed-off-by: Oleksandr Tyshchenko 

Reviewed-by: Stefano Stabellini 


> ---
> Since v1:
>  - re-wording in the commit message
> ---
>  xen/arch/arm/domain.c | 9 -
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c
> index f7ed130023d5..854e8fed0393 100644
> --- a/xen/arch/arm/domain.c
> +++ b/xen/arch/arm/domain.c
> @@ -985,7 +985,8 @@ static int relinquish_memory(struct domain *d, struct 
> page_list_head *list)
>   * function which may return -ERESTART.
>   */
>  enum {
> -PROG_tee = 1,
> +PROG_pci = 1,
> +PROG_tee,
>  PROG_xen,
>  PROG_page,
>  PROG_mapping,
> @@ -1022,6 +1023,12 @@ int domain_relinquish_resources(struct domain *d)
>  #ifdef CONFIG_IOREQ_SERVER
>  ioreq_server_destroy_all(d);
>  #endif
> +#ifdef CONFIG_HAS_PCI
> +PROGRESS(pci):
> +ret = pci_release_devices(d);
> +if ( ret )
> +return ret;
> +#endif
>  
>  PROGRESS(tee):
>  ret = tee_relinquish_resources(d);
> -- 
> 2.25.1
>

Re: [PATCH v2 05/11] xen/arm: Mark device as PCI while creating one

2021-09-24 Thread Stefano Stabellini

On Thu, 23 Sep 2021, Oleksandr Andrushchenko wrote:
> From: Oleksandr Andrushchenko 
> 
> While adding a PCI device mark it as such, so other frameworks
> can distinguish it form DT devices.
 ^ from

> Signed-off-by: Oleksandr Andrushchenko 

Reviewed-by: Stefano Stabellini 


> ---
> Since v1:
>  - Moved the assignment from iommu_add_device to alloc_pdev
> ---
>  xen/drivers/passthrough/pci.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
> index 633e89ac1311..fc3469bc12dc 100644
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -328,6 +328,9 @@ static struct pci_dev *alloc_pdev(struct pci_seg *pseg, 
> u8 bus, u8 devfn)
>  *((u8*) &pdev->bus) = bus;
>  *((u8*) &pdev->devfn) = devfn;
>  pdev->domain = NULL;
> +#ifdef CONFIG_ARM
> +pci_to_dev(pdev)->type = DEV_PCI;
> +#endif
>  
>  rc = pdev_msi_init(pdev);
>  if ( rc )
> -- 
> 2.25.1
>

Re: [PATCH v2 04/11] xen/device-tree: Make dt_find_node_by_phandle global

2021-09-24 Thread Stefano Stabellini

On Thu, 23 Sep 2021, Oleksandr Andrushchenko wrote:
> From: Oleksandr Andrushchenko 
> 
> Make dt_find_node_by_phandle globally visible, so it can be re-used by
> other frameworks.
> 
> Signed-off-by: Oleksandr Andrushchenko 

Reviewed-by: Stefano Stabellini 


> ---
>  xen/common/device_tree.c  | 2 +-
>  xen/include/xen/device_tree.h | 2 ++
>  2 files changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/xen/common/device_tree.c b/xen/common/device_tree.c
> index ea93da1725f6..4aae281e89bf 100644
> --- a/xen/common/device_tree.c
> +++ b/xen/common/device_tree.c
> @@ -1047,7 +1047,7 @@ int dt_for_each_range(const struct dt_device_node *dev,
>   *
>   * Returns a node pointer.
>   */
> -static struct dt_device_node *dt_find_node_by_phandle(dt_phandle handle)
> +struct dt_device_node *dt_find_node_by_phandle(dt_phandle handle)
>  {
>  struct dt_device_node *np;
>  
> diff --git a/xen/include/xen/device_tree.h b/xen/include/xen/device_tree.h
> index 9069040ef7f7..3334048d3bb5 100644
> --- a/xen/include/xen/device_tree.h
> +++ b/xen/include/xen/device_tree.h
> @@ -850,6 +850,8 @@ int dt_count_phandle_with_args(const struct 
> dt_device_node *np,
>   */
>  int dt_get_pci_domain_nr(struct dt_device_node *node);
>  
> +struct dt_device_node *dt_find_node_by_phandle(dt_phandle handle);
> +
>  #ifdef CONFIG_DEVICE_TREE_DEBUG
>  #define dt_dprintk(fmt, args...)  \
>  printk(XENLOG_DEBUG fmt, ## args)
> -- 
> 2.25.1
>

Re: [PATCH v2 03/11] xen/arm: Introduce pci_find_host_bridge_node helper

2021-09-24 Thread Stefano Stabellini

On Thu, 23 Sep 2021, Oleksandr Andrushchenko wrote:
> From: Oleksandr Andrushchenko 
> 
> Get host bridge node given a PCI device attached to it.
> 
> This helper will be re-used for adding PCI devices by the subsequent
> patches.
> 
> Signed-off-by: Oleksandr Andrushchenko 
> Signed-off-by: Oleksandr Tyshchenko 

Reviewed-by: Stefano Stabellini 


> ---
>  xen/arch/arm/pci/pci-host-common.c | 17 +
>  xen/include/asm-arm/pci.h  |  1 +
>  2 files changed, 18 insertions(+)
> 
> diff --git a/xen/arch/arm/pci/pci-host-common.c 
> b/xen/arch/arm/pci/pci-host-common.c
> index a88f20175ea9..1567b6e2956c 100644
> --- a/xen/arch/arm/pci/pci-host-common.c
> +++ b/xen/arch/arm/pci/pci-host-common.c
> @@ -283,6 +283,23 @@ int pci_get_host_bridge_segment(const struct 
> dt_device_node *node,
>  return -EINVAL;
>  }
>  
> +/*
> + * Get host bridge node given a device attached to it.
> + */
> +struct dt_device_node *pci_find_host_bridge_node(struct device *dev)
> +{
> +struct pci_host_bridge *bridge;
> +struct pci_dev *pdev = dev_to_pci(dev);
> +
> +bridge = pci_find_host_bridge(pdev->seg, pdev->bus);
> +if ( unlikely(!bridge) )
> +{
> +printk(XENLOG_ERR "Unable to find PCI bridge for "PRI_pci"\n",
> +   pdev->seg, pdev->bus, pdev->sbdf.dev, pdev->sbdf.fn);
> +return NULL;
> +}
> +return bridge->dt_node;
> +}
>  /*
>   * Local variables:
>   * mode: C
> diff --git a/xen/include/asm-arm/pci.h b/xen/include/asm-arm/pci.h
> index 9e366ae67e83..5b100556225e 100644
> --- a/xen/include/asm-arm/pci.h
> +++ b/xen/include/asm-arm/pci.h
> @@ -103,6 +103,7 @@ void __iomem *pci_ecam_map_bus(struct pci_host_bridge 
> *bridge,
>  struct pci_host_bridge *pci_find_host_bridge(uint16_t segment, uint8_t bus);
>  int pci_get_host_bridge_segment(const struct dt_device_node *node,
>  uint16_t *segment);
> +struct dt_device_node *pci_find_host_bridge_node(struct device *dev);
>  
>  static always_inline bool is_pci_passthrough_enabled(void)
>  {
> -- 
> 2.25.1
>

Re: [PATCH v2 02/11] xen/arm: Add new device type for PCI

2021-09-24 Thread Stefano Stabellini

On Thu, 23 Sep 2021, Oleksandr Andrushchenko wrote:
> From: Oleksandr Andrushchenko 
> 
> Add new device type (DEV_PCI) to distinguish PCI devices from platform
> DT devices, so some drivers, like IOMMU, can handle PCI devices
> differently.
> 
> Also add a helper which is when given a struct device returns the
> corresponding struct pci_dev which this device is a part of.
> 
> Because of the header cross-dependencies, e.g. we need both
> struct pci_dev and struct arch_pci_dev at the same time, this cannot be
> done with an inline. Macro can be implemented, but looks scary:
> 
>  #define dev_to_pci_dev(dev) container_of((container_of((dev), \
> struct arch_pci_dev, dev), struct pci_dev, arch)
> 
> Signed-off-by: Oleksandr Andrushchenko 

Reviewed-by: Stefano Stabellini 


> ---
> Since v1:
>  - Folded new device type (DEV_PCI) into this patch.
> ---
>  xen/arch/arm/pci/pci.c   | 10 ++
>  xen/include/asm-arm/device.h |  4 ++--
>  xen/include/asm-arm/pci.h|  7 +++
>  3 files changed, 19 insertions(+), 2 deletions(-)
> 
> diff --git a/xen/arch/arm/pci/pci.c b/xen/arch/arm/pci/pci.c
> index bb15edbccc90..e0420d0d86c1 100644
> --- a/xen/arch/arm/pci/pci.c
> +++ b/xen/arch/arm/pci/pci.c
> @@ -27,6 +27,16 @@ int arch_pci_clean_pirqs(struct domain *d)
>  return 0;
>  }
>  
> +struct pci_dev *dev_to_pci(struct device *dev)
> +{
> +struct arch_pci_dev *arch_dev;
> +
> +ASSERT(dev->type == DEV_PCI);
> +
> +arch_dev = container_of((dev), struct arch_pci_dev, dev);
> +return container_of(arch_dev, struct pci_dev, arch);
> +}
> +
>  static int __init dt_pci_init(void)
>  {
>  struct dt_device_node *np;
> diff --git a/xen/include/asm-arm/device.h b/xen/include/asm-arm/device.h
> index 64aaa2641b7f..12de217b36b9 100644
> --- a/xen/include/asm-arm/device.h
> +++ b/xen/include/asm-arm/device.h
> @@ -4,6 +4,7 @@
>  enum device_type
>  {
>  DEV_DT,
> +DEV_PCI,
>  };
>  
>  struct dev_archdata {
> @@ -27,8 +28,7 @@ typedef struct device device_t;
>  
>  #include 
>  
> -/* TODO: Correctly implement dev_is_pci when PCI is supported on ARM */
> -#define dev_is_pci(dev) ((void)(dev), 0)
> +#define dev_is_pci(dev) ((dev)->type == DEV_PCI)
>  #define dev_is_dt(dev)  ((dev)->type == DEV_DT)
>  
>  enum device_class
> diff --git a/xen/include/asm-arm/pci.h b/xen/include/asm-arm/pci.h
> index d2728a098a11..9e366ae67e83 100644
> --- a/xen/include/asm-arm/pci.h
> +++ b/xen/include/asm-arm/pci.h
> @@ -27,6 +27,13 @@ struct arch_pci_dev {
>  struct device dev;
>  };
>  
> +/*
> + * Because of the header cross-dependencies, e.g. we need both
> + * struct pci_dev and struct arch_pci_dev at the same time, this cannot be
> + * done with an inline here. Macro can be implemented, but looks scary.
> + */
> +struct pci_dev *dev_to_pci(struct device *dev);
> +
>  /* Arch-specific MSI data for vPCI. */
>  struct vpci_arch_msi {
>  };
> -- 
> 2.25.1
>

Re: [PATCH v2 01/11] xen/arm: Fix dev_is_dt macro definition

2021-09-24 Thread Stefano Stabellini

On Thu, 23 Sep 2021, Oleksandr Andrushchenko wrote:
> From: Oleksandr Andrushchenko 
> 
> This macro is not currently used, but still has an error in it:
> a missing parenthesis. Fix this, so the macro is properly defined.
> 
> Fixes: 6c5d3075d97e ("xen/arm: Introduce a generic way to describe device")
> 
> Signed-off-by: Oleksandr Andrushchenko 

Reviewed-by: Stefano Stabellini 


> ---
> New in v2
> ---
>  xen/include/asm-arm/device.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/xen/include/asm-arm/device.h b/xen/include/asm-arm/device.h
> index 582119c31ee0..64aaa2641b7f 100644
> --- a/xen/include/asm-arm/device.h
> +++ b/xen/include/asm-arm/device.h
> @@ -29,7 +29,7 @@ typedef struct device device_t;
>  
>  /* TODO: Correctly implement dev_is_pci when PCI is supported on ARM */
>  #define dev_is_pci(dev) ((void)(dev), 0)
> -#define dev_is_dt(dev)  ((dev->type == DEV_DT)
> +#define dev_is_dt(dev)  ((dev)->type == DEV_DT)
>  
>  enum device_class
>  {
> -- 
> 2.25.1
>

Re: [PATCH v2 11/17] xen/arm: PCI host bridge discovery within XEN on ARM

2021-09-24 Thread Stefano Stabellini

On Fri, 24 Sep 2021, Stefano Stabellini wrote:
> On Fri, 24 Sep 2021, Rahul Singh wrote:
> > Hi Stefano,
> > 
> > > On 23 Sep 2021, at 8:12 pm, Stefano Stabellini  
> > > wrote:
> > > 
> > > On Thu, 23 Sep 2021, Rahul Singh wrote:
> >  +goto err_exit;
> >  +}
> > >>> 
> > >>> This is unnecessary at the moment, right? Can we get rid of ops->init ?
> > >> 
> > >> No this is required for N1SDP board. Please check below patch.
> > >> https://gitlab.com/rahsingh/xen-integration/-/commit/6379ba5764df33d57547087cff4ffc078dc515d5
> > > 
> > > OK
> > > 
> > > 
> >  +int pci_host_common_probe(struct dt_device_node *dev, const void 
> >  *data)
> >  +{
> >  +struct pci_host_bridge *bridge;
> >  +struct pci_config_window *cfg;
> >  +struct pci_ecam_ops *ops;
> >  +const struct dt_device_match *of_id;
> >  +int err;
> >  +
> >  +if ( dt_device_for_passthrough(dev) )
> >  +return 0;
> >  +
> >  +of_id = dt_match_node(dev->dev.of_match_table, dev->dev.of_node);
> >  +ops = (struct pci_ecam_ops *) of_id->data;
> > >>> 
> > >>> Do we really need dt_match_node and dev->dev.of_match_table to get
> > >>> dt_device_match.data?
> > >>> 
> > >> 
> > >>> data is passed as a parameter to pci_host_common_probe, isn't it enough
> > >>> to do:
> > >>> 
> > >>> ops = (struct pci_ecam_ops *) data;
> > >> 
> > >> As of now not required but in future we might need it if we implement 
> > >> other ecam supported bridge
> > >> 
> > >> static const struct dt_device_match gen_pci_dt_match[] = {   
> > >>
> > >>{ .compatible = "pci-host-ecam-generic",  
> > >>   
> > >>  .data =   &pci_generic_ecam_ops },
> > >> 
> > >>{ .compatible = "pci-host-cam-generic",
> > >>  .data = &gen_pci_cfg_cam_bus_ops }, 
> > >> 
> > >>{ },  
> > >>   
> > >> };
> > > 
> > > Even if we add another ECAM-supported bridge, the following:
> > > 
> > > ops = (struct pci_ecam_ops *) data;
> > > 
> > > could still work, right? The probe function will directly receive as
> > > parameter the .data pointer. You shouldn't need the indirection via
> > > dt_match_node?
> > 
> > As per my understanding probe function will not get .data pointer.Probe 
> > data argument is NULL in most of the cases in XEN
> > Please have a look once dt_pci_init() -> device_init(..) call flow 
> > implementation.
> 
> You are right. Looking at the code, nobody is currently using
> dt_device_match.data and it is clear why: it is not passed to the
> device_desc.init function at all. As it is today, it is basically
> useless.
> 
> And there is only one case where device_init has a non-NULL data
> parameter and it is in xen/drivers/char/arm-uart.c. All the others are
> not even using the data parameter of device_init.
> 
> I think we need to change device_init so that dt_device_match.data can
> be useful. Sorry for the scope-creep but I think we should do the
> following:
> 
> - do not add of_match_table to struct device
> 
> - add one more parameter to device_desc.init:
>   int (*init)(struct dt_device_node *dev, struct device_desc *desc, const 
> void *data);
> 
> - change device_init to call desc->init with the right parameters:
>   desc->init(dev, desc, data);
> 
> This way pci_host_common_probe is just going to get a desc directly as
> parameter. I think it would make a lot more sense from an interface
> perspective. It does require a change in all the DT_DEVICE_START.init
> functions adding a struct device_desc *desc parameter, but it should be
> a mechanical change.
> 
> Alternatively we could just change device_init to pass
> device_desc.dt_match.data when the data parameter is NULL but it feels
> like a hack.
> 
> 
> What do you think?
 

Another idea that doesn't require a device_desc.init change and also
doesn't require a change to struct device is the following:


diff --git a/xen/arch/arm/pci/pci-host-common.c 
b/xen/arch/arm/pci/pci-host-common.c
index a88f20175e..1aa0ef4c1e 100644
--- a/xen/arch/arm/pci/pci-host-common.c
+++ b/xen/arch/arm/pci/pci-host-common.c
@@ -205,8 +205,7 @@ int pci_host_common_probe(struct dt_device_node *dev, const 
void *data)
 if ( dt_device_for_passthrough(dev) )
 return 0;
 
-of_id = dt_match_node(dev->dev.of_match_table, dev->dev.of_node);
-ops = (struct pci_ecam_ops *) of_id->data;
+ops = (struct pci_ecam_ops *) data;
 
 bridge = pci_alloc_host_bridge();
 if ( !bridge )
diff --git a/xen/arch/arm/pci/pci-host-generic.c 
b/xen/arch/arm/pci/pci-host-generic.c
index 6b3288d6f3..66fb843f49 100644
--- a/xen/arch/arm/pci/pci-host-generic.c
+++ b/xen/arch/arm/pci/pci-host-generic.c
@@ -20,15 +20,19 @@
 #include 
 
 static const struct dt_device_match gen_pci_dt_match[] = {
-{ .compatible = "pci-host-ecam-generic",
-  .data =   &

Re: [PATCH V3 3/3] libxl/arm: Add handling of extended regions for DomU

2021-09-24 Thread Stefano Stabellini

On Fri, 24 Sep 2021, Oleksandr Tyshchenko wrote:
> From: Oleksandr Tyshchenko 
> 
> The extended region (safe range) is a region of guest physical
> address space which is unused and could be safely used to create
> grant/foreign mappings instead of wasting real RAM pages from
> the domain memory for establishing these mappings.
> 
> The extended regions are chosen at the domain creation time and
> advertised to it via "reg" property under hypervisor node in
> the guest device-tree. As region 0 is reserved for grant table
> space (always present), the indexes for extended regions are 1...N.
> If extended regions could not be allocated for some reason,
> Xen doesn't fail and behaves as usual, so only inserts region 0.
> 
> Please note the following limitations:
> - The extended region feature is only supported for 64-bit domain
>   currently.
> - The ACPI case is not covered.
> 
> ***
> 
> The algorithm to choose extended regions for non-direct mapped
> DomU is simpler in comparison with the algorithm for direct mapped
> Dom0. As we have a lot of unused space above 4GB, provide single
> 2MB-aligned region from the second RAM bank taking into the account
> the maximum supported guest address space size and the amount of
> memory assigned to the guest. The maximum size of the region is 128GB.
> The minimum size is 64MB.
> 
> Suggested-by: Julien Grall 
> Signed-off-by: Oleksandr Tyshchenko 

Reviewed-by: Stefano Stabellini 


> ---
> Changes RFC -> V2:
>- update patch description
>- drop uneeded "extended-region" DT property
>- clear reg array in finalise_ext_region() and add a TODO
> 
> Changes V2 -> V3:
>- update patch description, comments in code
>- only pick up regions with size >= 64MB
>- move the region calculation to make_hypervisor_node() and drop
>  finalise_ext_region()
>- extend the list of arguments for make_hypervisor_node()
>- do not show warning for 32-bit domain
>- change the region alignment from 1GB to 2MB
>- move EXT_REGION_SIZE to public/arch-arm.h
> ---
>  tools/libs/light/libxl_arm.c  | 70 
> +++
>  xen/include/public/arch-arm.h |  3 ++
>  2 files changed, 68 insertions(+), 5 deletions(-)
> 
> diff --git a/tools/libs/light/libxl_arm.c b/tools/libs/light/libxl_arm.c
> index e3140a6..a67b68e 100644
> --- a/tools/libs/light/libxl_arm.c
> +++ b/tools/libs/light/libxl_arm.c
> @@ -598,9 +598,17 @@ static int make_timer_node(libxl__gc *gc, void *fdt,
>  return 0;
>  }
>  
> +#define ALIGN_UP_TO_2MB(x)   (((x) + MB(2) - 1) & (~(MB(2) - 1)))
> +
>  static int make_hypervisor_node(libxl__gc *gc, void *fdt,
> -const libxl_version_info *vers)
> +const libxl_version_info *vers,
> +const libxl_domain_build_info *b_info,
> +const struct xc_dom_image *dom)
>  {
> +uint64_t region_size = 0, region_base, ramsize, bank1size,
> +bank1end_align, bank1end_max;
> +uint8_t gpaddr_bits;
> +libxl_physinfo physinfo;
>  int res;
>  gic_interrupt intr;
>  
> @@ -615,9 +623,61 @@ static int make_hypervisor_node(libxl__gc *gc, void *fdt,
>"xen,xen");
>  if (res) return res;
>  
> -/* reg 0 is grant table space */
> -res = fdt_property_regs(gc, fdt, GUEST_ROOT_ADDRESS_CELLS, 
> GUEST_ROOT_SIZE_CELLS,
> -1,GUEST_GNTTAB_BASE, GUEST_GNTTAB_SIZE);
> +if (strcmp(dom->guest_type, "xen-3.0-aarch64")) {
> +LOG(DEBUG, "The extended regions are only supported for 64-bit guest 
> currently");
> +goto out;
> +}
> +
> +res = libxl_get_physinfo(CTX, &physinfo);
> +assert(!res);
> +
> +gpaddr_bits = physinfo.gpaddr_bits;
> +assert(gpaddr_bits >= 32 && gpaddr_bits <= 48);
> +
> +/*
> + * Try to allocate single 2MB-aligned extended region from the second RAM
> + * bank (above 4GB) taking into the account the maximum supported guest
> + * address space size and the amount of memory assigned to the guest.
> + * As the guest memory layout is not populated yet we cannot rely on
> + * dom->rambank_size[1], so calculate the actual size of the second bank
> + * using "max_memkb" value.
> + */
> +bank1end_max = min(1ULL << gpaddr_bits, GUEST_RAM1_BASE + 
> GUEST_RAM1_SIZE);
> +ramsize = b_info->max_memkb * 1024;
> +if (ramsize <= GUEST_RAM0_SIZE)
> +bank1size = 0;
> +else
> +bank1size = ramsize - GUEST_RAM0_SIZE;
> +bank1end_align = GUEST_RAM1_BASE + ALIGN_UP_TO_2MB(bank1size);
> +
> +if (bank1end_max <= bank1end_align) {
> +LOG(WARN, "The extended region cannot be allocated, not enough 
> space");
> +goto out;
> +}
> +
> +if (bank1end_max - bank1end_align > GUEST_EXT_REGION_MAX_SIZE) {
> +region_base = bank1end_max - GUEST_EXT_REGION_MAX_SIZE;
> +region_size = GUEST_EXT_REGION_MAX_S

Re: [PATCH V3 2/3] xen/arm: Add handling of extended regions for Dom0

2021-09-24 Thread Stefano Stabellini

On Fri, 24 Sep 2021, Oleksandr Tyshchenko wrote:
> From: Oleksandr Tyshchenko 
> 
> The extended region (safe range) is a region of guest physical
> address space which is unused and could be safely used to create
> grant/foreign mappings instead of wasting real RAM pages from
> the domain memory for establishing these mappings.
> 
> The extended regions are chosen at the domain creation time and
> advertised to it via "reg" property under hypervisor node in
> the guest device-tree. As region 0 is reserved for grant table
> space (always present), the indexes for extended regions are 1...N.
> If extended regions could not be allocated for some reason,
> Xen doesn't fail and behaves as usual, so only inserts region 0.
> 
> Please note the following limitations:
> - The extended region feature is only supported for 64-bit domain
>   currently.
> - The ACPI case is not covered.
> 
> ***
> 
> As Dom0 is direct mapped domain on Arm (e.g. MFN == GFN)
> the algorithm to choose extended regions for it is different
> in comparison with the algorithm for non-direct mapped DomU.
> What is more, that extended regions should be chosen differently
> whether IOMMU is enabled or not.
> 
> Provide RAM not assigned to Dom0 if IOMMU is disabled or memory
> holes found in host device-tree if otherwise. Make sure that
> extended regions are 2MB-aligned and located within maximum possible
> addressable physical memory range. The minimum size of extended
> region is 64MB. The maximum number of extended regions is 128,
> which is an artificial limit to minimize code changes (we reuse
> struct meminfo to describe extended regions, so there are an array
> field for 128 elements).
> 
> It worth mentioning that unallocated memory solution (when the IOMMU
> is disabled) will work safely until Dom0 is able to allocate memory
> outside of the original range.
> 
> Also introduce command line option to be able to globally enable or
> disable support for extended regions for Dom0 (enabled by default).
> 
> Suggested-by: Julien Grall 
> Signed-off-by: Oleksandr Tyshchenko 
> ---
> Please note, we need to decide which approach to use in 
> find_unallocated_memory(),
> you can find details at:
> https://lore.kernel.org/xen-devel/28503e09-44c3-f623-bb8d-8778bb942...@gmail.com/
> 
> Changes RFC -> V2:
>- update patch description
>- drop uneeded "extended-region" DT property
> 
> Changes V2 -> V3:
>- update patch description
>- add comment for "size" calculation in add_ext_regions()
>- clarify "end" calculation in find_unallocated_memory() and
>  find_memory_holes()
>- only pick up regions with size >= 64MB
>- allocate reg dynamically instead of keeping on the stack in
>  make_hypervisor_node()
>- do not show warning for 32-bit domain
>- drop Linux specific limits EXT_REGION_*
>- also cover "ranges" property in find_memory_holes()
>- add command line arg to enable/disable extended region support
> ---
>  docs/misc/xen-command-line.pandoc |   7 +
>  xen/arch/arm/domain_build.c   | 280 
> +-
>  2 files changed, 284 insertions(+), 3 deletions(-)
> 
> diff --git a/docs/misc/xen-command-line.pandoc 
> b/docs/misc/xen-command-line.pandoc
> index 177e656..3bb8eb7 100644
> --- a/docs/misc/xen-command-line.pandoc
> +++ b/docs/misc/xen-command-line.pandoc
> @@ -1081,6 +1081,13 @@ hardware domain is architecture dependent.
>  Note that specifying zero as domU value means zero, while for dom0 it means
>  to use the default.
>  
> +### ext_regions (Arm)
> +> `= `
> +
> +> Default : `true`
> +
> +Flag to globally enable or disable support for extended regions for dom0.

I'd say:

Flag to enable or disable extended regions for Dom0.

Extended regions are ranges of unused address space exposed to Dom0 as
"safe to use" for special memory mappings. Disable if your board device
tree is incomplete.


>  ### flask
>  > `= permissive | enforcing | late | disabled`
>  
> diff --git a/xen/arch/arm/domain_build.c b/xen/arch/arm/domain_build.c
> index d233d63..81997d5 100644
> --- a/xen/arch/arm/domain_build.c
> +++ b/xen/arch/arm/domain_build.c
> @@ -34,6 +34,10 @@
>  static unsigned int __initdata opt_dom0_max_vcpus;
>  integer_param("dom0_max_vcpus", opt_dom0_max_vcpus);
>  
> +/* If true, the extended regions support is enabled for dom0 */
> +static bool __initdata opt_ext_regions = true;
> +boolean_param("ext_regions", opt_ext_regions);
> +
>  static u64 __initdata dom0_mem;
>  static bool __initdata dom0_mem_set;
>  
> @@ -886,6 +890,233 @@ static int __init make_memory_node(const struct domain 
> *d,
>  return res;
>  }
>  
> +static int __init add_ext_regions(unsigned long s, unsigned long e, void 
> *data)
> +{
> +struct meminfo *ext_regions = data;
> +paddr_t start, size;
> +
> +if ( ext_regions->nr_banks >= ARRAY_SIZE(ext_regions->bank) )
> +return 0;
> +
> +/* Both start and size of the extended region should be 2MB aligned */
> +start =

[qemu-mainline test] 165179: regressions - FAIL

2021-09-24 Thread osstest service owner

flight 165179 qemu-mainline real [real]
http://logs.test-lab.xenproject.org/osstest/logs/165179/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
 test-arm64-arm64-libvirt-raw 17 guest-start/debian.repeat fail REGR. vs. 164950

Tests which are failing intermittently (not blocking):
 test-amd64-amd64-xl-rtds 18 guest-localmigrate fail pass in 165171
 test-amd64-i386-xl-shadow20 guest-localmigrate/x10 fail pass in 165171

Tests which did not succeed, but are not blocking:
 test-amd64-amd64-xl-rtds  20 guest-localmigrate/x10 fail in 165171 like 164950
 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 164950
 test-armhf-armhf-libvirt 16 saverestore-support-checkfail  like 164950
 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 164950
 test-amd64-i386-xl-qemuu-win7-amd64 19 guest-stop fail like 164950
 test-armhf-armhf-libvirt-qcow2 15 saverestore-support-check   fail like 164950
 test-armhf-armhf-libvirt-raw 15 saverestore-support-checkfail  like 164950
 test-amd64-i386-xl-qemuu-ws16-amd64 19 guest-stop fail like 164950
 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 164950
 test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-amd64-i386-libvirt  15 migrate-support-checkfail   never pass
 test-amd64-i386-libvirt-xsm  15 migrate-support-checkfail   never pass
 test-amd64-i386-xl-pvshim14 guest-start  fail   never pass
 test-arm64-arm64-xl  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  16 saverestore-support-checkfail   never pass
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-amd64-i386-libvirt-raw  14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 15 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  14 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  15 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail  never pass
 test-armhf-armhf-xl-cubietruck 15 migrate-support-checkfail never pass
 test-armhf-armhf-xl-cubietruck 16 saverestore-support-checkfail never pass
 test-armhf-armhf-libvirt 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-seattle  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-seattle  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt-qcow2 14 migrate-support-checkfail never pass
 test-armhf-armhf-xl-vhd  14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  15 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt-raw 14 migrate-support-checkfail   never pass

version targeted for

Re: [PATCH v2 11/17] xen/arm: PCI host bridge discovery within XEN on ARM

2021-09-24 Thread Stefano Stabellini

On Fri, 24 Sep 2021, Rahul Singh wrote:
> Hi Stefano,
> 
> > On 23 Sep 2021, at 8:12 pm, Stefano Stabellini  
> > wrote:
> > 
> > On Thu, 23 Sep 2021, Rahul Singh wrote:
>  +goto err_exit;
>  +}
> >>> 
> >>> This is unnecessary at the moment, right? Can we get rid of ops->init ?
> >> 
> >> No this is required for N1SDP board. Please check below patch.
> >> https://gitlab.com/rahsingh/xen-integration/-/commit/6379ba5764df33d57547087cff4ffc078dc515d5
> > 
> > OK
> > 
> > 
>  +int pci_host_common_probe(struct dt_device_node *dev, const void *data)
>  +{
>  +struct pci_host_bridge *bridge;
>  +struct pci_config_window *cfg;
>  +struct pci_ecam_ops *ops;
>  +const struct dt_device_match *of_id;
>  +int err;
>  +
>  +if ( dt_device_for_passthrough(dev) )
>  +return 0;
>  +
>  +of_id = dt_match_node(dev->dev.of_match_table, dev->dev.of_node);
>  +ops = (struct pci_ecam_ops *) of_id->data;
> >>> 
> >>> Do we really need dt_match_node and dev->dev.of_match_table to get
> >>> dt_device_match.data?
> >>> 
> >> 
> >>> data is passed as a parameter to pci_host_common_probe, isn't it enough
> >>> to do:
> >>> 
> >>> ops = (struct pci_ecam_ops *) data;
> >> 
> >> As of now not required but in future we might need it if we implement 
> >> other ecam supported bridge
> >> 
> >> static const struct dt_device_match gen_pci_dt_match[] = { 
> >>  
> >>{ .compatible = "pci-host-ecam-generic",
> >> 
> >>  .data =   &pci_generic_ecam_ops },
> >> 
> >>{ .compatible = "pci-host-cam-generic",
> >>  .data = &gen_pci_cfg_cam_bus_ops }, 
> >> 
> >>{ },
> >> 
> >> };
> > 
> > Even if we add another ECAM-supported bridge, the following:
> > 
> > ops = (struct pci_ecam_ops *) data;
> > 
> > could still work, right? The probe function will directly receive as
> > parameter the .data pointer. You shouldn't need the indirection via
> > dt_match_node?
> 
> As per my understanding probe function will not get .data pointer.Probe data 
> argument is NULL in most of the cases in XEN
> Please have a look once dt_pci_init() -> device_init(..) call flow 
> implementation.

You are right. Looking at the code, nobody is currently using
dt_device_match.data and it is clear why: it is not passed to the
device_desc.init function at all. As it is today, it is basically
useless.

And there is only one case where device_init has a non-NULL data
parameter and it is in xen/drivers/char/arm-uart.c. All the others are
not even using the data parameter of device_init.

I think we need to change device_init so that dt_device_match.data can
be useful. Sorry for the scope-creep but I think we should do the
following:

- do not add of_match_table to struct device

- add one more parameter to device_desc.init:
  int (*init)(struct dt_device_node *dev, struct device_desc *desc, const void 
*data);

- change device_init to call desc->init with the right parameters:
  desc->init(dev, desc, data);

This way pci_host_common_probe is just going to get a desc directly as
parameter. I think it would make a lot more sense from an interface
perspective. It does require a change in all the DT_DEVICE_START.init
functions adding a struct device_desc *desc parameter, but it should be
a mechanical change.

Alternatively we could just change device_init to pass
device_desc.dt_match.data when the data parameter is NULL but it feels
like a hack.


What do you think?

Re: [PATCH v2 2/2] arm/efi: Use dom0less configuration when using EFI boot

2021-09-24 Thread Stefano Stabellini

On Fri, 24 Sep 2021, Luca Fancellu wrote:
> > On 23 Sep 2021, at 17:59, Stefano Stabellini  wrote:
> > 
> > On Thu, 23 Sep 2021, Luca Fancellu wrote:
>  +/*
>  + * Binaries will be translated into bootmodules, the maximum number for 
>  them is
>  + * MAX_MODULES where we should remove a unit for Xen and one for Xen DTB
>  + */
>  +#define MAX_DOM0LESS_MODULES (MAX_MODULES - 2)
>  +static struct file __initdata dom0less_file;
>  +static dom0less_module_name __initdata 
>  dom0less_modules[MAX_DOM0LESS_MODULES];
>  +static unsigned int __initdata dom0less_modules_available =
>  +   MAX_DOM0LESS_MODULES;
>  +static unsigned int __initdata dom0less_modules_idx;
> >>> 
> >>> This is a lot better!
> >>> 
> >>> We don't need both dom0less_modules_idx and dom0less_modules_available.
> >>> You can just do:
> >>> 
> >>> #define dom0less_modules_available (MAX_DOM0LESS_MODULES - 
> >>> dom0less_modules_idx)
> >>> static unsigned int __initdata dom0less_modules_idx;
> >>> 
> >>> But maybe we can even get rid of dom0less_modules_available entirely?
> >>> 
> >>> We can change the check at the beginning of allocate_dom0less_file to:
> >>> 
> >>> if ( dom0less_modules_idx == dom0less_modules_available )
> >>>   blexit
> >>> 
> >>> Would that work?
> >> 
> >> I thought about it but I think they need to stay, because 
> >> dom0less_modules_available is the
> >> upper bound for the additional dom0less modules (it is decremented each 
> >> time a dom0 module
> >> Is added), instead dom0less_modules_idx is the typical index for the array 
> >> of dom0less modules.
> > 
> > [...]
> > 
> > 
>  +/*
>  + * Check if there is any space left for a domU module, the variable
>  + * dom0less_modules_available is updated each time we use 
>  read_file(...)
>  + * successfully.
>  + */
>  +if ( !dom0less_modules_available )
>  +blexit(L"No space left for domU modules");
> >>> 
> >>> This is the check that could be based on dom0less_modules_idx
> >>> 
> >> 
> >> The only way I see to have it based on dom0less_modules_idx will be to 
> >> compare it
> >> to the amount of modules still available, that is not constant because it 
> >> is dependent
> >> on how many dom0 modules are loaded, so still two variables needed.
> >> Don’t know if I’m missing something.
> > 
> > I think I understand where the confusion comes from. I am appending a
> > small patch to show what I had in mind. We are already accounting for
> > Xen and the DTB when declaring MAX_DOM0LESS_MODULES (MAX_MODULES - 2).
> > The other binaries are the Dom0 kernel and ramdisk, however, in my setup
> > they don't trigger a call to handle_dom0less_module_node because they
> > are compatible xen,linux-zimage and xen,linux-initrd.
> > 
> > However, the Dom0 kernel and ramdisk can be also compatible
> > multiboot,kernel and multiboot,ramdisk. If that is the case, then they
> > would indeed trigger a call to handle_dom0less_module_node.
> > 
> > I think that is not a good idea: a function called
> > handle_dom0less_module_node should only be called for dom0less modules
> > (domUs) and not dom0.

I can see that I misread the code yesterday: Dom0 modules don't go
through handle_dom0less_module_node thanks to the xen,domain check in
efi_arch_check_dom0less_boot.


> > But from the memory consumption point of view, it would be better
> > actually to catch dom0 modules too as you intended. In that case we need to:
> > 
> > - add a check for xen,linux-zimage and xen,linux-initrd in
> >  handle_dom0less_module_node also
> > 
> > - rename handle_dom0less_domain_node, handle_dom0less_module_node,
> >  dom0less_file, dom0less_modules, dom0less_modules_idx to something
> >  else more generic
> > 
> > 
> > For instance they could be called:
> > 
> > handle_domain_node
> > handle_module_node
> > module_file
> > modules
> > modules_idx
> > 
> > 
> > 
> > 
> > diff --git a/xen/arch/arm/efi/efi-boot.h b/xen/arch/arm/efi/efi-boot.h
> > index e2b007ece0..812d0bd607 100644
> > --- a/xen/arch/arm/efi/efi-boot.h
> > +++ b/xen/arch/arm/efi/efi-boot.h
> > @@ -22,8 +22,6 @@ typedef struct {
> > #define MAX_DOM0LESS_MODULES (MAX_MODULES - 2)
> > static struct file __initdata dom0less_file;
> > static dom0less_module_name __initdata 
> > dom0less_modules[MAX_DOM0LESS_MODULES];
> > -static unsigned int __initdata dom0less_modules_available =
> > -   MAX_DOM0LESS_MODULES;
> > static unsigned int __initdata dom0less_modules_idx;
> > 
> > #define ERROR_DOM0LESS_FILE_NOT_FOUND (-1)
> > @@ -592,14 +590,6 @@ static void __init efi_arch_handle_module(const struct 
> > file *file,
> >  * stop here.
> >  */
> > blexit(L"Unknown module type");
> > -
> > -/*
> > - * dom0less_modules_available is decremented here because for each dom0
> > - * file added, there will be an additional bootmodule, so the number
> > - * of d

Re: [PATCH v3 2/2] xen-pciback: allow compiling on other archs than x86

2021-09-24 Thread Stefano Stabellini

On Fri, 24 Sep 2021, Oleksandr Andrushchenko wrote:
> On 24.09.21 08:46, Oleksandr Andrushchenko wrote:
> > On 23.09.21 23:00, Stefano Stabellini wrote:
> >> On Thu, 23 Sep 2021, Oleksandr Andrushchenko wrote:
> >>> From: Oleksandr Andrushchenko 
> >>>
> >>> Xen-pciback driver was designed to be built for x86 only. But it
> >>> can also be used by other architectures, e.g. Arm.
> >>> Re-structure the driver in a way that it can be built for other
> >>> platforms as well.
> >>>
> >>> Signed-off-by: Oleksandr Andrushchenko 
> >>> Signed-off-by: Anastasiia Lukianenko 
> >> The patch looks good to me. Only one thing: on ARM32 I get:
> > WE do not yet support Xen PCI passthrough for ARM32

Keep in mind that it is possible to run ARM32 guests on an ARM64
hypervisor.


> >> drivers/xen/xen-pciback/conf_space_header.c: In function ‘bar_init’:
> >> drivers/xen/xen-pciback/conf_space_header.c:239:34: warning: right shift 
> >> count >= width of type [-Wshift-count-overflow]
> >>   bar->val = res[pos - 1].start >> 32;
> >> ^~
> >> drivers/xen/xen-pciback/conf_space_header.c:240:49: warning: right shift 
> >> count >= width of type [-Wshift-count-overflow]
> >>   bar->len_val = -resource_size(&res[pos - 1]) >> 32;
> >>
> >>
> >> resource_size_t is defined as phys_addr_t and it can be 32bit on arm32.
> >>
> >>
> >> One fix is to surround:
> >>
> >>if (pos && (res[pos - 1].flags & IORESOURCE_MEM_64)) {
> >>bar->val = res[pos - 1].start >> 32;
> >>bar->len_val = -resource_size(&res[pos - 1]) >> 32;
> >>return bar;
> >>}
> >>
> >> with #ifdef PHYS_ADDR_T_64BIT
> >>
> > This might not be correct. We are dealing here with a 64-bit BAR on a 
> > 32-bit OS.
> >
> > I think that this can still be valid use-case if BAR64.hi == 0. So, not sure
> >
> > we can just skip it with ifdef.
> >
> > Instead, to be on the safe side, we can have:
> >
> > config XEN_PCIDEV_STUB
> >      tristate "Xen PCI-device stub driver"
> >      depends on PCI && ARM64 && XEN
> > e.g. only allow building the "stub" for ARM64 for now.

This is a pretty drastic solution. I would be OK with it but I prefer
the solution below >> 16 >> 16.


> Or... there are couple of places in the kernel where PCI deals with the 32 
> bit shift as:
> 
> drivers/pci/setup-res.c:108:        new = region.start >> 16 >> 16;
> drivers/pci/iov.c:949:        new = region.start >> 16 >> 16;
> 
> commit cf7bee5a0bf270a4eace0be39329d6ac0136cc47
> Date:   Sun Aug 7 13:49:59 *2005* +0400
> 
> [snip]
> 
>      Also make sure to write high bits - use "x >> 16 >> 16" (rather than the
>      simpler ">> 32") to avoid warnings on 32-bit architectures where we're
>      not going to have any high bits.

I think this is the best option


> This might not be(?) immediately correct in case of LPAE though, e.g.
> 
> 64-bit BAR may tolerate 40-bit address in some use-cases?

It is correct for LPAE too, it is just that with LPAE it would be
unnecessary.

RE: [PATCH 08/37] xen/x86: add detection of discontinous node memory range

2021-09-24 Thread Stefano Stabellini

On Fri, 24 Sep 2021, Wei Chen wrote:
> > -Original Message-
> > From: Stefano Stabellini 
> > Sent: 2021年9月24日 8:26
> > To: Wei Chen 
> > Cc: xen-devel@lists.xenproject.org; sstabell...@kernel.org; jul...@xen.org;
> > Bertrand Marquis ; jbeul...@suse.com;
> > andrew.coop...@citrix.com; roger@citrix.com; w...@xen.org
> > Subject: Re: [PATCH 08/37] xen/x86: add detection of discontinous node
> > memory range
> > 
> > CC'ing x86 maintainers
> > 
> > On Thu, 23 Sep 2021, Wei Chen wrote:
> > > One NUMA node may contain several memory blocks. In current Xen
> > > code, Xen will maintain a node memory range for each node to cover
> > > all its memory blocks. But here comes the problem, in the gap of
> > > one node's two memory blocks, if there are some memory blocks don't
> > > belong to this node (remote memory blocks). This node's memory range
> > > will be expanded to cover these remote memory blocks.
> > >
> > > One node's memory range contains othe nodes' memory, this is obviously
> > > not very reasonable. This means current NUMA code only can support
> > > node has continous memory blocks. However, on a physical machine, the
> > > addresses of multiple nodes can be interleaved.
> > >
> > > So in this patch, we add code to detect discontinous memory blocks
> > > for one node. NUMA initializtion will be failed and error messages
> > > will be printed when Xen detect such hardware configuration.
> > 
> > At least on ARM, it is not just memory that can be interleaved, but also
> > MMIO regions. For instance:
> > 
> > node0 bank0 0-0x100
> > MMIO 0x100-0x1002000
> > Hole 0x1002000-0x200
> > node0 bank1 0x200-0x300
> > 
> > So I am not familiar with the SRAT format, but I think on ARM the check
> > would look different: we would just look for multiple memory ranges
> > under a device_type = "memory" node of a NUMA node in device tree.
> > 
> > 
> 
> Should I need to include/refine above message to commit log?

Let me ask you a question first.

With the NUMA implementation of this patch series, can we deal with
cases where each node has multiple memory banks, not interleaved?
An an example:

node0: 0x0- 0x1000
MMIO : 0x1000 - 0x2000
node0: 0x2000 - 0x3000
MMIO : 0x3000 - 0x5000
node1: 0x5000 - 0x6000
MMIO : 0x6000 - 0x8000
node2: 0x8000 - 0x9000


I assume we can deal with this case simply by setting node0 memory to
0x0-0x3000 even if there is actually something else, a device, that
doesn't belong to node0 in between the two node0 banks?

Is it only other nodes' memory interleaved that cause issues? In other
words, only the following is a problematic scenario?

node0: 0x0- 0x1000
MMIO : 0x1000 - 0x2000
node1: 0x2000 - 0x3000
MMIO : 0x3000 - 0x5000
node0: 0x5000 - 0x6000

Because node1 is in between the two ranges of node0?


I am asking these questions because it is certainly possible to have
multiple memory ranges for each NUMA node in device tree, either by
specifying multiple ranges with a single "reg" property, or by
specifying multiple memory nodes with the same numa-node-id.

RE: [PATCH 32/37] xen/arm: unified entry to parse all NUMA data from device tree

2021-09-24 Thread Stefano Stabellini

On Fri, 24 Sep 2021, Wei Chen wrote:
> > -Original Message-
> > From: Stefano Stabellini 
> > Sent: 2021年9月24日 11:17
> > To: Wei Chen 
> > Cc: xen-devel@lists.xenproject.org; sstabell...@kernel.org; jul...@xen.org;
> > Bertrand Marquis 
> > Subject: Re: [PATCH 32/37] xen/arm: unified entry to parse all NUMA data
> > from device tree
> > 
> > On Thu, 23 Sep 2021, Wei Chen wrote:
> > > In this API, we scan whole device tree to parse CPU node id, memory
> >   ^ function   ^ the whole
> > 
> > > node id and distance-map. Though early_scan_node will invoke has a
> > > handler to process memory nodes. If we want to parse memory node id
> > > in this handler, we have to embeded NUMA parse code in this handler.
> >   ^ embed
> > 
> > > But we still need to scan whole device tree to find CPU NUMA id and
> > > distance-map. In this case, we include memory NUMA id parse in this
> > > API too. Another benefit is that we have a unique entry for device
> >   ^ function
> > 
> > > tree NUMA data parse.
> > 
> > Ah, that's the explanation I was asking for earlier!
> > 
> 
> The question about device_tree_get_meminfo?

Yes, it would be nice to reuse process_memory_node if we can, but I
understand if we cannot.

Re: [PATCH 36/37] xen/arm: Provide Kconfig options for Arm to enable NUMA

2021-09-24 Thread Stefano Stabellini

On Fri, 24 Sep 2021, Wei Chen wrote:
> Hi Stefano,
> 
> On 2021/9/24 11:31, Stefano Stabellini wrote:
> > On Thu, 23 Sep 2021, Wei Chen wrote:
> > > Arm platforms support both ACPI and device tree. We don't
> > > want users to select device tree NUMA or ACPI NUMA manually.
> > > We hope usrs can just enable NUMA for Arm, and device tree
> >^ users
> > 
> > > NUMA and ACPI NUMA can be selected depends on device tree
> > > feature and ACPI feature status automatically. In this case,
> > > these two kinds of NUMA support code can be co-exist in one
> > > Xen binary. Xen can check feature flags to decide using
> > > device tree or ACPI as NUMA based firmware.
> > > 
> > > So in this patch, we introduce a generic option:
> > > CONFIG_ARM_NUMA for user to enable NUMA for Arm.
> >^ users
> > 
> 
> OK
> 
> > > And one CONFIG_DEVICE_TREE_NUMA option for ARM_NUMA
> > > to select when HAS_DEVICE_TREE option is enabled.
> > > Once when ACPI NUMA for Arm is supported, ACPI_NUMA
> > > can be selected here too.
> > > 
> > > Signed-off-by: Wei Chen 
> > > ---
> > >   xen/arch/arm/Kconfig | 11 +++
> > >   1 file changed, 11 insertions(+)
> > > 
> > > diff --git a/xen/arch/arm/Kconfig b/xen/arch/arm/Kconfig
> > > index 865ad83a89..ded94ebd37 100644
> > > --- a/xen/arch/arm/Kconfig
> > > +++ b/xen/arch/arm/Kconfig
> > > @@ -34,6 +34,17 @@ config ACPI
> > > Advanced Configuration and Power Interface (ACPI) support for 
> > > Xen is
> > > an alternative to device tree on ARM64.
> > >   + config DEVICE_TREE_NUMA
> > > + def_bool n
> > > + select NUMA
> > > +
> > > +config ARM_NUMA
> > > + bool "Arm NUMA (Non-Uniform Memory Access) Support (UNSUPPORTED)" if
> > > UNSUPPORTED
> > > + select DEVICE_TREE_NUMA if HAS_DEVICE_TREE
> > 
> > Should it be: depends on HAS_DEVICE_TREE ?
> > (And eventually depends on HAS_DEVICE_TREE || ACPI)
> > 
> 
> As the discussion in RFC [1]. We want to make ARM_NUMA as a generic
> option can be selected by users. And depends on has_device_tree
> or ACPI to select DEVICE_TREE_NUMA or ACPI_NUMA.
> 
> If we add HAS_DEVICE_TREE || ACPI as dependencies for ARM_NUMA,
> does it become a loop dependency?
> 
> https://lists.xenproject.org/archives/html/xen-devel/2021-08/msg00888.html

OK, I am fine with that. I was just trying to catch the case where a
user selects "ARM_NUMA" but actually neither ACPI nor HAS_DEVICE_TREE
are selected so nothing happens. I was trying to make it clear that
ARM_NUMA depends on having at least one between HAS_DEVICE_TREE or ACPI
because otherwise it is not going to work.

That said, I don't think this is important because HAS_DEVICE_TREE
cannot be unselected. So if we cannot find a way to express the
dependency, I think it is fine to keep the patch as is.

RE: [PATCH 23/37] xen/arm: implement node distance helpers for Arm

2021-09-24 Thread Stefano Stabellini

On Fri, 24 Sep 2021, Wei Chen wrote:
> > -Original Message-
> > From: Stefano Stabellini 
> > Sent: 2021年9月24日 9:47
> > To: Wei Chen 
> > Cc: xen-devel@lists.xenproject.org; sstabell...@kernel.org; jul...@xen.org;
> > Bertrand Marquis 
> > Subject: Re: [PATCH 23/37] xen/arm: implement node distance helpers for
> > Arm
> > 
> > On Thu, 23 Sep 2021, Wei Chen wrote:
> > > We will parse NUMA nodes distances from device tree or ACPI
> > > table. So we need a matrix to record the distances between
> > > any two nodes we parsed. Accordingly, we provide this
> > > node_set_distance API for device tree or ACPI table parsers
> > > to set the distance for any two nodes in this patch.
> > > When NUMA initialization failed, __node_distance will return
> > > NUMA_REMOTE_DISTANCE, this will help us avoid doing rollback
> > > for distance maxtrix when NUMA initialization failed.
> > >
> > > Signed-off-by: Wei Chen 
> > > ---
> > >  xen/arch/arm/Makefile  |  1 +
> > >  xen/arch/arm/numa.c| 69 ++
> > >  xen/include/asm-arm/numa.h | 13 +++
> > >  3 files changed, 83 insertions(+)
> > >  create mode 100644 xen/arch/arm/numa.c
> > >
> > > diff --git a/xen/arch/arm/Makefile b/xen/arch/arm/Makefile
> > > index ae4efbf76e..41ca311b6b 100644
> > > --- a/xen/arch/arm/Makefile
> > > +++ b/xen/arch/arm/Makefile
> > > @@ -35,6 +35,7 @@ obj-$(CONFIG_LIVEPATCH) += livepatch.o
> > >  obj-y += mem_access.o
> > >  obj-y += mm.o
> > >  obj-y += monitor.o
> > > +obj-$(CONFIG_NUMA) += numa.o
> > >  obj-y += p2m.o
> > >  obj-y += percpu.o
> > >  obj-y += platform.o
> > > diff --git a/xen/arch/arm/numa.c b/xen/arch/arm/numa.c
> > > new file mode 100644
> > > index 00..3f08870d69
> > > --- /dev/null
> > > +++ b/xen/arch/arm/numa.c
> > > @@ -0,0 +1,69 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +/*
> > > + * Arm Architecture support layer for NUMA.
> > > + *
> > > + * Copyright (C) 2021 Arm Ltd
> > > + *
> > > + * This program is free software; you can redistribute it and/or modify
> > > + * it under the terms of the GNU General Public License version 2 as
> > > + * published by the Free Software Foundation.
> > > + *
> > > + * This program is distributed in the hope that it will be useful,
> > > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > > + * GNU General Public License for more details.
> > > + *
> > > + * You should have received a copy of the GNU General Public License
> > > + * along with this program. If not, see .
> > > + *
> > > + */
> > > +#include 
> > > +#include 
> > > +
> > > +static uint8_t __read_mostly
> > > +node_distance_map[MAX_NUMNODES][MAX_NUMNODES] = {
> > > +{ 0 }
> > > +};
> > > +
> > > +void __init numa_set_distance(nodeid_t from, nodeid_t to, uint32_t
> > distance)
> > > +{
> > > +if ( from >= MAX_NUMNODES || to >= MAX_NUMNODES )
> > > +{
> > > +printk(KERN_WARNING
> > > +   "NUMA: invalid nodes: from=%"PRIu8" to=%"PRIu8"
> > MAX=%"PRIu8"\n",
> > > +   from, to, MAX_NUMNODES);
> > > +return;
> > > +}
> > > +
> > > +/* NUMA defines 0xff as an unreachable node and 0-9 are undefined
> > */
> > > +if ( distance >= NUMA_NO_DISTANCE ||
> > > +(distance >= NUMA_DISTANCE_UDF_MIN &&
> > > + distance <= NUMA_DISTANCE_UDF_MAX) ||
> > > +(from == to && distance != NUMA_LOCAL_DISTANCE) )
> > > +{
> > > +printk(KERN_WARNING
> > > +   "NUMA: invalid distance: from=%"PRIu8" to=%"PRIu8"
> > distance=%"PRIu32"\n",
> > > +   from, to, distance);
> > > +return;
> > > +}
> > > +
> > > +node_distance_map[from][to] = distance;
> > > +}
> > > +
> > > +uint8_t __node_distance(nodeid_t from, nodeid_t to)
> > > +{
> > > +/* When NUMA is off, any distance will be treated as remote. */
> > > +if ( srat_disabled() )
> > 
> > Given that this is ARM specific code and specific to ACPI, I don't think
> > we should have any call to something called "srat_disabled".
> > 
> > I suggest to either rename srat_disabled to numa_distance_disabled.
> > 
> > Other than that, this patch looks OK to me.
> > 
> 
> srat stands for static resource affinity table, I think dtb also can be
> treated as a static resource affinity table. So I keep SRAT in this patch
> and other patches. I have seen your comment in patch#25. Before x86 
> maintainers
> give any feedback, can we still keep srat here?

Jan and I replied in the other thread. I think that in warning messages
"SRAT" should not be mentioned when booting from DT. Ideally functions
names and variables should be renamed too when shared between ACPI and
DT but it is less critical, and it is fine if you don't do that in the
next version.

Re: [PATCH 25/37] xen/arm: implement bad_srat for Arm NUMA initialization

2021-09-24 Thread Stefano Stabellini

On Fri, 24 Sep 2021, Jan Beulich wrote:
> On 24.09.2021 04:09, Stefano Stabellini wrote:
> > On Thu, 23 Sep 2021, Wei Chen wrote:
> >> NUMA initialization will parse information from firmware provided
> >> static resource affinity table (ACPI SRAT or DTB). bad_srat if a
> >> function that will be used when initialization code encounters
> >> some unexcepted errors.
> >>
> >> In this patch, we introduce Arm version bad_srat for NUMA common
> >> initialization code to invoke it.
> >>
> >> Signed-off-by: Wei Chen 
> >> ---
> >>  xen/arch/arm/numa.c | 7 +++
> >>  1 file changed, 7 insertions(+)
> >>
> >> diff --git a/xen/arch/arm/numa.c b/xen/arch/arm/numa.c
> >> index 3755b01ef4..5209d3de4d 100644
> >> --- a/xen/arch/arm/numa.c
> >> +++ b/xen/arch/arm/numa.c
> >> @@ -18,6 +18,7 @@
> >>   *
> >>   */
> >>  #include 
> >> +#include 
> >>  #include 
> >>  
> >>  static uint8_t __read_mostly
> >> @@ -25,6 +26,12 @@ node_distance_map[MAX_NUMNODES][MAX_NUMNODES] = {
> >>  { 0 }
> >>  };
> >>  
> >> +__init void bad_srat(void)
> >> +{
> >> +printk(KERN_ERR "NUMA: Firmware SRAT table not used.\n");
> >> +fw_numa = -1;
> >> +}
> > 
> > I realize that the series keeps the "srat" terminology everywhere on DT
> > too. I wonder if it is worth replacing srat with something like
> > "numa_distance" everywhere as appropriate. I am adding the x86
> > maintainers for an opinion.
> > 
> > If you guys prefer to keep srat (if nothing else, it is concise), I am
> > also OK with keeping srat although it is not technically accurate.
> 
> I think we want to tell apart both things: Where we truly talk about
> the firmware's SRAT table, keeping that name is fine. But I suppose
> there no "Firmware SRAT table" (as in the log message above) when
> using DT?

No. FYI this is the DT binding:
https://github.com/torvalds/linux/blob/master/Documentation/devicetree/bindings/numa.txt

The interesting bit is the "distance-map"


> If so, at the very least in log messages SRAT shouldn't be
> mentioned. Perhaps even functions serving both an ACPI and a DT
> purpose would better not use "srat" in their names (but I'm not as
> fussed about it there.)

I agree 100% with what you wrote.

[xen-unstable-smoke test] 165181: tolerable all pass - PUSHED

2021-09-24 Thread osstest service owner

flight 165181 xen-unstable-smoke real [real]
http://logs.test-lab.xenproject.org/osstest/logs/165181/

Failures :-/ but no regressions.

Tests which did not succeed, but are not blocking:
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  16 saverestore-support-checkfail   never pass

version targeted for testing:
 xen  dd6c062a7a4abdb662c18af03d1396325969d155
baseline version:
 xen  728998f6f2b7e1420e771236efec65cbf6143b7b

Last test of basis   165180  2021-09-24 10:01:36 Z0 days
Testing same since   165181  2021-09-24 14:03:24 Z0 days1 attempts


People who touched revisions under test:
  Ian Jackson 
  Kevin Stefanov 

jobs:
 build-arm64-xsm  pass
 build-amd64  pass
 build-armhf  pass
 build-amd64-libvirt  pass
 test-armhf-armhf-xl  pass
 test-arm64-arm64-xl-xsm  pass
 test-amd64-amd64-xl-qemuu-debianhvm-amd64pass
 test-amd64-amd64-libvirt pass



sg-report-flight on osstest.test-lab.xenproject.org
logs: /home/logs/logs
images: /home/logs/images

Logs, config files, etc. are available at
http://logs.test-lab.xenproject.org/osstest/logs

Explanation of these reports, and of osstest in general, is at
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master

Test harness code can be found at
http://xenbits.xen.org/gitweb?p=osstest.git;a=summary


Pushing revision :

To xenbits.xen.org:/home/xen/git/xen.git
   728998f6f2..dd6c062a7a  dd6c062a7a4abdb662c18af03d1396325969d155 -> smoke

[linux-linus test] 165176: regressions - FAIL

2021-09-24 Thread osstest service owner

flight 165176 linux-linus real [real]
http://logs.test-lab.xenproject.org/osstest/logs/165176/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
 test-amd64-amd64-xl-pvshim   17 guest-saverestorefail REGR. vs. 152332
 test-amd64-amd64-dom0pvh-xl-amd 17 guest-saverestore fail REGR. vs. 152332
 test-amd64-amd64-xl-credit1  17 guest-saverestorefail REGR. vs. 152332
 test-amd64-amd64-xl-multivcpu 17 guest-saverestore   fail REGR. vs. 152332
 test-amd64-amd64-dom0pvh-xl-intel 17 guest-saverestore   fail REGR. vs. 152332
 test-amd64-amd64-xl-pvhv2-intel 17 guest-saverestore fail REGR. vs. 152332
 test-amd64-amd64-xl  17 guest-saverestorefail REGR. vs. 152332
 test-amd64-amd64-xl-credit2  17 guest-saverestorefail REGR. vs. 152332
 test-amd64-amd64-xl-shadow   17 guest-saverestorefail REGR. vs. 152332
 test-amd64-amd64-libvirt 17 guest-saverestorefail REGR. vs. 152332
 test-amd64-amd64-libvirt-xsm 17 guest-saverestorefail REGR. vs. 152332
 test-amd64-amd64-pair 26 guest-migrate/src_host/dst_host fail REGR. vs. 152332
 test-amd64-amd64-libvirt-pair 26 guest-migrate/src_host/dst_host fail REGR. 
vs. 152332
 test-amd64-amd64-xl-pvhv2-amd 17 guest-saverestore   fail REGR. vs. 152332
 test-amd64-coresched-amd64-xl 17 guest-saverestore   fail REGR. vs. 152332
 test-amd64-amd64-xl-xsm  17 guest-saverestorefail REGR. vs. 152332
 test-arm64-arm64-xl  13 debian-fixup fail REGR. vs. 152332
 test-arm64-arm64-xl-credit1  13 debian-fixup fail REGR. vs. 152332
 test-arm64-arm64-libvirt-xsm 13 debian-fixup fail REGR. vs. 152332
 test-arm64-arm64-xl-xsm  13 debian-fixup fail REGR. vs. 152332
 test-arm64-arm64-xl-credit2  13 debian-fixup fail REGR. vs. 152332
 test-arm64-arm64-xl-thunderx 13 debian-fixup fail REGR. vs. 152332
 test-armhf-armhf-xl-credit1 18 guest-start/debian.repeat fail REGR. vs. 152332

Regressions which are regarded as allowable (not blocking):
 test-amd64-amd64-xl-rtds 17 guest-saverestorefail REGR. vs. 152332

Tests which did not succeed, but are not blocking:
 test-armhf-armhf-libvirt-qcow2 15 saverestore-support-check fail baseline 
untested
 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 152332
 test-armhf-armhf-libvirt 16 saverestore-support-checkfail  like 152332
 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 152332
 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 152332
 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 152332
 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 152332
 test-armhf-armhf-libvirt-raw 15 saverestore-support-checkfail  like 152332
 test-arm64-arm64-xl-seattle  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-seattle  16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-amd64-amd64-libvirt-qcow2 14 migrate-support-checkfail never pass
 test-armhf-armhf-xl-arndale  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 15 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail  never pass
 test-armhf-armhf-xl-credit1  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  14 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  15 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-cubietruck 15 migrate-support-checkfail never pass
 test-armhf-armhf-xl-cubietruck 16 saverestore-support-checkfail never pass
 test-armhf-armhf-xl-rtds 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt 15 migrate-support-checkfail   never pass
 test-armhf-armhf-l

Re: [PATCH v2 08/12] xen/sched: Clean up trace handling

2021-09-24 Thread Dario Faggioli

On Mon, 2021-09-20 at 18:25 +0100, Andrew Cooper wrote:
> There is no need for bitfields anywhere - use more sensible types. 
> There is
> also no need to cast 'd' to (unsigned char *) before passing it to a
> function
> taking void *.
> 
> No functional change.
> 
> Signed-off-by: Andrew Cooper 
>
Reviewed-by: Dario Faggioli 

Thanks and Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
---
<> (Raistlin Majere)


signature.asc
Description: This is a digitally signed message part

Re: [PATCH v2 07/12] xen/rt: Clean up trace handling

2021-09-24 Thread Dario Faggioli

On Mon, 2021-09-20 at 18:25 +0100, Andrew Cooper wrote:
> Most uses of bitfields and __packed are unnecessary.  There is also
> no need to
> cast 'd' to (unsigned char *) before passing it to a function taking
> void *.
> 
> No functional change.
> 
> Signed-off-by: Andrew Cooper 
>
Reviewed-by: Dario Faggioli 

Thanks and Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
---
<> (Raistlin Majere)


signature.asc
Description: This is a digitally signed message part

Re: [PATCH v2 06/12] xen/credit2: Clean up trace handling

2021-09-24 Thread Dario Faggioli

On Mon, 2021-09-20 at 18:25 +0100, Andrew Cooper wrote:
> There is no need for bitfields anywhere - use more sensible types. 
> There is
> also no need to cast 'd' to (unsigned char *) before passing it to a
> function
> taking void *.
> 
> No functional change.
> 
> Signed-off-by: Andrew Cooper 
> Reviewed-by: Jan Beulich 
>
Reviewed-by: Dario Faggioli 

Thanks and Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
---
<> (Raistlin Majere)


signature.asc
Description: This is a digitally signed message part

Re: [PATCH v2 03/12] xen/credit2: Remove tail padding from TRC_CSCHED2_* records

2021-09-24 Thread Dario Faggioli

On Mon, 2021-09-20 at 18:25 +0100, Andrew Cooper wrote:
> All three of these records have tail padding, leaking stack rubble
> into the
> trace buffer.  Introduce an explicit _pad field and have the compiler
> zero the
> padding automatically.
> 
> Signed-off-by: Andrew Cooper 
> Reviewed-by: Jan Beulich 
>
Reviewed-by: Dario Faggioli 

Thanks and Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
---
<> (Raistlin Majere)


signature.asc
Description: This is a digitally signed message part

Re: [PATCH v2 02/12] xen/memory: Remove tail padding from TRC_MEM_* records

2021-09-24 Thread Dario Faggioli

On Mon, 2021-09-20 at 18:25 +0100, Andrew Cooper wrote:
> Four TRC_MEM_* records supply custom structures with tail padding,
> leaking
> stack rubble into the trace buffer.  Three of the records were fine
> in 32-bit
> builds of Xen, due to the relaxed alignment of 64-bit integers, but
> POD_SUPERPAGE_SPLITER was broken right from the outset.
> 
> We could pack the datastructures to remove the padding, but
> xentrace_format
> has no way of rendering the upper half of a 16-bit field.  Instead,
> expand all
> 16-bit fields to 32-bit.
> 
> For POD_SUPERPAGE_SPLINTER, introduce an order field as it is
> relevant
> information, and to match DECREASE_RESERVATION, and so it doesn't
> require a
> __packed attribute to drop tail padding.
> 
> Update xenalyze's structures to match, and introduce xentrace_format
> rendering
> which was absent previously.
> 
> Signed-off-by: Andrew Cooper 
> Reviewed-by: Jan Beulich 
>
Reviewed-by: Dario Faggioli 

Thanks and Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
---
<> (Raistlin Majere)


signature.asc
Description: This is a digitally signed message part

Xen Rust VirtIO demos work breakdown for Project Stratos

2021-09-24 Thread Alex Bennée



Hi,

The following is a breakdown (as best I can figure) of the work needed
to demonstrate VirtIO backends in Rust on the Xen hypervisor. It
requires work across a number of projects but notably core rust and virtio
enabling in the Xen project (building on the work EPAM has already done)
and the start of enabling rust-vmm crate to work with Xen.

The first demo is a fairly simple toy to exercise the direct hypercall
approach for a unikernel backend. On it's own it isn't super impressive
but hopefully serves as a proof of concept for the idea of having
backends running in a single exception level where latency will be
important.

The second is a much more ambitious bridge between Xen and vhost-user to
allow for re-use of the existing vhost-user backends with the bridge
acting as a proxy for what would usually be a full VMM in the type-2
hypervisor case. With that in mind the rust-vmm work is only aimed at
doing the device emulation and doesn't address the larger question of
how type-1 hypervisors can be integrated into the rust-vmm hypervisor
model.

A quick note about the estimates. They are exceedingly rough guesses
plucked out of the air and I would be grateful for feedback from the
appropriate domain experts on if I'm being overly optimistic or
pessimistic.

The links to the Stratos JIRA should be at least read accessible to all
although they contain the same information as the attached document
(albeit with nicer PNG renderings of my ASCII art ;-). There is a
Stratos sync-up call next Thursday:

  
https://calendar.google.com/event?action=TEMPLATE&tmeid=MWpidm5lbzM5NjlydnAxdWxvc2s4aGI0ZGpfMjAyMTA5MzBUMTUwMDAwWiBjX2o3bmdpMW84cmxvZmtwZWQ0cjVjaDk4bXZnQGc&tmsrc=c_j7ngi1o8rlofkped4r5ch98mvg%40group.calendar.google.com

and I'm sure there will also be discussion in the various projects
(hence the wide CC list). The Stratos calls are open to anyone who wants
to attend and we welcome feedback from all who are interested.

So on with the work breakdown:

━━━
 STRATOS PLANNING FOR 21 TO 22

  Alex Bennée
━━━


Table of Contents
─

1. Xen Rust Bindings ([STR-51])
.. 1. Upstream an "official" rust crate for Xen ([STR-52])
.. 2. Basic Hypervisor Interactions hypercalls ([STR-53])
.. 3. [#10] Access to XenStore service ([STR-54])
.. 4. VirtIO support hypercalls ([STR-55])
2. Xen Hypervisor Support for Stratos ([STR-56])
.. 1. Stable ABI for foreignmemory mapping to non-dom0 ([STR-57])
.. 2. Tweaks to tooling to launch VirtIO guests
3. rust-vmm support for Xen VirtIO ([STR-59])
.. 1. Make vm-memory Xen aware ([STR-60])
.. 2. Xen IO notification and IRQ injections ([STR-61])
4. Stratos Demos
.. 1. Rust based stubdomain monitor ([STR-62])
.. 2. Xen aware vhost-user master ([STR-63])





1 Xen Rust Bindings ([STR-51])
══

  There exists a [placeholder repository] with the start of a set of
  x86_64 bindings for Xen and a very basic hello world uni-kernel
  example. This forms the basis of the initial Xen Rust work and will be
  available as a [xen-sys crate] via cargo.


[STR-51] 

[placeholder repository] 

[xen-sys crate] 

1.1 Upstream an "official" rust crate for Xen ([STR-52])


  To start with we will want an upstream location for future work to be
  based upon. The intention is the crate is independent of the version
  of Xen it runs on (above the baseline version chosen). This will
  entail:

  • ☐ agreeing with upstream the name/location for the source
  • ☐ documenting the rules for the "stable" hypercall ABI
  • ☐ establish an internal interface to elide between ioctl mediated
and direct hypercalls
  • ☐ ensure the crate is multi-arch and has feature parity for arm64

  As such we expect the implementation to be standalone, i.e. not
  wrapping the existing Xen libraries for mediation. There should be a
  close (1-to-1) mapping between the interfaces in the crate and the
  eventual hypercall made to the hypervisor.

  Estimate: 4w (elapsed likely longer due to discussion)


[STR-52] 


1.2 Basic Hypervisor Interactions hypercalls ([STR-53])
───

  These are the bare minimum hypercalls implemented as both ioctl and
  direct calls. These allow for a very basic binary to:

  • ☐ console_io - output IO via the Xen console
  • ☐ domctl stub - basic stub for domain control (different API?)
  • ☐ sysctl stub - basic stub for system control (different API?)

  The idea would be this provides enough hypercall interface to query
  the list of domains and output their status via the xen console. There
  is an open question about if the domctl and sysctl hypercalls ar

Re: [future abi] [RFC PATCH V3] xen/gnttab: Store frame GFN in struct page_info on Arm

2021-09-24 Thread Roger Pau Monné

On Fri, Sep 24, 2021 at 07:52:24PM +0500, Julien Grall wrote:
> Hi Roger,
> 
> On 24/09/2021 13:41, Roger Pau Monné wrote:
> > On Thu, Sep 23, 2021 at 09:59:26PM +0100, Andrew Cooper wrote:
> > > On 23/09/2021 20:32, Oleksandr Tyshchenko wrote:
> > > > Suggested-by: Julien Grall 
> > > > Signed-off-by: Oleksandr Tyshchenko 
> > > > ---
> > > > You can find the related discussions at:
> > > > https://lore.kernel.org/xen-devel/93d0df14-2c8a-c2e3-8c51-544121901...@xen.org/
> > > > https://lore.kernel.org/xen-devel/1628890077-12545-1-git-send-email-olekst...@gmail.com/
> > > > https://lore.kernel.org/xen-devel/1631652245-30746-1-git-send-email-olekst...@gmail.com/
> > > > 
> > > > ! Please note, there is still unresolved locking question here for which
> > > > I failed to find a suitable solution. So, it is still an RFC !
> > > 
> > > Just FYI, I thought I'd share some of the plans for ABI v2.  Obviously
> > > these plans are future work and don't solve the current problem.
> > > 
> > > Guests mapping Xen pages is backwards.  There are reasons why it was
> > > used for x86 PV guests, but the entire interface should have been design
> > > differently for x86 HVM.
> > > 
> > > In particular, Xen should be mapping guest RAM, rather than the guest
> > > manipulating the 2nd stage tables to map Xen RAM.  Amongst other things,
> > > its far far lower overhead.
> > > 
> > > 
> > > A much better design is one where the grant table looks like an MMIO
> > > device.  The domain builder decides the ABI (v1 vs v2 - none of this
> > > dynamic switch at runtime nonsense), and picks a block of guest physical
> > > addresses, which are registered with Xen.  This forms the grant table,
> > > status table (v2 only), and holes to map into.
> > 
> > I think this could be problematic for identity mapped Arm dom0, as
> > IIRC in that case grants are mapped so that gfn == mfn in order to
> > account for the lack of an IOMMU. You could use a bounce buffer, but
> > that would introduce a big performance penalty.
> 
> Or you could find a hole that is outside of the RAM regions. This is not
> trivial but not impossible (see [1]).

I certainly not familiar with the Arm identity map.

If you map them at random areas (so no longer identity mapped), how do
you pass the addresses to the physical devices for DMA operations? I
assume there must be some kind of translation then that converts from
gfn to mfn in order to cope with the lack of an IOMMU, and because
dom0 doesn't know the mfn of the grant reference in order to map it at
the same gfn.

Thanks, Roger.

[libvirt test] 165177: regressions - FAIL

2021-09-24 Thread osstest service owner

flight 165177 libvirt real [real]
http://logs.test-lab.xenproject.org/osstest/logs/165177/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
 build-amd64-libvirt   6 libvirt-buildfail REGR. vs. 151777
 build-i386-libvirt6 libvirt-buildfail REGR. vs. 151777
 build-arm64-libvirt   6 libvirt-buildfail REGR. vs. 151777
 build-armhf-libvirt   6 libvirt-buildfail REGR. vs. 151777

Tests which did not succeed, but are not blocking:
 test-amd64-amd64-libvirt  1 build-check(1)   blocked  n/a
 test-amd64-amd64-libvirt-pair  1 build-check(1)   blocked  n/a
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 1 build-check(1) blocked n/a
 test-amd64-amd64-libvirt-vhd  1 build-check(1)   blocked  n/a
 test-amd64-amd64-libvirt-xsm  1 build-check(1)   blocked  n/a
 test-amd64-i386-libvirt   1 build-check(1)   blocked  n/a
 test-amd64-i386-libvirt-pair  1 build-check(1)   blocked  n/a
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 1 build-check(1) blocked n/a
 test-amd64-i386-libvirt-raw   1 build-check(1)   blocked  n/a
 test-amd64-i386-libvirt-xsm   1 build-check(1)   blocked  n/a
 test-arm64-arm64-libvirt  1 build-check(1)   blocked  n/a
 test-arm64-arm64-libvirt-qcow2  1 build-check(1)   blocked  n/a
 test-arm64-arm64-libvirt-raw  1 build-check(1)   blocked  n/a
 test-armhf-armhf-libvirt-raw  1 build-check(1)   blocked  n/a
 test-arm64-arm64-libvirt-xsm  1 build-check(1)   blocked  n/a
 test-armhf-armhf-libvirt  1 build-check(1)   blocked  n/a
 test-armhf-armhf-libvirt-qcow2  1 build-check(1)   blocked  n/a

version targeted for testing:
 libvirt  e0e0bf662896240f2293eb60e23d2570f110fa5a
baseline version:
 libvirt  2c846fa6bcc11929c9fb857a22430fb9945654ad

Last test of basis   151777  2020-07-10 04:19:19 Z  441 days
Failing since151818  2020-07-11 04:18:52 Z  440 days  431 attempts
Testing same since   165177  2021-09-24 04:20:03 Z0 days1 attempts


People who touched revisions under test:
Adolfo Jayme Barrientos 
  Aleksandr Alekseev 
  Aleksei Zakharov 
  Andika Triwidada 
  Andrea Bolognani 
  Balázs Meskó 
  Barrett Schonefeld 
  Bastian Germann 
  Bastien Orivel 
  BiaoXiang Ye 
  Bihong Yu 
  Binfeng Wu 
  Bjoern Walk 
  Boris Fiuczynski 
  Brian Turek 
  Bruno Haible 
  Chris Mayo 
  Christian Borntraeger 
  Christian Ehrhardt 
  Christian Kirbach 
  Christian Schoenebeck 
  Cole Robinson 
  Collin Walling 
  Cornelia Huck 
  Cédric Bosdonnat 
  Côme Borsoi 
  Daniel Henrique Barboza 
  Daniel Letai 
  Daniel P. Berrange 
  Daniel P. Berrangé 
  Didik Supriadi 
  dinglimin 
  Dmytro Linkin 
  Eiichi Tsukata 
  Eric Farman 
  Erik Skultety 
  Fabian Affolter 
  Fabian Freyer 
  Fabiano Fidêncio 
  Fangge Jin 
  Farhan Ali 
  Fedora Weblate Translation 
  gongwei 
  Guoyi Tu
  Göran Uddeborg 
  Halil Pasic 
  Han Han 
  Hao Wang 
  Hela Basa 
  Helmut Grohne 
  Hiroki Narukawa 
  Ian Wienand 
  Jakob Meng 
  Jamie Strandboge 
  Jamie Strandboge 
  Jan Kuparinen 
  jason lee 
  Jean-Baptiste Holcroft 
  Jia Zhou 
  Jianan Gao 
  Jim Fehlig 
  Jin Yan 
  Jinsheng Zhang 
  Jiri Denemark 
  John Ferlan 
  Jonathan Watt 
  Jonathon Jongsma 
  Julio Faracco 
  Justin Gatzen 
  Ján Tomko 
  Kashyap Chamarthy 
  Kevin Locke 
  Kristina Hanicova 
  Laine Stump 
  Laszlo Ersek 
  Lee Yarwood 
  Lei Yang 
  Liao Pingfang 
  Lin Ma 
  Lin Ma 
  Lin Ma 
  Liu Yiding 
  Luke Yue 
  Luyao Zhong 
  Marc Hartmayer 
  Marc-André Lureau 
  Marek Marczykowski-Górecki 
  Markus Schade 
  Martin Kletzander 
  Masayoshi Mizuma 
  Matej Cepl 
  Matt Coleman 
  Matt Coleman 
  Mauro Matteo Cascella 
  Meina Li 
  Michal Privoznik 
  Michał Smyk 
  Milo Casagrande 
  Moshe Levi 
  Muha Aliss 
  Nathan 
  Neal Gompa 
  Nick Chevsky 
  Nick Shyrokovskiy 
  Nickys Music Group 
  Nico Pache 
  Nikolay Shirokovskiy 
  Olaf Hering 
  Olesya Gerasimenko 
  Orion Poplawski 
  Pany 
  Patrick Magauran 
  Paulo de Rezende Pinatti 
  Pavel Hrdina 
  Peng Liang 
  Peter Krempa 
  Pino Toscano 
  Pino Toscano 
  Piotr Drąg 
  Prathamesh Chavan 
  Richard W.M. Jones 
  Ricky Tigg 
  Robin Lee 
  Roman Bogorodskiy 
  Roman Bolshakov 
  Ryan Gahagan 
  Ryan Schmidt 
  Sam Hartman 
  Scott Shambarger 
  Sebastian Mitterle 
  SeongHyun Jo 
  Shalini Chellathurai Saroja 
  Shaojun Yang 
  Shi Lei 
  simmon 
  Simon Chopin 
  Simon Gaiser 
  Simon Rowe 
  Stefan Bader 
  Stefan Berger 
  Stefan Berger 
  Stefan Hajnoczi 
  Stefan Hajnoczi 
  Szymon Scholz 
  Thomas Huth 
  Tim Wiederhake 
  Tomáš Golembiovský 
  Tomáš Janoušek 
  Tuguoyi 
  Victor Toso 
  Ville Skyttä 
  Vinayak Kale 
  Wang Xin 
  WangJian 
  Weblate 
  Wei

[ovmf test] 165175: all pass - PUSHED

2021-09-24 Thread osstest service owner

flight 165175 ovmf real [real]
http://logs.test-lab.xenproject.org/osstest/logs/165175/

Perfect :-)
All tests in this flight passed as required
version targeted for testing:
 ovmf d60915b7516c87ec49ad579a1cb8ff9226d85928
baseline version:
 ovmf 7ea7f9c07757b9445c24b23acf4c2e8e60b30b7e

Last test of basis   165170  2021-09-23 18:41:23 Z0 days
Testing same since   165175  2021-09-24 03:46:51 Z0 days1 attempts


People who touched revisions under test:
  Zhiguang Liu 

jobs:
 build-amd64-xsm  pass
 build-i386-xsm   pass
 build-amd64  pass
 build-i386   pass
 build-amd64-libvirt  pass
 build-i386-libvirt   pass
 build-amd64-pvopspass
 build-i386-pvops pass
 test-amd64-amd64-xl-qemuu-ovmf-amd64 pass
 test-amd64-i386-xl-qemuu-ovmf-amd64  pass



sg-report-flight on osstest.test-lab.xenproject.org
logs: /home/logs/logs
images: /home/logs/images

Logs, config files, etc. are available at
http://logs.test-lab.xenproject.org/osstest/logs

Explanation of these reports, and of osstest in general, is at
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master

Test harness code can be found at
http://xenbits.xen.org/gitweb?p=osstest.git;a=summary


Pushing revision :

To xenbits.xen.org:/home/xen/git/osstest/ovmf.git
   7ea7f9c077..d60915b751  d60915b7516c87ec49ad579a1cb8ff9226d85928 -> 
xen-tested-master

Re: [PATCH v2.1 14/12] xen: Switch to new TRACE() API

2021-09-24 Thread Dario Faggioli

On Mon, 2021-09-20 at 20:32 +0100, Andrew Cooper wrote:
> (Almost) no functional change.
> 
> irq_move_cleanup_interrupt() changes two smp_processor_id() calls to
> the 'me'
> local variable which manifests as a minor code improvement.  All
> other
> differences in the compiled binary are to do with line numbers
> changing.
> 
> Some conversion notes:
>  * HVMTRACE_LONG_[234]D() and TRACE_2_LONG_[234]D() were latently
> buggy.  They
>    blindly discard extra parameters, but luckily no users are
> impacted.  They
>    are also obfuscated wrappers, depending on exactly one or two
> parameters
>    being TRC_PAR_LONG() to compile successfully.
>  * HVMTRACE_LONG_1D() behaves unlike its named companions, and takes
> exactly
>    one 64bit parameter which it splits manually.  It's one user,
>    vmx_cr_access()'s LMSW path, is gets adjusted to use
> TRACE_PARAM64().
>  * TRACE_?D() and TRACE_2_LONG_*() change to TRACE_TIME() as cycles
> is always.
>
Was this supposed to be "as cycles is always 1", or something like
that? (But maybe it's fine and it is me. I'm no native speaker after
all...)

In any case...

>  * HVMTRACE_ND() is opencoded for VMENTRY/VMEXIT records to include
> cycles.
>    These are converted to TRACE_TIME(), with the old modifier
> parameter
>    expressed as an OR at the callsite.  One callsite,
> svm_vmenter_helper() had
>    a nested tb_init_done check, which is dropped.  (The optimiser
> also spotted
>    this, which is why it doesn't manifest as a binary difference.)
>  * All HVMTRACE_?D() change to TRACE() as cycles is explicitly
> skipped.
> 
> Signed-off-by: Andrew Cooper 
>
Reviewed-by: Dario Faggioli 

Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
---
<> (Raistlin Majere)


signature.asc
Description: This is a digitally signed message part

Re: [PATCH v2 2/2] arm/efi: Use dom0less configuration when using EFI boot

2021-09-24 Thread Luca Fancellu




> On 24 Sep 2021, at 15:02, Jan Beulich  wrote:
> 
> On 22.09.2021 16:13, Luca Fancellu wrote:
>> +static unsigned int __init allocate_dom0less_file(EFI_FILE_HANDLE 
>> dir_handle,
>> +  const char *name,
>> +  unsigned int name_len)
>> +{
>> +dom0less_module_name* file_name;
>> +union string module_name;
>> +unsigned int ret_idx;
>> +
>> +/*
>> + * Check if there is any space left for a domU module, the variable
>> + * dom0less_modules_available is updated each time we use read_file(...)
>> + * successfully.
>> + */
>> +if ( !dom0less_modules_available )
>> +blexit(L"No space left for domU modules");
>> +
>> +module_name.s = (char*) name;
> 
> Unfortunately there are too many style issues in these Arm additions to
> really enumerate; I'd like to ask that you go through yourself with
> ./CODING_STYLE, surrounding code, and review comments on earlier patches
> of yours in mind. This cast stands out, though: I'm pretty sure you were
> told before that casts are often dangerous and hence should be avoided
> whenever (easily) possible. There was a prior case where union string
> was used in a similar way, not all that long ago. Hence why it now has
> a "const char *" member. (That's still somewhat risky, but imo way
> better than a cast.)

Hi Jan,

Yes I will use the .cs member, I will have also a better look on the patch to
find the style issues.

> 
>> @@ -1361,12 +1360,21 @@ efi_start(EFI_HANDLE ImageHandle, EFI_SYSTEM_TABLE 
>> *SystemTable)
>> efi_bs->FreePages(cfg.addr, PFN_UP(cfg.size));
>> cfg.addr = 0;
>> 
>> -dir_handle->Close(dir_handle);
>> -
>> if ( gop && !base_video )
>> gop_mode = efi_find_gop_mode(gop, cols, rows, depth);
>> }
>> 
>> +/*
>> + * Check if a proper configuration is provided to start Xen:
>> + *  - Dom0 specified (minimum required)
>> + *  - Dom0 and DomU(s) specified
>> + *  - DomU(s) specified
>> + */
>> +if ( !efi_arch_check_dom0less_boot(dir_handle) && !kernel.addr )
>> +blexit(L"No Dom0 kernel image specified.");
>> +
>> +dir_handle->Close(dir_handle);
> 
> So far I was under the impression that handles and alike need closing
> before calling Exit(), to prevent resource leaks. While I will admit
> that likely there are more (pre-existing) affected paths, I think that
> - if that understanding of mine is correct - it would be nice to avoid
> adding yet more instances.

Ok sure, I will close the handle before the blexit.

Cheers,
Luca

> 
> Jan
>

Re: [PATCH v2 09/12] xen/trace: Minor code cleanup

2021-09-24 Thread Dario Faggioli

On Tue, 2021-09-21 at 13:03 +0200, Jan Beulich wrote:
> On 20.09.2021 19:25, Andrew Cooper wrote:
> > 
> > Signed-off-by: Andrew Cooper 
> 
> Like for v1: Largely
> Reviewed-by: Jan Beulich 
>
Reviewed-by: Dario Faggioli 

> One remark:
> 
> > @@ -717,9 +713,6 @@ void __trace_var(u32 event, bool_t cycles,
> > unsigned int extra,
> >  if ( !cpumask_test_cpu(smp_processor_id(), &tb_cpu_mask) )
> >  return;
> >  
> > -    /* Read tb_init_done /before/ t_bufs. */
> > -    smp_rmb();
> > -
> >  spin_lock_irqsave(&this_cpu(t_lock), flags);
> >  
> >  buf = this_cpu(t_bufs);
> 
> I wonder whether the comment wouldn't be helpful to move down here,
> in of course a slightly edited form (going from /before/ to /after/).
> 
FWIW, I agree with this (but the R-o-b: stands no matter whether it's
done or not).

Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
---
<> (Raistlin Majere)


signature.asc
Description: This is a digitally signed message part

[PATCH 1/2] include/public: add possible status values to usbif.h

2021-09-24 Thread Juergen Gross

The interface definition of PV USB devices is lacking the specification
of possible values of the status filed in a response. Those are
negative errno values as used in Linux, so they might differ in other
OS's. Specify them via appropriate defines.

Signed-off-by: Juergen Gross 
---
 xen/include/public/io/usbif.h | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/xen/include/public/io/usbif.h b/xen/include/public/io/usbif.h
index c6a58639d6..fbd6f953f8 100644
--- a/xen/include/public/io/usbif.h
+++ b/xen/include/public/io/usbif.h
@@ -221,6 +221,13 @@ struct usbif_urb_response {
uint16_t id; /* request id */
uint16_t start_frame;  /* start frame (ISO) */
int32_t status; /* status (non-ISO) */
+#define USBIF_STATUS_OK0
+#define USBIF_STATUS_NODEV -19
+#define USBIF_STATUS_INVAL -22
+#define USBIF_STATUS_STALL -32
+#define USBIF_STATUS_IOERROR   -71
+#define USBIF_STATUS_BABBLE-75
+#define USBIF_STATUS_SHUTDOWN  -108
int32_t actual_length; /* actual transfer length */
int32_t error_count; /* number of ISO errors */
 };
-- 
2.26.2

[PATCH 2/2] include/public: add better interface description to usbif.h

2021-09-24 Thread Juergen Gross

The PV USB protocol is poorly described. Add a more detailed
description to the usbif.h header file.

Signed-off-by: Juergen Gross 
---
 xen/include/public/io/usbif.h | 164 ++
 1 file changed, 164 insertions(+)

diff --git a/xen/include/public/io/usbif.h b/xen/include/public/io/usbif.h
index fbd6f953f8..10ff2ded58 100644
--- a/xen/include/public/io/usbif.h
+++ b/xen/include/public/io/usbif.h
@@ -32,6 +32,34 @@
 #include "../grant_table.h"
 
 /*
+ * Detailed Interface Description
+ * ==
+ * The pvUSB interface is using a split driver design: a frontend driver in
+ * the guest and a backend driver in a driver domain (normally dom0) having
+ * access to the physical USB device(s) being passed to the guest.
+ *
+ * The frontend and backend drivers use XenStore to initiate the connection
+ * between them, the I/O activity is handled via two shared ring pages and an
+ * event channel. As the interface between frontend and backend is at the USB
+ * host connector level, multiple (up to 31) physical USB devices can be
+ * handled by a single connection.
+ *
+ * The Xen pvUSB device name is "qusb", so the frontend's XenStore entries are
+ * to be found under "device/qusb", while the backend's XenStore entries are
+ * under "backend//qusb".
+ *
+ * When a new pvUSB connection is established, the frontend needs to setup the
+ * two shared ring pages for communication and the event channel. The ring
+ * pages need to be made available to the backend via the grant table
+ * interface.
+ *
+ * One of the shared ring pages is used by the backend to inform the frontend
+ * about USB device plug events (device to be added or removed). This is the
+ * "conn-ring".
+ *
+ * The other ring page is used for USB I/O communication (requests and
+ * responses). This is the "urb-ring".
+ *
  * Feature and Parameter Negotiation
  * =
  * The two halves of a Xen pvUSB driver utilize nodes within the XenStore to
@@ -99,6 +127,142 @@
  *  The machine ABI rules governing the format of all ring request and
  *  response structures.
  *
+ * Protocol Description
+ * 
+ *
+ *-- USB device plug events --
+ *
+ * USB device plug events are send via the "conn-ring" shared page. As only
+ * events are being sent, the respective requests from the frontend to the
+ * backend are just dummy ones.
+ * The events sent to the frontend have the following layout:
+ * 01 2   3octet
+ * +++++
+ * |   id|portnum | speed  | 4
+ * +++++
+ *   id - uint16_t, event id (taken from the actual frontend dummy request)
+ *   portnum - uint8_t, port number (1 ... 31)
+ *   speed - uint8_t, device USBIF_SPEED_*, USBIF_SPEED_NONE == unplug
+ *
+ * The dummy request:
+ * 01octet
+ * +++
+ * |   id| 2
+ * +++
+ *   id - uint16_t, guest supplied value (no need for being unique)
+ *
+ *-- USB I/O request -
+ *
+ * A single USB I/O request on the "urb-ring" has the following layout:
+ * 01 2   3octet
+ * +++++
+ * |   id| nr_buffer_segs  | 4
+ * +++++
+ * |   pipe| 8
+ * +++++
+ * | transfer_flags  |  buffer_length  | 12
+ * +++++
+ * |   request type specific   | 16
+ * |   data| 20
+ * +++++
+ * |  seg[0]   | 24
+ * |   data| 28
+ * +++++
+ * |/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/|
+ * +++++
+ * | seg[USBIF_MAX_SEGMENTS_PER_REQUEST - 1]   | 144
+ * |   data| 148
+ * +++++
+ * Bit field bit number 0 is always least significant bit, undefined bits must
+ * be zero.
+ *   id - uint16_t,

[PATCH 0/2] include/public: update usbif.h

2021-09-24 Thread Juergen Gross

Add some missing defines and documentation to the pvUSB header file.

Juergen Gross (2):
  include/public: add possible status values to usbif.h
  include/public: add better interface description to usbif.h

 xen/include/public/io/usbif.h | 171 ++
 1 file changed, 171 insertions(+)

-- 
2.26.2

[xen-unstable test] 165174: regressions - FAIL

2021-09-24 Thread osstest service owner

flight 165174 xen-unstable real [real]
http://logs.test-lab.xenproject.org/osstest/logs/165174/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
 build-i386-prev   6 xen-buildfail REGR. vs. 164945
 build-amd64-prev  6 xen-buildfail REGR. vs. 164945
 test-arm64-arm64-libvirt-raw 17 guest-start/debian.repeat fail REGR. vs. 164945

Tests which did not succeed, but are not blocking:
 test-amd64-amd64-migrupgrade  1 build-check(1)   blocked  n/a
 test-amd64-i386-migrupgrade   1 build-check(1)   blocked  n/a
 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 164945
 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 164945
 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 164945
 test-amd64-i386-xl-qemut-ws16-amd64 19 guest-stop fail like 164945
 test-amd64-i386-xl-qemut-win7-amd64 19 guest-stop fail like 164945
 test-armhf-armhf-libvirt-qcow2 15 saverestore-support-check   fail like 164945
 test-armhf-armhf-libvirt-raw 15 saverestore-support-checkfail  like 164945
 test-armhf-armhf-libvirt 16 saverestore-support-checkfail  like 164945
 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 164945
 test-amd64-i386-xl-qemuu-win7-amd64 19 guest-stop fail like 164945
 test-amd64-i386-xl-qemuu-ws16-amd64 19 guest-stop fail like 164945
 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 164945
 test-arm64-arm64-xl-seattle  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-seattle  16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-amd64-i386-xl-pvshim14 guest-start  fail   never pass
 test-amd64-i386-libvirt  15 migrate-support-checkfail   never pass
 test-amd64-i386-libvirt-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-amd64-i386-libvirt-raw  14 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 15 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  14 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  15 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail  never pass
 test-armhf-armhf-xl-credit2  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-cubietruck 15 migrate-support-checkfail never pass
 test-armhf-armhf-xl-cubietruck 16 saverestore-support-checkfail never pass
 test-armhf-armhf-libvirt-qcow2 14 migrate-support-checkfail never pass
 test-armhf-armhf-libvirt-raw 14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  16 saverestore-support-checkfail   never pass
 test-arm

Re: [future abi] [RFC PATCH V3] xen/gnttab: Store frame GFN in struct page_info on Arm

2021-09-24 Thread Julien Grall


Hi Roger,

On 24/09/2021 13:41, Roger Pau Monné wrote:

On Thu, Sep 23, 2021 at 09:59:26PM +0100, Andrew Cooper wrote:

On 23/09/2021 20:32, Oleksandr Tyshchenko wrote:

Suggested-by: Julien Grall 
Signed-off-by: Oleksandr Tyshchenko 
---
You can find the related discussions at:
https://lore.kernel.org/xen-devel/93d0df14-2c8a-c2e3-8c51-544121901...@xen.org/
https://lore.kernel.org/xen-devel/1628890077-12545-1-git-send-email-olekst...@gmail.com/
https://lore.kernel.org/xen-devel/1631652245-30746-1-git-send-email-olekst...@gmail.com/

! Please note, there is still unresolved locking question here for which
I failed to find a suitable solution. So, it is still an RFC !


Just FYI, I thought I'd share some of the plans for ABI v2.  Obviously
these plans are future work and don't solve the current problem.

Guests mapping Xen pages is backwards.  There are reasons why it was
used for x86 PV guests, but the entire interface should have been design
differently for x86 HVM.

In particular, Xen should be mapping guest RAM, rather than the guest
manipulating the 2nd stage tables to map Xen RAM.  Amongst other things,
its far far lower overhead.


A much better design is one where the grant table looks like an MMIO
device.  The domain builder decides the ABI (v1 vs v2 - none of this
dynamic switch at runtime nonsense), and picks a block of guest physical
addresses, which are registered with Xen.  This forms the grant table,
status table (v2 only), and holes to map into.


I think this could be problematic for identity mapped Arm dom0, as
IIRC in that case grants are mapped so that gfn == mfn in order to
account for the lack of an IOMMU. You could use a bounce buffer, but
that would introduce a big performance penalty.


Or you could find a hole that is outside of the RAM regions. This is not 
trivial but not impossible (see [1]).




Other question is whether we want/need to keep such mode going
forward.


I am assunming by "such mode" you mean "identity mapped". If so, then I 
am afraid this is not going to disappear on Arm at least. There are 
still out there many platforms without IOMMUs or devices which are not 
protected (the GPU is a common one).


Furthermore, Arm just sent a series to introduce identity mapping for 
domUs as well (see [2]).


[1] <1631034578-12598-1-git-send-email-olekst...@gmail.com>
[2] <20210923031115.1429719-1-penny.zh...@arm.com>

Cheers,

--
Julien Grall

Re: [PATCH v2 01/12] xen/trace: Don't over-read trace objects

2021-09-24 Thread Dario Faggioli

On Mon, 2021-09-20 at 18:25 +0100, Andrew Cooper wrote:

> There is one buggy race record, TRC_RTDS_BUDGET_BURN.  As it must
> remain
> __packed (as cur_budget is misaligned), change bool has_extratime to
> uint32_t
> to compensate.
> 
Mmm... maybe my understanding of data alignment inside structs is a bit
lacking, but what the actual issue here, and what would we need to do
to fix it (where, by fix, I mean us being able to get rid of the
`__packed`)?

If rearranging fields is not enough, we can think about making
priority_level and has_extratime smaller, or even combining them in
just one field and decode the information in xentrace.

Of course, I can send a patch for that myself, even as a followup of
this series when it's in, as soon as we agree about the best way
forward.

Thanks and Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
---
<> (Raistlin Majere)

signature.asc
Description: This is a digitally signed message part

Re: [PATCH v2 02/18] VT-d: have callers specify the target level for page table walks

2021-09-24 Thread Roger Pau Monné

On Fri, Sep 24, 2021 at 11:42:13AM +0200, Jan Beulich wrote:
> In order to be able to insert/remove super-pages we need to allow
> callers of the walking function to specify at which point to stop the
> walk.
> 
> For intel_iommu_lookup_page() integrate the last level access into
> the main walking function.
> 
> dma_pte_clear_one() gets only partly adjusted for now: Error handling
> and order parameter get put in place, but the order parameter remains
> ignored (just like intel_iommu_map_page()'s order part of the flags).
> 
> Signed-off-by: Jan Beulich 
> ---
> I have to admit that I don't understand why domain_pgd_maddr() wants to
> populate all page table levels for DFN 0.

I think it would be enough to create up to the level requested by the
caller?

Seems like a lazy way to always assert that the level requested by the
caller would be present.

> 
> I was actually wondering whether it wouldn't make sense to integrate
> dma_pte_clear_one() into its only caller intel_iommu_unmap_page(), for
> better symmetry with intel_iommu_map_page().
> ---
> v2: Fix build.
> 
> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -264,63 +264,116 @@ static u64 bus_to_context_maddr(struct v
>  return maddr;
>  }
>  
> -static u64 addr_to_dma_page_maddr(struct domain *domain, u64 addr, int alloc)
> +/*
> + * This function walks (and if requested allocates) page tables to the
> + * designated target level. It returns
> + * - 0 when a non-present entry was encountered and no allocation was
> + *   requested,
> + * - a small positive value (the level, i.e. below PAGE_SIZE) upon allocation
> + *   failure,
> + * - for target > 0 the address of the page table holding the leaf PTE for
  ^ physical

I think it's clearer, as the return type could be ambiguous.

> + *   the requested address,
> + * - for target == 0 the full PTE.

Could this create confusion if for example one PTE maps physical page
0? As the caller when getting back a full PTE with address 0 and some of
the low bits set could interpret the result as an error.

I think we already had this discussion on other occasions, but I would
rather add a parameter to be used as a return placeholder (ie: a
*dma_pte maybe?) and use the function return value just for errors
because IMO it's clearer, but I know you don't usually like this
approach, so I'm not going to insist.

> + */
> +static uint64_t addr_to_dma_page_maddr(struct domain *domain, daddr_t addr,
> +   unsigned int target,
> +   unsigned int *flush_flags, bool alloc)
>  {
>  struct domain_iommu *hd = dom_iommu(domain);
>  int addr_width = agaw_to_width(hd->arch.vtd.agaw);
>  struct dma_pte *parent, *pte = NULL;
> -int level = agaw_to_level(hd->arch.vtd.agaw);
> -int offset;
> +unsigned int level = agaw_to_level(hd->arch.vtd.agaw), offset;
>  u64 pte_maddr = 0;
>  
>  addr &= (((u64)1) << addr_width) - 1;
>  ASSERT(spin_is_locked(&hd->arch.mapping_lock));
> +ASSERT(target || !alloc);

Might be better to use an if with ASSERT_UNREACHABLE and return an
error? (ie: level itself?)

> +
>  if ( !hd->arch.vtd.pgd_maddr )
>  {
>  struct page_info *pg;
>  
> -if ( !alloc || !(pg = iommu_alloc_pgtable(domain)) )
> +if ( !alloc )
> +goto out;
> +
> +pte_maddr = level;
> +if ( !(pg = iommu_alloc_pgtable(domain)) )
>  goto out;
>  
>  hd->arch.vtd.pgd_maddr = page_to_maddr(pg);
>  }
>  
> -parent = (struct dma_pte *)map_vtd_domain_page(hd->arch.vtd.pgd_maddr);
> -while ( level > 1 )
> +pte_maddr = hd->arch.vtd.pgd_maddr;
> +parent = map_vtd_domain_page(pte_maddr);
> +while ( level > target )
>  {
>  offset = address_level_offset(addr, level);
>  pte = &parent[offset];
>  
>  pte_maddr = dma_pte_addr(*pte);
> -if ( !pte_maddr )
> +if ( !dma_pte_present(*pte) || (level > 1 && 
> dma_pte_superpage(*pte)) )
>  {
>  struct page_info *pg;
> +/*
> + * Higher level tables always set r/w, last level page table
> + * controls read/write.
> + */
> +struct dma_pte new_pte = { DMA_PTE_PROT };
>  
>  if ( !alloc )
> -break;
> +{
> +pte_maddr = 0;
> +if ( !dma_pte_present(*pte) )
> +break;
> +
> +/*
> + * When the leaf entry was requested, pass back the full PTE,
> + * with the address adjusted to account for the residual of
> + * the walk.
> + */
> +pte_maddr = pte->val +

Wouldn't it be better to use dma_pte_addr(*pte) rather than accessing
pte->val, and then you could drop the PAGE_MASK?

Or is the addr parameter not guaranteed to be page aligned?

> +

Re: [PATCH v2 01/12] xen/trace: Don't over-read trace objects

2021-09-24 Thread Dario Faggioli

On Wed, 2021-09-22 at 13:58 +0100, Andrew Cooper wrote:
> On 22/09/2021 08:01, Jan Beulich wrote:
> 
> > 
> > Agreed. Whether the truncation is an issue in practice is
> > questionable,
> > as I wouldn't expect budget to be consumed in multiple-second
> > individual
> > steps. But I didn't check whether this scheduler might allow a vCPU
> > to
> > run for this long all in one go.
> 
> I expect it's marginal too.  
>
It is indeed.

> Honestly, its not a bug I care to fix right
> about now.  I could leave a /* TODO: truncation? */ in place so
> whomever
> encounters weird behaviour from this trace record has a bit more help
> of
> where to look?
> 
Sure, that's fine for me.

Thanks and Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
---
<> (Raistlin Majere)


signature.asc
Description: This is a digitally signed message part

Re: [PATCH v2 2/2] arm/efi: Use dom0less configuration when using EFI boot

2021-09-24 Thread Jan Beulich

On 22.09.2021 16:13, Luca Fancellu wrote:
> +static unsigned int __init allocate_dom0less_file(EFI_FILE_HANDLE dir_handle,
> +  const char *name,
> +  unsigned int name_len)
> +{
> +dom0less_module_name* file_name;
> +union string module_name;
> +unsigned int ret_idx;
> +
> +/*
> + * Check if there is any space left for a domU module, the variable
> + * dom0less_modules_available is updated each time we use read_file(...)
> + * successfully.
> + */
> +if ( !dom0less_modules_available )
> +blexit(L"No space left for domU modules");
> +
> +module_name.s = (char*) name;

Unfortunately there are too many style issues in these Arm additions to
really enumerate; I'd like to ask that you go through yourself with
./CODING_STYLE, surrounding code, and review comments on earlier patches
of yours in mind. This cast stands out, though: I'm pretty sure you were
told before that casts are often dangerous and hence should be avoided
whenever (easily) possible. There was a prior case where union string
was used in a similar way, not all that long ago. Hence why it now has
a "const char *" member. (That's still somewhat risky, but imo way
better than a cast.)

> @@ -1361,12 +1360,21 @@ efi_start(EFI_HANDLE ImageHandle, EFI_SYSTEM_TABLE 
> *SystemTable)
>  efi_bs->FreePages(cfg.addr, PFN_UP(cfg.size));
>  cfg.addr = 0;
>  
> -dir_handle->Close(dir_handle);
> -
>  if ( gop && !base_video )
>  gop_mode = efi_find_gop_mode(gop, cols, rows, depth);
>  }
>  
> +/*
> + * Check if a proper configuration is provided to start Xen:
> + *  - Dom0 specified (minimum required)
> + *  - Dom0 and DomU(s) specified
> + *  - DomU(s) specified
> + */
> +if ( !efi_arch_check_dom0less_boot(dir_handle) && !kernel.addr )
> +blexit(L"No Dom0 kernel image specified.");
> +
> +dir_handle->Close(dir_handle);

So far I was under the impression that handles and alike need closing
before calling Exit(), to prevent resource leaks. While I will admit
that likely there are more (pre-existing) affected paths, I think that
- if that understanding of mine is correct - it would be nice to avoid
adding yet more instances.

Jan

Re: [PATCH v2.1 14/12] xen: Switch to new TRACE() API

2021-09-24 Thread Jan Beulich

On 20.09.2021 21:32, Andrew Cooper wrote:
> (Almost) no functional change.
> 
> irq_move_cleanup_interrupt() changes two smp_processor_id() calls to the 'me'
> local variable which manifests as a minor code improvement.  All other
> differences in the compiled binary are to do with line numbers changing.
> 
> Some conversion notes:
>  * HVMTRACE_LONG_[234]D() and TRACE_2_LONG_[234]D() were latently buggy.  They
>blindly discard extra parameters, but luckily no users are impacted.  They
>are also obfuscated wrappers, depending on exactly one or two parameters
>being TRC_PAR_LONG() to compile successfully.
>  * HVMTRACE_LONG_1D() behaves unlike its named companions, and takes exactly
>one 64bit parameter which it splits manually.  It's one user,
>vmx_cr_access()'s LMSW path, is gets adjusted to use TRACE_PARAM64().
>  * TRACE_?D() and TRACE_2_LONG_*() change to TRACE_TIME() as cycles is always.
>  * HVMTRACE_ND() is opencoded for VMENTRY/VMEXIT records to include cycles.
>These are converted to TRACE_TIME(), with the old modifier parameter
>expressed as an OR at the callsite.  One callsite, svm_vmenter_helper() had
>a nested tb_init_done check, which is dropped.  (The optimiser also spotted
>this, which is why it doesn't manifest as a binary difference.)
>  * All HVMTRACE_?D() change to TRACE() as cycles is explicitly skipped.
> 
> Signed-off-by: Andrew Cooper 

Acked-by: Jan Beulich 

> I'm in two minds as to whether to split this up by subsystem or not.  It is
> 95% x86, and isn't a massive patch.

Either way looks fine to me in this case; splitting might allow parts
to go in before you've managed to get acks from all relevant people.
If anything I might have preferred seeing e.g. all the HVM*() macros
getting replaced and dropped at the same time, rather than the
dropping (combined with others) getting split off.

Jan

Re: [PATCH v2.1 15/12] xen/trace: Drop old trace macros

2021-09-24 Thread Jan Beulich

On 20.09.2021 21:33, Andrew Cooper wrote:
> With all users updated to the new API, drop the old API.  This includes all of
> asm/hvm/trace.h, which allows us to drop some includes.
> 
> Signed-off-by: Andrew Cooper 

Acked-by: Jan Beulich 
albeit I'd like to note that ...

> --- a/xen/include/asm-x86/hvm/trace.h
> +++ /dev/null
> @@ -1,114 +0,0 @@
> -#ifndef __ASM_X86_HVM_TRACE_H__
> -#define __ASM_X86_HVM_TRACE_H__
> -
> -#include 
> -
> -#define DEFAULT_HVM_TRACE_ON  1
> -#define DEFAULT_HVM_TRACE_OFF 0
> -
> -#define DEFAULT_HVM_VMSWITCH   DEFAULT_HVM_TRACE_ON
> -#define DEFAULT_HVM_PF DEFAULT_HVM_TRACE_ON
> -#define DEFAULT_HVM_INJECT DEFAULT_HVM_TRACE_ON
> -#define DEFAULT_HVM_IO DEFAULT_HVM_TRACE_ON
> -#define DEFAULT_HVM_REGACCESS  DEFAULT_HVM_TRACE_ON
> -#define DEFAULT_HVM_MISC   DEFAULT_HVM_TRACE_ON
> -#define DEFAULT_HVM_INTR   DEFAULT_HVM_TRACE_ON

... least the part up to here as potentially useful to limit trace
output. Afaics there's no replacement in the new model, as you
invoke the base tracing macros now directly.

Jan

Re: [PATCH v2.1 13/12] xen/trace: Introduce new API

2021-09-24 Thread Jan Beulich

On 20.09.2021 21:29, Andrew Cooper wrote:
> --- a/xen/include/xen/trace.h
> +++ b/xen/include/xen/trace.h
> @@ -74,6 +74,30 @@ static inline void __trace_hypercall(uint32_t event, 
> unsigned long op,
>   const xen_ulong_t *args) {}
>  #endif /* CONFIG_TRACEBUFFER */
>  
> +/*
> + * Create a trace record, packaging up to 7 additional parameters into a
> + * uint32_t array.
> + */
> +#define TRACE_INTERNAL(_e, _c, ...) \
> +do {\
> +if ( unlikely(tb_init_done) )   \
> +{   \
> +uint32_t _d[] = { __VA_ARGS__ };\
> +BUILD_BUG_ON(ARRAY_SIZE(_d) > TRACE_EXTRA_MAX); \
> +__trace_var(_e, _c, sizeof(_d), sizeof(_d) ? _d : NULL);\
> +}   \
> +} while ( 0 )

I know we sort of disagree on this aspect, but I would really like
to understand what you (and others) think the leading underscores
are good for in macro parameter names. And if those went away, I'd
like to ask that the local variable also become e.g. d_, like we
have started doing elsewhere.

> +/* Split a uint64_t into two adjacent uint32_t's for a trace record. */
> +#define TRACE_PARAM64(p)(uint32_t)(p), ((p) >> 32)

You don't have a leading underscore here, for example.

> +/* Create a trace record with time included. */
> +#define TRACE_TIME(_e, ...) TRACE_INTERNAL(_e, true,  ##__VA_ARGS__)
> +
> +/* Create a trace record with no time included. */
> +#define TRACE(_e, ...)  TRACE_INTERNAL(_e, false, ##__VA_ARGS__)

Is , ## __VA_ARGS__ really doing what you expect? So far it has been
my understanding that the special behavior concatenating with a
comma only applies to the GNU form of variable macro arguments, e.g.

#define TRACE(_e, args...)  TRACE_INTERNAL(_e, false, ## args)

As a minor aspect (nit) - iirc it was you who had been asking me in a
few cases to treat ## like a normal binary operator when considering
style, requesting me to have a blank on each side of it.

> +
> +

Nit: Please can you avoid introducing double blank lines?

Jan

[xen-unstable-smoke test] 165180: tolerable all pass - PUSHED

2021-09-24 Thread osstest service owner

flight 165180 xen-unstable-smoke real [real]
http://logs.test-lab.xenproject.org/osstest/logs/165180/

Failures :-/ but no regressions.

Tests which did not succeed, but are not blocking:
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  16 saverestore-support-checkfail   never pass

version targeted for testing:
 xen  728998f6f2b7e1420e771236efec65cbf6143b7b
baseline version:
 xen  604be1b333b1b66052ab9b0133f156890549a4f0

Last test of basis   165157  2021-09-22 15:01:38 Z1 days
Testing same since   165180  2021-09-24 10:01:36 Z0 days1 attempts


People who touched revisions under test:
  Ian Jackson 
  Jan Beulich 
  Stefano Stabellini 
  Wei Chen 

jobs:
 build-arm64-xsm  pass
 build-amd64  pass
 build-armhf  pass
 build-amd64-libvirt  pass
 test-armhf-armhf-xl  pass
 test-arm64-arm64-xl-xsm  pass
 test-amd64-amd64-xl-qemuu-debianhvm-amd64pass
 test-amd64-amd64-libvirt pass



sg-report-flight on osstest.test-lab.xenproject.org
logs: /home/logs/logs
images: /home/logs/images

Logs, config files, etc. are available at
http://logs.test-lab.xenproject.org/osstest/logs

Explanation of these reports, and of osstest in general, is at
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master

Test harness code can be found at
http://xenbits.xen.org/gitweb?p=osstest.git;a=summary


Pushing revision :

To xenbits.xen.org:/home/xen/git/xen.git
   604be1b333..728998f6f2  728998f6f2b7e1420e771236efec65cbf6143b7b -> smoke

Re: [PATCH v2 11/17] xen/arm: PCI host bridge discovery within XEN on ARM

2021-09-24 Thread Rahul Singh

Hi Stefano,

> On 23 Sep 2021, at 8:12 pm, Stefano Stabellini  wrote:
> 
> On Thu, 23 Sep 2021, Rahul Singh wrote:
 +goto err_exit;
 +}
>>> 
>>> This is unnecessary at the moment, right? Can we get rid of ops->init ?
>> 
>> No this is required for N1SDP board. Please check below patch.
>> https://gitlab.com/rahsingh/xen-integration/-/commit/6379ba5764df33d57547087cff4ffc078dc515d5
> 
> OK
> 
> 
 +int pci_host_common_probe(struct dt_device_node *dev, const void *data)
 +{
 +struct pci_host_bridge *bridge;
 +struct pci_config_window *cfg;
 +struct pci_ecam_ops *ops;
 +const struct dt_device_match *of_id;
 +int err;
 +
 +if ( dt_device_for_passthrough(dev) )
 +return 0;
 +
 +of_id = dt_match_node(dev->dev.of_match_table, dev->dev.of_node);
 +ops = (struct pci_ecam_ops *) of_id->data;
>>> 
>>> Do we really need dt_match_node and dev->dev.of_match_table to get
>>> dt_device_match.data?
>>> 
>> 
>>> data is passed as a parameter to pci_host_common_probe, isn't it enough
>>> to do:
>>> 
>>> ops = (struct pci_ecam_ops *) data;
>> 
>> As of now not required but in future we might need it if we implement other 
>> ecam supported bridge
>> 
>> static const struct dt_device_match gen_pci_dt_match[] = {   
>>
>>{ .compatible = "pci-host-ecam-generic",  
>>   
>>  .data =   &pci_generic_ecam_ops },
>> 
>>{ .compatible = "pci-host-cam-generic",
>>  .data = &gen_pci_cfg_cam_bus_ops }, 
>> 
>>{ },  
>>   
>> };
> 
> Even if we add another ECAM-supported bridge, the following:
> 
> ops = (struct pci_ecam_ops *) data;
> 
> could still work, right? The probe function will directly receive as
> parameter the .data pointer. You shouldn't need the indirection via
> dt_match_node?

As per my understanding probe function will not get .data pointer.Probe data 
argument is NULL in most of the cases in XEN
Please have a look once dt_pci_init() -> device_init(..) call flow 
implementation.

Regards,
Rahul
 
> 
> If you are worried about gen_pci_cfg_cam_bus_ops not being a struct
> pci_ecam_ops: that problem can also be solved by making
> gen_pci_cfg_cam_bus_ops a struct containinig a struct pci_ecam_ops.

Re: [PATCH v3 3/9] x86/PVH: permit more physdevop-s to be used by Dom0

2021-09-24 Thread Jan Beulich

On 22.09.2021 16:22, Roger Pau Monné wrote:
> On Tue, Sep 21, 2021 at 09:17:37AM +0200, Jan Beulich wrote:
>> Certain notifications of Dom0 to Xen are independent of the mode Dom0 is
>> running in. Permit further PCI related ones (only their modern forms).
>> Also include the USB2 debug port operation at this occasion.
>>
>> Signed-off-by: Jan Beulich 
>> ---
>> I'm uncertain about the has_vpci() part of the check: I would think
>> is_hardware_domain() is both sufficient and concise. Without vPCI a PVH
>> Dom0 won't see any PCI devices in the first place (and hence would
>> effectively be non-functioning). Dropping this would in particular make
>> PHYSDEVOP_dbgp_op better fit in the mix.
>> ---
>> v3: New.
>>
>> --- a/xen/arch/x86/hvm/hypercall.c
>> +++ b/xen/arch/x86/hvm/hypercall.c
>> @@ -94,6 +94,12 @@ static long hvm_physdev_op(int cmd, XEN_
>>  break;
>>  
>>  case PHYSDEVOP_pci_mmcfg_reserved:
>> +case PHYSDEVOP_pci_device_add:
>> +case PHYSDEVOP_pci_device_remove:
>> +case PHYSDEVOP_restore_msi_ext:
> 
> Hm, I'm slightly unsure we need the restore operation. Wouldn't it be
> better to just reset all device state on suspend and then let dom0
> restore it's state as it does on native?

Hmm - Linux (even after my patch separating XEN_DOM0 from XEN_PV)
only issues this call when running in PV mode, so from that
perspective leaving it out would be okay. (Otherwise, i.e. if we
decide to permit its use, I guess we would better also permit
PHYSDEVOP_restore_msi. Somehow I had managed to not spot that.)
However, ...

> Maybe there's some wrinkle that prevents that from working properly.

... Xen might be using MSI for the serial console, and I'm not sure
this interrupt would get properly re-setup.

>> +case PHYSDEVOP_dbgp_op:
>> +case PHYSDEVOP_prepare_msix:
>> +case PHYSDEVOP_release_msix:
> 
> Albeit I think those two operations won't strictly conflict with vPCI
> usage (as they require no MSIX entries to be activ) I still wonder
> whether we will end up needing them on a PVH dom0. They are used by
> pciback and it's not yet clear how we will end up using pciback on a
> PVH dom0, hence I would prefer if we could leave them out until
> strictly required.

Even without a clear plan towards pciback, do you have any idea how
their function could sensibly be replaced in the PVH case? If there
is at least a rough idea, I'd be fine leaving them out here.

Jan

Re: [PATCH] tools/libxl: Remove page_size and page_shift from struct libxl_acpi_ctxt

2021-09-24 Thread Ian Jackson

Jan Beulich writes ("Re: [PATCH] tools/libxl: Remove page_size and page_shift 
from struct libxl_acpi_ctxt"):
> On 24.09.2021 13:05, Kevin Stefanov wrote:
> > As a result of recent work, two members of struct libxl_acpi_ctxt were
> > left with only one user. Thus, it becomes illogical for them to be
> > members of the struct at all.
> > 
> > Drop the two struct members and instead let the only function using
> > them have them as local variables.
> > 
> > Signed-off-by: Kevin Stefanov 
> 
> Reviewed-by: Jan Beulich 

Acked-by: Ian Jackson 

> I would like to suggest though to consider ...
> 
> > @@ -176,20 +174,19 @@ int libxl__dom_load_acpi(libxl__gc *gc,
> >  goto out;
> >  }
> >  
> > -config.rsdp = (unsigned long)libxl__malloc(gc, libxl_ctxt.page_size);
> > -config.infop = (unsigned long)libxl__malloc(gc, libxl_ctxt.page_size);
> > +config.rsdp = (unsigned long)libxl__malloc(gc, page_size);
> > +config.infop = (unsigned long)libxl__malloc(gc, page_size);
> >  /* Pages to hold ACPI tables */
> > -libxl_ctxt.buf = libxl__malloc(gc, NUM_ACPI_PAGES *
> > -   libxl_ctxt.page_size);
> > +libxl_ctxt.buf = libxl__malloc(gc, NUM_ACPI_PAGES * page_size);
> 
> ... using page_shift to replace all multiplications like the one here
> at this occasion.

I don't have an opinion about this; my tools ack can stand if this
change is made and reviewed.

Ian.

Re: [PATCH v2 01/18] AMD/IOMMU: have callers specify the target level for page table walks

2021-09-24 Thread Jan Beulich

On 24.09.2021 12:58, Roger Pau Monné wrote:
> On Fri, Sep 24, 2021 at 11:41:14AM +0200, Jan Beulich wrote:
>> --- a/xen/drivers/passthrough/amd/iommu_map.c
>> +++ b/xen/drivers/passthrough/amd/iommu_map.c
>> @@ -178,7 +178,8 @@ void __init iommu_dte_add_device_entry(s
>>   * page tables.
>>   */
>>  static int iommu_pde_from_dfn(struct domain *d, unsigned long dfn,
>> -  unsigned long *pt_mfn, bool map)
>> +  unsigned int target, unsigned long *pt_mfn,
>> +  bool map)
>>  {
>>  union amd_iommu_pte *pde, *next_table_vaddr;
>>  unsigned long  next_table_mfn;
>> @@ -189,7 +190,8 @@ static int iommu_pde_from_dfn(struct dom
>>  table = hd->arch.amd.root_table;
>>  level = hd->arch.amd.paging_mode;
>>  
>> -BUG_ON( table == NULL || level < 1 || level > 6 );
>> +if ( !table || target < 1 || level < target || level > 6 )
>> +return 1;
> 
> I would consider adding an ASSERT_UNREACHABLE here, since there should
> be no callers passing those parameters, and we shouldn't be
> introducing new ones. Unless you believe there could be valid callers
> passing level < target parameter.

Ah yes - added.

>> @@ -200,7 +202,7 @@ static int iommu_pde_from_dfn(struct dom
>>  
>>  next_table_mfn = mfn_x(page_to_mfn(table));
>>  
>> -while ( level > 1 )
>> +while ( level > target )
>>  {
>>  unsigned int next_level = level - 1;
> 
> There's a comment at the bottom of iommu_pde_from_dfn that needs to be
> adjusted to no longer explicitly mention level 1.

Oh, thanks for noticing. I recall spotting that comment as in
need of updating before starting any of this work. And then I
forgot ...

> With that adjusted:
> 
> Reviewed-by: Roger Pau Monné 

Thanks.

> FWIW, I always get confused with AMD and shadow code using level 1 to
> denote the smaller page size level while Intel uses 0.

Wait - with "Intel" you mean just EPT here, don't you? VT-d
code is using 1-based numbering again from all I can tell.

Jan

Re: [PATCH v3 2/2] xen-pciback: allow compiling on other archs than x86

2021-09-24 Thread Oleksandr Andrushchenko


On 24.09.21 08:46, Oleksandr Andrushchenko wrote:
> On 23.09.21 23:00, Stefano Stabellini wrote:
>> On Thu, 23 Sep 2021, Oleksandr Andrushchenko wrote:
>>> From: Oleksandr Andrushchenko 
>>>
>>> Xen-pciback driver was designed to be built for x86 only. But it
>>> can also be used by other architectures, e.g. Arm.
>>> Re-structure the driver in a way that it can be built for other
>>> platforms as well.
>>>
>>> Signed-off-by: Oleksandr Andrushchenko 
>>> Signed-off-by: Anastasiia Lukianenko 
>> The patch looks good to me. Only one thing: on ARM32 I get:
> WE do not yet support Xen PCI passthrough for ARM32
>> drivers/xen/xen-pciback/conf_space_header.c: In function ‘bar_init’:
>> drivers/xen/xen-pciback/conf_space_header.c:239:34: warning: right shift 
>> count >= width of type [-Wshift-count-overflow]
>>   bar->val = res[pos - 1].start >> 32;
>> ^~
>> drivers/xen/xen-pciback/conf_space_header.c:240:49: warning: right shift 
>> count >= width of type [-Wshift-count-overflow]
>>   bar->len_val = -resource_size(&res[pos - 1]) >> 32;
>>
>>
>> resource_size_t is defined as phys_addr_t and it can be 32bit on arm32.
>>
>>
>> One fix is to surround:
>>
>>  if (pos && (res[pos - 1].flags & IORESOURCE_MEM_64)) {
>>  bar->val = res[pos - 1].start >> 32;
>>  bar->len_val = -resource_size(&res[pos - 1]) >> 32;
>>  return bar;
>>  }
>>
>> with #ifdef PHYS_ADDR_T_64BIT
>>
> This might not be correct. We are dealing here with a 64-bit BAR on a 32-bit 
> OS.
>
> I think that this can still be valid use-case if BAR64.hi == 0. So, not sure
>
> we can just skip it with ifdef.
>
> Instead, to be on the safe side, we can have:
>
> config XEN_PCIDEV_STUB
>      tristate "Xen PCI-device stub driver"
>      depends on PCI && ARM64 && XEN
> e.g. only allow building the "stub" for ARM64 for now.

Or... there are couple of places in the kernel where PCI deals with the 32 bit 
shift as:

drivers/pci/setup-res.c:108:        new = region.start >> 16 >> 16;
drivers/pci/iov.c:949:        new = region.start >> 16 >> 16;

commit cf7bee5a0bf270a4eace0be39329d6ac0136cc47
Date:   Sun Aug 7 13:49:59 *2005* +0400

[snip]

     Also make sure to write high bits - use "x >> 16 >> 16" (rather than the
     simpler ">> 32") to avoid warnings on 32-bit architectures where we're
     not going to have any high bits.

This might not be(?) immediately correct in case of LPAE though, e.g.

64-bit BAR may tolerate 40-bit address in some use-cases?

Re: [PATCH] tools/libxl: Remove page_size and page_shift from struct libxl_acpi_ctxt

2021-09-24 Thread Jan Beulich

On 24.09.2021 13:05, Kevin Stefanov wrote:
> As a result of recent work, two members of struct libxl_acpi_ctxt were
> left with only one user. Thus, it becomes illogical for them to be
> members of the struct at all.
> 
> Drop the two struct members and instead let the only function using
> them have them as local variables.
> 
> Signed-off-by: Kevin Stefanov 

Reviewed-by: Jan Beulich 

I would like to suggest though to consider ...

> @@ -176,20 +174,19 @@ int libxl__dom_load_acpi(libxl__gc *gc,
>  goto out;
>  }
>  
> -config.rsdp = (unsigned long)libxl__malloc(gc, libxl_ctxt.page_size);
> -config.infop = (unsigned long)libxl__malloc(gc, libxl_ctxt.page_size);
> +config.rsdp = (unsigned long)libxl__malloc(gc, page_size);
> +config.infop = (unsigned long)libxl__malloc(gc, page_size);
>  /* Pages to hold ACPI tables */
> -libxl_ctxt.buf = libxl__malloc(gc, NUM_ACPI_PAGES *
> -   libxl_ctxt.page_size);
> +libxl_ctxt.buf = libxl__malloc(gc, NUM_ACPI_PAGES * page_size);

... using page_shift to replace all multiplications like the one here
at this occasion.

Jan

sh_unshadow_for_p2m_change() vs p2m_set_entry()

2021-09-24 Thread Jan Beulich

Tim,

I'm afraid you're still my best guess to hopefully get an insight
on issues like this one.

While doing IOMMU superpage work I was, just in the background,
considering in how far the superpage re-coalescing to be used there
couldn't be re-used for P2M / EPT / NPT. Which got me to think about
shadow mode's using of p2m-pt.c: That's purely software use of the
tables in that case, isn't it? In which case hardware support for
superpages shouldn't matter at all.

The only place where I could spot that P2M superpages would actually
make a difference to shadow code was sh_unshadow_for_p2m_change().
That one appears to have been dealing with 2M pages (more below)
already at the time of 0ca1669871f8a ("P2M: check whether hap mode
is enabled before using 2mb pages"), so I wonder what "potential
errors when hap is disabled" this commit's description might be
talking about. (Really, if it's software use of the tables only, in
principle even 512Gb superpages could be made use of there. But of
course sh_unshadow_for_p2m_change() wouldn't really like this, just
like it doesn't like 1Gb pages. So that's purely a theoretical
consideration.)

As to sh_unshadow_for_p2m_change()'s readiness for at least 2Mb
pages: The 4k page handling there differs from the 2M one primarily
in the p2m type checks: "p2m_is_valid(...) || p2m_is_grant(...)"
vs "p2m_is_valid(...)" plus later "!p2m_is_ram(...)", the first
three acting on the type taken from the old PTE, while the latter
acts on the type in the new (split) PTEs. Shouldn't the exact same
checks be used in both cases (less the p2m_is_grant() check which
can't be true for superpages)? IOW isn't !p2m_is_ram() at least
superfluous (and perhaps further redundant with the subsequent
!mfn_eq(l1e_get_mfn(npte[i]), omfn))? Instead I'm missing an
entry-is-present check, without which l1e_get_mfn(npte[i]) looks
risky at best. Or is p2m_is_ram() considered a sufficient
replacement, making assumptions on the behavior of a lot of other
code?

The 2M logic also first checks _PAGE_PRESENT (and _PAGE_PSE), while
the 4k logic appears to infer that the old page was present from
p2m_is_{valid,grant}().

And isn't this 2M page handling code (because of the commit pointed
at above) dead right now anyway? If not, where would P2M superpages
come from?

Thanks much,
Jan

[PATCH] tools/libxl: Remove page_size and page_shift from struct libxl_acpi_ctxt

2021-09-24 Thread Kevin Stefanov

As a result of recent work, two members of struct libxl_acpi_ctxt were
left with only one user. Thus, it becomes illogical for them to be
members of the struct at all.

Drop the two struct members and instead let the only function using
them have them as local variables.

Signed-off-by: Kevin Stefanov 
---
CC: Andrew Cooper 
CC: Ian Jackson 
CC: Wei Liu 
CC: Anthony PERARD 
---
 tools/libs/light/libxl_x86_acpi.c | 29 +
 1 file changed, 13 insertions(+), 16 deletions(-)

diff --git a/tools/libs/light/libxl_x86_acpi.c 
b/tools/libs/light/libxl_x86_acpi.c
index 57a6b63790..68902e7809 100644
--- a/tools/libs/light/libxl_x86_acpi.c
+++ b/tools/libs/light/libxl_x86_acpi.c
@@ -25,9 +25,6 @@
 struct libxl_acpi_ctxt {
 struct acpi_ctxt c;
 
-unsigned int page_size;
-unsigned int page_shift;
-
 /* Memory allocator */
 unsigned long guest_start;
 unsigned long guest_curr;
@@ -159,12 +156,13 @@ int libxl__dom_load_acpi(libxl__gc *gc,
 struct acpi_config config = {0};
 struct libxl_acpi_ctxt libxl_ctxt;
 int rc = 0, acpi_pages_num;
+unsigned int page_size, page_shift;
 
 if (b_info->type != LIBXL_DOMAIN_TYPE_PVH)
 goto out;
 
-libxl_ctxt.page_size = XC_DOM_PAGE_SIZE(dom);
-libxl_ctxt.page_shift =  XC_DOM_PAGE_SHIFT(dom);
+page_size = XC_DOM_PAGE_SIZE(dom);
+page_shift = XC_DOM_PAGE_SHIFT(dom);
 
 libxl_ctxt.c.mem_ops.alloc = mem_alloc;
 libxl_ctxt.c.mem_ops.v2p = virt_to_phys;
@@ -176,20 +174,19 @@ int libxl__dom_load_acpi(libxl__gc *gc,
 goto out;
 }
 
-config.rsdp = (unsigned long)libxl__malloc(gc, libxl_ctxt.page_size);
-config.infop = (unsigned long)libxl__malloc(gc, libxl_ctxt.page_size);
+config.rsdp = (unsigned long)libxl__malloc(gc, page_size);
+config.infop = (unsigned long)libxl__malloc(gc, page_size);
 /* Pages to hold ACPI tables */
-libxl_ctxt.buf = libxl__malloc(gc, NUM_ACPI_PAGES *
-   libxl_ctxt.page_size);
+libxl_ctxt.buf = libxl__malloc(gc, NUM_ACPI_PAGES * page_size);
 
 /*
  * Set up allocator memory.
  * Start next to acpi_info page to avoid fracturing e820.
  */
 libxl_ctxt.guest_start = libxl_ctxt.guest_curr = libxl_ctxt.guest_end =
-ACPI_INFO_PHYSICAL_ADDRESS + libxl_ctxt.page_size;
+ACPI_INFO_PHYSICAL_ADDRESS + page_size;
 
-libxl_ctxt.guest_end += NUM_ACPI_PAGES * libxl_ctxt.page_size;
+libxl_ctxt.guest_end += NUM_ACPI_PAGES * page_size;
 
 /* Build the tables. */
 rc = acpi_build_tables(&libxl_ctxt.c, &config);
@@ -199,8 +196,8 @@ int libxl__dom_load_acpi(libxl__gc *gc,
 }
 
 /* Calculate how many pages are needed for the tables. */
-acpi_pages_num = (ALIGN(libxl_ctxt.guest_curr, libxl_ctxt.page_size) -
-  libxl_ctxt.guest_start) >> libxl_ctxt.page_shift;
+acpi_pages_num = (ALIGN(libxl_ctxt.guest_curr, page_size) -
+  libxl_ctxt.guest_start) >> page_shift;
 
 dom->acpi_modules[0].data = (void *)config.rsdp;
 dom->acpi_modules[0].length = 64;
@@ -212,7 +209,7 @@ int libxl__dom_load_acpi(libxl__gc *gc,
 if (strcmp(xc_dom_guest_os(dom), "linux") ||
 xc_dom_feature_get(dom, XENFEAT_linux_rsdp_unrestricted))
 dom->acpi_modules[0].guest_addr_out = ACPI_INFO_PHYSICAL_ADDRESS +
-(1 + acpi_pages_num) * libxl_ctxt.page_size;
+(1 + acpi_pages_num) * page_size;
 else
 dom->acpi_modules[0].guest_addr_out = 0x10 - 64;
 
@@ -221,9 +218,9 @@ int libxl__dom_load_acpi(libxl__gc *gc,
 dom->acpi_modules[1].guest_addr_out = ACPI_INFO_PHYSICAL_ADDRESS;
 
 dom->acpi_modules[2].data = libxl_ctxt.buf;
-dom->acpi_modules[2].length = acpi_pages_num  << libxl_ctxt.page_shift;
+dom->acpi_modules[2].length = acpi_pages_num << page_shift;
 dom->acpi_modules[2].guest_addr_out = ACPI_INFO_PHYSICAL_ADDRESS +
-libxl_ctxt.page_size;
+page_size;
 
 out:
 return rc;
-- 
2.25.1

Re: [PATCH v2 01/18] AMD/IOMMU: have callers specify the target level for page table walks

2021-09-24 Thread Roger Pau Monné

On Fri, Sep 24, 2021 at 11:41:14AM +0200, Jan Beulich wrote:
> In order to be able to insert/remove super-pages we need to allow
> callers of the walking function to specify at which point to stop the
> walk. (For now at least gcc will instantiate just a variant of the
> function with the parameter eliminated, so effectively no change to
> generated code as far as the parameter addition goes.)
> 
> Instead of merely adjusting a BUG_ON() condition, convert it into an
> error return - there's no reason to crash the entire host in that case.
> 
> Signed-off-by: Jan Beulich 
> 
> --- a/xen/drivers/passthrough/amd/iommu_map.c
> +++ b/xen/drivers/passthrough/amd/iommu_map.c
> @@ -178,7 +178,8 @@ void __init iommu_dte_add_device_entry(s
>   * page tables.
>   */
>  static int iommu_pde_from_dfn(struct domain *d, unsigned long dfn,
> -  unsigned long *pt_mfn, bool map)
> +  unsigned int target, unsigned long *pt_mfn,
> +  bool map)
>  {
>  union amd_iommu_pte *pde, *next_table_vaddr;
>  unsigned long  next_table_mfn;
> @@ -189,7 +190,8 @@ static int iommu_pde_from_dfn(struct dom
>  table = hd->arch.amd.root_table;
>  level = hd->arch.amd.paging_mode;
>  
> -BUG_ON( table == NULL || level < 1 || level > 6 );
> +if ( !table || target < 1 || level < target || level > 6 )
> +return 1;

I would consider adding an ASSERT_UNREACHABLE here, since there should
be no callers passing those parameters, and we shouldn't be
introducing new ones. Unless you believe there could be valid callers
passing level < target parameter.

>  
>  /*
>   * A frame number past what the current page tables can represent can't
> @@ -200,7 +202,7 @@ static int iommu_pde_from_dfn(struct dom
>  
>  next_table_mfn = mfn_x(page_to_mfn(table));
>  
> -while ( level > 1 )
> +while ( level > target )
>  {
>  unsigned int next_level = level - 1;

There's a comment at the bottom of iommu_pde_from_dfn that needs to be
adjusted to no longer explicitly mention level 1.

With that adjusted:

Reviewed-by: Roger Pau Monné 

FWIW, I always get confused with AMD and shadow code using level 1 to
denote the smaller page size level while Intel uses 0.

Thanks, Roger.

Re: [PATCH v2 2/2] arm/efi: Use dom0less configuration when using EFI boot

2021-09-24 Thread Luca Fancellu




> On 23 Sep 2021, at 17:59, Stefano Stabellini  wrote:
> 
> On Thu, 23 Sep 2021, Luca Fancellu wrote:
 +/*
 + * Binaries will be translated into bootmodules, the maximum number for 
 them is
 + * MAX_MODULES where we should remove a unit for Xen and one for Xen DTB
 + */
 +#define MAX_DOM0LESS_MODULES (MAX_MODULES - 2)
 +static struct file __initdata dom0less_file;
 +static dom0less_module_name __initdata 
 dom0less_modules[MAX_DOM0LESS_MODULES];
 +static unsigned int __initdata dom0less_modules_available =
 +   MAX_DOM0LESS_MODULES;
 +static unsigned int __initdata dom0less_modules_idx;
>>> 
>>> This is a lot better!
>>> 
>>> We don't need both dom0less_modules_idx and dom0less_modules_available.
>>> You can just do:
>>> 
>>> #define dom0less_modules_available (MAX_DOM0LESS_MODULES - 
>>> dom0less_modules_idx)
>>> static unsigned int __initdata dom0less_modules_idx;
>>> 
>>> But maybe we can even get rid of dom0less_modules_available entirely?
>>> 
>>> We can change the check at the beginning of allocate_dom0less_file to:
>>> 
>>> if ( dom0less_modules_idx == dom0less_modules_available )
>>>   blexit
>>> 
>>> Would that work?
>> 
>> I thought about it but I think they need to stay, because 
>> dom0less_modules_available is the
>> upper bound for the additional dom0less modules (it is decremented each time 
>> a dom0 module
>> Is added), instead dom0less_modules_idx is the typical index for the array 
>> of dom0less modules.
> 
> [...]
> 
> 
 +/*
 + * Check if there is any space left for a domU module, the variable
 + * dom0less_modules_available is updated each time we use 
 read_file(...)
 + * successfully.
 + */
 +if ( !dom0less_modules_available )
 +blexit(L"No space left for domU modules");
>>> 
>>> This is the check that could be based on dom0less_modules_idx
>>> 
>> 
>> The only way I see to have it based on dom0less_modules_idx will be to 
>> compare it
>> to the amount of modules still available, that is not constant because it is 
>> dependent
>> on how many dom0 modules are loaded, so still two variables needed.
>> Don’t know if I’m missing something.
> 
> I think I understand where the confusion comes from. I am appending a
> small patch to show what I had in mind. We are already accounting for
> Xen and the DTB when declaring MAX_DOM0LESS_MODULES (MAX_MODULES - 2).
> The other binaries are the Dom0 kernel and ramdisk, however, in my setup
> they don't trigger a call to handle_dom0less_module_node because they
> are compatible xen,linux-zimage and xen,linux-initrd.
> 
> However, the Dom0 kernel and ramdisk can be also compatible
> multiboot,kernel and multiboot,ramdisk. If that is the case, then they
> would indeed trigger a call to handle_dom0less_module_node.
> 
> I think that is not a good idea: a function called
> handle_dom0less_module_node should only be called for dom0less modules
> (domUs) and not dom0.
> 
> But from the memory consumption point of view, it would be better
> actually to catch dom0 modules too as you intended. In that case we need to:
> 
> - add a check for xen,linux-zimage and xen,linux-initrd in
>  handle_dom0less_module_node also
> 
> - rename handle_dom0less_domain_node, handle_dom0less_module_node,
>  dom0less_file, dom0less_modules, dom0less_modules_idx to something
>  else more generic
> 
> 
> For instance they could be called:
> 
> handle_domain_node
> handle_module_node
> module_file
> modules
> modules_idx
> 
> 
> 
> 
> diff --git a/xen/arch/arm/efi/efi-boot.h b/xen/arch/arm/efi/efi-boot.h
> index e2b007ece0..812d0bd607 100644
> --- a/xen/arch/arm/efi/efi-boot.h
> +++ b/xen/arch/arm/efi/efi-boot.h
> @@ -22,8 +22,6 @@ typedef struct {
> #define MAX_DOM0LESS_MODULES (MAX_MODULES - 2)
> static struct file __initdata dom0less_file;
> static dom0less_module_name __initdata dom0less_modules[MAX_DOM0LESS_MODULES];
> -static unsigned int __initdata dom0less_modules_available =
> -   MAX_DOM0LESS_MODULES;
> static unsigned int __initdata dom0less_modules_idx;
> 
> #define ERROR_DOM0LESS_FILE_NOT_FOUND (-1)
> @@ -592,14 +590,6 @@ static void __init efi_arch_handle_module(const struct 
> file *file,
>  * stop here.
>  */
> blexit(L"Unknown module type");
> -
> -/*
> - * dom0less_modules_available is decremented here because for each dom0
> - * file added, there will be an additional bootmodule, so the number
> - * of dom0less module files will be decremented because there is
> - * a maximum amount of bootmodules that can be loaded.
> - */
> -dom0less_modules_available--;
> }
> 
> /*
> @@ -643,7 +633,7 @@ static unsigned int __init 
> allocate_dom0less_file(EFI_FILE_HANDLE dir_handle,
>  * dom0less_modules_available is updated each time we use read_file(...)
>  * successfully.
>  */
> -if ( !dom0less_modules_available )
> +if ( dom0le

Re: [PATCH 20/37] xen: introduce CONFIG_EFI to stub API for non-EFI architecture

2021-09-24 Thread Jan Beulich

On 24.09.2021 12:31, Wei Chen wrote:
>> From: Jan Beulich 
>> Sent: 2021年9月24日 15:59
>>
>> On 24.09.2021 06:34, Wei Chen wrote:
 From: Stefano Stabellini 
 Sent: 2021年9月24日 9:15

 On Thu, 23 Sep 2021, Wei Chen wrote:
> --- a/xen/common/Kconfig
> +++ b/xen/common/Kconfig
> @@ -11,6 +11,16 @@ config COMPAT
>  config CORE_PARKING
>   bool
>
> +config EFI
> + bool

 Without the title the option is not user-selectable (or de-selectable).
 So the help message below can never be seen.

 Either add a title, e.g.:

 bool "EFI support"

 Or fully make the option a silent option by removing the help text.
>>>
>>> OK, in current Xen code, EFI is unconditionally compiled. Before
>>> we change related code, I prefer to remove the help text.
>>
>> But that's not true: At least on x86 EFI gets compiled depending on
>> tool chain capabilities. Ultimately we may indeed want a user
>> selectable option here, but until then I'm afraid having this option
>> at all may be misleading on x86.
>>
> 
> I check the build scripts, yes, you're right. For x86, EFI is not a
> selectable option in Kconfig. I agree with you, we can't use Kconfig
> system to decide to enable EFI build for x86 or not.
> 
> So how about we just use this EFI option for Arm only? Because on Arm,
> we do not have such toolchain dependency.

To be honest - don't know. That's because I don't know what you want
to use the option for subsequently.

Jan

Re: [PATCH 02/37] xen: introduce a Kconfig option to configure NUMA nodes number

2021-09-24 Thread Jan Beulich

On 24.09.2021 12:33, Wei Chen wrote:
>> From: Jan Beulich 
>> Sent: 2021年9月24日 16:56
>>
>> On 23.09.2021 14:02, Wei Chen wrote:
>>> --- a/xen/arch/Kconfig
>>> +++ b/xen/arch/Kconfig
>>> @@ -17,3 +17,14 @@ config NR_CPUS
>>>   For CPU cores which support Simultaneous Multi-Threading or
>> similar
>>>   technologies, this the number of logical threads which Xen will
>>>   support.
>>> +
>>> +config NR_NUMA_NODES
>>> +   int "Maximum number of NUMA nodes supported"
>>> +   range 1 4095
>>
>> How was this upper bound established? Seeing 4095 is the limit of the
>> number of CPUs, do we really expect a CPU per node on such huge
>> systems? And did you check that whichever involved data types and
>> structures are actually suitable? I'm thinking e.g. of things like ...
>>
>>> --- a/xen/include/asm-x86/numa.h
>>> +++ b/xen/include/asm-x86/numa.h
>>> @@ -3,8 +3,6 @@
>>>
>>>  #include 
>>>
>>> -#define NODES_SHIFT 6
>>> -
>>>  typedef u8 nodeid_t;
>>
>> ... this.
>>
> 
> you're right, we use u8 as nodeid_t. 4095 for node number in this option
> is not reasonable. Maybe a 255 upper bound is good?

I think it is, yes, but you will want to properly check.

Jan

RE: [PATCH 36/37] xen/arm: Provide Kconfig options for Arm to enable NUMA

2021-09-24 Thread Wei Chen


> -Original Message-
> From: Jan Beulich 
> Sent: 2021年9月24日 18:26
> To: Wei Chen 
> Cc: Bertrand Marquis ; xen-
> de...@lists.xenproject.org; sstabell...@kernel.org; jul...@xen.org
> Subject: Re: [PATCH 36/37] xen/arm: Provide Kconfig options for Arm to
> enable NUMA
> 
> On 23.09.2021 14:02, Wei Chen wrote:
> > --- a/xen/arch/arm/Kconfig
> > +++ b/xen/arch/arm/Kconfig
> > @@ -34,6 +34,17 @@ config ACPI
> >   Advanced Configuration and Power Interface (ACPI) support for Xen
> is
> >   an alternative to device tree on ARM64.
> >
> > + config DEVICE_TREE_NUMA
> > +   def_bool n
> > +   select NUMA
> 
> Two nits here: There's a stray blank on the first line, and you
> appear to mean just "bool", not "def_bool n" (there's no point
> in having defaults for select-only options).
> 

Ok

> > +config ARM_NUMA
> > +   bool "Arm NUMA (Non-Uniform Memory Access) Support (UNSUPPORTED)" if
> UNSUPPORTED
> > +   select DEVICE_TREE_NUMA if HAS_DEVICE_TREE
> > +   ---help---
> 
> And another nit here: We try to move away from "---help---", which
> is no longer supported by Linux'es newer kconfig. Please use just
> "help" in new code.
> 

Thanks, I will do it.

> Jan

RE: [PATCH 03/37] xen/x86: Initialize memnodemapsize while faking NUMA node

2021-09-24 Thread Wei Chen



> -Original Message-
> From: Jan Beulich 
> Sent: 2021年9月24日 16:57
> To: Wei Chen 
> Cc: Bertrand Marquis ; xen-
> de...@lists.xenproject.org; sstabell...@kernel.org; jul...@xen.org
> Subject: Re: [PATCH 03/37] xen/x86: Initialize memnodemapsize while faking
> NUMA node
> 
> On 23.09.2021 14:02, Wei Chen wrote:
> > When system turns NUMA off or system lacks of NUMA support,
> > Xen will fake a NUMA node to make system works as a single
> > node NUMA system.
> >
> > In this case the memory node map doesn't need to be allocated
> > from boot pages, it will use the _memnodemap directly. But
> > memnodemapsize hasn't been set. Xen should assert in phys_to_nid.
> > Because x86 was using an empty macro "VIRTUAL_BUG_ON" to replace
> > SSERT, this bug will not be triggered on x86.
> 
> Somehow and A got lost here, which I'll add back while committing.
> 

Thanks!

> > Actually, Xen will only use 1 slot of memnodemap in this case.
> > So we set memnodemap[0] to 0 and memnodemapsize to 1 in this
> > patch to fix it.
> >
> > Signed-off-by: Wei Chen 
> 
> Acked-by: Jan Beulich

RE: [PATCH 02/37] xen: introduce a Kconfig option to configure NUMA nodes number

2021-09-24 Thread Wei Chen

Hi Jan,

> -Original Message-
> From: Jan Beulich 
> Sent: 2021年9月24日 16:56
> To: Wei Chen 
> Cc: Bertrand Marquis ; xen-
> de...@lists.xenproject.org; sstabell...@kernel.org; jul...@xen.org
> Subject: Re: [PATCH 02/37] xen: introduce a Kconfig option to configure
> NUMA nodes number
> 
> On 23.09.2021 14:02, Wei Chen wrote:
> > Current NUMA nodes number is a hardcode configuration. This
> > configuration is difficult for an administrator to change
> > unless changing the code.
> >
> > So in this patch, we introduce this new Kconfig option for
> > administrators to change NUMA nodes number conveniently.
> > Also considering that not all architectures support NUMA,
> > this Kconfig option only can be visible on NUMA enabled
> > architectures. Non-NUMA supported architectures can still
> > use 1 as MAX_NUMNODES.
> 
> Do you really mean administrators here? To me command line options
> are for administrators, but build decisions are usually taken by
> build managers of distros.
> 
> > --- a/xen/arch/Kconfig
> > +++ b/xen/arch/Kconfig
> > @@ -17,3 +17,14 @@ config NR_CPUS
> >   For CPU cores which support Simultaneous Multi-Threading or
> similar
> >   technologies, this the number of logical threads which Xen will
> >   support.
> > +
> > +config NR_NUMA_NODES
> > +   int "Maximum number of NUMA nodes supported"
> > +   range 1 4095
> 
> How was this upper bound established? Seeing 4095 is the limit of the
> number of CPUs, do we really expect a CPU per node on such huge
> systems? And did you check that whichever involved data types and
> structures are actually suitable? I'm thinking e.g. of things like ...
> 
> > --- a/xen/include/asm-x86/numa.h
> > +++ b/xen/include/asm-x86/numa.h
> > @@ -3,8 +3,6 @@
> >
> >  #include 
> >
> > -#define NODES_SHIFT 6
> > -
> >  typedef u8 nodeid_t;
> 
> ... this.
> 

you're right, we use u8 as nodeid_t. 4095 for node number in this option
is not reasonable. Maybe a 255 upper bound is good?

> Jan

RE: [PATCH 20/37] xen: introduce CONFIG_EFI to stub API for non-EFI architecture

2021-09-24 Thread Wei Chen

Hi Jan,

> -Original Message-
> From: Jan Beulich 
> Sent: 2021年9月24日 15:59
> To: Wei Chen 
> Cc: xen-devel@lists.xenproject.org; jul...@xen.org; Bertrand Marquis
> ; Stefano Stabellini 
> Subject: Re: [PATCH 20/37] xen: introduce CONFIG_EFI to stub API for non-
> EFI architecture
> 
> On 24.09.2021 06:34, Wei Chen wrote:
> >> From: Stefano Stabellini 
> >> Sent: 2021年9月24日 9:15
> >>
> >> On Thu, 23 Sep 2021, Wei Chen wrote:
> >>> --- a/xen/common/Kconfig
> >>> +++ b/xen/common/Kconfig
> >>> @@ -11,6 +11,16 @@ config COMPAT
> >>>  config CORE_PARKING
> >>>   bool
> >>>
> >>> +config EFI
> >>> + bool
> >>
> >> Without the title the option is not user-selectable (or de-selectable).
> >> So the help message below can never be seen.
> >>
> >> Either add a title, e.g.:
> >>
> >> bool "EFI support"
> >>
> >> Or fully make the option a silent option by removing the help text.
> >
> > OK, in current Xen code, EFI is unconditionally compiled. Before
> > we change related code, I prefer to remove the help text.
> 
> But that's not true: At least on x86 EFI gets compiled depending on
> tool chain capabilities. Ultimately we may indeed want a user
> selectable option here, but until then I'm afraid having this option
> at all may be misleading on x86.
> 

I check the build scripts, yes, you're right. For x86, EFI is not a
selectable option in Kconfig. I agree with you, we can't use Kconfig
system to decide to enable EFI build for x86 or not.

So how about we just use this EFI option for Arm only? Because on Arm,
we do not have such toolchain dependency.

> Jan

Re: [PATCH 36/37] xen/arm: Provide Kconfig options for Arm to enable NUMA

2021-09-24 Thread Jan Beulich

On 23.09.2021 14:02, Wei Chen wrote:
> --- a/xen/arch/arm/Kconfig
> +++ b/xen/arch/arm/Kconfig
> @@ -34,6 +34,17 @@ config ACPI
> Advanced Configuration and Power Interface (ACPI) support for Xen is
> an alternative to device tree on ARM64.
>  
> + config DEVICE_TREE_NUMA
> + def_bool n
> + select NUMA

Two nits here: There's a stray blank on the first line, and you
appear to mean just "bool", not "def_bool n" (there's no point
in having defaults for select-only options).

> +config ARM_NUMA
> + bool "Arm NUMA (Non-Uniform Memory Access) Support (UNSUPPORTED)" if 
> UNSUPPORTED
> + select DEVICE_TREE_NUMA if HAS_DEVICE_TREE
> + ---help---

And another nit here: We try to move away from "---help---", which
is no longer supported by Linux'es newer kconfig. Please use just
"help" in new code.

Jan

RE: [PATCH 33/37] xen/arm: keep guest still be NUMA unware

2021-09-24 Thread Wei Chen


> -Original Message-
> From: Stefano Stabellini 
> Sent: 2021年9月24日 11:19
> To: Wei Chen 
> Cc: xen-devel@lists.xenproject.org; sstabell...@kernel.org; jul...@xen.org;
> Bertrand Marquis 
> Subject: Re: [PATCH 33/37] xen/arm: keep guest still be NUMA unware
> 
> On Thu, 23 Sep 2021, Wei Chen wrote:
> > The NUMA information provided in the host Device-Tree
> > are only for Xen. For dom0, we want to hide them as they
> > may be different (for now, dom0 is still not aware of NUMA)
> > The CPU and memory nodes are recreated from scratch for the
> > domain. So we already skip the "numa-node-id" property for
> > these two types of nodes.
> >
> > However, some devices like PCIe may have "numa-node-id"
> > property too. We have to skip them as well.
> >
> > Signed-off-by: Wei Chen 
> > ---
> >  xen/arch/arm/domain_build.c | 6 ++
> >  1 file changed, 6 insertions(+)
> >
> > diff --git a/xen/arch/arm/domain_build.c b/xen/arch/arm/domain_build.c
> > index d233d634c1..6e94922238 100644
> > --- a/xen/arch/arm/domain_build.c
> > +++ b/xen/arch/arm/domain_build.c
> > @@ -737,6 +737,10 @@ static int __init write_properties(struct domain *d,
> struct kernel_info *kinfo,
> >  continue;
> >  }
> >
> > +/* Guest is numa unaware in current stage */
> 
> I would say: "Dom0 is currently NUMA unaware"
> 
> Reviewed-by: Stefano Stabellini 
> 

I will update the code comment in next version.
Thanks!

> 
> > +if ( dt_property_name_is_equal(prop, "numa-node-id") )
> > +continue;
> > +
> >  res = fdt_property(kinfo->fdt, prop->name, prop_data, prop_len);
> >
> >  if ( res )
> > @@ -1607,6 +1611,8 @@ static int __init handle_node(struct domain *d,
> struct kernel_info *kinfo,
> >  DT_MATCH_TYPE("memory"),
> >  /* The memory mapped timer is not supported by Xen. */
> >  DT_MATCH_COMPATIBLE("arm,armv7-timer-mem"),
> > +/* Numa info doesn't need to be exposed to Domain-0 */
> > +DT_MATCH_COMPATIBLE("numa-distance-map-v1"),
> >  { /* sentinel */ },
> >  };
> >  static const struct dt_device_match timer_matches[] __initconst =
> > --
> > 2.25.1
> >

RE: [PATCH 06/37] xen/arm: use !CONFIG_NUMA to keep fake NUMA API

2021-09-24 Thread Wei Chen


> -Original Message-
> From: Stefano Stabellini 
> Sent: 2021年9月24日 8:05
> To: Wei Chen 
> Cc: xen-devel@lists.xenproject.org; sstabell...@kernel.org; jul...@xen.org;
> Bertrand Marquis 
> Subject: Re: [PATCH 06/37] xen/arm: use !CONFIG_NUMA to keep fake NUMA API
> 
> On Thu, 23 Sep 2021, Wei Chen wrote:
> > We have introduced CONFIG_NUMA in previous patch. And this
>^ a
> 
> > option is enabled only on x86 in current stage. In a follow
> ^ at the
> 
> > up patch, we will enable this option for Arm. But we still
> > want users can disable the CONFIG_NUMA through Kconfig. In
>  ^ to be able to disable CONFIG_NUMA via Kconfig.
> 
> 
> > this case, keep current fake NUMA API, will make Arm code
>  ^ the
> 
> > still can work with NUMA aware memory allocation and scheduler.
> ^ able to work
> 
> >
> > Signed-off-by: Wei Chen 
> 
> With the small grammar fixes:
> 
> Reviewed-by: Stefano Stabellini 
> 
> 

Thanks, I will fix them in next version.

> > ---
> >  xen/include/asm-arm/numa.h | 4 
> >  1 file changed, 4 insertions(+)
> >
> > diff --git a/xen/include/asm-arm/numa.h b/xen/include/asm-arm/numa.h
> > index 9d5739542d..8f1c67e3eb 100644
> > --- a/xen/include/asm-arm/numa.h
> > +++ b/xen/include/asm-arm/numa.h
> > @@ -5,6 +5,8 @@
> >
> >  typedef u8 nodeid_t;
> >
> > +#ifndef CONFIG_NUMA
> > +
> >  /* Fake one node for now. See also node_online_map. */
> >  #define cpu_to_node(cpu) 0
> >  #define node_to_cpumask(node)   (cpu_online_map)
> > @@ -25,6 +27,8 @@ extern mfn_t first_valid_mfn;
> >  #define node_start_pfn(nid) (mfn_x(first_valid_mfn))
> >  #define __node_distance(a, b) (20)
> >
> > +#endif
> > +
> >  static inline unsigned int arch_have_default_dmazone(void)
> >  {
> >  return 0;
> > --
> > 2.25.1
> >

Re: [PATCH 36/37] xen/arm: Provide Kconfig options for Arm to enable NUMA

2021-09-24 Thread Wei Chen


Hi Stefano,

On 2021/9/24 11:31, Stefano Stabellini wrote:

On Thu, 23 Sep 2021, Wei Chen wrote:

Arm platforms support both ACPI and device tree. We don't
want users to select device tree NUMA or ACPI NUMA manually.
We hope usrs can just enable NUMA for Arm, and device tree

   ^ users


NUMA and ACPI NUMA can be selected depends on device tree
feature and ACPI feature status automatically. In this case,
these two kinds of NUMA support code can be co-exist in one
Xen binary. Xen can check feature flags to decide using
device tree or ACPI as NUMA based firmware.

So in this patch, we introduce a generic option:
CONFIG_ARM_NUMA for user to enable NUMA for Arm.

   ^ users



OK


And one CONFIG_DEVICE_TREE_NUMA option for ARM_NUMA
to select when HAS_DEVICE_TREE option is enabled.
Once when ACPI NUMA for Arm is supported, ACPI_NUMA
can be selected here too.

Signed-off-by: Wei Chen 
---
  xen/arch/arm/Kconfig | 11 +++
  1 file changed, 11 insertions(+)

diff --git a/xen/arch/arm/Kconfig b/xen/arch/arm/Kconfig
index 865ad83a89..ded94ebd37 100644
--- a/xen/arch/arm/Kconfig
+++ b/xen/arch/arm/Kconfig
@@ -34,6 +34,17 @@ config ACPI
  Advanced Configuration and Power Interface (ACPI) support for Xen is
  an alternative to device tree on ARM64.
  
+ config DEVICE_TREE_NUMA

+   def_bool n
+   select NUMA
+
+config ARM_NUMA
+   bool "Arm NUMA (Non-Uniform Memory Access) Support (UNSUPPORTED)" if 
UNSUPPORTED
+   select DEVICE_TREE_NUMA if HAS_DEVICE_TREE


Should it be: depends on HAS_DEVICE_TREE ?
(And eventually depends on HAS_DEVICE_TREE || ACPI)



As the discussion in RFC [1]. We want to make ARM_NUMA as a generic
option can be selected by users. And depends on has_device_tree
or ACPI to select DEVICE_TREE_NUMA or ACPI_NUMA.

If we add HAS_DEVICE_TREE || ACPI as dependencies for ARM_NUMA,
does it become a loop dependency?

https://lists.xenproject.org/archives/html/xen-devel/2021-08/msg00888.html



+   ---help---
+
+ Enable Non-Uniform Memory Access (NUMA) for Arm architecutres

   ^ architectures



+
  config GICV3
bool "GICv3 driver"
depends on ARM_64 && !NEW_VGIC
--
2.25.1

Re: [PATCH v3] tools/libxl: Correctly align the ACPI tables

2021-09-24 Thread Ian Jackson

Roger Pau Monné writes ("Re: [PATCH v3] tools/libxl: Correctly align the ACPI 
tables"):
> On Wed, Sep 15, 2021 at 03:30:00PM +0100, Kevin Stefanov wrote:
> > Fixes: 14c0d328da2b ("libxl/acpi: Build ACPI tables for HVMlite guests")
> > Signed-off-by: Kevin Stefanov 
> > Reviewed-by: Jan Beulich 
> 
> Reviewed-by: Roger Pau Monné 

Thanks to both of you.

Acked-by: Ian Jackson 

and pushed.

Ian.

Re: [PATCH v2] pci: fix handling of PCI bridges with subordinate bus number 0xff

2021-09-24 Thread Jan Beulich

On 24.09.2021 11:10, Igor Druzhinin wrote:
> Bus number 0xff is valid according to the PCI spec. Using u8 typed sub_bus
> and assigning 0xff to it will result in the following loop getting stuck.
> 
> for ( ; sec_bus <= sub_bus; sec_bus++ ) {...}
> 
> Just change its type to unsigned int similarly to what is already done in
> dmar_scope_add_buses().
> 
> Signed-off-by: Igor Druzhinin 

Reviewed-by: Jan Beulich

[PATCH v2 18/18] VT-d: free all-empty page tables

2021-09-24 Thread Jan Beulich

When a page table ends up with no present entries left, it can be
replaced by a non-present entry at the next higher level. The page table
itself can then be scheduled for freeing.

Note that while its output isn't used there yet, update_contig_markers()
right away needs to be called in all places where entries get updated,
not just the one where entries get cleared.

Signed-off-by: Jan Beulich 
---
v2: New.

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -42,6 +42,9 @@
 #include "vtd.h"
 #include "../ats.h"
 
+#define CONTIG_MASK DMA_PTE_CONTIG_MASK
+#include 
+
 /* dom_io is used as a sentinel for quarantined devices */
 #define QUARANTINE_SKIP(d) ((d) == dom_io && !dom_iommu(d)->arch.vtd.pgd_maddr)
 
@@ -368,6 +371,9 @@ static uint64_t addr_to_dma_page_maddr(s
 
 write_atomic(&pte->val, new_pte.val);
 iommu_sync_cache(pte, sizeof(struct dma_pte));
+update_contig_markers(&parent->val,
+  address_level_offset(addr, level),
+  level, PTE_kind_table);
 }
 
 if ( --level == target )
@@ -773,7 +779,7 @@ static int dma_pte_clear_one(struct doma
 struct domain_iommu *hd = dom_iommu(domain);
 struct dma_pte *page = NULL, *pte = NULL, old;
 u64 pg_maddr;
-unsigned int level = (order / LEVEL_STRIDE) + 1;
+unsigned int level = (order / LEVEL_STRIDE) + 1, pt_lvl = level;
 
 spin_lock(&hd->arch.mapping_lock);
 /* get target level pte */
@@ -796,9 +802,31 @@ static int dma_pte_clear_one(struct doma
 
 old = *pte;
 dma_clear_pte(*pte);
+iommu_sync_cache(pte, sizeof(*pte));
+
+while ( update_contig_markers(&page->val,
+  address_level_offset(addr, pt_lvl),
+  pt_lvl, PTE_kind_null) &&
+++pt_lvl < agaw_to_level(hd->arch.vtd.agaw) )
+{
+struct page_info *pg = maddr_to_page(pg_maddr);
+
+unmap_vtd_domain_page(page);
+
+pg_maddr = addr_to_dma_page_maddr(domain, addr, pt_lvl, flush_flags,
+  false);
+BUG_ON(pg_maddr < PAGE_SIZE);
+
+page = map_vtd_domain_page(pg_maddr);
+pte = &page[address_level_offset(addr, pt_lvl)];
+dma_clear_pte(*pte);
+iommu_sync_cache(pte, sizeof(*pte));
+
+*flush_flags |= IOMMU_FLUSHF_all;
+iommu_queue_free_pgtable(domain, pg);
+}
 
 spin_unlock(&hd->arch.mapping_lock);
-iommu_sync_cache(pte, sizeof(struct dma_pte));
 
 unmap_vtd_domain_page(page);
 
@@ -1952,8 +1980,11 @@ static int __must_check intel_iommu_map_
 }
 
 *pte = new;
-
 iommu_sync_cache(pte, sizeof(struct dma_pte));
+update_contig_markers(&page->val,
+  address_level_offset(dfn_to_daddr(dfn), level),
+  level, PTE_kind_leaf);
+
 spin_unlock(&hd->arch.mapping_lock);
 unmap_vtd_domain_page(page);

[PATCH v2 17/18] AMD/IOMMU: free all-empty page tables

2021-09-24 Thread Jan Beulich

When a page table ends up with no present entries left, it can be
replaced by a non-present entry at the next higher level. The page table
itself can then be scheduled for freeing.

Note that while its output isn't used there yet, update_contig_markers()
right away needs to be called in all places where entries get updated,
not just the one where entries get cleared.

Signed-off-by: Jan Beulich 
---
v2: New.

--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -21,6 +21,9 @@
 
 #include "iommu.h"
 
+#define CONTIG_MASK IOMMU_PTE_CONTIG_MASK
+#include 
+
 /* Given pfn and page table level, return pde index */
 static unsigned int pfn_to_pde_idx(unsigned long pfn, unsigned int level)
 {
@@ -33,16 +36,20 @@ static unsigned int pfn_to_pde_idx(unsig
 
 static union amd_iommu_pte clear_iommu_pte_present(unsigned long l1_mfn,
unsigned long dfn,
-   unsigned int level)
+   unsigned int level,
+   bool *free)
 {
 union amd_iommu_pte *table, *pte, old;
+unsigned int idx = pfn_to_pde_idx(dfn, level);
 
 table = map_domain_page(_mfn(l1_mfn));
-pte = &table[pfn_to_pde_idx(dfn, level)];
+pte = &table[idx];
 old = *pte;
 
 write_atomic(&pte->raw, 0);
 
+*free = update_contig_markers(&table->raw, idx, level, PTE_kind_null);
+
 unmap_domain_page(table);
 
 return old;
@@ -85,7 +92,11 @@ static union amd_iommu_pte set_iommu_pte
 if ( !old.pr || old.next_level ||
  old.mfn != next_mfn ||
  old.iw != iw || old.ir != ir )
+{
 set_iommu_pde_present(pde, next_mfn, 0, iw, ir);
+update_contig_markers(&table->raw, pfn_to_pde_idx(dfn, level), level,
+  PTE_kind_leaf);
+}
 else
 old.pr = false; /* signal "no change" to the caller */
 
@@ -259,6 +270,9 @@ static int iommu_pde_from_dfn(struct dom
 smp_wmb();
 set_iommu_pde_present(pde, next_table_mfn, next_level, true,
   true);
+update_contig_markers(&next_table_vaddr->raw,
+  pfn_to_pde_idx(dfn, level),
+  level, PTE_kind_table);
 
 *flush_flags |= IOMMU_FLUSHF_modified;
 }
@@ -284,6 +298,9 @@ static int iommu_pde_from_dfn(struct dom
 next_table_mfn = mfn_x(page_to_mfn(table));
 set_iommu_pde_present(pde, next_table_mfn, next_level, true,
   true);
+update_contig_markers(&next_table_vaddr->raw,
+  pfn_to_pde_idx(dfn, level),
+  level, PTE_kind_table);
 }
 else /* should never reach here */
 {
@@ -410,8 +427,25 @@ int amd_iommu_unmap_page(struct domain *
 
 if ( pt_mfn )
 {
+bool free;
+unsigned int pt_lvl = level;
+
 /* Mark PTE as 'page not present'. */
-old = clear_iommu_pte_present(pt_mfn, dfn_x(dfn), level);
+old = clear_iommu_pte_present(pt_mfn, dfn_x(dfn), level, &free);
+
+while ( unlikely(free) && ++pt_lvl < hd->arch.amd.paging_mode )
+{
+struct page_info *pg = mfn_to_page(_mfn(pt_mfn));
+
+if ( iommu_pde_from_dfn(d, dfn_x(dfn), pt_lvl, &pt_mfn,
+flush_flags, false) )
+BUG();
+BUG_ON(!pt_mfn);
+
+clear_iommu_pte_present(pt_mfn, dfn_x(dfn), pt_lvl, &free);
+*flush_flags |= IOMMU_FLUSHF_all;
+iommu_queue_free_pgtable(d, pg);
+}
 }
 
 spin_unlock(&hd->arch.mapping_lock);

[seabios test] 165173: tolerable FAIL - PUSHED

2021-09-24 Thread osstest service owner

flight 165173 seabios real [real]
http://logs.test-lab.xenproject.org/osstest/logs/165173/

Failures :-/ but no regressions.

Tests which did not succeed, but are not blocking:
 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 163203
 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 163203
 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 163203
 test-amd64-i386-xl-qemuu-win7-amd64 19 guest-stop fail like 163203
 test-amd64-i386-xl-qemuu-ws16-amd64 19 guest-stop fail like 163203
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass

version targeted for testing:
 seabios  64f37cc530f144e53c190c9e8209a51b58fd5c43
baseline version:
 seabios  54082c81d96028ba8c76fbe6784085cf1df76b20

Last test of basis   163203  2021-06-30 21:10:04 Z   85 days
Testing same since   165173  2021-09-24 03:09:48 Z0 days1 attempts


People who touched revisions under test:
  Stefan Berger 
  Stefan Berger 

jobs:
 build-amd64-xsm  pass
 build-i386-xsm   pass
 build-amd64  pass
 build-i386   pass
 build-amd64-libvirt  pass
 build-i386-libvirt   pass
 build-amd64-pvopspass
 build-i386-pvops pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm   pass
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsmpass
 test-amd64-amd64-xl-qemuu-debianhvm-i386-xsm pass
 test-amd64-i386-xl-qemuu-debianhvm-i386-xsm  pass
 test-amd64-amd64-qemuu-nested-amdfail
 test-amd64-i386-qemuu-rhel6hvm-amd   pass
 test-amd64-amd64-xl-qemuu-debianhvm-amd64pass
 test-amd64-i386-xl-qemuu-debianhvm-amd64 pass
 test-amd64-amd64-qemuu-freebsd11-amd64   pass
 test-amd64-amd64-qemuu-freebsd12-amd64   pass
 test-amd64-amd64-xl-qemuu-win7-amd64 fail
 test-amd64-i386-xl-qemuu-win7-amd64  fail
 test-amd64-amd64-xl-qemuu-ws16-amd64 fail
 test-amd64-i386-xl-qemuu-ws16-amd64  fail
 test-amd64-amd64-xl-qemuu-dmrestrict-amd64-dmrestrictpass
 test-amd64-i386-xl-qemuu-dmrestrict-amd64-dmrestrict pass
 test-amd64-amd64-qemuu-nested-intel  pass
 test-amd64-i386-qemuu-rhel6hvm-intel pass
 test-amd64-amd64-xl-qemuu-debianhvm-amd64-shadow pass
 test-amd64-i386-xl-qemuu-debianhvm-amd64-shadow  pass



sg-report-flight on osstest.test-lab.xenproject.org
logs: /home/logs/logs
images: /home/logs/images

Logs, config files, etc. are available at
http://logs.test-lab.xenproject.org/osstest/logs

Explanation of these reports, and of osstest in general, is at
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master

Test harness code can be found at
http://xenbits.xen.org/gitweb?p=osstest.git;a=summary


Pushing revision :

To xenbits.xen.org:/home/xen/git/osstest/seabios.git
   54082c8..64f37cc  64f37cc530f144e53c190c9e8209a51b58fd5c43 -> 
xen-tested-master

[PATCH v2 16/18] x86: introduce helper for recording degree of contiguity in page tables

2021-09-24 Thread Jan Beulich

This is a re-usable helper (kind of a template) which gets introduced
without users so that the individual subsequent patches introducing such
users can get committed independently of one another.

See the comment at the top of the new file. To demonstrate the effect,
if a page table had just 16 entries, this would be the set of markers
for a page table with fully contiguous mappings:

index  0 1 2 3 4 5 6 7 8 9 A B C D E F
marker 4 0 1 0 2 0 1 0 3 0 1 0 2 0 1 0

"Contiguous" here means not only present entries with successively
increasing MFNs, each one suitably aligned for its slot, but also a
respective number of all non-present entries.

Signed-off-by: Jan Beulich 
---
v2: New.

--- /dev/null
+++ b/xen/include/asm-x86/contig-marker.h
@@ -0,0 +1,105 @@
+#ifndef __ASM_X86_CONTIG_MARKER_H
+#define __ASM_X86_CONTIG_MARKER_H
+
+/*
+ * Short of having function templates in C, the function defined below is
+ * intended to be used by multiple parties interested in recording the
+ * degree of contiguity in mappings by a single page table.
+ *
+ * Scheme: Every entry records the order of contiguous successive entries,
+ * up to the maximum order covered by that entry (which is the number of
+ * clear low bits in its index, with entry 0 being the exception using
+ * the base-2 logarithm of the number of entries in a single page table).
+ * While a few entries need touching upon update, knowing whether the
+ * table is fully contiguous (and can hence be replaced by a higher level
+ * leaf entry) is then possible by simply looking at entry 0's marker.
+ *
+ * Prereqs:
+ * - CONTIG_MASK needs to be #define-d, to a value having at least 4
+ *   contiguous bits (ignored by hardware), before including this file,
+ * - page tables to be passed here need to be initialized with correct
+ *   markers.
+ */
+
+#include 
+#include 
+#include 
+
+/* This is the same for all anticipated users, so doesn't need passing in. */
+#define CONTIG_LEVEL_SHIFT 9
+#define CONTIG_NR  (1 << CONTIG_LEVEL_SHIFT)
+
+#define GET_MARKER(e) MASK_EXTR(e, CONTIG_MASK)
+#define SET_MARKER(e, m) \
+((void)(e = ((e) & ~CONTIG_MASK) | MASK_INSR(m, CONTIG_MASK)))
+
+enum PTE_kind {
+PTE_kind_null,
+PTE_kind_leaf,
+PTE_kind_table,
+};
+
+static bool update_contig_markers(uint64_t *pt, unsigned int idx,
+  unsigned int level, enum PTE_kind kind)
+{
+unsigned int b, i = idx;
+unsigned int shift = (level - 1) * CONTIG_LEVEL_SHIFT + PAGE_SHIFT;
+
+ASSERT(idx < CONTIG_NR);
+ASSERT(!(pt[idx] & CONTIG_MASK));
+
+/* Step 1: Reduce markers in lower numbered entries. */
+while ( i )
+{
+b = find_first_set_bit(i);
+i &= ~(1U << b);
+if ( GET_MARKER(pt[i]) > b )
+SET_MARKER(pt[i], b);
+}
+
+/* An intermediate table is never contiguous with anything. */
+if ( kind == PTE_kind_table )
+return false;
+
+/*
+ * Present entries need in sync index and address to be a candidate
+ * for being contiguous: What we're after is whether ultimately the
+ * intermediate table can be replaced by a superpage.
+ */
+if ( kind != PTE_kind_null &&
+ idx != ((pt[idx] >> shift) & (CONTIG_NR - 1)) )
+return false;
+
+/* Step 2: Check higher numbered entries for contiguity. */
+for ( b = 0; b < CONTIG_LEVEL_SHIFT && !(idx & (1U << b)); ++b )
+{
+i = idx | (1U << b);
+if ( (kind == PTE_kind_leaf
+  ? ((pt[i] ^ pt[idx]) & ~CONTIG_MASK) != (1ULL << (b + shift))
+  : pt[i] & ~CONTIG_MASK) ||
+ GET_MARKER(pt[i]) != b )
+break;
+}
+
+/* Step 3: Update markers in this and lower numbered entries. */
+for ( ; SET_MARKER(pt[idx], b), b < CONTIG_LEVEL_SHIFT; ++b )
+{
+i = idx ^ (1U << b);
+if ( (kind == PTE_kind_leaf
+  ? ((pt[i] ^ pt[idx]) & ~CONTIG_MASK) != (1ULL << (b + shift))
+  : pt[i] & ~CONTIG_MASK) ||
+ GET_MARKER(pt[i]) != b )
+break;
+idx &= ~(1U << b);
+}
+
+return b == CONTIG_LEVEL_SHIFT;
+}
+
+#undef SET_MARKER
+#undef GET_MARKER
+#undef CONTIG_NR
+#undef CONTIG_LEVEL_SHIFT
+#undef CONTIG_MASK
+
+#endif /* __ASM_X86_CONTIG_MARKER_H */

[PATCH v2 15/18] IOMMU/x86: prefill newly allocate page tables

2021-09-24 Thread Jan Beulich

Page table are used for two purposes after allocation: They either start
out all empty, or they get filled to replace a superpage. Subsequently,
to replace all empty or fully contiguous page tables, contiguous sub-
regions will be recorded within individual page tables. Install the
initial set of markers immediately after allocation. Make sure to retain
these markers when further populating a page table in preparation for it
to replace a superpage.

The markers are simply 4-bit fields holding the order value of
contiguous entries. To demonstrate this, if a page table had just 16
entries, this would be the initial (fully contiguous) set of markers:

index  0 1 2 3 4 5 6 7 8 9 A B C D E F
marker 4 0 1 0 2 0 1 0 3 0 1 0 2 0 1 0

"Contiguous" here means not only present entries with successively
increasing MFNs, each one suitably aligned for its slot, but also a
respective number of all non-present entries.

Signed-off-by: Jan Beulich 
---
An alternative to the ASSERT()s added to set_iommu_ptes_present() would
be to make the function less general-purpose; it's used in a single
place only after all (i.e. it might as well be folded into its only
caller).
---
v2: New.

--- a/xen/drivers/passthrough/amd/iommu-defs.h
+++ b/xen/drivers/passthrough/amd/iommu-defs.h
@@ -445,6 +445,8 @@ union amd_iommu_x2apic_control {
 #define IOMMU_PAGE_TABLE_U32_PER_ENTRY (IOMMU_PAGE_TABLE_ENTRY_SIZE / 4)
 #define IOMMU_PAGE_TABLE_ALIGNMENT 4096
 
+#define IOMMU_PTE_CONTIG_MASK   0x1e /* The ign0 field below. */
+
 union amd_iommu_pte {
 uint64_t raw;
 struct {
--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -116,7 +116,19 @@ static void set_iommu_ptes_present(unsig
 
 while ( nr_ptes-- )
 {
-set_iommu_pde_present(pde, next_mfn, 0, iw, ir);
+ASSERT(!pde->next_level);
+ASSERT(!pde->u);
+
+if ( pde > table )
+ASSERT(pde->ign0 == find_first_set_bit(pde - table));
+else
+ASSERT(pde->ign0 == PAGE_SHIFT - 3);
+
+pde->iw = iw;
+pde->ir = ir;
+pde->fc = true; /* See set_iommu_pde_present(). */
+pde->mfn = next_mfn;
+pde->pr = true;
 
 ++pde;
 next_mfn += page_sz;
@@ -232,7 +244,7 @@ static int iommu_pde_from_dfn(struct dom
 mfn = next_table_mfn;
 
 /* allocate lower level page table */
-table = iommu_alloc_pgtable(d);
+table = iommu_alloc_pgtable(d, IOMMU_PTE_CONTIG_MASK);
 if ( table == NULL )
 {
 AMD_IOMMU_DEBUG("Cannot allocate I/O page table\n");
@@ -262,7 +274,7 @@ static int iommu_pde_from_dfn(struct dom
 
 if ( next_table_mfn == 0 )
 {
-table = iommu_alloc_pgtable(d);
+table = iommu_alloc_pgtable(d, IOMMU_PTE_CONTIG_MASK);
 if ( table == NULL )
 {
 AMD_IOMMU_DEBUG("Cannot allocate I/O page table\n");
@@ -648,7 +660,7 @@ int __init amd_iommu_quarantine_init(str
 
 spin_lock(&hd->arch.mapping_lock);
 
-hd->arch.amd.root_table = iommu_alloc_pgtable(d);
+hd->arch.amd.root_table = iommu_alloc_pgtable(d, 0);
 if ( !hd->arch.amd.root_table )
 goto out;
 
@@ -663,7 +675,7 @@ int __init amd_iommu_quarantine_init(str
  * page table pages, and the resulting allocations are always
  * zeroed.
  */
-pg = iommu_alloc_pgtable(d);
+pg = iommu_alloc_pgtable(d, 0);
 if ( !pg )
 break;
 
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -238,7 +238,7 @@ int amd_iommu_alloc_root(struct domain *
 
 if ( unlikely(!hd->arch.amd.root_table) )
 {
-hd->arch.amd.root_table = iommu_alloc_pgtable(d);
+hd->arch.amd.root_table = iommu_alloc_pgtable(d, 0);
 if ( !hd->arch.amd.root_table )
 return -ENOMEM;
 }
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -297,7 +297,7 @@ static uint64_t addr_to_dma_page_maddr(s
 goto out;
 
 pte_maddr = level;
-if ( !(pg = iommu_alloc_pgtable(domain)) )
+if ( !(pg = iommu_alloc_pgtable(domain, 0)) )
 goto out;
 
 hd->arch.vtd.pgd_maddr = page_to_maddr(pg);
@@ -339,7 +339,7 @@ static uint64_t addr_to_dma_page_maddr(s
 }
 
 pte_maddr = level - 1;
-pg = iommu_alloc_pgtable(domain);
+pg = iommu_alloc_pgtable(domain, DMA_PTE_CONTIG_MASK);
 if ( !pg )
 break;
 
@@ -351,12 +351,13 @@ static uint64_t addr_to_dma_page_maddr(s
 struct dma_pte *split = map_vtd_domain_page(pte_maddr);
 unsigned long inc = 1UL << level_to_offset_bits(level - 1);
 
-split[0].val = pte->val;
+split[0].val |= pte->val & ~DMA_PTE_CONTIG_MASK;
 if ( inc

[PATCH v2 14/18] IOMMU: fold flush-all hook into "flush one"

2021-09-24 Thread Jan Beulich

Having a separate flush-all hook has always been puzzling me some. We
will want to be able to force a full flush via accumulated flush flags
from the map/unmap functions. Introduce a respective new flag and fold
all flush handling to use the single remaining hook.

Note that because of the respective comments in SMMU and IPMMU-VMSA
code, I've folded the two prior hook functions into one. For SMMU-v3,
which lacks a comment towards incapable hardware, I've left both
functions in place on the assumption that selective and full flushes
will eventually want separating.

Signed-off-by: Jan Beulich 
---
TBD: What we really are going to need is for the map/unmap functions to
 specify that a wider region needs flushing than just the one
 covered by the present set of (un)maps. This may still be less than
 a full flush, but at least as a first step it seemed better to me
 to keep things simple and go the flush-all route.
---
v2: New.

--- a/xen/drivers/passthrough/amd/iommu.h
+++ b/xen/drivers/passthrough/amd/iommu.h
@@ -242,7 +242,6 @@ int amd_iommu_get_reserved_device_memory
 int __must_check amd_iommu_flush_iotlb_pages(struct domain *d, dfn_t dfn,
  unsigned long page_count,
  unsigned int flush_flags);
-int __must_check amd_iommu_flush_iotlb_all(struct domain *d);
 void amd_iommu_print_entries(const struct amd_iommu *iommu, unsigned int 
dev_id,
  dfn_t dfn);
 
--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -475,15 +475,18 @@ int amd_iommu_flush_iotlb_pages(struct d
 {
 unsigned long dfn_l = dfn_x(dfn);
 
-ASSERT(page_count && !dfn_eq(dfn, INVALID_DFN));
-ASSERT(flush_flags);
+if ( !(flush_flags & IOMMU_FLUSHF_all) )
+{
+ASSERT(page_count && !dfn_eq(dfn, INVALID_DFN));
+ASSERT(flush_flags);
+}
 
 /* Unless a PTE was modified, no flush is required */
 if ( !(flush_flags & IOMMU_FLUSHF_modified) )
 return 0;
 
-/* If the range wraps then just flush everything */
-if ( dfn_l + page_count < dfn_l )
+/* If so requested or if the range wraps then just flush everything. */
+if ( (flush_flags & IOMMU_FLUSHF_all) || dfn_l + page_count < dfn_l )
 {
 amd_iommu_flush_all_pages(d);
 return 0;
@@ -508,13 +511,6 @@ int amd_iommu_flush_iotlb_pages(struct d
 
 return 0;
 }
-
-int amd_iommu_flush_iotlb_all(struct domain *d)
-{
-amd_iommu_flush_all_pages(d);
-
-return 0;
-}
 
 int amd_iommu_reserve_domain_unity_map(struct domain *d,
const struct ivrs_unity_map *map,
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -642,7 +642,6 @@ static const struct iommu_ops __initcons
 .map_page = amd_iommu_map_page,
 .unmap_page = amd_iommu_unmap_page,
 .iotlb_flush = amd_iommu_flush_iotlb_pages,
-.iotlb_flush_all = amd_iommu_flush_iotlb_all,
 .reassign_device = reassign_device,
 .get_device_group_id = amd_iommu_group_id,
 .enable_x2apic = iov_enable_xt,
--- a/xen/drivers/passthrough/arm/ipmmu-vmsa.c
+++ b/xen/drivers/passthrough/arm/ipmmu-vmsa.c
@@ -930,13 +930,19 @@ out:
 }
 
 /* Xen IOMMU ops */
-static int __must_check ipmmu_iotlb_flush_all(struct domain *d)
+static int __must_check ipmmu_iotlb_flush(struct domain *d, dfn_t dfn,
+  unsigned long page_count,
+  unsigned int flush_flags)
 {
 struct ipmmu_vmsa_xen_domain *xen_domain = dom_iommu(d)->arch.priv;
 
+ASSERT(flush_flags);
+
 if ( !xen_domain || !xen_domain->root_domain )
 return 0;
 
+/* The hardware doesn't support selective TLB flush. */
+
 spin_lock(&xen_domain->lock);
 ipmmu_tlb_invalidate(xen_domain->root_domain);
 spin_unlock(&xen_domain->lock);
@@ -944,16 +950,6 @@ static int __must_check ipmmu_iotlb_flus
 return 0;
 }
 
-static int __must_check ipmmu_iotlb_flush(struct domain *d, dfn_t dfn,
-  unsigned long page_count,
-  unsigned int flush_flags)
-{
-ASSERT(flush_flags);
-
-/* The hardware doesn't support selective TLB flush. */
-return ipmmu_iotlb_flush_all(d);
-}
-
 static struct ipmmu_vmsa_domain *ipmmu_get_cache_domain(struct domain *d,
 struct device *dev)
 {
@@ -1303,7 +1299,6 @@ static const struct iommu_ops ipmmu_iomm
 .hwdom_init  = ipmmu_iommu_hwdom_init,
 .teardown= ipmmu_iommu_domain_teardown,
 .iotlb_flush = ipmmu_iotlb_flush,
-.iotlb_flush_all = ipmmu_iotlb_flush_all,
 .assign_device   = ipmmu_assign_device,
 .reassign_device = ipmmu_reassign_device,
 .map_page= arm_iommu_map_page,
--- a/xen/drivers/passthrough/arm/smmu.c
+++ b/xen/drivers/passthr

[PATCH v2 13/18] VT-d: allow use of superpage mappings

2021-09-24 Thread Jan Beulich

... depending on feature availability (and absence of quirks).

Also make the page table dumping function aware of superpages.

Signed-off-by: Jan Beulich 

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -743,18 +743,37 @@ static int __must_check iommu_flush_iotl
 return iommu_flush_iotlb(d, INVALID_DFN, 0, 0);
 }
 
+static void queue_free_pt(struct domain *d, mfn_t mfn, unsigned int next_level)
+{
+if ( next_level > 1 )
+{
+struct dma_pte *pt = map_domain_page(mfn);
+unsigned int i;
+
+for ( i = 0; i < PTE_NUM; ++i )
+if ( dma_pte_present(pt[i]) && !dma_pte_superpage(pt[i]) )
+queue_free_pt(d, maddr_to_mfn(dma_pte_addr(pt[i])),
+  next_level - 1);
+
+unmap_domain_page(pt);
+}
+
+iommu_queue_free_pgtable(d, mfn_to_page(mfn));
+}
+
 /* clear one page's page table */
 static int dma_pte_clear_one(struct domain *domain, daddr_t addr,
  unsigned int order,
  unsigned int *flush_flags)
 {
 struct domain_iommu *hd = dom_iommu(domain);
-struct dma_pte *page = NULL, *pte = NULL;
+struct dma_pte *page = NULL, *pte = NULL, old;
 u64 pg_maddr;
+unsigned int level = (order / LEVEL_STRIDE) + 1;
 
 spin_lock(&hd->arch.mapping_lock);
-/* get last level pte */
-pg_maddr = addr_to_dma_page_maddr(domain, addr, 1, flush_flags, false);
+/* get target level pte */
+pg_maddr = addr_to_dma_page_maddr(domain, addr, level, flush_flags, false);
 if ( pg_maddr < PAGE_SIZE )
 {
 spin_unlock(&hd->arch.mapping_lock);
@@ -762,7 +781,7 @@ static int dma_pte_clear_one(struct doma
 }
 
 page = (struct dma_pte *)map_vtd_domain_page(pg_maddr);
-pte = page + address_level_offset(addr, 1);
+pte = &page[address_level_offset(addr, level)];
 
 if ( !dma_pte_present(*pte) )
 {
@@ -771,14 +790,19 @@ static int dma_pte_clear_one(struct doma
 return 0;
 }
 
+old = *pte;
 dma_clear_pte(*pte);
-*flush_flags |= IOMMU_FLUSHF_modified;
 
 spin_unlock(&hd->arch.mapping_lock);
 iommu_sync_cache(pte, sizeof(struct dma_pte));
 
 unmap_vtd_domain_page(page);
 
+*flush_flags |= IOMMU_FLUSHF_modified;
+
+if ( level > 1 && !dma_pte_superpage(old) )
+queue_free_pt(domain, maddr_to_mfn(dma_pte_addr(old)), level - 1);
+
 return 0;
 }
 
@@ -1868,6 +1892,7 @@ static int __must_check intel_iommu_map_
 struct domain_iommu *hd = dom_iommu(d);
 struct dma_pte *page, *pte, old, new = {};
 u64 pg_maddr;
+unsigned int level = (IOMMUF_order(flags) / LEVEL_STRIDE) + 1;
 int rc = 0;
 
 /* Do nothing if VT-d shares EPT page table */
@@ -1892,7 +1917,7 @@ static int __must_check intel_iommu_map_
 return 0;
 }
 
-pg_maddr = addr_to_dma_page_maddr(d, dfn_to_daddr(dfn), 1, flush_flags,
+pg_maddr = addr_to_dma_page_maddr(d, dfn_to_daddr(dfn), level, flush_flags,
   true);
 if ( pg_maddr < PAGE_SIZE )
 {
@@ -1901,13 +1926,15 @@ static int __must_check intel_iommu_map_
 }
 
 page = (struct dma_pte *)map_vtd_domain_page(pg_maddr);
-pte = &page[dfn_x(dfn) & LEVEL_MASK];
+pte = &page[address_level_offset(dfn_to_daddr(dfn), level)];
 old = *pte;
 
 dma_set_pte_addr(new, mfn_to_maddr(mfn));
 dma_set_pte_prot(new,
  ((flags & IOMMUF_readable) ? DMA_PTE_READ  : 0) |
  ((flags & IOMMUF_writable) ? DMA_PTE_WRITE : 0));
+if ( IOMMUF_order(flags) )
+dma_set_pte_superpage(new);
 
 /* Set the SNP on leaf page table if Snoop Control available */
 if ( iommu_snoop )
@@ -1928,8 +1955,13 @@ static int __must_check intel_iommu_map_
 
 *flush_flags |= IOMMU_FLUSHF_added;
 if ( dma_pte_present(old) )
+{
 *flush_flags |= IOMMU_FLUSHF_modified;
 
+if ( level > 1 && !dma_pte_superpage(old) )
+queue_free_pt(d, maddr_to_mfn(dma_pte_addr(old)), level - 1);
+}
+
 return rc;
 }
 
@@ -2286,6 +2318,7 @@ static int __init vtd_setup(void)
 {
 struct acpi_drhd_unit *drhd;
 struct vtd_iommu *iommu;
+unsigned int large_sizes = PAGE_SIZE_2M | PAGE_SIZE_1G;
 int ret;
 bool reg_inval_supported = true;
 
@@ -2328,6 +2361,11 @@ static int __init vtd_setup(void)
cap_sps_2mb(iommu->cap) ? ", 2MB" : "",
cap_sps_1gb(iommu->cap) ? ", 1GB" : "");
 
+if ( !cap_sps_2mb(iommu->cap) )
+large_sizes &= ~PAGE_SIZE_2M;
+if ( !cap_sps_1gb(iommu->cap) )
+large_sizes &= ~PAGE_SIZE_1G;
+
 #ifndef iommu_snoop
 if ( iommu_snoop && !ecap_snp_ctl(iommu->ecap) )
 iommu_snoop = false;
@@ -2399,6 +2437,9 @@ static int __init vtd_setup(void)
 if ( ret )
 goto error;
 
+ASSERT(iommu_ops.page_sizes & PAGE_SIZE_4K);
+iommu_ops.page_sizes |= large_sizes;
+
 register_k

RE: [PATCH 34/37] xen/arm: enable device tree based NUMA in system init

2021-09-24 Thread Wei Chen

Hi Stefano,

> -Original Message-
> From: Stefano Stabellini 
> Sent: 2021年9月24日 11:28
> To: Wei Chen 
> Cc: xen-devel@lists.xenproject.org; sstabell...@kernel.org; jul...@xen.org;
> Bertrand Marquis 
> Subject: Re: [PATCH 34/37] xen/arm: enable device tree based NUMA in
> system init
> 
> On Thu, 23 Sep 2021, Wei Chen wrote:
> > In this patch, we can start to create NUMA system that is
> > based on device tree.
> >
> > Signed-off-by: Wei Chen 
> > ---
> >  xen/arch/arm/numa.c| 55 ++
> >  xen/arch/arm/setup.c   |  7 +
> >  xen/include/asm-arm/numa.h |  6 +
> >  3 files changed, 68 insertions(+)
> >
> > diff --git a/xen/arch/arm/numa.c b/xen/arch/arm/numa.c
> > index 7f05299b76..d7a3d32d4b 100644
> > --- a/xen/arch/arm/numa.c
> > +++ b/xen/arch/arm/numa.c
> > @@ -18,8 +18,10 @@
> >   *
> >   */
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> > +#include 
> >
> >  static uint8_t __read_mostly
> >  node_distance_map[MAX_NUMNODES][MAX_NUMNODES] = {
> > @@ -85,6 +87,59 @@ uint8_t __node_distance(nodeid_t from, nodeid_t to)
> >  }
> >  EXPORT_SYMBOL(__node_distance);
> >
> > +void __init numa_init(bool acpi_off)
> > +{
> > +uint32_t idx;
> > +paddr_t ram_start = ~0;
> 
> INVALID_PADDR
> 

Oh, yes

> 
> > +paddr_t ram_size = 0;
> > +paddr_t ram_end = 0;
> > +
> > +/* NUMA has been turned off through Xen parameters */
> > +if ( numa_off )
> > +goto mem_init;
> > +
> > +/* Initialize NUMA from device tree when system is not ACPI booted
> */
> > +if ( acpi_off )
> > +{
> > +int ret = numa_device_tree_init(device_tree_flattened);
> > +if ( ret )
> > +{
> > +printk(XENLOG_WARNING
> > +   "Init NUMA from device tree failed, ret=%d\n", ret);
> 
> As I mentioned in other patches we need to distinguish between two
> cases:
> 
> 1) NUMA initialization failed because no NUMA information has been found
> 2) NUMA initialization failed because wrong/inconsistent NUMA info has
>been found
> 
> In case of 1), we print nothing. Maybe a single XENLOG_DEBUG message.
> In case of 2), all the warnings are good to print.
> 
> 
> In this case, if ret != 0 because of 2), then it is fine to print this
> warning. But it looks like could be that ret is -EINVAL simply because a
> CPU node doesn't have numa-node-id, which is a normal condition for
> non-NUMA machines.
> 

Yes, we should have to distinguish these two cases. I will try to address
it in next version.

> 
> > +numa_off = true;
> > +}
> > +}
> > +else
> > +{
> > +/* We don't support NUMA for ACPI boot currently */
> > +printk(XENLOG_WARNING
> > +   "ACPI NUMA has not been supported yet, NUMA off!\n");
> > +numa_off = true;
> > +}
> > +
> > +mem_init:
> > +/*
> > + * Find the minimal and maximum address of RAM, NUMA will
> > + * build a memory to node mapping table for the whole range.
> > + */
> > +ram_start = bootinfo.mem.bank[0].start;
> > +ram_size  = bootinfo.mem.bank[0].size;
> > +ram_end   = ram_start + ram_size;
> > +for ( idx = 1 ; idx < bootinfo.mem.nr_banks; idx++ )
> > +{
> > +paddr_t bank_start = bootinfo.mem.bank[idx].start;
> > +paddr_t bank_size = bootinfo.mem.bank[idx].size;
> > +paddr_t bank_end = bank_start + bank_size;
> > +
> > +ram_size  = ram_size + bank_size;
> > +ram_start = min(ram_start, bank_start);
> > +ram_end   = max(ram_end, bank_end);
> > +}
> > +
> > +numa_initmem_init(PFN_UP(ram_start), PFN_DOWN(ram_end));
> > +return;
> 
> No need for return
> 

Ok, I will remove it.

> 
> > +}
> > +
> >  uint32_t __init arch_meminfo_get_nr_bank(void)
> >  {
> > return bootinfo.mem.nr_banks;
> > diff --git a/xen/arch/arm/setup.c b/xen/arch/arm/setup.c
> > index 1f0fbc95b5..6097850682 100644
> > --- a/xen/arch/arm/setup.c
> > +++ b/xen/arch/arm/setup.c
> > @@ -905,6 +905,13 @@ void __init start_xen(unsigned long
> boot_phys_offset,
> >  /* Parse the ACPI tables for possible boot-time configuration */
> >  acpi_boot_table_init();
> >
> > +/*
> > + * Try to initialize NUMA system, if failed, the system will
> > + * fallback to uniform system which means system has only 1
> > + * NUMA node.
> > + */
> > +numa_init(acpi_disabled);
> > +
> >  end_boot_allocator();
> >
> >  /*
> > diff --git a/xen/include/asm-arm/numa.h b/xen/include/asm-arm/numa.h
> > index f46e8e2935..5b03dde87f 100644
> > --- a/xen/include/asm-arm/numa.h
> > +++ b/xen/include/asm-arm/numa.h
> > @@ -24,6 +24,7 @@ typedef u8 nodeid_t;
> >
> >  extern void numa_set_distance(nodeid_t from, nodeid_t to, uint32_t
> distance);
> >  extern int numa_device_tree_init(const void *fdt);
> > +extern void numa_init(bool acpi_off);
> >
> >  #else
> >
> > @@ -47,6 +48,11 @@ extern mfn_t first_valid_mfn;
> >  #define node_start_pfn(n

[PATCH v2 12/18] AMD/IOMMU: allow use of superpage mappings

2021-09-24 Thread Jan Beulich

No separate feature flags exist which would control availability of
these; the only restriction is HATS (establishing the maximum number of
page table levels in general), and even that has a lower bound of 4.
Thus we can unconditionally announce 2M, 1G, and 512G mappings. (Via
non-default page sizes the implementation in principle permits arbitrary
size mappings, but these require multiple identical leaf PTEs to be
written, which isn't all that different from having to write multiple
consecutive PTEs with increasing frame numbers. IMO that's therefore
beneficial only on hardware where suitable TLBs exist; I'm unaware of
such hardware.)

Signed-off-by: Jan Beulich 
---
I'm not fully sure about allowing 512G mappings: The scheduling-for-
freeing of intermediate page tables can take quite a while when
replacing a tree of 4k mappings by a single 512G one. Plus (or otoh)
there's no present code path via which 512G chunks of memory could be
allocated (and hence mapped) anyway.

--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -32,12 +32,13 @@ static unsigned int pfn_to_pde_idx(unsig
 }
 
 static union amd_iommu_pte clear_iommu_pte_present(unsigned long l1_mfn,
-   unsigned long dfn)
+   unsigned long dfn,
+   unsigned int level)
 {
 union amd_iommu_pte *table, *pte, old;
 
 table = map_domain_page(_mfn(l1_mfn));
-pte = &table[pfn_to_pde_idx(dfn, 1)];
+pte = &table[pfn_to_pde_idx(dfn, level)];
 old = *pte;
 
 write_atomic(&pte->raw, 0);
@@ -288,10 +289,31 @@ static int iommu_pde_from_dfn(struct dom
 return 0;
 }
 
+static void queue_free_pt(struct domain *d, mfn_t mfn, unsigned int next_level)
+{
+if ( next_level > 1 )
+{
+union amd_iommu_pte *pt = map_domain_page(mfn);
+unsigned int i;
+
+for ( i = 0; i < PTE_PER_TABLE_SIZE; ++i )
+if ( pt[i].pr && pt[i].next_level )
+{
+ASSERT(pt[i].next_level < next_level);
+queue_free_pt(d, _mfn(pt[i].mfn), pt[i].next_level);
+}
+
+unmap_domain_page(pt);
+}
+
+iommu_queue_free_pgtable(d, mfn_to_page(mfn));
+}
+
 int amd_iommu_map_page(struct domain *d, dfn_t dfn, mfn_t mfn,
unsigned int flags, unsigned int *flush_flags)
 {
 struct domain_iommu *hd = dom_iommu(d);
+unsigned int level = (IOMMUF_order(flags) / PTE_PER_TABLE_SHIFT) + 1;
 int rc;
 unsigned long pt_mfn = 0;
 union amd_iommu_pte old;
@@ -320,7 +342,7 @@ int amd_iommu_map_page(struct domain *d,
 return rc;
 }
 
-if ( iommu_pde_from_dfn(d, dfn_x(dfn), 1, &pt_mfn, flush_flags, true) ||
+if ( iommu_pde_from_dfn(d, dfn_x(dfn), level, &pt_mfn, flush_flags, true) 
||
  !pt_mfn )
 {
 spin_unlock(&hd->arch.mapping_lock);
@@ -330,8 +352,8 @@ int amd_iommu_map_page(struct domain *d,
 return -EFAULT;
 }
 
-/* Install 4k mapping */
-old = set_iommu_pte_present(pt_mfn, dfn_x(dfn), mfn_x(mfn), 1,
+/* Install mapping */
+old = set_iommu_pte_present(pt_mfn, dfn_x(dfn), mfn_x(mfn), level,
 (flags & IOMMUF_writable),
 (flags & IOMMUF_readable));
 
@@ -339,8 +361,13 @@ int amd_iommu_map_page(struct domain *d,
 
 *flush_flags |= IOMMU_FLUSHF_added;
 if ( old.pr )
+{
 *flush_flags |= IOMMU_FLUSHF_modified;
 
+if ( level > 1 && old.next_level )
+queue_free_pt(d, _mfn(old.mfn), old.next_level);
+}
+
 return 0;
 }
 
@@ -349,6 +376,7 @@ int amd_iommu_unmap_page(struct domain *
 {
 unsigned long pt_mfn = 0;
 struct domain_iommu *hd = dom_iommu(d);
+unsigned int level = (order / PTE_PER_TABLE_SHIFT) + 1;
 union amd_iommu_pte old = {};
 
 spin_lock(&hd->arch.mapping_lock);
@@ -359,7 +387,7 @@ int amd_iommu_unmap_page(struct domain *
 return 0;
 }
 
-if ( iommu_pde_from_dfn(d, dfn_x(dfn), 1, &pt_mfn, flush_flags, false) )
+if ( iommu_pde_from_dfn(d, dfn_x(dfn), level, &pt_mfn, flush_flags, false) 
)
 {
 spin_unlock(&hd->arch.mapping_lock);
 AMD_IOMMU_DEBUG("Invalid IO pagetable entry dfn = %"PRI_dfn"\n",
@@ -371,14 +399,19 @@ int amd_iommu_unmap_page(struct domain *
 if ( pt_mfn )
 {
 /* Mark PTE as 'page not present'. */
-old = clear_iommu_pte_present(pt_mfn, dfn_x(dfn));
+old = clear_iommu_pte_present(pt_mfn, dfn_x(dfn), level);
 }
 
 spin_unlock(&hd->arch.mapping_lock);
 
 if ( old.pr )
+{
 *flush_flags |= IOMMU_FLUSHF_modified;
 
+if ( level > 1 && old.next_level )
+queue_free_pt(d, _mfn(old.mfn), old.next_level);
+}
+
 return 0;
 }
 
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -630,7 +630,7 @@

[PATCH v2 11/18] AMD/IOMMU: return old PTE from {set,clear}_iommu_pte_present()

2021-09-24 Thread Jan Beulich

In order to free intermediate page tables when replacing smaller
mappings by a single larger one callers will need to know the full PTE.
Flush indicators can be derived from this in the callers (and outside
the locked regions). First split set_iommu_pte_present() from
set_iommu_ptes_present(): Only the former needs to return the old PTE,
while the latter (like also set_iommu_pde_present()) doesn't even need
to return flush indicators. Then change return types/values and callers
accordingly.

Signed-off-by: Jan Beulich 

--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -31,30 +31,28 @@ static unsigned int pfn_to_pde_idx(unsig
 return idx;
 }
 
-static unsigned int clear_iommu_pte_present(unsigned long l1_mfn,
-unsigned long dfn)
+static union amd_iommu_pte clear_iommu_pte_present(unsigned long l1_mfn,
+   unsigned long dfn)
 {
-union amd_iommu_pte *table, *pte;
-unsigned int flush_flags;
+union amd_iommu_pte *table, *pte, old;
 
 table = map_domain_page(_mfn(l1_mfn));
 pte = &table[pfn_to_pde_idx(dfn, 1)];
+old = *pte;
 
-flush_flags = pte->pr ? IOMMU_FLUSHF_modified : 0;
 write_atomic(&pte->raw, 0);
 
 unmap_domain_page(table);
 
-return flush_flags;
+return old;
 }
 
-static unsigned int set_iommu_pde_present(union amd_iommu_pte *pte,
-  unsigned long next_mfn,
-  unsigned int next_level, bool iw,
-  bool ir)
+static void set_iommu_pde_present(union amd_iommu_pte *pte,
+  unsigned long next_mfn,
+  unsigned int next_level,
+  bool iw, bool ir)
 {
-union amd_iommu_pte new = {}, old;
-unsigned int flush_flags = IOMMU_FLUSHF_added;
+union amd_iommu_pte new = {};
 
 /*
  * FC bit should be enabled in PTE, this helps to solve potential
@@ -68,28 +66,42 @@ static unsigned int set_iommu_pde_presen
 new.next_level = next_level;
 new.pr = true;
 
-old.raw = read_atomic(&pte->raw);
-old.ign0 = 0;
-old.ign1 = 0;
-old.ign2 = 0;
+write_atomic(&pte->raw, new.raw);
+}
 
-if ( old.pr && old.raw != new.raw )
-flush_flags |= IOMMU_FLUSHF_modified;
+static union amd_iommu_pte set_iommu_pte_present(unsigned long pt_mfn,
+ unsigned long dfn,
+ unsigned long next_mfn,
+ unsigned int level,
+ bool iw, bool ir)
+{
+union amd_iommu_pte *table, *pde, old;
 
-write_atomic(&pte->raw, new.raw);
+table = map_domain_page(_mfn(pt_mfn));
+pde = &table[pfn_to_pde_idx(dfn, level)];
+
+old = *pde;
+if ( !old.pr || old.next_level ||
+ old.mfn != next_mfn ||
+ old.iw != iw || old.ir != ir )
+set_iommu_pde_present(pde, next_mfn, 0, iw, ir);
+else
+old.pr = false; /* signal "no change" to the caller */
 
-return flush_flags;
+unmap_domain_page(table);
+
+return old;
 }
 
-static unsigned int set_iommu_ptes_present(unsigned long pt_mfn,
-   unsigned long dfn,
-   unsigned long next_mfn,
-   unsigned int nr_ptes,
-   unsigned int pde_level,
-   bool iw, bool ir)
+static void set_iommu_ptes_present(unsigned long pt_mfn,
+   unsigned long dfn,
+   unsigned long next_mfn,
+   unsigned int nr_ptes,
+   unsigned int pde_level,
+   bool iw, bool ir)
 {
 union amd_iommu_pte *table, *pde;
-unsigned int page_sz, flush_flags = 0;
+unsigned int page_sz;
 
 table = map_domain_page(_mfn(pt_mfn));
 pde = &table[pfn_to_pde_idx(dfn, pde_level)];
@@ -98,20 +110,18 @@ static unsigned int set_iommu_ptes_prese
 if ( (void *)(pde + nr_ptes) > (void *)table + PAGE_SIZE )
 {
 ASSERT_UNREACHABLE();
-return 0;
+return;
 }
 
 while ( nr_ptes-- )
 {
-flush_flags |= set_iommu_pde_present(pde, next_mfn, 0, iw, ir);
+set_iommu_pde_present(pde, next_mfn, 0, iw, ir);
 
 ++pde;
 next_mfn += page_sz;
 }
 
 unmap_domain_page(table);
-
-return flush_flags;
 }
 
 void amd_iommu_set_root_page_table(struct amd_iommu_dte *dte,
@@ -284,6 +294,7 @@ int amd_iommu_map_page(struct domain *d,
 struct domain_iommu *hd = dom_iommu(d);
 int rc;
 unsigned long pt_mfn = 0;
+union amd_iommu_pte old;
 
 spin

[PATCH v2 10/18] AMD/IOMMU: walk trees upon page fault

2021-09-24 Thread Jan Beulich

This is to aid diagnosing issues and largely matches VT-d's behavior.
Since I'm adding permissions output here as well, take the opportunity
and also add their displaying to amd_dump_page_table_level().

Signed-off-by: Jan Beulich 

--- a/xen/drivers/passthrough/amd/iommu.h
+++ b/xen/drivers/passthrough/amd/iommu.h
@@ -243,6 +243,8 @@ int __must_check amd_iommu_flush_iotlb_p
  unsigned long page_count,
  unsigned int flush_flags);
 int __must_check amd_iommu_flush_iotlb_all(struct domain *d);
+void amd_iommu_print_entries(const struct amd_iommu *iommu, unsigned int 
dev_id,
+ dfn_t dfn);
 
 /* device table functions */
 int get_dma_requestor_id(uint16_t seg, uint16_t bdf);
--- a/xen/drivers/passthrough/amd/iommu_init.c
+++ b/xen/drivers/passthrough/amd/iommu_init.c
@@ -573,6 +573,9 @@ static void parse_event_log_entry(struct
(flags & 0x002) ? " NX" : "",
(flags & 0x001) ? " GN" : "");
 
+if ( iommu_verbose )
+amd_iommu_print_entries(iommu, device_id, daddr_to_dfn(addr));
+
 for ( bdf = 0; bdf < ivrs_bdf_entries; bdf++ )
 if ( get_dma_requestor_id(iommu->seg, bdf) == device_id )
 pci_check_disable_device(iommu->seg, PCI_BUS(bdf),
--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -363,6 +363,50 @@ int amd_iommu_unmap_page(struct domain *
 return 0;
 }
 
+void amd_iommu_print_entries(const struct amd_iommu *iommu, unsigned int 
dev_id,
+ dfn_t dfn)
+{
+mfn_t pt_mfn;
+unsigned int level;
+const struct amd_iommu_dte *dt = iommu->dev_table.buffer;
+
+if ( !dt[dev_id].tv )
+{
+printk("%pp: no root\n", &PCI_SBDF2(iommu->seg, dev_id));
+return;
+}
+
+pt_mfn = _mfn(dt[dev_id].pt_root);
+level = dt[dev_id].paging_mode;
+printk("%pp root @ %"PRI_mfn" (%u levels) dfn=%"PRI_dfn"\n",
+   &PCI_SBDF2(iommu->seg, dev_id), mfn_x(pt_mfn), level, dfn_x(dfn));
+
+while ( level )
+{
+const union amd_iommu_pte *pt = map_domain_page(pt_mfn);
+unsigned int idx = pfn_to_pde_idx(dfn_x(dfn), level);
+union amd_iommu_pte pte = pt[idx];
+
+unmap_domain_page(pt);
+
+printk("  L%u[%03x] = %"PRIx64" %c%c\n", level, idx, pte.raw,
+   pte.pr ? pte.ir ? 'r' : '-' : 'n',
+   pte.pr ? pte.iw ? 'w' : '-' : 'p');
+
+if ( !pte.pr )
+break;
+
+if ( pte.next_level >= level )
+{
+printk("  L%u[%03x]: next: %u\n", level, idx, pte.next_level);
+break;
+}
+
+pt_mfn = _mfn(pte.mfn);
+level = pte.next_level;
+}
+}
+
 static unsigned long flush_count(unsigned long dfn, unsigned long page_count,
  unsigned int order)
 {
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -607,10 +607,11 @@ static void amd_dump_page_table_level(st
 mfn_to_page(_mfn(pde->mfn)), pde->next_level,
 address, indent + 1);
 else
-printk("%*sdfn: %08lx  mfn: %08lx\n",
+printk("%*sdfn: %08lx  mfn: %08lx  %c%c\n",
indent, "",
(unsigned long)PFN_DOWN(address),
-   (unsigned long)PFN_DOWN(pfn_to_paddr(pde->mfn)));
+   (unsigned long)PFN_DOWN(pfn_to_paddr(pde->mfn)),
+   pde->ir ? 'r' : '-', pde->iw ? 'w' : '-');
 }
 
 unmap_domain_page(table_vaddr);

Re: [PATCH v3] tools/libxl: Correctly align the ACPI tables

2021-09-24 Thread Roger Pau Monné

On Wed, Sep 15, 2021 at 03:30:00PM +0100, Kevin Stefanov wrote:
> The memory allocator currently calculates alignment in libxl's virtual
> address space, rather than guest physical address space. This results
> in the FACS being commonly misaligned.
> 
> Furthermore, the allocator has several other bugs.
> 
> The opencoded align-up calculation is currently susceptible to a bug
> that occurs in the corner case that the buffer is already aligned to
> begin with. In that case, an align-sized memory hole is introduced.
> 
> The while loop is dead logic because its effects are entirely and
> unconditionally overwritten immediately after it.
> 
> Rework the memory allocator to align in guest physical address space
> instead of libxl's virtual memory and improve the calculation, drop
> errant extra page in allocated buffer for ACPI tables, and give some
> of the variables better names/types.
> 
> Fixes: 14c0d328da2b ("libxl/acpi: Build ACPI tables for HVMlite guests")
> Signed-off-by: Kevin Stefanov 
> Reviewed-by: Jan Beulich 

Reviewed-by: Roger Pau Monné 

Thanks, Roger.

[PATCH v2 09/18] AMD/IOMMU: drop stray TLB flush

2021-09-24 Thread Jan Beulich

I think this flush was overlooked when flushing was moved out of the
core (un)mapping functions. The flush the caller is required to invoke
anyway will satisfy the needs resulting from the splitting of a
superpage.

Signed-off-by: Jan Beulich 

--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -179,7 +179,7 @@ void __init iommu_dte_add_device_entry(s
  */
 static int iommu_pde_from_dfn(struct domain *d, unsigned long dfn,
   unsigned int target, unsigned long *pt_mfn,
-  bool map)
+  unsigned int *flush_flags, bool map)
 {
 union amd_iommu_pte *pde, *next_table_vaddr;
 unsigned long  next_table_mfn;
@@ -237,7 +237,7 @@ static int iommu_pde_from_dfn(struct dom
 set_iommu_pde_present(pde, next_table_mfn, next_level, true,
   true);
 
-amd_iommu_flush_all_pages(d);
+*flush_flags |= IOMMU_FLUSHF_modified;
 }
 
 /* Install lower level page table for non-present entries */
@@ -309,7 +309,8 @@ int amd_iommu_map_page(struct domain *d,
 return rc;
 }
 
-if ( iommu_pde_from_dfn(d, dfn_x(dfn), 1, &pt_mfn, true) || !pt_mfn )
+if ( iommu_pde_from_dfn(d, dfn_x(dfn), 1, &pt_mfn, flush_flags, true) ||
+ !pt_mfn )
 {
 spin_unlock(&hd->arch.mapping_lock);
 AMD_IOMMU_DEBUG("Invalid IO pagetable entry dfn = %"PRI_dfn"\n",
@@ -342,7 +343,7 @@ int amd_iommu_unmap_page(struct domain *
 return 0;
 }
 
-if ( iommu_pde_from_dfn(d, dfn_x(dfn), 1, &pt_mfn, false) )
+if ( iommu_pde_from_dfn(d, dfn_x(dfn), 1, &pt_mfn, flush_flags, false) )
 {
 spin_unlock(&hd->arch.mapping_lock);
 AMD_IOMMU_DEBUG("Invalid IO pagetable entry dfn = %"PRI_dfn"\n",

[PATCH v2 08/18] IOMMU/x86: support freeing of pagetables

2021-09-24 Thread Jan Beulich

For vendor specific code to support superpages we need to be able to
deal with a superpage mapping replacing an intermediate page table (or
hierarchy thereof). Consequently an iommu_alloc_pgtable() counterpart is
needed to free individual page tables while a domain is still alive.
Since the freeing needs to be deferred until after a suitable IOTLB
flush was performed, released page tables get queued for processing by a
tasklet.

Signed-off-by: Jan Beulich 
---
I was considering whether to use a softirq-taklet instead. This would
have the benefit of avoiding extra scheduling operations, but come with
the risk of the freeing happening prematurely because of a
process_pending_softirqs() somewhere.

--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -12,6 +12,7 @@
  * this program; If not, see .
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -463,6 +464,85 @@ struct page_info *iommu_alloc_pgtable(st
 return pg;
 }
 
+/*
+ * Intermediate page tables which get replaced by large pages may only be
+ * freed after a suitable IOTLB flush. Hence such pages get queued on a
+ * per-CPU list, with a per-CPU tasklet processing the list on the assumption
+ * that the necessary IOTLB flush will have occurred by the time tasklets get
+ * to run. (List and tasklet being per-CPU has the benefit of accesses not
+ * requiring any locking.)
+ */
+static DEFINE_PER_CPU(struct page_list_head, free_pgt_list);
+static DEFINE_PER_CPU(struct tasklet, free_pgt_tasklet);
+
+static void free_queued_pgtables(void *arg)
+{
+struct page_list_head *list = arg;
+struct page_info *pg;
+
+while ( (pg = page_list_remove_head(list)) )
+free_domheap_page(pg);
+}
+
+void iommu_queue_free_pgtable(struct domain *d, struct page_info *pg)
+{
+struct domain_iommu *hd = dom_iommu(d);
+unsigned int cpu = smp_processor_id();
+
+spin_lock(&hd->arch.pgtables.lock);
+page_list_del(pg, &hd->arch.pgtables.list);
+spin_unlock(&hd->arch.pgtables.lock);
+
+page_list_add_tail(pg, &per_cpu(free_pgt_list, cpu));
+
+tasklet_schedule(&per_cpu(free_pgt_tasklet, cpu));
+}
+
+static int cpu_callback(
+struct notifier_block *nfb, unsigned long action, void *hcpu)
+{
+unsigned int cpu = (unsigned long)hcpu;
+struct page_list_head *list = &per_cpu(free_pgt_list, cpu);
+struct tasklet *tasklet = &per_cpu(free_pgt_tasklet, cpu);
+
+switch ( action )
+{
+case CPU_DOWN_PREPARE:
+tasklet_kill(tasklet);
+break;
+
+case CPU_DEAD:
+page_list_splice(list, &this_cpu(free_pgt_list));
+INIT_PAGE_LIST_HEAD(list);
+tasklet_schedule(&this_cpu(free_pgt_tasklet));
+break;
+
+case CPU_UP_PREPARE:
+case CPU_DOWN_FAILED:
+tasklet_init(tasklet, free_queued_pgtables, list);
+break;
+}
+
+return NOTIFY_DONE;
+}
+
+static struct notifier_block cpu_nfb = {
+.notifier_call = cpu_callback,
+};
+
+static int __init bsp_init(void)
+{
+if ( iommu_enabled )
+{
+cpu_callback(&cpu_nfb, CPU_UP_PREPARE,
+ (void *)(unsigned long)smp_processor_id());
+register_cpu_notifier(&cpu_nfb);
+}
+
+return 0;
+}
+presmp_initcall(bsp_init);
+
 bool arch_iommu_use_permitted(const struct domain *d)
 {
 /*
--- a/xen/include/asm-x86/iommu.h
+++ b/xen/include/asm-x86/iommu.h
@@ -143,6 +143,7 @@ int pi_update_irte(const struct pi_desc
 
 int __must_check iommu_free_pgtables(struct domain *d);
 struct page_info *__must_check iommu_alloc_pgtable(struct domain *d);
+void iommu_queue_free_pgtable(struct domain *d, struct page_info *pg);
 
 #endif /* !__ARCH_X86_IOMMU_H__ */
 /*

[PATCH v2 07/18] IOMMU/x86: perform PV Dom0 mappings in batches

2021-09-24 Thread Jan Beulich

For large page mappings to be easily usable (i.e. in particular without
un-shattering of smaller page mappings) and for mapping operations to
then also be more efficient, pass batches of Dom0 memory to iommu_map().
In dom0_construct_pv() and its helpers (covering strict mode) this
additionally requires establishing the type of those pages (albeit with
zero type references).

The earlier establishing of PGT_writable_page | PGT_validated requires
the existing places where this gets done (through get_page_and_type())
to be updated: For pages which actually have a mapping, the type
refcount needs to be 1.

There is actually a related bug that gets fixed here as a side effect:
Typically the last L1 table would get marked as such only after
get_page_and_type(..., PGT_writable_page). While this is fine as far as
refcounting goes, the page did remain mapped in the IOMMU in this case
(when "iommu=dom0-strict").

Signed-off-by: Jan Beulich 
---
Subsequently p2m_add_identity_entry() may want to also gain an order
parameter, for arch_iommu_hwdom_init() to use. While this only affects
non-RAM regions, systems typically have 2-16Mb of reserved space
immediately below 4Gb, which hence could be mapped more efficiently.

The installing of zero-ref writable types has in fact shown (observed
while putting together the change) that despite the intention by the
XSA-288 changes (affecting DomU-s only) for Dom0 a number of
sufficiently ordinary pages (at the very least initrd and P2M ones as
well as pages that are part of the initial allocation but not part of
the initial mapping) still have been starting out as PGT_none, meaning
that they would have gained IOMMU mappings only the first time these
pages would get mapped writably.

I didn't think I need to address the bug mentioned in the description in
a separate (prereq) patch, but if others disagree I could certainly
break out that part (needing to first use iommu_legacy_unmap() then).

Note that 4k P2M pages don't get (pre-)mapped in setup_pv_physmap():
They'll end up mapped via the later get_page_and_type().

As to the way these refs get installed: I've chosen to avoid the more
expensive {get,put}_page_and_type(), putting in place the intended type
directly. I guess I could be convinced to avoid this bypassing of the
actual logic; I merely think it's unnecessarily expensive.

--- a/xen/arch/x86/pv/dom0_build.c
+++ b/xen/arch/x86/pv/dom0_build.c
@@ -106,11 +106,26 @@ static __init void mark_pv_pt_pages_rdon
 unmap_domain_page(pl3e);
 }
 
+/*
+ * For IOMMU mappings done while building Dom0 the type of the pages needs to
+ * match (for _get_page_type() to unmap upon type change). Set the pages to
+ * writable with no type ref. NB: This is benign when !need_iommu_pt_sync(d).
+ */
+static void __init make_pages_writable(struct page_info *page, unsigned long 
nr)
+{
+for ( ; nr--; ++page )
+{
+ASSERT(!page->u.inuse.type_info);
+page->u.inuse.type_info = PGT_writable_page | PGT_validated;
+}
+}
+
 static __init void setup_pv_physmap(struct domain *d, unsigned long pgtbl_pfn,
 unsigned long v_start, unsigned long v_end,
 unsigned long vphysmap_start,
 unsigned long vphysmap_end,
-unsigned long nr_pages)
+unsigned long nr_pages,
+unsigned int *flush_flags)
 {
 struct page_info *page = NULL;
 l4_pgentry_t *pl4e, *l4start = map_domain_page(_mfn(pgtbl_pfn));
@@ -123,6 +138,8 @@ static __init void setup_pv_physmap(stru
 
 while ( vphysmap_start < vphysmap_end )
 {
+int rc = 0;
+
 if ( domain_tot_pages(d) +
  ((round_pgup(vphysmap_end) - vphysmap_start) >> PAGE_SHIFT) +
  3 > nr_pages )
@@ -176,7 +193,22 @@ static __init void setup_pv_physmap(stru
  L3_PAGETABLE_SHIFT - PAGE_SHIFT,
  MEMF_no_scrub)) != NULL )
 {
-*pl3e = l3e_from_page(page, L1_PROT|_PAGE_DIRTY|_PAGE_PSE);
+mfn_t mfn = page_to_mfn(page);
+
+if ( need_iommu_pt_sync(d) )
+rc = iommu_map(d, _dfn(mfn_x(mfn)), mfn,
+   SUPERPAGE_PAGES * SUPERPAGE_PAGES,
+   IOMMUF_readable | IOMMUF_writable,
+   flush_flags);
+if ( !rc )
+make_pages_writable(page,
+SUPERPAGE_PAGES * SUPERPAGE_PAGES);
+else
+printk(XENLOG_ERR
+   "pre-mapping P2M 1G-MFN %lx into IOMMU failed: 
%d\n",
+   mfn_x(mfn), rc);
+
+*pl3e = l3e_from_mfn(mfn, L1_PROT|_PAGE_DIRTY|_PAGE_PSE);
 vphysmap_start += 1UL << L3_PAGETAB

[PATCH v2 06/18] IOMMU/x86: restrict IO-APIC mappings for PV Dom0

2021-09-24 Thread Jan Beulich

While already the case for PVH, there's no reason to treat PV
differently here, though of course the addresses get taken from another
source in this case. Except that, to match CPU side mappings, by default
we permit r/o ones. This then also means we now deal consistently with
IO-APICs whose MMIO is or is not covered by E820 reserved regions.

Signed-off-by: Jan Beulich 
---
[integrated] v1: Integrate into series.
[standalone] v2: Keep IOMMU mappings in sync with CPU ones.

--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -253,12 +253,12 @@ void iommu_identity_map_teardown(struct
 }
 }
 
-static bool __hwdom_init hwdom_iommu_map(const struct domain *d,
- unsigned long pfn,
- unsigned long max_pfn)
+static unsigned int __hwdom_init hwdom_iommu_map(const struct domain *d,
+ unsigned long pfn,
+ unsigned long max_pfn)
 {
 mfn_t mfn = _mfn(pfn);
-unsigned int i, type;
+unsigned int i, type, perms = IOMMUF_readable | IOMMUF_writable;
 
 /*
  * Set up 1:1 mapping for dom0. Default to include only conventional RAM
@@ -267,44 +267,60 @@ static bool __hwdom_init hwdom_iommu_map
  * that fall in unusable ranges for PV Dom0.
  */
 if ( (pfn > max_pfn && !mfn_valid(mfn)) || xen_in_range(pfn) )
-return false;
+return 0;
 
 switch ( type = page_get_ram_type(mfn) )
 {
 case RAM_TYPE_UNUSABLE:
-return false;
+return 0;
 
 case RAM_TYPE_CONVENTIONAL:
 if ( iommu_hwdom_strict )
-return false;
+return 0;
 break;
 
 default:
 if ( type & RAM_TYPE_RESERVED )
 {
 if ( !iommu_hwdom_inclusive && !iommu_hwdom_reserved )
-return false;
+perms = 0;
 }
-else if ( is_hvm_domain(d) || !iommu_hwdom_inclusive || pfn > max_pfn )
-return false;
+else if ( is_hvm_domain(d) )
+return 0;
+else if ( !iommu_hwdom_inclusive || pfn > max_pfn )
+perms = 0;
 }
 
 /* Check that it doesn't overlap with the Interrupt Address Range. */
 if ( pfn >= 0xfee00 && pfn <= 0xfeeff )
-return false;
+return 0;
 /* ... or the IO-APIC */
-for ( i = 0; has_vioapic(d) && i < d->arch.hvm.nr_vioapics; i++ )
-if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
-return false;
+if ( has_vioapic(d) )
+{
+for ( i = 0; i < d->arch.hvm.nr_vioapics; i++ )
+if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
+return 0;
+}
+else if ( is_pv_domain(d) )
+{
+/*
+ * Be consistent with CPU mappings: Dom0 is permitted to establish r/o
+ * ones there, so it should also have such established for IOMMUs.
+ */
+for ( i = 0; i < nr_ioapics; i++ )
+if ( pfn == PFN_DOWN(mp_ioapics[i].mpc_apicaddr) )
+return rangeset_contains_singleton(mmio_ro_ranges, pfn)
+   ? IOMMUF_readable : 0;
+}
 /*
  * ... or the PCIe MCFG regions.
  * TODO: runtime added MMCFG regions are not checked to make sure they
  * don't overlap with already mapped regions, thus preventing trapping.
  */
 if ( has_vpci(d) && vpci_is_mmcfg_address(d, pfn_to_paddr(pfn)) )
-return false;
+return 0;
 
-return true;
+return perms;
 }
 
 void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
@@ -346,15 +362,19 @@ void __hwdom_init arch_iommu_hwdom_init(
 for ( ; i < top; i++ )
 {
 unsigned long pfn = pdx_to_pfn(i);
+unsigned int perms = hwdom_iommu_map(d, pfn, max_pfn);
 int rc;
 
-if ( !hwdom_iommu_map(d, pfn, max_pfn) )
+if ( !perms )
 rc = 0;
 else if ( paging_mode_translate(d) )
-rc = set_identity_p2m_entry(d, pfn, p2m_access_rw, 0);
+rc = set_identity_p2m_entry(d, pfn,
+perms & IOMMUF_writable ? p2m_access_rw
+: p2m_access_r,
+0);
 else
 rc = iommu_map(d, _dfn(pfn), _mfn(pfn), 1ul << PAGE_ORDER_4K,
-   IOMMUF_readable | IOMMUF_writable, &flush_flags);
+   perms, &flush_flags);
 
 if ( rc )
 printk(XENLOG_WARNING "%pd: identity %smapping of %lx failed: 
%d\n",

[PATCH v2 05/18] IOMMU: have iommu_{,un}map() split requests into largest possible chunks

2021-09-24 Thread Jan Beulich

Introduce a helper function to determine the largest possible mapping
that allows covering a request (or the next part of it that is left to
be processed).

In order to not add yet more recurring dfn_add() / mfn_add() to the two
callers of the new helper, also introduce local variables holding the
values presently operated on.

Signed-off-by: Jan Beulich 

--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -260,12 +260,38 @@ void iommu_domain_destroy(struct domain
 arch_iommu_domain_destroy(d);
 }
 
-int iommu_map(struct domain *d, dfn_t dfn, mfn_t mfn,
+static unsigned int mapping_order(const struct domain_iommu *hd,
+  dfn_t dfn, mfn_t mfn, unsigned long nr)
+{
+unsigned long res = dfn_x(dfn) | mfn_x(mfn);
+unsigned long sizes = hd->platform_ops->page_sizes;
+unsigned int bit = find_first_set_bit(sizes), order = 0;
+
+ASSERT(bit == PAGE_SHIFT);
+
+while ( (sizes = (sizes >> bit) & ~1) )
+{
+unsigned long mask;
+
+bit = find_first_set_bit(sizes);
+mask = (1UL << bit) - 1;
+if ( nr <= mask || (res & mask) )
+break;
+order += bit;
+nr >>= bit;
+res >>= bit;
+}
+
+return order;
+}
+
+int iommu_map(struct domain *d, dfn_t dfn0, mfn_t mfn0,
   unsigned long page_count, unsigned int flags,
   unsigned int *flush_flags)
 {
 const struct domain_iommu *hd = dom_iommu(d);
 unsigned long i;
+unsigned int order;
 int rc = 0;
 
 if ( !is_iommu_enabled(d) )
@@ -273,10 +299,16 @@ int iommu_map(struct domain *d, dfn_t df
 
 ASSERT(!IOMMUF_order(flags));
 
-for ( i = 0; i < page_count; i++ )
+for ( i = 0; i < page_count; i += 1UL << order )
 {
-rc = iommu_call(hd->platform_ops, map_page, d, dfn_add(dfn, i),
-mfn_add(mfn, i), flags, flush_flags);
+dfn_t dfn = dfn_add(dfn0, i);
+mfn_t mfn = mfn_add(mfn0, i);
+unsigned long j;
+
+order = mapping_order(hd, dfn, mfn, page_count - i);
+
+rc = iommu_call(hd->platform_ops, map_page, d, dfn, mfn,
+flags | IOMMUF_order(order), flush_flags);
 
 if ( likely(!rc) )
 continue;
@@ -284,14 +316,18 @@ int iommu_map(struct domain *d, dfn_t df
 if ( !d->is_shutting_down && printk_ratelimit() )
 printk(XENLOG_ERR
"d%d: IOMMU mapping dfn %"PRI_dfn" to mfn %"PRI_mfn" 
failed: %d\n",
-   d->domain_id, dfn_x(dfn_add(dfn, i)),
-   mfn_x(mfn_add(mfn, i)), rc);
+   d->domain_id, dfn_x(dfn), mfn_x(mfn), rc);
+
+for ( j = 0; j < i; j += 1UL << order )
+{
+dfn = dfn_add(dfn0, j);
+order = mapping_order(hd, dfn, _mfn(0), i - j);
 
-while ( i-- )
 /* if statement to satisfy __must_check */
-if ( iommu_call(hd->platform_ops, unmap_page, d, dfn_add(dfn, i),
-0, flush_flags) )
+if ( iommu_call(hd->platform_ops, unmap_page, d, dfn, order,
+flush_flags) )
 continue;
+}
 
 if ( !is_hardware_domain(d) )
 domain_crash(d);
@@ -322,20 +358,25 @@ int iommu_legacy_map(struct domain *d, d
 return rc;
 }
 
-int iommu_unmap(struct domain *d, dfn_t dfn, unsigned long page_count,
+int iommu_unmap(struct domain *d, dfn_t dfn0, unsigned long page_count,
 unsigned int *flush_flags)
 {
 const struct domain_iommu *hd = dom_iommu(d);
 unsigned long i;
+unsigned int order;
 int rc = 0;
 
 if ( !is_iommu_enabled(d) )
 return 0;
 
-for ( i = 0; i < page_count; i++ )
+for ( i = 0; i < page_count; i += 1UL << order )
 {
-int err = iommu_call(hd->platform_ops, unmap_page, d, dfn_add(dfn, i),
- 0, flush_flags);
+dfn_t dfn = dfn_add(dfn0, i);
+int err;
+
+order = mapping_order(hd, dfn, _mfn(0), page_count - i);
+err = iommu_call(hd->platform_ops, unmap_page, d, dfn,
+ order, flush_flags);
 
 if ( likely(!err) )
 continue;
@@ -343,7 +384,7 @@ int iommu_unmap(struct domain *d, dfn_t
 if ( !d->is_shutting_down && printk_ratelimit() )
 printk(XENLOG_ERR
"d%d: IOMMU unmapping dfn %"PRI_dfn" failed: %d\n",
-   d->domain_id, dfn_x(dfn_add(dfn, i)), err);
+   d->domain_id, dfn_x(dfn), err);
 
 if ( !rc )
 rc = err;

[PATCH v2 04/18] IOMMU: add order parameter to ->{,un}map_page() hooks

2021-09-24 Thread Jan Beulich

Or really, in the case of ->map_page(), accommodate it in the existing
"flags" parameter. All call sites will pass 0 for now.

Signed-off-by: Jan Beulich 
Reviewed-by: Kevin Tian 
---
v2: Re-base over change earlier in the series.

--- a/xen/drivers/passthrough/amd/iommu.h
+++ b/xen/drivers/passthrough/amd/iommu.h
@@ -230,6 +230,7 @@ int __must_check amd_iommu_map_page(stru
 mfn_t mfn, unsigned int flags,
 unsigned int *flush_flags);
 int __must_check amd_iommu_unmap_page(struct domain *d, dfn_t dfn,
+  unsigned int order,
   unsigned int *flush_flags);
 int __must_check amd_iommu_alloc_root(struct domain *d);
 int amd_iommu_reserve_domain_unity_map(struct domain *domain,
--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -328,7 +328,7 @@ int amd_iommu_map_page(struct domain *d,
 return 0;
 }
 
-int amd_iommu_unmap_page(struct domain *d, dfn_t dfn,
+int amd_iommu_unmap_page(struct domain *d, dfn_t dfn, unsigned int order,
  unsigned int *flush_flags)
 {
 unsigned long pt_mfn = 0;
--- a/xen/drivers/passthrough/arm/iommu_helpers.c
+++ b/xen/drivers/passthrough/arm/iommu_helpers.c
@@ -57,11 +57,13 @@ int __must_check arm_iommu_map_page(stru
  * The function guest_physmap_add_entry replaces the current mapping
  * if there is already one...
  */
-return guest_physmap_add_entry(d, _gfn(dfn_x(dfn)), _mfn(dfn_x(dfn)), 0, 
t);
+return guest_physmap_add_entry(d, _gfn(dfn_x(dfn)), _mfn(dfn_x(dfn)),
+   IOMMUF_order(flags), t);
 }
 
 /* Should only be used if P2M Table is shared between the CPU and the IOMMU. */
 int __must_check arm_iommu_unmap_page(struct domain *d, dfn_t dfn,
+  unsigned int order,
   unsigned int *flush_flags)
 {
 /*
@@ -71,7 +73,8 @@ int __must_check arm_iommu_unmap_page(st
 if ( !is_domain_direct_mapped(d) )
 return -EINVAL;
 
-return guest_physmap_remove_page(d, _gfn(dfn_x(dfn)), _mfn(dfn_x(dfn)), 0);
+return guest_physmap_remove_page(d, _gfn(dfn_x(dfn)), _mfn(dfn_x(dfn)),
+ order);
 }
 
 /*
--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -271,6 +271,8 @@ int iommu_map(struct domain *d, dfn_t df
 if ( !is_iommu_enabled(d) )
 return 0;
 
+ASSERT(!IOMMUF_order(flags));
+
 for ( i = 0; i < page_count; i++ )
 {
 rc = iommu_call(hd->platform_ops, map_page, d, dfn_add(dfn, i),
@@ -288,7 +290,7 @@ int iommu_map(struct domain *d, dfn_t df
 while ( i-- )
 /* if statement to satisfy __must_check */
 if ( iommu_call(hd->platform_ops, unmap_page, d, dfn_add(dfn, i),
-flush_flags) )
+0, flush_flags) )
 continue;
 
 if ( !is_hardware_domain(d) )
@@ -333,7 +335,7 @@ int iommu_unmap(struct domain *d, dfn_t
 for ( i = 0; i < page_count; i++ )
 {
 int err = iommu_call(hd->platform_ops, unmap_page, d, dfn_add(dfn, i),
- flush_flags);
+ 0, flush_flags);
 
 if ( likely(!err) )
 continue;
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -1934,6 +1934,7 @@ static int __must_check intel_iommu_map_
 }
 
 static int __must_check intel_iommu_unmap_page(struct domain *d, dfn_t dfn,
+   unsigned int order,
unsigned int *flush_flags)
 {
 /* Do nothing if VT-d shares EPT page table */
@@ -1944,7 +1945,7 @@ static int __must_check intel_iommu_unma
 if ( iommu_hwdom_passthrough && is_hardware_domain(d) )
 return 0;
 
-return dma_pte_clear_one(d, dfn_to_daddr(dfn), 0, flush_flags);
+return dma_pte_clear_one(d, dfn_to_daddr(dfn), order, flush_flags);
 }
 
 static int intel_iommu_lookup_page(struct domain *d, dfn_t dfn, mfn_t *mfn,
--- a/xen/include/asm-arm/iommu.h
+++ b/xen/include/asm-arm/iommu.h
@@ -31,6 +31,7 @@ int __must_check arm_iommu_map_page(stru
 unsigned int flags,
 unsigned int *flush_flags);
 int __must_check arm_iommu_unmap_page(struct domain *d, dfn_t dfn,
+  unsigned int order,
   unsigned int *flush_flags);
 
 #endif /* __ARCH_ARM_IOMMU_H__ */
--- a/xen/include/xen/iommu.h
+++ b/xen/include/xen/iommu.h
@@ -127,9 +127,10 @@ void arch_iommu_hwdom_init(struct domain
  * The following flags are passed to map operations and passed by lookup
  * operations.
  */
-#define _IOMMUF_readable 0
+#define IOMMUF_order(n)  ((n) & 0x3f)
+#define _IOMMUF_readable 6
 #define IO

[PATCH v2 03/18] IOMMU: have vendor code announce supported page sizes

2021-09-24 Thread Jan Beulich

Generic code will use this information to determine what order values
can legitimately be passed to the ->{,un}map_page() hooks. For now all
ops structures simply get to announce 4k mappings (as base page size),
and there is (and always has been) an assumption that this matches the
CPU's MMU base page size (eventually we will want to permit IOMMUs with
a base page size smaller than the CPU MMU's).

Signed-off-by: Jan Beulich 
Reviewed-by: Kevin Tian 

--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -629,6 +629,7 @@ static void amd_dump_page_tables(struct
 }
 
 static const struct iommu_ops __initconstrel _iommu_ops = {
+.page_sizes = PAGE_SIZE_4K,
 .init = amd_iommu_domain_init,
 .hwdom_init = amd_iommu_hwdom_init,
 .quarantine_init = amd_iommu_quarantine_init,
--- a/xen/drivers/passthrough/arm/ipmmu-vmsa.c
+++ b/xen/drivers/passthrough/arm/ipmmu-vmsa.c
@@ -1298,6 +1298,7 @@ static void ipmmu_iommu_domain_teardown(
 
 static const struct iommu_ops ipmmu_iommu_ops =
 {
+.page_sizes  = PAGE_SIZE_4K,
 .init= ipmmu_iommu_domain_init,
 .hwdom_init  = ipmmu_iommu_hwdom_init,
 .teardown= ipmmu_iommu_domain_teardown,
--- a/xen/drivers/passthrough/arm/smmu.c
+++ b/xen/drivers/passthrough/arm/smmu.c
@@ -2873,6 +2873,7 @@ static void arm_smmu_iommu_domain_teardo
 }
 
 static const struct iommu_ops arm_smmu_iommu_ops = {
+.page_sizes = PAGE_SIZE_4K,
 .init = arm_smmu_iommu_domain_init,
 .hwdom_init = arm_smmu_iommu_hwdom_init,
 .add_device = arm_smmu_dt_add_device_generic,
--- a/xen/drivers/passthrough/arm/smmu-v3.c
+++ b/xen/drivers/passthrough/arm/smmu-v3.c
@@ -3426,7 +3426,8 @@ static void arm_smmu_iommu_xen_domain_te
 }
 
 static const struct iommu_ops arm_smmu_iommu_ops = {
-   .init   = arm_smmu_iommu_xen_domain_init,
+   .page_sizes = PAGE_SIZE_4K,
+   .init   = arm_smmu_iommu_xen_domain_init,
.hwdom_init = arm_smmu_iommu_hwdom_init,
.teardown   = arm_smmu_iommu_xen_domain_teardown,
.iotlb_flush= arm_smmu_iotlb_flush,
--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -470,7 +470,17 @@ int __init iommu_setup(void)
 
 if ( iommu_enable )
 {
+const struct iommu_ops *ops = NULL;
+
 rc = iommu_hardware_setup();
+if ( !rc )
+ops = iommu_get_ops();
+if ( ops && (ops->page_sizes & -ops->page_sizes) != PAGE_SIZE )
+{
+printk(XENLOG_ERR "IOMMU: page size mask %lx unsupported\n",
+   ops->page_sizes);
+rc = ops->page_sizes ? -EPERM : -ENODATA;
+}
 iommu_enabled = (rc == 0);
 }
 
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -2806,6 +2806,7 @@ static int __init intel_iommu_quarantine
 }
 
 static struct iommu_ops __initdata vtd_ops = {
+.page_sizes = PAGE_SIZE_4K,
 .init = intel_iommu_domain_init,
 .hwdom_init = intel_iommu_hwdom_init,
 .quarantine_init = intel_iommu_quarantine_init,
--- a/xen/include/xen/iommu.h
+++ b/xen/include/xen/iommu.h
@@ -231,6 +231,7 @@ struct page_info;
 typedef int iommu_grdm_t(xen_pfn_t start, xen_ulong_t nr, u32 id, void *ctxt);
 
 struct iommu_ops {
+unsigned long page_sizes;
 int (*init)(struct domain *d);
 void (*hwdom_init)(struct domain *d);
 int (*quarantine_init)(struct domain *d);

[PATCH v2 02/18] VT-d: have callers specify the target level for page table walks

2021-09-24 Thread Jan Beulich

In order to be able to insert/remove super-pages we need to allow
callers of the walking function to specify at which point to stop the
walk.

For intel_iommu_lookup_page() integrate the last level access into
the main walking function.

dma_pte_clear_one() gets only partly adjusted for now: Error handling
and order parameter get put in place, but the order parameter remains
ignored (just like intel_iommu_map_page()'s order part of the flags).

Signed-off-by: Jan Beulich 
---
I have to admit that I don't understand why domain_pgd_maddr() wants to
populate all page table levels for DFN 0.

I was actually wondering whether it wouldn't make sense to integrate
dma_pte_clear_one() into its only caller intel_iommu_unmap_page(), for
better symmetry with intel_iommu_map_page().
---
v2: Fix build.

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -264,63 +264,116 @@ static u64 bus_to_context_maddr(struct v
 return maddr;
 }
 
-static u64 addr_to_dma_page_maddr(struct domain *domain, u64 addr, int alloc)
+/*
+ * This function walks (and if requested allocates) page tables to the
+ * designated target level. It returns
+ * - 0 when a non-present entry was encountered and no allocation was
+ *   requested,
+ * - a small positive value (the level, i.e. below PAGE_SIZE) upon allocation
+ *   failure,
+ * - for target > 0 the address of the page table holding the leaf PTE for
+ *   the requested address,
+ * - for target == 0 the full PTE.
+ */
+static uint64_t addr_to_dma_page_maddr(struct domain *domain, daddr_t addr,
+   unsigned int target,
+   unsigned int *flush_flags, bool alloc)
 {
 struct domain_iommu *hd = dom_iommu(domain);
 int addr_width = agaw_to_width(hd->arch.vtd.agaw);
 struct dma_pte *parent, *pte = NULL;
-int level = agaw_to_level(hd->arch.vtd.agaw);
-int offset;
+unsigned int level = agaw_to_level(hd->arch.vtd.agaw), offset;
 u64 pte_maddr = 0;
 
 addr &= (((u64)1) << addr_width) - 1;
 ASSERT(spin_is_locked(&hd->arch.mapping_lock));
+ASSERT(target || !alloc);
+
 if ( !hd->arch.vtd.pgd_maddr )
 {
 struct page_info *pg;
 
-if ( !alloc || !(pg = iommu_alloc_pgtable(domain)) )
+if ( !alloc )
+goto out;
+
+pte_maddr = level;
+if ( !(pg = iommu_alloc_pgtable(domain)) )
 goto out;
 
 hd->arch.vtd.pgd_maddr = page_to_maddr(pg);
 }
 
-parent = (struct dma_pte *)map_vtd_domain_page(hd->arch.vtd.pgd_maddr);
-while ( level > 1 )
+pte_maddr = hd->arch.vtd.pgd_maddr;
+parent = map_vtd_domain_page(pte_maddr);
+while ( level > target )
 {
 offset = address_level_offset(addr, level);
 pte = &parent[offset];
 
 pte_maddr = dma_pte_addr(*pte);
-if ( !pte_maddr )
+if ( !dma_pte_present(*pte) || (level > 1 && dma_pte_superpage(*pte)) )
 {
 struct page_info *pg;
+/*
+ * Higher level tables always set r/w, last level page table
+ * controls read/write.
+ */
+struct dma_pte new_pte = { DMA_PTE_PROT };
 
 if ( !alloc )
-break;
+{
+pte_maddr = 0;
+if ( !dma_pte_present(*pte) )
+break;
+
+/*
+ * When the leaf entry was requested, pass back the full PTE,
+ * with the address adjusted to account for the residual of
+ * the walk.
+ */
+pte_maddr = pte->val +
+(addr & ((1UL << level_to_offset_bits(level)) - 1) &
+ PAGE_MASK);
+if ( !target )
+break;
+}
 
+pte_maddr = level - 1;
 pg = iommu_alloc_pgtable(domain);
 if ( !pg )
 break;
 
 pte_maddr = page_to_maddr(pg);
-dma_set_pte_addr(*pte, pte_maddr);
+dma_set_pte_addr(new_pte, pte_maddr);
 
-/*
- * high level table always sets r/w, last level
- * page table control read/write
- */
-dma_set_pte_readable(*pte);
-dma_set_pte_writable(*pte);
+if ( dma_pte_present(*pte) )
+{
+struct dma_pte *split = map_vtd_domain_page(pte_maddr);
+unsigned long inc = 1UL << level_to_offset_bits(level - 1);
+
+split[0].val = pte->val;
+if ( inc == PAGE_SIZE )
+split[0].val &= ~DMA_PTE_SP;
+
+for ( offset = 1; offset < PTE_NUM; ++offset )
+split[offset].val = split[offset - 1].val + inc;
+
+iommu_sync_cache(split, PAGE_SIZE);
+unmap_vtd_domain_page(split);
+
+if ( flush_flags )
+*flush_flags |=

[PATCH v2 01/18] AMD/IOMMU: have callers specify the target level for page table walks

2021-09-24 Thread Jan Beulich

In order to be able to insert/remove super-pages we need to allow
callers of the walking function to specify at which point to stop the
walk. (For now at least gcc will instantiate just a variant of the
function with the parameter eliminated, so effectively no change to
generated code as far as the parameter addition goes.)

Instead of merely adjusting a BUG_ON() condition, convert it into an
error return - there's no reason to crash the entire host in that case.

Signed-off-by: Jan Beulich 

--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -178,7 +178,8 @@ void __init iommu_dte_add_device_entry(s
  * page tables.
  */
 static int iommu_pde_from_dfn(struct domain *d, unsigned long dfn,
-  unsigned long *pt_mfn, bool map)
+  unsigned int target, unsigned long *pt_mfn,
+  bool map)
 {
 union amd_iommu_pte *pde, *next_table_vaddr;
 unsigned long  next_table_mfn;
@@ -189,7 +190,8 @@ static int iommu_pde_from_dfn(struct dom
 table = hd->arch.amd.root_table;
 level = hd->arch.amd.paging_mode;
 
-BUG_ON( table == NULL || level < 1 || level > 6 );
+if ( !table || target < 1 || level < target || level > 6 )
+return 1;
 
 /*
  * A frame number past what the current page tables can represent can't
@@ -200,7 +202,7 @@ static int iommu_pde_from_dfn(struct dom
 
 next_table_mfn = mfn_x(page_to_mfn(table));
 
-while ( level > 1 )
+while ( level > target )
 {
 unsigned int next_level = level - 1;
 
@@ -307,7 +309,7 @@ int amd_iommu_map_page(struct domain *d,
 return rc;
 }
 
-if ( iommu_pde_from_dfn(d, dfn_x(dfn), &pt_mfn, true) || !pt_mfn )
+if ( iommu_pde_from_dfn(d, dfn_x(dfn), 1, &pt_mfn, true) || !pt_mfn )
 {
 spin_unlock(&hd->arch.mapping_lock);
 AMD_IOMMU_DEBUG("Invalid IO pagetable entry dfn = %"PRI_dfn"\n",
@@ -340,7 +342,7 @@ int amd_iommu_unmap_page(struct domain *
 return 0;
 }
 
-if ( iommu_pde_from_dfn(d, dfn_x(dfn), &pt_mfn, false) )
+if ( iommu_pde_from_dfn(d, dfn_x(dfn), 1, &pt_mfn, false) )
 {
 spin_unlock(&hd->arch.mapping_lock);
 AMD_IOMMU_DEBUG("Invalid IO pagetable entry dfn = %"PRI_dfn"\n",

[PATCH v2 00/18] IOMMU: superpage support when not sharing pagetables

2021-09-24 Thread Jan Beulich

For a long time we've been rather inefficient with IOMMU page table
management when not sharing page tables, i.e. in particular for PV (and
further specifically also for PV Dom0) and AMD (where nowadays we never
share page tables). While up to about 2.5 years ago AMD code had logic
to un-shatter page mappings, that logic was ripped out for being buggy
(XSA-275 plus follow-on).

This series enables use of large pages in AMD and Intel (VT-d) code;
Arm is presently not in need of any enabling as pagetables are always
shared there. It also augments PV Dom0 creation with suitable explicit
IOMMU mapping calls to facilitate use of large pages there without
getting into the business of un-shattering page mappings just yet.
Depending on the amount of memory handed to Dom0 this improves booting
time (latency until Dom0 actually starts) quite a bit; subsequent
shattering of some of the large pages may of course consume some of the
saved time.

Known fallout has been spelled out here:
https://lists.xen.org/archives/html/xen-devel/2021-08/msg00781.html

I'm inclined to say "of course" there are also a few seemingly unrelated
changes included here, which I just came to consider necessary or at
least desirable (in part for having been in need of adjustment for a
long time) along the way. Some of these changes are likely independent
of the bulk of the work here, and hence may be fine to go in ahead of
earlier patches.

While, as said above, un-shattering of mappings isn't an immediate goal,
teh last few patches now at least arrange for freeing page tables which
have ended up all empty. This also introduces the underlying support to
then un-shatter large pages (potentially re-usable elsewhere as well),
but that's not part of this v2 of the series.

01: AMD/IOMMU: have callers specify the target level for page table walks
02: VT-d: have callers specify the target level for page table walks
03: IOMMU: have vendor code announce supported page sizes
04: IOMMU: add order parameter to ->{,un}map_page() hooks
05: IOMMU: have iommu_{,un}map() split requests into largest possible chunks
06: IOMMU/x86: restrict IO-APIC mappings for PV Dom0
07: IOMMU/x86: perform PV Dom0 mappings in batches
08: IOMMU/x86: support freeing of pagetables
09: AMD/IOMMU: drop stray TLB flush
10: AMD/IOMMU: walk trees upon page fault
11: AMD/IOMMU: return old PTE from {set,clear}_iommu_pte_present()
12: AMD/IOMMU: allow use of superpage mappings
13: VT-d: allow use of superpage mappings
14: IOMMU: fold flush-all hook into "flush one"
15: IOMMU/x86: prefill newly allocate page tables
16: x86: introduce helper for recording degree of contiguity in page tables
17: AMD/IOMMU: free all-empty page tables
18: VT-d: free all-empty page tables

While not directly related (except that making this mode work properly
here was a fair part of the overall work), at this occasion I'd also
like to renew my proposal to make "iommu=dom0-strict" the default going
forward. It already is not only the default, but the only possible mode
for PVH Dom0.

Jan

1 2 >

1 - 100 of 116 matches

Mail list logo