date:20240327

Xen tries to load a PVH dom0 kernel at the fixed guest physical address
from the elf headers.  For Linux, this defaults to 0x100 (16MB), but
it can be configured.

Unfortunately there exist firmwares that have reserved regions at this
address, so Xen fails to load the dom0 kernel since it's not RAM.

The other issue is that the Linux PVH entry point is not
position-independent.  It expects to run at the compiled
CONFIG_PHYSICAL_ADDRESS.

This patch set expands the PVH dom0 builder to try to relocate the
kernel if needed and possible.  XEN_ELFNOTE_PHYS32_RELOC is added for
kernels to indicate they are relocatable and their acceptable address
range and alignment.

v6:
Choose the alignment from the Note if specified, or the Maximum PHDR
p_align value if greater than PAGE_SIZE.  Otherwise, it falls back to
the default 2MB.

Patches from v5 commited:
853c71dfbf xen/elfnote: Specify ELF Notes are x86-specific
7d8c9b4e8d libelf: Expand ELF note printing
8802230bfa Revert "xen/x86: bzImage parse kernel_alignment"

The first and second patches move MB/GB() to common-macros.h.

The third patch stores the maximum p_align value from the ELF PHDRs.

The fourth patch expands the pvh dom0 kernel placement code.

I'll post an additional patch showing the Linux changes to make PVH
relocatable.

Jason Andryuk (4):
  tools/init-xenstore-domain: Replace variable MB() usage
  tools: Move MB/GB() to common-macros.h
  libelf: Store maximum PHDR p_align
  x86/PVH: Support relocatable dom0 kernels

 tools/firmware/hvmloader/util.h |   3 -
 tools/helpers/init-xenstore-domain.c|  11 ++-
 tools/include/xen-tools/common-macros.h |   4 +
 tools/libs/light/libxl_internal.h   |   4 -
 xen/arch/x86/hvm/dom0_build.c   | 108 
 xen/common/libelf/libelf-dominfo.c  |  35 
 xen/common/libelf/libelf-loader.c   |  15 +++-
 xen/common/libelf/libelf-private.h  |   1 +
 xen/include/public/elfnote.h|  16 +++-
 xen/include/xen/libelf.h|   5 ++
 10 files changed, 186 insertions(+), 16 deletions(-)

-- 
2.44.0

[PATCH v6 4/4] x86/PVH: Support relocatable dom0 kernels

Xen tries to load a PVH dom0 kernel at the fixed guest physical address
from the elf headers.  For Linux, this defaults to 0x100 (16MB), but
it can be configured.

Unfortunately there exist firmwares that have reserved regions at this
address, so Xen fails to load the dom0 kernel since it's not RAM.

The PVH entry code is not relocatable - it loads from absolute
addresses, which fail when the kernel is loaded at a different address.
With a suitably modified kernel, a reloctable entry point is possible.

Add XEN_ELFNOTE_PHYS32_RELOC which specifies optional alignment,
minimum, and maximum addresses needed for the kernel.  The presence of
the NOTE indicates the kernel supports a relocatable entry path.

Change the loading to check for an acceptable load address.  If the
kernel is relocatable, support finding an alternate load address.

The primary motivation for an explicit align field is that Linux has a
configurable CONFIG_PHYSICAL_ALIGN field.  This value is present in the
bzImage setup header, but not the ELF program headers p_align, which
report 2MB even when CONFIG_PHYSICAL_ALIGN is greater.  Since a kernel
is only considered relocatable if the PHYS32_RELOC elf note is present,
the alignment contraints can just be specified within the note instead
of searching for an alignment value via a heuristic.

Load alignment uses the PHYS32_RELOC note value if specified.
Otherwise, the maxmum ELF PHDR p_align value is selected if greater than
or equal to PAGE_SIZE.  Finally, the fallback default is 2MB.

libelf-private.h includes common-macros.h to satisfy the fuzzer build.

Link: https://gitlab.com/xen-project/xen/-/issues/180
Signed-off-by: Jason Andryuk 
---
ELF Note printing looks like:
(XEN) ELF: note: PHYS32_RELOC align: 0x20 min: 0x100 max: 0x3fff

v2:
Use elfnote for min, max & align - use 64bit values.
Print original and relocated memory addresses
Use check_and_adjust_load_address() name
Return relocated base instead of offset
Use PAGE_ALIGN
Don't load above max_phys (expected to be 4GB in kernel elf note)
Use single line comments
Exit check_load_address loop earlier
Add __init to find_kernel_memory()

v3:
Remove kernel_start/end page rounding
Change loop comment to rely on a sorted memory map.
Reorder E820_RAM check first
Use %p for dest_base
Use PRIpaddr
Use 32bit phys_min/max/align
Consolidate to if ( x || y ) continue
Use max_t
Add parms->phys_reloc
Use common "%pd kernel: " prefix for messages
Re-order phys_entry assignment
Print range ends inclusively
Remove extra "Unable to load kernel" message
s/PVH_RELOCATION/PHYS32_RELOC/
Make PHYS32_RELOC contents optional
Use 2MB default alignment
Update ELF Note comment
Update XEN_ELFNOTE_MAX

v4:
Cast dest_base to uintptr_t
Use local start variable
Mention e820 requiring adjacent entries merged
Remove extra whitespace
Re-word elfnote comment & mention x86
Use ELFNOTE_NAME
Return -ENOSPC
Use ! instead of == 0
Check kend - 1 to avoid off by one
libelf: Use MB/GB() to define the size.
Use ARCH_PHYS_* defines

v5:
Place kernel in higher memory addresses
Remove stray semicolons
ELFNOTE_NAME comment about newline
Make PHYS32_RELOC element order align, min, max
Re-word PHYS32_RELOC comment
Move phys_align next to other bool variables

v6:
Select alignment from, in order, Note, PHDRs, then default
---
 xen/arch/x86/hvm/dom0_build.c  | 108 +
 xen/common/libelf/libelf-dominfo.c |  35 ++
 xen/common/libelf/libelf-private.h |   1 +
 xen/include/public/elfnote.h   |  16 -
 xen/include/xen/libelf.h   |   4 ++
 5 files changed, 163 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/hvm/dom0_build.c b/xen/arch/x86/hvm/dom0_build.c
index 0ceda4140b..01f39d650e 100644
--- a/xen/arch/x86/hvm/dom0_build.c
+++ b/xen/arch/x86/hvm/dom0_build.c
@@ -537,6 +537,111 @@ static paddr_t __init find_memory(
 return INVALID_PADDR;
 }
 
+static bool __init check_load_address(
+const struct domain *d, const struct elf_binary *elf)
+{
+paddr_t kernel_start = (uintptr_t)elf->dest_base;
+paddr_t kernel_end = kernel_start + elf->dest_size;
+unsigned int i;
+
+/* Relies on a sorted memory map with adjacent entries merged. */
+for ( i = 0; i < d->arch.nr_e820; i++ )
+{
+paddr_t start = d->arch.e820[i].addr;
+paddr_t end = start + d->arch.e820[i].size;
+
+if ( start >= kernel_end )
+return false;
+
+if ( d->arch.e820[i].type == E820_RAM &&
+ start <= kernel_start &&
+ end >= kernel_end )
+return true;
+}
+
+return false;
+}
+
+/* Find an e820 RAM region that fits the kernel at a suitable alignment. */
+static paddr_t __init find_kernel_memory(
+const struct domain *d, struct elf_binary *elf,
+const struct elf_dom_parms *parms)
+{
+paddr_t kernel_size = elf->dest_size;
+unsigned int align;
+int i;
+
+if ( parms->phys_align != UNSET_ADDR32 )
+align = parms->phys_align;
+

[PATCH v6 1/4] tools/init-xenstore-domain: Replace variable MB() usage

The local MB() & GB() macros will be replaced with a common
implementation, but those only work with constant values.  Introduce a
static inline mb_to_bytes() in place of the MB() macro to convert the
variable values.

Signed-off-by: Jason Andryuk 
---
v4:
New
---
 tools/helpers/init-xenstore-domain.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/tools/helpers/init-xenstore-domain.c 
b/tools/helpers/init-xenstore-domain.c
index 1683438c5c..5405842dfe 100644
--- a/tools/helpers/init-xenstore-domain.c
+++ b/tools/helpers/init-xenstore-domain.c
@@ -20,7 +20,6 @@
 #include "init-dom-json.h"
 
 #define LAPIC_BASE_ADDRESS  0xfee0UL
-#define MB(x)   ((uint64_t)x << 20)
 #define GB(x)   ((uint64_t)x << 30)
 
 static uint32_t domid = ~0;
@@ -36,6 +35,11 @@ static xc_evtchn_port_or_error_t console_evtchn;
 static xentoollog_level minmsglevel = XTL_PROGRESS;
 static void *logger;
 
+static inline uint64_t mb_to_bytes(int mem)
+{
+   return (uint64_t)mem << 20;
+}
+
 static struct option options[] = {
 { "kernel", 1, NULL, 'k' },
 { "memory", 1, NULL, 'm' },
@@ -76,8 +80,8 @@ static int build(xc_interface *xch)
 int rv, xs_fd;
 struct xc_dom_image *dom = NULL;
 int limit_kb = (maxmem ? : memory) * 1024 + X86_HVM_NR_SPECIAL_PAGES * 4;
-uint64_t mem_size = MB(memory);
-uint64_t max_size = MB(maxmem ? : memory);
+uint64_t mem_size = mb_to_bytes(memory);
+uint64_t max_size = mb_to_bytes(maxmem ? : memory);
 struct e820entry e820[3];
 struct xen_domctl_createdomain config = {
 .ssidref = SECINITSID_DOMU,
-- 
2.44.0

[PATCH v6 3/4] libelf: Store maximum PHDR p_align

While parsing the PHDRs, store the maximum p_align value.  This may be
consulted for moving a PVH image's load address.

Signed-off-by: Jason Andryuk 
---
v6:
New
---
 xen/common/libelf/libelf-loader.c | 15 +++
 xen/include/xen/libelf.h  |  1 +
 2 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/xen/common/libelf/libelf-loader.c 
b/xen/common/libelf/libelf-loader.c
index 629cc0d3e6..a5f6389f82 100644
--- a/xen/common/libelf/libelf-loader.c
+++ b/xen/common/libelf/libelf-loader.c
@@ -468,6 +468,7 @@ void elf_parse_binary(struct elf_binary *elf)
 {
 ELF_HANDLE_DECL(elf_phdr) phdr;
 uint64_t low = -1, high = 0, paddr, memsz;
+uint64_t max_align = 0, palign;
 unsigned i, count;
 
 count = elf_phdr_count(elf);
@@ -481,17 +482,23 @@ void elf_parse_binary(struct elf_binary *elf)
 continue;
 paddr = elf_uval(elf, phdr, p_paddr);
 memsz = elf_uval(elf, phdr, p_memsz);
-elf_msg(elf, "ELF: phdr: paddr=%#" PRIx64 " memsz=%#" PRIx64 "\n",
-paddr, memsz);
+palign = elf_uval(elf, phdr, p_align);
+elf_msg(elf,
+"ELF: phdr: paddr=%#" PRIx64 " memsz=%#" PRIx64 " palign=%#" 
PRIx64 "\n",
+paddr, memsz, palign);
 if ( low > paddr )
 low = paddr;
 if ( high < paddr + memsz )
 high = paddr + memsz;
+if ( max_align < palign )
+max_align = palign;
 }
 elf->pstart = low;
 elf->pend = high;
-elf_msg(elf, "ELF: memory: %#" PRIx64 " -> %#" PRIx64 "\n",
-elf->pstart, elf->pend);
+elf->palign = max_align;
+elf_msg(elf,
+"ELF: memory: %#" PRIx64 " -> %#" PRIx64 " align:%#" PRIx64 "\n",
+elf->pstart, elf->pend, elf->palign);
 }
 
 elf_errorstatus elf_load_binary(struct elf_binary *elf)
diff --git a/xen/include/xen/libelf.h b/xen/include/xen/libelf.h
index 1c77e3df31..2d971f958e 100644
--- a/xen/include/xen/libelf.h
+++ b/xen/include/xen/libelf.h
@@ -196,6 +196,7 @@ struct elf_binary {
 size_t dest_size;
 uint64_t pstart;
 uint64_t pend;
+uint64_t palign;
 uint64_t reloc_offset;
 
 uint64_t bsd_symtab_pstart;
-- 
2.44.0

[PATCH v6] RFC: x86/pvh: Make Xen PVH entrypoint PIC for x86-64

The Xen PVH entrypoint is 32bit non-PIC code running at a default load
address of 0x100 (16MB) (CONFIG_PHYSICAL_START).  Xen loads the
kernel at that physical address inside the PVH container.

When running a PVH Dom0, the system reserved addresses are mapped 1-1
into the PVH container.  There exist system firmwares (Coreboot/EDK2)
with reserved memory at 16MB.  This creates a conflict where the PVH
kernel cannot be loaded at that address.

Modify the PVH entrypoint to be position-indepedent to allow flexibility
in load address.  Only the 64bit entry path is converted.  A 32bit
kernel is not PIC, so calling into other parts of the kernel, like
xen_prepare_pvh() and mk_pgtable_32(), don't work properly when
relocated.

Initial PVH entry runs at the physical addresses and then transitions to
the identity mapped address.  While executing xen_prepare_pvh() calls
through pv_ops function pointers transition to the high mapped
addresses.  Additionally, __va() is called on some hvm_start_info
physical addresses, we need the directmap address range is used.  So we
need to run page tables with all of those ranges mapped.

Modifying init_top_pgt tables ran into issue since
startup_64/__startup_64() will modify those page tables again.  Use a
dedicated set of page tables - pvh_init_top_pgt  - for the PVH entry to
avoid unwanted interactions.

In xen_pvh_init(), __pa() is called to find the physical address of the
hypercall page.  Set phys_base temporarily before calling into
xen_prepare_pvh(), which calls xen_pvh_init(), and clear it afterwards.
__startup_64() assumes phys_base is zero and adds load_delta to it.  If
phys_base is already set, the calculation results in an incorrect cr3.

TODO: Sync elfnote.h from xen.git commit xx when it is
commited.

Signed-off-by: Jason Andryuk 
---
Put this out as an example for the Xen modifications

Instead of setting and clearing phys_base, add a dedicated variable?
Clearing phys_base is a little weird, but it leaves the kernel more
consistent when running non-entry code.

Make __startup_64() exit if phys_base is already set to allow calling
multiple times, and use that and init_top_pgt instead of adding
additional page tables?  That won't work.  __startup_64 is 64bit code,
and pvh needs to create page tables in 32bit code before it can
transition to 64bit long mode.  Hence it can't be re-used to relocate
page tables.
---
 arch/x86/platform/pvh/head.S| 182 +---
 include/xen/interface/elfnote.h |  16 ++-
 2 files changed, 185 insertions(+), 13 deletions(-)

diff --git a/arch/x86/platform/pvh/head.S b/arch/x86/platform/pvh/head.S
index f7235ef87bc3..ce2b201210e3 100644
--- a/arch/x86/platform/pvh/head.S
+++ b/arch/x86/platform/pvh/head.S
@@ -50,11 +50,32 @@
 #define PVH_CS_SEL (PVH_GDT_ENTRY_CS * 8)
 #define PVH_DS_SEL (PVH_GDT_ENTRY_DS * 8)
 
+#define rva(x) ((x) - pvh_start_xen)
+
 SYM_CODE_START_LOCAL(pvh_start_xen)
UNWIND_HINT_END_OF_STACK
cld
 
-   lgdt (_pa(gdt))
+   /*
+* See the comment for startup_32 for more details.  We need to
+* execute a call to get the execution address to be position
+* independent, but we don't have a stack.  Save and restore the
+* magic field of start_info in ebx, and use that as the stack.
+*/
+   mov (%ebx), %eax
+   leal4(%ebx), %esp
+   ANNOTATE_INTRA_FUNCTION_CALL
+   call1f
+1: popl%ebp
+   mov %eax, (%ebx)
+   subl$ rva(1b), %ebp
+   movl$0, %esp
+
+   lealrva(gdt)(%ebp), %eax
+   movl%eax, %ecx
+   lealrva(gdt_start)(%ebp), %ecx
+   movl%ecx, 2(%eax)
+   lgdt(%eax)
 
mov $PVH_DS_SEL,%eax
mov %eax,%ds
@@ -62,14 +83,14 @@ SYM_CODE_START_LOCAL(pvh_start_xen)
mov %eax,%ss
 
/* Stash hvm_start_info. */
-   mov $_pa(pvh_start_info), %edi
+   leal rva(pvh_start_info)(%ebp), %edi
mov %ebx, %esi
-   mov _pa(pvh_start_info_sz), %ecx
+   movl rva(pvh_start_info_sz)(%ebp), %ecx
shr $2,%ecx
rep
movsl
 
-   mov $_pa(early_stack_end), %esp
+   leal rva(early_stack_end)(%ebp), %esp
 
/* Enable PAE mode. */
mov %cr4, %eax
@@ -83,29 +104,81 @@ SYM_CODE_START_LOCAL(pvh_start_xen)
btsl $_EFER_LME, %eax
wrmsr
 
+   mov %ebp, %ebx
+   subl $LOAD_PHYSICAL_ADDR, %ebx /* offset */
+   jz .Lpagetable_done
+
+   /* Fixup page-tables for relocation. */
+   leal rva(pvh_init_top_pgt)(%ebp), %edi
+   movl $512, %ecx
+2:
+   testl $_PAGE_PRESENT, 0x00(%edi)
+   jz 1f
+   addl %ebx, 0x00(%edi)
+1:
+   addl $8, %edi
+   decl %ecx
+   jnz 2b
+
+   /* L3 ident has a single entry. */
+   leal rva(pvh_level3_ident_pgt)(%ebp), %edi
+   addl %ebx, 0x00(%edi)
+
+   leal rva(pvh_level3_kernel_pgt)(%ebp), %edi
+   addl %ebx, (4096 - 16)(%edi)
+   addl %ebx, (4096

[linux-6.1 test] 185167: tolerable FAIL - PUSHED

flight 185167 linux-6.1 real [real]
http://logs.test-lab.xenproject.org/osstest/logs/185167/

Failures :-/ but no regressions.

Tests which did not succeed, but are not blocking:
 test-armhf-armhf-libvirt 16 saverestore-support-checkfail  like 185053
 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 185053
 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 185053
 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 185053
 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 185053
 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 185053
 test-arm64-arm64-xl  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail  never pass
 test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-arm64-arm64-xl-credit1  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  15 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-qcow2 14 migrate-support-checkfail never pass
 test-armhf-armhf-xl  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt 15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 15 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  14 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  15 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-qcow214 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-qcow215 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt-vhd 14 migrate-support-checkfail   never pass
 test-armhf-armhf-libvirt-vhd 15 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-raw  14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-raw  15 saverestore-support-checkfail   never pass

version targeted for testing:
 linuxe5cd595e23c1a075359a337c0e5c3a4f2dc28dd1
baseline version:
 linuxd7543167affd372819a94879b8b1e8b9b12547d9

Last test of basis   185053  2024-03-15 19:18:14 Z   12 days
Testing same since   185167  2024-03-26 22:43:26 Z1 days1 attempts


377 people touched revisions under test,
not listing them all

jobs:
 build-amd64-xsm  pass
 build-arm64-xsm  pass
 build-i386-xsm   pass
 build-amd64  pass
 build-arm64  pass
 build-armhf  pass
 build-i386   pass
 build-amd64-libvirt  pass
 build-arm64-libvirt

Re: Linux Xen PV CPA W^X violation false-positives

On Wed, Mar 27, 2024 at 7:46 AM Jürgen Groß  wrote:
>
> On 24.01.24 17:54, Jason Andryuk wrote:
> > +
> > + return new;
> > + }
> > + }
> > +
> >   end = start + npg * PAGE_SIZE - 1;
> >   WARN_ONCE(1, "CPA detected W^X violation: %016llx -> %016llx range: 
> > 0x%016lx - 0x%016lx PFN %lx\n",
> > (unsigned long long)pgprot_val(old),
>
> Jason, do you want to send a V2 with your Signed-off, or would you like me to
> try upstreaming the patch?

Hi Jürgen,

Yes, please upstream your approach.  I wasn't sure how to deal with
it, so it was more of a bug report.

Thanks,
Jason

[linux-5.4 test] 185168: tolerable FAIL - PUSHED

flight 185168 linux-5.4 real [real]
http://logs.test-lab.xenproject.org/osstest/logs/185168/

Failures :-/ but no regressions.

Tests which did not succeed, but are not blocking:
 test-armhf-armhf-xl-credit1  14 guest-start  fail  like 185071
 test-armhf-armhf-xl-multivcpu 18 guest-start/debian.repeatfail like 185080
 test-amd64-i386-xl-qemut-win7-amd64 19 guest-stop fail like 185108
 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 185108
 test-amd64-i386-xl-qemuu-win7-amd64 19 guest-stop fail like 185108
 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 185108
 test-armhf-armhf-xl-rtds 18 guest-start/debian.repeatfail  like 185108
 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 185108
 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 185108
 test-amd64-i386-xl-qemut-ws16-amd64 19 guest-stop fail like 185108
 test-armhf-armhf-libvirt 16 saverestore-support-checkfail  like 185108
 test-armhf-armhf-xl-credit2  18 guest-start/debian.repeatfail  like 185108
 test-amd64-i386-xl-qemuu-ws16-amd64 19 guest-stop fail like 185108
 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 185108
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-amd64-i386-xl-pvshim14 guest-start  fail   never pass
 test-amd64-i386-libvirt  15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-amd64-i386-libvirt-xsm  15 migrate-support-checkfail   never pass
 test-amd64-i386-libvirt-qcow2 14 migrate-support-checkfail  never pass
 test-arm64-arm64-xl  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail  never pass
 test-armhf-armhf-xl-rtds 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  16 saverestore-support-checkfail   never pass
 test-amd64-i386-libvirt-raw  14 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 15 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-qcow214 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-qcow215 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  14 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  15 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt-vhd 14 migrate-support-checkfail   never pass
 test-armhf-armhf-libvirt-vhd 15 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-raw  14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-raw  15 saverestore-support-checkfail   never pass

version targeted for testing:
 linux24489321d0cd5339f9c2da01eb8bf2bccbac7956
baseline version:
 linux84075826304f1a297838de6bcfd9bd84f566026e

Last test of basis   185108  2024-03-20 07:24:30 Z7 d

[xen-unstable test] 185169: regressions - FAIL

flight 185169 xen-unstable real [real]
http://logs.test-lab.xenproject.org/osstest/logs/185169/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
 build-arm64-xsm   6 xen-buildfail REGR. vs. 185162

Tests which are failing intermittently (not blocking):
 test-amd64-amd64-xl-qemuu-dmrestrict-amd64-dmrestrict 7 xen-install fail in 
185162 pass in 185169
 test-amd64-amd64-xl-qemut-debianhvm-amd64 20 guest-start/debianhvm.repeat fail 
pass in 185162

Tests which did not succeed, but are not blocking:
 test-arm64-arm64-libvirt-xsm  1 build-check(1)   blocked  n/a
 test-arm64-arm64-xl-xsm   1 build-check(1)   blocked  n/a
 test-arm64-arm64-xl-xsm 15 migrate-support-check fail in 185162 never pass
 test-arm64-arm64-xl-xsm 16 saverestore-support-check fail in 185162 never pass
 test-arm64-arm64-libvirt-xsm 15 migrate-support-check fail in 185162 never pass
 test-arm64-arm64-libvirt-xsm 16 saverestore-support-check fail in 185162 never 
pass
 test-armhf-armhf-libvirt 16 saverestore-support-checkfail  like 185162
 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 185162
 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 185162
 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 185162
 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 185162
 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 185162
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-armhf-armhf-xl-rtds 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail  never pass
 test-amd64-amd64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 15 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  14 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  15 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-qcow2 14 migrate-support-checkfail never pass
 test-armhf-armhf-xl-qcow214 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-qcow215 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt-vhd 14 migrate-support-checkfail   never pass
 test-armhf-armhf-libvirt-vhd 15 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-raw  14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-raw  15 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  16 saverestore-support-checkfail   never pass

version targeted for testing:
 xen  6f6de10ade5ade907f9e3f3c72b7b18f7852d9ff
baseline version:
 xen  6f6de10ade5ade907f9e3f3c72b7b18f7852d9ff

Last test of basis   185169  2024-03-27 01:54:33 Z1 days
Testing same since  (not found) 0 attempts

jobs:
 build-amd64-xsm  pass
 build-arm64-xsm  fail
 build-i386-xsm

[QEMU PATCH v5 0/1] Support device passthrough when dom0 is PVH on Xen

Hi All,
This is v5 series to support passthrough on Xen when dom0 is PVH.
v4->v5 changes:
* Add review by Stefano

v3->v4 changes:
* Add gsi into struct XenHostPCIDevice and use gsi number that read from gsi 
sysfs
  if it exists, if there is no gsi sysfs, still use irq.

v2->v3 changes:
* Du to changes in the implementation of the second patch on kernel side(that 
adds
  a new sysfs for gsi instead of a new syscall), so read gsi number from the 
sysfs of gsi.

Below is the description of v2 cover letter:
This patch is the v2 of the implementation of passthrough when dom0 is PVH on 
Xen.
Issues we encountered:
1. failed to map pirq for gsi
Problem: qemu will call xc_physdev_map_pirq() to map a passthrough device’s gsi 
to pirq in
function xen_pt_realize(). But failed.

Reason: According to the implement of xc_physdev_map_pirq(), it needs gsi 
instead of irq,
but qemu pass irq to it and treat irq as gsi, it is got from file
/sys/bus/pci/devices/:xx:xx.x/irq in function xen_host_pci_device_get(). 
But actually
the gsi number is not equal with irq. On PVH dom0, when it allocates irq for a 
gsi in
function acpi_register_gsi_ioapic(), allocation is dynamic, and follow the 
principle of
applying first, distributing first. And if you debug the kernel codes
(see function __irq_alloc_descs), you will find the irq number is allocated 
from small to
large by order, but the applying gsi number is not, gsi 38 may come before gsi 
28, that
causes gsi 38 get a smaller irq number than gsi 28, and then gsi != irq.

Solution: we can record the relation between gsi and irq, then when 
userspace(qemu) want
to use gsi, we can do a translation. The third patch of kernel(xen/privcmd: Add 
new syscall
to get gsi from irq) records all the relations in acpi_register_gsi_xen_pvh() 
when dom0
initialize pci devices, and provide a syscall for userspace to get the gsi from 
irq. The
third patch of xen(tools: Add new function to get gsi from irq) add a new 
function
xc_physdev_gsi_from_irq() to call the new syscall added on kernel side.
And then userspace can use that function to get gsi. Then xc_physdev_map_pirq() 
will success.

This v2 on qemu side is the same as the v1 
(qemu 
https://lore.kernel.org/xen-devel/20230312092244.451465-19-ray.hu...@amd.com/), 
just call
xc_physdev_gsi_from_irq() to get gsi from irq.


Jiqian Chen (1):
  xen: Use gsi instead of irq for mapping pirq

 hw/xen/xen-host-pci-device.c | 7 +++
 hw/xen/xen-host-pci-device.h | 1 +
 hw/xen/xen_pt.c  | 6 +-
 3 files changed, 13 insertions(+), 1 deletion(-)

-- 
2.34.1

[RFC QEMU PATCH v5 1/1] xen: Use gsi instead of irq for mapping pirq

In PVH dom0, it uses the linux local interrupt mechanism,
when it allocs irq for a gsi, it is dynamic, and follow
the principle of applying first, distributing first. And
the irq number is alloced from small to large, but the
applying gsi number is not, may gsi 38 comes before gsi
28, that causes the irq number is not equal with the gsi
number. And when passthrough a device, qemu wants to use
gsi to map pirq, xen_pt_realize->xc_physdev_map_pirq, but
the gsi number is got from file
/sys/bus/pci/devices//irq in current code, so it
will fail when mapping.

Add gsi into XenHostPCIDevice and use gsi number that
read from gsi sysfs if it exists.

Signed-off-by: Huang Rui 
Signed-off-by: Jiqian Chen 
Reviewed-by: Stefano Stabellini 

---
RFC: discussions ongoing on the Linux side where/how to expose the gsi

---
 hw/xen/xen-host-pci-device.c | 7 +++
 hw/xen/xen-host-pci-device.h | 1 +
 hw/xen/xen_pt.c  | 6 +-
 3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/hw/xen/xen-host-pci-device.c b/hw/xen/xen-host-pci-device.c
index 8c6e9a1716a2..5be3279aa25b 100644
--- a/hw/xen/xen-host-pci-device.c
+++ b/hw/xen/xen-host-pci-device.c
@@ -370,6 +370,13 @@ void xen_host_pci_device_get(XenHostPCIDevice *d, uint16_t 
domain,
 }
 d->irq = v;
 
+xen_host_pci_get_dec_value(d, "gsi", &v, errp);
+if (*errp) {
+d->gsi = -1;
+} else {
+d->gsi = v;
+}
+
 xen_host_pci_get_hex_value(d, "class", &v, errp);
 if (*errp) {
 goto error;
diff --git a/hw/xen/xen-host-pci-device.h b/hw/xen/xen-host-pci-device.h
index 4d8d34ecb024..74c552bb5548 100644
--- a/hw/xen/xen-host-pci-device.h
+++ b/hw/xen/xen-host-pci-device.h
@@ -27,6 +27,7 @@ typedef struct XenHostPCIDevice {
 uint16_t device_id;
 uint32_t class_code;
 int irq;
+int gsi;
 
 XenHostPCIIORegion io_regions[PCI_NUM_REGIONS - 1];
 XenHostPCIIORegion rom;
diff --git a/hw/xen/xen_pt.c b/hw/xen/xen_pt.c
index 3635d1b39f79..d34a7a8764ab 100644
--- a/hw/xen/xen_pt.c
+++ b/hw/xen/xen_pt.c
@@ -840,7 +840,11 @@ static void xen_pt_realize(PCIDevice *d, Error **errp)
 goto out;
 }
 
-machine_irq = s->real_device.irq;
+if (s->real_device.gsi < 0) {
+machine_irq = s->real_device.irq;
+} else {
+machine_irq = s->real_device.gsi;
+}
 if (machine_irq == 0) {
 XEN_PT_LOG(d, "machine irq is 0\n");
 cmd |= PCI_COMMAND_INTX_DISABLE;
-- 
2.34.1

Re: Serious AMD-Vi(?) issue

On 27.03.2024 18:27, Elliott Mitchell wrote:
> On Mon, Mar 25, 2024 at 02:43:44PM -0700, Elliott Mitchell wrote:
>> On Mon, Mar 25, 2024 at 08:55:56AM +0100, Jan Beulich wrote:
>>> On 22.03.2024 20:22, Elliott Mitchell wrote:
 On Fri, Mar 22, 2024 at 04:41:45PM +, Kelly Choi wrote:
>
> I can see you've recently engaged with our community with some issues 
> you'd
> like help with.
> We love the fact you are participating in our project, however, our
> developers aren't able to help if you do not provide the specific details.

 Please point to specific details which have been omitted.  Fairly little
 data has been provided as fairly little data is available.  The primary
 observation is large numbers of:

 (XEN) AMD-Vi: IO_PAGE_FAULT: :bb:dd.f d0 addr ff???000 flags 
 0x8 I

 Lines in Xen's ring buffer.
>>>
>>> Yet this is (part of) the problem: By providing only the messages that 
>>> appear
>>> relevant to you, you imply that you know that no other message is in any way
>>> relevant. That's judgement you'd better leave to people actually trying to
>>> investigate. Unless of course you were proposing an actual code change, with
>>> suitable justification.
>>
>> Honestly, I forgot about the very small number of messages from the SATA
>> subsystem.  The question of whether the current mitigation actions are
>> effective right now was a bigger issue.  As such monitoring `xl dmesg`
>> was a priority to looking at SATA messages which failed to reliably
>> indicate status.
>>
>> I *thought* I would be able to retrieve those via other slow means, but a
>> different and possibly overlapping issue has shown up.  Unfortunately
>> this means those are no longer retrievable.   :-(
> 
> With some persistence I was able to retrieve them.  There are other
> pieces of software with worse UIs than Xen.
> 
>>> In fact when running into trouble, the usual course of action would be to
>>> increase verbosity in both hypervisor and kernel, just to make sure no
>>> potentially relevant message is missed.
>>
>> More/better information might have been obtained if I'd been engaged
>> earlier.
> 
> This is still true, things are in full mitigation mode and I'll be
> quite unhappy to go back with experiments at this point.

Well, it very likely won't work without further experimenting by someone
able to observe the bad behavior. Recall we're on xen-devel here; it is
kind of expected that without clear (and practical) repro instructions
experimenting as well as info collection will remain with the reporter.

> I now see why I left those out.  The messages from the SATA subsystem
> were from a kernel which a bad patch had leaked into a LTS branch.  Looks
> like the SATA subsystem was significantly broken and I'm unsure whether
> any useful information could be retrieved.  Notably there is quite a bit
> of noise from SATA devices not effected by this issue.
> 
> Some of the messages /might/ be useful, but the amount of noise is quite
> high.  Do messages from a broken kernel interest you?

If this was a less vague (in terms of possible root causes) issue, I'd
probably have answered "yes". But in the case here I'm afraid such might
further confuse things rather than clarifying them.

Jan

[XEN PATCH v6 1/5] xen/vpci: Clear all vpci status of device

When a device has been reset on dom0 side, the vpci on Xen
side won't get notification, so the cached state in vpci is
all out of date compare with the real device state.
To solve that problem, add a new hypercall to clear all vpci
device state. When the state of device is reset on dom0 side,
dom0 can call this hypercall to notify vpci.

Signed-off-by: Huang Rui 
Signed-off-by: Jiqian Chen 
Reviewed-by: Stewart Hildebrand 
Reviewed-by: Stefano Stabellini 
---
 xen/arch/x86/hvm/hypercall.c |  1 +
 xen/drivers/pci/physdev.c| 36 
 xen/drivers/vpci/vpci.c  | 10 ++
 xen/include/public/physdev.h |  7 +++
 xen/include/xen/vpci.h   |  6 ++
 5 files changed, 60 insertions(+)

diff --git a/xen/arch/x86/hvm/hypercall.c b/xen/arch/x86/hvm/hypercall.c
index eeb73e1aa5d0..6ad5b4d5f11f 100644
--- a/xen/arch/x86/hvm/hypercall.c
+++ b/xen/arch/x86/hvm/hypercall.c
@@ -84,6 +84,7 @@ long hvm_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
 case PHYSDEVOP_pci_mmcfg_reserved:
 case PHYSDEVOP_pci_device_add:
 case PHYSDEVOP_pci_device_remove:
+case PHYSDEVOP_pci_device_state_reset:
 case PHYSDEVOP_dbgp_op:
 if ( !is_hardware_domain(currd) )
 return -ENOSYS;
diff --git a/xen/drivers/pci/physdev.c b/xen/drivers/pci/physdev.c
index 42db3e6d133c..73dc8f058b0e 100644
--- a/xen/drivers/pci/physdev.c
+++ b/xen/drivers/pci/physdev.c
@@ -2,6 +2,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifndef COMPAT
 typedef long ret_t;
@@ -67,6 +68,41 @@ ret_t pci_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) 
arg)
 break;
 }
 
+case PHYSDEVOP_pci_device_state_reset: {
+struct physdev_pci_device dev;
+struct pci_dev *pdev;
+pci_sbdf_t sbdf;
+
+if ( !is_pci_passthrough_enabled() )
+return -EOPNOTSUPP;
+
+ret = -EFAULT;
+if ( copy_from_guest(&dev, arg, 1) != 0 )
+break;
+sbdf = PCI_SBDF(dev.seg, dev.bus, dev.devfn);
+
+ret = xsm_resource_setup_pci(XSM_PRIV, sbdf.sbdf);
+if ( ret )
+break;
+
+pcidevs_lock();
+pdev = pci_get_pdev(NULL, sbdf);
+if ( !pdev )
+{
+pcidevs_unlock();
+ret = -ENODEV;
+break;
+}
+
+write_lock(&pdev->domain->pci_lock);
+ret = vpci_reset_device_state(pdev);
+write_unlock(&pdev->domain->pci_lock);
+pcidevs_unlock();
+if ( ret )
+printk(XENLOG_ERR "%pp: failed to reset PCI device state\n", 
&sbdf);
+break;
+}
+
 default:
 ret = -ENOSYS;
 break;
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index 260b72875ee1..310700c1e775 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -117,6 +117,16 @@ int vpci_assign_device(struct pci_dev *pdev)
 
 return rc;
 }
+
+int vpci_reset_device_state(struct pci_dev *pdev)
+{
+ASSERT(pcidevs_locked());
+ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
+
+vpci_deassign_device(pdev);
+return vpci_assign_device(pdev);
+}
+
 #endif /* __XEN__ */
 
 static int vpci_register_cmp(const struct vpci_register *r1,
diff --git a/xen/include/public/physdev.h b/xen/include/public/physdev.h
index f0c0d4727c0b..f5bab1f29779 100644
--- a/xen/include/public/physdev.h
+++ b/xen/include/public/physdev.h
@@ -296,6 +296,13 @@ DEFINE_XEN_GUEST_HANDLE(physdev_pci_device_add_t);
  */
 #define PHYSDEVOP_prepare_msix  30
 #define PHYSDEVOP_release_msix  31
+/*
+ * Notify the hypervisor that a PCI device has been reset, so that any
+ * internally cached state is regenerated.  Should be called after any
+ * device reset performed by the hardware domain.
+ */
+#define PHYSDEVOP_pci_device_state_reset 32
+
 struct physdev_pci_device {
 /* IN */
 uint16_t seg;
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index e89c571890b2..ea64d94e818b 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -30,6 +30,7 @@ int __must_check vpci_assign_device(struct pci_dev *pdev);
 
 /* Remove all handlers and free vpci related structures. */
 void vpci_deassign_device(struct pci_dev *pdev);
+int __must_check vpci_reset_device_state(struct pci_dev *pdev);
 
 /* Add/remove a register handler. */
 int __must_check vpci_add_register_mask(struct vpci *vpci,
@@ -266,6 +267,11 @@ static inline int vpci_assign_device(struct pci_dev *pdev)
 
 static inline void vpci_deassign_device(struct pci_dev *pdev) { }
 
+static inline int __must_check vpci_reset_device_state(struct pci_dev *pdev)
+{
+return 0;
+}
+
 static inline void vpci_dump_msi(void) { }
 
 static inline uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg,
-- 
2.34.1

[RFC XEN PATCH v6 3/5] x86/pvh: Add PHYSDEVOP_setup_gsi for PVH dom0

On PVH dom0, the gsis don't get registered, but
the gsi of a passthrough device must be configured for it to
be able to be mapped into a hvm domU.
On Linux kernel side, it calles PHYSDEVOP_setup_gsi for
passthrough devices to register gsi when dom0 is PVH.
So, add PHYSDEVOP_setup_gsi for above purpose.

Signed-off-by: Huang Rui 
Signed-off-by: Jiqian Chen 
---
 xen/arch/x86/hvm/hypercall.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/xen/arch/x86/hvm/hypercall.c b/xen/arch/x86/hvm/hypercall.c
index 493998b42ec5..7d4e41f66885 100644
--- a/xen/arch/x86/hvm/hypercall.c
+++ b/xen/arch/x86/hvm/hypercall.c
@@ -76,6 +76,11 @@ long hvm_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) 
arg)
 case PHYSDEVOP_unmap_pirq:
 break;
 
+case PHYSDEVOP_setup_gsi:
+if ( !is_hardware_domain(currd) )
+return -EOPNOTSUPP;
+break;
+
 case PHYSDEVOP_eoi:
 case PHYSDEVOP_irq_status_query:
 case PHYSDEVOP_get_free_pirq:
-- 
2.34.1

[RFC XEN PATCH v6 0/5] Support device passthrough when dom0 is PVH on Xen

Hi All,
This is v6 series to support passthrough when dom0 is PVH
v5->v6 changes:
* patch#1: Add Reviewed-by Stefano and Stewart. Rebase code and change old 
function vpci_remove_device, vpci_add_handlers to vpci_deassign_device, 
vpci_assign_device
* patch#2: Add Reviewed-by Stefano
* patch#3: Remove unnecessary "ASSERT(!has_pirq(currd));"
* patch#4: Fix some coding style issues below directory tools
* patch#5: Modified some variable names and code logic to make code easier to 
be understood, which to use gsi by default and be compatible with older kernel 
versions to continue to use irq

v4->v5 changes:
* patch#1: add pci_lock wrap function vpci_reset_device_state
* patch#2: move the check of self map_pirq to physdev.c, and change to check if 
the caller has PIRQ flag, and just break for PHYSDEVOP_(un)map_pirq in 
hvm_physdev_op
* patch#3: return -EOPNOTSUPP instead, and use ASSERT(!has_pirq(currd));
* patch#4: is the patch#5 in v4 because patch#5 in v5 has some dependency on 
it. And add the handling of errno and add the Reviewed-by Stefano
* patch#5: is the patch#4 in v4. New implementation to add new hypercall 
XEN_DOMCTL_gsi_permission to grant gsi


v3->v4 changes:
* patch#1: change the comment of PHYSDEVOP_pci_device_state_reset; move 
printings behind pcidevs_unlock
* patch#2: add check to prevent PVH self map
* patch#3: new patch, The implementation of adding PHYSDEVOP_setup_gsi for PVH 
is treated as a separate patch
* patch#4: new patch to solve the map_pirq problem of PVH dom0. use gsi to 
grant irq permission in XEN_DOMCTL_irq_permission.
* patch#5: to be compatible with previous kernel versions, when there is no gsi 
sysfs, still use irq
v4 link:
https://lore.kernel.org/xen-devel/20240105070920.350113-1-jiqian.c...@amd.com/T/#t

v2->v3 changes:
* patch#1: move the content out of pci_reset_device_state and delete 
pci_reset_device_state; add xsm_resource_setup_pci check for 
PHYSDEVOP_pci_device_state_reset; add description for 
PHYSDEVOP_pci_device_state_reset;
* patch#2: du to changes in the implementation of the second patch on kernel 
side(that it will do setup_gsi and map_pirq when assigning a device to 
passthrough), add PHYSDEVOP_setup_gsi for PVH dom0, and we need to support self 
mapping.
* patch#3: du to changes in the implementation of the second patch on kernel 
side(that adds a new sysfs for gsi instead of a new syscall), so read gsi 
number from the sysfs of gsi.
v3 link:
https://lore.kernel.org/xen-devel/20231210164009.1551147-1-jiqian.c...@amd.com/T/#t

v2 link:
https://lore.kernel.org/xen-devel/20231124104136.3263722-1-jiqian.c...@amd.com/T/#t
Below is the description of v2 cover letter:
This series of patches are the v2 of the implementation of passthrough when 
dom0 is PVH on Xen.
We sent the v1 to upstream before, but the v1 had so many problems and we got 
lots of suggestions.
I will introduce all issues that these patches try to fix and the differences 
between v1 and v2.

Issues we encountered:
1. pci_stub failed to write bar for a passthrough device.
Problem: when we run “sudo xl pci-assignable-add ” to assign a device, 
pci_stub will call “pcistub_init_device() -> pci_restore_state() -> 
pci_restore_config_space() ->
pci_restore_config_space_range() -> pci_restore_config_dword() -> 
pci_write_config_dword()”, the pci config write will trigger an io interrupt to 
bar_write() in the xen, but the
bar->enabled was set before, the write is not allowed now, and then when 
bar->Qemu config the
passthrough device in xen_pt_realize(), it gets invalid bar values.

Reason: the reason is that we don't tell vPCI that the device has been reset, 
so the current cached state in pdev->vpci is all out of date and is different 
from the real device state.

Solution: to solve this problem, the first patch of kernel(xen/pci: Add 
xen_reset_device_state
function) and the fist patch of xen(xen/vpci: Clear all vpci status of device) 
add a new hypercall to reset the state stored in vPCI when the state of real 
device has changed.
Thank Roger for the suggestion of this v2, and it is different from v1 
(https://lore.kernel.org/xen-devel/20230312075455.450187-3-ray.hu...@amd.com/), 
v1 simply allow domU to write pci bar, it does not comply with the design 
principles of vPCI.

2. failed to do PHYSDEVOP_map_pirq when dom0 is PVH
Problem: HVM domU will do PHYSDEVOP_map_pirq for a passthrough device by using 
gsi. See xen_pt_realize->xc_physdev_map_pirq and 
pci_add_dm_done->xc_physdev_map_pirq. Then xc_physdev_map_pirq will call into 
Xen, but in hvm_physdev_op(), PHYSDEVOP_map_pirq is not allowed.

Reason: In hvm_physdev_op(), the variable "currd" is PVH dom0 and PVH has no 
X86_EMU_USE_PIRQ flag, it will fail at has_pirq check.

Solution: I think we may need to allow PHYSDEVOP_map_pirq when "currd" is dom0 
(at present dom0 is PVH). The second patch of xen(x86/pvh: Open 
PHYSDEVOP_map_pirq for PVH dom0) allow PVH dom0 do PHYSDEVOP_map_pirq. This v2 
patch is better than v1, v1 simply remove the

[XEN PATCH v6 2/5] x86/pvh: Allow (un)map_pirq when dom0 is PVH

If run Xen with PVH dom0 and hvm domU, hvm will map a pirq for
a passthrough device by using gsi, see
xen_pt_realize->xc_physdev_map_pirq and
pci_add_dm_done->xc_physdev_map_pirq. Then xc_physdev_map_pirq
will call into Xen, but in hvm_physdev_op, PHYSDEVOP_map_pirq
is not allowed because currd is PVH dom0 and PVH has no
X86_EMU_USE_PIRQ flag, it will fail at has_pirq check.

So, allow PHYSDEVOP_map_pirq when dom0 is PVH and also allow
PHYSDEVOP_unmap_pirq for the failed path to unmap pirq. And
add a new check to prevent self map when caller has no PIRQ
flag.

Signed-off-by: Huang Rui 
Signed-off-by: Jiqian Chen 
Reviewed-by: Stefano Stabellini 
---
 xen/arch/x86/hvm/hypercall.c |  2 ++
 xen/arch/x86/physdev.c   | 24 
 2 files changed, 26 insertions(+)

diff --git a/xen/arch/x86/hvm/hypercall.c b/xen/arch/x86/hvm/hypercall.c
index 6ad5b4d5f11f..493998b42ec5 100644
--- a/xen/arch/x86/hvm/hypercall.c
+++ b/xen/arch/x86/hvm/hypercall.c
@@ -74,6 +74,8 @@ long hvm_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
 {
 case PHYSDEVOP_map_pirq:
 case PHYSDEVOP_unmap_pirq:
+break;
+
 case PHYSDEVOP_eoi:
 case PHYSDEVOP_irq_status_query:
 case PHYSDEVOP_get_free_pirq:
diff --git a/xen/arch/x86/physdev.c b/xen/arch/x86/physdev.c
index 7efa17cf4c1e..1367abc61e54 100644
--- a/xen/arch/x86/physdev.c
+++ b/xen/arch/x86/physdev.c
@@ -305,11 +305,23 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) 
arg)
 case PHYSDEVOP_map_pirq: {
 physdev_map_pirq_t map;
 struct msi_info msi;
+struct domain *d;
 
 ret = -EFAULT;
 if ( copy_from_guest(&map, arg, 1) != 0 )
 break;
 
+d = rcu_lock_domain_by_any_id(map.domid);
+if ( d == NULL )
+return -ESRCH;
+/* If it is an HVM guest, check if it has PIRQs */
+if ( !is_pv_domain(d) && !has_pirq(d) )
+{
+rcu_unlock_domain(d);
+return -EOPNOTSUPP;
+}
+rcu_unlock_domain(d);
+
 switch ( map.type )
 {
 case MAP_PIRQ_TYPE_MSI_SEG:
@@ -343,11 +355,23 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) 
arg)
 
 case PHYSDEVOP_unmap_pirq: {
 struct physdev_unmap_pirq unmap;
+struct domain *d;
 
 ret = -EFAULT;
 if ( copy_from_guest(&unmap, arg, 1) != 0 )
 break;
 
+d = rcu_lock_domain_by_any_id(unmap.domid);
+if ( d == NULL )
+return -ESRCH;
+/* If it is an HVM guest, check if it has PIRQs */
+if ( !is_pv_domain(d) && !has_pirq(d) )
+{
+rcu_unlock_domain(d);
+return -EOPNOTSUPP;
+}
+rcu_unlock_domain(d);
+
 ret = physdev_unmap_pirq(unmap.domid, unmap.pirq);
 break;
 }
-- 
2.34.1

[RFC XEN PATCH v6 5/5] domctl: Add XEN_DOMCTL_gsi_permission to grant gsi

Some type of domain don't have PIRQ, like PVH, when
passthrough a device to guest on PVH dom0, callstack
pci_add_dm_done->XEN_DOMCTL_irq_permission will failed
at domain_pirq_to_irq.

So, add a new hypercall to grant/revoke gsi permission
when dom0 is not PV or dom0 has not PIRQ flag.

Signed-off-by: Huang Rui 
Signed-off-by: Jiqian Chen 
---
 tools/include/xenctrl.h  |  5 
 tools/libs/ctrl/xc_domain.c  | 15 +++
 tools/libs/light/libxl_pci.c | 52 +---
 xen/arch/x86/domctl.c| 31 +
 xen/include/public/domctl.h  |  9 +++
 xen/xsm/flask/hooks.c|  1 +
 6 files changed, 103 insertions(+), 10 deletions(-)

diff --git a/tools/include/xenctrl.h b/tools/include/xenctrl.h
index 2ef8b4e05422..519c860a00d5 100644
--- a/tools/include/xenctrl.h
+++ b/tools/include/xenctrl.h
@@ -1382,6 +1382,11 @@ int xc_domain_irq_permission(xc_interface *xch,
  uint32_t pirq,
  bool allow_access);
 
+int xc_domain_gsi_permission(xc_interface *xch,
+ uint32_t domid,
+ uint32_t gsi,
+ bool allow_access);
+
 int xc_domain_iomem_permission(xc_interface *xch,
uint32_t domid,
unsigned long first_mfn,
diff --git a/tools/libs/ctrl/xc_domain.c b/tools/libs/ctrl/xc_domain.c
index f2d9d14b4d9f..8540e84fda93 100644
--- a/tools/libs/ctrl/xc_domain.c
+++ b/tools/libs/ctrl/xc_domain.c
@@ -1394,6 +1394,21 @@ int xc_domain_irq_permission(xc_interface *xch,
 return do_domctl(xch, &domctl);
 }
 
+int xc_domain_gsi_permission(xc_interface *xch,
+ uint32_t domid,
+ uint32_t gsi,
+ bool allow_access)
+{
+struct xen_domctl domctl = {
+.cmd = XEN_DOMCTL_gsi_permission,
+.domain = domid,
+.u.gsi_permission.gsi = gsi,
+.u.gsi_permission.allow_access = allow_access,
+};
+
+return do_domctl(xch, &domctl);
+}
+
 int xc_domain_iomem_permission(xc_interface *xch,
uint32_t domid,
unsigned long first_mfn,
diff --git a/tools/libs/light/libxl_pci.c b/tools/libs/light/libxl_pci.c
index 2cec83e0b734..debf6ec6ddc7 100644
--- a/tools/libs/light/libxl_pci.c
+++ b/tools/libs/light/libxl_pci.c
@@ -1421,6 +1421,8 @@ static void pci_add_dm_done(libxl__egc *egc,
 uint32_t flag = XEN_DOMCTL_DEV_RDM_RELAXED;
 uint32_t domainid = domid;
 bool isstubdom = libxl_is_stubdom(ctx, domid, &domainid);
+int gsi;
+bool is_gsi = true;
 
 /* Convenience aliases */
 bool starting = pas->starting;
@@ -1485,6 +1487,7 @@ static void pci_add_dm_done(libxl__egc *egc,
 /* To compitable with old version of kernel, still need to use irq */
 sysfs_path = GCSPRINTF(SYSFS_PCI_DEV"/"PCI_BDF"/irq", pci->domain,
pci->bus, pci->dev, pci->func);
+is_gsi = false;
 }
 f = fopen(sysfs_path, "r");
 if (f == NULL) {
@@ -1492,6 +1495,13 @@ static void pci_add_dm_done(libxl__egc *egc,
 goto out_no_irq;
 }
 if ((fscanf(f, "%u", &irq) == 1) && irq) {
+/*
+ * If use gsi, save the value, because the value of irq
+ * will be changed by function xc_physdev_map_pirq
+ */
+if (is_gsi) {
+gsi = irq;
+}
 r = xc_physdev_map_pirq(ctx->xch, domid, irq, &irq);
 if (r < 0) {
 LOGED(ERROR, domainid, "xc_physdev_map_pirq irq=%d (error=%d)",
@@ -1500,13 +1510,25 @@ static void pci_add_dm_done(libxl__egc *egc,
 rc = ERROR_FAIL;
 goto out;
 }
-r = xc_domain_irq_permission(ctx->xch, domid, irq, 1);
-if (r < 0) {
-LOGED(ERROR, domainid,
-  "xc_domain_irq_permission irq=%d (error=%d)", irq, r);
-fclose(f);
-rc = ERROR_FAIL;
-goto out;
+if (is_gsi) {
+r = xc_domain_gsi_permission(ctx->xch, domid, gsi, 1);
+if (r < 0 && r != -EOPNOTSUPP) {
+LOGED(ERROR, domainid,
+  "xc_domain_gsi_permission gsi=%d (error=%d)", gsi, r);
+fclose(f);
+rc = ERROR_FAIL;
+goto out;
+}
+}
+if (!is_gsi || r == -EOPNOTSUPP) {
+r = xc_domain_irq_permission(ctx->xch, domid, irq, 1);
+if (r < 0) {
+LOGED(ERROR, domainid,
+"xc_domain_irq_permission irq=%d (error=%d)", irq, r);
+fclose(f);
+rc = ERROR_FAIL;
+goto out;
+}
 }
 }
 fclose(f);
@@ -2180,6 +2202,7 @@ static void pci_remove_detached(libxl__egc *egc,
 FILE *f;
 uint32_t domainid = prs->domid;
 bool isstubdom;
+bool is_gsi = true;
 
 /*

[RFC XEN PATCH v6 4/5] libxl: Use gsi instead of irq for mapping pirq

In PVH dom0, it uses the linux local interrupt mechanism,
when it allocs irq for a gsi, it is dynamic, and follow
the principle of applying first, distributing first. And
the irq number is alloced from small to large, but the
applying gsi number is not, may gsi 38 comes before gsi
28, that causes the irq number is not equal with the gsi
number. And when passthrough a device, xl wants to use
gsi to map pirq, see pci_add_dm_done->xc_physdev_map_pirq,
but the gsi number is got from file
/sys/bus/pci/devices//irq in current code, so it
will fail when mapping.

So, use real gsi number read from gsi sysfs.

Signed-off-by: Huang Rui 
Signed-off-by: Jiqian Chen 
Reviewed-by: Stefano Stabellini 

---
RFC: discussions ongoing on the Linux side where/how to expose the gsi

---
 tools/libs/light/libxl_pci.c | 16 ++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/tools/libs/light/libxl_pci.c b/tools/libs/light/libxl_pci.c
index 96cb4da0794e..2cec83e0b734 100644
--- a/tools/libs/light/libxl_pci.c
+++ b/tools/libs/light/libxl_pci.c
@@ -1478,8 +1478,14 @@ static void pci_add_dm_done(libxl__egc *egc,
 fclose(f);
 if (!pci_supp_legacy_irq())
 goto out_no_irq;
-sysfs_path = GCSPRINTF(SYSFS_PCI_DEV"/"PCI_BDF"/irq", pci->domain,
+sysfs_path = GCSPRINTF(SYSFS_PCI_DEV"/"PCI_BDF"/gsi", pci->domain,
 pci->bus, pci->dev, pci->func);
+r = access(sysfs_path, F_OK);
+if (r && errno == ENOENT) {
+/* To compitable with old version of kernel, still need to use irq */
+sysfs_path = GCSPRINTF(SYSFS_PCI_DEV"/"PCI_BDF"/irq", pci->domain,
+   pci->bus, pci->dev, pci->func);
+}
 f = fopen(sysfs_path, "r");
 if (f == NULL) {
 LOGED(ERROR, domainid, "Couldn't open %s", sysfs_path);
@@ -2229,9 +2235,15 @@ skip_bar:
 if (!pci_supp_legacy_irq())
 goto skip_legacy_irq;
 
-sysfs_path = GCSPRINTF(SYSFS_PCI_DEV"/"PCI_BDF"/irq", pci->domain,
+sysfs_path = GCSPRINTF(SYSFS_PCI_DEV"/"PCI_BDF"/gsi", pci->domain,
pci->bus, pci->dev, pci->func);
 
+rc = access(sysfs_path, F_OK);
+if (rc && errno == ENOENT) {
+/* To compitable with old version of kernel, still need to use irq */
+sysfs_path = GCSPRINTF(SYSFS_PCI_DEV"/"PCI_BDF"/irq", pci->domain,
+   pci->bus, pci->dev, pci->func);
+}
 f = fopen(sysfs_path, "r");
 if (f == NULL) {
 LOGED(ERROR, domid, "Couldn't open %s", sysfs_path);
-- 
2.34.1

Re: [PATCH v2 3/3] svm/nestedsvm: Introduce nested capabilities bit

On 27.03.2024 18:01, George Dunlap wrote:
> On Mon, Mar 18, 2024 at 2:17 PM Jan Beulich  wrote:
>> On 13.03.2024 13:24, George Dunlap wrote:
>>> --- a/xen/arch/x86/domain.c
>>> +++ b/xen/arch/x86/domain.c
>>> @@ -673,6 +673,12 @@ int arch_sanitise_domain_config(struct 
>>> xen_domctl_createdomain *config)
>>>   */
>>>  config->flags |= XEN_DOMCTL_CDF_oos_off;
>>>
>>> +if ( nested_virt && !hvm_nested_virt_supported() )
>>> +{
>>> +dprintk(XENLOG_INFO, "Nested virt requested but not available\n");
>>> +return -EINVAL;
>>> +}
>>> +
>>>  if ( nested_virt && !hap )
>>>  {
>>>  dprintk(XENLOG_INFO, "Nested virt not supported without HAP\n");
>>
>> As mentioned in reply to v1, I think what both start_nested_{svm,vmx}() check
>> is redundant with this latter check. I think that check isn't necessary there
>> (yet unlike suggested in reply to v1 I don't think anymore that the check 
>> here
>> can alternatively be dropped). And even if it was to be kept for some reason,
>> I'm having some difficulty seeing why vendor code needs to do that check, 
>> when
>> nestedhvm_setup() could do it for both of them.
> 
> I'm having a bit of trouble resolving the antecedents in this
> paragraph (what "this" and "there" are referring to).
> 
> Are you saying that we should set hvm_funcs.caps.nested_virt to 'true'
> even if the hardware doesn't support HAP, because we check that here?
> 
> That seems like a very strange way to go about things; hvm_funcs.caps
> should reflect the actual capabilities of the hardware.  Suppose at
> some point we want to expose nested virt capability to the toolstack
> -- wouldn't it make more sense to be able to just read
> `hvm_funcs.caps.nested_virt`, rather than having to remember to && it
> with `hvm_funcs.caps.hap`?
> 
> And as you say, you can't get rid of the HAP check here, because we
> also want to deny nested virt if the admin deliberately created a
> guest in shadow mode on a system that has HAP available.  So it's not
> redundant: one is checking the capabilities of the system, the other
> is checking the requested guest configuration.

Hmm, yes, you're right.

> As to why to have each vendor independent code check for HAP -- there
> are in fact two implementations of the code; it's nice to be able to
> look in one place for each implementation to determine the
> requirements.  Additionally, it would be possible, in the future, for
> one of the nested virt implementations to enable shadow mode, while
> the other one didn't.  The current arrangement makes that easy.

Well, first - how likely is this, when instead we've been considering
whether we could get rid of shadow mode? And then this is balancing
between ease of future changes vs avoiding redundancy where it can be
avoided. I'm not going to insist on centralizing the HAP check, but I
certainly wanted the option to be considered.

>>> --- a/xen/arch/x86/hvm/nestedhvm.c
>>> +++ b/xen/arch/x86/hvm/nestedhvm.c
>>> @@ -150,6 +150,16 @@ static int __init cf_check nestedhvm_setup(void)
>>>  __clear_bit(0x80, shadow_io_bitmap[0]);
>>>  __clear_bit(0xed, shadow_io_bitmap[1]);
>>>
>>> +/*
>>> + * NB this must be called after all command-line processing has been
>>> + * done, so that if (for example) HAP is disabled, nested virt is
>>> + * disabled as well.
>>> + */
>>> +if ( cpu_has_vmx )
>>> +start_nested_vmx(&hvm_funcs);
>>> +else if ( cpu_has_svm )
>>> +start_nested_svm(&hvm_funcs);
>>
>> Instead of passing the pointer, couldn't you have the functions return
>> bool, and then set hvm_funcs.caps.nested_virt from that? Passing that
>> pointer looks somewhat odd to me.
> 
> For one, it makes start_nested_XXX symmetric to start_XXX, which also
> has access to the full hvm_funcs structure (albeit in the former case
> because it's creating the structure).

This last aspect is the crucial aspect: start{svm,vmx}() are indeed quite
special because of this. Everywhere else we have accessor helpers when
hvm_funcs needs consulting, e.g. ...

>  For two, both of them need to read caps.hap.

... caps _reads_ would want using (an) accessor(s), too.

>  For three, it's just more flexible -- there may be
> future things that we want to modify in the start_nested_*()
> functions.  If we did as you suggest, and then added (say)
> caps.nested_virt_needs_hap, then we'd either need to add a second
> function, or refactor it to look like this.

Right, I can see this as an argument when taking, as you do, speculation
on future needs into account. Albeit I'm then inclined to say that even
such modifications may better be done through accessor helpers.

>> Is there a reason to use direct calls here rather than a true hook
>> (seeing that hvm_funcs must have been populated by the time we make it
>> here)? I understand we're (remotely) considering to switch away from
>> using hooks at some point, but right now this feels somewhat
>> inconsistent if not furt

[RFC KERNEL PATCH v5 0/3] Support device passthrough when dom0 is PVH on Xen

Hi All,
This is v5 series to support passthrough on Xen when dom0 is PVH.
patch#2 change function acpi_pci_irq_lookup from a static function to 
non-static, need ACPI Maintainer to give some comments.
patch#3 linux internal changes, need PCI and ACPI Maintainer to give more 
comments.
v4->v5 changes:
* patch#1: Add Reviewed-by Stefano
* patch#2: Add Reviewed-by Stefano
* patch#3: No changes

v3->v4 changes:
* patch#1: change the comment of PHYSDEVOP_pci_device_state_reset; use a new 
function pcistub_reset_device_state to wrap __pci_reset_function_locked and 
xen_reset_device_state, and call pcistub_reset_device_state in pci_stub.c
* patch#2: remove map_pirq from xen_pvh_passthrough_gsi

v3 link:
https://lore.kernel.org/lkml/20231210161519.1550860-1-jiqian.c...@amd.com/T/#t
v2->v3 changes:
* patch#1: add condition to limit do xen_reset_device_state for no-pv domain in 
pcistub_init_device.
* patch#2: Abandoning previous implementations that call unmask_irq. To setup 
gsi and map pirq for passthrough device in pcistub_init_device.
* patch#3: Abandoning previous implementations that adds new syscall to get gsi 
from irq. To add a new sysfs for gsi, then userspace can get gsi number from 
sysfs.


v2 link:
https://lore.kernel.org/lkml/20231124103123.3263471-1-jiqian.c...@amd.com/T/#t

Below is the description of v2 cover letter:
This series of patches are the v2 of the implementation of passthrough when 
dom0 is PVH on Xen.
We sent the v1 to upstream before, but the v1 had so many problems and we got 
lots of suggestions.
I will introduce all issues that these patches try to fix and the differences 
between v1 and v2.

Issues we encountered:
1. pci_stub failed to write bar for a passthrough device.
Problem: when we run “sudo xl pci-assignable-add ” to assign a device, 
pci_stub will call “pcistub_init_device() -> pci_restore_state() -> 
pci_restore_config_space() ->
pci_restore_config_space_range() -> pci_restore_config_dword() -> 
pci_write_config_dword()”, the pci config write will trigger an io interrupt to 
bar_write() in the xen, but the
bar->enabled was set before, the write is not allowed now, and then when 
bar->Qemu config the
passthrough device in xen_pt_realize(), it gets invalid bar values.

Reason: the reason is that we don't tell vPCI that the device has been reset, 
so the current cached state in pdev->vpci is all out of date and is different 
from the real device state.

Solution: to solve this problem, the first patch of kernel(xen/pci: Add 
xen_reset_device_state
function) and the fist patch of xen(xen/vpci: Clear all vpci status of device) 
add a new hypercall to reset the state stored in vPCI when the state of real 
device has changed.
Thank Roger for the suggestion of this v2, and it is different from v1 
(https://lore.kernel.org/xen-devel/20230312075455.450187-3-ray.hu...@amd.com/), 
v1 simply allow domU to write pci bar, it does not comply with the design 
principles of vPCI.

2. failed to do PHYSDEVOP_map_pirq when dom0 is PVH
Problem: HVM domU will do PHYSDEVOP_map_pirq for a passthrough device by using 
gsi. See xen_pt_realize->xc_physdev_map_pirq and 
pci_add_dm_done->xc_physdev_map_pirq. Then xc_physdev_map_pirq will call into 
Xen, but in hvm_physdev_op(), PHYSDEVOP_map_pirq is not allowed.

Reason: In hvm_physdev_op(), the variable "currd" is PVH dom0 and PVH has no 
X86_EMU_USE_PIRQ flag, it will fail at has_pirq check.

Solution: I think we may need to allow PHYSDEVOP_map_pirq when "currd" is dom0 
(at present dom0 is PVH). The second patch of xen(x86/pvh: Open 
PHYSDEVOP_map_pirq for PVH dom0) allow PVH dom0 do PHYSDEVOP_map_pirq. This v2 
patch is better than v1, v1 simply remove the has_pirq check(xen 
https://lore.kernel.org/xen-devel/20230312075455.450187-4-ray.hu...@amd.com/).

3. the gsi of a passthrough device doesn't be unmasked
 3.1 failed to check the permission of pirq
 3.2 the gsi of passthrough device was not registered in PVH dom0

Problem:
3.1 callback function pci_add_dm_done() will be called when qemu config a 
passthrough device for domU.
This function will call xc_domain_irq_permission()-> pirq_access_permitted() to 
check if the gsi has corresponding mappings in dom0. But it didn’t, so failed. 
See XEN_DOMCTL_irq_permission->pirq_access_permitted, "current" is PVH dom0 and 
it return irq is 0.
3.2 it's possible for a gsi (iow: vIO-APIC pin) to never get registered on PVH 
dom0, because the devices of PVH are using MSI(-X) interrupts. However, the 
IO-APIC pin must be configured for it to be able to be mapped into a domU.

Reason: After searching codes, I find "map_pirq" and "register_gsi" will be 
done in function vioapic_write_redirent->vioapic_hwdom_map_gsi when the gsi(aka 
ioapic's pin) is unmasked in PVH dom0.
So the two problems can be concluded to that the gsi of a passthrough device 
doesn't be unmasked.

Solution: to solve these problems, the second patch of kernel(xen/pvh: Unmask 
irq for passthrough device in PVH dom0) call the unmask_irq() when

[RFC KERNEL PATCH v5 2/3] xen/pvh: Setup gsi for passthrough device

In PVH dom0, the gsis don't get registered, but the gsi of
a passthrough device must be configured for it to be able to be
mapped into a domU.

When assign a device to passthrough, proactively setup the gsi
of the device during that process.

Co-developed-by: Huang Rui 
Signed-off-by: Jiqian Chen 
Reviewed-by: Stefano Stabellini 
---
RFC: This patch change function acpi_pci_irq_lookup from a static function to 
non-static, need ACPI Maintainer to give some comments.

---
 arch/x86/xen/enlighten_pvh.c   | 92 ++
 drivers/acpi/pci_irq.c |  2 +-
 drivers/xen/xen-pciback/pci_stub.c |  8 +++
 include/linux/acpi.h   |  1 +
 include/xen/acpi.h |  6 ++
 5 files changed, 108 insertions(+), 1 deletion(-)

diff --git a/arch/x86/xen/enlighten_pvh.c b/arch/x86/xen/enlighten_pvh.c
index c28f073c1df5..12be665b27d8 100644
--- a/arch/x86/xen/enlighten_pvh.c
+++ b/arch/x86/xen/enlighten_pvh.c
@@ -2,6 +2,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -26,6 +27,97 @@
 bool __ro_after_init xen_pvh;
 EXPORT_SYMBOL_GPL(xen_pvh);
 
+typedef struct gsi_info {
+   int gsi;
+   int trigger;
+   int polarity;
+} gsi_info_t;
+
+struct acpi_prt_entry {
+   struct acpi_pci_id  id;
+   u8  pin;
+   acpi_handle link;
+   u32 index;  /* GSI, or link _CRS index */
+};
+
+static int xen_pvh_get_gsi_info(struct pci_dev *dev,
+   gsi_info_t 
*gsi_info)
+{
+   int gsi;
+   u8 pin;
+   struct acpi_prt_entry *entry;
+   int trigger = ACPI_LEVEL_SENSITIVE;
+   int polarity = acpi_irq_model == ACPI_IRQ_MODEL_GIC ?
+ ACPI_ACTIVE_HIGH : ACPI_ACTIVE_LOW;
+
+   if (!dev || !gsi_info)
+   return -EINVAL;
+
+   pin = dev->pin;
+   if (!pin)
+   return -EINVAL;
+
+   entry = acpi_pci_irq_lookup(dev, pin);
+   if (entry) {
+   if (entry->link)
+   gsi = acpi_pci_link_allocate_irq(entry->link,
+entry->index,
+&trigger, &polarity,
+NULL);
+   else
+   gsi = entry->index;
+   } else
+   gsi = -1;
+
+   if (gsi < 0)
+   return -EINVAL;
+
+   gsi_info->gsi = gsi;
+   gsi_info->trigger = trigger;
+   gsi_info->polarity = polarity;
+
+   return 0;
+}
+
+static int xen_pvh_setup_gsi(gsi_info_t *gsi_info)
+{
+   struct physdev_setup_gsi setup_gsi;
+
+   if (!gsi_info)
+   return -EINVAL;
+
+   setup_gsi.gsi = gsi_info->gsi;
+   setup_gsi.triggering = (gsi_info->trigger == ACPI_EDGE_SENSITIVE ? 0 : 
1);
+   setup_gsi.polarity = (gsi_info->polarity == ACPI_ACTIVE_HIGH ? 0 : 1);
+
+   return HYPERVISOR_physdev_op(PHYSDEVOP_setup_gsi, &setup_gsi);
+}
+
+int xen_pvh_passthrough_gsi(struct pci_dev *dev)
+{
+   int ret;
+   gsi_info_t gsi_info;
+
+   if (!dev)
+   return -EINVAL;
+
+   ret = xen_pvh_get_gsi_info(dev, &gsi_info);
+   if (ret) {
+   xen_raw_printk("Fail to get gsi info!\n");
+   return ret;
+   }
+
+   ret = xen_pvh_setup_gsi(&gsi_info);
+   if (ret == -EEXIST) {
+   xen_raw_printk("Already setup the GSI :%d\n", gsi_info.gsi);
+   ret = 0;
+   } else if (ret)
+   xen_raw_printk("Fail to setup GSI (%d)!\n", gsi_info.gsi);
+
+   return ret;
+}
+EXPORT_SYMBOL_GPL(xen_pvh_passthrough_gsi);
+
 void __init xen_pvh_init(struct boot_params *boot_params)
 {
u32 msr;
diff --git a/drivers/acpi/pci_irq.c b/drivers/acpi/pci_irq.c
index ff30ceca2203..630fe0a34bc6 100644
--- a/drivers/acpi/pci_irq.c
+++ b/drivers/acpi/pci_irq.c
@@ -288,7 +288,7 @@ static int acpi_reroute_boot_interrupt(struct pci_dev *dev,
 }
 #endif /* CONFIG_X86_IO_APIC */
 
-static struct acpi_prt_entry *acpi_pci_irq_lookup(struct pci_dev *dev, int pin)
+struct acpi_prt_entry *acpi_pci_irq_lookup(struct pci_dev *dev, int pin)
 {
struct acpi_prt_entry *entry = NULL;
struct pci_dev *bridge;
diff --git a/drivers/xen/xen-pciback/pci_stub.c 
b/drivers/xen/xen-pciback/pci_stub.c
index 46c40ec8a18e..22d4380d2b04 100644
--- a/drivers/xen/xen-pciback/pci_stub.c
+++ b/drivers/xen/xen-pciback/pci_stub.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -435,6 +436,13 @@ static int pcistub_init_device(struct pci_dev *dev)
goto config_release;
pci_restore_state(dev);
}
+
+   if (xen_initial_domain() && xen_pvh_domain()) {
+   err = xen_pvh_passthrough_gsi(dev);
+   if (err)
+   goto config_rele

[RFC KERNEL PATCH v5 3/3] PCI/sysfs: Add gsi sysfs for pci_dev

There is a need for some scenarios to use gsi sysfs.
For example, when xen passthrough a device to dumU, it will
use gsi to map pirq, but currently userspace can't get gsi
number.
So, add gsi sysfs for that and for other potential scenarios.

Co-developed-by: Huang Rui 
Signed-off-by: Jiqian Chen 
---
RFC: No feasible suggestions were obtained in the discussion of v4.
Discussions are still needed where/how to expose the gsi.
Looking forward to get more comments and suggestions from PCI/ACPI Maintainers.

---
 drivers/acpi/pci_irq.c  |  1 +
 drivers/pci/pci-sysfs.c | 11 +++
 include/linux/pci.h |  2 ++
 3 files changed, 14 insertions(+)

diff --git a/drivers/acpi/pci_irq.c b/drivers/acpi/pci_irq.c
index 630fe0a34bc6..739a58755df2 100644
--- a/drivers/acpi/pci_irq.c
+++ b/drivers/acpi/pci_irq.c
@@ -449,6 +449,7 @@ int acpi_pci_irq_enable(struct pci_dev *dev)
kfree(entry);
return 0;
}
+   dev->gsi = gsi;
 
rc = acpi_register_gsi(&dev->dev, gsi, triggering, polarity);
if (rc < 0) {
diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index 2321fdfefd7d..c51df88d079e 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -71,6 +71,16 @@ static ssize_t irq_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(irq);
 
+static ssize_t gsi_show(struct device *dev,
+   struct device_attribute *attr,
+   char *buf)
+{
+   struct pci_dev *pdev = to_pci_dev(dev);
+
+   return sysfs_emit(buf, "%u\n", pdev->gsi);
+}
+static DEVICE_ATTR_RO(gsi);
+
 static ssize_t broken_parity_status_show(struct device *dev,
 struct device_attribute *attr,
 char *buf)
@@ -596,6 +606,7 @@ static struct attribute *pci_dev_attrs[] = {
&dev_attr_revision.attr,
&dev_attr_class.attr,
&dev_attr_irq.attr,
+   &dev_attr_gsi.attr,
&dev_attr_local_cpus.attr,
&dev_attr_local_cpulist.attr,
&dev_attr_modalias.attr,
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 7ab0d13672da..457043cfdfce 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -529,6 +529,8 @@ struct pci_dev {
 
/* These methods index pci_reset_fn_methods[] */
u8 reset_methods[PCI_NUM_RESET_METHODS]; /* In priority order */
+
+   unsigned intgsi;
 };
 
 static inline struct pci_dev *pci_physfn(struct pci_dev *dev)
-- 
2.34.1

[KERNEL PATCH v5 1/3] xen/pci: Add xen_reset_device_state function

When device on dom0 side has been reset, the vpci on Xen side
won't get notification, so that the cached state in vpci is
all out of date with the real device state.
To solve that problem, add a new function to clear all vpci
device state when device is reset on dom0 side.

And call that function in pcistub_init_device. Because when
using "pci-assignable-add" to assign a passthrough device in
Xen, it will reset passthrough device and the vpci state will
out of date, and then device will fail to restore bar state.

Co-developed-by: Huang Rui 
Signed-off-by: Jiqian Chen 
Reviewed-by: Stefano Stabellini 
---
 drivers/xen/pci.c  | 12 
 drivers/xen/xen-pciback/pci_stub.c | 18 +++---
 include/xen/interface/physdev.h|  7 +++
 include/xen/pci.h  |  6 ++
 4 files changed, 40 insertions(+), 3 deletions(-)

diff --git a/drivers/xen/pci.c b/drivers/xen/pci.c
index 72d4e3f193af..e9b30bc09139 100644
--- a/drivers/xen/pci.c
+++ b/drivers/xen/pci.c
@@ -177,6 +177,18 @@ static int xen_remove_device(struct device *dev)
return r;
 }
 
+int xen_reset_device_state(const struct pci_dev *dev)
+{
+   struct physdev_pci_device device = {
+   .seg = pci_domain_nr(dev->bus),
+   .bus = dev->bus->number,
+   .devfn = dev->devfn
+   };
+
+   return HYPERVISOR_physdev_op(PHYSDEVOP_pci_device_state_reset, &device);
+}
+EXPORT_SYMBOL_GPL(xen_reset_device_state);
+
 static int xen_pci_notifier(struct notifier_block *nb,
unsigned long action, void *data)
 {
diff --git a/drivers/xen/xen-pciback/pci_stub.c 
b/drivers/xen/xen-pciback/pci_stub.c
index e34b623e4b41..46c40ec8a18e 100644
--- a/drivers/xen/xen-pciback/pci_stub.c
+++ b/drivers/xen/xen-pciback/pci_stub.c
@@ -89,6 +89,16 @@ static struct pcistub_device *pcistub_device_alloc(struct 
pci_dev *dev)
return psdev;
 }
 
+static int pcistub_reset_device_state(struct pci_dev *dev)
+{
+   __pci_reset_function_locked(dev);
+
+   if (!xen_pv_domain())
+   return xen_reset_device_state(dev);
+   else
+   return 0;
+}
+
 /* Don't call this directly as it's called by pcistub_device_put */
 static void pcistub_device_release(struct kref *kref)
 {
@@ -107,7 +117,7 @@ static void pcistub_device_release(struct kref *kref)
/* Call the reset function which does not take lock as this
 * is called from "unbind" which takes a device_lock mutex.
 */
-   __pci_reset_function_locked(dev);
+   pcistub_reset_device_state(dev);
if (dev_data &&
pci_load_and_free_saved_state(dev, &dev_data->pci_saved_state))
dev_info(&dev->dev, "Could not reload PCI state\n");
@@ -284,7 +294,7 @@ void pcistub_put_pci_dev(struct pci_dev *dev)
 * (so it's ready for the next domain)
 */
device_lock_assert(&dev->dev);
-   __pci_reset_function_locked(dev);
+   pcistub_reset_device_state(dev);
 
dev_data = pci_get_drvdata(dev);
ret = pci_load_saved_state(dev, dev_data->pci_saved_state);
@@ -420,7 +430,9 @@ static int pcistub_init_device(struct pci_dev *dev)
dev_err(&dev->dev, "Could not store PCI conf saved state!\n");
else {
dev_dbg(&dev->dev, "resetting (FLR, D3, etc) the device\n");
-   __pci_reset_function_locked(dev);
+   err = pcistub_reset_device_state(dev);
+   if (err)
+   goto config_release;
pci_restore_state(dev);
}
/* Now disable the device (this also ensures some private device
diff --git a/include/xen/interface/physdev.h b/include/xen/interface/physdev.h
index a237af867873..8609770e28f5 100644
--- a/include/xen/interface/physdev.h
+++ b/include/xen/interface/physdev.h
@@ -256,6 +256,13 @@ struct physdev_pci_device_add {
  */
 #define PHYSDEVOP_prepare_msix  30
 #define PHYSDEVOP_release_msix  31
+/*
+ * Notify the hypervisor that a PCI device has been reset, so that any
+ * internally cached state is regenerated.  Should be called after any
+ * device reset performed by the hardware domain.
+ */
+#define PHYSDEVOP_pci_device_state_reset 32
+
 struct physdev_pci_device {
 /* IN */
 uint16_t seg;
diff --git a/include/xen/pci.h b/include/xen/pci.h
index b8337cf85fd1..b2e2e856efd6 100644
--- a/include/xen/pci.h
+++ b/include/xen/pci.h
@@ -4,10 +4,16 @@
 #define __XEN_PCI_H__
 
 #if defined(CONFIG_XEN_DOM0)
+int xen_reset_device_state(const struct pci_dev *dev);
 int xen_find_device_domain_owner(struct pci_dev *dev);
 int xen_register_device_domain_owner(struct pci_dev *dev, uint16_t domain);
 int xen_unregister_device_domain_owner(struct pci_dev *dev);
 #else
+static inline int xen_reset_device_state(const struct pci_dev *dev)
+{
+   return -1;
+}
+
 static inline int xen_find_device_domain_owner(struct pci_dev *dev)
 {
return -1;
-- 
2.34.1

Re: [PATCH] x86/vcpu: relax VCPUOP_initialise restriction for non-PV vCPUs

On 26.03.2024 23:08, Julien Grall wrote:
> Hi Andrew,
> 
> On 20/03/2024 14:39, Andrew Cooper wrote:
>> On 20/03/2024 2:26 pm, Roger Pau Monné wrote:
>>> On Wed, Mar 20, 2024 at 02:06:27PM +, Andrew Cooper wrote:
 On 20/03/2024 1:57 pm, Roger Pau Monne wrote:
> There's no reason to force HVM guests to have a valid vcpu_info area when
> initializing a vCPU, as the vCPU can also be brought online using the 
> local
> APIC, and on that path there's no requirement for vcpu_info to be setup 
> ahead
> of the bring up.  Note an HVM vCPU can operate normally without making 
> use of
> vcpu_info.
>
> Restrict the check against dummy_vcpu_info to only apply to PV guests.
>
> Fixes: 192df6f9122d ('x86: allow HVM guests to use hypercalls to bring up 
> vCPUs')
> Signed-off-by: Roger Pau Monné 
 Thanks for looking into this.  But, do we actually need to force this on
 PV either?
>>> Possibly, having now taken a look at the users it does seem they could
>>> cope with unpopulated vcpu_info_area.
>>>
>>> Part of my understanding was that this was some kind of courtesy to PV
>>> guests in order to prevent starting them without a vcpu_info, which
>>> strictly speaking is not mandatory, but otherwise the guest vCPU won't
>>> be able to receive interrupts, not even IPIs.
>>
>> That's more of a stick than a carrot... "you must set up the area of a
>> CPU before you can bring it online". Except as we've seen, HVM has been
>> fine all along without it.
 The only interesting user of dummy_vcpu_info now is vcpu_info_populate()
 which can cope with any arbitrary vCPU.

 I have a suspicion that we can (now) delete these two checks, delete the
 dummy_vcpu_info object, and use a regular NULL pointer in
 vcpu_info_{populate,reset}(), and in doing so, remove one of the bigger
 pieces of PV-absurdity from Xen.
>>> I was expecting this to be something we can backport.  OTOH removing
>>> the check completely (or even getting rid of dummy_vcpu_info) is not
>>> something that we would want to backport.
>>>
>>> I would rather do the removal of dummy_vcpu_info as followup work.
>>
>> I was worried about ARM with your patch, because it's spelt HVM there,
>> rather than PV.
>>
>> However, I'd forgotten that ARM's do_vcpu_op() filters ops down to just
>> VCPUOP_register_{vcpu_info,runstate_memory_area} so VCPUOP_initialise
>> isn't reachable.
>>
>> Therefore, Reviewed-by: Andrew Cooper 
>>
>> It also means ARM can't use any of the PHYS registration yet...
> 
> Whoops. I don't think this was intended. Jan, I don't seem to find any 
> use in Linux. Do you have any patches you could share?

No, I don't. I did all development with hacked up XTF tests, and I was
expecting Linux folks to be looking into making use of the new subops.

Jan

> I would like to 
> give a try on Arm before sending a patch?
> 
> Cheers,
>

Re: [PATCH v5 1/6] Revert "xen/x86: bzImage parse kernel_alignment"

On 26.03.2024 22:38, Jason Andryuk wrote:
> A new ELF note will specify the alignment for a relocatable PVH kernel.
> ELF notes are suitable for vmlinux and other ELF files, so this
> Linux-specific bzImage parsing in unnecessary.
> 
> This reverts commit c44cac229067faeec8f49247d1cf281723ac2d40.
> 
> Signed-off-by: Jason Andryuk 

Since you keep re-sending this: In private discussion Roger has indicated
that, like me, he too would prefer falling back to the ELF data, before
falling back to the arch default (Roger, please correct me if I got it
wrong). That would make it necessary that the change you're proposing to
revert here is actually kept.

Or wait - what you're reverting is taking the alignment out of the
bzImage header. I don't expect the BSDs to use that protocol; aiui that's
entirely Linux-specific.

I further meanwhile realized that consulting the ELF phdrs may also be
ambiguous, as there may be more than one. I guess it would need to be the
maximum of all of them then.

Jan

Re: [PATCH v5 5/6] xen/elfnote: Specify ELF Notes are x86-specific

On 26.03.2024 22:38, Jason Andryuk wrote:
> The Xen ELF Notes are only used with x86.  libelf's elf_xen_note_check()
> exits early for ARM binaries with "ELF: Not bothering with notes on
> ARM".
> 
> Add a comment to the top of elfnote.h specifying that Notes are only used
> with x86 binaries.  This is to avoid adding disclaimers for individual
> notes.
> 
> Signed-off-by: Jason Andryuk 

Acked-by: Jan Beulich

Re: [PATCH v5 4/6] libelf: Expand ELF note printing

On 26.03.2024 22:38, Jason Andryuk wrote:
> @@ -145,13 +150,20 @@ elf_errorstatus elf_xen_parse_note(struct elf_binary 
> *elf,
>  elf_msg(elf, "ELF: note: %s = \"%s\"\n", note_desc[type].name, str);
>  parms->elf_notes[type].type = XEN_ENT_STR;
>  parms->elf_notes[type].data.str = str;
> -}
> -else
> -{
> +break;
> +
> +case ELFNOTE_INT:
>  val = elf_note_numeric(elf, note);
>  elf_msg(elf, "ELF: note: %s = %#" PRIx64 "\n", note_desc[type].name, 
> val);
>  parms->elf_notes[type].type = XEN_ENT_LONG;
>  parms->elf_notes[type].data.num = val;
> +break;
> +
> +case ELFNOTE_NAME:
> +/* ELFNOTE_NAME has a newline printed at the end of the function to
> + * optionally allow printing customized details. */
> +elf_msg(elf, "ELF: note: %s", note_desc[type].name);
> +break;

Well. I said "brief comment" for several reasons. One of them being that
it would best fit on a single line. Since now it doesn't, I have to point
out that this way comment style is violated.

/* NB: Newline emitted further down. */

?

Jan

Re: [PATCH v6 10/20] xen/riscv: introduce atomic.h

On 26.03.2024 20:02, Oleksii wrote:
> On Mon, 2024-03-25 at 09:18 +0100, Jan Beulich wrote:
>> On 22.03.2024 13:25, Oleksii wrote:
>>> On Thu, 2024-03-21 at 14:03 +0100, Jan Beulich wrote:
 On 15.03.2024 19:06, Oleksii Kurochko wrote:
> + */
> +static always_inline void read_atomic_size(const volatile void
> *p,
> +   void *res,
> +   unsigned int size)
> +{
> +    switch ( size )
> +    {
> +    case 1: *(uint8_t *)res = readb(p); break;
> +    case 2: *(uint16_t *)res = readw(p); break;
> +    case 4: *(uint32_t *)res = readl(p); break;
> +    case 8: *(uint32_t *)res  = readq(p); break;

 Nit: Excess blank before =.

 Also - no #ifdef here to be RV32-ready?
>>> Because there is #ifdef RV32 in io.h for readq().
>>
>> There you'd run into __raw_readq()'s BUILD_BUG_ON(), afaict even for
>> 1-, 2-, or 4-byte accesses. That's not quite what we want here.
> Do you mean that if someone will redefine readq() in another way and
> not wrap it by #ifdef RV32? Except this I am not sure that there is an
> issue as it will be still a compilation error, so anyway it will be
> needed to provide an implementation for __raw_readq().

No. BUILD_BUG_ON() is a compile-time thing. The compiler will encounter
this construct. And hence it will necessarily fail. Which is why the
other approach (causing a linker error) is used elsewhere. And we're
still only in the course of considering to utilize DCE for something
like STATIC_ASSERT_UNREACHABLE(); iirc there was something getting in
the way there.

> One of the reason why I decided to wrap with #ifdef RV32 in io.h to not
> go over the source code and add wrapping. Also for some code it will be
> needed to rewrite it. For example, I am not sure that I can add #ifdef
> inside macros, f.e.:
>#define write_atomic(p, x)  \
>({  \
>typeof(*(p)) x__ = (x); \
>switch ( sizeof(*(p)) ) \
>{   \
>case 1: writeb(x__, p); break;  \
>case 2: writew(x__, p); break;  \
>case 4: writel(x__, p); break;  \
>case 8: writeq(x__, p); break;  \
>default: __bad_atomic_size(); break;\
>}   \
>x__;\
>})

You can't add #ifdef there. Such needs abstracting differently.

But of course there's the option of simply not making any of these
constructs RV32-ready. Yet if so, that then will want doing
consistently.

> +/*
> + * First, the atomic ops that have no ordering constraints and
> therefor don't
> + * have the AQ or RL bits set.  These don't return anything,
> so
> there's only
> + * one version to worry about.
> + */
> +#define ATOMIC_OP(op, asm_op, I, asm_type, c_type, prefix)  \
> +static inline   \
> +void atomic##prefix##_##op(c_type i, atomic##prefix##_t *v) \
> +{   \
> +    asm volatile (  \
> +    "   amo" #asm_op "." #asm_type " zero, %1, %0"  \
> +    : "+A" (v->counter) \
> +    : "r" (I)   \

 Btw, I consider this pretty confusing. At the 1st and 2nd glance
 this
 looks like a mistake, i.e. as if i was meant. Imo ...

> +    : "memory" );   \
> +}   \
> +
> +/*
> + * Only CONFIG_GENERIC_ATOMIC64=y was ported to Xen that is
> the
> reason why
> + * last argument for ATOMIC_OP isn't used.
> + */
> +#define ATOMIC_OPS(op, asm_op, I)   \
> +    ATOMIC_OP (op, asm_op, I, w, int,   )
> +
> +ATOMIC_OPS(add, add,  i)
> +ATOMIC_OPS(sub, add, -i)
> +ATOMIC_OPS(and, and,  i)
> +ATOMIC_OPS( or,  or,  i)
> +ATOMIC_OPS(xor, xor,  i)

 ... here you want to only pass the (unary) operator (and leaving
 that
 blank
 is as fine as using +).
>>> I agree that a game with 'i' and 'I' looks confusing, but I am not
>>> really understand what is wrong with using ' i' here. It seems that
>>> preprocessed macros looks fine:
>>>    static inline void atomic_add(int i, atomic_t *v) { asm volatile
>>> ( "  
>>>    amo" "add" "." "w" " zero, %1, %0" : "+A" (v->counter) : "r" (i)
>>> :
>>>    "memory" ); }
>>>    
>>>    static inline void atomic_sub(int i, atomic_t *v) { asm volatile
>>> ( "  
>>>    amo

Re: [XEN PATCH 11/11] x86/public: hvm: address violations of MISRA C Rule 20.7

On 22.03.2024 17:02, Nicola Vetrini wrote:
> MISRA C Rule 20.7 states: "Expressions resulting from the expansion
> of macro parameters shall be enclosed in parentheses". Therefore, some
> macro definitions should gain additional parentheses to ensure that all
> current and future users will be safe with respect to expansions that
> can possibly alter the semantics of the passed-in macro parameter.
> 
> No functional change.
> 
> Signed-off-by: Nicola Vetrini 

Btw, I've taken the liberty to drop the (secondary) hvm: prefix from the
subject; I can't ...

> --- a/xen/include/public/arch-x86/xen.h
> +++ b/xen/include/public/arch-x86/xen.h
> @@ -36,7 +36,7 @@
>  #define __XEN_GUEST_HANDLE(name)__guest_handle_ ## name
>  #define XEN_GUEST_HANDLE(name)  __XEN_GUEST_HANDLE(name)
>  #define XEN_GUEST_HANDLE_PARAM(name)XEN_GUEST_HANDLE(name)
> -#define set_xen_guest_handle_raw(hnd, val)  do { (hnd).p = val; } while (0)
> +#define set_xen_guest_handle_raw(hnd, val)  do { (hnd).p = (val); } while (0)
>  #define set_xen_guest_handle(hnd, val) set_xen_guest_handle_raw(hnd, val)

... spot anything HVM-specific in this change.

Jan

Re: [PATCH v5] RFC: x86/pvh: Make Xen PVH entrypoint PIC for x86-64

On 26.03.2024 22:47, Jason Andryuk wrote:
> --- a/include/xen/interface/elfnote.h
> +++ b/include/xen/interface/elfnote.h
> @@ -185,9 +185,25 @@
>   */
>  #define XEN_ELFNOTE_PHYS32_ENTRY 18
>  
> +/*
> + * Physical loading constraints for PVH kernels
> + *
> + * Used to place constraints on the guest physical loading addresses and
> + * alignment for a PVH kernel.
> + *
> + * The presence of this note indicates the kernel supports relocating itself.
> + *
> + * The note may include up to three 32bit values in the following order:
> + *  - a maximum address for the entire image to be loaded below (default
> + *  0x)
> + *  - a minimum address for the start of the image (default 0)
> + *  - a required start alignment (default 0x20)

This looks to be stale now.

Jan

Re: [PATCH 11/11] xen/arm: List static shared memory regions as /memory nodes

2024-03-27 Thread Michal Orzel

Hi Luca,

On 26/03/2024 15:19, Luca Fancellu wrote:
> 
> 
>> On 25 Mar 2024, at 08:58, Michal Orzel  wrote:
>>
>> Hi Luca,
>>
> 
> Hi Michal,
> 
>> On 12/03/2024 14:03, Luca Fancellu wrote:
>>>
>>>
>>> Currently Xen is not exporting the static shared memory regions
>>> to the device tree as /memory node, this commit is fixing this
>>> issue.
>> Looking at the implementation, you will always call make_shm_memory_node() 
>> twice. For the first
>> time, to create /memory node and for the second time to create entry under 
>> /reserved-memory. Also,
>> you will create a separate /memory node for every single shmem region 
>> instead of combining them
>> in a single /memory region like make_memory_node() would do. Can't we reuse 
>> this function for simplicity?
> 
> You mean using make_memory_node() to populate /reserved-memory and /memory? I 
> feel they are very different
> In terms of properties to be created, so I don’t think we should create a 
> make_memory_node() that does both.
> 
> Otherwise if you were suggesting to modify make_memory_node() only for what 
> concerns the /memory nodes,
yes, this is what I meant

> it might be feasible, however there are some parts that need to be skipped 
> with some flags (all the code accessing .type
> member), if I understand correctly you like this function because it doesn’t 
> create one node for every bank, but it creates
> reg addresses instead, in that case I could modify the current 
> make_shm_memory_node() to do the same.
My concern is that we will have 2 functions to create memory nodes instead of 
one that can be reused. I know the issue is with
.type member. If skipping results in a worse code, then I'm ok with 
make_shm_memory_node() used for 2 purposes (would it be possible
to create /memory and entry under /reserved in the same call to a function?).

> 
>>
>> Also, afaict it is not forbidden to specify shmem range (correct me if I'm 
>> wrong), where guest address will be
>> within with RAM allocated by Xen (e.g. GPA RAM range 0x4000 - 0x6000 
>> and shmem is at 0x5000). In this case,
>> you would create yet another /memory node that would result in overlap (i.e. 
>> more than one /memory node specifying the same range).
> 
> This is a good point I didn’t think about, yes currently the code is creating 
> overlapping nodes in that case, wow so it means I
> need to compute the non overlapping regions and emit a /memory node then! :) 
> ouch
> 
> Please let me know if I understood correctly your comments.
> 
> Cheers,
> Luca
> 
>>
>> ~Michal
> 

~Michal

Re: [PATCH v5 1/6] Revert "xen/x86: bzImage parse kernel_alignment"

2024-03-27 Thread Roger Pau Monné

On Wed, Mar 27, 2024 at 08:22:41AM +0100, Jan Beulich wrote:
> On 26.03.2024 22:38, Jason Andryuk wrote:
> > A new ELF note will specify the alignment for a relocatable PVH kernel.
> > ELF notes are suitable for vmlinux and other ELF files, so this
> > Linux-specific bzImage parsing in unnecessary.
> > 
> > This reverts commit c44cac229067faeec8f49247d1cf281723ac2d40.
> > 
> > Signed-off-by: Jason Andryuk 
> 
> Since you keep re-sending this: In private discussion Roger has indicated
> that, like me, he too would prefer falling back to the ELF data, before
> falling back to the arch default (Roger, please correct me if I got it
> wrong). That would make it necessary that the change you're proposing to
> revert here is actually kept.

Sorry, was meaning to reply yesterday but Jason is very fast at
sending new version so I'm always one version behind.

IMO the order: ELF note, PHDR alignment, arch default should be the
preferred one.

> Or wait - what you're reverting is taking the alignment out of the
> bzImage header. I don't expect the BSDs to use that protocol; aiui that's
> entirely Linux-specific.

Yeah, I don't have strong opinions in keeping this, we already do
bzImage parsing, so we might as well attempt to fetch the alignment
from there if correct:

ELF note, bzImage kernel_alignment, ELF PHDR alignment, arch default

> I further meanwhile realized that consulting the ELF phdrs may also be
> ambiguous, as there may be more than one. I guess it would need to be the
> maximum of all of them then.

My suggestion (not sure if I mentioned this before) was to use the
alignment of the first LOAD PHDR, which is the one that defines the
value of the dest_base field used as the image load start address.

Using the maximum of all load PHDRs might be safer.

Thanks, Roger.

Xen Summit Early Bird Rates - Ends 31st March 2024!

2024-03-27 Thread Kelly Choi

Hey everyone,

We've just announced our schedule for Xen Summit 2024 and can't wait to see
you all.

*Make sure to grab your early-rate tickets today, these end on 31st March
2024! *
Academics can also attend the event for free.

Tickets:
https://events.linuxfoundation.org/xen-project-summit/register/

Schedule:
https://events.linuxfoundation.org/xen-project-summit/program/schedule/

See you all there!

Many thanks,
Kelly Choi

Community Manager
Xen Project

Re: [PATCH v6 10/20] xen/riscv: introduce atomic.h

2024-03-27 Thread Oleksii

On Wed, 2024-03-27 at 08:40 +0100, Jan Beulich wrote:
...

> > > > > > +/* This is required to provide a full barrier on success.
> > > > > > */
> > > > > > +static inline int atomic_add_unless(atomic_t *v, int a,
> > > > > > int u)
> > > > > > +{
> > > > > > +   int prev, rc;
> > > > > > +
> > > > > > +    asm volatile (
> > > > > > +    "0: lr.w %[p],  %[c]\n"
> > > > > > +    "   beq  %[p],  %[u], 1f\n"
> > > > > > +    "   add  %[rc], %[p], %[a]\n"
> > > > > > +    "   sc.w.rl  %[rc], %[rc], %[c]\n"
> > > > > > +    "   bnez %[rc], 0b\n"
> > > > > > +    RISCV_FULL_BARRIER
> > > > > 
> > > > > With this and no .aq on the load, why the .rl on the store?
> > > > It is something that LKMM requires [1].
> > > > 
> > > > This is not fully clear to me what is so specific in LKMM, but
> > > > accoring
> > > > to the spec:
> > > >    Ordering Annotation Fence-based Equivalent
> > > >    l{b|h|w|d|r}.aq l{b|h|w|d|r}; fence r,rw
> > > >    l{b|h|w|d|r}.aqrl   fence rw,rw; l{b|h|w|d|r}; fence r,rw
> > > >    s{b|h|w|d|c}.rl fence rw,w; s{b|h|w|d|c}
> > > >    s{b|h|w|d|c}.aqrl   fence rw,w; s{b|h|w|d|c}
> > > >    amo.aq  amo; fence r,rw
> > > >    amo.rl  fence rw,w; amo
> > > >    amo.aqrl    fence rw,rw; amo; fence rw,rw
> > > >    Table 2.2: Mappings from .aq and/or .rl to fence-based
> > > > equivalents.
> > > >    An alternative mapping places a fence rw,rw after the
> > > > existing 
> > > >    s{b|h|w|d|c} mapping rather than at the front of the
> > > >    l{b|h|w|d|r} mapping.
> > > 
> > > I'm afraid I can't spot the specific case in this table. None of
> > > the
> > > stores in the right column have a .rl suffix.
> > Yes, it is expected.
> > 
> > I am reading this table as (f.e.) amo.rl is an equivalent of
> > fence
> > rw,w; amo. (without .rl) 
> 
> In which case: How does quoting the table answer my original
> question?
Agree, it is starting to be confusing, so let me give an answer to your
question below.

> 
> > > >    It is also safe to translate any .aq, .rl, or .aqrl
> > > > annotation
> > > > into
> > > >    the fence-based snippets of
> > > >    Table 2.2. These can also be used as a legal implementation
> > > > of
> > > >    l{b|h|w|d} or s{b|h|w|d} pseu-
> > > >    doinstructions for as long as those instructions are not
> > > > added
> > > > to
> > > >    the ISA.
> > > > 
> > > > So according to the spec, it should be:
> > > >  sc.w ...
> > > >  RISCV_FULL_BARRIER.
> > > > 
> > > > Considering [1] and how this code looks before, it seems to me
> > > > that
> > > > it
> > > > is safe to use lr.w.aq/sc.w.rl here or an fence equivalent.
> > > 
> > > Here you say "or". Then why dos the code use sc.?.rl _and_ a
> > > fence?
> > I confused this line with amo.aqrl, so based on the table 2.2
> > above
> > s{b|h|w|d|c}.aqrl is transformed to "fence rw,w; s{b|h|w|d|c}", but
> > Linux kernel decided to strengthen it with full barrier:
> >    -  "0:\n\t"
> >    -  "lr.w.aqrl  %[p],  %[c]\n\t"
> >    -  "beq    %[p],  %[u], 1f\n\t"
> >    -  "add   %[rc],  %[p], %[a]\n\t"
> >    -  "sc.w.aqrl %[rc], %[rc], %[c]\n\t"
> >    -  "bnez  %[rc], 0b\n\t"
> >    -  "1:"
> >    +   "0: lr.w %[p],  %[c]\n"
> >    +   "   beq  %[p],  %[u], 1f\n"
> >    +   "   add  %[rc], %[p], %[a]\n"
> >    +   "   sc.w.rl  %[rc], %[rc], %[c]\n"
> >    +   "   bnez %[rc], 0b\n"
> >    +   "   fence    rw, rw\n"
> >    +   "1:\n"
> > As they have the following issue:
> >    implementations of atomics such as atomic_cmpxchg() and
> >    atomic_add_unless() rely on LR/SC pairs,
> >    which do not give full-ordering with .aqrl; for example, current
> >    implementations allow the "lr-sc-aqrl-pair-vs-full-barrier" test
> >    below to end up with the state indicated in the "exists" clause.
> 
> Is that really "current implementations", not "the abstract model"?
> If so, the use of an explicit fence would be more like a workaround
> (and would hence want commenting to that effect).
> 
> In neither case can I see my original question answered: Why the .rl
> on the store when there is a full fence later?
The good explanation for that was provided in the commit addressing a
similar issue for ARM64 [
https://patchwork.kernel.org/project/linux-arm-kernel/patch/1391516953-14541-1-git-send-email-will.dea...@arm.com/
].
The same holds true for RISC-V since ARM also employs WMO.

I would also like to mention another point, as I indicated in another
email thread
[ https://lists.xen.org/archives/html/xen-devel/2024-03/msg01582.html ]
, that now this fence can be omitted and .aqrl can be used instead.

This was confirmed by Dan (the author of the RVWMO spec)
[https://lore.kernel.org/linux-riscv/41e01514-74ca-84f2-f5cc-2645c444f...@nvidia.com/
]

I hope this

Re: xen | Failed pipeline for staging | e3883336

On 27.03.2024 10:58, GitLab wrote:
> 
> 
> Pipeline #1229415063 has failed!
> 
> Project: xen ( https://gitlab.com/xen-project/hardware/xen )
> Branch: staging ( 
> https://gitlab.com/xen-project/hardware/xen/-/commits/staging )
> 
> Commit: e3883336 ( 
> https://gitlab.com/xen-project/hardware/xen/-/commit/e3883336bb5abba2ec2231618f5b64f61b099b1e
>  )
> Commit Message: xen/efi: efibind: address violations of MISRA C...
> Commit Author: Nicola Vetrini
> Committed by: Jan Beulich ( https://gitlab.com/jbeulich )
> 
> 
> Pipeline #1229415063 ( 
> https://gitlab.com/xen-project/hardware/xen/-/pipelines/1229415063 ) 
> triggered by Jan Beulich ( https://gitlab.com/jbeulich )
> had 1 failed job.
> 
> Job #6487283739 ( 
> https://gitlab.com/xen-project/hardware/xen/-/jobs/6487283739/raw )
> 
> Stage: test
> Name: qemu-alpine-x86_64-gcc

qemu-system-x86_64: terminating on signal 15 from pid 52 (timeout)

I have to admit I have no idea what this is supposed to be telling me.

Jan

Re: [PATCH v6 10/20] xen/riscv: introduce atomic.h

On 27.03.2024 11:28, Oleksii wrote:
> On Wed, 2024-03-27 at 08:40 +0100, Jan Beulich wrote:
> ...
> 
>>> +/* This is required to provide a full barrier on success.
>>> */
>>> +static inline int atomic_add_unless(atomic_t *v, int a,
>>> int u)
>>> +{
>>> +   int prev, rc;
>>> +
>>> +    asm volatile (
>>> +    "0: lr.w %[p],  %[c]\n"
>>> +    "   beq  %[p],  %[u], 1f\n"
>>> +    "   add  %[rc], %[p], %[a]\n"
>>> +    "   sc.w.rl  %[rc], %[rc], %[c]\n"
>>> +    "   bnez %[rc], 0b\n"
>>> +    RISCV_FULL_BARRIER
>>
>> With this and no .aq on the load, why the .rl on the store?
> It is something that LKMM requires [1].
>
> This is not fully clear to me what is so specific in LKMM, but
> accoring
> to the spec:
>    Ordering Annotation Fence-based Equivalent
>    l{b|h|w|d|r}.aq l{b|h|w|d|r}; fence r,rw
>    l{b|h|w|d|r}.aqrl   fence rw,rw; l{b|h|w|d|r}; fence r,rw
>    s{b|h|w|d|c}.rl fence rw,w; s{b|h|w|d|c}
>    s{b|h|w|d|c}.aqrl   fence rw,w; s{b|h|w|d|c}
>    amo.aq  amo; fence r,rw
>    amo.rl  fence rw,w; amo
>    amo.aqrl    fence rw,rw; amo; fence rw,rw
>    Table 2.2: Mappings from .aq and/or .rl to fence-based
> equivalents.
>    An alternative mapping places a fence rw,rw after the
> existing 
>    s{b|h|w|d|c} mapping rather than at the front of the
>    l{b|h|w|d|r} mapping.

 I'm afraid I can't spot the specific case in this table. None of
 the
 stores in the right column have a .rl suffix.
>>> Yes, it is expected.
>>>
>>> I am reading this table as (f.e.) amo.rl is an equivalent of
>>> fence
>>> rw,w; amo. (without .rl) 
>>
>> In which case: How does quoting the table answer my original
>> question?
> Agree, it is starting to be confusing, so let me give an answer to your
> question below.
> 
>>
>    It is also safe to translate any .aq, .rl, or .aqrl
> annotation
> into
>    the fence-based snippets of
>    Table 2.2. These can also be used as a legal implementation
> of
>    l{b|h|w|d} or s{b|h|w|d} pseu-
>    doinstructions for as long as those instructions are not
> added
> to
>    the ISA.
>
> So according to the spec, it should be:
>  sc.w ...
>  RISCV_FULL_BARRIER.
>
> Considering [1] and how this code looks before, it seems to me
> that
> it
> is safe to use lr.w.aq/sc.w.rl here or an fence equivalent.

 Here you say "or". Then why dos the code use sc.?.rl _and_ a
 fence?
>>> I confused this line with amo.aqrl, so based on the table 2.2
>>> above
>>> s{b|h|w|d|c}.aqrl is transformed to "fence rw,w; s{b|h|w|d|c}", but
>>> Linux kernel decided to strengthen it with full barrier:
>>>    -  "0:\n\t"
>>>    -  "lr.w.aqrl  %[p],  %[c]\n\t"
>>>    -  "beq    %[p],  %[u], 1f\n\t"
>>>    -  "add   %[rc],  %[p], %[a]\n\t"
>>>    -  "sc.w.aqrl %[rc], %[rc], %[c]\n\t"
>>>    -  "bnez  %[rc], 0b\n\t"
>>>    -  "1:"
>>>    +   "0: lr.w %[p],  %[c]\n"
>>>    +   "   beq  %[p],  %[u], 1f\n"
>>>    +   "   add  %[rc], %[p], %[a]\n"
>>>    +   "   sc.w.rl  %[rc], %[rc], %[c]\n"
>>>    +   "   bnez %[rc], 0b\n"
>>>    +   "   fence    rw, rw\n"
>>>    +   "1:\n"
>>> As they have the following issue:
>>>    implementations of atomics such as atomic_cmpxchg() and
>>>    atomic_add_unless() rely on LR/SC pairs,
>>>    which do not give full-ordering with .aqrl; for example, current
>>>    implementations allow the "lr-sc-aqrl-pair-vs-full-barrier" test
>>>    below to end up with the state indicated in the "exists" clause.
>>
>> Is that really "current implementations", not "the abstract model"?
>> If so, the use of an explicit fence would be more like a workaround
>> (and would hence want commenting to that effect).
>>
>> In neither case can I see my original question answered: Why the .rl
>> on the store when there is a full fence later?
> The good explanation for that was provided in the commit addressing a
> similar issue for ARM64 [
> https://patchwork.kernel.org/project/linux-arm-kernel/patch/1391516953-14541-1-git-send-email-will.dea...@arm.com/
> ].
> The same holds true for RISC-V since ARM also employs WMO.
> 
> I would also like to mention another point, as I indicated in another
> email thread
> [ https://lists.xen.org/archives/html/xen-devel/2024-03/msg01582.html ]
> , that now this fence can be omitted and .aqrl can be used instead.
> 
> This was confirmed by Dan (the author of the RVWMO spec)
> [https://lore.kernel.org/linux-riscv/41e01514-74ca-84f2-f5cc-2645c444f...@nvidia.com/
> ]
> 
> I hope this addresses your original question. Does it?

I think it does, thanks

Re: [PATCH v7 08/14] xen/page_alloc: introduce preserved page flags macro

2024-03-27 Thread Carlo Nonato

Hi guys,

> Question is: How would you justify such a change? IOW I'm not convinced
> (yet) this wants doing there.

You mean in this series?

> Looking at the code, the flag is originally set in
> alloc_domheap_pages(). So I guess it would make sense to do it in
> free_domheap_pages().

We don't hold the heap_lock there. Is it safe to change count_info without it?

Thanks.

> For PGC_static, I don't think we can reach free_domheap_pages() with the
> flag set (the pages should live in a separate pool). So I believe there
> is nothing to do for them.
>
> Cheers,
>
> --
> Julien Grall

Re: xen | Failed pipeline for staging | e3883336

On 27/03/2024 10:40 am, Jan Beulich wrote:
> On 27.03.2024 10:58, GitLab wrote:
>>
>> Pipeline #1229415063 has failed!
>>
>> Project: xen ( https://gitlab.com/xen-project/hardware/xen )
>> Branch: staging ( 
>> https://gitlab.com/xen-project/hardware/xen/-/commits/staging )
>>
>> Commit: e3883336 ( 
>> https://gitlab.com/xen-project/hardware/xen/-/commit/e3883336bb5abba2ec2231618f5b64f61b099b1e
>>  )
>> Commit Message: xen/efi: efibind: address violations of MISRA C...
>> Commit Author: Nicola Vetrini
>> Committed by: Jan Beulich ( https://gitlab.com/jbeulich )
>>
>>
>> Pipeline #1229415063 ( 
>> https://gitlab.com/xen-project/hardware/xen/-/pipelines/1229415063 ) 
>> triggered by Jan Beulich ( https://gitlab.com/jbeulich )
>> had 1 failed job.
>>
>> Job #6487283739 ( 
>> https://gitlab.com/xen-project/hardware/xen/-/jobs/6487283739/raw )
>>
>> Stage: test
>> Name: qemu-alpine-x86_64-gcc
> qemu-system-x86_64: terminating on signal 15 from pid 52 (timeout)
>
> I have to admit I have no idea what this is supposed to be telling me.

In these smoke test, Qemu is run with a timeout in case things hang/crash.

This is saying Qemu running Xen+Dom0 didn't appear to get to the login
prompt within the timeout that the test specifies.

As to why, that's less clear.  It appears that it was making progress,
just slowly.

~Andrew

Re: Xen NIC driver have page_pool memory leaks

2024-03-27 Thread Jesper Dangaard Brouer

On 25/03/2024 13.33, Paul Durrant wrote:

On 25/03/2024 12:21, Jesper Dangaard Brouer wrote:

Hi Arthur,

(Answer inlined below, which is custom on this mailing list)

On 23/03/2024 14.23, Arthur Borsboom wrote:

Hi Jesper,

After a recent kernel upgrade 6.7.6 > 6.8.1 all my Xen guests on Arch
Linux are dumping kernel traces.
It seems to be indirectly caused by the page pool memory leak
mechanism, which is probably a good thing.

I have created a bug report, but there is no response.

https://bugzilla.kernel.org/show_bug.cgi?id=218618

I am uncertain where and to whom I need to report this page leak.
Can you help me get this issue fixed?

I'm the page_pool maintainer, but as you say yourself in comment 2 then
since dba1b8a7ab68 ("mm/page_pool: catch page_pool memory leaks") this
indicated there is a problem in the xen_netfront driver, which was
previously not visible.

Cc'ing the "XEN NETWORK BACKEND DRIVER" maintainers, as this is a driver
bug. What confuses me it that I cannot find any modules named
"xen_netfront" in the upstream tree.

You should have tried '-' rather than '_' :-)

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/xen-netfront.c

Looking at this driver, I think it is missing a call to
skb_mark_for_recycle().

I'll will submit at patch for this, with details for stable maintainers.

As I think it dates back to v5.9 via commit 6c5aa6fc4def ("xen
networking: add basic XDP support for xen-netfront"). I think this
commit is missing a call to page_pool_release_page()
between v5.9 to v5.14, after which is should have used
skb_mark_for_recycle().

Since v6.6 the call page_pool_release_page() were removed (in
535b9c61bdef ("net: page_pool: hide page_pool_release_page()") and
remaining callers converted (in commit 6bfef2ec0172 ("Merge branch
'net-page_pool-remove-page_pool_release_page'")).

This leak became visible in v6.8 via commit dba1b8a7ab68 ("mm/page_pool:
catch page_pool memory leaks").

--Jesper

Re: Xen NIC driver have page_pool memory leaks

On 27/03/2024 11:27 am, Jesper Dangaard Brouer wrote:
>
>
> On 25/03/2024 13.33, Paul Durrant wrote:
>> On 25/03/2024 12:21, Jesper Dangaard Brouer wrote:
>>> Hi Arthur,
>>>
>>> (Answer inlined below, which is custom on this mailing list)
>>>
>>> On 23/03/2024 14.23, Arthur Borsboom wrote:
 Hi Jesper,

 After a recent kernel upgrade 6.7.6 > 6.8.1 all my Xen guests on Arch
 Linux are dumping kernel traces.
 It seems to be indirectly caused by the page pool memory leak
 mechanism, which is probably a good thing.

 I have created a bug report, but there is no response.

 https://bugzilla.kernel.org/show_bug.cgi?id=218618

 I am uncertain where and to whom I need to report this page leak.
 Can you help me get this issue fixed?
>>>
>>> I'm the page_pool maintainer, but as you say yourself in comment 2 then
>>> since dba1b8a7ab68 ("mm/page_pool: catch page_pool memory leaks") this
>>> indicated there is a problem in the xen_netfront driver, which was
>>> previously not visible.
>>>
>>> Cc'ing the "XEN NETWORK BACKEND DRIVER" maintainers, as this is a
>>> driver
>>> bug.  What confuses me it that I cannot find any modules named
>>> "xen_netfront" in the upstream tree.
>>>
>>
>> You should have tried '-' rather than '_' :-)
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/xen-netfront.c
>>
>>
>
> Looking at this driver, I think it is missing a call to
> skb_mark_for_recycle().
>
> I'll will submit at patch for this, with details for stable maintainers.
>
> As I think it dates back to v5.9 via commit 6c5aa6fc4def ("xen
> networking: add basic XDP support for xen-netfront"). I think this
> commit is missing a call to page_pool_release_page()
> between v5.9 to v5.14, after which is should have used
> skb_mark_for_recycle().
>
> Since v6.6 the call page_pool_release_page() were removed (in
> 535b9c61bdef ("net: page_pool: hide page_pool_release_page()") and
> remaining callers converted (in commit 6bfef2ec0172 ("Merge branch
> 'net-page_pool-remove-page_pool_release_page'")).
>
> This leak became visible in v6.8 via commit dba1b8a7ab68 ("mm/page_pool:
> catch page_pool memory leaks").

Thankyou very much for your help here.  Please CC
secur...@xenproject.org too, because we'll want to issue an XSA (Xen
Security Advisory) when this fix is ready.

~Andrew

Re: Backport request for 4.17

On 25.03.2024 18:02, Anthony PERARD wrote:
> Would it be possible to backport 18a36b4a9b08 ("tools: ipxe: update for
> fixing build with GCC12") to Xen 4.17 ?

Sure, now done.

Jan

> This would be to allow building Xen 4.17 on Debian Bookworm, and to allow
> osstest to test Xen 4.18 with Debian Bookworm. osstest always tries to
> migration from N-1 to N, so it would need to be able to build both 4.17
> and 4.18 to actually test 4.18. Otherwise, I can tell osstest to keep
> using Debian Buster to test 4.18.
> 
> For context:
> 
> https://lore.kernel.org/xen-devel/20240318165545.3898-36-anthony.per...@citrix.com/
> 
> That commit pulls a newer version of IPXE, I don't think there's be
> compatibility issue with Xen, and hopefully nothing breaks.
> 
> Thanks,
>

[xen-unstable-smoke test] 185170: tolerable all pass - PUSHED

flight 185170 xen-unstable-smoke real [real]
http://logs.test-lab.xenproject.org/osstest/logs/185170/

Failures :-/ but no regressions.

Tests which did not succeed, but are not blocking:
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  16 saverestore-support-checkfail   never pass

version targeted for testing:
 xen  e3883336bb5abba2ec2231618f5b64f61b099b1e
baseline version:
 xen  6f6de10ade5ade907f9e3f3c72b7b18f7852d9ff

Last test of basis   185157  2024-03-25 10:03:48 Z2 days
Testing same since   185170  2024-03-27 09:03:56 Z0 days1 attempts


People who touched revisions under test:
  Jan Beulich 
  Jason Andryuk 
  Nicola Vetrini 

jobs:
 build-arm64-xsm  pass
 build-amd64  pass
 build-armhf  pass
 build-amd64-libvirt  pass
 test-armhf-armhf-xl  pass
 test-arm64-arm64-xl-xsm  pass
 test-amd64-amd64-xl-qemuu-debianhvm-amd64pass
 test-amd64-amd64-libvirt pass



sg-report-flight on osstest.test-lab.xenproject.org
logs: /home/logs/logs
images: /home/logs/images

Logs, config files, etc. are available at
http://logs.test-lab.xenproject.org/osstest/logs

Explanation of these reports, and of osstest in general, is at
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master

Test harness code can be found at
http://xenbits.xen.org/gitweb?p=osstest.git;a=summary


Pushing revision :

To xenbits.xen.org:/home/xen/git/xen.git
   6f6de10ade..e3883336bb  e3883336bb5abba2ec2231618f5b64f61b099b1e -> smoke

Re: [PATCH v7 04/14] xen/arm: add Dom0 cache coloring support

2024-03-27 Thread Carlo Nonato

Hi guys,

On Fri, Mar 22, 2024 at 8:26 AM Jan Beulich  wrote:
>
> On 21.03.2024 18:31, Carlo Nonato wrote:
> > On Thu, Mar 21, 2024 at 4:57 PM Jan Beulich  wrote:
> >>
> >> On 21.03.2024 16:04, Carlo Nonato wrote:
> >>> On Tue, Mar 19, 2024 at 4:30 PM Jan Beulich  wrote:
>  On 15.03.2024 11:58, Carlo Nonato wrote:
> > --- a/docs/misc/xen-command-line.pandoc
> > +++ b/docs/misc/xen-command-line.pandoc
> > @@ -963,6 +963,15 @@ Controls for the dom0 IOMMU setup.
> >
> >  Specify a list of IO ports to be excluded from dom0 access.
> >
> > +### dom0-llc-colors
> > +> `= List of [  | - ]`
> > +
> > +> Default: `All available LLC colors`
> > +
> > +Specify dom0 LLC color configuration. This option is available only 
> > when
> > +`CONFIG_LLC_COLORING` is enabled. If the parameter is not set, all 
> > available
> > +colors are used.
> 
>  My reservation towards this being a top-level option remains.
> >>>
> >>> How can I turn this into a lower-level option? Moving it into "dom0=" 
> >>> doesn't
> >>> seem possible to me. How can I express a list (llc-colors) inside another 
> >>> list
> >>> (dom0)? dom0=llc-colors=0-3,12-15,other-param=... How can I stop parsing
> >>> before reaching other-param?
> >>
> >> For example by using a different separator:
> >>
> >> dom0=llc-colors=0-3+12-15,other-param=...
> >
> > Ok, but that would mean to change the implementation of the parsing function
> > and to adopt this syntax also in other places, something that I would've
> > preferred to avoid. Anyway I'll follow your suggestion.
>
> Well, this is all used by Arm only for now. You will want to make sure Arm
> folks are actually okay with this alternative approach.
>
> Jan

Are you Arm maintainers ok with this?

Thanks.

Re: Linux Xen PV CPA W^X violation false-positives

2024-03-27 Thread Jürgen Groß


On 24.01.24 17:54, Jason Andryuk wrote:

Xen PV domains show CPA W^X violations like:

CPA detected W^X violation: 0064 -> 0067 range: 
0x8881 - 0x88810fff PFN 10
WARNING: CPU: 0 PID: 30 at arch/x86/mm/pat/set_memory.c:613 
__change_page_attr_set_clr+0x113a/0x11c0
Modules linked in: xt_physdev xt_MASQUERADE iptable_nat nf_nat nf_conntrack 
libcrc32c nf_defrag_ipv4 ip_tables x_tables xen_argo(O)
CPU: 0 PID: 30 Comm: kworker/0:2 Tainted: G   O   6.1.38 #1
Workqueue: events bpf_prog_free_deferred
RIP: e030:__change_page_attr_set_clr+0x113a/0x11c0
Code: 4c 89 f1 4c 89 e2 4c 89 d6 4c 89 8d 70 ff ff ff 4d 8d 86 ff 0f 00 00 48 c7 c7 
f0 3c da 81 c6 05 d0 0e 0e 01 01 e8 f6 71 00 00 <0f> 0b 4c 8b 8d 70 ff ff ff e9 
2a fd ff ff 48 8b 85 60 ff ff ff 48
RSP: e02b:c9367c48 EFLAGS: 00010282
RAX:  RBX: 000ef064 RCX: 
RDX: 0003 RSI: f7ff RDI: 
RBP: c9367d48 R08:  R09: c9367aa0
R10: 0001 R11: 0001 R12: 0067
R13: 0001 R14: 8881 R15: c9367d60
FS:  () GS:88800b80() knlGS:
CS:  e030 DS:  ES:  CR0: 80050033
CR2: 7fdbaeda01c0 CR3: 04312000 CR4: 00050660
Call Trace:
  
  ? show_regs.cold+0x1a/0x1f
  ? __change_page_attr_set_clr+0x113a/0x11c0
  ? __warn+0x7b/0xc0
  ? __change_page_attr_set_clr+0x113a/0x11c0
  ? report_bug+0x111/0x1a0
  ? handle_bug+0x4d/0xa0
  ? exc_invalid_op+0x19/0x70
  ? asm_exc_invalid_op+0x1b/0x20
  ? __change_page_attr_set_clr+0x113a/0x11c0
  ? __change_page_attr_set_clr+0x113a/0x11c0
  ? debug_smp_processor_id+0x17/0x20
  ? ___cache_free+0x2e/0x1e0
  ? _raw_spin_unlock+0x1e/0x40
  ? __purge_vmap_area_lazy+0x2ea/0x6b0
  set_direct_map_default_noflush+0x7c/0xa0
  __vunmap+0x1ac/0x280
  __vfree+0x1d/0x60
  vfree+0x27/0x40
  __bpf_prog_free+0x44/0x50
  bpf_prog_free_deferred+0x104/0x120
  process_one_work+0x1ca/0x3d0
  ? process_one_work+0x3d0/0x3d0
  worker_thread+0x45/0x3c0
  ? process_one_work+0x3d0/0x3d0
  kthread+0xe2/0x110
  ? kthread_complete_and_exit+0x20/0x20
  ret_from_fork+0x1f/0x30
  
---[ end trace  ]---

Xen provides a set of page tables that the guest executes out of when it
starts.  The L1 entries are shared between level2_ident_pgt and
level2_kernel_pgt, and xen_setup_kernel_pagetable() sets the NX bit in
the level2_ident_pgt entries.  verify_rwx() only checks the l1 entry and
reports a false-positive violation.

Here is a dump of some kernel virtual addresses and the corresponding
L1 and L2 entries:
This is the start of the directmap (ident) and they have NX (bit 63) set
in the PMD.
ndvm-pv (1): [0.466778] va=8880 pte=00100027 level: 1
ndvm-pv (1): [0.466788] va=8880 pmd=8242c067 level: 2
Directmap for kernel text:
ndvm-pv (1): [0.466795] va=88800100 pte=00100165 level: 1
ndvm-pv (1): [0.466801] va=88800100 pmd=82434067 level: 2
ndvm-pv (1): [0.466807] va=88800101 pte=001001010065 level: 1
ndvm-pv (1): [0.466814] va=88800101 pmd=82434067 level: 2
The start of the kernel text highmap is unmapped:
ndvm-pv (1): [0.466820] va=8000 pte= level: 3
ndvm-pv (1): [0.466826] va=8000 pmd= level: 3
Kernel PMD for .text has NX bit clear
ndvm-pv (1): [0.466832] va=8100 pte=00100165 level: 1
ndvm-pv (1): [0.466838] va=8100 pmd=02434067 level: 2
Kernel PTE for rodata_end has NX bit set
ndvm-pv (1): [0.466846] va=81e62000 pte=801001e62025 level: 1
ndvm-pv (1): [0.466874] va=81e62000 pmd=0243b067 level: 2
Directmap of rodata_end
ndvm-pv (1): [0.466907] va=888001e62000 pte=801001e62025 level: 1
ndvm-pv (1): [0.466913] va=888001e62000 pmd=8243b067 level: 2
Directmap of a low RAM address
ndvm-pv (1): [0.466920] va=8881 pte=00110027 level: 1
ndvm-pv (1): [0.466926] va=8881 pmd=8242c067 level: 2
Directmap of another RAM address close to but below kernel text
ndvm-pv (1): [0.466932] va=88800096c000 pte=00100096c027 level: 1
ndvm-pv (1): [0.466938] va=88800096c000 pmd=82430067 level: 2

Here are some L2 entries showing the differing NX bits for l2_ident vs.
l2_kernel while they point at the same L1 addresses
ndvm-pv (1): [0.466944]  l2_ident[  0] pmd=8242c067
ndvm-pv (1): [0.466949]  l2_ident[  1] pmd=8242d067
ndvm-pv (1): [0.466955]  l2_ident[  8] pmd=82434067
ndvm-pv (1): [0.466959]  l2_ident[  9] pmd=82435067
ndvm-pv (1): [0.466964]  l2_ident[ 14] pmd=8243a067
ndvm-pv (1): [0.466969]  l2_ident[ 15] pmd=8243b067
ndvm-pv (1): [0.466974]

Re: [PATCH-for-9.0 v2 05/19] hw/display: Restrict xen_register_framebuffer() call to Xen

On Tue, Nov 14, 2023 at 03:38:01PM +0100, Philippe Mathieu-Daudé wrote:
> Only call xen_register_framebuffer() when Xen is enabled.
> 
> Signed-off-by: Philippe Mathieu-Daudé 

I don't think this patch is very useful but it's fine, so:
Reviewed-by: Anthony PERARD 

Thanks,

-- 
Anthony PERARD

Re: [PATCH-for-9.0 v2 08/19] hw/xen: Remove unused Xen stubs

On Tue, Nov 14, 2023 at 03:38:04PM +0100, Philippe Mathieu-Daudé wrote:
> All these stubs are protected by a 'if (xen_enabled())' check.

Are you sure? There's still nothing that prevent a compiler from wanting
those, I don't think.

Sure, often compilers will remove dead code in `if(0){...}`, but there's
no guaranty, is there?

Cheers,

-- 
Anthony PERARD

Re: [PATCH v7 04/14] xen/arm: add Dom0 cache coloring support

2024-03-27 Thread Julien Grall


Hi,

On 27/03/2024 11:39, Carlo Nonato wrote:

On Fri, Mar 22, 2024 at 8:26 AM Jan Beulich  wrote:


On 21.03.2024 18:31, Carlo Nonato wrote:

On Thu, Mar 21, 2024 at 4:57 PM Jan Beulich  wrote:


On 21.03.2024 16:04, Carlo Nonato wrote:

On Tue, Mar 19, 2024 at 4:30 PM Jan Beulich  wrote:

On 15.03.2024 11:58, Carlo Nonato wrote:

--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -963,6 +963,15 @@ Controls for the dom0 IOMMU setup.

  Specify a list of IO ports to be excluded from dom0 access.

+### dom0-llc-colors
+> `= List of [  | - ]`
+
+> Default: `All available LLC colors`
+
+Specify dom0 LLC color configuration. This option is available only when
+`CONFIG_LLC_COLORING` is enabled. If the parameter is not set, all available
+colors are used.


My reservation towards this being a top-level option remains.


How can I turn this into a lower-level option? Moving it into "dom0=" doesn't
seem possible to me. How can I express a list (llc-colors) inside another list
(dom0)? dom0=llc-colors=0-3,12-15,other-param=... How can I stop parsing
before reaching other-param?


For example by using a different separator:

dom0=llc-colors=0-3+12-15,other-param=...


Ok, but that would mean to change the implementation of the parsing function
and to adopt this syntax also in other places, something that I would've
preferred to avoid. Anyway I'll follow your suggestion.


Well, this is all used by Arm only for now. You will want to make sure Arm
folks are actually okay with this alternative approach.

Jan


Are you Arm maintainers ok with this?


Unfortunately no. I find the use of + and "nested" = extremely confusing 
to read.


There might be other symbols to use (e.g. s/=/:/ s#+#/#), but I am not 
entirely sure the value of trying to cram all the options under a single 
top-level parameter. So I right now, I would prefrr if we stick with the 
existing approach (i.e. introducing dom0-llc-colors).


Cheers,

--
Julien Grall

Re: [PATCH v7 04/14] xen/arm: add Dom0 cache coloring support

2024-03-27 Thread Michal Orzel




On 27/03/2024 12:39, Carlo Nonato wrote:
> 
> 
> Hi guys,
> 
> On Fri, Mar 22, 2024 at 8:26 AM Jan Beulich  wrote:
>>
>> On 21.03.2024 18:31, Carlo Nonato wrote:
>>> On Thu, Mar 21, 2024 at 4:57 PM Jan Beulich  wrote:

 On 21.03.2024 16:04, Carlo Nonato wrote:
> On Tue, Mar 19, 2024 at 4:30 PM Jan Beulich  wrote:
>> On 15.03.2024 11:58, Carlo Nonato wrote:
>>> --- a/docs/misc/xen-command-line.pandoc
>>> +++ b/docs/misc/xen-command-line.pandoc
>>> @@ -963,6 +963,15 @@ Controls for the dom0 IOMMU setup.
>>>
>>>  Specify a list of IO ports to be excluded from dom0 access.
>>>
>>> +### dom0-llc-colors
>>> +> `= List of [  | - ]`
>>> +
>>> +> Default: `All available LLC colors`
>>> +
>>> +Specify dom0 LLC color configuration. This option is available only 
>>> when
>>> +`CONFIG_LLC_COLORING` is enabled. If the parameter is not set, all 
>>> available
>>> +colors are used.
>>
>> My reservation towards this being a top-level option remains.
>
> How can I turn this into a lower-level option? Moving it into "dom0=" 
> doesn't
> seem possible to me. How can I express a list (llc-colors) inside another 
> list
> (dom0)? dom0=llc-colors=0-3,12-15,other-param=... How can I stop parsing
> before reaching other-param?

 For example by using a different separator:

 dom0=llc-colors=0-3+12-15,other-param=...
>>>
>>> Ok, but that would mean to change the implementation of the parsing function
>>> and to adopt this syntax also in other places, something that I would've
>>> preferred to avoid. Anyway I'll follow your suggestion.
>>
>> Well, this is all used by Arm only for now. You will want to make sure Arm
>> folks are actually okay with this alternative approach.
>>
>> Jan
> 
> Are you Arm maintainers ok with this?
I'm not a fan of this syntax and I find it more difficult to parse compared to 
the usual case, where
every option is clearly separated. That said, I won't oppose to it.

~Michal

[PATCH net] xen-netfront: Add missing skb_mark_for_recycle

2024-03-27 Thread Jesper Dangaard Brouer

Notice that skb_mark_for_recycle() is introduced later than fixes tag in
6a5bcd84e886 ("page_pool: Allow drivers to hint on SKB recycling").

It is believed that fixes tag were missing a call to page_pool_release_page()
between v5.9 to v5.14, after which is should have used skb_mark_for_recycle().
Since v6.6 the call page_pool_release_page() were removed (in 535b9c61bdef
("net: page_pool: hide page_pool_release_page()") and remaining callers
converted (in commit 6bfef2ec0172 ("Merge branch
'net-page_pool-remove-page_pool_release_page'")).

This leak became visible in v6.8 via commit dba1b8a7ab68 ("mm/page_pool: catch
page_pool memory leaks").

Fixes: 6c5aa6fc4def ("xen networking: add basic XDP support for xen-netfront")
Reported-by: Arthur Borsboom 
Signed-off-by: Jesper Dangaard Brouer 
---
Compile tested only, can someone please test this

 drivers/net/xen-netfront.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index ad29f370034e..8d2aee88526c 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -285,6 +285,7 @@ static struct sk_buff *xennet_alloc_one_rx_buffer(struct 
netfront_queue *queue)
return NULL;
}
skb_add_rx_frag(skb, 0, page, 0, 0, PAGE_SIZE);
+   skb_mark_for_recycle(skb);
 
/* Align ip header to a 16 bytes boundary */
skb_reserve(skb, NET_IP_ALIGN);

preparations for 4.17.4

All,

the release is due in two to three weeks. Please point out backports you find
missing from the respective staging branch, but which you consider relevant.

Note that this is going to be the last Xen Project coordinated ordinary stable
release from this branch; the branch will move into security-only support mode
afterwards.

Jan

Re: preparations for 4.17.4

On 27/03/2024 12:23 pm, Jan Beulich wrote:
> All,
>
> the release is due in two to three weeks. Please point out backports you find
> missing from the respective staging branch, but which you consider relevant.
>
> Note that this is going to be the last Xen Project coordinated ordinary stable
> release from this branch; the branch will move into security-only support mode
> afterwards.

1) livepatching of .rodata:

989556c6f8ca - xen/virtual-region: Rename the start/end fields
ef969144a425 - xen/virtual-region: Include rodata pointers
b083b1c393dc - x86/livepatch: Relax permissions on rodata too

And technically "x86/mm: fix detection of last L1 entry in
modify_xen_mappings_lite()" too but you've already backported this one.

Patching .rodata worked before Xen 4.17, and was broken (left as a TODO)
when I adjusted Xen to stop using CR0.WP=0 for patching.

2) Policy fixes:

e2d8a6522516 - x86/cpu-policy: Fix visibility of HTT/CMP_LEGACY in max
policies

This is a real bugfix for a real regression we found updating from Xen
4.13 -> 4.17.  It has a dependency on

5420aa165dfa - x86/cpu-policy: Hide x2APIC from PV guests

which I know you had more concern with.  FWIW, I'm certain its a good
fix, and should be backported.

3) Test fixes:

0263dc9069dd - tests/resource: Fix HVM guest in !SHADOW builds

It's minor, but does make a difference for those of us who run these
tests regularly.

4) Watchdog fixes:

9e18f339830c - x86/boot: Improve the boot watchdog determination of
stuck cpus
131892e0dcc1 - x86/boot: Support the watchdog on newer AMD systems

You took "x86/boot: Fix setup_apic_nmi_watchdog() to fail more cleanly"
and the first of the two patches is in the same category IMO.  The
second I also feel ok to take for the in-support releases, particularly
as all it is doing is dropping a family list.

5) Ucode scan stability  (For 4.18 only)

Xen 4.18 had "x86/ucode: Refresh raw CPU policy after microcode load" in
it's .0 release, so should also gain:

cf7fe8b72dea - x86/ucode: Fix stability of the raw CPU Policy rescan

I've only noticed because I've got them both backported to 4.17 in
XenServer, but I don't think upstream wants to take that route.

~Andrew

Re: Xen NIC driver have page_pool memory leaks

2024-03-27 Thread Denis Kirjanov




On 3/27/24 14:27, Jesper Dangaard Brouer wrote:
> 
> 
> On 25/03/2024 13.33, Paul Durrant wrote:
>> On 25/03/2024 12:21, Jesper Dangaard Brouer wrote:
>>> Hi Arthur,
>>>
>>> (Answer inlined below, which is custom on this mailing list)
>>>
>>> On 23/03/2024 14.23, Arthur Borsboom wrote:
 Hi Jesper,

 After a recent kernel upgrade 6.7.6 > 6.8.1 all my Xen guests on Arch
 Linux are dumping kernel traces.
 It seems to be indirectly caused by the page pool memory leak
 mechanism, which is probably a good thing.

 I have created a bug report, but there is no response.

 https://bugzilla.kernel.org/show_bug.cgi?id=218618

 I am uncertain where and to whom I need to report this page leak.
 Can you help me get this issue fixed?
>>>
>>> I'm the page_pool maintainer, but as you say yourself in comment 2 then
>>> since dba1b8a7ab68 ("mm/page_pool: catch page_pool memory leaks") this
>>> indicated there is a problem in the xen_netfront driver, which was
>>> previously not visible.
>>>
>>> Cc'ing the "XEN NETWORK BACKEND DRIVER" maintainers, as this is a driver
>>> bug.  What confuses me it that I cannot find any modules named
>>> "xen_netfront" in the upstream tree.
>>>
>>
>> You should have tried '-' rather than '_' :-)
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/xen-netfront.c
>>
> 
> Looking at this driver, I think it is missing a call to 
> skb_mark_for_recycle().
> 
> I'll will submit at patch for this, with details for stable maintainers.

Please do

> 
> As I think it dates back to v5.9 via commit 6c5aa6fc4def ("xen
> networking: add basic XDP support for xen-netfront"). I think this
> commit is missing a call to page_pool_release_page()
> between v5.9 to v5.14, after which is should have used
> skb_mark_for_recycle().

Hmm, looks like I've missed it  
> 
> Since v6.6 the call page_pool_release_page() were removed (in
> 535b9c61bdef ("net: page_pool: hide page_pool_release_page()") and
> remaining callers converted (in commit 6bfef2ec0172 ("Merge branch
> 'net-page_pool-remove-page_pool_release_page'")).
> 
> This leak became visible in v6.8 via commit dba1b8a7ab68 ("mm/page_pool:
> catch page_pool memory leaks").
> 
> --Jesper
>

Re: [PATCH v7 08/14] xen/page_alloc: introduce preserved page flags macro

2024-03-27 Thread Julien Grall


Hi Carlo,

On 27/03/2024 11:10, Carlo Nonato wrote:

Hi guys,


Question is: How would you justify such a change? IOW I'm not convinced
(yet) this wants doing there.


You mean in this series?


Looking at the code, the flag is originally set in
alloc_domheap_pages(). So I guess it would make sense to do it in
free_domheap_pages().


We don't hold the heap_lock there.
Regardless of the safety question (I will answer below), count_info is 
not meant to be protected by heap_lock. The lock is protecting the heap 
and ensure we are not corrupting them when there are concurrent call to 
free_heap_pages().


> Is it safe to change count_info without it?

count_info is meant to be accessed locklessly. The flag PGC_extra cannot 
be set/clear concurrently because you can't allocate a page that has not 
yet been freed.


So it would be fine to use clear_bit(..., ...);

Cheers,

--
Julien Grall

Re: [XEN PATCH 1/6] xen/arm: ffa: rename functions to use ffa_ prefix

Hi Jens,

> On 25 Mar 2024, at 10:38, Jens Wiklander  wrote:
> 
> Prepare to separate into modules by renaming functions that will need
> new names when becoming non-static in the following commit.
> 
> Signed-off-by: Jens Wiklander 

Reviewed-by: Bertrand Marquis 

Cheers
Bertrand

> ---
> xen/arch/arm/tee/ffa.c | 125 +
> 1 file changed, 65 insertions(+), 60 deletions(-)
> 
> diff --git a/xen/arch/arm/tee/ffa.c b/xen/arch/arm/tee/ffa.c
> index 9a05dcede17a..0344a0f17e72 100644
> --- a/xen/arch/arm/tee/ffa.c
> +++ b/xen/arch/arm/tee/ffa.c
> @@ -4,7 +4,7 @@
>  *
>  * Arm Firmware Framework for ARMv8-A (FF-A) mediator
>  *
> - * Copyright (C) 2023  Linaro Limited
> + * Copyright (C) 2023-2024  Linaro Limited
>  *
>  * References:
>  * FF-A-1.0-REL: FF-A specification version 1.0 available at
> @@ -473,7 +473,7 @@ static bool ffa_get_version(uint32_t *vers)
> return true;
> }
> 
> -static int32_t get_ffa_ret_code(const struct arm_smccc_1_2_regs *resp)
> +static int32_t ffa_get_ret_code(const struct arm_smccc_1_2_regs *resp)
> {
> switch ( resp->a0 )
> {
> @@ -504,7 +504,7 @@ static int32_t ffa_simple_call(uint32_t fid, register_t 
> a1, register_t a2,
> 
> arm_smccc_1_2_smc(&arg, &resp);
> 
> -return get_ffa_ret_code(&resp);
> +return ffa_get_ret_code(&resp);
> }
> 
> static int32_t ffa_features(uint32_t id)
> @@ -546,7 +546,7 @@ static int32_t ffa_partition_info_get(uint32_t w1, 
> uint32_t w2, uint32_t w3,
> 
> arm_smccc_1_2_smc(&arg, &resp);
> 
> -ret = get_ffa_ret_code(&resp);
> +ret = ffa_get_ret_code(&resp);
> if ( !ret )
> {
> *count = resp.a2;
> @@ -654,15 +654,16 @@ static int32_t ffa_direct_req_send_vm(uint16_t sp_id, 
> uint16_t vm_id,
> return res;
> }
> 
> -static uint16_t get_vm_id(const struct domain *d)
> +static uint16_t ffa_get_vm_id(const struct domain *d)
> {
> /* +1 since 0 is reserved for the hypervisor in FF-A */
> return d->domain_id + 1;
> }
> 
> -static void set_regs(struct cpu_user_regs *regs, register_t v0, register_t 
> v1,
> - register_t v2, register_t v3, register_t v4, register_t 
> v5,
> - register_t v6, register_t v7)
> +static void ffa_set_regs(struct cpu_user_regs *regs, register_t v0,
> + register_t v1, register_t v2, register_t v3,
> + register_t v4, register_t v5, register_t v6,
> + register_t v7)
> {
> set_user_reg(regs, 0, v0);
> set_user_reg(regs, 1, v1);
> @@ -674,15 +675,15 @@ static void set_regs(struct cpu_user_regs *regs, 
> register_t v0, register_t v1,
> set_user_reg(regs, 7, v7);
> }
> 
> -static void set_regs_error(struct cpu_user_regs *regs, uint32_t error_code)
> +static void ffa_set_regs_error(struct cpu_user_regs *regs, uint32_t 
> error_code)
> {
> -set_regs(regs, FFA_ERROR, 0, error_code, 0, 0, 0, 0, 0);
> +ffa_set_regs(regs, FFA_ERROR, 0, error_code, 0, 0, 0, 0, 0);
> }
> 
> -static void set_regs_success(struct cpu_user_regs *regs, uint32_t w2,
> +static void ffa_set_regs_success(struct cpu_user_regs *regs, uint32_t w2,
>  uint32_t w3)
> {
> -set_regs(regs, FFA_SUCCESS_32, 0, w2, w3, 0, 0, 0, 0);
> +ffa_set_regs(regs, FFA_SUCCESS_32, 0, w2, w3, 0, 0, 0, 0);
> }
> 
> static void handle_version(struct cpu_user_regs *regs)
> @@ -697,11 +698,11 @@ static void handle_version(struct cpu_user_regs *regs)
> vers = FFA_VERSION_1_1;
> 
> ctx->guest_vers = vers;
> -set_regs(regs, vers, 0, 0, 0, 0, 0, 0, 0);
> +ffa_set_regs(regs, vers, 0, 0, 0, 0, 0, 0, 0);
> }
> 
> -static uint32_t handle_rxtx_map(uint32_t fid, register_t tx_addr,
> -register_t rx_addr, uint32_t page_count)
> +static uint32_t ffa_handle_rxtx_map(uint32_t fid, register_t tx_addr,
> +register_t rx_addr, uint32_t page_count)
> {
> uint32_t ret = FFA_RET_INVALID_PARAMETERS;
> struct domain *d = current->domain;
> @@ -789,7 +790,7 @@ static void rxtx_unmap(struct ffa_ctx *ctx)
> ctx->rx_is_free = false;
> }
> 
> -static uint32_t handle_rxtx_unmap(void)
> +static uint32_t ffa_handle_rxtx_unmap(void)
> {
> struct domain *d = current->domain;
> struct ffa_ctx *ctx = d->arch.tee;
> @@ -802,9 +803,10 @@ static uint32_t handle_rxtx_unmap(void)
> return FFA_RET_OK;
> }
> 
> -static int32_t handle_partition_info_get(uint32_t w1, uint32_t w2, uint32_t 
> w3,
> - uint32_t w4, uint32_t w5,
> - uint32_t *count, uint32_t *fpi_size)
> +static int32_t ffa_handle_partition_info_get(uint32_t w1, uint32_t w2,
> + uint32_t w3, uint32_t w4,
> + uint32_t w5, uint32_t *count,
> + uint32_t *fpi_size)
> {
> int32_t ret = FFA_RET_DENIED;
> struct domain

Re: [RFC PATCH-for-9.0 v2 09/19] hw/block/xen_blkif: Align structs with QEMU_ALIGNED() instead of #pragma

On Tue, Nov 14, 2023 at 03:38:05PM +0100, Philippe Mathieu-Daudé wrote:
> Except imported source files, QEMU code base uses
> the QEMU_ALIGNED() macro to align its structures.

This patch only convert the alignment, but discard pack. We need both or
the struct is changed.

> ---
>  hw/block/xen_blkif.h | 8 +++-
>  1 file changed, 3 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/block/xen_blkif.h b/hw/block/xen_blkif.h
> index 99733529c1..c1d154d502 100644
> --- a/hw/block/xen_blkif.h
> +++ b/hw/block/xen_blkif.h
> @@ -18,7 +18,6 @@ struct blkif_common_response {
>  };
>  
>  /* i386 protocol version */
> -#pragma pack(push, 4)
>  struct blkif_x86_32_request {
>  uint8_toperation;/* BLKIF_OP_??? 
> */
>  uint8_tnr_segments;  /* number of segments   
> */
> @@ -26,7 +25,7 @@ struct blkif_x86_32_request {
>  uint64_t   id;   /* private guest value, echoed in resp  
> */
>  blkif_sector_t sector_number;/* start sector idx on disk (r/w only)  
> */
>  struct blkif_request_segment seg[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> -};
> +} QEMU_ALIGNED(4);

E.g. for this one, I've compare the output of
`pahole --class_name=blkif_x86_32_request build/qemu-system-i386`:

--- before
+++ after
@@ -1,11 +1,15 @@
 struct blkif_x86_32_request {
uint8_toperation;/* 0 1 */
uint8_tnr_segments;  /* 1 1 */
uint16_t   handle;   /* 2 2 */
-   uint64_t   id;   /* 4 8 */
-   uint64_t   sector_number;/*12 8 */
-   struct blkif_request_segment seg[11];/*2088 */

-   /* size: 108, cachelines: 2, members: 6 */
-   /* last cacheline: 44 bytes */
-} __attribute__((__packed__));
+   /* XXX 4 bytes hole, try to pack */
+
+   uint64_t   id;   /* 8 8 */
+   uint64_t   sector_number;/*16 8 */
+   struct blkif_request_segment seg[11];/*2488 */
+
+   /* size: 112, cachelines: 2, members: 6 */
+   /* sum members: 108, holes: 1, sum holes: 4 */
+   /* last cacheline: 48 bytes */
+} __attribute__((__aligned__(8)));

Thanks,

-- 
Anthony PERARD

Re: [XEN PATCH 2/6] xen/arm: ffa: move common things to ffa_private.h

Hi Jens,

> On 25 Mar 2024, at 10:39, Jens Wiklander  wrote:
> 
> Prepare to separate ffa.c into modules by moving common things into the
> new internal header file ffa_private.h.
> 
> Signed-off-by: Jens Wiklander 

Reviewed-by: Bertrand Marquis 

Cheers
Bertrand

> ---
> xen/arch/arm/tee/ffa.c | 298 +-
> xen/arch/arm/tee/ffa_private.h | 318 +
> 2 files changed, 319 insertions(+), 297 deletions(-)
> create mode 100644 xen/arch/arm/tee/ffa_private.h
> 
> diff --git a/xen/arch/arm/tee/ffa.c b/xen/arch/arm/tee/ffa.c
> index 0344a0f17e72..259851f20bdb 100644
> --- a/xen/arch/arm/tee/ffa.c
> +++ b/xen/arch/arm/tee/ffa.c
> @@ -63,204 +63,7 @@
> #include 
> #include 
> 
> -/* Error codes */
> -#define FFA_RET_OK  0
> -#define FFA_RET_NOT_SUPPORTED   -1
> -#define FFA_RET_INVALID_PARAMETERS  -2
> -#define FFA_RET_NO_MEMORY   -3
> -#define FFA_RET_BUSY-4
> -#define FFA_RET_INTERRUPTED -5
> -#define FFA_RET_DENIED  -6
> -#define FFA_RET_RETRY   -7
> -#define FFA_RET_ABORTED -8
> -
> -/* FFA_VERSION helpers */
> -#define FFA_VERSION_MAJOR_SHIFT 16U
> -#define FFA_VERSION_MAJOR_MASK  0x7FFFU
> -#define FFA_VERSION_MINOR_SHIFT 0U
> -#define FFA_VERSION_MINOR_MASK  0xU
> -#define MAKE_FFA_VERSION(major, minor)  \
> -major) & FFA_VERSION_MAJOR_MASK) << FFA_VERSION_MAJOR_SHIFT) | \
> - ((minor) & FFA_VERSION_MINOR_MASK))
> -
> -#define FFA_VERSION_1_0 MAKE_FFA_VERSION(1, 0)
> -#define FFA_VERSION_1_1 MAKE_FFA_VERSION(1, 1)
> -/* The minimal FF-A version of the SPMC that can be supported */
> -#define FFA_MIN_SPMC_VERSIONFFA_VERSION_1_1
> -
> -/*
> - * This is the version we want to use in communication with guests and SPs.
> - * During negotiation with a guest or a SP we may need to lower it for
> - * that particular guest or SP.
> - */
> -#define FFA_MY_VERSION_MAJOR1U
> -#define FFA_MY_VERSION_MINOR1U
> -#define FFA_MY_VERSION  MAKE_FFA_VERSION(FFA_MY_VERSION_MAJOR, \
> - FFA_MY_VERSION_MINOR)
> -
> -/*
> - * The FF-A specification explicitly works with 4K pages as a measure of
> - * memory size, for example, FFA_RXTX_MAP takes one parameter "RX/TX page
> - * count" which is the number of contiguous 4K pages allocated. Xen may use
> - * a different page size depending on the configuration to avoid confusion
> - * with PAGE_SIZE use a special define when it's a page size as in the FF-A
> - * specification.
> - */
> -#define FFA_PAGE_SIZE   SZ_4K
> -
> -/*
> - * The number of pages used for each of the RX and TX buffers shared with
> - * the SPMC.
> - */
> -#define FFA_RXTX_PAGE_COUNT 1
> -
> -/*
> - * Limit the number of pages RX/TX buffers guests can map.
> - * TODO support a larger number.
> - */
> -#define FFA_MAX_RXTX_PAGE_COUNT 1
> -
> -/*
> - * Limit for shared buffer size. Please note that this define limits
> - * number of pages.
> - *
> - * FF-A doesn't have any direct requirements on GlobalPlatform or vice
> - * versa, but an implementation can very well use FF-A in order to provide
> - * a GlobalPlatform interface on top.
> - *
> - * Global Platform specification for TEE requires that any TEE
> - * implementation should allow to share buffers with size of at least
> - * 512KB, defined in TEEC-1.0C page 24, Table 4-1,
> - * TEEC_CONFIG_SHAREDMEM_MAX_SIZE.
> - * Due to overhead which can be hard to predict exactly, double this number
> - * to give a safe margin.
> - */
> -#define FFA_MAX_SHM_PAGE_COUNT  (2 * SZ_512K / FFA_PAGE_SIZE)
> -
> -/*
> - * Limits the number of shared buffers that guest can have at once. This
> - * is to prevent case, when guests trick XEN into exhausting its own
> - * memory by allocating many small buffers. This value has been chosen
> - * arbitrarily.
> - */
> -#define FFA_MAX_SHM_COUNT   32
> -
> -/*
> - * The time we wait until trying to tear down a domain again if it was
> - * blocked initially.
> - */
> -#define FFA_CTX_TEARDOWN_DELAY  SECONDS(1)
> -
> -/* FF-A-1.1-REL0 section 10.9.2 Memory region handle, page 167 */
> -#define FFA_HANDLE_HYP_FLAG BIT(63, ULL)
> -#define FFA_HANDLE_INVALID  0xULL
> -
> -/*
> - * Memory attributes: Normal memory, Write-Back cacheable, Inner shareable
> - * Defined in FF-A-1.1-REL0 Table 10.18 at page 175.
> - */
> -#define FFA_NORMAL_MEM_REG_ATTR 0x2fU
> -/*
> - * Memory access permissions: Read-write
> - * Defined in FF-A-1.1-REL0 Table 10.15 at page 168.
> - */
> -#define FFA_MEM_ACC_RW  0x2U
> -
> -/* FF-A-1.1-REL0 section 10.11.4 Flags usage, page 184-187 */
> -/* Clear memory before mapping in receiver */
> -#define FFA_MEMORY_REGION_FLAG_CLEARBIT(0, U)
> -/* Relayer may time sl

Re: [RFC PATCH-for-9.0 v2 09/19] hw/block/xen_blkif: Align structs with QEMU_ALIGNED() instead of #pragma

2024-03-27 Thread David Woodhouse

On 27 March 2024 13:31:52 GMT, Anthony PERARD  wrote:
>On Tue, Nov 14, 2023 at 03:38:05PM +0100, Philippe Mathieu-Daudé wrote:
>> Except imported source files, QEMU code base uses
>> the QEMU_ALIGNED() macro to align its structures.
>
>This patch only convert the alignment, but discard pack. We need both or
>the struct is changed.

Which means we need some build-time asserts on struct size and field offsets. 
That should never have passed a build test.

Re: [PATCH-for-9.0 v2 11/19] hw/xen/xen_arch_hvm: Rename prototypes using 'xen_arch_' prefix

On Tue, Nov 14, 2023 at 03:38:07PM +0100, Philippe Mathieu-Daudé wrote:
> Use a common 'xen_arch_' prefix for architecture-specific functions.
> Rename xen_arch_set_memory() and xen_arch_handle_ioreq().
> 
> Signed-off-by: Philippe Mathieu-Daudé 
> Reviewed-by: David Woodhouse 
> Reviewed-by: Richard Henderson 

Acked-by: Anthony PERARD 

Thanks,

-- 
Anthony PERARD

Re: [PATCH v7 08/14] xen/page_alloc: introduce preserved page flags macro

On 27.03.2024 14:28, Julien Grall wrote:
> Hi Carlo,
> 
> On 27/03/2024 11:10, Carlo Nonato wrote:
>> Hi guys,
>>
>>> Question is: How would you justify such a change? IOW I'm not convinced
>>> (yet) this wants doing there.
>>
>> You mean in this series?
>>
>>> Looking at the code, the flag is originally set in
>>> alloc_domheap_pages(). So I guess it would make sense to do it in
>>> free_domheap_pages().
>>
>> We don't hold the heap_lock there.
> Regardless of the safety question (I will answer below), count_info is 
> not meant to be protected by heap_lock. The lock is protecting the heap 
> and ensure we are not corrupting them when there are concurrent call to 
> free_heap_pages().
> 
>  > Is it safe to change count_info without it?
> 
> count_info is meant to be accessed locklessly. The flag PGC_extra cannot 
> be set/clear concurrently because you can't allocate a page that has not 
> yet been freed.
> 
> So it would be fine to use clear_bit(..., ...);

Actually we hardly ever use clear_bit() on count_info. Normally we use
ordinary C operators. Atomic (and otherwise lockless) updates are useful
only if done like this everywhere.

Jan

Re: [XEN PATCH 3/6] xen/arm: ffa: separate memory sharing routines

Hi Jens,

> On 25 Mar 2024, at 10:39, Jens Wiklander  wrote:
> 
> Move memory sharing routines into a separate file for easier navigation
> in the source code.
> 
> Add ffa_shm_domain_destroy() to isolate the ffa_shm things in
> ffa_domain_teardown_continue().
> 
> Signed-off-by: Jens Wiklander 

With the copyright date mentioned after fixed (which can be done on commit):
Reviewed-by: Bertrand Marquis  ---
> xen/arch/arm/tee/Makefile  |   1 +
> xen/arch/arm/tee/ffa.c | 708 +
> xen/arch/arm/tee/ffa_private.h |  10 +
> xen/arch/arm/tee/ffa_shm.c | 708 +
> 4 files changed, 729 insertions(+), 698 deletions(-)
> create mode 100644 xen/arch/arm/tee/ffa_shm.c
> 
> diff --git a/xen/arch/arm/tee/Makefile b/xen/arch/arm/tee/Makefile
> index 58a1015e40e0..0e683d23aa9d 100644
> --- a/xen/arch/arm/tee/Makefile
> +++ b/xen/arch/arm/tee/Makefile
> @@ -1,3 +1,4 @@
> obj-$(CONFIG_FFA) += ffa.o
> +obj-$(CONFIG_FFA) += ffa_shm.o
> obj-y += tee.o
> obj-$(CONFIG_OPTEE) += optee.o
> diff --git a/xen/arch/arm/tee/ffa.c b/xen/arch/arm/tee/ffa.c
> index 259851f20bdb..db36292dc52f 100644
> --- a/xen/arch/arm/tee/ffa.c
> +++ b/xen/arch/arm/tee/ffa.c
> @@ -84,92 +84,6 @@ struct ffa_partition_info_1_1 {
> uint8_t uuid[16];
> };
> 
> -/* Constituent memory region descriptor */
> -struct ffa_address_range {
> -uint64_t address;
> -uint32_t page_count;
> -uint32_t reserved;
> -};
> -
> -/* Composite memory region descriptor */
> -struct ffa_mem_region {
> -uint32_t total_page_count;
> -uint32_t address_range_count;
> -uint64_t reserved;
> -struct ffa_address_range address_range_array[];
> -};
> -
> -/* Memory access permissions descriptor */
> -struct ffa_mem_access_perm {
> -uint16_t endpoint_id;
> -uint8_t perm;
> -uint8_t flags;
> -};
> -
> -/* Endpoint memory access descriptor */
> -struct ffa_mem_access {
> -struct ffa_mem_access_perm access_perm;
> -uint32_t region_offs;
> -uint64_t reserved;
> -};
> -
> -/* Lend, donate or share memory transaction descriptor */
> -struct ffa_mem_transaction_1_0 {
> -uint16_t sender_id;
> -uint8_t mem_reg_attr;
> -uint8_t reserved0;
> -uint32_t flags;
> -uint64_t handle;
> -uint64_t tag;
> -uint32_t reserved1;
> -uint32_t mem_access_count;
> -struct ffa_mem_access mem_access_array[];
> -};
> -
> -struct ffa_mem_transaction_1_1 {
> -uint16_t sender_id;
> -uint16_t mem_reg_attr;
> -uint32_t flags;
> -uint64_t handle;
> -uint64_t tag;
> -uint32_t mem_access_size;
> -uint32_t mem_access_count;
> -uint32_t mem_access_offs;
> -uint8_t reserved[12];
> -};
> -
> -/* Calculate offset of struct ffa_mem_access from start of buffer */
> -#define MEM_ACCESS_OFFSET(access_idx) \
> -( sizeof(struct ffa_mem_transaction_1_1) + \
> -  ( access_idx ) * sizeof(struct ffa_mem_access) )
> -
> -/* Calculate offset of struct ffa_mem_region from start of buffer */
> -#define REGION_OFFSET(access_count, region_idx) \
> -( MEM_ACCESS_OFFSET(access_count) + \
> -  ( region_idx ) * sizeof(struct ffa_mem_region) )
> -
> -/* Calculate offset of struct ffa_address_range from start of buffer */
> -#define ADDR_RANGE_OFFSET(access_count, region_count, range_idx) \
> -( REGION_OFFSET(access_count, region_count) + \
> -  ( range_idx ) * sizeof(struct ffa_address_range) )
> -
> -/*
> - * The parts needed from struct ffa_mem_transaction_1_0 or struct
> - * ffa_mem_transaction_1_1, used to provide an abstraction of difference in
> - * data structures between version 1.0 and 1.1. This is just an internal
> - * interface and can be changed without changing any ABI.
> - */
> -struct ffa_mem_transaction_int {
> -uint16_t sender_id;
> -uint8_t mem_reg_attr;
> -uint8_t flags;
> -uint8_t mem_access_size;
> -uint8_t mem_access_count;
> -uint16_t mem_access_offs;
> -uint64_t handle;
> -uint64_t tag;
> -};
> -
> /* Endpoint RX/TX descriptor */
> struct ffa_endpoint_rxtx_descriptor_1_0 {
> uint16_t sender_id;
> @@ -185,15 +99,6 @@ struct ffa_endpoint_rxtx_descriptor_1_1 {
> uint32_t tx_region_offs;
> };
> 
> -struct ffa_shm_mem {
> -struct list_head list;
> -uint16_t sender_id;
> -uint16_t ep_id; /* endpoint, the one lending */
> -uint64_t handle;/* FFA_HANDLE_INVALID if not set yet */
> -unsigned int page_count;
> -struct page_info *pages[];
> -};
> -
> /* Negotiated FF-A version to use with the SPMC */
> static uint32_t __ro_after_init ffa_version;
> 
> @@ -212,10 +117,10 @@ static uint16_t subscr_vm_destroyed_count __read_mostly;
>  * for calls which uses our RX buffer to deliver a result we must call
>  * ffa_rx_release() to let the SPMC know that we're done with the buffer.
>  */
> -static void *ffa_rx __read_mostly;
> -static void *ffa_tx __read_mostly;
> -static DEFINE_SPINLOCK(ffa_rx_buffer_lock);
> -static DEFINE_SPINLOCK(ffa_tx_buffer_lock

Re: [XEN PATCH 4/6] xen/arm: ffa: separate partition info get routines

Hi Jens,

> On 25 Mar 2024, at 10:39, Jens Wiklander  wrote:
> 
> Move partition info get routines into a separate file for easier
> navigation in the source code.
> 
> Add ffa_partinfo_init(), ffa_partinfo_domain_init(), and
> ffa_partinfo_domain_destroy() to handle the ffa_partinfo internal things
> on initialization and teardown.
> 
> Signed-off-by: Jens Wiklander 

Reviewed-by: Bertrand Marquis 

Cheers
Bertrand

> ---
> xen/arch/arm/tee/Makefile   |   1 +
> xen/arch/arm/tee/ffa.c  | 359 +-
> xen/arch/arm/tee/ffa_partinfo.c | 373 
> xen/arch/arm/tee/ffa_private.h  |  14 +-
> 4 files changed, 398 insertions(+), 349 deletions(-)
> create mode 100644 xen/arch/arm/tee/ffa_partinfo.c
> 
> diff --git a/xen/arch/arm/tee/Makefile b/xen/arch/arm/tee/Makefile
> index 0e683d23aa9d..be644fba8055 100644
> --- a/xen/arch/arm/tee/Makefile
> +++ b/xen/arch/arm/tee/Makefile
> @@ -1,4 +1,5 @@
> obj-$(CONFIG_FFA) += ffa.o
> obj-$(CONFIG_FFA) += ffa_shm.o
> +obj-$(CONFIG_FFA) += ffa_partinfo.o
> obj-y += tee.o
> obj-$(CONFIG_OPTEE) += optee.o
> diff --git a/xen/arch/arm/tee/ffa.c b/xen/arch/arm/tee/ffa.c
> index db36292dc52f..7a2803881420 100644
> --- a/xen/arch/arm/tee/ffa.c
> +++ b/xen/arch/arm/tee/ffa.c
> @@ -70,20 +70,6 @@
>  * structs ending with _1_1 are defined in FF-A-1.1-REL0.
>  */
> 
> -/* Partition information descriptor */
> -struct ffa_partition_info_1_0 {
> -uint16_t id;
> -uint16_t execution_context;
> -uint32_t partition_properties;
> -};
> -
> -struct ffa_partition_info_1_1 {
> -uint16_t id;
> -uint16_t execution_context;
> -uint32_t partition_properties;
> -uint8_t uuid[16];
> -};
> -
> /* Endpoint RX/TX descriptor */
> struct ffa_endpoint_rxtx_descriptor_1_0 {
> uint16_t sender_id;
> @@ -102,11 +88,6 @@ struct ffa_endpoint_rxtx_descriptor_1_1 {
> /* Negotiated FF-A version to use with the SPMC */
> static uint32_t __ro_after_init ffa_version;
> 
> -/* SPs subscribing to VM_CREATE and VM_DESTROYED events */
> -static uint16_t *subscr_vm_created __read_mostly;
> -static uint16_t subscr_vm_created_count __read_mostly;
> -static uint16_t *subscr_vm_destroyed __read_mostly;
> -static uint16_t subscr_vm_destroyed_count __read_mostly;
> 
> /*
>  * Our rx/tx buffers shared with the SPMC. FFA_RXTX_PAGE_COUNT is the
> @@ -170,90 +151,6 @@ static int32_t ffa_rxtx_map(paddr_t tx_addr, paddr_t 
> rx_addr,
> return ffa_simple_call(FFA_RXTX_MAP_64, tx_addr, rx_addr, page_count, 0);
> }
> 
> -static int32_t ffa_partition_info_get(uint32_t w1, uint32_t w2, uint32_t w3,
> -  uint32_t w4, uint32_t w5,
> -  uint32_t *count, uint32_t *fpi_size)
> -{
> -const struct arm_smccc_1_2_regs arg = {
> -.a0 = FFA_PARTITION_INFO_GET,
> -.a1 = w1,
> -.a2 = w2,
> -.a3 = w3,
> -.a4 = w4,
> -.a5 = w5,
> -};
> -struct arm_smccc_1_2_regs resp;
> -uint32_t ret;
> -
> -arm_smccc_1_2_smc(&arg, &resp);
> -
> -ret = ffa_get_ret_code(&resp);
> -if ( !ret )
> -{
> -*count = resp.a2;
> -*fpi_size = resp.a3;
> -}
> -
> -return ret;
> -}
> -
> -static int32_t ffa_rx_release(void)
> -{
> -return ffa_simple_call(FFA_RX_RELEASE, 0, 0, 0, 0);
> -}
> -
> -static int32_t ffa_direct_req_send_vm(uint16_t sp_id, uint16_t vm_id,
> -  uint8_t msg)
> -{
> -uint32_t exp_resp = FFA_MSG_FLAG_FRAMEWORK;
> -unsigned int retry_count = 0;
> -int32_t res;
> -
> -if ( msg == FFA_MSG_SEND_VM_CREATED )
> -exp_resp |= FFA_MSG_RESP_VM_CREATED;
> -else if ( msg == FFA_MSG_SEND_VM_DESTROYED )
> -exp_resp |= FFA_MSG_RESP_VM_DESTROYED;
> -else
> -return FFA_RET_INVALID_PARAMETERS;
> -
> -do {
> -const struct arm_smccc_1_2_regs arg = {
> -.a0 = FFA_MSG_SEND_DIRECT_REQ_32,
> -.a1 = sp_id,
> -.a2 = FFA_MSG_FLAG_FRAMEWORK | msg,
> -.a5 = vm_id,
> -};
> -struct arm_smccc_1_2_regs resp;
> -
> -arm_smccc_1_2_smc(&arg, &resp);
> -if ( resp.a0 != FFA_MSG_SEND_DIRECT_RESP_32 || resp.a2 != exp_resp )
> -{
> -/*
> - * This is an invalid response, likely due to some error in the
> - * implementation of the ABI.
> - */
> -return FFA_RET_INVALID_PARAMETERS;
> -}
> -res = resp.a3;
> -if ( ++retry_count > 10 )
> -{
> -/*
> - * TODO
> - * FFA_RET_INTERRUPTED means that the SPMC has a pending
> - * non-secure interrupt, we need a way of delivering that
> - * non-secure interrupt.
> - * FFA_RET_RETRY is the SP telling us that it's temporarily
> - * blocked from handling the direct request, we need a generic
> - * way to deal with this.
> -

Re: [XEN PATCH 5/6] xen/arm: ffa: separate rxtx buffer routines

Hi Jens,

> On 25 Mar 2024, at 10:39, Jens Wiklander  wrote:
> 
> Move rxtx buffer routines into a spearate file for easier navigation in
> the source code.
> 
> Add ffa_rxtx_init(), ffa_rxtx_destroy(), and ffa_rxtx_domain_destroy() to
> handle the ffa_rxtx internal things on initialization and teardown.
> 
> Signed-off-by: Jens Wiklander 

Reviewed-by: Bertrand Marquis 

Cheers
Bertrand

> ---
> xen/arch/arm/tee/Makefile  |   1 +
> xen/arch/arm/tee/ffa.c | 174 +-
> xen/arch/arm/tee/ffa_private.h |   7 ++
> xen/arch/arm/tee/ffa_rxtx.c| 216 +
> 4 files changed, 229 insertions(+), 169 deletions(-)
> create mode 100644 xen/arch/arm/tee/ffa_rxtx.c
> 
> diff --git a/xen/arch/arm/tee/Makefile b/xen/arch/arm/tee/Makefile
> index be644fba8055..f0112a2f922d 100644
> --- a/xen/arch/arm/tee/Makefile
> +++ b/xen/arch/arm/tee/Makefile
> @@ -1,5 +1,6 @@
> obj-$(CONFIG_FFA) += ffa.o
> obj-$(CONFIG_FFA) += ffa_shm.o
> obj-$(CONFIG_FFA) += ffa_partinfo.o
> +obj-$(CONFIG_FFA) += ffa_rxtx.o
> obj-y += tee.o
> obj-$(CONFIG_OPTEE) += optee.o
> diff --git a/xen/arch/arm/tee/ffa.c b/xen/arch/arm/tee/ffa.c
> index 7a2803881420..4f7775b8c890 100644
> --- a/xen/arch/arm/tee/ffa.c
> +++ b/xen/arch/arm/tee/ffa.c
> @@ -65,26 +65,6 @@
> 
> #include "ffa_private.h"
> 
> -/*
> - * Structs below ending with _1_0 are defined in FF-A-1.0-REL and
> - * structs ending with _1_1 are defined in FF-A-1.1-REL0.
> - */
> -
> -/* Endpoint RX/TX descriptor */
> -struct ffa_endpoint_rxtx_descriptor_1_0 {
> -uint16_t sender_id;
> -uint16_t reserved;
> -uint32_t rx_range_count;
> -uint32_t tx_range_count;
> -};
> -
> -struct ffa_endpoint_rxtx_descriptor_1_1 {
> -uint16_t sender_id;
> -uint16_t reserved;
> -uint32_t rx_region_offs;
> -uint32_t tx_region_offs;
> -};
> -
> /* Negotiated FF-A version to use with the SPMC */
> static uint32_t __ro_after_init ffa_version;
> 
> @@ -145,12 +125,6 @@ static bool check_mandatory_feature(uint32_t id)
> return !ret;
> }
> 
> -static int32_t ffa_rxtx_map(paddr_t tx_addr, paddr_t rx_addr,
> -uint32_t page_count)
> -{
> -return ffa_simple_call(FFA_RXTX_MAP_64, tx_addr, rx_addr, page_count, 0);
> -}
> -
> static void handle_version(struct cpu_user_regs *regs)
> {
> struct domain *d = current->domain;
> @@ -166,127 +140,6 @@ static void handle_version(struct cpu_user_regs *regs)
> ffa_set_regs(regs, vers, 0, 0, 0, 0, 0, 0, 0);
> }
> 
> -static uint32_t ffa_handle_rxtx_map(uint32_t fid, register_t tx_addr,
> -register_t rx_addr, uint32_t page_count)
> -{
> -uint32_t ret = FFA_RET_INVALID_PARAMETERS;
> -struct domain *d = current->domain;
> -struct ffa_ctx *ctx = d->arch.tee;
> -struct page_info *tx_pg;
> -struct page_info *rx_pg;
> -p2m_type_t t;
> -void *rx;
> -void *tx;
> -
> -if ( !smccc_is_conv_64(fid) )
> -{
> -/*
> - * Calls using the 32-bit calling convention must ignore the upper
> - * 32 bits in the argument registers.
> - */
> -tx_addr &= UINT32_MAX;
> -rx_addr &= UINT32_MAX;
> -}
> -
> -if ( page_count > FFA_MAX_RXTX_PAGE_COUNT )
> -{
> -printk(XENLOG_ERR "ffa: RXTX_MAP: error: %u pages requested (limit 
> %u)\n",
> -   page_count, FFA_MAX_RXTX_PAGE_COUNT);
> -return FFA_RET_INVALID_PARAMETERS;
> -}
> -
> -/* Already mapped */
> -if ( ctx->rx )
> -return FFA_RET_DENIED;
> -
> -tx_pg = get_page_from_gfn(d, gfn_x(gaddr_to_gfn(tx_addr)), &t, 
> P2M_ALLOC);
> -if ( !tx_pg )
> -return FFA_RET_INVALID_PARAMETERS;
> -
> -/* Only normal RW RAM for now */
> -if ( t != p2m_ram_rw )
> -goto err_put_tx_pg;
> -
> -rx_pg = get_page_from_gfn(d, gfn_x(gaddr_to_gfn(rx_addr)), &t, 
> P2M_ALLOC);
> -if ( !tx_pg )
> -goto err_put_tx_pg;
> -
> -/* Only normal RW RAM for now */
> -if ( t != p2m_ram_rw )
> -goto err_put_rx_pg;
> -
> -tx = __map_domain_page_global(tx_pg);
> -if ( !tx )
> -goto err_put_rx_pg;
> -
> -rx = __map_domain_page_global(rx_pg);
> -if ( !rx )
> -goto err_unmap_tx;
> -
> -ctx->rx = rx;
> -ctx->tx = tx;
> -ctx->rx_pg = rx_pg;
> -ctx->tx_pg = tx_pg;
> -ctx->page_count = page_count;
> -ctx->rx_is_free = true;
> -return FFA_RET_OK;
> -
> -err_unmap_tx:
> -unmap_domain_page_global(tx);
> -err_put_rx_pg:
> -put_page(rx_pg);
> -err_put_tx_pg:
> -put_page(tx_pg);
> -
> -return ret;
> -}
> -
> -static void rxtx_unmap(struct ffa_ctx *ctx)
> -{
> -unmap_domain_page_global(ctx->rx);
> -unmap_domain_page_global(ctx->tx);
> -put_page(ctx->rx_pg);
> -put_page(ctx->tx_pg);
> -ctx->rx = NULL;
> -ctx->tx = NULL;
> -ctx->rx_pg = NULL;
> -ctx->tx_pg = NULL;
> -ctx->page_count = 0;
> -ctx->rx_is_free = false;
> -}
> -
> -static uint32_t ffa_handle_rxtx_unmap(v

Re: [XEN PATCH 6/6] xen/arm: ffa: support FFA_FEATURES

Hi Jens,

> On 25 Mar 2024, at 10:39, Jens Wiklander  wrote:
> 
> Add support for the mandatory FF-A ABI function FFA_FEATURES.
> 
> Signed-off-by: Jens Wiklander 

Reviewed-by: Bertrand Marquis 

Cheers
Bertrand

> ---
> xen/arch/arm/tee/ffa.c | 57 ++
> 1 file changed, 57 insertions(+)
> 
> diff --git a/xen/arch/arm/tee/ffa.c b/xen/arch/arm/tee/ffa.c
> index 4f7775b8c890..8665201e34a9 100644
> --- a/xen/arch/arm/tee/ffa.c
> +++ b/xen/arch/arm/tee/ffa.c
> @@ -192,6 +192,60 @@ out:
>  resp.a7 & mask);
> }
> 
> +static void handle_features(struct cpu_user_regs *regs)
> +{
> +uint32_t a1 = get_user_reg(regs, 1);
> +unsigned int n;
> +
> +for ( n = 2; n <= 7; n++ )
> +{
> +if ( get_user_reg(regs, n) )
> +{
> +ffa_set_regs_error(regs, FFA_RET_NOT_SUPPORTED);
> +return;
> +}
> +}
> +
> +switch ( a1 )
> +{
> +case FFA_ERROR:
> +case FFA_VERSION:
> +case FFA_SUCCESS_32:
> +case FFA_SUCCESS_64:
> +case FFA_FEATURES:
> +case FFA_ID_GET:
> +case FFA_RX_RELEASE:
> +case FFA_RXTX_UNMAP:
> +case FFA_MEM_RECLAIM:
> +case FFA_PARTITION_INFO_GET:
> +case FFA_MSG_SEND_DIRECT_REQ_32:
> +case FFA_MSG_SEND_DIRECT_REQ_64:
> +ffa_set_regs_success(regs, 0, 0);
> +break;
> +case FFA_MEM_SHARE_64:
> +case FFA_MEM_SHARE_32:
> +/*
> + * We currently don't support dynamically allocated buffers. Report
> + * that with 0 in bit[0] of w2.
> + */
> +ffa_set_regs_success(regs, 0, 0);
> +break;
> +case FFA_RXTX_MAP_64:
> +case FFA_RXTX_MAP_32:
> +/*
> + * We currently support 4k pages only, report that as 00 in
> + * bit[0:1] in w0. This needs to be revised if Xen page size
> + * differs from FFA_PAGE_SIZE (SZ_4K).
> + */
> +BUILD_BUG_ON(PAGE_SIZE != FFA_PAGE_SIZE);
> +ffa_set_regs_success(regs, 0, 0);
> +break;
> +default:
> +ffa_set_regs_error(regs, FFA_RET_NOT_SUPPORTED);
> +break;
> +}
> +}
> +
> static bool ffa_handle_call(struct cpu_user_regs *regs)
> {
> uint32_t fid = get_user_reg(regs, 0);
> @@ -212,6 +266,9 @@ static bool ffa_handle_call(struct cpu_user_regs *regs)
> case FFA_ID_GET:
> ffa_set_regs_success(regs, ffa_get_vm_id(d), 0);
> return true;
> +case FFA_FEATURES:
> +handle_features(regs);
> +return true;
> case FFA_RXTX_MAP_32:
> case FFA_RXTX_MAP_64:
> e = ffa_handle_rxtx_map(fid, get_user_reg(regs, 1),
> -- 
> 2.34.1
>

Re: [XEN PATCH 0/6] FF-A mediator reorganisation

Hi Jens,

> On 25 Mar 2024, at 10:38, Jens Wiklander  wrote:
> 
> Hi,
> 
> The FF-A mediator is reorganized into modules for easier maintenance and to
> prepare for future changes. This patch set is not expected add any changed
> behaviour, except for the "xen/arm: ffa: support FFA_FEATURES" patch.

I reviewed the serie and there is only one tiny fix of copyright date that
can be done on commit.

Appart from that, I tested FF-A with the whole serie applied and I can
confirm it works on my side so for the whole serie:

Tested-by: Bertrand Marquis 

Cheers
Bertrand

> 
> Thanks,
> Jens
> 
> Jens Wiklander (6):
>  xen/arm: ffa: rename functions to use ffa_ prefix
>  xen/arm: ffa: move common things to ffa_private.h
>  xen/arm: ffa: separate memory sharing routines
>  xen/arm: ffa: separate partition info get routines
>  xen/arm: ffa: separate rxtx buffer routines
>  xen/arm: ffa: support FFA_FEATURES
> 
> xen/arch/arm/tee/Makefile   |3 +
> xen/arch/arm/tee/ffa.c  | 1629 ++-
> xen/arch/arm/tee/ffa_partinfo.c |  373 +++
> xen/arch/arm/tee/ffa_private.h  |  347 +++
> xen/arch/arm/tee/ffa_rxtx.c |  216 
> xen/arch/arm/tee/ffa_shm.c  |  708 ++
> 6 files changed, 1750 insertions(+), 1526 deletions(-)
> create mode 100644 xen/arch/arm/tee/ffa_partinfo.c
> create mode 100644 xen/arch/arm/tee/ffa_private.h
> create mode 100644 xen/arch/arm/tee/ffa_rxtx.c
> create mode 100644 xen/arch/arm/tee/ffa_shm.c
> 
> -- 
> 2.34.1
>

Re: [PATCH v7 08/14] xen/page_alloc: introduce preserved page flags macro

2024-03-27 Thread Julien Grall





On 27/03/2024 13:38, Jan Beulich wrote:

On 27.03.2024 14:28, Julien Grall wrote:

Hi Carlo,

On 27/03/2024 11:10, Carlo Nonato wrote:

Hi guys,


Question is: How would you justify such a change? IOW I'm not convinced
(yet) this wants doing there.


You mean in this series?


Looking at the code, the flag is originally set in
alloc_domheap_pages(). So I guess it would make sense to do it in
free_domheap_pages().


We don't hold the heap_lock there.

Regardless of the safety question (I will answer below), count_info is
not meant to be protected by heap_lock. The lock is protecting the heap
and ensure we are not corrupting them when there are concurrent call to
free_heap_pages().

  > Is it safe to change count_info without it?

count_info is meant to be accessed locklessly. The flag PGC_extra cannot
be set/clear concurrently because you can't allocate a page that has not
yet been freed.

So it would be fine to use clear_bit(..., ...);


Actually we hardly ever use clear_bit() on count_info. Normally we use
ordinary C operators.


I knew you would say that. I am not convince it is safe to always using 
count_info without any atomic operations. But I never got around to 
check all them.



Atomic (and otherwise lockless) updates are useful
only if done like this everywhere.


You are right. But starting to use the bitops is not going to hurt 
anyone (other than maybe performance, but once we convert all of them, 
then this will become moot). In fact, it helps start to slowly move 
towards the aim to have count_info safe.


Cheers,

--
Julien Grall

Re: preparations for 4.17.4

On 27/03/2024 12:33 pm, Andrew Cooper wrote:
> On 27/03/2024 12:23 pm, Jan Beulich wrote:
>> All,
>>
>> the release is due in two to three weeks. Please point out backports you find
>> missing from the respective staging branch, but which you consider relevant.
>>
>> Note that this is going to be the last Xen Project coordinated ordinary 
>> stable
>> release from this branch; the branch will move into security-only support 
>> mode
>> afterwards.
> 1) livepatching of .rodata:
>
> 989556c6f8ca - xen/virtual-region: Rename the start/end fields
> ef969144a425 - xen/virtual-region: Include rodata pointers
> b083b1c393dc - x86/livepatch: Relax permissions on rodata too
>
> And technically "x86/mm: fix detection of last L1 entry in
> modify_xen_mappings_lite()" too but you've already backported this one.
>
> Patching .rodata worked before Xen 4.17, and was broken (left as a TODO)
> when I adjusted Xen to stop using CR0.WP=0 for patching.
>
>
> 2) Policy fixes:
>
> e2d8a6522516 - x86/cpu-policy: Fix visibility of HTT/CMP_LEGACY in max
> policies
>
> This is a real bugfix for a real regression we found updating from Xen
> 4.13 -> 4.17.  It has a dependency on
>
> 5420aa165dfa - x86/cpu-policy: Hide x2APIC from PV guests
>
> which I know you had more concern with.  FWIW, I'm certain its a good
> fix, and should be backported.
>
>
> 3) Test fixes:
>
> 0263dc9069dd - tests/resource: Fix HVM guest in !SHADOW builds
>
> It's minor, but does make a difference for those of us who run these
> tests regularly.
>
>
> 4) Watchdog fixes:
>
> 9e18f339830c - x86/boot: Improve the boot watchdog determination of
> stuck cpus
> 131892e0dcc1 - x86/boot: Support the watchdog on newer AMD systems
>
> You took "x86/boot: Fix setup_apic_nmi_watchdog() to fail more cleanly"
> and the first of the two patches is in the same category IMO.  The
> second I also feel ok to take for the in-support releases, particularly
> as all it is doing is dropping a family list.
>
>
> 5) Ucode scan stability  (For 4.18 only)
>
> Xen 4.18 had "x86/ucode: Refresh raw CPU policy after microcode load" in
> it's .0 release, so should also gain:
>
> cf7fe8b72dea - x86/ucode: Fix stability of the raw CPU Policy rescan
>
> I've only noticed because I've got them both backported to 4.17 in
> XenServer, but I don't think upstream wants to take that route.

It occurs to me that these want considering:

b6cf604207fd - tools/oxenstored: Use Map instead of Hashtbl for quotas
098d868e52ac - tools/oxenstored: Make Quota.t pure

while 4.17 is still in general support.  These came from a performance
regression we investigated.

I've done the backport to 4.17 and they're not entirely trivial (owing
to the major source reformat in 4.18) so can commit them if you'd prefer.

~Andrew

[linux-linus test] 185166: tolerable FAIL - PUSHED

flight 185166 linux-linus real [real]
flight 185174 linux-linus real-retest [real]
http://logs.test-lab.xenproject.org/osstest/logs/185166/
http://logs.test-lab.xenproject.org/osstest/logs/185174/

Failures :-/ but no regressions.

Tests which are failing intermittently (not blocking):
 test-armhf-armhf-libvirt-vhd  8 xen-bootfail pass in 185174-retest

Tests which did not succeed, but are not blocking:
 test-armhf-armhf-libvirt-vhd 14 migrate-support-check fail in 185174 never pass
 test-armhf-armhf-libvirt-vhd 15 saverestore-support-check fail in 185174 never 
pass
 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 185160
 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 185160
 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 185160
 test-armhf-armhf-libvirt 16 saverestore-support-checkfail  like 185160
 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 185160
 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 185160
 test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail  never pass
 test-armhf-armhf-xl-credit1  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-qcow2 14 migrate-support-checkfail never pass
 test-amd64-amd64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 15 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  14 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  15 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-qcow214 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-qcow215 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-armhf-armhf-xl-raw  14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-raw  15 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt 15 migrate-support-checkfail   never pass

version targeted for testing:
 linux7033999ecd7b8cf9ea59265035a0150961e023ee
baseline version:
 linux928a87efa42302a23bb9554be081a28058495f22

Last test of basis   185160  2024-03-25 18:44:17 Z1 days
Testing same since   185166  2024-03-26 21:44:24 Z0 days1 attempts


People who touched revisions under test:
  John Ogness 
  Linus Torvalds 
  Petr Mladek 
  Uwe Kleine-König 
  Zoltan HERPAI 

jobs:
 build-amd64-xsm  pass
 build-arm64-xsm  pass
 build-i386-xsm   pass
 build-amd64  pass
 build-a

Re: preparations for 4.17.4

On 27.03.2024 15:01, Andrew Cooper wrote:
> It occurs to me that these want considering:
> 
> b6cf604207fd - tools/oxenstored: Use Map instead of Hashtbl for quotas
> 098d868e52ac - tools/oxenstored: Make Quota.t pure
> 
> while 4.17 is still in general support.  These came from a performance
> regression we investigated.
> 
> I've done the backport to 4.17 and they're not entirely trivial (owing
> to the major source reformat in 4.18) so can commit them if you'd prefer.

Didn't you bring these up for 4.18.1 already, and I said that I'd leave
this for the maintainers to decide? Same here, in any event. Cc-ing them
both, just in case.

Jan

Re: [PATCH v5 1/6] Revert "xen/x86: bzImage parse kernel_alignment"


On 2024-03-27 04:59, Roger Pau Monné wrote:

On Wed, Mar 27, 2024 at 08:22:41AM +0100, Jan Beulich wrote:

On 26.03.2024 22:38, Jason Andryuk wrote:

A new ELF note will specify the alignment for a relocatable PVH kernel.
ELF notes are suitable for vmlinux and other ELF files, so this
Linux-specific bzImage parsing in unnecessary.

This reverts commit c44cac229067faeec8f49247d1cf281723ac2d40.

Signed-off-by: Jason Andryuk 


Since you keep re-sending this: In private discussion Roger has indicated
that, like me, he too would prefer falling back to the ELF data, before
falling back to the arch default (Roger, please correct me if I got it
wrong). That would make it necessary that the change you're proposing to
revert here is actually kept.


Sorry, was meaning to reply yesterday but Jason is very fast at
sending new version so I'm always one version behind.


:)

I was hoping to finish this up and get it in...


IMO the order: ELF note, PHDR alignment, arch default should be the
preferred one.


Or wait - what you're reverting is taking the alignment out of the
bzImage header. I don't expect the BSDs to use that protocol; aiui that's
entirely Linux-specific.


Yeah, I don't have strong opinions in keeping this, we already do
bzImage parsing, so we might as well attempt to fetch the alignment
from there if correct:

ELF note, bzImage kernel_alignment, ELF PHDR alignment, arch default


I'm not sure how to handle ELF PHDR vs. arch default.  ELF PHDR will 
always be set, AFAIU.  Should that always be respected, which means we 
don't need an arch default?


To include arch default, it would be something like this:

if ( parms->phys_align != UNSET_ADDR )
align = parms->phys_align;
else if ( bz_align )
align = bz_align;
else if ( elf->palign > PHYS32_RELOC_ALIGN_DEFAULT )
align = elf->palign;
else
align = PHYS32_RELOC_ALIGN_DEFAULT;



I further meanwhile realized that consulting the ELF phdrs may also be
ambiguous, as there may be more than one. I guess it would need to be the
maximum of all of them then.


My suggestion (not sure if I mentioned this before) was to use the
alignment of the first LOAD PHDR, which is the one that defines the
value of the dest_base field used as the image load start address.

Using the maximum of all load PHDRs might be safer.


I'll find the maximum.

Thanks,
Jason

Re: [PATCH v5] RFC: x86/pvh: Make Xen PVH entrypoint PIC for x86-64


On 2024-03-27 04:20, Jan Beulich wrote:

On 26.03.2024 22:47, Jason Andryuk wrote:

--- a/include/xen/interface/elfnote.h
+++ b/include/xen/interface/elfnote.h
@@ -185,9 +185,25 @@
   */
  #define XEN_ELFNOTE_PHYS32_ENTRY 18
  
+/*

+ * Physical loading constraints for PVH kernels
+ *
+ * Used to place constraints on the guest physical loading addresses and
+ * alignment for a PVH kernel.
+ *
+ * The presence of this note indicates the kernel supports relocating itself.
+ *
+ * The note may include up to three 32bit values in the following order:
+ *  - a maximum address for the entire image to be loaded below (default
+ *  0x)
+ *  - a minimum address for the start of the image (default 0)
+ *  - a required start alignment (default 0x20)


This looks to be stale now.


Yes - I did not update this file.

The patch is functional as the ELF Note fields in 
arch/x86/platform/pvh/head.S were re-ordered to match the Xen side.


Regards,
Jason

Re: [PATCH v5 1/6] Revert "xen/x86: bzImage parse kernel_alignment"

On 27.03.2024 15:08, Jason Andryuk wrote:
> On 2024-03-27 04:59, Roger Pau Monné wrote:
>> On Wed, Mar 27, 2024 at 08:22:41AM +0100, Jan Beulich wrote:
>>> On 26.03.2024 22:38, Jason Andryuk wrote:
 A new ELF note will specify the alignment for a relocatable PVH kernel.
 ELF notes are suitable for vmlinux and other ELF files, so this
 Linux-specific bzImage parsing in unnecessary.

 This reverts commit c44cac229067faeec8f49247d1cf281723ac2d40.

 Signed-off-by: Jason Andryuk 
>>>
>>> Since you keep re-sending this: In private discussion Roger has indicated
>>> that, like me, he too would prefer falling back to the ELF data, before
>>> falling back to the arch default (Roger, please correct me if I got it
>>> wrong). That would make it necessary that the change you're proposing to
>>> revert here is actually kept.
>>
>> Sorry, was meaning to reply yesterday but Jason is very fast at
>> sending new version so I'm always one version behind.
> 
> :)
> 
> I was hoping to finish this up and get it in...
> 
>> IMO the order: ELF note, PHDR alignment, arch default should be the
>> preferred one.
>>
>>> Or wait - what you're reverting is taking the alignment out of the
>>> bzImage header. I don't expect the BSDs to use that protocol; aiui that's
>>> entirely Linux-specific.
>>
>> Yeah, I don't have strong opinions in keeping this, we already do
>> bzImage parsing, so we might as well attempt to fetch the alignment
>> from there if correct:
>>
>> ELF note, bzImage kernel_alignment, ELF PHDR alignment, arch default
> 
> I'm not sure how to handle ELF PHDR vs. arch default.  ELF PHDR will 
> always be set, AFAIU.  Should that always be respected, which means we 
> don't need an arch default?

A value of 0 (and 1) is specifically permitted, to indicate no alignment.
We may take 0 to mean default, but what you suggest below is another
plausible approach. Yet another might be to take anything below PAGE_SIZE
as "use default".

> To include arch default, it would be something like this:
> 
>  if ( parms->phys_align != UNSET_ADDR )
>  align = parms->phys_align;
>  else if ( bz_align )
>  align = bz_align;

Why do you include bz again here? Didn't you previously indicate the
header field can't be relied upon? Which is also why, finally, I committed
this revert earlier today.

Jan

>  else if ( elf->palign > PHYS32_RELOC_ALIGN_DEFAULT )
>  align = elf->palign;
>  else
>  align = PHYS32_RELOC_ALIGN_DEFAULT;
> 
> 
>>> I further meanwhile realized that consulting the ELF phdrs may also be
>>> ambiguous, as there may be more than one. I guess it would need to be the
>>> maximum of all of them then.
>>
>> My suggestion (not sure if I mentioned this before) was to use the
>> alignment of the first LOAD PHDR, which is the one that defines the
>> value of the dest_base field used as the image load start address.
>>
>> Using the maximum of all load PHDRs might be safer.
> 
> I'll find the maximum.
> 
> Thanks,
> Jason

Re: [PATCH-for-9.0 v2 12/19] hw/xen: Merge 'hw/xen/arch_hvm.h' in 'hw/xen/xen-hvm-common.h'

On Tue, Nov 14, 2023 at 03:38:08PM +0100, Philippe Mathieu-Daudé wrote:
> We don't need a target-specific header for common target-specific
> prototypes. Declare xen_arch_handle_ioreq() and xen_arch_set_memory()
> in "hw/xen/xen-hvm-common.h".
> 
> Signed-off-by: Philippe Mathieu-Daudé 
> Reviewed-by: David Woodhouse 
> Reviewed-by: Richard Henderson 

Acked-by: Anthony PERARD 

Thanks,

-- 
Anthony PERARD

Re: preparations for 4.17.4

On 27/03/2024 2:06 pm, Jan Beulich wrote:
> On 27.03.2024 15:01, Andrew Cooper wrote:
>> It occurs to me that these want considering:
>>
>> b6cf604207fd - tools/oxenstored: Use Map instead of Hashtbl for quotas
>> 098d868e52ac - tools/oxenstored: Make Quota.t pure
>>
>> while 4.17 is still in general support.  These came from a performance
>> regression we investigated.
>>
>> I've done the backport to 4.17 and they're not entirely trivial (owing
>> to the major source reformat in 4.18) so can commit them if you'd prefer.
> Didn't you bring these up for 4.18.1 already, and I said that I'd leave
> this for the maintainers to decide? Same here, in any event. Cc-ing them
> both, just in case.

I could have sworn that I remembered requested this before, but couldn't
remember where.

I'll see about poking people for an answer.

~Andrew

Re: [PATCH v5 1/6] Revert "xen/x86: bzImage parse kernel_alignment"


On 2024-03-27 10:19, Jan Beulich wrote:

On 27.03.2024 15:08, Jason Andryuk wrote:

On 2024-03-27 04:59, Roger Pau Monné wrote:

On Wed, Mar 27, 2024 at 08:22:41AM +0100, Jan Beulich wrote:

On 26.03.2024 22:38, Jason Andryuk wrote:

A new ELF note will specify the alignment for a relocatable PVH kernel.
ELF notes are suitable for vmlinux and other ELF files, so this
Linux-specific bzImage parsing in unnecessary.

This reverts commit c44cac229067faeec8f49247d1cf281723ac2d40.

Signed-off-by: Jason Andryuk 


Since you keep re-sending this: In private discussion Roger has indicated
that, like me, he too would prefer falling back to the ELF data, before
falling back to the arch default (Roger, please correct me if I got it
wrong). That would make it necessary that the change you're proposing to
revert here is actually kept.


Sorry, was meaning to reply yesterday but Jason is very fast at
sending new version so I'm always one version behind.


:)

I was hoping to finish this up and get it in...


IMO the order: ELF note, PHDR alignment, arch default should be the
preferred one.


Or wait - what you're reverting is taking the alignment out of the
bzImage header. I don't expect the BSDs to use that protocol; aiui that's
entirely Linux-specific.


Yeah, I don't have strong opinions in keeping this, we already do
bzImage parsing, so we might as well attempt to fetch the alignment
from there if correct:

ELF note, bzImage kernel_alignment, ELF PHDR alignment, arch default


I'm not sure how to handle ELF PHDR vs. arch default.  ELF PHDR will
always be set, AFAIU.  Should that always be respected, which means we
don't need an arch default?


A value of 0 (and 1) is specifically permitted, to indicate no alignment.
We may take 0 to mean default, but what you suggest below is another
plausible approach. Yet another might be to take anything below PAGE_SIZE
as "use default".


To include arch default, it would be something like this:

  if ( parms->phys_align != UNSET_ADDR )
  align = parms->phys_align;
  else if ( bz_align )
  align = bz_align;


Why do you include bz again here? Didn't you previously indicate the
header field can't be relied upon? Which is also why, finally, I committed
this revert earlier today.


Roger wanted to consult the bz value?  He mentioned it above.  Locally, 
I haven't synced with staging yet, and I dropped the revert to start 
re-working this...


If present, the bzImage header field is valid.  But being 
bzImage-specific, it isn't useful for other ELF files.  Xen will only 
move a kernel with the PHSY32_RELOC Note, so it can just specify an 
alignment if it needs to.


Personally, I think using the Note's value or a default is fine.  It 
seems like the PHDR aligment will just be 0x20 anyway (for x86-64 at 
lease), which matches the default.  Specifying the PHYS32_RELOC Note, 
but relying on a search for the alignment, seems slightly questionable 
to me.


Still, it seemed like the path of least resistance is to just implement 
the fallback search like Roger requested.  Dropping the bzImage, I guess 
I'd go with your PAGE_SIZE suggestions for:


if ( parms->phys_align != UNSET_ADDR )
align = parms->phys_align;
else if ( elf->palign > PAGE_SIZE )
align = elf->palign;
else
align = PHYS32_RELOC_ALIGN_DEFAULT;

Thanks for your reviews.

Regards,
Jason

Re: [PATCH] tools/oxenstored: Re-format

2024-03-27 Thread Edwin Torok

On Mon, Feb 26, 2024 at 10:48 AM Andrew Cooper
 wrote:
>
> Rerun make format.

Looks good, although not sure whether whitespace will be correctly
preserved in email, recommend using git to push the changes.
Reviewed-by: Edwin Török 

>
> Signed-off-by: Andrew Cooper 
> ---
> CC: Christian Lindig 
> CC: Edwin Török 
> CC: Rob Hoes 
> ---
>  tools/ocaml/xenstored/quota.ml | 16 
>  1 file changed, 8 insertions(+), 8 deletions(-)
>
> diff --git a/tools/ocaml/xenstored/quota.ml b/tools/ocaml/xenstored/quota.ml
> index 1f652040d898..082cd25f26fc 100644
> --- a/tools/ocaml/xenstored/quota.ml
> +++ b/tools/ocaml/xenstored/quota.ml
> @@ -55,13 +55,13 @@ let _check quota id size =
>  raise Data_too_big
>);
>if id > 0 then
> -  try
> -let entry = DomidMap.find id quota.cur in
> -if entry >= quota.maxent then (
> -  warn "domain %u cannot create entry: quota reached" id;
> -  raise Limit_reached
> -)
> -  with Not_found -> ()
> +try
> +  let entry = DomidMap.find id quota.cur in
> +  if entry >= quota.maxent then (
> +warn "domain %u cannot create entry: quota reached" id;
> +raise Limit_reached
> +  )
> +with Not_found -> ()
>
>  let check quota id size =
>if !activate then
> @@ -88,4 +88,4 @@ let merge orig_quota mod_quota dest_quota =
>  | diff -> update_entry dest id diff (* update with [x=x+diff] *)
>in
>{dest_quota with cur = DomidMap.fold fold_merge mod_quota.cur 
> dest_quota.cur}
> -  (* dest_quota = dest_quota + (mod_quota - orig_quota) *)
> +(* dest_quota = dest_quota + (mod_quota - orig_quota) *)
>
> base-commit: 8de3afc0b402bc17f65093a53e5870862707a8c7
> --
> 2.30.2
>

[PATCH v6 1/8] xen/spinlock: add explicit non-recursive locking functions

In order to prepare a type-safe recursive spinlock structure, add
explicitly non-recursive locking functions to be used for non-recursive
locking of spinlocks, which are used recursively, too.

Signed-off-by: Juergen Gross 
Acked-by: Jan Beulich 
---
V2:
- rename functions (Jan Beulich)
- get rid of !! in pcidevs_locked() (Jan Beulich)
V5:
- remove spurious change (Julien Grall)
- add nrspin_lock() description (Julien Grall)
---
 xen/arch/arm/mm.c |  4 ++--
 xen/arch/x86/domain.c | 12 ++--
 xen/arch/x86/mm.c | 12 ++--
 xen/arch/x86/mm/mem_sharing.c |  8 
 xen/arch/x86/mm/p2m-pod.c |  4 ++--
 xen/arch/x86/mm/p2m.c |  4 ++--
 xen/arch/x86/tboot.c  |  4 ++--
 xen/common/domctl.c   |  4 ++--
 xen/common/grant_table.c  | 10 +-
 xen/common/memory.c   |  4 ++--
 xen/common/numa.c |  4 ++--
 xen/common/page_alloc.c   | 16 
 xen/drivers/char/console.c| 16 
 xen/include/xen/spinlock.h| 29 +++--
 14 files changed, 74 insertions(+), 57 deletions(-)

diff --git a/xen/arch/arm/mm.c b/xen/arch/arm/mm.c
index b15a18a494..def939172c 100644
--- a/xen/arch/arm/mm.c
+++ b/xen/arch/arm/mm.c
@@ -105,7 +105,7 @@ void share_xen_page_with_guest(struct page_info *page, 
struct domain *d,
 if ( page_get_owner(page) == d )
 return;
 
-spin_lock(&d->page_alloc_lock);
+nrspin_lock(&d->page_alloc_lock);
 
 /*
  * The incremented type count pins as writable or read-only.
@@ -136,7 +136,7 @@ void share_xen_page_with_guest(struct page_info *page, 
struct domain *d,
 page_list_add_tail(page, &d->xenpage_list);
 }
 
-spin_unlock(&d->page_alloc_lock);
+nrspin_unlock(&d->page_alloc_lock);
 }
 
 int xenmem_add_to_physmap_one(
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index a11c55f921..33a2830d9d 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -212,7 +212,7 @@ void dump_pageframe_info(struct domain *d)
 {
 unsigned long total[MASK_EXTR(PGT_type_mask, PGT_type_mask) + 1] = {};
 
-spin_lock(&d->page_alloc_lock);
+nrspin_lock(&d->page_alloc_lock);
 page_list_for_each ( page, &d->page_list )
 {
 unsigned int index = MASK_EXTR(page->u.inuse.type_info,
@@ -231,13 +231,13 @@ void dump_pageframe_info(struct domain *d)
_p(mfn_x(page_to_mfn(page))),
page->count_info, page->u.inuse.type_info);
 }
-spin_unlock(&d->page_alloc_lock);
+nrspin_unlock(&d->page_alloc_lock);
 }
 
 if ( is_hvm_domain(d) )
 p2m_pod_dump_data(d);
 
-spin_lock(&d->page_alloc_lock);
+nrspin_lock(&d->page_alloc_lock);
 
 page_list_for_each ( page, &d->xenpage_list )
 {
@@ -253,7 +253,7 @@ void dump_pageframe_info(struct domain *d)
page->count_info, page->u.inuse.type_info);
 }
 
-spin_unlock(&d->page_alloc_lock);
+nrspin_unlock(&d->page_alloc_lock);
 }
 
 void update_guest_memory_policy(struct vcpu *v,
@@ -2448,10 +2448,10 @@ int domain_relinquish_resources(struct domain *d)
 d->arch.auto_unmask = 0;
 }
 
-spin_lock(&d->page_alloc_lock);
+nrspin_lock(&d->page_alloc_lock);
 page_list_splice(&d->arch.relmem_list, &d->page_list);
 INIT_PAGE_LIST_HEAD(&d->arch.relmem_list);
-spin_unlock(&d->page_alloc_lock);
+nrspin_unlock(&d->page_alloc_lock);
 
 PROGRESS(xen):
 
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 62f5b811bb..b4d125db39 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -482,7 +482,7 @@ void share_xen_page_with_guest(struct page_info *page, 
struct domain *d,
 
 set_gpfn_from_mfn(mfn_x(page_to_mfn(page)), INVALID_M2P_ENTRY);
 
-spin_lock(&d->page_alloc_lock);
+nrspin_lock(&d->page_alloc_lock);
 
 /* The incremented type count pins as writable or read-only. */
 page->u.inuse.type_info =
@@ -502,7 +502,7 @@ void share_xen_page_with_guest(struct page_info *page, 
struct domain *d,
 page_list_add_tail(page, &d->xenpage_list);
 }
 
-spin_unlock(&d->page_alloc_lock);
+nrspin_unlock(&d->page_alloc_lock);
 }
 
 void make_cr3(struct vcpu *v, mfn_t mfn)
@@ -3597,11 +3597,11 @@ long do_mmuext_op(
 {
 bool drop_ref;
 
-spin_lock(&pg_owner->page_alloc_lock);
+nrspin_lock(&pg_owner->page_alloc_lock);
 drop_ref = (pg_owner->is_dying &&
 test_and_clear_bit(_PGT_pinned,
&page->u.inuse.type_info));
-spin_unlock(&pg_owner->page_alloc_lock);
+nrspin_unlock(&pg_owner->page_alloc_lock);
 if ( drop_ref )
 {
 pin_drop:
@@ -4424,7 +4424,7 @@ int steal_page(
  * that it might be upon return from alloc_domheap_pages

[PATCH v6 3/8] xen/spinlock: add missing rspin_is_locked() and rspin_barrier()

Add rspin_is_locked() and rspin_barrier() in order to prepare differing
spinlock_t and rspinlock_t types.

Signed-off-by: Juergen Gross 
---
V2:
- partially carved out from V1 patch, partially new
V5:
- let rspin_is_locked() return bool (Jan Beulich)
V6:
- Re-add comment to _spin_is_locked() (Jan Beulich)
---
 xen/arch/x86/mm/p2m-pod.c |  2 +-
 xen/common/domain.c   |  2 +-
 xen/common/page_alloc.c   |  2 +-
 xen/common/spinlock.c | 26 --
 xen/drivers/char/console.c|  4 ++--
 xen/drivers/passthrough/pci.c |  2 +-
 xen/include/xen/spinlock.h|  4 
 7 files changed, 30 insertions(+), 12 deletions(-)

diff --git a/xen/arch/x86/mm/p2m-pod.c b/xen/arch/x86/mm/p2m-pod.c
index c48ea169b7..9750a3a21b 100644
--- a/xen/arch/x86/mm/p2m-pod.c
+++ b/xen/arch/x86/mm/p2m-pod.c
@@ -374,7 +374,7 @@ int p2m_pod_empty_cache(struct domain *d)
 
 /* After this barrier no new PoD activities can happen. */
 BUG_ON(!d->is_dying);
-spin_barrier(&p2m->pod.lock.lock);
+rspin_barrier(&p2m->pod.lock.lock);
 
 lock_page_alloc(p2m);
 
diff --git a/xen/common/domain.c b/xen/common/domain.c
index ceb44c8266..282c3ab623 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -991,7 +991,7 @@ int domain_kill(struct domain *d)
 case DOMDYING_alive:
 domain_pause(d);
 d->is_dying = DOMDYING_dying;
-spin_barrier(&d->domain_lock);
+rspin_barrier(&d->domain_lock);
 argo_destroy(d);
 vnuma_destroy(d->vnuma);
 domain_set_outstanding_pages(d, 0);
diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index 4d6ce726e3..7c1bdfc046 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -477,7 +477,7 @@ unsigned long domain_adjust_tot_pages(struct domain *d, 
long pages)
 {
 long dom_before, dom_after, dom_claimed, sys_before, sys_after;
 
-ASSERT(spin_is_locked(&d->page_alloc_lock));
+ASSERT(rspin_is_locked(&d->page_alloc_lock));
 d->tot_pages += pages;
 
 /*
diff --git a/xen/common/spinlock.c b/xen/common/spinlock.c
index 648393d95f..6572c76114 100644
--- a/xen/common/spinlock.c
+++ b/xen/common/spinlock.c
@@ -396,13 +396,10 @@ static bool always_inline spin_is_locked_common(const 
spinlock_tickets_t *t)
 int _spin_is_locked(const spinlock_t *lock)
 {
 /*
- * Recursive locks may be locked by another CPU, yet we return
- * "false" here, making this function suitable only for use in
- * ASSERT()s and alike.
+ * This function is suitable only for use in ASSERT()s and alike, as it
+ * doesn't tell _who_ is holding the lock.
  */
-return lock->recurse_cpu == SPINLOCK_NO_CPU
-   ? spin_is_locked_common(&lock->tickets)
-   : lock->recurse_cpu == smp_processor_id();
+return spin_is_locked_common(&lock->tickets);
 }
 
 static bool always_inline spin_trylock_common(spinlock_tickets_t *t,
@@ -465,6 +462,23 @@ void _spin_barrier(spinlock_t *lock)
 spin_barrier_common(&lock->tickets, &lock->debug, LOCK_PROFILE_PAR);
 }
 
+bool _rspin_is_locked(const rspinlock_t *lock)
+{
+/*
+ * Recursive locks may be locked by another CPU, yet we return
+ * "false" here, making this function suitable only for use in
+ * ASSERT()s and alike.
+ */
+return lock->recurse_cpu == SPINLOCK_NO_CPU
+   ? spin_is_locked_common(&lock->tickets)
+   : lock->recurse_cpu == smp_processor_id();
+}
+
+void _rspin_barrier(rspinlock_t *lock)
+{
+spin_barrier_common(&lock->tickets, &lock->debug, LOCK_PROFILE_PAR);
+}
+
 bool _rspin_trylock(rspinlock_t *lock)
 {
 unsigned int cpu = smp_processor_id();
diff --git a/xen/drivers/char/console.c b/xen/drivers/char/console.c
index 22f50fc617..d5e6aacc27 100644
--- a/xen/drivers/char/console.c
+++ b/xen/drivers/char/console.c
@@ -327,7 +327,7 @@ static void cf_check do_dec_thresh(unsigned char key, bool 
unused)
 
 static void conring_puts(const char *str, size_t len)
 {
-ASSERT(spin_is_locked(&console_lock));
+ASSERT(rspin_is_locked(&console_lock));
 
 while ( len-- )
 conring[CONRING_IDX_MASK(conringp++)] = *str++;
@@ -765,7 +765,7 @@ static void __putstr(const char *str)
 {
 size_t len = strlen(str);
 
-ASSERT(spin_is_locked(&console_lock));
+ASSERT(rspin_is_locked(&console_lock));
 
 console_serial_puts(str, len);
 video_puts(str, len);
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 4fcc7e2cde..5a446d3dce 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -65,7 +65,7 @@ void pcidevs_unlock(void)
 
 bool pcidevs_locked(void)
 {
-return !!spin_is_locked(&_pcidevs_lock);
+return rspin_is_locked(&_pcidevs_lock);
 }
 
 static struct radix_tree_root pci_segments;
diff --git a/xen/include/xen/spinlock.h b/xen/include/xen/spinlock.h
index 8bc4652526..148be1e116 100644
--- a/xen/include/xen/spinlock.h
+++ b/xen/include/xen/spinlock.h
@@ -297,6 +297,8 @@ void _rspi

[PATCH v6 0/8] xen/spinlock: make recursive spinlocks a dedicated type

Instead of being able to use normal spinlocks as recursive ones, too,
make recursive spinlocks a special lock type.

This will make the spinlock structure smaller in production builds and
add type-safety.

This allows to increase the maximum number of physical cpus from 4095
to 65535 without increasing the size of the lock structure in production
builds (the size of recursive spinlocks in debug builds will grow to
12 bytes due to that change).

Note that rwlock handling is still limiting the number of cpus to 4095,
this is being taken care off in patch 12, which raises the rwlock limit
to 16384 cpus.

Iommu code imposes a limit of 16383 cpus.

Changes in V2:
- addressed comments by Jan Beulich
- lots of additional cleanups
- reorganized complete series

Changes in V3:
- addressed comments by Jan Beulich

Changes in V4:
- former patch 1 has already been applied
- fixed a coding style issue in patch 1

Changes in V5:
- new patches 1 + 10 + 12 + 13
- due to the recent Ghost-race patches the macro layer for calling
  spinlock functions is kept
- addressed comments

Changes in V6:
- patches 1-5 of V5 have been committed already
- addressed comments

Juergen Gross (8):
  xen/spinlock: add explicit non-recursive locking functions
  xen/spinlock: add another function level
  xen/spinlock: add missing rspin_is_locked() and rspin_barrier()
  xen/spinlock: split recursive spinlocks from normal ones
  xen/spinlock: let all is_locked and trylock variants return bool
  xen/spinlock: support higher number of cpus
  xen/rwlock: raise the number of possible cpus
  xen: allow up to 16383 cpus

 xen/arch/Kconfig  |   2 +-
 xen/arch/arm/mm.c |   4 +-
 xen/arch/x86/domain.c |  12 +--
 xen/arch/x86/mm.c |  12 +--
 xen/arch/x86/mm/mem_sharing.c |   8 +-
 xen/arch/x86/mm/p2m-pod.c |   6 +-
 xen/arch/x86/mm/p2m.c |   4 +-
 xen/arch/x86/tboot.c  |   4 +-
 xen/common/domain.c   |   2 +-
 xen/common/domctl.c   |   4 +-
 xen/common/grant_table.c  |  10 +-
 xen/common/memory.c   |   4 +-
 xen/common/numa.c |   4 +-
 xen/common/page_alloc.c   |  18 ++--
 xen/common/spinlock.c | 181 ++
 xen/drivers/char/console.c|  20 ++--
 xen/drivers/passthrough/pci.c |   2 +-
 xen/include/xen/rwlock.h  |  23 +++--
 xen/include/xen/spinlock.h| 110 -
 19 files changed, 297 insertions(+), 133 deletions(-)

-- 
2.35.3

[PATCH v6 5/8] xen/spinlock: let all is_locked and trylock variants return bool

Switch the remaining trylock and is_locked variants to return bool.

Signed-off-by: Juergen Gross 
Reviewed-by: Jan Beulich 
---
V5:
- new patch (Jan Beulich)
---
 xen/common/spinlock.c  | 4 ++--
 xen/include/xen/spinlock.h | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/xen/common/spinlock.c b/xen/common/spinlock.c
index 5aaca49a61..7ccb725171 100644
--- a/xen/common/spinlock.c
+++ b/xen/common/spinlock.c
@@ -393,7 +393,7 @@ static bool always_inline spin_is_locked_common(const 
spinlock_tickets_t *t)
 return t->head != t->tail;
 }
 
-int _spin_is_locked(const spinlock_t *lock)
+bool _spin_is_locked(const spinlock_t *lock)
 {
 /*
  * This function is suitable only for use in ASSERT()s and alike, as it
@@ -433,7 +433,7 @@ static bool always_inline 
spin_trylock_common(spinlock_tickets_t *t,
 return true;
 }
 
-int _spin_trylock(spinlock_t *lock)
+bool _spin_trylock(spinlock_t *lock)
 {
 return spin_trylock_common(&lock->tickets, &lock->debug, LOCK_PROFILE_PAR);
 }
diff --git a/xen/include/xen/spinlock.h b/xen/include/xen/spinlock.h
index f49ba928f0..3a4092626c 100644
--- a/xen/include/xen/spinlock.h
+++ b/xen/include/xen/spinlock.h
@@ -234,8 +234,8 @@ void _spin_unlock(spinlock_t *lock);
 void _spin_unlock_irq(spinlock_t *lock);
 void _spin_unlock_irqrestore(spinlock_t *lock, unsigned long flags);
 
-int _spin_is_locked(const spinlock_t *lock);
-int _spin_trylock(spinlock_t *lock);
+bool _spin_is_locked(const spinlock_t *lock);
+bool _spin_trylock(spinlock_t *lock);
 void _spin_barrier(spinlock_t *lock);
 
 static always_inline void spin_lock(spinlock_t *l)
-- 
2.35.3

[PATCH v6 2/8] xen/spinlock: add another function level

Add another function level in spinlock.c hiding the spinlock_t layout
from the low level locking code.

This is done in preparation of introducing rspinlock_t for recursive
locks without having to duplicate all of the locking code.

Signed-off-by: Juergen Gross 
Reviewed-by: Jan Beulich 
---
V2:
- new patch
V5:
- don't regress spin_is_locked() for rspin-lock (Jan Beulich)
- use bool as return type of spin_is_locked_common() and
  spin_trylock_common() (Jan Beulich)
---
 xen/common/spinlock.c  | 103 -
 xen/include/xen/spinlock.h |   1 +
 2 files changed, 68 insertions(+), 36 deletions(-)

diff --git a/xen/common/spinlock.c b/xen/common/spinlock.c
index 874ed762b4..648393d95f 100644
--- a/xen/common/spinlock.c
+++ b/xen/common/spinlock.c
@@ -261,29 +261,31 @@ void spin_debug_disable(void)
 
 #ifdef CONFIG_DEBUG_LOCK_PROFILE
 
+#define LOCK_PROFILE_PAR lock->profile
 #define LOCK_PROFILE_REL \
-if ( lock->profile ) \
+if ( profile )   \
 {\
-lock->profile->time_hold += NOW() - lock->profile->time_locked;  \
-lock->profile->lock_cnt++;   \
+profile->time_hold += NOW() - profile->time_locked;  \
+profile->lock_cnt++; \
 }
 #define LOCK_PROFILE_VAR(var, val)s_time_t var = (val)
 #define LOCK_PROFILE_BLOCK(var)   var = var ? : NOW()
 #define LOCK_PROFILE_BLKACC(tst, val)\
 if ( tst )   \
 {\
-lock->profile->time_block += lock->profile->time_locked - (val); \
-lock->profile->block_cnt++;  \
+profile->time_block += profile->time_locked - (val); \
+profile->block_cnt++;\
 }
 #define LOCK_PROFILE_GOT(val)\
-if ( lock->profile ) \
+if ( profile )   \
 {\
-lock->profile->time_locked = NOW();  \
+profile->time_locked = NOW();\
 LOCK_PROFILE_BLKACC(val, val);   \
 }
 
 #else
 
+#define LOCK_PROFILE_PAR NULL
 #define LOCK_PROFILE_REL
 #define LOCK_PROFILE_VAR(var, val)
 #define LOCK_PROFILE_BLOCK(var)
@@ -307,17 +309,18 @@ static always_inline uint16_t observe_head(const 
spinlock_tickets_t *t)
 return read_atomic(&t->head);
 }
 
-static void always_inline spin_lock_common(spinlock_t *lock,
+static void always_inline spin_lock_common(spinlock_tickets_t *t,
+   union lock_debug *debug,
+   struct lock_profile *profile,
void (*cb)(void *data), void *data)
 {
 spinlock_tickets_t tickets = SPINLOCK_TICKET_INC;
 LOCK_PROFILE_VAR(block, 0);
 
-check_lock(&lock->debug, false);
+check_lock(debug, false);
 preempt_disable();
-tickets.head_tail = arch_fetch_and_add(&lock->tickets.head_tail,
-   tickets.head_tail);
-while ( tickets.tail != observe_head(&lock->tickets) )
+tickets.head_tail = arch_fetch_and_add(&t->head_tail, tickets.head_tail);
+while ( tickets.tail != observe_head(t) )
 {
 LOCK_PROFILE_BLOCK(block);
 if ( cb )
@@ -325,18 +328,19 @@ static void always_inline spin_lock_common(spinlock_t 
*lock,
 arch_lock_relax();
 }
 arch_lock_acquire_barrier();
-got_lock(&lock->debug);
+got_lock(debug);
 LOCK_PROFILE_GOT(block);
 }
 
 void _spin_lock(spinlock_t *lock)
 {
-spin_lock_common(lock, NULL, NULL);
+spin_lock_common(&lock->tickets, &lock->debug, LOCK_PROFILE_PAR, NULL,
+ NULL);
 }
 
 void _spin_lock_cb(spinlock_t *lock, void (*cb)(void *data), void *data)
 {
-spin_lock_common(lock, cb, data);
+spin_lock_common(&lock->tickets, &lock->debug, LOCK_PROFILE_PAR, cb, data);
 }
 
 void _spin_lock_irq(spinlock_t *lock)
@@ -355,16 +359,23 @@ unsigned long _spin_lock_irqsave(spinlock_t *lock)
 return flags;
 }
 
-void _spin_unlock(spinlock_t *lock)
+static void always_inline spin_unlock_common(spinlock_tickets_t *t,
+ union lock_debug *debug,
+ st

[PATCH v6 4/8] xen/spinlock: split recursive spinlocks from normal ones

Recursive and normal spinlocks are sharing the same data structure for
representation of the lock. This has two major disadvantages:

- it is not clear from the definition of a lock, whether it is intended
  to be used recursive or not, while a mixture of both usage variants
  needs to be

- in production builds (builds without CONFIG_DEBUG_LOCKS) the needed
  data size of an ordinary spinlock is 8 bytes instead of 4, due to the
  additional recursion data needed (associated with that the rwlock
  data is using 12 instead of only 8 bytes)

Fix that by introducing a struct spinlock_recursive for recursive
spinlocks only, and switch recursive spinlock functions to require
pointers to this new struct.

This allows to check the correct usage at build time.

Signed-off-by: Juergen Gross 
Reviewed-by: Jan Beulich 
---
V2:
- use shorter names (Jan Beulich)
- don't embed spinlock_t in rspinlock_t (Jan Beulich)
V5:
- some style fixes (Jan Beulich)
- bool instead of int (Jan Beulich)
---
 xen/common/spinlock.c  | 50 ++
 xen/include/xen/spinlock.h | 72 +-
 2 files changed, 105 insertions(+), 17 deletions(-)

diff --git a/xen/common/spinlock.c b/xen/common/spinlock.c
index 6572c76114..5aaca49a61 100644
--- a/xen/common/spinlock.c
+++ b/xen/common/spinlock.c
@@ -545,6 +545,56 @@ void _rspin_unlock_irqrestore(rspinlock_t *lock, unsigned 
long flags)
 local_irq_restore(flags);
 }
 
+bool _nrspin_trylock(rspinlock_t *lock)
+{
+check_lock(&lock->debug, true);
+
+if ( unlikely(lock->recurse_cpu != SPINLOCK_NO_CPU) )
+return false;
+
+return spin_trylock_common(&lock->tickets, &lock->debug, LOCK_PROFILE_PAR);
+}
+
+void _nrspin_lock(rspinlock_t *lock)
+{
+spin_lock_common(&lock->tickets, &lock->debug, LOCK_PROFILE_PAR, NULL,
+ NULL);
+}
+
+void _nrspin_unlock(rspinlock_t *lock)
+{
+spin_unlock_common(&lock->tickets, &lock->debug, LOCK_PROFILE_PAR);
+}
+
+void _nrspin_lock_irq(rspinlock_t *lock)
+{
+ASSERT(local_irq_is_enabled());
+local_irq_disable();
+_nrspin_lock(lock);
+}
+
+void _nrspin_unlock_irq(rspinlock_t *lock)
+{
+_nrspin_unlock(lock);
+local_irq_enable();
+}
+
+unsigned long _nrspin_lock_irqsave(rspinlock_t *lock)
+{
+unsigned long flags;
+
+local_irq_save(flags);
+_nrspin_lock(lock);
+
+return flags;
+}
+
+void _nrspin_unlock_irqrestore(rspinlock_t *lock, unsigned long flags)
+{
+_nrspin_unlock(lock);
+local_irq_restore(flags);
+}
+
 #ifdef CONFIG_DEBUG_LOCK_PROFILE
 
 struct lock_profile_anc {
diff --git a/xen/include/xen/spinlock.h b/xen/include/xen/spinlock.h
index 148be1e116..f49ba928f0 100644
--- a/xen/include/xen/spinlock.h
+++ b/xen/include/xen/spinlock.h
@@ -77,8 +77,6 @@ union lock_debug { };
 */
 
 struct spinlock;
-/* Temporary hack until a dedicated struct rspinlock is existing. */
-#define rspinlock spinlock
 
 struct lock_profile {
 struct lock_profile *next;   /* forward link */
@@ -110,6 +108,10 @@ struct lock_profile_qhead {
 __used_section(".lockprofile.data") = \
 &lock_profile_data__##name
 #define SPIN_LOCK_UNLOCKED_(x) {  \
+.debug = LOCK_DEBUG_, \
+.profile = x, \
+}
+#define RSPIN_LOCK_UNLOCKED_(x) { \
 .recurse_cpu = SPINLOCK_NO_CPU,   \
 .debug = LOCK_DEBUG_, \
 .profile = x, \
@@ -119,8 +121,9 @@ struct lock_profile_qhead {
 spinlock_t l = SPIN_LOCK_UNLOCKED_(NULL); \
 static struct lock_profile lock_profile_data__##l = LOCK_PROFILE_(l); \
 LOCK_PROFILE_PTR_(l)
+#define RSPIN_LOCK_UNLOCKED RSPIN_LOCK_UNLOCKED_(NULL)
 #define DEFINE_RSPINLOCK(l)   \
-rspinlock_t l = SPIN_LOCK_UNLOCKED_(NULL);\
+rspinlock_t l = RSPIN_LOCK_UNLOCKED_(NULL);   \
 static struct lock_profile lock_profile_data__##l = RLOCK_PROFILE_(l);\
 LOCK_PROFILE_PTR_(l)
 
@@ -145,8 +148,11 @@ struct lock_profile_qhead {
 
 #define spin_lock_init_prof(s, l) \
 spin_lock_init_prof__(s, l, lock, spinlock_t, false)
-#define rspin_lock_init_prof(s, l)\
-spin_lock_init_prof__(s, l, rlock, rspinlock_t, true)
+#define rspin_lock_init_prof(s, l) do {   \
+spin_lock_init_prof__(s, l, rlock, rspinlock_t, true);\
+(s)->l.recurse_cpu = SPINLOCK_NO_CPU; \
+(s)->l.recurse_cnt = 0;

[PATCH v6 7/8] xen/rwlock: raise the number of possible cpus

The rwlock handling is limiting the number of cpus to 4095 today. The
main reason is the use of the atomic_t data type for the main lock
handling, which needs 2 bits for the locking state (writer waiting or
write locked), 12 bits for the id of a possible writer, and a 12 bit
counter for readers. The limit isn't 4096 due to an off by one sanity
check.

The atomic_t data type is 32 bits wide, so in theory 15 bits for the
writer's cpu id and 15 bits for the reader count seem to be fine, but
via read_trylock() more readers than cpus are possible.

This means that it is possible to raise the number of cpus to 16384
without changing the rwlock_t data structure. In order to avoid the
reader count wrapping to zero, don't let read_trylock() succeed in case
the highest bit of the reader's count is set already. This leaves enough
headroom for non-recursive readers to enter without risking a wrap.

While at it calculate _QW_CPUMASK and _QR_SHIFT from _QW_SHIFT and
add a sanity check for not overflowing the atomic_t data type.

Signed-off-by: Juergen Gross 
---
V5:
- new patch
V6:
- add comment to _can_read_lock() (Jan Beulich)
---
 xen/include/xen/rwlock.h | 23 +++
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/xen/include/xen/rwlock.h b/xen/include/xen/rwlock.h
index 65d88b0ef4..232782801d 100644
--- a/xen/include/xen/rwlock.h
+++ b/xen/include/xen/rwlock.h
@@ -23,12 +23,12 @@ typedef struct {
 #define rwlock_init(l) (*(l) = (rwlock_t)RW_LOCK_UNLOCKED)
 
 /* Writer states & reader shift and bias. */
-#define_QW_CPUMASK  0xfffU /* Writer CPU mask */
-#define_QW_SHIFT12 /* Writer flags shift */
-#define_QW_WAITING  (1U << _QW_SHIFT)  /* A writer is waiting */
-#define_QW_LOCKED   (3U << _QW_SHIFT)  /* A writer holds the lock */
-#define_QW_WMASK(3U << _QW_SHIFT)  /* Writer mask */
-#define_QR_SHIFT14 /* Reader count shift */
+#define_QW_SHIFT14  /* Writer flags shift */
+#define_QW_CPUMASK  ((1U << _QW_SHIFT) - 1) /* Writer CPU mask */
+#define_QW_WAITING  (1U << _QW_SHIFT)   /* A writer is waiting */
+#define_QW_LOCKED   (3U << _QW_SHIFT)   /* A writer holds the lock */
+#define_QW_WMASK(3U << _QW_SHIFT)   /* Writer mask */
+#define_QR_SHIFT(_QW_SHIFT + 2) /* Reader count shift */
 #define_QR_BIAS (1U << _QR_SHIFT)
 
 void queue_read_lock_slowpath(rwlock_t *lock);
@@ -36,14 +36,21 @@ void queue_write_lock_slowpath(rwlock_t *lock);
 
 static inline bool _is_write_locked_by_me(unsigned int cnts)
 {
-BUILD_BUG_ON(_QW_CPUMASK < NR_CPUS);
+BUILD_BUG_ON((_QW_CPUMASK + 1) < NR_CPUS);
+BUILD_BUG_ON(NR_CPUS * _QR_BIAS > INT_MAX);
 return (cnts & _QW_WMASK) == _QW_LOCKED &&
(cnts & _QW_CPUMASK) == smp_processor_id();
 }
 
 static inline bool _can_read_lock(unsigned int cnts)
 {
-return !(cnts & _QW_WMASK) || _is_write_locked_by_me(cnts);
+/*
+ * If write locked by the caller, no other readers are possible.
+ * Not allowing the lock holder to read_lock() another 32768 times ought
+ * to be fine.
+ */
+return cnts <= INT_MAX &&
+   (!(cnts & _QW_WMASK) || _is_write_locked_by_me(cnts));
 }
 
 /*
-- 
2.35.3

[PATCH v6 6/8] xen/spinlock: support higher number of cpus

Allow 16 bits per cpu number, which is the limit imposed by
spinlock_tickets_t.

This will allow up to 65535 cpus, while increasing only the size of
recursive spinlocks in debug builds from 8 to 12 bytes.

The current Xen limit of 4095 cpus is imposed by SPINLOCK_CPU_BITS
being 12. There are machines available with more cpus than the current
Xen limit, so it makes sense to have the possibility to use more cpus.

Signed-off-by: Juergen Gross 
---
V5:
- keep previous recursion limit (Julien Grall)
V6:
- use unsigned int instead of uint32_t (Jan Beulich)
---
 xen/common/spinlock.c  |  2 ++
 xen/include/xen/spinlock.h | 20 ++--
 2 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/xen/common/spinlock.c b/xen/common/spinlock.c
index 7ccb725171..5aa9ba6188 100644
--- a/xen/common/spinlock.c
+++ b/xen/common/spinlock.c
@@ -485,7 +485,9 @@ bool _rspin_trylock(rspinlock_t *lock)
 
 /* Don't allow overflow of recurse_cpu field. */
 BUILD_BUG_ON(NR_CPUS > SPINLOCK_NO_CPU);
+BUILD_BUG_ON(SPINLOCK_CPU_BITS > sizeof(lock->recurse_cpu) * 8);
 BUILD_BUG_ON(SPINLOCK_RECURSE_BITS < 3);
+BUILD_BUG_ON(SPINLOCK_MAX_RECURSE > ((1u << SPINLOCK_RECURSE_BITS) - 1));
 
 check_lock(&lock->debug, true);
 
diff --git a/xen/include/xen/spinlock.h b/xen/include/xen/spinlock.h
index 3a4092626c..db00a24646 100644
--- a/xen/include/xen/spinlock.h
+++ b/xen/include/xen/spinlock.h
@@ -8,16 +8,16 @@
 #include 
 #include 
 
-#define SPINLOCK_CPU_BITS  12
+#define SPINLOCK_CPU_BITS  16
 
 #ifdef CONFIG_DEBUG_LOCKS
 union lock_debug {
-uint16_t val;
-#define LOCK_DEBUG_INITVAL 0x
+uint32_t val;
+#define LOCK_DEBUG_INITVAL 0x
 struct {
-uint16_t cpu:SPINLOCK_CPU_BITS;
-#define LOCK_DEBUG_PAD_BITS (14 - SPINLOCK_CPU_BITS)
-uint16_t :LOCK_DEBUG_PAD_BITS;
+unsigned int cpu:SPINLOCK_CPU_BITS;
+#define LOCK_DEBUG_PAD_BITS (30 - SPINLOCK_CPU_BITS)
+unsigned int :LOCK_DEBUG_PAD_BITS;
 bool irq_safe:1;
 bool unseen:1;
 };
@@ -211,11 +211,11 @@ typedef struct spinlock {
 
 typedef struct rspinlock {
 spinlock_tickets_t tickets;
-uint16_t recurse_cpu:SPINLOCK_CPU_BITS;
+uint16_t recurse_cpu;
 #define SPINLOCK_NO_CPU((1u << SPINLOCK_CPU_BITS) - 1)
-#define SPINLOCK_RECURSE_BITS  (16 - SPINLOCK_CPU_BITS)
-uint16_t recurse_cnt:SPINLOCK_RECURSE_BITS;
-#define SPINLOCK_MAX_RECURSE   ((1u << SPINLOCK_RECURSE_BITS) - 1)
+#define SPINLOCK_RECURSE_BITS  8
+uint8_t recurse_cnt;
+#define SPINLOCK_MAX_RECURSE   15
 union lock_debug debug;
 #ifdef CONFIG_DEBUG_LOCK_PROFILE
 struct lock_profile *profile;
-- 
2.35.3

[PATCH v6 8/8] xen: allow up to 16383 cpus

With lock handling now allowing up to 16384 cpus (spinlocks can handle
65535 cpus, rwlocks can handle 16384 cpus), raise the allowed limit for
the number of cpus to be configured to 16383.

The new limit is imposed by IOMMU_CMD_BUFFER_MAX_ENTRIES and
QINVAL_MAX_ENTRY_NR required to be larger than 2 * CONFIG_NR_CPUS.

Signed-off-by: Juergen Gross 
---
V5:
- new patch (Jan Beulich)
---
 xen/arch/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/xen/arch/Kconfig b/xen/arch/Kconfig
index 67ba38f32f..308ce129a8 100644
--- a/xen/arch/Kconfig
+++ b/xen/arch/Kconfig
@@ -6,7 +6,7 @@ config PHYS_ADDR_T_32
 
 config NR_CPUS
int "Maximum number of CPUs"
-   range 1 4095
+   range 1 16383
default "256" if X86
default "8" if ARM && RCAR3
default "4" if ARM && QEMU
-- 
2.35.3

Re: [PATCH] tools/oxenstored: Re-format

On 27/03/2024 3:11 pm, Edwin Torok wrote:
> On Mon, Feb 26, 2024 at 10:48 AM Andrew Cooper
>  wrote:
>> Rerun make format.
> Looks good, although not sure whether whitespace will be correctly
> preserved in email, recommend using git to push the changes.
> Reviewed-by: Edwin Török 

Thanks.  Don't worry about the patch workflow.

~Andrew

Re: preparations for 4.17.4

2024-03-27 Thread Christian Lindig




> On 27 Mar 2024, at 14:42, Andrew Cooper  wrote:
> 
> On 27/03/2024 2:06 pm, Jan Beulich wrote:
>> On 27.03.2024 15:01, Andrew Cooper wrote:
>>> It occurs to me that these want considering:
>>> 
>>> b6cf604207fd - tools/oxenstored: Use Map instead of Hashtbl for quotas
>>> 098d868e52ac - tools/oxenstored: Make Quota.t pure
>>> 
>>> while 4.17 is still in general support.  These came from a performance
>>> regression we investigated.
>>> 
>>> I've done the backport to 4.17 and they're not entirely trivial (owing
>>> to the major source reformat in 4.18) so can commit them if you'd prefer.
>> Didn't you bring these up for 4.18.1 already, and I said that I'd leave
>> this for the maintainers to decide? Same here, in any event. Cc-ing them
>> both, just in case.
> 
> I could have sworn that I remembered requested this before, but couldn't
> remember where.
> 
> I'll see about poking people for an answer.
> 
> ~Andrew

I understand we have backport; so I support upstreaming it. 

— C

[xen-unstable-smoke test] 185173: tolerable all pass - PUSHED

flight 185173 xen-unstable-smoke real [real]
http://logs.test-lab.xenproject.org/osstest/logs/185173/

Failures :-/ but no regressions.

Tests which did not succeed, but are not blocking:
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  16 saverestore-support-checkfail   never pass

version targeted for testing:
 xen  0cd50753eb40ca5f00ea1ced9f80ce5f478e560c
baseline version:
 xen  e3883336bb5abba2ec2231618f5b64f61b099b1e

Last test of basis   185170  2024-03-27 09:03:56 Z0 days
Testing same since   185173  2024-03-27 12:02:02 Z0 days1 attempts


People who touched revisions under test:
  George Dunlap 
  Jan Beulich 

jobs:
 build-arm64-xsm  pass
 build-amd64  pass
 build-armhf  pass
 build-amd64-libvirt  pass
 test-armhf-armhf-xl  pass
 test-arm64-arm64-xl-xsm  pass
 test-amd64-amd64-xl-qemuu-debianhvm-amd64pass
 test-amd64-amd64-libvirt pass



sg-report-flight on osstest.test-lab.xenproject.org
logs: /home/logs/logs
images: /home/logs/images

Logs, config files, etc. are available at
http://logs.test-lab.xenproject.org/osstest/logs

Explanation of these reports, and of osstest in general, is at
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master

Test harness code can be found at
http://xenbits.xen.org/gitweb?p=osstest.git;a=summary


Pushing revision :

To xenbits.xen.org:/home/xen/git/xen.git
   e3883336bb..0cd50753eb  0cd50753eb40ca5f00ea1ced9f80ce5f478e560c -> smoke

[PATCH v1] tools/ocaml: fix warnings

2024-03-27 Thread Edwin Török

Do not rely on the string values of the `Failure` exception,
 but use the `_opt` functions instead.

Signed-off-by: Edwin Török 
---
 tools/ocaml/xenstored/config.ml | 20 +++-
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/tools/ocaml/xenstored/config.ml b/tools/ocaml/xenstored/config.ml
index 95ef745a54..e0df236f73 100644
--- a/tools/ocaml/xenstored/config.ml
+++ b/tools/ocaml/xenstored/config.ml
@@ -83,25 +83,27 @@ let validate cf expected other =
   let err = ref [] in
   let append x = err := x :: !err in
   List.iter (fun (k, v) ->
+  let parse ~err_msg parser v f =
+match parser v with
+| None -> append (k, err_msg)
+| Some r -> f r
+  in
   try
 if not (List.mem_assoc k expected) then
   other k v
 else let ty = List.assoc k expected in
   match ty with
   | Unit f   -> f ()
-  | Bool f   -> f (bool_of_string v)
+  | Bool f   -> parse ~err_msg:"expect bool arg" 
bool_of_string_opt v f
   | String f -> f v
-  | Int f-> f (int_of_string v)
-  | Float f  -> f (float_of_string v)
-  | Set_bool r   -> r := (bool_of_string v)
+  | Int f-> parse ~err_msg:"expect int arg" int_of_string_opt 
v f
+  | Float f  -> parse ~err_msg:"expect float arg" 
float_of_string_opt v f
+  | Set_bool r   -> parse ~err_msg:"expect bool arg" 
bool_of_string_opt v (fun x -> r := x)
   | Set_string r -> r := v
-  | Set_int r-> r := int_of_string v
-  | Set_float r  -> r := (float_of_string v)
+  | Set_int r-> parse ~err_msg:"expect int arg" int_of_string_opt 
v (fun x -> r:= x)
+  | Set_float r  -> parse ~err_msg:"expect float arg" 
float_of_string_opt v (fun x -> r := x)
   with
   | Not_found -> append (k, "unknown key")
-  | Failure "int_of_string"   -> append (k, "expect int arg")
-  | Failure "bool_of_string"  -> append (k, "expect bool arg")
-  | Failure "float_of_string" -> append (k, "expect float arg")
   | exn   -> append (k, Printexc.to_string exn)
 ) cf;
   if !err != [] then raise (Error !err)
-- 
2.44.0

Re: [PATCH v1] tools/ocaml: fix warnings

On 27/03/2024 4:30 pm, Edwin Török wrote:
> Do not rely on the string values of the `Failure` exception,
>  but use the `_opt` functions instead.
>
> Signed-off-by: Edwin Török 

FWIW, Tested-by: Andrew Cooper 

But I think the subject wants to say "in config.ml" and the commit
message gain something like:

Fixes warnings such as:

  File "config.ml", line 102, characters 12-27:
  102 |         | Failure "int_of_string"   -> append (k, "expect int arg")
        ^^^
  Warning 52: Code should not depend on the actual values of
  this constructor's arguments. They are only for information
  and may change in future versions. (See manual section 9.5)

I can adjust all on commit.

~Andrew

Re: [RFC PATCH-for-9.0 v2 13/19] hw/xen: Remove use of 'target_ulong' in handle_ioreq()

On Tue, Nov 14, 2023 at 03:38:09PM +0100, Philippe Mathieu-Daudé wrote:
> Per commit f17068c1c7 ("xen-hvm: reorganize xen-hvm and move common
> function to xen-hvm-common"), handle_ioreq() is expected to be
> target-agnostic. However it uses 'target_ulong', which is a target
> specific definition.
> 
> Per xen/include/public/hvm/ioreq.h header:
> 
>   struct ioreq {
> uint64_t addr;  /* physical address */
> uint64_t data;  /* data (or paddr of data) */
> uint32_t count; /* for rep prefixes */
> uint32_t size;  /* size in bytes */
> uint32_t vp_eport;  /* evtchn for notifications to/from device model 
> */
> uint16_t _pad0;
> uint8_t state:4;
> uint8_t data_is_ptr:1;  /* if 1, data above is the guest paddr
>  * of the real data to use. */
> uint8_t dir:1;  /* 1=read, 0=write */
> uint8_t df:1;
> uint8_t _pad1:1;
> uint8_t type;   /* I/O type */
>   };
>   typedef struct ioreq ioreq_t;
> 
> If 'data' is not a pointer, it is a u64.
> 
> - In PIO / VMWARE_PORT modes, only 32-bit are used.
> 
> - In MMIO COPY mode, memory is accessed by chunks of 64-bit

Looks like it could also be 8, 16, or 32 as well, depending on
req->size.

> - In PCI_CONFIG mode, access is u8 or u16 or u32.
> 
> - None of TIMEOFFSET / INVALIDATE use 'req'.
> 
> - Fallback is only used in x86 for VMWARE_PORT.
> 
> Masking the upper bits of 'data' to keep 'req->size' low bits
> is irrelevant of the target word size. Remove the word size
> check and always extract the relevant bits.

When building QEMU for Xen, we tend to build the target "i386-softmmu",
which looks like to have target_ulong == uint32_t. So the `data`
clamping would only apply to size 8 and 16. The clamping with
target_ulong was introduce long time ago, here:
https://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=b4a663b87df3954557434a2d31bff7f6b2706ec1
and they were more IOREQ types.
So my guess is it isn't relevant anymore, but extending the clamping to
32-bits request should be fine, when using qemu-system-i386 that is, as
it is already be done if one use qemu-system-x86_64.

So I think the patch is fine, and the tests I've ran so far worked fine.

> Signed-off-by: Philippe Mathieu-Daudé 

Reviewed-by: Anthony PERARD 

Thanks,

-- 
Anthony PERARD

Re: [PATCH v2 3/3] svm/nestedsvm: Introduce nested capabilities bit

2024-03-27 Thread George Dunlap

On Mon, Mar 18, 2024 at 2:17 PM Jan Beulich  wrote:
>
> On 13.03.2024 13:24, George Dunlap wrote:
> > In order to make implementation and testing tractable, we will require
> > specific host functionality.  Add a nested_virt bit to hvm_funcs.caps,
> > and return an error if a domain is created with nested virt and this
> > bit isn't set.  Create VMX and SVM callbacks to be executed from
> > start_nested_svm(), which is guaranteed to execute after all
>
> Nit: nestedhvm_setup() (or, with different wording, start_nested_{svm,vmx}()).

Oops, fixed.

> > --- a/xen/arch/x86/domain.c
> > +++ b/xen/arch/x86/domain.c
> > @@ -673,6 +673,12 @@ int arch_sanitise_domain_config(struct 
> > xen_domctl_createdomain *config)
> >   */
> >  config->flags |= XEN_DOMCTL_CDF_oos_off;
> >
> > +if ( nested_virt && !hvm_nested_virt_supported() )
> > +{
> > +dprintk(XENLOG_INFO, "Nested virt requested but not available\n");
> > +return -EINVAL;
> > +}
> > +
> >  if ( nested_virt && !hap )
> >  {
> >  dprintk(XENLOG_INFO, "Nested virt not supported without HAP\n");
>
> As mentioned in reply to v1, I think what both start_nested_{svm,vmx}() check
> is redundant with this latter check. I think that check isn't necessary there
> (yet unlike suggested in reply to v1 I don't think anymore that the check here
> can alternatively be dropped). And even if it was to be kept for some reason,
> I'm having some difficulty seeing why vendor code needs to do that check, when
> nestedhvm_setup() could do it for both of them.

I'm having a bit of trouble resolving the antecedents in this
paragraph (what "this" and "there" are referring to).

Are you saying that we should set hvm_funcs.caps.nested_virt to 'true'
even if the hardware doesn't support HAP, because we check that here?

That seems like a very strange way to go about things; hvm_funcs.caps
should reflect the actual capabilities of the hardware.  Suppose at
some point we want to expose nested virt capability to the toolstack
-- wouldn't it make more sense to be able to just read
`hvm_funcs.caps.nested_virt`, rather than having to remember to && it
with `hvm_funcs.caps.hap`?

And as you say, you can't get rid of the HAP check here, because we
also want to deny nested virt if the admin deliberately created a
guest in shadow mode on a system that has HAP available.  So it's not
redundant: one is checking the capabilities of the system, the other
is checking the requested guest configuration.

As to why to have each vendor independent code check for HAP -- there
are in fact two implementations of the code; it's nice to be able to
look in one place for each implementation to determine the
requirements.  Additionally, it would be possible, in the future, for
one of the nested virt implementations to enable shadow mode, while
the other one didn't.  The current arrangement makes that easy.

> > --- a/xen/arch/x86/hvm/nestedhvm.c
> > +++ b/xen/arch/x86/hvm/nestedhvm.c
> > @@ -150,6 +150,16 @@ static int __init cf_check nestedhvm_setup(void)
> >  __clear_bit(0x80, shadow_io_bitmap[0]);
> >  __clear_bit(0xed, shadow_io_bitmap[1]);
> >
> > +/*
> > + * NB this must be called after all command-line processing has been
> > + * done, so that if (for example) HAP is disabled, nested virt is
> > + * disabled as well.
> > + */
> > +if ( cpu_has_vmx )
> > +start_nested_vmx(&hvm_funcs);
> > +else if ( cpu_has_svm )
> > +start_nested_svm(&hvm_funcs);
>
> Instead of passing the pointer, couldn't you have the functions return
> bool, and then set hvm_funcs.caps.nested_virt from that? Passing that
> pointer looks somewhat odd to me.

For one, it makes start_nested_XXX symmetric to start_XXX, which also
has access to the full hvm_funcs structure (albeit in the former case
because it's creating the structure).  For two, both of them need to
read caps.hap.  For three, it's just more flexible -- there may be
future things that we want to modify in the start_nested_*()
functions.  If we did as you suggest, and then added (say)
caps.nested_virt_needs_hap, then we'd either need to add a second
function, or refactor it to look like this.

> Is there a reason to use direct calls here rather than a true hook
> (seeing that hvm_funcs must have been populated by the time we make it
> here)? I understand we're (remotely) considering to switch away from
> using hooks at some point, but right now this feels somewhat
> inconsistent if not further justified.

Again it mimics the calls to start_vmx/svm in hvm_enable.  But I could
look at adding a function pointer to see if it works.

 -George

Re: [PATCH-for-9.0 v2 15/19] hw/xen: Reduce inclusion of 'cpu.h' to target-specific sources

On Tue, Nov 14, 2023 at 03:38:11PM +0100, Philippe Mathieu-Daudé wrote:
> We rarely need to include "cpu.h" in headers. Including it
> 'taint' headers to be target-specific. Here only the i386/arm
> implementations requires "cpu.h", so include it there and
> remove from the "hw/xen/xen-hvm-common.h" *common* header.
> 
> Signed-off-by: Philippe Mathieu-Daudé 
> Reviewed-by: Richard Henderson 
> Reviewed-by: David Woodhouse 

Acked-by: Anthony PERARD 

Thanks,

-- 
Anthony PERARD

Re: [PATCH-for-9.0 v2 16/19] hw/xen/xen_pt: Add missing license

On Tue, Nov 14, 2023 at 03:38:12PM +0100, Philippe Mathieu-Daudé wrote:
> Commit eaab4d60d3 ("Introduce Xen PCI Passthrough, qdevice")
> introduced both xen_pt.[ch], but only added the license to
> xen_pt.c. Use the same license for xen_pt.h.
> 
> Suggested-by: David Woodhouse 
> Signed-off-by: Philippe Mathieu-Daudé 

Fine by me. Looks like there was a license header before:
https://xenbits.xen.org/gitweb/?p=qemu-xen-unstable.git;a=blob;f=hw/pass-through.h;h=0b5822414e24d199a064abccc4d378dcaf569bd6;hb=HEAD
I don't know why I didn't copied it over here.

Acked-by: Anthony PERARD 

Thanks,

-- 
Anthony PERARD

Re: Serious AMD-Vi(?) issue

2024-03-27 Thread Elliott Mitchell

On Mon, Mar 25, 2024 at 02:43:44PM -0700, Elliott Mitchell wrote:
> On Mon, Mar 25, 2024 at 08:55:56AM +0100, Jan Beulich wrote:
> > On 22.03.2024 20:22, Elliott Mitchell wrote:
> > > On Fri, Mar 22, 2024 at 04:41:45PM +, Kelly Choi wrote:
> > >>
> > >> I can see you've recently engaged with our community with some issues 
> > >> you'd
> > >> like help with.
> > >> We love the fact you are participating in our project, however, our
> > >> developers aren't able to help if you do not provide the specific 
> > >> details.
> > > 
> > > Please point to specific details which have been omitted.  Fairly little
> > > data has been provided as fairly little data is available.  The primary
> > > observation is large numbers of:
> > > 
> > > (XEN) AMD-Vi: IO_PAGE_FAULT: :bb:dd.f d0 addr ff???000 flags 
> > > 0x8 I
> > > 
> > > Lines in Xen's ring buffer.
> > 
> > Yet this is (part of) the problem: By providing only the messages that 
> > appear
> > relevant to you, you imply that you know that no other message is in any way
> > relevant. That's judgement you'd better leave to people actually trying to
> > investigate. Unless of course you were proposing an actual code change, with
> > suitable justification.
> 
> Honestly, I forgot about the very small number of messages from the SATA
> subsystem.  The question of whether the current mitigation actions are
> effective right now was a bigger issue.  As such monitoring `xl dmesg`
> was a priority to looking at SATA messages which failed to reliably
> indicate status.
> 
> I *thought* I would be able to retrieve those via other slow means, but a
> different and possibly overlapping issue has shown up.  Unfortunately
> this means those are no longer retrievable.   :-(

With some persistence I was able to retrieve them.  There are other
pieces of software with worse UIs than Xen.

> > In fact when running into trouble, the usual course of action would be to
> > increase verbosity in both hypervisor and kernel, just to make sure no
> > potentially relevant message is missed.
> 
> More/better information might have been obtained if I'd been engaged
> earlier.

This is still true, things are in full mitigation mode and I'll be
quite unhappy to go back with experiments at this point.


I now see why I left those out.  The messages from the SATA subsystem
were from a kernel which a bad patch had leaked into a LTS branch.  Looks
like the SATA subsystem was significantly broken and I'm unsure whether
any useful information could be retrieved.  Notably there is quite a bit
of noise from SATA devices not effected by this issue.

Some of the messages /might/ be useful, but the amount of noise is quite
high.  Do messages from a broken kernel interest you?


-- 
(\___(\___(\__  --=> 8-) EHM <=--  __/)___/)___/)
 \BS (| ehem+sig...@m5p.com  PGP 87145445 |)   /
  \_CS\   |  _  -O #include  O-   _  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445

[PATCH 4/6] tools/misc: xenwatchdogd: add parse_secs()

From: Leigh Brown 

Create a new parse_secs() function to parse the timeout and sleep
parameters. This ensures that non-numeric parameters are not
accidentally treated as numbers.
---
 tools/misc/xenwatchdogd.c | 23 +++
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/tools/misc/xenwatchdogd.c b/tools/misc/xenwatchdogd.c
index 224753e824..26bfdef3db 100644
--- a/tools/misc/xenwatchdogd.c
+++ b/tools/misc/xenwatchdogd.c
@@ -49,6 +49,18 @@ static void catch_usr1(int sig)
 done = true;
 }
 
+static int parse_secs(const char *arg, const char *what)
+{
+char *endptr;
+unsigned long val;
+
+val = strtoul(arg, &endptr, 0);
+if (val > INT_MAX || *endptr != 0)
+   errx(EXIT_FAILURE, "invalid %s: '%s'", what, arg);
+
+return val;
+}
+
 int main(int argc, char **argv)
 {
 int id;
@@ -64,16 +76,11 @@ int main(int argc, char **argv)
 if (h == NULL)
err(EXIT_FAILURE, "xc_interface_open");
 
-t = strtoul(argv[1], NULL, 0);
-if (t == ULONG_MAX)
-   err(EXIT_FAILURE, "strtoul");
+t = parse_secs(argv[optind], "timeout");
 
 s = t / 2;
-if (argc == 3) {
-   s = strtoul(argv[2], NULL, 0);
-   if (s == ULONG_MAX)
-   err(EXIT_FAILURE, "strtoul");
-}
+if (argc == 3)
+   s = parse_secs(argv[optind], "sleep");
 
 if (signal(SIGHUP, &catch_exit) == SIG_ERR)
err(EXIT_FAILURE, "signal");
-- 
2.39.2

[PATCH 2/6] tools/misc: rework xenwatchdogd signal handling

From: Leigh Brown 

Rework xenwatchdogd signal handling to do the minimum in the signal
handler. This is a very minor enhancement.
---
 tools/misc/xenwatchdogd.c | 20 
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/tools/misc/xenwatchdogd.c b/tools/misc/xenwatchdogd.c
index 2f7c822d61..d4da0ad0b6 100644
--- a/tools/misc/xenwatchdogd.c
+++ b/tools/misc/xenwatchdogd.c
@@ -9,9 +9,11 @@
 #include 
 #include 
 #include 
+#include 
 
 xc_interface *h;
-int id = 0;
+bool safeexit = false;
+bool done = false;
 
 void daemonize(void)
 {
@@ -38,20 +40,18 @@ void daemonize(void)
 
 void catch_exit(int sig)
 {
-if (id)
-xc_watchdog(h, id, 300);
-exit(EXIT_SUCCESS);
+done = true;
 }
 
 void catch_usr1(int sig)
 {
-if (id)
-xc_watchdog(h, id, 0);
-exit(EXIT_SUCCESS);
+safeexit = true;
+done = true;
 }
 
 int main(int argc, char **argv)
 {
+int id;
 int t, s;
 int ret;
 
@@ -90,10 +90,14 @@ int main(int argc, char **argv)
 if (id <= 0)
 err(EXIT_FAILURE, "xc_watchdog setup");
 
-for (;;) {
+while (!done) {
 sleep(s);
 ret = xc_watchdog(h, id, t);
 if (ret != 0)
 err(EXIT_FAILURE, "xc_watchdog");
 }
+
+// Zero seconds timeout will disarm the watchdog timer
+xc_watchdog(h, id, safeexit ? 0 : 300);
+return 0;
 }
-- 
2.39.2

[PATCH 3/6] tools/misc: xenwatchdogd: make functions static

From: Leigh Brown 

Make all functions except main() static in xenwatchdogd.c.
---
 tools/misc/xenwatchdogd.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/tools/misc/xenwatchdogd.c b/tools/misc/xenwatchdogd.c
index d4da0ad0b6..224753e824 100644
--- a/tools/misc/xenwatchdogd.c
+++ b/tools/misc/xenwatchdogd.c
@@ -15,7 +15,7 @@ xc_interface *h;
 bool safeexit = false;
 bool done = false;
 
-void daemonize(void)
+static void daemonize(void)
 {
 switch (fork()) {
 case -1:
@@ -38,12 +38,12 @@ void daemonize(void)
 err(EXIT_FAILURE, "reopen stderr");
 }
 
-void catch_exit(int sig)
+static void catch_exit(int sig)
 {
 done = true;
 }
 
-void catch_usr1(int sig)
+static void catch_usr1(int sig)
 {
 safeexit = true;
 done = true;
-- 
2.39.2

[PATCH 6/6] docs/man: Add xenwatchdog manual page

From: Leigh Brown 

Add a manual page for xenwatchdogd.
---
 docs/man/xenwatchdogd.8.pod | 54 +
 1 file changed, 54 insertions(+)
 create mode 100644 docs/man/xenwatchdogd.8.pod

diff --git a/docs/man/xenwatchdogd.8.pod b/docs/man/xenwatchdogd.8.pod
new file mode 100644
index 00..2f6454f183
--- /dev/null
+++ b/docs/man/xenwatchdogd.8.pod
@@ -0,0 +1,54 @@
+=head1 NAME
+
+xenwatchdogd - Xen hypercall based watchdog daemon
+
+=head1 SYNOPSIS
+
+B [ I ] > [ > ]
+
+=head1 DESCRIPTION
+
+B arms the Xen watchdog timer to I every I
+seconds. If the xenwatchdogd process dies or is delayed for more than
+I seconds, then Xen will reboot the domain. If the domain being
+rebooted is domain 0, the whole system will reboot.
+
+=head1 OPTIONS
+
+=over 4
+
+=item B<-h>, B<--help>
+
+Display a help message.
+
+=item B<-F>, B<--foreground>
+
+Run in the foreground. The default behaviour is to daemonize.
+
+=item B<-x>, B<--safe-exit>
+
+Disable watchdog on orderly exit. The default behaviour is to arm the
+watchdog to 300 seconds to allow time for the domain to shutdown.  See 
+also the B section.
+
+=item B
+
+The number of seconds to arm the Xen watchdog timer. This must be set to
+a minimum of two.
+
+=item B
+
+The number of seconds to sleep in between calls to arm the Xen watchdog
+timer. This must be at least one second, and less than the I
+value. If not specified, it defaults to half the I value.
+
+=back
+
+=head1 SIGNALS
+
+B Will cause the program to disarm the watchdog timer and exit,
+regardless of whether the safe exit option was passed.
+
+=head1 AUTHOR
+
+Citrix Ltd and other contributors.
-- 
2.39.2

[PATCH 0/6] xenwatchdogd enhancements

From: Leigh Brown 

Following up on Cyril's email. I had been independently looking at this,
mainly because xenwatchdogd is simple enough for me to understand. The
primary intention of this patch series is to replace the pathologically
bad behaviour of rebooting the domain if you run "xenwatchdogd -h". To
that end, I have implemented comprehensive argument validation. This
validation ensures you can't pass arguments that instantly reboot the
domain or cause it to spin loop running sleep(0) repeatedly. I added a
couple of enhancements whilst working on the changes as they were easy
enough.

Full list of changes:
- Use getopt_long() to add -h/--help with associated usage help.
- Add -F/--foreground parameter to run without daemonizing.
- Add -x/--save-exit parameter to disarm the watchdog when exiting.
- Validate timeout is numeric and is at least two seconds.
- Validate sleep is numeric and is at least one and less than timeout.
- Check for too many arguments.
- Use symbol constants instead of magic numbers where possible.
- Add a manual page for xenwatchdogd().

I am not an expert in git or sending patches so forgive me if things
don't look quite right.

Signed-off-by: Leigh Brown 

Leigh Brown (6):
  tools/misc: xenwatchdogd: use EXIT_* constants
  tools/misc: rework xenwatchdogd signal handling
  tools/misc: xenwatchdogd: make functions static
  tools/misc: xenwatchdogd: add parse_secs()
  tools/misc: xenwatchdogd enhancements
  docs/man: Add xenwatchdog manual page

 docs/man/xenwatchdogd.8.pod |  54 +++
 tools/misc/xenwatchdogd.c   | 180 
 2 files changed, 195 insertions(+), 39 deletions(-)
 create mode 100644 docs/man/xenwatchdogd.8.pod

-- 
2.39.2

[PATCH 1/6] tools/misc: xenwatchdogd: use EXIT_* constants