[PATCH v6 0/4] x86/pvh: Support relocating dom0 kernel
Xen tries to load a PVH dom0 kernel at the fixed guest physical address from the elf headers. For Linux, this defaults to 0x100 (16MB), but it can be configured. Unfortunately there exist firmwares that have reserved regions at this address, so Xen fails to load the dom0 kernel since it's not RAM. The other issue is that the Linux PVH entry point is not position-independent. It expects to run at the compiled CONFIG_PHYSICAL_ADDRESS. This patch set expands the PVH dom0 builder to try to relocate the kernel if needed and possible. XEN_ELFNOTE_PHYS32_RELOC is added for kernels to indicate they are relocatable and their acceptable address range and alignment. v6: Choose the alignment from the Note if specified, or the Maximum PHDR p_align value if greater than PAGE_SIZE. Otherwise, it falls back to the default 2MB. Patches from v5 commited: 853c71dfbf xen/elfnote: Specify ELF Notes are x86-specific 7d8c9b4e8d libelf: Expand ELF note printing 8802230bfa Revert "xen/x86: bzImage parse kernel_alignment" The first and second patches move MB/GB() to common-macros.h. The third patch stores the maximum p_align value from the ELF PHDRs. The fourth patch expands the pvh dom0 kernel placement code. I'll post an additional patch showing the Linux changes to make PVH relocatable. Jason Andryuk (4): tools/init-xenstore-domain: Replace variable MB() usage tools: Move MB/GB() to common-macros.h libelf: Store maximum PHDR p_align x86/PVH: Support relocatable dom0 kernels tools/firmware/hvmloader/util.h | 3 - tools/helpers/init-xenstore-domain.c| 11 ++- tools/include/xen-tools/common-macros.h | 4 + tools/libs/light/libxl_internal.h | 4 - xen/arch/x86/hvm/dom0_build.c | 108 xen/common/libelf/libelf-dominfo.c | 35 xen/common/libelf/libelf-loader.c | 15 +++- xen/common/libelf/libelf-private.h | 1 + xen/include/public/elfnote.h| 16 +++- xen/include/xen/libelf.h| 5 ++ 10 files changed, 186 insertions(+), 16 deletions(-) -- 2.44.0
[PATCH v6 4/4] x86/PVH: Support relocatable dom0 kernels
Xen tries to load a PVH dom0 kernel at the fixed guest physical address from the elf headers. For Linux, this defaults to 0x100 (16MB), but it can be configured. Unfortunately there exist firmwares that have reserved regions at this address, so Xen fails to load the dom0 kernel since it's not RAM. The PVH entry code is not relocatable - it loads from absolute addresses, which fail when the kernel is loaded at a different address. With a suitably modified kernel, a reloctable entry point is possible. Add XEN_ELFNOTE_PHYS32_RELOC which specifies optional alignment, minimum, and maximum addresses needed for the kernel. The presence of the NOTE indicates the kernel supports a relocatable entry path. Change the loading to check for an acceptable load address. If the kernel is relocatable, support finding an alternate load address. The primary motivation for an explicit align field is that Linux has a configurable CONFIG_PHYSICAL_ALIGN field. This value is present in the bzImage setup header, but not the ELF program headers p_align, which report 2MB even when CONFIG_PHYSICAL_ALIGN is greater. Since a kernel is only considered relocatable if the PHYS32_RELOC elf note is present, the alignment contraints can just be specified within the note instead of searching for an alignment value via a heuristic. Load alignment uses the PHYS32_RELOC note value if specified. Otherwise, the maxmum ELF PHDR p_align value is selected if greater than or equal to PAGE_SIZE. Finally, the fallback default is 2MB. libelf-private.h includes common-macros.h to satisfy the fuzzer build. Link: https://gitlab.com/xen-project/xen/-/issues/180 Signed-off-by: Jason Andryuk --- ELF Note printing looks like: (XEN) ELF: note: PHYS32_RELOC align: 0x20 min: 0x100 max: 0x3fff v2: Use elfnote for min, max & align - use 64bit values. Print original and relocated memory addresses Use check_and_adjust_load_address() name Return relocated base instead of offset Use PAGE_ALIGN Don't load above max_phys (expected to be 4GB in kernel elf note) Use single line comments Exit check_load_address loop earlier Add __init to find_kernel_memory() v3: Remove kernel_start/end page rounding Change loop comment to rely on a sorted memory map. Reorder E820_RAM check first Use %p for dest_base Use PRIpaddr Use 32bit phys_min/max/align Consolidate to if ( x || y ) continue Use max_t Add parms->phys_reloc Use common "%pd kernel: " prefix for messages Re-order phys_entry assignment Print range ends inclusively Remove extra "Unable to load kernel" message s/PVH_RELOCATION/PHYS32_RELOC/ Make PHYS32_RELOC contents optional Use 2MB default alignment Update ELF Note comment Update XEN_ELFNOTE_MAX v4: Cast dest_base to uintptr_t Use local start variable Mention e820 requiring adjacent entries merged Remove extra whitespace Re-word elfnote comment & mention x86 Use ELFNOTE_NAME Return -ENOSPC Use ! instead of == 0 Check kend - 1 to avoid off by one libelf: Use MB/GB() to define the size. Use ARCH_PHYS_* defines v5: Place kernel in higher memory addresses Remove stray semicolons ELFNOTE_NAME comment about newline Make PHYS32_RELOC element order align, min, max Re-word PHYS32_RELOC comment Move phys_align next to other bool variables v6: Select alignment from, in order, Note, PHDRs, then default --- xen/arch/x86/hvm/dom0_build.c | 108 + xen/common/libelf/libelf-dominfo.c | 35 ++ xen/common/libelf/libelf-private.h | 1 + xen/include/public/elfnote.h | 16 - xen/include/xen/libelf.h | 4 ++ 5 files changed, 163 insertions(+), 1 deletion(-) diff --git a/xen/arch/x86/hvm/dom0_build.c b/xen/arch/x86/hvm/dom0_build.c index 0ceda4140b..01f39d650e 100644 --- a/xen/arch/x86/hvm/dom0_build.c +++ b/xen/arch/x86/hvm/dom0_build.c @@ -537,6 +537,111 @@ static paddr_t __init find_memory( return INVALID_PADDR; } +static bool __init check_load_address( +const struct domain *d, const struct elf_binary *elf) +{ +paddr_t kernel_start = (uintptr_t)elf->dest_base; +paddr_t kernel_end = kernel_start + elf->dest_size; +unsigned int i; + +/* Relies on a sorted memory map with adjacent entries merged. */ +for ( i = 0; i < d->arch.nr_e820; i++ ) +{ +paddr_t start = d->arch.e820[i].addr; +paddr_t end = start + d->arch.e820[i].size; + +if ( start >= kernel_end ) +return false; + +if ( d->arch.e820[i].type == E820_RAM && + start <= kernel_start && + end >= kernel_end ) +return true; +} + +return false; +} + +/* Find an e820 RAM region that fits the kernel at a suitable alignment. */ +static paddr_t __init find_kernel_memory( +const struct domain *d, struct elf_binary *elf, +const struct elf_dom_parms *parms) +{ +paddr_t kernel_size = elf->dest_size; +unsigned int align; +int i; + +if ( parms->phys_align != UNSET_ADDR32 ) +align = parms->phys_align; +
[PATCH v6 1/4] tools/init-xenstore-domain: Replace variable MB() usage
The local MB() & GB() macros will be replaced with a common implementation, but those only work with constant values. Introduce a static inline mb_to_bytes() in place of the MB() macro to convert the variable values. Signed-off-by: Jason Andryuk --- v4: New --- tools/helpers/init-xenstore-domain.c | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/tools/helpers/init-xenstore-domain.c b/tools/helpers/init-xenstore-domain.c index 1683438c5c..5405842dfe 100644 --- a/tools/helpers/init-xenstore-domain.c +++ b/tools/helpers/init-xenstore-domain.c @@ -20,7 +20,6 @@ #include "init-dom-json.h" #define LAPIC_BASE_ADDRESS 0xfee0UL -#define MB(x) ((uint64_t)x << 20) #define GB(x) ((uint64_t)x << 30) static uint32_t domid = ~0; @@ -36,6 +35,11 @@ static xc_evtchn_port_or_error_t console_evtchn; static xentoollog_level minmsglevel = XTL_PROGRESS; static void *logger; +static inline uint64_t mb_to_bytes(int mem) +{ + return (uint64_t)mem << 20; +} + static struct option options[] = { { "kernel", 1, NULL, 'k' }, { "memory", 1, NULL, 'm' }, @@ -76,8 +80,8 @@ static int build(xc_interface *xch) int rv, xs_fd; struct xc_dom_image *dom = NULL; int limit_kb = (maxmem ? : memory) * 1024 + X86_HVM_NR_SPECIAL_PAGES * 4; -uint64_t mem_size = MB(memory); -uint64_t max_size = MB(maxmem ? : memory); +uint64_t mem_size = mb_to_bytes(memory); +uint64_t max_size = mb_to_bytes(maxmem ? : memory); struct e820entry e820[3]; struct xen_domctl_createdomain config = { .ssidref = SECINITSID_DOMU, -- 2.44.0
[PATCH v6 3/4] libelf: Store maximum PHDR p_align
While parsing the PHDRs, store the maximum p_align value. This may be consulted for moving a PVH image's load address. Signed-off-by: Jason Andryuk --- v6: New --- xen/common/libelf/libelf-loader.c | 15 +++ xen/include/xen/libelf.h | 1 + 2 files changed, 12 insertions(+), 4 deletions(-) diff --git a/xen/common/libelf/libelf-loader.c b/xen/common/libelf/libelf-loader.c index 629cc0d3e6..a5f6389f82 100644 --- a/xen/common/libelf/libelf-loader.c +++ b/xen/common/libelf/libelf-loader.c @@ -468,6 +468,7 @@ void elf_parse_binary(struct elf_binary *elf) { ELF_HANDLE_DECL(elf_phdr) phdr; uint64_t low = -1, high = 0, paddr, memsz; +uint64_t max_align = 0, palign; unsigned i, count; count = elf_phdr_count(elf); @@ -481,17 +482,23 @@ void elf_parse_binary(struct elf_binary *elf) continue; paddr = elf_uval(elf, phdr, p_paddr); memsz = elf_uval(elf, phdr, p_memsz); -elf_msg(elf, "ELF: phdr: paddr=%#" PRIx64 " memsz=%#" PRIx64 "\n", -paddr, memsz); +palign = elf_uval(elf, phdr, p_align); +elf_msg(elf, +"ELF: phdr: paddr=%#" PRIx64 " memsz=%#" PRIx64 " palign=%#" PRIx64 "\n", +paddr, memsz, palign); if ( low > paddr ) low = paddr; if ( high < paddr + memsz ) high = paddr + memsz; +if ( max_align < palign ) +max_align = palign; } elf->pstart = low; elf->pend = high; -elf_msg(elf, "ELF: memory: %#" PRIx64 " -> %#" PRIx64 "\n", -elf->pstart, elf->pend); +elf->palign = max_align; +elf_msg(elf, +"ELF: memory: %#" PRIx64 " -> %#" PRIx64 " align:%#" PRIx64 "\n", +elf->pstart, elf->pend, elf->palign); } elf_errorstatus elf_load_binary(struct elf_binary *elf) diff --git a/xen/include/xen/libelf.h b/xen/include/xen/libelf.h index 1c77e3df31..2d971f958e 100644 --- a/xen/include/xen/libelf.h +++ b/xen/include/xen/libelf.h @@ -196,6 +196,7 @@ struct elf_binary { size_t dest_size; uint64_t pstart; uint64_t pend; +uint64_t palign; uint64_t reloc_offset; uint64_t bsd_symtab_pstart; -- 2.44.0
[PATCH v6] RFC: x86/pvh: Make Xen PVH entrypoint PIC for x86-64
The Xen PVH entrypoint is 32bit non-PIC code running at a default load address of 0x100 (16MB) (CONFIG_PHYSICAL_START). Xen loads the kernel at that physical address inside the PVH container. When running a PVH Dom0, the system reserved addresses are mapped 1-1 into the PVH container. There exist system firmwares (Coreboot/EDK2) with reserved memory at 16MB. This creates a conflict where the PVH kernel cannot be loaded at that address. Modify the PVH entrypoint to be position-indepedent to allow flexibility in load address. Only the 64bit entry path is converted. A 32bit kernel is not PIC, so calling into other parts of the kernel, like xen_prepare_pvh() and mk_pgtable_32(), don't work properly when relocated. Initial PVH entry runs at the physical addresses and then transitions to the identity mapped address. While executing xen_prepare_pvh() calls through pv_ops function pointers transition to the high mapped addresses. Additionally, __va() is called on some hvm_start_info physical addresses, we need the directmap address range is used. So we need to run page tables with all of those ranges mapped. Modifying init_top_pgt tables ran into issue since startup_64/__startup_64() will modify those page tables again. Use a dedicated set of page tables - pvh_init_top_pgt - for the PVH entry to avoid unwanted interactions. In xen_pvh_init(), __pa() is called to find the physical address of the hypercall page. Set phys_base temporarily before calling into xen_prepare_pvh(), which calls xen_pvh_init(), and clear it afterwards. __startup_64() assumes phys_base is zero and adds load_delta to it. If phys_base is already set, the calculation results in an incorrect cr3. TODO: Sync elfnote.h from xen.git commit xx when it is commited. Signed-off-by: Jason Andryuk --- Put this out as an example for the Xen modifications Instead of setting and clearing phys_base, add a dedicated variable? Clearing phys_base is a little weird, but it leaves the kernel more consistent when running non-entry code. Make __startup_64() exit if phys_base is already set to allow calling multiple times, and use that and init_top_pgt instead of adding additional page tables? That won't work. __startup_64 is 64bit code, and pvh needs to create page tables in 32bit code before it can transition to 64bit long mode. Hence it can't be re-used to relocate page tables. --- arch/x86/platform/pvh/head.S| 182 +--- include/xen/interface/elfnote.h | 16 ++- 2 files changed, 185 insertions(+), 13 deletions(-) diff --git a/arch/x86/platform/pvh/head.S b/arch/x86/platform/pvh/head.S index f7235ef87bc3..ce2b201210e3 100644 --- a/arch/x86/platform/pvh/head.S +++ b/arch/x86/platform/pvh/head.S @@ -50,11 +50,32 @@ #define PVH_CS_SEL (PVH_GDT_ENTRY_CS * 8) #define PVH_DS_SEL (PVH_GDT_ENTRY_DS * 8) +#define rva(x) ((x) - pvh_start_xen) + SYM_CODE_START_LOCAL(pvh_start_xen) UNWIND_HINT_END_OF_STACK cld - lgdt (_pa(gdt)) + /* +* See the comment for startup_32 for more details. We need to +* execute a call to get the execution address to be position +* independent, but we don't have a stack. Save and restore the +* magic field of start_info in ebx, and use that as the stack. +*/ + mov (%ebx), %eax + leal4(%ebx), %esp + ANNOTATE_INTRA_FUNCTION_CALL + call1f +1: popl%ebp + mov %eax, (%ebx) + subl$ rva(1b), %ebp + movl$0, %esp + + lealrva(gdt)(%ebp), %eax + movl%eax, %ecx + lealrva(gdt_start)(%ebp), %ecx + movl%ecx, 2(%eax) + lgdt(%eax) mov $PVH_DS_SEL,%eax mov %eax,%ds @@ -62,14 +83,14 @@ SYM_CODE_START_LOCAL(pvh_start_xen) mov %eax,%ss /* Stash hvm_start_info. */ - mov $_pa(pvh_start_info), %edi + leal rva(pvh_start_info)(%ebp), %edi mov %ebx, %esi - mov _pa(pvh_start_info_sz), %ecx + movl rva(pvh_start_info_sz)(%ebp), %ecx shr $2,%ecx rep movsl - mov $_pa(early_stack_end), %esp + leal rva(early_stack_end)(%ebp), %esp /* Enable PAE mode. */ mov %cr4, %eax @@ -83,29 +104,81 @@ SYM_CODE_START_LOCAL(pvh_start_xen) btsl $_EFER_LME, %eax wrmsr + mov %ebp, %ebx + subl $LOAD_PHYSICAL_ADDR, %ebx /* offset */ + jz .Lpagetable_done + + /* Fixup page-tables for relocation. */ + leal rva(pvh_init_top_pgt)(%ebp), %edi + movl $512, %ecx +2: + testl $_PAGE_PRESENT, 0x00(%edi) + jz 1f + addl %ebx, 0x00(%edi) +1: + addl $8, %edi + decl %ecx + jnz 2b + + /* L3 ident has a single entry. */ + leal rva(pvh_level3_ident_pgt)(%ebp), %edi + addl %ebx, 0x00(%edi) + + leal rva(pvh_level3_kernel_pgt)(%ebp), %edi + addl %ebx, (4096 - 16)(%edi) + addl %ebx, (4096
[linux-6.1 test] 185167: tolerable FAIL - PUSHED
flight 185167 linux-6.1 real [real] http://logs.test-lab.xenproject.org/osstest/logs/185167/ Failures :-/ but no regressions. Tests which did not succeed, but are not blocking: test-armhf-armhf-libvirt 16 saverestore-support-checkfail like 185053 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 185053 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 185053 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 185053 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 185053 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 185053 test-arm64-arm64-xl 15 migrate-support-checkfail never pass test-arm64-arm64-xl 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail never pass test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail never pass test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail never pass test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail never pass test-amd64-amd64-libvirt 15 migrate-support-checkfail never pass test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check fail never pass test-arm64-arm64-xl-credit1 15 migrate-support-checkfail never pass test-arm64-arm64-xl-credit1 16 saverestore-support-checkfail never pass test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail never pass test-arm64-arm64-xl-credit2 15 migrate-support-checkfail never pass test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-credit2 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-xsm 15 migrate-support-checkfail never pass test-arm64-arm64-xl-xsm 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail never pass test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-rtds 15 migrate-support-checkfail never pass test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-credit2 15 migrate-support-checkfail never pass test-armhf-armhf-xl-credit2 16 saverestore-support-checkfail never pass test-amd64-amd64-libvirt-qcow2 14 migrate-support-checkfail never pass test-armhf-armhf-xl 15 migrate-support-checkfail never pass test-armhf-armhf-xl 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-credit1 15 migrate-support-checkfail never pass test-armhf-armhf-xl-credit1 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-arndale 15 migrate-support-checkfail never pass test-armhf-armhf-xl-arndale 16 saverestore-support-checkfail never pass test-armhf-armhf-libvirt 15 migrate-support-checkfail never pass test-amd64-amd64-libvirt-raw 14 migrate-support-checkfail never pass test-arm64-arm64-libvirt-raw 14 migrate-support-checkfail never pass test-arm64-arm64-libvirt-raw 15 saverestore-support-checkfail never pass test-arm64-arm64-xl-vhd 14 migrate-support-checkfail never pass test-arm64-arm64-xl-vhd 15 saverestore-support-checkfail never pass test-armhf-armhf-xl-qcow214 migrate-support-checkfail never pass test-armhf-armhf-xl-qcow215 saverestore-support-checkfail never pass test-armhf-armhf-libvirt-vhd 14 migrate-support-checkfail never pass test-armhf-armhf-libvirt-vhd 15 saverestore-support-checkfail never pass test-armhf-armhf-xl-raw 14 migrate-support-checkfail never pass test-armhf-armhf-xl-raw 15 saverestore-support-checkfail never pass version targeted for testing: linuxe5cd595e23c1a075359a337c0e5c3a4f2dc28dd1 baseline version: linuxd7543167affd372819a94879b8b1e8b9b12547d9 Last test of basis 185053 2024-03-15 19:18:14 Z 12 days Testing same since 185167 2024-03-26 22:43:26 Z1 days1 attempts 377 people touched revisions under test, not listing them all jobs: build-amd64-xsm pass build-arm64-xsm pass build-i386-xsm pass build-amd64 pass build-arm64 pass build-armhf pass build-i386 pass build-amd64-libvirt pass build-arm64-libvirt
Re: Linux Xen PV CPA W^X violation false-positives
On Wed, Mar 27, 2024 at 7:46 AM Jürgen Groß wrote: > > On 24.01.24 17:54, Jason Andryuk wrote: > > + > > + return new; > > + } > > + } > > + > > end = start + npg * PAGE_SIZE - 1; > > WARN_ONCE(1, "CPA detected W^X violation: %016llx -> %016llx range: > > 0x%016lx - 0x%016lx PFN %lx\n", > > (unsigned long long)pgprot_val(old), > > Jason, do you want to send a V2 with your Signed-off, or would you like me to > try upstreaming the patch? Hi Jürgen, Yes, please upstream your approach. I wasn't sure how to deal with it, so it was more of a bug report. Thanks, Jason
[linux-5.4 test] 185168: tolerable FAIL - PUSHED
flight 185168 linux-5.4 real [real] http://logs.test-lab.xenproject.org/osstest/logs/185168/ Failures :-/ but no regressions. Tests which did not succeed, but are not blocking: test-armhf-armhf-xl-credit1 14 guest-start fail like 185071 test-armhf-armhf-xl-multivcpu 18 guest-start/debian.repeatfail like 185080 test-amd64-i386-xl-qemut-win7-amd64 19 guest-stop fail like 185108 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 185108 test-amd64-i386-xl-qemuu-win7-amd64 19 guest-stop fail like 185108 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 185108 test-armhf-armhf-xl-rtds 18 guest-start/debian.repeatfail like 185108 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 185108 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 185108 test-amd64-i386-xl-qemut-ws16-amd64 19 guest-stop fail like 185108 test-armhf-armhf-libvirt 16 saverestore-support-checkfail like 185108 test-armhf-armhf-xl-credit2 18 guest-start/debian.repeatfail like 185108 test-amd64-i386-xl-qemuu-ws16-amd64 19 guest-stop fail like 185108 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 185108 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check fail never pass test-amd64-amd64-libvirt 15 migrate-support-checkfail never pass test-amd64-i386-xl-pvshim14 guest-start fail never pass test-amd64-i386-libvirt 15 migrate-support-checkfail never pass test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail never pass test-amd64-i386-libvirt-xsm 15 migrate-support-checkfail never pass test-amd64-i386-libvirt-qcow2 14 migrate-support-checkfail never pass test-arm64-arm64-xl 15 migrate-support-checkfail never pass test-arm64-arm64-xl 16 saverestore-support-checkfail never pass test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail never pass test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail never pass test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-credit1 15 migrate-support-checkfail never pass test-arm64-arm64-xl-credit2 15 migrate-support-checkfail never pass test-arm64-arm64-xl-credit1 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-credit2 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-xsm 15 migrate-support-checkfail never pass test-arm64-arm64-xl-xsm 16 saverestore-support-checkfail never pass test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check fail never pass test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail never pass test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-rtds 15 migrate-support-checkfail never pass test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail never pass test-armhf-armhf-xl 15 migrate-support-checkfail never pass test-armhf-armhf-xl 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-arndale 15 migrate-support-checkfail never pass test-armhf-armhf-xl-arndale 16 saverestore-support-checkfail never pass test-amd64-i386-libvirt-raw 14 migrate-support-checkfail never pass test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail never pass test-arm64-arm64-libvirt-raw 14 migrate-support-checkfail never pass test-arm64-arm64-libvirt-raw 15 saverestore-support-checkfail never pass test-armhf-armhf-xl-qcow214 migrate-support-checkfail never pass test-armhf-armhf-xl-qcow215 saverestore-support-checkfail never pass test-arm64-arm64-xl-vhd 14 migrate-support-checkfail never pass test-arm64-arm64-xl-vhd 15 saverestore-support-checkfail never pass test-armhf-armhf-libvirt-vhd 14 migrate-support-checkfail never pass test-armhf-armhf-libvirt-vhd 15 saverestore-support-checkfail never pass test-armhf-armhf-xl-credit2 15 migrate-support-checkfail never pass test-armhf-armhf-xl-credit2 16 saverestore-support-checkfail never pass test-armhf-armhf-libvirt 15 migrate-support-checkfail never pass test-armhf-armhf-xl-raw 14 migrate-support-checkfail never pass test-armhf-armhf-xl-raw 15 saverestore-support-checkfail never pass version targeted for testing: linux24489321d0cd5339f9c2da01eb8bf2bccbac7956 baseline version: linux84075826304f1a297838de6bcfd9bd84f566026e Last test of basis 185108 2024-03-20 07:24:30 Z7 d
[xen-unstable test] 185169: regressions - FAIL
flight 185169 xen-unstable real [real] http://logs.test-lab.xenproject.org/osstest/logs/185169/ Regressions :-( Tests which did not succeed and are blocking, including tests which could not be run: build-arm64-xsm 6 xen-buildfail REGR. vs. 185162 Tests which are failing intermittently (not blocking): test-amd64-amd64-xl-qemuu-dmrestrict-amd64-dmrestrict 7 xen-install fail in 185162 pass in 185169 test-amd64-amd64-xl-qemut-debianhvm-amd64 20 guest-start/debianhvm.repeat fail pass in 185162 Tests which did not succeed, but are not blocking: test-arm64-arm64-libvirt-xsm 1 build-check(1) blocked n/a test-arm64-arm64-xl-xsm 1 build-check(1) blocked n/a test-arm64-arm64-xl-xsm 15 migrate-support-check fail in 185162 never pass test-arm64-arm64-xl-xsm 16 saverestore-support-check fail in 185162 never pass test-arm64-arm64-libvirt-xsm 15 migrate-support-check fail in 185162 never pass test-arm64-arm64-libvirt-xsm 16 saverestore-support-check fail in 185162 never pass test-armhf-armhf-libvirt 16 saverestore-support-checkfail like 185162 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 185162 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 185162 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 185162 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 185162 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 185162 test-amd64-amd64-libvirt 15 migrate-support-checkfail never pass test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail never pass test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail never pass test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail never pass test-arm64-arm64-xl 15 migrate-support-checkfail never pass test-arm64-arm64-xl 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-credit1 15 migrate-support-checkfail never pass test-arm64-arm64-xl-credit1 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-credit2 15 migrate-support-checkfail never pass test-arm64-arm64-xl-credit2 16 saverestore-support-checkfail never pass test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check fail never pass test-armhf-armhf-xl-rtds 15 migrate-support-checkfail never pass test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail never pass test-armhf-armhf-libvirt 15 migrate-support-checkfail never pass test-armhf-armhf-xl-credit2 15 migrate-support-checkfail never pass test-armhf-armhf-xl-credit2 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail never pass test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail never pass test-amd64-amd64-libvirt-raw 14 migrate-support-checkfail never pass test-arm64-arm64-libvirt-raw 14 migrate-support-checkfail never pass test-arm64-arm64-libvirt-raw 15 saverestore-support-checkfail never pass test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail never pass test-arm64-arm64-xl-vhd 14 migrate-support-checkfail never pass test-arm64-arm64-xl-vhd 15 saverestore-support-checkfail never pass test-amd64-amd64-libvirt-qcow2 14 migrate-support-checkfail never pass test-armhf-armhf-xl-qcow214 migrate-support-checkfail never pass test-armhf-armhf-xl-qcow215 saverestore-support-checkfail never pass test-armhf-armhf-libvirt-vhd 14 migrate-support-checkfail never pass test-armhf-armhf-libvirt-vhd 15 saverestore-support-checkfail never pass test-armhf-armhf-xl-raw 14 migrate-support-checkfail never pass test-armhf-armhf-xl-raw 15 saverestore-support-checkfail never pass test-armhf-armhf-xl 15 migrate-support-checkfail never pass test-armhf-armhf-xl 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-credit1 15 migrate-support-checkfail never pass test-armhf-armhf-xl-credit1 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-arndale 15 migrate-support-checkfail never pass test-armhf-armhf-xl-arndale 16 saverestore-support-checkfail never pass version targeted for testing: xen 6f6de10ade5ade907f9e3f3c72b7b18f7852d9ff baseline version: xen 6f6de10ade5ade907f9e3f3c72b7b18f7852d9ff Last test of basis 185169 2024-03-27 01:54:33 Z1 days Testing same since (not found) 0 attempts jobs: build-amd64-xsm pass build-arm64-xsm fail build-i386-xsm
[QEMU PATCH v5 0/1] Support device passthrough when dom0 is PVH on Xen
Hi All, This is v5 series to support passthrough on Xen when dom0 is PVH. v4->v5 changes: * Add review by Stefano v3->v4 changes: * Add gsi into struct XenHostPCIDevice and use gsi number that read from gsi sysfs if it exists, if there is no gsi sysfs, still use irq. v2->v3 changes: * Du to changes in the implementation of the second patch on kernel side(that adds a new sysfs for gsi instead of a new syscall), so read gsi number from the sysfs of gsi. Below is the description of v2 cover letter: This patch is the v2 of the implementation of passthrough when dom0 is PVH on Xen. Issues we encountered: 1. failed to map pirq for gsi Problem: qemu will call xc_physdev_map_pirq() to map a passthrough device’s gsi to pirq in function xen_pt_realize(). But failed. Reason: According to the implement of xc_physdev_map_pirq(), it needs gsi instead of irq, but qemu pass irq to it and treat irq as gsi, it is got from file /sys/bus/pci/devices/:xx:xx.x/irq in function xen_host_pci_device_get(). But actually the gsi number is not equal with irq. On PVH dom0, when it allocates irq for a gsi in function acpi_register_gsi_ioapic(), allocation is dynamic, and follow the principle of applying first, distributing first. And if you debug the kernel codes (see function __irq_alloc_descs), you will find the irq number is allocated from small to large by order, but the applying gsi number is not, gsi 38 may come before gsi 28, that causes gsi 38 get a smaller irq number than gsi 28, and then gsi != irq. Solution: we can record the relation between gsi and irq, then when userspace(qemu) want to use gsi, we can do a translation. The third patch of kernel(xen/privcmd: Add new syscall to get gsi from irq) records all the relations in acpi_register_gsi_xen_pvh() when dom0 initialize pci devices, and provide a syscall for userspace to get the gsi from irq. The third patch of xen(tools: Add new function to get gsi from irq) add a new function xc_physdev_gsi_from_irq() to call the new syscall added on kernel side. And then userspace can use that function to get gsi. Then xc_physdev_map_pirq() will success. This v2 on qemu side is the same as the v1 (qemu https://lore.kernel.org/xen-devel/20230312092244.451465-19-ray.hu...@amd.com/), just call xc_physdev_gsi_from_irq() to get gsi from irq. Jiqian Chen (1): xen: Use gsi instead of irq for mapping pirq hw/xen/xen-host-pci-device.c | 7 +++ hw/xen/xen-host-pci-device.h | 1 + hw/xen/xen_pt.c | 6 +- 3 files changed, 13 insertions(+), 1 deletion(-) -- 2.34.1
[RFC QEMU PATCH v5 1/1] xen: Use gsi instead of irq for mapping pirq
In PVH dom0, it uses the linux local interrupt mechanism, when it allocs irq for a gsi, it is dynamic, and follow the principle of applying first, distributing first. And the irq number is alloced from small to large, but the applying gsi number is not, may gsi 38 comes before gsi 28, that causes the irq number is not equal with the gsi number. And when passthrough a device, qemu wants to use gsi to map pirq, xen_pt_realize->xc_physdev_map_pirq, but the gsi number is got from file /sys/bus/pci/devices//irq in current code, so it will fail when mapping. Add gsi into XenHostPCIDevice and use gsi number that read from gsi sysfs if it exists. Signed-off-by: Huang Rui Signed-off-by: Jiqian Chen Reviewed-by: Stefano Stabellini --- RFC: discussions ongoing on the Linux side where/how to expose the gsi --- hw/xen/xen-host-pci-device.c | 7 +++ hw/xen/xen-host-pci-device.h | 1 + hw/xen/xen_pt.c | 6 +- 3 files changed, 13 insertions(+), 1 deletion(-) diff --git a/hw/xen/xen-host-pci-device.c b/hw/xen/xen-host-pci-device.c index 8c6e9a1716a2..5be3279aa25b 100644 --- a/hw/xen/xen-host-pci-device.c +++ b/hw/xen/xen-host-pci-device.c @@ -370,6 +370,13 @@ void xen_host_pci_device_get(XenHostPCIDevice *d, uint16_t domain, } d->irq = v; +xen_host_pci_get_dec_value(d, "gsi", &v, errp); +if (*errp) { +d->gsi = -1; +} else { +d->gsi = v; +} + xen_host_pci_get_hex_value(d, "class", &v, errp); if (*errp) { goto error; diff --git a/hw/xen/xen-host-pci-device.h b/hw/xen/xen-host-pci-device.h index 4d8d34ecb024..74c552bb5548 100644 --- a/hw/xen/xen-host-pci-device.h +++ b/hw/xen/xen-host-pci-device.h @@ -27,6 +27,7 @@ typedef struct XenHostPCIDevice { uint16_t device_id; uint32_t class_code; int irq; +int gsi; XenHostPCIIORegion io_regions[PCI_NUM_REGIONS - 1]; XenHostPCIIORegion rom; diff --git a/hw/xen/xen_pt.c b/hw/xen/xen_pt.c index 3635d1b39f79..d34a7a8764ab 100644 --- a/hw/xen/xen_pt.c +++ b/hw/xen/xen_pt.c @@ -840,7 +840,11 @@ static void xen_pt_realize(PCIDevice *d, Error **errp) goto out; } -machine_irq = s->real_device.irq; +if (s->real_device.gsi < 0) { +machine_irq = s->real_device.irq; +} else { +machine_irq = s->real_device.gsi; +} if (machine_irq == 0) { XEN_PT_LOG(d, "machine irq is 0\n"); cmd |= PCI_COMMAND_INTX_DISABLE; -- 2.34.1
Re: Serious AMD-Vi(?) issue
On 27.03.2024 18:27, Elliott Mitchell wrote: > On Mon, Mar 25, 2024 at 02:43:44PM -0700, Elliott Mitchell wrote: >> On Mon, Mar 25, 2024 at 08:55:56AM +0100, Jan Beulich wrote: >>> On 22.03.2024 20:22, Elliott Mitchell wrote: On Fri, Mar 22, 2024 at 04:41:45PM +, Kelly Choi wrote: > > I can see you've recently engaged with our community with some issues > you'd > like help with. > We love the fact you are participating in our project, however, our > developers aren't able to help if you do not provide the specific details. Please point to specific details which have been omitted. Fairly little data has been provided as fairly little data is available. The primary observation is large numbers of: (XEN) AMD-Vi: IO_PAGE_FAULT: :bb:dd.f d0 addr ff???000 flags 0x8 I Lines in Xen's ring buffer. >>> >>> Yet this is (part of) the problem: By providing only the messages that >>> appear >>> relevant to you, you imply that you know that no other message is in any way >>> relevant. That's judgement you'd better leave to people actually trying to >>> investigate. Unless of course you were proposing an actual code change, with >>> suitable justification. >> >> Honestly, I forgot about the very small number of messages from the SATA >> subsystem. The question of whether the current mitigation actions are >> effective right now was a bigger issue. As such monitoring `xl dmesg` >> was a priority to looking at SATA messages which failed to reliably >> indicate status. >> >> I *thought* I would be able to retrieve those via other slow means, but a >> different and possibly overlapping issue has shown up. Unfortunately >> this means those are no longer retrievable. :-( > > With some persistence I was able to retrieve them. There are other > pieces of software with worse UIs than Xen. > >>> In fact when running into trouble, the usual course of action would be to >>> increase verbosity in both hypervisor and kernel, just to make sure no >>> potentially relevant message is missed. >> >> More/better information might have been obtained if I'd been engaged >> earlier. > > This is still true, things are in full mitigation mode and I'll be > quite unhappy to go back with experiments at this point. Well, it very likely won't work without further experimenting by someone able to observe the bad behavior. Recall we're on xen-devel here; it is kind of expected that without clear (and practical) repro instructions experimenting as well as info collection will remain with the reporter. > I now see why I left those out. The messages from the SATA subsystem > were from a kernel which a bad patch had leaked into a LTS branch. Looks > like the SATA subsystem was significantly broken and I'm unsure whether > any useful information could be retrieved. Notably there is quite a bit > of noise from SATA devices not effected by this issue. > > Some of the messages /might/ be useful, but the amount of noise is quite > high. Do messages from a broken kernel interest you? If this was a less vague (in terms of possible root causes) issue, I'd probably have answered "yes". But in the case here I'm afraid such might further confuse things rather than clarifying them. Jan
[XEN PATCH v6 1/5] xen/vpci: Clear all vpci status of device
When a device has been reset on dom0 side, the vpci on Xen side won't get notification, so the cached state in vpci is all out of date compare with the real device state. To solve that problem, add a new hypercall to clear all vpci device state. When the state of device is reset on dom0 side, dom0 can call this hypercall to notify vpci. Signed-off-by: Huang Rui Signed-off-by: Jiqian Chen Reviewed-by: Stewart Hildebrand Reviewed-by: Stefano Stabellini --- xen/arch/x86/hvm/hypercall.c | 1 + xen/drivers/pci/physdev.c| 36 xen/drivers/vpci/vpci.c | 10 ++ xen/include/public/physdev.h | 7 +++ xen/include/xen/vpci.h | 6 ++ 5 files changed, 60 insertions(+) diff --git a/xen/arch/x86/hvm/hypercall.c b/xen/arch/x86/hvm/hypercall.c index eeb73e1aa5d0..6ad5b4d5f11f 100644 --- a/xen/arch/x86/hvm/hypercall.c +++ b/xen/arch/x86/hvm/hypercall.c @@ -84,6 +84,7 @@ long hvm_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg) case PHYSDEVOP_pci_mmcfg_reserved: case PHYSDEVOP_pci_device_add: case PHYSDEVOP_pci_device_remove: +case PHYSDEVOP_pci_device_state_reset: case PHYSDEVOP_dbgp_op: if ( !is_hardware_domain(currd) ) return -ENOSYS; diff --git a/xen/drivers/pci/physdev.c b/xen/drivers/pci/physdev.c index 42db3e6d133c..73dc8f058b0e 100644 --- a/xen/drivers/pci/physdev.c +++ b/xen/drivers/pci/physdev.c @@ -2,6 +2,7 @@ #include #include #include +#include #ifndef COMPAT typedef long ret_t; @@ -67,6 +68,41 @@ ret_t pci_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg) break; } +case PHYSDEVOP_pci_device_state_reset: { +struct physdev_pci_device dev; +struct pci_dev *pdev; +pci_sbdf_t sbdf; + +if ( !is_pci_passthrough_enabled() ) +return -EOPNOTSUPP; + +ret = -EFAULT; +if ( copy_from_guest(&dev, arg, 1) != 0 ) +break; +sbdf = PCI_SBDF(dev.seg, dev.bus, dev.devfn); + +ret = xsm_resource_setup_pci(XSM_PRIV, sbdf.sbdf); +if ( ret ) +break; + +pcidevs_lock(); +pdev = pci_get_pdev(NULL, sbdf); +if ( !pdev ) +{ +pcidevs_unlock(); +ret = -ENODEV; +break; +} + +write_lock(&pdev->domain->pci_lock); +ret = vpci_reset_device_state(pdev); +write_unlock(&pdev->domain->pci_lock); +pcidevs_unlock(); +if ( ret ) +printk(XENLOG_ERR "%pp: failed to reset PCI device state\n", &sbdf); +break; +} + default: ret = -ENOSYS; break; diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c index 260b72875ee1..310700c1e775 100644 --- a/xen/drivers/vpci/vpci.c +++ b/xen/drivers/vpci/vpci.c @@ -117,6 +117,16 @@ int vpci_assign_device(struct pci_dev *pdev) return rc; } + +int vpci_reset_device_state(struct pci_dev *pdev) +{ +ASSERT(pcidevs_locked()); +ASSERT(rw_is_write_locked(&pdev->domain->pci_lock)); + +vpci_deassign_device(pdev); +return vpci_assign_device(pdev); +} + #endif /* __XEN__ */ static int vpci_register_cmp(const struct vpci_register *r1, diff --git a/xen/include/public/physdev.h b/xen/include/public/physdev.h index f0c0d4727c0b..f5bab1f29779 100644 --- a/xen/include/public/physdev.h +++ b/xen/include/public/physdev.h @@ -296,6 +296,13 @@ DEFINE_XEN_GUEST_HANDLE(physdev_pci_device_add_t); */ #define PHYSDEVOP_prepare_msix 30 #define PHYSDEVOP_release_msix 31 +/* + * Notify the hypervisor that a PCI device has been reset, so that any + * internally cached state is regenerated. Should be called after any + * device reset performed by the hardware domain. + */ +#define PHYSDEVOP_pci_device_state_reset 32 + struct physdev_pci_device { /* IN */ uint16_t seg; diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h index e89c571890b2..ea64d94e818b 100644 --- a/xen/include/xen/vpci.h +++ b/xen/include/xen/vpci.h @@ -30,6 +30,7 @@ int __must_check vpci_assign_device(struct pci_dev *pdev); /* Remove all handlers and free vpci related structures. */ void vpci_deassign_device(struct pci_dev *pdev); +int __must_check vpci_reset_device_state(struct pci_dev *pdev); /* Add/remove a register handler. */ int __must_check vpci_add_register_mask(struct vpci *vpci, @@ -266,6 +267,11 @@ static inline int vpci_assign_device(struct pci_dev *pdev) static inline void vpci_deassign_device(struct pci_dev *pdev) { } +static inline int __must_check vpci_reset_device_state(struct pci_dev *pdev) +{ +return 0; +} + static inline void vpci_dump_msi(void) { } static inline uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, -- 2.34.1
[RFC XEN PATCH v6 3/5] x86/pvh: Add PHYSDEVOP_setup_gsi for PVH dom0
On PVH dom0, the gsis don't get registered, but the gsi of a passthrough device must be configured for it to be able to be mapped into a hvm domU. On Linux kernel side, it calles PHYSDEVOP_setup_gsi for passthrough devices to register gsi when dom0 is PVH. So, add PHYSDEVOP_setup_gsi for above purpose. Signed-off-by: Huang Rui Signed-off-by: Jiqian Chen --- xen/arch/x86/hvm/hypercall.c | 5 + 1 file changed, 5 insertions(+) diff --git a/xen/arch/x86/hvm/hypercall.c b/xen/arch/x86/hvm/hypercall.c index 493998b42ec5..7d4e41f66885 100644 --- a/xen/arch/x86/hvm/hypercall.c +++ b/xen/arch/x86/hvm/hypercall.c @@ -76,6 +76,11 @@ long hvm_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg) case PHYSDEVOP_unmap_pirq: break; +case PHYSDEVOP_setup_gsi: +if ( !is_hardware_domain(currd) ) +return -EOPNOTSUPP; +break; + case PHYSDEVOP_eoi: case PHYSDEVOP_irq_status_query: case PHYSDEVOP_get_free_pirq: -- 2.34.1
[RFC XEN PATCH v6 0/5] Support device passthrough when dom0 is PVH on Xen
Hi All, This is v6 series to support passthrough when dom0 is PVH v5->v6 changes: * patch#1: Add Reviewed-by Stefano and Stewart. Rebase code and change old function vpci_remove_device, vpci_add_handlers to vpci_deassign_device, vpci_assign_device * patch#2: Add Reviewed-by Stefano * patch#3: Remove unnecessary "ASSERT(!has_pirq(currd));" * patch#4: Fix some coding style issues below directory tools * patch#5: Modified some variable names and code logic to make code easier to be understood, which to use gsi by default and be compatible with older kernel versions to continue to use irq v4->v5 changes: * patch#1: add pci_lock wrap function vpci_reset_device_state * patch#2: move the check of self map_pirq to physdev.c, and change to check if the caller has PIRQ flag, and just break for PHYSDEVOP_(un)map_pirq in hvm_physdev_op * patch#3: return -EOPNOTSUPP instead, and use ASSERT(!has_pirq(currd)); * patch#4: is the patch#5 in v4 because patch#5 in v5 has some dependency on it. And add the handling of errno and add the Reviewed-by Stefano * patch#5: is the patch#4 in v4. New implementation to add new hypercall XEN_DOMCTL_gsi_permission to grant gsi v3->v4 changes: * patch#1: change the comment of PHYSDEVOP_pci_device_state_reset; move printings behind pcidevs_unlock * patch#2: add check to prevent PVH self map * patch#3: new patch, The implementation of adding PHYSDEVOP_setup_gsi for PVH is treated as a separate patch * patch#4: new patch to solve the map_pirq problem of PVH dom0. use gsi to grant irq permission in XEN_DOMCTL_irq_permission. * patch#5: to be compatible with previous kernel versions, when there is no gsi sysfs, still use irq v4 link: https://lore.kernel.org/xen-devel/20240105070920.350113-1-jiqian.c...@amd.com/T/#t v2->v3 changes: * patch#1: move the content out of pci_reset_device_state and delete pci_reset_device_state; add xsm_resource_setup_pci check for PHYSDEVOP_pci_device_state_reset; add description for PHYSDEVOP_pci_device_state_reset; * patch#2: du to changes in the implementation of the second patch on kernel side(that it will do setup_gsi and map_pirq when assigning a device to passthrough), add PHYSDEVOP_setup_gsi for PVH dom0, and we need to support self mapping. * patch#3: du to changes in the implementation of the second patch on kernel side(that adds a new sysfs for gsi instead of a new syscall), so read gsi number from the sysfs of gsi. v3 link: https://lore.kernel.org/xen-devel/20231210164009.1551147-1-jiqian.c...@amd.com/T/#t v2 link: https://lore.kernel.org/xen-devel/20231124104136.3263722-1-jiqian.c...@amd.com/T/#t Below is the description of v2 cover letter: This series of patches are the v2 of the implementation of passthrough when dom0 is PVH on Xen. We sent the v1 to upstream before, but the v1 had so many problems and we got lots of suggestions. I will introduce all issues that these patches try to fix and the differences between v1 and v2. Issues we encountered: 1. pci_stub failed to write bar for a passthrough device. Problem: when we run “sudo xl pci-assignable-add ” to assign a device, pci_stub will call “pcistub_init_device() -> pci_restore_state() -> pci_restore_config_space() -> pci_restore_config_space_range() -> pci_restore_config_dword() -> pci_write_config_dword()”, the pci config write will trigger an io interrupt to bar_write() in the xen, but the bar->enabled was set before, the write is not allowed now, and then when bar->Qemu config the passthrough device in xen_pt_realize(), it gets invalid bar values. Reason: the reason is that we don't tell vPCI that the device has been reset, so the current cached state in pdev->vpci is all out of date and is different from the real device state. Solution: to solve this problem, the first patch of kernel(xen/pci: Add xen_reset_device_state function) and the fist patch of xen(xen/vpci: Clear all vpci status of device) add a new hypercall to reset the state stored in vPCI when the state of real device has changed. Thank Roger for the suggestion of this v2, and it is different from v1 (https://lore.kernel.org/xen-devel/20230312075455.450187-3-ray.hu...@amd.com/), v1 simply allow domU to write pci bar, it does not comply with the design principles of vPCI. 2. failed to do PHYSDEVOP_map_pirq when dom0 is PVH Problem: HVM domU will do PHYSDEVOP_map_pirq for a passthrough device by using gsi. See xen_pt_realize->xc_physdev_map_pirq and pci_add_dm_done->xc_physdev_map_pirq. Then xc_physdev_map_pirq will call into Xen, but in hvm_physdev_op(), PHYSDEVOP_map_pirq is not allowed. Reason: In hvm_physdev_op(), the variable "currd" is PVH dom0 and PVH has no X86_EMU_USE_PIRQ flag, it will fail at has_pirq check. Solution: I think we may need to allow PHYSDEVOP_map_pirq when "currd" is dom0 (at present dom0 is PVH). The second patch of xen(x86/pvh: Open PHYSDEVOP_map_pirq for PVH dom0) allow PVH dom0 do PHYSDEVOP_map_pirq. This v2 patch is better than v1, v1 simply remove the
[XEN PATCH v6 2/5] x86/pvh: Allow (un)map_pirq when dom0 is PVH
If run Xen with PVH dom0 and hvm domU, hvm will map a pirq for a passthrough device by using gsi, see xen_pt_realize->xc_physdev_map_pirq and pci_add_dm_done->xc_physdev_map_pirq. Then xc_physdev_map_pirq will call into Xen, but in hvm_physdev_op, PHYSDEVOP_map_pirq is not allowed because currd is PVH dom0 and PVH has no X86_EMU_USE_PIRQ flag, it will fail at has_pirq check. So, allow PHYSDEVOP_map_pirq when dom0 is PVH and also allow PHYSDEVOP_unmap_pirq for the failed path to unmap pirq. And add a new check to prevent self map when caller has no PIRQ flag. Signed-off-by: Huang Rui Signed-off-by: Jiqian Chen Reviewed-by: Stefano Stabellini --- xen/arch/x86/hvm/hypercall.c | 2 ++ xen/arch/x86/physdev.c | 24 2 files changed, 26 insertions(+) diff --git a/xen/arch/x86/hvm/hypercall.c b/xen/arch/x86/hvm/hypercall.c index 6ad5b4d5f11f..493998b42ec5 100644 --- a/xen/arch/x86/hvm/hypercall.c +++ b/xen/arch/x86/hvm/hypercall.c @@ -74,6 +74,8 @@ long hvm_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg) { case PHYSDEVOP_map_pirq: case PHYSDEVOP_unmap_pirq: +break; + case PHYSDEVOP_eoi: case PHYSDEVOP_irq_status_query: case PHYSDEVOP_get_free_pirq: diff --git a/xen/arch/x86/physdev.c b/xen/arch/x86/physdev.c index 7efa17cf4c1e..1367abc61e54 100644 --- a/xen/arch/x86/physdev.c +++ b/xen/arch/x86/physdev.c @@ -305,11 +305,23 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg) case PHYSDEVOP_map_pirq: { physdev_map_pirq_t map; struct msi_info msi; +struct domain *d; ret = -EFAULT; if ( copy_from_guest(&map, arg, 1) != 0 ) break; +d = rcu_lock_domain_by_any_id(map.domid); +if ( d == NULL ) +return -ESRCH; +/* If it is an HVM guest, check if it has PIRQs */ +if ( !is_pv_domain(d) && !has_pirq(d) ) +{ +rcu_unlock_domain(d); +return -EOPNOTSUPP; +} +rcu_unlock_domain(d); + switch ( map.type ) { case MAP_PIRQ_TYPE_MSI_SEG: @@ -343,11 +355,23 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg) case PHYSDEVOP_unmap_pirq: { struct physdev_unmap_pirq unmap; +struct domain *d; ret = -EFAULT; if ( copy_from_guest(&unmap, arg, 1) != 0 ) break; +d = rcu_lock_domain_by_any_id(unmap.domid); +if ( d == NULL ) +return -ESRCH; +/* If it is an HVM guest, check if it has PIRQs */ +if ( !is_pv_domain(d) && !has_pirq(d) ) +{ +rcu_unlock_domain(d); +return -EOPNOTSUPP; +} +rcu_unlock_domain(d); + ret = physdev_unmap_pirq(unmap.domid, unmap.pirq); break; } -- 2.34.1
[RFC XEN PATCH v6 5/5] domctl: Add XEN_DOMCTL_gsi_permission to grant gsi
Some type of domain don't have PIRQ, like PVH, when passthrough a device to guest on PVH dom0, callstack pci_add_dm_done->XEN_DOMCTL_irq_permission will failed at domain_pirq_to_irq. So, add a new hypercall to grant/revoke gsi permission when dom0 is not PV or dom0 has not PIRQ flag. Signed-off-by: Huang Rui Signed-off-by: Jiqian Chen --- tools/include/xenctrl.h | 5 tools/libs/ctrl/xc_domain.c | 15 +++ tools/libs/light/libxl_pci.c | 52 +--- xen/arch/x86/domctl.c| 31 + xen/include/public/domctl.h | 9 +++ xen/xsm/flask/hooks.c| 1 + 6 files changed, 103 insertions(+), 10 deletions(-) diff --git a/tools/include/xenctrl.h b/tools/include/xenctrl.h index 2ef8b4e05422..519c860a00d5 100644 --- a/tools/include/xenctrl.h +++ b/tools/include/xenctrl.h @@ -1382,6 +1382,11 @@ int xc_domain_irq_permission(xc_interface *xch, uint32_t pirq, bool allow_access); +int xc_domain_gsi_permission(xc_interface *xch, + uint32_t domid, + uint32_t gsi, + bool allow_access); + int xc_domain_iomem_permission(xc_interface *xch, uint32_t domid, unsigned long first_mfn, diff --git a/tools/libs/ctrl/xc_domain.c b/tools/libs/ctrl/xc_domain.c index f2d9d14b4d9f..8540e84fda93 100644 --- a/tools/libs/ctrl/xc_domain.c +++ b/tools/libs/ctrl/xc_domain.c @@ -1394,6 +1394,21 @@ int xc_domain_irq_permission(xc_interface *xch, return do_domctl(xch, &domctl); } +int xc_domain_gsi_permission(xc_interface *xch, + uint32_t domid, + uint32_t gsi, + bool allow_access) +{ +struct xen_domctl domctl = { +.cmd = XEN_DOMCTL_gsi_permission, +.domain = domid, +.u.gsi_permission.gsi = gsi, +.u.gsi_permission.allow_access = allow_access, +}; + +return do_domctl(xch, &domctl); +} + int xc_domain_iomem_permission(xc_interface *xch, uint32_t domid, unsigned long first_mfn, diff --git a/tools/libs/light/libxl_pci.c b/tools/libs/light/libxl_pci.c index 2cec83e0b734..debf6ec6ddc7 100644 --- a/tools/libs/light/libxl_pci.c +++ b/tools/libs/light/libxl_pci.c @@ -1421,6 +1421,8 @@ static void pci_add_dm_done(libxl__egc *egc, uint32_t flag = XEN_DOMCTL_DEV_RDM_RELAXED; uint32_t domainid = domid; bool isstubdom = libxl_is_stubdom(ctx, domid, &domainid); +int gsi; +bool is_gsi = true; /* Convenience aliases */ bool starting = pas->starting; @@ -1485,6 +1487,7 @@ static void pci_add_dm_done(libxl__egc *egc, /* To compitable with old version of kernel, still need to use irq */ sysfs_path = GCSPRINTF(SYSFS_PCI_DEV"/"PCI_BDF"/irq", pci->domain, pci->bus, pci->dev, pci->func); +is_gsi = false; } f = fopen(sysfs_path, "r"); if (f == NULL) { @@ -1492,6 +1495,13 @@ static void pci_add_dm_done(libxl__egc *egc, goto out_no_irq; } if ((fscanf(f, "%u", &irq) == 1) && irq) { +/* + * If use gsi, save the value, because the value of irq + * will be changed by function xc_physdev_map_pirq + */ +if (is_gsi) { +gsi = irq; +} r = xc_physdev_map_pirq(ctx->xch, domid, irq, &irq); if (r < 0) { LOGED(ERROR, domainid, "xc_physdev_map_pirq irq=%d (error=%d)", @@ -1500,13 +1510,25 @@ static void pci_add_dm_done(libxl__egc *egc, rc = ERROR_FAIL; goto out; } -r = xc_domain_irq_permission(ctx->xch, domid, irq, 1); -if (r < 0) { -LOGED(ERROR, domainid, - "xc_domain_irq_permission irq=%d (error=%d)", irq, r); -fclose(f); -rc = ERROR_FAIL; -goto out; +if (is_gsi) { +r = xc_domain_gsi_permission(ctx->xch, domid, gsi, 1); +if (r < 0 && r != -EOPNOTSUPP) { +LOGED(ERROR, domainid, + "xc_domain_gsi_permission gsi=%d (error=%d)", gsi, r); +fclose(f); +rc = ERROR_FAIL; +goto out; +} +} +if (!is_gsi || r == -EOPNOTSUPP) { +r = xc_domain_irq_permission(ctx->xch, domid, irq, 1); +if (r < 0) { +LOGED(ERROR, domainid, +"xc_domain_irq_permission irq=%d (error=%d)", irq, r); +fclose(f); +rc = ERROR_FAIL; +goto out; +} } } fclose(f); @@ -2180,6 +2202,7 @@ static void pci_remove_detached(libxl__egc *egc, FILE *f; uint32_t domainid = prs->domid; bool isstubdom; +bool is_gsi = true; /*
[RFC XEN PATCH v6 4/5] libxl: Use gsi instead of irq for mapping pirq
In PVH dom0, it uses the linux local interrupt mechanism, when it allocs irq for a gsi, it is dynamic, and follow the principle of applying first, distributing first. And the irq number is alloced from small to large, but the applying gsi number is not, may gsi 38 comes before gsi 28, that causes the irq number is not equal with the gsi number. And when passthrough a device, xl wants to use gsi to map pirq, see pci_add_dm_done->xc_physdev_map_pirq, but the gsi number is got from file /sys/bus/pci/devices//irq in current code, so it will fail when mapping. So, use real gsi number read from gsi sysfs. Signed-off-by: Huang Rui Signed-off-by: Jiqian Chen Reviewed-by: Stefano Stabellini --- RFC: discussions ongoing on the Linux side where/how to expose the gsi --- tools/libs/light/libxl_pci.c | 16 ++-- 1 file changed, 14 insertions(+), 2 deletions(-) diff --git a/tools/libs/light/libxl_pci.c b/tools/libs/light/libxl_pci.c index 96cb4da0794e..2cec83e0b734 100644 --- a/tools/libs/light/libxl_pci.c +++ b/tools/libs/light/libxl_pci.c @@ -1478,8 +1478,14 @@ static void pci_add_dm_done(libxl__egc *egc, fclose(f); if (!pci_supp_legacy_irq()) goto out_no_irq; -sysfs_path = GCSPRINTF(SYSFS_PCI_DEV"/"PCI_BDF"/irq", pci->domain, +sysfs_path = GCSPRINTF(SYSFS_PCI_DEV"/"PCI_BDF"/gsi", pci->domain, pci->bus, pci->dev, pci->func); +r = access(sysfs_path, F_OK); +if (r && errno == ENOENT) { +/* To compitable with old version of kernel, still need to use irq */ +sysfs_path = GCSPRINTF(SYSFS_PCI_DEV"/"PCI_BDF"/irq", pci->domain, + pci->bus, pci->dev, pci->func); +} f = fopen(sysfs_path, "r"); if (f == NULL) { LOGED(ERROR, domainid, "Couldn't open %s", sysfs_path); @@ -2229,9 +2235,15 @@ skip_bar: if (!pci_supp_legacy_irq()) goto skip_legacy_irq; -sysfs_path = GCSPRINTF(SYSFS_PCI_DEV"/"PCI_BDF"/irq", pci->domain, +sysfs_path = GCSPRINTF(SYSFS_PCI_DEV"/"PCI_BDF"/gsi", pci->domain, pci->bus, pci->dev, pci->func); +rc = access(sysfs_path, F_OK); +if (rc && errno == ENOENT) { +/* To compitable with old version of kernel, still need to use irq */ +sysfs_path = GCSPRINTF(SYSFS_PCI_DEV"/"PCI_BDF"/irq", pci->domain, + pci->bus, pci->dev, pci->func); +} f = fopen(sysfs_path, "r"); if (f == NULL) { LOGED(ERROR, domid, "Couldn't open %s", sysfs_path); -- 2.34.1
Re: [PATCH v2 3/3] svm/nestedsvm: Introduce nested capabilities bit
On 27.03.2024 18:01, George Dunlap wrote: > On Mon, Mar 18, 2024 at 2:17 PM Jan Beulich wrote: >> On 13.03.2024 13:24, George Dunlap wrote: >>> --- a/xen/arch/x86/domain.c >>> +++ b/xen/arch/x86/domain.c >>> @@ -673,6 +673,12 @@ int arch_sanitise_domain_config(struct >>> xen_domctl_createdomain *config) >>> */ >>> config->flags |= XEN_DOMCTL_CDF_oos_off; >>> >>> +if ( nested_virt && !hvm_nested_virt_supported() ) >>> +{ >>> +dprintk(XENLOG_INFO, "Nested virt requested but not available\n"); >>> +return -EINVAL; >>> +} >>> + >>> if ( nested_virt && !hap ) >>> { >>> dprintk(XENLOG_INFO, "Nested virt not supported without HAP\n"); >> >> As mentioned in reply to v1, I think what both start_nested_{svm,vmx}() check >> is redundant with this latter check. I think that check isn't necessary there >> (yet unlike suggested in reply to v1 I don't think anymore that the check >> here >> can alternatively be dropped). And even if it was to be kept for some reason, >> I'm having some difficulty seeing why vendor code needs to do that check, >> when >> nestedhvm_setup() could do it for both of them. > > I'm having a bit of trouble resolving the antecedents in this > paragraph (what "this" and "there" are referring to). > > Are you saying that we should set hvm_funcs.caps.nested_virt to 'true' > even if the hardware doesn't support HAP, because we check that here? > > That seems like a very strange way to go about things; hvm_funcs.caps > should reflect the actual capabilities of the hardware. Suppose at > some point we want to expose nested virt capability to the toolstack > -- wouldn't it make more sense to be able to just read > `hvm_funcs.caps.nested_virt`, rather than having to remember to && it > with `hvm_funcs.caps.hap`? > > And as you say, you can't get rid of the HAP check here, because we > also want to deny nested virt if the admin deliberately created a > guest in shadow mode on a system that has HAP available. So it's not > redundant: one is checking the capabilities of the system, the other > is checking the requested guest configuration. Hmm, yes, you're right. > As to why to have each vendor independent code check for HAP -- there > are in fact two implementations of the code; it's nice to be able to > look in one place for each implementation to determine the > requirements. Additionally, it would be possible, in the future, for > one of the nested virt implementations to enable shadow mode, while > the other one didn't. The current arrangement makes that easy. Well, first - how likely is this, when instead we've been considering whether we could get rid of shadow mode? And then this is balancing between ease of future changes vs avoiding redundancy where it can be avoided. I'm not going to insist on centralizing the HAP check, but I certainly wanted the option to be considered. >>> --- a/xen/arch/x86/hvm/nestedhvm.c >>> +++ b/xen/arch/x86/hvm/nestedhvm.c >>> @@ -150,6 +150,16 @@ static int __init cf_check nestedhvm_setup(void) >>> __clear_bit(0x80, shadow_io_bitmap[0]); >>> __clear_bit(0xed, shadow_io_bitmap[1]); >>> >>> +/* >>> + * NB this must be called after all command-line processing has been >>> + * done, so that if (for example) HAP is disabled, nested virt is >>> + * disabled as well. >>> + */ >>> +if ( cpu_has_vmx ) >>> +start_nested_vmx(&hvm_funcs); >>> +else if ( cpu_has_svm ) >>> +start_nested_svm(&hvm_funcs); >> >> Instead of passing the pointer, couldn't you have the functions return >> bool, and then set hvm_funcs.caps.nested_virt from that? Passing that >> pointer looks somewhat odd to me. > > For one, it makes start_nested_XXX symmetric to start_XXX, which also > has access to the full hvm_funcs structure (albeit in the former case > because it's creating the structure). This last aspect is the crucial aspect: start{svm,vmx}() are indeed quite special because of this. Everywhere else we have accessor helpers when hvm_funcs needs consulting, e.g. ... > For two, both of them need to read caps.hap. ... caps _reads_ would want using (an) accessor(s), too. > For three, it's just more flexible -- there may be > future things that we want to modify in the start_nested_*() > functions. If we did as you suggest, and then added (say) > caps.nested_virt_needs_hap, then we'd either need to add a second > function, or refactor it to look like this. Right, I can see this as an argument when taking, as you do, speculation on future needs into account. Albeit I'm then inclined to say that even such modifications may better be done through accessor helpers. >> Is there a reason to use direct calls here rather than a true hook >> (seeing that hvm_funcs must have been populated by the time we make it >> here)? I understand we're (remotely) considering to switch away from >> using hooks at some point, but right now this feels somewhat >> inconsistent if not furt
[RFC KERNEL PATCH v5 0/3] Support device passthrough when dom0 is PVH on Xen
Hi All, This is v5 series to support passthrough on Xen when dom0 is PVH. patch#2 change function acpi_pci_irq_lookup from a static function to non-static, need ACPI Maintainer to give some comments. patch#3 linux internal changes, need PCI and ACPI Maintainer to give more comments. v4->v5 changes: * patch#1: Add Reviewed-by Stefano * patch#2: Add Reviewed-by Stefano * patch#3: No changes v3->v4 changes: * patch#1: change the comment of PHYSDEVOP_pci_device_state_reset; use a new function pcistub_reset_device_state to wrap __pci_reset_function_locked and xen_reset_device_state, and call pcistub_reset_device_state in pci_stub.c * patch#2: remove map_pirq from xen_pvh_passthrough_gsi v3 link: https://lore.kernel.org/lkml/20231210161519.1550860-1-jiqian.c...@amd.com/T/#t v2->v3 changes: * patch#1: add condition to limit do xen_reset_device_state for no-pv domain in pcistub_init_device. * patch#2: Abandoning previous implementations that call unmask_irq. To setup gsi and map pirq for passthrough device in pcistub_init_device. * patch#3: Abandoning previous implementations that adds new syscall to get gsi from irq. To add a new sysfs for gsi, then userspace can get gsi number from sysfs. v2 link: https://lore.kernel.org/lkml/20231124103123.3263471-1-jiqian.c...@amd.com/T/#t Below is the description of v2 cover letter: This series of patches are the v2 of the implementation of passthrough when dom0 is PVH on Xen. We sent the v1 to upstream before, but the v1 had so many problems and we got lots of suggestions. I will introduce all issues that these patches try to fix and the differences between v1 and v2. Issues we encountered: 1. pci_stub failed to write bar for a passthrough device. Problem: when we run “sudo xl pci-assignable-add ” to assign a device, pci_stub will call “pcistub_init_device() -> pci_restore_state() -> pci_restore_config_space() -> pci_restore_config_space_range() -> pci_restore_config_dword() -> pci_write_config_dword()”, the pci config write will trigger an io interrupt to bar_write() in the xen, but the bar->enabled was set before, the write is not allowed now, and then when bar->Qemu config the passthrough device in xen_pt_realize(), it gets invalid bar values. Reason: the reason is that we don't tell vPCI that the device has been reset, so the current cached state in pdev->vpci is all out of date and is different from the real device state. Solution: to solve this problem, the first patch of kernel(xen/pci: Add xen_reset_device_state function) and the fist patch of xen(xen/vpci: Clear all vpci status of device) add a new hypercall to reset the state stored in vPCI when the state of real device has changed. Thank Roger for the suggestion of this v2, and it is different from v1 (https://lore.kernel.org/xen-devel/20230312075455.450187-3-ray.hu...@amd.com/), v1 simply allow domU to write pci bar, it does not comply with the design principles of vPCI. 2. failed to do PHYSDEVOP_map_pirq when dom0 is PVH Problem: HVM domU will do PHYSDEVOP_map_pirq for a passthrough device by using gsi. See xen_pt_realize->xc_physdev_map_pirq and pci_add_dm_done->xc_physdev_map_pirq. Then xc_physdev_map_pirq will call into Xen, but in hvm_physdev_op(), PHYSDEVOP_map_pirq is not allowed. Reason: In hvm_physdev_op(), the variable "currd" is PVH dom0 and PVH has no X86_EMU_USE_PIRQ flag, it will fail at has_pirq check. Solution: I think we may need to allow PHYSDEVOP_map_pirq when "currd" is dom0 (at present dom0 is PVH). The second patch of xen(x86/pvh: Open PHYSDEVOP_map_pirq for PVH dom0) allow PVH dom0 do PHYSDEVOP_map_pirq. This v2 patch is better than v1, v1 simply remove the has_pirq check(xen https://lore.kernel.org/xen-devel/20230312075455.450187-4-ray.hu...@amd.com/). 3. the gsi of a passthrough device doesn't be unmasked 3.1 failed to check the permission of pirq 3.2 the gsi of passthrough device was not registered in PVH dom0 Problem: 3.1 callback function pci_add_dm_done() will be called when qemu config a passthrough device for domU. This function will call xc_domain_irq_permission()-> pirq_access_permitted() to check if the gsi has corresponding mappings in dom0. But it didn’t, so failed. See XEN_DOMCTL_irq_permission->pirq_access_permitted, "current" is PVH dom0 and it return irq is 0. 3.2 it's possible for a gsi (iow: vIO-APIC pin) to never get registered on PVH dom0, because the devices of PVH are using MSI(-X) interrupts. However, the IO-APIC pin must be configured for it to be able to be mapped into a domU. Reason: After searching codes, I find "map_pirq" and "register_gsi" will be done in function vioapic_write_redirent->vioapic_hwdom_map_gsi when the gsi(aka ioapic's pin) is unmasked in PVH dom0. So the two problems can be concluded to that the gsi of a passthrough device doesn't be unmasked. Solution: to solve these problems, the second patch of kernel(xen/pvh: Unmask irq for passthrough device in PVH dom0) call the unmask_irq() when
[RFC KERNEL PATCH v5 2/3] xen/pvh: Setup gsi for passthrough device
In PVH dom0, the gsis don't get registered, but the gsi of a passthrough device must be configured for it to be able to be mapped into a domU. When assign a device to passthrough, proactively setup the gsi of the device during that process. Co-developed-by: Huang Rui Signed-off-by: Jiqian Chen Reviewed-by: Stefano Stabellini --- RFC: This patch change function acpi_pci_irq_lookup from a static function to non-static, need ACPI Maintainer to give some comments. --- arch/x86/xen/enlighten_pvh.c | 92 ++ drivers/acpi/pci_irq.c | 2 +- drivers/xen/xen-pciback/pci_stub.c | 8 +++ include/linux/acpi.h | 1 + include/xen/acpi.h | 6 ++ 5 files changed, 108 insertions(+), 1 deletion(-) diff --git a/arch/x86/xen/enlighten_pvh.c b/arch/x86/xen/enlighten_pvh.c index c28f073c1df5..12be665b27d8 100644 --- a/arch/x86/xen/enlighten_pvh.c +++ b/arch/x86/xen/enlighten_pvh.c @@ -2,6 +2,7 @@ #include #include #include +#include #include @@ -26,6 +27,97 @@ bool __ro_after_init xen_pvh; EXPORT_SYMBOL_GPL(xen_pvh); +typedef struct gsi_info { + int gsi; + int trigger; + int polarity; +} gsi_info_t; + +struct acpi_prt_entry { + struct acpi_pci_id id; + u8 pin; + acpi_handle link; + u32 index; /* GSI, or link _CRS index */ +}; + +static int xen_pvh_get_gsi_info(struct pci_dev *dev, + gsi_info_t *gsi_info) +{ + int gsi; + u8 pin; + struct acpi_prt_entry *entry; + int trigger = ACPI_LEVEL_SENSITIVE; + int polarity = acpi_irq_model == ACPI_IRQ_MODEL_GIC ? + ACPI_ACTIVE_HIGH : ACPI_ACTIVE_LOW; + + if (!dev || !gsi_info) + return -EINVAL; + + pin = dev->pin; + if (!pin) + return -EINVAL; + + entry = acpi_pci_irq_lookup(dev, pin); + if (entry) { + if (entry->link) + gsi = acpi_pci_link_allocate_irq(entry->link, +entry->index, +&trigger, &polarity, +NULL); + else + gsi = entry->index; + } else + gsi = -1; + + if (gsi < 0) + return -EINVAL; + + gsi_info->gsi = gsi; + gsi_info->trigger = trigger; + gsi_info->polarity = polarity; + + return 0; +} + +static int xen_pvh_setup_gsi(gsi_info_t *gsi_info) +{ + struct physdev_setup_gsi setup_gsi; + + if (!gsi_info) + return -EINVAL; + + setup_gsi.gsi = gsi_info->gsi; + setup_gsi.triggering = (gsi_info->trigger == ACPI_EDGE_SENSITIVE ? 0 : 1); + setup_gsi.polarity = (gsi_info->polarity == ACPI_ACTIVE_HIGH ? 0 : 1); + + return HYPERVISOR_physdev_op(PHYSDEVOP_setup_gsi, &setup_gsi); +} + +int xen_pvh_passthrough_gsi(struct pci_dev *dev) +{ + int ret; + gsi_info_t gsi_info; + + if (!dev) + return -EINVAL; + + ret = xen_pvh_get_gsi_info(dev, &gsi_info); + if (ret) { + xen_raw_printk("Fail to get gsi info!\n"); + return ret; + } + + ret = xen_pvh_setup_gsi(&gsi_info); + if (ret == -EEXIST) { + xen_raw_printk("Already setup the GSI :%d\n", gsi_info.gsi); + ret = 0; + } else if (ret) + xen_raw_printk("Fail to setup GSI (%d)!\n", gsi_info.gsi); + + return ret; +} +EXPORT_SYMBOL_GPL(xen_pvh_passthrough_gsi); + void __init xen_pvh_init(struct boot_params *boot_params) { u32 msr; diff --git a/drivers/acpi/pci_irq.c b/drivers/acpi/pci_irq.c index ff30ceca2203..630fe0a34bc6 100644 --- a/drivers/acpi/pci_irq.c +++ b/drivers/acpi/pci_irq.c @@ -288,7 +288,7 @@ static int acpi_reroute_boot_interrupt(struct pci_dev *dev, } #endif /* CONFIG_X86_IO_APIC */ -static struct acpi_prt_entry *acpi_pci_irq_lookup(struct pci_dev *dev, int pin) +struct acpi_prt_entry *acpi_pci_irq_lookup(struct pci_dev *dev, int pin) { struct acpi_prt_entry *entry = NULL; struct pci_dev *bridge; diff --git a/drivers/xen/xen-pciback/pci_stub.c b/drivers/xen/xen-pciback/pci_stub.c index 46c40ec8a18e..22d4380d2b04 100644 --- a/drivers/xen/xen-pciback/pci_stub.c +++ b/drivers/xen/xen-pciback/pci_stub.c @@ -20,6 +20,7 @@ #include #include #include +#include #include #include #include @@ -435,6 +436,13 @@ static int pcistub_init_device(struct pci_dev *dev) goto config_release; pci_restore_state(dev); } + + if (xen_initial_domain() && xen_pvh_domain()) { + err = xen_pvh_passthrough_gsi(dev); + if (err) + goto config_rele
[RFC KERNEL PATCH v5 3/3] PCI/sysfs: Add gsi sysfs for pci_dev
There is a need for some scenarios to use gsi sysfs. For example, when xen passthrough a device to dumU, it will use gsi to map pirq, but currently userspace can't get gsi number. So, add gsi sysfs for that and for other potential scenarios. Co-developed-by: Huang Rui Signed-off-by: Jiqian Chen --- RFC: No feasible suggestions were obtained in the discussion of v4. Discussions are still needed where/how to expose the gsi. Looking forward to get more comments and suggestions from PCI/ACPI Maintainers. --- drivers/acpi/pci_irq.c | 1 + drivers/pci/pci-sysfs.c | 11 +++ include/linux/pci.h | 2 ++ 3 files changed, 14 insertions(+) diff --git a/drivers/acpi/pci_irq.c b/drivers/acpi/pci_irq.c index 630fe0a34bc6..739a58755df2 100644 --- a/drivers/acpi/pci_irq.c +++ b/drivers/acpi/pci_irq.c @@ -449,6 +449,7 @@ int acpi_pci_irq_enable(struct pci_dev *dev) kfree(entry); return 0; } + dev->gsi = gsi; rc = acpi_register_gsi(&dev->dev, gsi, triggering, polarity); if (rc < 0) { diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c index 2321fdfefd7d..c51df88d079e 100644 --- a/drivers/pci/pci-sysfs.c +++ b/drivers/pci/pci-sysfs.c @@ -71,6 +71,16 @@ static ssize_t irq_show(struct device *dev, } static DEVICE_ATTR_RO(irq); +static ssize_t gsi_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct pci_dev *pdev = to_pci_dev(dev); + + return sysfs_emit(buf, "%u\n", pdev->gsi); +} +static DEVICE_ATTR_RO(gsi); + static ssize_t broken_parity_status_show(struct device *dev, struct device_attribute *attr, char *buf) @@ -596,6 +606,7 @@ static struct attribute *pci_dev_attrs[] = { &dev_attr_revision.attr, &dev_attr_class.attr, &dev_attr_irq.attr, + &dev_attr_gsi.attr, &dev_attr_local_cpus.attr, &dev_attr_local_cpulist.attr, &dev_attr_modalias.attr, diff --git a/include/linux/pci.h b/include/linux/pci.h index 7ab0d13672da..457043cfdfce 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -529,6 +529,8 @@ struct pci_dev { /* These methods index pci_reset_fn_methods[] */ u8 reset_methods[PCI_NUM_RESET_METHODS]; /* In priority order */ + + unsigned intgsi; }; static inline struct pci_dev *pci_physfn(struct pci_dev *dev) -- 2.34.1
[KERNEL PATCH v5 1/3] xen/pci: Add xen_reset_device_state function
When device on dom0 side has been reset, the vpci on Xen side won't get notification, so that the cached state in vpci is all out of date with the real device state. To solve that problem, add a new function to clear all vpci device state when device is reset on dom0 side. And call that function in pcistub_init_device. Because when using "pci-assignable-add" to assign a passthrough device in Xen, it will reset passthrough device and the vpci state will out of date, and then device will fail to restore bar state. Co-developed-by: Huang Rui Signed-off-by: Jiqian Chen Reviewed-by: Stefano Stabellini --- drivers/xen/pci.c | 12 drivers/xen/xen-pciback/pci_stub.c | 18 +++--- include/xen/interface/physdev.h| 7 +++ include/xen/pci.h | 6 ++ 4 files changed, 40 insertions(+), 3 deletions(-) diff --git a/drivers/xen/pci.c b/drivers/xen/pci.c index 72d4e3f193af..e9b30bc09139 100644 --- a/drivers/xen/pci.c +++ b/drivers/xen/pci.c @@ -177,6 +177,18 @@ static int xen_remove_device(struct device *dev) return r; } +int xen_reset_device_state(const struct pci_dev *dev) +{ + struct physdev_pci_device device = { + .seg = pci_domain_nr(dev->bus), + .bus = dev->bus->number, + .devfn = dev->devfn + }; + + return HYPERVISOR_physdev_op(PHYSDEVOP_pci_device_state_reset, &device); +} +EXPORT_SYMBOL_GPL(xen_reset_device_state); + static int xen_pci_notifier(struct notifier_block *nb, unsigned long action, void *data) { diff --git a/drivers/xen/xen-pciback/pci_stub.c b/drivers/xen/xen-pciback/pci_stub.c index e34b623e4b41..46c40ec8a18e 100644 --- a/drivers/xen/xen-pciback/pci_stub.c +++ b/drivers/xen/xen-pciback/pci_stub.c @@ -89,6 +89,16 @@ static struct pcistub_device *pcistub_device_alloc(struct pci_dev *dev) return psdev; } +static int pcistub_reset_device_state(struct pci_dev *dev) +{ + __pci_reset_function_locked(dev); + + if (!xen_pv_domain()) + return xen_reset_device_state(dev); + else + return 0; +} + /* Don't call this directly as it's called by pcistub_device_put */ static void pcistub_device_release(struct kref *kref) { @@ -107,7 +117,7 @@ static void pcistub_device_release(struct kref *kref) /* Call the reset function which does not take lock as this * is called from "unbind" which takes a device_lock mutex. */ - __pci_reset_function_locked(dev); + pcistub_reset_device_state(dev); if (dev_data && pci_load_and_free_saved_state(dev, &dev_data->pci_saved_state)) dev_info(&dev->dev, "Could not reload PCI state\n"); @@ -284,7 +294,7 @@ void pcistub_put_pci_dev(struct pci_dev *dev) * (so it's ready for the next domain) */ device_lock_assert(&dev->dev); - __pci_reset_function_locked(dev); + pcistub_reset_device_state(dev); dev_data = pci_get_drvdata(dev); ret = pci_load_saved_state(dev, dev_data->pci_saved_state); @@ -420,7 +430,9 @@ static int pcistub_init_device(struct pci_dev *dev) dev_err(&dev->dev, "Could not store PCI conf saved state!\n"); else { dev_dbg(&dev->dev, "resetting (FLR, D3, etc) the device\n"); - __pci_reset_function_locked(dev); + err = pcistub_reset_device_state(dev); + if (err) + goto config_release; pci_restore_state(dev); } /* Now disable the device (this also ensures some private device diff --git a/include/xen/interface/physdev.h b/include/xen/interface/physdev.h index a237af867873..8609770e28f5 100644 --- a/include/xen/interface/physdev.h +++ b/include/xen/interface/physdev.h @@ -256,6 +256,13 @@ struct physdev_pci_device_add { */ #define PHYSDEVOP_prepare_msix 30 #define PHYSDEVOP_release_msix 31 +/* + * Notify the hypervisor that a PCI device has been reset, so that any + * internally cached state is regenerated. Should be called after any + * device reset performed by the hardware domain. + */ +#define PHYSDEVOP_pci_device_state_reset 32 + struct physdev_pci_device { /* IN */ uint16_t seg; diff --git a/include/xen/pci.h b/include/xen/pci.h index b8337cf85fd1..b2e2e856efd6 100644 --- a/include/xen/pci.h +++ b/include/xen/pci.h @@ -4,10 +4,16 @@ #define __XEN_PCI_H__ #if defined(CONFIG_XEN_DOM0) +int xen_reset_device_state(const struct pci_dev *dev); int xen_find_device_domain_owner(struct pci_dev *dev); int xen_register_device_domain_owner(struct pci_dev *dev, uint16_t domain); int xen_unregister_device_domain_owner(struct pci_dev *dev); #else +static inline int xen_reset_device_state(const struct pci_dev *dev) +{ + return -1; +} + static inline int xen_find_device_domain_owner(struct pci_dev *dev) { return -1; -- 2.34.1
Re: [PATCH] x86/vcpu: relax VCPUOP_initialise restriction for non-PV vCPUs
On 26.03.2024 23:08, Julien Grall wrote: > Hi Andrew, > > On 20/03/2024 14:39, Andrew Cooper wrote: >> On 20/03/2024 2:26 pm, Roger Pau Monné wrote: >>> On Wed, Mar 20, 2024 at 02:06:27PM +, Andrew Cooper wrote: On 20/03/2024 1:57 pm, Roger Pau Monne wrote: > There's no reason to force HVM guests to have a valid vcpu_info area when > initializing a vCPU, as the vCPU can also be brought online using the > local > APIC, and on that path there's no requirement for vcpu_info to be setup > ahead > of the bring up. Note an HVM vCPU can operate normally without making > use of > vcpu_info. > > Restrict the check against dummy_vcpu_info to only apply to PV guests. > > Fixes: 192df6f9122d ('x86: allow HVM guests to use hypercalls to bring up > vCPUs') > Signed-off-by: Roger Pau Monné Thanks for looking into this. But, do we actually need to force this on PV either? >>> Possibly, having now taken a look at the users it does seem they could >>> cope with unpopulated vcpu_info_area. >>> >>> Part of my understanding was that this was some kind of courtesy to PV >>> guests in order to prevent starting them without a vcpu_info, which >>> strictly speaking is not mandatory, but otherwise the guest vCPU won't >>> be able to receive interrupts, not even IPIs. >> >> That's more of a stick than a carrot... "you must set up the area of a >> CPU before you can bring it online". Except as we've seen, HVM has been >> fine all along without it. The only interesting user of dummy_vcpu_info now is vcpu_info_populate() which can cope with any arbitrary vCPU. I have a suspicion that we can (now) delete these two checks, delete the dummy_vcpu_info object, and use a regular NULL pointer in vcpu_info_{populate,reset}(), and in doing so, remove one of the bigger pieces of PV-absurdity from Xen. >>> I was expecting this to be something we can backport. OTOH removing >>> the check completely (or even getting rid of dummy_vcpu_info) is not >>> something that we would want to backport. >>> >>> I would rather do the removal of dummy_vcpu_info as followup work. >> >> I was worried about ARM with your patch, because it's spelt HVM there, >> rather than PV. >> >> However, I'd forgotten that ARM's do_vcpu_op() filters ops down to just >> VCPUOP_register_{vcpu_info,runstate_memory_area} so VCPUOP_initialise >> isn't reachable. >> >> Therefore, Reviewed-by: Andrew Cooper >> >> It also means ARM can't use any of the PHYS registration yet... > > Whoops. I don't think this was intended. Jan, I don't seem to find any > use in Linux. Do you have any patches you could share? No, I don't. I did all development with hacked up XTF tests, and I was expecting Linux folks to be looking into making use of the new subops. Jan > I would like to > give a try on Arm before sending a patch? > > Cheers, >
Re: [PATCH v5 1/6] Revert "xen/x86: bzImage parse kernel_alignment"
On 26.03.2024 22:38, Jason Andryuk wrote: > A new ELF note will specify the alignment for a relocatable PVH kernel. > ELF notes are suitable for vmlinux and other ELF files, so this > Linux-specific bzImage parsing in unnecessary. > > This reverts commit c44cac229067faeec8f49247d1cf281723ac2d40. > > Signed-off-by: Jason Andryuk Since you keep re-sending this: In private discussion Roger has indicated that, like me, he too would prefer falling back to the ELF data, before falling back to the arch default (Roger, please correct me if I got it wrong). That would make it necessary that the change you're proposing to revert here is actually kept. Or wait - what you're reverting is taking the alignment out of the bzImage header. I don't expect the BSDs to use that protocol; aiui that's entirely Linux-specific. I further meanwhile realized that consulting the ELF phdrs may also be ambiguous, as there may be more than one. I guess it would need to be the maximum of all of them then. Jan
Re: [PATCH v5 5/6] xen/elfnote: Specify ELF Notes are x86-specific
On 26.03.2024 22:38, Jason Andryuk wrote: > The Xen ELF Notes are only used with x86. libelf's elf_xen_note_check() > exits early for ARM binaries with "ELF: Not bothering with notes on > ARM". > > Add a comment to the top of elfnote.h specifying that Notes are only used > with x86 binaries. This is to avoid adding disclaimers for individual > notes. > > Signed-off-by: Jason Andryuk Acked-by: Jan Beulich
Re: [PATCH v5 4/6] libelf: Expand ELF note printing
On 26.03.2024 22:38, Jason Andryuk wrote: > @@ -145,13 +150,20 @@ elf_errorstatus elf_xen_parse_note(struct elf_binary > *elf, > elf_msg(elf, "ELF: note: %s = \"%s\"\n", note_desc[type].name, str); > parms->elf_notes[type].type = XEN_ENT_STR; > parms->elf_notes[type].data.str = str; > -} > -else > -{ > +break; > + > +case ELFNOTE_INT: > val = elf_note_numeric(elf, note); > elf_msg(elf, "ELF: note: %s = %#" PRIx64 "\n", note_desc[type].name, > val); > parms->elf_notes[type].type = XEN_ENT_LONG; > parms->elf_notes[type].data.num = val; > +break; > + > +case ELFNOTE_NAME: > +/* ELFNOTE_NAME has a newline printed at the end of the function to > + * optionally allow printing customized details. */ > +elf_msg(elf, "ELF: note: %s", note_desc[type].name); > +break; Well. I said "brief comment" for several reasons. One of them being that it would best fit on a single line. Since now it doesn't, I have to point out that this way comment style is violated. /* NB: Newline emitted further down. */ ? Jan
Re: [PATCH v6 10/20] xen/riscv: introduce atomic.h
On 26.03.2024 20:02, Oleksii wrote: > On Mon, 2024-03-25 at 09:18 +0100, Jan Beulich wrote: >> On 22.03.2024 13:25, Oleksii wrote: >>> On Thu, 2024-03-21 at 14:03 +0100, Jan Beulich wrote: On 15.03.2024 19:06, Oleksii Kurochko wrote: > + */ > +static always_inline void read_atomic_size(const volatile void > *p, > + void *res, > + unsigned int size) > +{ > + switch ( size ) > + { > + case 1: *(uint8_t *)res = readb(p); break; > + case 2: *(uint16_t *)res = readw(p); break; > + case 4: *(uint32_t *)res = readl(p); break; > + case 8: *(uint32_t *)res = readq(p); break; Nit: Excess blank before =. Also - no #ifdef here to be RV32-ready? >>> Because there is #ifdef RV32 in io.h for readq(). >> >> There you'd run into __raw_readq()'s BUILD_BUG_ON(), afaict even for >> 1-, 2-, or 4-byte accesses. That's not quite what we want here. > Do you mean that if someone will redefine readq() in another way and > not wrap it by #ifdef RV32? Except this I am not sure that there is an > issue as it will be still a compilation error, so anyway it will be > needed to provide an implementation for __raw_readq(). No. BUILD_BUG_ON() is a compile-time thing. The compiler will encounter this construct. And hence it will necessarily fail. Which is why the other approach (causing a linker error) is used elsewhere. And we're still only in the course of considering to utilize DCE for something like STATIC_ASSERT_UNREACHABLE(); iirc there was something getting in the way there. > One of the reason why I decided to wrap with #ifdef RV32 in io.h to not > go over the source code and add wrapping. Also for some code it will be > needed to rewrite it. For example, I am not sure that I can add #ifdef > inside macros, f.e.: >#define write_atomic(p, x) \ >({ \ >typeof(*(p)) x__ = (x); \ >switch ( sizeof(*(p)) ) \ >{ \ >case 1: writeb(x__, p); break; \ >case 2: writew(x__, p); break; \ >case 4: writel(x__, p); break; \ >case 8: writeq(x__, p); break; \ >default: __bad_atomic_size(); break;\ >} \ >x__;\ >}) You can't add #ifdef there. Such needs abstracting differently. But of course there's the option of simply not making any of these constructs RV32-ready. Yet if so, that then will want doing consistently. > +/* > + * First, the atomic ops that have no ordering constraints and > therefor don't > + * have the AQ or RL bits set. These don't return anything, > so > there's only > + * one version to worry about. > + */ > +#define ATOMIC_OP(op, asm_op, I, asm_type, c_type, prefix) \ > +static inline \ > +void atomic##prefix##_##op(c_type i, atomic##prefix##_t *v) \ > +{ \ > + asm volatile ( \ > + " amo" #asm_op "." #asm_type " zero, %1, %0" \ > + : "+A" (v->counter) \ > + : "r" (I) \ Btw, I consider this pretty confusing. At the 1st and 2nd glance this looks like a mistake, i.e. as if i was meant. Imo ... > + : "memory" ); \ > +} \ > + > +/* > + * Only CONFIG_GENERIC_ATOMIC64=y was ported to Xen that is > the > reason why > + * last argument for ATOMIC_OP isn't used. > + */ > +#define ATOMIC_OPS(op, asm_op, I) \ > + ATOMIC_OP (op, asm_op, I, w, int, ) > + > +ATOMIC_OPS(add, add, i) > +ATOMIC_OPS(sub, add, -i) > +ATOMIC_OPS(and, and, i) > +ATOMIC_OPS( or, or, i) > +ATOMIC_OPS(xor, xor, i) ... here you want to only pass the (unary) operator (and leaving that blank is as fine as using +). >>> I agree that a game with 'i' and 'I' looks confusing, but I am not >>> really understand what is wrong with using ' i' here. It seems that >>> preprocessed macros looks fine: >>> static inline void atomic_add(int i, atomic_t *v) { asm volatile >>> ( " >>> amo" "add" "." "w" " zero, %1, %0" : "+A" (v->counter) : "r" (i) >>> : >>> "memory" ); } >>> >>> static inline void atomic_sub(int i, atomic_t *v) { asm volatile >>> ( " >>> amo
Re: [XEN PATCH 11/11] x86/public: hvm: address violations of MISRA C Rule 20.7
On 22.03.2024 17:02, Nicola Vetrini wrote: > MISRA C Rule 20.7 states: "Expressions resulting from the expansion > of macro parameters shall be enclosed in parentheses". Therefore, some > macro definitions should gain additional parentheses to ensure that all > current and future users will be safe with respect to expansions that > can possibly alter the semantics of the passed-in macro parameter. > > No functional change. > > Signed-off-by: Nicola Vetrini Btw, I've taken the liberty to drop the (secondary) hvm: prefix from the subject; I can't ... > --- a/xen/include/public/arch-x86/xen.h > +++ b/xen/include/public/arch-x86/xen.h > @@ -36,7 +36,7 @@ > #define __XEN_GUEST_HANDLE(name)__guest_handle_ ## name > #define XEN_GUEST_HANDLE(name) __XEN_GUEST_HANDLE(name) > #define XEN_GUEST_HANDLE_PARAM(name)XEN_GUEST_HANDLE(name) > -#define set_xen_guest_handle_raw(hnd, val) do { (hnd).p = val; } while (0) > +#define set_xen_guest_handle_raw(hnd, val) do { (hnd).p = (val); } while (0) > #define set_xen_guest_handle(hnd, val) set_xen_guest_handle_raw(hnd, val) ... spot anything HVM-specific in this change. Jan
Re: [PATCH v5] RFC: x86/pvh: Make Xen PVH entrypoint PIC for x86-64
On 26.03.2024 22:47, Jason Andryuk wrote: > --- a/include/xen/interface/elfnote.h > +++ b/include/xen/interface/elfnote.h > @@ -185,9 +185,25 @@ > */ > #define XEN_ELFNOTE_PHYS32_ENTRY 18 > > +/* > + * Physical loading constraints for PVH kernels > + * > + * Used to place constraints on the guest physical loading addresses and > + * alignment for a PVH kernel. > + * > + * The presence of this note indicates the kernel supports relocating itself. > + * > + * The note may include up to three 32bit values in the following order: > + * - a maximum address for the entire image to be loaded below (default > + * 0x) > + * - a minimum address for the start of the image (default 0) > + * - a required start alignment (default 0x20) This looks to be stale now. Jan
Re: [PATCH 11/11] xen/arm: List static shared memory regions as /memory nodes
Hi Luca, On 26/03/2024 15:19, Luca Fancellu wrote: > > >> On 25 Mar 2024, at 08:58, Michal Orzel wrote: >> >> Hi Luca, >> > > Hi Michal, > >> On 12/03/2024 14:03, Luca Fancellu wrote: >>> >>> >>> Currently Xen is not exporting the static shared memory regions >>> to the device tree as /memory node, this commit is fixing this >>> issue. >> Looking at the implementation, you will always call make_shm_memory_node() >> twice. For the first >> time, to create /memory node and for the second time to create entry under >> /reserved-memory. Also, >> you will create a separate /memory node for every single shmem region >> instead of combining them >> in a single /memory region like make_memory_node() would do. Can't we reuse >> this function for simplicity? > > You mean using make_memory_node() to populate /reserved-memory and /memory? I > feel they are very different > In terms of properties to be created, so I don’t think we should create a > make_memory_node() that does both. > > Otherwise if you were suggesting to modify make_memory_node() only for what > concerns the /memory nodes, yes, this is what I meant > it might be feasible, however there are some parts that need to be skipped > with some flags (all the code accessing .type > member), if I understand correctly you like this function because it doesn’t > create one node for every bank, but it creates > reg addresses instead, in that case I could modify the current > make_shm_memory_node() to do the same. My concern is that we will have 2 functions to create memory nodes instead of one that can be reused. I know the issue is with .type member. If skipping results in a worse code, then I'm ok with make_shm_memory_node() used for 2 purposes (would it be possible to create /memory and entry under /reserved in the same call to a function?). > >> >> Also, afaict it is not forbidden to specify shmem range (correct me if I'm >> wrong), where guest address will be >> within with RAM allocated by Xen (e.g. GPA RAM range 0x4000 - 0x6000 >> and shmem is at 0x5000). In this case, >> you would create yet another /memory node that would result in overlap (i.e. >> more than one /memory node specifying the same range). > > This is a good point I didn’t think about, yes currently the code is creating > overlapping nodes in that case, wow so it means I > need to compute the non overlapping regions and emit a /memory node then! :) > ouch > > Please let me know if I understood correctly your comments. > > Cheers, > Luca > >> >> ~Michal > ~Michal
Re: [PATCH v5 1/6] Revert "xen/x86: bzImage parse kernel_alignment"
On Wed, Mar 27, 2024 at 08:22:41AM +0100, Jan Beulich wrote: > On 26.03.2024 22:38, Jason Andryuk wrote: > > A new ELF note will specify the alignment for a relocatable PVH kernel. > > ELF notes are suitable for vmlinux and other ELF files, so this > > Linux-specific bzImage parsing in unnecessary. > > > > This reverts commit c44cac229067faeec8f49247d1cf281723ac2d40. > > > > Signed-off-by: Jason Andryuk > > Since you keep re-sending this: In private discussion Roger has indicated > that, like me, he too would prefer falling back to the ELF data, before > falling back to the arch default (Roger, please correct me if I got it > wrong). That would make it necessary that the change you're proposing to > revert here is actually kept. Sorry, was meaning to reply yesterday but Jason is very fast at sending new version so I'm always one version behind. IMO the order: ELF note, PHDR alignment, arch default should be the preferred one. > Or wait - what you're reverting is taking the alignment out of the > bzImage header. I don't expect the BSDs to use that protocol; aiui that's > entirely Linux-specific. Yeah, I don't have strong opinions in keeping this, we already do bzImage parsing, so we might as well attempt to fetch the alignment from there if correct: ELF note, bzImage kernel_alignment, ELF PHDR alignment, arch default > I further meanwhile realized that consulting the ELF phdrs may also be > ambiguous, as there may be more than one. I guess it would need to be the > maximum of all of them then. My suggestion (not sure if I mentioned this before) was to use the alignment of the first LOAD PHDR, which is the one that defines the value of the dest_base field used as the image load start address. Using the maximum of all load PHDRs might be safer. Thanks, Roger.
Xen Summit Early Bird Rates - Ends 31st March 2024!
Hey everyone, We've just announced our schedule for Xen Summit 2024 and can't wait to see you all. *Make sure to grab your early-rate tickets today, these end on 31st March 2024! * Academics can also attend the event for free. Tickets: https://events.linuxfoundation.org/xen-project-summit/register/ Schedule: https://events.linuxfoundation.org/xen-project-summit/program/schedule/ See you all there! Many thanks, Kelly Choi Community Manager Xen Project
Re: [PATCH v6 10/20] xen/riscv: introduce atomic.h
On Wed, 2024-03-27 at 08:40 +0100, Jan Beulich wrote: ... > > > > > > +/* This is required to provide a full barrier on success. > > > > > > */ > > > > > > +static inline int atomic_add_unless(atomic_t *v, int a, > > > > > > int u) > > > > > > +{ > > > > > > + int prev, rc; > > > > > > + > > > > > > + asm volatile ( > > > > > > + "0: lr.w %[p], %[c]\n" > > > > > > + " beq %[p], %[u], 1f\n" > > > > > > + " add %[rc], %[p], %[a]\n" > > > > > > + " sc.w.rl %[rc], %[rc], %[c]\n" > > > > > > + " bnez %[rc], 0b\n" > > > > > > + RISCV_FULL_BARRIER > > > > > > > > > > With this and no .aq on the load, why the .rl on the store? > > > > It is something that LKMM requires [1]. > > > > > > > > This is not fully clear to me what is so specific in LKMM, but > > > > accoring > > > > to the spec: > > > > Ordering Annotation Fence-based Equivalent > > > > l{b|h|w|d|r}.aq l{b|h|w|d|r}; fence r,rw > > > > l{b|h|w|d|r}.aqrl fence rw,rw; l{b|h|w|d|r}; fence r,rw > > > > s{b|h|w|d|c}.rl fence rw,w; s{b|h|w|d|c} > > > > s{b|h|w|d|c}.aqrl fence rw,w; s{b|h|w|d|c} > > > > amo.aq amo; fence r,rw > > > > amo.rl fence rw,w; amo > > > > amo.aqrl fence rw,rw; amo; fence rw,rw > > > > Table 2.2: Mappings from .aq and/or .rl to fence-based > > > > equivalents. > > > > An alternative mapping places a fence rw,rw after the > > > > existing > > > > s{b|h|w|d|c} mapping rather than at the front of the > > > > l{b|h|w|d|r} mapping. > > > > > > I'm afraid I can't spot the specific case in this table. None of > > > the > > > stores in the right column have a .rl suffix. > > Yes, it is expected. > > > > I am reading this table as (f.e.) amo.rl is an equivalent of > > fence > > rw,w; amo. (without .rl) > > In which case: How does quoting the table answer my original > question? Agree, it is starting to be confusing, so let me give an answer to your question below. > > > > > It is also safe to translate any .aq, .rl, or .aqrl > > > > annotation > > > > into > > > > the fence-based snippets of > > > > Table 2.2. These can also be used as a legal implementation > > > > of > > > > l{b|h|w|d} or s{b|h|w|d} pseu- > > > > doinstructions for as long as those instructions are not > > > > added > > > > to > > > > the ISA. > > > > > > > > So according to the spec, it should be: > > > > sc.w ... > > > > RISCV_FULL_BARRIER. > > > > > > > > Considering [1] and how this code looks before, it seems to me > > > > that > > > > it > > > > is safe to use lr.w.aq/sc.w.rl here or an fence equivalent. > > > > > > Here you say "or". Then why dos the code use sc.?.rl _and_ a > > > fence? > > I confused this line with amo.aqrl, so based on the table 2.2 > > above > > s{b|h|w|d|c}.aqrl is transformed to "fence rw,w; s{b|h|w|d|c}", but > > Linux kernel decided to strengthen it with full barrier: > > - "0:\n\t" > > - "lr.w.aqrl %[p], %[c]\n\t" > > - "beq %[p], %[u], 1f\n\t" > > - "add %[rc], %[p], %[a]\n\t" > > - "sc.w.aqrl %[rc], %[rc], %[c]\n\t" > > - "bnez %[rc], 0b\n\t" > > - "1:" > > + "0: lr.w %[p], %[c]\n" > > + " beq %[p], %[u], 1f\n" > > + " add %[rc], %[p], %[a]\n" > > + " sc.w.rl %[rc], %[rc], %[c]\n" > > + " bnez %[rc], 0b\n" > > + " fence rw, rw\n" > > + "1:\n" > > As they have the following issue: > > implementations of atomics such as atomic_cmpxchg() and > > atomic_add_unless() rely on LR/SC pairs, > > which do not give full-ordering with .aqrl; for example, current > > implementations allow the "lr-sc-aqrl-pair-vs-full-barrier" test > > below to end up with the state indicated in the "exists" clause. > > Is that really "current implementations", not "the abstract model"? > If so, the use of an explicit fence would be more like a workaround > (and would hence want commenting to that effect). > > In neither case can I see my original question answered: Why the .rl > on the store when there is a full fence later? The good explanation for that was provided in the commit addressing a similar issue for ARM64 [ https://patchwork.kernel.org/project/linux-arm-kernel/patch/1391516953-14541-1-git-send-email-will.dea...@arm.com/ ]. The same holds true for RISC-V since ARM also employs WMO. I would also like to mention another point, as I indicated in another email thread [ https://lists.xen.org/archives/html/xen-devel/2024-03/msg01582.html ] , that now this fence can be omitted and .aqrl can be used instead. This was confirmed by Dan (the author of the RVWMO spec) [https://lore.kernel.org/linux-riscv/41e01514-74ca-84f2-f5cc-2645c444f...@nvidia.com/ ] I hope this
Re: xen | Failed pipeline for staging | e3883336
On 27.03.2024 10:58, GitLab wrote: > > > Pipeline #1229415063 has failed! > > Project: xen ( https://gitlab.com/xen-project/hardware/xen ) > Branch: staging ( > https://gitlab.com/xen-project/hardware/xen/-/commits/staging ) > > Commit: e3883336 ( > https://gitlab.com/xen-project/hardware/xen/-/commit/e3883336bb5abba2ec2231618f5b64f61b099b1e > ) > Commit Message: xen/efi: efibind: address violations of MISRA C... > Commit Author: Nicola Vetrini > Committed by: Jan Beulich ( https://gitlab.com/jbeulich ) > > > Pipeline #1229415063 ( > https://gitlab.com/xen-project/hardware/xen/-/pipelines/1229415063 ) > triggered by Jan Beulich ( https://gitlab.com/jbeulich ) > had 1 failed job. > > Job #6487283739 ( > https://gitlab.com/xen-project/hardware/xen/-/jobs/6487283739/raw ) > > Stage: test > Name: qemu-alpine-x86_64-gcc qemu-system-x86_64: terminating on signal 15 from pid 52 (timeout) I have to admit I have no idea what this is supposed to be telling me. Jan
Re: [PATCH v6 10/20] xen/riscv: introduce atomic.h
On 27.03.2024 11:28, Oleksii wrote: > On Wed, 2024-03-27 at 08:40 +0100, Jan Beulich wrote: > ... > >>> +/* This is required to provide a full barrier on success. >>> */ >>> +static inline int atomic_add_unless(atomic_t *v, int a, >>> int u) >>> +{ >>> + int prev, rc; >>> + >>> + asm volatile ( >>> + "0: lr.w %[p], %[c]\n" >>> + " beq %[p], %[u], 1f\n" >>> + " add %[rc], %[p], %[a]\n" >>> + " sc.w.rl %[rc], %[rc], %[c]\n" >>> + " bnez %[rc], 0b\n" >>> + RISCV_FULL_BARRIER >> >> With this and no .aq on the load, why the .rl on the store? > It is something that LKMM requires [1]. > > This is not fully clear to me what is so specific in LKMM, but > accoring > to the spec: > Ordering Annotation Fence-based Equivalent > l{b|h|w|d|r}.aq l{b|h|w|d|r}; fence r,rw > l{b|h|w|d|r}.aqrl fence rw,rw; l{b|h|w|d|r}; fence r,rw > s{b|h|w|d|c}.rl fence rw,w; s{b|h|w|d|c} > s{b|h|w|d|c}.aqrl fence rw,w; s{b|h|w|d|c} > amo.aq amo; fence r,rw > amo.rl fence rw,w; amo > amo.aqrl fence rw,rw; amo; fence rw,rw > Table 2.2: Mappings from .aq and/or .rl to fence-based > equivalents. > An alternative mapping places a fence rw,rw after the > existing > s{b|h|w|d|c} mapping rather than at the front of the > l{b|h|w|d|r} mapping. I'm afraid I can't spot the specific case in this table. None of the stores in the right column have a .rl suffix. >>> Yes, it is expected. >>> >>> I am reading this table as (f.e.) amo.rl is an equivalent of >>> fence >>> rw,w; amo. (without .rl) >> >> In which case: How does quoting the table answer my original >> question? > Agree, it is starting to be confusing, so let me give an answer to your > question below. > >> > It is also safe to translate any .aq, .rl, or .aqrl > annotation > into > the fence-based snippets of > Table 2.2. These can also be used as a legal implementation > of > l{b|h|w|d} or s{b|h|w|d} pseu- > doinstructions for as long as those instructions are not > added > to > the ISA. > > So according to the spec, it should be: > sc.w ... > RISCV_FULL_BARRIER. > > Considering [1] and how this code looks before, it seems to me > that > it > is safe to use lr.w.aq/sc.w.rl here or an fence equivalent. Here you say "or". Then why dos the code use sc.?.rl _and_ a fence? >>> I confused this line with amo.aqrl, so based on the table 2.2 >>> above >>> s{b|h|w|d|c}.aqrl is transformed to "fence rw,w; s{b|h|w|d|c}", but >>> Linux kernel decided to strengthen it with full barrier: >>> - "0:\n\t" >>> - "lr.w.aqrl %[p], %[c]\n\t" >>> - "beq %[p], %[u], 1f\n\t" >>> - "add %[rc], %[p], %[a]\n\t" >>> - "sc.w.aqrl %[rc], %[rc], %[c]\n\t" >>> - "bnez %[rc], 0b\n\t" >>> - "1:" >>> + "0: lr.w %[p], %[c]\n" >>> + " beq %[p], %[u], 1f\n" >>> + " add %[rc], %[p], %[a]\n" >>> + " sc.w.rl %[rc], %[rc], %[c]\n" >>> + " bnez %[rc], 0b\n" >>> + " fence rw, rw\n" >>> + "1:\n" >>> As they have the following issue: >>> implementations of atomics such as atomic_cmpxchg() and >>> atomic_add_unless() rely on LR/SC pairs, >>> which do not give full-ordering with .aqrl; for example, current >>> implementations allow the "lr-sc-aqrl-pair-vs-full-barrier" test >>> below to end up with the state indicated in the "exists" clause. >> >> Is that really "current implementations", not "the abstract model"? >> If so, the use of an explicit fence would be more like a workaround >> (and would hence want commenting to that effect). >> >> In neither case can I see my original question answered: Why the .rl >> on the store when there is a full fence later? > The good explanation for that was provided in the commit addressing a > similar issue for ARM64 [ > https://patchwork.kernel.org/project/linux-arm-kernel/patch/1391516953-14541-1-git-send-email-will.dea...@arm.com/ > ]. > The same holds true for RISC-V since ARM also employs WMO. > > I would also like to mention another point, as I indicated in another > email thread > [ https://lists.xen.org/archives/html/xen-devel/2024-03/msg01582.html ] > , that now this fence can be omitted and .aqrl can be used instead. > > This was confirmed by Dan (the author of the RVWMO spec) > [https://lore.kernel.org/linux-riscv/41e01514-74ca-84f2-f5cc-2645c444f...@nvidia.com/ > ] > > I hope this addresses your original question. Does it? I think it does, thanks
Re: [PATCH v7 08/14] xen/page_alloc: introduce preserved page flags macro
Hi guys, > Question is: How would you justify such a change? IOW I'm not convinced > (yet) this wants doing there. You mean in this series? > Looking at the code, the flag is originally set in > alloc_domheap_pages(). So I guess it would make sense to do it in > free_domheap_pages(). We don't hold the heap_lock there. Is it safe to change count_info without it? Thanks. > For PGC_static, I don't think we can reach free_domheap_pages() with the > flag set (the pages should live in a separate pool). So I believe there > is nothing to do for them. > > Cheers, > > -- > Julien Grall
Re: xen | Failed pipeline for staging | e3883336
On 27/03/2024 10:40 am, Jan Beulich wrote: > On 27.03.2024 10:58, GitLab wrote: >> >> Pipeline #1229415063 has failed! >> >> Project: xen ( https://gitlab.com/xen-project/hardware/xen ) >> Branch: staging ( >> https://gitlab.com/xen-project/hardware/xen/-/commits/staging ) >> >> Commit: e3883336 ( >> https://gitlab.com/xen-project/hardware/xen/-/commit/e3883336bb5abba2ec2231618f5b64f61b099b1e >> ) >> Commit Message: xen/efi: efibind: address violations of MISRA C... >> Commit Author: Nicola Vetrini >> Committed by: Jan Beulich ( https://gitlab.com/jbeulich ) >> >> >> Pipeline #1229415063 ( >> https://gitlab.com/xen-project/hardware/xen/-/pipelines/1229415063 ) >> triggered by Jan Beulich ( https://gitlab.com/jbeulich ) >> had 1 failed job. >> >> Job #6487283739 ( >> https://gitlab.com/xen-project/hardware/xen/-/jobs/6487283739/raw ) >> >> Stage: test >> Name: qemu-alpine-x86_64-gcc > qemu-system-x86_64: terminating on signal 15 from pid 52 (timeout) > > I have to admit I have no idea what this is supposed to be telling me. In these smoke test, Qemu is run with a timeout in case things hang/crash. This is saying Qemu running Xen+Dom0 didn't appear to get to the login prompt within the timeout that the test specifies. As to why, that's less clear. It appears that it was making progress, just slowly. ~Andrew
Re: Xen NIC driver have page_pool memory leaks
On 25/03/2024 13.33, Paul Durrant wrote: On 25/03/2024 12:21, Jesper Dangaard Brouer wrote: Hi Arthur, (Answer inlined below, which is custom on this mailing list) On 23/03/2024 14.23, Arthur Borsboom wrote: Hi Jesper, After a recent kernel upgrade 6.7.6 > 6.8.1 all my Xen guests on Arch Linux are dumping kernel traces. It seems to be indirectly caused by the page pool memory leak mechanism, which is probably a good thing. I have created a bug report, but there is no response. https://bugzilla.kernel.org/show_bug.cgi?id=218618 I am uncertain where and to whom I need to report this page leak. Can you help me get this issue fixed? I'm the page_pool maintainer, but as you say yourself in comment 2 then since dba1b8a7ab68 ("mm/page_pool: catch page_pool memory leaks") this indicated there is a problem in the xen_netfront driver, which was previously not visible. Cc'ing the "XEN NETWORK BACKEND DRIVER" maintainers, as this is a driver bug. What confuses me it that I cannot find any modules named "xen_netfront" in the upstream tree. You should have tried '-' rather than '_' :-) https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/xen-netfront.c Looking at this driver, I think it is missing a call to skb_mark_for_recycle(). I'll will submit at patch for this, with details for stable maintainers. As I think it dates back to v5.9 via commit 6c5aa6fc4def ("xen networking: add basic XDP support for xen-netfront"). I think this commit is missing a call to page_pool_release_page() between v5.9 to v5.14, after which is should have used skb_mark_for_recycle(). Since v6.6 the call page_pool_release_page() were removed (in 535b9c61bdef ("net: page_pool: hide page_pool_release_page()") and remaining callers converted (in commit 6bfef2ec0172 ("Merge branch 'net-page_pool-remove-page_pool_release_page'")). This leak became visible in v6.8 via commit dba1b8a7ab68 ("mm/page_pool: catch page_pool memory leaks"). --Jesper
Re: Xen NIC driver have page_pool memory leaks
On 27/03/2024 11:27 am, Jesper Dangaard Brouer wrote: > > > On 25/03/2024 13.33, Paul Durrant wrote: >> On 25/03/2024 12:21, Jesper Dangaard Brouer wrote: >>> Hi Arthur, >>> >>> (Answer inlined below, which is custom on this mailing list) >>> >>> On 23/03/2024 14.23, Arthur Borsboom wrote: Hi Jesper, After a recent kernel upgrade 6.7.6 > 6.8.1 all my Xen guests on Arch Linux are dumping kernel traces. It seems to be indirectly caused by the page pool memory leak mechanism, which is probably a good thing. I have created a bug report, but there is no response. https://bugzilla.kernel.org/show_bug.cgi?id=218618 I am uncertain where and to whom I need to report this page leak. Can you help me get this issue fixed? >>> >>> I'm the page_pool maintainer, but as you say yourself in comment 2 then >>> since dba1b8a7ab68 ("mm/page_pool: catch page_pool memory leaks") this >>> indicated there is a problem in the xen_netfront driver, which was >>> previously not visible. >>> >>> Cc'ing the "XEN NETWORK BACKEND DRIVER" maintainers, as this is a >>> driver >>> bug. What confuses me it that I cannot find any modules named >>> "xen_netfront" in the upstream tree. >>> >> >> You should have tried '-' rather than '_' :-) >> >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/xen-netfront.c >> >> > > Looking at this driver, I think it is missing a call to > skb_mark_for_recycle(). > > I'll will submit at patch for this, with details for stable maintainers. > > As I think it dates back to v5.9 via commit 6c5aa6fc4def ("xen > networking: add basic XDP support for xen-netfront"). I think this > commit is missing a call to page_pool_release_page() > between v5.9 to v5.14, after which is should have used > skb_mark_for_recycle(). > > Since v6.6 the call page_pool_release_page() were removed (in > 535b9c61bdef ("net: page_pool: hide page_pool_release_page()") and > remaining callers converted (in commit 6bfef2ec0172 ("Merge branch > 'net-page_pool-remove-page_pool_release_page'")). > > This leak became visible in v6.8 via commit dba1b8a7ab68 ("mm/page_pool: > catch page_pool memory leaks"). Thankyou very much for your help here. Please CC secur...@xenproject.org too, because we'll want to issue an XSA (Xen Security Advisory) when this fix is ready. ~Andrew
Re: Backport request for 4.17
On 25.03.2024 18:02, Anthony PERARD wrote: > Would it be possible to backport 18a36b4a9b08 ("tools: ipxe: update for > fixing build with GCC12") to Xen 4.17 ? Sure, now done. Jan > This would be to allow building Xen 4.17 on Debian Bookworm, and to allow > osstest to test Xen 4.18 with Debian Bookworm. osstest always tries to > migration from N-1 to N, so it would need to be able to build both 4.17 > and 4.18 to actually test 4.18. Otherwise, I can tell osstest to keep > using Debian Buster to test 4.18. > > For context: > > https://lore.kernel.org/xen-devel/20240318165545.3898-36-anthony.per...@citrix.com/ > > That commit pulls a newer version of IPXE, I don't think there's be > compatibility issue with Xen, and hopefully nothing breaks. > > Thanks, >
[xen-unstable-smoke test] 185170: tolerable all pass - PUSHED
flight 185170 xen-unstable-smoke real [real] http://logs.test-lab.xenproject.org/osstest/logs/185170/ Failures :-/ but no regressions. Tests which did not succeed, but are not blocking: test-amd64-amd64-libvirt 15 migrate-support-checkfail never pass test-arm64-arm64-xl-xsm 15 migrate-support-checkfail never pass test-arm64-arm64-xl-xsm 16 saverestore-support-checkfail never pass test-armhf-armhf-xl 15 migrate-support-checkfail never pass test-armhf-armhf-xl 16 saverestore-support-checkfail never pass version targeted for testing: xen e3883336bb5abba2ec2231618f5b64f61b099b1e baseline version: xen 6f6de10ade5ade907f9e3f3c72b7b18f7852d9ff Last test of basis 185157 2024-03-25 10:03:48 Z2 days Testing same since 185170 2024-03-27 09:03:56 Z0 days1 attempts People who touched revisions under test: Jan Beulich Jason Andryuk Nicola Vetrini jobs: build-arm64-xsm pass build-amd64 pass build-armhf pass build-amd64-libvirt pass test-armhf-armhf-xl pass test-arm64-arm64-xl-xsm pass test-amd64-amd64-xl-qemuu-debianhvm-amd64pass test-amd64-amd64-libvirt pass sg-report-flight on osstest.test-lab.xenproject.org logs: /home/logs/logs images: /home/logs/images Logs, config files, etc. are available at http://logs.test-lab.xenproject.org/osstest/logs Explanation of these reports, and of osstest in general, is at http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master Test harness code can be found at http://xenbits.xen.org/gitweb?p=osstest.git;a=summary Pushing revision : To xenbits.xen.org:/home/xen/git/xen.git 6f6de10ade..e3883336bb e3883336bb5abba2ec2231618f5b64f61b099b1e -> smoke
Re: [PATCH v7 04/14] xen/arm: add Dom0 cache coloring support
Hi guys, On Fri, Mar 22, 2024 at 8:26 AM Jan Beulich wrote: > > On 21.03.2024 18:31, Carlo Nonato wrote: > > On Thu, Mar 21, 2024 at 4:57 PM Jan Beulich wrote: > >> > >> On 21.03.2024 16:04, Carlo Nonato wrote: > >>> On Tue, Mar 19, 2024 at 4:30 PM Jan Beulich wrote: > On 15.03.2024 11:58, Carlo Nonato wrote: > > --- a/docs/misc/xen-command-line.pandoc > > +++ b/docs/misc/xen-command-line.pandoc > > @@ -963,6 +963,15 @@ Controls for the dom0 IOMMU setup. > > > > Specify a list of IO ports to be excluded from dom0 access. > > > > +### dom0-llc-colors > > +> `= List of [ | - ]` > > + > > +> Default: `All available LLC colors` > > + > > +Specify dom0 LLC color configuration. This option is available only > > when > > +`CONFIG_LLC_COLORING` is enabled. If the parameter is not set, all > > available > > +colors are used. > > My reservation towards this being a top-level option remains. > >>> > >>> How can I turn this into a lower-level option? Moving it into "dom0=" > >>> doesn't > >>> seem possible to me. How can I express a list (llc-colors) inside another > >>> list > >>> (dom0)? dom0=llc-colors=0-3,12-15,other-param=... How can I stop parsing > >>> before reaching other-param? > >> > >> For example by using a different separator: > >> > >> dom0=llc-colors=0-3+12-15,other-param=... > > > > Ok, but that would mean to change the implementation of the parsing function > > and to adopt this syntax also in other places, something that I would've > > preferred to avoid. Anyway I'll follow your suggestion. > > Well, this is all used by Arm only for now. You will want to make sure Arm > folks are actually okay with this alternative approach. > > Jan Are you Arm maintainers ok with this? Thanks.
Re: Linux Xen PV CPA W^X violation false-positives
On 24.01.24 17:54, Jason Andryuk wrote: Xen PV domains show CPA W^X violations like: CPA detected W^X violation: 0064 -> 0067 range: 0x8881 - 0x88810fff PFN 10 WARNING: CPU: 0 PID: 30 at arch/x86/mm/pat/set_memory.c:613 __change_page_attr_set_clr+0x113a/0x11c0 Modules linked in: xt_physdev xt_MASQUERADE iptable_nat nf_nat nf_conntrack libcrc32c nf_defrag_ipv4 ip_tables x_tables xen_argo(O) CPU: 0 PID: 30 Comm: kworker/0:2 Tainted: G O 6.1.38 #1 Workqueue: events bpf_prog_free_deferred RIP: e030:__change_page_attr_set_clr+0x113a/0x11c0 Code: 4c 89 f1 4c 89 e2 4c 89 d6 4c 89 8d 70 ff ff ff 4d 8d 86 ff 0f 00 00 48 c7 c7 f0 3c da 81 c6 05 d0 0e 0e 01 01 e8 f6 71 00 00 <0f> 0b 4c 8b 8d 70 ff ff ff e9 2a fd ff ff 48 8b 85 60 ff ff ff 48 RSP: e02b:c9367c48 EFLAGS: 00010282 RAX: RBX: 000ef064 RCX: RDX: 0003 RSI: f7ff RDI: RBP: c9367d48 R08: R09: c9367aa0 R10: 0001 R11: 0001 R12: 0067 R13: 0001 R14: 8881 R15: c9367d60 FS: () GS:88800b80() knlGS: CS: e030 DS: ES: CR0: 80050033 CR2: 7fdbaeda01c0 CR3: 04312000 CR4: 00050660 Call Trace: ? show_regs.cold+0x1a/0x1f ? __change_page_attr_set_clr+0x113a/0x11c0 ? __warn+0x7b/0xc0 ? __change_page_attr_set_clr+0x113a/0x11c0 ? report_bug+0x111/0x1a0 ? handle_bug+0x4d/0xa0 ? exc_invalid_op+0x19/0x70 ? asm_exc_invalid_op+0x1b/0x20 ? __change_page_attr_set_clr+0x113a/0x11c0 ? __change_page_attr_set_clr+0x113a/0x11c0 ? debug_smp_processor_id+0x17/0x20 ? ___cache_free+0x2e/0x1e0 ? _raw_spin_unlock+0x1e/0x40 ? __purge_vmap_area_lazy+0x2ea/0x6b0 set_direct_map_default_noflush+0x7c/0xa0 __vunmap+0x1ac/0x280 __vfree+0x1d/0x60 vfree+0x27/0x40 __bpf_prog_free+0x44/0x50 bpf_prog_free_deferred+0x104/0x120 process_one_work+0x1ca/0x3d0 ? process_one_work+0x3d0/0x3d0 worker_thread+0x45/0x3c0 ? process_one_work+0x3d0/0x3d0 kthread+0xe2/0x110 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x1f/0x30 ---[ end trace ]--- Xen provides a set of page tables that the guest executes out of when it starts. The L1 entries are shared between level2_ident_pgt and level2_kernel_pgt, and xen_setup_kernel_pagetable() sets the NX bit in the level2_ident_pgt entries. verify_rwx() only checks the l1 entry and reports a false-positive violation. Here is a dump of some kernel virtual addresses and the corresponding L1 and L2 entries: This is the start of the directmap (ident) and they have NX (bit 63) set in the PMD. ndvm-pv (1): [0.466778] va=8880 pte=00100027 level: 1 ndvm-pv (1): [0.466788] va=8880 pmd=8242c067 level: 2 Directmap for kernel text: ndvm-pv (1): [0.466795] va=88800100 pte=00100165 level: 1 ndvm-pv (1): [0.466801] va=88800100 pmd=82434067 level: 2 ndvm-pv (1): [0.466807] va=88800101 pte=001001010065 level: 1 ndvm-pv (1): [0.466814] va=88800101 pmd=82434067 level: 2 The start of the kernel text highmap is unmapped: ndvm-pv (1): [0.466820] va=8000 pte= level: 3 ndvm-pv (1): [0.466826] va=8000 pmd= level: 3 Kernel PMD for .text has NX bit clear ndvm-pv (1): [0.466832] va=8100 pte=00100165 level: 1 ndvm-pv (1): [0.466838] va=8100 pmd=02434067 level: 2 Kernel PTE for rodata_end has NX bit set ndvm-pv (1): [0.466846] va=81e62000 pte=801001e62025 level: 1 ndvm-pv (1): [0.466874] va=81e62000 pmd=0243b067 level: 2 Directmap of rodata_end ndvm-pv (1): [0.466907] va=888001e62000 pte=801001e62025 level: 1 ndvm-pv (1): [0.466913] va=888001e62000 pmd=8243b067 level: 2 Directmap of a low RAM address ndvm-pv (1): [0.466920] va=8881 pte=00110027 level: 1 ndvm-pv (1): [0.466926] va=8881 pmd=8242c067 level: 2 Directmap of another RAM address close to but below kernel text ndvm-pv (1): [0.466932] va=88800096c000 pte=00100096c027 level: 1 ndvm-pv (1): [0.466938] va=88800096c000 pmd=82430067 level: 2 Here are some L2 entries showing the differing NX bits for l2_ident vs. l2_kernel while they point at the same L1 addresses ndvm-pv (1): [0.466944] l2_ident[ 0] pmd=8242c067 ndvm-pv (1): [0.466949] l2_ident[ 1] pmd=8242d067 ndvm-pv (1): [0.466955] l2_ident[ 8] pmd=82434067 ndvm-pv (1): [0.466959] l2_ident[ 9] pmd=82435067 ndvm-pv (1): [0.466964] l2_ident[ 14] pmd=8243a067 ndvm-pv (1): [0.466969] l2_ident[ 15] pmd=8243b067 ndvm-pv (1): [0.466974]
Re: [PATCH-for-9.0 v2 05/19] hw/display: Restrict xen_register_framebuffer() call to Xen
On Tue, Nov 14, 2023 at 03:38:01PM +0100, Philippe Mathieu-Daudé wrote: > Only call xen_register_framebuffer() when Xen is enabled. > > Signed-off-by: Philippe Mathieu-Daudé I don't think this patch is very useful but it's fine, so: Reviewed-by: Anthony PERARD Thanks, -- Anthony PERARD
Re: [PATCH-for-9.0 v2 08/19] hw/xen: Remove unused Xen stubs
On Tue, Nov 14, 2023 at 03:38:04PM +0100, Philippe Mathieu-Daudé wrote: > All these stubs are protected by a 'if (xen_enabled())' check. Are you sure? There's still nothing that prevent a compiler from wanting those, I don't think. Sure, often compilers will remove dead code in `if(0){...}`, but there's no guaranty, is there? Cheers, -- Anthony PERARD
Re: [PATCH v7 04/14] xen/arm: add Dom0 cache coloring support
Hi, On 27/03/2024 11:39, Carlo Nonato wrote: On Fri, Mar 22, 2024 at 8:26 AM Jan Beulich wrote: On 21.03.2024 18:31, Carlo Nonato wrote: On Thu, Mar 21, 2024 at 4:57 PM Jan Beulich wrote: On 21.03.2024 16:04, Carlo Nonato wrote: On Tue, Mar 19, 2024 at 4:30 PM Jan Beulich wrote: On 15.03.2024 11:58, Carlo Nonato wrote: --- a/docs/misc/xen-command-line.pandoc +++ b/docs/misc/xen-command-line.pandoc @@ -963,6 +963,15 @@ Controls for the dom0 IOMMU setup. Specify a list of IO ports to be excluded from dom0 access. +### dom0-llc-colors +> `= List of [ | - ]` + +> Default: `All available LLC colors` + +Specify dom0 LLC color configuration. This option is available only when +`CONFIG_LLC_COLORING` is enabled. If the parameter is not set, all available +colors are used. My reservation towards this being a top-level option remains. How can I turn this into a lower-level option? Moving it into "dom0=" doesn't seem possible to me. How can I express a list (llc-colors) inside another list (dom0)? dom0=llc-colors=0-3,12-15,other-param=... How can I stop parsing before reaching other-param? For example by using a different separator: dom0=llc-colors=0-3+12-15,other-param=... Ok, but that would mean to change the implementation of the parsing function and to adopt this syntax also in other places, something that I would've preferred to avoid. Anyway I'll follow your suggestion. Well, this is all used by Arm only for now. You will want to make sure Arm folks are actually okay with this alternative approach. Jan Are you Arm maintainers ok with this? Unfortunately no. I find the use of + and "nested" = extremely confusing to read. There might be other symbols to use (e.g. s/=/:/ s#+#/#), but I am not entirely sure the value of trying to cram all the options under a single top-level parameter. So I right now, I would prefrr if we stick with the existing approach (i.e. introducing dom0-llc-colors). Cheers, -- Julien Grall
Re: [PATCH v7 04/14] xen/arm: add Dom0 cache coloring support
On 27/03/2024 12:39, Carlo Nonato wrote: > > > Hi guys, > > On Fri, Mar 22, 2024 at 8:26 AM Jan Beulich wrote: >> >> On 21.03.2024 18:31, Carlo Nonato wrote: >>> On Thu, Mar 21, 2024 at 4:57 PM Jan Beulich wrote: On 21.03.2024 16:04, Carlo Nonato wrote: > On Tue, Mar 19, 2024 at 4:30 PM Jan Beulich wrote: >> On 15.03.2024 11:58, Carlo Nonato wrote: >>> --- a/docs/misc/xen-command-line.pandoc >>> +++ b/docs/misc/xen-command-line.pandoc >>> @@ -963,6 +963,15 @@ Controls for the dom0 IOMMU setup. >>> >>> Specify a list of IO ports to be excluded from dom0 access. >>> >>> +### dom0-llc-colors >>> +> `= List of [ | - ]` >>> + >>> +> Default: `All available LLC colors` >>> + >>> +Specify dom0 LLC color configuration. This option is available only >>> when >>> +`CONFIG_LLC_COLORING` is enabled. If the parameter is not set, all >>> available >>> +colors are used. >> >> My reservation towards this being a top-level option remains. > > How can I turn this into a lower-level option? Moving it into "dom0=" > doesn't > seem possible to me. How can I express a list (llc-colors) inside another > list > (dom0)? dom0=llc-colors=0-3,12-15,other-param=... How can I stop parsing > before reaching other-param? For example by using a different separator: dom0=llc-colors=0-3+12-15,other-param=... >>> >>> Ok, but that would mean to change the implementation of the parsing function >>> and to adopt this syntax also in other places, something that I would've >>> preferred to avoid. Anyway I'll follow your suggestion. >> >> Well, this is all used by Arm only for now. You will want to make sure Arm >> folks are actually okay with this alternative approach. >> >> Jan > > Are you Arm maintainers ok with this? I'm not a fan of this syntax and I find it more difficult to parse compared to the usual case, where every option is clearly separated. That said, I won't oppose to it. ~Michal
[PATCH net] xen-netfront: Add missing skb_mark_for_recycle
Notice that skb_mark_for_recycle() is introduced later than fixes tag in 6a5bcd84e886 ("page_pool: Allow drivers to hint on SKB recycling"). It is believed that fixes tag were missing a call to page_pool_release_page() between v5.9 to v5.14, after which is should have used skb_mark_for_recycle(). Since v6.6 the call page_pool_release_page() were removed (in 535b9c61bdef ("net: page_pool: hide page_pool_release_page()") and remaining callers converted (in commit 6bfef2ec0172 ("Merge branch 'net-page_pool-remove-page_pool_release_page'")). This leak became visible in v6.8 via commit dba1b8a7ab68 ("mm/page_pool: catch page_pool memory leaks"). Fixes: 6c5aa6fc4def ("xen networking: add basic XDP support for xen-netfront") Reported-by: Arthur Borsboom Signed-off-by: Jesper Dangaard Brouer --- Compile tested only, can someone please test this drivers/net/xen-netfront.c |1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c index ad29f370034e..8d2aee88526c 100644 --- a/drivers/net/xen-netfront.c +++ b/drivers/net/xen-netfront.c @@ -285,6 +285,7 @@ static struct sk_buff *xennet_alloc_one_rx_buffer(struct netfront_queue *queue) return NULL; } skb_add_rx_frag(skb, 0, page, 0, 0, PAGE_SIZE); + skb_mark_for_recycle(skb); /* Align ip header to a 16 bytes boundary */ skb_reserve(skb, NET_IP_ALIGN);
preparations for 4.17.4
All, the release is due in two to three weeks. Please point out backports you find missing from the respective staging branch, but which you consider relevant. Note that this is going to be the last Xen Project coordinated ordinary stable release from this branch; the branch will move into security-only support mode afterwards. Jan
Re: preparations for 4.17.4
On 27/03/2024 12:23 pm, Jan Beulich wrote: > All, > > the release is due in two to three weeks. Please point out backports you find > missing from the respective staging branch, but which you consider relevant. > > Note that this is going to be the last Xen Project coordinated ordinary stable > release from this branch; the branch will move into security-only support mode > afterwards. 1) livepatching of .rodata: 989556c6f8ca - xen/virtual-region: Rename the start/end fields ef969144a425 - xen/virtual-region: Include rodata pointers b083b1c393dc - x86/livepatch: Relax permissions on rodata too And technically "x86/mm: fix detection of last L1 entry in modify_xen_mappings_lite()" too but you've already backported this one. Patching .rodata worked before Xen 4.17, and was broken (left as a TODO) when I adjusted Xen to stop using CR0.WP=0 for patching. 2) Policy fixes: e2d8a6522516 - x86/cpu-policy: Fix visibility of HTT/CMP_LEGACY in max policies This is a real bugfix for a real regression we found updating from Xen 4.13 -> 4.17. It has a dependency on 5420aa165dfa - x86/cpu-policy: Hide x2APIC from PV guests which I know you had more concern with. FWIW, I'm certain its a good fix, and should be backported. 3) Test fixes: 0263dc9069dd - tests/resource: Fix HVM guest in !SHADOW builds It's minor, but does make a difference for those of us who run these tests regularly. 4) Watchdog fixes: 9e18f339830c - x86/boot: Improve the boot watchdog determination of stuck cpus 131892e0dcc1 - x86/boot: Support the watchdog on newer AMD systems You took "x86/boot: Fix setup_apic_nmi_watchdog() to fail more cleanly" and the first of the two patches is in the same category IMO. The second I also feel ok to take for the in-support releases, particularly as all it is doing is dropping a family list. 5) Ucode scan stability (For 4.18 only) Xen 4.18 had "x86/ucode: Refresh raw CPU policy after microcode load" in it's .0 release, so should also gain: cf7fe8b72dea - x86/ucode: Fix stability of the raw CPU Policy rescan I've only noticed because I've got them both backported to 4.17 in XenServer, but I don't think upstream wants to take that route. ~Andrew
Re: Xen NIC driver have page_pool memory leaks
On 3/27/24 14:27, Jesper Dangaard Brouer wrote: > > > On 25/03/2024 13.33, Paul Durrant wrote: >> On 25/03/2024 12:21, Jesper Dangaard Brouer wrote: >>> Hi Arthur, >>> >>> (Answer inlined below, which is custom on this mailing list) >>> >>> On 23/03/2024 14.23, Arthur Borsboom wrote: Hi Jesper, After a recent kernel upgrade 6.7.6 > 6.8.1 all my Xen guests on Arch Linux are dumping kernel traces. It seems to be indirectly caused by the page pool memory leak mechanism, which is probably a good thing. I have created a bug report, but there is no response. https://bugzilla.kernel.org/show_bug.cgi?id=218618 I am uncertain where and to whom I need to report this page leak. Can you help me get this issue fixed? >>> >>> I'm the page_pool maintainer, but as you say yourself in comment 2 then >>> since dba1b8a7ab68 ("mm/page_pool: catch page_pool memory leaks") this >>> indicated there is a problem in the xen_netfront driver, which was >>> previously not visible. >>> >>> Cc'ing the "XEN NETWORK BACKEND DRIVER" maintainers, as this is a driver >>> bug. What confuses me it that I cannot find any modules named >>> "xen_netfront" in the upstream tree. >>> >> >> You should have tried '-' rather than '_' :-) >> >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/xen-netfront.c >> > > Looking at this driver, I think it is missing a call to > skb_mark_for_recycle(). > > I'll will submit at patch for this, with details for stable maintainers. Please do > > As I think it dates back to v5.9 via commit 6c5aa6fc4def ("xen > networking: add basic XDP support for xen-netfront"). I think this > commit is missing a call to page_pool_release_page() > between v5.9 to v5.14, after which is should have used > skb_mark_for_recycle(). Hmm, looks like I've missed it > > Since v6.6 the call page_pool_release_page() were removed (in > 535b9c61bdef ("net: page_pool: hide page_pool_release_page()") and > remaining callers converted (in commit 6bfef2ec0172 ("Merge branch > 'net-page_pool-remove-page_pool_release_page'")). > > This leak became visible in v6.8 via commit dba1b8a7ab68 ("mm/page_pool: > catch page_pool memory leaks"). > > --Jesper >
Re: [PATCH v7 08/14] xen/page_alloc: introduce preserved page flags macro
Hi Carlo, On 27/03/2024 11:10, Carlo Nonato wrote: Hi guys, Question is: How would you justify such a change? IOW I'm not convinced (yet) this wants doing there. You mean in this series? Looking at the code, the flag is originally set in alloc_domheap_pages(). So I guess it would make sense to do it in free_domheap_pages(). We don't hold the heap_lock there. Regardless of the safety question (I will answer below), count_info is not meant to be protected by heap_lock. The lock is protecting the heap and ensure we are not corrupting them when there are concurrent call to free_heap_pages(). > Is it safe to change count_info without it? count_info is meant to be accessed locklessly. The flag PGC_extra cannot be set/clear concurrently because you can't allocate a page that has not yet been freed. So it would be fine to use clear_bit(..., ...); Cheers, -- Julien Grall
Re: [XEN PATCH 1/6] xen/arm: ffa: rename functions to use ffa_ prefix
Hi Jens, > On 25 Mar 2024, at 10:38, Jens Wiklander wrote: > > Prepare to separate into modules by renaming functions that will need > new names when becoming non-static in the following commit. > > Signed-off-by: Jens Wiklander Reviewed-by: Bertrand Marquis Cheers Bertrand > --- > xen/arch/arm/tee/ffa.c | 125 + > 1 file changed, 65 insertions(+), 60 deletions(-) > > diff --git a/xen/arch/arm/tee/ffa.c b/xen/arch/arm/tee/ffa.c > index 9a05dcede17a..0344a0f17e72 100644 > --- a/xen/arch/arm/tee/ffa.c > +++ b/xen/arch/arm/tee/ffa.c > @@ -4,7 +4,7 @@ > * > * Arm Firmware Framework for ARMv8-A (FF-A) mediator > * > - * Copyright (C) 2023 Linaro Limited > + * Copyright (C) 2023-2024 Linaro Limited > * > * References: > * FF-A-1.0-REL: FF-A specification version 1.0 available at > @@ -473,7 +473,7 @@ static bool ffa_get_version(uint32_t *vers) > return true; > } > > -static int32_t get_ffa_ret_code(const struct arm_smccc_1_2_regs *resp) > +static int32_t ffa_get_ret_code(const struct arm_smccc_1_2_regs *resp) > { > switch ( resp->a0 ) > { > @@ -504,7 +504,7 @@ static int32_t ffa_simple_call(uint32_t fid, register_t > a1, register_t a2, > > arm_smccc_1_2_smc(&arg, &resp); > > -return get_ffa_ret_code(&resp); > +return ffa_get_ret_code(&resp); > } > > static int32_t ffa_features(uint32_t id) > @@ -546,7 +546,7 @@ static int32_t ffa_partition_info_get(uint32_t w1, > uint32_t w2, uint32_t w3, > > arm_smccc_1_2_smc(&arg, &resp); > > -ret = get_ffa_ret_code(&resp); > +ret = ffa_get_ret_code(&resp); > if ( !ret ) > { > *count = resp.a2; > @@ -654,15 +654,16 @@ static int32_t ffa_direct_req_send_vm(uint16_t sp_id, > uint16_t vm_id, > return res; > } > > -static uint16_t get_vm_id(const struct domain *d) > +static uint16_t ffa_get_vm_id(const struct domain *d) > { > /* +1 since 0 is reserved for the hypervisor in FF-A */ > return d->domain_id + 1; > } > > -static void set_regs(struct cpu_user_regs *regs, register_t v0, register_t > v1, > - register_t v2, register_t v3, register_t v4, register_t > v5, > - register_t v6, register_t v7) > +static void ffa_set_regs(struct cpu_user_regs *regs, register_t v0, > + register_t v1, register_t v2, register_t v3, > + register_t v4, register_t v5, register_t v6, > + register_t v7) > { > set_user_reg(regs, 0, v0); > set_user_reg(regs, 1, v1); > @@ -674,15 +675,15 @@ static void set_regs(struct cpu_user_regs *regs, > register_t v0, register_t v1, > set_user_reg(regs, 7, v7); > } > > -static void set_regs_error(struct cpu_user_regs *regs, uint32_t error_code) > +static void ffa_set_regs_error(struct cpu_user_regs *regs, uint32_t > error_code) > { > -set_regs(regs, FFA_ERROR, 0, error_code, 0, 0, 0, 0, 0); > +ffa_set_regs(regs, FFA_ERROR, 0, error_code, 0, 0, 0, 0, 0); > } > > -static void set_regs_success(struct cpu_user_regs *regs, uint32_t w2, > +static void ffa_set_regs_success(struct cpu_user_regs *regs, uint32_t w2, > uint32_t w3) > { > -set_regs(regs, FFA_SUCCESS_32, 0, w2, w3, 0, 0, 0, 0); > +ffa_set_regs(regs, FFA_SUCCESS_32, 0, w2, w3, 0, 0, 0, 0); > } > > static void handle_version(struct cpu_user_regs *regs) > @@ -697,11 +698,11 @@ static void handle_version(struct cpu_user_regs *regs) > vers = FFA_VERSION_1_1; > > ctx->guest_vers = vers; > -set_regs(regs, vers, 0, 0, 0, 0, 0, 0, 0); > +ffa_set_regs(regs, vers, 0, 0, 0, 0, 0, 0, 0); > } > > -static uint32_t handle_rxtx_map(uint32_t fid, register_t tx_addr, > -register_t rx_addr, uint32_t page_count) > +static uint32_t ffa_handle_rxtx_map(uint32_t fid, register_t tx_addr, > +register_t rx_addr, uint32_t page_count) > { > uint32_t ret = FFA_RET_INVALID_PARAMETERS; > struct domain *d = current->domain; > @@ -789,7 +790,7 @@ static void rxtx_unmap(struct ffa_ctx *ctx) > ctx->rx_is_free = false; > } > > -static uint32_t handle_rxtx_unmap(void) > +static uint32_t ffa_handle_rxtx_unmap(void) > { > struct domain *d = current->domain; > struct ffa_ctx *ctx = d->arch.tee; > @@ -802,9 +803,10 @@ static uint32_t handle_rxtx_unmap(void) > return FFA_RET_OK; > } > > -static int32_t handle_partition_info_get(uint32_t w1, uint32_t w2, uint32_t > w3, > - uint32_t w4, uint32_t w5, > - uint32_t *count, uint32_t *fpi_size) > +static int32_t ffa_handle_partition_info_get(uint32_t w1, uint32_t w2, > + uint32_t w3, uint32_t w4, > + uint32_t w5, uint32_t *count, > + uint32_t *fpi_size) > { > int32_t ret = FFA_RET_DENIED; > struct domain
Re: [RFC PATCH-for-9.0 v2 09/19] hw/block/xen_blkif: Align structs with QEMU_ALIGNED() instead of #pragma
On Tue, Nov 14, 2023 at 03:38:05PM +0100, Philippe Mathieu-Daudé wrote: > Except imported source files, QEMU code base uses > the QEMU_ALIGNED() macro to align its structures. This patch only convert the alignment, but discard pack. We need both or the struct is changed. > --- > hw/block/xen_blkif.h | 8 +++- > 1 file changed, 3 insertions(+), 5 deletions(-) > > diff --git a/hw/block/xen_blkif.h b/hw/block/xen_blkif.h > index 99733529c1..c1d154d502 100644 > --- a/hw/block/xen_blkif.h > +++ b/hw/block/xen_blkif.h > @@ -18,7 +18,6 @@ struct blkif_common_response { > }; > > /* i386 protocol version */ > -#pragma pack(push, 4) > struct blkif_x86_32_request { > uint8_toperation;/* BLKIF_OP_??? > */ > uint8_tnr_segments; /* number of segments > */ > @@ -26,7 +25,7 @@ struct blkif_x86_32_request { > uint64_t id; /* private guest value, echoed in resp > */ > blkif_sector_t sector_number;/* start sector idx on disk (r/w only) > */ > struct blkif_request_segment seg[BLKIF_MAX_SEGMENTS_PER_REQUEST]; > -}; > +} QEMU_ALIGNED(4); E.g. for this one, I've compare the output of `pahole --class_name=blkif_x86_32_request build/qemu-system-i386`: --- before +++ after @@ -1,11 +1,15 @@ struct blkif_x86_32_request { uint8_toperation;/* 0 1 */ uint8_tnr_segments; /* 1 1 */ uint16_t handle; /* 2 2 */ - uint64_t id; /* 4 8 */ - uint64_t sector_number;/*12 8 */ - struct blkif_request_segment seg[11];/*2088 */ - /* size: 108, cachelines: 2, members: 6 */ - /* last cacheline: 44 bytes */ -} __attribute__((__packed__)); + /* XXX 4 bytes hole, try to pack */ + + uint64_t id; /* 8 8 */ + uint64_t sector_number;/*16 8 */ + struct blkif_request_segment seg[11];/*2488 */ + + /* size: 112, cachelines: 2, members: 6 */ + /* sum members: 108, holes: 1, sum holes: 4 */ + /* last cacheline: 48 bytes */ +} __attribute__((__aligned__(8))); Thanks, -- Anthony PERARD
Re: [XEN PATCH 2/6] xen/arm: ffa: move common things to ffa_private.h
Hi Jens, > On 25 Mar 2024, at 10:39, Jens Wiklander wrote: > > Prepare to separate ffa.c into modules by moving common things into the > new internal header file ffa_private.h. > > Signed-off-by: Jens Wiklander Reviewed-by: Bertrand Marquis Cheers Bertrand > --- > xen/arch/arm/tee/ffa.c | 298 +- > xen/arch/arm/tee/ffa_private.h | 318 + > 2 files changed, 319 insertions(+), 297 deletions(-) > create mode 100644 xen/arch/arm/tee/ffa_private.h > > diff --git a/xen/arch/arm/tee/ffa.c b/xen/arch/arm/tee/ffa.c > index 0344a0f17e72..259851f20bdb 100644 > --- a/xen/arch/arm/tee/ffa.c > +++ b/xen/arch/arm/tee/ffa.c > @@ -63,204 +63,7 @@ > #include > #include > > -/* Error codes */ > -#define FFA_RET_OK 0 > -#define FFA_RET_NOT_SUPPORTED -1 > -#define FFA_RET_INVALID_PARAMETERS -2 > -#define FFA_RET_NO_MEMORY -3 > -#define FFA_RET_BUSY-4 > -#define FFA_RET_INTERRUPTED -5 > -#define FFA_RET_DENIED -6 > -#define FFA_RET_RETRY -7 > -#define FFA_RET_ABORTED -8 > - > -/* FFA_VERSION helpers */ > -#define FFA_VERSION_MAJOR_SHIFT 16U > -#define FFA_VERSION_MAJOR_MASK 0x7FFFU > -#define FFA_VERSION_MINOR_SHIFT 0U > -#define FFA_VERSION_MINOR_MASK 0xU > -#define MAKE_FFA_VERSION(major, minor) \ > -major) & FFA_VERSION_MAJOR_MASK) << FFA_VERSION_MAJOR_SHIFT) | \ > - ((minor) & FFA_VERSION_MINOR_MASK)) > - > -#define FFA_VERSION_1_0 MAKE_FFA_VERSION(1, 0) > -#define FFA_VERSION_1_1 MAKE_FFA_VERSION(1, 1) > -/* The minimal FF-A version of the SPMC that can be supported */ > -#define FFA_MIN_SPMC_VERSIONFFA_VERSION_1_1 > - > -/* > - * This is the version we want to use in communication with guests and SPs. > - * During negotiation with a guest or a SP we may need to lower it for > - * that particular guest or SP. > - */ > -#define FFA_MY_VERSION_MAJOR1U > -#define FFA_MY_VERSION_MINOR1U > -#define FFA_MY_VERSION MAKE_FFA_VERSION(FFA_MY_VERSION_MAJOR, \ > - FFA_MY_VERSION_MINOR) > - > -/* > - * The FF-A specification explicitly works with 4K pages as a measure of > - * memory size, for example, FFA_RXTX_MAP takes one parameter "RX/TX page > - * count" which is the number of contiguous 4K pages allocated. Xen may use > - * a different page size depending on the configuration to avoid confusion > - * with PAGE_SIZE use a special define when it's a page size as in the FF-A > - * specification. > - */ > -#define FFA_PAGE_SIZE SZ_4K > - > -/* > - * The number of pages used for each of the RX and TX buffers shared with > - * the SPMC. > - */ > -#define FFA_RXTX_PAGE_COUNT 1 > - > -/* > - * Limit the number of pages RX/TX buffers guests can map. > - * TODO support a larger number. > - */ > -#define FFA_MAX_RXTX_PAGE_COUNT 1 > - > -/* > - * Limit for shared buffer size. Please note that this define limits > - * number of pages. > - * > - * FF-A doesn't have any direct requirements on GlobalPlatform or vice > - * versa, but an implementation can very well use FF-A in order to provide > - * a GlobalPlatform interface on top. > - * > - * Global Platform specification for TEE requires that any TEE > - * implementation should allow to share buffers with size of at least > - * 512KB, defined in TEEC-1.0C page 24, Table 4-1, > - * TEEC_CONFIG_SHAREDMEM_MAX_SIZE. > - * Due to overhead which can be hard to predict exactly, double this number > - * to give a safe margin. > - */ > -#define FFA_MAX_SHM_PAGE_COUNT (2 * SZ_512K / FFA_PAGE_SIZE) > - > -/* > - * Limits the number of shared buffers that guest can have at once. This > - * is to prevent case, when guests trick XEN into exhausting its own > - * memory by allocating many small buffers. This value has been chosen > - * arbitrarily. > - */ > -#define FFA_MAX_SHM_COUNT 32 > - > -/* > - * The time we wait until trying to tear down a domain again if it was > - * blocked initially. > - */ > -#define FFA_CTX_TEARDOWN_DELAY SECONDS(1) > - > -/* FF-A-1.1-REL0 section 10.9.2 Memory region handle, page 167 */ > -#define FFA_HANDLE_HYP_FLAG BIT(63, ULL) > -#define FFA_HANDLE_INVALID 0xULL > - > -/* > - * Memory attributes: Normal memory, Write-Back cacheable, Inner shareable > - * Defined in FF-A-1.1-REL0 Table 10.18 at page 175. > - */ > -#define FFA_NORMAL_MEM_REG_ATTR 0x2fU > -/* > - * Memory access permissions: Read-write > - * Defined in FF-A-1.1-REL0 Table 10.15 at page 168. > - */ > -#define FFA_MEM_ACC_RW 0x2U > - > -/* FF-A-1.1-REL0 section 10.11.4 Flags usage, page 184-187 */ > -/* Clear memory before mapping in receiver */ > -#define FFA_MEMORY_REGION_FLAG_CLEARBIT(0, U) > -/* Relayer may time sl
Re: [RFC PATCH-for-9.0 v2 09/19] hw/block/xen_blkif: Align structs with QEMU_ALIGNED() instead of #pragma
On 27 March 2024 13:31:52 GMT, Anthony PERARD wrote: >On Tue, Nov 14, 2023 at 03:38:05PM +0100, Philippe Mathieu-Daudé wrote: >> Except imported source files, QEMU code base uses >> the QEMU_ALIGNED() macro to align its structures. > >This patch only convert the alignment, but discard pack. We need both or >the struct is changed. Which means we need some build-time asserts on struct size and field offsets. That should never have passed a build test.
Re: [PATCH-for-9.0 v2 11/19] hw/xen/xen_arch_hvm: Rename prototypes using 'xen_arch_' prefix
On Tue, Nov 14, 2023 at 03:38:07PM +0100, Philippe Mathieu-Daudé wrote: > Use a common 'xen_arch_' prefix for architecture-specific functions. > Rename xen_arch_set_memory() and xen_arch_handle_ioreq(). > > Signed-off-by: Philippe Mathieu-Daudé > Reviewed-by: David Woodhouse > Reviewed-by: Richard Henderson Acked-by: Anthony PERARD Thanks, -- Anthony PERARD
Re: [PATCH v7 08/14] xen/page_alloc: introduce preserved page flags macro
On 27.03.2024 14:28, Julien Grall wrote: > Hi Carlo, > > On 27/03/2024 11:10, Carlo Nonato wrote: >> Hi guys, >> >>> Question is: How would you justify such a change? IOW I'm not convinced >>> (yet) this wants doing there. >> >> You mean in this series? >> >>> Looking at the code, the flag is originally set in >>> alloc_domheap_pages(). So I guess it would make sense to do it in >>> free_domheap_pages(). >> >> We don't hold the heap_lock there. > Regardless of the safety question (I will answer below), count_info is > not meant to be protected by heap_lock. The lock is protecting the heap > and ensure we are not corrupting them when there are concurrent call to > free_heap_pages(). > > > Is it safe to change count_info without it? > > count_info is meant to be accessed locklessly. The flag PGC_extra cannot > be set/clear concurrently because you can't allocate a page that has not > yet been freed. > > So it would be fine to use clear_bit(..., ...); Actually we hardly ever use clear_bit() on count_info. Normally we use ordinary C operators. Atomic (and otherwise lockless) updates are useful only if done like this everywhere. Jan
Re: [XEN PATCH 3/6] xen/arm: ffa: separate memory sharing routines
Hi Jens, > On 25 Mar 2024, at 10:39, Jens Wiklander wrote: > > Move memory sharing routines into a separate file for easier navigation > in the source code. > > Add ffa_shm_domain_destroy() to isolate the ffa_shm things in > ffa_domain_teardown_continue(). > > Signed-off-by: Jens Wiklander With the copyright date mentioned after fixed (which can be done on commit): Reviewed-by: Bertrand Marquis --- > xen/arch/arm/tee/Makefile | 1 + > xen/arch/arm/tee/ffa.c | 708 + > xen/arch/arm/tee/ffa_private.h | 10 + > xen/arch/arm/tee/ffa_shm.c | 708 + > 4 files changed, 729 insertions(+), 698 deletions(-) > create mode 100644 xen/arch/arm/tee/ffa_shm.c > > diff --git a/xen/arch/arm/tee/Makefile b/xen/arch/arm/tee/Makefile > index 58a1015e40e0..0e683d23aa9d 100644 > --- a/xen/arch/arm/tee/Makefile > +++ b/xen/arch/arm/tee/Makefile > @@ -1,3 +1,4 @@ > obj-$(CONFIG_FFA) += ffa.o > +obj-$(CONFIG_FFA) += ffa_shm.o > obj-y += tee.o > obj-$(CONFIG_OPTEE) += optee.o > diff --git a/xen/arch/arm/tee/ffa.c b/xen/arch/arm/tee/ffa.c > index 259851f20bdb..db36292dc52f 100644 > --- a/xen/arch/arm/tee/ffa.c > +++ b/xen/arch/arm/tee/ffa.c > @@ -84,92 +84,6 @@ struct ffa_partition_info_1_1 { > uint8_t uuid[16]; > }; > > -/* Constituent memory region descriptor */ > -struct ffa_address_range { > -uint64_t address; > -uint32_t page_count; > -uint32_t reserved; > -}; > - > -/* Composite memory region descriptor */ > -struct ffa_mem_region { > -uint32_t total_page_count; > -uint32_t address_range_count; > -uint64_t reserved; > -struct ffa_address_range address_range_array[]; > -}; > - > -/* Memory access permissions descriptor */ > -struct ffa_mem_access_perm { > -uint16_t endpoint_id; > -uint8_t perm; > -uint8_t flags; > -}; > - > -/* Endpoint memory access descriptor */ > -struct ffa_mem_access { > -struct ffa_mem_access_perm access_perm; > -uint32_t region_offs; > -uint64_t reserved; > -}; > - > -/* Lend, donate or share memory transaction descriptor */ > -struct ffa_mem_transaction_1_0 { > -uint16_t sender_id; > -uint8_t mem_reg_attr; > -uint8_t reserved0; > -uint32_t flags; > -uint64_t handle; > -uint64_t tag; > -uint32_t reserved1; > -uint32_t mem_access_count; > -struct ffa_mem_access mem_access_array[]; > -}; > - > -struct ffa_mem_transaction_1_1 { > -uint16_t sender_id; > -uint16_t mem_reg_attr; > -uint32_t flags; > -uint64_t handle; > -uint64_t tag; > -uint32_t mem_access_size; > -uint32_t mem_access_count; > -uint32_t mem_access_offs; > -uint8_t reserved[12]; > -}; > - > -/* Calculate offset of struct ffa_mem_access from start of buffer */ > -#define MEM_ACCESS_OFFSET(access_idx) \ > -( sizeof(struct ffa_mem_transaction_1_1) + \ > - ( access_idx ) * sizeof(struct ffa_mem_access) ) > - > -/* Calculate offset of struct ffa_mem_region from start of buffer */ > -#define REGION_OFFSET(access_count, region_idx) \ > -( MEM_ACCESS_OFFSET(access_count) + \ > - ( region_idx ) * sizeof(struct ffa_mem_region) ) > - > -/* Calculate offset of struct ffa_address_range from start of buffer */ > -#define ADDR_RANGE_OFFSET(access_count, region_count, range_idx) \ > -( REGION_OFFSET(access_count, region_count) + \ > - ( range_idx ) * sizeof(struct ffa_address_range) ) > - > -/* > - * The parts needed from struct ffa_mem_transaction_1_0 or struct > - * ffa_mem_transaction_1_1, used to provide an abstraction of difference in > - * data structures between version 1.0 and 1.1. This is just an internal > - * interface and can be changed without changing any ABI. > - */ > -struct ffa_mem_transaction_int { > -uint16_t sender_id; > -uint8_t mem_reg_attr; > -uint8_t flags; > -uint8_t mem_access_size; > -uint8_t mem_access_count; > -uint16_t mem_access_offs; > -uint64_t handle; > -uint64_t tag; > -}; > - > /* Endpoint RX/TX descriptor */ > struct ffa_endpoint_rxtx_descriptor_1_0 { > uint16_t sender_id; > @@ -185,15 +99,6 @@ struct ffa_endpoint_rxtx_descriptor_1_1 { > uint32_t tx_region_offs; > }; > > -struct ffa_shm_mem { > -struct list_head list; > -uint16_t sender_id; > -uint16_t ep_id; /* endpoint, the one lending */ > -uint64_t handle;/* FFA_HANDLE_INVALID if not set yet */ > -unsigned int page_count; > -struct page_info *pages[]; > -}; > - > /* Negotiated FF-A version to use with the SPMC */ > static uint32_t __ro_after_init ffa_version; > > @@ -212,10 +117,10 @@ static uint16_t subscr_vm_destroyed_count __read_mostly; > * for calls which uses our RX buffer to deliver a result we must call > * ffa_rx_release() to let the SPMC know that we're done with the buffer. > */ > -static void *ffa_rx __read_mostly; > -static void *ffa_tx __read_mostly; > -static DEFINE_SPINLOCK(ffa_rx_buffer_lock); > -static DEFINE_SPINLOCK(ffa_tx_buffer_lock
Re: [XEN PATCH 4/6] xen/arm: ffa: separate partition info get routines
Hi Jens, > On 25 Mar 2024, at 10:39, Jens Wiklander wrote: > > Move partition info get routines into a separate file for easier > navigation in the source code. > > Add ffa_partinfo_init(), ffa_partinfo_domain_init(), and > ffa_partinfo_domain_destroy() to handle the ffa_partinfo internal things > on initialization and teardown. > > Signed-off-by: Jens Wiklander Reviewed-by: Bertrand Marquis Cheers Bertrand > --- > xen/arch/arm/tee/Makefile | 1 + > xen/arch/arm/tee/ffa.c | 359 +- > xen/arch/arm/tee/ffa_partinfo.c | 373 > xen/arch/arm/tee/ffa_private.h | 14 +- > 4 files changed, 398 insertions(+), 349 deletions(-) > create mode 100644 xen/arch/arm/tee/ffa_partinfo.c > > diff --git a/xen/arch/arm/tee/Makefile b/xen/arch/arm/tee/Makefile > index 0e683d23aa9d..be644fba8055 100644 > --- a/xen/arch/arm/tee/Makefile > +++ b/xen/arch/arm/tee/Makefile > @@ -1,4 +1,5 @@ > obj-$(CONFIG_FFA) += ffa.o > obj-$(CONFIG_FFA) += ffa_shm.o > +obj-$(CONFIG_FFA) += ffa_partinfo.o > obj-y += tee.o > obj-$(CONFIG_OPTEE) += optee.o > diff --git a/xen/arch/arm/tee/ffa.c b/xen/arch/arm/tee/ffa.c > index db36292dc52f..7a2803881420 100644 > --- a/xen/arch/arm/tee/ffa.c > +++ b/xen/arch/arm/tee/ffa.c > @@ -70,20 +70,6 @@ > * structs ending with _1_1 are defined in FF-A-1.1-REL0. > */ > > -/* Partition information descriptor */ > -struct ffa_partition_info_1_0 { > -uint16_t id; > -uint16_t execution_context; > -uint32_t partition_properties; > -}; > - > -struct ffa_partition_info_1_1 { > -uint16_t id; > -uint16_t execution_context; > -uint32_t partition_properties; > -uint8_t uuid[16]; > -}; > - > /* Endpoint RX/TX descriptor */ > struct ffa_endpoint_rxtx_descriptor_1_0 { > uint16_t sender_id; > @@ -102,11 +88,6 @@ struct ffa_endpoint_rxtx_descriptor_1_1 { > /* Negotiated FF-A version to use with the SPMC */ > static uint32_t __ro_after_init ffa_version; > > -/* SPs subscribing to VM_CREATE and VM_DESTROYED events */ > -static uint16_t *subscr_vm_created __read_mostly; > -static uint16_t subscr_vm_created_count __read_mostly; > -static uint16_t *subscr_vm_destroyed __read_mostly; > -static uint16_t subscr_vm_destroyed_count __read_mostly; > > /* > * Our rx/tx buffers shared with the SPMC. FFA_RXTX_PAGE_COUNT is the > @@ -170,90 +151,6 @@ static int32_t ffa_rxtx_map(paddr_t tx_addr, paddr_t > rx_addr, > return ffa_simple_call(FFA_RXTX_MAP_64, tx_addr, rx_addr, page_count, 0); > } > > -static int32_t ffa_partition_info_get(uint32_t w1, uint32_t w2, uint32_t w3, > - uint32_t w4, uint32_t w5, > - uint32_t *count, uint32_t *fpi_size) > -{ > -const struct arm_smccc_1_2_regs arg = { > -.a0 = FFA_PARTITION_INFO_GET, > -.a1 = w1, > -.a2 = w2, > -.a3 = w3, > -.a4 = w4, > -.a5 = w5, > -}; > -struct arm_smccc_1_2_regs resp; > -uint32_t ret; > - > -arm_smccc_1_2_smc(&arg, &resp); > - > -ret = ffa_get_ret_code(&resp); > -if ( !ret ) > -{ > -*count = resp.a2; > -*fpi_size = resp.a3; > -} > - > -return ret; > -} > - > -static int32_t ffa_rx_release(void) > -{ > -return ffa_simple_call(FFA_RX_RELEASE, 0, 0, 0, 0); > -} > - > -static int32_t ffa_direct_req_send_vm(uint16_t sp_id, uint16_t vm_id, > - uint8_t msg) > -{ > -uint32_t exp_resp = FFA_MSG_FLAG_FRAMEWORK; > -unsigned int retry_count = 0; > -int32_t res; > - > -if ( msg == FFA_MSG_SEND_VM_CREATED ) > -exp_resp |= FFA_MSG_RESP_VM_CREATED; > -else if ( msg == FFA_MSG_SEND_VM_DESTROYED ) > -exp_resp |= FFA_MSG_RESP_VM_DESTROYED; > -else > -return FFA_RET_INVALID_PARAMETERS; > - > -do { > -const struct arm_smccc_1_2_regs arg = { > -.a0 = FFA_MSG_SEND_DIRECT_REQ_32, > -.a1 = sp_id, > -.a2 = FFA_MSG_FLAG_FRAMEWORK | msg, > -.a5 = vm_id, > -}; > -struct arm_smccc_1_2_regs resp; > - > -arm_smccc_1_2_smc(&arg, &resp); > -if ( resp.a0 != FFA_MSG_SEND_DIRECT_RESP_32 || resp.a2 != exp_resp ) > -{ > -/* > - * This is an invalid response, likely due to some error in the > - * implementation of the ABI. > - */ > -return FFA_RET_INVALID_PARAMETERS; > -} > -res = resp.a3; > -if ( ++retry_count > 10 ) > -{ > -/* > - * TODO > - * FFA_RET_INTERRUPTED means that the SPMC has a pending > - * non-secure interrupt, we need a way of delivering that > - * non-secure interrupt. > - * FFA_RET_RETRY is the SP telling us that it's temporarily > - * blocked from handling the direct request, we need a generic > - * way to deal with this. > -
Re: [XEN PATCH 5/6] xen/arm: ffa: separate rxtx buffer routines
Hi Jens, > On 25 Mar 2024, at 10:39, Jens Wiklander wrote: > > Move rxtx buffer routines into a spearate file for easier navigation in > the source code. > > Add ffa_rxtx_init(), ffa_rxtx_destroy(), and ffa_rxtx_domain_destroy() to > handle the ffa_rxtx internal things on initialization and teardown. > > Signed-off-by: Jens Wiklander Reviewed-by: Bertrand Marquis Cheers Bertrand > --- > xen/arch/arm/tee/Makefile | 1 + > xen/arch/arm/tee/ffa.c | 174 +- > xen/arch/arm/tee/ffa_private.h | 7 ++ > xen/arch/arm/tee/ffa_rxtx.c| 216 + > 4 files changed, 229 insertions(+), 169 deletions(-) > create mode 100644 xen/arch/arm/tee/ffa_rxtx.c > > diff --git a/xen/arch/arm/tee/Makefile b/xen/arch/arm/tee/Makefile > index be644fba8055..f0112a2f922d 100644 > --- a/xen/arch/arm/tee/Makefile > +++ b/xen/arch/arm/tee/Makefile > @@ -1,5 +1,6 @@ > obj-$(CONFIG_FFA) += ffa.o > obj-$(CONFIG_FFA) += ffa_shm.o > obj-$(CONFIG_FFA) += ffa_partinfo.o > +obj-$(CONFIG_FFA) += ffa_rxtx.o > obj-y += tee.o > obj-$(CONFIG_OPTEE) += optee.o > diff --git a/xen/arch/arm/tee/ffa.c b/xen/arch/arm/tee/ffa.c > index 7a2803881420..4f7775b8c890 100644 > --- a/xen/arch/arm/tee/ffa.c > +++ b/xen/arch/arm/tee/ffa.c > @@ -65,26 +65,6 @@ > > #include "ffa_private.h" > > -/* > - * Structs below ending with _1_0 are defined in FF-A-1.0-REL and > - * structs ending with _1_1 are defined in FF-A-1.1-REL0. > - */ > - > -/* Endpoint RX/TX descriptor */ > -struct ffa_endpoint_rxtx_descriptor_1_0 { > -uint16_t sender_id; > -uint16_t reserved; > -uint32_t rx_range_count; > -uint32_t tx_range_count; > -}; > - > -struct ffa_endpoint_rxtx_descriptor_1_1 { > -uint16_t sender_id; > -uint16_t reserved; > -uint32_t rx_region_offs; > -uint32_t tx_region_offs; > -}; > - > /* Negotiated FF-A version to use with the SPMC */ > static uint32_t __ro_after_init ffa_version; > > @@ -145,12 +125,6 @@ static bool check_mandatory_feature(uint32_t id) > return !ret; > } > > -static int32_t ffa_rxtx_map(paddr_t tx_addr, paddr_t rx_addr, > -uint32_t page_count) > -{ > -return ffa_simple_call(FFA_RXTX_MAP_64, tx_addr, rx_addr, page_count, 0); > -} > - > static void handle_version(struct cpu_user_regs *regs) > { > struct domain *d = current->domain; > @@ -166,127 +140,6 @@ static void handle_version(struct cpu_user_regs *regs) > ffa_set_regs(regs, vers, 0, 0, 0, 0, 0, 0, 0); > } > > -static uint32_t ffa_handle_rxtx_map(uint32_t fid, register_t tx_addr, > -register_t rx_addr, uint32_t page_count) > -{ > -uint32_t ret = FFA_RET_INVALID_PARAMETERS; > -struct domain *d = current->domain; > -struct ffa_ctx *ctx = d->arch.tee; > -struct page_info *tx_pg; > -struct page_info *rx_pg; > -p2m_type_t t; > -void *rx; > -void *tx; > - > -if ( !smccc_is_conv_64(fid) ) > -{ > -/* > - * Calls using the 32-bit calling convention must ignore the upper > - * 32 bits in the argument registers. > - */ > -tx_addr &= UINT32_MAX; > -rx_addr &= UINT32_MAX; > -} > - > -if ( page_count > FFA_MAX_RXTX_PAGE_COUNT ) > -{ > -printk(XENLOG_ERR "ffa: RXTX_MAP: error: %u pages requested (limit > %u)\n", > - page_count, FFA_MAX_RXTX_PAGE_COUNT); > -return FFA_RET_INVALID_PARAMETERS; > -} > - > -/* Already mapped */ > -if ( ctx->rx ) > -return FFA_RET_DENIED; > - > -tx_pg = get_page_from_gfn(d, gfn_x(gaddr_to_gfn(tx_addr)), &t, > P2M_ALLOC); > -if ( !tx_pg ) > -return FFA_RET_INVALID_PARAMETERS; > - > -/* Only normal RW RAM for now */ > -if ( t != p2m_ram_rw ) > -goto err_put_tx_pg; > - > -rx_pg = get_page_from_gfn(d, gfn_x(gaddr_to_gfn(rx_addr)), &t, > P2M_ALLOC); > -if ( !tx_pg ) > -goto err_put_tx_pg; > - > -/* Only normal RW RAM for now */ > -if ( t != p2m_ram_rw ) > -goto err_put_rx_pg; > - > -tx = __map_domain_page_global(tx_pg); > -if ( !tx ) > -goto err_put_rx_pg; > - > -rx = __map_domain_page_global(rx_pg); > -if ( !rx ) > -goto err_unmap_tx; > - > -ctx->rx = rx; > -ctx->tx = tx; > -ctx->rx_pg = rx_pg; > -ctx->tx_pg = tx_pg; > -ctx->page_count = page_count; > -ctx->rx_is_free = true; > -return FFA_RET_OK; > - > -err_unmap_tx: > -unmap_domain_page_global(tx); > -err_put_rx_pg: > -put_page(rx_pg); > -err_put_tx_pg: > -put_page(tx_pg); > - > -return ret; > -} > - > -static void rxtx_unmap(struct ffa_ctx *ctx) > -{ > -unmap_domain_page_global(ctx->rx); > -unmap_domain_page_global(ctx->tx); > -put_page(ctx->rx_pg); > -put_page(ctx->tx_pg); > -ctx->rx = NULL; > -ctx->tx = NULL; > -ctx->rx_pg = NULL; > -ctx->tx_pg = NULL; > -ctx->page_count = 0; > -ctx->rx_is_free = false; > -} > - > -static uint32_t ffa_handle_rxtx_unmap(v
Re: [XEN PATCH 6/6] xen/arm: ffa: support FFA_FEATURES
Hi Jens, > On 25 Mar 2024, at 10:39, Jens Wiklander wrote: > > Add support for the mandatory FF-A ABI function FFA_FEATURES. > > Signed-off-by: Jens Wiklander Reviewed-by: Bertrand Marquis Cheers Bertrand > --- > xen/arch/arm/tee/ffa.c | 57 ++ > 1 file changed, 57 insertions(+) > > diff --git a/xen/arch/arm/tee/ffa.c b/xen/arch/arm/tee/ffa.c > index 4f7775b8c890..8665201e34a9 100644 > --- a/xen/arch/arm/tee/ffa.c > +++ b/xen/arch/arm/tee/ffa.c > @@ -192,6 +192,60 @@ out: > resp.a7 & mask); > } > > +static void handle_features(struct cpu_user_regs *regs) > +{ > +uint32_t a1 = get_user_reg(regs, 1); > +unsigned int n; > + > +for ( n = 2; n <= 7; n++ ) > +{ > +if ( get_user_reg(regs, n) ) > +{ > +ffa_set_regs_error(regs, FFA_RET_NOT_SUPPORTED); > +return; > +} > +} > + > +switch ( a1 ) > +{ > +case FFA_ERROR: > +case FFA_VERSION: > +case FFA_SUCCESS_32: > +case FFA_SUCCESS_64: > +case FFA_FEATURES: > +case FFA_ID_GET: > +case FFA_RX_RELEASE: > +case FFA_RXTX_UNMAP: > +case FFA_MEM_RECLAIM: > +case FFA_PARTITION_INFO_GET: > +case FFA_MSG_SEND_DIRECT_REQ_32: > +case FFA_MSG_SEND_DIRECT_REQ_64: > +ffa_set_regs_success(regs, 0, 0); > +break; > +case FFA_MEM_SHARE_64: > +case FFA_MEM_SHARE_32: > +/* > + * We currently don't support dynamically allocated buffers. Report > + * that with 0 in bit[0] of w2. > + */ > +ffa_set_regs_success(regs, 0, 0); > +break; > +case FFA_RXTX_MAP_64: > +case FFA_RXTX_MAP_32: > +/* > + * We currently support 4k pages only, report that as 00 in > + * bit[0:1] in w0. This needs to be revised if Xen page size > + * differs from FFA_PAGE_SIZE (SZ_4K). > + */ > +BUILD_BUG_ON(PAGE_SIZE != FFA_PAGE_SIZE); > +ffa_set_regs_success(regs, 0, 0); > +break; > +default: > +ffa_set_regs_error(regs, FFA_RET_NOT_SUPPORTED); > +break; > +} > +} > + > static bool ffa_handle_call(struct cpu_user_regs *regs) > { > uint32_t fid = get_user_reg(regs, 0); > @@ -212,6 +266,9 @@ static bool ffa_handle_call(struct cpu_user_regs *regs) > case FFA_ID_GET: > ffa_set_regs_success(regs, ffa_get_vm_id(d), 0); > return true; > +case FFA_FEATURES: > +handle_features(regs); > +return true; > case FFA_RXTX_MAP_32: > case FFA_RXTX_MAP_64: > e = ffa_handle_rxtx_map(fid, get_user_reg(regs, 1), > -- > 2.34.1 >
Re: [XEN PATCH 0/6] FF-A mediator reorganisation
Hi Jens, > On 25 Mar 2024, at 10:38, Jens Wiklander wrote: > > Hi, > > The FF-A mediator is reorganized into modules for easier maintenance and to > prepare for future changes. This patch set is not expected add any changed > behaviour, except for the "xen/arm: ffa: support FFA_FEATURES" patch. I reviewed the serie and there is only one tiny fix of copyright date that can be done on commit. Appart from that, I tested FF-A with the whole serie applied and I can confirm it works on my side so for the whole serie: Tested-by: Bertrand Marquis Cheers Bertrand > > Thanks, > Jens > > Jens Wiklander (6): > xen/arm: ffa: rename functions to use ffa_ prefix > xen/arm: ffa: move common things to ffa_private.h > xen/arm: ffa: separate memory sharing routines > xen/arm: ffa: separate partition info get routines > xen/arm: ffa: separate rxtx buffer routines > xen/arm: ffa: support FFA_FEATURES > > xen/arch/arm/tee/Makefile |3 + > xen/arch/arm/tee/ffa.c | 1629 ++- > xen/arch/arm/tee/ffa_partinfo.c | 373 +++ > xen/arch/arm/tee/ffa_private.h | 347 +++ > xen/arch/arm/tee/ffa_rxtx.c | 216 > xen/arch/arm/tee/ffa_shm.c | 708 ++ > 6 files changed, 1750 insertions(+), 1526 deletions(-) > create mode 100644 xen/arch/arm/tee/ffa_partinfo.c > create mode 100644 xen/arch/arm/tee/ffa_private.h > create mode 100644 xen/arch/arm/tee/ffa_rxtx.c > create mode 100644 xen/arch/arm/tee/ffa_shm.c > > -- > 2.34.1 >
Re: [PATCH v7 08/14] xen/page_alloc: introduce preserved page flags macro
On 27/03/2024 13:38, Jan Beulich wrote: On 27.03.2024 14:28, Julien Grall wrote: Hi Carlo, On 27/03/2024 11:10, Carlo Nonato wrote: Hi guys, Question is: How would you justify such a change? IOW I'm not convinced (yet) this wants doing there. You mean in this series? Looking at the code, the flag is originally set in alloc_domheap_pages(). So I guess it would make sense to do it in free_domheap_pages(). We don't hold the heap_lock there. Regardless of the safety question (I will answer below), count_info is not meant to be protected by heap_lock. The lock is protecting the heap and ensure we are not corrupting them when there are concurrent call to free_heap_pages(). > Is it safe to change count_info without it? count_info is meant to be accessed locklessly. The flag PGC_extra cannot be set/clear concurrently because you can't allocate a page that has not yet been freed. So it would be fine to use clear_bit(..., ...); Actually we hardly ever use clear_bit() on count_info. Normally we use ordinary C operators. I knew you would say that. I am not convince it is safe to always using count_info without any atomic operations. But I never got around to check all them. Atomic (and otherwise lockless) updates are useful only if done like this everywhere. You are right. But starting to use the bitops is not going to hurt anyone (other than maybe performance, but once we convert all of them, then this will become moot). In fact, it helps start to slowly move towards the aim to have count_info safe. Cheers, -- Julien Grall
Re: preparations for 4.17.4
On 27/03/2024 12:33 pm, Andrew Cooper wrote: > On 27/03/2024 12:23 pm, Jan Beulich wrote: >> All, >> >> the release is due in two to three weeks. Please point out backports you find >> missing from the respective staging branch, but which you consider relevant. >> >> Note that this is going to be the last Xen Project coordinated ordinary >> stable >> release from this branch; the branch will move into security-only support >> mode >> afterwards. > 1) livepatching of .rodata: > > 989556c6f8ca - xen/virtual-region: Rename the start/end fields > ef969144a425 - xen/virtual-region: Include rodata pointers > b083b1c393dc - x86/livepatch: Relax permissions on rodata too > > And technically "x86/mm: fix detection of last L1 entry in > modify_xen_mappings_lite()" too but you've already backported this one. > > Patching .rodata worked before Xen 4.17, and was broken (left as a TODO) > when I adjusted Xen to stop using CR0.WP=0 for patching. > > > 2) Policy fixes: > > e2d8a6522516 - x86/cpu-policy: Fix visibility of HTT/CMP_LEGACY in max > policies > > This is a real bugfix for a real regression we found updating from Xen > 4.13 -> 4.17. It has a dependency on > > 5420aa165dfa - x86/cpu-policy: Hide x2APIC from PV guests > > which I know you had more concern with. FWIW, I'm certain its a good > fix, and should be backported. > > > 3) Test fixes: > > 0263dc9069dd - tests/resource: Fix HVM guest in !SHADOW builds > > It's minor, but does make a difference for those of us who run these > tests regularly. > > > 4) Watchdog fixes: > > 9e18f339830c - x86/boot: Improve the boot watchdog determination of > stuck cpus > 131892e0dcc1 - x86/boot: Support the watchdog on newer AMD systems > > You took "x86/boot: Fix setup_apic_nmi_watchdog() to fail more cleanly" > and the first of the two patches is in the same category IMO. The > second I also feel ok to take for the in-support releases, particularly > as all it is doing is dropping a family list. > > > 5) Ucode scan stability (For 4.18 only) > > Xen 4.18 had "x86/ucode: Refresh raw CPU policy after microcode load" in > it's .0 release, so should also gain: > > cf7fe8b72dea - x86/ucode: Fix stability of the raw CPU Policy rescan > > I've only noticed because I've got them both backported to 4.17 in > XenServer, but I don't think upstream wants to take that route. It occurs to me that these want considering: b6cf604207fd - tools/oxenstored: Use Map instead of Hashtbl for quotas 098d868e52ac - tools/oxenstored: Make Quota.t pure while 4.17 is still in general support. These came from a performance regression we investigated. I've done the backport to 4.17 and they're not entirely trivial (owing to the major source reformat in 4.18) so can commit them if you'd prefer. ~Andrew
[linux-linus test] 185166: tolerable FAIL - PUSHED
flight 185166 linux-linus real [real] flight 185174 linux-linus real-retest [real] http://logs.test-lab.xenproject.org/osstest/logs/185166/ http://logs.test-lab.xenproject.org/osstest/logs/185174/ Failures :-/ but no regressions. Tests which are failing intermittently (not blocking): test-armhf-armhf-libvirt-vhd 8 xen-bootfail pass in 185174-retest Tests which did not succeed, but are not blocking: test-armhf-armhf-libvirt-vhd 14 migrate-support-check fail in 185174 never pass test-armhf-armhf-libvirt-vhd 15 saverestore-support-check fail in 185174 never pass test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 185160 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 185160 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 185160 test-armhf-armhf-libvirt 16 saverestore-support-checkfail like 185160 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 185160 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 185160 test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail never pass test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-credit1 15 migrate-support-checkfail never pass test-arm64-arm64-xl-credit1 16 saverestore-support-checkfail never pass test-arm64-arm64-xl 15 migrate-support-checkfail never pass test-arm64-arm64-xl 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-credit2 15 migrate-support-checkfail never pass test-arm64-arm64-xl-credit2 16 saverestore-support-checkfail never pass test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail never pass test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-xsm 15 migrate-support-checkfail never pass test-arm64-arm64-xl-xsm 16 saverestore-support-checkfail never pass test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail never pass test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail never pass test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-credit1 15 migrate-support-checkfail never pass test-armhf-armhf-xl-credit1 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-credit2 15 migrate-support-checkfail never pass test-armhf-armhf-xl-credit2 16 saverestore-support-checkfail never pass test-armhf-armhf-xl 15 migrate-support-checkfail never pass test-armhf-armhf-xl 16 saverestore-support-checkfail never pass test-amd64-amd64-libvirt-qcow2 14 migrate-support-checkfail never pass test-amd64-amd64-libvirt-raw 14 migrate-support-checkfail never pass test-arm64-arm64-libvirt-raw 14 migrate-support-checkfail never pass test-arm64-arm64-libvirt-raw 15 saverestore-support-checkfail never pass test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail never pass test-amd64-amd64-libvirt 15 migrate-support-checkfail never pass test-arm64-arm64-xl-vhd 14 migrate-support-checkfail never pass test-arm64-arm64-xl-vhd 15 saverestore-support-checkfail never pass test-armhf-armhf-xl-qcow214 migrate-support-checkfail never pass test-armhf-armhf-xl-qcow215 saverestore-support-checkfail never pass test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check fail never pass test-armhf-armhf-xl-raw 14 migrate-support-checkfail never pass test-armhf-armhf-xl-raw 15 saverestore-support-checkfail never pass test-armhf-armhf-xl-arndale 15 migrate-support-checkfail never pass test-armhf-armhf-xl-arndale 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-rtds 15 migrate-support-checkfail never pass test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail never pass test-armhf-armhf-libvirt 15 migrate-support-checkfail never pass version targeted for testing: linux7033999ecd7b8cf9ea59265035a0150961e023ee baseline version: linux928a87efa42302a23bb9554be081a28058495f22 Last test of basis 185160 2024-03-25 18:44:17 Z1 days Testing same since 185166 2024-03-26 21:44:24 Z0 days1 attempts People who touched revisions under test: John Ogness Linus Torvalds Petr Mladek Uwe Kleine-König Zoltan HERPAI jobs: build-amd64-xsm pass build-arm64-xsm pass build-i386-xsm pass build-amd64 pass build-a
Re: preparations for 4.17.4
On 27.03.2024 15:01, Andrew Cooper wrote: > It occurs to me that these want considering: > > b6cf604207fd - tools/oxenstored: Use Map instead of Hashtbl for quotas > 098d868e52ac - tools/oxenstored: Make Quota.t pure > > while 4.17 is still in general support. These came from a performance > regression we investigated. > > I've done the backport to 4.17 and they're not entirely trivial (owing > to the major source reformat in 4.18) so can commit them if you'd prefer. Didn't you bring these up for 4.18.1 already, and I said that I'd leave this for the maintainers to decide? Same here, in any event. Cc-ing them both, just in case. Jan
Re: [PATCH v5 1/6] Revert "xen/x86: bzImage parse kernel_alignment"
On 2024-03-27 04:59, Roger Pau Monné wrote: On Wed, Mar 27, 2024 at 08:22:41AM +0100, Jan Beulich wrote: On 26.03.2024 22:38, Jason Andryuk wrote: A new ELF note will specify the alignment for a relocatable PVH kernel. ELF notes are suitable for vmlinux and other ELF files, so this Linux-specific bzImage parsing in unnecessary. This reverts commit c44cac229067faeec8f49247d1cf281723ac2d40. Signed-off-by: Jason Andryuk Since you keep re-sending this: In private discussion Roger has indicated that, like me, he too would prefer falling back to the ELF data, before falling back to the arch default (Roger, please correct me if I got it wrong). That would make it necessary that the change you're proposing to revert here is actually kept. Sorry, was meaning to reply yesterday but Jason is very fast at sending new version so I'm always one version behind. :) I was hoping to finish this up and get it in... IMO the order: ELF note, PHDR alignment, arch default should be the preferred one. Or wait - what you're reverting is taking the alignment out of the bzImage header. I don't expect the BSDs to use that protocol; aiui that's entirely Linux-specific. Yeah, I don't have strong opinions in keeping this, we already do bzImage parsing, so we might as well attempt to fetch the alignment from there if correct: ELF note, bzImage kernel_alignment, ELF PHDR alignment, arch default I'm not sure how to handle ELF PHDR vs. arch default. ELF PHDR will always be set, AFAIU. Should that always be respected, which means we don't need an arch default? To include arch default, it would be something like this: if ( parms->phys_align != UNSET_ADDR ) align = parms->phys_align; else if ( bz_align ) align = bz_align; else if ( elf->palign > PHYS32_RELOC_ALIGN_DEFAULT ) align = elf->palign; else align = PHYS32_RELOC_ALIGN_DEFAULT; I further meanwhile realized that consulting the ELF phdrs may also be ambiguous, as there may be more than one. I guess it would need to be the maximum of all of them then. My suggestion (not sure if I mentioned this before) was to use the alignment of the first LOAD PHDR, which is the one that defines the value of the dest_base field used as the image load start address. Using the maximum of all load PHDRs might be safer. I'll find the maximum. Thanks, Jason
Re: [PATCH v5] RFC: x86/pvh: Make Xen PVH entrypoint PIC for x86-64
On 2024-03-27 04:20, Jan Beulich wrote: On 26.03.2024 22:47, Jason Andryuk wrote: --- a/include/xen/interface/elfnote.h +++ b/include/xen/interface/elfnote.h @@ -185,9 +185,25 @@ */ #define XEN_ELFNOTE_PHYS32_ENTRY 18 +/* + * Physical loading constraints for PVH kernels + * + * Used to place constraints on the guest physical loading addresses and + * alignment for a PVH kernel. + * + * The presence of this note indicates the kernel supports relocating itself. + * + * The note may include up to three 32bit values in the following order: + * - a maximum address for the entire image to be loaded below (default + * 0x) + * - a minimum address for the start of the image (default 0) + * - a required start alignment (default 0x20) This looks to be stale now. Yes - I did not update this file. The patch is functional as the ELF Note fields in arch/x86/platform/pvh/head.S were re-ordered to match the Xen side. Regards, Jason
Re: [PATCH v5 1/6] Revert "xen/x86: bzImage parse kernel_alignment"
On 27.03.2024 15:08, Jason Andryuk wrote: > On 2024-03-27 04:59, Roger Pau Monné wrote: >> On Wed, Mar 27, 2024 at 08:22:41AM +0100, Jan Beulich wrote: >>> On 26.03.2024 22:38, Jason Andryuk wrote: A new ELF note will specify the alignment for a relocatable PVH kernel. ELF notes are suitable for vmlinux and other ELF files, so this Linux-specific bzImage parsing in unnecessary. This reverts commit c44cac229067faeec8f49247d1cf281723ac2d40. Signed-off-by: Jason Andryuk >>> >>> Since you keep re-sending this: In private discussion Roger has indicated >>> that, like me, he too would prefer falling back to the ELF data, before >>> falling back to the arch default (Roger, please correct me if I got it >>> wrong). That would make it necessary that the change you're proposing to >>> revert here is actually kept. >> >> Sorry, was meaning to reply yesterday but Jason is very fast at >> sending new version so I'm always one version behind. > > :) > > I was hoping to finish this up and get it in... > >> IMO the order: ELF note, PHDR alignment, arch default should be the >> preferred one. >> >>> Or wait - what you're reverting is taking the alignment out of the >>> bzImage header. I don't expect the BSDs to use that protocol; aiui that's >>> entirely Linux-specific. >> >> Yeah, I don't have strong opinions in keeping this, we already do >> bzImage parsing, so we might as well attempt to fetch the alignment >> from there if correct: >> >> ELF note, bzImage kernel_alignment, ELF PHDR alignment, arch default > > I'm not sure how to handle ELF PHDR vs. arch default. ELF PHDR will > always be set, AFAIU. Should that always be respected, which means we > don't need an arch default? A value of 0 (and 1) is specifically permitted, to indicate no alignment. We may take 0 to mean default, but what you suggest below is another plausible approach. Yet another might be to take anything below PAGE_SIZE as "use default". > To include arch default, it would be something like this: > > if ( parms->phys_align != UNSET_ADDR ) > align = parms->phys_align; > else if ( bz_align ) > align = bz_align; Why do you include bz again here? Didn't you previously indicate the header field can't be relied upon? Which is also why, finally, I committed this revert earlier today. Jan > else if ( elf->palign > PHYS32_RELOC_ALIGN_DEFAULT ) > align = elf->palign; > else > align = PHYS32_RELOC_ALIGN_DEFAULT; > > >>> I further meanwhile realized that consulting the ELF phdrs may also be >>> ambiguous, as there may be more than one. I guess it would need to be the >>> maximum of all of them then. >> >> My suggestion (not sure if I mentioned this before) was to use the >> alignment of the first LOAD PHDR, which is the one that defines the >> value of the dest_base field used as the image load start address. >> >> Using the maximum of all load PHDRs might be safer. > > I'll find the maximum. > > Thanks, > Jason
Re: [PATCH-for-9.0 v2 12/19] hw/xen: Merge 'hw/xen/arch_hvm.h' in 'hw/xen/xen-hvm-common.h'
On Tue, Nov 14, 2023 at 03:38:08PM +0100, Philippe Mathieu-Daudé wrote: > We don't need a target-specific header for common target-specific > prototypes. Declare xen_arch_handle_ioreq() and xen_arch_set_memory() > in "hw/xen/xen-hvm-common.h". > > Signed-off-by: Philippe Mathieu-Daudé > Reviewed-by: David Woodhouse > Reviewed-by: Richard Henderson Acked-by: Anthony PERARD Thanks, -- Anthony PERARD
Re: preparations for 4.17.4
On 27/03/2024 2:06 pm, Jan Beulich wrote: > On 27.03.2024 15:01, Andrew Cooper wrote: >> It occurs to me that these want considering: >> >> b6cf604207fd - tools/oxenstored: Use Map instead of Hashtbl for quotas >> 098d868e52ac - tools/oxenstored: Make Quota.t pure >> >> while 4.17 is still in general support. These came from a performance >> regression we investigated. >> >> I've done the backport to 4.17 and they're not entirely trivial (owing >> to the major source reformat in 4.18) so can commit them if you'd prefer. > Didn't you bring these up for 4.18.1 already, and I said that I'd leave > this for the maintainers to decide? Same here, in any event. Cc-ing them > both, just in case. I could have sworn that I remembered requested this before, but couldn't remember where. I'll see about poking people for an answer. ~Andrew
Re: [PATCH v5 1/6] Revert "xen/x86: bzImage parse kernel_alignment"
On 2024-03-27 10:19, Jan Beulich wrote: On 27.03.2024 15:08, Jason Andryuk wrote: On 2024-03-27 04:59, Roger Pau Monné wrote: On Wed, Mar 27, 2024 at 08:22:41AM +0100, Jan Beulich wrote: On 26.03.2024 22:38, Jason Andryuk wrote: A new ELF note will specify the alignment for a relocatable PVH kernel. ELF notes are suitable for vmlinux and other ELF files, so this Linux-specific bzImage parsing in unnecessary. This reverts commit c44cac229067faeec8f49247d1cf281723ac2d40. Signed-off-by: Jason Andryuk Since you keep re-sending this: In private discussion Roger has indicated that, like me, he too would prefer falling back to the ELF data, before falling back to the arch default (Roger, please correct me if I got it wrong). That would make it necessary that the change you're proposing to revert here is actually kept. Sorry, was meaning to reply yesterday but Jason is very fast at sending new version so I'm always one version behind. :) I was hoping to finish this up and get it in... IMO the order: ELF note, PHDR alignment, arch default should be the preferred one. Or wait - what you're reverting is taking the alignment out of the bzImage header. I don't expect the BSDs to use that protocol; aiui that's entirely Linux-specific. Yeah, I don't have strong opinions in keeping this, we already do bzImage parsing, so we might as well attempt to fetch the alignment from there if correct: ELF note, bzImage kernel_alignment, ELF PHDR alignment, arch default I'm not sure how to handle ELF PHDR vs. arch default. ELF PHDR will always be set, AFAIU. Should that always be respected, which means we don't need an arch default? A value of 0 (and 1) is specifically permitted, to indicate no alignment. We may take 0 to mean default, but what you suggest below is another plausible approach. Yet another might be to take anything below PAGE_SIZE as "use default". To include arch default, it would be something like this: if ( parms->phys_align != UNSET_ADDR ) align = parms->phys_align; else if ( bz_align ) align = bz_align; Why do you include bz again here? Didn't you previously indicate the header field can't be relied upon? Which is also why, finally, I committed this revert earlier today. Roger wanted to consult the bz value? He mentioned it above. Locally, I haven't synced with staging yet, and I dropped the revert to start re-working this... If present, the bzImage header field is valid. But being bzImage-specific, it isn't useful for other ELF files. Xen will only move a kernel with the PHSY32_RELOC Note, so it can just specify an alignment if it needs to. Personally, I think using the Note's value or a default is fine. It seems like the PHDR aligment will just be 0x20 anyway (for x86-64 at lease), which matches the default. Specifying the PHYS32_RELOC Note, but relying on a search for the alignment, seems slightly questionable to me. Still, it seemed like the path of least resistance is to just implement the fallback search like Roger requested. Dropping the bzImage, I guess I'd go with your PAGE_SIZE suggestions for: if ( parms->phys_align != UNSET_ADDR ) align = parms->phys_align; else if ( elf->palign > PAGE_SIZE ) align = elf->palign; else align = PHYS32_RELOC_ALIGN_DEFAULT; Thanks for your reviews. Regards, Jason
Re: [PATCH] tools/oxenstored: Re-format
On Mon, Feb 26, 2024 at 10:48 AM Andrew Cooper wrote: > > Rerun make format. Looks good, although not sure whether whitespace will be correctly preserved in email, recommend using git to push the changes. Reviewed-by: Edwin Török > > Signed-off-by: Andrew Cooper > --- > CC: Christian Lindig > CC: Edwin Török > CC: Rob Hoes > --- > tools/ocaml/xenstored/quota.ml | 16 > 1 file changed, 8 insertions(+), 8 deletions(-) > > diff --git a/tools/ocaml/xenstored/quota.ml b/tools/ocaml/xenstored/quota.ml > index 1f652040d898..082cd25f26fc 100644 > --- a/tools/ocaml/xenstored/quota.ml > +++ b/tools/ocaml/xenstored/quota.ml > @@ -55,13 +55,13 @@ let _check quota id size = > raise Data_too_big >); >if id > 0 then > - try > -let entry = DomidMap.find id quota.cur in > -if entry >= quota.maxent then ( > - warn "domain %u cannot create entry: quota reached" id; > - raise Limit_reached > -) > - with Not_found -> () > +try > + let entry = DomidMap.find id quota.cur in > + if entry >= quota.maxent then ( > +warn "domain %u cannot create entry: quota reached" id; > +raise Limit_reached > + ) > +with Not_found -> () > > let check quota id size = >if !activate then > @@ -88,4 +88,4 @@ let merge orig_quota mod_quota dest_quota = > | diff -> update_entry dest id diff (* update with [x=x+diff] *) >in >{dest_quota with cur = DomidMap.fold fold_merge mod_quota.cur > dest_quota.cur} > - (* dest_quota = dest_quota + (mod_quota - orig_quota) *) > +(* dest_quota = dest_quota + (mod_quota - orig_quota) *) > > base-commit: 8de3afc0b402bc17f65093a53e5870862707a8c7 > -- > 2.30.2 >
[PATCH v6 1/8] xen/spinlock: add explicit non-recursive locking functions
In order to prepare a type-safe recursive spinlock structure, add explicitly non-recursive locking functions to be used for non-recursive locking of spinlocks, which are used recursively, too. Signed-off-by: Juergen Gross Acked-by: Jan Beulich --- V2: - rename functions (Jan Beulich) - get rid of !! in pcidevs_locked() (Jan Beulich) V5: - remove spurious change (Julien Grall) - add nrspin_lock() description (Julien Grall) --- xen/arch/arm/mm.c | 4 ++-- xen/arch/x86/domain.c | 12 ++-- xen/arch/x86/mm.c | 12 ++-- xen/arch/x86/mm/mem_sharing.c | 8 xen/arch/x86/mm/p2m-pod.c | 4 ++-- xen/arch/x86/mm/p2m.c | 4 ++-- xen/arch/x86/tboot.c | 4 ++-- xen/common/domctl.c | 4 ++-- xen/common/grant_table.c | 10 +- xen/common/memory.c | 4 ++-- xen/common/numa.c | 4 ++-- xen/common/page_alloc.c | 16 xen/drivers/char/console.c| 16 xen/include/xen/spinlock.h| 29 +++-- 14 files changed, 74 insertions(+), 57 deletions(-) diff --git a/xen/arch/arm/mm.c b/xen/arch/arm/mm.c index b15a18a494..def939172c 100644 --- a/xen/arch/arm/mm.c +++ b/xen/arch/arm/mm.c @@ -105,7 +105,7 @@ void share_xen_page_with_guest(struct page_info *page, struct domain *d, if ( page_get_owner(page) == d ) return; -spin_lock(&d->page_alloc_lock); +nrspin_lock(&d->page_alloc_lock); /* * The incremented type count pins as writable or read-only. @@ -136,7 +136,7 @@ void share_xen_page_with_guest(struct page_info *page, struct domain *d, page_list_add_tail(page, &d->xenpage_list); } -spin_unlock(&d->page_alloc_lock); +nrspin_unlock(&d->page_alloc_lock); } int xenmem_add_to_physmap_one( diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c index a11c55f921..33a2830d9d 100644 --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -212,7 +212,7 @@ void dump_pageframe_info(struct domain *d) { unsigned long total[MASK_EXTR(PGT_type_mask, PGT_type_mask) + 1] = {}; -spin_lock(&d->page_alloc_lock); +nrspin_lock(&d->page_alloc_lock); page_list_for_each ( page, &d->page_list ) { unsigned int index = MASK_EXTR(page->u.inuse.type_info, @@ -231,13 +231,13 @@ void dump_pageframe_info(struct domain *d) _p(mfn_x(page_to_mfn(page))), page->count_info, page->u.inuse.type_info); } -spin_unlock(&d->page_alloc_lock); +nrspin_unlock(&d->page_alloc_lock); } if ( is_hvm_domain(d) ) p2m_pod_dump_data(d); -spin_lock(&d->page_alloc_lock); +nrspin_lock(&d->page_alloc_lock); page_list_for_each ( page, &d->xenpage_list ) { @@ -253,7 +253,7 @@ void dump_pageframe_info(struct domain *d) page->count_info, page->u.inuse.type_info); } -spin_unlock(&d->page_alloc_lock); +nrspin_unlock(&d->page_alloc_lock); } void update_guest_memory_policy(struct vcpu *v, @@ -2448,10 +2448,10 @@ int domain_relinquish_resources(struct domain *d) d->arch.auto_unmask = 0; } -spin_lock(&d->page_alloc_lock); +nrspin_lock(&d->page_alloc_lock); page_list_splice(&d->arch.relmem_list, &d->page_list); INIT_PAGE_LIST_HEAD(&d->arch.relmem_list); -spin_unlock(&d->page_alloc_lock); +nrspin_unlock(&d->page_alloc_lock); PROGRESS(xen): diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c index 62f5b811bb..b4d125db39 100644 --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -482,7 +482,7 @@ void share_xen_page_with_guest(struct page_info *page, struct domain *d, set_gpfn_from_mfn(mfn_x(page_to_mfn(page)), INVALID_M2P_ENTRY); -spin_lock(&d->page_alloc_lock); +nrspin_lock(&d->page_alloc_lock); /* The incremented type count pins as writable or read-only. */ page->u.inuse.type_info = @@ -502,7 +502,7 @@ void share_xen_page_with_guest(struct page_info *page, struct domain *d, page_list_add_tail(page, &d->xenpage_list); } -spin_unlock(&d->page_alloc_lock); +nrspin_unlock(&d->page_alloc_lock); } void make_cr3(struct vcpu *v, mfn_t mfn) @@ -3597,11 +3597,11 @@ long do_mmuext_op( { bool drop_ref; -spin_lock(&pg_owner->page_alloc_lock); +nrspin_lock(&pg_owner->page_alloc_lock); drop_ref = (pg_owner->is_dying && test_and_clear_bit(_PGT_pinned, &page->u.inuse.type_info)); -spin_unlock(&pg_owner->page_alloc_lock); +nrspin_unlock(&pg_owner->page_alloc_lock); if ( drop_ref ) { pin_drop: @@ -4424,7 +4424,7 @@ int steal_page( * that it might be upon return from alloc_domheap_pages
[PATCH v6 3/8] xen/spinlock: add missing rspin_is_locked() and rspin_barrier()
Add rspin_is_locked() and rspin_barrier() in order to prepare differing spinlock_t and rspinlock_t types. Signed-off-by: Juergen Gross --- V2: - partially carved out from V1 patch, partially new V5: - let rspin_is_locked() return bool (Jan Beulich) V6: - Re-add comment to _spin_is_locked() (Jan Beulich) --- xen/arch/x86/mm/p2m-pod.c | 2 +- xen/common/domain.c | 2 +- xen/common/page_alloc.c | 2 +- xen/common/spinlock.c | 26 -- xen/drivers/char/console.c| 4 ++-- xen/drivers/passthrough/pci.c | 2 +- xen/include/xen/spinlock.h| 4 7 files changed, 30 insertions(+), 12 deletions(-) diff --git a/xen/arch/x86/mm/p2m-pod.c b/xen/arch/x86/mm/p2m-pod.c index c48ea169b7..9750a3a21b 100644 --- a/xen/arch/x86/mm/p2m-pod.c +++ b/xen/arch/x86/mm/p2m-pod.c @@ -374,7 +374,7 @@ int p2m_pod_empty_cache(struct domain *d) /* After this barrier no new PoD activities can happen. */ BUG_ON(!d->is_dying); -spin_barrier(&p2m->pod.lock.lock); +rspin_barrier(&p2m->pod.lock.lock); lock_page_alloc(p2m); diff --git a/xen/common/domain.c b/xen/common/domain.c index ceb44c8266..282c3ab623 100644 --- a/xen/common/domain.c +++ b/xen/common/domain.c @@ -991,7 +991,7 @@ int domain_kill(struct domain *d) case DOMDYING_alive: domain_pause(d); d->is_dying = DOMDYING_dying; -spin_barrier(&d->domain_lock); +rspin_barrier(&d->domain_lock); argo_destroy(d); vnuma_destroy(d->vnuma); domain_set_outstanding_pages(d, 0); diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c index 4d6ce726e3..7c1bdfc046 100644 --- a/xen/common/page_alloc.c +++ b/xen/common/page_alloc.c @@ -477,7 +477,7 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages) { long dom_before, dom_after, dom_claimed, sys_before, sys_after; -ASSERT(spin_is_locked(&d->page_alloc_lock)); +ASSERT(rspin_is_locked(&d->page_alloc_lock)); d->tot_pages += pages; /* diff --git a/xen/common/spinlock.c b/xen/common/spinlock.c index 648393d95f..6572c76114 100644 --- a/xen/common/spinlock.c +++ b/xen/common/spinlock.c @@ -396,13 +396,10 @@ static bool always_inline spin_is_locked_common(const spinlock_tickets_t *t) int _spin_is_locked(const spinlock_t *lock) { /* - * Recursive locks may be locked by another CPU, yet we return - * "false" here, making this function suitable only for use in - * ASSERT()s and alike. + * This function is suitable only for use in ASSERT()s and alike, as it + * doesn't tell _who_ is holding the lock. */ -return lock->recurse_cpu == SPINLOCK_NO_CPU - ? spin_is_locked_common(&lock->tickets) - : lock->recurse_cpu == smp_processor_id(); +return spin_is_locked_common(&lock->tickets); } static bool always_inline spin_trylock_common(spinlock_tickets_t *t, @@ -465,6 +462,23 @@ void _spin_barrier(spinlock_t *lock) spin_barrier_common(&lock->tickets, &lock->debug, LOCK_PROFILE_PAR); } +bool _rspin_is_locked(const rspinlock_t *lock) +{ +/* + * Recursive locks may be locked by another CPU, yet we return + * "false" here, making this function suitable only for use in + * ASSERT()s and alike. + */ +return lock->recurse_cpu == SPINLOCK_NO_CPU + ? spin_is_locked_common(&lock->tickets) + : lock->recurse_cpu == smp_processor_id(); +} + +void _rspin_barrier(rspinlock_t *lock) +{ +spin_barrier_common(&lock->tickets, &lock->debug, LOCK_PROFILE_PAR); +} + bool _rspin_trylock(rspinlock_t *lock) { unsigned int cpu = smp_processor_id(); diff --git a/xen/drivers/char/console.c b/xen/drivers/char/console.c index 22f50fc617..d5e6aacc27 100644 --- a/xen/drivers/char/console.c +++ b/xen/drivers/char/console.c @@ -327,7 +327,7 @@ static void cf_check do_dec_thresh(unsigned char key, bool unused) static void conring_puts(const char *str, size_t len) { -ASSERT(spin_is_locked(&console_lock)); +ASSERT(rspin_is_locked(&console_lock)); while ( len-- ) conring[CONRING_IDX_MASK(conringp++)] = *str++; @@ -765,7 +765,7 @@ static void __putstr(const char *str) { size_t len = strlen(str); -ASSERT(spin_is_locked(&console_lock)); +ASSERT(rspin_is_locked(&console_lock)); console_serial_puts(str, len); video_puts(str, len); diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c index 4fcc7e2cde..5a446d3dce 100644 --- a/xen/drivers/passthrough/pci.c +++ b/xen/drivers/passthrough/pci.c @@ -65,7 +65,7 @@ void pcidevs_unlock(void) bool pcidevs_locked(void) { -return !!spin_is_locked(&_pcidevs_lock); +return rspin_is_locked(&_pcidevs_lock); } static struct radix_tree_root pci_segments; diff --git a/xen/include/xen/spinlock.h b/xen/include/xen/spinlock.h index 8bc4652526..148be1e116 100644 --- a/xen/include/xen/spinlock.h +++ b/xen/include/xen/spinlock.h @@ -297,6 +297,8 @@ void _rspi
[PATCH v6 0/8] xen/spinlock: make recursive spinlocks a dedicated type
Instead of being able to use normal spinlocks as recursive ones, too, make recursive spinlocks a special lock type. This will make the spinlock structure smaller in production builds and add type-safety. This allows to increase the maximum number of physical cpus from 4095 to 65535 without increasing the size of the lock structure in production builds (the size of recursive spinlocks in debug builds will grow to 12 bytes due to that change). Note that rwlock handling is still limiting the number of cpus to 4095, this is being taken care off in patch 12, which raises the rwlock limit to 16384 cpus. Iommu code imposes a limit of 16383 cpus. Changes in V2: - addressed comments by Jan Beulich - lots of additional cleanups - reorganized complete series Changes in V3: - addressed comments by Jan Beulich Changes in V4: - former patch 1 has already been applied - fixed a coding style issue in patch 1 Changes in V5: - new patches 1 + 10 + 12 + 13 - due to the recent Ghost-race patches the macro layer for calling spinlock functions is kept - addressed comments Changes in V6: - patches 1-5 of V5 have been committed already - addressed comments Juergen Gross (8): xen/spinlock: add explicit non-recursive locking functions xen/spinlock: add another function level xen/spinlock: add missing rspin_is_locked() and rspin_barrier() xen/spinlock: split recursive spinlocks from normal ones xen/spinlock: let all is_locked and trylock variants return bool xen/spinlock: support higher number of cpus xen/rwlock: raise the number of possible cpus xen: allow up to 16383 cpus xen/arch/Kconfig | 2 +- xen/arch/arm/mm.c | 4 +- xen/arch/x86/domain.c | 12 +-- xen/arch/x86/mm.c | 12 +-- xen/arch/x86/mm/mem_sharing.c | 8 +- xen/arch/x86/mm/p2m-pod.c | 6 +- xen/arch/x86/mm/p2m.c | 4 +- xen/arch/x86/tboot.c | 4 +- xen/common/domain.c | 2 +- xen/common/domctl.c | 4 +- xen/common/grant_table.c | 10 +- xen/common/memory.c | 4 +- xen/common/numa.c | 4 +- xen/common/page_alloc.c | 18 ++-- xen/common/spinlock.c | 181 ++ xen/drivers/char/console.c| 20 ++-- xen/drivers/passthrough/pci.c | 2 +- xen/include/xen/rwlock.h | 23 +++-- xen/include/xen/spinlock.h| 110 - 19 files changed, 297 insertions(+), 133 deletions(-) -- 2.35.3
[PATCH v6 5/8] xen/spinlock: let all is_locked and trylock variants return bool
Switch the remaining trylock and is_locked variants to return bool. Signed-off-by: Juergen Gross Reviewed-by: Jan Beulich --- V5: - new patch (Jan Beulich) --- xen/common/spinlock.c | 4 ++-- xen/include/xen/spinlock.h | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/xen/common/spinlock.c b/xen/common/spinlock.c index 5aaca49a61..7ccb725171 100644 --- a/xen/common/spinlock.c +++ b/xen/common/spinlock.c @@ -393,7 +393,7 @@ static bool always_inline spin_is_locked_common(const spinlock_tickets_t *t) return t->head != t->tail; } -int _spin_is_locked(const spinlock_t *lock) +bool _spin_is_locked(const spinlock_t *lock) { /* * This function is suitable only for use in ASSERT()s and alike, as it @@ -433,7 +433,7 @@ static bool always_inline spin_trylock_common(spinlock_tickets_t *t, return true; } -int _spin_trylock(spinlock_t *lock) +bool _spin_trylock(spinlock_t *lock) { return spin_trylock_common(&lock->tickets, &lock->debug, LOCK_PROFILE_PAR); } diff --git a/xen/include/xen/spinlock.h b/xen/include/xen/spinlock.h index f49ba928f0..3a4092626c 100644 --- a/xen/include/xen/spinlock.h +++ b/xen/include/xen/spinlock.h @@ -234,8 +234,8 @@ void _spin_unlock(spinlock_t *lock); void _spin_unlock_irq(spinlock_t *lock); void _spin_unlock_irqrestore(spinlock_t *lock, unsigned long flags); -int _spin_is_locked(const spinlock_t *lock); -int _spin_trylock(spinlock_t *lock); +bool _spin_is_locked(const spinlock_t *lock); +bool _spin_trylock(spinlock_t *lock); void _spin_barrier(spinlock_t *lock); static always_inline void spin_lock(spinlock_t *l) -- 2.35.3
[PATCH v6 2/8] xen/spinlock: add another function level
Add another function level in spinlock.c hiding the spinlock_t layout from the low level locking code. This is done in preparation of introducing rspinlock_t for recursive locks without having to duplicate all of the locking code. Signed-off-by: Juergen Gross Reviewed-by: Jan Beulich --- V2: - new patch V5: - don't regress spin_is_locked() for rspin-lock (Jan Beulich) - use bool as return type of spin_is_locked_common() and spin_trylock_common() (Jan Beulich) --- xen/common/spinlock.c | 103 - xen/include/xen/spinlock.h | 1 + 2 files changed, 68 insertions(+), 36 deletions(-) diff --git a/xen/common/spinlock.c b/xen/common/spinlock.c index 874ed762b4..648393d95f 100644 --- a/xen/common/spinlock.c +++ b/xen/common/spinlock.c @@ -261,29 +261,31 @@ void spin_debug_disable(void) #ifdef CONFIG_DEBUG_LOCK_PROFILE +#define LOCK_PROFILE_PAR lock->profile #define LOCK_PROFILE_REL \ -if ( lock->profile ) \ +if ( profile ) \ {\ -lock->profile->time_hold += NOW() - lock->profile->time_locked; \ -lock->profile->lock_cnt++; \ +profile->time_hold += NOW() - profile->time_locked; \ +profile->lock_cnt++; \ } #define LOCK_PROFILE_VAR(var, val)s_time_t var = (val) #define LOCK_PROFILE_BLOCK(var) var = var ? : NOW() #define LOCK_PROFILE_BLKACC(tst, val)\ if ( tst ) \ {\ -lock->profile->time_block += lock->profile->time_locked - (val); \ -lock->profile->block_cnt++; \ +profile->time_block += profile->time_locked - (val); \ +profile->block_cnt++;\ } #define LOCK_PROFILE_GOT(val)\ -if ( lock->profile ) \ +if ( profile ) \ {\ -lock->profile->time_locked = NOW(); \ +profile->time_locked = NOW();\ LOCK_PROFILE_BLKACC(val, val); \ } #else +#define LOCK_PROFILE_PAR NULL #define LOCK_PROFILE_REL #define LOCK_PROFILE_VAR(var, val) #define LOCK_PROFILE_BLOCK(var) @@ -307,17 +309,18 @@ static always_inline uint16_t observe_head(const spinlock_tickets_t *t) return read_atomic(&t->head); } -static void always_inline spin_lock_common(spinlock_t *lock, +static void always_inline spin_lock_common(spinlock_tickets_t *t, + union lock_debug *debug, + struct lock_profile *profile, void (*cb)(void *data), void *data) { spinlock_tickets_t tickets = SPINLOCK_TICKET_INC; LOCK_PROFILE_VAR(block, 0); -check_lock(&lock->debug, false); +check_lock(debug, false); preempt_disable(); -tickets.head_tail = arch_fetch_and_add(&lock->tickets.head_tail, - tickets.head_tail); -while ( tickets.tail != observe_head(&lock->tickets) ) +tickets.head_tail = arch_fetch_and_add(&t->head_tail, tickets.head_tail); +while ( tickets.tail != observe_head(t) ) { LOCK_PROFILE_BLOCK(block); if ( cb ) @@ -325,18 +328,19 @@ static void always_inline spin_lock_common(spinlock_t *lock, arch_lock_relax(); } arch_lock_acquire_barrier(); -got_lock(&lock->debug); +got_lock(debug); LOCK_PROFILE_GOT(block); } void _spin_lock(spinlock_t *lock) { -spin_lock_common(lock, NULL, NULL); +spin_lock_common(&lock->tickets, &lock->debug, LOCK_PROFILE_PAR, NULL, + NULL); } void _spin_lock_cb(spinlock_t *lock, void (*cb)(void *data), void *data) { -spin_lock_common(lock, cb, data); +spin_lock_common(&lock->tickets, &lock->debug, LOCK_PROFILE_PAR, cb, data); } void _spin_lock_irq(spinlock_t *lock) @@ -355,16 +359,23 @@ unsigned long _spin_lock_irqsave(spinlock_t *lock) return flags; } -void _spin_unlock(spinlock_t *lock) +static void always_inline spin_unlock_common(spinlock_tickets_t *t, + union lock_debug *debug, + st
[PATCH v6 4/8] xen/spinlock: split recursive spinlocks from normal ones
Recursive and normal spinlocks are sharing the same data structure for representation of the lock. This has two major disadvantages: - it is not clear from the definition of a lock, whether it is intended to be used recursive or not, while a mixture of both usage variants needs to be - in production builds (builds without CONFIG_DEBUG_LOCKS) the needed data size of an ordinary spinlock is 8 bytes instead of 4, due to the additional recursion data needed (associated with that the rwlock data is using 12 instead of only 8 bytes) Fix that by introducing a struct spinlock_recursive for recursive spinlocks only, and switch recursive spinlock functions to require pointers to this new struct. This allows to check the correct usage at build time. Signed-off-by: Juergen Gross Reviewed-by: Jan Beulich --- V2: - use shorter names (Jan Beulich) - don't embed spinlock_t in rspinlock_t (Jan Beulich) V5: - some style fixes (Jan Beulich) - bool instead of int (Jan Beulich) --- xen/common/spinlock.c | 50 ++ xen/include/xen/spinlock.h | 72 +- 2 files changed, 105 insertions(+), 17 deletions(-) diff --git a/xen/common/spinlock.c b/xen/common/spinlock.c index 6572c76114..5aaca49a61 100644 --- a/xen/common/spinlock.c +++ b/xen/common/spinlock.c @@ -545,6 +545,56 @@ void _rspin_unlock_irqrestore(rspinlock_t *lock, unsigned long flags) local_irq_restore(flags); } +bool _nrspin_trylock(rspinlock_t *lock) +{ +check_lock(&lock->debug, true); + +if ( unlikely(lock->recurse_cpu != SPINLOCK_NO_CPU) ) +return false; + +return spin_trylock_common(&lock->tickets, &lock->debug, LOCK_PROFILE_PAR); +} + +void _nrspin_lock(rspinlock_t *lock) +{ +spin_lock_common(&lock->tickets, &lock->debug, LOCK_PROFILE_PAR, NULL, + NULL); +} + +void _nrspin_unlock(rspinlock_t *lock) +{ +spin_unlock_common(&lock->tickets, &lock->debug, LOCK_PROFILE_PAR); +} + +void _nrspin_lock_irq(rspinlock_t *lock) +{ +ASSERT(local_irq_is_enabled()); +local_irq_disable(); +_nrspin_lock(lock); +} + +void _nrspin_unlock_irq(rspinlock_t *lock) +{ +_nrspin_unlock(lock); +local_irq_enable(); +} + +unsigned long _nrspin_lock_irqsave(rspinlock_t *lock) +{ +unsigned long flags; + +local_irq_save(flags); +_nrspin_lock(lock); + +return flags; +} + +void _nrspin_unlock_irqrestore(rspinlock_t *lock, unsigned long flags) +{ +_nrspin_unlock(lock); +local_irq_restore(flags); +} + #ifdef CONFIG_DEBUG_LOCK_PROFILE struct lock_profile_anc { diff --git a/xen/include/xen/spinlock.h b/xen/include/xen/spinlock.h index 148be1e116..f49ba928f0 100644 --- a/xen/include/xen/spinlock.h +++ b/xen/include/xen/spinlock.h @@ -77,8 +77,6 @@ union lock_debug { }; */ struct spinlock; -/* Temporary hack until a dedicated struct rspinlock is existing. */ -#define rspinlock spinlock struct lock_profile { struct lock_profile *next; /* forward link */ @@ -110,6 +108,10 @@ struct lock_profile_qhead { __used_section(".lockprofile.data") = \ &lock_profile_data__##name #define SPIN_LOCK_UNLOCKED_(x) { \ +.debug = LOCK_DEBUG_, \ +.profile = x, \ +} +#define RSPIN_LOCK_UNLOCKED_(x) { \ .recurse_cpu = SPINLOCK_NO_CPU, \ .debug = LOCK_DEBUG_, \ .profile = x, \ @@ -119,8 +121,9 @@ struct lock_profile_qhead { spinlock_t l = SPIN_LOCK_UNLOCKED_(NULL); \ static struct lock_profile lock_profile_data__##l = LOCK_PROFILE_(l); \ LOCK_PROFILE_PTR_(l) +#define RSPIN_LOCK_UNLOCKED RSPIN_LOCK_UNLOCKED_(NULL) #define DEFINE_RSPINLOCK(l) \ -rspinlock_t l = SPIN_LOCK_UNLOCKED_(NULL);\ +rspinlock_t l = RSPIN_LOCK_UNLOCKED_(NULL); \ static struct lock_profile lock_profile_data__##l = RLOCK_PROFILE_(l);\ LOCK_PROFILE_PTR_(l) @@ -145,8 +148,11 @@ struct lock_profile_qhead { #define spin_lock_init_prof(s, l) \ spin_lock_init_prof__(s, l, lock, spinlock_t, false) -#define rspin_lock_init_prof(s, l)\ -spin_lock_init_prof__(s, l, rlock, rspinlock_t, true) +#define rspin_lock_init_prof(s, l) do { \ +spin_lock_init_prof__(s, l, rlock, rspinlock_t, true);\ +(s)->l.recurse_cpu = SPINLOCK_NO_CPU; \ +(s)->l.recurse_cnt = 0;
[PATCH v6 7/8] xen/rwlock: raise the number of possible cpus
The rwlock handling is limiting the number of cpus to 4095 today. The main reason is the use of the atomic_t data type for the main lock handling, which needs 2 bits for the locking state (writer waiting or write locked), 12 bits for the id of a possible writer, and a 12 bit counter for readers. The limit isn't 4096 due to an off by one sanity check. The atomic_t data type is 32 bits wide, so in theory 15 bits for the writer's cpu id and 15 bits for the reader count seem to be fine, but via read_trylock() more readers than cpus are possible. This means that it is possible to raise the number of cpus to 16384 without changing the rwlock_t data structure. In order to avoid the reader count wrapping to zero, don't let read_trylock() succeed in case the highest bit of the reader's count is set already. This leaves enough headroom for non-recursive readers to enter without risking a wrap. While at it calculate _QW_CPUMASK and _QR_SHIFT from _QW_SHIFT and add a sanity check for not overflowing the atomic_t data type. Signed-off-by: Juergen Gross --- V5: - new patch V6: - add comment to _can_read_lock() (Jan Beulich) --- xen/include/xen/rwlock.h | 23 +++ 1 file changed, 15 insertions(+), 8 deletions(-) diff --git a/xen/include/xen/rwlock.h b/xen/include/xen/rwlock.h index 65d88b0ef4..232782801d 100644 --- a/xen/include/xen/rwlock.h +++ b/xen/include/xen/rwlock.h @@ -23,12 +23,12 @@ typedef struct { #define rwlock_init(l) (*(l) = (rwlock_t)RW_LOCK_UNLOCKED) /* Writer states & reader shift and bias. */ -#define_QW_CPUMASK 0xfffU /* Writer CPU mask */ -#define_QW_SHIFT12 /* Writer flags shift */ -#define_QW_WAITING (1U << _QW_SHIFT) /* A writer is waiting */ -#define_QW_LOCKED (3U << _QW_SHIFT) /* A writer holds the lock */ -#define_QW_WMASK(3U << _QW_SHIFT) /* Writer mask */ -#define_QR_SHIFT14 /* Reader count shift */ +#define_QW_SHIFT14 /* Writer flags shift */ +#define_QW_CPUMASK ((1U << _QW_SHIFT) - 1) /* Writer CPU mask */ +#define_QW_WAITING (1U << _QW_SHIFT) /* A writer is waiting */ +#define_QW_LOCKED (3U << _QW_SHIFT) /* A writer holds the lock */ +#define_QW_WMASK(3U << _QW_SHIFT) /* Writer mask */ +#define_QR_SHIFT(_QW_SHIFT + 2) /* Reader count shift */ #define_QR_BIAS (1U << _QR_SHIFT) void queue_read_lock_slowpath(rwlock_t *lock); @@ -36,14 +36,21 @@ void queue_write_lock_slowpath(rwlock_t *lock); static inline bool _is_write_locked_by_me(unsigned int cnts) { -BUILD_BUG_ON(_QW_CPUMASK < NR_CPUS); +BUILD_BUG_ON((_QW_CPUMASK + 1) < NR_CPUS); +BUILD_BUG_ON(NR_CPUS * _QR_BIAS > INT_MAX); return (cnts & _QW_WMASK) == _QW_LOCKED && (cnts & _QW_CPUMASK) == smp_processor_id(); } static inline bool _can_read_lock(unsigned int cnts) { -return !(cnts & _QW_WMASK) || _is_write_locked_by_me(cnts); +/* + * If write locked by the caller, no other readers are possible. + * Not allowing the lock holder to read_lock() another 32768 times ought + * to be fine. + */ +return cnts <= INT_MAX && + (!(cnts & _QW_WMASK) || _is_write_locked_by_me(cnts)); } /* -- 2.35.3
[PATCH v6 6/8] xen/spinlock: support higher number of cpus
Allow 16 bits per cpu number, which is the limit imposed by spinlock_tickets_t. This will allow up to 65535 cpus, while increasing only the size of recursive spinlocks in debug builds from 8 to 12 bytes. The current Xen limit of 4095 cpus is imposed by SPINLOCK_CPU_BITS being 12. There are machines available with more cpus than the current Xen limit, so it makes sense to have the possibility to use more cpus. Signed-off-by: Juergen Gross --- V5: - keep previous recursion limit (Julien Grall) V6: - use unsigned int instead of uint32_t (Jan Beulich) --- xen/common/spinlock.c | 2 ++ xen/include/xen/spinlock.h | 20 ++-- 2 files changed, 12 insertions(+), 10 deletions(-) diff --git a/xen/common/spinlock.c b/xen/common/spinlock.c index 7ccb725171..5aa9ba6188 100644 --- a/xen/common/spinlock.c +++ b/xen/common/spinlock.c @@ -485,7 +485,9 @@ bool _rspin_trylock(rspinlock_t *lock) /* Don't allow overflow of recurse_cpu field. */ BUILD_BUG_ON(NR_CPUS > SPINLOCK_NO_CPU); +BUILD_BUG_ON(SPINLOCK_CPU_BITS > sizeof(lock->recurse_cpu) * 8); BUILD_BUG_ON(SPINLOCK_RECURSE_BITS < 3); +BUILD_BUG_ON(SPINLOCK_MAX_RECURSE > ((1u << SPINLOCK_RECURSE_BITS) - 1)); check_lock(&lock->debug, true); diff --git a/xen/include/xen/spinlock.h b/xen/include/xen/spinlock.h index 3a4092626c..db00a24646 100644 --- a/xen/include/xen/spinlock.h +++ b/xen/include/xen/spinlock.h @@ -8,16 +8,16 @@ #include #include -#define SPINLOCK_CPU_BITS 12 +#define SPINLOCK_CPU_BITS 16 #ifdef CONFIG_DEBUG_LOCKS union lock_debug { -uint16_t val; -#define LOCK_DEBUG_INITVAL 0x +uint32_t val; +#define LOCK_DEBUG_INITVAL 0x struct { -uint16_t cpu:SPINLOCK_CPU_BITS; -#define LOCK_DEBUG_PAD_BITS (14 - SPINLOCK_CPU_BITS) -uint16_t :LOCK_DEBUG_PAD_BITS; +unsigned int cpu:SPINLOCK_CPU_BITS; +#define LOCK_DEBUG_PAD_BITS (30 - SPINLOCK_CPU_BITS) +unsigned int :LOCK_DEBUG_PAD_BITS; bool irq_safe:1; bool unseen:1; }; @@ -211,11 +211,11 @@ typedef struct spinlock { typedef struct rspinlock { spinlock_tickets_t tickets; -uint16_t recurse_cpu:SPINLOCK_CPU_BITS; +uint16_t recurse_cpu; #define SPINLOCK_NO_CPU((1u << SPINLOCK_CPU_BITS) - 1) -#define SPINLOCK_RECURSE_BITS (16 - SPINLOCK_CPU_BITS) -uint16_t recurse_cnt:SPINLOCK_RECURSE_BITS; -#define SPINLOCK_MAX_RECURSE ((1u << SPINLOCK_RECURSE_BITS) - 1) +#define SPINLOCK_RECURSE_BITS 8 +uint8_t recurse_cnt; +#define SPINLOCK_MAX_RECURSE 15 union lock_debug debug; #ifdef CONFIG_DEBUG_LOCK_PROFILE struct lock_profile *profile; -- 2.35.3
[PATCH v6 8/8] xen: allow up to 16383 cpus
With lock handling now allowing up to 16384 cpus (spinlocks can handle 65535 cpus, rwlocks can handle 16384 cpus), raise the allowed limit for the number of cpus to be configured to 16383. The new limit is imposed by IOMMU_CMD_BUFFER_MAX_ENTRIES and QINVAL_MAX_ENTRY_NR required to be larger than 2 * CONFIG_NR_CPUS. Signed-off-by: Juergen Gross --- V5: - new patch (Jan Beulich) --- xen/arch/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/xen/arch/Kconfig b/xen/arch/Kconfig index 67ba38f32f..308ce129a8 100644 --- a/xen/arch/Kconfig +++ b/xen/arch/Kconfig @@ -6,7 +6,7 @@ config PHYS_ADDR_T_32 config NR_CPUS int "Maximum number of CPUs" - range 1 4095 + range 1 16383 default "256" if X86 default "8" if ARM && RCAR3 default "4" if ARM && QEMU -- 2.35.3
Re: [PATCH] tools/oxenstored: Re-format
On 27/03/2024 3:11 pm, Edwin Torok wrote: > On Mon, Feb 26, 2024 at 10:48 AM Andrew Cooper > wrote: >> Rerun make format. > Looks good, although not sure whether whitespace will be correctly > preserved in email, recommend using git to push the changes. > Reviewed-by: Edwin Török Thanks. Don't worry about the patch workflow. ~Andrew
Re: preparations for 4.17.4
> On 27 Mar 2024, at 14:42, Andrew Cooper wrote: > > On 27/03/2024 2:06 pm, Jan Beulich wrote: >> On 27.03.2024 15:01, Andrew Cooper wrote: >>> It occurs to me that these want considering: >>> >>> b6cf604207fd - tools/oxenstored: Use Map instead of Hashtbl for quotas >>> 098d868e52ac - tools/oxenstored: Make Quota.t pure >>> >>> while 4.17 is still in general support. These came from a performance >>> regression we investigated. >>> >>> I've done the backport to 4.17 and they're not entirely trivial (owing >>> to the major source reformat in 4.18) so can commit them if you'd prefer. >> Didn't you bring these up for 4.18.1 already, and I said that I'd leave >> this for the maintainers to decide? Same here, in any event. Cc-ing them >> both, just in case. > > I could have sworn that I remembered requested this before, but couldn't > remember where. > > I'll see about poking people for an answer. > > ~Andrew I understand we have backport; so I support upstreaming it. — C
[xen-unstable-smoke test] 185173: tolerable all pass - PUSHED
flight 185173 xen-unstable-smoke real [real] http://logs.test-lab.xenproject.org/osstest/logs/185173/ Failures :-/ but no regressions. Tests which did not succeed, but are not blocking: test-amd64-amd64-libvirt 15 migrate-support-checkfail never pass test-arm64-arm64-xl-xsm 15 migrate-support-checkfail never pass test-arm64-arm64-xl-xsm 16 saverestore-support-checkfail never pass test-armhf-armhf-xl 15 migrate-support-checkfail never pass test-armhf-armhf-xl 16 saverestore-support-checkfail never pass version targeted for testing: xen 0cd50753eb40ca5f00ea1ced9f80ce5f478e560c baseline version: xen e3883336bb5abba2ec2231618f5b64f61b099b1e Last test of basis 185170 2024-03-27 09:03:56 Z0 days Testing same since 185173 2024-03-27 12:02:02 Z0 days1 attempts People who touched revisions under test: George Dunlap Jan Beulich jobs: build-arm64-xsm pass build-amd64 pass build-armhf pass build-amd64-libvirt pass test-armhf-armhf-xl pass test-arm64-arm64-xl-xsm pass test-amd64-amd64-xl-qemuu-debianhvm-amd64pass test-amd64-amd64-libvirt pass sg-report-flight on osstest.test-lab.xenproject.org logs: /home/logs/logs images: /home/logs/images Logs, config files, etc. are available at http://logs.test-lab.xenproject.org/osstest/logs Explanation of these reports, and of osstest in general, is at http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master Test harness code can be found at http://xenbits.xen.org/gitweb?p=osstest.git;a=summary Pushing revision : To xenbits.xen.org:/home/xen/git/xen.git e3883336bb..0cd50753eb 0cd50753eb40ca5f00ea1ced9f80ce5f478e560c -> smoke
[PATCH v1] tools/ocaml: fix warnings
Do not rely on the string values of the `Failure` exception, but use the `_opt` functions instead. Signed-off-by: Edwin Török --- tools/ocaml/xenstored/config.ml | 20 +++- 1 file changed, 11 insertions(+), 9 deletions(-) diff --git a/tools/ocaml/xenstored/config.ml b/tools/ocaml/xenstored/config.ml index 95ef745a54..e0df236f73 100644 --- a/tools/ocaml/xenstored/config.ml +++ b/tools/ocaml/xenstored/config.ml @@ -83,25 +83,27 @@ let validate cf expected other = let err = ref [] in let append x = err := x :: !err in List.iter (fun (k, v) -> + let parse ~err_msg parser v f = +match parser v with +| None -> append (k, err_msg) +| Some r -> f r + in try if not (List.mem_assoc k expected) then other k v else let ty = List.assoc k expected in match ty with | Unit f -> f () - | Bool f -> f (bool_of_string v) + | Bool f -> parse ~err_msg:"expect bool arg" bool_of_string_opt v f | String f -> f v - | Int f-> f (int_of_string v) - | Float f -> f (float_of_string v) - | Set_bool r -> r := (bool_of_string v) + | Int f-> parse ~err_msg:"expect int arg" int_of_string_opt v f + | Float f -> parse ~err_msg:"expect float arg" float_of_string_opt v f + | Set_bool r -> parse ~err_msg:"expect bool arg" bool_of_string_opt v (fun x -> r := x) | Set_string r -> r := v - | Set_int r-> r := int_of_string v - | Set_float r -> r := (float_of_string v) + | Set_int r-> parse ~err_msg:"expect int arg" int_of_string_opt v (fun x -> r:= x) + | Set_float r -> parse ~err_msg:"expect float arg" float_of_string_opt v (fun x -> r := x) with | Not_found -> append (k, "unknown key") - | Failure "int_of_string" -> append (k, "expect int arg") - | Failure "bool_of_string" -> append (k, "expect bool arg") - | Failure "float_of_string" -> append (k, "expect float arg") | exn -> append (k, Printexc.to_string exn) ) cf; if !err != [] then raise (Error !err) -- 2.44.0
Re: [PATCH v1] tools/ocaml: fix warnings
On 27/03/2024 4:30 pm, Edwin Török wrote: > Do not rely on the string values of the `Failure` exception, > but use the `_opt` functions instead. > > Signed-off-by: Edwin Török FWIW, Tested-by: Andrew Cooper But I think the subject wants to say "in config.ml" and the commit message gain something like: Fixes warnings such as: File "config.ml", line 102, characters 12-27: 102 | | Failure "int_of_string" -> append (k, "expect int arg") ^^^ Warning 52: Code should not depend on the actual values of this constructor's arguments. They are only for information and may change in future versions. (See manual section 9.5) I can adjust all on commit. ~Andrew
Re: [RFC PATCH-for-9.0 v2 13/19] hw/xen: Remove use of 'target_ulong' in handle_ioreq()
On Tue, Nov 14, 2023 at 03:38:09PM +0100, Philippe Mathieu-Daudé wrote: > Per commit f17068c1c7 ("xen-hvm: reorganize xen-hvm and move common > function to xen-hvm-common"), handle_ioreq() is expected to be > target-agnostic. However it uses 'target_ulong', which is a target > specific definition. > > Per xen/include/public/hvm/ioreq.h header: > > struct ioreq { > uint64_t addr; /* physical address */ > uint64_t data; /* data (or paddr of data) */ > uint32_t count; /* for rep prefixes */ > uint32_t size; /* size in bytes */ > uint32_t vp_eport; /* evtchn for notifications to/from device model > */ > uint16_t _pad0; > uint8_t state:4; > uint8_t data_is_ptr:1; /* if 1, data above is the guest paddr > * of the real data to use. */ > uint8_t dir:1; /* 1=read, 0=write */ > uint8_t df:1; > uint8_t _pad1:1; > uint8_t type; /* I/O type */ > }; > typedef struct ioreq ioreq_t; > > If 'data' is not a pointer, it is a u64. > > - In PIO / VMWARE_PORT modes, only 32-bit are used. > > - In MMIO COPY mode, memory is accessed by chunks of 64-bit Looks like it could also be 8, 16, or 32 as well, depending on req->size. > - In PCI_CONFIG mode, access is u8 or u16 or u32. > > - None of TIMEOFFSET / INVALIDATE use 'req'. > > - Fallback is only used in x86 for VMWARE_PORT. > > Masking the upper bits of 'data' to keep 'req->size' low bits > is irrelevant of the target word size. Remove the word size > check and always extract the relevant bits. When building QEMU for Xen, we tend to build the target "i386-softmmu", which looks like to have target_ulong == uint32_t. So the `data` clamping would only apply to size 8 and 16. The clamping with target_ulong was introduce long time ago, here: https://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=b4a663b87df3954557434a2d31bff7f6b2706ec1 and they were more IOREQ types. So my guess is it isn't relevant anymore, but extending the clamping to 32-bits request should be fine, when using qemu-system-i386 that is, as it is already be done if one use qemu-system-x86_64. So I think the patch is fine, and the tests I've ran so far worked fine. > Signed-off-by: Philippe Mathieu-Daudé Reviewed-by: Anthony PERARD Thanks, -- Anthony PERARD
Re: [PATCH v2 3/3] svm/nestedsvm: Introduce nested capabilities bit
On Mon, Mar 18, 2024 at 2:17 PM Jan Beulich wrote: > > On 13.03.2024 13:24, George Dunlap wrote: > > In order to make implementation and testing tractable, we will require > > specific host functionality. Add a nested_virt bit to hvm_funcs.caps, > > and return an error if a domain is created with nested virt and this > > bit isn't set. Create VMX and SVM callbacks to be executed from > > start_nested_svm(), which is guaranteed to execute after all > > Nit: nestedhvm_setup() (or, with different wording, start_nested_{svm,vmx}()). Oops, fixed. > > --- a/xen/arch/x86/domain.c > > +++ b/xen/arch/x86/domain.c > > @@ -673,6 +673,12 @@ int arch_sanitise_domain_config(struct > > xen_domctl_createdomain *config) > > */ > > config->flags |= XEN_DOMCTL_CDF_oos_off; > > > > +if ( nested_virt && !hvm_nested_virt_supported() ) > > +{ > > +dprintk(XENLOG_INFO, "Nested virt requested but not available\n"); > > +return -EINVAL; > > +} > > + > > if ( nested_virt && !hap ) > > { > > dprintk(XENLOG_INFO, "Nested virt not supported without HAP\n"); > > As mentioned in reply to v1, I think what both start_nested_{svm,vmx}() check > is redundant with this latter check. I think that check isn't necessary there > (yet unlike suggested in reply to v1 I don't think anymore that the check here > can alternatively be dropped). And even if it was to be kept for some reason, > I'm having some difficulty seeing why vendor code needs to do that check, when > nestedhvm_setup() could do it for both of them. I'm having a bit of trouble resolving the antecedents in this paragraph (what "this" and "there" are referring to). Are you saying that we should set hvm_funcs.caps.nested_virt to 'true' even if the hardware doesn't support HAP, because we check that here? That seems like a very strange way to go about things; hvm_funcs.caps should reflect the actual capabilities of the hardware. Suppose at some point we want to expose nested virt capability to the toolstack -- wouldn't it make more sense to be able to just read `hvm_funcs.caps.nested_virt`, rather than having to remember to && it with `hvm_funcs.caps.hap`? And as you say, you can't get rid of the HAP check here, because we also want to deny nested virt if the admin deliberately created a guest in shadow mode on a system that has HAP available. So it's not redundant: one is checking the capabilities of the system, the other is checking the requested guest configuration. As to why to have each vendor independent code check for HAP -- there are in fact two implementations of the code; it's nice to be able to look in one place for each implementation to determine the requirements. Additionally, it would be possible, in the future, for one of the nested virt implementations to enable shadow mode, while the other one didn't. The current arrangement makes that easy. > > --- a/xen/arch/x86/hvm/nestedhvm.c > > +++ b/xen/arch/x86/hvm/nestedhvm.c > > @@ -150,6 +150,16 @@ static int __init cf_check nestedhvm_setup(void) > > __clear_bit(0x80, shadow_io_bitmap[0]); > > __clear_bit(0xed, shadow_io_bitmap[1]); > > > > +/* > > + * NB this must be called after all command-line processing has been > > + * done, so that if (for example) HAP is disabled, nested virt is > > + * disabled as well. > > + */ > > +if ( cpu_has_vmx ) > > +start_nested_vmx(&hvm_funcs); > > +else if ( cpu_has_svm ) > > +start_nested_svm(&hvm_funcs); > > Instead of passing the pointer, couldn't you have the functions return > bool, and then set hvm_funcs.caps.nested_virt from that? Passing that > pointer looks somewhat odd to me. For one, it makes start_nested_XXX symmetric to start_XXX, which also has access to the full hvm_funcs structure (albeit in the former case because it's creating the structure). For two, both of them need to read caps.hap. For three, it's just more flexible -- there may be future things that we want to modify in the start_nested_*() functions. If we did as you suggest, and then added (say) caps.nested_virt_needs_hap, then we'd either need to add a second function, or refactor it to look like this. > Is there a reason to use direct calls here rather than a true hook > (seeing that hvm_funcs must have been populated by the time we make it > here)? I understand we're (remotely) considering to switch away from > using hooks at some point, but right now this feels somewhat > inconsistent if not further justified. Again it mimics the calls to start_vmx/svm in hvm_enable. But I could look at adding a function pointer to see if it works. -George
Re: [PATCH-for-9.0 v2 15/19] hw/xen: Reduce inclusion of 'cpu.h' to target-specific sources
On Tue, Nov 14, 2023 at 03:38:11PM +0100, Philippe Mathieu-Daudé wrote: > We rarely need to include "cpu.h" in headers. Including it > 'taint' headers to be target-specific. Here only the i386/arm > implementations requires "cpu.h", so include it there and > remove from the "hw/xen/xen-hvm-common.h" *common* header. > > Signed-off-by: Philippe Mathieu-Daudé > Reviewed-by: Richard Henderson > Reviewed-by: David Woodhouse Acked-by: Anthony PERARD Thanks, -- Anthony PERARD
Re: [PATCH-for-9.0 v2 16/19] hw/xen/xen_pt: Add missing license
On Tue, Nov 14, 2023 at 03:38:12PM +0100, Philippe Mathieu-Daudé wrote: > Commit eaab4d60d3 ("Introduce Xen PCI Passthrough, qdevice") > introduced both xen_pt.[ch], but only added the license to > xen_pt.c. Use the same license for xen_pt.h. > > Suggested-by: David Woodhouse > Signed-off-by: Philippe Mathieu-Daudé Fine by me. Looks like there was a license header before: https://xenbits.xen.org/gitweb/?p=qemu-xen-unstable.git;a=blob;f=hw/pass-through.h;h=0b5822414e24d199a064abccc4d378dcaf569bd6;hb=HEAD I don't know why I didn't copied it over here. Acked-by: Anthony PERARD Thanks, -- Anthony PERARD
Re: Serious AMD-Vi(?) issue
On Mon, Mar 25, 2024 at 02:43:44PM -0700, Elliott Mitchell wrote: > On Mon, Mar 25, 2024 at 08:55:56AM +0100, Jan Beulich wrote: > > On 22.03.2024 20:22, Elliott Mitchell wrote: > > > On Fri, Mar 22, 2024 at 04:41:45PM +, Kelly Choi wrote: > > >> > > >> I can see you've recently engaged with our community with some issues > > >> you'd > > >> like help with. > > >> We love the fact you are participating in our project, however, our > > >> developers aren't able to help if you do not provide the specific > > >> details. > > > > > > Please point to specific details which have been omitted. Fairly little > > > data has been provided as fairly little data is available. The primary > > > observation is large numbers of: > > > > > > (XEN) AMD-Vi: IO_PAGE_FAULT: :bb:dd.f d0 addr ff???000 flags > > > 0x8 I > > > > > > Lines in Xen's ring buffer. > > > > Yet this is (part of) the problem: By providing only the messages that > > appear > > relevant to you, you imply that you know that no other message is in any way > > relevant. That's judgement you'd better leave to people actually trying to > > investigate. Unless of course you were proposing an actual code change, with > > suitable justification. > > Honestly, I forgot about the very small number of messages from the SATA > subsystem. The question of whether the current mitigation actions are > effective right now was a bigger issue. As such monitoring `xl dmesg` > was a priority to looking at SATA messages which failed to reliably > indicate status. > > I *thought* I would be able to retrieve those via other slow means, but a > different and possibly overlapping issue has shown up. Unfortunately > this means those are no longer retrievable. :-( With some persistence I was able to retrieve them. There are other pieces of software with worse UIs than Xen. > > In fact when running into trouble, the usual course of action would be to > > increase verbosity in both hypervisor and kernel, just to make sure no > > potentially relevant message is missed. > > More/better information might have been obtained if I'd been engaged > earlier. This is still true, things are in full mitigation mode and I'll be quite unhappy to go back with experiments at this point. I now see why I left those out. The messages from the SATA subsystem were from a kernel which a bad patch had leaked into a LTS branch. Looks like the SATA subsystem was significantly broken and I'm unsure whether any useful information could be retrieved. Notably there is quite a bit of noise from SATA devices not effected by this issue. Some of the messages /might/ be useful, but the amount of noise is quite high. Do messages from a broken kernel interest you? -- (\___(\___(\__ --=> 8-) EHM <=-- __/)___/)___/) \BS (| ehem+sig...@m5p.com PGP 87145445 |) / \_CS\ | _ -O #include O- _ | / _/ 8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
[PATCH 4/6] tools/misc: xenwatchdogd: add parse_secs()
From: Leigh Brown Create a new parse_secs() function to parse the timeout and sleep parameters. This ensures that non-numeric parameters are not accidentally treated as numbers. --- tools/misc/xenwatchdogd.c | 23 +++ 1 file changed, 15 insertions(+), 8 deletions(-) diff --git a/tools/misc/xenwatchdogd.c b/tools/misc/xenwatchdogd.c index 224753e824..26bfdef3db 100644 --- a/tools/misc/xenwatchdogd.c +++ b/tools/misc/xenwatchdogd.c @@ -49,6 +49,18 @@ static void catch_usr1(int sig) done = true; } +static int parse_secs(const char *arg, const char *what) +{ +char *endptr; +unsigned long val; + +val = strtoul(arg, &endptr, 0); +if (val > INT_MAX || *endptr != 0) + errx(EXIT_FAILURE, "invalid %s: '%s'", what, arg); + +return val; +} + int main(int argc, char **argv) { int id; @@ -64,16 +76,11 @@ int main(int argc, char **argv) if (h == NULL) err(EXIT_FAILURE, "xc_interface_open"); -t = strtoul(argv[1], NULL, 0); -if (t == ULONG_MAX) - err(EXIT_FAILURE, "strtoul"); +t = parse_secs(argv[optind], "timeout"); s = t / 2; -if (argc == 3) { - s = strtoul(argv[2], NULL, 0); - if (s == ULONG_MAX) - err(EXIT_FAILURE, "strtoul"); -} +if (argc == 3) + s = parse_secs(argv[optind], "sleep"); if (signal(SIGHUP, &catch_exit) == SIG_ERR) err(EXIT_FAILURE, "signal"); -- 2.39.2
[PATCH 2/6] tools/misc: rework xenwatchdogd signal handling
From: Leigh Brown Rework xenwatchdogd signal handling to do the minimum in the signal handler. This is a very minor enhancement. --- tools/misc/xenwatchdogd.c | 20 1 file changed, 12 insertions(+), 8 deletions(-) diff --git a/tools/misc/xenwatchdogd.c b/tools/misc/xenwatchdogd.c index 2f7c822d61..d4da0ad0b6 100644 --- a/tools/misc/xenwatchdogd.c +++ b/tools/misc/xenwatchdogd.c @@ -9,9 +9,11 @@ #include #include #include +#include xc_interface *h; -int id = 0; +bool safeexit = false; +bool done = false; void daemonize(void) { @@ -38,20 +40,18 @@ void daemonize(void) void catch_exit(int sig) { -if (id) -xc_watchdog(h, id, 300); -exit(EXIT_SUCCESS); +done = true; } void catch_usr1(int sig) { -if (id) -xc_watchdog(h, id, 0); -exit(EXIT_SUCCESS); +safeexit = true; +done = true; } int main(int argc, char **argv) { +int id; int t, s; int ret; @@ -90,10 +90,14 @@ int main(int argc, char **argv) if (id <= 0) err(EXIT_FAILURE, "xc_watchdog setup"); -for (;;) { +while (!done) { sleep(s); ret = xc_watchdog(h, id, t); if (ret != 0) err(EXIT_FAILURE, "xc_watchdog"); } + +// Zero seconds timeout will disarm the watchdog timer +xc_watchdog(h, id, safeexit ? 0 : 300); +return 0; } -- 2.39.2
[PATCH 3/6] tools/misc: xenwatchdogd: make functions static
From: Leigh Brown Make all functions except main() static in xenwatchdogd.c. --- tools/misc/xenwatchdogd.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/tools/misc/xenwatchdogd.c b/tools/misc/xenwatchdogd.c index d4da0ad0b6..224753e824 100644 --- a/tools/misc/xenwatchdogd.c +++ b/tools/misc/xenwatchdogd.c @@ -15,7 +15,7 @@ xc_interface *h; bool safeexit = false; bool done = false; -void daemonize(void) +static void daemonize(void) { switch (fork()) { case -1: @@ -38,12 +38,12 @@ void daemonize(void) err(EXIT_FAILURE, "reopen stderr"); } -void catch_exit(int sig) +static void catch_exit(int sig) { done = true; } -void catch_usr1(int sig) +static void catch_usr1(int sig) { safeexit = true; done = true; -- 2.39.2
[PATCH 6/6] docs/man: Add xenwatchdog manual page
From: Leigh Brown Add a manual page for xenwatchdogd. --- docs/man/xenwatchdogd.8.pod | 54 + 1 file changed, 54 insertions(+) create mode 100644 docs/man/xenwatchdogd.8.pod diff --git a/docs/man/xenwatchdogd.8.pod b/docs/man/xenwatchdogd.8.pod new file mode 100644 index 00..2f6454f183 --- /dev/null +++ b/docs/man/xenwatchdogd.8.pod @@ -0,0 +1,54 @@ +=head1 NAME + +xenwatchdogd - Xen hypercall based watchdog daemon + +=head1 SYNOPSIS + +B [ I ] > [ > ] + +=head1 DESCRIPTION + +B arms the Xen watchdog timer to I every I +seconds. If the xenwatchdogd process dies or is delayed for more than +I seconds, then Xen will reboot the domain. If the domain being +rebooted is domain 0, the whole system will reboot. + +=head1 OPTIONS + +=over 4 + +=item B<-h>, B<--help> + +Display a help message. + +=item B<-F>, B<--foreground> + +Run in the foreground. The default behaviour is to daemonize. + +=item B<-x>, B<--safe-exit> + +Disable watchdog on orderly exit. The default behaviour is to arm the +watchdog to 300 seconds to allow time for the domain to shutdown. See +also the B section. + +=item B + +The number of seconds to arm the Xen watchdog timer. This must be set to +a minimum of two. + +=item B + +The number of seconds to sleep in between calls to arm the Xen watchdog +timer. This must be at least one second, and less than the I +value. If not specified, it defaults to half the I value. + +=back + +=head1 SIGNALS + +B Will cause the program to disarm the watchdog timer and exit, +regardless of whether the safe exit option was passed. + +=head1 AUTHOR + +Citrix Ltd and other contributors. -- 2.39.2
[PATCH 0/6] xenwatchdogd enhancements
From: Leigh Brown Following up on Cyril's email. I had been independently looking at this, mainly because xenwatchdogd is simple enough for me to understand. The primary intention of this patch series is to replace the pathologically bad behaviour of rebooting the domain if you run "xenwatchdogd -h". To that end, I have implemented comprehensive argument validation. This validation ensures you can't pass arguments that instantly reboot the domain or cause it to spin loop running sleep(0) repeatedly. I added a couple of enhancements whilst working on the changes as they were easy enough. Full list of changes: - Use getopt_long() to add -h/--help with associated usage help. - Add -F/--foreground parameter to run without daemonizing. - Add -x/--save-exit parameter to disarm the watchdog when exiting. - Validate timeout is numeric and is at least two seconds. - Validate sleep is numeric and is at least one and less than timeout. - Check for too many arguments. - Use symbol constants instead of magic numbers where possible. - Add a manual page for xenwatchdogd(). I am not an expert in git or sending patches so forgive me if things don't look quite right. Signed-off-by: Leigh Brown Leigh Brown (6): tools/misc: xenwatchdogd: use EXIT_* constants tools/misc: rework xenwatchdogd signal handling tools/misc: xenwatchdogd: make functions static tools/misc: xenwatchdogd: add parse_secs() tools/misc: xenwatchdogd enhancements docs/man: Add xenwatchdog manual page docs/man/xenwatchdogd.8.pod | 54 +++ tools/misc/xenwatchdogd.c | 180 2 files changed, 195 insertions(+), 39 deletions(-) create mode 100644 docs/man/xenwatchdogd.8.pod -- 2.39.2
[PATCH 1/6] tools/misc: xenwatchdogd: use EXIT_* constants
From: Leigh Brown Use EXIT_SUCCESS/EXIT_FAILURE constants instead of magic numbers. --- tools/misc/xenwatchdogd.c | 40 +++ 1 file changed, 20 insertions(+), 20 deletions(-) diff --git a/tools/misc/xenwatchdogd.c b/tools/misc/xenwatchdogd.c index 254117b554..2f7c822d61 100644 --- a/tools/misc/xenwatchdogd.c +++ b/tools/misc/xenwatchdogd.c @@ -17,37 +17,37 @@ void daemonize(void) { switch (fork()) { case -1: - err(1, "fork"); + err(EXIT_FAILURE, "fork"); case 0: break; default: - exit(0); + exit(EXIT_SUCCESS); } umask(0); if (setsid() < 0) - err(1, "setsid"); + err(EXIT_FAILURE, "setsid"); if (chdir("/") < 0) - err(1, "chdir /"); + err(EXIT_FAILURE, "chdir /"); if (freopen("/dev/null", "r", stdin) == NULL) -err(1, "reopen stdin"); +err(EXIT_FAILURE, "reopen stdin"); if(freopen("/dev/null", "w", stdout) == NULL) -err(1, "reopen stdout"); +err(EXIT_FAILURE, "reopen stdout"); if(freopen("/dev/null", "w", stderr) == NULL) -err(1, "reopen stderr"); +err(EXIT_FAILURE, "reopen stderr"); } void catch_exit(int sig) { if (id) xc_watchdog(h, id, 300); -exit(0); +exit(EXIT_SUCCESS); } void catch_usr1(int sig) { if (id) xc_watchdog(h, id, 0); -exit(0); +exit(EXIT_SUCCESS); } int main(int argc, char **argv) @@ -56,44 +56,44 @@ int main(int argc, char **argv) int ret; if (argc < 2) - errx(1, "usage: %s ", argv[0]); + errx(EXIT_FAILURE, "usage: %s ", argv[0]); daemonize(); h = xc_interface_open(NULL, NULL, 0); if (h == NULL) - err(1, "xc_interface_open"); + err(EXIT_FAILURE, "xc_interface_open"); t = strtoul(argv[1], NULL, 0); if (t == ULONG_MAX) - err(1, "strtoul"); + err(EXIT_FAILURE, "strtoul"); s = t / 2; if (argc == 3) { s = strtoul(argv[2], NULL, 0); if (s == ULONG_MAX) - err(1, "strtoul"); + err(EXIT_FAILURE, "strtoul"); } if (signal(SIGHUP, &catch_exit) == SIG_ERR) - err(1, "signal"); + err(EXIT_FAILURE, "signal"); if (signal(SIGINT, &catch_exit) == SIG_ERR) - err(1, "signal"); + err(EXIT_FAILURE, "signal"); if (signal(SIGQUIT, &catch_exit) == SIG_ERR) - err(1, "signal"); + err(EXIT_FAILURE, "signal"); if (signal(SIGTERM, &catch_exit) == SIG_ERR) - err(1, "signal"); + err(EXIT_FAILURE, "signal"); if (signal(SIGUSR1, &catch_usr1) == SIG_ERR) - err(1, "signal"); + err(EXIT_FAILURE, "signal"); id = xc_watchdog(h, 0, t); if (id <= 0) -err(1, "xc_watchdog setup"); +err(EXIT_FAILURE, "xc_watchdog setup"); for (;;) { sleep(s); ret = xc_watchdog(h, id, t); if (ret != 0) -err(1, "xc_watchdog"); +err(EXIT_FAILURE, "xc_watchdog"); } } -- 2.39.2