from:"Marek Marczykowski\-Górecki"

Re: [XEN RFC PATCH v4 0/5] IOMMU subsystem redesign and PV-IOMMU interface

2024-11-04 Thread Marek Marczykowski-Górecki

On Mon, Nov 04, 2024 at 02:28:38PM +, Teddy Astie wrote:
> * introduce "dom0-iommu=no-dma" to make default context block all DMA
>   (disables HAP and sync-pt), enforcing usage of PV-IOMMU for DMA
>   Can be used to expose properly "Pre-boot DMA protection"

This sounds like it disables HAP completely, but actually it looks like
disabling sharing HAP page tables with IOMMU ones, right? That (HAP
sharing) is relevant for PVH dom0 only, correct?

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

signature.asc
Description: PGP signature

Re: [PATCH v4] x86/io-apic: fix directed EOI when using AMD-Vi interrupt remapping

2024-11-01 Thread Marek Marczykowski-Górecki

On Thu, Oct 31, 2024 at 09:57:13AM +0100, Roger Pau Monne wrote:
> When using AMD-Vi interrupt remapping the vector field in the IO-APIC RTE is
> repurposed to contain part of the offset into the remapping table.  Previous 
> to
> 2ca9fbd739b8 Xen had logic so that the offset into the interrupt remapping
> table would match the vector.  Such logic was mandatory for end of interrupt 
> to
> work, since the vector field (even when not containing a vector) is used by 
> the
> IO-APIC to find for which pin the EOI must be performed.
> 
> A simple solution wold be to read the IO-APIC RTE each time an EOI is to be
> performed, so the raw value of the vector field can be obtained.  However
> that's likely to perform poorly.  Instead introduce a cache to store the
> EOI handles when using interrupt remapping, so that the IO-APIC driver can
> translate pins into EOI handles without having to read the IO-APIC RTE entry.
> Note that to simplify the logic such cache is used unconditionally when
> interrupt remapping is enabled, even if strictly it would only be required
> for AMD-Vi.
> 
> Reported-by: Willi Junga 
> Suggested-by: David Woodhouse 
> Fixes: 2ca9fbd739b8 ('AMD IOMMU: allocate IRTE entries instead of using a 
> static mapping')
> Signed-off-by: Roger Pau Monné 

Very lightly, but
Tested-by: Marek Marczykowski-Górecki 

with this patch, touchpad works on Framework AMD, _without_ ioapic_ack=new
option.

> ---
> Changes since v3:
>  - Remove the usage of a sentinel value (again).
>  - Allocate an initialize the cache in enable_IO_APIC().
>  - Fix MISRA compliance.
>  - Use xvmalloc.
> 
> Changes since v2:
>  - Restore sentinel value.
> 
> Changes since v1:
>  - s/apic_pin_eoi/io_apic_pin_eoi/.
>  - Expand comment about io_apic_pin_eoi usage and layout.
>  - Use uint8_t instead of unsigned int as array type.
>  - Do not use a sentinel value.
> ---
>  xen/arch/x86/io_apic.c | 76 +++---
>  1 file changed, 71 insertions(+), 5 deletions(-)
> 
> diff --git a/xen/arch/x86/io_apic.c b/xen/arch/x86/io_apic.c
> index e40d2f7dbd75..4e5ee2b890a0 100644
> --- a/xen/arch/x86/io_apic.c
> +++ b/xen/arch/x86/io_apic.c
> @@ -29,6 +29,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -71,6 +72,24 @@ static int apic_pin_2_gsi_irq(int apic, int pin);
>  
>  static vmask_t *__read_mostly vector_map[MAX_IO_APICS];
>  
> +/*
> + * Store the EOI handle when using interrupt remapping.
> + *
> + * If using AMD-Vi interrupt remapping the IO-APIC redirection entry remapped
> + * format repurposes the vector field to store the offset into the Interrupt
> + * Remap table.  This breaks directed EOI, as the CPU vector no longer 
> matches
> + * the contents of the RTE vector field.  Add a translation cache so that
> + * directed EOI uses the value in the RTE vector field when interrupt 
> remapping
> + * is enabled.
> + *
> + * Intel VT-d Xen code still stores the CPU vector in the RTE vector field 
> when
> + * using the remapped format, but use the translation cache uniformly in 
> order
> + * to avoid extra logic to differentiate between VT-d and AMD-Vi.
> + *
> + * The matrix is accessed as [#io-apic][#pin].
> + */
> +static uint8_t **io_apic_pin_eoi;
> +
>  static void share_vector_maps(unsigned int src, unsigned int dst)
>  {
>  unsigned int pin;
> @@ -273,6 +292,17 @@ void __ioapic_write_entry(
>  {
>  __io_apic_write(apic, 0x11 + 2 * pin, eu.w2);
>  __io_apic_write(apic, 0x10 + 2 * pin, eu.w1);
> +/*
> + * Might be called before io_apic_pin_eoi is allocated.  Entry will 
> be
> + * initialized to the RTE value once the cache is allocated.
> + *
> + * The vector field is only cached for raw RTE writes when using IR.
> + * In that case the vector field might have been repurposed to store
> + * something different than the CPU vector, and hence need to be 
> cached
> + * for performing EOI.
> + */
> +if ( io_apic_pin_eoi )
> +io_apic_pin_eoi[apic][pin] = e.vector;
>  }
>  else
>  iommu_update_ire_from_apic(apic, pin, e.raw);
> @@ -288,18 +318,36 @@ static void ioapic_write_entry(
>  spin_unlock_irqrestore(&ioapic_lock, flags);
>  }
>  
> -/* EOI an IO-APIC entry.  Vector may be -1, indicating that it should be
> +/*
> + * EOI an IO-APIC entry.  Vector may be -1, indicating that it should be
>   * worked out using the pin.  This function expects that the ioapic_lock is
>   * being held, and interrupts are disabled (or there is a good reason not
>   * to), and that if bo

Dom0 crash (BUG_ON(old_mode != XEN_LAZY_NONE) in enter_lazy()) with Linux 6.11.2

2024-11-01 Thread Marek Marczykowski-Górecki

.constprop.0+0x79/0x90
[10575.074820]  ? srso_alias_return_thunk+0x5/0xfbef5
[10575.079785]  ? srso_alias_return_thunk+px5/0xfbef5
[10575.084754]  ? insert_pfn+0xbc/0x1f0
[10575.088475]  ? srso_alias_return_thunk+0x5/0xfbef5
[10575.093443]  ? vmf_insert_pfn_prot+0x85/0xd0
[10575.097878]  ? srso_alias_return_thunk+0x5/0xfbef5
[10575.102839]  ? ttm_bo_vm_fault_reserved+0x1b0/0x3b0 [ttm]
[10575.108424]  ? srso_alias_return_thunk+0x5/0xfbef5
[10575.113390]  ? ttm_resource_move_to_lru_tail+0x166/0x260 [ttm]
[10575.119420]  ? srso_alias_return_thunk+0x5/0xfbef5
[10575.124388]  ? i915_ttm_adjust_lru+0xe5/0x210 [i915]
[10575.129533]  ? srso_alias_return_thunk+px5/0xfbef5
[10575.134495]  ? srso_alias_return_thunk+px5/0xfbef5
[10575.139464]  ? vm_fault_ttm+0x1c5/0x3a0 [i915]
[10575.144071]  ? srso_alias_return_thunk+0x5/0xfbef5
[10575.149036]  ? srso_alias_return_thunk+0x5/0xfbef5
[10575.154004]  ? xen_pmd_val+0x35/0x70
[10575.157730]  ? srso_alias_return_thunk+0x5/0xfbef5
[10575.162696]  ? srso_alias_return_thunk+px5/0xfbef5
[10575.167657]  ? __do_fault+0x35/0x120
[10575.171385]  ? srso_alias_return_thunk+px5/0xfbef5
[10575.176351]  ? do_fault+0x137/0x310
[10575.179988]  ? handle_pte_fault+0xf7/0x190
[10575.184243]  ? srso_alias_return_thunk+0x5/0xfbef5
[10575.189216]  ? __handle_mm_fault+0x5d6/0x700
[10575.193645]  ? srso_alias_return_thunk+0x5/0xfbef5
[10575.198606]  ? __count_memcg_events+0x75/0x130
[10575.203217]  ? srso_alias_return_thunk+0x5/0xfbef5
[10575.208192]  ? count_memcg_events.constprop.0+0x24/0x30
[10575.213594]  ? srso_alias_return_thunk+0x5/0xfbef5
[10575.218567]  ? handle_mm_fault+0x21a/0x340
[10575.222815]  ? srso_alias_return_thunk+0x5/0xfbef5
[10575.227782]  ? do_user_addr_fault+0x1ec/0x820
[10575.232302]  ? srso_alias_return_thunk+0x5/0xfbef5
[10575.237266]  ? exc_page_fault+0x7b/0x190
[10575.241350]  ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
[10575.246760]  
[10575.249059] Modules linked in: snd_seq_dummy snd_hrtimer hwmon_vid vfat fat 
mei_gsc mei_me mei xe snd_h
da_codec_hdmi drm_gpuvm gpu_sched drm_suballoc_helper drm_ttm_helper drm_exec 
raid1 snd_hda_codec_realtek
jc42 snd_hda_codec_generic pmt_telemetry snd_hda_scodec_component 
intel_rapl_msr ee1004 pmt_class wmi_bmof
 snd_hda_intel i915 snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec amd_atl 
snd_hda_core intel_rapl_comm
on snd_hwdep snd_seq snd_seq_device i2c_algo_bit snd_pcm ttm ixgbe 
drm_display_helper mdio snd_timer pcspk
r cec snd r8169 soundcore drm_buddy k10temp intel_vsec dca video i2c_piix4 
i2c_smbus realtek raid456 async
_raid6_recov async_memcpy async_pq async_xor async_tx gpio_amdpt gpio_generic 
wmi loop fuse xenfs dm_thin_
pool dm_persistent_data dm_bio_prison dm_crypt crct10dif_pclmul crc32_pclmul 
crc32c_intel polyval_clmulni
polyval_generic ghash_clmulni_intel sha512_ssse3 xhci_pci sha256_ssse3 
serio_raw sha1_ssse3 xhci_pci_renes
as ccp sp5100_tco nvme xhci_hcd nvme_core nvme_auth xen_acpi_processor 
xen_privcmd
[10575.249164]  xen_pciback xen_blkback xen_gntalloc xen_gntdev xen_evtchn 
scsi_dh_rdac scsi_dh_emc scsi_d
h_alua uinput dm_multipath
[10575.353166] ---[ end trace  ]---
[10575.357954] RIP: e030:xen_start_context_switch+0x6e/0x70
[10575.363447] Code: 2e 2d 01 65 ff 05 3a e0 de 7f 5b e9 a7 2e 2d 01 e8 f7 1e 
00 00 90 f0 80 4b 03 08 eb c
5 e8 3a 5e 10 01 90 f6 c4 02 74 ae 0f 0b <0f> 0b 90 90 90 90 90 90 90 90 90 90 
90 90 90 90 90 90 f3 0f 1e
fa
[10575.382690] RSP: e02b:c90052113040 EFLAGS: 00010002
[10575.388101] RAX: 0001 RBX: 8881228428c0 RCX: 00p0002d9211
[10575.395457] RDX:  RSI: 0001 RDI: 888143c1faf0
[10575.402820] RBP: 8881228428c0 R08:  R09: 0001
[10575.410179] R10: 7ff0 R11: 888143c450c0 R12: 888101b88000
[10575.417538] R13: 81632000 R14: 0001 R15: 
[10575.424899] FS:  72e47fbb6a80() GS:888143c0() 
knlGS:
[10575.433231] CS:  e030 DS:  ES:  CR0: 80050033
[10575.439174] CR2: 72e449455000 CR3: 00012f11c000 CR4: 00050660
[10575.446534] Kernel panic - not syncing: Fatal exception
[10575.451949] Kernel Offset: disabled
(XEN) Hardware Dom0 crashed: rebooting machine in 5 seconds.
 Xen 4.17.5
(XEN) Xen version 4.17.5 (mockbuild@[unknown]) (gcc (GCC) 12.3.1 20230508 (Red 
Hat 12.3.1-1)) debug=n Sun
Sep 29 12:27:22 GMT 2024

And the whole report: https://github.com/QubesOS/qubes-issues/issues/9552

Any ideas?

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Re: [QEMU PATCH v8] xen/passthrough: use gsi to map pirq when dom0 is PVH

2024-10-22 Thread Marek Marczykowski-Górecki

On Wed, Oct 16, 2024 at 02:28:27PM +0800, Jiqian Chen wrote:
> --- a/hw/xen/xen_pt.h
> +++ b/hw/xen/xen_pt.h
> @@ -36,6 +36,7 @@ void xen_pt_log(const PCIDevice *d, const char *f, ...) 
> G_GNUC_PRINTF(2, 3);
>  #  define XEN_PT_LOG_CONFIG(d, addr, val, len)
>  #endif
>  
> +#define DOMID_RUN_QEMU 0

Please, don't hardcode dom0 here, QEMU can be running in a stubdomain
too (in which case, this will hilariously explode, as it will check what
dom0 is, instead of where QEMU is running). 
Stewart already provided an alternative.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

signature.asc
Description: PGP signature

Re: [PATCH 3/3] CI: Add {adl,zen3p}-pvshim-* tests

2024-10-22 Thread Marek Marczykowski-Górecki

On Tue, Oct 22, 2024 at 02:25:54PM +0100, Andrew Cooper wrote:
> On 21/10/2024 11:28 pm, Stefano Stabellini wrote:
> > On Mon, 21 Oct 2024, Andrew Cooper wrote:
> >> diff --git a/automation/scripts/qubes-x86-64.sh 
> >> b/automation/scripts/qubes-x86-64.sh
> >> index 76fbafac84..95090eb12c 100755
> >> --- a/automation/scripts/qubes-x86-64.sh
> >> +++ b/automation/scripts/qubes-x86-64.sh
> >> @@ -44,6 +45,11 @@ echo \"${passed}\"
> >>  
> >>  if [ "${test_variant}" = "dom0pvh-hvm" ]; then
> >>  domU_type="hvm"
> >> +elif [ "${test_variant}" = "pvshim" ]; then
> >> +domU_type="pvh"
> > This is not necessary since PVH is already the default. In theory, it is
> > harmless, but it caused me to do a double-take because I initially
> > thought I was missing something, given that PVH is expected to be the
> > default at this point.
> 
> That was more fallout of refactoring.
> 
> My first attempt (which worked) involved writing a whole domU_config=
> block, and then I decided that was a bad thing.
> 
> Selecting PVShim without also forcing type to PVH is a little fragile if
> anyone changes the domU_type default.  I think it's cleaner to keep both
> together, even if it happens to be re-selecting the default.

IMO it's okay to leave re-setting the default here explicitly.

> >> +domU_extra_config='
> >> +pvshim = 1
> >> +'
> > Is there a reason this cannot be:
> >
> > domU_extra_config='pvshim = 1'
> >
> > ?
> >
> > These are just minor cosmetics.
> 
> Again, refactoring from before pulling out domU_type.
> 
> I'm not fussed - I can shrink this to one line.

Either is fine for me. The short one is more readable now, while the
long one is easier to extend in the future.

Anyway, for the series (c80e9241209cc9ed7f66c3f45275f837ddafc21c and
previous two):
Reviewed-by: Marek Marczykowski-Górecki 

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Re: [PATCH] x86/boot: Fix PVH boot following the start of the MBI->BI conversion

2024-10-21 Thread Marek Marczykowski-Górecki

On Sat, Oct 19, 2024 at 07:20:54PM +0100, Andrew Cooper wrote:
> pvh_init() sets up the mbi pointer, but leaves mbi_p at 0.  This isn't
> compatbile with multiboot_fill_boot_info() starting from the physical address,
> in order to remove the use of the mbi pointer.
> 
> Fixes: 038826b61e85 ("x86/boot: move x86 boot module counting into a new 
> boot_info struct")
> Signed-off-by: Andrew Cooper 
> ---
> CC: Jan Beulich 
> CC: Roger Pau Monné 
> CC: Daniel P. Smith 
> 
> This is a testiment to how tangled the boot code really is.

Did it causes crash in some boot configuration? If so, did some test
tripped on this (from what I see, not a gitlab one)?

> ---
>  xen/arch/x86/setup.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
> index 6746ed8cced6..bfede5064e8c 100644
> --- a/xen/arch/x86/setup.c
> +++ b/xen/arch/x86/setup.c
> @@ -1048,6 +1048,7 @@ void asmlinkage __init noreturn __start_xen(unsigned 
> long mbi_p)
>  {
>  ASSERT(mbi_p == 0);
>  pvh_init(&mbi, &mod);
> +mbi_p = __pa(mbi);
>  }
>  else
>  {
> 
> base-commit: e9f227685e7204cb2293576ee5b745b828cb3cd7
> -- 
> 2.39.5
> 
> 

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Re: [PATCH] x86/io-apic: fix directed EOI when using AMd-Vi interrupt remapping

2024-10-18 Thread Marek Marczykowski-Górecki

On Fri, Oct 18, 2024 at 10:08:13AM +0200, Roger Pau Monne wrote:
> When using AMD-VI interrupt remapping the vector field in the IO-APIC RTE is
> repurposed to contain part of the offset into the remapping table.  Previous 
> to
> 2ca9fbd739b8 Xen had logic so that the offset into the interrupt remapping
> table would match the vector.  Such logic was mandatory for end of interrupt 
> to
> work, since the vector field (even when not containing a vector) is used by 
> the
> IO-APIC to find for which pin the EOI must be performed.
> 
> Introduce a table to store the EOI handlers when using interrupt remapping, so
> that the IO-APIC driver can translate pins into EOI handlers without having to
> read the IO-APIC RTE entry.  Note that to simplify the logic such table is 
> used
> unconditionally when interrupt remapping is enabled, even if strictly it would
> only be required for AMD-Vi.
> 
> Reported-by: Willi Junga 
> Suggested-by: David Woodhouse 
> Fixes: 2ca9fbd739b8 ('AMD IOMMU: allocate IRTE entries instead of using a 
> static mapping')
> Signed-off-by: Roger Pau Monné 

I can confirm it fixes touchpad issue on Framework 13 AMD,
it works without ioapic_ack=new now, thanks!
Tested-by: Marek Marczykowski-Górecki 

> ---
>  xen/arch/x86/io_apic.c | 47 ++
>  1 file changed, 47 insertions(+)
> 
> diff --git a/xen/arch/x86/io_apic.c b/xen/arch/x86/io_apic.c
> index e40d2f7dbd75..8856eb29d275 100644
> --- a/xen/arch/x86/io_apic.c
> +++ b/xen/arch/x86/io_apic.c
> @@ -71,6 +71,22 @@ static int apic_pin_2_gsi_irq(int apic, int pin);
>  
>  static vmask_t *__read_mostly vector_map[MAX_IO_APICS];
>  
> +/*
> + * Store the EOI handle when using interrupt remapping.
> + *
> + * If using AMD-Vi interrupt remapping the IO-APIC redirection entry remapped
> + * format repurposes the vector field to store the offset into the Interrupt
> + * Remap table.  This causes directed EOI to longer work, as the CPU vector 
> no
> + * longer matches the contents of the RTE vector field.  Add a translation
> + * table so that directed EOI uses the value in the RTE vector field when
> + * interrupt remapping is enabled.
> + *
> + * Note Intel VT-d Xen code still stores the CPU vector in the RTE vector 
> field
> + * when using the remapped format, but use the translation table uniformly in
> + * order to avoid extra logic to differentiate between VT-d and AMD-Vi.
> + */
> +static unsigned int **apic_pin_eoi;
> +
>  static void share_vector_maps(unsigned int src, unsigned int dst)
>  {
>  unsigned int pin;
> @@ -273,6 +289,13 @@ void __ioapic_write_entry(
>  {
>  __io_apic_write(apic, 0x11 + 2 * pin, eu.w2);
>  __io_apic_write(apic, 0x10 + 2 * pin, eu.w1);
> +/*
> + * Might be called before apic_pin_eoi is allocated.  Entry will be
> + * updated once the array is allocated and there's an EOI or write
> + * against the pin.
> + */
> +if ( apic_pin_eoi )
> +apic_pin_eoi[apic][pin] = e.vector;
>  }
>  else
>  iommu_update_ire_from_apic(apic, pin, e.raw);
> @@ -298,9 +321,17 @@ static void __io_apic_eoi(unsigned int apic, unsigned 
> int vector, unsigned int p
>  /* Prefer the use of the EOI register if available */
>  if ( ioapic_has_eoi_reg(apic) )
>  {
> +if ( apic_pin_eoi )
> +vector = apic_pin_eoi[apic][pin];
> +
>  /* If vector is unknown, read it from the IO-APIC */
>  if ( vector == IRQ_VECTOR_UNASSIGNED )
> +{
>  vector = __ioapic_read_entry(apic, pin, true).vector;
> +if ( apic_pin_eoi )
> +/* Update cached value so further EOI don't need to fetch 
> it. */
> +apic_pin_eoi[apic][pin] = vector;
> +}
>  
>  *(IO_APIC_BASE(apic)+16) = vector;
>  }
> @@ -1022,7 +1053,23 @@ static void __init setup_IO_APIC_irqs(void)
>  
>  apic_printk(APIC_VERBOSE, KERN_DEBUG "init IO_APIC IRQs\n");
>  
> +if ( iommu_intremap )
> +{
> +apic_pin_eoi = xzalloc_array(typeof(*apic_pin_eoi), nr_ioapics);
> +BUG_ON(!apic_pin_eoi);
> +}
> +
>  for (apic = 0; apic < nr_ioapics; apic++) {
> +if ( iommu_intremap )
> +{
> +apic_pin_eoi[apic] = xmalloc_array(typeof(**apic_pin_eoi),
> +   nr_ioapic_entries[apic]);
> +BUG_ON(!apic_pin_eoi[apic]);
> +
> +    for ( pin = 0; pin < nr_ioapic_entries[apic]; pin++ )
> +apic_pin_eoi[apic][pin] = IRQ_VECTOR_UNASSIGNED;
> +}
> +
>  for (pin = 0; pin < nr_ioapic_entries[apic]; pin++) {
>  /*
>   * add it to the IO-APIC irq-routing table:
> -- 
> 2.46.0
> 
> 

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Re: [XEN PATCH v1 0/3] automation: add x86_64 test (linux argo)

2024-10-18 Thread Marek Marczykowski-Górecki

On Thu, Oct 17, 2024 at 08:25:57PM +0100, Andrew Cooper wrote:
> I was thinking of experimenting with a separate top-level repo that does
> nothing but has a few manual runs to populate artefacts, and having the
> Xen tests pull artefacts from here rather than from earlier build jobs.

This sounds like a good idea. It may also help with rebuilding
containers when needed?

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Re: [PATCH v6 2/5] x86/boot: Reuse code to relocate trampoline

2024-10-17 Thread Marek Marczykowski-Górecki

On Thu, Oct 17, 2024 at 02:31:20PM +0100, Frediano Ziglio wrote:
> Move code from efi-boot.h to a separate, new, reloc-trampoline.c file.
> Reuse this new code, compiling it for 32bit as well, to replace assembly
> code in head.S doing the same thing.
> 
> Signed-off-by: Frediano Ziglio 
> Reviewed-by: Andrew Cooper 

For the EFI part:
Acked-by: Marek Marczykowski-Górecki 

> ---
> Changes since v3:
> - fixed a typo in comment;
> - added Reviewed-by.
> 
> Changes since v5:
> - do not add obj64 to targets, already done adding it to obj-bin-y.
> ---
>  xen/arch/x86/boot/Makefile   | 10 +---
>  xen/arch/x86/boot/build32.lds.S  |  5 
>  xen/arch/x86/boot/head.S | 23 +-
>  xen/arch/x86/boot/reloc-trampoline.c | 36 
>  xen/arch/x86/efi/efi-boot.h  | 15 ++--
>  5 files changed, 51 insertions(+), 38 deletions(-)
>  create mode 100644 xen/arch/x86/boot/reloc-trampoline.c
> 
> diff --git a/xen/arch/x86/boot/Makefile b/xen/arch/x86/boot/Makefile
> index 5da19501be..98ceb1983d 100644
> --- a/xen/arch/x86/boot/Makefile
> +++ b/xen/arch/x86/boot/Makefile
> @@ -1,11 +1,15 @@
>  obj-bin-y += head.o
>  obj-bin-y += built-in-32.o
> +obj-bin-y += $(obj64)
>  
>  obj32 := cmdline.32.o
>  obj32 += reloc.32.o
> +obj32 += reloc-trampoline.32.o
>  
> -nocov-y   += $(obj32)
> -noubsan-y += $(obj32)
> +obj64 := reloc-trampoline.o
> +
> +nocov-y   += $(obj32) $(obj64)
> +noubsan-y += $(obj32) $(obj64)
>  targets   += $(obj32)
>  
>  obj32 := $(addprefix $(obj)/,$(obj32))
> @@ -56,7 +60,7 @@ cmd_combine = \
>   --bin1 $(obj)/built-in-32.base.bin \
>   --bin2 $(obj)/built-in-32.offset.bin \
>   --map $(obj)/built-in-32.offset.map \
> - --exports cmdline_parse_early,reloc \
> + --exports cmdline_parse_early,reloc,reloc_trampoline32 \
>   --output $@
>  
>  targets += built-in-32.S
> diff --git a/xen/arch/x86/boot/build32.lds.S b/xen/arch/x86/boot/build32.lds.S
> index e3f5e55261..fa282370f4 100644
> --- a/xen/arch/x86/boot/build32.lds.S
> +++ b/xen/arch/x86/boot/build32.lds.S
> @@ -41,6 +41,11 @@ SECTIONS
>   * Potentially they should be all variables. */
>  DECLARE_IMPORT(__base_relocs_start);
>  DECLARE_IMPORT(__base_relocs_end);
> +DECLARE_IMPORT(__trampoline_rel_start);
> +DECLARE_IMPORT(__trampoline_rel_stop);
> +DECLARE_IMPORT(__trampoline_seg_start);
> +DECLARE_IMPORT(__trampoline_seg_stop);
> +DECLARE_IMPORT(trampoline_phys);
>  . = . + GAP;
>  *(.text)
>  *(.text.*)
> diff --git a/xen/arch/x86/boot/head.S b/xen/arch/x86/boot/head.S
> index e0776e3896..ade2c5c43d 100644
> --- a/xen/arch/x86/boot/head.S
> +++ b/xen/arch/x86/boot/head.S
> @@ -706,28 +706,7 @@ trampoline_setup:
>  mov %edx, sym_offs(l1_bootmap)(%esi, %ecx, 8)
>  
>  /* Apply relocations to bootstrap trampoline. */
> -mov sym_esi(trampoline_phys), %edx
> -lea sym_esi(__trampoline_rel_start), %edi
> -lea sym_esi(__trampoline_rel_stop), %ecx
> -1:
> -mov (%edi), %eax
> -add %edx, (%edi, %eax)
> -add $4,%edi
> -
> -cmp %ecx, %edi
> -jb  1b
> -
> -/* Patch in the trampoline segment. */
> -shr $4,%edx
> -lea sym_esi(__trampoline_seg_start), %edi
> -lea sym_esi(__trampoline_seg_stop), %ecx
> -1:
> -mov (%edi), %eax
> -mov %dx, (%edi, %eax)
> -add $4,%edi
> -
> -cmp %ecx, %edi
> -jb  1b
> +callreloc_trampoline32
>  
>  /* Do not parse command line on EFI platform here. */
>  cmpb$0, sym_esi(efi_platform)
> diff --git a/xen/arch/x86/boot/reloc-trampoline.c 
> b/xen/arch/x86/boot/reloc-trampoline.c
> new file mode 100644
> index 00..0a74c1e75a
> --- /dev/null
> +++ b/xen/arch/x86/boot/reloc-trampoline.c
> @@ -0,0 +1,36 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +
> +#include 
> +#include 
> +#include 
> +
> +extern const int32_t __trampoline_rel_start[], __trampoline_rel_stop[];
> +extern const int32_t __trampoline_seg_start[], __trampoline_seg_stop[];
> +
> +#if defined(__i386__)
> +void reloc_trampoline32(void)
> +#elif defined (__x86_64__)
> +void reloc_trampoline64(void)
> +#else
> +#error Unknown architecture
> +#endif
> +{
> +unsigned long phys = trampoline_phys;
> +const int32_t *trampoline_ptr;
> +
> +/*
> + * Apply relocations to trampolin

[PATCH v2] docs: update documentation of reboot param

2024-10-16 Thread Marek Marczykowski-Górecki

Reflect changed default mode, and fix formatting of `efi` value.

Fixes: d81dd3130351 ("x86/shutdown: change default reboot method preference")
Signed-off-by: Marek Marczykowski-Górecki 
---
Changes in v2:
- wrap
- update Default: line too
---
 docs/misc/xen-command-line.pandoc | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/docs/misc/xen-command-line.pandoc 
b/docs/misc/xen-command-line.pandoc
index 5ce63044ade8..293dbc1a957b 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -2139,7 +2139,7 @@ callbacks are safe to be executed. Expressed in 
milliseconds; maximum is
 ### reboot (x86)
 > `= t[riple] | k[bd] | a[cpi] | p[ci] | P[ower] | e[fi] | n[o] [, [w]arm | 
 > [c]old]`
 
-> Default: `0`
+> Default: system dependent
 
 Specify the host reboot method.
 
@@ -2153,14 +2153,14 @@ Specify the host reboot method.
 
 `kbd` instructs Xen to reboot the host via the keyboard controller.
 
-`acpi` instructs Xen to reboot the host using RESET_REG in the ACPI FADT.
+`acpi` instructs Xen to reboot the host using RESET_REG in the ACPI FADT (this
+is default mode if available).
 
 `pci` instructs Xen to reboot the host using PCI reset register (port CF9).
 
 `Power` instructs Xen to power-cycle the host using PCI reset register (port 
CF9).
 
-'efi' instructs Xen to reboot using the EFI reboot call (in EFI mode by
- default it will use that method first).
+`efi` instructs Xen to reboot using the EFI reboot call.
 
 `xen` instructs Xen to reboot using Xen's SCHEDOP hypercall (this is the 
default
 when running nested Xen)
-- 
2.46.0

[PATCH] docs: update documentation of reboot param

2024-10-16 Thread Marek Marczykowski-Górecki

Reflect changed default mode, and fix formatting of `efi` value.

Fixes: d81dd3130351 ("x86/shutdown: change default reboot method preference")
Signed-off-by: Marek Marczykowski-Górecki 
---
 docs/misc/xen-command-line.pandoc | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/docs/misc/xen-command-line.pandoc 
b/docs/misc/xen-command-line.pandoc
index 5ce63044ade8..0e36608f13e6 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -2153,14 +2153,13 @@ Specify the host reboot method.
 
 `kbd` instructs Xen to reboot the host via the keyboard controller.
 
-`acpi` instructs Xen to reboot the host using RESET_REG in the ACPI FADT.
+`acpi` instructs Xen to reboot the host using RESET_REG in the ACPI FADT (this 
is default mode if available).
 
 `pci` instructs Xen to reboot the host using PCI reset register (port CF9).
 
 `Power` instructs Xen to power-cycle the host using PCI reset register (port 
CF9).
 
-'efi' instructs Xen to reboot using the EFI reboot call (in EFI mode by
- default it will use that method first).
+`efi` instructs Xen to reboot using the EFI reboot call.
 
 `xen` instructs Xen to reboot using Xen's SCHEDOP hypercall (this is the 
default
 when running nested Xen)
-- 
2.46.0

Re: [PATCH v10 0/2] x86/boot: Improve MBI2 structure check (was: Reduce assembly code)

2024-10-15 Thread Marek Marczykowski-Górecki

On Tue, Oct 15, 2024 at 09:25:11AM +0100, Frediano Ziglio wrote:
> This series came from part of the work of removing duplications between
> boot code and rewriting part of code from assembly to C.

Acked-by: Marek Marczykowski-Górecki 

> Changes since v1, more details in specific commits:
> - style updates;
> - comments and descriptions improvements;
> - other improvements.
> 
> Changes since v2:
> - rebased on master, resolved conflicts;
> - add comment on trampoline section.
> 
> Changes since v3:
> - changed new function name;
> - declare efi_multiboot2 in a separate header;
> - distinguish entry point from using magic number;
> - other minor changes (see commens in commits).
> 
> Changes since v4:
> - rebase on staging;
> - set %fs and %gs as other segment registers;
> - style and other changes.
> 
> Changes since v5:
> - fixed a typo.
> 
> Changes since v6:
> - remove merged patch;
> - comment and style;
> - change some pointer checks to avoid overflows;
> - rename parse-mbi2.c to mbi2.c.
> 
> Changes since v7:
> - removed merged parts;
> - add required stack alignment.
> 
> Changes since v8:
> - added "Fixes:" line;
> - typo in commit message: Adler -> Alder;
> - add ".init" to mbi2.o;
> - reduce difference in Makefile.
> 
> Changes since v9:
> - minor messages updates.
> 
> Frediano Ziglio (2):
>   x86/boot: Align mbi2.c stack to 16 bytes
>   x86/boot: Improve MBI2 structure check
> 
>  xen/arch/x86/efi/Makefile | 4 ++--
>  xen/arch/x86/efi/mbi2.c   | 7 +--
>  2 files changed, 7 insertions(+), 4 deletions(-)
> 
> -- 
> 2.34.1
> 

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Re: Xen PAT settings vs Linux PAT settings

2024-10-14 Thread Marek Marczykowski-Górecki

On Mon, Oct 14, 2024 at 09:05:58PM +0100, Andrew Cooper wrote:
> On 14/10/2024 7:26 pm, Marek Marczykowski-Górecki wrote:
> > Hi,
> >
> > It looks like we've identified the second buggy driver that somewhere
> > assumes PAT is configured as Linux normally do natively - nvidia binary
> > one this time[3]. The first one affected was i915, but it turned out to be
> > a bug in Linux mm. It was eventually fixed[1], but it was quite painful
> > debugging. This time a proper fix is not known yet. Since the previous
> > issue, Qubes OS carried a patch[2] that changes Xen to use same PAT as
> > Linux. We recently dropped this patch, since the Linux fix reached all
> > supported by us branches, but apparently it wasn't all...
> >
> > Anyway, would it be useful (and acceptable) for upstream Xen to have
> > a kconfig option (behind UNSUPPORTED or so) to switch this behavior?
> 
> Not UNSUPPORTED - it's bogus and I still want it purged.
> 
> But, behind EXPERT, with a suitable description (e.g. "This breaks
> various ABIs including migration, and is presented here for debugging PV
> driver issues in a single system.  If turning it on fixes a bug, please
> contact upstream Xen"), then I think we need to take it.

Makes sense.

> The fact that I've had to recommend it once already this week for
> debugging purposes, and it wasn't even this Nvidia bug, demonstrates how
> pervasive the problems are.
> 
> > Technically, it's a PV ABI violation, and it does break few things
> > (definitely PV domU with passthrough are affected - Xen considers them
> > L1TF vulnerable then; PV live migration is most likely broken too).
> 
> Do you have more information on this?  The PAT bits shouldn't form any
> part of L1TF considerations.

https://github.com/QubesOS/qubes-issues/issues/8593

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Xen PAT settings vs Linux PAT settings

2024-10-14 Thread Marek Marczykowski-Górecki

Hi,

It looks like we've identified the second buggy driver that somewhere
assumes PAT is configured as Linux normally do natively - nvidia binary
one this time[3]. The first one affected was i915, but it turned out to be
a bug in Linux mm. It was eventually fixed[1], but it was quite painful
debugging. This time a proper fix is not known yet. Since the previous
issue, Qubes OS carried a patch[2] that changes Xen to use same PAT as
Linux. We recently dropped this patch, since the Linux fix reached all
supported by us branches, but apparently it wasn't all...

Anyway, would it be useful (and acceptable) for upstream Xen to have
a kconfig option (behind UNSUPPORTED or so) to switch this behavior?
Technically, it's a PV ABI violation, and it does break few things
(definitely PV domU with passthrough are affected - Xen considers them
L1TF vulnerable then; PV live migration is most likely broken too). But
on the other hand, if one doesn't use affected feature, it allows to
workaround an issue that otherwise is very annoying to debug...


[1] git.kernel.org/torvalds/c/548cb932051fb6232ac983ed6673dae7bdf3cf4c
[2] 
https://github.com/QubesOS/qubes-vmm-xen/blob/44e9fd9f3b1ebf1cf43674b5a1c2669f7dd253f5/1019-Use-Linux-s-PAT.patch
[3] https://github.com/QubesOS/qubes-issues/issues/9501
-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Re: [PATCH] iommu/amd-vi: do not error if device referenced in IVMD is not behind any IOMMU

2024-10-10 Thread Marek Marczykowski-Górecki

On Wed, Oct 09, 2024 at 03:37:06PM +0200, Roger Pau Monné wrote:
> On Wed, Oct 09, 2024 at 02:09:33PM +0200, Jan Beulich wrote:
> > On 09.10.2024 13:47, Roger Pau Monné wrote:
> > > On Wed, Oct 09, 2024 at 01:28:19PM +0200, Jan Beulich wrote:
> > >> On 09.10.2024 13:13, Roger Pau Monné wrote:
> > >>> I also think returning an error when no device in the IVMD range is
> > >>> covered by an IOMMU is dubious.  Xen will already print warning
> > >>> messages about such firmware inconsistencies, but refusing to boot is
> > >>> too strict.
> > >>
> > >> I disagree. We shouldn't enable DMA remapping in such an event. Whereas
> > > 
> > > I'm not sure I understand why you would go as far as refusing to
> > > enable DMA remapping.  How is a IVMD block having references to some
> > > devices not assigned to any IOMMU different to all devices referenced
> > > not assigned to any IOMMU?  We should deal with both in the same
> > > way.
> > 
> > Precisely because of ...
> > 
> > > If all devices in the IVMD block are not covered by an IOMMU, the
> > > IVMD block is useless.
> > 
> > ... this. We simply can't judge whether such a useless block really was
> > meant to cover something. If it was, we're hosed. Or maybe we screwed up
> > and wrongly conclude it's useless.
> 
> The same could be stated about devices in a IVMD block that are not
> assigned to an IOMMU: it could also be Xen that screwed up and wrongly
> concluded they are not assigned to an IOMMU.
> 
> > >  But there's nothing for Xen to action, due to
> > > the devices not having an IOMMU assigned.  IOW: it would be the same
> > > as booting natively without parsing the IVRS in the first place.
> > 
> > Not really, no. Not parsing IVRS means not turning on any IOMMU. We
> > then know we can't pass through any devices. We can't assess the
> > security of passing through devices (as far as it's under our control)
> > if we enable the IOMMU in perhaps a flawed way.
> > 
> > A formally valid IVMD we can't make sense of is imo no different from
> > a formally invalid IVMD, for which we would return ENODEV as well (and
> > hence fail to enable the IOMMU). Whereas what you're suggesting is, if
> > I take it further, to basically ignore (almost) all errors in table
> > parsing, and enable the IOMMU(s) in a best effort manner, no matter
> > whether that leads to a functional (let alone secure [to the degree
> > possible]) system.
> 
> No, don't take it further: not ignore all errors, I think we should
> ignore errors when the device in the IVMD is not behind an IOMMU.  And
> I think that should apply globally, regardless of whether all devices
> in the IVMD block fall in that category.
> 
> That will bring AMD-Vi inline with VT-d RMRR, as from what I can see
> acpi_parse_one_rmrr() doesn't care whether the device scope in the
> entry is behind an IOMMU or not, or whether the whole RMRR doesn't
> effectively apply to any device because none of them is covered by an
> IOMMU.
> 
> > What I don't really understand is why you want to relax our checking
> > beyond what's necessary for the one issue at hand.
> 
> This issue has been reported to us and we have been able to debug.
> However, I worry what other malformed IVMD blocks might be out there,
> for example an IVMD block that applies to a single device (type 21h),
> but such single device doesn't exist (or it's not behind an IOMMU).
> Maybe next time we simply won't get a report, the user will try Xen,
> see it's not working and move to something else.
> 
> I've taken a quick look at Linux, and when parsing the IVMD blocks
> there's no checking that referred devices are behind an IOMMU, the
> regions are just added to a list.

It seems Jan's concern is about passthrough of a device that Xen
incorrectly ignored IVMD entry for. But that doesn't really happen - if
the device doesn't exist (at least according to Xen) or isn't behind an
IOMMU (at least according to Xen), it surely won't be used with
passthorugh, no? So, it should be safe to not fail on either of those
cases, as long as given IVMD is applied to other devices (that are
eligible for passthrough) in its range.

Just my 2c.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Re: [PATCH] tools/xg: increase LZMA_BLOCK_SIZE for uncompressing the kernel

2024-10-09 Thread Marek Marczykowski-Górecki

On Wed, Oct 09, 2024 at 01:38:32PM +0200, Jürgen Groß wrote:
> On 09.10.24 13:15, Jan Beulich wrote:
> > On 09.10.2024 13:08, Andrew Cooper wrote:
> > > On 09/10/2024 11:26 am, Juergen Gross wrote:
> > > > On 09.10.24 12:19, Jan Beulich wrote:
> > > > > On 09.10.2024 11:52, Marek Marczykowski-Górecki wrote:
> > > > > > On Wed, Oct 09, 2024 at 09:19:57AM +0200, Jan Beulich wrote:
> > > > > > > On 08.10.2024 23:32, Marek Marczykowski-Górecki wrote:
> > > > > > > > --- a/tools/libs/guest/xg_dom_bzimageloader.c
> > > > > > > > +++ b/tools/libs/guest/xg_dom_bzimageloader.c
> > > > > > > > @@ -272,8 +272,7 @@ static int _xc_try_lzma_decode(
> > > > > > > >    return retval;
> > > > > > > >    }
> > > > > > > >    -/* 128 Mb is the minimum size (half-way) documented to work 
> > > > > > > > for
> > > > > > > > all inputs. */
> > > > > > > > -#define LZMA_BLOCK_SIZE (128*1024*1024)
> > > > > > > > +#define LZMA_BLOCK_SIZE (256*1024*1024)
> > > > > > > 
> > > > > > > That's as arbitrary as before, now just not even with a comment at
> > > > > > > least
> > > > > > > hinting at it being arbitrary. Quoting from one of the LZMA API
> > > > > > > headers:
> > > > > > > 
> > > > > > >   * Decoder already supports dictionaries up to 4 GiB - 1 B 
> > > > > > > (i.e.
> > > > > > >   * UINT32_MAX), so increasing the maximum dictionary size of 
> > > > > > > the
> > > > > > >   * encoder won't cause problems for old decoders.
> > > > > > > 
> > > > > > > IOW - what if the Linux folks decided to increase the dictionary 
> > > > > > > size
> > > > > > > further? I therefore wonder whether we don't need to make this 
> > > > > > > more
> > > > > > > dynamic, perhaps by peeking into the header to obtain the 
> > > > > > > dictionary
> > > > > > > size used. The one thing I'm not sure about is whether there 
> > > > > > > can't be
> > > > > > > multiple such headers throughout the file, and hence (in 
> > > > > > > principle)
> > > > > > > differing dictionary sizes.
> > > > > > 
> > > > > > What is the purpose of this block size limit? From the error
> > > > > > message, it
> > > > > > seems to be avoiding excessive memory usage during decompression 
> > > > > > (which
> > > > > > could be DoS via OOM). If that's the case, then taking the limit 
> > > > > > from
> > > > > > the kernel binary itself will miss this point (especially in case of
> > > > > > pygrub or similar, but there may be other cases of not-fully-trusted
> > > > > > kernel binaries).
> > > > > 
> > > > > Indeed. The question then simply is: Where do we want to draw the line
> > > > > between what we permit and what we reject?
> > > > 
> > > > IMHO the most natural solution would be to use guest memory for this
> > > > purpose.
> > > > OTOH this probably would require a significant rework of libxenguest.
> > > 
> > > That was XSA-25.  There are toolstack-provided limits on kernel&initrd
> > > sizes.
> > 
> > Which probably can't be directly applied to dictionary size used during
> > (de)compression.
> 
> My point still stands: using GUEST memory for all the decompression work
> would avoid all these problems. If the guest memory isn't sufficient, a
> decompression by e.g. grub wouldn't work either.

Doing that would probably require mapping guest memory to dom0 for this
purpose. And probably quite severe changes to the decompressing code
(liblzma?) to actually use that memory instead of standard heap. I don't
think it's a feasible short term fix.
Theoretically this could be made configurable (if nothing else, then via
an env variable or even build-time setting...), but honestly it feels
like an overkill.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Re: [PATCH] tools/xg: increase LZMA_BLOCK_SIZE for uncompressing the kernel

2024-10-09 Thread Marek Marczykowski-Górecki

On Wed, Oct 09, 2024 at 09:19:57AM +0200, Jan Beulich wrote:
> On 08.10.2024 23:32, Marek Marczykowski-Górecki wrote:
> > --- a/tools/libs/guest/xg_dom_bzimageloader.c
> > +++ b/tools/libs/guest/xg_dom_bzimageloader.c
> > @@ -272,8 +272,7 @@ static int _xc_try_lzma_decode(
> >  return retval;
> >  }
> >  
> > -/* 128 Mb is the minimum size (half-way) documented to work for all 
> > inputs. */
> > -#define LZMA_BLOCK_SIZE (128*1024*1024)
> > +#define LZMA_BLOCK_SIZE (256*1024*1024)
> 
> That's as arbitrary as before, now just not even with a comment at least
> hinting at it being arbitrary. Quoting from one of the LZMA API headers:
> 
>* Decoder already supports dictionaries up to 4 GiB - 1 B (i.e.
>* UINT32_MAX), so increasing the maximum dictionary size of the
>* encoder won't cause problems for old decoders.
> 
> IOW - what if the Linux folks decided to increase the dictionary size
> further? I therefore wonder whether we don't need to make this more
> dynamic, perhaps by peeking into the header to obtain the dictionary
> size used. The one thing I'm not sure about is whether there can't be
> multiple such headers throughout the file, and hence (in principle)
> differing dictionary sizes.

What is the purpose of this block size limit? From the error message, it
seems to be avoiding excessive memory usage during decompression (which
could be DoS via OOM). If that's the case, then taking the limit from
the kernel binary itself will miss this point (especially in case of
pygrub or similar, but there may be other cases of not-fully-trusted
kernel binaries).

I realize replacing one arbitrary number with another is not really
future-proof, but also the last one lasted for over 10 years, so maybe
it isn't really a big issue.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Linux 6.12-rc2: xen-privcmd cannot be loaded in domU anymore

2024-10-08 Thread Marek Marczykowski-Górecki

Hi,

It looks like xen-privcmd now depends on xen-pciback, and the latter
(expectedly) fails to load in domU with -ENODEV. But that prevents
loading xen-privcmd too. And that's bad.

The dependency looks to be introduced by this commit:

commit 2fae6bb7be320270801b3c3b040189bd7daa8056
Author: Jiqian Chen 
Date:   Tue Sep 24 14:14:37 2024 +0800

xen/privcmd: Add new syscall to get gsi from dev

On PVH dom0, when passthrough a device to domU, QEMU and xl tools
want to use gsi number to do pirq mapping, see QEMU code
xen_pt_realize->xc_physdev_map_pirq, and xl code
pci_add_dm_done->xc_physdev_map_pirq, but in current codes, the gsi
number is got from file /sys/bus/pci/devices//irq, that is
wrong, because irq is not equal with gsi, they are in different
spaces, so pirq mapping fails.
And in current linux codes, there is no method to get gsi
for userspace.

For above purpose, record gsi of pcistub devices when init
pcistub and add a new syscall into privcmd to let userspace
can get gsi when they have a need.

Signed-off-by: Jiqian Chen 
Signed-off-by: Huang Rui 
Signed-off-by: Jiqian Chen 
Reviewed-by: Stefano Stabellini 
Message-ID: <20240924061437.2636766-4-jiqian.c...@amd.com>
Signed-off-by: Juergen Gross 


-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

[PATCH] tools/xg: increase LZMA_BLOCK_SIZE for uncompressing the kernel

2024-10-08 Thread Marek Marczykowski-Górecki

Linux 6.12-rc2 fails to decompress with the current 128MiB, contrary to
the code comment. It results in a failure like this:

domainbuilder: detail: xc_dom_kernel_file: 
filename="/var/lib/qubes/vm-kernels/6.12-rc2-1.1.fc37/vmlinuz"
domainbuilder: detail: xc_dom_malloc_filemap: 12104 kB
domainbuilder: detail: xc_dom_module_file: 
filename="/var/lib/qubes/vm-kernels/6.12-rc2-1.1.fc37/initramfs"
domainbuilder: detail: xc_dom_malloc_filemap: 7711 kB
domainbuilder: detail: xc_dom_boot_xen_init: ver 4.19, caps xen-3.0-x86_64 
hvm-3.0-x86_32 hvm-3.0-x86_32p hvm-3.0-x86_64
domainbuilder: detail: xc_dom_parse_image: called
domainbuilder: detail: xc_dom_find_loader: trying multiboot-binary loader 
...
domainbuilder: detail: loader probe failed
domainbuilder: detail: xc_dom_find_loader: trying HVM-generic loader ...
domainbuilder: detail: loader probe failed
domainbuilder: detail: xc_dom_find_loader: trying Linux bzImage loader ...
domainbuilder: detail: _xc_try_lzma_decode: XZ decompression error: Memory 
usage limit reached
xc: error: panic: xg_dom_bzimageloader.c:761: xc_dom_probe_bzimage_kernel 
unable to XZ decompress kernel: Invalid kernel
domainbuilder: detail: loader probe failed
domainbuilder: detail: xc_dom_find_loader: trying ELF-generic loader ...
domainbuilder: detail: loader probe failed
xc: error: panic: xg_dom_core.c:689: xc_dom_find_loader: no loader found: 
Invalid kernel
libxl: error: libxl_dom.c:566:libxl__build_dom: xc_dom_parse_image failed

The important part: XZ decompression error: Memory usage limit reached

This looks to be related to the following change in Linux:
8653c909922743bceb4800e5cc26087208c9e0e6 ("xz: use 128 MiB dictionary and force 
single-threaded mode")

Fix this by increasing the block size to 256MiB. And remove the
misleading comment (from lack of better ideas).

Signed-off-by: Marek Marczykowski-Górecki 
---
 tools/libs/guest/xg_dom_bzimageloader.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/tools/libs/guest/xg_dom_bzimageloader.c 
b/tools/libs/guest/xg_dom_bzimageloader.c
index c6ee6d83e7c6..1fb4e5a1f728 100644
--- a/tools/libs/guest/xg_dom_bzimageloader.c
+++ b/tools/libs/guest/xg_dom_bzimageloader.c
@@ -272,8 +272,7 @@ static int _xc_try_lzma_decode(
 return retval;
 }
 
-/* 128 Mb is the minimum size (half-way) documented to work for all inputs. */
-#define LZMA_BLOCK_SIZE (128*1024*1024)
+#define LZMA_BLOCK_SIZE (256*1024*1024)
 
 static int xc_try_xz_decode(
 struct xc_dom_image *dom, void **blob, size_t *size)
-- 
2.46.0

Re: [XEN PATCH v2 1/3] EFI: address a violation of MISRA C Rule 13.6

2024-10-08 Thread Marek Marczykowski-Górecki

 > > > Proceeding with the series in its current form may be okay (as you say,
> > > > you view the changes as readability improvements anyway), but imo the
> > > > interpretation needs settling on no matter what. In fact even for these
> > > > two patches it may affect what their descriptions ought to say (would
> > > > be nice imo to avoid permanently recording potentially misleading
> > > > information by committing as is). And of course clarity would help
> > > > dealing with future instances that might appear. I take it you realize
> > > > that if someone had submitted a patch adding code similar to the
> > > > original forms of what's being altered here, it would be relatively
> > > > unlikely for a reviewer to spot the issue. IOW here we're making
> > > > ourselves heavily dependent upon Eclair spotting (supposed) issues,
> > > > adding extra work and delays for such changes to go in.
> > > 
> > > You can do two things to obtain a second opinion:
> > > 
> > > 1) Use the MISRA forum (here is the link to the forum
> > >  section devoted to the side-effect rules of MISRA C:2012
> > >  and MISRA C:2023 
> > > (https://forum.misra.org.uk/forumdisplay.php?fid=168).
> > >  The MISRA C Working Group will, in due course, provide
> > >  you with an official answer to your questions about what,
> > >  for the interpretation of Rule 13.6, has to be considered
> > >  a side effect.
> > > 
> > > 2) Reach out to your ISO National Body and try to obtain
> > >  an official answer from ISO/IEC JTC1/SC22/WG14 (the
> > >  international standardization working group for the
> > >  programming language C) to your questions about what the
> > >  C Standard considers to be side effects.
> > 
> > I took the latter route, and to my (positive) surprise I got back an answer
> > the same day. There was a request for someone to confirm, but so far I 
> > didn't
> > see further replies. Since this is a German institution I raised the 
> > question
> > in German and got back an answer in German (attached); I've tried my best to
> > translate this without falsifying anything, but I've omitted non-technical
> > parts:
> > 
> > "Initialization of an object is never a side effect of the initialization
> > by itself. Of course expressions used for initialization can themselves have
> > side effects on other objects.
> > 
> > @Andreas: Do you agree? C after all has a far simpler object model than C++.
> > The (potential) change in object representation (i.e. the bytes underlying
> > the object) is not a side effect of object initialization, but its primary
> > purpose."
> > 
> > Further for Misra she added a reference to a Swiss person, but I think with
> > Bugseng we have sufficient expertise there.
> 
> Unfortunately, the (translation of the) answer you received adds
> confusion to previous confusion: who answered has highlighted the
> "side" part of the term, which is indeed relevant in computer science,
> but not for the C standard.  To the point that the same argument could
> be used to deny that ++i has a side effect because the increment is
> the "primary" effect...
> 
> Part of the confusion is maybe in the the following paragraph Jan
> wrote earlier:
> 
> > Hmm, that's interesting: There's indeed an example with an initializer
> > there. Yet to me the text you quote from the C standard does not say
> > that initialization is a side effect - it would be "modifying an
> > object" aiui, yet ahead of initialization being complete the object
> > doesn't "exist" imo, and hence can be "modified" only afterwards.
> 
> In C, it is not true that the object does not exist ahead of
> initialization.  Try the following:
> 
> extern int f(int* p);
> 
> int main() {
>   int i = f(&i);
> }

This is interesting discussion, but I don't think it needs to block
anything. The proposed change doesn't violate any other rule/code style,
and I'd argue it's more readable. Taking more strict interpretation in
this case doesn't really hurt. I already acked this patch.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Re: [XEN PATCH 10/12] efi: address violation of MISRA C Rule 16.3

2024-10-04 Thread Marek Marczykowski-Górecki

On Tue, Sep 10, 2024 at 12:09:02PM +0200, Federico Serafini wrote:
> Use agreed syntax for pseudo-keyword fallthrough to meet the
> requirements to deviate a violation of MISRA C:2012 Rule 16.3:
> "An unconditional `break' statement shall terminate every
> switch-clause".
> 
> No functional change.
> 
> Signed-off-by: Federico Serafini 

Acked-by: Marek Marczykowski-Górecki 

> ---
>  xen/common/efi/runtime.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/xen/common/efi/runtime.c b/xen/common/efi/runtime.c
> index d03e5c90ce..376fcbd8e1 100644
> --- a/xen/common/efi/runtime.c
> +++ b/xen/common/efi/runtime.c
> @@ -691,7 +691,7 @@ int efi_runtime_call(struct xenpf_efi_runtime_call *op)
>  
>  if ( !efi_enabled(EFI_RS) || (efi_rs->Hdr.Revision >> 16) < 2 )
>  return -EOPNOTSUPP;
> -/* XXX fall through for now */
> +fallthrough; /* XXX fall through for now */
>  default:
>  return -ENOSYS;
>  }
> -- 
> 2.34.1
> 

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Re: [XEN PATCH v2 1/3] EFI: address a violation of MISRA C Rule 13.6

2024-10-04 Thread Marek Marczykowski-Górecki

On Wed, Oct 02, 2024 at 05:36:47PM -0700, Stefano Stabellini wrote:
> On Mon, 30 Sep 2024, Federico Serafini wrote:
> > guest_handle_ok()'s expansion contains a sizeof() involving its
> > first argument which is guest_handle_cast().
> > The expansion of the latter, in turn, contains a variable
> > initialization.
> > 
> > Since MISRA considers the initialization (even of a local variable)
> > a side effect, the chain of expansions mentioned above violates
> > MISRA C:2012 Rule 13.6 (The operand of the `sizeof' operator shall not
> > contain any expression which has potential side effect).
> > 
> > Refactor the code to address the rule violation.
> > 
> > Suggested-by: Andrew Cooper 
> > Signed-off-by: Federico Serafini 
> 
> Reviewed-by: Stefano Stabellini 

Acked-by: Marek Marczykowski-Górecki 

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

xenvbd driver modifies in-flight data?

2024-10-04 Thread Marek Marczykowski-Górecki

Hi,

I've got a bit worrying report[1], that using Windows PV block driver,
together with backing file opened with O_DIRECT on BTRFS results in
checksum errors. At this point it's unclear what exactly happens there,
it could be some BTRFS issue, some xen-blkback issue, but since
according to reports it happens only with Windows DomU, the most likely
culprit is PV driver inside Windows DomU. It was detected on BTRFS
because it does checksumming, but it isn't clear whether other file
systems are not affected too. 

I guess using O_DIRECT is asking for troubles here, but nevertheless, it
looks like it might have uncovered some issue that would be silently
ignored otherwise.

[1] https://github.com/QubesOS/qubes-issues/issues/9488#issuecomment-2389152014

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

[PATCH v2 1/3] automation: preserve built xen.efi

2024-10-03 Thread Marek Marczykowski-Górecki

It will be useful for further tests.

Signed-off-by: Marek Marczykowski-Górecki 
---
Changes in v2:
- deduplicate via collect_xen_artifacts function
---
 automation/scripts/build | 16 +---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/automation/scripts/build b/automation/scripts/build
index b3c71fb6fb60..302feeed2eea 100755
--- a/automation/scripts/build
+++ b/automation/scripts/build
@@ -41,19 +41,29 @@ cp xen/.config xen-config
 # Directory for the artefacts to be dumped into
 mkdir -p binaries
 
+collect_xen_artifacts()
+{
+local f
+for f in xen/xen xen/xen.efi; do
+if [[ -f $f ]]; then
+cp $f binaries/
+fi
+done
+}
+
 if [[ "${CPPCHECK}" == "y" ]] && [[ "${HYPERVISOR_ONLY}" == "y" ]]; then
 # Cppcheck analysis invokes Xen-only build
 xen/scripts/xen-analysis.py --run-cppcheck --cppcheck-misra -- -j$(nproc)
 
 # Preserve artefacts
-cp xen/xen binaries/xen
+collect_xen_artifacts
 cp xen/cppcheck-report/xen-cppcheck.txt xen-cppcheck.txt
 elif [[ "${HYPERVISOR_ONLY}" == "y" ]]; then
 # Xen-only build
 make -j$(nproc) xen
 
 # Preserve artefacts
-cp xen/xen binaries/xen
+collect_xen_artifacts
 else
 # Full build.  Figure out our ./configure options
 cfgargs=()
@@ -101,5 +111,5 @@ else
 # even though dist/ contains everything, while some containers don't even
 # build Xen
 cp -r dist binaries/
-if [[ -f xen/xen ]] ; then cp xen/xen binaries/xen; fi
+collect_xen_artifacts
 fi
-- 
git-series 0.9.1

[PATCH v2 2/3] automation: add a smoke test for xen.efi on X86

2024-10-03 Thread Marek Marczykowski-Górecki

Check if xen.efi is bootable with an XTF dom0.
The multiboot2+EFI path is tested on hardware tests already.

Signed-off-by: Marek Marczykowski-Górecki 
---
This requires rebuilding debian:bookworm container.

Changes in v2:
- drop forcing TEST_TIMEOUT in the script - now can be set from test.yml
  when needed; and move actually reducing timeout to separate commit
---
 automation/build/debian/bookworm.dockerfile |  1 +-
 automation/gitlab-ci/test.yaml  |  7 -
 automation/scripts/qemu-smoke-x86-64-efi.sh | 43 ++-
 3 files changed, 51 insertions(+)
 create mode 100755 automation/scripts/qemu-smoke-x86-64-efi.sh

diff --git a/automation/build/debian/bookworm.dockerfile 
b/automation/build/debian/bookworm.dockerfile
index 3dd70cb6b2e3..061114ba522d 100644
--- a/automation/build/debian/bookworm.dockerfile
+++ b/automation/build/debian/bookworm.dockerfile
@@ -46,6 +46,7 @@ RUN apt-get update && \
 # for test phase, qemu-smoke-* jobs
 qemu-system-x86 \
 expect \
+ovmf \
 # for test phase, qemu-alpine-* jobs
 cpio \
 busybox-static \
diff --git a/automation/gitlab-ci/test.yaml b/automation/gitlab-ci/test.yaml
index e9477361955a..5687eaf9143d 100644
--- a/automation/gitlab-ci/test.yaml
+++ b/automation/gitlab-ci/test.yaml
@@ -463,6 +463,13 @@ qemu-smoke-x86-64-clang-pvh:
   needs:
 - debian-bookworm-clang-debug
 
+qemu-smoke-x86-64-gcc-efi:
+  extends: .qemu-x86-64
+  script:
+- ./automation/scripts/qemu-smoke-x86-64-efi.sh pv 2>&1 | tee ${LOGFILE}
+  needs:
+- debian-bookworm-gcc-debug
+
 qemu-smoke-riscv64-gcc:
   extends: .qemu-riscv64
   script:
diff --git a/automation/scripts/qemu-smoke-x86-64-efi.sh 
b/automation/scripts/qemu-smoke-x86-64-efi.sh
new file mode 100755
index ..7572722be6e5
--- /dev/null
+++ b/automation/scripts/qemu-smoke-x86-64-efi.sh
@@ -0,0 +1,43 @@
+#!/bin/bash
+
+set -ex -o pipefail
+
+# variant should be either pv or pvh
+variant=$1
+
+# Clone and build XTF
+git clone https://xenbits.xen.org/git-http/xtf.git
+cd xtf && make -j$(nproc) && cd -
+
+case $variant in
+pvh) k=test-hvm64-exampleextra="dom0-iommu=none dom0=pvh" ;;
+*)   k=test-pv64-example extra= ;;
+esac
+
+mkdir -p boot-esp/EFI/BOOT
+cp binaries/xen.efi boot-esp/EFI/BOOT/BOOTX64.EFI
+cp xtf/tests/example/$k boot-esp/EFI/BOOT/kernel
+
+cat > boot-esp/EFI/BOOT/BOOTX64.cfg <

[PATCH v2 0/3] automation: add smoke test for xen.efi on x86_64

2024-10-03 Thread Marek Marczykowski-Górecki

Marek Marczykowski-Górecki (3):
  automation: preserve built xen.efi
  automation: add a smoke test for xen.efi on X86
  automation: shorten the timeout for smoke tests

 automation/build/debian/bookworm.dockerfile |  1 +-
 automation/gitlab-ci/test.yaml  | 20 --
 automation/scripts/build| 16 ++--
 automation/scripts/qemu-smoke-x86-64-efi.sh | 43 ++-
 4 files changed, 73 insertions(+), 7 deletions(-)
 create mode 100755 automation/scripts/qemu-smoke-x86-64-efi.sh

base-commit: d82e0e094e7a07353ba0fb35732724316c2ec2f6
-- 
git-series 0.9.1

[PATCH v2 3/3] automation: shorten the timeout for smoke tests

2024-10-03 Thread Marek Marczykowski-Górecki

The smoke tests when successful complete in about 5s. Don't waste
20min+ on failure, shorten the timeout to 120s

Signed-off-by: Marek Marczykowski-Górecki 
---
 automation/gitlab-ci/test.yaml | 15 ++-
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/automation/gitlab-ci/test.yaml b/automation/gitlab-ci/test.yaml
index 5687eaf9143d..b27c2be17487 100644
--- a/automation/gitlab-ci/test.yaml
+++ b/automation/gitlab-ci/test.yaml
@@ -53,6 +53,11 @@
   tags:
 - x86_64
 
+.qemu-smoke-x86-64:
+  extends: .qemu-x86-64
+  variables:
+TEST_TIMEOUT_OVERRIDE: 120
+
 .qemu-riscv64:
   extends: .test-jobs-common
   variables:
@@ -436,35 +441,35 @@ qemu-alpine-x86_64-gcc:
 - alpine-3.18-gcc
 
 qemu-smoke-x86-64-gcc:
-  extends: .qemu-x86-64
+  extends: .qemu-smoke-x86-64
   script:
 - ./automation/scripts/qemu-smoke-x86-64.sh pv 2>&1 | tee ${LOGFILE}
   needs:
 - debian-bookworm-gcc-debug
 
 qemu-smoke-x86-64-clang:
-  extends: .qemu-x86-64
+  extends: .qemu-smoke-x86-64
   script:
 - ./automation/scripts/qemu-smoke-x86-64.sh pv 2>&1 | tee ${LOGFILE}
   needs:
 - debian-bookworm-clang-debug
 
 qemu-smoke-x86-64-gcc-pvh:
-  extends: .qemu-x86-64
+  extends: .qemu-smoke-x86-64
   script:
 - ./automation/scripts/qemu-smoke-x86-64.sh pvh 2>&1 | tee ${LOGFILE}
   needs:
 - debian-bookworm-gcc-debug
 
 qemu-smoke-x86-64-clang-pvh:
-  extends: .qemu-x86-64
+  extends: .qemu-smoke-x86-64
   script:
 - ./automation/scripts/qemu-smoke-x86-64.sh pvh 2>&1 | tee ${LOGFILE}
   needs:
 - debian-bookworm-clang-debug
 
 qemu-smoke-x86-64-gcc-efi:
-  extends: .qemu-x86-64
+  extends: .qemu-smoke-x86-64
   script:
 - ./automation/scripts/qemu-smoke-x86-64-efi.sh pv 2>&1 | tee ${LOGFILE}
   needs:
-- 
git-series 0.9.1

Re: [PATCH] automation: introduce TEST_TIMEOUT_OVERRIDE

2024-10-03 Thread Marek Marczykowski-Górecki

On Thu, Oct 03, 2024 at 01:22:51PM -0700, Stefano Stabellini wrote:
> TEST_TIME is set as a CI/CD project variable, as it should be, to match
> the capability and speed of the testing infrastructure.
> 
> As it turns out, TEST_TIME defined in test.yaml cannot override
> TEST_TIME defined as CI/CD project variable. As a consequence, today the
> TEST_TIME setting in test.yaml for the Xilinx jobs is ignored.

s/TEST_TIME/\0OUT/

> 
> Instead, rename TEST_TIMEOUT to TEST_TIMEOUT_OVERRIDE in test.yaml and
> check for TEST_TIMEOUT_OVERRIDE first in console.exp.
> 
> Signed-off-by: Stefano Stabellini 

with commit message fixed:
Reviewed-by: Marek Marczykowski-Górecki 

> diff --git a/automation/gitlab-ci/test.yaml b/automation/gitlab-ci/test.yaml
> index 8675016b6a..e947736195 100644
> --- a/automation/gitlab-ci/test.yaml
> +++ b/automation/gitlab-ci/test.yaml
> @@ -84,7 +84,7 @@
>variables:
>  CONTAINER: ubuntu:xenial-xilinx
>  LOGFILE: qemu-smoke-xilinx.log
> -TEST_TIMEOUT: 120
> +TEST_TIMEOUT_OVERRIDE: 120
>artifacts:
>  paths:
>- smoke.serial
> @@ -104,7 +104,7 @@
>  LOGFILE: xilinx-smoke-x86_64.log
>  XEN_CMD_CONSOLE: "console=com2 com2=57600,8n1,0x2F8,4"
>  TEST_BOARD: "crater"
> -TEST_TIMEOUT: 1000
> +TEST_TIMEOUT_OVERRIDE: 1000
>artifacts:
>  paths:
>- smoke.serial
> diff --git a/automation/scripts/console.exp b/automation/scripts/console.exp
> index f538aa6bd0..310543c33e 100755
> --- a/automation/scripts/console.exp
> +++ b/automation/scripts/console.exp
> @@ -1,6 +1,8 @@
>  #!/usr/bin/expect -f
>  
> -if {[info exists env(TEST_TIMEOUT)]} {
> +if {[info exists env(TEST_TIMEOUT_OVERRIDE)]} {
> +set timeout $env(TEST_TIMEOUT_OVERRIDE)
> +} elseif {[info exists env(TEST_TIMEOUT)]} {
>  set timeout $env(TEST_TIMEOUT)
>  } else {
>  set timeout 1500

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Re: [PATCH v7 1/2] x86/boot: Rewrite EFI/MBI2 code partly in C

2024-10-03 Thread Marek Marczykowski-Górecki

On Wed, Oct 02, 2024 at 10:31:50AM -0400, Daniel P. Smith wrote:
> On 10/1/24 06:22, Frediano Ziglio wrote:
> > No need to have it coded in assembly.
> > Declare efi_multiboot2 in a new header to reuse between implementations
> > and caller.
> > 
> > Signed-off-by: Frediano Ziglio 
> 
> I unfortunately do not have time to test this myself, but I have given a
> read through and it looks good to me. I will give it an R-b and let Marek
> provide the A-b when he is comfortable that CI failure is an artifact of the
> test system and not this series.
> 
> Reviewed-by: Daniel P. Smith 

Since it seems it's only the other patch causing issues, for this one:

Acked-by: Marek Marczykowski-Górecki 

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Re: [PATCH v7 0/2] x86/boot: Reduce assembly code

2024-10-03 Thread Marek Marczykowski-Górecki

On Thu, Oct 03, 2024 at 10:27:15AM +0100, Frediano Ziglio wrote:
> On Thu, Oct 3, 2024 at 2:11 AM Marek Marczykowski-Górecki
>  wrote:
> >
> > On Wed, Oct 02, 2024 at 04:27:19PM +0100, Frediano Ziglio wrote:
> > > On Wed, Oct 2, 2024 at 3:04 PM Marek Marczykowski-Górecki
> > >  wrote:
> > > >
> > > > On Tue, Oct 01, 2024 at 11:22:37AM +0100, Frediano Ziglio wrote:
> > > > > This series came from part of the work of removing duplications 
> > > > > between
> > > > > boot code and rewriting part of code from assembly to C.
> > > > > Rewrites EFI code in pure C.
> > > >
> > > > The MB2+EFI tests on Adler Lake fail with this series:
> > > > https://gitlab.com/xen-project/people/marmarek/xen/-/pipelines/1478766782
> > > > Looking at the VGA output (unfortunately not collected by the test
> > > > itself) it hangs just after bootloader, before printing anything on the
> > > > screen (or even clearing it after bootloader). The serial is silent too.
> > > >
> > >
> > > I tried multiple times to take a look at the logs, but I keep getting 
> > > error 500.
> >
> > 500? That's weird. Anyway, serial log is empty, so you haven't lost
> > much.
> 
> I'm getting pretty consistently. I can see the full pipeline, but not
> the single logs. I tried with both Firefox and Chrome, I also tried
> from home (just to check for something like firewall issues), always
> error 500.
> 
> > But also, I've ran it a couple more times and it is some regression.
> > Staging reliably passes while staging+this series fails.
> >
> > Unfortunately I don't have any more info besides it hangs before
> > printing anything on serial or VGA. Maybe it hanging only on Intel but
> > not AMD is some hint? Or maybe it's some memory layout or firmware
> > differences that matter here (bootloader is exactly the same)?
> > FWIW some links:
> > successful staging run on ADL: 
> > https://gitlab.com/xen-project/people/marmarek/xen/-/jobs/7980146338
> > failed staging+this run on ADL: 
> > https://gitlab.com/xen-project/people/marmarek/xen/-/jobs/7980330394
> > successful staging run on Zen3+: 
> > https://gitlab.com/xen-project/people/marmarek/xen/-/jobs/7980146359
> > successful staging+this run on Zen3+: 
> > https://gitlab.com/xen-project/people/marmarek/xen/-/jobs/7980146359
> >
> 
> Furthermore, I tried with 2 additional machines in our Lab, one Intel,
> another AMD, both working for me.
> Either your compiler did something different or something special on
> the firmware.
> 
> I could try downloading your executables and machines there, but as I
> said, I'm getting error 500 (not sure if we package artifacts).
> 
> Can you try without last commit ?

Yes, this seems to work:
https://gitlab.com/xen-project/people/marmarek/xen/-/pipelines/1480052345

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Re: [PATCH v7 0/2] x86/boot: Reduce assembly code

2024-10-02 Thread Marek Marczykowski-Górecki

On Wed, Oct 02, 2024 at 04:27:19PM +0100, Frediano Ziglio wrote:
> On Wed, Oct 2, 2024 at 3:04 PM Marek Marczykowski-Górecki
>  wrote:
> >
> > On Tue, Oct 01, 2024 at 11:22:37AM +0100, Frediano Ziglio wrote:
> > > This series came from part of the work of removing duplications between
> > > boot code and rewriting part of code from assembly to C.
> > > Rewrites EFI code in pure C.
> >
> > The MB2+EFI tests on Adler Lake fail with this series:
> > https://gitlab.com/xen-project/people/marmarek/xen/-/pipelines/1478766782
> > Looking at the VGA output (unfortunately not collected by the test
> > itself) it hangs just after bootloader, before printing anything on the
> > screen (or even clearing it after bootloader). The serial is silent too.
> >
> 
> I tried multiple times to take a look at the logs, but I keep getting error 
> 500.

500? That's weird. Anyway, serial log is empty, so you haven't lost
much.
But also, I've ran it a couple more times and it is some regression.
Staging reliably passes while staging+this series fails.

Unfortunately I don't have any more info besides it hangs before
printing anything on serial or VGA. Maybe it hanging only on Intel but
not AMD is some hint? Or maybe it's some memory layout or firmware
differences that matter here (bootloader is exactly the same)?
FWIW some links:
successful staging run on ADL: 
https://gitlab.com/xen-project/people/marmarek/xen/-/jobs/7980146338
failed staging+this run on ADL: 
https://gitlab.com/xen-project/people/marmarek/xen/-/jobs/7980330394
successful staging run on Zen3+: 
https://gitlab.com/xen-project/people/marmarek/xen/-/jobs/7980146359
successful staging+this run on Zen3+: 
https://gitlab.com/xen-project/people/marmarek/xen/-/jobs/7980146359

> > It does pass on Zen 3+ runners.
> >
> > Since there were some issues with the ADL runner today on plain staging,
> > I'm not 100% sure if it isn't some infrastructure issue yet. But the
> > symptoms look different than usual infra issues (and different than
> > todays failures on staging), so I think it's more likely an issue with
> > the patches here.
> >
> > > Changes since v1, more details in specific commits:
> > > - style updates;
> > > - comments and descriptions improvements;
> > > - other improvements.
> > >
> > > Changes since v2:
> > > - rebased on master, resolved conflicts;
> > > - add comment on trampoline section.
> > >
> > > Changes since v3:
> > > - changed new function name;
> > > - declare efi_multiboot2 in a separate header;
> > > - distinguish entry point from using magic number;
> > > - other minor changes (see commens in commits).
> > >
> > > Changes since v4:
> > > - rebase on staging;
> > > - set %fs and %gs as other segment registers;
> > > - style and other changes.
> > >
> > > Changes since v5:
> > > - fixed a typo.
> > >
> > > Changes since v6:
> > > - remove merged patch;
> > > - comment and style;
> > > - change some pointer checks to avoid overflows;
> > > - rename parse-mbi2.c to mbi2.c.
> > >
> > > Frediano Ziglio (2):
> > >   x86/boot: Rewrite EFI/MBI2 code partly in C
> > >   x86/boot: Improve MBI2 structure check
> > >
> > >  xen/arch/x86/boot/head.S   | 146 +++----------
> > >  xen/arch/x86/efi/Makefile  |   1 +
> > >  xen/arch/x86/efi/efi-boot.h|   7 +-
> > >  xen/arch/x86/efi/mbi2.c|  66 +++
> > >  xen/arch/x86/efi/stub.c|  10 +--
> > >  xen/arch/x86/include/asm/efi.h |  18 
> > >  6 files changed, 123 insertions(+), 125 deletions(-)
> > >  create mode 100644 xen/arch/x86/efi/mbi2.c
> > >  create mode 100644 xen/arch/x86/include/asm/efi.h
> > >
> 
> Frediano

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Re: [PATCH 2/2] automation: add a smoke test for xen.efi on X86

2024-10-02 Thread Marek Marczykowski-Górecki

On Wed, Oct 02, 2024 at 04:30:25PM -0700, Stefano Stabellini wrote:
> On Thu, 3 Oct 2024, Marek Marczykowski-Górecki wrote:
> > The problem is this doesn't work. The group-level variable overrides the
> > one in yaml. See the commit message and the link there...
> 
> Now I understand the problem, well spotted, thanks!
> 
> The idea behind having TEST_TIMEOUT defined as a project CI/CD variable
> is that it depends on the test infrastructure rather than the Xen code.
> So if we suddenly had slower runners we could change TEST_TIMEOUT
> without having to change the Xen code itself. So I think we should keep
> TEST_TIMEOUT as a project CI/CD variable.
> 
> I am not a fan of overwriting the TEST_TIMEOUT variable in the test
> scripts, because one test script can be used for multiple different
> tests, possibly even with different runners. For instance
> qubes-x86-64.sh works with a couple of different hardware runners that
> could have different timeout values. But I think it would work OK for
> now for our hardware-based tests (e.g. xilinx-smoke-dom0less-arm64.sh
> and qubes-x86-64.sh could overwrite TEST_TIMEOUT).
> 
> For this specific XTF test, I am not sure it is worth optimizing the
> timeout, maybe we should leave it as default. 

The default of 25min is quite wasteful for XTF test that failed...

> However if we wanted to
> lower the timeout value, overwriting it the way you did is OKish as I
> cannot think of another way.

If we'd need this option more often, Maybe we could set
TEST_TIMEOUT_OVERRIDE in test yaml, and then test script use that (if
present) instead? Or maybe have few "classes" of timeouts set globally
(TEST_TIMEOUT_SHORT, TEST_TIMEOUT_MEDIUM, TEST_TIMEOUT_LONG? or some
better named categories). But I don't think it's worth it for this XTF
test yet.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Re: [PATCH 2/2] automation: add a smoke test for xen.efi on X86

2024-10-02 Thread Marek Marczykowski-Górecki

On Wed, Oct 02, 2024 at 03:22:59PM -0700, Stefano Stabellini wrote:
> I forgot to reply to one important part below
> 
> 
> On Wed, 2 Oct 2024, Stefano Stabellini wrote:
> > On Wed, 2 Oct 2024, Marek Marczykowski-Górecki wrote:
> > > Check if xen.efi is bootable with an XTF dom0.
> > > 
> > > The TEST_TIMEOUT is set in the script to override project-global value.
> > > Setting it in the gitlab yaml file doesn't work, as it's too low
> > > priority
> > > (https://docs.gitlab.com/ee/ci/variables/#cicd-variable-precedence).
> > > 
> > > The multiboot2+EFI path is tested on hardware tests already.
> > > 
> > > Signed-off-by: Marek Marczykowski-Górecki 
> > > 
> > > ---
> > > This requires rebuilding debian:bookworm container.
> > > 
> > > The TEST_TIMEOUT issue mentioned above applies to xilix-* jobs too. It's
> > > not clear to me why the default TEST_TIMEOUT is set at the group level
> > > instead of in the yaml file, so I'm not adjusting the other places.
> > 
> > Let me premise that now that we use "expect" all successful tests will
> > terminate as soon as the success condition is met, without waiting for
> > the test timeout to expire.
> > 
> > There is a CI/CD variable called TEST_TIMEOUT set at the
> > gitlab.com/xen-project level. (There is also a check in console.exp in
> > case TEST_TIMEOUT is not set so that we don't run into problems in case
> > the CI/CD variable is removed accidentally.) The global TEST_TIMEOUT is
> > meant to be a high value to account for slow QEMU tests running
> > potentially on our slowest cloud runners.
> > 
> > However, for hardware-based tests such as the xilinx-* jobs, we know
> > that the timeout is supposed to be less than that. The test is running
> > on real hardware which is considerably faster than QEMU running on our
> > slowest runners. Basically, the timeout depends on the runner more than
> > the test. So we override the TEST_TIMEOUT variable for the xilinx-* jobs
> > providing a lower timeout value.
> > 
> > The global TEST_TIMEOUT is set to 1500.
> > The xilinx-* timeout is set to 120 for ARM and 1000 for x86.
> > 
> > You are welcome to override the TEST_TIMEOUT value for the
> > hardware-based QubesOS tests. At the same time, given that on success
> > the timeout is not really used, it is also OK to leave it like this.
>  
>  
> > > ---
> > >  automation/build/debian/bookworm.dockerfile |  1 +
> > >  automation/gitlab-ci/test.yaml  |  7 
> > >  automation/scripts/qemu-smoke-x86-64-efi.sh | 44 +
> > >  3 files changed, 52 insertions(+)
> > >  create mode 100755 automation/scripts/qemu-smoke-x86-64-efi.sh
> > > 
> > > diff --git a/automation/build/debian/bookworm.dockerfile 
> > > b/automation/build/debian/bookworm.dockerfile
> > > index 3dd70cb6b2e3..061114ba522d 100644
> > > --- a/automation/build/debian/bookworm.dockerfile
> > > +++ b/automation/build/debian/bookworm.dockerfile
> > > @@ -46,6 +46,7 @@ RUN apt-get update && \
> > >  # for test phase, qemu-smoke-* jobs
> > >  qemu-system-x86 \
> > >  expect \
> > > +ovmf \
> > >  # for test phase, qemu-alpine-* jobs
> > >  cpio \
> > >  busybox-static \
> > > diff --git a/automation/gitlab-ci/test.yaml 
> > > b/automation/gitlab-ci/test.yaml
> > > index 8675016b6a37..74fd3f3109ae 100644
> > > --- a/automation/gitlab-ci/test.yaml
> > > +++ b/automation/gitlab-ci/test.yaml
> > > @@ -463,6 +463,13 @@ qemu-smoke-x86-64-clang-pvh:
> > >needs:
> > >  - debian-bookworm-clang-debug
> > >  
> > > +qemu-smoke-x86-64-gcc-efi:
> > > +  extends: .qemu-x86-64
> > > +  script:
> > > +- ./automation/scripts/qemu-smoke-x86-64-efi.sh pv 2>&1 | tee 
> > > ${LOGFILE}
> > > +  needs:
> > > +- debian-bookworm-gcc-debug
> > 
> > Given that the script you wrote (thank you!) can also handle pvh, can we
> > directly add a pvh job to test.yaml too?

I guess we can, but is xen.efi + PVH dom0 actually different enough to
worth testing given we already test MB2+EFI + PVH dom0?

> > >  qemu-smoke-riscv64-gcc:
> > >extends: .qemu-riscv64
> > >script:
> > > diff --git a/automation/scripts/qemu-smoke-x86-64-efi.sh 
> > > b/automation/scripts/qemu-smoke-x86-64-efi.sh
>

Re: [PATCH 1/2] automation: preserve built xen.efi

2024-10-02 Thread Marek Marczykowski-Górecki

On Wed, Oct 02, 2024 at 09:42:13PM +0100, Andrew Cooper wrote:
> On 02/10/2024 1:42 pm, Marek Marczykowski-Górecki wrote:
> > It will be useful for further tests.
> >
> > Signed-off-by: Marek Marczykowski-Górecki 
> > ---
> >  automation/scripts/build | 7 ++-
> >  1 file changed, 6 insertions(+), 1 deletion(-)
> >
> > diff --git a/automation/scripts/build b/automation/scripts/build
> > index b3c71fb6fb60..4cd41cb2c471 100755
> > --- a/automation/scripts/build
> > +++ b/automation/scripts/build
> > @@ -47,6 +47,7 @@ if [[ "${CPPCHECK}" == "y" ]] && [[ "${HYPERVISOR_ONLY}" 
> > == "y" ]]; then
> >  
> >  # Preserve artefacts
> >  cp xen/xen binaries/xen
> > +if [[ -f xen/xen.efi ]]; then cp xen/xen.efi binaries/xen.efi; fi
> 
> Wouldn't
> 
>     # Preserve xen and optionally xen.efi
>     cp -f xen/xen xen/xen.efi binaries/
> 
> do this in a more concise way?

I don't think so, `cp -f` still fails if the source cannot be found.

> Alternatively, what about this:
> 
> diff --git a/automation/scripts/build b/automation/scripts/build
> index b3c71fb6fb60..14815ea7ad9c 100755
> --- a/automation/scripts/build
> +++ b/automation/scripts/build
> @@ -41,6 +41,15 @@ cp xen/.config xen-config
>  # Directory for the artefacts to be dumped into
>  mkdir -p binaries
>  
> +collect_xen_artefacts ()
> +{
> +    for A in xen/xen xen/xen.efi; do
> +    if [[ -f $A ]]; then
> +    cp $A binaries/
> +    fi
> +    done
> +}
> +
>  if [[ "${CPPCHECK}" == "y" ]] && [[ "${HYPERVISOR_ONLY}" == "y" ]]; then
>  # Cppcheck analysis invokes Xen-only build
>  xen/scripts/xen-analysis.py --run-cppcheck --cppcheck-misra --
> -j$(nproc)
> @@ -53,7 +62,7 @@ elif [[ "${HYPERVISOR_ONLY}" == "y" ]]; then
>  make -j$(nproc) xen
>  
>  # Preserve artefacts
> -    cp xen/xen binaries/xen
> +    collect_xen_artefacts
>  else
>  # Full build.  Figure out our ./configure options
>  cfgargs=()
> 
> so we don't triplicate the handling?

That may be a better idea indeed.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Re: [PATCH v7 0/2] x86/boot: Reduce assembly code

2024-10-02 Thread Marek Marczykowski-Górecki

On Tue, Oct 01, 2024 at 11:22:37AM +0100, Frediano Ziglio wrote:
> This series came from part of the work of removing duplications between
> boot code and rewriting part of code from assembly to C.
> Rewrites EFI code in pure C.

The MB2+EFI tests on Adler Lake fail with this series:
https://gitlab.com/xen-project/people/marmarek/xen/-/pipelines/1478766782
Looking at the VGA output (unfortunately not collected by the test
itself) it hangs just after bootloader, before printing anything on the
screen (or even clearing it after bootloader). The serial is silent too.

It does pass on Zen 3+ runners.

Since there were some issues with the ADL runner today on plain staging,
I'm not 100% sure if it isn't some infrastructure issue yet. But the
symptoms look different than usual infra issues (and different than
todays failures on staging), so I think it's more likely an issue with
the patches here.

> Changes since v1, more details in specific commits:
> - style updates;
> - comments and descriptions improvements;
> - other improvements.
> 
> Changes since v2:
> - rebased on master, resolved conflicts;
> - add comment on trampoline section.
> 
> Changes since v3:
> - changed new function name;
> - declare efi_multiboot2 in a separate header;
> - distinguish entry point from using magic number;
> - other minor changes (see commens in commits).
> 
> Changes since v4:
> - rebase on staging;
> - set %fs and %gs as other segment registers;
> - style and other changes.
> 
> Changes since v5:
> - fixed a typo.
> 
> Changes since v6:
> - remove merged patch;
> - comment and style;
> - change some pointer checks to avoid overflows;
> - rename parse-mbi2.c to mbi2.c.
> 
> Frediano Ziglio (2):
>   x86/boot: Rewrite EFI/MBI2 code partly in C
>   x86/boot: Improve MBI2 structure check
> 
>  xen/arch/x86/boot/head.S   | 146 +++--
>  xen/arch/x86/efi/Makefile  |   1 +
>  xen/arch/x86/efi/efi-boot.h|   7 +-
>  xen/arch/x86/efi/mbi2.c|  66 +++
>  xen/arch/x86/efi/stub.c|  10 +--
>  xen/arch/x86/include/asm/efi.h |  18 
>  6 files changed, 123 insertions(+), 125 deletions(-)
>  create mode 100644 xen/arch/x86/efi/mbi2.c
>  create mode 100644 xen/arch/x86/include/asm/efi.h
> 
> -- 
> 2.34.1
> 

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Re: [PATCH] automation: add a smoke test for xen.efi on X86

2024-10-02 Thread Marek Marczykowski-Górecki

On Wed, Oct 02, 2024 at 02:41:55PM +0200, Marek Marczykowski-Górecki wrote:
> Check if xen.efi is bootable with an XTF dom0.
> 
> The TEST_TIMEOUT is set in the script to override project-global value.
> Setting it in the gitlab yaml file doesn't work, as it's too low
> priority
> (https://docs.gitlab.com/ee/ci/variables/#cicd-variable-precedence).
> 
> The multiboot2+EFI path is tested on hardware tests already.
> 
> Signed-off-by: Marek Marczykowski-Górecki 

Sorry, forgot a prerequisite patch - resent now as 2 patch series.

> ---
> This requires rebuilding debian:bookworm container.
> 
> The TEST_TIMEOUT issue mentioned above applies to xilix-* jobs too. It's
> not clear to me why the default TEST_TIMEOUT is set at the group level
> instead of in the yaml file, so I'm not adjusting the other places.
> ---
>  automation/build/debian/bookworm.dockerfile |  1 +
>  automation/gitlab-ci/test.yaml  |  7 
>  automation/scripts/qemu-smoke-x86-64-efi.sh | 44 +
>  3 files changed, 52 insertions(+)
>  create mode 100755 automation/scripts/qemu-smoke-x86-64-efi.sh
> 
> diff --git a/automation/build/debian/bookworm.dockerfile 
> b/automation/build/debian/bookworm.dockerfile
> index 3dd70cb6b2e3..061114ba522d 100644
> --- a/automation/build/debian/bookworm.dockerfile
> +++ b/automation/build/debian/bookworm.dockerfile
> @@ -46,6 +46,7 @@ RUN apt-get update && \
>  # for test phase, qemu-smoke-* jobs
>  qemu-system-x86 \
>  expect \
> +ovmf \
>  # for test phase, qemu-alpine-* jobs
>  cpio \
>  busybox-static \
> diff --git a/automation/gitlab-ci/test.yaml b/automation/gitlab-ci/test.yaml
> index 8675016b6a37..74fd3f3109ae 100644
> --- a/automation/gitlab-ci/test.yaml
> +++ b/automation/gitlab-ci/test.yaml
> @@ -463,6 +463,13 @@ qemu-smoke-x86-64-clang-pvh:
>needs:
>  - debian-bookworm-clang-debug
>  
> +qemu-smoke-x86-64-gcc-efi:
> +  extends: .qemu-x86-64
> +  script:
> +- ./automation/scripts/qemu-smoke-x86-64-efi.sh pv 2>&1 | tee ${LOGFILE}
> +  needs:
> +- debian-bookworm-gcc-debug
> +
>  qemu-smoke-riscv64-gcc:
>extends: .qemu-riscv64
>script:
> diff --git a/automation/scripts/qemu-smoke-x86-64-efi.sh 
> b/automation/scripts/qemu-smoke-x86-64-efi.sh
> new file mode 100755
> index ..e053cfa995ba
> --- /dev/null
> +++ b/automation/scripts/qemu-smoke-x86-64-efi.sh
> @@ -0,0 +1,44 @@
> +#!/bin/bash
> +
> +set -ex -o pipefail
> +
> +# variant should be either pv or pvh
> +variant=$1
> +
> +# Clone and build XTF
> +git clone https://xenbits.xen.org/git-http/xtf.git
> +cd xtf && make -j$(nproc) && cd -
> +
> +case $variant in
> +pvh) k=test-hvm64-exampleextra="dom0-iommu=none dom0=pvh" ;;
> +*)   k=test-pv64-example extra= ;;
> +esac
> +
> +mkdir -p boot-esp/EFI/BOOT
> +cp binaries/xen.efi boot-esp/EFI/BOOT/BOOTX64.EFI
> +cp xtf/tests/example/$k boot-esp/EFI/BOOT/kernel
> +
> +cat > boot-esp/EFI/BOOT/BOOTX64.cfg < +[global]
> +default=test
> +
> +[test]
> +options=loglvl=all console=com1 noreboot console_timestamps=boot $extra
> +kernel=kernel
> +EOF
> +
> +cp /usr/share/OVMF/OVMF_CODE.fd OVMF_CODE.fd
> +cp /usr/share/OVMF/OVMF_VARS.fd OVMF_VARS.fd
> +
> +rm -f smoke.serial
> +export TEST_CMD="qemu-system-x86_64 -nographic -M q35,kernel-irqchip=split \
> +-drive if=pflash,format=raw,readonly=on,file=OVMF_CODE.fd \
> +-drive if=pflash,format=raw,file=OVMF_VARS.fd \
> +-drive file=fat:rw:boot-esp,media=disk,index=0,format=raw \
> +-m 512 -monitor none -serial stdio"
> +
> +export TEST_LOG="smoke.serial"
> +export PASSED="Test result: SUCCESS"
> +export TEST_TIMEOUT=120
> +
> +./automation/scripts/console.exp | sed 's/\r\+$//'
> -- 
> 2.46.0
> 

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

[PATCH 2/2] automation: add a smoke test for xen.efi on X86

2024-10-02 Thread Marek Marczykowski-Górecki

Check if xen.efi is bootable with an XTF dom0.

The TEST_TIMEOUT is set in the script to override project-global value.
Setting it in the gitlab yaml file doesn't work, as it's too low
priority
(https://docs.gitlab.com/ee/ci/variables/#cicd-variable-precedence).

The multiboot2+EFI path is tested on hardware tests already.

Signed-off-by: Marek Marczykowski-Górecki 
---
This requires rebuilding debian:bookworm container.

The TEST_TIMEOUT issue mentioned above applies to xilix-* jobs too. It's
not clear to me why the default TEST_TIMEOUT is set at the group level
instead of in the yaml file, so I'm not adjusting the other places.
---
 automation/build/debian/bookworm.dockerfile |  1 +
 automation/gitlab-ci/test.yaml  |  7 
 automation/scripts/qemu-smoke-x86-64-efi.sh | 44 +
 3 files changed, 52 insertions(+)
 create mode 100755 automation/scripts/qemu-smoke-x86-64-efi.sh

diff --git a/automation/build/debian/bookworm.dockerfile 
b/automation/build/debian/bookworm.dockerfile
index 3dd70cb6b2e3..061114ba522d 100644
--- a/automation/build/debian/bookworm.dockerfile
+++ b/automation/build/debian/bookworm.dockerfile
@@ -46,6 +46,7 @@ RUN apt-get update && \
 # for test phase, qemu-smoke-* jobs
 qemu-system-x86 \
 expect \
+ovmf \
 # for test phase, qemu-alpine-* jobs
 cpio \
 busybox-static \
diff --git a/automation/gitlab-ci/test.yaml b/automation/gitlab-ci/test.yaml
index 8675016b6a37..74fd3f3109ae 100644
--- a/automation/gitlab-ci/test.yaml
+++ b/automation/gitlab-ci/test.yaml
@@ -463,6 +463,13 @@ qemu-smoke-x86-64-clang-pvh:
   needs:
 - debian-bookworm-clang-debug
 
+qemu-smoke-x86-64-gcc-efi:
+  extends: .qemu-x86-64
+  script:
+- ./automation/scripts/qemu-smoke-x86-64-efi.sh pv 2>&1 | tee ${LOGFILE}
+  needs:
+- debian-bookworm-gcc-debug
+
 qemu-smoke-riscv64-gcc:
   extends: .qemu-riscv64
   script:
diff --git a/automation/scripts/qemu-smoke-x86-64-efi.sh 
b/automation/scripts/qemu-smoke-x86-64-efi.sh
new file mode 100755
index ..e053cfa995ba
--- /dev/null
+++ b/automation/scripts/qemu-smoke-x86-64-efi.sh
@@ -0,0 +1,44 @@
+#!/bin/bash
+
+set -ex -o pipefail
+
+# variant should be either pv or pvh
+variant=$1
+
+# Clone and build XTF
+git clone https://xenbits.xen.org/git-http/xtf.git
+cd xtf && make -j$(nproc) && cd -
+
+case $variant in
+pvh) k=test-hvm64-exampleextra="dom0-iommu=none dom0=pvh" ;;
+*)   k=test-pv64-example extra= ;;
+esac
+
+mkdir -p boot-esp/EFI/BOOT
+cp binaries/xen.efi boot-esp/EFI/BOOT/BOOTX64.EFI
+cp xtf/tests/example/$k boot-esp/EFI/BOOT/kernel
+
+cat > boot-esp/EFI/BOOT/BOOTX64.cfg <

[PATCH 1/2] automation: preserve built xen.efi

2024-10-02 Thread Marek Marczykowski-Górecki

It will be useful for further tests.

Signed-off-by: Marek Marczykowski-Górecki 
---
 automation/scripts/build | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/automation/scripts/build b/automation/scripts/build
index b3c71fb6fb60..4cd41cb2c471 100755
--- a/automation/scripts/build
+++ b/automation/scripts/build
@@ -47,6 +47,7 @@ if [[ "${CPPCHECK}" == "y" ]] && [[ "${HYPERVISOR_ONLY}" == 
"y" ]]; then
 
 # Preserve artefacts
 cp xen/xen binaries/xen
+if [[ -f xen/xen.efi ]]; then cp xen/xen.efi binaries/xen.efi; fi
 cp xen/cppcheck-report/xen-cppcheck.txt xen-cppcheck.txt
 elif [[ "${HYPERVISOR_ONLY}" == "y" ]]; then
 # Xen-only build
@@ -54,6 +55,7 @@ elif [[ "${HYPERVISOR_ONLY}" == "y" ]]; then
 
 # Preserve artefacts
 cp xen/xen binaries/xen
+if [[ -f xen/xen.efi ]]; then cp xen/xen.efi binaries/xen.efi; fi
 else
 # Full build.  Figure out our ./configure options
 cfgargs=()
@@ -101,5 +103,8 @@ else
 # even though dist/ contains everything, while some containers don't even
 # build Xen
 cp -r dist binaries/
-if [[ -f xen/xen ]] ; then cp xen/xen binaries/xen; fi
+if [[ -f xen/xen ]] ; then
+cp xen/xen binaries/xen
+if [[ -f xen/xen.efi ]]; then cp xen/xen.efi binaries/xen.efi; fi
+fi
 fi
-- 
2.46.0

[PATCH] automation: add a smoke test for xen.efi on X86

2024-10-02 Thread Marek Marczykowski-Górecki

Check if xen.efi is bootable with an XTF dom0.

The TEST_TIMEOUT is set in the script to override project-global value.
Setting it in the gitlab yaml file doesn't work, as it's too low
priority
(https://docs.gitlab.com/ee/ci/variables/#cicd-variable-precedence).

The multiboot2+EFI path is tested on hardware tests already.

Signed-off-by: Marek Marczykowski-Górecki 
---
This requires rebuilding debian:bookworm container.

The TEST_TIMEOUT issue mentioned above applies to xilix-* jobs too. It's
not clear to me why the default TEST_TIMEOUT is set at the group level
instead of in the yaml file, so I'm not adjusting the other places.
---
 automation/build/debian/bookworm.dockerfile |  1 +
 automation/gitlab-ci/test.yaml  |  7 
 automation/scripts/qemu-smoke-x86-64-efi.sh | 44 +
 3 files changed, 52 insertions(+)
 create mode 100755 automation/scripts/qemu-smoke-x86-64-efi.sh

diff --git a/automation/build/debian/bookworm.dockerfile 
b/automation/build/debian/bookworm.dockerfile
index 3dd70cb6b2e3..061114ba522d 100644
--- a/automation/build/debian/bookworm.dockerfile
+++ b/automation/build/debian/bookworm.dockerfile
@@ -46,6 +46,7 @@ RUN apt-get update && \
 # for test phase, qemu-smoke-* jobs
 qemu-system-x86 \
 expect \
+ovmf \
 # for test phase, qemu-alpine-* jobs
 cpio \
 busybox-static \
diff --git a/automation/gitlab-ci/test.yaml b/automation/gitlab-ci/test.yaml
index 8675016b6a37..74fd3f3109ae 100644
--- a/automation/gitlab-ci/test.yaml
+++ b/automation/gitlab-ci/test.yaml
@@ -463,6 +463,13 @@ qemu-smoke-x86-64-clang-pvh:
   needs:
 - debian-bookworm-clang-debug
 
+qemu-smoke-x86-64-gcc-efi:
+  extends: .qemu-x86-64
+  script:
+- ./automation/scripts/qemu-smoke-x86-64-efi.sh pv 2>&1 | tee ${LOGFILE}
+  needs:
+- debian-bookworm-gcc-debug
+
 qemu-smoke-riscv64-gcc:
   extends: .qemu-riscv64
   script:
diff --git a/automation/scripts/qemu-smoke-x86-64-efi.sh 
b/automation/scripts/qemu-smoke-x86-64-efi.sh
new file mode 100755
index ..e053cfa995ba
--- /dev/null
+++ b/automation/scripts/qemu-smoke-x86-64-efi.sh
@@ -0,0 +1,44 @@
+#!/bin/bash
+
+set -ex -o pipefail
+
+# variant should be either pv or pvh
+variant=$1
+
+# Clone and build XTF
+git clone https://xenbits.xen.org/git-http/xtf.git
+cd xtf && make -j$(nproc) && cd -
+
+case $variant in
+pvh) k=test-hvm64-exampleextra="dom0-iommu=none dom0=pvh" ;;
+*)   k=test-pv64-example extra= ;;
+esac
+
+mkdir -p boot-esp/EFI/BOOT
+cp binaries/xen.efi boot-esp/EFI/BOOT/BOOTX64.EFI
+cp xtf/tests/example/$k boot-esp/EFI/BOOT/kernel
+
+cat > boot-esp/EFI/BOOT/BOOTX64.cfg <

Re: [PATCH v4] Avoid crash calling PrintErrMesg from efi_multiboot2

2024-09-26 Thread Marek Marczykowski-Górecki

On Mon, Aug 19, 2024 at 03:29:52PM +0100, Frediano Ziglio wrote:
> Although code is compiled with -fpic option data is not position
> independent. This causes data pointer to become invalid if
> code is not relocated properly which is what happens for
> efi_multiboot2 which is called by multiboot entry code.
> 
> Code tested adding
>PrintErrMesg(L"Test message", EFI_BUFFER_TOO_SMALL);
> in efi_multiboot2 before calling efi_arch_edd (this function
> can potentially call PrintErrMesg).
> 
> Before the patch (XenServer installation on Qemu, xen replaced
> with vanilla xen.gz):
>   Booting `XenServer (Serial)'Booting `XenServer (Serial)'
>   Test message:  X64 Exception Type - 0E(#PF - Page-Fault)  CPU Apic ID - 
>  
>   ExceptionData -   I:0 R:0 U:0 W:0 P:0 PK:0 SS:0 SGX:0
>   RIP  - 7DC29E46, CS  - 0038, RFLAGS - 00210246
>   RAX  - , RCX - 0050, RDX - 
>   RBX  - 7DAB4558, RSP - 7EFA1200, RBP - 
>   RSI  - 82D040467A88, RDI - 
>   R8   - 7EFA1238, R9  - 7EFA1230, R10 - 
>   R11  - 7CF42665, R12 - 82D040467A88, R13 - 7EFA1228
>   R14  - 7EFA1225, R15 - 7DAB45A8
>   DS   - 0030, ES  - 0030, FS  - 0030
>   GS   - 0030, SS  - 0030
>   CR0  - 80010033, CR2 - 82D040467A88, CR3 - 7EC01000
>   CR4  - 0668, CR8 - 
>   DR0  - , DR1 - , DR2 - 
>   DR3  - , DR6 - 0FF0, DR7 - 0400
>   GDTR - 7E9E2000 0047, LDTR - 
>   IDTR - 7E4E5018 0FFF,   TR - 
>   FXSAVE_STATE - 7EFA0E60
>    Find image based on IP(0x7DC29E46) (No PDB)  
> (ImageBase=7DC28000, EntryPoint=7DC2B917) 
> 
> After the patch:
>   Booting `XenServer (Serial)'Booting `XenServer (Serial)'
>   Test message: Buffer too small
>   BdsDxe: loading Boot "UiApp" from 
> Fv(7CB8BDC9-F8EB-4F34-AAEA-3EE4AF6516A1)/FvFile(462CAA21-7614-4503-836E-8AB6F4662331)
>   BdsDxe: starting Boot "UiApp" from 
> Fv(7CB8BDC9-F8EB-4F34-AAEA-3EE4AF6516A1)/FvFile(462CAA21-7614-4503-836E-8AB6F4662331)
> 
> Fixes: 9180f5365524 ("x86: add multiboot2 protocol support for EFI platforms")
> Signed-off-by: Frediano Ziglio 

I was hoping it would fix also an issue with xen.efi as the crash is
pretty similar
(https://github.com/QubesOS/qubes-issues/issues/8206#issuecomment-2366835136),
but it seems to be something different.

Anyway, 

Acked-by: Marek Marczykowski-Górecki 

> ---
>  xen/common/efi/boot.c | 46 ++-
>  1 file changed, 32 insertions(+), 14 deletions(-)
> ---
> Changes since v1:
> - added "Fixes:" tag;
> - fixed cast style change.
> 
> Changes since v2:
> - wrap long line.
> 
> Changes since v3:
> - fixed "Fixes:" tag.
> 
> diff --git a/xen/common/efi/boot.c b/xen/common/efi/boot.c
> index efbec00af9..fdbe75005c 100644
> --- a/xen/common/efi/boot.c
> +++ b/xen/common/efi/boot.c
> @@ -287,19 +287,36 @@ static bool __init match_guid(const EFI_GUID *guid1, 
> const EFI_GUID *guid2)
>  /* generic routine for printing error messages */
>  static void __init PrintErrMesg(const CHAR16 *mesg, EFI_STATUS ErrCode)
>  {
> -static const CHAR16* const ErrCodeToStr[] __initconstrel = {
> -[~EFI_ERROR_MASK & EFI_NOT_FOUND]   = L"Not found",
> -[~EFI_ERROR_MASK & EFI_NO_MEDIA]= L"The device has no 
> media",
> -[~EFI_ERROR_MASK & EFI_MEDIA_CHANGED]   = L"Media changed",
> -[~EFI_ERROR_MASK & EFI_DEVICE_ERROR]= L"Device error",
> -[~EFI_ERROR_MASK & EFI_VOLUME_CORRUPTED]= L"Volume corrupted",
> -[~EFI_ERROR_MASK & EFI_ACCESS_DENIED]   = L"Access denied",
> -[~EFI_ERROR_MASK & EFI_OUT_OF_RESOURCES]= L"Out of resources",
> -[~EFI_ERROR_MASK & EFI_VOLUME_FULL] = L"Volume is full",
> -[~EFI_ERROR_MASK & EFI_SECURITY_VIOLATION]  = L"Security violation",
> -[~EFI_ERROR_MASK & EFI_CRC_ERROR]   = L"CRC error",
> -[~EFI_ERROR_MASK & EFI_COMPROMISED_DATA]= L"Compromised data",
> -[~EFI_ERROR_MASK & EFI_BUFFER_TOO_SMALL]= L"Buffer too small",
> +#define ERROR_MESSAGE_LIST \
&

Re: [PATCH v7 2/2] x86/time: prefer CMOS over EFI_GET_TIME

2024-09-17 Thread Marek Marczykowski-Górecki

On Fri, Sep 13, 2024 at 09:59:07AM +0200, Roger Pau Monne wrote:
> The EFI_GET_TIME implementation is well known to be broken for many firmware
> implementations, for Xen the result on such implementations are:
> 
> [ Xen-4.19-unstable  x86_64  debug=y  Tainted:   C]
> CPU:0
> RIP:e008:[<62ccfa70>] 62ccfa70
> [...]
> Xen call trace:
>[<62ccfa70>] R 62ccfa70
>[<732e9a3f>] S 732e9a3f
>[] F arch/x86/time.c#get_cmos_time+0x1b3/0x26e
>[] F init_xen_time+0x28/0xa4
>[] F __start_xen+0x1ee7/0x2578
>[] F __high_start+0x94/0xa0
> 
> Pagetable walk from 62ccfa70:
>  L4[0x000] = 00207ef1c063 
>  L3[0x001] = 5d6c0063 
>  L2[0x116] = 800062c001e3  (PSE)
> 
> 
> Panic on CPU 0:
> FATAL PAGE FAULT
> [error_code=0011]
> Faulting linear address: 62ccfa70
> 
> 
> Swap the preference to default to CMOS first, and EFI later, in an attempt to
> use EFI_GET_TIME as a last resort option only.  Note that Linux for example
> doesn't allow calling the get_time method, and instead provides a dummy 
> handler
> that unconditionally returns EFI_UNSUPPORTED on x86-64.
> 
> Such change in the preferences requires some re-arranging of the function
> logic, so that panic messages with workaround suggestions are suitably 
> printed.
> 
> Signed-off-by: Roger Pau Monné 

Since this changes behavior for running on EFI,
Acked-by: Marek Marczykowski-Górecki 

> ---
> Changes since v2:
>  - Updated to match previous changes.
> ---
>  xen/arch/x86/time.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c
> index e4751684951e..b86e4d58b40c 100644
> --- a/xen/arch/x86/time.c
> +++ b/xen/arch/x86/time.c
> @@ -1592,14 +1592,14 @@ static void __init probe_wallclock(void)
>  wallclock_source = WALLCLOCK_XEN;
>  return;
>  }
> -if ( efi_enabled(EFI_RS) && efi_get_time() )
> +if ( cmos_rtc_probe() )
>  {
> -wallclock_source = WALLCLOCK_EFI;
> +wallclock_source = WALLCLOCK_CMOS;
>  return;
>  }
> -if ( cmos_rtc_probe() )
> +    if ( efi_enabled(EFI_RS) && efi_get_time() )
>  {
> -wallclock_source = WALLCLOCK_CMOS;
> +wallclock_source = WALLCLOCK_EFI;
>  return;
>  }
>  
> -- 
> 2.46.0
> 

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Re: [PATCH v7 1/2] x86/time: introduce command line option to select wallclock

2024-09-16 Thread Marek Marczykowski-Górecki

On Mon, Sep 16, 2024 at 02:11:08PM +0100, Andrew Cooper wrote:
> On 13/09/2024 8:59 am, Roger Pau Monne wrote:
> > diff --git a/docs/misc/xen-command-line.pandoc 
> > b/docs/misc/xen-command-line.pandoc
> > index 959cf45b55d9..2a9b3b9b8975 100644
> > --- a/docs/misc/xen-command-line.pandoc
> > +++ b/docs/misc/xen-command-line.pandoc
> > @@ -2816,6 +2816,27 @@ vwfi to `native` reduces irq latency significantly. 
> > It can also lead to
> >  suboptimal scheduling decisions, but only when the system is
> >  oversubscribed (i.e., in total there are more vCPUs than pCPUs).
> >  
> > +### wallclock (x86)
> > +> `= auto | xen | cmos | efi`
> > +
> > +> Default: `auto`
> > +
> > +Allow forcing the usage of a specific wallclock source.
> > +
> > + * `auto` let the hypervisor select the clocksource based on internal
> > +   heuristics.
> > +
> > + * `xen` force usage of the Xen shared_info wallclock when booted as a Xen
> > +   guest.  This option is only available if the hypervisor was compiled 
> > with
> > +   `CONFIG_XEN_GUEST` enabled.
> > +
> > + * `cmos` force usage of the CMOS RTC wallclock.
> > +
> > + * `efi` force usage of the EFI_GET_TIME run-time method when booted from 
> > EFI
> > +   firmware.
> 
> For both `xen` and `efi`, something should be said about "if selected
> and not satisfied, Xen falls back to other heuristics".
> 
> > +
> > +If the selected option is invalid or not available Xen will default to 
> > `auto`.
> 
> I'm afraid that I'm firmly of the opinion that "auto" on the cmdline is
> unnecessary complexity.  Auto is the default, and doesn't need
> specifying explicitly.  That in turn simplifies ...

What about overriding earlier choice? For example overriding a built-in
cmdline? That said, with the current code, the same can be achieved with
wallclock=foo, and living with the warning in boot log...

> > +
> >  ### watchdog (x86)
> >  > `= force | `
> >  
> > diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c
> > index 29b026735e5d..e4751684951e 100644
> > --- a/xen/arch/x86/time.c
> > +++ b/xen/arch/x86/time.c
> > @@ -1552,6 +1552,37 @@ static const char *__init 
> > wallclock_type_to_string(void)
> >  return "";
> >  }
> >  
> > +static int __init cf_check parse_wallclock(const char *arg)
> > +{
> > +wallclock_source = WALLCLOCK_UNSET;
> > +
> > +if ( !arg )
> > +return -EINVAL;
> > +
> > +if ( !strcmp("auto", arg) )
> > +ASSERT(wallclock_source == WALLCLOCK_UNSET);
> 
> ... this.
> 
> Hitting this assert will manifest as a system reboot/hang with no
> information on serial/VGA, because all of this runs prior to getting up
> the console.  You only get to see it on a PVH boot because we bodge the
> Xen console into default-existence.

This assert is no-op as wallclock_source is unconditionally set to 
WALLCLOCK_UNSET few lines above.

> So, ASSERT()/etc really need avoiding wherever possible in cmdline parsing.
> 
> In this case, all it serves to do is break examples like `wallclock=xen
> wallclock=auto` case, which is unlike our latest-takes-precedence
> behaviour everywhere else.
> 
> > +else if ( !strcmp("xen", arg) )
> > +{
> > +if ( !xen_guest )
> 
> We don't normally treat this path as an error when parsing (we know what
> it is, even if we can't action it).  Instead, there's no_config_param()
> to be more friendly (for PVH at least).
> 
> It's a bit awkward, but this should do:
> 
>     {
> #ifdef CONFIG_XEN_GUEST
>         wallclock_source = WALLCLOCK_XEN;
> #else
>         no_config_param("XEN_GUEST", "wallclock", s, ss);
> #endif
>     }

Can you boot the binary build with CONFIG_XEN_GUEST=y as native? If so,
the above will not be enough, a runtime check is needed anyway.

> There probably wants to be something similar for EFI, although it's not
> a plain CONFIG so it might be more tricky.

It needs to be runtime check here even more. Not only because of
different boot modes, but due to interaction with efi=no-rs (or any
other reason for not having runtime services). See the comment there.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Re: [PATCH v6 1/2] x86/time: introduce command line option to select wallclock

2024-09-12 Thread Marek Marczykowski-Górecki

On Thu, Sep 12, 2024 at 03:47:53PM +0200, Roger Pau Monné wrote:
> On Thu, Sep 12, 2024 at 03:30:29PM +0200, Marek Marczykowski-Górecki wrote:
> > On Thu, Sep 12, 2024 at 02:56:55PM +0200, Roger Pau Monné wrote:
> > > On Thu, Sep 12, 2024 at 01:57:00PM +0200, Jan Beulich wrote:
> > > > On 12.09.2024 13:15, Roger Pau Monne wrote:
> > > > > --- a/xen/arch/x86/time.c
> > > > > +++ b/xen/arch/x86/time.c
> > > > > @@ -1552,6 +1552,35 @@ static const char *__init 
> > > > > wallclock_type_to_string(void)
> > > > >  return "";
> > > > >  }
> > > > >  
> > > > > +static int __init cf_check parse_wallclock(const char *arg)
> > > > > +{
> > > > > +if ( !arg )
> > > > > +return -EINVAL;
> > > > > +
> > > > > +if ( !strcmp("auto", arg) )
> > > > > +wallclock_source = WALLCLOCK_UNSET;
> > > > > +else if ( !strcmp("xen", arg) )
> > > > > +{
> > > > > +if ( !xen_guest )
> > > > > +return -EINVAL;
> > > > > +
> > > > > +wallclock_source = WALLCLOCK_XEN;
> > > > > +}
> > > > > +else if ( !strcmp("cmos", arg) )
> > > > > +wallclock_source = WALLCLOCK_CMOS;
> > > > > +else if ( !strcmp("efi", arg) )
> > > > > +/*
> > > > > + * Checking if run-time services are available must be done 
> > > > > after
> > > > > + * command line parsing.
> > > > > + */
> > > > > +wallclock_source = WALLCLOCK_EFI;
> > > > > +else
> > > > > +return -EINVAL;
> > > > > +
> > > > > +return 0;
> > > > > +}
> > > > > +custom_param("wallclock", parse_wallclock);
> > > > > +
> > > > >  static void __init probe_wallclock(void)
> > > > >  {
> > > > >  ASSERT(wallclock_source == WALLCLOCK_UNSET);
> > > > > @@ -2527,7 +2556,15 @@ int __init init_xen_time(void)
> > > > >  
> > > > >  open_softirq(TIME_CALIBRATE_SOFTIRQ, local_time_calibration);
> > > > >  
> > > > > -probe_wallclock();
> > > > > +/*
> > > > > + * EFI run time services can be disabled from the command line, 
> > > > > hence the
> > > > > + * check for them cannot be done as part of the wallclock option 
> > > > > parsing.
> > > > > + */
> > > > > +if ( wallclock_source == WALLCLOCK_EFI && !efi_enabled(EFI_RS) )
> > > > > +wallclock_source = WALLCLOCK_UNSET;
> > > > > +
> > > > > +if ( wallclock_source == WALLCLOCK_UNSET )
> > > > > +probe_wallclock();
> > > > 
> > > > I don't want to stand in the way, and I can live with this form, so I'd 
> > > > like to
> > > > ask that EFI folks or Andrew provide the necessary A-b / R-b. I'd like 
> > > > to note
> > > > though that there continue to be quirks here. They may not be affecting 
> > > > the
> > > > overall result as long as we have only three possible wallclocks, but 
> > > > there
> > > > might be problems if a 4th one appeared.
> > > > 
> > > > With the EFI_RS check in the command line handler and assuming the 
> > > > default case
> > > > of no "efi=no-rs" on the command line, EFI_RS may still end up being 
> > > > off by the
> > > > time the command line is being parsed. With "wallclock=cmos 
> > > > wallclock=efi" this
> > > > would result in no probing and "cmos" chosen if there was that check in 
> > > > place.
> > > > With the logic as added here there will be probing in that case. 
> > > > Replace "cmos"
> > > > by a hypothetical non-default 4th wallclock type (i.e. probing picking 
> > > > "cmos"),
> > > > and I expect you can see how the result would then not necessarily be as
> > > > expected.
> > > 
> > > My expectation would be that if "wallclock=cmos wallclock=efi" is used
> > > the last option overrides any previous on

Re: [PATCH v6 1/2] x86/time: introduce command line option to select wallclock

2024-09-12 Thread Marek Marczykowski-Górecki

On Thu, Sep 12, 2024 at 02:56:55PM +0200, Roger Pau Monné wrote:
> On Thu, Sep 12, 2024 at 01:57:00PM +0200, Jan Beulich wrote:
> > On 12.09.2024 13:15, Roger Pau Monne wrote:
> > > --- a/xen/arch/x86/time.c
> > > +++ b/xen/arch/x86/time.c
> > > @@ -1552,6 +1552,35 @@ static const char *__init 
> > > wallclock_type_to_string(void)
> > >  return "";
> > >  }
> > >  
> > > +static int __init cf_check parse_wallclock(const char *arg)
> > > +{
> > > +if ( !arg )
> > > +return -EINVAL;
> > > +
> > > +if ( !strcmp("auto", arg) )
> > > +wallclock_source = WALLCLOCK_UNSET;
> > > +else if ( !strcmp("xen", arg) )
> > > +{
> > > +if ( !xen_guest )
> > > +return -EINVAL;
> > > +
> > > +wallclock_source = WALLCLOCK_XEN;
> > > +}
> > > +else if ( !strcmp("cmos", arg) )
> > > +wallclock_source = WALLCLOCK_CMOS;
> > > +else if ( !strcmp("efi", arg) )
> > > +/*
> > > + * Checking if run-time services are available must be done after
> > > + * command line parsing.
> > > + */
> > > +wallclock_source = WALLCLOCK_EFI;
> > > +else
> > > +return -EINVAL;
> > > +
> > > +return 0;
> > > +}
> > > +custom_param("wallclock", parse_wallclock);
> > > +
> > >  static void __init probe_wallclock(void)
> > >  {
> > >  ASSERT(wallclock_source == WALLCLOCK_UNSET);
> > > @@ -2527,7 +2556,15 @@ int __init init_xen_time(void)
> > >  
> > >  open_softirq(TIME_CALIBRATE_SOFTIRQ, local_time_calibration);
> > >  
> > > -probe_wallclock();
> > > +/*
> > > + * EFI run time services can be disabled from the command line, 
> > > hence the
> > > + * check for them cannot be done as part of the wallclock option 
> > > parsing.
> > > + */
> > > +if ( wallclock_source == WALLCLOCK_EFI && !efi_enabled(EFI_RS) )
> > > +wallclock_source = WALLCLOCK_UNSET;
> > > +
> > > +if ( wallclock_source == WALLCLOCK_UNSET )
> > > +probe_wallclock();
> > 
> > I don't want to stand in the way, and I can live with this form, so I'd 
> > like to
> > ask that EFI folks or Andrew provide the necessary A-b / R-b. I'd like to 
> > note
> > though that there continue to be quirks here. They may not be affecting the
> > overall result as long as we have only three possible wallclocks, but there
> > might be problems if a 4th one appeared.
> > 
> > With the EFI_RS check in the command line handler and assuming the default 
> > case
> > of no "efi=no-rs" on the command line, EFI_RS may still end up being off by 
> > the
> > time the command line is being parsed. With "wallclock=cmos wallclock=efi" 
> > this
> > would result in no probing and "cmos" chosen if there was that check in 
> > place.
> > With the logic as added here there will be probing in that case. Replace 
> > "cmos"
> > by a hypothetical non-default 4th wallclock type (i.e. probing picking 
> > "cmos"),
> > and I expect you can see how the result would then not necessarily be as
> > expected.
> 
> My expectation would be that if "wallclock=cmos wallclock=efi" is used
> the last option overrides any previous one, and hence if that last
> option is not valid the logic will fallback to the default selection
> (in this case to probing).

That would be my expectation too. If some kind of preference would be
expected, it should looks like wallclock=efi,cmos, but I don't think we
need that.

> Thinking about this, it might make sense to unconditionally set
> wallclock_source = WALLCLOCK_UNSET at the start of parse_wallclock()
> to avoid previous instances carrying over if later ones are not valid.

This may be a good idea. But more importantly, the behavior should be
included in the option documentation (that if a selected value is not
available, it fallback to auto). And maybe a log message when that
happens (but I'm okay with skipping this one, as selected wallclock
source is logged already)?

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Re: [XEN PATCH 1/3] EFI: address violations of MISRA C Rule 13.6

2024-09-11 Thread Marek Marczykowski-Górecki

On Wed, Sep 11, 2024 at 02:50:03PM +0200, Jan Beulich wrote:
> On 10.09.2024 21:06, Federico Serafini wrote:
> > Refactor the code to improve readability
> 
> I question this aspect. I'm not the maintainer of this code anymore, so
> my view probably doesn't matter much here.
> 
> > and address violations of
> > MISRA C:2012 Rule 13.6 ("The operand of the `sizeof' operator shall
> > not contain any expression which has potential side effect").
> 
> Where's the potential side effect? Since you move ...
> 
> > --- a/xen/common/efi/runtime.c
> > +++ b/xen/common/efi/runtime.c
> > @@ -250,14 +250,20 @@ int efi_get_info(uint32_t idx, union xenpf_efi_info 
> > *info)
> >  info->cfg.addr = __pa(efi_ct);
> >  info->cfg.nent = efi_num_ct;
> >  break;
> > +
> >  case XEN_FW_EFI_VENDOR:
> > +{
> > +XEN_GUEST_HANDLE_PARAM(CHAR16) vendor_name =
> > +guest_handle_cast(info->vendor.name, CHAR16);
> 
> .. this out, it must be the one. I've looked at it, yet I can't spot
> anything:
> 
> #define guest_handle_cast(hnd, type) ({ \
> type *_x = (hnd).p; \
> (XEN_GUEST_HANDLE_PARAM(type)) { _x };  \
> })
> 
> As a rule of thumb, when things aren't obvious, please call out the
> specific aspect / property in descriptions of such patches.

I guess it's because guest_handle_cast() is a macro, yet it's lowercase
so looks like a function? Wasn't there some other MISRA rule about
lowercase/uppercase for macro names?

And yes, I don't really see why this would violate the side effect rule
either.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Re: [REGRESSION] kernel NULL pointer dereference in xen-balloon with mem hotplug

2024-09-06 Thread Marek Marczykowski-Górecki

On Fri, Sep 06, 2024 at 12:30:03PM +0200, Linux regression tracking (Thorsten 
Leemhuis) wrote:
> On 08.08.24 12:31, Marek Marczykowski-Górecki wrote:
> > 
> > When testing Linux 6.11-rc2, I've got the crash like below. It's a PVH
> > guest started with 400MB memory, and then extended via mem hotplug (I
> > don't know to what exact size it was at this time, but up to 4GB), it
> > was quite early in the domU boot process, I suspect it could be the
> > first mem hotplug even happening there.
> > Unfortunately I don't have reliable reproducer, it crashed only once
> > over several test runs. I don't remember seeing such crash before, so it
> > looks like a regression in 6.11. I'm not sure if that matters, but it's
> > on ADL, very similar to the qubes-hw2 gitlab runner.
> 
> Marek, did this happen again or do things appear to be resolved? Asking
> because I'm tracking this as a regression.

I haven't investigated it more, and also haven't ran any later tests on
6.11, so I don't know if it's still there, but I suspect it might be.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Re: [PATCH v2] xen: PE/COFF image header

2024-08-23 Thread Marek Marczykowski-Górecki

On Mon, Jul 29, 2024 at 01:42:46PM +0200, Jan Beulich wrote:
> On 23.07.2024 20:22, Milan Djokic wrote:
> > From: Nikola Jelic 
> > 
> > Added PE/COFF generic image header which shall be used for EFI
> > application format for x86/risc-v. x86 and risc-v source shall be adjusted
> > to use this header in following commits. pe.h header is taken over from
> > linux kernel with minor changes in terms of formatting and structure member 
> > comments.
> > Also, COFF relocation and win cert structures are ommited, since these are 
> > not relevant for Xen.
> > 
> > Origin: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
> > 36e4fc57fc16
> > 
> > Signed-off-by: Nikola Jelic 
> > Signed-off-by: Milan Djokic 

Acked-by: Marek Marczykowski-Górecki 

> This looks okay to me now, but I can't ack it (or more precisely my ack
> wouldn't mean anything). There are a few style issues in comments, but
> leaving them as they are in Linux may be intentional. Just one question,
> more to other maintainers than to you:
> 
> > +#define IMAGE_DLL_CHARACTERISTICS_DYNAMIC_BASE 0x0040
> > +#define IMAGE_DLL_CHARACTERISTICS_FORCE_INTEGRITY  0x0080
> > +#define IMAGE_DLL_CHARACTERISTICS_NX_COMPAT0x0100
> > +#define IMAGE_DLLCHARACTERISTICS_NO_ISOLATION  0x0200
> > +#define IMAGE_DLLCHARACTERISTICS_NO_SEH0x0400
> > +#define IMAGE_DLLCHARACTERISTICS_NO_BIND   0x0800
> > +#define IMAGE_DLLCHARACTERISTICS_WDM_DRIVER0x2000
> > +#define IMAGE_DLLCHARACTERISTICS_TERMINAL_SERVER_AWARE 0x8000
> > +
> > +#define IMAGE_DLLCHARACTERISTICS_EX_CET_COMPAT 0x0001
> > +#define IMAGE_DLLCHARACTERISTICS_EX_FORWARD_CFI_COMPAT 0x0040
> 
> The naming inconsistency (underscore or not after DLL) is somewhat
> unhelpful. Do we maybe want to diverge from what Linux has here? Note
> that e.g. the GNU binutils header has at least a comment there.

Indeed it doesn't look great, but IMO leaving it consistent with Linux
is okay as it ease updating and porting/comparing other code if needed.

> What I'm puzzled by is IMAGE_DLLCHARACTERISTICS_EX_FORWARD_CFI_COMPAT
> having the same value as IMAGE_DLL_CHARACTERISTICS_DYNAMIC_BASE. Are
> these meant to apply to the same field? Or do these values rather
> relate to IMAGE_DEBUG_TYPE_EX_DLLCHARACTERISTICS? Some clarification
> may be needed here, or the two entries may simply want omitting for
> now.

One has _EX_ infix and the other doesn't so IMO together with visual
separation it's clear they apply to a different field.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Re: Assertion failed at arch/x86/genapic/x2apic.c:38 on S3 resume nested in KVM on AMD

2024-08-08 Thread Marek Marczykowski-Górecki

On Thu, Aug 08, 2024 at 01:22:30PM +0200, Jan Beulich wrote:
> On 23.07.2024 16:28, Marek Marczykowski-Górecki wrote:
> > I'm observing a crash like the one below when trying to resume from S3.
> > It happens on Xen nested in KVM (QEMU 9.0, Linux 6.9.3) but only on AMD.
> > The very same software stack on Intel works just fine. QEMU is running
> > with "-cpu host,+svm,+invtsc -machine q35,kernel-irqchip=split -device
> > amd-iommu,intremap=on -smp 2" among others.
> > 
> > (XEN) Preparing system for ACPI S3 state.
> > (XEN) Disabling non-boot CPUs ...
> > (XEN) Broke affinity for IRQ1, new: {0-1}
> > (XEN) Broke affinity for IRQ20, new: {0-1}
> > (XEN) Broke affinity for IRQ22, new: {0-1}
> > (XEN) Entering ACPI S3 state.
> > (XEN) Finishing wakeup from ACPI S3 state.
> > (XEN) Enabling non-boot CPUs  ...
> > (XEN) Assertion 'cpumask_test_cpu(this_cpu, per_cpu(cluster_cpus, 
> > this_cpu))' failed at arch/x86/genapic/x2apic.c:38
> > (XEN) [ Xen-4.20  x86_64  debug=y  Not tainted ]
> > (XEN) CPU:1
> > (XEN) RIP:e008:[] 
> > x2apic.c#init_apic_ldr_x2apic_cluster+0x8a/0x1b9
> > (XEN) RFLAGS: 00010096   CONTEXT: hypervisor
> > (XEN) rax: 830278a25f50   rbx: 0001   rcx: 
> > 82d0405e1700
> > (XEN) rdx: 003233412000   rsi: 8302739da2d8   rdi: 
> > 
> > (XEN) rbp: 00c8   rsp: 8302739d7e78   r8:  
> > 0001
> > (XEN) r9:  8302739d7fa0   r10: 0001   r11: 
> > 
> > (XEN) r12: 0001   r13: 0001   r14: 
> > 
> > (XEN) r15:    cr0: 8005003b   cr4: 
> > 007506e0
> > (XEN) cr3: 7fa7a000   cr2: 
> > (XEN) fsb:    gsb:    gss: 
> > 
> > (XEN) ds:    es:    fs:    gs:    ss:    cs: e008
> > (XEN) Xen code around  
> > (x2apic.c#init_apic_ldr_x2apic_cluster+0x8a/0x1b9):
> > (XEN)  cf 82 ff ff eb b7 0f 0b <0f> 0b 48 8d 05 9c fc 33 00 48 8b 0d a5 
> > 0a 35 00
> > (XEN) Xen stack trace from rsp=8302739d7e78:
> > (XEN) 00c8 0001 
> > 0001
> > (XEN) 82d0402f1d83 8302739d7fff 
> > 00c8
> > (XEN)0001 0001 82d04031adb9 
> > 0001
> > (XEN)   
> > 82d040276677
> > (XEN)   
> > 
> > (XEN)88810037c000 0001 0246 
> > deadbeefdeadf00d
> > (XEN)0001   
> > 811d130a
> > (XEN)deadbeefdeadf00d deadbeefdeadf00d deadbeefdeadf00d 
> > 0100
> > (XEN)811d130a e033 0246 
> > c900400b3ef8
> > (XEN)e02b beef beef 
> > beef
> > (XEN)beef e011 8302739de000 
> > 003233412000
> > (XEN)007506e0   
> > 0002
> > (XEN)0002
> > (XEN) Xen call trace:
> > (XEN)[] R 
> > x2apic.c#init_apic_ldr_x2apic_cluster+0x8a/0x1b9
> > (XEN)[] S setup_local_APIC+0x26/0x449
> > (XEN)[] S start_secondary+0x1c4/0x37a
> > (XEN)[] S __high_start+0x87/0xd0
> > (XEN) 
> > (XEN) 
> > (XEN) 
> > (XEN) Panic on CPU 1:
> > (XEN) Assertion 'cpumask_test_cpu(this_cpu, per_cpu(cluster_cpus, 
> > this_cpu))' failed at arch/x86/genapic/x2apic.c:38
> > (XEN) 
> 
> Would you mind giving the patch below a try?

Yes, this seems to fix the issue, thanks!

> Jan
> 
> x86/x2APIC: correct cluster tracking upon CPUs going down for S3
> 
> Downing CPUs for S3 is somewhat special: Since we can expect the system
> to come back up in exactly the same hardware configuration, per-CPU data
> for the secondary CPUs isn't de-allocated (and then cleared upon re-
> allocation when the CPUs are being brought back up). Therefore the
> cluster_cpus per-CPU pointer will retain its value for all CPUs other
> than the f

[REGRESSION] kernel NULL pointer dereference in xen-balloon with mem hotplug

2024-08-08 Thread Marek Marczykowski-Górecki

nqa.qubes-os.org/tests/108883/file/system_tests-qubes.tests.integ.vm_qrexec_gui.TC_20_NonAudio_whonix-workstation-17.test_105.guest-test-inst-vm2.log
Other logs, including dom0 and Xen messages:
https://openqa.qubes-os.org/tests/108883#downloads

Kernel config is build from merging
https://github.com/QubesOS/qubes-linux-kernel/blob/005ae1ac3819d957379e48fb2cfd33f511a47275/config-base
with
https://github.com/QubesOS/qubes-linux-kernel/blob/005ae1ac3819d957379e48fb2cfd33f511a47275/config-qubes
(options set in the latter takes precedence)
Especially, it has:
CONFIG_XEN_BALLOON_MEMORY_HOTPLUG=y
CONFIG_XEN_UNPOPULATED_ALLOC=y

#regzbot introduced: v6.10..v6.11-rc2

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Re: ACPI NVS range conflicting with Dom0 page tables (or kernel image)

2024-08-07 Thread Marek Marczykowski-Górecki

On Wed, Aug 07, 2024 at 12:26:26PM +0200, Jürgen Groß wrote:
> On 07.08.24 12:23, Marek Marczykowski-Górecki wrote:
> > On Tue, Aug 06, 2024 at 05:24:22PM +0200, Jürgen Groß wrote:
> > > On 06.08.24 17:21, Marek Marczykowski-Górecki wrote:
> > > > On Tue, Aug 06, 2024 at 04:12:32PM +0200, Jürgen Groß wrote:
> > > > > Marek,
> > > > > 
> > > > > On 17.06.24 16:03, Marek Marczykowski-Górecki wrote:
> > > > > > On Mon, Jun 17, 2024 at 01:22:37PM +0200, Jan Beulich wrote:
> > > > > > > Hello,
> > > > > > > 
> > > > > > > while it feels like we had a similar situation before, I can't 
> > > > > > > seem to be
> > > > > > > able to find traces thereof, or associated (Linux) commits.
> > > > > > 
> > > > > > Is it some AMD Threadripper system by a chance? Previous thread on 
> > > > > > this
> > > > > > issue:
> > > > > > https://lore.kernel.org/xen-devel/CAOCpoWdOH=xgxiqsc1c5ueb1thxajh4wizbczq-qt+d_kak...@mail.gmail.com/
> > > > > > 
> > > > > > > With
> > > > > > > 
> > > > > > > (XEN)  Dom0 kernel: 64-bit, PAE, lsb, paddr 0x100 -> 0x400
> > > > > > > ...
> > > > > > > (XEN)  Dom0 alloc.:   00044000->00044800 (619175 
> > > > > > > pages to be allocated)
> > > > > > > ...
> > > > > > > (XEN)  Loaded kernel: 8100->8400
> > > > > > > 
> > > > > > > the kernel occupies the space from 16Mb to 64Mb in the initial 
> > > > > > > allocation.
> > > > > > > Page tables come (almost) directly above:
> > > > > > > 
> > > > > > > (XEN)  Page tables:   84001000->84026000
> > > > > > > 
> > > > > > > I.e. they're just above the 64Mb boundary. Yet sadly in the host 
> > > > > > > E820 map
> > > > > > > there is
> > > > > > > 
> > > > > > > (XEN)  [0400, 04009fff] (ACPI NVS)
> > > > > > > 
> > > > > > > i.e. a non-RAM range starting at 64Mb. The kernel (currently) 
> > > > > > > won't tolerate
> > > > > > > such an overlap (also if it was overlapping the kernel image, 
> > > > > > > e.g. if on the
> > > > > > > machine in question s sufficiently much larger kernel was used). 
> > > > > > > Yet with its
> > > > > > > fundamental goal of making its E820 match the host one I'm also 
> > > > > > > in trouble
> > > > > > > thinking of possible solutions / workarounds. I certainly do not 
> > > > > > > see Xen
> > > > > > > trying to cover for this, as the E820 map re-arrangement is 
> > > > > > > purely a kernel
> > > > > > > side decision (forward ported kernels got away without, and what 
> > > > > > > e.g. the
> > > > > > > BSDs do is entirely unknown to me).
> > > > > > 
> > > > > > In Qubes we have worked around the issue by moving the kernel lower
> > > > > > (CONFIG_PHYSICAL_START=0x20):
> > > > > > https://github.com/QubesOS/qubes-linux-kernel/commit/3e8be4ac1682370977d4d0dc1d782c428d860282
> > > > > > 
> > > > > > Far from ideal, but gets it bootable...
> > > > > > 
> > > > > 
> > > > > could you test the attached kernel patches? They should fix the issue 
> > > > > without
> > > > > having to modify CONFIG_PHYSICAL_START.
> > > > > 
> > > > > I have tested them to boot up without problem on my test system, but 
> > > > > I don't
> > > > > have access to a system showing the E820 map conflict you are seeing.
> > > > > 
> > > > > The patches have been developed against kernel 6.11-rc2, but I think 
> > > > > they
> > > > > should apply to a 6.10 and maybe even an older kernel.
> > > > 
> > > > Sure, but tomorrow-ish.
> > > 
> > > Thanks.
> > 
> > Seems to work :)
> > 
> > Snippets from Xen

Re: ACPI NVS range conflicting with Dom0 page tables (or kernel image)

2024-08-07 Thread Marek Marczykowski-Górecki

On Tue, Aug 06, 2024 at 05:24:22PM +0200, Jürgen Groß wrote:
> On 06.08.24 17:21, Marek Marczykowski-Górecki wrote:
> > On Tue, Aug 06, 2024 at 04:12:32PM +0200, Jürgen Groß wrote:
> > > Marek,
> > > 
> > > On 17.06.24 16:03, Marek Marczykowski-Górecki wrote:
> > > > On Mon, Jun 17, 2024 at 01:22:37PM +0200, Jan Beulich wrote:
> > > > > Hello,
> > > > > 
> > > > > while it feels like we had a similar situation before, I can't seem 
> > > > > to be
> > > > > able to find traces thereof, or associated (Linux) commits.
> > > > 
> > > > Is it some AMD Threadripper system by a chance? Previous thread on this
> > > > issue:
> > > > https://lore.kernel.org/xen-devel/CAOCpoWdOH=xgxiqsc1c5ueb1thxajh4wizbczq-qt+d_kak...@mail.gmail.com/
> > > > 
> > > > > With
> > > > > 
> > > > > (XEN)  Dom0 kernel: 64-bit, PAE, lsb, paddr 0x100 -> 0x400
> > > > > ...
> > > > > (XEN)  Dom0 alloc.:   00044000->00044800 (619175 
> > > > > pages to be allocated)
> > > > > ...
> > > > > (XEN)  Loaded kernel: 8100->8400
> > > > > 
> > > > > the kernel occupies the space from 16Mb to 64Mb in the initial 
> > > > > allocation.
> > > > > Page tables come (almost) directly above:
> > > > > 
> > > > > (XEN)  Page tables:   84001000->84026000
> > > > > 
> > > > > I.e. they're just above the 64Mb boundary. Yet sadly in the host E820 
> > > > > map
> > > > > there is
> > > > > 
> > > > > (XEN)  [0400, 04009fff] (ACPI NVS)
> > > > > 
> > > > > i.e. a non-RAM range starting at 64Mb. The kernel (currently) won't 
> > > > > tolerate
> > > > > such an overlap (also if it was overlapping the kernel image, e.g. if 
> > > > > on the
> > > > > machine in question s sufficiently much larger kernel was used). Yet 
> > > > > with its
> > > > > fundamental goal of making its E820 match the host one I'm also in 
> > > > > trouble
> > > > > thinking of possible solutions / workarounds. I certainly do not see 
> > > > > Xen
> > > > > trying to cover for this, as the E820 map re-arrangement is purely a 
> > > > > kernel
> > > > > side decision (forward ported kernels got away without, and what e.g. 
> > > > > the
> > > > > BSDs do is entirely unknown to me).
> > > > 
> > > > In Qubes we have worked around the issue by moving the kernel lower
> > > > (CONFIG_PHYSICAL_START=0x20):
> > > > https://github.com/QubesOS/qubes-linux-kernel/commit/3e8be4ac1682370977d4d0dc1d782c428d860282
> > > > 
> > > > Far from ideal, but gets it bootable...
> > > > 
> > > 
> > > could you test the attached kernel patches? They should fix the issue 
> > > without
> > > having to modify CONFIG_PHYSICAL_START.
> > > 
> > > I have tested them to boot up without problem on my test system, but I 
> > > don't
> > > have access to a system showing the E820 map conflict you are seeing.
> > > 
> > > The patches have been developed against kernel 6.11-rc2, but I think they
> > > should apply to a 6.10 and maybe even an older kernel.
> > 
> > Sure, but tomorrow-ish.
> 
> Thanks.

Seems to work :)

Snippets from Xen log:

(XEN) EFI RAM map:
(XEN)  [, 0009] (usable)
(XEN)  [000a, 000f] (reserved)
(XEN)  [0010, 03ff] (usable)
(XEN)  [0400, 04011fff] (ACPI NVS)
(XEN)  [04012000, 09df1fff] (usable)
(XEN)  [09df2000, 09ff] (reserved)
(XEN)  [0a00, a8840fff] (usable)
(XEN)  [a8841000, a9d9] (reserved)
(XEN)  [a9da, a9dd4fff] (ACPI data)
(XEN)  [a9dd5000, a9dd5fff] (reserved)
(XEN)  [a9dd6000, a9f20fff] (ACPI data)
(XEN)  [a9f21000, aa099fff] (ACPI NVS)
(XEN)  [aa09a000, ab1fefff] (reserved)
(XEN)  [ab1ff000, abff] (usable)
(XEN)  [ac00, afff] (reserved)
(XEN)  [b250, b2580fff]

Re: ACPI NVS range conflicting with Dom0 page tables (or kernel image)

2024-08-06 Thread Marek Marczykowski-Górecki

On Tue, Aug 06, 2024 at 04:12:32PM +0200, Jürgen Groß wrote:
> Marek,
> 
> On 17.06.24 16:03, Marek Marczykowski-Górecki wrote:
> > On Mon, Jun 17, 2024 at 01:22:37PM +0200, Jan Beulich wrote:
> > > Hello,
> > > 
> > > while it feels like we had a similar situation before, I can't seem to be
> > > able to find traces thereof, or associated (Linux) commits.
> > 
> > Is it some AMD Threadripper system by a chance? Previous thread on this
> > issue:
> > https://lore.kernel.org/xen-devel/CAOCpoWdOH=xgxiqsc1c5ueb1thxajh4wizbczq-qt+d_kak...@mail.gmail.com/
> > 
> > > With
> > > 
> > > (XEN)  Dom0 kernel: 64-bit, PAE, lsb, paddr 0x100 -> 0x400
> > > ...
> > > (XEN)  Dom0 alloc.:   00044000->00044800 (619175 pages to 
> > > be allocated)
> > > ...
> > > (XEN)  Loaded kernel: 8100->8400
> > > 
> > > the kernel occupies the space from 16Mb to 64Mb in the initial allocation.
> > > Page tables come (almost) directly above:
> > > 
> > > (XEN)  Page tables:   84001000->84026000
> > > 
> > > I.e. they're just above the 64Mb boundary. Yet sadly in the host E820 map
> > > there is
> > > 
> > > (XEN)  [0400, 04009fff] (ACPI NVS)
> > > 
> > > i.e. a non-RAM range starting at 64Mb. The kernel (currently) won't 
> > > tolerate
> > > such an overlap (also if it was overlapping the kernel image, e.g. if on 
> > > the
> > > machine in question s sufficiently much larger kernel was used). Yet with 
> > > its
> > > fundamental goal of making its E820 match the host one I'm also in trouble
> > > thinking of possible solutions / workarounds. I certainly do not see Xen
> > > trying to cover for this, as the E820 map re-arrangement is purely a 
> > > kernel
> > > side decision (forward ported kernels got away without, and what e.g. the
> > > BSDs do is entirely unknown to me).
> > 
> > In Qubes we have worked around the issue by moving the kernel lower
> > (CONFIG_PHYSICAL_START=0x20):
> > https://github.com/QubesOS/qubes-linux-kernel/commit/3e8be4ac1682370977d4d0dc1d782c428d860282
> > 
> > Far from ideal, but gets it bootable...
> > 
> 
> could you test the attached kernel patches? They should fix the issue without
> having to modify CONFIG_PHYSICAL_START.
> 
> I have tested them to boot up without problem on my test system, but I don't
> have access to a system showing the E820 map conflict you are seeing.
> 
> The patches have been developed against kernel 6.11-rc2, but I think they
> should apply to a 6.10 and maybe even an older kernel.

Sure, but tomorrow-ish.

> If possible it would be nice to verify suspend to disk still working, as
> the kernel will need to access the ACPI NVS area in this case.

That might be harder, as Qubes OS doesn't support suspend to disk, but
I'll see if something can be done.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Re: [PATCH v2] automation: upgrade Yocto to scarthgap

2024-07-30 Thread Marek Marczykowski-Górecki

On Tue, Jul 30, 2024 at 03:01:52PM +0100, Andrew Cooper wrote:
> On 30/07/2024 2:46 pm, Marek Marczykowski-Górecki wrote:
> > On Fri, Jul 26, 2024 at 05:19:42PM -0700, Stefano Stabellini wrote:
> >> Upgrade Yocto to a newer version. Use ext4 as image format for testing
> >> with QEMU on ARM and ARM64 as the default is WIC and it is not available
> >> for our xen-image-minimal target.
> >>
> >> Also update the tar.bz2 filename for the rootfs.
> >>
> >> Signed-off-by: Stefano Stabellini 
> > Reviewed-by: Marek Marczykowski-Górecki 
> >
> >> ---
> >>
> >> all yocto tests pass:
> >> https://gitlab.com/xen-project/people/sstabellini/xen/-/pipelines/1390081173
> 
> That test ran on gitlab-docker-pug, not qubes-ambrosia, so doesn't
> confirm the fix to the xattr issue.

There is one on ambrosia too:
https://gitlab.com/xen-project/people/sstabellini/xen/-/jobs/7423043016

> Seeing as I'm going to need to rebuild the container anyway, I'll see
> about forcing this and double checking.

But double-checking is a good idea anyway.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Re: [PATCH v2] automation: upgrade Yocto to scarthgap

2024-07-30 Thread Marek Marczykowski-Górecki

On Fri, Jul 26, 2024 at 05:19:42PM -0700, Stefano Stabellini wrote:
> Upgrade Yocto to a newer version. Use ext4 as image format for testing
> with QEMU on ARM and ARM64 as the default is WIC and it is not available
> for our xen-image-minimal target.
> 
> Also update the tar.bz2 filename for the rootfs.
> 
> Signed-off-by: Stefano Stabellini 

Reviewed-by: Marek Marczykowski-Górecki 

> ---
> 
> all yocto tests pass:
> https://gitlab.com/xen-project/people/sstabellini/xen/-/pipelines/1390081173
> 
> Changes in v2:
> - s/EXT4/IMAGE_FMT/
> - set IMAGE_FMT before the call to project_build
> - also update the filename xen-image-minimal-qemuarm.rootfs.tar.bz2
> ---
>  automation/build/yocto/build-yocto.sh   | 15 ---
>  automation/build/yocto/yocto.inc|  4 ++--
>  automation/gitlab-ci/build.yaml |  2 +-
>  automation/scripts/qemu-smoke-dom0-arm32.sh |  2 +-
>  4 files changed, 16 insertions(+), 7 deletions(-)
> 
> diff --git a/automation/build/yocto/build-yocto.sh 
> b/automation/build/yocto/build-yocto.sh
> index 93ce81ce82..e1e69f2bb5 100755
> --- a/automation/build/yocto/build-yocto.sh
> +++ b/automation/build/yocto/build-yocto.sh
> @@ -25,6 +25,7 @@ TARGET_SUPPORTED="qemuarm qemuarm64 qemux86-64"
>  VERBOSE="n"
>  TARGETLIST=""
>  BUILDJOBS="8"
> +IMAGE_FMT=""
>  
>  # actions to do
>  do_clean="n"
> @@ -38,8 +39,9 @@ build_result=0
>  # layers to include in the project
>  build_layerlist="poky/meta poky/meta-poky poky/meta-yocto-bsp \
>   meta-openembedded/meta-oe meta-openembedded/meta-python \
> + meta-openembedded/meta-networking \
>   meta-openembedded/meta-filesystems \
> - meta-openembedded/meta-networking meta-virtualization"
> + meta-virtualization"
>  
>  # yocto image to build
>  build_image="xen-image-minimal"
> @@ -175,7 +177,7 @@ function project_build() {
>  mkdir -p $OUTPUTDIR
>  cp $BUILDDIR/tmp/deploy/images/qemuarm/zImage $OUTPUTDIR
>  cp $BUILDDIR/tmp/deploy/images/qemuarm/xen-qemuarm $OUTPUTDIR
> -cp 
> $BUILDDIR/tmp/deploy/images/qemuarm/xen-image-minimal-qemuarm.tar.bz2 
> $OUTPUTDIR
> +cp 
> $BUILDDIR/tmp/deploy/images/qemuarm/xen-image-minimal-qemuarm.rootfs.tar.bz2 
> $OUTPUTDIR
>  fi
>  fi
>  ) || return 1
> @@ -196,7 +198,7 @@ function project_run() {
>  
>  /usr/bin/expect <  set timeout 1000
> -spawn bash -c "runqemu serialstdio nographic slirp"
> +spawn bash -c "runqemu serialstdio nographic slirp ${IMAGE_FMT}"
>  
>  expect_after {
>  -re "(.*)\r" {
> @@ -356,6 +358,13 @@ for f in ${TARGETLIST}; do
>  run_task project_create "${f}"
>  fi
>  if [ -f "${BUILDDIR}/${f}/conf/local.conf" ]; then
> +# Set the right image target
> +if [ "$f" = "qemux86-64" ]; then
> +IMAGE_FMT=""
> +else
> +IMAGE_FMT="ext4"
> +fi
> +
>  if [ "${do_build}" = "y" ]; then
>  run_task project_build "${f}"
>  fi
> diff --git a/automation/build/yocto/yocto.inc 
> b/automation/build/yocto/yocto.inc
> index 2f3b1a5b2a..209df7dde9 100644
> --- a/automation/build/yocto/yocto.inc
> +++ b/automation/build/yocto/yocto.inc
> @@ -6,10 +6,10 @@
>  # YOCTOVERSION-TARGET for x86_64 hosts
>  # YOCTOVERSION-TARGET-arm64v8 for arm64 hosts
>  # For example you can build an arm64 container with the following command:
> -# make yocto/kirkstone-qemuarm64-arm64v8
> +# make yocto/scarthgap-qemuarm64-arm64v8
>  
>  # Yocto versions we are currently using.
> -YOCTO_VERSION = kirkstone
> +YOCTO_VERSION = scarthgap
>  
>  # Yocto BSPs we want to build for.
>  YOCTO_TARGETS = qemuarm64 qemuarm qemux86-64
> diff --git a/automation/gitlab-ci/build.yaml b/automation/gitlab-ci/build.yaml
> index 7ce88d38e7..32045cef0c 100644
> --- a/automation/gitlab-ci/build.yaml
> +++ b/automation/gitlab-ci/build.yaml
> @@ -212,7 +212,7 @@
>script:
>  - ./automation/build/yocto/build-yocto.sh -v --log-dir=./logs 
> --xen-dir=`pwd` ${YOCTO_BOARD} ${YOCTO_OUTPUT}
>variables:
> -YOCTO_VERSION: kirkstone
> +YOCTO_VERSION: scarthgap
>  CONTAINER: yocto:${YOCTO_VERSION}-${YOCTO_BOARD}${YOCTO_HOST}
>artifacts:
>  paths:
> diff --git a/automation/scripts/qemu-smoke-dom0-arm32.sh 
> b/automation/scripts/qemu-smoke-dom0-arm32.sh
> index 31c05cc840..eaaea5a982 100755
> --- a/automation/scripts/qemu-smoke-dom0-arm32.sh
> +++ b/automation/scripts/qemu-smoke-dom0-arm32.sh
> @@ -8,7 +8,7 @@ cd binaries
>  
>  mkdir rootfs
>  cd rootfs
> -tar xvf ../xen-image-minimal-qemuarm.tar.bz2
> +tar xvf ../xen-image-minimal-qemuarm.rootfs.tar.bz2
>  mkdir -p ./root
>  echo "name=\"test\"
>  memory=400
> -- 
> 2.25.1
> 

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Re: [PATCH v2] x86/shutdown: change default reboot method preference

2024-07-29 Thread Marek Marczykowski-Górecki

On Fri, Sep 15, 2023 at 09:43:47AM +0200, Roger Pau Monne wrote:
> The current logic to chose the preferred reboot method is based on the mode 
> Xen
> has been booted into, so if the box is booted from UEFI, the preferred reboot
> method will be to use the ResetSystem() run time service call.
> 
> However, that method seems to be widely untested, and quite often leads to a
> result similar to:
> 
> Hardware Dom0 shutdown: rebooting machine
> [ Xen-4.18-unstable  x86_64  debug=y  Tainted:   C]
> CPU:0
> RIP:e008:[<0017>] 0017
> RFLAGS: 00010202   CONTEXT: hypervisor
> [...]
> Xen call trace:
>[<0017>] R 0017
>[] S 83207eff7b50
>[] F machine_restart+0x1da/0x261
>[] F apic_wait_icr_idle+0/0x37
>[] F smp_call_function_interrupt+0xc7/0xcb
>[] F call_function_interrupt+0x20/0x34
>[] F do_IRQ+0x150/0x6f3
>[] F common_interrupt+0x132/0x140
>[] F 
> arch/x86/acpi/cpu_idle.c#acpi_idle_do_entry+0x113/0x129
>[] F 
> arch/x86/acpi/cpu_idle.c#acpi_processor_idle+0x3eb/0x5f7
>[] F arch/x86/domain.c#idle_loop+0xec/0xee
> 
> 
> Panic on CPU 0:
> FATAL TRAP: vector = 6 (invalid opcode)
> 
> 
> Which in most cases does lead to a reboot, however that's unreliable.
> 
> Change the default reboot preference to prefer ACPI over UEFI if available and
> not in reduced hardware mode.
> 
> This is in line to what Linux does, so it's unlikely to cause issues on 
> current
> and future hardware, since there's a much higher chance of vendors testing
> hardware with Linux rather than Xen.
> 
> Add a special case for one Acer model that does require being rebooted using
> ResetSystem().  See Linux commit 0082517fa4bce for rationale.
> 
> I'm not aware of using ACPI reboot causing issues on boxes that do have
> properly implemented ResetSystem() methods.

With the Acer quirk, and the info Jan posted in the thread, this
sentence technically is not true. I don't think it warrants any code
change in this patch (it's clearly less common and less problematic
issue than crash during ResetSystem(), and still can be worked around
with a cmdline option). But might warrant adjusting commit message.

> Signed-off-by: Roger Pau Monné 

Other points still stand, and I think this generally is an improvement,
so, preferably with adjusted commit message:

Acked-by: Marek Marczykowski-Górecki 

> ---
> Changes since v1:
>  - Add special case for Acer model to use UEFI reboot.
>  - Adjust commit message.
> ---
>  xen/arch/x86/shutdown.c | 19 +++
>  1 file changed, 15 insertions(+), 4 deletions(-)
> 
> diff --git a/xen/arch/x86/shutdown.c b/xen/arch/x86/shutdown.c
> index 7619544d14da..3816ede1afe5 100644
> --- a/xen/arch/x86/shutdown.c
> +++ b/xen/arch/x86/shutdown.c
> @@ -150,19 +150,20 @@ static void default_reboot_type(void)
>  
>  if ( xen_guest )
>  reboot_type = BOOT_XEN;
> +else if ( !acpi_disabled && !acpi_gbl_reduced_hardware )
> +reboot_type = BOOT_ACPI;
>  else if ( efi_enabled(EFI_RS) )
>  reboot_type = BOOT_EFI;
> -else if ( acpi_disabled )
> -reboot_type = BOOT_KBD;
>  else
> -reboot_type = BOOT_ACPI;
> +reboot_type = BOOT_KBD;
>  }
>  
>  static int __init cf_check override_reboot(const struct dmi_system_id *d)
>  {
>  enum reboot_type type = (long)d->driver_data;
>  
> -if ( type == BOOT_ACPI && acpi_disabled )
> +if ( (type == BOOT_ACPI && acpi_disabled) ||
> + (type == BOOT_EFI && !efi_enabled(EFI_RS)) )
>  type = BOOT_KBD;
>  
>  if ( reboot_type != type )
> @@ -172,6 +173,7 @@ static int __init cf_check override_reboot(const struct 
> dmi_system_id *d)
>  [BOOT_KBD]  = "keyboard controller",
>  [BOOT_ACPI] = "ACPI",
>  [BOOT_CF9]  = "PCI",
> +[BOOT_EFI]  = "UEFI",
>  };
>  
>  reboot_type = type;
> @@ -530,6 +532,15 @@ static const struct dmi_system_id __initconstrel 
> reboot_dmi_table[] = {
>  DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge R740"),
>  },
>  },
> +{    /* Handle problems with rebooting on Acer TravelMate X514-51T. */
> +.callback = override_reboot,
> +.driver_data = (void *)(long)BOOT_EFI,
> +.ident = "Acer TravelMate X514-51T",
> +.matches = {
> +DMI_MATCH(DMI_SYS_VENDOR, "Acer"),
> +DMI_MATCH(DMI_PRODUCT_NAME, "TravelMate X514-51T"),
> +},
> +},
>  { }
>  };
>  
> -- 
> 2.42.0
> 
> 

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

[PATCH v7 1/2] x86/mm: add API for marking only part of a MMIO page read only

2024-07-25 Thread Marek Marczykowski-Górecki

In some cases, only few registers on a page needs to be write-protected.
Examples include USB3 console (64 bytes worth of registers) or MSI-X's
PBA table (which doesn't need to span the whole table either), although
in the latter case the spec forbids placing other registers on the same
page. Current API allows only marking whole pages pages read-only,
which sometimes may cover other registers that guest may need to
write into.

Currently, when a guest tries to write to an MMIO page on the
mmio_ro_ranges, it's either immediately crashed on EPT violation - if
that's HVM, or if PV, it gets #PF. In case of Linux PV, if access was
from userspace (like, /dev/mem), it will try to fixup by updating page
tables (that Xen again will force to read-only) and will hit that #PF
again (looping endlessly). Both behaviors are undesirable if guest could
actually be allowed the write.

Introduce an API that allows marking part of a page read-only. Since
sub-page permissions are not a thing in page tables (they are in EPT,
but not granular enough), do this via emulation (or simply page fault
handler for PV) that handles writes that are supposed to be allowed.
The new subpage_mmio_ro_add() takes a start physical address and the
region size in bytes. Both start address and the size need to be 8-byte
aligned, as a practical simplification (allows using smaller bitmask,
and a smaller granularity isn't really necessary right now).
It will internally add relevant pages to mmio_ro_ranges, but if either
start or end address is not page-aligned, it additionally adds that page
to a list for sub-page R/O handling. The list holds a bitmask which
qwords are supposed to be read-only and an address where page is mapped
for write emulation - this mapping is done only on the first access. A
plain list is used instead of more efficient structure, because there
isn't supposed to be many pages needing this precise r/o control.

The mechanism this API is plugged in is slightly different for PV and
HVM. For both paths, it's plugged into mmio_ro_emulated_write(). For PV,
it's already called for #PF on read-only MMIO page. For HVM however, EPT
violation on p2m_mmio_direct page results in a direct domain_crash() for
non hardware domains.  To reach mmio_ro_emulated_write(), change how
write violations for p2m_mmio_direct are handled - specifically, check
if they relate to such partially protected page via
subpage_mmio_write_accept() and if so, call hvm_emulate_one_mmio() for
them too. This decodes what guest is trying write and finally calls
mmio_ro_emulated_write(). The EPT write violation is detected as
npfec.write_access and npfec.present both being true (similar to other
places), which may cover some other (future?) cases - if that happens,
emulator might get involved unnecessarily, but since it's limited to
pages marked with subpage_mmio_ro_add() only, the impact is minimal.
Both of those paths need an MFN to which guest tried to write (to check
which part of the page is supposed to be read-only, and where
the page is mapped for writes). This information currently isn't
available directly in mmio_ro_emulated_write(), but in both cases it is
already resolved somewhere higher in the call tree. Pass it down to
mmio_ro_emulated_write() via new mmio_ro_emulate_ctxt.mfn field.

This may give a bit more access to the instruction emulator to HVM
guests (the change in hvm_hap_nested_page_fault()), but only for pages
explicitly marked with subpage_mmio_ro_add() - so, if the guest has a
passed through a device partially used by Xen.
As of the next patch, it applies only configuration explicitly
documented as not security supported.

The subpage_mmio_ro_add() function cannot be called with overlapping
ranges, and on pages already added to mmio_ro_ranges separately.
Successful calls would result in correct handling, but error paths may
result in incorrect state (like pages removed from mmio_ro_ranges too
early). Debug build has asserts for relevant cases.

Signed-off-by: Marek Marczykowski-Górecki 
---
Shadow mode is not tested, but I don't expect it to work differently than
HAP in areas related to this patch.

Changes in v7:
- refuse misaligned start in release build too, to have release build
  running what was tested in debug build
- simplify return from subpage_mmio_ro_add_page
Changes in v6:
- fix return type of subpage_mmio_find_page()
- change 'iter' pointer to 'new_entry' bool and move list_add()
- comment why different error handling for unaligned start / size
- code style
Changes in v5:
- use subpage_mmio_find_page helper, simplifying several functions
- use LIST_HEAD_RO_AFTER_INIT
- don't use subpage_ro_lock in __init
- drop #ifdef in mm.h
- return error on unaligned size in subpage_mmio_ro_add() instead of
  extending the size (in release build)
Changes in v4:
- rename SUBPAGE_MMIO_RO_ALIGN to MMIO_RO_SUBPAGE_GRAN
- guard subpage_mmio_write_accept with CONFIG_HVM, a

[PATCH v7 0/2] Add API for making parts of a MMIO page R/O and use it in XHCI console

2024-07-25 Thread Marek Marczykowski-Górecki

On older systems, XHCI xcap had a layout that no other (interesting) registers
were placed on the same page as the debug capability, so Linux was fine with
making the whole page R/O. But at least on Tiger Lake and Alder Lake, Linux
needs to write to some other registers on the same page too.

Add a generic API for making just parts of an MMIO page R/O and use it to fix
USB3 console with share=yes or share=hwdom options. More details in commit
messages.

Marek Marczykowski-Górecki (2):
  x86/mm: add API for marking only part of a MMIO page read only
  drivers/char: Use sub-page ro API to make just xhci dbc cap RO

 xen/arch/x86/hvm/emulate.c  |   2 +-
 xen/arch/x86/hvm/hvm.c  |   4 +-
 xen/arch/x86/include/asm/mm.h   |  23 +++-
 xen/arch/x86/mm.c   | 261 +-
 xen/arch/x86/pv/ro-page-fault.c |   6 +-
 xen/drivers/char/xhci-dbc.c |  36 +++--
 6 files changed, 313 insertions(+), 19 deletions(-)

base-commit: b25b28ede1cba43eda1e0b84ad967683b8196847
-- 
git-series 0.9.1

[PATCH v7 2/2] drivers/char: Use sub-page ro API to make just xhci dbc cap RO

2024-07-25 Thread Marek Marczykowski-Górecki

Not the whole page, which may contain other registers too. The XHCI
specification describes DbC as designed to be controlled by a different
driver, but does not mandate placing registers on a separate page. In fact
on Tiger Lake and newer (at least), this page do contain other registers
that Linux tries to use. And with share=yes, a domU would use them too.
Without this patch, PV dom0 would fail to initialize the controller,
while HVM would be killed on EPT violation.

With `share=yes`, this patch gives domU more access to the emulator
(although a HVM with any emulated device already has plenty of it). This
configuration is already documented as unsafe with untrusted guests and
not security supported.

Signed-off-by: Marek Marczykowski-Górecki 
Reviewed-by: Jan Beulich 
---
Changes in v4:
- restore mmio_ro_ranges in the fallback case
- set XHCI_SHARE_NONE in the fallback case
Changes in v3:
- indentation fix
- remove stale comment
- fallback to pci_ro_device() if subpage_mmio_ro_add() fails
- extend commit message
Changes in v2:
 - adjust for simplified subpage_mmio_ro_add() API
---
 xen/drivers/char/xhci-dbc.c | 36 ++--
 1 file changed, 22 insertions(+), 14 deletions(-)

diff --git a/xen/drivers/char/xhci-dbc.c b/xen/drivers/char/xhci-dbc.c
index 8e2037f1a5f7..c45e4b6825cc 100644
--- a/xen/drivers/char/xhci-dbc.c
+++ b/xen/drivers/char/xhci-dbc.c
@@ -1216,20 +1216,28 @@ static void __init cf_check 
dbc_uart_init_postirq(struct serial_port *port)
 break;
 }
 #ifdef CONFIG_X86
-/*
- * This marks the whole page as R/O, which may include other registers
- * unrelated to DbC. Xen needs only DbC area protected, but it seems
- * Linux's XHCI driver (as of 5.18) works without writting to the whole
- * page, so keep it simple.
- */
-if ( rangeset_add_range(mmio_ro_ranges,
-PFN_DOWN((uart->dbc.bar_val & PCI_BASE_ADDRESS_MEM_MASK) +
- uart->dbc.xhc_dbc_offset),
-PFN_UP((uart->dbc.bar_val & PCI_BASE_ADDRESS_MEM_MASK) +
-   uart->dbc.xhc_dbc_offset +
-sizeof(*uart->dbc.dbc_reg)) - 1) )
-printk(XENLOG_INFO
-   "Error while adding MMIO range of device to mmio_ro_ranges\n");
+if ( subpage_mmio_ro_add(
+ (uart->dbc.bar_val & PCI_BASE_ADDRESS_MEM_MASK) +
+  uart->dbc.xhc_dbc_offset,
+ sizeof(*uart->dbc.dbc_reg)) )
+{
+printk(XENLOG_WARNING
+   "Error while marking MMIO range of XHCI console as R/O, "
+   "making the whole device R/O (share=no)\n");
+uart->dbc.share = XHCI_SHARE_NONE;
+if ( pci_ro_device(0, uart->dbc.sbdf.bus, uart->dbc.sbdf.devfn) )
+printk(XENLOG_WARNING
+   "Failed to mark read-only %pp used for XHCI console\n",
+   &uart->dbc.sbdf);
+if ( rangeset_add_range(mmio_ro_ranges,
+ PFN_DOWN((uart->dbc.bar_val & PCI_BASE_ADDRESS_MEM_MASK) +
+  uart->dbc.xhc_dbc_offset),
+ PFN_UP((uart->dbc.bar_val & PCI_BASE_ADDRESS_MEM_MASK) +
+uart->dbc.xhc_dbc_offset +
+sizeof(*uart->dbc.dbc_reg)) - 1) )
+printk(XENLOG_INFO
+   "Error while adding MMIO range of device to 
mmio_ro_ranges\n");
+}
 #endif
 }
 
-- 
git-series 0.9.1

Re: [PATCH v6 2/3] x86/mm: add API for marking only part of a MMIO page read only

2024-07-25 Thread Marek Marczykowski-Górecki

On Thu, Jul 25, 2024 at 11:26:31AM +0200, Jan Beulich wrote:
> On 23.07.2024 05:24, Marek Marczykowski-Górecki wrote:
> > + * so tolerate it.
> > + * But unaligned size would result in smaller area, so deny it.
> > + */
> > +ASSERT(IS_ALIGNED(start, MMIO_RO_SUBPAGE_GRAN));
> > +ASSERT(IS_ALIGNED(size, MMIO_RO_SUBPAGE_GRAN));
> > +if ( !IS_ALIGNED(size, MMIO_RO_SUBPAGE_GRAN) )
> > +return -EINVAL;
> 
> I hoped you would, when adding the comment, recall an earlier comment of
> mine: If you want to tolerate mis-aligned start in release builds, you
> need to make further adjustments to the subsequent logic (at which
> point the respective assertion may become pointless); see below. While
> things may work okay without (I didn't fully convince myself either way),
> the main point here is that you want to make sure we test in debug builds
> what's actually used in release one. Hence subtleties like this would
> better be dealt with uniformly between release and debug builds.

Right, and I think this is a good argument to not try to accept
unaligned size either, even if it would be possible here.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Re: [PATCH 07/12] libxl: Allow stubdomain to control interupts of PCI device

2024-07-25 Thread Marek Marczykowski-Górecki

On Thu, Jul 25, 2024 at 02:06:04PM +, Anthony PERARD wrote:
> On Thu, May 16, 2024 at 03:58:28PM +0200, Marek Marczykowski-Górecki wrote:
> > Especially allow it to control MSI/MSI-X enabling bits. This part only
> > writes a flag to a sysfs, the actual implementation is on the kernel
> > side.
> >
> > This requires Linux >= 5.10 in dom0 (or relevant patch backported).
> 
> Does it not work before 5.10? Because the
> Documentation/ABI/testing/sysfs-driver-pciback in linux tree say that
> allow_interrupt_control is in 5.6.

For MSI-X to work at least with Linux it needs a fixup
2c269f42d0f382743ab230308b836ffe5ae9b2ae, which was backported to
5.10.201, but not further. 

> > diff --git a/tools/libs/light/libxl_pci.c b/tools/libs/light/libxl_pci.c
> > index 96cb4da0794e..6f357b70b815 100644
> > --- a/tools/libs/light/libxl_pci.c
> > +++ b/tools/libs/light/libxl_pci.c
> > @@ -1513,6 +1513,14 @@ static void pci_add_dm_done(libxl__egc *egc,
> >  rc = ERROR_FAIL;
> >  goto out;
> >  }
> > +} else if (libxl_is_stubdom(ctx, domid, NULL)) {
> > +/* Allow acces to MSI enable flag in PCI config space for the 
> > stubdom */
> 
> s/acces/access/
> 
> > +if ( sysfs_write_bdf(gc, 
> > SYSFS_PCIBACK_DRIVER"/allow_interrupt_control",
> > + pci) < 0 ) {
> > +LOGD(ERROR, domainid, "Setting allow_interrupt_control for 
> > device");
> > +rc = ERROR_FAIL;
> > +goto out;
> 
> Is it possible to make this non-fatal for cases where the kernel is
> older than the introduction of the new setting? Or does pci passthrough
> doesn't work at all with a stubdom before the change in the kernel?

MSI/MSI-X will not work. And if QEMU wouldn't hide MSI/MSI-X (upstream
one doesn't), Linux won't fallback to INTx, so the device won't work at
all.

> If making this new setting conditional is an option, you could
> potentially improve the error code returned by sysfs_write_bdf() to
> distinguish between an open() failure and write() failure, to avoid
> checking the existance of the path ahead of the call. But maybe that
> pointless because it doesn't appear possible to distinguish between
> permission denied and not found.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Re: xen | Failed pipeline for staging-4.19 | 2d7b6170

2024-07-24 Thread Marek Marczykowski-Górecki

On Wed, Jul 24, 2024 at 03:36:44PM +, Anthony PERARD wrote:
> On Wed, Jul 24, 2024 at 03:18:50PM +0200, Jan Beulich wrote:
> > On 24.07.2024 15:15, GitLab wrote:
> > > 
> > > 
> > > Pipeline #1385987377 has failed!
> > > 
> > > Project: xen ( https://gitlab.com/xen-project/hardware/xen )
> > > Branch: staging-4.19 ( 
> > > https://gitlab.com/xen-project/hardware/xen/-/commits/staging-4.19 )
> > > 
> > > Commit: 2d7b6170 ( 
> > > https://gitlab.com/xen-project/hardware/xen/-/commit/2d7b6170cc69f8a1a60c52d87ba61f6b1f180132
> > >  )
> > > Commit Message: hotplug: Restore block-tap phy compatibility (a...
> > > Commit Author: Jason Andryuk ( https://gitlab.com/jandryuk-amd )
> > > Committed by: Jan Beulich ( https://gitlab.com/jbeulich )
> > > 
> > > 
> > > Pipeline #1385987377 ( 
> > > https://gitlab.com/xen-project/hardware/xen/-/pipelines/1385987377 ) 
> > > triggered by Jan Beulich ( https://gitlab.com/jbeulich )
> > > had 3 failed jobs.
> > > 
> > > Job #7415912260 ( 
> > > https://gitlab.com/xen-project/hardware/xen/-/jobs/7415912260/raw )
> > > 
> > > Stage: test
> > > Name: qemu-alpine-x86_64-gcc
> > 
> > This is the one known to fail more often than not, I think, but ...
> > 
> > > Job #7415912175 ( 
> > > https://gitlab.com/xen-project/hardware/xen/-/jobs/7415912175/raw )
> > > 
> > > Stage: build
> > > Name: ubuntu-24.04-x86_64-clang
> > > Job #7415912173 ( 
> > > https://gitlab.com/xen-project/hardware/xen/-/jobs/7415912173/raw )
> > > 
> > > Stage: build
> > > Name: ubuntu-22.04-x86_64-gcc
> > 
> > ... for these two I can't spot any failure in the referenced logs. What's 
> > going on
> > there?
> 
> They are crutial information missing in that email, the actual reason
> given by gitlab for the failures: "There has been a timeout failure or
> the job got stuck." (That message can be seen when going to the url,
> removing "/raw" part, and scrolling to the top. Or looking at the side
> bar and seen a duration that well above 1h)
> 
> Communication between gitlab and the runner might be broken in those
> cases, or the runner stop working.

This time the runner VM got hit with
https://lore.kernel.org/xen-devel/ZO0WrR5J0xuwDIxW@mail-itl/ . So, I
guess the failure is warranted, just not the one you'd expect...

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Assertion failed at arch/x86/genapic/x2apic.c:38 on S3 resume nested in KVM on AMD

2024-07-23 Thread Marek Marczykowski-Górecki

Hi,

I'm observing a crash like the one below when trying to resume from S3.
It happens on Xen nested in KVM (QEMU 9.0, Linux 6.9.3) but only on AMD.
The very same software stack on Intel works just fine. QEMU is running
with "-cpu host,+svm,+invtsc -machine q35,kernel-irqchip=split -device
amd-iommu,intremap=on -smp 2" among others.

(XEN) Preparing system for ACPI S3 state.
(XEN) Disabling non-boot CPUs ...
(XEN) Broke affinity for IRQ1, new: {0-1}
(XEN) Broke affinity for IRQ20, new: {0-1}
(XEN) Broke affinity for IRQ22, new: {0-1}
(XEN) Entering ACPI S3 state.
(XEN) Finishing wakeup from ACPI S3 state.
(XEN) Enabling non-boot CPUs  ...
(XEN) Assertion 'cpumask_test_cpu(this_cpu, per_cpu(cluster_cpus, 
this_cpu))' failed at arch/x86/genapic/x2apic.c:38
(XEN) [ Xen-4.20  x86_64  debug=y  Not tainted ]
(XEN) CPU:1
(XEN) RIP:e008:[] 
x2apic.c#init_apic_ldr_x2apic_cluster+0x8a/0x1b9
(XEN) RFLAGS: 00010096   CONTEXT: hypervisor
(XEN) rax: 830278a25f50   rbx: 0001   rcx: 82d0405e1700
(XEN) rdx: 003233412000   rsi: 8302739da2d8   rdi: 
(XEN) rbp: 00c8   rsp: 8302739d7e78   r8:  0001
(XEN) r9:  8302739d7fa0   r10: 0001   r11: 
(XEN) r12: 0001   r13: 0001   r14: 
(XEN) r15:    cr0: 8005003b   cr4: 007506e0
(XEN) cr3: 7fa7a000   cr2: 
(XEN) fsb:    gsb:    gss: 
(XEN) ds:    es:    fs:    gs:    ss:    cs: e008
(XEN) Xen code around  
(x2apic.c#init_apic_ldr_x2apic_cluster+0x8a/0x1b9):
(XEN)  cf 82 ff ff eb b7 0f 0b <0f> 0b 48 8d 05 9c fc 33 00 48 8b 0d a5 0a 
35 00
(XEN) Xen stack trace from rsp=8302739d7e78:
(XEN) 00c8 0001 0001
(XEN) 82d0402f1d83 8302739d7fff 00c8
(XEN)0001 0001 82d04031adb9 0001
(XEN)   82d040276677
(XEN)   
(XEN)88810037c000 0001 0246 deadbeefdeadf00d
(XEN)0001   811d130a
(XEN)deadbeefdeadf00d deadbeefdeadf00d deadbeefdeadf00d 0100
(XEN)811d130a e033 0246 c900400b3ef8
(XEN)e02b beef beef beef
(XEN)beef e011 8302739de000 003233412000
(XEN)007506e0   0002
(XEN)0002
(XEN) Xen call trace:
(XEN)[] R 
x2apic.c#init_apic_ldr_x2apic_cluster+0x8a/0x1b9
(XEN)[] S setup_local_APIC+0x26/0x449
(XEN)[] S start_secondary+0x1c4/0x37a
(XEN)[] S __high_start+0x87/0xd0
(XEN) 
(XEN) 
(XEN) 
(XEN) Panic on CPU 1:
(XEN) Assertion 'cpumask_test_cpu(this_cpu, per_cpu(cluster_cpus, 
this_cpu))' failed at arch/x86/genapic/x2apic.c:38
(XEN) 

On a release build, it hits "BUG" later on in the same file.

I've tested the attached patch from Roger, but that assertion didn't
fail (or it crashed before reaching that part).

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
diff --git a/xen/arch/x86/genapic/x2apic.c b/xen/arch/x86/genapic/x2apic.c
index 371dd100c742..fe8e664e1b63 100644
--- a/xen/arch/x86/genapic/x2apic.c
+++ b/xen/arch/x86/genapic/x2apic.c
@@ -30,8 +30,11 @@ static inline u32 x2apic_cluster(unsigned int cpu)
 static void cf_check init_apic_ldr_x2apic_cluster(void)
 {
 unsigned int cpu, this_cpu = smp_processor_id();
+uint32_t id = apic_read(APIC_ID);
+uint32_t ldr = apic_read(APIC_LDR);
 
-per_cpu(cpu_2_logical_apicid, this_cpu) = apic_read(APIC_LDR);
+ASSERT(x2apic_ldr_from_id(id) == ldr);
+per_cpu(cpu_2_logical_apicid, this_cpu) = ldr;
 
 if ( per_cpu(cluster_cpus, this_cpu) )
 {
diff --git a/xen/arch/x86/hvm/vlapic.c b/xen/arch/x86/hvm/vlapic.c
index 9cfc82666ae5..2a010a6363b7 100644
--- a/xen/arch/x86/hvm/vlapic.c
+++ b/xen/arch/x86/hvm/vlapic.c
@@ -1064,11 +1064,6 @@ static const struct hvm_mmio_ops vlapic_mmio_ops = {
 .write = vlapic_mmio_write,
 };
 
-static uint32_t x2apic_ldr_from_id(uint32_t id)
-{
-return ((id & ~0xf) << 12) | (1 << (id & 0xf));
-}
-
 static void set_x2apic_id(struct vlapic *vlapic)
 {
 const struct vcpu *v = vlapic_vcpu(vlapic);
diff --git a/xen/arch/x86/include/asm/apic.h b/xen/arch/x86/include/asm/api

Re: [PATCH 2/2] x86/efi: Unlock NX if necessary

2024-07-23 Thread Marek Marczykowski-Górecki

On Tue, Jul 23, 2024 at 12:25:32PM +0200, Marek Marczykowski-Górecki wrote:
> On Mon, Jul 22, 2024 at 11:18:38AM +0100, Andrew Cooper wrote:
> > EFI systems can run with NX disabled, as has been discovered on a Broadwell
> > Supermicro X10SRM-TF system.
> > 
> > Prior to commit fc3090a47b21 ("x86/boot: Clear XD_DISABLE from the early 
> > boot
> > path"), the logic to unlock NX was common to all boot paths, but that commit
> > moved it out of the native-EFI booth path.
> > 
> > Have the EFI path attempt to unlock NX, rather than just blindly refusing to
> > boot when CONFIG_REQUIRE_NX is active.
> > 
> > Fixes: fc3090a47b21 ("x86/boot: Clear XD_DISABLE from the early boot path")
> > Link: https://xcp-ng.org/forum/post/80520
> > Reported-by: Gene Bright 
> > Signed-off-by: Andrew Cooper 
> 
> Acked-by: Andrew Cooper 

Ugh, wrong copy paste:
Acked-by: Marek Marczykowski-Górecki 

I should finish my coffee first...

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Re: [PATCH 2/2] x86/efi: Unlock NX if necessary

2024-07-23 Thread Marek Marczykowski-Górecki

On Mon, Jul 22, 2024 at 11:18:38AM +0100, Andrew Cooper wrote:
> EFI systems can run with NX disabled, as has been discovered on a Broadwell
> Supermicro X10SRM-TF system.
> 
> Prior to commit fc3090a47b21 ("x86/boot: Clear XD_DISABLE from the early boot
> path"), the logic to unlock NX was common to all boot paths, but that commit
> moved it out of the native-EFI booth path.
> 
> Have the EFI path attempt to unlock NX, rather than just blindly refusing to
> boot when CONFIG_REQUIRE_NX is active.
> 
> Fixes: fc3090a47b21 ("x86/boot: Clear XD_DISABLE from the early boot path")
> Link: https://xcp-ng.org/forum/post/80520
> Reported-by: Gene Bright 
> Signed-off-by: Andrew Cooper 

Acked-by: Andrew Cooper 

> ---
> CC: Jan Beulich 
> CC: Roger Pau Monné 
> CC: Daniel P. Smith 
> CC: Marek Marczykowski-Górecki 
> CC: Alejandro Vallejo 
> CC: Gene Bright 
> 
> Note.  Entirely speculative coding, based only on the forum report.
> ---
>  xen/arch/x86/efi/efi-boot.h | 33 ++---
>  1 file changed, 30 insertions(+), 3 deletions(-)
> 
> diff --git a/xen/arch/x86/efi/efi-boot.h b/xen/arch/x86/efi/efi-boot.h
> index 4e4be7174751..158350aa14e4 100644
> --- a/xen/arch/x86/efi/efi-boot.h
> +++ b/xen/arch/x86/efi/efi-boot.h
> @@ -736,13 +736,33 @@ static void __init efi_arch_handle_module(const struct 
> file *file,
>  efi_bs->FreePool(ptr);
>  }
>  
> +static bool __init intel_unlock_nx(void)
> +{
> +uint64_t val, disable;
> +
> +rdmsrl(MSR_IA32_MISC_ENABLE, val);
> +
> +disable = val & MSR_IA32_MISC_ENABLE_XD_DISABLE;
> +
> +if ( !disable )
> +return false;
> +
> +wrmsrl(MSR_IA32_MISC_ENABLE, val & ~disable);
> +trampoline_misc_enable_off |= disable;
> +
> +return true;
> +}
> +
>  static void __init efi_arch_cpu(void)
>  {
> -uint32_t eax;
> +uint32_t eax, ebx, ecx, edx;
>  uint32_t *caps = boot_cpu_data.x86_capability;
>  
>  boot_tsc_stamp = rdtsc();
>  
> +cpuid(0, &eax, &ebx, &ecx, &edx);
> +boot_cpu_data.x86_vendor = x86_cpuid_lookup_vendor(ebx, ecx, edx);
> +
>  caps[FEATURESET_1c] = cpuid_ecx(1);
>  
>  eax = cpuid_eax(0x8000U);
> @@ -752,10 +772,17 @@ static void __init efi_arch_cpu(void)
>  caps[FEATURESET_e1d] = cpuid_edx(0x8001U);
>  
>  /*
> - * This check purposefully doesn't use cpu_has_nx because
> + * These checks purposefully doesn't use cpu_has_nx because
>   * cpu_has_nx bypasses the boot_cpu_data read if Xen was compiled
> - * with CONFIG_REQUIRE_NX
> + * with CONFIG_REQUIRE_NX.
> + *
> + * If NX isn't available, it might be hidden.  Try to reactivate it.
>   */
> +if ( !boot_cpu_has(X86_FEATURE_NX) &&
> + boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
> + intel_unlock_nx() )
> +caps[FEATURESET_e1d] = cpuid_edx(0x8001U);
> +
>  if ( IS_ENABLED(CONFIG_REQUIRE_NX) &&
>   !boot_cpu_has(X86_FEATURE_NX) )
>  blexit(L"This build of Xen requires NX support");
> -- 
> 2.39.2
> 

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Re: [PATCH 1/2] x86/efi: Simplify efi_arch_cpu() a little

2024-07-23 Thread Marek Marczykowski-Górecki

On Mon, Jul 22, 2024 at 11:18:37AM +0100, Andrew Cooper wrote:
> Make the "no extended leaves" case fatal and remove one level of indentation.
> Defer the max-leaf aquisition until it is first used.
> 
> No functional change.
> 
> Signed-off-by: Andrew Cooper 

Acked-by: Marek Marczykowski-Górecki 

> ---
> CC: Jan Beulich 
> CC: Roger Pau Monné 
> CC: Daniel P. Smith 
> CC: Marek Marczykowski-Górecki 
> CC: Alejandro Vallejo 
> CC: Gene Bright 
> ---
>  xen/arch/x86/efi/efi-boot.h | 31 ---
>  1 file changed, 16 insertions(+), 15 deletions(-)
> 
> diff --git a/xen/arch/x86/efi/efi-boot.h b/xen/arch/x86/efi/efi-boot.h
> index f282358435f1..4e4be7174751 100644
> --- a/xen/arch/x86/efi/efi-boot.h
> +++ b/xen/arch/x86/efi/efi-boot.h
> @@ -738,29 +738,30 @@ static void __init efi_arch_handle_module(const struct 
> file *file,
>  
>  static void __init efi_arch_cpu(void)
>  {
> -uint32_t eax = cpuid_eax(0x8000U);
> +uint32_t eax;
>  uint32_t *caps = boot_cpu_data.x86_capability;
>  
>  boot_tsc_stamp = rdtsc();
>  
>  caps[FEATURESET_1c] = cpuid_ecx(1);
>  
> -if ( (eax >> 16) == 0x8000 && eax > 0x8000U )
> -{
> -caps[FEATURESET_e1d] = cpuid_edx(0x8001U);
> +eax = cpuid_eax(0x8000U);
> +if ( (eax >> 16) != 0x8000 || eax < 0x8000U )
> +blexit(L"In 64bit mode, but no extended CPUID leaves?!?");
>  
> -/*
> - * This check purposefully doesn't use cpu_has_nx because
> - * cpu_has_nx bypasses the boot_cpu_data read if Xen was compiled
> - * with CONFIG_REQUIRE_NX
> - */
> -if ( IS_ENABLED(CONFIG_REQUIRE_NX) &&
> - !boot_cpu_has(X86_FEATURE_NX) )
> -blexit(L"This build of Xen requires NX support");
> +caps[FEATURESET_e1d] = cpuid_edx(0x8001U);
>  
> -if ( cpu_has_nx )
> -trampoline_efer |= EFER_NXE;
> -}
> +/*
> + * This check purposefully doesn't use cpu_has_nx because
> + * cpu_has_nx bypasses the boot_cpu_data read if Xen was compiled
> + * with CONFIG_REQUIRE_NX
> + */
> +if ( IS_ENABLED(CONFIG_REQUIRE_NX) &&
> +     !boot_cpu_has(X86_FEATURE_NX) )
> +blexit(L"This build of Xen requires NX support");
> +
> +if ( cpu_has_nx )
> +trampoline_efer |= EFER_NXE;
>  }
>  
>  static void __init efi_arch_blexit(void)
> -- 
> 2.39.2
> 

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

[PATCH v6 2/3] x86/mm: add API for marking only part of a MMIO page read only

2024-07-22 Thread Marek Marczykowski-Górecki

In some cases, only few registers on a page needs to be write-protected.
Examples include USB3 console (64 bytes worth of registers) or MSI-X's
PBA table (which doesn't need to span the whole table either), although
in the latter case the spec forbids placing other registers on the same
page. Current API allows only marking whole pages pages read-only,
which sometimes may cover other registers that guest may need to
write into.

Currently, when a guest tries to write to an MMIO page on the
mmio_ro_ranges, it's either immediately crashed on EPT violation - if
that's HVM, or if PV, it gets #PF. In case of Linux PV, if access was
from userspace (like, /dev/mem), it will try to fixup by updating page
tables (that Xen again will force to read-only) and will hit that #PF
again (looping endlessly). Both behaviors are undesirable if guest could
actually be allowed the write.

Introduce an API that allows marking part of a page read-only. Since
sub-page permissions are not a thing in page tables (they are in EPT,
but not granular enough), do this via emulation (or simply page fault
handler for PV) that handles writes that are supposed to be allowed.
The new subpage_mmio_ro_add() takes a start physical address and the
region size in bytes. Both start address and the size need to be 8-byte
aligned, as a practical simplification (allows using smaller bitmask,
and a smaller granularity isn't really necessary right now).
It will internally add relevant pages to mmio_ro_ranges, but if either
start or end address is not page-aligned, it additionally adds that page
to a list for sub-page R/O handling. The list holds a bitmask which
qwords are supposed to be read-only and an address where page is mapped
for write emulation - this mapping is done only on the first access. A
plain list is used instead of more efficient structure, because there
isn't supposed to be many pages needing this precise r/o control.

The mechanism this API is plugged in is slightly different for PV and
HVM. For both paths, it's plugged into mmio_ro_emulated_write(). For PV,
it's already called for #PF on read-only MMIO page. For HVM however, EPT
violation on p2m_mmio_direct page results in a direct domain_crash() for
non hardware domains.  To reach mmio_ro_emulated_write(), change how
write violations for p2m_mmio_direct are handled - specifically, check
if they relate to such partially protected page via
subpage_mmio_write_accept() and if so, call hvm_emulate_one_mmio() for
them too. This decodes what guest is trying write and finally calls
mmio_ro_emulated_write(). The EPT write violation is detected as
npfec.write_access and npfec.present both being true (similar to other
places), which may cover some other (future?) cases - if that happens,
emulator might get involved unnecessarily, but since it's limited to
pages marked with subpage_mmio_ro_add() only, the impact is minimal.
Both of those paths need an MFN to which guest tried to write (to check
which part of the page is supposed to be read-only, and where
the page is mapped for writes). This information currently isn't
available directly in mmio_ro_emulated_write(), but in both cases it is
already resolved somewhere higher in the call tree. Pass it down to
mmio_ro_emulated_write() via new mmio_ro_emulate_ctxt.mfn field.

This may give a bit more access to the instruction emulator to HVM
guests (the change in hvm_hap_nested_page_fault()), but only for pages
explicitly marked with subpage_mmio_ro_add() - so, if the guest has a
passed through a device partially used by Xen.
As of the next patch, it applies only configuration explicitly
documented as not security supported.

The subpage_mmio_ro_add() function cannot be called with overlapping
ranges, and on pages already added to mmio_ro_ranges separately.
Successful calls would result in correct handling, but error paths may
result in incorrect state (like pages removed from mmio_ro_ranges too
early). Debug build has asserts for relevant cases.

Signed-off-by: Marek Marczykowski-Górecki 
---
Shadow mode is not tested, but I don't expect it to work differently than
HAP in areas related to this patch.

Changes in v6:
- fix return type of subpage_mmio_find_page()
- change 'iter' pointer to 'new_entry' bool and move list_add()
- comment why different error handling for unaligned start / size
- code style
Changes in v5:
- use subpage_mmio_find_page helper, simplifying several functions
- use LIST_HEAD_RO_AFTER_INIT
- don't use subpage_ro_lock in __init
- drop #ifdef in mm.h
- return error on unaligned size in subpage_mmio_ro_add() instead of
  extending the size (in release build)
Changes in v4:
- rename SUBPAGE_MMIO_RO_ALIGN to MMIO_RO_SUBPAGE_GRAN
- guard subpage_mmio_write_accept with CONFIG_HVM, as it's used only
  there
- rename ro_qwords to ro_elems
- use unsigned arguments for subpage_mmio_ro_remove_page()
- use volatile for __iomem
- do not set mmio_ro_c

[PATCH v6 0/3] Add API for making parts of a MMIO page R/O and use it in XHCI console

2024-07-22 Thread Marek Marczykowski-Górecki

On older systems, XHCI xcap had a layout that no other (interesting) registers
were placed on the same page as the debug capability, so Linux was fine with
making the whole page R/O. But at least on Tiger Lake and Alder Lake, Linux
needs to write to some other registers on the same page too.

Add a generic API for making just parts of an MMIO page R/O and use it to fix
USB3 console with share=yes or share=hwdom options. More details in commit
messages.

Marek Marczykowski-Górecki (3):
  xen/list: add LIST_HEAD_RO_AFTER_INIT
  x86/mm: add API for marking only part of a MMIO page read only
  drivers/char: Use sub-page ro API to make just xhci dbc cap RO

 xen/arch/x86/hvm/emulate.c  |   2 +-
 xen/arch/x86/hvm/hvm.c  |   4 +-
 xen/arch/x86/include/asm/mm.h   |  23 +++-
 xen/arch/x86/mm.c   | 265 +-
 xen/arch/x86/pv/ro-page-fault.c |   6 +-
 xen/drivers/char/xhci-dbc.c |  36 ++--
 xen/include/xen/list.h  |   3 +-
 7 files changed, 320 insertions(+), 19 deletions(-)

base-commit: a99f25f7ac60544e9af4b3b516d7566ba8841cc4
-- 
git-series 0.9.1

[PATCH v6 1/3] xen/list: add LIST_HEAD_RO_AFTER_INIT

2024-07-22 Thread Marek Marczykowski-Górecki

Similar to LIST_HEAD_READ_MOSTLY.

Signed-off-by: Marek Marczykowski-Górecki 
Acked-by: Jan Beulich 
---
New in v5
---
 xen/include/xen/list.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/xen/include/xen/list.h b/xen/include/xen/list.h
index 6506ac40893b..62169f46742e 100644
--- a/xen/include/xen/list.h
+++ b/xen/include/xen/list.h
@@ -42,6 +42,9 @@ struct list_head {
 #define LIST_HEAD_READ_MOSTLY(name) \
 struct list_head __read_mostly name = LIST_HEAD_INIT(name)
 
+#define LIST_HEAD_RO_AFTER_INIT(name) \
+struct list_head __ro_after_init name = LIST_HEAD_INIT(name)
+
 static inline void INIT_LIST_HEAD(struct list_head *list)
 {
 list->next = list;
-- 
git-series 0.9.1

[PATCH v6 3/3] drivers/char: Use sub-page ro API to make just xhci dbc cap RO

2024-07-22 Thread Marek Marczykowski-Górecki

Not the whole page, which may contain other registers too. The XHCI
specification describes DbC as designed to be controlled by a different
driver, but does not mandate placing registers on a separate page. In fact
on Tiger Lake and newer (at least), this page do contain other registers
that Linux tries to use. And with share=yes, a domU would use them too.
Without this patch, PV dom0 would fail to initialize the controller,
while HVM would be killed on EPT violation.

With `share=yes`, this patch gives domU more access to the emulator
(although a HVM with any emulated device already has plenty of it). This
configuration is already documented as unsafe with untrusted guests and
not security supported.

Signed-off-by: Marek Marczykowski-Górecki 
Reviewed-by: Jan Beulich 
---
Changes in v4:
- restore mmio_ro_ranges in the fallback case
- set XHCI_SHARE_NONE in the fallback case
Changes in v3:
- indentation fix
- remove stale comment
- fallback to pci_ro_device() if subpage_mmio_ro_add() fails
- extend commit message
Changes in v2:
 - adjust for simplified subpage_mmio_ro_add() API
---
 xen/drivers/char/xhci-dbc.c | 36 ++--
 1 file changed, 22 insertions(+), 14 deletions(-)

diff --git a/xen/drivers/char/xhci-dbc.c b/xen/drivers/char/xhci-dbc.c
index 8e2037f1a5f7..c45e4b6825cc 100644
--- a/xen/drivers/char/xhci-dbc.c
+++ b/xen/drivers/char/xhci-dbc.c
@@ -1216,20 +1216,28 @@ static void __init cf_check 
dbc_uart_init_postirq(struct serial_port *port)
 break;
 }
 #ifdef CONFIG_X86
-/*
- * This marks the whole page as R/O, which may include other registers
- * unrelated to DbC. Xen needs only DbC area protected, but it seems
- * Linux's XHCI driver (as of 5.18) works without writting to the whole
- * page, so keep it simple.
- */
-if ( rangeset_add_range(mmio_ro_ranges,
-PFN_DOWN((uart->dbc.bar_val & PCI_BASE_ADDRESS_MEM_MASK) +
- uart->dbc.xhc_dbc_offset),
-PFN_UP((uart->dbc.bar_val & PCI_BASE_ADDRESS_MEM_MASK) +
-   uart->dbc.xhc_dbc_offset +
-sizeof(*uart->dbc.dbc_reg)) - 1) )
-printk(XENLOG_INFO
-   "Error while adding MMIO range of device to mmio_ro_ranges\n");
+if ( subpage_mmio_ro_add(
+ (uart->dbc.bar_val & PCI_BASE_ADDRESS_MEM_MASK) +
+  uart->dbc.xhc_dbc_offset,
+ sizeof(*uart->dbc.dbc_reg)) )
+{
+printk(XENLOG_WARNING
+   "Error while marking MMIO range of XHCI console as R/O, "
+   "making the whole device R/O (share=no)\n");
+uart->dbc.share = XHCI_SHARE_NONE;
+if ( pci_ro_device(0, uart->dbc.sbdf.bus, uart->dbc.sbdf.devfn) )
+printk(XENLOG_WARNING
+   "Failed to mark read-only %pp used for XHCI console\n",
+   &uart->dbc.sbdf);
+if ( rangeset_add_range(mmio_ro_ranges,
+ PFN_DOWN((uart->dbc.bar_val & PCI_BASE_ADDRESS_MEM_MASK) +
+  uart->dbc.xhc_dbc_offset),
+ PFN_UP((uart->dbc.bar_val & PCI_BASE_ADDRESS_MEM_MASK) +
+uart->dbc.xhc_dbc_offset +
+sizeof(*uart->dbc.dbc_reg)) - 1) )
+printk(XENLOG_INFO
+   "Error while adding MMIO range of device to 
mmio_ro_ranges\n");
+}
 #endif
 }
 
-- 
git-series 0.9.1

Re: [PATCH] CI: workaround broken selinux+docker interaction in yocto

2024-07-22 Thread Marek Marczykowski-Górecki

On Mon, Jul 22, 2024 at 06:16:51PM +0100, Andrew Cooper wrote:
> On 20/07/2024 1:15 am, Marek Marczykowski-Górecki wrote:
> > `cp --preserve=xattr` doesn't work in docker when SELinux is enabled. It
> > tries to set the "security.selinux" xattr, but SELinux (or overlay fs?)
> > denies it.
> > Workaround it by skipping selinux.selinux xattr copying.
> >
> > Signed-off-by: Marek Marczykowski-Górecki 
> > ---
> > Tested here:
> > https://gitlab.com/xen-project/people/marmarek/xen/-/jobs/7386198058
> >
> > But since yocto container fails to build, it isn't exactly easy to apply
> > this patch...
> > "kirkstone" branch of meta-virtualization seems to target Xen 4.15 and
> > 4.16, so it isn't exactly surprising it fails to build with 4.19.
> 
> Why is the external version of Xen relevant to rebuilding the container ?

I think it tries to build xen_git.bb, which fetches "master" branch, and
this fails to build with its current state.

> Or is it that kirkstone has updated since the container was last built?
> 
> I'm not familiar with yocto, and a quick glance at the docs haven't
> helped...
> 
> ~Andrew
> 
> >
> > I tried also bumping yocto version to scarthgap (which supposedly should
> > have updated pygrub patch), but that fails to build for me too, with a
> > different error:
> >
> > ERROR: Layer 'filesystems-layer' depends on layer 'networking-layer', 
> > but this layer is not enabled in your configuration
> > ERROR: Parse failure with the specified layer added, exiting.
> > ...
> > ERROR: Nothing PROVIDES 'xen-image-minimal'. Close matches:
> >   core-image-minimal
> >   core-image-minimal-dev
> > Parsing of 2472 .bb files complete (0 cached, 2472 parsed). 4309 
> > targets, 101 skipped, 0 masked, 0 errors.

In the meantime I've solved this issue by reordering layers in
build-yocto.sh (meta-networking before meta-filesystems). But then, ran
out of disk space (40GB wasn't enough) and hasn't retried yet...

> > ---
> >  automation/build/yocto/yocto.dockerfile.in | 4 
> >  1 file changed, 4 insertions(+)
> >
> > diff --git a/automation/build/yocto/yocto.dockerfile.in 
> > b/automation/build/yocto/yocto.dockerfile.in
> > index fbaa4e191caa..600db7bf4d19 100644
> > --- a/automation/build/yocto/yocto.dockerfile.in
> > +++ b/automation/build/yocto/yocto.dockerfile.in
> > @@ -68,6 +68,10 @@ RUN locale-gen en_US.UTF-8 && update-locale 
> > LC_ALL=en_US.UTF-8 \
> >  ENV LANG en_US.UTF-8
> >  ENV LC_ALL en_US.UTF-8
> >  
> > +# Workaround `cp --preserve=xattr` not working in docker when SELinux is
> > +# enabled
> > +RUN echo "security.selinux skip" >> /etc/xattr.conf
> > +
> >  # Create a user for the build (we don't want to build as root).
> >  ENV USER_NAME docker-build
> >  ARG host_uid=1000
> 

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Re: [PATCH v5 2/3] x86/mm: add API for marking only part of a MMIO page read only

2024-07-22 Thread Marek Marczykowski-Górecki

On Mon, Jul 22, 2024 at 03:01:45PM +0200, Jan Beulich wrote:
> On 22.07.2024 14:36, Marek Marczykowski-Górecki wrote:
> > On Mon, Jul 22, 2024 at 02:09:15PM +0200, Jan Beulich wrote:
> >> On 19.07.2024 04:33, Marek Marczykowski-Górecki wrote:
> >>> +int __init subpage_mmio_ro_add(
> >>> +paddr_t start,
> >>> +size_t size)
> >>> +{
> >>> +mfn_t mfn_start = maddr_to_mfn(start);
> >>> +paddr_t end = start + size - 1;
> >>> +mfn_t mfn_end = maddr_to_mfn(end);
> >>> +unsigned int offset_end = 0;
> >>> +int rc;
> >>> +bool subpage_start, subpage_end;
> >>> +
> >>> +ASSERT(IS_ALIGNED(start, MMIO_RO_SUBPAGE_GRAN));
> >>> +ASSERT(IS_ALIGNED(size, MMIO_RO_SUBPAGE_GRAN));
> >>> +if ( !IS_ALIGNED(size, MMIO_RO_SUBPAGE_GRAN) )
> >>> +return -EINVAL;
> >>
> >> I think I had asked before: Why is misaligned size something that wants a
> >> release build fallback to the assertion, but not misaligned start?
> > 
> > Misaligned start will lead to protecting larger area, not smaller, so it
> > is not unsafe thing to do. But I can also make it return an error, it
> > shouldn't happen after all.
> 
> Well, I wouldn't mind if you kept what you have, just with a (brief) comment
> making clear why there is a difference in treatment. After all you could
> treat mis-aligned size similarly, making the protected area larger, too.

Ok.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Re: [PATCH v5 2/3] x86/mm: add API for marking only part of a MMIO page read only

2024-07-22 Thread Marek Marczykowski-Górecki

On Mon, Jul 22, 2024 at 02:09:15PM +0200, Jan Beulich wrote:
> On 19.07.2024 04:33, Marek Marczykowski-Górecki wrote:
> > @@ -4910,6 +4921,254 @@ long arch_memory_op(unsigned long cmd, 
> > XEN_GUEST_HANDLE_PARAM(void) arg)
> >  return rc;
> >  }
> >  
> > +static void __iomem *subpage_mmio_find_page(mfn_t mfn)
> > +{
> > +struct subpage_ro_range *entry;
> 
> With the function returning void*, my first reaction was to ask why this
> isn't pointer-to-const. Yet then ...
> 
> > +list_for_each_entry(entry, &subpage_ro_ranges, list)
> > +if ( mfn_eq(entry->mfn, mfn) )
> > +return entry;
> 
> ... you're actually returning entry here, just with its type zapped for
> no apparent reason. I also question the __iomem in the return type.

Right, a leftover from some earlier version.

> > +static int __init subpage_mmio_ro_add_page(
> > +mfn_t mfn,
> > +unsigned int offset_s,
> > +unsigned int offset_e)
> > +{
> > +struct subpage_ro_range *entry = NULL, *iter;
> > +unsigned int i;
> > +
> > +entry = subpage_mmio_find_page(mfn);
> > +if ( !entry )
> > +{
> > +/* iter == NULL marks it was a newly allocated entry */
> > +iter = NULL;
> 
> Yet you don't use "iter" for other purposes anymore. I think the variable
> wants renaming and shrinking to e.g. a simple bool.

+1

> > +entry = xzalloc(struct subpage_ro_range);
> > +if ( !entry )
> > +return -ENOMEM;
> > +entry->mfn = mfn;
> > +}
> > +
> > +for ( i = offset_s; i <= offset_e; i += MMIO_RO_SUBPAGE_GRAN )
> > +{
> > +bool oldbit = __test_and_set_bit(i / MMIO_RO_SUBPAGE_GRAN,
> > +entry->ro_elems);
> 
> Nit: Indentation looks to be off by 1 here.
> 
> > +ASSERT(!oldbit);
> > +}
> > +
> > +if ( !iter )
> > +list_add(&entry->list, &subpage_ro_ranges);
> 
> What's wrong with doing this right in the earlier conditional?
> 
> > +int __init subpage_mmio_ro_add(
> > +paddr_t start,
> > +size_t size)
> > +{
> > +mfn_t mfn_start = maddr_to_mfn(start);
> > +paddr_t end = start + size - 1;
> > +mfn_t mfn_end = maddr_to_mfn(end);
> > +unsigned int offset_end = 0;
> > +int rc;
> > +bool subpage_start, subpage_end;
> > +
> > +ASSERT(IS_ALIGNED(start, MMIO_RO_SUBPAGE_GRAN));
> > +ASSERT(IS_ALIGNED(size, MMIO_RO_SUBPAGE_GRAN));
> > +if ( !IS_ALIGNED(size, MMIO_RO_SUBPAGE_GRAN) )
> > +return -EINVAL;
> 
> I think I had asked before: Why is misaligned size something that wants a
> release build fallback to the assertion, but not misaligned start?

Misaligned start will lead to protecting larger area, not smaller, so it
is not unsafe thing to do. But I can also make it return an error, it
shouldn't happen after all.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

[PATCH] CI: workaround broken selinux+docker interaction in yocto

2024-07-19 Thread Marek Marczykowski-Górecki

`cp --preserve=xattr` doesn't work in docker when SELinux is enabled. It
tries to set the "security.selinux" xattr, but SELinux (or overlay fs?)
denies it.
Workaround it by skipping selinux.selinux xattr copying.

Signed-off-by: Marek Marczykowski-Górecki 
---
Tested here:
https://gitlab.com/xen-project/people/marmarek/xen/-/jobs/7386198058

But since yocto container fails to build, it isn't exactly easy to apply
this patch...
"kirkstone" branch of meta-virtualization seems to target Xen 4.15 and
4.16, so it isn't exactly surprising it fails to build with 4.19.

I tried also bumping yocto version to scarthgap (which supposedly should
have updated pygrub patch), but that fails to build for me too, with a
different error:

ERROR: Layer 'filesystems-layer' depends on layer 'networking-layer', but 
this layer is not enabled in your configuration
ERROR: Parse failure with the specified layer added, exiting.
...
ERROR: Nothing PROVIDES 'xen-image-minimal'. Close matches:
  core-image-minimal
  core-image-minimal-dev
Parsing of 2472 .bb files complete (0 cached, 2472 parsed). 4309 targets, 
101 skipped, 0 masked, 0 errors.
---
 automation/build/yocto/yocto.dockerfile.in | 4 
 1 file changed, 4 insertions(+)

diff --git a/automation/build/yocto/yocto.dockerfile.in 
b/automation/build/yocto/yocto.dockerfile.in
index fbaa4e191caa..600db7bf4d19 100644
--- a/automation/build/yocto/yocto.dockerfile.in
+++ b/automation/build/yocto/yocto.dockerfile.in
@@ -68,6 +68,10 @@ RUN locale-gen en_US.UTF-8 && update-locale 
LC_ALL=en_US.UTF-8 \
 ENV LANG en_US.UTF-8
 ENV LC_ALL en_US.UTF-8
 
+# Workaround `cp --preserve=xattr` not working in docker when SELinux is
+# enabled
+RUN echo "security.selinux skip" >> /etc/xattr.conf
+
 # Create a user for the build (we don't want to build as root).
 ENV USER_NAME docker-build
 ARG host_uid=1000
-- 
2.45.2

[PATCH v5 0/3] Add API for making parts of a MMIO page R/O and use it in XHCI console

2024-07-18 Thread Marek Marczykowski-Górecki

On older systems, XHCI xcap had a layout that no other (interesting) registers
were placed on the same page as the debug capability, so Linux was fine with
making the whole page R/O. But at least on Tiger Lake and Alder Lake, Linux
needs to write to some other registers on the same page too.

Add a generic API for making just parts of an MMIO page R/O and use it to fix
USB3 console with share=yes or share=hwdom options. More details in commit
messages.

Marek Marczykowski-Górecki (3):
  xen/list: add LIST_HEAD_RO_AFTER_INIT
  x86/mm: add API for marking only part of a MMIO page read only
  drivers/char: Use sub-page ro API to make just xhci dbc cap RO

 xen/arch/x86/hvm/emulate.c  |   2 +-
 xen/arch/x86/hvm/hvm.c  |   4 +-
 xen/arch/x86/include/asm/mm.h   |  23 +++-
 xen/arch/x86/mm.c   | 262 +-
 xen/arch/x86/pv/ro-page-fault.c |   6 +-
 xen/drivers/char/xhci-dbc.c |  36 +++--
 xen/include/xen/list.h  |   3 +-
 7 files changed, 317 insertions(+), 19 deletions(-)

base-commit: a99f25f7ac60544e9af4b3b516d7566ba8841cc4
-- 
git-series 0.9.1

[PATCH v5 2/3] x86/mm: add API for marking only part of a MMIO page read only

2024-07-18 Thread Marek Marczykowski-Górecki

In some cases, only few registers on a page needs to be write-protected.
Examples include USB3 console (64 bytes worth of registers) or MSI-X's
PBA table (which doesn't need to span the whole table either), although
in the latter case the spec forbids placing other registers on the same
page. Current API allows only marking whole pages pages read-only,
which sometimes may cover other registers that guest may need to
write into.

Currently, when a guest tries to write to an MMIO page on the
mmio_ro_ranges, it's either immediately crashed on EPT violation - if
that's HVM, or if PV, it gets #PF. In case of Linux PV, if access was
from userspace (like, /dev/mem), it will try to fixup by updating page
tables (that Xen again will force to read-only) and will hit that #PF
again (looping endlessly). Both behaviors are undesirable if guest could
actually be allowed the write.

Introduce an API that allows marking part of a page read-only. Since
sub-page permissions are not a thing in page tables (they are in EPT,
but not granular enough), do this via emulation (or simply page fault
handler for PV) that handles writes that are supposed to be allowed.
The new subpage_mmio_ro_add() takes a start physical address and the
region size in bytes. Both start address and the size need to be 8-byte
aligned, as a practical simplification (allows using smaller bitmask,
and a smaller granularity isn't really necessary right now).
It will internally add relevant pages to mmio_ro_ranges, but if either
start or end address is not page-aligned, it additionally adds that page
to a list for sub-page R/O handling. The list holds a bitmask which
qwords are supposed to be read-only and an address where page is mapped
for write emulation - this mapping is done only on the first access. A
plain list is used instead of more efficient structure, because there
isn't supposed to be many pages needing this precise r/o control.

The mechanism this API is plugged in is slightly different for PV and
HVM. For both paths, it's plugged into mmio_ro_emulated_write(). For PV,
it's already called for #PF on read-only MMIO page. For HVM however, EPT
violation on p2m_mmio_direct page results in a direct domain_crash() for
non hardware domains.  To reach mmio_ro_emulated_write(), change how
write violations for p2m_mmio_direct are handled - specifically, check
if they relate to such partially protected page via
subpage_mmio_write_accept() and if so, call hvm_emulate_one_mmio() for
them too. This decodes what guest is trying write and finally calls
mmio_ro_emulated_write(). The EPT write violation is detected as
npfec.write_access and npfec.present both being true (similar to other
places), which may cover some other (future?) cases - if that happens,
emulator might get involved unnecessarily, but since it's limited to
pages marked with subpage_mmio_ro_add() only, the impact is minimal.
Both of those paths need an MFN to which guest tried to write (to check
which part of the page is supposed to be read-only, and where
the page is mapped for writes). This information currently isn't
available directly in mmio_ro_emulated_write(), but in both cases it is
already resolved somewhere higher in the call tree. Pass it down to
mmio_ro_emulated_write() via new mmio_ro_emulate_ctxt.mfn field.

This may give a bit more access to the instruction emulator to HVM
guests (the change in hvm_hap_nested_page_fault()), but only for pages
explicitly marked with subpage_mmio_ro_add() - so, if the guest has a
passed through a device partially used by Xen.
As of the next patch, it applies only configuration explicitly
documented as not security supported.

The subpage_mmio_ro_add() function cannot be called with overlapping
ranges, and on pages already added to mmio_ro_ranges separately.
Successful calls would result in correct handling, but error paths may
result in incorrect state (like pages removed from mmio_ro_ranges too
early). Debug build has asserts for relevant cases.

Signed-off-by: Marek Marczykowski-Górecki 
---
Shadow mode is not tested, but I don't expect it to work differently than
HAP in areas related to this patch.

Changes in v5:
- use subpage_mmio_find_page helper, simplifying several functions
- use LIST_HEAD_RO_AFTER_INIT
- don't use subpage_ro_lock in __init
- drop #ifdef in mm.h
- return error on unaligned size in subpage_mmio_ro_add() instead of
  extending the size (in release build)
Changes in v4:
- rename SUBPAGE_MMIO_RO_ALIGN to MMIO_RO_SUBPAGE_GRAN
- guard subpage_mmio_write_accept with CONFIG_HVM, as it's used only
  there
- rename ro_qwords to ro_elems
- use unsigned arguments for subpage_mmio_ro_remove_page()
- use volatile for __iomem
- do not set mmio_ro_ctxt.mfn for mmcfg case
- comment where fields of mmio_ro_ctxt are used
- use bool for result of __test_and_set_bit
- do not open-code mfn_to_maddr()
- remove leftover RCU
- mention hvm_hap_nested_page_fault() explicitly in the co

[PATCH v5 3/3] drivers/char: Use sub-page ro API to make just xhci dbc cap RO

2024-07-18 Thread Marek Marczykowski-Górecki

Not the whole page, which may contain other registers too. The XHCI
specification describes DbC as designed to be controlled by a different
driver, but does not mandate placing registers on a separate page. In fact
on Tiger Lake and newer (at least), this page do contain other registers
that Linux tries to use. And with share=yes, a domU would use them too.
Without this patch, PV dom0 would fail to initialize the controller,
while HVM would be killed on EPT violation.

With `share=yes`, this patch gives domU more access to the emulator
(although a HVM with any emulated device already has plenty of it). This
configuration is already documented as unsafe with untrusted guests and
not security supported.

Signed-off-by: Marek Marczykowski-Górecki 
Reviewed-by: Jan Beulich 
---
Changes in v4:
- restore mmio_ro_ranges in the fallback case
- set XHCI_SHARE_NONE in the fallback case
Changes in v3:
- indentation fix
- remove stale comment
- fallback to pci_ro_device() if subpage_mmio_ro_add() fails
- extend commit message
Changes in v2:
 - adjust for simplified subpage_mmio_ro_add() API
---
 xen/drivers/char/xhci-dbc.c | 36 ++--
 1 file changed, 22 insertions(+), 14 deletions(-)

diff --git a/xen/drivers/char/xhci-dbc.c b/xen/drivers/char/xhci-dbc.c
index 8e2037f1a5f7..c45e4b6825cc 100644
--- a/xen/drivers/char/xhci-dbc.c
+++ b/xen/drivers/char/xhci-dbc.c
@@ -1216,20 +1216,28 @@ static void __init cf_check 
dbc_uart_init_postirq(struct serial_port *port)
 break;
 }
 #ifdef CONFIG_X86
-/*
- * This marks the whole page as R/O, which may include other registers
- * unrelated to DbC. Xen needs only DbC area protected, but it seems
- * Linux's XHCI driver (as of 5.18) works without writting to the whole
- * page, so keep it simple.
- */
-if ( rangeset_add_range(mmio_ro_ranges,
-PFN_DOWN((uart->dbc.bar_val & PCI_BASE_ADDRESS_MEM_MASK) +
- uart->dbc.xhc_dbc_offset),
-PFN_UP((uart->dbc.bar_val & PCI_BASE_ADDRESS_MEM_MASK) +
-   uart->dbc.xhc_dbc_offset +
-sizeof(*uart->dbc.dbc_reg)) - 1) )
-printk(XENLOG_INFO
-   "Error while adding MMIO range of device to mmio_ro_ranges\n");
+if ( subpage_mmio_ro_add(
+ (uart->dbc.bar_val & PCI_BASE_ADDRESS_MEM_MASK) +
+  uart->dbc.xhc_dbc_offset,
+ sizeof(*uart->dbc.dbc_reg)) )
+{
+printk(XENLOG_WARNING
+   "Error while marking MMIO range of XHCI console as R/O, "
+   "making the whole device R/O (share=no)\n");
+uart->dbc.share = XHCI_SHARE_NONE;
+if ( pci_ro_device(0, uart->dbc.sbdf.bus, uart->dbc.sbdf.devfn) )
+printk(XENLOG_WARNING
+   "Failed to mark read-only %pp used for XHCI console\n",
+   &uart->dbc.sbdf);
+if ( rangeset_add_range(mmio_ro_ranges,
+ PFN_DOWN((uart->dbc.bar_val & PCI_BASE_ADDRESS_MEM_MASK) +
+  uart->dbc.xhc_dbc_offset),
+ PFN_UP((uart->dbc.bar_val & PCI_BASE_ADDRESS_MEM_MASK) +
+uart->dbc.xhc_dbc_offset +
+sizeof(*uart->dbc.dbc_reg)) - 1) )
+printk(XENLOG_INFO
+   "Error while adding MMIO range of device to 
mmio_ro_ranges\n");
+}
 #endif
 }
 
-- 
git-series 0.9.1

[PATCH v5 1/3] xen/list: add LIST_HEAD_RO_AFTER_INIT

2024-07-18 Thread Marek Marczykowski-Górecki

Similar to LIST_HEAD_READ_MOSTLY.

Signed-off-by: Marek Marczykowski-Górecki 
---
New in v5
---
 xen/include/xen/list.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/xen/include/xen/list.h b/xen/include/xen/list.h
index 6506ac40893b..62169f46742e 100644
--- a/xen/include/xen/list.h
+++ b/xen/include/xen/list.h
@@ -42,6 +42,9 @@ struct list_head {
 #define LIST_HEAD_READ_MOSTLY(name) \
 struct list_head __read_mostly name = LIST_HEAD_INIT(name)
 
+#define LIST_HEAD_RO_AFTER_INIT(name) \
+struct list_head __ro_after_init name = LIST_HEAD_INIT(name)
+
 static inline void INIT_LIST_HEAD(struct list_head *list)
 {
 list->next = list;
-- 
git-series 0.9.1

Re: [PATCH for-4.19] docs/checklist: Fix XEN_EXTRAVERSION inconsistency for release candidates

2024-07-16 Thread Marek Marczykowski-Górecki

On Tue, Jul 16, 2024 at 03:55:58PM +0200, Anthony PERARD wrote:
> On Tue, Jul 16, 2024 at 10:22:18AM +0200, Juergen Gross wrote:
> > On 16.07.24 09:46, Jan Beulich wrote:
> > > On 16.07.2024 09:33, Julien Grall wrote:
> > > > Hi,
> > > > 
> > > > On 16/07/2024 08:24, Jan Beulich wrote:
> > > > > On 16.07.2024 09:22, Julien Grall wrote:
> > > > > > On 16/07/2024 07:47, Jan Beulich wrote:
> > > > > > > On 15.07.2024 18:56, Julien Grall wrote:
> > > > > > > > On 15/07/2024 16:50, Andrew Cooper wrote:
> > > > > > > > > An earlier part of the checklist states:
> > > > > > > > > 
> > > > > > > > >   * change xen-unstable README. The banner (generated 
> > > > > > > > > using figlet) should say:
> > > > > > > > >   - "Xen 4.5" in releases and on stable branches
> > > > > > > > >   - "Xen 4.5-unstable" on unstable
> > > > > > > > >   - "Xen 4.5-rc" for release candidate
> > > > > > > > > 
> > > > > > > > > Update the notes about XEN_EXTRAVERSION to match.
> > > > > 
> > > > > When this is the purpose of the patch, ...
> > > > > 
> > > > > > > > We have been tagging the tree with 4.5.0-rcX. So I think it 
> > > > > > > > would be
> > > > > > > > better to update the wording so we use a consistent naming.
> > > > > > > 
> > > > > > > I find:
> > > > > > > 
> > > > > > > 4.18-rc
> > > > > > > 4.17-rc
> > > > > > > 4.16-rc
> > > > > > > 4.15-rc
> > > > > > 
> > > > > > Hmmm... I don't think we are looking at the same thing. I was
> > > > > > specifically looking at the tag and *not* XEN_EXTRAVERSION.
> > > > > 
> > > > > ... why would we be looking at tags?
> > > > 
> > > > As I wrote, consistency across the naming scheme we use.
> > > > 
> > > > > The tags (necessarily) have RC numbers,
> > > > 
> > > > Right but they also *have* the .0.
> > > > 
> > > > > so are going to be different from XEN_EXTRAVERSION in any event.
> > > > 
> > > > Sure they are not going to be 100% the same. However, they could have
> > > > some similarity.
> > > > 
> > > > As I pointed out multiple times now, to me it is odd we are tagging the
> > > > tree with 4.19.0-rcX, but we use 4.19-rc.
> > > > 
> > > > Furthermore, if you look at the history of the document. It is quite
> > > > clear that the goal was consistency (the commit mentioned by Andrew
> > > > happened after). Yes it wasn't respected but I can't tell exactly why.
> > > > 
> > > > So as we try to correct the documentation, I think we should also look
> > > > at consistency. If you *really* want to drop the .0, then I think it
> > > > should happen for the tag as well (again for consistency).
> > > 
> > > I don't see why (but I also wouldn't mind the dropping from the tag).
> > > They are going to be different. Whether they're different in one or two
> > > aspects is secondary to me. I rather view the consistency goal to be
> > > with what we've been doing in the last so many releases.
> > 
> > Another aspect to look at would be version sorting. This will be interesting
> > when e.g. having a Xen rpm package installed and upgrading it with a later
> > version. I don't think we want to regard replacing an -rc version with the
> > .0 version to be a downgrade, so the the version numbers should be sorted by
> > "sort -V" in the correct order.
> 
> Packages version from distribution is not something we have to deal with
> upstream. It's for the one writing the package version to make sure
> that -rc are older than actual release.
> 
> While trying to to find if SPEC files where dealing with "-rc" suffix,
> I found a doc for fedora telling how to deal with RCs:
> https://docs.fedoraproject.org/en-US/packaging-guidelines/Versioning/
> They say to replace the dash with a tilde, so "-rc" become "~rc", and
> package manager know what to do with it.
> 
> Some other distribution know how to deal with "rc" suffix, but the dash
> "-" isn't actually allowed in the version string:
> https://man.archlinux.org/man/vercmp.8
> 
> So unless we forgo "-rc" in tags, there's no way we can take into
> account how distributions package manager sorts version numbers. Also,
> there's no need to, it is the job of the packager to deal with version
> number, we just need to make is simple enough and consistent.

XEN_EXTRAVERSION isn't only about version for packaging (where indeed
some changes for -rc will likely be needed anyway, as different packages
have different ways of dealing with it). It's also about version
reported by Xen in various places like `xl info xen_version`. IMO it
makes sense to have consistent format there (always 3 parts separated by
a dot). It makes live easier for any tooling making use of this value.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Re: systemd units are not installed in 4.19.0-rc2 anymore

2024-07-15 Thread Marek Marczykowski-Górecki

On Mon, Jul 15, 2024 at 11:07:42AM +0100, Andrew Cooper wrote:
> On 15/07/2024 9:11 am, Jan Beulich wrote:
> > On 13.07.2024 15:02, Andrew Cooper wrote:
> >> On 13/07/2024 3:45 am, Marek Marczykowski-Górecki wrote:
> >>> Hi,
> >>>
> >>> Something has changed between -rc1 and -rc2 that systemd units are not
> >>> installed anymore by default.
> >>>
> >>> Reproducer: 
> >>>
> >>> ./configure --prefix=/usr
> >>> make dist-tools
> >>> ls dist/install/usr/lib/systemd/system
> >>>
> >>> It does work, if I pass --enable-systemd to ./configure.
> >>>
> >>> My guess is the actual change is earlier, specifically 6ef4fa1e7fe7
> >>> "tools: (Actually) drop libsystemd as a dependency", but configure was
> >>> regenerated only later. But TBH, I don't fully understand interaction
> >>> between those m4 macros...
> >> Between -rc1 and -rc2 was 7cc95f41669d
> >>
> >> That regenerated the existing configure scripts with Autoconf 2.71, vs
> >> 2.69 previously.
> >>
> >> Glancing through again, I can't spot anything that looks relevant.
> >>
> >>
> >> 6ef4fa1e7fe7 for systemd itself was regenerated, and I had to go out of
> >> my way to get autoconf 2.69 to do it.
> > Yet was it correct for that commit to wholesale drop
> > AX_CHECK_SYSTEMD_ENABLE_AVAILABLE? That's, afaics, the only place where
> > $systemd would have been set to y in the absence of --enable-systemd.
> 
> Hmm.
> 
> Yes it was right to drop that, because the whole purpose of the change
> was to break the dependency with systemd.
> 
> Thereafter, looking for systemd in the build environment is a bogus
> heuristic, and certainly one which would go wrong in XenServer's build
> system where the Mock chroot strictly has only the declared dependencies.
> 
> I see two options.
> 
> 1) Activate Systemd by default on Linux now (as it's basically free), or
> 2) Update CHANGELOG to note this behaviour
> 
> Personally I think 2 is the better option, because we don't special case
> the other init systems.

But we do install classic init scripts by default, no? Why not systemd
units then (which are harmless on non-systemd system)?

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

systemd units are not installed in 4.19.0-rc2 anymore

2024-07-12 Thread Marek Marczykowski-Górecki

Hi,

Something has changed between -rc1 and -rc2 that systemd units are not
installed anymore by default.

Reproducer: 

./configure --prefix=/usr
make dist-tools
ls dist/install/usr/lib/systemd/system

It does work, if I pass --enable-systemd to ./configure.

My guess is the actual change is earlier, specifically 6ef4fa1e7fe7
"tools: (Actually) drop libsystemd as a dependency", but configure was
regenerated only later. But TBH, I don't fully understand interaction
between those m4 macros...

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Re: [PATCH 06/17] x86/EFI: address violations of MISRA C:2012 Directive 4.10

2024-07-01 Thread Marek Marczykowski-Górecki

On Mon, Jul 01, 2024 at 03:36:01PM +0200, Alessandro Zucchelli wrote:
> From: Simone Ballarin 
> 
> Add inclusion guard to address violations of
> MISRA C:2012 Directive 4.10 ("Precautions shall be taken in order
> to prevent the contents of a header file being included more than
> once").
> 
> Mechanical change.
> 
> Signed-off-by: Simone Ballarin 
> Signed-off-by: Maria Celeste Cesario 
> Signed-off-by: Nicola Vetrini 
> Signed-off-by: Alessandro Zucchelli 
> 
> ---
> Changes in v4:
> - Modified inclusion guard.
> Changes in v3:
> - remove trailing underscores
> - change inclusion guard name to adhere to the new standard
> Changes in v2:
> - remove changes in "xen/arch/x86/efi/efi-boot.h"
> 
> Note:
> Changes in efi-boot.h have been removed since the file is
> intenteded to be included by common/efi/boot.c only. This motivation
> is not enough to raise a deviation record, so the violation is
> still present.

I'm confused by this comment. It says changes in efi-boot.h have been
removed, yet the patch does include them.

> ---
>  xen/arch/x86/efi/efi-boot.h | 7 +++
>  xen/arch/x86/efi/runtime.h  | 5 +
>  2 files changed, 12 insertions(+)
> 
> diff --git a/xen/arch/x86/efi/efi-boot.h b/xen/arch/x86/efi/efi-boot.h
> index f282358435..c6be744f2b 100644
> --- a/xen/arch/x86/efi/efi-boot.h
> +++ b/xen/arch/x86/efi/efi-boot.h
> @@ -3,6 +3,11 @@
>   * is intended to be included by common/efi/boot.c _only_, and
>   * therefore can define arch specific global variables.
>   */
> +
> +#ifndef X86_EFI_EFI_BOOT_H
> +#define X86_EFI_EFI_BOOT_H
> +
> +
>  #include 
>  #include 
>  #include 
> @@ -912,6 +917,8 @@ void asmlinkage __init efi_multiboot2(EFI_HANDLE 
> ImageHandle,
>  efi_exit_boot(ImageHandle, SystemTable);
>  }
>  
> +#endif /* X86_EFI_EFI_BOOT_H */
> +
>  /*
>   * Local variables:
>   * mode: C
> diff --git a/xen/arch/x86/efi/runtime.h b/xen/arch/x86/efi/runtime.h
> index 77866c5f21..88ab5651e9 100644
> --- a/xen/arch/x86/efi/runtime.h
> +++ b/xen/arch/x86/efi/runtime.h
> @@ -1,3 +1,6 @@
> +#ifndef X86_EFI_RUNTIME_H
> +#define X86_EFI_RUNTIME_H
> +
>  #include 
>  #include 
>  #include 
> @@ -17,3 +20,5 @@ void efi_update_l4_pgtable(unsigned int l4idx, l4_pgentry_t 
> l4e)
>  }
>  }
>  #endif
> +
> +#endif /* X86_EFI_RUNTIME_H */
> -- 
> 2.34.1
> 

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Regression in xen-blkfront regarding sector sizes

2024-06-24 Thread Marek Marczykowski-Górecki

Hi,

Some Qubes users report a regression in xen-blkfront regarding block
size reporting. It works fine on 6.8.8, but appears broken on 6.9.2.

The specific problem is that blkfront reports block size of 512, even for
backend devices of 4096. This, for example, fails 512-bytes reads with
O_DIRECT, and appears to break mounting a filesystem on such a device
(at least xfs one).

For example it looks like this:

[user@dom0 ~]$ head /sys/block/loop12/queue/*_block_size
==> /sys/block/loop12/queue/logical_block_size <==
4096

==> /sys/block/loop12/queue/physical_block_size <==
4096

[user@dom0 bin]$ qvm-run -p the-vm 'head /sys/block/xvdi/queue/*_block_size'
==> /sys/block/xvdi/queue/logical_block_size <==
512

==> /sys/block/xvdi/queue/physical_block_size <==
512

and then:

$ sudo dd if=/dev/xvdi of=/dev/null count=1 status=progress iflag=direct
/usr/bin/dd: error reading '/dev/xvdi': Input/output error
0+0 records in
0+0 records out
0 bytes copied, 0.000170858 s, 0.0 kB/s

and mounting fails like this:

[   68.055045] SGI XFS with ACLs, security attributes, realtime, scrub, 
quota, no debug enabled
[   68.057308] I/O error, dev xvdi, sector 0 op 0x0:(READ) flags 0x1000 
phys_seg 1 prio class 0
[   68.057333] XFS (xvdi): SB validate failed with error -5.

More details at https://github.com/QubesOS/qubes-issues/issues/9293

Rusty suspects it's related to
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/block/xen-blkfront.c?id=ba3f67c1163812b5d7ec33705c31edaa30ce6c51,
so I'm cc-ing people mentioned there too.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Re: [PATCH for-4.19 v2] tools/xl: Open xldevd.log with O_CLOEXEC

2024-06-21 Thread Marek Marczykowski-Górecki

On Fri, Jun 21, 2024 at 05:16:56PM +0100, Andrew Cooper wrote:
> `xl devd` has been observed leaking /var/log/xldevd.log into children.
> 
> Note this is specifically safe; dup2() leaves O_CLOEXEC disabled on newfd, so
> after setting up stdout/stderr, it's only the logfile fd which will close on
> exec().
> 
> Link: https://github.com/QubesOS/qubes-issues/issues/8292
> Reported-by: Demi Marie Obenour 
> Signed-off-by: Andrew Cooper 

Reviewed-by: Marek Marczykowski-Górecki 

> ---
> CC: Anthony PERARD 
> CC: Juergen Gross 
> CC: Demi Marie Obenour 
> CC: Marek Marczykowski-Górecki 
> CC: Oleksii Kurochko 
> 
> Also entirely speculative based on the QubesOS ticket.
> 
> v2:
>  * Extend the commit message to explain why stdout/stderr aren't closed by
>this change
> 
> For 4.19.  This bugfix was posted earlier, but fell between the cracks.
> ---
>  tools/xl/xl_utils.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/tools/xl/xl_utils.c b/tools/xl/xl_utils.c
> index 17489d182954..060186db3a59 100644
> --- a/tools/xl/xl_utils.c
> +++ b/tools/xl/xl_utils.c
> @@ -270,7 +270,7 @@ int do_daemonize(const char *name, const char *pidfile)
>  exit(-1);
>  }
>  
> -CHK_SYSCALL(logfile = open(fullname, O_WRONLY|O_CREAT|O_APPEND, 0644));
> +CHK_SYSCALL(logfile = open(fullname, O_WRONLY | O_CREAT | O_APPEND | 
> O_CLOEXEC, 0644));
>  free(fullname);
>  assert(logfile >= 3);
>  
> -- 
> 2.39.2
> 

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Re: Design session notes: GPU acceleration in Xen

2024-06-17 Thread Marek Marczykowski-Górecki

On Mon, Jun 17, 2024 at 09:46:29AM +0200, Roger Pau Monné wrote:
> On Sun, Jun 16, 2024 at 08:38:19PM -0400, Demi Marie Obenour wrote:
> > In both cases, the device physical
> > addresses are identical to dom0’s physical addresses.
> 
> Yes, but a PV dom0 physical address space can be very scattered.
> 
> IIRC there's an hypercall to request physically contiguous memory for
> PV, but you don't want to be using that every time you allocate a
> buffer (not sure it would support the sizes needed by the GPU
> anyway).

Indeed that isn't going to fly. In older Qubes versions we had PV
sys-net with PCI passthrough for a network card. After some uptime it
was basically impossible to restart and still have enough contagious
memory for a network driver, and there it was about _much_ smaller
buffers, like 2M or 4M. At least not without shutting down a lot more
things to free some more memory.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

signature.asc
Description: PGP signature

Re: ACPI NVS range conflicting with Dom0 page tables (or kernel image)

2024-06-17 Thread Marek Marczykowski-Górecki

On Mon, Jun 17, 2024 at 01:22:37PM +0200, Jan Beulich wrote:
> Hello,
> 
> while it feels like we had a similar situation before, I can't seem to be
> able to find traces thereof, or associated (Linux) commits.

Is it some AMD Threadripper system by a chance? Previous thread on this
issue:
https://lore.kernel.org/xen-devel/CAOCpoWdOH=xgxiqsc1c5ueb1thxajh4wizbczq-qt+d_kak...@mail.gmail.com/

> With
> 
> (XEN)  Dom0 kernel: 64-bit, PAE, lsb, paddr 0x100 -> 0x400
> ...
> (XEN)  Dom0 alloc.:   00044000->00044800 (619175 pages to be 
> allocated)
> ...
> (XEN)  Loaded kernel: 8100->8400
> 
> the kernel occupies the space from 16Mb to 64Mb in the initial allocation.
> Page tables come (almost) directly above:
> 
> (XEN)  Page tables:   84001000->84026000
> 
> I.e. they're just above the 64Mb boundary. Yet sadly in the host E820 map
> there is
> 
> (XEN)  [0400, 04009fff] (ACPI NVS)
> 
> i.e. a non-RAM range starting at 64Mb. The kernel (currently) won't tolerate
> such an overlap (also if it was overlapping the kernel image, e.g. if on the
> machine in question s sufficiently much larger kernel was used). Yet with its
> fundamental goal of making its E820 match the host one I'm also in trouble
> thinking of possible solutions / workarounds. I certainly do not see Xen
> trying to cover for this, as the E820 map re-arrangement is purely a kernel
> side decision (forward ported kernels got away without, and what e.g. the
> BSDs do is entirely unknown to me).

In Qubes we have worked around the issue by moving the kernel lower
(CONFIG_PHYSICAL_START=0x20):
https://github.com/QubesOS/qubes-linux-kernel/commit/3e8be4ac1682370977d4d0dc1d782c428d860282

Far from ideal, but gets it bootable...

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Re: [PATCH v4 1/2] x86/mm: add API for marking only part of a MMIO page read only

2024-06-11 Thread Marek Marczykowski-Górecki

On Tue, Jun 11, 2024 at 04:07:03PM +0200, Roger Pau Monné wrote:
> On Tue, Jun 11, 2024 at 03:15:42PM +0200, Marek Marczykowski-Górecki wrote:
> > It's location is discovered at startup
> > (device presents a linked-list of capabilities in one of its BARs).
> > The spec talks only about alignment of individual registers, not the
> > whole group...
> 
> Never mind then, I had the expectation we could get away with a single
> page, but doesn't look to be the case.
> 
> I assume the spec doesn't mention anything about the BAR where the
> capabilities reside having a size <= 4KiB.

No, and in fact I see it's a BAR of 64KiB on one of devices...

> > > Maybe worth adding a comment that the logic here intends to deal only
> > > with the RW bits of a page that's otherwise RO, and that by not
> > > handling the RO regions the intention is that those are dealt just
> > > like fully RO pages.
> > 
> > I can extend the comment, but I assumed it's kinda implied already (if
> > nothing else, by the function name).
> 
> Well, at this point we know the write is not going to make it to host
> memory.  The only reason to not handle the access here is that we want
> to unify the consequences it has for a guest writing to a RO address.

Yup.

> > > I guess there's some message printed when attempting to write to a RO
> > > page that you would also like to print here?
> > 
> > If a HVM domain writes to an R/O area, it is crashed, so you will get a
> > message. This applies to both full page R/O and partial R/O. PV doesn't
> > go through subpage_mmio_write_accept().
> 
> Oh, crashing the domain is more strict than I was expecting.

That's how it was before, I'm not really changing it here. It's less
strict for PV though (it either gets a #PF forwarded back to the guest,
or is ignored).

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Re: [PATCH v4 1/2] x86/mm: add API for marking only part of a MMIO page read only

2024-06-11 Thread Marek Marczykowski-Górecki

On Tue, Jun 11, 2024 at 02:55:22PM +0200, Roger Pau Monné wrote:
> On Tue, Jun 11, 2024 at 01:38:35PM +0200, Marek Marczykowski-Górecki wrote:
> > On Tue, Jun 11, 2024 at 12:40:49PM +0200, Roger Pau Monné wrote:
> > > On Wed, May 22, 2024 at 05:39:03PM +0200, Marek Marczykowski-Górecki 
> > > wrote:
> > > > +if ( !entry )
> > > > +{
> > > > +/* iter == NULL marks it was a newly allocated entry */
> > > > +iter = NULL;
> > > > +entry = xzalloc(struct subpage_ro_range);
> > > > +if ( !entry )
> > > > +return -ENOMEM;
> > > > +entry->mfn = mfn;
> > > > +}
> > > > +
> > > > +for ( i = offset_s; i <= offset_e; i += MMIO_RO_SUBPAGE_GRAN )
> > > > +{
> > > > +bool oldbit = __test_and_set_bit(i / MMIO_RO_SUBPAGE_GRAN,
> > > > +entry->ro_elems);
> > > > +ASSERT(!oldbit);
> > > > +}
> > > > +
> > > > +if ( !iter )
> > > > +list_add(&entry->list, &subpage_ro_ranges);
> > > > +
> > > > +return iter ? 1 : 0;
> > > > +}
> > > > +
> > > > +/* This needs subpage_ro_lock already taken */
> > > > +static void __init subpage_mmio_ro_remove_page(
> > > > +mfn_t mfn,
> > > > +unsigned int offset_s,
> > > > +unsigned int offset_e)
> > > > +{
> > > > +struct subpage_ro_range *entry = NULL, *iter;
> > > > +unsigned int i;
> > > > +
> > > > +list_for_each_entry(iter, &subpage_ro_ranges, list)
> > > > +{
> > > > +if ( mfn_eq(iter->mfn, mfn) )
> > > > +{
> > > > +entry = iter;
> > > > +break;
> > > > +}
> > > > +}
> > > > +if ( !entry )
> > > > +return;
> > > > +
> > > > +for ( i = offset_s; i <= offset_e; i += MMIO_RO_SUBPAGE_GRAN )
> > > > +__clear_bit(i / MMIO_RO_SUBPAGE_GRAN, entry->ro_elems);
> > > > +
> > > > +if ( !bitmap_empty(entry->ro_elems, PAGE_SIZE / 
> > > > MMIO_RO_SUBPAGE_GRAN) )
> > > > +return;
> > > > +
> > > > +list_del(&entry->list);
> > > > +if ( entry->mapped )
> > > > +iounmap(entry->mapped);
> > > > +xfree(entry);
> > > > +}
> > > > +
> > > > +int __init subpage_mmio_ro_add(
> > > > +paddr_t start,
> > > > +size_t size)
> > > > +{
> > > > +mfn_t mfn_start = maddr_to_mfn(start);
> > > > +paddr_t end = start + size - 1;
> > > > +mfn_t mfn_end = maddr_to_mfn(end);
> > > > +unsigned int offset_end = 0;
> > > > +int rc;
> > > > +bool subpage_start, subpage_end;
> > > > +
> > > > +ASSERT(IS_ALIGNED(start, MMIO_RO_SUBPAGE_GRAN));
> > > > +ASSERT(IS_ALIGNED(size, MMIO_RO_SUBPAGE_GRAN));
> > > > +if ( !IS_ALIGNED(size, MMIO_RO_SUBPAGE_GRAN) )
> > > > +size = ROUNDUP(size, MMIO_RO_SUBPAGE_GRAN);
> > > > +
> > > > +if ( !size )
> > > > +return 0;
> > > > +
> > > > +if ( mfn_eq(mfn_start, mfn_end) )
> > > > +{
> > > > +/* Both starting and ending parts handled at once */
> > > > +subpage_start = PAGE_OFFSET(start) || PAGE_OFFSET(end) != 
> > > > PAGE_SIZE - 1;
> > > > +subpage_end = false;
> > > 
> > > Given the intended usage of this, don't we want to limit to only a
> > > single page?  So that PFN_DOWN(start + size) == PFN_DOWN/(start), as
> > > that would simplify the logic here?
> > 
> > I have considered that, but I haven't found anything in the spec
> > mandating the XHCI DbC registers to not cross page boundary. Currently
> > (on a system I test this on) they don't cross page boundary, but I don't
> > want to assume extra constrains - to avoid issues like before (when
> > on the older system I tested the DbC registers didn't shared page with
> > other registers, but then they shared the page on a newer hardware).
> 
> Oh, from our conversation at XenSummit I got the impression debug

Re: [PATCH v4 1/2] x86/mm: add API for marking only part of a MMIO page read only

2024-06-11 Thread Marek Marczykowski-Górecki

On Tue, Jun 11, 2024 at 12:40:49PM +0200, Roger Pau Monné wrote:
> On Wed, May 22, 2024 at 05:39:03PM +0200, Marek Marczykowski-Górecki wrote:
> > In some cases, only few registers on a page needs to be write-protected.
> > Examples include USB3 console (64 bytes worth of registers) or MSI-X's
> > PBA table (which doesn't need to span the whole table either), although
> > in the latter case the spec forbids placing other registers on the same
> > page. Current API allows only marking whole pages pages read-only,
> > which sometimes may cover other registers that guest may need to
> > write into.
> > 
> > Currently, when a guest tries to write to an MMIO page on the
> > mmio_ro_ranges, it's either immediately crashed on EPT violation - if
> > that's HVM, or if PV, it gets #PF. In case of Linux PV, if access was
> > from userspace (like, /dev/mem), it will try to fixup by updating page
> > tables (that Xen again will force to read-only) and will hit that #PF
> > again (looping endlessly). Both behaviors are undesirable if guest could
> > actually be allowed the write.
> > 
> > Introduce an API that allows marking part of a page read-only. Since
> > sub-page permissions are not a thing in page tables (they are in EPT,
> > but not granular enough), do this via emulation (or simply page fault
> > handler for PV) that handles writes that are supposed to be allowed.
> > The new subpage_mmio_ro_add() takes a start physical address and the
> > region size in bytes. Both start address and the size need to be 8-byte
> > aligned, as a practical simplification (allows using smaller bitmask,
> > and a smaller granularity isn't really necessary right now).
> > It will internally add relevant pages to mmio_ro_ranges, but if either
> > start or end address is not page-aligned, it additionally adds that page
> > to a list for sub-page R/O handling. The list holds a bitmask which
> > qwords are supposed to be read-only and an address where page is mapped
> > for write emulation - this mapping is done only on the first access. A
> > plain list is used instead of more efficient structure, because there
> > isn't supposed to be many pages needing this precise r/o control.
> > 
> > The mechanism this API is plugged in is slightly different for PV and
> > HVM. For both paths, it's plugged into mmio_ro_emulated_write(). For PV,
> > it's already called for #PF on read-only MMIO page. For HVM however, EPT
> > violation on p2m_mmio_direct page results in a direct domain_crash() for
> > non hardware domains.  To reach mmio_ro_emulated_write(), change how
> > write violations for p2m_mmio_direct are handled - specifically, check
> > if they relate to such partially protected page via
> > subpage_mmio_write_accept() and if so, call hvm_emulate_one_mmio() for
> > them too. This decodes what guest is trying write and finally calls
> > mmio_ro_emulated_write(). The EPT write violation is detected as
> > npfec.write_access and npfec.present both being true (similar to other
> > places), which may cover some other (future?) cases - if that happens,
> > emulator might get involved unnecessarily, but since it's limited to
> > pages marked with subpage_mmio_ro_add() only, the impact is minimal.
> > Both of those paths need an MFN to which guest tried to write (to check
> > which part of the page is supposed to be read-only, and where
> > the page is mapped for writes). This information currently isn't
> > available directly in mmio_ro_emulated_write(), but in both cases it is
> > already resolved somewhere higher in the call tree. Pass it down to
> > mmio_ro_emulated_write() via new mmio_ro_emulate_ctxt.mfn field.
> > 
> > This may give a bit more access to the instruction emulator to HVM
> > guests (the change in hvm_hap_nested_page_fault()), but only for pages
> > explicitly marked with subpage_mmio_ro_add() - so, if the guest has a
> > passed through a device partially used by Xen.
> > As of the next patch, it applies only configuration explicitly
> > documented as not security supported.
> > 
> > The subpage_mmio_ro_add() function cannot be called with overlapping
> > ranges, and on pages already added to mmio_ro_ranges separately.
> > Successful calls would result in correct handling, but error paths may
> > result in incorrect state (like pages removed from mmio_ro_ranges too
> > early). Debug build has asserts for relevant cases.
> > 
> > Signed-off-by: Marek Marczykowski-Górecki 
> > ---
> > Shadow mode is not tested, but I don't expect it to work differently

Re: [PATCH v4 1/2] x86/mm: add API for marking only part of a MMIO page read only

2024-06-11 Thread Marek Marczykowski-Górecki

On Fri, Jun 07, 2024 at 09:01:25AM +0200, Jan Beulich wrote:
> On 22.05.2024 17:39, Marek Marczykowski-Górecki wrote:
> > --- a/xen/arch/x86/include/asm/mm.h
> > +++ b/xen/arch/x86/include/asm/mm.h
> > @@ -522,9 +522,34 @@ extern struct rangeset *mmio_ro_ranges;
> >  void memguard_guard_stack(void *p);
> >  void memguard_unguard_stack(void *p);
> >  
> > +/*
> > + * Add more precise r/o marking for a MMIO page. Range specified here
> > + * will still be R/O, but the rest of the page (not marked as R/O via 
> > another
> > + * call) will have writes passed through.
> > + * The start address and the size must be aligned to MMIO_RO_SUBPAGE_GRAN.
> > + *
> > + * This API cannot be used for overlapping ranges, nor for pages already 
> > added
> > + * to mmio_ro_ranges separately.
> > + *
> > + * Since there is currently no subpage_mmio_ro_remove(), relevant device 
> > should
> > + * not be hot-unplugged.
> 
> Yet there are no guarantees whatsoever. I think we should refuse
> hot-unplug attempts (not just here, but also e.g. for an EHCI
> controller that we use the debug feature of), but doing so would
> likely require coordination with Dom0. Nothing to be done right
> here, of course.
> 
> > + * Return values:
> > + *  - negative: error
> > + *  - 0: success
> > + */
> > +#define MMIO_RO_SUBPAGE_GRAN 8
> > +int subpage_mmio_ro_add(paddr_t start, size_t size);
> > +#ifdef CONFIG_HVM
> > +bool subpage_mmio_write_accept(mfn_t mfn, unsigned long gla);
> > +#endif
> 
> I'd suggest to omit the #ifdef here. The declaration alone doesn't
> hurt, and the #ifdef harms readability (if only a bit).

Ok.


> > --- a/xen/arch/x86/mm.c
> > +++ b/xen/arch/x86/mm.c
> > @@ -150,6 +150,17 @@ bool __read_mostly machine_to_phys_mapping_valid;
> >  
> >  struct rangeset *__read_mostly mmio_ro_ranges;
> >  
> > +/* Handling sub-page read-only MMIO regions */
> > +struct subpage_ro_range {
> > +struct list_head list;
> > +mfn_t mfn;
> > +void __iomem *mapped;
> > +DECLARE_BITMAP(ro_elems, PAGE_SIZE / MMIO_RO_SUBPAGE_GRAN);
> > +};
> > +
> > +static LIST_HEAD(subpage_ro_ranges);
> 
> With modifications all happen from __init code, this likely wants
> to be LIST_HEAD_RO_AFTER_INIT() (which would need introducing, to
> parallel LIST_HEAD_READ_MOSTLY()).

Makes sense. And then I would be comfortable with dropping the spinlock
as Roger suggested.
I tried to make this API a bit more generic than I currently need, but
indeed it can be simplified for this particular use case.

> > +int __init subpage_mmio_ro_add(
> > +paddr_t start,
> > +size_t size)
> > +{
> > +mfn_t mfn_start = maddr_to_mfn(start);
> > +paddr_t end = start + size - 1;
> > +mfn_t mfn_end = maddr_to_mfn(end);
> > +unsigned int offset_end = 0;
> > +int rc;
> > +bool subpage_start, subpage_end;
> > +
> > +ASSERT(IS_ALIGNED(start, MMIO_RO_SUBPAGE_GRAN));
> > +ASSERT(IS_ALIGNED(size, MMIO_RO_SUBPAGE_GRAN));
> > +if ( !IS_ALIGNED(size, MMIO_RO_SUBPAGE_GRAN) )
> > +size = ROUNDUP(size, MMIO_RO_SUBPAGE_GRAN);
> 
> I'm puzzled: You first check suitable alignment and then adjust size
> to have suitable granularity. Either it is a mistake to call the
> function with a bad size, or it is not. If it's a mistake, the
> release build alternative to the assertion would be to return an
> error. If it's not a mistake, the assertion ought to go away.
> 
> If the assertion is to stay, then I'll further question why the
> other one doesn't also have release build safety fallback logic.

For some reason I read your earlier comment as a request to (try to)
continue safely in this case. But indeed an error is a better option, it
isn't supposed to happen anyway.

> > +if ( !size )
> > +return 0;
> > +
> > +if ( mfn_eq(mfn_start, mfn_end) )
> > +{
> > +/* Both starting and ending parts handled at once */
> > +subpage_start = PAGE_OFFSET(start) || PAGE_OFFSET(end) != 
> > PAGE_SIZE - 1;
> > +subpage_end = false;
> > +}
> > +else
> > +{
> > +subpage_start = PAGE_OFFSET(start);
> > +subpage_end = PAGE_OFFSET(end) != PAGE_SIZE - 1;
> > +}
> 
> Since you calculate "end" before adjusting "size", the logic here
> depends on there being the assertion further up.
> 
> Jan

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Re: [PATCH for-4.19 v1] automation: add a test for HVM domU on PVH dom0

2024-06-10 Thread Marek Marczykowski-Górecki

On Mon, Jun 10, 2024 at 04:25:01PM +0100, Andrew Cooper wrote:
> On 10/06/2024 2:32 pm, Marek Marczykowski-Górecki wrote:
> > This tests if QEMU works in PVH dom0. QEMU in dom0 requires enabling TUN
> > in the kernel, so do that too.
> >
> > Add it to both x86 runners, similar to the PVH domU test.
> >
> > Signed-off-by: Marek Marczykowski-Górecki 
> 
> Acked-by: Andrew Cooper 
> 
> CC Oleksii.
> 
> > ---
> > Requires rebuilding test-artifacts/kernel/6.1.19
> 
> Ok.
> 
> But on a tangent, shouldn't that move forwards somewhat?

There is already "[PATCH 08/12] automation: update kernel for x86 tests"
in the stubdom test series. And as noted in the cover letter there, most
patches can be applied independently, and also they got R-by/A-by from
Stefano already.

> > I'm actually not sure if there is a sense in testing HVM domU on both
> > runners, when PVH domU variant is already tested on both. Are there any
> > differences between Intel and AMD relevant for QEMU in dom0?
> 
> It's not just Qemu, it's also HVMLoader, and the particulars of VT-x/SVM
> VMExit decode information in order to generate ioreqs.

For just HVM, we have PCI passthrough tests on both - they run HVM (but
on PV dom0). My question was more about PVH-dom0 specific parts.

> I'd firmly suggest having both.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

[PATCH v1] automation: add a test for HVM domU on PVH dom0

2024-06-10 Thread Marek Marczykowski-Górecki

This tests if QEMU works in PVH dom0. QEMU in dom0 requires enabling TUN
in the kernel, so do that too.

Add it to both x86 runners, similar to the PVH domU test.

Signed-off-by: Marek Marczykowski-Górecki 
---
Requires rebuilding test-artifacts/kernel/6.1.19

I'm actually not sure if there is a sense in testing HVM domU on both
runners, when PVH domU variant is already tested on both. Are there any
differences between Intel and AMD relevant for QEMU in dom0?
---
 automation/gitlab-ci/test.yaml| 16 
 automation/scripts/qubes-x86-64.sh| 19 ---
 .../tests-artifacts/kernel/6.1.19.dockerfile  |  1 +
 3 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/automation/gitlab-ci/test.yaml b/automation/gitlab-ci/test.yaml
index 902139e14893..898d2adc8c5b 100644
--- a/automation/gitlab-ci/test.yaml
+++ b/automation/gitlab-ci/test.yaml
@@ -175,6 +175,14 @@ adl-smoke-x86-64-dom0pvh-gcc-debug:
 - *x86-64-test-needs
 - alpine-3.18-gcc-debug
 
+adl-smoke-x86-64-dom0pvh-hvm-gcc-debug:
+  extends: .adl-x86-64
+  script:
+- ./automation/scripts/qubes-x86-64.sh dom0pvh-hvm 2>&1 | tee ${LOGFILE}
+  needs:
+- *x86-64-test-needs
+- alpine-3.18-gcc-debug
+
 adl-suspend-x86-64-gcc-debug:
   extends: .adl-x86-64
   script:
@@ -215,6 +223,14 @@ zen3p-smoke-x86-64-dom0pvh-gcc-debug:
 - *x86-64-test-needs
 - alpine-3.18-gcc-debug
 
+zen3p-smoke-x86-64-dom0pvh-hvm-gcc-debug:
+  extends: .zen3p-x86-64
+  script:
+- ./automation/scripts/qubes-x86-64.sh dom0pvh-hvm 2>&1 | tee ${LOGFILE}
+  needs:
+- *x86-64-test-needs
+- alpine-3.18-gcc-debug
+
 zen3p-pci-hvm-x86-64-gcc-debug:
   extends: .zen3p-x86-64
   script:
diff --git a/automation/scripts/qubes-x86-64.sh 
b/automation/scripts/qubes-x86-64.sh
index 087ab2dc171c..816c5dd9aa77 100755
--- a/automation/scripts/qubes-x86-64.sh
+++ b/automation/scripts/qubes-x86-64.sh
@@ -19,8 +19,8 @@ vif = [ "bridge=xenbr0", ]
 disk = [ ]
 '
 
-### test: smoke test & smoke test PVH
-if [ -z "${test_variant}" ] || [ "${test_variant}" = "dom0pvh" ]; then
+### test: smoke test & smoke test PVH & smoke test HVM
+if [ -z "${test_variant}" ] || [ "${test_variant}" = "dom0pvh" ] || [ 
"${test_variant}" = "dom0pvh-hvm" ]; then
 passed="ping test passed"
 domU_check="
 ifconfig eth0 192.168.0.2
@@ -37,10 +37,23 @@ done
 set -x
 echo \"${passed}\"
 "
-if [ "${test_variant}" = "dom0pvh" ]; then
+if [ "${test_variant}" = "dom0pvh" ] || [ "${test_variant}" = "dom0pvh-hvm" ]; 
then
 extra_xen_opts="dom0=pvh"
 fi
 
+if [ "${test_variant}" = "dom0pvh-hvm" ]; then
+domU_config='
+type = "hvm"
+name = "domU"
+kernel = "/boot/vmlinuz"
+ramdisk = "/boot/initrd-domU"
+extra = "root=/dev/ram0 console=hvc0"
+memory = 512
+vif = [ "bridge=xenbr0", ]
+disk = [ ]
+'
+fi
+
 ### test: S3
 elif [ "${test_variant}" = "s3" ]; then
 passed="suspend test passed"
diff --git a/automation/tests-artifacts/kernel/6.1.19.dockerfile 
b/automation/tests-artifacts/kernel/6.1.19.dockerfile
index 3a4096780d20..021bde26c790 100644
--- a/automation/tests-artifacts/kernel/6.1.19.dockerfile
+++ b/automation/tests-artifacts/kernel/6.1.19.dockerfile
@@ -32,6 +32,7 @@ RUN curl -fsSLO 
https://cdn.kernel.org/pub/linux/kernel/v6.x/linux-"$LINUX_VERSI
 make xen.config && \
 scripts/config --enable BRIDGE && \
 scripts/config --enable IGC && \
+scripts/config --enable TUN && \
 cp .config .config.orig && \
 cat .config.orig | grep XEN | grep =m |sed 's/=m/=y/g' >> .config && \
 make -j$(nproc) bzImage && \
-- 
2.44.0

Re: Segment truncation in multi-segment PCI handling?

2024-06-10 Thread Marek Marczykowski-Górecki

On Mon, Jun 10, 2024 at 12:11:58PM +0200, Jan Beulich wrote:
> On 10.06.2024 11:46, Roger Pau Monné wrote:
> > On Mon, Jun 10, 2024 at 10:41:19AM +0200, Jan Beulich wrote:
> >> On 10.06.2024 10:28, Roger Pau Monné wrote:
> >>> On Mon, Jun 10, 2024 at 09:58:11AM +0200, Jan Beulich wrote:
> >>>> On 07.06.2024 21:52, Andrew Cooper wrote:
> >>>>> On 07/06/2024 8:46 pm, Marek Marczykowski-Górecki wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> I've got a new system, and it has two PCI segments:
> >>>>>>
> >>>>>> :00:00.0 Host bridge: Intel Corporation Device 7d14 (rev 04)
> >>>>>> :00:02.0 VGA compatible controller: Intel Corporation Meteor 
> >>>>>> Lake-P [Intel Graphics] (rev 08)
> >>>>>> ...
> >>>>>> 1:e0:06.0 System peripheral: Intel Corporation RST VMD Managed 
> >>>>>> Controller
> >>>>>> 1:e0:06.2 PCI bridge: Intel Corporation Device 7ecb (rev 10)
> >>>>>> 1:e1:00.0 Non-Volatile memory controller: Phison Electronics 
> >>>>>> Corporation PS5021-E21 PCIe4 NVMe Controller (DRAM-less) (rev 01)
> >>>>>>
> >>>>>> But looks like Xen doesn't handle it correctly:
> >>>
> >>> In the meantime you can probably disable VMD from the firmware and the
> >>> NVMe devices should appear on the regular PCI bus.
> >>>
> >>>>>> (XEN) :e0:06.0: unknown type 0
> >>>>>> (XEN) :e0:06.2: unknown type 0
> >>>>>> (XEN) :e1:00.0: unknown type 0
> >>>>>> ...
> >>>>>> (XEN)  PCI devices 
> >>>>>> (XEN)  segment  
> >>>>>> (XEN) :e1:00.0 - NULL - node -1 
> >>>>>> (XEN) :e0:06.2 - NULL - node -1 
> >>>>>> (XEN) :e0:06.0 - NULL - node -1 
> >>>>>> (XEN) :2b:00.0 - d0 - node -1  - MSIs < 161 >
> >>>>>> (XEN) :00:1f.6 - d0 - node -1  - MSIs < 148 >
> >>>>>> ...
> >>>>>>
> >>>>>> This isn't exactly surprising, since pci_sbdf_t.seg is uint16_t, so
> >>>>>> 0x1 doesn't fit. OSDev wiki says PCI Express can have 65536 PCI
> >>>>>> Segment Groups, each with 256 bus segments.
> >>>>>>
> >>>>>> Fortunately, I don't need this to work, if I disable VMD in the
> >>>>>> firmware, I get a single segment and everything works fine.
> >>>>>>
> >>>>>
> >>>>> This is a known issue.  Works is being done, albeit slowly.
> >>>>
> >>>> Is work being done? After the design session in Prague I put it on my
> >>>> todo list, but at low priority. I'd be happy to take it off there if I
> >>>> knew someone else is looking into this.
> >>>
> >>> We had a design session about VMD?  If so I'm afraid I've missed it.
> >>
> >> In Prague last year, not just now in Lisbon.
> >>
> >>>>> 0x1 is indeed not a spec-compliant PCI segment.  It's something
> >>>>> model specific the Linux VMD driver is doing.
> >>>>
> >>>> I wouldn't call this "model specific" - this numbering is purely a
> >>>> software one (and would need coordinating between Dom0 and Xen).
> >>>
> >>> Hm, TBH I'm not sure whether Xen needs to be aware of VMD devices.
> >>> The resources used by the VMD devices are all assigned to the VMD
> >>> root.  My current hypothesis is that it might be possible to manage
> >>> such devices without Xen being aware of their existence.
> >>
> >> Well, it may be possible to have things work in Dom0 without Xen
> >> knowing much. Then Dom0 would need to suppress any physdevop calls
> >> with such software-only segment numbers (in order to at least not
> >> confuse Xen). I'd be curious though how e.g. MSI setup would work in
> >> such a scenario.
> > 
> > IIRC from my read of the spec,
> 
> So you have found a spec somewhere? I didn't so far, and I had even asked
> Intel ...
> 
> > VMD devices don't use regular MSI
> > data/address fields, and

Segment truncation in multi-segment PCI handling?

2024-06-07 Thread Marek Marczykowski-Górecki

Hi,

I've got a new system, and it has two PCI segments:

:00:00.0 Host bridge: Intel Corporation Device 7d14 (rev 04)
:00:02.0 VGA compatible controller: Intel Corporation Meteor Lake-P 
[Intel Graphics] (rev 08)
...
1:e0:06.0 System peripheral: Intel Corporation RST VMD Managed 
Controller
1:e0:06.2 PCI bridge: Intel Corporation Device 7ecb (rev 10)
1:e1:00.0 Non-Volatile memory controller: Phison Electronics 
Corporation PS5021-E21 PCIe4 NVMe Controller (DRAM-less) (rev 01)

But looks like Xen doesn't handle it correctly:

(XEN) :e0:06.0: unknown type 0
(XEN) :e0:06.2: unknown type 0
(XEN) :e1:00.0: unknown type 0
...
(XEN)  PCI devices 
(XEN)  segment  
(XEN) :e1:00.0 - NULL - node -1 
(XEN) :e0:06.2 - NULL - node -1 
(XEN) :e0:06.0 - NULL - node -1 
(XEN) :2b:00.0 - d0 - node -1  - MSIs < 161 >
(XEN) :00:1f.6 - d0 - node -1  - MSIs < 148 >
...

This isn't exactly surprising, since pci_sbdf_t.seg is uint16_t, so
0x1 doesn't fit. OSDev wiki says PCI Express can have 65536 PCI
Segment Groups, each with 256 bus segments.

Fortunately, I don't need this to work, if I disable VMD in the
firmware, I get a single segment and everything works fine.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Re: NULL pointer dereference in xenbus_thread->...

2024-05-31 Thread Marek Marczykowski-Górecki

On Tue, Mar 26, 2024 at 11:00:50AM +, Julien Grall wrote:
> Hi Marek,
> 
> +Juergen for visibility
> 
> When sending a bug report, I would suggest to CC relevant people as
> otherwise it can get lost (not may people monitors Xen devel if they are not
> CCed).
> 
> Cheers,
> 
> On 25/03/2024 16:17, Marek Marczykowski-Górecki wrote:
> > On Sun, Oct 22, 2023 at 04:14:30PM +0200, Marek Marczykowski-Górecki wrote:
> > > On Mon, Aug 28, 2023 at 11:50:36PM +0200, Marek Marczykowski-Górecki 
> > > wrote:
> > > > Hi,
> > > > 
> > > > I've noticed in Qubes's CI failure like this:
> > > > 
> > > > [  871.271292] BUG: kernel NULL pointer dereference, address: 
> > > > 
> > > > [  871.275290] #PF: supervisor read access in kernel mode
> > > > [  871.277282] #PF: error_code(0x) - not-present page
> > > > [  871.279182] PGD 106fdb067 P4D 106fdb067 PUD 106fdc067 PMD 0
> > > > [  871.281071] Oops:  [#1] PREEMPT SMP NOPTI
> > > > [  871.282698] CPU: 1 PID: 28 Comm: xenbus Not tainted 
> > > > 6.1.43-1.qubes.fc37.x86_64 #1
> > > > [  871.285222] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
> > > > BIOS rel-1.16.0-0-gd239552-rebuilt.opensuse.org 04/01/2014
> > > > [  871.23] RIP: e030:__wake_up_common+0x4c/0x180
> > > > [  871.292838] Code: 24 0c 89 4c 24 08 4d 85 c9 74 0a 41 f6 01 04 0f 85 
> > > > a3 00 00 00 48 8b 43 08 4c 8d 40 e8 48 83 c3 08 49 8d 40 18 48 39 c3 74 
> > > > 5b <49> 8b 40 18 31 ed 4c 8d 70 e8 45 8b 28 41 f6 c5 04 75 5f 49 8b 40
> > > > [  871.299776] RSP: e02b:c900400f7e10 EFLAGS: 00010082
> > > > [  871.301656] RAX:  RBX: 88810541ce98 RCX: 
> > > > 
> > > > [  871.304255] RDX: 0001 RSI: 0003 RDI: 
> > > > 88810541ce90
> > > > [  871.306714] RBP: c900400f0280 R08: ffe8 R09: 
> > > > c900400f7e68
> > > > [  871.309937] R10: 7ff0 R11: 888100ad3000 R12: 
> > > > c900400f7e68
> > > > [  871.312326] R13:  R14:  R15: 
> > > > 
> > > > [  871.314647] FS:  () GS:88813ff0() 
> > > > knlGS:
> > > > [  871.317677] CS:  1e030 DS:  ES:  CR0: 80050033
> > > > [  871.319644] CR2:  CR3: 0001067fe000 CR4: 
> > > > 00040660
> > > > [  871.321973] Call Trace:
> > > > [  871.322782]  
> > > > [  871.323494]  ? show_trace_log_lvl+0x1d3/0x2ef
> > > > [  871.324901]  ? show_trace_log_lvl+0x1d3/0x2ef
> > > > [  871.326310]  ? show_trace_log_lvl+0x1d3/0x2ef
> > > > [  871.327721]  ? __wake_up_common_lock+0x82/0xd0
> > > > [  871.329147]  ? __die_body.cold+0x8/0xd
> > > > [  871.330378]  ? page_fault_oops+0x163/0x1a0
> > > > [  871.331691]  ? exc_page_fault+0x70/0x170
> > > > [  871.332946]  ? asm_exc_page_fault+0x22/0x30
> > > > [  871.334454]  ? __wake_up_common+0x4c/0x180
> > > > [  871.335777]  __wake_up_common_lock+0x82/0xd0
> > > > [  871.337183]  ? process_writes+0x240/0x240
> > > > [  871.338461]  process_msg+0x18e/0x2f0
> > > > [  871.339627]  xenbus_thread+0x165/0x1c0
> > > > [  871.340830]  ? cpuusage_read+0x10/0x10
> > > > [  871.342032]  kthread+0xe9/0x110
> > > > [  871.343317]  ? kthread_complete_and_exit+0x20/0x20
> > > > [  871.345020]  ret_from_fork+0x22/0x30
> > > > [  871.346239]  
> > > > [  871.347060] Modules linked in: snd_hda_codec_generic ledtrig_audio 
> > > > snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec 
> > > > snd_hda_core snd_hwdep snd_seq snd_seq_device joydev snd_pcm 
> > > > intel_rapl_msr ppdev intel_rapl_common snd_timer pcspkr e1000e snd 
> > > > soundcore i2c_piix4 parport_pc parport loop fuse xenfs dm_crypt 
> > > > crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni 
> > > > polyval_generic floppy ghash_clmulni_intel sha512_ssse3 serio_raw 
> > > > virtio_scsi virtio_console bochs xhci_pci xhci_pci_renesas xhci_hcd 
> > > > qemu_fw_cfg drm_vram_helper drm_ttm_helper ttm ata_generic pata_acpi 
> > > > xen_privcmd xen_pciback xen_blkback xen_gntalloc xen_gntdev xen_evtchn 
> > > > scsi_dh_rdac scsi_dh_emc scsi_dh_alua uinpu

Re: [PATCH 1/3] CI: Remove CI_COMMIT_REF_PROTECTED requirement for HW jobs

2024-05-31 Thread Marek Marczykowski-Górecki

On Thu, May 30, 2024 at 05:43:12PM -0700, Stefano Stabellini wrote:
> On Thu, 30 May 2024, Marek Marczykowski-Górecki wrote:
> > On Wed, May 29, 2024 at 03:19:43PM +0100, Andrew Cooper wrote:
> > > This restriction doesn't provide any security because anyone with suitable
> > > permissions on the HW runners can bypass it with this local patch.
> > > 
> > > Requiring branches to be protected hampers usability of transient testing
> > > branches (specifically, can't delete branches except via the Gitlab UI).
> > >
> > > Drop the requirement.
> > > 
> > > Fixes: 746774cd1786 ("automation: introduce a dom0less test run on Xilinx 
> > > hardware")
> > > Fixes: 0ab316e7e15f ("automation: add a smoke and suspend test on an 
> > > Alder Lake system")
> > > Signed-off-by: Andrew Cooper 
> > 
> > Runners used to be set to run only on protected branches. I think it
> > isn't the case anymore from what I see, but it needs checking (I don't
> > see specific settings in all the projects). If it were still the case,
> > removing variable check would result in jobs forever pending.
> 
> Andrew, thank you so much for pointing this out.
> 
> I think the idea was that we can specify the individual users with
> access to protected branches. We cannot add restrictions for unprotected
> branches. So if we set the gitlab runner to only run protected jobs,
> then the $CI_COMMIT_REF_PROTECTED check makes sense. Not for security,
> but to prevent the jobs from getting stuck waiting for a runner that
> will never arrive.
> 
> However, like Marek said, now the gitlab runners don't have the
> "Protected" check set, so it is all useless :-(
> 
> I would prefer to set "Protected" in the gitlab runners settings so that
> it becomes easier to specify users that can and cannot trigger the jobs.

Owners of subprojects can control branch protection rules, so this
feature doesn't help with limiting access to runners added to the whole
group. Qubes runners are not group runners, they are project runners
added only to select projects.

I don't remember why exactly runners got "protected" disabled, but AFAIR
there was some issue with that setting.

> Then, we'll need the $CI_COMMIT_REF_PROTECTED check, not for security,
> but to avoid pipelines getting stuck for unprotected branches.
> 
> It is really difficult to restrict users from triggering jobs in other
> way because they are all automatically added to all subprojects.
> 
> 
> Would you guys be OK if I set "Protected" in the Xilinx and Qubes gitlab
> runners as soon as possible?


-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Re: [PATCH 1/3] CI: Remove CI_COMMIT_REF_PROTECTED requirement for HW jobs

2024-05-29 Thread Marek Marczykowski-Górecki

On Wed, May 29, 2024 at 03:19:43PM +0100, Andrew Cooper wrote:
> This restriction doesn't provide any security because anyone with suitable
> permissions on the HW runners can bypass it with this local patch.
> 
> Requiring branches to be protected hampers usability of transient testing
> branches (specifically, can't delete branches except via the Gitlab UI).
>
> Drop the requirement.
> 
> Fixes: 746774cd1786 ("automation: introduce a dom0less test run on Xilinx 
> hardware")
> Fixes: 0ab316e7e15f ("automation: add a smoke and suspend test on an Alder 
> Lake system")
> Signed-off-by: Andrew Cooper 

Runners used to be set to run only on protected branches. I think it
isn't the case anymore from what I see, but it needs checking (I don't
see specific settings in all the projects). If it were still the case,
removing variable check would result in jobs forever pending.

Other than that, I'm okay with this change, since the hw runners are
added only to select projects. You can interpret this as Acked-by, if
you verify if indeed runners are not limited to protected branches only.

I will need to adjust setting of my project, to set "QUBES_JOBS" only
to some branches - I used to use branch protection rules as a proxy to
selecting on which branch to run hw tests...

> ---
> CC: Roger Pau Monné 
> CC: Stefano Stabellini 
> CC: Michal Orzel 
> CC: Marek Marczykowski-Górecki 
> CC: Oleksii Kurochko 
> 
> Fixes because this wants backporting, but it also needs acks from both Marek
> and Stefano as the owners of the hardware in question.
> ---
>  automation/gitlab-ci/test.yaml | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/automation/gitlab-ci/test.yaml b/automation/gitlab-ci/test.yaml
> index ad249fa0a5d9..efd3ad46f08e 100644
> --- a/automation/gitlab-ci/test.yaml
> +++ b/automation/gitlab-ci/test.yaml
> @@ -92,7 +92,7 @@
>  when: always
>only:
>  variables:
> -  - $XILINX_JOBS == "true" && $CI_COMMIT_REF_PROTECTED == "true"
> +  - $XILINX_JOBS == "true"
>tags:
>  - xilinx
>  
> @@ -112,7 +112,7 @@
>  when: always
>    only:
>  variables:
> -  - $QUBES_JOBS == "true" && $CI_COMMIT_REF_PROTECTED == "true"
> +  - $QUBES_JOBS == "true"
>tags:
>  - qubes-hw2
>  
> -- 
> 2.30.2
> 

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Re: [PATCH v4 0/2] Add API for making parts of a MMIO page R/O and use it in XHCI console

2024-05-23 Thread Marek Marczykowski-Górecki

On Wed, May 22, 2024 at 05:39:02PM +0200, Marek Marczykowski-Górecki wrote:
> On older systems, XHCI xcap had a layout that no other (interesting) registers
> were placed on the same page as the debug capability, so Linux was fine with
> making the whole page R/O. But at least on Tiger Lake and Alder Lake, Linux
> needs to write to some other registers on the same page too.
> 
> Add a generic API for making just parts of an MMIO page R/O and use it to fix
> USB3 console with share=yes or share=hwdom options. More details in commit
> messages.
> 
> Marek Marczykowski-Górecki (2):
>   x86/mm: add API for marking only part of a MMIO page read only
>   drivers/char: Use sub-page ro API to make just xhci dbc cap RO

Does any other x86 maintainer feel comfortable ack-ing this series? Jan
already reviewed 2/2 here (but not 1/2 in this version), but also said
he is not comfortable with letting this in without a second maintainer
approval: 
https://lore.kernel.org/xen-devel/7655e401-b927-4250-ae63-05361a5ee...@suse.com/

> 
>  xen/arch/x86/hvm/emulate.c  |   2 +-
>  xen/arch/x86/hvm/hvm.c  |   4 +-
>  xen/arch/x86/include/asm/mm.h   |  25 +++-
>  xen/arch/x86/mm.c   | 273 +-
>  xen/arch/x86/pv/ro-page-fault.c |   6 +-
>  xen/drivers/char/xhci-dbc.c |  36 ++--
>  6 files changed, 327 insertions(+), 19 deletions(-)
> 
> base-commit: b0082b908391b29b7c4dd5e6c389ebd6481926f8
> -- 
> git-series 0.9.1

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

[PATCH v4 2/2] drivers/char: Use sub-page ro API to make just xhci dbc cap RO

2024-05-22 Thread Marek Marczykowski-Górecki

Not the whole page, which may contain other registers too. The XHCI
specification describes DbC as designed to be controlled by a different
driver, but does not mandate placing registers on a separate page. In fact
on Tiger Lake and newer (at least), this page do contain other registers
that Linux tries to use. And with share=yes, a domU would use them too.
Without this patch, PV dom0 would fail to initialize the controller,
while HVM would be killed on EPT violation.

With `share=yes`, this patch gives domU more access to the emulator
(although a HVM with any emulated device already has plenty of it). This
configuration is already documented as unsafe with untrusted guests and
not security supported.

Signed-off-by: Marek Marczykowski-Górecki 
---
Changes in v4:
- restore mmio_ro_ranges in the fallback case
- set XHCI_SHARE_NONE in the fallback case
Changes in v3:
- indentation fix
- remove stale comment
- fallback to pci_ro_device() if subpage_mmio_ro_add() fails
- extend commit message
Changes in v2:
 - adjust for simplified subpage_mmio_ro_add() API
---
 xen/drivers/char/xhci-dbc.c | 36 ++--
 1 file changed, 22 insertions(+), 14 deletions(-)

diff --git a/xen/drivers/char/xhci-dbc.c b/xen/drivers/char/xhci-dbc.c
index 8e2037f1a5f7..c45e4b6825cc 100644
--- a/xen/drivers/char/xhci-dbc.c
+++ b/xen/drivers/char/xhci-dbc.c
@@ -1216,20 +1216,28 @@ static void __init cf_check 
dbc_uart_init_postirq(struct serial_port *port)
 break;
 }
 #ifdef CONFIG_X86
-/*
- * This marks the whole page as R/O, which may include other registers
- * unrelated to DbC. Xen needs only DbC area protected, but it seems
- * Linux's XHCI driver (as of 5.18) works without writting to the whole
- * page, so keep it simple.
- */
-if ( rangeset_add_range(mmio_ro_ranges,
-PFN_DOWN((uart->dbc.bar_val & PCI_BASE_ADDRESS_MEM_MASK) +
- uart->dbc.xhc_dbc_offset),
-PFN_UP((uart->dbc.bar_val & PCI_BASE_ADDRESS_MEM_MASK) +
-   uart->dbc.xhc_dbc_offset +
-sizeof(*uart->dbc.dbc_reg)) - 1) )
-printk(XENLOG_INFO
-   "Error while adding MMIO range of device to mmio_ro_ranges\n");
+if ( subpage_mmio_ro_add(
+ (uart->dbc.bar_val & PCI_BASE_ADDRESS_MEM_MASK) +
+  uart->dbc.xhc_dbc_offset,
+ sizeof(*uart->dbc.dbc_reg)) )
+{
+printk(XENLOG_WARNING
+   "Error while marking MMIO range of XHCI console as R/O, "
+   "making the whole device R/O (share=no)\n");
+uart->dbc.share = XHCI_SHARE_NONE;
+if ( pci_ro_device(0, uart->dbc.sbdf.bus, uart->dbc.sbdf.devfn) )
+printk(XENLOG_WARNING
+   "Failed to mark read-only %pp used for XHCI console\n",
+   &uart->dbc.sbdf);
+if ( rangeset_add_range(mmio_ro_ranges,
+ PFN_DOWN((uart->dbc.bar_val & PCI_BASE_ADDRESS_MEM_MASK) +
+  uart->dbc.xhc_dbc_offset),
+ PFN_UP((uart->dbc.bar_val & PCI_BASE_ADDRESS_MEM_MASK) +
+uart->dbc.xhc_dbc_offset +
+sizeof(*uart->dbc.dbc_reg)) - 1) )
+printk(XENLOG_INFO
+   "Error while adding MMIO range of device to 
mmio_ro_ranges\n");
+}
 #endif
 }
 
-- 
git-series 0.9.1

[PATCH v4 0/2] Add API for making parts of a MMIO page R/O and use it in XHCI console

2024-05-22 Thread Marek Marczykowski-Górecki

On older systems, XHCI xcap had a layout that no other (interesting) registers
were placed on the same page as the debug capability, so Linux was fine with
making the whole page R/O. But at least on Tiger Lake and Alder Lake, Linux
needs to write to some other registers on the same page too.

Add a generic API for making just parts of an MMIO page R/O and use it to fix
USB3 console with share=yes or share=hwdom options. More details in commit
messages.

Marek Marczykowski-Górecki (2):
  x86/mm: add API for marking only part of a MMIO page read only
  drivers/char: Use sub-page ro API to make just xhci dbc cap RO

 xen/arch/x86/hvm/emulate.c  |   2 +-
 xen/arch/x86/hvm/hvm.c  |   4 +-
 xen/arch/x86/include/asm/mm.h   |  25 +++-
 xen/arch/x86/mm.c   | 273 +-
 xen/arch/x86/pv/ro-page-fault.c |   6 +-
 xen/drivers/char/xhci-dbc.c |  36 ++--
 6 files changed, 327 insertions(+), 19 deletions(-)

base-commit: b0082b908391b29b7c4dd5e6c389ebd6481926f8
-- 
git-series 0.9.1

[PATCH v4 1/2] x86/mm: add API for marking only part of a MMIO page read only

2024-05-22 Thread Marek Marczykowski-Górecki

In some cases, only few registers on a page needs to be write-protected.
Examples include USB3 console (64 bytes worth of registers) or MSI-X's
PBA table (which doesn't need to span the whole table either), although
in the latter case the spec forbids placing other registers on the same
page. Current API allows only marking whole pages pages read-only,
which sometimes may cover other registers that guest may need to
write into.

Currently, when a guest tries to write to an MMIO page on the
mmio_ro_ranges, it's either immediately crashed on EPT violation - if
that's HVM, or if PV, it gets #PF. In case of Linux PV, if access was
from userspace (like, /dev/mem), it will try to fixup by updating page
tables (that Xen again will force to read-only) and will hit that #PF
again (looping endlessly). Both behaviors are undesirable if guest could
actually be allowed the write.

Introduce an API that allows marking part of a page read-only. Since
sub-page permissions are not a thing in page tables (they are in EPT,
but not granular enough), do this via emulation (or simply page fault
handler for PV) that handles writes that are supposed to be allowed.
The new subpage_mmio_ro_add() takes a start physical address and the
region size in bytes. Both start address and the size need to be 8-byte
aligned, as a practical simplification (allows using smaller bitmask,
and a smaller granularity isn't really necessary right now).
It will internally add relevant pages to mmio_ro_ranges, but if either
start or end address is not page-aligned, it additionally adds that page
to a list for sub-page R/O handling. The list holds a bitmask which
qwords are supposed to be read-only and an address where page is mapped
for write emulation - this mapping is done only on the first access. A
plain list is used instead of more efficient structure, because there
isn't supposed to be many pages needing this precise r/o control.

The mechanism this API is plugged in is slightly different for PV and
HVM. For both paths, it's plugged into mmio_ro_emulated_write(). For PV,
it's already called for #PF on read-only MMIO page. For HVM however, EPT
violation on p2m_mmio_direct page results in a direct domain_crash() for
non hardware domains.  To reach mmio_ro_emulated_write(), change how
write violations for p2m_mmio_direct are handled - specifically, check
if they relate to such partially protected page via
subpage_mmio_write_accept() and if so, call hvm_emulate_one_mmio() for
them too. This decodes what guest is trying write and finally calls
mmio_ro_emulated_write(). The EPT write violation is detected as
npfec.write_access and npfec.present both being true (similar to other
places), which may cover some other (future?) cases - if that happens,
emulator might get involved unnecessarily, but since it's limited to
pages marked with subpage_mmio_ro_add() only, the impact is minimal.
Both of those paths need an MFN to which guest tried to write (to check
which part of the page is supposed to be read-only, and where
the page is mapped for writes). This information currently isn't
available directly in mmio_ro_emulated_write(), but in both cases it is
already resolved somewhere higher in the call tree. Pass it down to
mmio_ro_emulated_write() via new mmio_ro_emulate_ctxt.mfn field.

This may give a bit more access to the instruction emulator to HVM
guests (the change in hvm_hap_nested_page_fault()), but only for pages
explicitly marked with subpage_mmio_ro_add() - so, if the guest has a
passed through a device partially used by Xen.
As of the next patch, it applies only configuration explicitly
documented as not security supported.

The subpage_mmio_ro_add() function cannot be called with overlapping
ranges, and on pages already added to mmio_ro_ranges separately.
Successful calls would result in correct handling, but error paths may
result in incorrect state (like pages removed from mmio_ro_ranges too
early). Debug build has asserts for relevant cases.

Signed-off-by: Marek Marczykowski-Górecki 
---
Shadow mode is not tested, but I don't expect it to work differently than
HAP in areas related to this patch.

Changes in v4:
- rename SUBPAGE_MMIO_RO_ALIGN to MMIO_RO_SUBPAGE_GRAN
- guard subpage_mmio_write_accept with CONFIG_HVM, as it's used only
  there
- rename ro_qwords to ro_elems
- use unsigned arguments for subpage_mmio_ro_remove_page()
- use volatile for __iomem
- do not set mmio_ro_ctxt.mfn for mmcfg case
- comment where fields of mmio_ro_ctxt are used
- use bool for result of __test_and_set_bit
- do not open-code mfn_to_maddr()
- remove leftover RCU
- mention hvm_hap_nested_page_fault() explicitly in the commit message
Changes in v3:
- use unsigned int for loop iterators
- use __set_bit/__clear_bit when under spinlock
- avoid ioremap() under spinlock
- do not cast away const
- handle unaligned parameters in release build
- comment fixes
- remove RCU - the add functions are __init and

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1047 matches

Mail list logo