Re: [PATCH 3/3] x86/pci: Fix racy accesses to MSI-X Control register
On 13/12/2022 11:50, Jan Beulich wrote: On 13.12.2022 12:34, David Vrabel wrote: On 12/12/2022 17:04, Jan Beulich wrote: On 10.11.2022 17:59, David Vrabel wrote: --- a/xen/arch/x86/include/asm/msi.h +++ b/xen/arch/x86/include/asm/msi.h @@ -237,7 +237,10 @@ struct arch_msix { int table_refcnt[MAX_MSIX_TABLE_PAGES]; int table_idx[MAX_MSIX_TABLE_PAGES]; spinlock_t table_lock; +spinlock_t control_lock; bool host_maskall, guest_maskall; +uint16_t host_enable; If you want to keep this more narrow than "unsigned int", then please add a BUILD_BUG_ON() against NR_CPUS, so the need to update the field can be easily noticed (in some perhaps distant future). This is only incremented: - while holding the pci_devs lock, or - while holding a lock for one of the associated IRQs. How do the locks held matter here, especially given that - as iirc you say in the description - neither lock is held uniformly? Because the usage here is: lock() host_enable++ ... host_enable-- unlock() Since there are at most 4096 MSI-X vectors (and thus at most 4096 IRQs), the highest value this can be (even with >> 4096 PCPUs) is 4097, thus a uint16_t is fine. Where's the 4096 coming from as a limit for MSI-X vectors? DYM 2048, which is the per-device limit (because the qsize field is 11 bits wide)? If so, yes, I think that's indeed restricting how large the number can get. Yes, I did mean 2048 here. I could only think of less neat ones like host_enable_override or forced_enable or some such. If I had any good name in mind, I would > certainly have said so. host_enable_override works for me. David
Re: [PATCH 3/3] x86/pci: Fix racy accesses to MSI-X Control register
On 12/12/2022 17:04, Jan Beulich wrote: On 10.11.2022 17:59, David Vrabel wrote: --- a/xen/arch/x86/include/asm/msi.h +++ b/xen/arch/x86/include/asm/msi.h @@ -237,7 +237,10 @@ struct arch_msix { int table_refcnt[MAX_MSIX_TABLE_PAGES]; int table_idx[MAX_MSIX_TABLE_PAGES]; spinlock_t table_lock; +spinlock_t control_lock; bool host_maskall, guest_maskall; +uint16_t host_enable; If you want to keep this more narrow than "unsigned int", then please add a BUILD_BUG_ON() against NR_CPUS, so the need to update the field can be easily noticed (in some perhaps distant future). This is only incremented: - while holding the pci_devs lock, or - while holding a lock for one of the associated IRQs. Since there are at most 4096 MSI-X vectors (and thus at most 4096 IRQs), the highest value this can be (even with >> 4096 PCPUs) is 4097, thus a uint16_t is fine. +static void msix_update_unlock(struct pci_dev *dev, unsigned int pos, uint16_t control) +{ +uint16_t new_control; +unsigned long flags; + +spin_lock_irqsave(>msix->control_lock, flags); + +dev->msix->host_enable--; + +new_control = control & ~(PCI_MSIX_FLAGS_ENABLE | PCI_MSIX_FLAGS_MASKALL); + +if ( dev->msix->host_enable || dev->msix->guest_enable ) +new_control |= PCI_MSIX_FLAGS_ENABLE; +if ( dev->msix->host_maskall || dev->msix->guest_maskall || dev->msix->host_enable ) +new_control |= PCI_MSIX_FLAGS_MASKALL; In particular this use of "host_enable" suggests the field wants to be named differently: It makes no sense to derive MASKALL from ENABLE without it being clear (from the name) that the field is an init-time override only. I think the name as-is makes sense. It is analogous to the host_maskall and complements guest_enable. I can't think of a better name, so what name do you suggest. David
Re: [PATCH] xen-pciback: Consider MSI-X enabled only when MASKALL bit is cleared
On 17/11/2022 11:41, Marek Marczykowski-Górecki wrote: Linux enables MSI-X before disabling INTx, but keeps MSI-X masked until the table is filled. Then it disables INTx just before clearing MASKALL bit. Currently this approach is rejected by xen-pciback. Allow setting PCI_MSIX_FLAGS_ENABLE while INTx is still enabled as long as PCI_MSIX_FLAGS_MASKALL is set too. The use of MSI-X interrupts is conditional on only the MSI-X Enable bit. Setting MSI-X Enable effectively overrides the Interrupt Disable bit in the Command register. PCIe 6.0.1 section 7.7.2.2. "MSI-X Enable ... is prohibited from using INTx interrupts (if implemented)." And there is similar wording for MSI Enable. I think you need to shuffle the checks for MSI/MSI-X in xen_pcibk_get_interrupt_type() so they are _before_ the check of the Interrupt Disable bit in the Command register. David
Re: [PATCH 2/3] x86/msi: remove return value from msi_set_mask_bit()
On 11/11/2022 09:44, Jan Beulich wrote: The idea of the WARN() / BUG_ON() is to - not leave failed unmasking unrecorded, - not continue after failure to mask an entry. Then lets make msi_set_mask_bit() unable to fail with something like this (untested) patch. Would this be acceptable? David From 837649a70d44455f4fd98e2eaa46dcf35a56d00a Mon Sep 17 00:00:00 2001 From: David Vrabel Date: Fri, 11 Nov 2022 14:30:16 + Subject: [PATCH] x86: Always enable memory space decodes when using MSI-X Instead of the numerous (racy) checks for memory space accesses being enabled before writing the the MSI-X table, force Memory Space Enable to be set in the Command register if MSI-X is used. This allows the memory_decoded() function and the associated error paths to be removed (since it will always return true). In particular, msi_set_mask_bit() can no longer fail and its return value is removed. Note that if the PCI device is a virtual function, the relevant command register is in the corresponding physical function. Signed-off-by: David Vrabel --- xen/arch/x86/include/asm/pci.h | 3 + xen/arch/x86/msi.c | 116 + xen/arch/x86/pci.c | 39 ++- 3 files changed, 71 insertions(+), 87 deletions(-) diff --git a/xen/arch/x86/include/asm/pci.h b/xen/arch/x86/include/asm/pci.h index f4a58c8acf..4f59b70959 100644 --- a/xen/arch/x86/include/asm/pci.h +++ b/xen/arch/x86/include/asm/pci.h @@ -32,8 +32,11 @@ struct arch_pci_dev { domid_t pseudo_domid; mfn_t leaf_mfn; struct page_list_head pgtables_list; +uint16_t host_command; +uint16_t guest_command; }; +void pci_command_override(struct pci_dev *pdev, uint16_t val); int pci_conf_write_intercept(unsigned int seg, unsigned int bdf, unsigned int reg, unsigned int size, uint32_t *data); diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c index d0bf63df1d..2f8667aa7b 100644 --- a/xen/arch/x86/msi.c +++ b/xen/arch/x86/msi.c @@ -124,27 +124,11 @@ static void msix_put_fixmap(struct arch_msix *msix, int idx) spin_unlock(>table_lock); } -static bool memory_decoded(const struct pci_dev *dev) -{ -pci_sbdf_t sbdf = dev->sbdf; - -if ( dev->info.is_virtfn ) -{ -sbdf.bus = dev->info.physfn.bus; -sbdf.devfn = dev->info.physfn.devfn; -} - -return pci_conf_read16(sbdf, PCI_COMMAND) & PCI_COMMAND_MEMORY; -} - static bool msix_memory_decoded(const struct pci_dev *dev, unsigned int pos) { uint16_t control = pci_conf_read16(dev->sbdf, msix_control_reg(pos)); -if ( !(control & PCI_MSIX_FLAGS_ENABLE) ) -return false; - -return memory_decoded(dev); +return control & PCI_MSIX_FLAGS_ENABLE; } /* @@ -314,7 +298,7 @@ int msi_maskable_irq(const struct msi_desc *entry) || entry->msi_attrib.maskbit; } -static bool msi_set_mask_bit(struct irq_desc *desc, bool host, bool guest) +static void msi_set_mask_bit(struct irq_desc *desc, bool host, bool guest) { struct msi_desc *entry = desc->msi_desc; struct pci_dev *pdev; @@ -354,45 +338,26 @@ static bool msi_set_mask_bit(struct irq_desc *desc, bool host, bool guest) control | (PCI_MSIX_FLAGS_ENABLE | PCI_MSIX_FLAGS_MASKALL)); } -if ( likely(memory_decoded(pdev)) ) -{ -writel(flag, entry->mask_base + PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET); -readl(entry->mask_base + PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET); -if ( likely(control & PCI_MSIX_FLAGS_ENABLE) ) -break; - -entry->msi_attrib.host_masked = host; -entry->msi_attrib.guest_masked = guest; +writel(flag, entry->mask_base + PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET); +readl(entry->mask_base + PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET); -flag = true; -} -else if ( flag && !(control & PCI_MSIX_FLAGS_MASKALL) ) +if ( unlikely(!(control & PCI_MSIX_FLAGS_ENABLE)) ) { -domid_t domid = pdev->domain->domain_id; - -maskall = true; -if ( pdev->msix->warned != domid ) -{ -pdev->msix->warned = domid; -printk(XENLOG_G_WARNING - "cannot mask IRQ %d: masking MSI-X on Dom%d's %pp\n", - desc->irq, domid, >sbdf); -} +pdev->msix->host_maskall = maskall; +if ( maskall || pdev->msix->guest_maskall ) +control |= PCI_MSIX_FLAGS_MASKALL; +pci_conf_write16(pdev->sbdf, + msix_control_reg(entry->msi_attrib.pos), control); } -pdev->msix->host_maskall = maskall; -
[PATCH 3/3] x86/pci: Fix racy accesses to MSI-X Control register
Concurrent access the the MSI-X control register are not serialized with a suitable lock. For example, in msix_capability_init() access use the pcidevs_lock() but some calls to msi_set_mask_bit() use the interrupt descriptor lock. This can lead to MSI-X being incorrectly disabled and subsequent failures due to msix_memory_decoded() calls that check for MSI-X being enabled. This was seen with some non-compliant hardware that gated MSI-X messages on the per-vector mask bit only (i.e., the MSI-X Enable bit and Function Mask bits in the MSI-X Control register were ignored). An interrupt (with a pending move) for vector 0 would occur while vector 1 was being initialized in msix_capability_init(). Updates the the Control register would race and the vector 1 initialization would intermittently fail with -ENXIO. Typically a race between initializing a vector and another vector being moved doesn't occur because: 1. Racy Control accesses only occur when MSI-X is (guest) disabled 2 Hardware should only raise interrupts when MSI-X is enabled and unmasked. 3. Xen always sets Function Mask when temporarily enabling MSI-X. But there may be other race conditions depending on hardware and guest driver behaviour (e.g., disabling MSI-X when a IRQ has a pending move on another PCPU). Fix this by: 1. Tracking the host and guest enable state in a similar way to the host and guest maskall state. Note that since multiple CPUs can be updating different vectors concurrently, a counter is needed for the host enable state. 2. Add a new lock for serialize the Control read/modify/write sequence. 3. Wrap the above in two helper functions (msix_update_lock(), and msix_update_unlock()), which bracket any MSI-X register updates that require MSI-X to be (temporarily) enabled. Signed-off-by: David Vrabel SIM: https://t.corp.amazon.com/P63914633 CR: https://code.amazon.com/reviews/CR-79020945 --- xen/arch/x86/include/asm/msi.h | 3 + xen/arch/x86/msi.c | 215 + xen/drivers/passthrough/msi.c | 1 + 3 files changed, 114 insertions(+), 105 deletions(-) diff --git a/xen/arch/x86/include/asm/msi.h b/xen/arch/x86/include/asm/msi.h index fe670895ee..aa36e44f4e 100644 --- a/xen/arch/x86/include/asm/msi.h +++ b/xen/arch/x86/include/asm/msi.h @@ -237,7 +237,10 @@ struct arch_msix { int table_refcnt[MAX_MSIX_TABLE_PAGES]; int table_idx[MAX_MSIX_TABLE_PAGES]; spinlock_t table_lock; +spinlock_t control_lock; bool host_maskall, guest_maskall; +uint16_t host_enable; +bool guest_enable; domid_t warned; }; diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c index 6c675d11d1..8e394da07a 100644 --- a/xen/arch/x86/msi.c +++ b/xen/arch/x86/msi.c @@ -147,6 +147,57 @@ static bool msix_memory_decoded(const struct pci_dev *dev, unsigned int pos) return memory_decoded(dev); } + +/* + * Ensure MSI-X interrupts are masked during setup. Some devices require + * MSI-X to be enabled before we can touch the MSI-X registers. We need + * to mask all the vectors to prevent interrupts coming in before they're + * fully set up. + */ +static uint16_t msix_update_lock(struct pci_dev *dev, unsigned int pos) +{ +uint16_t control, new_control; +unsigned long flags; + +spin_lock_irqsave(>msix->control_lock, flags); + +dev->msix->host_enable++; + +control = pci_conf_read16(dev->sbdf, msix_control_reg(pos)); +if ( !(control & PCI_MSIX_FLAGS_ENABLE) ) +{ +new_control = control | PCI_MSIX_FLAGS_ENABLE | PCI_MSIX_FLAGS_MASKALL; +pci_conf_write16(dev->sbdf, msix_control_reg(pos), new_control); +} +else +dev->msix->guest_enable = true; + +spin_unlock_irqrestore(>msix->control_lock, flags); + +return control; +} + +static void msix_update_unlock(struct pci_dev *dev, unsigned int pos, uint16_t control) +{ +uint16_t new_control; +unsigned long flags; + +spin_lock_irqsave(>msix->control_lock, flags); + +dev->msix->host_enable--; + +new_control = control & ~(PCI_MSIX_FLAGS_ENABLE | PCI_MSIX_FLAGS_MASKALL); + +if ( dev->msix->host_enable || dev->msix->guest_enable ) +new_control |= PCI_MSIX_FLAGS_ENABLE; +if ( dev->msix->host_maskall || dev->msix->guest_maskall || dev->msix->host_enable ) +new_control |= PCI_MSIX_FLAGS_MASKALL; +if ( new_control != control ) +pci_conf_write16(dev->sbdf, msix_control_reg(pos), control); + +spin_unlock_irqrestore(>msix->control_lock, flags); +} + /* * MSI message composition */ @@ -288,7 +339,7 @@ static void msi_set_enable(struct pci_dev *dev, int enable) __msi_set_enable(seg, bus, slot, func, pos, enable); } -static void msix_set_enable(struct pci_dev *dev, int enable) +static void msix_force_disable(struct pci_dev *dev) { int pos; u16 control, seg = dev->seg; @@ -299,11 +350,16 @@ stat
[PATCH 0/3] x86: Fix racy accesses to MSI-X Control register
The main patch in this series is 3/3 with some preparatory patches to simplify the implementation. To summarize: Concurrent access the the MSI-X control register are not serialized with a suitable lock. For example, in msix_capability_init() access use the pcidevs_lock() but some calls to msi_set_mask_bit() use the interrupt descriptor lock. This can lead to MSI-X being incorrectly disabled and subsequent failures due to msix_memory_decoded() calls that check for MSI-X being enabled. David
[PATCH 1/3] x86/msi: consistently handle BAR mapping failures in MSI-X setup
When setting up an MSI-X vector in msix_capability_init() the error handling after a BAR mapping failure is different depending on whether the first page fails or a subsequent page. There's no reason to break working vectors so consistently use the later error handling behaviour. The zap_on_error flag was added as part of XSA-337, beb54596cfda (x86/MSI-X: restrict reading of table/PBA bases from BARs), but appears to be unrelated to XSA-337 and is not useful because: 1. table.first and pba.first are not used unless msix->used_vectors > 0. 2. Force disabling MSI-X in this error path is not necessary as the per-vector mask is still still set. Signed-off-by: David Vrabel CR: https://code.amazon.com/reviews/CR-79020908 --- xen/arch/x86/msi.c | 12 +--- 1 file changed, 1 insertion(+), 11 deletions(-) diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c index d0bf63df1d..8bde6b9be1 100644 --- a/xen/arch/x86/msi.c +++ b/xen/arch/x86/msi.c @@ -776,7 +776,7 @@ static int msix_capability_init(struct pci_dev *dev, u8 bus = dev->bus; u8 slot = PCI_SLOT(dev->devfn); u8 func = PCI_FUNC(dev->devfn); -bool maskall = msix->host_maskall, zap_on_error = false; +bool maskall = msix->host_maskall; unsigned int pos = pci_find_cap_offset(seg, bus, slot, func, PCI_CAP_ID_MSIX); @@ -875,8 +875,6 @@ static int msix_capability_init(struct pci_dev *dev, BITS_TO_LONGS(msix->nr_entries) - 1); WARN_ON(rangeset_overlaps_range(mmio_ro_ranges, msix->pba.first, msix->pba.last)); - -zap_on_error = true; } else if ( !msix->table.first ) { @@ -897,14 +895,6 @@ static int msix_capability_init(struct pci_dev *dev, if ( idx < 0 ) { -if ( zap_on_error ) -{ -msix->table.first = 0; -msix->pba.first = 0; - -control &= ~PCI_MSIX_FLAGS_ENABLE; -} - pci_conf_write16(dev->sbdf, msix_control_reg(pos), control); xfree(entry); return idx; -- 2.30.2
[PATCH 2/3] x86/msi: remove return value from msi_set_mask_bit()
The return value was only used for WARN()s or BUG()s so it has no functional purpose. Simplify the code by removing it. The meaning of the return value and the purpose of the various WARNs() and BUGs() is rather unclear. The only failure path (where an MSI-X vector needs to be masked but the MSI-X table is not accessible) has a useful warning message. Signed-off-by: David Vrabel CR: https://code.amazon.com/reviews/CR-79020927 --- xen/arch/x86/msi.c | 34 +- 1 file changed, 9 insertions(+), 25 deletions(-) diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c index 8bde6b9be1..6c675d11d1 100644 --- a/xen/arch/x86/msi.c +++ b/xen/arch/x86/msi.c @@ -314,7 +314,7 @@ int msi_maskable_irq(const struct msi_desc *entry) || entry->msi_attrib.maskbit; } -static bool msi_set_mask_bit(struct irq_desc *desc, bool host, bool guest) +static void msi_set_mask_bit(struct irq_desc *desc, bool host, bool guest) { struct msi_desc *entry = desc->msi_desc; struct pci_dev *pdev; @@ -361,11 +361,6 @@ static bool msi_set_mask_bit(struct irq_desc *desc, bool host, bool guest) if ( likely(control & PCI_MSIX_FLAGS_ENABLE) ) break; - -entry->msi_attrib.host_masked = host; -entry->msi_attrib.guest_masked = guest; - -flag = true; } else if ( flag && !(control & PCI_MSIX_FLAGS_MASKALL) ) { @@ -385,14 +380,10 @@ static bool msi_set_mask_bit(struct irq_desc *desc, bool host, bool guest) control |= PCI_MSIX_FLAGS_MASKALL; pci_conf_write16(pdev->sbdf, msix_control_reg(entry->msi_attrib.pos), control); -return flag; -default: -return 0; +break; } entry->msi_attrib.host_masked = host; entry->msi_attrib.guest_masked = guest; - -return 1; } static int msi_get_mask_bit(const struct msi_desc *entry) @@ -418,16 +409,12 @@ static int msi_get_mask_bit(const struct msi_desc *entry) void cf_check mask_msi_irq(struct irq_desc *desc) { -if ( unlikely(!msi_set_mask_bit(desc, 1, -desc->msi_desc->msi_attrib.guest_masked)) ) -BUG_ON(!(desc->status & IRQ_DISABLED)); +msi_set_mask_bit(desc, 1, desc->msi_desc->msi_attrib.guest_masked); } void cf_check unmask_msi_irq(struct irq_desc *desc) { -if ( unlikely(!msi_set_mask_bit(desc, 0, -desc->msi_desc->msi_attrib.guest_masked)) ) -WARN(); +msi_set_mask_bit(desc, 0, desc->msi_desc->msi_attrib.guest_masked); } void guest_mask_msi_irq(struct irq_desc *desc, bool mask) @@ -437,15 +424,13 @@ void guest_mask_msi_irq(struct irq_desc *desc, bool mask) static unsigned int cf_check startup_msi_irq(struct irq_desc *desc) { -if ( unlikely(!msi_set_mask_bit(desc, 0, !!(desc->status & IRQ_GUEST))) ) -WARN(); +msi_set_mask_bit(desc, 0, !!(desc->status & IRQ_GUEST)); return 0; } static void cf_check shutdown_msi_irq(struct irq_desc *desc) { -if ( unlikely(!msi_set_mask_bit(desc, 1, 1)) ) -BUG_ON(!(desc->status & IRQ_DISABLED)); +msi_set_mask_bit(desc, 1, 1); } void cf_check ack_nonmaskable_msi_irq(struct irq_desc *desc) @@ -1358,10 +1343,9 @@ int pci_restore_msi_state(struct pci_dev *pdev) for ( i = 0; ; ) { -if ( unlikely(!msi_set_mask_bit(desc, -entry[i].msi_attrib.host_masked, -entry[i].msi_attrib.guest_masked)) ) -BUG(); +msi_set_mask_bit(desc, + entry[i].msi_attrib.host_masked, + entry[i].msi_attrib.guest_masked); if ( !--nr ) break; -- 2.30.2
Re: Regression with CET: [PATCH v1] x86/mm: avoid inadvertently degrading a TLB flush to local only
On 27/04/2022 19:03, Andrew Cooper wrote: On 19/04/2022 16:03, David Vrabel wrote: From: David Vrabel If the direct map is incorrectly modified with interrupts disabled, the required TLB flushes are degraded to flushing the local CPU only. This could lead to very hard to diagnose problems as different CPUs will end up with different views of memory. Although, no such issues have yet been identified. Change the check in the flush_area() macro to look at system_state instead. This defers the switch from local to all later in the boot (see xen/arch/x86/setup.c:__start_xen()). This is fine because additional PCPUs are not brought up until after the system state is SYS_STATE_smp_boot. Signed-off-by: David Vrabel This explodes on CET systems: (XEN) Assertion 'local_irq_is_enabled()' failed at arch/x86/smp.c:265 (XEN) [ Xen-4.17.0-10.24-d x86_64 debug=y Not tainted ] (XEN) CPU: 0 (XEN) RIP: e008:[] flush_area_mask+0x40/0x13e (XEN) Xen call trace: (XEN) [] R flush_area_mask+0x40/0x13e (XEN) [] F modify_xen_mappings+0xc5/0x958 (XEN) [] F arch/x86/alternative.c#_alternative_instructions+0xb7/0xb9 (XEN) [] F alternative_branches+0xf/0x12 (XEN) [] F __start_xen+0x1ef4/0x2776 (XEN) [] F __high_start+0x94/0xa0 (XEN) (XEN) (XEN) (XEN) Panic on CPU 0: (XEN) Assertion 'local_irq_is_enabled()' failed at arch/x86/smp.c:265 (XEN) (XEN) We really did want a local-only flush here, because we specifically intended to make self-modifying changes before bringing secondary CPUs up. I think the transition to SYS_STATE_smp_boot system state should be later. i.e., the last point were only 1 PCPU is guaranteed. David
Re: [PATCH v4] page_alloc: assert IRQs are enabled in heap alloc/free
On 26/04/2022 15:14, Julien Grall wrote: Hi, On 26/04/2022 15:01, Jan Beulich wrote: On 25.04.2022 15:28, David Vrabel wrote: --- a/xen/common/page_alloc.c +++ b/xen/common/page_alloc.c @@ -162,6 +162,13 @@ static char __initdata opt_badpage[100] = ""; string_param("badpage", opt_badpage); +/* + * Heap allocations may need TLB flushes which require IRQs to be + * enabled (except when only 1 PCPU is online). + */ +#define ASSERT_ALLOC_CONTEXT() \ + ASSERT(!in_irq() && (local_irq_is_enabled() || num_online_cpus() <= 1)) At least one of these tightened assertions triggers on Arm, as per the most recent smoke flight. I'm going to revert this for the time being. From the serial console [1]: (XEN) Xen call trace: (XEN) [<0022a510>] alloc_xenheap_pages+0x120/0x150 (PC) (XEN) [<>] (LR) (XEN) [<002736ac>] arch/arm/mm.c#xen_pt_update+0x144/0x6e4 (XEN) [<002740d4>] map_pages_to_xen+0x10/0x20 (XEN) [<00236864>] __vmap+0x400/0x4a4 (XEN) [<0026aee8>] arch/arm/alternative.c#__apply_alternatives_multi_stop+0x144/0x1ec (XEN) [<0022fe40>] stop_machine_run+0x23c/0x300 An allocation inside a stop_machine_run() action function. That is what the asserts were designed to catch. I did try the run the GitLab CI pipelines but it is setup to use runners that are only available to the Xen Project group, so forking the repo doesn't work. Can my (personal) GitLab be added as a Developer to the Xen Project group? I think this is the intended way for people to run the CI pipelines on their own branches. David
Re: [PATCH v4] page_alloc: assert IRQs are enabled in heap alloc/free
On 25/04/2022 14:43, Julien Grall wrote: Hi Jan, On 25/04/2022 14:37, Jan Beulich wrote: On 25.04.2022 15:34, Julien Grall wrote: On 25/04/2022 14:28, David Vrabel wrote: --- a/xen/common/page_alloc.c +++ b/xen/common/page_alloc.c @@ -162,6 +162,13 @@ static char __initdata opt_badpage[100] = ""; string_param("badpage", opt_badpage); +/* + * Heap allocations may need TLB flushes which require IRQs to be The comment needs to be updated to reflect the fact that at least Arm doesn't use IPI to flush TLBs. I thought the use of "may" was satisfying your earlier request? Maybe I read wrongly this comment... To me, anything after 'which' is optional (it is a non-defining clause) and describe how the TLB flushes works. So the 'may' here is referring to the possibility to use flush TLB. Oh dear, you're using formal grammar with a native English speaker who's never seen a grammar rule in any of his schooling. I think this should be: "Heap allocations may need TLB flushes that require IRQs..." i.e., "that" instead of "which" David
[PATCH v4] page_alloc: assert IRQs are enabled in heap alloc/free
From: David Vrabel Heap pages can only be safely allocated and freed with interuupts enabled as they may require a TLB flush which will send IPIs (on x86). Normally spinlock debugging would catch calls from the incorrect context, but not from stop_machine_run() action functions as these are called with spin lock debugging disabled. Enhance the assertions in alloc_xenheap_pages() and alloc_domheap_pages() to check interrupts are enabled. For consistency the same asserts are used when freeing heap pages. As an exception, when only 1 PCPU is online, allocations are permitted with interrupts disabled as any TLB flushes would be local only. This is necessary during early boot. Signed-off-by: David Vrabel --- Changes in v4: - Tweak comment. Changes in v3: - Use num_online_cpus() in assert. Changes in v2: - Set SYS_STATE_smp_boot on arm. --- xen/common/page_alloc.c | 23 +++ 1 file changed, 15 insertions(+), 8 deletions(-) diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c index 319029140f..739ca6e74b 100644 --- a/xen/common/page_alloc.c +++ b/xen/common/page_alloc.c @@ -162,6 +162,13 @@ static char __initdata opt_badpage[100] = ""; string_param("badpage", opt_badpage); +/* + * Heap allocations may need TLB flushes which require IRQs to be + * enabled (except when only 1 PCPU is online). + */ +#define ASSERT_ALLOC_CONTEXT() \ +ASSERT(!in_irq() && (local_irq_is_enabled() || num_online_cpus() <= 1)) + /* * no-bootscrub -> Free pages are not zeroed during boot. */ @@ -2160,7 +2167,7 @@ void *alloc_xenheap_pages(unsigned int order, unsigned int memflags) { struct page_info *pg; -ASSERT(!in_irq()); +ASSERT_ALLOC_CONTEXT(); pg = alloc_heap_pages(MEMZONE_XEN, MEMZONE_XEN, order, memflags | MEMF_no_scrub, NULL); @@ -2173,7 +2180,7 @@ void *alloc_xenheap_pages(unsigned int order, unsigned int memflags) void free_xenheap_pages(void *v, unsigned int order) { -ASSERT(!in_irq()); +ASSERT_ALLOC_CONTEXT(); if ( v == NULL ) return; @@ -2202,7 +2209,7 @@ void *alloc_xenheap_pages(unsigned int order, unsigned int memflags) struct page_info *pg; unsigned int i; -ASSERT(!in_irq()); +ASSERT_ALLOC_CONTEXT(); if ( xenheap_bits && (memflags >> _MEMF_bits) > xenheap_bits ) memflags &= ~MEMF_bits(~0U); @@ -2224,7 +2231,7 @@ void free_xenheap_pages(void *v, unsigned int order) struct page_info *pg; unsigned int i; -ASSERT(!in_irq()); +ASSERT_ALLOC_CONTEXT(); if ( v == NULL ) return; @@ -2249,7 +2256,7 @@ void init_domheap_pages(paddr_t ps, paddr_t pe) { mfn_t smfn, emfn; -ASSERT(!in_irq()); +ASSERT_ALLOC_CONTEXT(); smfn = maddr_to_mfn(round_pgup(ps)); emfn = maddr_to_mfn(round_pgdown(pe)); @@ -2369,7 +2376,7 @@ struct page_info *alloc_domheap_pages( unsigned int bits = memflags >> _MEMF_bits, zone_hi = NR_ZONES - 1; unsigned int dma_zone; -ASSERT(!in_irq()); +ASSERT_ALLOC_CONTEXT(); bits = domain_clamp_alloc_bitsize(memflags & MEMF_no_owner ? NULL : d, bits ? : (BITS_PER_LONG+PAGE_SHIFT)); @@ -2419,7 +2426,7 @@ void free_domheap_pages(struct page_info *pg, unsigned int order) unsigned int i; bool drop_dom_ref; -ASSERT(!in_irq()); +ASSERT_ALLOC_CONTEXT(); if ( unlikely(is_xen_heap_page(pg)) ) { @@ -2738,7 +2745,7 @@ int __init acquire_domstatic_pages(struct domain *d, mfn_t smfn, { struct page_info *pg; -ASSERT(!in_irq()); +ASSERT_ALLOC_CONTEXT(); pg = acquire_staticmem_pages(smfn, nr_mfns, memflags); if ( !pg ) -- 2.30.2
[PATCH v3] page_alloc: assert IRQs are enabled in heap alloc/free
From: David Vrabel Heap pages can only be safely allocated and freed with interuupts enabled as they may require a TLB flush which will send IPIs. Normally spinlock debugging would catch calls from the incorrect context, but not from stop_machine_run() action functions as these are called with spin lock debugging disabled. Enhance the assertions in alloc_xenheap_pages() and alloc_domheap_pages() to check interrupts are enabled. For consistency the same asserts are used when freeing heap pages. As an exception, when only 1 PCPU is online, allocations are permitted with interrupts disabled as any TLB flushes would be local only. This is necessary during early boot. Signed-off-by: David Vrabel --- Changes in v3: - Use num_online_cpus() in assert. Changes in v2: - Set SYS_STATE_smp_boot on arm. --- xen/common/page_alloc.c | 23 +++ 1 file changed, 15 insertions(+), 8 deletions(-) diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c index 319029140f..516ffa2a97 100644 --- a/xen/common/page_alloc.c +++ b/xen/common/page_alloc.c @@ -162,6 +162,13 @@ static char __initdata opt_badpage[100] = ""; string_param("badpage", opt_badpage); +/* + * Heap allocations may need TLB flushes which require IRQs to be + * enabled (except during early boot when only 1 PCPU is online). + */ +#define ASSERT_ALLOC_CONTEXT() \ +ASSERT(!in_irq() && (local_irq_is_enabled() || num_online_cpus() == 1)) + /* * no-bootscrub -> Free pages are not zeroed during boot. */ @@ -2160,7 +2167,7 @@ void *alloc_xenheap_pages(unsigned int order, unsigned int memflags) { struct page_info *pg; -ASSERT(!in_irq()); +ASSERT_ALLOC_CONTEXT(); pg = alloc_heap_pages(MEMZONE_XEN, MEMZONE_XEN, order, memflags | MEMF_no_scrub, NULL); @@ -2173,7 +2180,7 @@ void *alloc_xenheap_pages(unsigned int order, unsigned int memflags) void free_xenheap_pages(void *v, unsigned int order) { -ASSERT(!in_irq()); +ASSERT_ALLOC_CONTEXT(); if ( v == NULL ) return; @@ -2202,7 +2209,7 @@ void *alloc_xenheap_pages(unsigned int order, unsigned int memflags) struct page_info *pg; unsigned int i; -ASSERT(!in_irq()); +ASSERT_ALLOC_CONTEXT(); if ( xenheap_bits && (memflags >> _MEMF_bits) > xenheap_bits ) memflags &= ~MEMF_bits(~0U); @@ -2224,7 +2231,7 @@ void free_xenheap_pages(void *v, unsigned int order) struct page_info *pg; unsigned int i; -ASSERT(!in_irq()); +ASSERT_ALLOC_CONTEXT(); if ( v == NULL ) return; @@ -2249,7 +2256,7 @@ void init_domheap_pages(paddr_t ps, paddr_t pe) { mfn_t smfn, emfn; -ASSERT(!in_irq()); +ASSERT_ALLOC_CONTEXT(); smfn = maddr_to_mfn(round_pgup(ps)); emfn = maddr_to_mfn(round_pgdown(pe)); @@ -2369,7 +2376,7 @@ struct page_info *alloc_domheap_pages( unsigned int bits = memflags >> _MEMF_bits, zone_hi = NR_ZONES - 1; unsigned int dma_zone; -ASSERT(!in_irq()); +ASSERT_ALLOC_CONTEXT(); bits = domain_clamp_alloc_bitsize(memflags & MEMF_no_owner ? NULL : d, bits ? : (BITS_PER_LONG+PAGE_SHIFT)); @@ -2419,7 +2426,7 @@ void free_domheap_pages(struct page_info *pg, unsigned int order) unsigned int i; bool drop_dom_ref; -ASSERT(!in_irq()); +ASSERT_ALLOC_CONTEXT(); if ( unlikely(is_xen_heap_page(pg)) ) { @@ -2738,7 +2745,7 @@ int __init acquire_domstatic_pages(struct domain *d, mfn_t smfn, { struct page_info *pg; -ASSERT(!in_irq()); +ASSERT_ALLOC_CONTEXT(); pg = acquire_staticmem_pages(smfn, nr_mfns, memflags); if ( !pg ) -- 2.30.2
Re: [PATCH v2] page_alloc: assert IRQs are enabled in heap alloc/free
On 21/04/2022 12:38, Jan Beulich wrote: On 21.04.2022 12:43, David Vrabel wrote: --- a/xen/arch/arm/setup.c +++ b/xen/arch/arm/setup.c @@ -984,6 +984,8 @@ void __init start_xen(unsigned long boot_phys_offset, console_init_postirq(); +system_state = SYS_STATE_smp_boot + do_presmp_initcalls(); for_each_present_cpu ( i ) I'm afraid it's not this simple: There are two "ASSERT(system_state != SYS_STATE_boot)" in Arm-specific code. While both could in principle be left as is, I think both want modifying to ">= SYS_STATE_active", such that they would also trigger when in this newly set state (in case registration of the notifiers was altered). These asserts are already too-relaxed given that there's an early_boot state. It also wants at least mentioning that setting this state is okay with all uses of system_state in common code (where it's not impossible that x86-isms still exist, having gone unnoticed so far), just to indicate that all of these were actually inspected (there's just one where it looks to be unobvious when simply looking at grep output, the one in keyhandler.c). As a result this may want to be a separate, prereq patch. At which point it will want considering whether to put the setting of the state _in_ do_presmp_initcalls() instead of ahead of its invocation. Not sure I understand this comment. The transition to the smp_boot state on arm makes the state machine on x86 and arm look _more_ alike, thus common code should be happier. --- a/xen/common/page_alloc.c +++ b/xen/common/page_alloc.c @@ -162,6 +162,14 @@ static char __initdata opt_badpage[100] = ""; string_param("badpage", opt_badpage); +/* + * Heap allocations may need TLB flushes which require IRQs to be + * enabled (except during early boot when only 1 PCPU is online). + */ +#define ASSERT_ALLOC_CONTEXT() \ +ASSERT(!in_irq() && (local_irq_is_enabled() \ + || system_state < SYS_STATE_smp_boot)) Upon further consideration: In principle IRQs would be okay to be off whenever we're in UP mode (and hence flush IPIs don't need sending). Provided of course spin debug is off as well and no other IRQs-on checks get in the way (like that in flush_area_mask()). This might be more robust overall than depending on system_state, but I'm not going to exclude there may also be arguments against doing so. Not sure I understand what you're suggesting here. Do you mean something like this? #define ASSERT_ALLOC_CONTEXT() \ ASSERT(!in_irq() && (local_irq_is_enabled()\ || nr_online_cpus == 1)) In any event, looking back at my v1 comment, it would have been nice if the spinlock related aspect was at least also mentioned here, even if - as you did say in reply - the uses of the new macro aren't fully redundant with check_lock(). Also, nit: The || belongs on the earlier line as per our coding style. CODING_STYLE says: "Long lines should be split at sensible places and the trailing portions indented." If you're going to have rules (that have, IMO[1], worse readability) please document them. David [1] Compare a = b + dksaldksa_daskldsa_dsakdlsad + hds + dsadjka_jdaksjdk_daskajd; and a = b + dksaldksa_daskldsa_dsakdlsad + hds + dsadjka_jdaksjdk_daskajd; Which one is more clearly readable as a sum?
[PATCH v2] page_alloc: assert IRQs are enabled in heap alloc/free
From: David Vrabel Heap pages can only be safely allocated and freed with interuupts enabled as they may require a TLB flush which will send IPIs. Normally spinlock debugging would catch calls from the incorrect context, but not from stop_machine_run() action functions as these are called with spin lock debugging disabled. Enhance the assertions in alloc_xenheap_pages() and alloc_domheap_pages() to check interrupts are enabled. For consistency the same asserts are used when freeing heap pages. As an exception, during early boot when only 1 PCPU is online, allocations are permitted with interrupts disabled. This required setting the SYS_STATE_smp_boot system state on arm, to match x86. Signed-off-by: David Vrabel --- Changes in v2: - Set SYS_STATE_smp_boot on arm. --- xen/arch/arm/setup.c| 2 ++ xen/common/page_alloc.c | 24 2 files changed, 18 insertions(+), 8 deletions(-) diff --git a/xen/arch/arm/setup.c b/xen/arch/arm/setup.c index d5d0792ed4..44d45f1449 100644 --- a/xen/arch/arm/setup.c +++ b/xen/arch/arm/setup.c @@ -984,6 +984,8 @@ void __init start_xen(unsigned long boot_phys_offset, console_init_postirq(); +system_state = SYS_STATE_smp_boot + do_presmp_initcalls(); for_each_present_cpu ( i ) diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c index 319029140f..e1ce38df13 100644 --- a/xen/common/page_alloc.c +++ b/xen/common/page_alloc.c @@ -162,6 +162,14 @@ static char __initdata opt_badpage[100] = ""; string_param("badpage", opt_badpage); +/* + * Heap allocations may need TLB flushes which require IRQs to be + * enabled (except during early boot when only 1 PCPU is online). + */ +#define ASSERT_ALLOC_CONTEXT() \ +ASSERT(!in_irq() && (local_irq_is_enabled() \ + || system_state < SYS_STATE_smp_boot)) + /* * no-bootscrub -> Free pages are not zeroed during boot. */ @@ -2160,7 +2168,7 @@ void *alloc_xenheap_pages(unsigned int order, unsigned int memflags) { struct page_info *pg; -ASSERT(!in_irq()); +ASSERT_ALLOC_CONTEXT(); pg = alloc_heap_pages(MEMZONE_XEN, MEMZONE_XEN, order, memflags | MEMF_no_scrub, NULL); @@ -2173,7 +2181,7 @@ void *alloc_xenheap_pages(unsigned int order, unsigned int memflags) void free_xenheap_pages(void *v, unsigned int order) { -ASSERT(!in_irq()); +ASSERT_ALLOC_CONTEXT(); if ( v == NULL ) return; @@ -2202,7 +2210,7 @@ void *alloc_xenheap_pages(unsigned int order, unsigned int memflags) struct page_info *pg; unsigned int i; -ASSERT(!in_irq()); +ASSERT_ALLOC_CONTEXT(); if ( xenheap_bits && (memflags >> _MEMF_bits) > xenheap_bits ) memflags &= ~MEMF_bits(~0U); @@ -2224,7 +2232,7 @@ void free_xenheap_pages(void *v, unsigned int order) struct page_info *pg; unsigned int i; -ASSERT(!in_irq()); +ASSERT_ALLOC_CONTEXT(); if ( v == NULL ) return; @@ -2249,7 +2257,7 @@ void init_domheap_pages(paddr_t ps, paddr_t pe) { mfn_t smfn, emfn; -ASSERT(!in_irq()); +ASSERT_ALLOC_CONTEXT(); smfn = maddr_to_mfn(round_pgup(ps)); emfn = maddr_to_mfn(round_pgdown(pe)); @@ -2369,7 +2377,7 @@ struct page_info *alloc_domheap_pages( unsigned int bits = memflags >> _MEMF_bits, zone_hi = NR_ZONES - 1; unsigned int dma_zone; -ASSERT(!in_irq()); +ASSERT_ALLOC_CONTEXT(); bits = domain_clamp_alloc_bitsize(memflags & MEMF_no_owner ? NULL : d, bits ? : (BITS_PER_LONG+PAGE_SHIFT)); @@ -2419,7 +2427,7 @@ void free_domheap_pages(struct page_info *pg, unsigned int order) unsigned int i; bool drop_dom_ref; -ASSERT(!in_irq()); +ASSERT_ALLOC_CONTEXT(); if ( unlikely(is_xen_heap_page(pg)) ) { @@ -2738,7 +2746,7 @@ int __init acquire_domstatic_pages(struct domain *d, mfn_t smfn, { struct page_info *pg; -ASSERT(!in_irq()); +ASSERT_ALLOC_CONTEXT(); pg = acquire_staticmem_pages(smfn, nr_mfns, memflags); if ( !pg ) -- 2.30.2
Re: [PATCH v1] page_alloc: assert IRQs are enabled in heap alloc/free
On 20/04/2022 07:26, Jan Beulich wrote: On 19.04.2022 17:01, David Vrabel wrote: From: David Vrabel Heap pages can only be safely allocated and freed with interuupts enabled as they may require a TLB flush which will send IPIs. Enhance the assertions in alloc_xenheap_pages() and alloc_domheap_pages() to check interrupts are enabled. For consistency the same asserts are used when freeing heap pages. As an exception, during early boot when only 1 PCPU is online, allocations are permitted with interrupts disabled. This exception is tightly coupled with spin lock checking, i.e. the point in time when spin_debug_enable() is called. I think this wants making explicit at least in the code comment, but as a result I also wonder in how far the extended assertions are really worthwhile: Any violation would be detected in check_lock() anyway. Thoughts? I was caught out by stop_machine_run() disabling both interrupts and spin lock debugging when running the action function, so check_lock() didn't help in this (admittedly) narrow use case. Furthermore I'm concerned of Arm not using either SYS_STATE_smp_boot or spin_debug_enable(). David
[PATCH v1] x86/mm: avoid inadvertently degrading a TLB flush to local only
From: David Vrabel If the direct map is incorrectly modified with interrupts disabled, the required TLB flushes are degraded to flushing the local CPU only. This could lead to very hard to diagnose problems as different CPUs will end up with different views of memory. Although, no such issues have yet been identified. Change the check in the flush_area() macro to look at system_state instead. This defers the switch from local to all later in the boot (see xen/arch/x86/setup.c:__start_xen()). This is fine because additional PCPUs are not brought up until after the system state is SYS_STATE_smp_boot. Signed-off-by: David Vrabel --- xen/arch/x86/mm.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c index c271e383b5..72dbce43b1 100644 --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -5071,11 +5071,10 @@ l1_pgentry_t *virt_to_xen_l1e(unsigned long v) #define lNf_to_l1f(f) (((f) & _PAGE_PRESENT) ? ((f) & ~_PAGE_PSE) : (f)) /* - * map_pages_to_xen() can be called with interrupts disabled during - * early bootstrap. In this case it is safe to use flush_area_local() - * and avoid locking because only the local CPU is online. + * map_pages_to_xen() can be called early in boot before any other + * CPUs are online. Use flush_area_local() in this case. */ -#define flush_area(v,f) (!local_irq_is_enabled() ? \ +#define flush_area(v,f) (system_state < SYS_STATE_smp_boot ?\ flush_area_local((const void *)v, f) : \ flush_area_all((const void *)v, f)) -- 2.30.2
[PATCH v1] page_alloc: assert IRQs are enabled in heap alloc/free
From: David Vrabel Heap pages can only be safely allocated and freed with interuupts enabled as they may require a TLB flush which will send IPIs. Enhance the assertions in alloc_xenheap_pages() and alloc_domheap_pages() to check interrupts are enabled. For consistency the same asserts are used when freeing heap pages. As an exception, during early boot when only 1 PCPU is online, allocations are permitted with interrupts disabled. Signed-off-by: David Vrabel --- xen/common/page_alloc.c | 24 1 file changed, 16 insertions(+), 8 deletions(-) diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c index 319029140f..e1ce38df13 100644 --- a/xen/common/page_alloc.c +++ b/xen/common/page_alloc.c @@ -162,6 +162,14 @@ static char __initdata opt_badpage[100] = ""; string_param("badpage", opt_badpage); +/* + * Heap allocations may need TLB flushes which require IRQs to be + * enabled (except during early boot when only 1 PCPU is online). + */ +#define ASSERT_ALLOC_CONTEXT() \ +ASSERT(!in_irq() && (local_irq_is_enabled() \ + || system_state < SYS_STATE_smp_boot)) + /* * no-bootscrub -> Free pages are not zeroed during boot. */ @@ -2160,7 +2168,7 @@ void *alloc_xenheap_pages(unsigned int order, unsigned int memflags) { struct page_info *pg; -ASSERT(!in_irq()); +ASSERT_ALLOC_CONTEXT(); pg = alloc_heap_pages(MEMZONE_XEN, MEMZONE_XEN, order, memflags | MEMF_no_scrub, NULL); @@ -2173,7 +2181,7 @@ void *alloc_xenheap_pages(unsigned int order, unsigned int memflags) void free_xenheap_pages(void *v, unsigned int order) { -ASSERT(!in_irq()); +ASSERT_ALLOC_CONTEXT(); if ( v == NULL ) return; @@ -2202,7 +2210,7 @@ void *alloc_xenheap_pages(unsigned int order, unsigned int memflags) struct page_info *pg; unsigned int i; -ASSERT(!in_irq()); +ASSERT_ALLOC_CONTEXT(); if ( xenheap_bits && (memflags >> _MEMF_bits) > xenheap_bits ) memflags &= ~MEMF_bits(~0U); @@ -2224,7 +2232,7 @@ void free_xenheap_pages(void *v, unsigned int order) struct page_info *pg; unsigned int i; -ASSERT(!in_irq()); +ASSERT_ALLOC_CONTEXT(); if ( v == NULL ) return; @@ -2249,7 +2257,7 @@ void init_domheap_pages(paddr_t ps, paddr_t pe) { mfn_t smfn, emfn; -ASSERT(!in_irq()); +ASSERT_ALLOC_CONTEXT(); smfn = maddr_to_mfn(round_pgup(ps)); emfn = maddr_to_mfn(round_pgdown(pe)); @@ -2369,7 +2377,7 @@ struct page_info *alloc_domheap_pages( unsigned int bits = memflags >> _MEMF_bits, zone_hi = NR_ZONES - 1; unsigned int dma_zone; -ASSERT(!in_irq()); +ASSERT_ALLOC_CONTEXT(); bits = domain_clamp_alloc_bitsize(memflags & MEMF_no_owner ? NULL : d, bits ? : (BITS_PER_LONG+PAGE_SHIFT)); @@ -2419,7 +2427,7 @@ void free_domheap_pages(struct page_info *pg, unsigned int order) unsigned int i; bool drop_dom_ref; -ASSERT(!in_irq()); +ASSERT_ALLOC_CONTEXT(); if ( unlikely(is_xen_heap_page(pg)) ) { @@ -2738,7 +2746,7 @@ int __init acquire_domstatic_pages(struct domain *d, mfn_t smfn, { struct page_info *pg; -ASSERT(!in_irq()); +ASSERT_ALLOC_CONTEXT(); pg = acquire_staticmem_pages(smfn, nr_mfns, memflags); if ( !pg ) -- 2.30.2
Re: [PATCH] xen/evtchn: Add design for static event channel signaling for domUs..
On 23/03/2022 15:43, Rahul Singh wrote: in dom0less system. This patch introduce the new feature to support the signaling between two domUs in dom0less system. Signed-off-by: Rahul Singh --- docs/designs/dom0less-evtchn.md | 96 + 1 file changed, 96 insertions(+) create mode 100644 docs/designs/dom0less-evtchn.md diff --git a/docs/designs/dom0less-evtchn.md b/docs/designs/dom0less-evtchn.md new file mode 100644 index 00..6a1b7e8c22 --- /dev/null +++ b/docs/designs/dom0less-evtchn.md @@ -0,0 +1,96 @@ +# Signaling support between two domUs on dom0less system + +## Current state: Draft version + +## Proposer(s): Rahul Singh, Bertrand Marquis + +## Problem Statement: + +The goal of this work is to define a simple signaling system between Xen guests +in dom0less systems. + +In dom0less system, we cannot make use of xenbus and xenstore that are used in +normal systems with dynamic VMs to communicate between domains by providing a +bus abstraction for paravirtualized drivers. + +One possible solution to implement the signaling system between domUs is based +on event channels. This problem statement could do with some example use cases that are usefully solved by this proposed solution. "We don't have xenstore so can't set up shared rings, but here's a replacement comms mechanism that can do a single bit." Doesn't seem very compelling to me. +chosen { + + +domU1: domU1 { +.. +}; + +domU2: domU2 { +.. +}; + +evtchn@1 { +compatible = "xen,evtchn"; +xen,evtchn = <0xa 0xb >; +}; + +evtchn@2 { +compatible = "xen,evtchn"; +xen,evtchn = <0xc 0xd >; +}; How is the domain supposed to know what these event channels are for? I'm not that familiar with device tree. Is it possible to give these entries name? David
Re: [XEN PATCH] evtchn/fifo: Don't set PENDING bit if guest misbehaves
On 16/03/2022 18:38, Raphael Ning wrote: Currently, evtchn_fifo_set_pending() will mark the event as PENDING even if it fails to lock the FIFO event queue(s), or if the guest has not initialized the FIFO control block for the target vCPU. A well-behaved guest should never trigger either of these cases. Reviewed-by: David Vrabel David
Re: [XEN PATCH] evtchn/fifo: Don't set PENDING bit if guest misbehaves
On 17/03/2022 06:28, Juergen Gross wrote: On 16.03.22 19:38, Raphael Ning wrote: From: Raphael Ning Currently, evtchn_fifo_set_pending() will mark the event as PENDING even if it fails to lock the FIFO event queue(s), or if the guest has not initialized the FIFO control block for the target vCPU. A well-behaved guest should never trigger either of these cases. Is this true even for the resume case e.g. after a migration? The guests starts on the new host with no FIFO control block for any vcpu registered, so couldn't an event get lost with your patch in case it was sent before the target vcpu's control block gets registered? An event that is PENDING but not LINKED is not reachable by the guest so it won't ever see such an event, so the event is lost whether the P bit is set or not. Guests ensure that event channels are not bound to VCPUs that don't (yet) have FIFO control blocks. For example, in Linux xen_irq_resume() reinitializes the control blocks (in xen_evtchn_resume()) before restoring any of the event channels. David
Re: [PATCH] x86/kexec: Fix kexec-reboot with CET active
On 07/03/2022 20:53, Andrew Cooper wrote: The kexec_reloc() asm has an indirect jump to relocate onto the identity trampoline. While we clear CET in machine_crash_shutdown(), we fail to clear CET for the non-crash path. This in turn highlights that the same is true of resetting the CPUID masking/faulting. Move both pieces of logic from machine_crash_shutdown() to machine_kexec(), the latter being common for all kexec transitions. Adjust the condition for CET being considered active to check in CR4, which is simpler and more robust. Reviewed-by: David Vrabel Fixes: 311434bfc9d1 ("x86/setup: Rework MSR_S_CET handling for CET-IBT") Fixes: b60ab42db2f0 ("x86/shstk: Activate Supervisor Shadow Stacks") Fixes: 5ab9564c6fa1 ("x86/cpu: Context switch cpuid masks and faulting state in context_switch()") Reported-by: David Vrabel (XXX which alias to use?) Amazon, please. David
CET-IBT and kexec?
kexec_reloc (see xen/arch/x86/x86_64/kexec_reloc.S) has an indirect branch as part of switching page tables. I understand that if CET-IBT is enabled this will raise an exception since there's no ENDBR64 instruction and (as far as I could tell) CET-IBT has not been disabled in machine_kexec() prior to calling kexec_reloc(). Have I correctly spotted an issue, and if so, would the correct fix be to disable CET-IBT in machine_kexec()? I guess this would also be an issue if kexec'ing to a image without CET-IBT support. David
Re: [PATCH 2/2] xen/x86: Livepatch: support patching CET-enhanced functions
On 07/03/2022 14:03, Jan Beulich wrote: On 07.03.2022 12:53, Bjoern Doebel wrote: @@ -104,18 +122,36 @@ void noinline arch_livepatch_revive(void) int arch_livepatch_verify_func(const struct livepatch_func *func) { +BUILD_BUG_ON(sizeof(struct x86_livepatch_meta) != LIVEPATCH_OPAQUE_SIZE); + /* If NOPing.. */ if ( !func->new_addr ) { +struct x86_livepatch_meta *lp; + +lp = (struct x86_livepatch_meta *)func->opaque; /* Only do up to maximum amount we can put in the ->opaque. */ -if ( func->new_size > sizeof(func->opaque) ) +if ( func->new_size > sizeof(lp->instruction) ) return -EOPNOTSUPP; if ( func->old_size < func->new_size ) return -EINVAL; } I continue to be concerned of the new local variable causing compiler warnings. With the adjustment made compared to v1, the specific warning would have changed, and we're now liable to see set-but-never- used ones. Linux has a sizeof_field() macro for this sort of use. /** * sizeof_field() - Report the size of a struct field in bytes * * @TYPE: The structure containing the field of interest * @MEMBER: The field to return the size of */ #define sizeof_field(TYPE, MEMBER) sizeofTYPE *)0)->MEMBER)) David
Re: [PATCH v2 21/70] xen/evtchn: CFI hardening
On 14/02/2022 12:50, Andrew Cooper wrote: Control Flow Integrity schemes use toolchain and optionally hardware support to help protect against call/jump/return oriented programming attacks. Use cf_check to annotate function pointer targets for the toolchain. [...] -static void evtchn_2l_set_pending(struct vcpu *v, struct evtchn *evtchn) +static void cf_check evtchn_2l_set_pending( +struct vcpu *v, struct evtchn *evtchn) Why manually annotate functions instead of getting the compiler to automatically work it out? David
Re: [PATCH] x86/time: further improve TSC / CPU freq calibration accuracy
On 18/01/2022 08:50, Jan Beulich wrote: On 13.01.2022 14:41, Jan Beulich wrote: Calibration logic assumes that the platform timer (HPET or ACPI PM timer) and the TSC are read at about the same time. This assumption may not hold when a long latency event (e.g. SMI or NMI) occurs between the two reads. Reduce the risk of reading uncorrelated values by doing at least four pairs of reads, using the tuple where the delta between the enclosing TSC reads was smallest. From the fourth iteration onwards bail if the new TSC delta isn't better (smaller) than the best earlier one. Signed-off-by: Jan Beulich When running virtualized, scheduling in the host would also constitute long latency events. I wonder whether, to compensate for that, we'd want more than 3 "base" iterations, as I would expect scheduling events to occur more frequently than e.g. SMI (and with a higher probability of multiple ones occurring in close succession). Should Xen be continually or periodically recalibrating? Rather than trying to get it perfect at the start of day? You may also be able to find inspiration from the design or implementation of the Precision Time Protocol which has to similarly filter out outliers due to transmission delays. David
Re: [PATCH] x86/hvm: reserve another HVM context save record ID for Amazon
On 14/01/2022 07:08, Jan Beulich wrote: On 07.01.2022 13:55, David Vrabel wrote: Amazon's guest transparent live migration work needs another save record (for event channel upcall vectors). Reserve another HVM context save record ID for this. I have to admit that I have reservations: I didn't really like seeing the original range getting reserved. Even less so I'd like to see extensions / further such reservations. The more that iirc the original reservation was accepted based on a (perhaps vague) promise of the respective uses actually getting upstreamed. Yet that hasn't happened (nor even started to happen) in slightly over 2 years time, iirc. I think this is fair. I hadn't realized we'd sat on this work for so long. What I could see as a compromise is to have, say, vendor ranges higher up in number space. I (personally) would accept removing the reservation entirely -- we didn't follow the upstreaming process, so we take the cost of fixing up any compatibility issues. David
Re: [XEN PATCH 1/7] xen: introduce XENFEAT_xenstore_late_init
On 10/01/2022 22:55, Stefano Stabellini wrote: I have a patch for Linux that if XENFEAT_xenstore_late_init is present makes Linux wait for an event notification before initializing xenstore: https://marc.info/?l=xen-devel=164160299315589 So with v1 of the Xen and Linux patches series: - Xen allocates the event channel at domain creation - Linux boots, sees XENFEAT_xenstore_late_init and wait for an event - init-dom0less later allocates the xenstore page - init-dom0less triggers the xenstore event channel - Linux receives the event and finishes the initialization, including mapping the xenstore page You can get this behaviour without the new feature. - Xen allocates the event channel at domain creation - Linux boots, sees HVM_PARAM_STORE_PFN is invalid and there's a xenstore event channel and waits for an event - init-dom0less later allocates the xenstore page - init-dom0less triggers the xenstore event channel - Linux receives the event and finishes the initialization, including mapping the xenstore page- David
Re: [PATCHv2] x86/hvm: add more callback/upcall info to 'I' debug key
On 07/01/2022 13:45, Andrew Cooper wrote: printk("Callback via PCI dev %u INTx %u%s\n", PCI 00:%02x.0 ? Is this correct? If I remember right, the INTx lines are associated with a PCI device, with the function then reporting which line it uses. So Xen neither knows (nor cares) what the function is, so it would be misleading to report it. + hvm_irq->callback_via.pci.dev, + hvm_irq->callback_via.pci.intx, + via_asserted); +break; + +case HVMIRQ_callback_vector: +printk("Callback via vector %u%s\n", + hvm_irq->callback_via.vector, + via_asserted); ... here, vectors ought to be 0x%02x. Amongst other things, it makes the priority class instantly readable. I realise this is all a complete mess, but is via_asserted correct for HVMIRQ_callback_vector? It's mismatched between the two, and the best metric that exists is "is pending in IRR". Also, looking at struct hvm_irq, all the callback information is in the wrong structure, because it absolutely shouldn't be duplicated for each GSI. I'm not sure what changes to this patch you want here.. David
Re: [PATCH] x86/hvm: save/restore per-VCPU event channel upcall vector
On 06/01/2022 16:48, Jan Beulich wrote: On 06.01.2022 16:54, David Vrabel wrote: The Windows XENBUS driver sets the per-VCPU LAPIC vector for event channel interrupts using the HVMOP_set_evtchn_upcall_vector hypercall (rather than using a vector-type callback in the CALLBACK_IRQ HVM parameter since the vectors might be different for different VCPUs). This state needs to be saved/restored or a restored guest may not be able to get an event channel interrupts. Note that the Windows XENBUS driver workarounds this by reissuing the hypercall when resuming after a migration, but this workaround would not be possible in an guest transparent live migration or a live update. Why would this be needed only for this one form of "via"? And how do you guarantee no observable change in behavior for existing guests? It would seem to me that this information is something to be saved / restored _only_ during transparent migration, as aware guests may deliberately defer re-registration. I was under the impression that the HVM parameters (including CALLBACK_IRQ) were saved/restored so the intent here was to save/restore all event channel delivery mechanism state consistently but it seems I was confused by some internal work that's not upstream yet. So, I agree, this patch on it's own doesn't make sense. I've sent a patch reserving another save record ID instead. +static int hvm_load_evtchn_upcall_vector( +struct domain *d, hvm_domain_context_t *h) +{ +unsigned int vcpuid; +struct vcpu *v; +struct hvm_evtchn_upcall_vector upcall; + +vcpuid = hvm_load_instance(h); +if ( (v = domain_vcpu(d, vcpuid)) == NULL ) +return -EINVAL; + +if ( hvm_load_entry(EVTCHN_UPCALL_VECTOR, h, ) != 0 ) +return -EINVAL; + +hvm_set_evtchn_upcall_vector(v, upcall.vector); + +return 0; +} Don't you need to also set callback_via_type accordingly? The callback_via_type is ignored if a per-vcpu upcall vector is set. You can use a mix of a CALLBACK_IRQ defined upcalls on some VCPUs, and a HVMOP_set_evtchn_upcall_vector defined one on others, so setting the per-domain type wouldn't work. I'm not sure why you would do this, but the ABI (as implemented, if not intentionally) allows it... David
[PATCH] x86/hvm: reserve another HVM context save record ID for Amazon
Amazon's guest transparent live migration work needs another save record (for event channel upcall vectors). Reserve another HVM context save record ID for this. Signed-off-by: David Vrabel --- I've added it to the end, keeping the unused ID at 21. --- xen/include/public/arch-x86/hvm/save.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/xen/include/public/arch-x86/hvm/save.h b/xen/include/public/arch-x86/hvm/save.h index 773a380bc2..2de3dfd9bb 100644 --- a/xen/include/public/arch-x86/hvm/save.h +++ b/xen/include/public/arch-x86/hvm/save.h @@ -641,7 +641,7 @@ struct hvm_msr { #define CPU_MSR_CODE 20 -/* Range 22 - 34 (inclusive) reserved for Amazon */ +/* Range 22 - 35 (inclusive) reserved for Amazon */ /* * Largest type-code in use -- 2.30.2
[PATCHv2] x86/hvm: add more callback/upcall info to 'I' debug key
Include the type of the callback via and the per-VCPU upcall vector. Signed-off-by: David Vrabel --- v2: - fix style - make upcall vector output distinguishable from logs prior to this patch - use fewer lines for callback via. --- xen/arch/x86/hvm/irq.c | 49 ++ 1 file changed, 45 insertions(+), 4 deletions(-) diff --git a/xen/arch/x86/hvm/irq.c b/xen/arch/x86/hvm/irq.c index 52aae4565f..8b1bb79127 100644 --- a/xen/arch/x86/hvm/irq.c +++ b/xen/arch/x86/hvm/irq.c @@ -598,7 +598,11 @@ int hvm_local_events_need_delivery(struct vcpu *v) static void irq_dump(struct domain *d) { struct hvm_irq *hvm_irq = hvm_domain_irq(d); -int i; +unsigned int i; +const struct vcpu *v; +bool have_upcall_vector = false; +const char *via_asserted; + printk("Domain %d:\n", d->domain_id); printk("PCI 0x%16.16"PRIx64"%16.16"PRIx64 " ISA 0x%8.8"PRIx32" ROUTE %u %u %u %u\n", @@ -630,9 +634,46 @@ static void irq_dump(struct domain *d) hvm_irq->pci_link_assert_count[1], hvm_irq->pci_link_assert_count[2], hvm_irq->pci_link_assert_count[3]); -printk("Callback via %i:%#"PRIx32",%s asserted\n", - hvm_irq->callback_via_type, hvm_irq->callback_via.gsi, - hvm_irq->callback_via_asserted ? "" : " not"); + +printk("Per-VCPU upcall vector:\n"); +for_each_vcpu ( d, v ) +{ +if ( v->arch.hvm.evtchn_upcall_vector ) +{ +printk(" v%u: %u\n", + v->vcpu_id, v->arch.hvm.evtchn_upcall_vector); +have_upcall_vector = true; +} +} +if ( !have_upcall_vector ) +printk(" none\n"); + +via_asserted = hvm_irq->callback_via_asserted ? " (asserted)" : ""; +switch( hvm_irq->callback_via_type ) +{ +case HVMIRQ_callback_none: +printk("Callback via none\n"); +break; + +case HVMIRQ_callback_gsi: +printk("Callback via GSI %u%s\n", + hvm_irq->callback_via.gsi, + via_asserted); +break; + +case HVMIRQ_callback_pci_intx: +printk("Callback via PCI dev %u INTx %u%s\n", + hvm_irq->callback_via.pci.dev, + hvm_irq->callback_via.pci.intx, + via_asserted); +break; + +case HVMIRQ_callback_vector: +printk("Callback via vector %u%s\n", + hvm_irq->callback_via.vector, + via_asserted); +break; +} } static void dump_irq_info(unsigned char key) -- 2.30.2
[PATCH] x86/hvm: save/restore per-VCPU event channel upcall vector
The Windows XENBUS driver sets the per-VCPU LAPIC vector for event channel interrupts using the HVMOP_set_evtchn_upcall_vector hypercall (rather than using a vector-type callback in the CALLBACK_IRQ HVM parameter since the vectors might be different for different VCPUs). This state needs to be saved/restored or a restored guest may not be able to get an event channel interrupts. Note that the Windows XENBUS driver workarounds this by reissuing the hypercall when resuming after a migration, but this workaround would not be possible in an guest transparent live migration or a live update. Signed-off-by: David Vrabel --- xen/arch/x86/hvm/hvm.c | 50 -- xen/include/public/arch-x86/hvm/save.h | 12 ++- 2 files changed, 58 insertions(+), 4 deletions(-) diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c index 350dc396e3..be2e676c4a 100644 --- a/xen/arch/x86/hvm/hvm.c +++ b/xen/arch/x86/hvm/hvm.c @@ -4071,6 +4071,52 @@ static int hvmop_flush_tlb_all(void) return paging_flush_tlb(NULL) ? 0 : -ERESTART; } +static void hvm_set_evtchn_upcall_vector(struct vcpu *v, uint8_t vector) +{ +printk(XENLOG_G_INFO "%pv: upcall vector %02x\n", v, vector); + +v->arch.hvm.evtchn_upcall_vector = vector; +hvm_assert_evtchn_irq(v); +} + +static int hvm_save_evtchn_upcall_vector( +struct vcpu *v, hvm_domain_context_t *h) +{ +struct hvm_evtchn_upcall_vector upcall; + +/* Skip record if VCPU is down or the upcall vector is not used. */ +if ( test_bit(_VPF_down, >pause_flags) ) +return 0; +if ( v->arch.hvm.evtchn_upcall_vector == 0 ) +return 0; + +upcall.vector = v->arch.hvm.evtchn_upcall_vector; + +return hvm_save_entry(EVTCHN_UPCALL_VECTOR, v->vcpu_id, h, ); +} + +static int hvm_load_evtchn_upcall_vector( +struct domain *d, hvm_domain_context_t *h) +{ +unsigned int vcpuid; +struct vcpu *v; +struct hvm_evtchn_upcall_vector upcall; + +vcpuid = hvm_load_instance(h); +if ( (v = domain_vcpu(d, vcpuid)) == NULL ) +return -EINVAL; + +if ( hvm_load_entry(EVTCHN_UPCALL_VECTOR, h, ) != 0 ) +return -EINVAL; + +hvm_set_evtchn_upcall_vector(v, upcall.vector); + +return 0; +} + +HVM_REGISTER_SAVE_RESTORE(EVTCHN_UPCALL_VECTOR, hvm_save_evtchn_upcall_vector, + hvm_load_evtchn_upcall_vector, 1, HVMSR_PER_VCPU); + static int hvmop_set_evtchn_upcall_vector( XEN_GUEST_HANDLE_PARAM(xen_hvm_evtchn_upcall_vector_t) uop) { @@ -4090,10 +4136,8 @@ static int hvmop_set_evtchn_upcall_vector( if ( (v = domain_vcpu(d, op.vcpu)) == NULL ) return -ENOENT; -printk(XENLOG_G_INFO "%pv: upcall vector %02x\n", v, op.vector); +hvm_set_evtchn_upcall_vector(v, op.vector); -v->arch.hvm.evtchn_upcall_vector = op.vector; -hvm_assert_evtchn_irq(v); return 0; } diff --git a/xen/include/public/arch-x86/hvm/save.h b/xen/include/public/arch-x86/hvm/save.h index 773a380bc2..177cb09150 100644 --- a/xen/include/public/arch-x86/hvm/save.h +++ b/xen/include/public/arch-x86/hvm/save.h @@ -641,12 +641,22 @@ struct hvm_msr { #define CPU_MSR_CODE 20 +/** + * Per-VCPU event channel upcall vector as set by + * HVMOP_set_evtchn_upcall_vector hypercall. + */ +struct hvm_evtchn_upcall_vector { +uint8_t vector; +}; + +DECLARE_HVM_SAVE_TYPE(EVTCHN_UPCALL_VECTOR, 21, struct hvm_evtchn_upcall_vector); + /* Range 22 - 34 (inclusive) reserved for Amazon */ /* * Largest type-code in use */ -#define HVM_SAVE_CODE_MAX 20 +#define HVM_SAVE_CODE_MAX 21 #endif /* __XEN_PUBLIC_HVM_SAVE_X86_H__ */ -- 2.30.2
[PATCH] x86/hvm: add more callback/upcall info to 'I' debug key
Include the type of the callback via and the per-VCPU upcall vector. Signed-off-by: David Vrabel --- xen/arch/x86/hvm/irq.c | 31 +++ 1 file changed, 27 insertions(+), 4 deletions(-) diff --git a/xen/arch/x86/hvm/irq.c b/xen/arch/x86/hvm/irq.c index 52aae4565f..6a1edb99f2 100644 --- a/xen/arch/x86/hvm/irq.c +++ b/xen/arch/x86/hvm/irq.c @@ -598,7 +598,9 @@ int hvm_local_events_need_delivery(struct vcpu *v) static void irq_dump(struct domain *d) { struct hvm_irq *hvm_irq = hvm_domain_irq(d); -int i; +int i; +struct vcpu *v; + printk("Domain %d:\n", d->domain_id); printk("PCI 0x%16.16"PRIx64"%16.16"PRIx64 " ISA 0x%8.8"PRIx32" ROUTE %u %u %u %u\n", @@ -630,9 +632,30 @@ static void irq_dump(struct domain *d) hvm_irq->pci_link_assert_count[1], hvm_irq->pci_link_assert_count[2], hvm_irq->pci_link_assert_count[3]); -printk("Callback via %i:%#"PRIx32",%s asserted\n", - hvm_irq->callback_via_type, hvm_irq->callback_via.gsi, - hvm_irq->callback_via_asserted ? "" : " not"); +for_each_vcpu( d, v ) +{ +if ( v->arch.hvm.evtchn_upcall_vector ) +printk("%pv: upcall vector: %u\n", + v, v->arch.hvm.evtchn_upcall_vector); +} +switch( hvm_irq->callback_via_type ) +{ +case HVMIRQ_callback_none: +printk("Callback via none\n"); +break; +case HVMIRQ_callback_gsi: +printk("Callback via GSI %u\n", hvm_irq->callback_via.gsi); +break; +case HVMIRQ_callback_pci_intx: +printk("Callback via PCI dev %u INTx %u\n", + hvm_irq->callback_via.pci.dev, + hvm_irq->callback_via.pci.intx); +break; +case HVMIRQ_callback_vector: +printk("Callback via vector %u\n", hvm_irq->callback_via.vector); +break; +} +printk(" %s asserted\n", hvm_irq->callback_via_asserted ? "" : " not"); } static void dump_irq_info(unsigned char key) -- 2.30.2