Re: [PATCH 3/3] x86/pci: Fix racy accesses to MSI-X Control register

2022-12-13 Thread David Vrabel




On 13/12/2022 11:50, Jan Beulich wrote:

On 13.12.2022 12:34, David Vrabel wrote:

On 12/12/2022 17:04, Jan Beulich wrote:

On 10.11.2022 17:59, David Vrabel wrote:


--- a/xen/arch/x86/include/asm/msi.h
+++ b/xen/arch/x86/include/asm/msi.h
@@ -237,7 +237,10 @@ struct arch_msix {
   int table_refcnt[MAX_MSIX_TABLE_PAGES];
   int table_idx[MAX_MSIX_TABLE_PAGES];
   spinlock_t table_lock;
+spinlock_t control_lock;
   bool host_maskall, guest_maskall;
+uint16_t host_enable;


If you want to keep this more narrow than "unsigned int", then please
add a BUILD_BUG_ON() against NR_CPUS, so the need to update the field
can be easily noticed (in some perhaps distant future).


This is only incremented:

- while holding the pci_devs lock, or
- while holding a lock for one of the associated IRQs.


How do the locks held matter here, especially given that - as iirc you say
in the description - neither lock is held uniformly?


Because the usage here is:

   lock()
   host_enable++
   ...
   host_enable--
   unlock()


Since there are at most 4096 MSI-X vectors (and thus at most 4096 IRQs),
the highest value this can be (even with >> 4096 PCPUs) is 4097, thus a
uint16_t is fine.


Where's the 4096 coming from as a limit for MSI-X vectors? DYM 2048, which
is the per-device limit (because the qsize field is 11 bits wide)? If so,
yes, I think that's indeed restricting how large the number can get.


Yes, I did mean 2048 here.


I could only think of less neat ones like host_enable_override or
forced_enable or some such. If I had any good name in mind, I would > certainly 
have said so.


host_enable_override works for me.

David



Re: [PATCH 3/3] x86/pci: Fix racy accesses to MSI-X Control register

2022-12-13 Thread David Vrabel




On 12/12/2022 17:04, Jan Beulich wrote:

On 10.11.2022 17:59, David Vrabel wrote:


--- a/xen/arch/x86/include/asm/msi.h
+++ b/xen/arch/x86/include/asm/msi.h
@@ -237,7 +237,10 @@ struct arch_msix {
  int table_refcnt[MAX_MSIX_TABLE_PAGES];
  int table_idx[MAX_MSIX_TABLE_PAGES];
  spinlock_t table_lock;
+spinlock_t control_lock;
  bool host_maskall, guest_maskall;
+uint16_t host_enable;


If you want to keep this more narrow than "unsigned int", then please
add a BUILD_BUG_ON() against NR_CPUS, so the need to update the field
can be easily noticed (in some perhaps distant future).


This is only incremented:

- while holding the pci_devs lock, or
- while holding a lock for one of the associated IRQs.

Since there are at most 4096 MSI-X vectors (and thus at most 4096 IRQs), 
the highest value this can be (even with >> 4096 PCPUs) is 4097, thus a 
uint16_t is fine.



+static void msix_update_unlock(struct pci_dev *dev, unsigned int pos, uint16_t 
control)
+{
+uint16_t new_control;
+unsigned long flags;
+
+spin_lock_irqsave(>msix->control_lock, flags);
+
+dev->msix->host_enable--;
+
+new_control = control & ~(PCI_MSIX_FLAGS_ENABLE | PCI_MSIX_FLAGS_MASKALL);
+
+if ( dev->msix->host_enable || dev->msix->guest_enable )
+new_control |= PCI_MSIX_FLAGS_ENABLE;
+if ( dev->msix->host_maskall || dev->msix->guest_maskall || 
dev->msix->host_enable )
+new_control |= PCI_MSIX_FLAGS_MASKALL;


In particular this use of "host_enable" suggests the field wants to be
named differently: It makes no sense to derive MASKALL from ENABLE
without it being clear (from the name) that the field is an init-time
override only.


I think the name as-is makes sense. It is analogous to the host_maskall 
and complements guest_enable. I can't think of a better name, so what 
name do you suggest.


David



Re: [PATCH] xen-pciback: Consider MSI-X enabled only when MASKALL bit is cleared

2022-11-17 Thread David Vrabel

On 17/11/2022 11:41, Marek Marczykowski-Górecki wrote:

Linux enables MSI-X before disabling INTx, but keeps MSI-X masked until
the table is filled. Then it disables INTx just before clearing MASKALL
bit. Currently this approach is rejected by xen-pciback.
Allow setting PCI_MSIX_FLAGS_ENABLE while INTx is still enabled as long
as PCI_MSIX_FLAGS_MASKALL is set too.


The use of MSI-X interrupts is conditional on only the MSI-X Enable bit. 
Setting MSI-X Enable effectively overrides the Interrupt Disable bit in 
the Command register.


PCIe 6.0.1 section 7.7.2.2. "MSI-X Enable ... is prohibited from using 
INTx interrupts (if implemented)." And there is similar wording for MSI 
Enable.


I think you need to shuffle the checks for MSI/MSI-X in 
xen_pcibk_get_interrupt_type() so they are _before_ the check of the 
Interrupt Disable bit in the Command register.


David



Re: [PATCH 2/3] x86/msi: remove return value from msi_set_mask_bit()

2022-11-11 Thread David Vrabel




On 11/11/2022 09:44, Jan Beulich wrote:


The idea of the WARN() / BUG_ON() is to
- not leave failed unmasking unrecorded,
- not continue after failure to mask an entry.


Then lets make msi_set_mask_bit() unable to fail with something like 
this (untested) patch. Would this be acceptable?


David

From 837649a70d44455f4fd98e2eaa46dcf35a56d00a Mon Sep 17 00:00:00 2001
From: David Vrabel 
Date: Fri, 11 Nov 2022 14:30:16 +
Subject: [PATCH] x86: Always enable memory space decodes when using MSI-X

Instead of the numerous (racy) checks for memory space accesses being
enabled before writing the the MSI-X table, force Memory Space Enable
to be set in the Command register if MSI-X is used.

This allows the memory_decoded() function and the associated error
paths to be removed (since it will always return true). In particular,
msi_set_mask_bit() can no longer fail and its return value is removed.

Note that if the PCI device is a virtual function, the relevant
command register is in the corresponding physical function.

Signed-off-by: David Vrabel 
---
 xen/arch/x86/include/asm/pci.h |   3 +
 xen/arch/x86/msi.c | 116 +
 xen/arch/x86/pci.c |  39 ++-
 3 files changed, 71 insertions(+), 87 deletions(-)

diff --git a/xen/arch/x86/include/asm/pci.h b/xen/arch/x86/include/asm/pci.h
index f4a58c8acf..4f59b70959 100644
--- a/xen/arch/x86/include/asm/pci.h
+++ b/xen/arch/x86/include/asm/pci.h
@@ -32,8 +32,11 @@ struct arch_pci_dev {
 domid_t pseudo_domid;
 mfn_t leaf_mfn;
 struct page_list_head pgtables_list;
+uint16_t host_command;
+uint16_t guest_command;
 };

+void pci_command_override(struct pci_dev *pdev, uint16_t val);
 int pci_conf_write_intercept(unsigned int seg, unsigned int bdf,
  unsigned int reg, unsigned int size,
  uint32_t *data);
diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
index d0bf63df1d..2f8667aa7b 100644
--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -124,27 +124,11 @@ static void msix_put_fixmap(struct arch_msix 
*msix, int idx)

 spin_unlock(>table_lock);
 }

-static bool memory_decoded(const struct pci_dev *dev)
-{
-pci_sbdf_t sbdf = dev->sbdf;
-
-if ( dev->info.is_virtfn )
-{
-sbdf.bus = dev->info.physfn.bus;
-sbdf.devfn = dev->info.physfn.devfn;
-}
-
-return pci_conf_read16(sbdf, PCI_COMMAND) & PCI_COMMAND_MEMORY;
-}
-
 static bool msix_memory_decoded(const struct pci_dev *dev, unsigned 
int pos)

 {
 uint16_t control = pci_conf_read16(dev->sbdf, msix_control_reg(pos));

-if ( !(control & PCI_MSIX_FLAGS_ENABLE) )
-return false;
-
-return memory_decoded(dev);
+return control & PCI_MSIX_FLAGS_ENABLE;
 }

 /*
@@ -314,7 +298,7 @@ int msi_maskable_irq(const struct msi_desc *entry)
|| entry->msi_attrib.maskbit;
 }

-static bool msi_set_mask_bit(struct irq_desc *desc, bool host, bool guest)
+static void msi_set_mask_bit(struct irq_desc *desc, bool host, bool guest)
 {
 struct msi_desc *entry = desc->msi_desc;
 struct pci_dev *pdev;
@@ -354,45 +338,26 @@ static bool msi_set_mask_bit(struct irq_desc 
*desc, bool host, bool guest)

  control | (PCI_MSIX_FLAGS_ENABLE |
 PCI_MSIX_FLAGS_MASKALL));
 }
-if ( likely(memory_decoded(pdev)) )
-{
-writel(flag, entry->mask_base + 
PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET);

-readl(entry->mask_base + PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET);

-if ( likely(control & PCI_MSIX_FLAGS_ENABLE) )
-break;
-
-entry->msi_attrib.host_masked = host;
-entry->msi_attrib.guest_masked = guest;
+writel(flag, entry->mask_base + PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET);
+readl(entry->mask_base + PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET);

-flag = true;
-}
-else if ( flag && !(control & PCI_MSIX_FLAGS_MASKALL) )
+if ( unlikely(!(control & PCI_MSIX_FLAGS_ENABLE)) )
 {
-domid_t domid = pdev->domain->domain_id;
-
-maskall = true;
-if ( pdev->msix->warned != domid )
-{
-pdev->msix->warned = domid;
-printk(XENLOG_G_WARNING
-   "cannot mask IRQ %d: masking MSI-X on Dom%d's 
%pp\n",

-   desc->irq, domid, >sbdf);
-}
+pdev->msix->host_maskall = maskall;
+if ( maskall || pdev->msix->guest_maskall )
+control |= PCI_MSIX_FLAGS_MASKALL;
+pci_conf_write16(pdev->sbdf,
+ msix_control_reg(entry->msi_attrib.pos), 
control);

 }
-pdev->msix->host_maskall = maskall;
-

[PATCH 3/3] x86/pci: Fix racy accesses to MSI-X Control register

2022-11-10 Thread David Vrabel
Concurrent access the the MSI-X control register are not serialized
with a suitable lock. For example, in msix_capability_init() access
use the pcidevs_lock() but some calls to msi_set_mask_bit() use the
interrupt descriptor lock.

This can lead to MSI-X being incorrectly disabled and subsequent
failures due to msix_memory_decoded() calls that check for MSI-X being
enabled.

This was seen with some non-compliant hardware that gated MSI-X
messages on the per-vector mask bit only (i.e., the MSI-X Enable bit
and Function Mask bits in the MSI-X Control register were ignored). An
interrupt (with a pending move) for vector 0 would occur while vector
1 was being initialized in msix_capability_init(). Updates the the
Control register would race and the vector 1 initialization would
intermittently fail with -ENXIO.

Typically a race between initializing a vector and another vector
being moved doesn't occur because:

1. Racy Control accesses only occur when MSI-X is (guest) disabled

2  Hardware should only raise interrupts when MSI-X is enabled and unmasked.

3. Xen always sets Function Mask when temporarily enabling MSI-X.

But there may be other race conditions depending on hardware and guest
driver behaviour (e.g., disabling MSI-X when a IRQ has a pending move
on another PCPU).

Fix this by:

1. Tracking the host and guest enable state in a similar way to the
   host and guest maskall state. Note that since multiple CPUs can be
   updating different vectors concurrently, a counter is needed for
   the host enable state.

2. Add a new lock for serialize the Control read/modify/write
   sequence.

3. Wrap the above in two helper functions (msix_update_lock(), and
   msix_update_unlock()), which bracket any MSI-X register updates
   that require MSI-X to be (temporarily) enabled.

Signed-off-by: David Vrabel 
SIM: https://t.corp.amazon.com/P63914633

CR: https://code.amazon.com/reviews/CR-79020945
---
 xen/arch/x86/include/asm/msi.h |   3 +
 xen/arch/x86/msi.c | 215 +
 xen/drivers/passthrough/msi.c  |   1 +
 3 files changed, 114 insertions(+), 105 deletions(-)

diff --git a/xen/arch/x86/include/asm/msi.h b/xen/arch/x86/include/asm/msi.h
index fe670895ee..aa36e44f4e 100644
--- a/xen/arch/x86/include/asm/msi.h
+++ b/xen/arch/x86/include/asm/msi.h
@@ -237,7 +237,10 @@ struct arch_msix {
 int table_refcnt[MAX_MSIX_TABLE_PAGES];
 int table_idx[MAX_MSIX_TABLE_PAGES];
 spinlock_t table_lock;
+spinlock_t control_lock;
 bool host_maskall, guest_maskall;
+uint16_t host_enable;
+bool guest_enable;
 domid_t warned;
 };
 
diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
index 6c675d11d1..8e394da07a 100644
--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -147,6 +147,57 @@ static bool msix_memory_decoded(const struct pci_dev *dev, 
unsigned int pos)
 return memory_decoded(dev);
 }
 
+
+/*
+ * Ensure MSI-X interrupts are masked during setup. Some devices require
+ * MSI-X to be enabled before we can touch the MSI-X registers. We need
+ * to mask all the vectors to prevent interrupts coming in before they're
+ * fully set up.
+ */
+static uint16_t msix_update_lock(struct pci_dev *dev, unsigned int pos)
+{
+uint16_t control, new_control;
+unsigned long flags;
+
+spin_lock_irqsave(>msix->control_lock, flags);
+
+dev->msix->host_enable++;
+
+control = pci_conf_read16(dev->sbdf, msix_control_reg(pos));
+if ( !(control & PCI_MSIX_FLAGS_ENABLE) )
+{
+new_control = control | PCI_MSIX_FLAGS_ENABLE | PCI_MSIX_FLAGS_MASKALL;
+pci_conf_write16(dev->sbdf, msix_control_reg(pos), new_control);
+}
+else
+dev->msix->guest_enable = true;
+
+spin_unlock_irqrestore(>msix->control_lock, flags);
+
+return control;
+}
+
+static void msix_update_unlock(struct pci_dev *dev, unsigned int pos, uint16_t 
control)
+{
+uint16_t new_control;
+unsigned long flags;
+
+spin_lock_irqsave(>msix->control_lock, flags);
+
+dev->msix->host_enable--;
+
+new_control = control & ~(PCI_MSIX_FLAGS_ENABLE | PCI_MSIX_FLAGS_MASKALL);
+
+if ( dev->msix->host_enable || dev->msix->guest_enable )
+new_control |= PCI_MSIX_FLAGS_ENABLE;
+if ( dev->msix->host_maskall || dev->msix->guest_maskall || 
dev->msix->host_enable )
+new_control |= PCI_MSIX_FLAGS_MASKALL;
+if ( new_control != control )
+pci_conf_write16(dev->sbdf, msix_control_reg(pos), control);
+
+spin_unlock_irqrestore(>msix->control_lock, flags);
+}
+
 /*
  * MSI message composition
  */
@@ -288,7 +339,7 @@ static void msi_set_enable(struct pci_dev *dev, int enable)
 __msi_set_enable(seg, bus, slot, func, pos, enable);
 }
 
-static void msix_set_enable(struct pci_dev *dev, int enable)
+static void msix_force_disable(struct pci_dev *dev)
 {
 int pos;
 u16 control, seg = dev->seg;
@@ -299,11 +350,16 @@ stat

[PATCH 0/3] x86: Fix racy accesses to MSI-X Control register

2022-11-10 Thread David Vrabel
The main patch in this series is 3/3 with some preparatory patches to
simplify the implementation. To summarize:

Concurrent access the the MSI-X control register are not serialized
with a suitable lock. For example, in msix_capability_init() access
use the pcidevs_lock() but some calls to msi_set_mask_bit() use the
interrupt descriptor lock.

This can lead to MSI-X being incorrectly disabled and subsequent
failures due to msix_memory_decoded() calls that check for MSI-X being
enabled.

David





[PATCH 1/3] x86/msi: consistently handle BAR mapping failures in MSI-X setup

2022-11-10 Thread David Vrabel
When setting up an MSI-X vector in msix_capability_init() the error
handling after a BAR mapping failure is different depending on whether
the first page fails or a subsequent page. There's no reason to break
working vectors so consistently use the later error handling
behaviour.

The zap_on_error flag was added as part of XSA-337, beb54596cfda
(x86/MSI-X: restrict reading of table/PBA bases from BARs), but
appears to be unrelated to XSA-337 and is not useful because:

1. table.first and pba.first are not used unless msix->used_vectors > 0.

2. Force disabling MSI-X in this error path is not necessary as the
   per-vector mask is still still set.

Signed-off-by: David Vrabel 

CR: https://code.amazon.com/reviews/CR-79020908
---
 xen/arch/x86/msi.c | 12 +---
 1 file changed, 1 insertion(+), 11 deletions(-)

diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
index d0bf63df1d..8bde6b9be1 100644
--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -776,7 +776,7 @@ static int msix_capability_init(struct pci_dev *dev,
 u8 bus = dev->bus;
 u8 slot = PCI_SLOT(dev->devfn);
 u8 func = PCI_FUNC(dev->devfn);
-bool maskall = msix->host_maskall, zap_on_error = false;
+bool maskall = msix->host_maskall;
 unsigned int pos = pci_find_cap_offset(seg, bus, slot, func,
PCI_CAP_ID_MSIX);
 
@@ -875,8 +875,6 @@ static int msix_capability_init(struct pci_dev *dev,
   BITS_TO_LONGS(msix->nr_entries) - 1);
 WARN_ON(rangeset_overlaps_range(mmio_ro_ranges, msix->pba.first,
 msix->pba.last));
-
-zap_on_error = true;
 }
 else if ( !msix->table.first )
 {
@@ -897,14 +895,6 @@ static int msix_capability_init(struct pci_dev *dev,
 
 if ( idx < 0 )
 {
-if ( zap_on_error )
-{
-msix->table.first = 0;
-msix->pba.first = 0;
-
-control &= ~PCI_MSIX_FLAGS_ENABLE;
-}
-
 pci_conf_write16(dev->sbdf, msix_control_reg(pos), control);
 xfree(entry);
 return idx;
-- 
2.30.2




[PATCH 2/3] x86/msi: remove return value from msi_set_mask_bit()

2022-11-10 Thread David Vrabel
The return value was only used for WARN()s or BUG()s so it has no
functional purpose. Simplify the code by removing it.

The meaning of the return value and the purpose of the various WARNs()
and BUGs() is rather unclear. The only failure path (where an MSI-X
vector needs to be masked but the MSI-X table is not accessible) has a
useful warning message.

Signed-off-by: David Vrabel 

CR: https://code.amazon.com/reviews/CR-79020927
---
 xen/arch/x86/msi.c | 34 +-
 1 file changed, 9 insertions(+), 25 deletions(-)

diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
index 8bde6b9be1..6c675d11d1 100644
--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -314,7 +314,7 @@ int msi_maskable_irq(const struct msi_desc *entry)
|| entry->msi_attrib.maskbit;
 }
 
-static bool msi_set_mask_bit(struct irq_desc *desc, bool host, bool guest)
+static void msi_set_mask_bit(struct irq_desc *desc, bool host, bool guest)
 {
 struct msi_desc *entry = desc->msi_desc;
 struct pci_dev *pdev;
@@ -361,11 +361,6 @@ static bool msi_set_mask_bit(struct irq_desc *desc, bool 
host, bool guest)
 
 if ( likely(control & PCI_MSIX_FLAGS_ENABLE) )
 break;
-
-entry->msi_attrib.host_masked = host;
-entry->msi_attrib.guest_masked = guest;
-
-flag = true;
 }
 else if ( flag && !(control & PCI_MSIX_FLAGS_MASKALL) )
 {
@@ -385,14 +380,10 @@ static bool msi_set_mask_bit(struct irq_desc *desc, bool 
host, bool guest)
 control |= PCI_MSIX_FLAGS_MASKALL;
 pci_conf_write16(pdev->sbdf,
  msix_control_reg(entry->msi_attrib.pos), control);
-return flag;
-default:
-return 0;
+break;
 }
 entry->msi_attrib.host_masked = host;
 entry->msi_attrib.guest_masked = guest;
-
-return 1;
 }
 
 static int msi_get_mask_bit(const struct msi_desc *entry)
@@ -418,16 +409,12 @@ static int msi_get_mask_bit(const struct msi_desc *entry)
 
 void cf_check mask_msi_irq(struct irq_desc *desc)
 {
-if ( unlikely(!msi_set_mask_bit(desc, 1,
-desc->msi_desc->msi_attrib.guest_masked)) )
-BUG_ON(!(desc->status & IRQ_DISABLED));
+msi_set_mask_bit(desc, 1, desc->msi_desc->msi_attrib.guest_masked);
 }
 
 void cf_check unmask_msi_irq(struct irq_desc *desc)
 {
-if ( unlikely(!msi_set_mask_bit(desc, 0,
-desc->msi_desc->msi_attrib.guest_masked)) )
-WARN();
+msi_set_mask_bit(desc, 0, desc->msi_desc->msi_attrib.guest_masked);
 }
 
 void guest_mask_msi_irq(struct irq_desc *desc, bool mask)
@@ -437,15 +424,13 @@ void guest_mask_msi_irq(struct irq_desc *desc, bool mask)
 
 static unsigned int cf_check startup_msi_irq(struct irq_desc *desc)
 {
-if ( unlikely(!msi_set_mask_bit(desc, 0, !!(desc->status & IRQ_GUEST))) )
-WARN();
+msi_set_mask_bit(desc, 0, !!(desc->status & IRQ_GUEST));
 return 0;
 }
 
 static void cf_check shutdown_msi_irq(struct irq_desc *desc)
 {
-if ( unlikely(!msi_set_mask_bit(desc, 1, 1)) )
-BUG_ON(!(desc->status & IRQ_DISABLED));
+msi_set_mask_bit(desc, 1, 1);
 }
 
 void cf_check ack_nonmaskable_msi_irq(struct irq_desc *desc)
@@ -1358,10 +1343,9 @@ int pci_restore_msi_state(struct pci_dev *pdev)
 
 for ( i = 0; ; )
 {
-if ( unlikely(!msi_set_mask_bit(desc,
-entry[i].msi_attrib.host_masked,
-entry[i].msi_attrib.guest_masked)) 
)
-BUG();
+msi_set_mask_bit(desc,
+ entry[i].msi_attrib.host_masked,
+ entry[i].msi_attrib.guest_masked);
 
 if ( !--nr )
 break;
-- 
2.30.2




Re: Regression with CET: [PATCH v1] x86/mm: avoid inadvertently degrading a TLB flush to local only

2022-04-27 Thread David Vrabel




On 27/04/2022 19:03, Andrew Cooper wrote:

On 19/04/2022 16:03, David Vrabel wrote:

From: David Vrabel 

If the direct map is incorrectly modified with interrupts disabled,
the required TLB flushes are degraded to flushing the local CPU only.

This could lead to very hard to diagnose problems as different CPUs will
end up with different views of memory. Although, no such issues have yet
been identified.

Change the check in the flush_area() macro to look at system_state
instead. This defers the switch from local to all later in the boot
(see xen/arch/x86/setup.c:__start_xen()). This is fine because
additional PCPUs are not brought up until after the system state is
SYS_STATE_smp_boot.

Signed-off-by: David Vrabel 


This explodes on CET systems:

(XEN) Assertion 'local_irq_is_enabled()' failed at arch/x86/smp.c:265
(XEN) [ Xen-4.17.0-10.24-d  x86_64  debug=y  Not tainted ]
(XEN) CPU:    0
(XEN) RIP:    e008:[] flush_area_mask+0x40/0x13e

(XEN) Xen call trace:
(XEN)    [] R flush_area_mask+0x40/0x13e
(XEN)    [] F modify_xen_mappings+0xc5/0x958
(XEN)    [] F
arch/x86/alternative.c#_alternative_instructions+0xb7/0xb9
(XEN)    [] F alternative_branches+0xf/0x12
(XEN)    [] F __start_xen+0x1ef4/0x2776
(XEN)    [] F __high_start+0x94/0xa0
(XEN)
(XEN)
(XEN) 
(XEN) Panic on CPU 0:
(XEN) Assertion 'local_irq_is_enabled()' failed at arch/x86/smp.c:265
(XEN) 
(XEN)

We really did want a local-only flush here, because we specifically
intended to make self-modifying changes before bringing secondary CPUs up.


I think the transition to SYS_STATE_smp_boot system state should be 
later. i.e., the last point were only 1 PCPU is guaranteed.


David



Re: [PATCH v4] page_alloc: assert IRQs are enabled in heap alloc/free

2022-04-26 Thread David Vrabel




On 26/04/2022 15:14, Julien Grall wrote:

Hi,

On 26/04/2022 15:01, Jan Beulich wrote:

On 25.04.2022 15:28, David Vrabel wrote:

--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -162,6 +162,13 @@
  static char __initdata opt_badpage[100] = "";
  string_param("badpage", opt_badpage);
+/*
+ * Heap allocations may need TLB flushes which require IRQs to be
+ * enabled (except when only 1 PCPU is online).
+ */
+#define ASSERT_ALLOC_CONTEXT() \
+    ASSERT(!in_irq() && (local_irq_is_enabled() || num_online_cpus() 
<= 1))


At least one of these tightened assertions triggers on Arm, as per the
most recent smoke flight. I'm going to revert this for the time being.


 From the serial console [1]:

(XEN) Xen call trace:
(XEN)    [<0022a510>] alloc_xenheap_pages+0x120/0x150 (PC)
(XEN)    [<>]  (LR)
(XEN)    [<002736ac>] arch/arm/mm.c#xen_pt_update+0x144/0x6e4
(XEN)    [<002740d4>] map_pages_to_xen+0x10/0x20
(XEN)    [<00236864>] __vmap+0x400/0x4a4
(XEN)    [<0026aee8>] 
arch/arm/alternative.c#__apply_alternatives_multi_stop+0x144/0x1ec

(XEN)    [<0022fe40>] stop_machine_run+0x23c/0x300


An allocation inside a stop_machine_run() action function. That is what 
the asserts were designed to catch.


I did try the run the GitLab CI pipelines but it is setup to use runners 
that are only available to the Xen Project group, so forking the repo 
doesn't work.


Can my (personal) GitLab be added as a Developer to the Xen Project 
group? I think this is the intended way for people to run the CI 
pipelines on their own branches.


David



Re: [PATCH v4] page_alloc: assert IRQs are enabled in heap alloc/free

2022-04-25 Thread David Vrabel




On 25/04/2022 14:43, Julien Grall wrote:

Hi Jan,

On 25/04/2022 14:37, Jan Beulich wrote:

On 25.04.2022 15:34, Julien Grall wrote:

On 25/04/2022 14:28, David Vrabel wrote:

--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -162,6 +162,13 @@
   static char __initdata opt_badpage[100] = "";
   string_param("badpage", opt_badpage);
+/*
+ * Heap allocations may need TLB flushes which require IRQs to be


The comment needs to be updated to reflect the fact that at least Arm
doesn't use IPI to flush TLBs.


I thought the use of "may" was satisfying your earlier request?


Maybe I read wrongly this comment... To me, anything after 'which' is 
optional (it is a non-defining clause) and describe how the TLB flushes 
works. So the 'may' here is referring to the possibility to use flush TLB.


Oh dear, you're using formal grammar with a native English speaker who's 
never seen a grammar rule in any of his schooling.


I think this should be:

"Heap allocations may need TLB flushes that require IRQs..."

i.e., "that" instead of "which"

David



[PATCH v4] page_alloc: assert IRQs are enabled in heap alloc/free

2022-04-25 Thread David Vrabel
From: David Vrabel 

Heap pages can only be safely allocated and freed with interuupts
enabled as they may require a TLB flush which will send IPIs (on x86).

Normally spinlock debugging would catch calls from the incorrect
context, but not from stop_machine_run() action functions as these are
called with spin lock debugging disabled.

Enhance the assertions in alloc_xenheap_pages() and
alloc_domheap_pages() to check interrupts are enabled. For consistency
the same asserts are used when freeing heap pages.

As an exception, when only 1 PCPU is online, allocations are permitted
with interrupts disabled as any TLB flushes would be local only. This
is necessary during early boot.

Signed-off-by: David Vrabel 
---
Changes in v4:
- Tweak comment.

Changes in v3:
- Use num_online_cpus() in assert.

Changes in v2:
- Set SYS_STATE_smp_boot on arm.
---
 xen/common/page_alloc.c | 23 +++
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index 319029140f..739ca6e74b 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -162,6 +162,13 @@
 static char __initdata opt_badpage[100] = "";
 string_param("badpage", opt_badpage);
 
+/*
+ * Heap allocations may need TLB flushes which require IRQs to be
+ * enabled (except when only 1 PCPU is online).
+ */
+#define ASSERT_ALLOC_CONTEXT() \
+ASSERT(!in_irq() && (local_irq_is_enabled() || num_online_cpus() <= 1))
+
 /*
  * no-bootscrub -> Free pages are not zeroed during boot.
  */
@@ -2160,7 +2167,7 @@ void *alloc_xenheap_pages(unsigned int order, unsigned 
int memflags)
 {
 struct page_info *pg;
 
-ASSERT(!in_irq());
+ASSERT_ALLOC_CONTEXT();
 
 pg = alloc_heap_pages(MEMZONE_XEN, MEMZONE_XEN,
   order, memflags | MEMF_no_scrub, NULL);
@@ -2173,7 +2180,7 @@ void *alloc_xenheap_pages(unsigned int order, unsigned 
int memflags)
 
 void free_xenheap_pages(void *v, unsigned int order)
 {
-ASSERT(!in_irq());
+ASSERT_ALLOC_CONTEXT();
 
 if ( v == NULL )
 return;
@@ -2202,7 +2209,7 @@ void *alloc_xenheap_pages(unsigned int order, unsigned 
int memflags)
 struct page_info *pg;
 unsigned int i;
 
-ASSERT(!in_irq());
+ASSERT_ALLOC_CONTEXT();
 
 if ( xenheap_bits && (memflags >> _MEMF_bits) > xenheap_bits )
 memflags &= ~MEMF_bits(~0U);
@@ -2224,7 +2231,7 @@ void free_xenheap_pages(void *v, unsigned int order)
 struct page_info *pg;
 unsigned int i;
 
-ASSERT(!in_irq());
+ASSERT_ALLOC_CONTEXT();
 
 if ( v == NULL )
 return;
@@ -2249,7 +2256,7 @@ void init_domheap_pages(paddr_t ps, paddr_t pe)
 {
 mfn_t smfn, emfn;
 
-ASSERT(!in_irq());
+ASSERT_ALLOC_CONTEXT();
 
 smfn = maddr_to_mfn(round_pgup(ps));
 emfn = maddr_to_mfn(round_pgdown(pe));
@@ -2369,7 +2376,7 @@ struct page_info *alloc_domheap_pages(
 unsigned int bits = memflags >> _MEMF_bits, zone_hi = NR_ZONES - 1;
 unsigned int dma_zone;
 
-ASSERT(!in_irq());
+ASSERT_ALLOC_CONTEXT();
 
 bits = domain_clamp_alloc_bitsize(memflags & MEMF_no_owner ? NULL : d,
   bits ? : (BITS_PER_LONG+PAGE_SHIFT));
@@ -2419,7 +2426,7 @@ void free_domheap_pages(struct page_info *pg, unsigned 
int order)
 unsigned int i;
 bool drop_dom_ref;
 
-ASSERT(!in_irq());
+ASSERT_ALLOC_CONTEXT();
 
 if ( unlikely(is_xen_heap_page(pg)) )
 {
@@ -2738,7 +2745,7 @@ int __init acquire_domstatic_pages(struct domain *d, 
mfn_t smfn,
 {
 struct page_info *pg;
 
-ASSERT(!in_irq());
+ASSERT_ALLOC_CONTEXT();
 
 pg = acquire_staticmem_pages(smfn, nr_mfns, memflags);
 if ( !pg )
-- 
2.30.2




[PATCH v3] page_alloc: assert IRQs are enabled in heap alloc/free

2022-04-22 Thread David Vrabel
From: David Vrabel 

Heap pages can only be safely allocated and freed with interuupts
enabled as they may require a TLB flush which will send IPIs.

Normally spinlock debugging would catch calls from the incorrect
context, but not from stop_machine_run() action functions as these are
called with spin lock debugging disabled.

Enhance the assertions in alloc_xenheap_pages() and
alloc_domheap_pages() to check interrupts are enabled. For consistency
the same asserts are used when freeing heap pages.

As an exception, when only 1 PCPU is online, allocations are permitted
with interrupts disabled as any TLB flushes would be local only. This
is necessary during early boot.

Signed-off-by: David Vrabel 
---
Changes in v3:
- Use num_online_cpus() in assert.

Changes in v2:
- Set SYS_STATE_smp_boot on arm.
---
 xen/common/page_alloc.c | 23 +++
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index 319029140f..516ffa2a97 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -162,6 +162,13 @@
 static char __initdata opt_badpage[100] = "";
 string_param("badpage", opt_badpage);
 
+/*
+ * Heap allocations may need TLB flushes which require IRQs to be
+ * enabled (except during early boot when only 1 PCPU is online).
+ */
+#define ASSERT_ALLOC_CONTEXT() \
+ASSERT(!in_irq() && (local_irq_is_enabled() || num_online_cpus() == 1))
+
 /*
  * no-bootscrub -> Free pages are not zeroed during boot.
  */
@@ -2160,7 +2167,7 @@ void *alloc_xenheap_pages(unsigned int order, unsigned 
int memflags)
 {
 struct page_info *pg;
 
-ASSERT(!in_irq());
+ASSERT_ALLOC_CONTEXT();
 
 pg = alloc_heap_pages(MEMZONE_XEN, MEMZONE_XEN,
   order, memflags | MEMF_no_scrub, NULL);
@@ -2173,7 +2180,7 @@ void *alloc_xenheap_pages(unsigned int order, unsigned 
int memflags)
 
 void free_xenheap_pages(void *v, unsigned int order)
 {
-ASSERT(!in_irq());
+ASSERT_ALLOC_CONTEXT();
 
 if ( v == NULL )
 return;
@@ -2202,7 +2209,7 @@ void *alloc_xenheap_pages(unsigned int order, unsigned 
int memflags)
 struct page_info *pg;
 unsigned int i;
 
-ASSERT(!in_irq());
+ASSERT_ALLOC_CONTEXT();
 
 if ( xenheap_bits && (memflags >> _MEMF_bits) > xenheap_bits )
 memflags &= ~MEMF_bits(~0U);
@@ -2224,7 +2231,7 @@ void free_xenheap_pages(void *v, unsigned int order)
 struct page_info *pg;
 unsigned int i;
 
-ASSERT(!in_irq());
+ASSERT_ALLOC_CONTEXT();
 
 if ( v == NULL )
 return;
@@ -2249,7 +2256,7 @@ void init_domheap_pages(paddr_t ps, paddr_t pe)
 {
 mfn_t smfn, emfn;
 
-ASSERT(!in_irq());
+ASSERT_ALLOC_CONTEXT();
 
 smfn = maddr_to_mfn(round_pgup(ps));
 emfn = maddr_to_mfn(round_pgdown(pe));
@@ -2369,7 +2376,7 @@ struct page_info *alloc_domheap_pages(
 unsigned int bits = memflags >> _MEMF_bits, zone_hi = NR_ZONES - 1;
 unsigned int dma_zone;
 
-ASSERT(!in_irq());
+ASSERT_ALLOC_CONTEXT();
 
 bits = domain_clamp_alloc_bitsize(memflags & MEMF_no_owner ? NULL : d,
   bits ? : (BITS_PER_LONG+PAGE_SHIFT));
@@ -2419,7 +2426,7 @@ void free_domheap_pages(struct page_info *pg, unsigned 
int order)
 unsigned int i;
 bool drop_dom_ref;
 
-ASSERT(!in_irq());
+ASSERT_ALLOC_CONTEXT();
 
 if ( unlikely(is_xen_heap_page(pg)) )
 {
@@ -2738,7 +2745,7 @@ int __init acquire_domstatic_pages(struct domain *d, 
mfn_t smfn,
 {
 struct page_info *pg;
 
-ASSERT(!in_irq());
+ASSERT_ALLOC_CONTEXT();
 
 pg = acquire_staticmem_pages(smfn, nr_mfns, memflags);
 if ( !pg )
-- 
2.30.2




Re: [PATCH v2] page_alloc: assert IRQs are enabled in heap alloc/free

2022-04-21 Thread David Vrabel




On 21/04/2022 12:38, Jan Beulich wrote:

On 21.04.2022 12:43, David Vrabel wrote:

--- a/xen/arch/arm/setup.c
+++ b/xen/arch/arm/setup.c
@@ -984,6 +984,8 @@ void __init start_xen(unsigned long boot_phys_offset,
  
  console_init_postirq();
  
+system_state = SYS_STATE_smp_boot

+
  do_presmp_initcalls();
  
  for_each_present_cpu ( i )


I'm afraid it's not this simple: There are two
"ASSERT(system_state != SYS_STATE_boot)" in Arm-specific code. While
both could in principle be left as is, I think both want modifying to
">= SYS_STATE_active", such that they would also trigger when in this
newly set state (in case registration of the notifiers was altered).


These asserts are already too-relaxed given that there's an early_boot 
state.



It also wants at least mentioning that setting this state is okay with
all uses of system_state in common code (where it's not impossible
that x86-isms still exist, having gone unnoticed so far), just to
indicate that all of these were actually inspected (there's just one
where it looks to be unobvious when simply looking at grep output, the
one in keyhandler.c). As a result this may want to be a separate,
prereq patch. At which point it will want considering whether to put
the setting of the state _in_ do_presmp_initcalls() instead of ahead
of its invocation.


Not sure I understand this comment. The transition to the smp_boot state 
on arm makes the state machine on x86 and arm look _more_ alike, thus 
common code should be happier.



--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -162,6 +162,14 @@
  static char __initdata opt_badpage[100] = "";
  string_param("badpage", opt_badpage);
  
+/*

+ * Heap allocations may need TLB flushes which require IRQs to be
+ * enabled (except during early boot when only 1 PCPU is online).
+ */
+#define ASSERT_ALLOC_CONTEXT()  \
+ASSERT(!in_irq() && (local_irq_is_enabled() \
+ || system_state < SYS_STATE_smp_boot))


Upon further consideration: In principle IRQs would be okay to be off
whenever we're in UP mode (and hence flush IPIs don't need sending).
Provided of course spin debug is off as well and no other IRQs-on
checks get in the way (like that in flush_area_mask()). This might be
more robust overall than depending on system_state, but I'm not going
to exclude there may also be arguments against doing so.


Not sure I understand what you're suggesting here. Do you mean something 
like this?


#define ASSERT_ALLOC_CONTEXT() \
ASSERT(!in_irq() && (local_irq_is_enabled()\
 || nr_online_cpus == 1))


In any event, looking back at my v1 comment, it would have been nice
if the spinlock related aspect was at least also mentioned here, even
if - as you did say in reply - the uses of the new macro aren't fully
redundant with check_lock().

Also, nit: The || belongs on the earlier line as per our coding style.


CODING_STYLE says: "Long lines should be split at sensible places and 
the trailing portions indented."


If you're going to have rules (that have, IMO[1], worse readability) 
please document them.


David

[1] Compare

a = b
+ dksaldksa_daskldsa_dsakdlsad
+ hds
+ dsadjka_jdaksjdk_daskajd;

and

a = b +
dksaldksa_daskldsa_dsakdlsad +
hds +
dsadjka_jdaksjdk_daskajd;

Which one is more clearly readable as a sum?



[PATCH v2] page_alloc: assert IRQs are enabled in heap alloc/free

2022-04-21 Thread David Vrabel
From: David Vrabel 

Heap pages can only be safely allocated and freed with interuupts
enabled as they may require a TLB flush which will send IPIs.

Normally spinlock debugging would catch calls from the incorrect
context, but not from stop_machine_run() action functions as these are
called with spin lock debugging disabled.

Enhance the assertions in alloc_xenheap_pages() and
alloc_domheap_pages() to check interrupts are enabled. For consistency
the same asserts are used when freeing heap pages.

As an exception, during early boot when only 1 PCPU is online,
allocations are permitted with interrupts disabled. This required
setting the SYS_STATE_smp_boot system state on arm, to match x86.

Signed-off-by: David Vrabel 
---
Changes in v2:
- Set SYS_STATE_smp_boot on arm.
---
 xen/arch/arm/setup.c|  2 ++
 xen/common/page_alloc.c | 24 
 2 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/xen/arch/arm/setup.c b/xen/arch/arm/setup.c
index d5d0792ed4..44d45f1449 100644
--- a/xen/arch/arm/setup.c
+++ b/xen/arch/arm/setup.c
@@ -984,6 +984,8 @@ void __init start_xen(unsigned long boot_phys_offset,
 
 console_init_postirq();
 
+system_state = SYS_STATE_smp_boot
+
 do_presmp_initcalls();
 
 for_each_present_cpu ( i )
diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index 319029140f..e1ce38df13 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -162,6 +162,14 @@
 static char __initdata opt_badpage[100] = "";
 string_param("badpage", opt_badpage);
 
+/*
+ * Heap allocations may need TLB flushes which require IRQs to be
+ * enabled (except during early boot when only 1 PCPU is online).
+ */
+#define ASSERT_ALLOC_CONTEXT()  \
+ASSERT(!in_irq() && (local_irq_is_enabled() \
+ || system_state < SYS_STATE_smp_boot))
+
 /*
  * no-bootscrub -> Free pages are not zeroed during boot.
  */
@@ -2160,7 +2168,7 @@ void *alloc_xenheap_pages(unsigned int order, unsigned 
int memflags)
 {
 struct page_info *pg;
 
-ASSERT(!in_irq());
+ASSERT_ALLOC_CONTEXT();
 
 pg = alloc_heap_pages(MEMZONE_XEN, MEMZONE_XEN,
   order, memflags | MEMF_no_scrub, NULL);
@@ -2173,7 +2181,7 @@ void *alloc_xenheap_pages(unsigned int order, unsigned 
int memflags)
 
 void free_xenheap_pages(void *v, unsigned int order)
 {
-ASSERT(!in_irq());
+ASSERT_ALLOC_CONTEXT();
 
 if ( v == NULL )
 return;
@@ -2202,7 +2210,7 @@ void *alloc_xenheap_pages(unsigned int order, unsigned 
int memflags)
 struct page_info *pg;
 unsigned int i;
 
-ASSERT(!in_irq());
+ASSERT_ALLOC_CONTEXT();
 
 if ( xenheap_bits && (memflags >> _MEMF_bits) > xenheap_bits )
 memflags &= ~MEMF_bits(~0U);
@@ -2224,7 +2232,7 @@ void free_xenheap_pages(void *v, unsigned int order)
 struct page_info *pg;
 unsigned int i;
 
-ASSERT(!in_irq());
+ASSERT_ALLOC_CONTEXT();
 
 if ( v == NULL )
 return;
@@ -2249,7 +2257,7 @@ void init_domheap_pages(paddr_t ps, paddr_t pe)
 {
 mfn_t smfn, emfn;
 
-ASSERT(!in_irq());
+ASSERT_ALLOC_CONTEXT();
 
 smfn = maddr_to_mfn(round_pgup(ps));
 emfn = maddr_to_mfn(round_pgdown(pe));
@@ -2369,7 +2377,7 @@ struct page_info *alloc_domheap_pages(
 unsigned int bits = memflags >> _MEMF_bits, zone_hi = NR_ZONES - 1;
 unsigned int dma_zone;
 
-ASSERT(!in_irq());
+ASSERT_ALLOC_CONTEXT();
 
 bits = domain_clamp_alloc_bitsize(memflags & MEMF_no_owner ? NULL : d,
   bits ? : (BITS_PER_LONG+PAGE_SHIFT));
@@ -2419,7 +2427,7 @@ void free_domheap_pages(struct page_info *pg, unsigned 
int order)
 unsigned int i;
 bool drop_dom_ref;
 
-ASSERT(!in_irq());
+ASSERT_ALLOC_CONTEXT();
 
 if ( unlikely(is_xen_heap_page(pg)) )
 {
@@ -2738,7 +2746,7 @@ int __init acquire_domstatic_pages(struct domain *d, 
mfn_t smfn,
 {
 struct page_info *pg;
 
-ASSERT(!in_irq());
+ASSERT_ALLOC_CONTEXT();
 
 pg = acquire_staticmem_pages(smfn, nr_mfns, memflags);
 if ( !pg )
-- 
2.30.2




Re: [PATCH v1] page_alloc: assert IRQs are enabled in heap alloc/free

2022-04-20 Thread David Vrabel




On 20/04/2022 07:26, Jan Beulich wrote:

On 19.04.2022 17:01, David Vrabel wrote:

From: David Vrabel 

Heap pages can only be safely allocated and freed with interuupts
enabled as they may require a TLB flush which will send IPIs.

Enhance the assertions in alloc_xenheap_pages() and
alloc_domheap_pages() to check interrupts are enabled. For consistency
the same asserts are used when freeing heap pages.

As an exception, during early boot when only 1 PCPU is online,
allocations are permitted with interrupts disabled.


This exception is tightly coupled with spin lock checking, i.e. the
point in time when spin_debug_enable() is called. I think this wants
making explicit at least in the code comment, but as a result I also
wonder in how far the extended assertions are really worthwhile: Any
violation would be detected in check_lock() anyway. Thoughts?


I was caught out by stop_machine_run() disabling both interrupts and 
spin lock debugging when running the action function, so check_lock() 
didn't help in this (admittedly) narrow use case.



Furthermore I'm concerned of Arm not using either SYS_STATE_smp_boot
or spin_debug_enable().


David



[PATCH v1] x86/mm: avoid inadvertently degrading a TLB flush to local only

2022-04-19 Thread David Vrabel
From: David Vrabel 

If the direct map is incorrectly modified with interrupts disabled,
the required TLB flushes are degraded to flushing the local CPU only.

This could lead to very hard to diagnose problems as different CPUs will
end up with different views of memory. Although, no such issues have yet
been identified.

Change the check in the flush_area() macro to look at system_state
instead. This defers the switch from local to all later in the boot
(see xen/arch/x86/setup.c:__start_xen()). This is fine because
additional PCPUs are not brought up until after the system state is
SYS_STATE_smp_boot.

Signed-off-by: David Vrabel 
---
 xen/arch/x86/mm.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index c271e383b5..72dbce43b1 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -5071,11 +5071,10 @@ l1_pgentry_t *virt_to_xen_l1e(unsigned long v)
 #define lNf_to_l1f(f) (((f) & _PAGE_PRESENT) ? ((f) & ~_PAGE_PSE) : (f))
 
 /*
- * map_pages_to_xen() can be called with interrupts disabled during
- * early bootstrap. In this case it is safe to use flush_area_local()
- * and avoid locking because only the local CPU is online.
+ * map_pages_to_xen() can be called early in boot before any other
+ * CPUs are online. Use flush_area_local() in this case.
  */
-#define flush_area(v,f) (!local_irq_is_enabled() ?  \
+#define flush_area(v,f) (system_state < SYS_STATE_smp_boot ?\
  flush_area_local((const void *)v, f) : \
  flush_area_all((const void *)v, f))
 
-- 
2.30.2




[PATCH v1] page_alloc: assert IRQs are enabled in heap alloc/free

2022-04-19 Thread David Vrabel
From: David Vrabel 

Heap pages can only be safely allocated and freed with interuupts
enabled as they may require a TLB flush which will send IPIs.

Enhance the assertions in alloc_xenheap_pages() and
alloc_domheap_pages() to check interrupts are enabled. For consistency
the same asserts are used when freeing heap pages.

As an exception, during early boot when only 1 PCPU is online,
allocations are permitted with interrupts disabled.

Signed-off-by: David Vrabel 
---
 xen/common/page_alloc.c | 24 
 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index 319029140f..e1ce38df13 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -162,6 +162,14 @@
 static char __initdata opt_badpage[100] = "";
 string_param("badpage", opt_badpage);
 
+/*
+ * Heap allocations may need TLB flushes which require IRQs to be
+ * enabled (except during early boot when only 1 PCPU is online).
+ */
+#define ASSERT_ALLOC_CONTEXT()  \
+ASSERT(!in_irq() && (local_irq_is_enabled() \
+ || system_state < SYS_STATE_smp_boot))
+
 /*
  * no-bootscrub -> Free pages are not zeroed during boot.
  */
@@ -2160,7 +2168,7 @@ void *alloc_xenheap_pages(unsigned int order, unsigned 
int memflags)
 {
 struct page_info *pg;
 
-ASSERT(!in_irq());
+ASSERT_ALLOC_CONTEXT();
 
 pg = alloc_heap_pages(MEMZONE_XEN, MEMZONE_XEN,
   order, memflags | MEMF_no_scrub, NULL);
@@ -2173,7 +2181,7 @@ void *alloc_xenheap_pages(unsigned int order, unsigned 
int memflags)
 
 void free_xenheap_pages(void *v, unsigned int order)
 {
-ASSERT(!in_irq());
+ASSERT_ALLOC_CONTEXT();
 
 if ( v == NULL )
 return;
@@ -2202,7 +2210,7 @@ void *alloc_xenheap_pages(unsigned int order, unsigned 
int memflags)
 struct page_info *pg;
 unsigned int i;
 
-ASSERT(!in_irq());
+ASSERT_ALLOC_CONTEXT();
 
 if ( xenheap_bits && (memflags >> _MEMF_bits) > xenheap_bits )
 memflags &= ~MEMF_bits(~0U);
@@ -2224,7 +2232,7 @@ void free_xenheap_pages(void *v, unsigned int order)
 struct page_info *pg;
 unsigned int i;
 
-ASSERT(!in_irq());
+ASSERT_ALLOC_CONTEXT();
 
 if ( v == NULL )
 return;
@@ -2249,7 +2257,7 @@ void init_domheap_pages(paddr_t ps, paddr_t pe)
 {
 mfn_t smfn, emfn;
 
-ASSERT(!in_irq());
+ASSERT_ALLOC_CONTEXT();
 
 smfn = maddr_to_mfn(round_pgup(ps));
 emfn = maddr_to_mfn(round_pgdown(pe));
@@ -2369,7 +2377,7 @@ struct page_info *alloc_domheap_pages(
 unsigned int bits = memflags >> _MEMF_bits, zone_hi = NR_ZONES - 1;
 unsigned int dma_zone;
 
-ASSERT(!in_irq());
+ASSERT_ALLOC_CONTEXT();
 
 bits = domain_clamp_alloc_bitsize(memflags & MEMF_no_owner ? NULL : d,
   bits ? : (BITS_PER_LONG+PAGE_SHIFT));
@@ -2419,7 +2427,7 @@ void free_domheap_pages(struct page_info *pg, unsigned 
int order)
 unsigned int i;
 bool drop_dom_ref;
 
-ASSERT(!in_irq());
+ASSERT_ALLOC_CONTEXT();
 
 if ( unlikely(is_xen_heap_page(pg)) )
 {
@@ -2738,7 +2746,7 @@ int __init acquire_domstatic_pages(struct domain *d, 
mfn_t smfn,
 {
 struct page_info *pg;
 
-ASSERT(!in_irq());
+ASSERT_ALLOC_CONTEXT();
 
 pg = acquire_staticmem_pages(smfn, nr_mfns, memflags);
 if ( !pg )
-- 
2.30.2




Re: [PATCH] xen/evtchn: Add design for static event channel signaling for domUs..

2022-03-24 Thread David Vrabel




On 23/03/2022 15:43, Rahul Singh wrote:

in dom0less system. This patch introduce the new feature to support the
signaling between two domUs in dom0less system.

Signed-off-by: Rahul Singh 
---
  docs/designs/dom0less-evtchn.md | 96 +
  1 file changed, 96 insertions(+)
  create mode 100644 docs/designs/dom0less-evtchn.md

diff --git a/docs/designs/dom0less-evtchn.md b/docs/designs/dom0less-evtchn.md
new file mode 100644
index 00..6a1b7e8c22
--- /dev/null
+++ b/docs/designs/dom0less-evtchn.md
@@ -0,0 +1,96 @@
+# Signaling support between two domUs on dom0less system
+
+## Current state: Draft version
+
+## Proposer(s): Rahul Singh, Bertrand Marquis
+
+## Problem Statement:
+
+The goal of this work is to define a simple signaling system between Xen guests
+in dom0less systems.
+
+In dom0less system, we cannot make use of xenbus and xenstore that are used in
+normal systems with dynamic VMs to communicate between domains by providing a
+bus abstraction for paravirtualized drivers.
+
+One possible solution to implement the signaling system between domUs is based
+on event channels.


This problem statement could do with some example use cases that are 
usefully solved by this proposed solution.


"We don't have xenstore so can't set up shared rings, but here's a 
replacement comms mechanism that can do a single bit." Doesn't seem very 
compelling to me.



+chosen {
+
+
+domU1: domU1 {
+..
+};
+
+domU2: domU2 {
+..
+};
+
+evtchn@1 {
+compatible = "xen,evtchn";
+xen,evtchn = <0xa  0xb >;
+};
+
+evtchn@2 {
+compatible = "xen,evtchn";
+xen,evtchn = <0xc  0xd >;
+};


How is the domain supposed to know what these event channels are for?

I'm not that familiar with device tree. Is it possible to give these 
entries name?


David



Re: [XEN PATCH] evtchn/fifo: Don't set PENDING bit if guest misbehaves

2022-03-21 Thread David Vrabel

On 16/03/2022 18:38, Raphael Ning wrote:

Currently, evtchn_fifo_set_pending() will mark the event as PENDING even
if it fails to lock the FIFO event queue(s), or if the guest has not
initialized the FIFO control block for the target vCPU. A well-behaved
guest should never trigger either of these cases.


Reviewed-by: David Vrabel 

David



Re: [XEN PATCH] evtchn/fifo: Don't set PENDING bit if guest misbehaves

2022-03-17 Thread David Vrabel




On 17/03/2022 06:28, Juergen Gross wrote:

On 16.03.22 19:38, Raphael Ning wrote:

From: Raphael Ning 

Currently, evtchn_fifo_set_pending() will mark the event as PENDING even
if it fails to lock the FIFO event queue(s), or if the guest has not
initialized the FIFO control block for the target vCPU. A well-behaved
guest should never trigger either of these cases.


Is this true even for the resume case e.g. after a migration?

The guests starts on the new host with no FIFO control block for any
vcpu registered, so couldn't an event get lost with your patch in case
it was sent before the target vcpu's control block gets registered?


An event that is PENDING but not LINKED is not reachable by the guest so 
it won't ever see such an event, so the event is lost whether the P bit 
is set or not.


Guests ensure that event channels are not bound to VCPUs that don't 
(yet) have FIFO control blocks.


For example, in Linux xen_irq_resume() reinitializes the control blocks 
(in xen_evtchn_resume()) before restoring any of the event channels.


David



Re: [PATCH] x86/kexec: Fix kexec-reboot with CET active

2022-03-07 Thread David Vrabel

On 07/03/2022 20:53, Andrew Cooper wrote:

The kexec_reloc() asm has an indirect jump to relocate onto the identity
trampoline.  While we clear CET in machine_crash_shutdown(), we fail to clear
CET for the non-crash path.  This in turn highlights that the same is true of
resetting the CPUID masking/faulting.

Move both pieces of logic from machine_crash_shutdown() to machine_kexec(),
the latter being common for all kexec transitions.  Adjust the condition for
CET being considered active to check in CR4, which is simpler and more robust.


Reviewed-by: David Vrabel 


Fixes: 311434bfc9d1 ("x86/setup: Rework MSR_S_CET handling for CET-IBT")
Fixes: b60ab42db2f0 ("x86/shstk: Activate Supervisor Shadow Stacks")
Fixes: 5ab9564c6fa1 ("x86/cpu: Context switch cpuid masks and faulting state in 
context_switch()")
Reported-by: David Vrabel (XXX which alias to use?)


Amazon, please.

David



CET-IBT and kexec?

2022-03-07 Thread David Vrabel
kexec_reloc (see xen/arch/x86/x86_64/kexec_reloc.S) has an indirect 
branch as part of switching page tables. I understand that if CET-IBT is 
enabled this will raise an exception since there's no ENDBR64 
instruction and (as far as I could tell) CET-IBT has not been disabled 
in machine_kexec() prior to calling kexec_reloc().


Have I correctly spotted an issue, and if so, would the correct fix be 
to disable CET-IBT in machine_kexec()?


I guess this would also be an issue if kexec'ing to a image without 
CET-IBT support.


David



Re: [PATCH 2/2] xen/x86: Livepatch: support patching CET-enhanced functions

2022-03-07 Thread David Vrabel

On 07/03/2022 14:03, Jan Beulich wrote:

On 07.03.2022 12:53, Bjoern Doebel wrote:

@@ -104,18 +122,36 @@ void noinline arch_livepatch_revive(void)
  
  int arch_livepatch_verify_func(const struct livepatch_func *func)

  {
+BUILD_BUG_ON(sizeof(struct x86_livepatch_meta) != LIVEPATCH_OPAQUE_SIZE);
+
  /* If NOPing.. */
  if ( !func->new_addr )
  {
+struct x86_livepatch_meta *lp;
+
+lp = (struct x86_livepatch_meta *)func->opaque;
  /* Only do up to maximum amount we can put in the ->opaque. */
-if ( func->new_size > sizeof(func->opaque) )
+if ( func->new_size > sizeof(lp->instruction) )
  return -EOPNOTSUPP;
  
  if ( func->old_size < func->new_size )

  return -EINVAL;
  }


I continue to be concerned of the new local variable causing compiler
warnings. With the adjustment made compared to v1, the specific
warning would have changed, and we're now liable to see set-but-never-
used ones.


Linux has a sizeof_field() macro for this sort of use.

/**
 * sizeof_field() - Report the size of a struct field in bytes
 *
 * @TYPE: The structure containing the field of interest
 * @MEMBER: The field to return the size of
 */
#define sizeof_field(TYPE, MEMBER) sizeofTYPE *)0)->MEMBER))

David



Re: [PATCH v2 21/70] xen/evtchn: CFI hardening

2022-02-14 Thread David Vrabel

On 14/02/2022 12:50, Andrew Cooper wrote:

Control Flow Integrity schemes use toolchain and optionally hardware support
to help protect against call/jump/return oriented programming attacks.

Use cf_check to annotate function pointer targets for the toolchain.

[...]

-static void evtchn_2l_set_pending(struct vcpu *v, struct evtchn *evtchn)
+static void cf_check evtchn_2l_set_pending(
+struct vcpu *v, struct evtchn *evtchn)


Why manually annotate functions instead of getting the compiler to 
automatically work it out?


David



Re: [PATCH] x86/time: further improve TSC / CPU freq calibration accuracy

2022-01-18 Thread David Vrabel




On 18/01/2022 08:50, Jan Beulich wrote:

On 13.01.2022 14:41, Jan Beulich wrote:

Calibration logic assumes that the platform timer (HPET or ACPI PM
timer) and the TSC are read at about the same time. This assumption may
not hold when a long latency event (e.g. SMI or NMI) occurs between the
two reads. Reduce the risk of reading uncorrelated values by doing at
least four pairs of reads, using the tuple where the delta between the
enclosing TSC reads was smallest. From the fourth iteration onwards bail
if the new TSC delta isn't better (smaller) than the best earlier one.

Signed-off-by: Jan Beulich 


When running virtualized, scheduling in the host would also constitute
long latency events. I wonder whether, to compensate for that, we'd want
more than 3 "base" iterations, as I would expect scheduling events to
occur more frequently than e.g. SMI (and with a higher probability of
multiple ones occurring in close succession).


Should Xen be continually or periodically recalibrating? Rather than 
trying to get it perfect at the start of day?


You may also be able to find inspiration from the design or 
implementation of the Precision Time Protocol which has to similarly 
filter out outliers due to transmission delays.


David



Re: [PATCH] x86/hvm: reserve another HVM context save record ID for Amazon

2022-01-14 Thread David Vrabel




On 14/01/2022 07:08, Jan Beulich wrote:

On 07.01.2022 13:55, David Vrabel wrote:

Amazon's guest transparent live migration work needs another save
record (for event channel upcall vectors). Reserve another HVM context
save record ID for this.


I have to admit that I have reservations: I didn't really like seeing
the original range getting reserved. Even less so I'd like to see
extensions / further such reservations. The more that iirc the
original reservation was accepted based on a (perhaps vague) promise
of the respective uses actually getting upstreamed. Yet that hasn't
happened (nor even started to happen) in slightly over 2 years time,
iirc.


I think this is fair. I hadn't realized we'd sat on this work for so long.


What I could see as a compromise is to have, say, vendor ranges
higher up in number space.


I (personally) would accept removing the reservation entirely -- we 
didn't follow the upstreaming process, so we take the cost of fixing up 
any compatibility issues.


David



Re: [XEN PATCH 1/7] xen: introduce XENFEAT_xenstore_late_init

2022-01-11 Thread David Vrabel




On 10/01/2022 22:55, Stefano Stabellini wrote:


I have a patch for Linux that if XENFEAT_xenstore_late_init is present
makes Linux wait for an event notification before initializing xenstore:
https://marc.info/?l=xen-devel=164160299315589

So with v1 of the Xen and Linux patches series:
- Xen allocates the event channel at domain creation
- Linux boots, sees XENFEAT_xenstore_late_init and wait for an event
- init-dom0less later allocates the xenstore page
- init-dom0less triggers the xenstore event channel
- Linux receives the event and finishes the initialization, including
   mapping the xenstore page


You can get this behaviour without the new feature.

- Xen allocates the event channel at domain creation
- Linux boots, sees HVM_PARAM_STORE_PFN is invalid and there's a 
xenstore event channel and waits for an event

- init-dom0less later allocates the xenstore page
- init-dom0less triggers the xenstore event channel
- Linux receives the event and finishes the initialization, including
   mapping the xenstore page-

David



Re: [PATCHv2] x86/hvm: add more callback/upcall info to 'I' debug key

2022-01-07 Thread David Vrabel




On 07/01/2022 13:45, Andrew Cooper wrote:

 printk("Callback via PCI dev %u INTx %u%s\n",

PCI 00:%02x.0  ?


Is this correct? If I remember right, the INTx lines are associated with 
a PCI device, with the function then reporting which line it uses.


So Xen neither knows (nor cares) what the function is, so it would be 
misleading to report it.



+   hvm_irq->callback_via.pci.dev,
+   hvm_irq->callback_via.pci.intx,
+   via_asserted);
+break;
+
+case HVMIRQ_callback_vector:
+printk("Callback via vector %u%s\n",
+   hvm_irq->callback_via.vector,
+   via_asserted);


... here, vectors ought to be 0x%02x.  Amongst other things, it makes
the priority class instantly readable.

I realise this is all a complete mess, but is via_asserted correct for
HVMIRQ_callback_vector?  It's mismatched between the two, and the best
metric that exists is "is pending in IRR".  Also, looking at struct
hvm_irq, all the callback information is in the wrong structure, because
it absolutely shouldn't be duplicated for each GSI.


I'm not sure what changes to this patch you want here..

David



Re: [PATCH] x86/hvm: save/restore per-VCPU event channel upcall vector

2022-01-07 Thread David Vrabel

On 06/01/2022 16:48, Jan Beulich wrote:

On 06.01.2022 16:54, David Vrabel wrote:

The Windows XENBUS driver sets the per-VCPU LAPIC vector for event
channel interrupts using the HVMOP_set_evtchn_upcall_vector hypercall
(rather than using a vector-type callback in the CALLBACK_IRQ HVM
parameter since the vectors might be different for different VCPUs).

This state needs to be saved/restored or a restored guest may not be
able to get an event channel interrupts.

Note that the Windows XENBUS driver workarounds this by reissuing the
hypercall when resuming after a migration, but this workaround would
not be possible in an guest transparent live migration or a live
update.


Why would this be needed only for this one form of "via"? And how do
you guarantee no observable change in behavior for existing guests?
It would seem to me that this information is something to be saved /
restored _only_ during transparent migration, as aware guests may
deliberately defer re-registration.


I was under the impression that the HVM parameters (including 
CALLBACK_IRQ) were saved/restored so the intent here was to save/restore 
all event channel delivery mechanism state consistently but it seems I 
was confused by some internal work that's not upstream yet.


So, I agree, this patch on it's own doesn't make sense.

I've sent a patch reserving another save record ID instead.


+static int hvm_load_evtchn_upcall_vector(
+struct domain *d, hvm_domain_context_t *h)
+{
+unsigned int vcpuid;
+struct vcpu *v;
+struct hvm_evtchn_upcall_vector upcall;
+
+vcpuid = hvm_load_instance(h);
+if ( (v = domain_vcpu(d, vcpuid)) == NULL )
+return -EINVAL;
+
+if ( hvm_load_entry(EVTCHN_UPCALL_VECTOR, h, ) != 0 )
+return -EINVAL;
+
+hvm_set_evtchn_upcall_vector(v, upcall.vector);
+
+return 0;
+}


Don't you need to also set callback_via_type accordingly?


The callback_via_type is ignored if a per-vcpu upcall vector is set.

You can use a mix of a CALLBACK_IRQ defined upcalls on some VCPUs, and a 
HVMOP_set_evtchn_upcall_vector defined one on others, so setting the 
per-domain type wouldn't work.


I'm not sure why you would do this, but the ABI (as implemented, if not 
intentionally) allows it...


David



[PATCH] x86/hvm: reserve another HVM context save record ID for Amazon

2022-01-07 Thread David Vrabel
Amazon's guest transparent live migration work needs another save
record (for event channel upcall vectors). Reserve another HVM context
save record ID for this.

Signed-off-by: David Vrabel 
---
I've added it to the end, keeping the unused ID at 21.
---
 xen/include/public/arch-x86/hvm/save.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/xen/include/public/arch-x86/hvm/save.h 
b/xen/include/public/arch-x86/hvm/save.h
index 773a380bc2..2de3dfd9bb 100644
--- a/xen/include/public/arch-x86/hvm/save.h
+++ b/xen/include/public/arch-x86/hvm/save.h
@@ -641,7 +641,7 @@ struct hvm_msr {
 
 #define CPU_MSR_CODE  20
 
-/* Range 22 - 34 (inclusive) reserved for Amazon */
+/* Range 22 - 35 (inclusive) reserved for Amazon */
 
 /*
  * Largest type-code in use
-- 
2.30.2




[PATCHv2] x86/hvm: add more callback/upcall info to 'I' debug key

2022-01-07 Thread David Vrabel
Include the type of the callback via and the per-VCPU upcall vector.

Signed-off-by: David Vrabel 
---
v2:
- fix style
- make upcall vector output distinguishable from logs prior to this patch
- use fewer lines for callback via.
---
 xen/arch/x86/hvm/irq.c | 49 ++
 1 file changed, 45 insertions(+), 4 deletions(-)

diff --git a/xen/arch/x86/hvm/irq.c b/xen/arch/x86/hvm/irq.c
index 52aae4565f..8b1bb79127 100644
--- a/xen/arch/x86/hvm/irq.c
+++ b/xen/arch/x86/hvm/irq.c
@@ -598,7 +598,11 @@ int hvm_local_events_need_delivery(struct vcpu *v)
 static void irq_dump(struct domain *d)
 {
 struct hvm_irq *hvm_irq = hvm_domain_irq(d);
-int i; 
+unsigned int i;
+const struct vcpu *v;
+bool have_upcall_vector = false;
+const char *via_asserted;
+
 printk("Domain %d:\n", d->domain_id);
 printk("PCI 0x%16.16"PRIx64"%16.16"PRIx64
" ISA 0x%8.8"PRIx32" ROUTE %u %u %u %u\n",
@@ -630,9 +634,46 @@ static void irq_dump(struct domain *d)
hvm_irq->pci_link_assert_count[1],
hvm_irq->pci_link_assert_count[2],
hvm_irq->pci_link_assert_count[3]);
-printk("Callback via %i:%#"PRIx32",%s asserted\n",
-   hvm_irq->callback_via_type, hvm_irq->callback_via.gsi, 
-   hvm_irq->callback_via_asserted ? "" : " not");
+
+printk("Per-VCPU upcall vector:\n");
+for_each_vcpu ( d, v )
+{
+if ( v->arch.hvm.evtchn_upcall_vector )
+{
+printk("  v%u: %u\n",
+   v->vcpu_id, v->arch.hvm.evtchn_upcall_vector);
+have_upcall_vector = true;
+}
+}
+if ( !have_upcall_vector )
+printk("  none\n");
+
+via_asserted = hvm_irq->callback_via_asserted ? " (asserted)" : "";
+switch( hvm_irq->callback_via_type )
+{
+case HVMIRQ_callback_none:
+printk("Callback via none\n");
+break;
+
+case HVMIRQ_callback_gsi:
+printk("Callback via GSI %u%s\n",
+   hvm_irq->callback_via.gsi,
+   via_asserted);
+break;
+
+case HVMIRQ_callback_pci_intx:
+printk("Callback via PCI dev %u INTx %u%s\n",
+   hvm_irq->callback_via.pci.dev,
+   hvm_irq->callback_via.pci.intx,
+   via_asserted);
+break;
+
+case HVMIRQ_callback_vector:
+printk("Callback via vector %u%s\n",
+   hvm_irq->callback_via.vector,
+   via_asserted);
+break;
+}
 }
 
 static void dump_irq_info(unsigned char key)
-- 
2.30.2




[PATCH] x86/hvm: save/restore per-VCPU event channel upcall vector

2022-01-06 Thread David Vrabel
The Windows XENBUS driver sets the per-VCPU LAPIC vector for event
channel interrupts using the HVMOP_set_evtchn_upcall_vector hypercall
(rather than using a vector-type callback in the CALLBACK_IRQ HVM
parameter since the vectors might be different for different VCPUs).

This state needs to be saved/restored or a restored guest may not be
able to get an event channel interrupts.

Note that the Windows XENBUS driver workarounds this by reissuing the
hypercall when resuming after a migration, but this workaround would
not be possible in an guest transparent live migration or a live
update.

Signed-off-by: David Vrabel 
---
 xen/arch/x86/hvm/hvm.c | 50 --
 xen/include/public/arch-x86/hvm/save.h | 12 ++-
 2 files changed, 58 insertions(+), 4 deletions(-)

diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 350dc396e3..be2e676c4a 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -4071,6 +4071,52 @@ static int hvmop_flush_tlb_all(void)
 return paging_flush_tlb(NULL) ? 0 : -ERESTART;
 }
 
+static void hvm_set_evtchn_upcall_vector(struct vcpu *v, uint8_t vector)
+{
+printk(XENLOG_G_INFO "%pv: upcall vector %02x\n", v, vector);
+
+v->arch.hvm.evtchn_upcall_vector = vector;
+hvm_assert_evtchn_irq(v);
+}
+
+static int hvm_save_evtchn_upcall_vector(
+struct vcpu *v, hvm_domain_context_t *h)
+{
+struct hvm_evtchn_upcall_vector upcall;
+
+/* Skip record if VCPU is down or the upcall vector is not used. */
+if ( test_bit(_VPF_down, >pause_flags) )
+return 0;
+if ( v->arch.hvm.evtchn_upcall_vector == 0 )
+return 0;
+
+upcall.vector = v->arch.hvm.evtchn_upcall_vector;
+
+return hvm_save_entry(EVTCHN_UPCALL_VECTOR, v->vcpu_id, h, );
+}
+
+static int hvm_load_evtchn_upcall_vector(
+struct domain *d, hvm_domain_context_t *h)
+{
+unsigned int vcpuid;
+struct vcpu *v;
+struct hvm_evtchn_upcall_vector upcall;
+
+vcpuid = hvm_load_instance(h);
+if ( (v = domain_vcpu(d, vcpuid)) == NULL )
+return -EINVAL;
+
+if ( hvm_load_entry(EVTCHN_UPCALL_VECTOR, h, ) != 0 )
+return -EINVAL;
+
+hvm_set_evtchn_upcall_vector(v, upcall.vector);
+
+return 0;
+}
+
+HVM_REGISTER_SAVE_RESTORE(EVTCHN_UPCALL_VECTOR, hvm_save_evtchn_upcall_vector,
+  hvm_load_evtchn_upcall_vector, 1, HVMSR_PER_VCPU);
+
 static int hvmop_set_evtchn_upcall_vector(
 XEN_GUEST_HANDLE_PARAM(xen_hvm_evtchn_upcall_vector_t) uop)
 {
@@ -4090,10 +4136,8 @@ static int hvmop_set_evtchn_upcall_vector(
 if ( (v = domain_vcpu(d, op.vcpu)) == NULL )
 return -ENOENT;
 
-printk(XENLOG_G_INFO "%pv: upcall vector %02x\n", v, op.vector);
+hvm_set_evtchn_upcall_vector(v, op.vector);
 
-v->arch.hvm.evtchn_upcall_vector = op.vector;
-hvm_assert_evtchn_irq(v);
 return 0;
 }
 
diff --git a/xen/include/public/arch-x86/hvm/save.h 
b/xen/include/public/arch-x86/hvm/save.h
index 773a380bc2..177cb09150 100644
--- a/xen/include/public/arch-x86/hvm/save.h
+++ b/xen/include/public/arch-x86/hvm/save.h
@@ -641,12 +641,22 @@ struct hvm_msr {
 
 #define CPU_MSR_CODE  20
 
+/**
+ * Per-VCPU event channel upcall vector as set by
+ * HVMOP_set_evtchn_upcall_vector hypercall.
+ */
+struct hvm_evtchn_upcall_vector {
+uint8_t vector;
+};
+
+DECLARE_HVM_SAVE_TYPE(EVTCHN_UPCALL_VECTOR, 21, struct 
hvm_evtchn_upcall_vector);
+
 /* Range 22 - 34 (inclusive) reserved for Amazon */
 
 /*
  * Largest type-code in use
  */
-#define HVM_SAVE_CODE_MAX 20
+#define HVM_SAVE_CODE_MAX 21
 
 #endif /* __XEN_PUBLIC_HVM_SAVE_X86_H__ */
 
-- 
2.30.2




[PATCH] x86/hvm: add more callback/upcall info to 'I' debug key

2022-01-06 Thread David Vrabel
Include the type of the callback via and the per-VCPU upcall vector.

Signed-off-by: David Vrabel 
---
 xen/arch/x86/hvm/irq.c | 31 +++
 1 file changed, 27 insertions(+), 4 deletions(-)

diff --git a/xen/arch/x86/hvm/irq.c b/xen/arch/x86/hvm/irq.c
index 52aae4565f..6a1edb99f2 100644
--- a/xen/arch/x86/hvm/irq.c
+++ b/xen/arch/x86/hvm/irq.c
@@ -598,7 +598,9 @@ int hvm_local_events_need_delivery(struct vcpu *v)
 static void irq_dump(struct domain *d)
 {
 struct hvm_irq *hvm_irq = hvm_domain_irq(d);
-int i; 
+int i;
+struct vcpu *v;
+
 printk("Domain %d:\n", d->domain_id);
 printk("PCI 0x%16.16"PRIx64"%16.16"PRIx64
" ISA 0x%8.8"PRIx32" ROUTE %u %u %u %u\n",
@@ -630,9 +632,30 @@ static void irq_dump(struct domain *d)
hvm_irq->pci_link_assert_count[1],
hvm_irq->pci_link_assert_count[2],
hvm_irq->pci_link_assert_count[3]);
-printk("Callback via %i:%#"PRIx32",%s asserted\n",
-   hvm_irq->callback_via_type, hvm_irq->callback_via.gsi, 
-   hvm_irq->callback_via_asserted ? "" : " not");
+for_each_vcpu( d, v )
+{
+if ( v->arch.hvm.evtchn_upcall_vector )
+printk("%pv: upcall vector: %u\n",
+   v, v->arch.hvm.evtchn_upcall_vector);
+}
+switch( hvm_irq->callback_via_type )
+{
+case HVMIRQ_callback_none:
+printk("Callback via none\n");
+break;
+case HVMIRQ_callback_gsi:
+printk("Callback via GSI %u\n", hvm_irq->callback_via.gsi);
+break;
+case HVMIRQ_callback_pci_intx:
+printk("Callback via PCI dev %u INTx %u\n",
+   hvm_irq->callback_via.pci.dev,
+   hvm_irq->callback_via.pci.intx);
+break;
+case HVMIRQ_callback_vector:
+printk("Callback via vector %u\n", hvm_irq->callback_via.vector);
+break;
+}
+printk("  %s asserted\n", hvm_irq->callback_via_asserted ? "" : " not");
 }
 
 static void dump_irq_info(unsigned char key)
-- 
2.30.2