from:"Roger Pau Monné"

Re: [PATCH] x86/EPT: relax iPAT for "invalid" MFNs

2024-06-11 Thread Roger Pau Monné

On Tue, Jun 11, 2024 at 04:53:22PM +0200, Jan Beulich wrote:
> On 11.06.2024 15:52, Roger Pau Monné wrote:
> > On Tue, Jun 11, 2024 at 01:52:58PM +0200, Jan Beulich wrote:
> >> On 11.06.2024 13:08, Roger Pau Monné wrote:
> >>> I really wonder whether Xen has enough information to figure out
> >>> whether a hole (MMIO region) is supposed to be accessed as UC or
> >>> something else.
> >>
> >> It certainly hasn't, and hence is erring on the (safe) side of forcing
> >> UC.
> > 
> > Except that for the vesa framebuffer at least this is a bad choice :).
> 
> Well, yes, that's where we want WC to be permitted. But for that we only
> need to avoid setting iPAT; we still can uniformly hand back UC. Except
> (as mentioned elsewhere earlier) if the guest uses MTRRs rather than PAT
> to arrange for WC.

If we want to get this into 4.19, we likely want to go your proposed
approach then, as it's less risky.

I think a comment would be helpful to note that the fix here to not
enforce iPAT by still return UC is mostly done to accommodate vesa
regions mapped with PAT attributes to use WC.

I would also like to add some kind of note that special casing
!mfn_valid() might not be needed, but that removing it must be done
carefully to not cause regressions.

> >>>  Maybe the mfn_valid() check should be
> >>> inverted, and return WB when the underlying mfn is RAM, and otherwise
> >>> use the guest MTRRs to decide the cache attribute?
> >>
> >> First: Whether WB is correct for RAM isn't known. With some peculiar device
> >> assigned, the guest may want/need part of its RAM be e.g. WC or WT. (It's
> >> only without any physical devices assigned that we can be quite sure that
> >> WB is good for all of RAM.) Therefore, second, I think respecting MTRRs for
> >> RAM is less likely to cause problems than respecting them for MMIO.
> >>
> >> I think at this point the main question is: Do we want to do things at 
> >> least
> >> along the lines of this v1, or do we instead feel certain enough to switch
> >> the mfn_valid() to a comparison against INVALID_MFN (and perhaps moving it
> >> up to almost the top of the function)?
> > 
> > My preferred option would be the later, as that would remove a special
> > casing.  However, I'm unsure how much fallout this could cause - those
> > caching changes are always tricky and lead to unexpected fallout.
> 
> Which is the very reason why I tried to avoid going to far with this.
> 
> > OTOH the current !mfn_valid() check is very restrictive, as it forces
> > all MMIO to UC.
> 
> Which is why, in this v1, I'm relaxing only the iPAT part.
> 
> >  So by removing it we allow guest chosen types to take
> > effect, which are likely less restrictive than UC (whether those are
> > correct is another question).
> 
> No, guest chosen types still wouldn't come into play, due to what the
> switch() further down in the function does for p2m_mmio_direct.

Indeed.  That should also be removed if we decide for MMIO cache
attributes to be controlled by guest MTRRs.

> 
> >> One caveat here that I forgot to
> >> mention before: MFNs taken out of EPT entries will never be INVALID_MFN, 
> >> for
> >> the truncation that happens when populating entries. In that case we rely 
> >> on
> >> mfn_valid() to be "rejecting" them.
> > 
> > The only caller where mfns from EPT entries are passed to
> > epte_get_entry_emt() is in resolve_misconfig() AFAICT, and in that
> > case the EPT entry must be present for epte_get_entry_emt() to be
> > called.  So it seems to me that epte_get_entry_emt() can never be
> > called from an mfn constructed from an INVALID_MFN EPT entry (but it's
> > worth adding an assert for it).
> 
> Are you sure? I agree for the first of those two calls, but the second
> isn't quite as obvious. There we'd need to first prove that we will
> never create non-present super-page entries. Yet if I'm not mistaken
> for PoD we may create such.

I should go look then, didn't know PoD would do that.

Regards, Roger.

Re: [PATCH v4 1/2] x86/mm: add API for marking only part of a MMIO page read only

2024-06-11 Thread Roger Pau Monné

On Tue, Jun 11, 2024 at 03:15:42PM +0200, Marek Marczykowski-Górecki wrote:
> On Tue, Jun 11, 2024 at 02:55:22PM +0200, Roger Pau Monné wrote:
> > On Tue, Jun 11, 2024 at 01:38:35PM +0200, Marek Marczykowski-Górecki wrote:
> > > On Tue, Jun 11, 2024 at 12:40:49PM +0200, Roger Pau Monné wrote:
> > > > On Wed, May 22, 2024 at 05:39:03PM +0200, Marek Marczykowski-Górecki 
> > > > wrote:
> > > > > +if ( !entry )
> > > > > +{
> > > > > +/* iter == NULL marks it was a newly allocated entry */
> > > > > +iter = NULL;
> > > > > +entry = xzalloc(struct subpage_ro_range);
> > > > > +if ( !entry )
> > > > > +return -ENOMEM;
> > > > > +entry->mfn = mfn;
> > > > > +}
> > > > > +
> > > > > +for ( i = offset_s; i <= offset_e; i += MMIO_RO_SUBPAGE_GRAN )
> > > > > +{
> > > > > +bool oldbit = __test_and_set_bit(i / MMIO_RO_SUBPAGE_GRAN,
> > > > > +entry->ro_elems);
> > > > > +ASSERT(!oldbit);
> > > > > +}
> > > > > +
> > > > > +if ( !iter )
> > > > > +list_add(>list, _ro_ranges);
> > > > > +
> > > > > +return iter ? 1 : 0;
> > > > > +}
> > > > > +
> > > > > +/* This needs subpage_ro_lock already taken */
> > > > > +static void __init subpage_mmio_ro_remove_page(
> > > > > +mfn_t mfn,
> > > > > +unsigned int offset_s,
> > > > > +unsigned int offset_e)
> > > > > +{
> > > > > +struct subpage_ro_range *entry = NULL, *iter;
> > > > > +unsigned int i;
> > > > > +
> > > > > +list_for_each_entry(iter, _ro_ranges, list)
> > > > > +{
> > > > > +if ( mfn_eq(iter->mfn, mfn) )
> > > > > +{
> > > > > +entry = iter;
> > > > > +break;
> > > > > +}
> > > > > +}
> > > > > +if ( !entry )
> > > > > +return;
> > > > > +
> > > > > +for ( i = offset_s; i <= offset_e; i += MMIO_RO_SUBPAGE_GRAN )
> > > > > +__clear_bit(i / MMIO_RO_SUBPAGE_GRAN, entry->ro_elems);
> > > > > +
> > > > > +if ( !bitmap_empty(entry->ro_elems, PAGE_SIZE / 
> > > > > MMIO_RO_SUBPAGE_GRAN) )
> > > > > +return;
> > > > > +
> > > > > +list_del(>list);
> > > > > +if ( entry->mapped )
> > > > > +iounmap(entry->mapped);
> > > > > +xfree(entry);
> > > > > +}
> > > > > +
> > > > > +int __init subpage_mmio_ro_add(
> > > > > +paddr_t start,
> > > > > +size_t size)
> > > > > +{
> > > > > +mfn_t mfn_start = maddr_to_mfn(start);
> > > > > +paddr_t end = start + size - 1;
> > > > > +mfn_t mfn_end = maddr_to_mfn(end);
> > > > > +unsigned int offset_end = 0;
> > > > > +int rc;
> > > > > +bool subpage_start, subpage_end;
> > > > > +
> > > > > +ASSERT(IS_ALIGNED(start, MMIO_RO_SUBPAGE_GRAN));
> > > > > +ASSERT(IS_ALIGNED(size, MMIO_RO_SUBPAGE_GRAN));
> > > > > +if ( !IS_ALIGNED(size, MMIO_RO_SUBPAGE_GRAN) )
> > > > > +size = ROUNDUP(size, MMIO_RO_SUBPAGE_GRAN);
> > > > > +
> > > > > +if ( !size )
> > > > > +return 0;
> > > > > +
> > > > > +if ( mfn_eq(mfn_start, mfn_end) )
> > > > > +{
> > > > > +/* Both starting and ending parts handled at once */
> > > > > +subpage_start = PAGE_OFFSET(start) || PAGE_OFFSET(end) != 
> > > > > PAGE_SIZE - 1;
> > > > > +subpage_end = false;
> > > > 
> > > > Given the intended usage of this, don't we want to limit to only a
> > > > single page?  So that PFN_DOWN(start + size) == PFN_DOWN/(start), as
> > > > that would simplify the logic here?
> > > 
> > > I have considered that, but I haven't found anything in the spec

Re: [PATCH] x86/EPT: relax iPAT for "invalid" MFNs

2024-06-11 Thread Roger Pau Monné

On Tue, Jun 11, 2024 at 01:52:58PM +0200, Jan Beulich wrote:
> On 11.06.2024 13:08, Roger Pau Monné wrote:
> > On Tue, Jun 11, 2024 at 11:33:24AM +0200, Jan Beulich wrote:
> >> On 11.06.2024 11:02, Roger Pau Monné wrote:
> >>> On Tue, Jun 11, 2024 at 10:26:32AM +0200, Jan Beulich wrote:
> >>>> On 11.06.2024 09:41, Roger Pau Monné wrote:
> >>>>> On Mon, Jun 10, 2024 at 04:58:52PM +0200, Jan Beulich wrote:
> >>>>>> --- a/xen/arch/x86/mm/p2m-ept.c
> >>>>>> +++ b/xen/arch/x86/mm/p2m-ept.c
> >>>>>> @@ -503,7 +503,8 @@ int epte_get_entry_emt(struct domain *d,
> >>>>>>  
> >>>>>>  if ( !mfn_valid(mfn) )
> >>>>>>  {
> >>>>>> -*ipat = true;
> >>>>>> +*ipat = type != p2m_mmio_direct ||
> >>>>>> +(!is_iommu_enabled(d) && !cache_flush_permitted(d));
> >>>>>
> >>>>> Looking at this, shouldn't the !mfn_valid special case be removed, and
> >>>>> mfns without a valid page be processed normally, so that the guest
> >>>>> MTRR values are taken into account, and no iPAT is enforced?
> >>>>
> >>>> Such removal is what, in the post commit message remark, I'm referring to
> >>>> as "moving to too lax". Doing so might be okay, but will imo be hard to
> >>>> prove to be correct for all possible cases. Along these lines goes also
> >>>> that I'm adding the IOMMU-enabled and cache-flush checks: In principle
> >>>> p2m_mmio_direct should not be used when neither of these return true. Yet
> >>>> a similar consideration would apply to the immediately subsequent if().
> >>>>
> >>>> Removing this code would, in particular, result in INVALID_MFN getting a
> >>>> type of WB by way of the subsequent if(), unless the type there would
> >>>> also be p2m_mmio_direct (which, as said, it ought to never be for non-
> >>>> pass-through domains). That again _may_ not be a problem as long as such
> >>>> EPT entries would never be marked present, yet that's again difficult to
> >>>> prove.
> >>>
> >>> My understanding is that the !mfn_valid() check was a way to detect
> >>> MMIO regions in order to exit early and set those to UC.  I however
> >>> don't follow why the guest MTRR settings shouldn't also be applied to
> >>> those regions.
> >>
> >> It's unclear to me whether the original purpose of he check really was
> >> (just) MMIO. It could as well also have been to cover the (then not yet
> >> named that way) case of INVALID_MFN.
> >>
> >> As to ignoring guest MTRRs for MMIO: I think that's to be on the safe
> >> side. We don't want guests to map uncachable memory with a cachable
> >> memory type. Yet control isn't fine grained enough to prevent just
> >> that. Hence why we force UC, allowing merely to move to WC via PAT.
> > 
> > Would that be to cover up for guests bugs, or there's a coherency
> > reason for not allowing guests to access memory using fully guest
> > chosen cache attributes?
> 
> I think the main reason is that this way we don't need to bother thinking
> of whether MMIO regions may need caches flushed in order for us to be
> sure memory is all up-to-date. But I have no insight into what the
> original reasons here may have been.
> 
> > I really wonder whether Xen has enough information to figure out
> > whether a hole (MMIO region) is supposed to be accessed as UC or
> > something else.
> 
> It certainly hasn't, and hence is erring on the (safe) side of forcing
> UC.

Except that for the vesa framebuffer at least this is a bad choice :).

> > Your proposed patch already allows guest to set such attributes in
> > PAT, and hence I don't see why also taking guest MTRRs into account
> > would be any worse.
> 
> Whatever the guest sets in PAT, UC in EMT will win except fot the
> special case of WC.
> 
> >>>>> I also think this likely wants a:
> >>>>>
> >>>>> Fixes: 81fd0d3ca4b2 ('x86/hvm: simplify 'mmio_direct' check in 
> >>>>> epte_get_entry_emt()')
> >>>>
> >>>> Oh, indeed, I should have dug out when this broke. I didn't because I
> >>>> knew this mfn_valid() check was there forever, neglecting that it wasn't
> >>>> always (almost) first.
> >>>>
> >>>>>

Re: [PATCH for-4.19] CI: Update FreeBSD to 13.3

2024-06-11 Thread Roger Pau Monné

On Tue, Jun 11, 2024 at 01:47:01PM +0100, Andrew Cooper wrote:
> Signed-off-by: Andrew Cooper 

Acked-by: Roger Pau Monné 

Albeit I'm unsure if that's some kind of glitch or error on the
FreeBSD side.  13.2 is not EOL until June 30, 2024 [0].

Thanks, Roger.

[0] https://www.freebsd.org/security/#sup

Re: [PATCH v4 1/2] x86/mm: add API for marking only part of a MMIO page read only

2024-06-11 Thread Roger Pau Monné

On Tue, Jun 11, 2024 at 01:38:35PM +0200, Marek Marczykowski-Górecki wrote:
> On Tue, Jun 11, 2024 at 12:40:49PM +0200, Roger Pau Monné wrote:
> > On Wed, May 22, 2024 at 05:39:03PM +0200, Marek Marczykowski-Górecki wrote:
> > > +if ( !entry )
> > > +{
> > > +/* iter == NULL marks it was a newly allocated entry */
> > > +iter = NULL;
> > > +entry = xzalloc(struct subpage_ro_range);
> > > +if ( !entry )
> > > +return -ENOMEM;
> > > +entry->mfn = mfn;
> > > +}
> > > +
> > > +for ( i = offset_s; i <= offset_e; i += MMIO_RO_SUBPAGE_GRAN )
> > > +{
> > > +bool oldbit = __test_and_set_bit(i / MMIO_RO_SUBPAGE_GRAN,
> > > +entry->ro_elems);
> > > +ASSERT(!oldbit);
> > > +}
> > > +
> > > +if ( !iter )
> > > +list_add(>list, _ro_ranges);
> > > +
> > > +return iter ? 1 : 0;
> > > +}
> > > +
> > > +/* This needs subpage_ro_lock already taken */
> > > +static void __init subpage_mmio_ro_remove_page(
> > > +mfn_t mfn,
> > > +unsigned int offset_s,
> > > +unsigned int offset_e)
> > > +{
> > > +struct subpage_ro_range *entry = NULL, *iter;
> > > +unsigned int i;
> > > +
> > > +list_for_each_entry(iter, _ro_ranges, list)
> > > +{
> > > +if ( mfn_eq(iter->mfn, mfn) )
> > > +{
> > > +entry = iter;
> > > +break;
> > > +}
> > > +}
> > > +if ( !entry )
> > > +return;
> > > +
> > > +for ( i = offset_s; i <= offset_e; i += MMIO_RO_SUBPAGE_GRAN )
> > > +__clear_bit(i / MMIO_RO_SUBPAGE_GRAN, entry->ro_elems);
> > > +
> > > +if ( !bitmap_empty(entry->ro_elems, PAGE_SIZE / 
> > > MMIO_RO_SUBPAGE_GRAN) )
> > > +return;
> > > +
> > > +list_del(>list);
> > > +if ( entry->mapped )
> > > +iounmap(entry->mapped);
> > > +xfree(entry);
> > > +}
> > > +
> > > +int __init subpage_mmio_ro_add(
> > > +paddr_t start,
> > > +size_t size)
> > > +{
> > > +mfn_t mfn_start = maddr_to_mfn(start);
> > > +paddr_t end = start + size - 1;
> > > +mfn_t mfn_end = maddr_to_mfn(end);
> > > +unsigned int offset_end = 0;
> > > +int rc;
> > > +bool subpage_start, subpage_end;
> > > +
> > > +ASSERT(IS_ALIGNED(start, MMIO_RO_SUBPAGE_GRAN));
> > > +ASSERT(IS_ALIGNED(size, MMIO_RO_SUBPAGE_GRAN));
> > > +if ( !IS_ALIGNED(size, MMIO_RO_SUBPAGE_GRAN) )
> > > +size = ROUNDUP(size, MMIO_RO_SUBPAGE_GRAN);
> > > +
> > > +if ( !size )
> > > +return 0;
> > > +
> > > +if ( mfn_eq(mfn_start, mfn_end) )
> > > +{
> > > +/* Both starting and ending parts handled at once */
> > > +subpage_start = PAGE_OFFSET(start) || PAGE_OFFSET(end) != 
> > > PAGE_SIZE - 1;
> > > +subpage_end = false;
> > 
> > Given the intended usage of this, don't we want to limit to only a
> > single page?  So that PFN_DOWN(start + size) == PFN_DOWN/(start), as
> > that would simplify the logic here?
> 
> I have considered that, but I haven't found anything in the spec
> mandating the XHCI DbC registers to not cross page boundary. Currently
> (on a system I test this on) they don't cross page boundary, but I don't
> want to assume extra constrains - to avoid issues like before (when
> on the older system I tested the DbC registers didn't shared page with
> other registers, but then they shared the page on a newer hardware).

Oh, from our conversation at XenSummit I got the impression debug registers
where always at the same position.  Looking at patch 2/2, it seems you
only need to block access to a single register.  Are registers in XHCI
size aligned?  As this would guarantee it doesn't cross a page
boundary (as long as the register is <= 4096 in size).

> > > +if ( !addr )
> > > +{
> > > +gprintk(XENLOG_ERR,
> > > +"Failed to map page for MMIO write at 
> > > 0x%"PRI_mfn"%03x\n",
> > > +mfn_x(mfn), offset);
> > > +return;

Re: [PATCH] x86/EPT: relax iPAT for "invalid" MFNs

2024-06-11 Thread Roger Pau Monné

On Tue, Jun 11, 2024 at 11:33:24AM +0200, Jan Beulich wrote:
> On 11.06.2024 11:02, Roger Pau Monné wrote:
> > On Tue, Jun 11, 2024 at 10:26:32AM +0200, Jan Beulich wrote:
> >> On 11.06.2024 09:41, Roger Pau Monné wrote:
> >>> On Mon, Jun 10, 2024 at 04:58:52PM +0200, Jan Beulich wrote:
> >>>> --- a/xen/arch/x86/mm/p2m-ept.c
> >>>> +++ b/xen/arch/x86/mm/p2m-ept.c
> >>>> @@ -503,7 +503,8 @@ int epte_get_entry_emt(struct domain *d,
> >>>>  
> >>>>  if ( !mfn_valid(mfn) )
> >>>>  {
> >>>> -*ipat = true;
> >>>> +*ipat = type != p2m_mmio_direct ||
> >>>> +(!is_iommu_enabled(d) && !cache_flush_permitted(d));
> >>>
> >>> Looking at this, shouldn't the !mfn_valid special case be removed, and
> >>> mfns without a valid page be processed normally, so that the guest
> >>> MTRR values are taken into account, and no iPAT is enforced?
> >>
> >> Such removal is what, in the post commit message remark, I'm referring to
> >> as "moving to too lax". Doing so might be okay, but will imo be hard to
> >> prove to be correct for all possible cases. Along these lines goes also
> >> that I'm adding the IOMMU-enabled and cache-flush checks: In principle
> >> p2m_mmio_direct should not be used when neither of these return true. Yet
> >> a similar consideration would apply to the immediately subsequent if().
> >>
> >> Removing this code would, in particular, result in INVALID_MFN getting a
> >> type of WB by way of the subsequent if(), unless the type there would
> >> also be p2m_mmio_direct (which, as said, it ought to never be for non-
> >> pass-through domains). That again _may_ not be a problem as long as such
> >> EPT entries would never be marked present, yet that's again difficult to
> >> prove.
> > 
> > My understanding is that the !mfn_valid() check was a way to detect
> > MMIO regions in order to exit early and set those to UC.  I however
> > don't follow why the guest MTRR settings shouldn't also be applied to
> > those regions.
> 
> It's unclear to me whether the original purpose of he check really was
> (just) MMIO. It could as well also have been to cover the (then not yet
> named that way) case of INVALID_MFN.
> 
> As to ignoring guest MTRRs for MMIO: I think that's to be on the safe
> side. We don't want guests to map uncachable memory with a cachable
> memory type. Yet control isn't fine grained enough to prevent just
> that. Hence why we force UC, allowing merely to move to WC via PAT.

Would that be to cover up for guests bugs, or there's a coherency
reason for not allowing guests to access memory using fully guest
chosen cache attributes?

I really wonder whether Xen has enough information to figure out
whether a hole (MMIO region) is supposed to be accessed as UC or
something else.

Your proposed patch already allows guest to set such attributes in
PAT, and hence I don't see why also taking guest MTRRs into account
would be any worse.

> > I'm also confused by your comment about "as such EPT entries would
> > never be marked present": non-present EPT entries don't even get into
> > epte_get_entry_emt(), and hence we could assert in epte_get_entry_emt
> > that mfn != INVALID_MFN?
> 
> I don't think we can. Especially for the call from ept_set_entry() I
> can't spot anything that would prevent the call for non-present entries.
> This may be a mistake, but I can't do anything about it right here.

Hm, I see, then we should explicitly handle INVALID_MFN in
epte_get_entry_emt(), and just return early.

> >>> I also think this likely wants a:
> >>>
> >>> Fixes: 81fd0d3ca4b2 ('x86/hvm: simplify 'mmio_direct' check in 
> >>> epte_get_entry_emt()')
> >>
> >> Oh, indeed, I should have dug out when this broke. I didn't because I
> >> knew this mfn_valid() check was there forever, neglecting that it wasn't
> >> always (almost) first.
> >>
> >>> As AFAICT before that commit direct MMIO regions would set iPAT to WB,
> >>> which would result in the correct attributes (albeit guest MTRR was
> >>> still ignored).
> >>
> >> Two corrections here: First iPAT is a boolean; it can't be set to WB.
> >> And then what was happening prior to that change was that for the APIC
> >> access page iPAT was set to true, thus forcing WB there. iPAT was left
> >> set to false for all other p2m_mmio_direct pages, yielding (PAT-
> >> overridable) UC there.
> > 
> >

Re: [PATCH v4 1/2] x86/mm: add API for marking only part of a MMIO page read only

2024-06-11 Thread Roger Pau Monné

On Wed, May 22, 2024 at 05:39:03PM +0200, Marek Marczykowski-Górecki wrote:
> In some cases, only few registers on a page needs to be write-protected.
> Examples include USB3 console (64 bytes worth of registers) or MSI-X's
> PBA table (which doesn't need to span the whole table either), although
> in the latter case the spec forbids placing other registers on the same
> page. Current API allows only marking whole pages pages read-only,
> which sometimes may cover other registers that guest may need to
> write into.
> 
> Currently, when a guest tries to write to an MMIO page on the
> mmio_ro_ranges, it's either immediately crashed on EPT violation - if
> that's HVM, or if PV, it gets #PF. In case of Linux PV, if access was
> from userspace (like, /dev/mem), it will try to fixup by updating page
> tables (that Xen again will force to read-only) and will hit that #PF
> again (looping endlessly). Both behaviors are undesirable if guest could
> actually be allowed the write.
> 
> Introduce an API that allows marking part of a page read-only. Since
> sub-page permissions are not a thing in page tables (they are in EPT,
> but not granular enough), do this via emulation (or simply page fault
> handler for PV) that handles writes that are supposed to be allowed.
> The new subpage_mmio_ro_add() takes a start physical address and the
> region size in bytes. Both start address and the size need to be 8-byte
> aligned, as a practical simplification (allows using smaller bitmask,
> and a smaller granularity isn't really necessary right now).
> It will internally add relevant pages to mmio_ro_ranges, but if either
> start or end address is not page-aligned, it additionally adds that page
> to a list for sub-page R/O handling. The list holds a bitmask which
> qwords are supposed to be read-only and an address where page is mapped
> for write emulation - this mapping is done only on the first access. A
> plain list is used instead of more efficient structure, because there
> isn't supposed to be many pages needing this precise r/o control.
> 
> The mechanism this API is plugged in is slightly different for PV and
> HVM. For both paths, it's plugged into mmio_ro_emulated_write(). For PV,
> it's already called for #PF on read-only MMIO page. For HVM however, EPT
> violation on p2m_mmio_direct page results in a direct domain_crash() for
> non hardware domains.  To reach mmio_ro_emulated_write(), change how
> write violations for p2m_mmio_direct are handled - specifically, check
> if they relate to such partially protected page via
> subpage_mmio_write_accept() and if so, call hvm_emulate_one_mmio() for
> them too. This decodes what guest is trying write and finally calls
> mmio_ro_emulated_write(). The EPT write violation is detected as
> npfec.write_access and npfec.present both being true (similar to other
> places), which may cover some other (future?) cases - if that happens,
> emulator might get involved unnecessarily, but since it's limited to
> pages marked with subpage_mmio_ro_add() only, the impact is minimal.
> Both of those paths need an MFN to which guest tried to write (to check
> which part of the page is supposed to be read-only, and where
> the page is mapped for writes). This information currently isn't
> available directly in mmio_ro_emulated_write(), but in both cases it is
> already resolved somewhere higher in the call tree. Pass it down to
> mmio_ro_emulated_write() via new mmio_ro_emulate_ctxt.mfn field.
> 
> This may give a bit more access to the instruction emulator to HVM
> guests (the change in hvm_hap_nested_page_fault()), but only for pages
> explicitly marked with subpage_mmio_ro_add() - so, if the guest has a
> passed through a device partially used by Xen.
> As of the next patch, it applies only configuration explicitly
> documented as not security supported.
> 
> The subpage_mmio_ro_add() function cannot be called with overlapping
> ranges, and on pages already added to mmio_ro_ranges separately.
> Successful calls would result in correct handling, but error paths may
> result in incorrect state (like pages removed from mmio_ro_ranges too
> early). Debug build has asserts for relevant cases.
> 
> Signed-off-by: Marek Marczykowski-Górecki 
> ---
> Shadow mode is not tested, but I don't expect it to work differently than
> HAP in areas related to this patch.
> 
> Changes in v4:
> - rename SUBPAGE_MMIO_RO_ALIGN to MMIO_RO_SUBPAGE_GRAN
> - guard subpage_mmio_write_accept with CONFIG_HVM, as it's used only
>   there
> - rename ro_qwords to ro_elems
> - use unsigned arguments for subpage_mmio_ro_remove_page()
> - use volatile for __iomem
> - do not set mmio_ro_ctxt.mfn for mmcfg case
> - comment where fields of mmio_ro_ctxt are used
> - use bool for result of __test_and_set_bit
> - do not open-code mfn_to_maddr()
> - remove leftover RCU
> - mention hvm_hap_nested_page_fault() explicitly in the commit message
> Changes in v3:
> - use unsigned int for loop iterators
> - use __set_bit/__clear_bit

Re: [PATCH] x86/EPT: relax iPAT for "invalid" MFNs

2024-06-11 Thread Roger Pau Monné

On Tue, Jun 11, 2024 at 10:26:32AM +0200, Jan Beulich wrote:
> On 11.06.2024 09:41, Roger Pau Monné wrote:
> > On Mon, Jun 10, 2024 at 04:58:52PM +0200, Jan Beulich wrote:
> >> mfn_valid() is RAM-focused; it will often return false for MMIO. Yet
> >> access to actual MMIO space should not generally be restricted to UC
> >> only; especially video frame buffer accesses are unduly affected by such
> >> a restriction. Permit PAT use for directly assigned MMIO as long as the
> >> domain is known to have been granted some level of cache control.
> >>
> >> Signed-off-by: Jan Beulich 
> >> ---
> >> Considering that we've just declared PVH Dom0 "supported", this may well
> >> qualify for 4.19. The issue was specifically very noticable there.
> >>
> >> The conditional may be more complex than really necessary, but it's in
> >> line with what we do elsewhere. And imo better continue to be a little
> >> too restrictive, than moving to too lax.
> >>
> >> --- a/xen/arch/x86/mm/p2m-ept.c
> >> +++ b/xen/arch/x86/mm/p2m-ept.c
> >> @@ -503,7 +503,8 @@ int epte_get_entry_emt(struct domain *d,
> >>  
> >>  if ( !mfn_valid(mfn) )
> >>  {
> >> -*ipat = true;
> >> +*ipat = type != p2m_mmio_direct ||
> >> +(!is_iommu_enabled(d) && !cache_flush_permitted(d));
> > 
> > Looking at this, shouldn't the !mfn_valid special case be removed, and
> > mfns without a valid page be processed normally, so that the guest
> > MTRR values are taken into account, and no iPAT is enforced?
> 
> Such removal is what, in the post commit message remark, I'm referring to
> as "moving to too lax". Doing so might be okay, but will imo be hard to
> prove to be correct for all possible cases. Along these lines goes also
> that I'm adding the IOMMU-enabled and cache-flush checks: In principle
> p2m_mmio_direct should not be used when neither of these return true. Yet
> a similar consideration would apply to the immediately subsequent if().
> 
> Removing this code would, in particular, result in INVALID_MFN getting a
> type of WB by way of the subsequent if(), unless the type there would
> also be p2m_mmio_direct (which, as said, it ought to never be for non-
> pass-through domains). That again _may_ not be a problem as long as such
> EPT entries would never be marked present, yet that's again difficult to
> prove.

My understanding is that the !mfn_valid() check was a way to detect
MMIO regions in order to exit early and set those to UC.  I however
don't follow why the guest MTRR settings shouldn't also be applied to
those regions.

I'm also confused by your comment about "as such EPT entries would
never be marked present": non-present EPT entries don't even get into
epte_get_entry_emt(), and hence we could assert in epte_get_entry_emt
that mfn != INVALID_MFN?

> I was in fact wondering whether to special-case INVALID_MFN in the change
> I'm making. Question there is: Are we sure that by now we've indeed got
> rid of all arithmetic mistakenly done on MFN variables happening to hold
> INVALID_MFN as the value? IOW I fear that there might be code left which
> would pass in INVALID_MFN masked down to a 2M or 1G boundary. At which
> point checking for just INVALID_MFN would end up insufficient. If we
> meant to rely on this (tagging possible leftover issues as bugs we don't
> mean to attempt to cover for here anymore), then indeed the mfn_valid()
> check could be replaced by a comparison with INVALID_MFN (following a
> pattern we've been slowly trying to carry through elsewhere, especially
> in shadow code). Yet it could still not be outright dropped imo.
> 
> Furthermore simply dropping (or replacing as per above) that check won't
> work either: Further down in the function we use mfn_to_page(), which
> requires an up-front mfn_valid() check. That said, this code looks
> partly broken to me anyway: For a 1G page mfn_valid() on the start of it
> doesn't really imply all parts of it are valid. I guess I need to make a
> 2nd patch to address that as well, which may then want to be a prereq
> change to the one here (if we decided to go the route you're asking for).

I see, yes, the loop over the special pages array will need to be
adjusted to account for mfn_to_page() possibly returning NULL.

Overall I don't understand the need for this special case for
!mfn_valid().  The rest of special cases we have (the special pages
and domains without devices or MMIO regions assigned) are performance
optimizations which I do understand.  Yet the special casing of
!mfn_valid regions bypassing guest MTRR settings seems bogus to me.

> 
> > I also

Re: [PATCH for-4.19 v2] x86/pvh: declare PVH dom0 supported with caveats

2024-06-11 Thread Roger Pau Monné

On Mon, Jun 10, 2024 at 04:55:34PM +0100, George Dunlap wrote:
> On Mon, Jun 10, 2024 at 9:50 AM Roger Pau Monne  wrote:
> >
> > PVH dom0 is functionally very similar to PVH domU except for the domain
> > builder and the added set of hypercalls available to it.
> >
> > The main concern with declaring it "Supported" is the lack of some features
> > when compared to classic PV dom0, hence switch it's status to supported with
> > caveats.  List the known missing features, there might be more features 
> > missing
> > or not working as expected apart from the ones listed.
> >
> > Note there's some (limited) PVH dom0 testing on both osstest and gitlab.
> >
> > Signed-off-by: Roger Pau Monné 
> > Acked-by: Andrew Cooper 
> > Release-Acked-by: Oleksii Kurochko 
> > ---
> > Changes since v1:
> >  - Remove boot warning.
> > ---
> >  CHANGELOG.md  |  1 +
> >  SUPPORT.md| 15 ++-
> >  xen/arch/x86/hvm/dom0_build.c |  1 -
> >  3 files changed, 15 insertions(+), 2 deletions(-)
> >
> > diff --git a/CHANGELOG.md b/CHANGELOG.md
> > index 201478aa1c0e..1778419cae64 100644
> > --- a/CHANGELOG.md
> > +++ b/CHANGELOG.md
> > @@ -14,6 +14,7 @@ The format is based on [Keep a 
> > Changelog](https://keepachangelog.com/en/1.0.0/)
> > - HVM PIRQs are disabled by default.
> > - Reduce IOMMU setup time for hardware domain.
> > - Allow HVM/PVH domains to map foreign pages.
> > +   - Declare PVH dom0 supported with caveats.
> >   - xl/libxl configures vkb=[] for HVM domains with priority over 
> > vkb_device.
> >   - Increase the maximum number of CPUs Xen can be built for from 4095 to
> > 16383.
> > diff --git a/SUPPORT.md b/SUPPORT.md
> > index d5d60c62ec11..711aacf34662 100644
> > --- a/SUPPORT.md
> > +++ b/SUPPORT.md
> > @@ -161,7 +161,20 @@ Requires hardware virtualisation support (Intel VMX / 
> > AMD SVM).
> >  Dom0 support requires an IOMMU (Intel VT-d / AMD IOMMU).
> >
> >  Status, domU: Supported
> > -Status, dom0: Experimental
> > +Status, dom0: Supported, with caveats
> > +
> > +PVH dom0 hasn't received the same test coverage as PV dom0, so it can 
> > exhibit
> > +unexpected behavior or issues on some hardware.
> 
> What's the criteria for removing this paragraph?
> 
> FAOD I'm OK with it being checked in as-is, but I feel like this
> paragraph is somewhat anomalous, and would at least like to have an
> idea what might trigger its removal.

More testing is the only way this paragraph can be removed IMO.

For example I would be happy to remove it if dom0 PVH works on all
hardware in the XenRT lab.  So far the Linux dom0 version used by
XenServer is missing some required fixes for such testing to be
feasible.

Thanks, Roger.

Re: [PATCH] x86/EPT: relax iPAT for "invalid" MFNs

2024-06-11 Thread Roger Pau Monné

On Mon, Jun 10, 2024 at 04:58:52PM +0200, Jan Beulich wrote:
> mfn_valid() is RAM-focused; it will often return false for MMIO. Yet
> access to actual MMIO space should not generally be restricted to UC
> only; especially video frame buffer accesses are unduly affected by such
> a restriction. Permit PAT use for directly assigned MMIO as long as the
> domain is known to have been granted some level of cache control.
> 
> Signed-off-by: Jan Beulich 
> ---
> Considering that we've just declared PVH Dom0 "supported", this may well
> qualify for 4.19. The issue was specifically very noticable there.
> 
> The conditional may be more complex than really necessary, but it's in
> line with what we do elsewhere. And imo better continue to be a little
> too restrictive, than moving to too lax.
> 
> --- a/xen/arch/x86/mm/p2m-ept.c
> +++ b/xen/arch/x86/mm/p2m-ept.c
> @@ -503,7 +503,8 @@ int epte_get_entry_emt(struct domain *d,
>  
>  if ( !mfn_valid(mfn) )
>  {
> -*ipat = true;
> +*ipat = type != p2m_mmio_direct ||
> +(!is_iommu_enabled(d) && !cache_flush_permitted(d));

Looking at this, shouldn't the !mfn_valid special case be removed, and
mfns without a valid page be processed normally, so that the guest
MTRR values are taken into account, and no iPAT is enforced?

I also think this likely wants a:

Fixes: 81fd0d3ca4b2 ('x86/hvm: simplify 'mmio_direct' check in 
epte_get_entry_emt()')

As AFAICT before that commit direct MMIO regions would set iPAT to WB,
which would result in the correct attributes (albeit guest MTRR was
still ignored).

Thanks, Roger.

Re: Segment truncation in multi-segment PCI handling?

2024-06-10 Thread Roger Pau Monné

On Mon, Jun 10, 2024 at 10:41:19AM +0200, Jan Beulich wrote:
> On 10.06.2024 10:28, Roger Pau Monné wrote:
> > On Mon, Jun 10, 2024 at 09:58:11AM +0200, Jan Beulich wrote:
> >> On 07.06.2024 21:52, Andrew Cooper wrote:
> >>> On 07/06/2024 8:46 pm, Marek Marczykowski-Górecki wrote:
> >>>> Hi,
> >>>>
> >>>> I've got a new system, and it has two PCI segments:
> >>>>
> >>>> :00:00.0 Host bridge: Intel Corporation Device 7d14 (rev 04)
> >>>> :00:02.0 VGA compatible controller: Intel Corporation Meteor 
> >>>> Lake-P [Intel Graphics] (rev 08)
> >>>> ...
> >>>> 1:e0:06.0 System peripheral: Intel Corporation RST VMD Managed 
> >>>> Controller
> >>>> 1:e0:06.2 PCI bridge: Intel Corporation Device 7ecb (rev 10)
> >>>> 1:e1:00.0 Non-Volatile memory controller: Phison Electronics 
> >>>> Corporation PS5021-E21 PCIe4 NVMe Controller (DRAM-less) (rev 01)
> >>>>
> >>>> But looks like Xen doesn't handle it correctly:
> > 
> > In the meantime you can probably disable VMD from the firmware and the
> > NVMe devices should appear on the regular PCI bus.
> > 
> >>>> (XEN) :e0:06.0: unknown type 0
> >>>> (XEN) :e0:06.2: unknown type 0
> >>>> (XEN) :e1:00.0: unknown type 0
> >>>> ...
> >>>> (XEN)  PCI devices 
> >>>> (XEN)  segment  
> >>>> (XEN) :e1:00.0 - NULL - node -1 
> >>>> (XEN) :e0:06.2 - NULL - node -1 
> >>>> (XEN) :e0:06.0 - NULL - node -1 
> >>>> (XEN) :2b:00.0 - d0 - node -1  - MSIs < 161 >
> >>>> (XEN) :00:1f.6 - d0 - node -1  - MSIs < 148 >
> >>>> ...
> >>>>
> >>>> This isn't exactly surprising, since pci_sbdf_t.seg is uint16_t, so
> >>>> 0x1 doesn't fit. OSDev wiki says PCI Express can have 65536 PCI
> >>>> Segment Groups, each with 256 bus segments.
> >>>>
> >>>> Fortunately, I don't need this to work, if I disable VMD in the
> >>>> firmware, I get a single segment and everything works fine.
> >>>>
> >>>
> >>> This is a known issue.  Works is being done, albeit slowly.
> >>
> >> Is work being done? After the design session in Prague I put it on my
> >> todo list, but at low priority. I'd be happy to take it off there if I
> >> knew someone else is looking into this.
> > 
> > We had a design session about VMD?  If so I'm afraid I've missed it.
> 
> In Prague last year, not just now in Lisbon.
> 
> >>> 0x1 is indeed not a spec-compliant PCI segment.  It's something
> >>> model specific the Linux VMD driver is doing.
> >>
> >> I wouldn't call this "model specific" - this numbering is purely a
> >> software one (and would need coordinating between Dom0 and Xen).
> > 
> > Hm, TBH I'm not sure whether Xen needs to be aware of VMD devices.
> > The resources used by the VMD devices are all assigned to the VMD
> > root.  My current hypothesis is that it might be possible to manage
> > such devices without Xen being aware of their existence.
> 
> Well, it may be possible to have things work in Dom0 without Xen
> knowing much. Then Dom0 would need to suppress any physdevop calls
> with such software-only segment numbers (in order to at least not
> confuse Xen). I'd be curious though how e.g. MSI setup would work in
> such a scenario.

IIRC from my read of the spec, VMD devices don't use regular MSI
data/address fields, and instead configure an index into the MSI table
on the VMD root for the interrupt they want to use.  It's only the VMD
root device (which is a normal device on the PCI bus) that has
MSI(-X) configured with real vectors, and multiplexes interrupts for
all devices behind it.

If we had to passthrough VMD devices we might have to intercept writes
to the VMD MSI(-X) entries, but since they can only be safely assigned
to dom0 I think it's not an issue ATM (see below).

> Plus clearly any passing through of a device behind
> the VMD bridge will quite likely need Xen involvement (unless of
> course the only way of doing such pass-through was to pass on the
> entire hierarchy).

All VMD devices share the Requestor ID of the VMD root, so AFAIK it's
not possible to passthrough them (unless you passthrough the whole VMD
root) because they all share the same context entry on the IOMMU.

Thanks, Roger.

Re: Segment truncation in multi-segment PCI handling?

2024-06-10 Thread Roger Pau Monné

On Mon, Jun 10, 2024 at 09:58:11AM +0200, Jan Beulich wrote:
> On 07.06.2024 21:52, Andrew Cooper wrote:
> > On 07/06/2024 8:46 pm, Marek Marczykowski-Górecki wrote:
> >> Hi,
> >>
> >> I've got a new system, and it has two PCI segments:
> >>
> >> :00:00.0 Host bridge: Intel Corporation Device 7d14 (rev 04)
> >> :00:02.0 VGA compatible controller: Intel Corporation Meteor 
> >> Lake-P [Intel Graphics] (rev 08)
> >> ...
> >> 1:e0:06.0 System peripheral: Intel Corporation RST VMD Managed 
> >> Controller
> >> 1:e0:06.2 PCI bridge: Intel Corporation Device 7ecb (rev 10)
> >> 1:e1:00.0 Non-Volatile memory controller: Phison Electronics 
> >> Corporation PS5021-E21 PCIe4 NVMe Controller (DRAM-less) (rev 01)
> >>
> >> But looks like Xen doesn't handle it correctly:

In the meantime you can probably disable VMD from the firmware and the
NVMe devices should appear on the regular PCI bus.

> >> (XEN) :e0:06.0: unknown type 0
> >> (XEN) :e0:06.2: unknown type 0
> >> (XEN) :e1:00.0: unknown type 0
> >> ...
> >> (XEN)  PCI devices 
> >> (XEN)  segment  
> >> (XEN) :e1:00.0 - NULL - node -1 
> >> (XEN) :e0:06.2 - NULL - node -1 
> >> (XEN) :e0:06.0 - NULL - node -1 
> >> (XEN) :2b:00.0 - d0 - node -1  - MSIs < 161 >
> >> (XEN) :00:1f.6 - d0 - node -1  - MSIs < 148 >
> >> ...
> >>
> >> This isn't exactly surprising, since pci_sbdf_t.seg is uint16_t, so
> >> 0x1 doesn't fit. OSDev wiki says PCI Express can have 65536 PCI
> >> Segment Groups, each with 256 bus segments.
> >>
> >> Fortunately, I don't need this to work, if I disable VMD in the
> >> firmware, I get a single segment and everything works fine.
> >>
> > 
> > This is a known issue.  Works is being done, albeit slowly.
> 
> Is work being done? After the design session in Prague I put it on my
> todo list, but at low priority. I'd be happy to take it off there if I
> knew someone else is looking into this.

We had a design session about VMD?  If so I'm afraid I've missed it.

> > 0x1 is indeed not a spec-compliant PCI segment.  It's something
> > model specific the Linux VMD driver is doing.
> 
> I wouldn't call this "model specific" - this numbering is purely a
> software one (and would need coordinating between Dom0 and Xen).

Hm, TBH I'm not sure whether Xen needs to be aware of VMD devices.
The resources used by the VMD devices are all assigned to the VMD
root.  My current hypothesis is that it might be possible to manage
such devices without Xen being aware of their existence.

Regards, Roger.

Re: [PATCH for-4.19] x86/pvh: declare PVH dom0 supported with caveats

2024-06-07 Thread Roger Pau Monné

On Fri, Jun 07, 2024 at 12:03:20PM +0200, Roger Pau Monne wrote:
> PVH dom0 is functionally very similar to PVH domU except for the domain
> builder and the added set of hypercalls available to it.
> 
> The main concern with declaring it "Supported" is the lack of some features
> when compared to classic PV dom0, hence switch it's status to supported with
> caveats.  List the known missing features, there might be more features 
> missing
> or not working as expected apart from the ones listed.
> 
> Note there's some (limited) PVH dom0 testing on both osstest and gitlab.
> 
> Signed-off-by: Roger Pau Monné 
> ---
> Hopefully this will attract more testing an resources to PVH dom0 in order to
> try to finish the missing features.
> ---
>  CHANGELOG.md |  1 +
>  SUPPORT.md   | 15 ++-

Bah, forgot to remove the boot warning message, will send v2.

Sorry, Roger.

Re: [PATCH v3 2/6] xen/x86: Add initial x2APIC ID to the per-vLAPIC save area

2024-05-31 Thread Roger Pau Monné

On Thu, May 30, 2024 at 12:08:26PM +0100, Andrew Cooper wrote:
> On 29/05/2024 3:32 pm, Alejandro Vallejo wrote:
> > diff --git a/xen/lib/x86/policy.c b/xen/lib/x86/policy.c
> > index f033d22785be..b70b22d55fcf 100644
> > --- a/xen/lib/x86/policy.c
> > +++ b/xen/lib/x86/policy.c
> > @@ -2,6 +2,17 @@
> >  
> >  #include 
> >  
> > +uint32_t x86_x2apic_id_from_vcpu_id(const struct cpu_policy *p, uint32_t 
> > id)
> > +{
> > +/*
> > + * TODO: Derive x2APIC ID from the topology information inside `p`
> > + *   rather than from the vCPU ID alone. This bodge is a temporary
> > + *   measure until all infra is in place to retrieve or derive the
> > + *   initial x2APIC ID from migrated domains.
> > + */
> > +return id * 2;
> > +}
> > +
> 
> I'm afraid it's nonsensical to try and derive x2APIC ID from a
> policy+vcpu_id.
> 
> Take a step back, and think the data through.
> 
> A VM has:
> * A unique APIC_ID for each vCPU
> * Info in CPUID describing how to decompose the APIC_ID into topology
> 
> Right now, because this is all completely broken, we have:
> * Hardcoded APIC_ID = vCPU_ID * 2
> * Total nonsense in CPUID
> 
> 
> When constructing a VM, the toolstack (given suitable admin
> guidance/defaults) *must* choose both:
> * The APIC_ID themselves
> * The CPUID topo data to match
> 
> i.e. this series should be editing the toolstack's call to
> xc_domain_hvm_setcontext().
> 
> It's not, because AFAICT you're depending on the migration compatibility
> logic and inserting a new hardcoded assumption about symmetry of the layout.
> 
> 
> The data flows we need are:
> 
> (New) create:
> * Toolstack chooses both parts of topo information
> * Xen needs a default, which reasonably can be APIC_ID=vCPU_ID when the
> rest of the data flow has been cleaned up.  But this is needs to be
> explicit in vcpu_create() and without reference to the policy.

Doesn't using APIC_ID=vCPU_ID limits us to only being able to expose
certain typologies? (as vCPU IDs are contiguous).  For example
exposing a topology with 3 cores per package won't be possible?

Not saying it's a bad move to start this way, but if we want to
support exposing more exotic topology sooner or later we will need
some kind of logic that assigns the APIC IDs based on the knowledge of
the expected topology.  Whether is gets such knowledge from the CPU
policy or directly from the toolstack is another question.

Thanks, Roger.

Re: [PATCH for-4.19 3/9] xen/cpu: ensure get_cpu_maps() returns false if CPU operations are underway

2024-05-31 Thread Roger Pau Monné

On Fri, May 31, 2024 at 10:33:58AM +0200, Jan Beulich wrote:
> On 31.05.2024 09:31, Roger Pau Monné wrote:
> > On Fri, May 31, 2024 at 09:02:20AM +0200, Jan Beulich wrote:
> >> On 29.05.2024 18:14, Roger Pau Monné wrote:
> >>> On Wed, May 29, 2024 at 05:49:48PM +0200, Jan Beulich wrote:
> >>>> On 29.05.2024 17:03, Roger Pau Monné wrote:
> >>>>> On Wed, May 29, 2024 at 03:35:04PM +0200, Jan Beulich wrote:
> >>>>>> On 29.05.2024 11:01, Roger Pau Monne wrote:
> >>>>>>> Due to the current rwlock logic, if the CPU calling get_cpu_maps() 
> >>>>>>> does so from
> >>>>>>> a cpu_hotplug_{begin,done}() region the function will still return 
> >>>>>>> success,
> >>>>>>> because a CPU taking the rwlock in read mode after having taken it in 
> >>>>>>> write
> >>>>>>> mode is allowed.  Such behavior however defeats the purpose of 
> >>>>>>> get_cpu_maps(),
> >>>>>>> as it should always return false when called with a CPU hot{,un}plug 
> >>>>>>> operation
> >>>>>>> is in progress.
> >>>>>>
> >>>>>> I'm not sure I can agree with this. The CPU doing said operation ought 
> >>>>>> to be
> >>>>>> aware of what it is itself doing. And all other CPUs will get back 
> >>>>>> false from
> >>>>>> get_cpu_maps().
> >>>>>
> >>>>> Well, the CPU is aware in the context of cpu_{up,down}(), but not in
> >>>>> the interrupts that might be handled while that operation is in
> >>>>> progress, see below for a concrete example.
> >>>>>
> >>>>>>>  Otherwise the logic in send_IPI_mask() for example is wrong,
> >>>>>>> as it could decide to use the shorthand even when a CPU operation is 
> >>>>>>> in
> >>>>>>> progress.
> >>>>>>
> >>>>>> It's also not becoming clear what's wrong there: As long as a CPU isn't
> >>>>>> offline enough to not be in cpu_online_map anymore, it may well need 
> >>>>>> to still
> >>>>>> be the target of IPIs, and targeting it with a shorthand then is still 
> >>>>>> fine.
> >>>>>
> >>>>> The issue is in the online path: there's a window where the CPU is
> >>>>> online (and the lapic active), but cpu_online_map hasn't been updated
> >>>>> yet.  A specific example would be time_calibration() being executed on
> >>>>> the CPU that is running cpu_up().  That could result in a shorthand
> >>>>> IPI being used, but the mask in r.cpu_calibration_map not containing
> >>>>> the CPU that's being brought up online because it's not yet added to
> >>>>> cpu_online_map.  Then the number of CPUs actually running
> >>>>> time_calibration_rendezvous_fn won't match the weight of the cpumask
> >>>>> in r.cpu_calibration_map.
> >>>>
> >>>> I see, but maybe only partly. Prior to the CPU having its bit set in
> >>>> cpu_online_map, can it really take interrupts already? Shouldn't it be
> >>>> running with IRQs off until later, thus preventing it from making it
> >>>> into the rendezvous function in the first place? But yes, I can see
> >>>> how the IRQ (IPI) then being delivered later (once IRQs are enabled)
> >>>> might cause problems, too.
> >>>
> >>> The interrupt will get set in IRR and handled when interrupts are
> >>> enabled.
> >>>
> >>>>
> >>>> Plus, with how the rendezvous function is invoked (via
> >>>> on_selected_cpus() with the mask copied from cpu_online_map), the
> >>>> first check in smp_call_function_interrupt() ought to prevent the
> >>>> function from being called on the CPU being onlined. A problem would
> >>>> arise though if the IPI arrived later and call_data was already
> >>>> (partly or fully) overwritten with the next request.
> >>>
> >>> Yeah, there's a small window where the fields in call_data are out of
> >>> sync.
> >>>
> >>>>>> In any event this would again affect only the CPU leading the CPU 
> >>>>>> operati

Re: [PATCH for-4.19 4/9] x86/irq: describe how the interrupt CPU movement works

2024-05-31 Thread Roger Pau Monné

On Fri, May 31, 2024 at 09:06:10AM +0200, Jan Beulich wrote:
> On 29.05.2024 17:28, Roger Pau Monné wrote:
> > On Wed, May 29, 2024 at 03:57:19PM +0200, Jan Beulich wrote:
> >> On 29.05.2024 11:01, Roger Pau Monne wrote:
> >>> --- a/xen/arch/x86/include/asm/irq.h
> >>> +++ b/xen/arch/x86/include/asm/irq.h
> >>> @@ -28,6 +28,32 @@ typedef struct {
> >>>  
> >>>  struct irq_desc;
> >>>  
> >>> +/*
> >>> + * Xen logic for moving interrupts around CPUs allows manipulating 
> >>> interrupts
> >>> + * that target remote CPUs.  The logic to move an interrupt from CPU(s) 
> >>> is as
> >>> + * follows:
> >>> + *
> >>> + * 1. cpu_mask and vector is copied to old_cpu_mask and old_vector.
> >>> + * 2. New cpu_mask and vector are set, vector is setup at the new 
> >>> destination.
> >>> + * 3. move_in_progress is set.
> >>> + * 4. Interrupt source is updated to target new CPU and vector.
> >>> + * 5. Interrupts arriving at old_cpu_mask are processed normally.
> >>> + * 6. When an interrupt is delivered at the new destination (cpu_mask) 
> >>> as part
> >>> + *of acking the interrupt move_in_progress is cleared and 
> >>> move_cleanup_count
> >>
> >> Nit: A comma after "interrupt" may help reading.
> >>
> >>> + *is set to the weight of online CPUs in old_cpu_mask.
> >>> + *IRQ_MOVE_CLEANUP_VECTOR is sent to all CPUs in old_cpu_mask.
> >>
> >> These last two steps aren't precise enough, compared to what the code does.
> >> old_cpu_mask is first reduced to online CPUs therein. If the result is non-
> >> empty, what you describe is done. If, however, the result is empty, the
> >> vector is released right away (this code may be there just in case, but I
> >> think it shouldn't be omitted here).
> > 
> > I've left that out because I got the impression it made the text more
> > complex to follow (with the extra branch) for no real benefit, but I'm
> > happy to attempt to add it.
> 
> Why "no real benefit"? Isn't the goal to accurately describe what code does
> (in various places)? If the result isn't an accurate description in one
> specific regard, how reliable would the rest be from a reader's perspective?

FWIW, it seemed to me the reduction of old_cpu_mask was (kind of) a
shortcut to what the normal path does, by releasing the vector early
if there are no online CPUs in old_cpu_mask.

Now that you made me look into it, I think after this series the
old_cpu_mask should never contain offline CPUs, as fixup_irqs() will
take care of removing offliend CPUs from old_cpu_mask, and freeing the
vector if the set becomes empty.

I will expand the comment to mention this case, and consider adjusting
it if this series get merged.

> >>> + * 7. When receiving IRQ_MOVE_CLEANUP_VECTOR CPUs in old_cpu_mask clean 
> >>> the
> >>> + *vector entry and decrease the count in move_cleanup_count.  The 
> >>> CPU that
> >>> + *sets move_cleanup_count to 0 releases the vector.
> >>> + *
> >>> + * Note that when interrupt movement (either move_in_progress or
> >>> + * move_cleanup_count set) is in progress it's not possible to move the
> >>> + * interrupt to yet a different CPU.
> >>> + *
> >>> + * By keeping the vector in the old CPU(s) configured until the 
> >>> interrupt is
> >>> + * acked on the new destination Xen allows draining any pending 
> >>> interrupts at
> >>> + * the old destinations.
> >>> + */
> >>>  struct arch_irq_desc {
> >>>  s16 vector;  /* vector itself is only 8 bits, */
> >>>  s16 old_vector;  /* but we use -1 for unassigned  */
> >>
> >> I take it that it is not a goal to (also) describe under what conditions
> >> an IRQ move may actually be initiated (IRQ_MOVE_PENDING)? I ask not the
> >> least because the 2nd from last paragraph lightly touches that area.
> > 
> > Right, I was mostly focused on moves (forcefully) initiated from
> > fixup_irqs(), which is different from the opportunistic affinity
> > changes signaled by IRQ_MOVE_PENDING.
> > 
> > Not sure whether I want to mention this ahead of the list in a
> > paragraph, or just add it as a step.  Do you have any preference?
> 
> I think ahead might be better. But I also won't insist on it being added.
> Just if you don't, perhaps mention in the description that leaving that
> out is intentional.

No, I'm fine with adding it.

Thanks, Roger.

Re: [PATCH for-4.19 3/9] xen/cpu: ensure get_cpu_maps() returns false if CPU operations are underway

2024-05-31 Thread Roger Pau Monné

On Fri, May 31, 2024 at 09:02:20AM +0200, Jan Beulich wrote:
> On 29.05.2024 18:14, Roger Pau Monné wrote:
> > On Wed, May 29, 2024 at 05:49:48PM +0200, Jan Beulich wrote:
> >> On 29.05.2024 17:03, Roger Pau Monné wrote:
> >>> On Wed, May 29, 2024 at 03:35:04PM +0200, Jan Beulich wrote:
> >>>> On 29.05.2024 11:01, Roger Pau Monne wrote:
> >>>>> Due to the current rwlock logic, if the CPU calling get_cpu_maps() does 
> >>>>> so from
> >>>>> a cpu_hotplug_{begin,done}() region the function will still return 
> >>>>> success,
> >>>>> because a CPU taking the rwlock in read mode after having taken it in 
> >>>>> write
> >>>>> mode is allowed.  Such behavior however defeats the purpose of 
> >>>>> get_cpu_maps(),
> >>>>> as it should always return false when called with a CPU hot{,un}plug 
> >>>>> operation
> >>>>> is in progress.
> >>>>
> >>>> I'm not sure I can agree with this. The CPU doing said operation ought 
> >>>> to be
> >>>> aware of what it is itself doing. And all other CPUs will get back false 
> >>>> from
> >>>> get_cpu_maps().
> >>>
> >>> Well, the CPU is aware in the context of cpu_{up,down}(), but not in
> >>> the interrupts that might be handled while that operation is in
> >>> progress, see below for a concrete example.
> >>>
> >>>>>  Otherwise the logic in send_IPI_mask() for example is wrong,
> >>>>> as it could decide to use the shorthand even when a CPU operation is in
> >>>>> progress.
> >>>>
> >>>> It's also not becoming clear what's wrong there: As long as a CPU isn't
> >>>> offline enough to not be in cpu_online_map anymore, it may well need to 
> >>>> still
> >>>> be the target of IPIs, and targeting it with a shorthand then is still 
> >>>> fine.
> >>>
> >>> The issue is in the online path: there's a window where the CPU is
> >>> online (and the lapic active), but cpu_online_map hasn't been updated
> >>> yet.  A specific example would be time_calibration() being executed on
> >>> the CPU that is running cpu_up().  That could result in a shorthand
> >>> IPI being used, but the mask in r.cpu_calibration_map not containing
> >>> the CPU that's being brought up online because it's not yet added to
> >>> cpu_online_map.  Then the number of CPUs actually running
> >>> time_calibration_rendezvous_fn won't match the weight of the cpumask
> >>> in r.cpu_calibration_map.
> >>
> >> I see, but maybe only partly. Prior to the CPU having its bit set in
> >> cpu_online_map, can it really take interrupts already? Shouldn't it be
> >> running with IRQs off until later, thus preventing it from making it
> >> into the rendezvous function in the first place? But yes, I can see
> >> how the IRQ (IPI) then being delivered later (once IRQs are enabled)
> >> might cause problems, too.
> > 
> > The interrupt will get set in IRR and handled when interrupts are
> > enabled.
> > 
> >>
> >> Plus, with how the rendezvous function is invoked (via
> >> on_selected_cpus() with the mask copied from cpu_online_map), the
> >> first check in smp_call_function_interrupt() ought to prevent the
> >> function from being called on the CPU being onlined. A problem would
> >> arise though if the IPI arrived later and call_data was already
> >> (partly or fully) overwritten with the next request.
> > 
> > Yeah, there's a small window where the fields in call_data are out of
> > sync.
> > 
> >>>> In any event this would again affect only the CPU leading the CPU 
> >>>> operation,
> >>>> which should clearly know at which point(s) it is okay to send IPIs. Are 
> >>>> we
> >>>> actually sending any IPIs from within CPU-online or CPU-offline paths?
> >>>
> >>> Yes, I've seen the time rendezvous happening while in the middle of a
> >>> hotplug operation, and the CPU coordinating the rendezvous being the
> >>> one doing the CPU hotplug operation, so get_cpu_maps() returning true.
> >>
> >> Right, yet together with ...
> >>
> >>>> Together with the earlier paragraph the critical window would be between 
> >>>> the
> >>>> CPU

Re: [PATCH v3 2/6] xen/x86: Add initial x2APIC ID to the per-vLAPIC save area

2024-05-30 Thread Roger Pau Monné

On Thu, May 30, 2024 at 02:48:10PM +0100, Alejandro Vallejo wrote:
> I'll try to do that soon-ish. I suspect the pain points are going to be
> making it work nicely as well on 1vCPU systems with no APIC (are
> those expected to work?).

We do not allow creation of PVH/HVM domains without an emulated local
APIC, and I don't think we ever want to allow doing so (see
emulation_flags_ok()).

Thanks, Roger.

Re: CPU_DOWN_FAILED hits ASSERTs in scheduling logic

2024-05-30 Thread Roger Pau Monné

On Thu, May 30, 2024 at 02:45:18PM +0200, Jürgen Groß wrote:
> On 29.05.24 18:03, Roger Pau Monné wrote:
> > On Wed, May 29, 2024 at 03:08:49PM +0200, Jürgen Groß wrote:
> > > On 29.05.24 14:46, Roger Pau Monné wrote:
> > > > On Wed, May 29, 2024 at 01:47:09PM +0200, Jürgen Groß wrote:
> > > > > On 28.05.24 13:22, Roger Pau Monné wrote:
> > > > > > Hello,
> > > > > > 
> > > > > > When the stop_machine_run() call in cpu_down() fails and calls the 
> > > > > > CPU
> > > > > > notifier CPU_DOWN_FAILED hook the following assert triggers in the
> > > > > > scheduling code:
> > > > > > 
> > > > > > Assertion '!cpumask_test_cpu(cpu, >initialized)' failed at 
> > > > > > common/sched/cred1
> > > > > > [ Xen-4.19-unstable  x86_64  debug=y  Tainted:   C]
> > > > > > CPU:0
> > > > > > RIP:e008:[] 
> > > > > > common/sched/credit2.c#csched2_free_pdata+0xc8/0x177
> > > > > > RFLAGS: 00010093   CONTEXT: hypervisor
> > > > > > rax:    rbx: 83202ecc2f80   rcx: 
> > > > > > 83202f3e64c0
> > > > > > rdx: 0001   rsi: 0002   rdi: 
> > > > > > 83202ecc2f88
> > > > > > rbp: 83203d58   rsp: 83203d30   r8:  
> > > > > > 
> > > > > > r9:  83202f3e6e01   r10:    r11: 
> > > > > > 0f0f0f0f0f0f0f0f
> > > > > > r12: 83202ecb80b0   r13: 0001   r14: 
> > > > > > 0282
> > > > > > r15: 83202ecbbf00   cr0: 8005003b   cr4: 
> > > > > > 007526e0
> > > > > > cr3: 574c2000   cr2: 
> > > > > > fsb:    gsb:    gss: 
> > > > > > 
> > > > > > ds:    es:    fs:    gs:    ss:    cs: e008
> > > > > > Xen code around  
> > > > > > (common/sched/credit2.c#csched2_free_pdata+0xc8/0x177):
> > > > > > fe ff eb 9a 0f 0b 0f 0b <0f> 0b 49 8d 4f 08 49 8b 47 08 48 3b 
> > > > > > 48 08 75 2e
> > > > > > Xen stack trace from rsp=83203d30:
> > > > > >   83202d74d100 0001 82d0404c4430 
> > > > > > 0006
> > > > > >    83203d78 82d040257454 
> > > > > > 
> > > > > >   0001 83203da8 82d04021f303 
> > > > > > 82d0404c4628
> > > > > >   82d0404c4620 82d0404c4430 0006 
> > > > > > 83203df0
> > > > > >   82d04022bc4c 83203e18 0001 
> > > > > > 0001
> > > > > >   fff0   
> > > > > > 82d0405e6500
> > > > > >   83203e08 82d040204fd5 0001 
> > > > > > 83203e30
> > > > > >   82d0402054f0 82d0404c5860 0001 
> > > > > > 83202ec75000
> > > > > >   83203e48 82d040348c25 83202d74d0d0 
> > > > > > 83203e68
> > > > > >   82d0402071aa 83202ec751d0 82d0405ce210 
> > > > > > 83203e80
> > > > > >   82d0402343c9 82d0405ce200 83203eb0 
> > > > > > 82d040234631
> > > > > >    7fff 82d0405d5080 
> > > > > > 82d0405ce210
> > > > > >   83203ee8 82d040321411 82d040321399 
> > > > > > 83202f3a9000
> > > > > >    001d91a6fa2d 82d0405e6500 
> > > > > > 83203de0
> > > > > >   82d040324391   
> > > > > > 
> > > > > >      
> > > > > > 
> > > > > >      
> > > > > > 
> > > > &

Re: [PATCH] tools: (Actually) drop libsystemd as a dependency

2024-05-30 Thread Roger Pau Monné

On Thu, May 30, 2024 at 12:12:19PM +0100, Andrew Cooper wrote:
> On 30/05/2024 12:02 pm, Roger Pau Monné wrote:
> > On Thu, May 30, 2024 at 11:14:39AM +0100, Andrew Cooper wrote:
> >> When reinstating some of systemd.m4 between v1 and v2, I reintroduced a 
> >> little
> >> too much.  While {c,o}xenstored are indeed no longer linked against
> >> libsystemd, ./configure still looks for it.
> >>
> >> Drop this too.
> >>
> >> Fixes: ae26101f6bfc ("tools: Drop libsystemd as a dependency")
> >> Signed-off-by: Andrew Cooper 
> > LGTM, but my knowledge of systemd is very limited.
> >
> > Reviewed-by: Roger Pau Monné 
> 
> Thanks.  TBH, this is all M4/autoconf, rather than systemd.

Right, but it's about systemd dependencies which is what I don't know
about.  The m4 stuff LGTM, whether it's appropriate to drop the
dependency is what I can't be sure about.

Thanks, Roger.

Re: [PATCH] tools: (Actually) drop libsystemd as a dependency

2024-05-30 Thread Roger Pau Monné

On Thu, May 30, 2024 at 11:14:39AM +0100, Andrew Cooper wrote:
> When reinstating some of systemd.m4 between v1 and v2, I reintroduced a little
> too much.  While {c,o}xenstored are indeed no longer linked against
> libsystemd, ./configure still looks for it.
> 
> Drop this too.
> 
> Fixes: ae26101f6bfc ("tools: Drop libsystemd as a dependency")
> Signed-off-by: Andrew Cooper 

LGTM, but my knowledge of systemd is very limited.

Reviewed-by: Roger Pau Monné 

Thanks, Roger.

Re: [PATCH v3 2/6] xen/x86: Add initial x2APIC ID to the per-vLAPIC save area

2024-05-30 Thread Roger Pau Monné

On Wed, May 29, 2024 at 03:32:31PM +0100, Alejandro Vallejo wrote:
> This allows the initial x2APIC ID to be sent on the migration stream. The
> hardcoded mapping x2apic_id=2*vcpu_id is maintained for the time being.
> Given the vlapic data is zero-extended on restore, fix up migrations from
> hosts without the field by setting it to the old convention if zero.
> 
> x2APIC IDs are calculated from the CPU policy where the guest topology is
> defined. For the time being, the function simply returns the old
> relationship, but will eventually return results consistent with the
> topology.
> 
> Signed-off-by: Alejandro Vallejo 

Reviewed-by: Roger Pau Monné 

Thanks, Roger.

Re: [PATCH v3 1/6] x86/vlapic: Move lapic migration checks to the check hooks

2024-05-30 Thread Roger Pau Monné

On Wed, May 29, 2024 at 03:32:30PM +0100, Alejandro Vallejo wrote:
> While doing this, factor out checks common to architectural and hidden state.
> 
> Signed-off-by: Alejandro Vallejo 

Reviewed-by: Roger PAu Monné 

With the BUG() possibly replaced with ASSERT_UNREACHABLE(),

> ---
> v3:
>   * Moved from v2/patch3.
>   * Added check hook for the architectural state as well.
>   * Use domain_vcpu() rather than the previous open coded checks for vcpu 
> range.
> ---
>  xen/arch/x86/hvm/vlapic.c | 81 +--
>  1 file changed, 53 insertions(+), 28 deletions(-)
> 
> diff --git a/xen/arch/x86/hvm/vlapic.c b/xen/arch/x86/hvm/vlapic.c
> index 9cfc82666ae5..a0df62b5ec0a 100644
> --- a/xen/arch/x86/hvm/vlapic.c
> +++ b/xen/arch/x86/hvm/vlapic.c
> @@ -1553,60 +1553,85 @@ static void lapic_load_fixup(struct vlapic *vlapic)
> v, vlapic->loaded.id, vlapic->loaded.ldr, good_ldr);
>  }
>  
> -static int cf_check lapic_load_hidden(struct domain *d, hvm_domain_context_t 
> *h)
> -{
> -unsigned int vcpuid = hvm_load_instance(h);
> -struct vcpu *v;
> -struct vlapic *s;
>  
> +static int lapic_check_common(const struct domain *d, unsigned int vcpuid)
> +{
>  if ( !has_vlapic(d) )
>  return -ENODEV;
>  
>  /* Which vlapic to load? */
> -if ( vcpuid >= d->max_vcpus || (v = d->vcpu[vcpuid]) == NULL )
> +if ( !domain_vcpu(d, vcpuid) )
>  {
>  dprintk(XENLOG_G_ERR, "HVM restore: dom%d has no apic%u\n",
>  d->domain_id, vcpuid);

The message here is kind of misleading as printing apic%u makes it
look like it's an APIC ID, but it's a vCPU ID.  It would be best to
just print: "HVM restore: dom%d has no vCPU %u\n".

>  return -EINVAL;
>  }
> -s = vcpu_vlapic(v);
>  
> -if ( hvm_load_entry_zeroextend(LAPIC, h, >hw) != 0 )
> +return 0;
> +}
> +
> +static int cf_check lapic_check_hidden(const struct domain *d,
> +   hvm_domain_context_t *h)
> +{
> +unsigned int vcpuid = hvm_load_instance(h);
> +struct hvm_hw_lapic s;
> +int rc;
> +
> +if ( (rc = lapic_check_common(d, vcpuid)) )
> +return rc;

Nit: I don't like much to assign values inside of conditions, I would
rather do:

int rc = lapic_check_common(d, vcpuid);

if ( rc )
return rc;

> +
> +if ( hvm_load_entry_zeroextend(LAPIC, h, ) != 0 )
> +return -ENODATA;
> +
> +/* EN=0 with EXTD=1 is illegal */
> +if ( (s.apic_base_msr & (APIC_BASE_ENABLE | APIC_BASE_EXTD)) ==
> + APIC_BASE_EXTD )
>  return -EINVAL;
>  
> +return 0;
> +}
> +
> +static int cf_check lapic_load_hidden(struct domain *d, hvm_domain_context_t 
> *h)
> +{
> +unsigned int vcpuid = hvm_load_instance(h);
> +struct vcpu *v = d->vcpu[vcpuid];

Not sure whether it's worth using domain_vcpu() here.  We have already
checked the vCPU is valid.

> +struct vlapic *s = vcpu_vlapic(v);
> +
> +if ( hvm_load_entry_zeroextend(LAPIC, h, >hw) != 0 )
> +BUG();

I would use { ASSERT_UNREACHABLE(); return -EINVAL; } here, there's
IMO no strict reason to panic on non-debug builds.

Thanks, Roger.

Re: [PATCH v4 2/2] tools/xg: Clean up xend-style overrides for CPU policies

2024-05-30 Thread Roger Pau Monné

On Wed, May 29, 2024 at 03:30:38PM +0100, Alejandro Vallejo wrote:
> Factor out policy getters/setters from both (CPUID and MSR) policy override
> functions. Additionally, use host policy rather than featureset when
> preparing the cur policy, saving one hypercall and several lines of
> boilerplate.
> 
> No functional change intended.

One change that's worth mentioning is that now the policy gets set
only once per domain, as the whole policy is prepared and uploaded to
the hypervisor, rather than uploading a partial policy which is then
further adjusted by xc_{cpuid_xend,msr}_policy() calls.

> 
> Signed-off-by: Alejandro Vallejo 

Reviewed-by: Roger Pau Monné 

> ---
> v4:
>   * Indentation adjustment.
>   * Fix unhandled corner case using bsearch() with MSR and leaf buffers
> ---
>  tools/libs/guest/xg_cpuid_x86.c | 437 ++--
>  1 file changed, 130 insertions(+), 307 deletions(-)
> 
> diff --git a/tools/libs/guest/xg_cpuid_x86.c b/tools/libs/guest/xg_cpuid_x86.c
> index 6cab5c60bb41..552ec2ab7312 100644
> --- a/tools/libs/guest/xg_cpuid_x86.c
> +++ b/tools/libs/guest/xg_cpuid_x86.c
> @@ -36,6 +36,34 @@ enum {
>  #define bitmaskof(idx)  (1u << ((idx) & 31))
>  #define featureword_of(idx) ((idx) >> 5)
>  
> +static int deserialize_policy(xc_interface *xch, xc_cpu_policy_t *policy)
> +{
> +uint32_t err_leaf = -1, err_subleaf = -1, err_msr = -1;
> +int rc;
> +
> +rc = x86_cpuid_copy_from_buffer(>policy, policy->leaves,
> +policy->nr_leaves, _leaf, 
> _subleaf);
> +if ( rc )
> +{
> +if ( err_leaf != -1 )
> +ERROR("Failed to deserialise CPUID (err leaf %#x, subleaf %#x) 
> (%d = %s)",
> +  err_leaf, err_subleaf, -rc, strerror(-rc));
> +return rc;
> +}
> +
> +rc = x86_msr_copy_from_buffer(>policy, policy->msrs,
> +  policy->nr_msrs, _msr);
> +if ( rc )
> +{
> +if ( err_msr != -1 )
> +ERROR("Failed to deserialise MSR (err MSR %#x) (%d = %s)",
> +  err_msr, -rc, strerror(-rc));
> +return rc;
> +}
> +
> +return 0;
> +}
> +
>  int xc_get_cpu_levelling_caps(xc_interface *xch, uint32_t *caps)
>  {
>  struct xen_sysctl sysctl = {};
> @@ -260,102 +288,37 @@ static int compare_leaves(const void *l, const void *r)
>  return 0;
>  }
>  
> -static xen_cpuid_leaf_t *find_leaf(
> -xen_cpuid_leaf_t *leaves, unsigned int nr_leaves,
> -const struct xc_xend_cpuid *xend)
> +static xen_cpuid_leaf_t *find_leaf(xc_cpu_policy_t *p,
> +   const struct xc_xend_cpuid *xend)
>  {
>  const xen_cpuid_leaf_t key = { xend->leaf, xend->subleaf };
>  
> -return bsearch(, leaves, nr_leaves, sizeof(*leaves), compare_leaves);
> +return bsearch(, p->leaves, p->nr_leaves,
> +   sizeof(*p->leaves), compare_leaves);
>  }
>  
> -static int xc_cpuid_xend_policy(
> -xc_interface *xch, uint32_t domid, const struct xc_xend_cpuid *xend)
> +static int xc_cpuid_xend_policy(xc_interface *xch, uint32_t domid,
> +const struct xc_xend_cpuid *xend,
> +xc_cpu_policy_t *host,
> +xc_cpu_policy_t *def,
> +xc_cpu_policy_t *cur)
>  {
> -int rc;
> -bool hvm;
> -xc_domaininfo_t di;
> -unsigned int nr_leaves, nr_msrs;
> -uint32_t err_leaf = -1, err_subleaf = -1, err_msr = -1;
> -/*
> - * Three full policies.  The host, default for the domain type,
> - * and domain current.
> - */
> -xen_cpuid_leaf_t *host = NULL, *def = NULL, *cur = NULL;
> -unsigned int nr_host, nr_def, nr_cur;
> -
> -if ( (rc = xc_domain_getinfo_single(xch, domid, )) < 0 )
> -{
> -PERROR("Failed to obtain d%d info", domid);
> -rc = -errno;
> -goto fail;
> -}
> -hvm = di.flags & XEN_DOMINF_hvm_guest;
> -
> -rc = xc_cpu_policy_get_size(xch, _leaves, _msrs);
> -if ( rc )
> -{
> -PERROR("Failed to obtain policy info size");
> -rc = -errno;
> -goto fail;
> -}
> -
> -rc = -ENOMEM;
> -if ( (host = calloc(nr_leaves, sizeof(*host))) == NULL ||
> - (def  = calloc(nr_leaves, sizeof(*def)))  == NULL ||
> - (cur  = calloc(nr_leaves, sizeof(*cur)))  == NULL )
> -{
> -ERROR("Unable to allocate memory for %u CPUID leaves", nr_leaves);
> -goto fail;
> -}
> -

Re: [PATCH] x86/hvm: allow XENMEM_machine_memory_map

2024-05-30 Thread Roger Pau Monné

On Thu, May 30, 2024 at 09:04:08AM +0100, Andrew Cooper wrote:
> On 30/05/2024 8:53 am, Roger Pau Monne wrote:
> > For HVM based control domains XENMEM_machine_memory_map must be available so
> > that the `e820_host` xl.cfg option can be used.
> >
> > Signed-off-by: Roger Pau Monné 
> 
> Seems safe enough to allow.
> 
> Does this want a reported-by, or some further discussion about how it
> was found?

I've found it while attempting to repro an issue with e820_host
reported by Marek, but the issue he reported is not related to this.
It's just that I have most of my test systems set as PVH dom0.

> Also, as it's mostly PVH Dom0 bugfixing, shouldn't we want it in 4.19?

Yeah, forgot to add the for-4.19 line and Oleksii, adding him now for
consideration for 4.19.

Thanks, Roger.

Re: [PATCH for-4.19 3/9] xen/cpu: ensure get_cpu_maps() returns false if CPU operations are underway

2024-05-29 Thread Roger Pau Monné

On Wed, May 29, 2024 at 05:49:48PM +0200, Jan Beulich wrote:
> On 29.05.2024 17:03, Roger Pau Monné wrote:
> > On Wed, May 29, 2024 at 03:35:04PM +0200, Jan Beulich wrote:
> >> On 29.05.2024 11:01, Roger Pau Monne wrote:
> >>> Due to the current rwlock logic, if the CPU calling get_cpu_maps() does 
> >>> so from
> >>> a cpu_hotplug_{begin,done}() region the function will still return 
> >>> success,
> >>> because a CPU taking the rwlock in read mode after having taken it in 
> >>> write
> >>> mode is allowed.  Such behavior however defeats the purpose of 
> >>> get_cpu_maps(),
> >>> as it should always return false when called with a CPU hot{,un}plug 
> >>> operation
> >>> is in progress.
> >>
> >> I'm not sure I can agree with this. The CPU doing said operation ought to 
> >> be
> >> aware of what it is itself doing. And all other CPUs will get back false 
> >> from
> >> get_cpu_maps().
> > 
> > Well, the CPU is aware in the context of cpu_{up,down}(), but not in
> > the interrupts that might be handled while that operation is in
> > progress, see below for a concrete example.
> > 
> >>>  Otherwise the logic in send_IPI_mask() for example is wrong,
> >>> as it could decide to use the shorthand even when a CPU operation is in
> >>> progress.
> >>
> >> It's also not becoming clear what's wrong there: As long as a CPU isn't
> >> offline enough to not be in cpu_online_map anymore, it may well need to 
> >> still
> >> be the target of IPIs, and targeting it with a shorthand then is still 
> >> fine.
> > 
> > The issue is in the online path: there's a window where the CPU is
> > online (and the lapic active), but cpu_online_map hasn't been updated
> > yet.  A specific example would be time_calibration() being executed on
> > the CPU that is running cpu_up().  That could result in a shorthand
> > IPI being used, but the mask in r.cpu_calibration_map not containing
> > the CPU that's being brought up online because it's not yet added to
> > cpu_online_map.  Then the number of CPUs actually running
> > time_calibration_rendezvous_fn won't match the weight of the cpumask
> > in r.cpu_calibration_map.
> 
> I see, but maybe only partly. Prior to the CPU having its bit set in
> cpu_online_map, can it really take interrupts already? Shouldn't it be
> running with IRQs off until later, thus preventing it from making it
> into the rendezvous function in the first place? But yes, I can see
> how the IRQ (IPI) then being delivered later (once IRQs are enabled)
> might cause problems, too.

The interrupt will get set in IRR and handled when interrupts are
enabled.

> 
> Plus, with how the rendezvous function is invoked (via
> on_selected_cpus() with the mask copied from cpu_online_map), the
> first check in smp_call_function_interrupt() ought to prevent the
> function from being called on the CPU being onlined. A problem would
> arise though if the IPI arrived later and call_data was already
> (partly or fully) overwritten with the next request.

Yeah, there's a small window where the fields in call_data are out of
sync.

> >> In any event this would again affect only the CPU leading the CPU 
> >> operation,
> >> which should clearly know at which point(s) it is okay to send IPIs. Are we
> >> actually sending any IPIs from within CPU-online or CPU-offline paths?
> > 
> > Yes, I've seen the time rendezvous happening while in the middle of a
> > hotplug operation, and the CPU coordinating the rendezvous being the
> > one doing the CPU hotplug operation, so get_cpu_maps() returning true.
> 
> Right, yet together with ...
> 
> >> Together with the earlier paragraph the critical window would be between 
> >> the
> >> CPU being taken off of cpu_online_map and the CPU actually going "dead" 
> >> (i.e.
> >> on x86: its LAPIC becoming unresponsive to other than INIT/SIPI). And even
> >> then the question would be what bad, if any, would happen to that CPU if an
> >> IPI was still targeted at it by way of using the shorthand. I'm pretty sure
> >> it runs with IRQs off at that time, so no ordinary IRQ could be delivered.
> >>
> >>> Adjust the logic in get_cpu_maps() to return false when the CPUs lock is
> >>> already hold in write mode by the current CPU, as read_trylock() would
> >>> otherwise return true.
> >>>
> >>> Fixes: 868a01021c6f ('rwlock: allow recursive read locking when already 
&g

Re: CPU_DOWN_FAILED hits ASSERTs in scheduling logic

2024-05-29 Thread Roger Pau Monné

On Wed, May 29, 2024 at 03:08:49PM +0200, Jürgen Groß wrote:
> On 29.05.24 14:46, Roger Pau Monné wrote:
> > On Wed, May 29, 2024 at 01:47:09PM +0200, Jürgen Groß wrote:
> > > On 28.05.24 13:22, Roger Pau Monné wrote:
> > > > Hello,
> > > > 
> > > > When the stop_machine_run() call in cpu_down() fails and calls the CPU
> > > > notifier CPU_DOWN_FAILED hook the following assert triggers in the
> > > > scheduling code:
> > > > 
> > > > Assertion '!cpumask_test_cpu(cpu, >initialized)' failed at 
> > > > common/sched/cred1
> > > > [ Xen-4.19-unstable  x86_64  debug=y  Tainted:   C]
> > > > CPU:0
> > > > RIP:e008:[] 
> > > > common/sched/credit2.c#csched2_free_pdata+0xc8/0x177
> > > > RFLAGS: 00010093   CONTEXT: hypervisor
> > > > rax:    rbx: 83202ecc2f80   rcx: 83202f3e64c0
> > > > rdx: 0001   rsi: 0002   rdi: 83202ecc2f88
> > > > rbp: 83203d58   rsp: 83203d30   r8:  
> > > > r9:  83202f3e6e01   r10:    r11: 0f0f0f0f0f0f0f0f
> > > > r12: 83202ecb80b0   r13: 0001   r14: 0282
> > > > r15: 83202ecbbf00   cr0: 8005003b   cr4: 007526e0
> > > > cr3: 574c2000   cr2: 
> > > > fsb:    gsb:    gss: 
> > > > ds:    es:    fs:    gs:    ss:    cs: e008
> > > > Xen code around  
> > > > (common/sched/credit2.c#csched2_free_pdata+0xc8/0x177):
> > > >fe ff eb 9a 0f 0b 0f 0b <0f> 0b 49 8d 4f 08 49 8b 47 08 48 3b 48 08 
> > > > 75 2e
> > > > Xen stack trace from rsp=83203d30:
> > > >  83202d74d100 0001 82d0404c4430 0006
> > > >   83203d78 82d040257454 
> > > >  0001 83203da8 82d04021f303 82d0404c4628
> > > >  82d0404c4620 82d0404c4430 0006 83203df0
> > > >  82d04022bc4c 83203e18 0001 0001
> > > >  fff0   82d0405e6500
> > > >  83203e08 82d040204fd5 0001 83203e30
> > > >  82d0402054f0 82d0404c5860 0001 83202ec75000
> > > >  83203e48 82d040348c25 83202d74d0d0 83203e68
> > > >  82d0402071aa 83202ec751d0 82d0405ce210 83203e80
> > > >  82d0402343c9 82d0405ce200 83203eb0 82d040234631
> > > >   7fff 82d0405d5080 82d0405ce210
> > > >  83203ee8 82d040321411 82d040321399 83202f3a9000
> > > >   001d91a6fa2d 82d0405e6500 83203de0
> > > >  82d040324391   
> > > >     
> > > >     
> > > >     
> > > >     
> > > >     
> > > > Xen call trace:
> > > >  [] R 
> > > > common/sched/credit2.c#csched2_free_pdata+0xc8/0x177
> > > >  [] F free_cpu_rm_data+0x41/0x58
> > > >  [] F 
> > > > common/sched/cpupool.c#cpu_callback+0xfb/0x466
> > > >  [] F notifier_call_chain+0x6c/0x96
> > > >  [] F 
> > > > common/cpu.c#cpu_notifier_call_chain+0x1b/0x36
> > > >  [] F cpu_down+0xa7/0x143
> > > >  [] F cpu_down_helper+0x11/0x27
> > > >  [] F 
> > > > common/domain.c#continue_hypercall_tasklet_handler+0x50/0xbd
> > > >  [] F common/tasklet.c#do_tasklet_work+0x76/0xaf
> > > >  [] F do_tasklet+0x5b/0x8d
> > > >  [] F arch/x86/domain.c#idle_loop+0x78/0xe6
> > > >  [] F continue_running+0x5b/0x5d
> > > > 
> > > > 
> > > > 
> > > > Panic on CPU 0:
> > > > Assertion '!cpumask_test_cpu(cpu, >init

Re: [PATCH for-4.19 2/9] xen/cpu: do not get the CPU map in stop_machine_run()

2024-05-29 Thread Roger Pau Monné

On Wed, May 29, 2024 at 05:31:02PM +0200, Jan Beulich wrote:
> On 29.05.2024 17:20, Roger Pau Monné wrote:
> > On Wed, May 29, 2024 at 03:04:13PM +0200, Jan Beulich wrote:
> >> On 29.05.2024 11:01, Roger Pau Monne wrote:
> >>> The current callers of stop_machine_run() outside of init code already 
> >>> have the
> >>> CPU maps locked, and hence there's no reason for stop_machine_run() to 
> >>> attempt
> >>> to lock again.
> >>
> >> While purely from a description perspective this is okay, ...
> >>
> >>> --- a/xen/common/stop_machine.c
> >>> +++ b/xen/common/stop_machine.c
> >>> @@ -82,9 +82,15 @@ int stop_machine_run(int (*fn)(void *data), void 
> >>> *data, unsigned int cpu)
> >>>  BUG_ON(!local_irq_is_enabled());
> >>>  BUG_ON(!is_idle_vcpu(current));
> >>>  
> >>> -/* cpu_online_map must not change. */
> >>> -if ( !get_cpu_maps() )
> >>> +/*
> >>> + * cpu_online_map must not change.  The only two callers of
> >>> + * stop_machine_run() outside of init code already have the CPU map 
> >>> locked.
> >>> + */
> >>
> >> ... the "two" here is not unlikely to quickly go stale; who knows what PPC
> >> and RISC-V will have as their code becomes more complete?
> >>
> >> I'm also unconvinced that requiring ...
> >>
> >>> +if ( system_state >= SYS_STATE_active && !cpu_map_locked() )
> >>
> >> ... this for all future (post-init) uses of stop_machine_run() is a good
> >> idea. It is quite a bit more natural, to me at least, for the function to
> >> effect this itself, as is the case prior to your change.
> > 
> > This is mostly a pre-req for the next change that switches
> > get_cpu_maps() to return false if the current CPU is holding the CPU
> > maps lock in write mode.
> > 
> > IF we don't want to go this route we need a way to signal
> > send_IPI_mask() when a CPU hot{,un}plug operation is taking place,
> > because get_cpu_maps() enough is not suitable.
> > 
> > Overall I don't like the corner case where get_cpu_maps() returns true
> > if a CPU hot{,un}plug operation is taking place in the current CPU
> > context.  The guarantee of get_cpu_maps() is that no CPU hot{,un}plug
> > operations can be in progress if it returns true.
> 
> I'm not convinced of looking at it this way. To me the guarantee is
> merely that no CPU operation is taking place _elsewhere_. As indicated,
> imo the local CPU should be well aware of what context it's actually in,
> and hence what is (or is not) appropriate to do at a particular point in
> time.
> 
> I guess what I'm missing is an example of a concrete code path where
> things presently go wrong.

See the specific example in patch 3/9 with time_calibration() and it's
usage of send_IPI_mask() when called from a CPU executing in cpu_up()
context.

Thanks, Roger.

Re: [PATCH for-4.19 1/9] x86/irq: remove offline CPUs from old CPU mask when adjusting move_cleanup_count

2024-05-29 Thread Roger Pau Monné

On Wed, May 29, 2024 at 05:27:06PM +0200, Jan Beulich wrote:
> On 29.05.2024 17:15, Roger Pau Monné wrote:
> > On Wed, May 29, 2024 at 02:40:51PM +0200, Jan Beulich wrote:
> >> On 29.05.2024 11:01, Roger Pau Monne wrote:
> >>> When adjusting move_cleanup_count to account for CPUs that are offline 
> >>> also
> >>> adjust old_cpu_mask, otherwise further calls to fixup_irqs() could 
> >>> subtract
> >>> those again creating and create an imbalance in move_cleanup_count.
> >>
> >> I'm in trouble with "creating"; I can't seem to be able to guess what you 
> >> may
> >> have meant.
> > 
> > Oh, sorry, that's a typo.
> > 
> > I was meaning to point out that not removing the already subtracted
> > CPUs from the mask can lead to further calls to fixup_irqs()
> > subtracting them again and move_cleanup_count possibly underflowing.
> > 
> > Would you prefer to write it as:
> > 
> > "... could subtract those again and possibly underflow move_cleanup_count."
> 
> Fine with me. Looks like simply deleting "creating" and keeping the rest
> as it was would be okay too? Whatever you prefer in the end.

Yes, whatever you think it's clearer TBH, I don't really have a
preference.

Thanks, Roger.

Re: [PATCH for-4.19 4/9] x86/irq: describe how the interrupt CPU movement works

2024-05-29 Thread Roger Pau Monné

On Wed, May 29, 2024 at 03:57:19PM +0200, Jan Beulich wrote:
> On 29.05.2024 11:01, Roger Pau Monne wrote:
> > --- a/xen/arch/x86/include/asm/irq.h
> > +++ b/xen/arch/x86/include/asm/irq.h
> > @@ -28,6 +28,32 @@ typedef struct {
> >  
> >  struct irq_desc;
> >  
> > +/*
> > + * Xen logic for moving interrupts around CPUs allows manipulating 
> > interrupts
> > + * that target remote CPUs.  The logic to move an interrupt from CPU(s) is 
> > as
> > + * follows:
> > + *
> > + * 1. cpu_mask and vector is copied to old_cpu_mask and old_vector.
> > + * 2. New cpu_mask and vector are set, vector is setup at the new 
> > destination.
> > + * 3. move_in_progress is set.
> > + * 4. Interrupt source is updated to target new CPU and vector.
> > + * 5. Interrupts arriving at old_cpu_mask are processed normally.
> > + * 6. When an interrupt is delivered at the new destination (cpu_mask) as 
> > part
> > + *of acking the interrupt move_in_progress is cleared and 
> > move_cleanup_count
> 
> Nit: A comma after "interrupt" may help reading.
> 
> > + *is set to the weight of online CPUs in old_cpu_mask.
> > + *IRQ_MOVE_CLEANUP_VECTOR is sent to all CPUs in old_cpu_mask.
> 
> These last two steps aren't precise enough, compared to what the code does.
> old_cpu_mask is first reduced to online CPUs therein. If the result is non-
> empty, what you describe is done. If, however, the result is empty, the
> vector is released right away (this code may be there just in case, but I
> think it shouldn't be omitted here).

I've left that out because I got the impression it made the text more
complex to follow (with the extra branch) for no real benefit, but I'm
happy to attempt to add it.

> 
> > + * 7. When receiving IRQ_MOVE_CLEANUP_VECTOR CPUs in old_cpu_mask clean the
> > + *vector entry and decrease the count in move_cleanup_count.  The CPU 
> > that
> > + *sets move_cleanup_count to 0 releases the vector.
> > + *
> > + * Note that when interrupt movement (either move_in_progress or
> > + * move_cleanup_count set) is in progress it's not possible to move the
> > + * interrupt to yet a different CPU.
> > + *
> > + * By keeping the vector in the old CPU(s) configured until the interrupt 
> > is
> > + * acked on the new destination Xen allows draining any pending interrupts 
> > at
> > + * the old destinations.
> > + */
> >  struct arch_irq_desc {
> >  s16 vector;  /* vector itself is only 8 bits, */
> >  s16 old_vector;  /* but we use -1 for unassigned  */
> 
> I take it that it is not a goal to (also) describe under what conditions
> an IRQ move may actually be initiated (IRQ_MOVE_PENDING)? I ask not the
> least because the 2nd from last paragraph lightly touches that area.

Right, I was mostly focused on moves (forcefully) initiated from
fixup_irqs(), which is different from the opportunistic affinity
changes signaled by IRQ_MOVE_PENDING.

Not sure whether I want to mention this ahead of the list in a
paragraph, or just add it as a step.  Do you have any preference?

Thanks, Roger.

Re: [PATCH for-4.19 2/9] xen/cpu: do not get the CPU map in stop_machine_run()

2024-05-29 Thread Roger Pau Monné

On Wed, May 29, 2024 at 03:04:13PM +0200, Jan Beulich wrote:
> On 29.05.2024 11:01, Roger Pau Monne wrote:
> > The current callers of stop_machine_run() outside of init code already have 
> > the
> > CPU maps locked, and hence there's no reason for stop_machine_run() to 
> > attempt
> > to lock again.
> 
> While purely from a description perspective this is okay, ...
> 
> > --- a/xen/common/stop_machine.c
> > +++ b/xen/common/stop_machine.c
> > @@ -82,9 +82,15 @@ int stop_machine_run(int (*fn)(void *data), void *data, 
> > unsigned int cpu)
> >  BUG_ON(!local_irq_is_enabled());
> >  BUG_ON(!is_idle_vcpu(current));
> >  
> > -/* cpu_online_map must not change. */
> > -if ( !get_cpu_maps() )
> > +/*
> > + * cpu_online_map must not change.  The only two callers of
> > + * stop_machine_run() outside of init code already have the CPU map 
> > locked.
> > + */
> 
> ... the "two" here is not unlikely to quickly go stale; who knows what PPC
> and RISC-V will have as their code becomes more complete?
> 
> I'm also unconvinced that requiring ...
> 
> > +if ( system_state >= SYS_STATE_active && !cpu_map_locked() )
> 
> ... this for all future (post-init) uses of stop_machine_run() is a good
> idea. It is quite a bit more natural, to me at least, for the function to
> effect this itself, as is the case prior to your change.

This is mostly a pre-req for the next change that switches
get_cpu_maps() to return false if the current CPU is holding the CPU
maps lock in write mode.

IF we don't want to go this route we need a way to signal
send_IPI_mask() when a CPU hot{,un}plug operation is taking place,
because get_cpu_maps() enough is not suitable.

Overall I don't like the corner case where get_cpu_maps() returns true
if a CPU hot{,un}plug operation is taking place in the current CPU
context.  The guarantee of get_cpu_maps() is that no CPU hot{,un}plug
operations can be in progress if it returns true.

Thanks, Roger.

Re: [PATCH for-4.19 1/9] x86/irq: remove offline CPUs from old CPU mask when adjusting move_cleanup_count

2024-05-29 Thread Roger Pau Monné

On Wed, May 29, 2024 at 02:40:51PM +0200, Jan Beulich wrote:
> On 29.05.2024 11:01, Roger Pau Monne wrote:
> > When adjusting move_cleanup_count to account for CPUs that are offline also
> > adjust old_cpu_mask, otherwise further calls to fixup_irqs() could subtract
> > those again creating and create an imbalance in move_cleanup_count.
> 
> I'm in trouble with "creating"; I can't seem to be able to guess what you may
> have meant.

Oh, sorry, that's a typo.

I was meaning to point out that not removing the already subtracted
CPUs from the mask can lead to further calls to fixup_irqs()
subtracting them again and move_cleanup_count possibly underflowing.

Would you prefer to write it as:

"... could subtract those again and possibly underflow move_cleanup_count."

> > Fixes: 472e0b74c5c4 ('x86/IRQ: deal with move cleanup count state in 
> > fixup_irqs()')
> > Signed-off-by: Roger Pau Monné 
> 
> With the above clarified (adjustment can be done while committing)
> Reviewed-by: Jan Beulich 
> 
> > --- a/xen/arch/x86/irq.c
> > +++ b/xen/arch/x86/irq.c
> > @@ -2572,6 +2572,14 @@ void fixup_irqs(const cpumask_t *mask, bool verbose)
> >  desc->arch.move_cleanup_count -= cpumask_weight(affinity);
> >  if ( !desc->arch.move_cleanup_count )
> >  release_old_vec(desc);
> > +else
> > +/*
> > + * Adjust old_cpu_mask to account for the offline CPUs,
> > + * otherwise further calls to fixup_irqs() could subtract 
> > those
> > + * again and possibly underflow the counter.
> > + */
> > +cpumask_and(desc->arch.old_cpu_mask, 
> > desc->arch.old_cpu_mask,
> > +_online_map);
> >  }
> 
> While functionality-wise okay, imo it would be slightly better to use
> "affinity" here as well, so that even without looking at context beyond
> what's shown here there is a direct connection to the cpumask_weight()
> call. I.e.
> 
> cpumask_andnot(desc->arch.old_cpu_mask, 
> desc->arch.old_cpu_mask,
>affinity);
> 
> Thoughts?

It was more straightforward for me to reason that removing the offline
CPUs is OK, but I can see that you might prefer to use 'affinity',
because that's the weight that's subtracted from move_cleanup_count.
Using either should lead to the same result if my understanding is
correct.

Thanks, Roger.

Re: [PATCH for-4.19 3/9] xen/cpu: ensure get_cpu_maps() returns false if CPU operations are underway

2024-05-29 Thread Roger Pau Monné

On Wed, May 29, 2024 at 03:35:04PM +0200, Jan Beulich wrote:
> On 29.05.2024 11:01, Roger Pau Monne wrote:
> > Due to the current rwlock logic, if the CPU calling get_cpu_maps() does so 
> > from
> > a cpu_hotplug_{begin,done}() region the function will still return success,
> > because a CPU taking the rwlock in read mode after having taken it in write
> > mode is allowed.  Such behavior however defeats the purpose of 
> > get_cpu_maps(),
> > as it should always return false when called with a CPU hot{,un}plug 
> > operation
> > is in progress.
> 
> I'm not sure I can agree with this. The CPU doing said operation ought to be
> aware of what it is itself doing. And all other CPUs will get back false from
> get_cpu_maps().

Well, the CPU is aware in the context of cpu_{up,down}(), but not in
the interrupts that might be handled while that operation is in
progress, see below for a concrete example.

> >  Otherwise the logic in send_IPI_mask() for example is wrong,
> > as it could decide to use the shorthand even when a CPU operation is in
> > progress.
> 
> It's also not becoming clear what's wrong there: As long as a CPU isn't
> offline enough to not be in cpu_online_map anymore, it may well need to still
> be the target of IPIs, and targeting it with a shorthand then is still fine.

The issue is in the online path: there's a window where the CPU is
online (and the lapic active), but cpu_online_map hasn't been updated
yet.  A specific example would be time_calibration() being executed on
the CPU that is running cpu_up().  That could result in a shorthand
IPI being used, but the mask in r.cpu_calibration_map not containing
the CPU that's being brought up online because it's not yet added to
cpu_online_map.  Then the number of CPUs actually running
time_calibration_rendezvous_fn won't match the weight of the cpumask
in r.cpu_calibration_map.

> In any event this would again affect only the CPU leading the CPU operation,
> which should clearly know at which point(s) it is okay to send IPIs. Are we
> actually sending any IPIs from within CPU-online or CPU-offline paths?

Yes, I've seen the time rendezvous happening while in the middle of a
hotplug operation, and the CPU coordinating the rendezvous being the
one doing the CPU hotplug operation, so get_cpu_maps() returning true.

> Together with the earlier paragraph the critical window would be between the
> CPU being taken off of cpu_online_map and the CPU actually going "dead" (i.e.
> on x86: its LAPIC becoming unresponsive to other than INIT/SIPI). And even
> then the question would be what bad, if any, would happen to that CPU if an
> IPI was still targeted at it by way of using the shorthand. I'm pretty sure
> it runs with IRQs off at that time, so no ordinary IRQ could be delivered.
> 
> > Adjust the logic in get_cpu_maps() to return false when the CPUs lock is
> > already hold in write mode by the current CPU, as read_trylock() would
> > otherwise return true.
> > 
> > Fixes: 868a01021c6f ('rwlock: allow recursive read locking when already 
> > locked in write mode')
> 
> I'm puzzled by this as well: Prior to that and the change referenced by its
> Fixes: tag, recursive spin locks were used. For the purposes here that's the
> same as permitting read locking even when the write lock is already held by
> the local CPU.

I see, so the Fixes should be:

x86/smp: use APIC ALLBUT destination shorthand when possible

Instead, which is the commit that started using get_cpu_maps() in
send_IPI_mask().

Thanks, Roger.

Re: CPU_DOWN_FAILED hits ASSERTs in scheduling logic

2024-05-29 Thread Roger Pau Monné

On Wed, May 29, 2024 at 01:47:09PM +0200, Jürgen Groß wrote:
> On 28.05.24 13:22, Roger Pau Monné wrote:
> > Hello,
> > 
> > When the stop_machine_run() call in cpu_down() fails and calls the CPU
> > notifier CPU_DOWN_FAILED hook the following assert triggers in the
> > scheduling code:
> > 
> > Assertion '!cpumask_test_cpu(cpu, >initialized)' failed at 
> > common/sched/cred1
> > [ Xen-4.19-unstable  x86_64  debug=y  Tainted:   C]
> > CPU:0
> > RIP:e008:[] 
> > common/sched/credit2.c#csched2_free_pdata+0xc8/0x177
> > RFLAGS: 00010093   CONTEXT: hypervisor
> > rax:    rbx: 83202ecc2f80   rcx: 83202f3e64c0
> > rdx: 0001   rsi: 0002   rdi: 83202ecc2f88
> > rbp: 83203d58   rsp: 83203d30   r8:  
> > r9:  83202f3e6e01   r10:    r11: 0f0f0f0f0f0f0f0f
> > r12: 83202ecb80b0   r13: 0001   r14: 0282
> > r15: 83202ecbbf00   cr0: 8005003b   cr4: 007526e0
> > cr3: 574c2000   cr2: 
> > fsb:    gsb:    gss: 
> > ds:    es:    fs:    gs:    ss:    cs: e008
> > Xen code around  
> > (common/sched/credit2.c#csched2_free_pdata+0xc8/0x177):
> >   fe ff eb 9a 0f 0b 0f 0b <0f> 0b 49 8d 4f 08 49 8b 47 08 48 3b 48 08 75 2e
> > Xen stack trace from rsp=83203d30:
> > 83202d74d100 0001 82d0404c4430 0006
> >  83203d78 82d040257454 
> > 0001 83203da8 82d04021f303 82d0404c4628
> > 82d0404c4620 82d0404c4430 0006 83203df0
> > 82d04022bc4c 83203e18 0001 0001
> > fff0   82d0405e6500
> > 83203e08 82d040204fd5 0001 83203e30
> > 82d0402054f0 82d0404c5860 0001 83202ec75000
> > 83203e48 82d040348c25 83202d74d0d0 83203e68
> > 82d0402071aa 83202ec751d0 82d0405ce210 83203e80
> > 82d0402343c9 82d0405ce200 83203eb0 82d040234631
> >  7fff 82d0405d5080 82d0405ce210
> > 83203ee8 82d040321411 82d040321399 83202f3a9000
> >  001d91a6fa2d 82d0405e6500 83203de0
> > 82d040324391   
> >    
> >    
> >    
> >    
> >    
> > Xen call trace:
> > [] R 
> > common/sched/credit2.c#csched2_free_pdata+0xc8/0x177
> > [] F free_cpu_rm_data+0x41/0x58
> > [] F common/sched/cpupool.c#cpu_callback+0xfb/0x466
> > [] F notifier_call_chain+0x6c/0x96
> > [] F common/cpu.c#cpu_notifier_call_chain+0x1b/0x36
> > [] F cpu_down+0xa7/0x143
> > [] F cpu_down_helper+0x11/0x27
> > [] F 
> > common/domain.c#continue_hypercall_tasklet_handler+0x50/0xbd
> > [] F common/tasklet.c#do_tasklet_work+0x76/0xaf
> > [] F do_tasklet+0x5b/0x8d
> > [] F arch/x86/domain.c#idle_loop+0x78/0xe6
> > [] F continue_running+0x5b/0x5d
> > 
> > 
> > 
> > Panic on CPU 0:
> > Assertion '!cpumask_test_cpu(cpu, >initialized)' failed at 
> > common/sched/credit2.c:4111
> > 
> > 
> > The issue seems to be that since the CPU hasn't been removed, it's
> > still part of prv->initialized and the assert in csched2_free_pdata()
> > called as part of free_cpu_rm_data() triggers.
> > 
> > It's easy to reproduce by substituting the stop_machine_run() call in
> > cpu_down() with an error.
> 
> Could you please give the attached patch a try?

I still get the following assert:

Assertion '!cpumask_test_cpu(cpu, >initialized)' failed at 
common/sched/credit2.c:4111
[ Xen-4.19-unstable  x86_64  debug=y  Not tainted ]
CPU:0
RIP:e008:[] 
common/sched/credit2.c#csched2_free_pdata+0xc8/0x177
RFLAGS: 00010093   CONTEXT: hypervisor
rax:    rbx: 83202ec

CPU_DOWN_FAILED hits ASSERTs in scheduling logic

2024-05-28 Thread Roger Pau Monné

Hello,

When the stop_machine_run() call in cpu_down() fails and calls the CPU
notifier CPU_DOWN_FAILED hook the following assert triggers in the
scheduling code:

Assertion '!cpumask_test_cpu(cpu, >initialized)' failed at 
common/sched/cred1
[ Xen-4.19-unstable  x86_64  debug=y  Tainted:   C]
CPU:0
RIP:e008:[] 
common/sched/credit2.c#csched2_free_pdata+0xc8/0x177
RFLAGS: 00010093   CONTEXT: hypervisor
rax:    rbx: 83202ecc2f80   rcx: 83202f3e64c0
rdx: 0001   rsi: 0002   rdi: 83202ecc2f88
rbp: 83203d58   rsp: 83203d30   r8:  
r9:  83202f3e6e01   r10:    r11: 0f0f0f0f0f0f0f0f
r12: 83202ecb80b0   r13: 0001   r14: 0282
r15: 83202ecbbf00   cr0: 8005003b   cr4: 007526e0
cr3: 574c2000   cr2: 
fsb:    gsb:    gss: 
ds:    es:    fs:    gs:    ss:    cs: e008
Xen code around  
(common/sched/credit2.c#csched2_free_pdata+0xc8/0x177):
 fe ff eb 9a 0f 0b 0f 0b <0f> 0b 49 8d 4f 08 49 8b 47 08 48 3b 48 08 75 2e
Xen stack trace from rsp=83203d30:
   83202d74d100 0001 82d0404c4430 0006
    83203d78 82d040257454 
   0001 83203da8 82d04021f303 82d0404c4628
   82d0404c4620 82d0404c4430 0006 83203df0
   82d04022bc4c 83203e18 0001 0001
   fff0   82d0405e6500
   83203e08 82d040204fd5 0001 83203e30
   82d0402054f0 82d0404c5860 0001 83202ec75000
   83203e48 82d040348c25 83202d74d0d0 83203e68
   82d0402071aa 83202ec751d0 82d0405ce210 83203e80
   82d0402343c9 82d0405ce200 83203eb0 82d040234631
    7fff 82d0405d5080 82d0405ce210
   83203ee8 82d040321411 82d040321399 83202f3a9000
    001d91a6fa2d 82d0405e6500 83203de0
   82d040324391   
      
      
      
      
      
Xen call trace:
   [] R common/sched/credit2.c#csched2_free_pdata+0xc8/0x177
   [] F free_cpu_rm_data+0x41/0x58
   [] F common/sched/cpupool.c#cpu_callback+0xfb/0x466
   [] F notifier_call_chain+0x6c/0x96
   [] F common/cpu.c#cpu_notifier_call_chain+0x1b/0x36
   [] F cpu_down+0xa7/0x143
   [] F cpu_down_helper+0x11/0x27
   [] F 
common/domain.c#continue_hypercall_tasklet_handler+0x50/0xbd
   [] F common/tasklet.c#do_tasklet_work+0x76/0xaf
   [] F do_tasklet+0x5b/0x8d
   [] F arch/x86/domain.c#idle_loop+0x78/0xe6
   [] F continue_running+0x5b/0x5d



Panic on CPU 0:
Assertion '!cpumask_test_cpu(cpu, >initialized)' failed at 
common/sched/credit2.c:4111


The issue seems to be that since the CPU hasn't been removed, it's
still part of prv->initialized and the assert in csched2_free_pdata()
called as part of free_cpu_rm_data() triggers.

It's easy to reproduce by substituting the stop_machine_run() call in
cpu_down() with an error.

Thanks, Roger.

Re: [PATCH v16 1/5] arm/vpci: honor access size when returning an error

2024-05-28 Thread Roger Pau Monné

On Mon, May 27, 2024 at 10:14:59PM +0100, Julien Grall wrote:
> Hi Roger,
> 
> On 23/05/2024 08:55, Roger Pau Monné wrote:
> > On Wed, May 22, 2024 at 06:59:20PM -0400, Stewart Hildebrand wrote:
> > > From: Volodymyr Babchuk 
> > > 
> > > Guest can try to read config space using different access sizes: 8,
> > > 16, 32, 64 bits. We need to take this into account when we are
> > > returning an error back to MMIO handler, otherwise it is possible to
> > > provide more data than requested: i.e. guest issues LDRB instruction
> > > to read one byte, but we are writing 0x in the target
> > > register.
> > 
> > Shouldn't this be taken care of in the trap handler subsystem, rather
> > than forcing each handler to ensure the returned data matches the
> > access size?
> 
> I understand how this can be useful when we return all 1s.
> 
> However, in most of the current cases, we already need to deal with the
> masking because the data is extracted from a wider field (for instance, see
> the vGIC emulation). For those handlers, I would argue it would be
> concerning/ a bug if the handler return bits above the access size.
> Although, this would only impact the guest itself.

Even if there was a bug in the handler, it would be mitigated by the
truncation done in io.c.

> So overall, this seems to be a matter of taste and I don't quite (yet) see
> the benefits to do it in io.c. Regardless that...

It's up to you really, it's all ARM code so I don't really have a
stake.  IMO it makes the handlers more complicated and fragile.

If nothing else I would at least add an ASSERT() in io.c to ensure
that the data returned from the handler matches the size constrains
you expect.

> > 
> > IOW, something like:
> > 
> > diff --git a/xen/arch/arm/io.c b/xen/arch/arm/io.c
> > index 96c740d5636c..b7e12df85f87 100644
> > --- a/xen/arch/arm/io.c
> > +++ b/xen/arch/arm/io.c
> > @@ -37,6 +37,7 @@ static enum io_state handle_read(const struct 
> > mmio_handler *handler,
> >   return IO_ABORT;
> > 
> >   r = sign_extend(dabt, r);
> > +r = r & GENMASK_ULL((1U << dabt.size) * 8 - 1, 0);
> 
> ... in some case we need to sign extend up to the width of the register
> (even if the access is 8-byte). So we would need to do the masking *before*
> calling sign_extend().

I would consider doing the truncation in sign_extend() if suitable,
even if that's doing more than what the function name implies.

Thanks, Roger.

Re: [PATCH v2 5/8] tools/hvmloader: Retrieve (x2)APIC IDs from the APs themselves

2024-05-27 Thread Roger Pau Monné

On Fri, May 24, 2024 at 04:16:01PM +0100, Alejandro Vallejo wrote:
> On 23/05/2024 17:13, Roger Pau Monné wrote:
> > On Wed, May 08, 2024 at 01:39:24PM +0100, Alejandro Vallejo wrote:
> >> @@ -86,10 +113,11 @@ static void boot_cpu(unsigned int cpu)
> >>  BUG();
> >>  
> >>  /*
> >> - * Wait for the secondary processor to complete initialisation.
> >> + * Wait for the secondary processor to complete initialisation,
> >> + * which is signaled by its x2APIC ID being writted to the LUT.
> >>   * Do not touch shared resources meanwhile.
> >>   */
> >> -while ( !ap_callin )
> >> +while ( !ACCESS_ONCE(CPU_TO_X2APICID[cpu]) )
> >>  cpu_relax();
> > 
> > As a further improvement, we could launch all APs in pararell, and use
> > a for loop to wait until all positions of the CPU_TO_X2APICID array
> > are set.
> 
> I thought about it, but then we'd need locking for the prints as well,
> or refactor things so only the BSP prints on success.

Hm, I see, yes, we would likely need to refactor the printing a bit so
each AP only prints one line, and then add locking around the calls in
`cpu_setup()`.

> The time taken is truly negligible, so I reckon it's better left for
> another patch.

Oh, indeed, sorry if I made it look it should be part of this patch,
that wasn't my intention.  Just something that might be worth looking
into doing in the future.

Thanks, Roger.

Re: [PATCH v2 5/8] tools/hvmloader: Retrieve (x2)APIC IDs from the APs themselves

2024-05-27 Thread Roger Pau Monné

On Fri, May 24, 2024 at 04:15:34PM +0100, Alejandro Vallejo wrote:
> On 24/05/2024 08:21, Roger Pau Monné wrote:
> > On Wed, May 08, 2024 at 01:39:24PM +0100, Alejandro Vallejo wrote:
> >> Make it so the APs expose their own APIC IDs in a LUT. We can use that LUT 
> >> to
> >> populate the MADT, decoupling the algorithm that relates CPU IDs and APIC 
> >> IDs
> >> from hvmloader.
> >>
> >> While at this also remove ap_callin, as writing the APIC ID may serve the 
> >> same
> >> purpose.
> >>
> >> Signed-off-by: Alejandro Vallejo 
> >> ---
> >> v2:
> >>   * New patch. Replaces adding cpu policy to hvmloader in v1.
> >> ---
> >>  tools/firmware/hvmloader/config.h|  6 -
> >>  tools/firmware/hvmloader/hvmloader.c |  4 +--
> >>  tools/firmware/hvmloader/smp.c   | 40 +++-
> >>  tools/firmware/hvmloader/util.h  |  5 
> >>  xen/arch/x86/include/asm/hvm/hvm.h   |  1 +
> >>  5 files changed, 47 insertions(+), 9 deletions(-)
> >>
> >> diff --git a/tools/firmware/hvmloader/config.h 
> >> b/tools/firmware/hvmloader/config.h
> >> index c82adf6dc508..edf6fa9c908c 100644
> >> --- a/tools/firmware/hvmloader/config.h
> >> +++ b/tools/firmware/hvmloader/config.h
> >> @@ -4,6 +4,8 @@
> >>  #include 
> >>  #include 
> >>  
> >> +#include 
> >> +
> >>  enum virtual_vga { VGA_none, VGA_std, VGA_cirrus, VGA_pt };
> >>  extern enum virtual_vga virtual_vga;
> >>  
> >> @@ -49,8 +51,10 @@ extern uint8_t ioapic_version;
> >>  
> >>  #define IOAPIC_ID   0x01
> >>  
> >> +extern uint32_t CPU_TO_X2APICID[HVM_MAX_VCPUS];
> >> +
> >>  #define LAPIC_BASE_ADDRESS  0xfee0
> >> -#define LAPIC_ID(vcpu_id)   ((vcpu_id) * 2)
> >> +#define LAPIC_ID(vcpu_id)   (CPU_TO_X2APICID[(vcpu_id)])
> >>  
> >>  #define PCI_ISA_DEVFN   0x08/* dev 1, fn 0 */
> >>  #define PCI_ISA_IRQ_MASK0x0c20U /* ISA IRQs 5,10,11 are PCI connected 
> >> */
> >> diff --git a/tools/firmware/hvmloader/hvmloader.c 
> >> b/tools/firmware/hvmloader/hvmloader.c
> >> index c58841e5b556..1eba92229925 100644
> >> --- a/tools/firmware/hvmloader/hvmloader.c
> >> +++ b/tools/firmware/hvmloader/hvmloader.c
> >> @@ -342,11 +342,11 @@ int main(void)
> >>  
> >>  printf("CPU speed is %u MHz\n", get_cpu_mhz());
> >>  
> >> +smp_initialise();
> >> +
> >>  apic_setup();
> >>  pci_setup();
> >>  
> >> -smp_initialise();
> >> -
> >>  perform_tests();
> >>  
> >>  if ( bios->bios_info_setup )
> >> diff --git a/tools/firmware/hvmloader/smp.c 
> >> b/tools/firmware/hvmloader/smp.c
> >> index a668f15d7e1f..4d75f239c2f5 100644
> >> --- a/tools/firmware/hvmloader/smp.c
> >> +++ b/tools/firmware/hvmloader/smp.c
> >> @@ -29,7 +29,34 @@
> >>  
> >>  #include 
> >>  
> >> -static int ap_callin, ap_cpuid;
> >> +static int ap_cpuid;
> >> +
> >> +/**
> >> + * Lookup table of x2APIC IDs.
> >> + *
> >> + * Each entry is populated its respective CPU as they come online. This 
> >> is required
> >> + * for generating the MADT with minimal assumptions about ID 
> >> relationships.
> >> + */
> >> +uint32_t CPU_TO_X2APICID[HVM_MAX_VCPUS];
> >> +
> >> +static uint32_t read_apic_id(void)
> >> +{
> >> +uint32_t apic_id;
> >> +
> >> +cpuid(1, NULL, _id, NULL, NULL);
> >> +apic_id >>= 24;
> >> +
> >> +/*
> >> + * APIC IDs over 255 are represented by 255 in leaf 1 and are meant 
> >> to be
> >> + * read from topology leaves instead. Xen exposes x2APIC IDs in leaf 
> >> 0xb,
> >> + * but only if the x2APIC feature is present. If there are that many 
> >> CPUs
> >> + * it's guaranteed to be there so we can avoid checking for it 
> >> specifically
> >> + */
> >> +if ( apic_id == 255 )
> >> +cpuid(0xb, NULL, NULL, NULL, _id);
> >> +
> >> +return apic_id;
> >> +}
> >>  
> >>  static void ap_start(void)
> >>  {
> >> @@ -37,12 +64,12 @@ static void ap_start(void)
> >>  cacheattr_init();
> >>

Re: [PATCH v2 8/8] xen/x86: Synthesise domain topologies

2024-05-27 Thread Roger Pau Monné

On Fri, May 24, 2024 at 06:16:01PM +0100, Alejandro Vallejo wrote:
> On 24/05/2024 09:58, Roger Pau Monné wrote:
> > On Wed, May 08, 2024 at 01:39:27PM +0100, Alejandro Vallejo wrote:
> > 
> >> +rc = x86_topo_from_parts(>policy, threads_per_core, 
> >> cores_per_pkg);
> > 
> > I assume this generates the same topology as the current code, or will
> > the population of the leaves be different in some way?
> > 
> 
> The current code does not populate 0xb. This generates a topology
> consistent with the existing INTENDED topology. The actual APIC IDs will
> be different though (because there's no skipping of odd values).
> 
> All the dance in patch 1 was to make this migrate-safe. The x2apic ID is
> stored in the lapic hidden regs so differences with previous behaviour
> don't matter.

What about systems without CPU policy in the migration stream, will
those also get restored as expected?

I think you likely need to check whether 'restore' is set and keep the
old logic in that case?

As otherwise migrated systems without a CPU policy will get the new
topology information instead of the old one?

> IOW, The differences are:
>   * 0xb is exposed, whereas previously it wasn't
>   * APIC IDs are compacted such that new_apicid=old_apicid/2
>   * There's also a cleanup of the murkier paths to put the right core
> counts in the right leaves (whereas previously it was bonkers)

This needs to be in the commit message IMO.

> > 
> > Note that currently the host policy also gets the topology leaves
> > cleared, is it intended to not clear them anymore after this change?
> > 
> > (as you only clear the leaves for the guest {max,def} policies)
> > 
> > Thanks, Roger.
> 
> It was like that originally in v1, I changed in v2 as part of feedback
> from Jan.

I think that's fine, but this divergence from current behavior of
cleaning the topology for the host policy needs to be mentioned in
the commit message.

Thanks, Roger.

Re: [PATCH v2 7/8] xen/x86: Derive topologically correct x2APIC IDs from the policy

2024-05-27 Thread Roger Pau Monné

On Fri, May 24, 2024 at 06:03:22PM +0100, Alejandro Vallejo wrote:
> On 24/05/2024 09:39, Roger Pau Monné wrote:
> > On Wed, May 08, 2024 at 01:39:26PM +0100, Alejandro Vallejo wrote:
> > 
> > Also you could initialize x2apic_id at definition:
> > 
> > const struct test *t = [j];
> > struct cpu_policy policy = { .x86_vendor = vendors[i] };
> > int rc = x86_topo_from_parts(, t->threads_per_core, 
> > t->cores_per_pkg);
> > uint32_t x2apic_id = x86_x2apic_id_from_vcpu_id(, t->vcpu_id);
> 
> Seeing this snippet I just realized there's a bug. The second loop
> should use j rather than i. Ugh.

Well, you shadow the outer variable with the inner one, which makes it
still fine.  Yet I don't like that shadowing much.  I was going to
comment, but for the requested change you need to not shadow the outer
loop variable (in the example chunk I've used 'j' to signal the outer
loop index).

> >> +}
> >> +
> >>  uint32_t x86_x2apic_id_from_vcpu_id(const struct cpu_policy *p, uint32_t 
> >> id)
> >>  {
> >> +uint32_t shift = 0, x2apic_id = 0;
> >> +
> >> +/* In the absence of topology leaves, fallback to traditional mapping 
> >> */
> >> +if ( !p->topo.subleaf[0].type )
> >> +return id * 2;
> >> +
> >>  /*
> >> - * TODO: Derive x2APIC ID from the topology information inside `p`
> >> - *   rather than from vCPU ID. This bodge is a temporary measure
> >> - *   until all infra is in place to retrieve or derive the initial
> >> - *   x2APIC ID from migrated domains.
> > 
> > I'm a bit confused with this, the policy is domain wide, so we will
> > always need to pass the vCPU ID into x86_x2apic_id_from_vcpu_id()?
> > IOW: the x2APIC ID will always be derived from the vCPU ID.
> > 
> > Thanks, Roger.
> 
> The x2APIC ID is derived (after the series) from the vCPU ID _and_ the
> topology information. The vCPU alone will work out in all cases because
> it'll be cached in the vlapic hvm structure.
> 
> I guess the comment could be rewritten as "... rather than from the vCPU
> ID alone..."

Yup, that's clearer :).

Thanks, Roger.

Re: [PATCH v16 4/5] xen/arm: translate virtual PCI bus topology for guests

2024-05-27 Thread Roger Pau Monné

On Fri, May 24, 2024 at 02:21:09PM +0100, Julien Grall wrote:
> Hi,
> 
> Sorry I didn't notice there was a v16 and posted comments on the v15. The
> only one is about the size of the list we iterate.
> 
> On 23/05/2024 08:48, Roger Pau Monné wrote:
> > On Wed, May 22, 2024 at 06:59:23PM -0400, Stewart Hildebrand wrote:
> > > From: Oleksandr Andrushchenko 
> > > +}
> > > -return sbdf;
> > > +return translated;
> > >   }
> > >   static int vpci_mmio_read(struct vcpu *v, mmio_info_t *info,
> > > register_t *r, void *p)
> > >   {
> > >   struct pci_host_bridge *bridge = p;
> > > -pci_sbdf_t sbdf = vpci_sbdf_from_gpa(bridge, info->gpa);
> > > +pci_sbdf_t sbdf;
> > >   const unsigned int access_size = (1U << info->dabt.size) * 8;
> > >   const register_t invalid = GENMASK_ULL(access_size - 1, 0);
> > 
> > Do you know why the invalid value is truncated to the access size.
> 
> Because no other callers are doing the truncation and therefore the guest
> would read 1s even for 8-byte unsigned access.

I think forcing all handlers to do the truncation is a lot of
duplication, and more risky than just doing it in the dispatcher
itself (handle_read()), see my reply to 1/5.

Thanks, Roger.

Re: [PATCH v2 3/8] x86/vlapic: Move lapic_load_hidden migration checks to the check hook

2024-05-24 Thread Roger Pau Monné

On Fri, May 24, 2024 at 12:16:00PM +0100, Alejandro Vallejo wrote:
> On 23/05/2024 15:50, Roger Pau Monné wrote:
> > On Wed, May 08, 2024 at 01:39:22PM +0100, Alejandro Vallejo wrote:
> >> While at it, add a check for the reserved field in the hidden save area.
> >>
> >> Signed-off-by: Alejandro Vallejo 
> >> ---
> >> v2:
> >>   * New patch. Addresses the missing check for rsvd_zero in v1.
> > 
> > Oh, it would be better if this was done at the time when rsvd_zero is
> > introduced.  I think this should be moved ahead of the series, so that
> > the patch that introduces rsvd_zero can add the check in
> > lapic_check_hidden().
> 
> I'll give that a whirl.
> 
> > 
> >> ---
> >>  xen/arch/x86/hvm/vlapic.c | 41 ---
> >>  1 file changed, 30 insertions(+), 11 deletions(-)
> >>
> >> diff --git a/xen/arch/x86/hvm/vlapic.c b/xen/arch/x86/hvm/vlapic.c
> >> index 8a24419c..2f06bff1b2cc 100644
> >> --- a/xen/arch/x86/hvm/vlapic.c
> >> +++ b/xen/arch/x86/hvm/vlapic.c
> >> @@ -1573,35 +1573,54 @@ static void lapic_load_fixup(struct vlapic *vlapic)
> >> v, vlapic->loaded.id, vlapic->loaded.ldr, good_ldr);
> >>  }
> >>  
> >> -static int cf_check lapic_load_hidden(struct domain *d, 
> >> hvm_domain_context_t *h)
> >> +static int cf_check lapic_check_hidden(const struct domain *d,
> >> +   hvm_domain_context_t *h)
> >>  {
> >>  unsigned int vcpuid = hvm_load_instance(h);
> >> -struct vcpu *v;
> >> -struct vlapic *s;
> >> +struct hvm_hw_lapic s;
> >>  
> >>  if ( !has_vlapic(d) )
> >>  return -ENODEV;
> >>  
> >>  /* Which vlapic to load? */
> >> -if ( vcpuid >= d->max_vcpus || (v = d->vcpu[vcpuid]) == NULL )
> >> +if ( vcpuid >= d->max_vcpus || d->vcpu[vcpuid] == NULL )
> >>  {
> >>  dprintk(XENLOG_G_ERR, "HVM restore: dom%d has no apic%u\n",
> >>  d->domain_id, vcpuid);
> >>  return -EINVAL;
> >>  }
> >> -s = vcpu_vlapic(v);
> >>  
> >> -if ( hvm_load_entry_zeroextend(LAPIC, h, >hw) != 0 )
> >> +if ( hvm_load_entry_zeroextend(LAPIC, h, ) )
> > 
> > Can't you use hvm_get_entry() to perform the sanity checks:
> > 
> > const struct hvm_hw_lapic *s = hvm_get_entry(LAPIC, h);
> > 
> > Thanks, Roger.
> 
> I don't think I can. Because the last field (rsvd_zero) might or might
> not be there, so it needs to be zero-extended. Unless I misunderstood
> what hvm_get_entry() is meant to do. It seems to check for exact sizes.

Oh, indeed, hvm_get_entry() uses strict checking and will refuse to
return the entry if sizes don't match.  There seems to be no way to
avoid the copy if we want to do this in a sane way.

Thanks, Roger.

Re: [PATCH v2 1/8] xen/x86: Add initial x2APIC ID to the per-vLAPIC save area

2024-05-24 Thread Roger Pau Monné

On Fri, May 24, 2024 at 11:58:44AM +0100, Alejandro Vallejo wrote:
> On 23/05/2024 15:32, Roger Pau Monné wrote:
> >>  case 0xb:
> >> -/*
> >> - * In principle, this leaf is Intel-only.  In practice, it is 
> >> tightly
> >> - * coupled with x2apic, and we offer an x2apic-capable APIC 
> >> emulation
> >> - * to guests on AMD hardware as well.
> >> - *
> >> - * TODO: Rework topology logic.
> >> - */
> >> -if ( p->basic.x2apic )
> >> +/* Don't expose topology information to PV guests */
> > 
> > Not sure whether we want to keep part of the comment about exposing
> > x2APIC to guests even when x2APIC is not present in the host.  I think
> > this code has changed and the comment is kind of stale now.
> 
> The comment is definitely stale. Nowadays x2APIC is fully supported by
> AMD, as is leaf 0xb. The fact we emulate the x2APIC seems hardly
> relevant in a CPUID leaf about topology. I could keep a note showing...
> 
> /* Exposed alongside x2apic, as it's tightly coupled with it */
> 
> ... although that's directly implied by the conditional.

Yeah, I haven't gone through the history of this file, but I bet at
some point before the introduction of CPUID policies we leaked (part
of) the host CPUID contents in here.

It's also no longer true that the leaf is Intel only.

I fine either adding your newly proposed comment, or leaving it as-is.

> >> +}
> >> +
> >>  int guest_wrmsr_apic_base(struct vcpu *v, uint64_t val)
> >>  {
> >>  const struct cpu_policy *cp = v->domain->arch.cpu_policy;
Le> >> @@ -1449,7 +1465,7 @@ void vlapic_reset(struct vlapic *vlapic)
> >>  if ( v->vcpu_id == 0 )
> >>  vlapic->hw.apic_base_msr |= APIC_BASE_BSP;
> >>  
> >> -vlapic_set_reg(vlapic, APIC_ID, (v->vcpu_id * 2) << 24);
> >> +vlapic_set_reg(vlapic, APIC_ID, SET_xAPIC_ID(vlapic->hw.x2apic_id));
> >>  vlapic_do_init(vlapic);
> >>  }
> >>  
> >> @@ -1514,6 +1530,16 @@ static void lapic_load_fixup(struct vlapic *vlapic)
> >>  const struct vcpu *v = vlapic_vcpu(vlapic);
> >>  uint32_t good_ldr = x2apic_ldr_from_id(vlapic->loaded.id);
> >>  
> >> +/*
> >> + * Loading record without hw.x2apic_id in the save stream, calculate 
> >> using
> >> + * the traditional "vcpu_id * 2" relation. There's an implicit 
> >> assumption
> >> + * that vCPU0 always has x2APIC0, which is true for the old relation, 
> >> and
> >> + * still holds under the new x2APIC generation algorithm. While that 
> >> case
> >> + * goes through the conditional it's benign because it still maps to 
> >> zero.
> >> + */
> >> +if ( !vlapic->hw.x2apic_id )
> >> +vlapic->hw.x2apic_id = v->vcpu_id * 2;
> >> +
> >>  /* Skip fixups on xAPIC mode, or if the x2APIC LDR is already correct 
> >> */
> >>  if ( !vlapic_x2apic_mode(vlapic) ||
> >>   (vlapic->loaded.ldr == good_ldr) )
> >> diff --git a/xen/arch/x86/include/asm/hvm/hvm.h 
> >> b/xen/arch/x86/include/asm/hvm/hvm.h
> >> index 0c9e6f15645d..e1f0585d75a9 100644
> >> --- a/xen/arch/x86/include/asm/hvm/hvm.h
> >> +++ b/xen/arch/x86/include/asm/hvm/hvm.h
> >> @@ -448,6 +448,7 @@ static inline void hvm_update_guest_efer(struct vcpu 
> >> *v)
> >>  static inline void hvm_cpuid_policy_changed(struct vcpu *v)
> >>  {
> >>  alternative_vcall(hvm_funcs.cpuid_policy_changed, v);
> >> +vlapic_cpu_policy_changed(v);
> > 
> > Note sure whether this call would better be placed in
> > cpu_policy_updated() inside the is_hvm_vcpu() conditional branch.
> > 
> > hvm_cpuid_policy_changed()  are just wrappers around the hvm_funcs
> > hooks, pulling vlapic functions in there is likely to complicate the
> > header dependencies in the long term.
> > 
> 
> That's how it was in v1 and I moved it in v2 answering one of Jan's
> feedback points.
> 
> I don't mind either way.

Oh (goes and reads Jan's reply to v1) I see.  Let's leave it as-is
then.

> 
> >>  }
> >>  
> >>  static inline void hvm_set_tsc_offset(struct vcpu *v, uint64_t offset,
> >> diff --git a/xen/arch/x86/include/asm/hvm/vlapic.h 
> >> b/xen/arch/x86/include/asm/hvm/vlapic.h
> >> index 88ef94524339..e8d41313abd3 100644
> >> --- a/xen/arch/x86/include

Re: [PATCH v3 1/2] tools/xg: Streamline cpu policy serialise/deserialise calls

2024-05-24 Thread Roger Pau Monné

On Fri, May 24, 2024 at 11:32:50AM +0100, Alejandro Vallejo wrote:
> On 23/05/2024 11:21, Roger Pau Monné wrote:
> > On Thu, May 23, 2024 at 10:41:29AM +0100, Alejandro Vallejo wrote:
> >> -int xc_cpu_policy_serialise(xc_interface *xch, const xc_cpu_policy_t *p,
> >> -xen_cpuid_leaf_t *leaves, uint32_t *nr_leaves,
> >> -xen_msr_entry_t *msrs, uint32_t *nr_msrs)
> >> +int xc_cpu_policy_serialise(xc_interface *xch, xc_cpu_policy_t *p)
> >>  {
> >> +unsigned int nr_leaves = ARRAY_SIZE(p->leaves);
> >> +unsigned int nr_msrs = ARRAY_SIZE(p->msrs);
> >>  int rc;
> >>  
> >> -if ( leaves )
> >> +rc = x86_cpuid_copy_to_buffer(>policy, p->leaves, _leaves);
> >> +if ( rc )
> >>  {
> >> -rc = x86_cpuid_copy_to_buffer(>policy, leaves, nr_leaves);
> >> -if ( rc )
> >> -{
> >> -ERROR("Failed to serialize CPUID policy");
> >> -errno = -rc;
> >> -return -1;
> >> -}
> >> +ERROR("Failed to serialize CPUID policy");
> >> +errno = -rc;
> >> +return -1;
> >>  }
> >>  
> >> -if ( msrs )
> >> +p->nr_leaves = nr_leaves;
> > 
> > Nit: FWIW, I think you could avoid having to introduce local
> > nr_{leaves,msrs} variables and just use p->nr_{leaves,msrs}?  By
> > setting them to ARRAY_SIZE() at the top of the function and then
> > letting x86_{cpuid,msr}_copy_to_buffer() adjust as necessary.
> > 
> > Thanks, Roger.
> 
> The intent was to avoid mutating the policy object in the error cases
> during deserialise. Then I adjusted the serialise case to have symmetry.

It's currently unavoidable for the policy to be likely mutated even in
case of error, as x86_{cpuid,msr}_copy_to_buffer() are two separate
operations, and hence the first succeeding but the second failing will
already result in the policy being mutated on error.

> It's true the preservation is not meaningful in the serialise case
> because at that point the serialised form is already corrupted.
> 
> I don't mind either way. Seeing how I'm sending one final version with
> the comments of patch2 I'll just adjust as you proposed.

I'm fine either way (hence why prefix it with "nit:") albeit I have a
preference for not introducing the local variables if they are not
needed.

Thanks, Roger.

Re: [PATCH v2 8/8] xen/x86: Synthesise domain topologies

2024-05-24 Thread Roger Pau Monné

On Wed, May 08, 2024 at 01:39:27PM +0100, Alejandro Vallejo wrote:
> Expose sensible topologies in leaf 0xb. At the moment it synthesises non-HT
> systems, in line with the previous code intent.
> 
> Signed-off-by: Alejandro Vallejo 
> ---
> v2:
>   * Zap the topology leaves of (pv/hvm)_(def/max)_policy rather than the host 
> policy
> ---
>  tools/libs/guest/xg_cpuid_x86.c | 62 +
>  xen/arch/x86/cpu-policy.c   |  9 +++--
>  2 files changed, 15 insertions(+), 56 deletions(-)
> 
> diff --git a/tools/libs/guest/xg_cpuid_x86.c b/tools/libs/guest/xg_cpuid_x86.c
> index 4453178100ad..8170769dbe43 100644
> --- a/tools/libs/guest/xg_cpuid_x86.c
> +++ b/tools/libs/guest/xg_cpuid_x86.c
> @@ -584,7 +584,7 @@ int xc_cpuid_apply_policy(xc_interface *xch, uint32_t 
> domid, bool restore,
>  bool hvm;
>  xc_domaininfo_t di;
>  struct xc_cpu_policy *p = xc_cpu_policy_init();
> -unsigned int i, nr_leaves = ARRAY_SIZE(p->leaves), nr_msrs = 0;
> +unsigned int nr_leaves = ARRAY_SIZE(p->leaves), nr_msrs = 0;
>  uint32_t err_leaf = -1, err_subleaf = -1, err_msr = -1;
>  uint32_t host_featureset[FEATURESET_NR_ENTRIES] = {};
>  uint32_t len = ARRAY_SIZE(host_featureset);
> @@ -727,59 +727,15 @@ int xc_cpuid_apply_policy(xc_interface *xch, uint32_t 
> domid, bool restore,
>  }
>  else
>  {
> -/*
> - * Topology for HVM guests is entirely controlled by Xen.  For now, 
> we
> - * hardcode APIC_ID = vcpu_id * 2 to give the illusion of no SMT.
> - */
> -p->policy.basic.htt = true;
> -p->policy.extd.cmp_legacy = false;
> -
> -/*
> - * Leaf 1 EBX[23:16] is Maximum Logical Processors Per Package.
> - * Update to reflect vLAPIC_ID = vCPU_ID * 2, but make sure to avoid
> - * overflow.
> - */
> -if ( !p->policy.basic.lppp )
> -p->policy.basic.lppp = 2;
> -else if ( !(p->policy.basic.lppp & 0x80) )
> -p->policy.basic.lppp *= 2;
> -
> -switch ( p->policy.x86_vendor )
> +/* TODO: Expose the ability to choose a custom topology for HVM/PVH 
> */
> +unsigned int threads_per_core = 1;
> +unsigned int cores_per_pkg = di.max_vcpu_id + 1;

Newline.

> +rc = x86_topo_from_parts(>policy, threads_per_core, 
> cores_per_pkg);

I assume this generates the same topology as the current code, or will
the population of the leaves be different in some way?

> +if ( rc )
>  {
> -case X86_VENDOR_INTEL:
> -for ( i = 0; (p->policy.cache.subleaf[i].type &&
> -  i < ARRAY_SIZE(p->policy.cache.raw)); ++i )
> -{
> -p->policy.cache.subleaf[i].cores_per_package =
> -(p->policy.cache.subleaf[i].cores_per_package << 1) | 1;
> -p->policy.cache.subleaf[i].threads_per_cache = 0;
> -}
> -break;
> -
> -case X86_VENDOR_AMD:
> -case X86_VENDOR_HYGON:
> -/*
> - * Leaf 0x8008 ECX[15:12] is ApicIdCoreSize.
> - * Leaf 0x8008 ECX[7:0] is NumberOfCores (minus one).
> - * Update to reflect vLAPIC_ID = vCPU_ID * 2.  But avoid
> - * - overflow,
> - * - going out of sync with leaf 1 EBX[23:16],
> - * - incrementing ApicIdCoreSize when it's zero (which changes 
> the
> - *   meaning of bits 7:0).
> - *
> - * UPDATE: I addition to avoiding overflow, some
> - * proprietary operating systems have trouble with
> - * apic_id_size values greater than 7.  Limit the value to
> - * 7 for now.
> - */
> -if ( p->policy.extd.nc < 0x7f )
> -{
> -if ( p->policy.extd.apic_id_size != 0 && 
> p->policy.extd.apic_id_size < 0x7 )
> -p->policy.extd.apic_id_size++;
> -
> -p->policy.extd.nc = (p->policy.extd.nc << 1) | 1;
> -}
> -break;
> +ERROR("Failed to generate topology: t/c=%u c/p=%u",
> +  threads_per_core, cores_per_pkg);

Could you also print the error code?

> +goto out;
>  }
>  }
>  
> diff --git a/xen/arch/x86/cpu-policy.c b/xen/arch/x86/cpu-policy.c
> index 4b6d96276399..0ad871732ba0 100644
> --- a/xen/arch/x86/cpu-policy.c
> +++ b/xen/arch/x86/cpu-policy.c
> @@ -278,9 +278,6 @@ static void recalculate_misc(struct cpu_policy *p)
>  
>  p->basic.raw[0x8] = EMPTY_LEAF;
>  
> -/* TODO: Rework topology logic. */
> -memset(p->topo.raw, 0, sizeof(p->topo.raw));
> -
>  p->basic.raw[0xc] = EMPTY_LEAF;
>  
>  p->extd.e1d &= ~CPUID_COMMON_1D_FEATURES;
> @@ -621,6 +618,9 @@ static void __init calculate_pv_max_policy(void)
>  recalculate_xstate(p);
>  
>  p->extd.raw[0xa] = EMPTY_LEAF; /* No SVM for PV guests. */
> +
> +/* Wipe host topology. Toolstack

Re: [PATCH v2 7/8] xen/x86: Derive topologically correct x2APIC IDs from the policy

2024-05-24 Thread Roger Pau Monné

On Wed, May 08, 2024 at 01:39:26PM +0100, Alejandro Vallejo wrote:
> Implements the helper for mapping vcpu_id to x2apic_id given a valid
> topology in a policy. The algo is written with the intention of extending
> it to leaves 0x1f and e26 in the future.

Using 0x1f and e26 is kind of confusing.  I would word as "0x1f and
extended leaf 0x26" to avoid confusion.

> 
> Toolstack doesn't set leaf 0xb and the HVM default policy has it cleared,
> so the leaf is not implemented. In that case, the new helper just returns
> the legacy mapping.
> 
> Signed-off-by: Alejandro Vallejo 
> ---
> v2:
>   * const-ify the test definitions
>   * Cosmetic changes (newline + parameter name in prototype)
> ---
>  tools/tests/cpu-policy/test-cpu-policy.c | 63 
>  xen/include/xen/lib/x86/cpu-policy.h |  2 +
>  xen/lib/x86/policy.c | 73 ++--
>  3 files changed, 133 insertions(+), 5 deletions(-)
> 
> diff --git a/tools/tests/cpu-policy/test-cpu-policy.c 
> b/tools/tests/cpu-policy/test-cpu-policy.c
> index 0ba8c418b1b3..82a6aeb23317 100644
> --- a/tools/tests/cpu-policy/test-cpu-policy.c
> +++ b/tools/tests/cpu-policy/test-cpu-policy.c
> @@ -776,6 +776,68 @@ static void test_topo_from_parts(void)
>  }
>  }
>  
> +static void test_x2apic_id_from_vcpu_id_success(void)
> +{
> +static const struct test {
> +unsigned int vcpu_id;
> +unsigned int threads_per_core;
> +unsigned int cores_per_pkg;
> +uint32_t x2apic_id;
> +uint8_t x86_vendor;
> +} tests[] = {
> +{
> +.vcpu_id = 3, .threads_per_core = 3, .cores_per_pkg = 8,
> +.x2apic_id = 1 << 2,
> +},
> +{
> +.vcpu_id = 6, .threads_per_core = 3, .cores_per_pkg = 8,
> +.x2apic_id = 2 << 2,
> +},
> +{
> +.vcpu_id = 24, .threads_per_core = 3, .cores_per_pkg = 8,
> +.x2apic_id = 1 << 5,
> +},
> +{
> +.vcpu_id = 35, .threads_per_core = 3, .cores_per_pkg = 8,
> +.x2apic_id = (35 % 3) | (((35 / 3) % 8)  << 2) | ((35 / 24) << 
> 5),
> +},
> +{
> +.vcpu_id = 96, .threads_per_core = 7, .cores_per_pkg = 3,
> +.x2apic_id = (96 % 7) | (((96 / 7) % 3)  << 3) | ((96 / 21) << 
> 5),
  ^ extra space (same above)

> +},
> +};
> +
> +const uint8_t vendors[] = {
> +X86_VENDOR_INTEL,
> +X86_VENDOR_AMD,
> +X86_VENDOR_CENTAUR,
> +X86_VENDOR_SHANGHAI,
> +X86_VENDOR_HYGON,
> +};
> +
> +printf("Testing x2apic id from vcpu id success:\n");
> +
> +/* Perform the test run on every vendor we know about */
> +for ( size_t i = 0; i < ARRAY_SIZE(vendors); ++i )
> +{
> +struct cpu_policy policy = { .x86_vendor = vendors[i] };

Newline.

> +for ( size_t i = 0; i < ARRAY_SIZE(tests); ++i )
> +{
> +const struct test *t = [i];
> +uint32_t x2apic_id;
> +int rc = x86_topo_from_parts(, t->threads_per_core, 
> t->cores_per_pkg);

Overly long line.

Won't it be better to define `policy` in this scope, so that for each
test you start with a clean policy, rather than having leftover data
from the previous test?

Also you could initialize x2apic_id at definition:

const struct test *t = [j];
struct cpu_policy policy = { .x86_vendor = vendors[i] };
int rc = x86_topo_from_parts(, t->threads_per_core, t->cores_per_pkg);
uint32_t x2apic_id = x86_x2apic_id_from_vcpu_id(, t->vcpu_id);

> +
> +x2apic_id = x86_x2apic_id_from_vcpu_id(, t->vcpu_id);
> +if ( rc || x2apic_id != t->x2apic_id )
> +fail("FAIL[%d] - '%s cpu%u %u t/c %u c/p'. bad x2apic_id: 
> expected=%u actual=%u\n",
> + rc,
> + x86_cpuid_vendor_to_str(policy.x86_vendor),
> + t->vcpu_id, t->threads_per_core, t->cores_per_pkg,
> + t->x2apic_id, x2apic_id);
> +}
> +}
> +}
> +
>  int main(int argc, char **argv)
>  {
>  printf("CPU Policy unit tests\n");
> @@ -794,6 +856,7 @@ int main(int argc, char **argv)
>  test_is_compatible_failure();
>  
>  test_topo_from_parts();
> +test_x2apic_id_from_vcpu_id_success();
>  
>  if ( nr_failures )
>  printf("Done: %u failures\n", nr_failures);
> diff --git a/xen/include/xen/lib/x86/cpu-policy.h 
> b/xen/include/xen/lib/x86/cpu-policy.h
> index f5df18e9f77c..2cbc2726a861 100644
> --- a/xen/include/xen/lib/x86/cpu-policy.h
> +++ b/xen/include/xen/lib/x86/cpu-policy.h
> @@ -545,6 +545,8 @@ int x86_cpu_policies_are_compatible(const struct 
> cpu_policy *host,
>  /**
>   * Calculates the x2APIC ID of a vCPU given a CPU policy
>   *
> + * If the policy lacks leaf 0xb falls back to legacy mapping of apic_id=cpu*2
> + *
>   * @param p  CPU policy of the domain.
>   * @param id vCPU ID of the

Re: [PATCH v2 3/8] x86/vlapic: Move lapic_load_hidden migration checks to the check hook

2024-05-24 Thread Roger Pau Monné

On Thu, May 23, 2024 at 07:58:57PM +0100, Andrew Cooper wrote:
> On 08/05/2024 1:39 pm, Alejandro Vallejo wrote:
> > diff --git a/xen/arch/x86/hvm/vlapic.c b/xen/arch/x86/hvm/vlapic.c
> > index 8a24419c..2f06bff1b2cc 100644
> > --- a/xen/arch/x86/hvm/vlapic.c
> > +++ b/xen/arch/x86/hvm/vlapic.c
> > @@ -1573,35 +1573,54 @@ static void lapic_load_fixup(struct vlapic *vlapic)
> > v, vlapic->loaded.id, vlapic->loaded.ldr, good_ldr);
> >  }
> >  
> > -static int cf_check lapic_load_hidden(struct domain *d, 
> > hvm_domain_context_t *h)
> > +static int cf_check lapic_check_hidden(const struct domain *d,
> > +   hvm_domain_context_t *h)
> >  {
> >  unsigned int vcpuid = hvm_load_instance(h);
> > -struct vcpu *v;
> > -struct vlapic *s;
> > +struct hvm_hw_lapic s;
> >  
> >  if ( !has_vlapic(d) )
> >  return -ENODEV;
> >  
> >  /* Which vlapic to load? */
> > -if ( vcpuid >= d->max_vcpus || (v = d->vcpu[vcpuid]) == NULL )
> > +if ( vcpuid >= d->max_vcpus || d->vcpu[vcpuid] == NULL )
> 
> As you're editing this anyway, swap for
> 
>     if ( !domain_vcpu(d, vcpuid) )
> 
> please.
> 
> >  {
> >  dprintk(XENLOG_G_ERR, "HVM restore: dom%d has no apic%u\n",
> >  d->domain_id, vcpuid);
> >  return -EINVAL;
> >  }
> > -s = vcpu_vlapic(v);
> >  
> > -if ( hvm_load_entry_zeroextend(LAPIC, h, >hw) != 0 )
> > +if ( hvm_load_entry_zeroextend(LAPIC, h, ) )
> > +return -ENODATA;
> > +
> > +/* EN=0 with EXTD=1 is illegal */
> > +if ( (s.apic_base_msr & (APIC_BASE_ENABLE | APIC_BASE_EXTD)) ==
> > + APIC_BASE_EXTD )
> > +return -EINVAL;
> 
> This is very insufficient auditing for the incoming value, but it turns
> out that there's no nice logic for this at all.
> 
> As it's just a less obfuscated form of the logic from
> lapic_load_hidden(), it's probably fine to stay as it is for now.
> 
> The major changes since this logic was written originally are that the
> CPU policy correct (so we can reject EXTD on VMs which can't see
> x2apic), and that we now prohibit VMs moving the xAPIC MMIO window away
> from its default location (as this would require per-vCPU P2Ms in order
> to virtualise properly.)

Since this is just migration of the existing checks I think keeping
them as-is is best.  Adding new checks should be done in a followup
patch.

Thanks, Roger.

Re: [PATCH v2 5/8] tools/hvmloader: Retrieve (x2)APIC IDs from the APs themselves

2024-05-24 Thread Roger Pau Monné

On Wed, May 08, 2024 at 01:39:24PM +0100, Alejandro Vallejo wrote:
> Make it so the APs expose their own APIC IDs in a LUT. We can use that LUT to
> populate the MADT, decoupling the algorithm that relates CPU IDs and APIC IDs
> from hvmloader.
> 
> While at this also remove ap_callin, as writing the APIC ID may serve the same
> purpose.
> 
> Signed-off-by: Alejandro Vallejo 
> ---
> v2:
>   * New patch. Replaces adding cpu policy to hvmloader in v1.
> ---
>  tools/firmware/hvmloader/config.h|  6 -
>  tools/firmware/hvmloader/hvmloader.c |  4 +--
>  tools/firmware/hvmloader/smp.c   | 40 +++-
>  tools/firmware/hvmloader/util.h  |  5 
>  xen/arch/x86/include/asm/hvm/hvm.h   |  1 +
>  5 files changed, 47 insertions(+), 9 deletions(-)
> 
> diff --git a/tools/firmware/hvmloader/config.h 
> b/tools/firmware/hvmloader/config.h
> index c82adf6dc508..edf6fa9c908c 100644
> --- a/tools/firmware/hvmloader/config.h
> +++ b/tools/firmware/hvmloader/config.h
> @@ -4,6 +4,8 @@
>  #include 
>  #include 
>  
> +#include 
> +
>  enum virtual_vga { VGA_none, VGA_std, VGA_cirrus, VGA_pt };
>  extern enum virtual_vga virtual_vga;
>  
> @@ -49,8 +51,10 @@ extern uint8_t ioapic_version;
>  
>  #define IOAPIC_ID   0x01
>  
> +extern uint32_t CPU_TO_X2APICID[HVM_MAX_VCPUS];
> +
>  #define LAPIC_BASE_ADDRESS  0xfee0
> -#define LAPIC_ID(vcpu_id)   ((vcpu_id) * 2)
> +#define LAPIC_ID(vcpu_id)   (CPU_TO_X2APICID[(vcpu_id)])
>  
>  #define PCI_ISA_DEVFN   0x08/* dev 1, fn 0 */
>  #define PCI_ISA_IRQ_MASK0x0c20U /* ISA IRQs 5,10,11 are PCI connected */
> diff --git a/tools/firmware/hvmloader/hvmloader.c 
> b/tools/firmware/hvmloader/hvmloader.c
> index c58841e5b556..1eba92229925 100644
> --- a/tools/firmware/hvmloader/hvmloader.c
> +++ b/tools/firmware/hvmloader/hvmloader.c
> @@ -342,11 +342,11 @@ int main(void)
>  
>  printf("CPU speed is %u MHz\n", get_cpu_mhz());
>  
> +smp_initialise();
> +
>  apic_setup();
>  pci_setup();
>  
> -smp_initialise();
> -
>  perform_tests();
>  
>  if ( bios->bios_info_setup )
> diff --git a/tools/firmware/hvmloader/smp.c b/tools/firmware/hvmloader/smp.c
> index a668f15d7e1f..4d75f239c2f5 100644
> --- a/tools/firmware/hvmloader/smp.c
> +++ b/tools/firmware/hvmloader/smp.c
> @@ -29,7 +29,34 @@
>  
>  #include 
>  
> -static int ap_callin, ap_cpuid;
> +static int ap_cpuid;
> +
> +/**
> + * Lookup table of x2APIC IDs.
> + *
> + * Each entry is populated its respective CPU as they come online. This is 
> required
> + * for generating the MADT with minimal assumptions about ID relationships.
> + */
> +uint32_t CPU_TO_X2APICID[HVM_MAX_VCPUS];
> +
> +static uint32_t read_apic_id(void)
> +{
> +uint32_t apic_id;
> +
> +cpuid(1, NULL, _id, NULL, NULL);
> +apic_id >>= 24;
> +
> +/*
> + * APIC IDs over 255 are represented by 255 in leaf 1 and are meant to be
> + * read from topology leaves instead. Xen exposes x2APIC IDs in leaf 0xb,
> + * but only if the x2APIC feature is present. If there are that many CPUs
> + * it's guaranteed to be there so we can avoid checking for it 
> specifically
> + */
> +if ( apic_id == 255 )
> +cpuid(0xb, NULL, NULL, NULL, _id);
> +
> +return apic_id;
> +}
>  
>  static void ap_start(void)
>  {
> @@ -37,12 +64,12 @@ static void ap_start(void)
>  cacheattr_init();
>  printf("done.\n");
>  
> +wmb();
> +ACCESS_ONCE(CPU_TO_X2APICID[ap_cpuid]) = read_apic_id();

Further thinking about this: do we really need the wmb(), given the
usage of ACCESS_ONCE()?

wmb() is a compiler barrier, and the usage of volatile in
ACCESS_ONCE() should already prevent any compiler re-ordering.

Thanks, Roger.

Re: [PATCH v2 6/8] xen/lib: Add topology generator for x86

2024-05-23 Thread Roger Pau Monné

On Wed, May 08, 2024 at 01:39:25PM +0100, Alejandro Vallejo wrote:
> Add a helper to populate topology leaves in the cpu policy from
> threads/core and cores/package counts.
> 
> No functional change, as it's not connected to anything yet.

There is a functional change in test-cpu-policy.c.

Maybe the commit message needs to be updated to reflect the added
testing to test-cpu-policy.c using the newly introduced helper to
generate topologies?

> 
> Signed-off-by: Alejandro Vallejo 
> ---
> v2:
>   * New patch. Extracted from v1/patch6
> ---
>  tools/tests/cpu-policy/test-cpu-policy.c | 128 +++
>  xen/include/xen/lib/x86/cpu-policy.h |  16 +++
>  xen/lib/x86/policy.c |  86 +++
>  3 files changed, 230 insertions(+)
> 
> diff --git a/tools/tests/cpu-policy/test-cpu-policy.c 
> b/tools/tests/cpu-policy/test-cpu-policy.c
> index 301df2c00285..0ba8c418b1b3 100644
> --- a/tools/tests/cpu-policy/test-cpu-policy.c
> +++ b/tools/tests/cpu-policy/test-cpu-policy.c
> @@ -650,6 +650,132 @@ static void test_is_compatible_failure(void)
>  }
>  }
>  
> +static void test_topo_from_parts(void)
> +{
> +static const struct test {
> +unsigned int threads_per_core;
> +unsigned int cores_per_pkg;
> +struct cpu_policy policy;
> +} tests[] = {
> +{
> +.threads_per_core = 3, .cores_per_pkg = 1,
> +.policy = {
> +.x86_vendor = X86_VENDOR_AMD,
> +.topo.subleaf = {
> +[0] = { .nr_logical = 3, .level = 0, .type = 1, 
> .id_shift = 2, },
> +[1] = { .nr_logical = 1, .level = 1, .type = 2, 
> .id_shift = 2, },
> +},
> +},
> +},
> +{
> +.threads_per_core = 1, .cores_per_pkg = 3,
> +.policy = {
> +.x86_vendor = X86_VENDOR_AMD,
> +.topo.subleaf = {
> +[0] = { .nr_logical = 1, .level = 0, .type = 1, 
> .id_shift = 0, },
> +[1] = { .nr_logical = 3, .level = 1, .type = 2, 
> .id_shift = 2, },
> +},
> +},
> +},
> +{
> +.threads_per_core = 7, .cores_per_pkg = 5,
> +.policy = {
> +.x86_vendor = X86_VENDOR_AMD,
> +.topo.subleaf = {
> +[0] = { .nr_logical = 7, .level = 0, .type = 1, 
> .id_shift = 3, },
> +[1] = { .nr_logical = 5, .level = 1, .type = 2, 
> .id_shift = 6, },
> +},
> +},
> +},
> +{
> +.threads_per_core = 2, .cores_per_pkg = 128,
> +.policy = {
> +.x86_vendor = X86_VENDOR_AMD,
> +.topo.subleaf = {
> +[0] = { .nr_logical = 2, .level = 0, .type = 1, 
> .id_shift = 1, },
> +[1] = { .nr_logical = 128, .level = 1, .type = 2, 
> .id_shift = 8, },
> +},
> +},
> +},
> +{
> +.threads_per_core = 3, .cores_per_pkg = 1,
> +.policy = {
> +.x86_vendor = X86_VENDOR_INTEL,
> +.topo.subleaf = {
> +[0] = { .nr_logical = 3, .level = 0, .type = 1, 
> .id_shift = 2, },
> +[1] = { .nr_logical = 3, .level = 1, .type = 2, 
> .id_shift = 2, },
> +},
> +},
> +},
> +{
> +.threads_per_core = 1, .cores_per_pkg = 3,
> +.policy = {
> +.x86_vendor = X86_VENDOR_INTEL,
> +.topo.subleaf = {
> +[0] = { .nr_logical = 1, .level = 0, .type = 1, 
> .id_shift = 0, },
> +[1] = { .nr_logical = 3, .level = 1, .type = 2, 
> .id_shift = 2, },
> +},
> +},
> +},
> +{
> +.threads_per_core = 7, .cores_per_pkg = 5,
> +.policy = {
> +.x86_vendor = X86_VENDOR_INTEL,
> +.topo.subleaf = {
> +[0] = { .nr_logical = 7, .level = 0, .type = 1, 
> .id_shift = 3, },
> +[1] = { .nr_logical = 35, .level = 1, .type = 2, 
> .id_shift = 6, },
> +},
> +},
> +},
> +{
> +.threads_per_core = 2, .cores_per_pkg = 128,
> +.policy = {
> +.x86_vendor = X86_VENDOR_INTEL,
> +.topo.subleaf = {
> +[0] = { .nr_logical = 2, .level = 0, .type = 1, 
> .id_shift = 1, },
> +[1] = { .nr_logical = 256, .level = 1, .type = 2, 
> .id_shift = 8, },

You don't need the array index in the initialization:

.topo.subleaf = {
{ .nr_logical = 2, .level = 0, .type = 1, .id_shift = 1, },
{ .nr_logical = 256, .level = 1, .type = 2,
  .id_shift = 8, },
}

And lines should be limited to 80

Re: [PATCH v2 5/8] tools/hvmloader: Retrieve (x2)APIC IDs from the APs themselves

2024-05-23 Thread Roger Pau Monné

On Wed, May 08, 2024 at 01:39:24PM +0100, Alejandro Vallejo wrote:
> Make it so the APs expose their own APIC IDs in a LUT. We can use that LUT to
> populate the MADT, decoupling the algorithm that relates CPU IDs and APIC IDs
> from hvmloader.
> 
> While at this also remove ap_callin, as writing the APIC ID may serve the same
> purpose.
> 
> Signed-off-by: Alejandro Vallejo 
> ---
> v2:
>   * New patch. Replaces adding cpu policy to hvmloader in v1.
> ---
>  tools/firmware/hvmloader/config.h|  6 -
>  tools/firmware/hvmloader/hvmloader.c |  4 +--
>  tools/firmware/hvmloader/smp.c   | 40 +++-
>  tools/firmware/hvmloader/util.h  |  5 
>  xen/arch/x86/include/asm/hvm/hvm.h   |  1 +
>  5 files changed, 47 insertions(+), 9 deletions(-)
> 
> diff --git a/tools/firmware/hvmloader/config.h 
> b/tools/firmware/hvmloader/config.h
> index c82adf6dc508..edf6fa9c908c 100644
> --- a/tools/firmware/hvmloader/config.h
> +++ b/tools/firmware/hvmloader/config.h
> @@ -4,6 +4,8 @@
>  #include 
>  #include 
>  
> +#include 
> +
>  enum virtual_vga { VGA_none, VGA_std, VGA_cirrus, VGA_pt };
>  extern enum virtual_vga virtual_vga;
>  
> @@ -49,8 +51,10 @@ extern uint8_t ioapic_version;
>  
>  #define IOAPIC_ID   0x01
>  
> +extern uint32_t CPU_TO_X2APICID[HVM_MAX_VCPUS];
> +
>  #define LAPIC_BASE_ADDRESS  0xfee0
> -#define LAPIC_ID(vcpu_id)   ((vcpu_id) * 2)
> +#define LAPIC_ID(vcpu_id)   (CPU_TO_X2APICID[(vcpu_id)])
>  
>  #define PCI_ISA_DEVFN   0x08/* dev 1, fn 0 */
>  #define PCI_ISA_IRQ_MASK0x0c20U /* ISA IRQs 5,10,11 are PCI connected */
> diff --git a/tools/firmware/hvmloader/hvmloader.c 
> b/tools/firmware/hvmloader/hvmloader.c
> index c58841e5b556..1eba92229925 100644
> --- a/tools/firmware/hvmloader/hvmloader.c
> +++ b/tools/firmware/hvmloader/hvmloader.c
> @@ -342,11 +342,11 @@ int main(void)
>  
>  printf("CPU speed is %u MHz\n", get_cpu_mhz());
>  
> +smp_initialise();
> +
>  apic_setup();
>  pci_setup();
>  
> -smp_initialise();
> -
>  perform_tests();
>  
>  if ( bios->bios_info_setup )
> diff --git a/tools/firmware/hvmloader/smp.c b/tools/firmware/hvmloader/smp.c
> index a668f15d7e1f..4d75f239c2f5 100644
> --- a/tools/firmware/hvmloader/smp.c
> +++ b/tools/firmware/hvmloader/smp.c
> @@ -29,7 +29,34 @@
>  
>  #include 
>  
> -static int ap_callin, ap_cpuid;
> +static int ap_cpuid;
> +
> +/**
> + * Lookup table of x2APIC IDs.
> + *
> + * Each entry is populated its respective CPU as they come online. This is 
> required
> + * for generating the MADT with minimal assumptions about ID relationships.
> + */
> +uint32_t CPU_TO_X2APICID[HVM_MAX_VCPUS];
> +
> +static uint32_t read_apic_id(void)
> +{
> +uint32_t apic_id;
> +
> +cpuid(1, NULL, _id, NULL, NULL);
> +apic_id >>= 24;
> +
> +/*
> + * APIC IDs over 255 are represented by 255 in leaf 1 and are meant to be
> + * read from topology leaves instead. Xen exposes x2APIC IDs in leaf 0xb,
> + * but only if the x2APIC feature is present. If there are that many CPUs
> + * it's guaranteed to be there so we can avoid checking for it 
> specifically
> + */

Maybe I'm missing something, but given the current code won't Xen just
return the low 8 bits from the x2APIC ID?  I don't see any code in
guest_cpuid() that adjusts the IDs to be 255 when > 255.

> +if ( apic_id == 255 )
> +cpuid(0xb, NULL, NULL, NULL, _id);

Won't the correct logic be to check if x2APIC is set in CPUID, and
then fetch the APIC ID from leaf 0xb, otherwise fallback to fetching
the APID ID from leaf 1?

> +
> +return apic_id;
> +}
>  
>  static void ap_start(void)
>  {
> @@ -37,12 +64,12 @@ static void ap_start(void)
>  cacheattr_init();
>  printf("done.\n");
>  
> +wmb();
> +ACCESS_ONCE(CPU_TO_X2APICID[ap_cpuid]) = read_apic_id();

A comment would be helpful here, that CPU_TO_X2APICID[ap_cpuid] is
used as synchronization that the AP has started.

You probably want to assert that read_apic_id() doesn't return 0,
otherwise we are skewed.

> +
>  if ( !ap_cpuid )
>  return;
>  
> -wmb();
> -ap_callin = 1;
> -
>  while ( 1 )
>  asm volatile ( "hlt" );
>  }
> @@ -86,10 +113,11 @@ static void boot_cpu(unsigned int cpu)
>  BUG();
>  
>  /*
> - * Wait for the secondary processor to complete initialisation.
> + * Wait for the secondary processor to complete initialisation,
> + * which is signaled by its x2APIC ID being writted to the LUT.
>   * Do not touch shared resources meanwhile.
>   */
> -while ( !ap_callin )
> +while ( !ACCESS_ONCE(CPU_TO_X2APICID[cpu]) )
>  cpu_relax();

As a further improvement, we could launch all APs in pararell, and use
a for loop to wait until all positions of the CPU_TO_X2APICID array
are set.

>  
>  /* Take the secondary processor offline. */
> diff --git a/tools/firmware/hvmloader/util.h b/tools/firmware/hvmloader/util.h
> index

Re: [PATCH 4.5/8] tools/hvmloader: Further simplify SMP setup

2024-05-23 Thread Roger Pau Monné

On Thu, May 09, 2024 at 06:50:57PM +0100, Andrew Cooper wrote:
> Now that we're using hypercalls to start APs, we can replace the 'ap_cpuid'
> global with a regular function parameter.  This requires telling the compiler
> that we'd like the parameter in a register rather than on the stack.
> 
> While adjusting, rename to cpu_setup().  It's always been used on the BSP,
> making the name ap_start() specifically misleading.
> 
> Signed-off-by: Andrew Cooper 

Reviewed-by: Roger Pau Monné 

Thanks, Roger.

Re: [PATCH v2 2/8] xen/x86: Simplify header dependencies in x86/hvm

2024-05-23 Thread Roger Pau Monné

On Thu, May 23, 2024 at 04:40:06PM +0200, Jan Beulich wrote:
> On 23.05.2024 16:37, Roger Pau Monné wrote:
> > On Wed, May 08, 2024 at 01:39:21PM +0100, Alejandro Vallejo wrote:
> >> --- a/xen/arch/x86/include/asm/hvm/hvm.h
> >> +++ b/xen/arch/x86/include/asm/hvm/hvm.h
> >> @@ -798,6 +798,12 @@ static inline void hvm_update_vlapic_mode(struct vcpu 
> >> *v)
> >>  alternative_vcall(hvm_funcs.update_vlapic_mode, v);
> >>  }
> >>  
> >> +static inline void hvm_vlapic_sync_pir_to_irr(struct vcpu *v)
> >> +{
> >> +if ( hvm_funcs.sync_pir_to_irr )
> >> +alternative_vcall(hvm_funcs.sync_pir_to_irr, v);
> > 
> > Nit: for consistency the wrappers are usually named hvm_,
> > so in this case it would be hvm_sync_pir_to_irr(), or the hvm_funcs
> > field should be renamed to vlapic_sync_pir_to_irr.
> 
> Funny you should mention that: See my earlier comment as well as what
> was committed.

Oh, sorry, didn't realize you already replied, adjusted and committed.

Thanks, Roger.

Re: [PATCH v2 3/8] x86/vlapic: Move lapic_load_hidden migration checks to the check hook

2024-05-23 Thread Roger Pau Monné

On Wed, May 08, 2024 at 01:39:22PM +0100, Alejandro Vallejo wrote:
> While at it, add a check for the reserved field in the hidden save area.
> 
> Signed-off-by: Alejandro Vallejo 
> ---
> v2:
>   * New patch. Addresses the missing check for rsvd_zero in v1.

Oh, it would be better if this was done at the time when rsvd_zero is
introduced.  I think this should be moved ahead of the series, so that
the patch that introduces rsvd_zero can add the check in
lapic_check_hidden().

> ---
>  xen/arch/x86/hvm/vlapic.c | 41 ---
>  1 file changed, 30 insertions(+), 11 deletions(-)
> 
> diff --git a/xen/arch/x86/hvm/vlapic.c b/xen/arch/x86/hvm/vlapic.c
> index 8a24419c..2f06bff1b2cc 100644
> --- a/xen/arch/x86/hvm/vlapic.c
> +++ b/xen/arch/x86/hvm/vlapic.c
> @@ -1573,35 +1573,54 @@ static void lapic_load_fixup(struct vlapic *vlapic)
> v, vlapic->loaded.id, vlapic->loaded.ldr, good_ldr);
>  }
>  
> -static int cf_check lapic_load_hidden(struct domain *d, hvm_domain_context_t 
> *h)
> +static int cf_check lapic_check_hidden(const struct domain *d,
> +   hvm_domain_context_t *h)
>  {
>  unsigned int vcpuid = hvm_load_instance(h);
> -struct vcpu *v;
> -struct vlapic *s;
> +struct hvm_hw_lapic s;
>  
>  if ( !has_vlapic(d) )
>  return -ENODEV;
>  
>  /* Which vlapic to load? */
> -if ( vcpuid >= d->max_vcpus || (v = d->vcpu[vcpuid]) == NULL )
> +if ( vcpuid >= d->max_vcpus || d->vcpu[vcpuid] == NULL )
>  {
>  dprintk(XENLOG_G_ERR, "HVM restore: dom%d has no apic%u\n",
>  d->domain_id, vcpuid);
>  return -EINVAL;
>  }
> -s = vcpu_vlapic(v);
>  
> -if ( hvm_load_entry_zeroextend(LAPIC, h, >hw) != 0 )
> +if ( hvm_load_entry_zeroextend(LAPIC, h, ) )

Can't you use hvm_get_entry() to perform the sanity checks:

const struct hvm_hw_lapic *s = hvm_get_entry(LAPIC, h);

Thanks, Roger.

Re: [PATCH v2 2/8] xen/x86: Simplify header dependencies in x86/hvm

2024-05-23 Thread Roger Pau Monné

On Wed, May 08, 2024 at 01:39:21PM +0100, Alejandro Vallejo wrote:
> Otherwise it's not possible to call functions described in hvm/vlapic.h from 
> the
> inline functions of hvm/hvm.h.
> 
> This is because a static inline in vlapic.h depends on hvm.h, and pulls it
> transitively through vpt.h. The ultimate cause is having hvm.h included in any
> of the "v*.h" headers, so break the cycle moving the guilty inline into hvm.h.
> 
> No functional change.
> 
> Signed-off-by: Alejandro Vallejo 

Acked-by: Roger Pau Monné 

One cosmetic comment below.

> ---
> v2:
>   * New patch. Prereq to moving vlapic_cpu_policy_changed() onto hvm.h
> ---
>  xen/arch/x86/hvm/irq.c| 6 +++---
>  xen/arch/x86/hvm/vlapic.c | 4 ++--
>  xen/arch/x86/include/asm/hvm/hvm.h| 6 ++
>  xen/arch/x86/include/asm/hvm/vlapic.h | 6 --
>  xen/arch/x86/include/asm/hvm/vpt.h| 1 -
>  5 files changed, 11 insertions(+), 12 deletions(-)
> 
> diff --git a/xen/arch/x86/hvm/irq.c b/xen/arch/x86/hvm/irq.c
> index 4a9fe82cbd8d..4f5479b12c98 100644
> --- a/xen/arch/x86/hvm/irq.c
> +++ b/xen/arch/x86/hvm/irq.c
> @@ -512,13 +512,13 @@ struct hvm_intack hvm_vcpu_has_pending_irq(struct vcpu 
> *v)
>  int vector;
>  
>  /*
> - * Always call vlapic_sync_pir_to_irr so that PIR is synced into IRR when
> - * using posted interrupts. Note this is also done by
> + * Always call hvm_vlapic_sync_pir_to_irr so that PIR is synced into IRR
> + * when using posted interrupts. Note this is also done by
>   * vlapic_has_pending_irq but depending on which interrupts are pending
>   * hvm_vcpu_has_pending_irq will return early without calling
>   * vlapic_has_pending_irq.
>   */
> -vlapic_sync_pir_to_irr(v);
> +hvm_vlapic_sync_pir_to_irr(v);
>  
>  if ( unlikely(v->arch.nmi_pending) )
>  return hvm_intack_nmi;
> diff --git a/xen/arch/x86/hvm/vlapic.c b/xen/arch/x86/hvm/vlapic.c
> index 61a96474006b..8a24419c 100644
> --- a/xen/arch/x86/hvm/vlapic.c
> +++ b/xen/arch/x86/hvm/vlapic.c
> @@ -98,7 +98,7 @@ static void vlapic_clear_irr(int vector, struct vlapic 
> *vlapic)
>  
>  static int vlapic_find_highest_irr(struct vlapic *vlapic)
>  {
> -vlapic_sync_pir_to_irr(vlapic_vcpu(vlapic));
> +hvm_vlapic_sync_pir_to_irr(vlapic_vcpu(vlapic));
>  
>  return vlapic_find_highest_vector(>regs->data[APIC_IRR]);
>  }
> @@ -1516,7 +1516,7 @@ static int cf_check lapic_save_regs(struct vcpu *v, 
> hvm_domain_context_t *h)
>  if ( !has_vlapic(v->domain) )
>  return 0;
>  
> -vlapic_sync_pir_to_irr(v);
> +hvm_vlapic_sync_pir_to_irr(v);
>  
>  return hvm_save_entry(LAPIC_REGS, v->vcpu_id, h, vcpu_vlapic(v)->regs);
>  }
> diff --git a/xen/arch/x86/include/asm/hvm/hvm.h 
> b/xen/arch/x86/include/asm/hvm/hvm.h
> index e1f0585d75a9..84911f3ebcb4 100644
> --- a/xen/arch/x86/include/asm/hvm/hvm.h
> +++ b/xen/arch/x86/include/asm/hvm/hvm.h
> @@ -798,6 +798,12 @@ static inline void hvm_update_vlapic_mode(struct vcpu *v)
>  alternative_vcall(hvm_funcs.update_vlapic_mode, v);
>  }
>  
> +static inline void hvm_vlapic_sync_pir_to_irr(struct vcpu *v)
> +{
> +if ( hvm_funcs.sync_pir_to_irr )
> +alternative_vcall(hvm_funcs.sync_pir_to_irr, v);

Nit: for consistency the wrappers are usually named hvm_,
so in this case it would be hvm_sync_pir_to_irr(), or the hvm_funcs
field should be renamed to vlapic_sync_pir_to_irr.

Thanks, Roger.

Re: [PATCH v2 1/8] xen/x86: Add initial x2APIC ID to the per-vLAPIC save area

2024-05-23 Thread Roger Pau Monné

On Wed, May 08, 2024 at 01:39:20PM +0100, Alejandro Vallejo wrote:
> This allows the initial x2APIC ID to be sent on the migration stream. The
> hardcoded mapping x2apic_id=2*vcpu_id is maintained for the time being.
> Given the vlapic data is zero-extended on restore, fix up migrations from
> hosts without the field by setting it to the old convention if zero.
> 
> x2APIC IDs are calculated from the CPU policy where the guest topology is
> defined. For the time being, the function simply returns the old
> relationship, but will eventually return results consistent with the
> topology.
> 
> Signed-off-by: Alejandro Vallejo 
> ---
> v2:
>   * Removed usage of SET_xAPIC_ID().
>   * Restored previous logic when exposing leaf 0xb, and gate it for HVM only.
>   * Rewrote comment in lapic_load_fixup, including the implicit assumption.
>   * Moved vlapic_cpu_policy_changed() into hvm_cpuid_policy_changed())
>   * const-ified policy in vlapic_cpu_policy_changed()
> ---
>  xen/arch/x86/cpuid.c   | 15 -
>  xen/arch/x86/hvm/vlapic.c  | 30 --
>  xen/arch/x86/include/asm/hvm/hvm.h |  1 +
>  xen/arch/x86/include/asm/hvm/vlapic.h  |  2 ++
>  xen/include/public/arch-x86/hvm/save.h |  2 ++
>  xen/include/xen/lib/x86/cpu-policy.h   |  9 
>  xen/lib/x86/policy.c   | 11 ++
>  7 files changed, 57 insertions(+), 13 deletions(-)
> 
> diff --git a/xen/arch/x86/cpuid.c b/xen/arch/x86/cpuid.c
> index 7a38e032146a..242c21ec5bb6 100644
> --- a/xen/arch/x86/cpuid.c
> +++ b/xen/arch/x86/cpuid.c
> @@ -139,10 +139,9 @@ void guest_cpuid(const struct vcpu *v, uint32_t leaf,
>  const struct cpu_user_regs *regs;
>  
>  case 0x1:
> -/* TODO: Rework topology logic. */
>  res->b &= 0x00ffu;
>  if ( is_hvm_domain(d) )
> -res->b |= (v->vcpu_id * 2) << 24;
> +res->b |= vlapic_x2apic_id(vcpu_vlapic(v)) << 24;
>  
>  /* TODO: Rework vPMU control in terms of toolstack choices. */
>  if ( vpmu_available(v) &&
> @@ -311,19 +310,13 @@ void guest_cpuid(const struct vcpu *v, uint32_t leaf,
>  break;
>  
>  case 0xb:
> -/*
> - * In principle, this leaf is Intel-only.  In practice, it is tightly
> - * coupled with x2apic, and we offer an x2apic-capable APIC emulation
> - * to guests on AMD hardware as well.
> - *
> - * TODO: Rework topology logic.
> - */
> -if ( p->basic.x2apic )
> +/* Don't expose topology information to PV guests */

Not sure whether we want to keep part of the comment about exposing
x2APIC to guests even when x2APIC is not present in the host.  I think
this code has changed and the comment is kind of stale now.

> +if ( is_hvm_domain(d) && p->basic.x2apic )
>  {
>  *(uint8_t *)>c = subleaf;
>  
>  /* Fix the x2APIC identifier. */
> -res->d = v->vcpu_id * 2;
> +res->d = vlapic_x2apic_id(vcpu_vlapic(v));
>  }
>  break;
>  
> diff --git a/xen/arch/x86/hvm/vlapic.c b/xen/arch/x86/hvm/vlapic.c
> index 05072a21bf38..61a96474006b 100644
> --- a/xen/arch/x86/hvm/vlapic.c
> +++ b/xen/arch/x86/hvm/vlapic.c
> @@ -1069,7 +1069,7 @@ static uint32_t x2apic_ldr_from_id(uint32_t id)
>  static void set_x2apic_id(struct vlapic *vlapic)
>  {
>  const struct vcpu *v = vlapic_vcpu(vlapic);
> -uint32_t apic_id = v->vcpu_id * 2;
> +uint32_t apic_id = vlapic->hw.x2apic_id;
>  uint32_t apic_ldr = x2apic_ldr_from_id(apic_id);
>  
>  /*
> @@ -1083,6 +1083,22 @@ static void set_x2apic_id(struct vlapic *vlapic)
>  vlapic_set_reg(vlapic, APIC_LDR, apic_ldr);
>  }
>  
> +void vlapic_cpu_policy_changed(struct vcpu *v)
> +{
> +struct vlapic *vlapic = vcpu_vlapic(v);
> +const struct cpu_policy *cp = v->domain->arch.cpu_policy;
> +
> +/*
> + * Don't override the initial x2APIC ID if we have migrated it or
> + * if the domain doesn't have vLAPIC at all.
> + */
> +if ( !has_vlapic(v->domain) || vlapic->loaded.hw )
> +return;
> +
> +vlapic->hw.x2apic_id = x86_x2apic_id_from_vcpu_id(cp, v->vcpu_id);
> +vlapic_set_reg(vlapic, APIC_ID, SET_xAPIC_ID(vlapic->hw.x2apic_id));

Nit: in case we decide to start APICs in x2APIC mode, might be good to
take this into account here and use vlapic_x2apic_mode(vlapic) to
select whether SET_xAPIC_ID() needs to be used or not:

vlapic_set_reg(vlapic, APIC_ID,
vlapic_x2apic_mode(vlapic) ? vlapic->hw.x2apic_id
   : SET_xAPIC_ID(vlapic->hw.x2apic_id));

Or similar.

> +}
> +
>  int guest_wrmsr_apic_base(struct vcpu *v, uint64_t val)
>  {
>  const struct cpu_policy *cp = v->domain->arch.cpu_policy;
> @@ -1449,7 +1465,7 @@ void vlapic_reset(struct vlapic *vlapic)
>  if ( v->vcpu_id == 0 )
>  vlapic->hw.apic_base_msr |= APIC_BASE_BSP;
>  
> -vlapic_set_reg(vlapic, APIC_ID,

Re: [PATCH for-4.19 v3 2/3] xen: enable altp2m at create domain domctl

2024-05-23 Thread Roger Pau Monné

On Fri, May 17, 2024 at 03:33:51PM +0200, Roger Pau Monne wrote:
> Enabling it using an HVM param is fragile, and complicates the logic when
> deciding whether options that interact with altp2m can also be enabled.
> 
> Leave the HVM param value for consumption by the guest, but prevent it from
> being set.  Enabling is now done using and additional altp2m specific field in
> xen_domctl_createdomain.
> 
> Note that albeit only currently implemented in x86, altp2m could be 
> implemented
> in other architectures, hence why the field is added to 
> xen_domctl_createdomain
> instead of xen_arch_domainconfig.
> 
> Signed-off-by: Roger Pau Monné 
> ---
> Changes since v2:
>  - Introduce a new altp2m field in xen_domctl_createdomain.
> 
> Changes since v1:
>  - New in this version.
> ---
>  tools/libs/light/libxl_create.c | 23 ++-
>  tools/libs/light/libxl_x86.c| 26 --
>  tools/ocaml/libs/xc/xenctrl_stubs.c |  2 +-
>  xen/arch/arm/domain.c   |  6 ++

Could I get an Ack from one of the Arm maintainers for the trivial Arm
change?

Thanks, Roger.

Re: [PATCH v3 2/2] tools/xg: Clean up xend-style overrides for CPU policies

2024-05-23 Thread Roger Pau Monné

On Thu, May 23, 2024 at 10:41:30AM +0100, Alejandro Vallejo wrote:
> Factor out policy getters/setters from both (CPUID and MSR) policy override
> functions. Additionally, use host policy rather than featureset when
> preparing the cur policy, saving one hypercall and several lines of
> boilerplate.
> 
> No functional change intended.
> 
> Signed-off-by: Alejandro Vallejo 
> ---
> v3:
>   * Restored overscoped loop indices
>   * Split long line in conditional
> ---
>  tools/libs/guest/xg_cpuid_x86.c | 438 ++--
>  1 file changed, 131 insertions(+), 307 deletions(-)
> 
> diff --git a/tools/libs/guest/xg_cpuid_x86.c b/tools/libs/guest/xg_cpuid_x86.c
> index 4f4b86b59470..1e631fd46d2f 100644
> --- a/tools/libs/guest/xg_cpuid_x86.c
> +++ b/tools/libs/guest/xg_cpuid_x86.c
> @@ -36,6 +36,34 @@ enum {
>  #define bitmaskof(idx)  (1u << ((idx) & 31))
>  #define featureword_of(idx) ((idx) >> 5)
>  
> +static int deserialize_policy(xc_interface *xch, xc_cpu_policy_t *policy)
> +{
> +uint32_t err_leaf = -1, err_subleaf = -1, err_msr = -1;
> +int rc;
> +
> +rc = x86_cpuid_copy_from_buffer(>policy, policy->leaves,
> +policy->nr_leaves, _leaf, 
> _subleaf);
> +if ( rc )
> +{
> +if ( err_leaf != -1 )
> +ERROR("Failed to deserialise CPUID (err leaf %#x, subleaf %#x) 
> (%d = %s)",
> +  err_leaf, err_subleaf, -rc, strerror(-rc));
> +return rc;
> +}
> +
> +rc = x86_msr_copy_from_buffer(>policy, policy->msrs,
> +  policy->nr_msrs, _msr);
> +if ( rc )
> +{
> +if ( err_msr != -1 )
> +ERROR("Failed to deserialise MSR (err MSR %#x) (%d = %s)",
> +  err_msr, -rc, strerror(-rc));
> +return rc;
> +}
> +
> +return 0;
> +}
> +
>  int xc_get_cpu_levelling_caps(xc_interface *xch, uint32_t *caps)
>  {
>  struct xen_sysctl sysctl = {};
> @@ -260,102 +288,37 @@ static int compare_leaves(const void *l, const void *r)
>  return 0;
>  }
>  
> -static xen_cpuid_leaf_t *find_leaf(
> -xen_cpuid_leaf_t *leaves, unsigned int nr_leaves,
> -const struct xc_xend_cpuid *xend)
> +static xen_cpuid_leaf_t *find_leaf(xc_cpu_policy_t *p,
> +   const struct xc_xend_cpuid *xend)
>  {
>  const xen_cpuid_leaf_t key = { xend->leaf, xend->subleaf };
>  
> -return bsearch(, leaves, nr_leaves, sizeof(*leaves), compare_leaves);
> +return bsearch(, p->leaves, ARRAY_SIZE(p->leaves),

Don't you need to use p->nr_leaves here, as otherwise we could check
against possibly uninitialized leaves (or leaves with stale data)?

> +   sizeof(*p->leaves), compare_leaves);
>  }
>  
> -static int xc_cpuid_xend_policy(
> -xc_interface *xch, uint32_t domid, const struct xc_xend_cpuid *xend)
> +static int xc_cpuid_xend_policy(xc_interface *xch, uint32_t domid,
> +const struct xc_xend_cpuid *xend,
> +xc_cpu_policy_t *host,
> +xc_cpu_policy_t *def,
> +xc_cpu_policy_t *cur)
>  {
> -int rc;
> -bool hvm;
> -xc_domaininfo_t di;
> -unsigned int nr_leaves, nr_msrs;
> -uint32_t err_leaf = -1, err_subleaf = -1, err_msr = -1;
> -/*
> - * Three full policies.  The host, default for the domain type,
> - * and domain current.
> - */
> -xen_cpuid_leaf_t *host = NULL, *def = NULL, *cur = NULL;
> -unsigned int nr_host, nr_def, nr_cur;
> -
> -if ( (rc = xc_domain_getinfo_single(xch, domid, )) < 0 )
> -{
> -PERROR("Failed to obtain d%d info", domid);
> -rc = -errno;
> -goto fail;
> -}
> -hvm = di.flags & XEN_DOMINF_hvm_guest;
> -
> -rc = xc_cpu_policy_get_size(xch, _leaves, _msrs);
> -if ( rc )
> -{
> -PERROR("Failed to obtain policy info size");
> -rc = -errno;
> -goto fail;
> -}
> -
> -rc = -ENOMEM;
> -if ( (host = calloc(nr_leaves, sizeof(*host))) == NULL ||
> - (def  = calloc(nr_leaves, sizeof(*def)))  == NULL ||
> - (cur  = calloc(nr_leaves, sizeof(*cur)))  == NULL )
> -{
> -ERROR("Unable to allocate memory for %u CPUID leaves", nr_leaves);
> -goto fail;
> -}
> -
> -/* Get the domain's current policy. */
> -nr_msrs = 0;
> -nr_cur = nr_leaves;
> -rc = get_domain_cpu_policy(xch, domid, _cur, cur, _msrs, NULL);
> -if ( rc )
> -{
> -PERROR("Failed to obtain d%d current policy", domid);
> -rc = -errno;
> -goto fail;
> -}
> +if ( !xend )
> +return 0;
>  
> -/* Get the domain type's default policy. */
> -nr_msrs = 0;
> -nr_def = nr_leaves;
> -rc = get_system_cpu_policy(xch, hvm ? XEN_SYSCTL_cpu_policy_hvm_default
> -: XEN_SYSCTL_cpu_policy_pv_default,
> -

Re: [PATCH v3 1/2] tools/xg: Streamline cpu policy serialise/deserialise calls

2024-05-23 Thread Roger Pau Monné

On Thu, May 23, 2024 at 10:41:29AM +0100, Alejandro Vallejo wrote:
> The idea is to use xc_cpu_policy_t as a single object containing both the
> serialised and deserialised forms of the policy. Note that we need lengths
> for the arrays, as the serialised policies may be shorter than the array
> capacities.
> 
> * Add the serialised lengths to the struct so we can distinguish
>   between length and capacity of the serialisation buffers.
> * Remove explicit buffer+lengths in serialise/deserialise calls
>   and use the internal buffer inside xc_cpu_policy_t instead.
> * Refactor everything to use the new serialisation functions.
> * Remove redundant serialization calls and avoid allocating dynamic
>   memory aside from the policy objects in xen-cpuid. Also minor cleanup
>   in the policy print call sites.
> 
> No functional change intended.
> 
> Signed-off-by: Alejandro Vallejo 

Acked-by: Roger Pau Monné 

Just two comments.

> ---
> v3:
>   * Better context scoping in xg_sr_common_x86.
> * Can't be const because write_record() takes non-const.
>   * Adjusted line length of xen-cpuid's print_policy.
>   * Adjusted error messages in xen-cpuid's print_policy.
>   * Reverted removal of overscoped loop indices.
> ---
>  tools/include/xenguest.h|  8 ++-
>  tools/libs/guest/xg_cpuid_x86.c | 98 -
>  tools/libs/guest/xg_private.h   |  2 +
>  tools/libs/guest/xg_sr_common_x86.c | 56 ++---
>  tools/misc/xen-cpuid.c  | 41 
>  5 files changed, 106 insertions(+), 99 deletions(-)
> 
> diff --git a/tools/include/xenguest.h b/tools/include/xenguest.h
> index e01f494b772a..563811cd8dde 100644
> --- a/tools/include/xenguest.h
> +++ b/tools/include/xenguest.h
> @@ -799,14 +799,16 @@ int xc_cpu_policy_set_domain(xc_interface *xch, 
> uint32_t domid,
>   xc_cpu_policy_t *policy);
>  
>  /* Manipulate a policy via architectural representations. */
> -int xc_cpu_policy_serialise(xc_interface *xch, const xc_cpu_policy_t *policy,
> -xen_cpuid_leaf_t *leaves, uint32_t *nr_leaves,
> -xen_msr_entry_t *msrs, uint32_t *nr_msrs);
> +int xc_cpu_policy_serialise(xc_interface *xch, xc_cpu_policy_t *policy);
>  int xc_cpu_policy_update_cpuid(xc_interface *xch, xc_cpu_policy_t *policy,
> const xen_cpuid_leaf_t *leaves,
> uint32_t nr);
>  int xc_cpu_policy_update_msrs(xc_interface *xch, xc_cpu_policy_t *policy,
>const xen_msr_entry_t *msrs, uint32_t nr);
> +int xc_cpu_policy_get_leaves(xc_interface *xch, const xc_cpu_policy_t 
> *policy,
> + const xen_cpuid_leaf_t **leaves, uint32_t *nr);
> +int xc_cpu_policy_get_msrs(xc_interface *xch, const xc_cpu_policy_t *policy,
> +   const xen_msr_entry_t **msrs, uint32_t *nr);

Maybe it would be helpful to have a comment clarifying that the return
of xc_cpu_policy_get_{leaves,msrs}() is a reference to the content of
the policy, not a copy of it (and hence is tied to the lifetime of
policy, and doesn't require explicit freeing).

>  
>  /* Compatibility calculations. */
>  bool xc_cpu_policy_is_compatible(xc_interface *xch, xc_cpu_policy_t *host,
> diff --git a/tools/libs/guest/xg_cpuid_x86.c b/tools/libs/guest/xg_cpuid_x86.c
> index 4453178100ad..4f4b86b59470 100644
> --- a/tools/libs/guest/xg_cpuid_x86.c
> +++ b/tools/libs/guest/xg_cpuid_x86.c
> @@ -834,14 +834,13 @@ void xc_cpu_policy_destroy(xc_cpu_policy_t *policy)
>  }
>  }
>  
> -static int deserialize_policy(xc_interface *xch, xc_cpu_policy_t *policy,
> -  unsigned int nr_leaves, unsigned int 
> nr_entries)
> +static int deserialize_policy(xc_interface *xch, xc_cpu_policy_t *policy)
>  {
>  uint32_t err_leaf = -1, err_subleaf = -1, err_msr = -1;
>  int rc;
>  
>  rc = x86_cpuid_copy_from_buffer(>policy, policy->leaves,
> -nr_leaves, _leaf, _subleaf);
> +policy->nr_leaves, _leaf, 
> _subleaf);
>  if ( rc )
>  {
>  if ( err_leaf != -1 )
> @@ -851,7 +850,7 @@ static int deserialize_policy(xc_interface *xch, 
> xc_cpu_policy_t *policy,
>  }
>  
>  rc = x86_msr_copy_from_buffer(>policy, policy->msrs,
> -  nr_entries, _msr);
> +  policy->nr_msrs, _msr);
>  if ( rc )
>  {
>  if ( err_msr != -1 )
> @@ -878,7 +877,10 @@ int xc_cpu_policy_get_system(xc_interface *xch, unsigned 
> int policy_idx,
>  return rc;
>  }
&

Re: [XEN PATCH] x86/iommu: Conditionally compile platform-specific union entries

2024-05-23 Thread Roger Pau Monné

On Thu, May 23, 2024 at 09:19:53AM +, Teddy Astie wrote:
> If some platform driver isn't compiled in, remove its related union
> entries as they are not used.
> 
> Signed-off-by Teddy Astie 
> ---
>  xen/arch/x86/include/asm/iommu.h | 4 
>  xen/arch/x86/include/asm/pci.h   | 4 
>  2 files changed, 8 insertions(+)
> 
> diff --git a/xen/arch/x86/include/asm/iommu.h 
> b/xen/arch/x86/include/asm/iommu.h
> index 8dc464fbd3..99180940c4 100644
> --- a/xen/arch/x86/include/asm/iommu.h
> +++ b/xen/arch/x86/include/asm/iommu.h
> @@ -42,17 +42,21 @@ struct arch_iommu
>  struct list_head identity_maps;
>  
>  union {
> +#ifdef CONFIG_INTEL_IOMMU
>  /* Intel VT-d */
>  struct {
>  uint64_t pgd_maddr; /* io page directory machine address */
>  unsigned int agaw; /* adjusted guest address width, 0 is level 2 
> 30-bit */
>  unsigned long *iommu_bitmap; /* bitmap of iommu(s) that the 
> domain uses */
>  } vtd;
> +#endif
> +#ifdef CONFIG_AMD_IOMMU
>  /* AMD IOMMU */
>  struct {
>  unsigned int paging_mode;
>  struct page_info *root_table;
>  } amd;
> +#endif
>  };
>  };
>  
> diff --git a/xen/arch/x86/include/asm/pci.h b/xen/arch/x86/include/asm/pci.h
> index fd5480d67d..842710f0dc 100644
> --- a/xen/arch/x86/include/asm/pci.h
> +++ b/xen/arch/x86/include/asm/pci.h
> @@ -22,12 +22,16 @@ struct arch_pci_dev {
>   */
>  union {
>  /* Subset of struct arch_iommu's fields, to be used in dom_io. */
> +#ifdef CONFIG_INTEL_IOMMU
>  struct {
>  uint64_t pgd_maddr;
>  } vtd;
> +#endif
> +#ifdef CONFIG_AMD_IOMMU
>  struct {
>  struct page_info *root_table;
>  } amd;
> +#endif
>  };

The #ifdef and #endif processor directives shouldn't be indented.

Would you mind adding /* CONFIG_{AMD,INTEL}_IOMMU */ comments in the
#endif directives?

I wonder if we could move the definitions of those structures to the
vendor specific headers, but that's more convoluted, and would require
including the iommu headers in pci.h

Thanks, Roger.

Re: [PATCH v16 5/5] xen/arm: account IO handlers for emulated PCI MSI-X

2024-05-23 Thread Roger Pau Monné

On Wed, May 22, 2024 at 06:59:24PM -0400, Stewart Hildebrand wrote:
> From: Oleksandr Andrushchenko 
> 
> At the moment, we always allocate an extra 16 slots for IO handlers
> (see MAX_IO_HANDLER). So while adding IO trap handlers for the emulated
> MSI-X registers we need to explicitly tell that we have additional IO
> handlers, so those are accounted.
> 
> Signed-off-by: Oleksandr Andrushchenko 
> Acked-by: Julien Grall 
> Signed-off-by: Volodymyr Babchuk 
> Signed-off-by: Stewart Hildebrand 
> ---
> This depends on a constant defined in ("vpci: add initial support for
> virtual PCI bus topology"), so cannot be committed without the
> dependency.
> 
> Since v5:
> - optimize with IS_ENABLED(CONFIG_HAS_PCI_MSI) since VPCI_MAX_VIRT_DEV is
>   defined unconditionally
> New in v5
> ---
>  xen/arch/arm/vpci.c | 14 +-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/xen/arch/arm/vpci.c b/xen/arch/arm/vpci.c
> index 516933bebfb3..4779bbfa9be3 100644
> --- a/xen/arch/arm/vpci.c
> +++ b/xen/arch/arm/vpci.c
> @@ -132,6 +132,8 @@ static int vpci_get_num_handlers_cb(struct domain *d,
>  
>  unsigned int domain_vpci_get_num_mmio_handlers(struct domain *d)
>  {
> +unsigned int count;
> +
>  if ( !has_vpci(d) )
>  return 0;
>  
> @@ -152,7 +154,17 @@ unsigned int domain_vpci_get_num_mmio_handlers(struct 
> domain *d)
>   * For guests each host bridge requires one region to cover the
>   * configuration space. At the moment, we only expose a single host 
> bridge.
>   */
> -return 1;
> +count = 1;
> +
> +/*
> + * There's a single MSI-X MMIO handler that deals with both PBA
> + * and MSI-X tables per each PCI device being passed through.
> + * Maximum number of emulated virtual devices is VPCI_MAX_VIRT_DEV.
> + */
> +if ( IS_ENABLED(CONFIG_HAS_PCI_MSI) )
> +count += VPCI_MAX_VIRT_DEV;

I think this was already raised in a previous version, at some point
you will need to consider making this a linker list or similar.  The
array is also not very helpful, as you still need to iterate over the
slots in order to find which handler should dispatch the access.

(Not that I oppose to this patch, but the handlers array cannot be
expanded indefinitely).

Thanks, Roger.

Re: [PATCH v16 1/5] arm/vpci: honor access size when returning an error

2024-05-23 Thread Roger Pau Monné

On Wed, May 22, 2024 at 06:59:20PM -0400, Stewart Hildebrand wrote:
> From: Volodymyr Babchuk 
> 
> Guest can try to read config space using different access sizes: 8,
> 16, 32, 64 bits. We need to take this into account when we are
> returning an error back to MMIO handler, otherwise it is possible to
> provide more data than requested: i.e. guest issues LDRB instruction
> to read one byte, but we are writing 0x in the target
> register.

Shouldn't this be taken care of in the trap handler subsystem, rather
than forcing each handler to ensure the returned data matches the
access size?

IOW, something like:

diff --git a/xen/arch/arm/io.c b/xen/arch/arm/io.c
index 96c740d5636c..b7e12df85f87 100644
--- a/xen/arch/arm/io.c
+++ b/xen/arch/arm/io.c
@@ -37,6 +37,7 @@ static enum io_state handle_read(const struct mmio_handler 
*handler,
 return IO_ABORT;

 r = sign_extend(dabt, r);
+r = r & GENMASK_ULL((1U << dabt.size) * 8 - 1, 0);

 set_user_reg(regs, dabt.reg, r);

(not even build tested)

Thanks, Roger.

Re: [PATCH v16 4/5] xen/arm: translate virtual PCI bus topology for guests

2024-05-23 Thread Roger Pau Monné

On Wed, May 22, 2024 at 06:59:23PM -0400, Stewart Hildebrand wrote:
> From: Oleksandr Andrushchenko 
> 
> There are three  originators for the PCI configuration space access:
> 1. The domain that owns physical host bridge: MMIO handlers are
> there so we can update vPCI register handlers with the values
> written by the hardware domain, e.g. physical view of the registers
> vs guest's view on the configuration space.
> 2. Guest access to the passed through PCI devices: we need to properly
> map virtual bus topology to the physical one, e.g. pass the configuration
> space access to the corresponding physical devices.
> 3. Emulated host PCI bridge access. It doesn't exist in the physical
> topology, e.g. it can't be mapped to some physical host bridge.
> So, all access to the host bridge itself needs to be trapped and
> emulated.
> 
> Signed-off-by: Oleksandr Andrushchenko 
> Signed-off-by: Volodymyr Babchuk 
> Signed-off-by: Stewart Hildebrand 

Acked-by: Roger Pau Monné 

One unrelated question below.

> ---
> In v15:
> - base on top of ("arm/vpci: honor access size when returning an error")
> In v11:
> - Fixed format issues
> - Added ASSERT_UNREACHABLE() to the dummy implementation of
> vpci_translate_virtual_device()
> - Moved variable in vpci_sbdf_from_gpa(), now it is easier to follow
> the logic in the function
> Since v9:
> - Commend about required lock replaced with ASSERT()
> - Style fixes
> - call to vpci_translate_virtual_device folded into vpci_sbdf_from_gpa
> Since v8:
> - locks moved out of vpci_translate_virtual_device()
> Since v6:
> - add pcidevs locking to vpci_translate_virtual_device
> - update wrt to the new locking scheme
> Since v5:
> - add vpci_translate_virtual_device for #ifndef CONFIG_HAS_VPCI_GUEST_SUPPORT
>   case to simplify ifdefery
> - add ASSERT(!is_hardware_domain(d)); to vpci_translate_virtual_device
> - reset output register on failed virtual SBDF translation
> Since v4:
> - indentation fixes
> - constify struct domain
> - updated commit message
> - updates to the new locking scheme (pdev->vpci_lock)
> Since v3:
> - revisit locking
> - move code to vpci.c
> Since v2:
>  - pass struct domain instead of struct vcpu
>  - constify arguments where possible
>  - gate relevant code with CONFIG_HAS_VPCI_GUEST_SUPPORT
> New in v2
> ---
>  xen/arch/arm/vpci.c | 45 -
>  xen/drivers/vpci/vpci.c | 24 ++
>  xen/include/xen/vpci.h  | 12 +++
>  3 files changed, 71 insertions(+), 10 deletions(-)
> 
> diff --git a/xen/arch/arm/vpci.c b/xen/arch/arm/vpci.c
> index b63a356bb4a8..516933bebfb3 100644
> --- a/xen/arch/arm/vpci.c
> +++ b/xen/arch/arm/vpci.c
> @@ -7,33 +7,53 @@
>  
>  #include 
>  
> -static pci_sbdf_t vpci_sbdf_from_gpa(const struct pci_host_bridge *bridge,
> - paddr_t gpa)
> +static bool vpci_sbdf_from_gpa(struct domain *d,
> +   const struct pci_host_bridge *bridge,
> +   paddr_t gpa, pci_sbdf_t *sbdf)
>  {
> -pci_sbdf_t sbdf;
> +bool translated = true;
> +
> +ASSERT(sbdf);
>  
>  if ( bridge )
>  {
> -sbdf.sbdf = VPCI_ECAM_BDF(gpa - bridge->cfg->phys_addr);
> -sbdf.seg = bridge->segment;
> -sbdf.bus += bridge->cfg->busn_start;
> +sbdf->sbdf = VPCI_ECAM_BDF(gpa - bridge->cfg->phys_addr);
> +sbdf->seg = bridge->segment;
> +sbdf->bus += bridge->cfg->busn_start;
>  }
>  else
> -sbdf.sbdf = VPCI_ECAM_BDF(gpa - GUEST_VPCI_ECAM_BASE);
> +{
> +/*
> + * For the passed through devices we need to map their virtual SBDF
> + * to the physical PCI device being passed through.
> + */
> +sbdf->sbdf = VPCI_ECAM_BDF(gpa - GUEST_VPCI_ECAM_BASE);
> +read_lock(>pci_lock);
> +translated = vpci_translate_virtual_device(d, sbdf);
> +read_unlock(>pci_lock);

I would consider moving the read_{,un}lock() calls inside
vpci_translate_virtual_device(), if that's the only caller of
vpci_translate_virtual_device().  Maybe further patches add other
instances that call from an already locked context.

> +}
>  
> -return sbdf;
> +return translated;
>  }
>  
>  static int vpci_mmio_read(struct vcpu *v, mmio_info_t *info,
>register_t *r, void *p)
>  {
>  struct pci_host_bridge *bridge = p;
> -pci_sbdf_t sbdf = vpci_sbdf_from_gpa(bridge, info->gpa);
> +pci_sbdf_t sbdf;
>  const unsigned int access_size = (1U << info->dabt.size) * 8;
>  const register_t invalid = GENMASK_ULL(access_size - 1, 0);

Do you know why the invalid value is truncated to the access size.
Won't it be simpler to just set the whole variable to 1s? (~0)

TBH, you could just set:

*r = ~(register_t)0;

At the top of the function and get rid of the local invalid variable
plus having to set r on all error paths.

Thanks, Roger.

Re: [PATCH v16 3/5] vpci: add initial support for virtual PCI bus topology

2024-05-23 Thread Roger Pau Monné

On Wed, May 22, 2024 at 06:59:22PM -0400, Stewart Hildebrand wrote:
> From: Oleksandr Andrushchenko 
> 
> Assign SBDF to the PCI devices being passed through with bus 0.
> The resulting topology is where PCIe devices reside on the bus 0 of the
> root complex itself (embedded endpoints).
> This implementation is limited to 32 devices which are allowed on
> a single PCI bus.
> 
> Please note, that at the moment only function 0 of a multifunction
> device can be passed through.
> 
> Signed-off-by: Oleksandr Andrushchenko 
> Signed-off-by: Volodymyr Babchuk 
> Signed-off-by: Stewart Hildebrand 
> Acked-by: Jan Beulich 

Acked-by: Roger Pau Monné 

Thanks, Roger.

Re: [PATCH for-4.19 v3 2/3] xen: enable altp2m at create domain domctl

2024-05-22 Thread Roger Pau Monné

On Wed, May 22, 2024 at 03:34:29PM +0200, Jan Beulich wrote:
> On 22.05.2024 15:16, Roger Pau Monné wrote:
> > On Tue, May 21, 2024 at 12:30:32PM +0200, Jan Beulich wrote:
> >> On 17.05.2024 15:33, Roger Pau Monne wrote:
> >>> Enabling it using an HVM param is fragile, and complicates the logic when
> >>> deciding whether options that interact with altp2m can also be enabled.
> >>>
> >>> Leave the HVM param value for consumption by the guest, but prevent it 
> >>> from
> >>> being set.  Enabling is now done using and additional altp2m specific 
> >>> field in
> >>> xen_domctl_createdomain.
> >>>
> >>> Note that albeit only currently implemented in x86, altp2m could be 
> >>> implemented
> >>> in other architectures, hence why the field is added to 
> >>> xen_domctl_createdomain
> >>> instead of xen_arch_domainconfig.
> >>>
> >>> Signed-off-by: Roger Pau Monné 
> >>
> >> Reviewed-by: Jan Beulich  # hypervisor
> >> albeit with one question:
> >>
> >>> --- a/xen/arch/x86/domain.c
> >>> +++ b/xen/arch/x86/domain.c
> >>> @@ -637,6 +637,8 @@ int arch_sanitise_domain_config(struct 
> >>> xen_domctl_createdomain *config)
> >>>  bool hap = config->flags & XEN_DOMCTL_CDF_hap;
> >>>  bool nested_virt = config->flags & XEN_DOMCTL_CDF_nested_virt;
> >>>  unsigned int max_vcpus;
> >>> +unsigned int altp2m_mode = MASK_EXTR(config->altp2m_opts,
> >>> + XEN_DOMCTL_ALTP2M_mode_mask);
> >>>  
> >>>  if ( hvm ? !hvm_enabled : !IS_ENABLED(CONFIG_PV) )
> >>>  {
> >>> @@ -715,6 +717,26 @@ int arch_sanitise_domain_config(struct 
> >>> xen_domctl_createdomain *config)
> >>>  return -EINVAL;
> >>>  }
> >>>  
> >>> +if ( config->altp2m_opts & ~XEN_DOMCTL_ALTP2M_mode_mask )
> >>> +{
> >>> +dprintk(XENLOG_INFO, "Invalid altp2m options selected: %#x\n",
> >>> +config->flags);
> >>> +return -EINVAL;
> >>> +}
> >>> +
> >>> +if ( altp2m_mode && nested_virt )
> >>> +{
> >>> +dprintk(XENLOG_INFO,
> >>> +"Nested virt and altp2m are not supported together\n");
> >>> +return -EINVAL;
> >>> +}
> >>> +
> >>> +if ( altp2m_mode && !hap )
> >>> +{
> >>> +dprintk(XENLOG_INFO, "altp2m is only supported with HAP\n");
> >>> +return -EINVAL;
> >>> +}
> >>
> >> Should this last one perhaps be further extended to permit altp2m with EPT
> >> only?
> > 
> > Hm, yes, that would be more accurate as:
> > 
> > if ( altp2m_mode && (!hap || !hvm_altp2m_supported()) )
> 
> Wouldn't
> 
>if ( altp2m_mode && !hvm_altp2m_supported() )
> 
> suffice? hvm_funcs.caps.altp2m is not supposed to be set when no HAP,
> as long as HAP continues to be a pre-condition?

No, `hap` here signals whether the domain is using HAP, and we need to
take this int account, otherwise we would allow enabling altp2m for
domains using shadow.

Thanks, Roger.

Re: [PATCH for-4.19 v3 2/3] xen: enable altp2m at create domain domctl

2024-05-22 Thread Roger Pau Monné

On Tue, May 21, 2024 at 12:30:32PM +0200, Jan Beulich wrote:
> On 17.05.2024 15:33, Roger Pau Monne wrote:
> > Enabling it using an HVM param is fragile, and complicates the logic when
> > deciding whether options that interact with altp2m can also be enabled.
> > 
> > Leave the HVM param value for consumption by the guest, but prevent it from
> > being set.  Enabling is now done using and additional altp2m specific field 
> > in
> > xen_domctl_createdomain.
> > 
> > Note that albeit only currently implemented in x86, altp2m could be 
> > implemented
> > in other architectures, hence why the field is added to 
> > xen_domctl_createdomain
> > instead of xen_arch_domainconfig.
> > 
> > Signed-off-by: Roger Pau Monné 
> 
> Reviewed-by: Jan Beulich  # hypervisor
> albeit with one question:
> 
> > --- a/xen/arch/x86/domain.c
> > +++ b/xen/arch/x86/domain.c
> > @@ -637,6 +637,8 @@ int arch_sanitise_domain_config(struct 
> > xen_domctl_createdomain *config)
> >  bool hap = config->flags & XEN_DOMCTL_CDF_hap;
> >  bool nested_virt = config->flags & XEN_DOMCTL_CDF_nested_virt;
> >  unsigned int max_vcpus;
> > +unsigned int altp2m_mode = MASK_EXTR(config->altp2m_opts,
> > + XEN_DOMCTL_ALTP2M_mode_mask);
> >  
> >  if ( hvm ? !hvm_enabled : !IS_ENABLED(CONFIG_PV) )
> >  {
> > @@ -715,6 +717,26 @@ int arch_sanitise_domain_config(struct 
> > xen_domctl_createdomain *config)
> >  return -EINVAL;
> >  }
> >  
> > +if ( config->altp2m_opts & ~XEN_DOMCTL_ALTP2M_mode_mask )
> > +{
> > +dprintk(XENLOG_INFO, "Invalid altp2m options selected: %#x\n",
> > +config->flags);
> > +return -EINVAL;
> > +}
> > +
> > +if ( altp2m_mode && nested_virt )
> > +{
> > +dprintk(XENLOG_INFO,
> > +"Nested virt and altp2m are not supported together\n");
> > +return -EINVAL;
> > +}
> > +
> > +if ( altp2m_mode && !hap )
> > +{
> > +dprintk(XENLOG_INFO, "altp2m is only supported with HAP\n");
> > +return -EINVAL;
> > +}
> 
> Should this last one perhaps be further extended to permit altp2m with EPT
> only?

Hm, yes, that would be more accurate as:

if ( altp2m_mode && (!hap || !hvm_altp2m_supported()) )

Would you be fine adjusting at commit, or would you prefer me to send
an updated version?

Thanks, Roger.

Re: [PATCH] x86/shadow: don't leave trace record field uninitialized

2024-05-22 Thread Roger Pau Monné

On Wed, May 22, 2024 at 12:17:30PM +0200, Jan Beulich wrote:
> The emulation_count field is set only conditionally right now. Convert
> all field setting to an initializer, thus guaranteeing that field to be
> set to 0 (default initialized) when GUEST_PAGING_LEVELS != 3.
> 
> While there also drop the "event" local variable, thus eliminating an
> instance of the being phased out u32 type.
> 
> Coverity ID: 1598430
> Fixes: 9a86ac1aa3d2 ("xentrace 5/7: Additional tracing for the shadow code")
> Signed-off-by: Jan Beulich 

Acked-by: Roger Pau Monné 

Thanks, Roger.

Re: [PATCH v15 3/5] vpci: add initial support for virtual PCI bus topology

2024-05-22 Thread Roger Pau Monné

On Fri, May 17, 2024 at 01:06:13PM -0400, Stewart Hildebrand wrote:
> From: Oleksandr Andrushchenko 
> 
> Assign SBDF to the PCI devices being passed through with bus 0.
> The resulting topology is where PCIe devices reside on the bus 0 of the
> root complex itself (embedded endpoints).
> This implementation is limited to 32 devices which are allowed on
> a single PCI bus.
> 
> Please note, that at the moment only function 0 of a multifunction
> device can be passed through.
> 
> Signed-off-by: Oleksandr Andrushchenko 
> Signed-off-by: Volodymyr Babchuk 
> Signed-off-by: Stewart Hildebrand 
> Acked-by: Jan Beulich 
> ---
> In v15:
> - add Jan's A-b
> In v13:
> - s/depends on/select/ in Kconfig
> - check pdev->sbdf.fn instead of two booleans in add_virtual_device()
> - comment #endifs in sched.h
> - clarify comment about limits in vpci.h with seg/bus limit
> In v11:
> - Fixed code formatting
> - Removed bogus write_unlock() call
> - Fixed type for new_dev_number
> In v10:
> - Removed ASSERT(pcidevs_locked())
> - Removed redundant code (local sbdf variable, clearing sbdf during
> device removal, etc)
> - Added __maybe_unused attribute to "out:" label
> - Introduced HAS_VPCI_GUEST_SUPPORT Kconfig option, as this is the
>   first patch where it is used (previously was in "vpci: add hooks for
>   PCI device assign/de-assign")
> In v9:
> - Lock in add_virtual_device() replaced with ASSERT (thanks, Stewart)
> In v8:
> - Added write lock in add_virtual_device
> Since v6:
> - re-work wrt new locking scheme
> - OT: add ASSERT(pcidevs_write_locked()); to add_virtual_device()
> Since v5:
> - s/vpci_add_virtual_device/add_virtual_device and make it static
> - call add_virtual_device from vpci_assign_device and do not use
>   REGISTER_VPCI_INIT machinery
> - add pcidevs_locked ASSERT
> - use DECLARE_BITMAP for vpci_dev_assigned_map
> Since v4:
> - moved and re-worked guest sbdf initializers
> - s/set_bit/__set_bit
> - s/clear_bit/__clear_bit
> - minor comment fix s/Virtual/Guest/
> - added VPCI_MAX_VIRT_DEV constant (PCI_SLOT(~0) + 1) which will be used
>   later for counting the number of MMIO handlers required for a guest
>   (Julien)
> Since v3:
>  - make use of VPCI_INIT
>  - moved all new code to vpci.c which belongs to it
>  - changed open-coded 31 to PCI_SLOT(~0)
>  - added comments and code to reject multifunction devices with
>functions other than 0
>  - updated comment about vpci_dev_next and made it unsigned int
>  - implement roll back in case of error while assigning/deassigning devices
>  - s/dom%pd/%pd
> Since v2:
>  - remove casts that are (a) malformed and (b) unnecessary
>  - add new line for better readability
>  - remove CONFIG_HAS_VPCI_GUEST_SUPPORT ifdef's as the relevant vPCI
> functions are now completely gated with this config
>  - gate common code with CONFIG_HAS_VPCI_GUEST_SUPPORT
> New in v2
> ---
>  xen/drivers/Kconfig |  4 +++
>  xen/drivers/vpci/vpci.c | 57 +
>  xen/include/xen/sched.h | 10 +++-
>  xen/include/xen/vpci.h  | 12 +
>  4 files changed, 82 insertions(+), 1 deletion(-)
> 
> diff --git a/xen/drivers/Kconfig b/xen/drivers/Kconfig
> index db94393f47a6..20050e9bb8b3 100644
> --- a/xen/drivers/Kconfig
> +++ b/xen/drivers/Kconfig
> @@ -15,4 +15,8 @@ source "drivers/video/Kconfig"
>  config HAS_VPCI
>   bool
>  
> +config HAS_VPCI_GUEST_SUPPORT
> + bool
> + select HAS_VPCI
> +
>  endmenu
> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
> index 97e115dc5798..23722634d50b 100644
> --- a/xen/drivers/vpci/vpci.c
> +++ b/xen/drivers/vpci/vpci.c
> @@ -40,6 +40,49 @@ extern vpci_register_init_t *const __start_vpci_array[];
>  extern vpci_register_init_t *const __end_vpci_array[];
>  #define NUM_VPCI_INIT (__end_vpci_array - __start_vpci_array)
>  
> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
> +static int add_virtual_device(struct pci_dev *pdev)

This seems quite generic, IMO it would better named
`assign_{guest,virtual}_sbdf()` or similar, unless there are plans to
add more code here that's not strictly only about setting the guest
SBDF.

> +{
> +struct domain *d = pdev->domain;
> +unsigned int new_dev_number;
> +
> +if ( is_hardware_domain(d) )
> +return 0;
> +
> +ASSERT(rw_is_write_locked(>domain->pci_lock));

Shouldn't the assert be done before the is_hardware_domain() check, so
that we assert that all possible paths (even those from dom0) have
taken the correct lock?

> +
> +/*
> + * Each PCI bus supports 32 devices/slots at max or up to 256 when
> + * there are multi-function ones which are not yet supported.
> + */
> +if ( pdev->sbdf.fn )
> +{
> +gdprintk(XENLOG_ERR, "%pp: only function 0 passthrough supported\n",
> + >sbdf);
> +return -EOPNOTSUPP;
> +}
> +new_dev_number = find_first_zero_bit(d->vpci_dev_assigned_map,
> + VPCI_MAX_VIRT_DEV);
> +if ( new_dev_number ==

Re: [PATCH v15 2/5] vpci/header: emulate PCI_COMMAND register for guests

2024-05-22 Thread Roger Pau Monné

= vpci_add_register_mask(pdev->vpci,
> +is_hwdom ? vpci_hw_read16 : guest_cmd_read,
> +cmd_write, PCI_COMMAND, 2, header, 0, 0,
> +PCI_COMMAND_RSVDP_MASK |
> +(is_hwdom ? 0
> +  : PCI_COMMAND_IO |
> +PCI_COMMAND_PARITY |
> +PCI_COMMAND_WAIT |
> +PCI_COMMAND_SERR |
> +PCI_COMMAND_FAST_BACK),

We want to allow full access to the hw domain and only apply the
PCI_COMMAND_RSVDP_MASK when !is_hwdom in order to keep the current
behavior for dom0.

I don't think it makes a difference in practice, but we are very lax
in explicitly not applying any of such restrictions to dom0.

With that fixed:

Reviewed-by: Roger Pau Monné 

Thanks, Roger.

Re: [PATCH v2 05/12] IOMMU: rename and re-type ats_enabled

2024-05-21 Thread Roger Pau Monné

On Tue, May 21, 2024 at 08:21:35AM +0200, Jan Beulich wrote:
> On 20.05.2024 12:29, Roger Pau Monné wrote:
> > On Wed, May 15, 2024 at 12:07:50PM +0200, Jan Beulich wrote:
> >> On 06.05.2024 15:53, Roger Pau Monné wrote:
> >>> On Mon, May 06, 2024 at 03:20:38PM +0200, Jan Beulich wrote:
> >>>> On 06.05.2024 14:42, Roger Pau Monné wrote:
> >>>>> On Thu, Feb 15, 2024 at 11:15:39AM +0100, Jan Beulich wrote:
> >>>>>> @@ -196,7 +196,7 @@ static int __must_check amd_iommu_setup_
> >>>>>>  dte->sys_mgt = MASK_EXTR(ivrs_dev->device_flags, 
> >>>>>> ACPI_IVHD_SYSTEM_MGMT);
> >>>>>>  
> >>>>>>  if ( use_ats(pdev, iommu, ivrs_dev) )
> >>>>>> -dte->i = ats_enabled;
> >>>>>> +dte->i = true;
> >>>>>
> >>>>> Might be easier to just use:
> >>>>>
> >>>>> dte->i = use_ats(pdev, iommu, ivrs_dev);
> >>>>
> >>>> I'm hesitant here, as in principle we might be overwriting a "true" by
> >>>> "false" then.
> >>>
> >>> Hm, but that would be fine, what's the point in enabling the IOMMU to
> >>> reply to ATS requests if ATS is not enabled on the device?
> >>>
> >>> IOW: overwriting a "true" with a "false" seem like the correct
> >>> behavior if it's based on the output of use_ats().
> >>
> >> I don't think so, unless there were flow guarantees excluding the 
> >> possibility
> >> of taking this path twice without intermediately disabling the device 
> >> again.
> >> Down from here the enabling of ATS is gated on use_ats(). Hence if, in an
> >> earlier invocation, we enabled ATS (and set dte->i), we wouldn't turn off 
> >> ATS
> >> below (there's only code to turn it on), yet with what you suggest we'd 
> >> clear
> >> dte->i.
> > 
> > Please bear with me, I think I'm confused, why would use_ats(), and if
> > that's the case, don't we want to update dte->i so that it matches the
> > ATS state?
> 
> I'm afraid I can't parse this. Maybe a result of incomplete editing? The
> topic is complex enough that I don't want to even try to guess what you
> may have meant to ask ...

Oh, indeed, sorry, the full sentences should have been:

Please bear with me, I think I'm confused, why would use_ats() return
different values for the same device?

And if that's the case, don't we want to update dte->i so that it
matches the ATS state signaled by use_ats()?

> > Otherwise we would fail to disable IOMMU device address translation
> > support if ATS was disabled?
> 
> I think the answer here is "no", but with the above I'm not really sure
> here, either.

Given the current logic in use_ats() AFAICT the return value of that
function should not change for a given device?

Thanks, Roger.

Re: [PATCH v3.5 3/4] tools/xen-cpuid: Use automatically generated feature names

2024-05-20 Thread Roger Pau Monné

On Mon, May 20, 2024 at 04:20:37PM +0100, Andrew Cooper wrote:
> On 20/05/2024 4:07 pm, Roger Pau Monné wrote:
> > On Mon, May 20, 2024 at 03:33:59PM +0100, Andrew Cooper wrote:
> >> From: Roger Pau Monné 
> >>
> >> Have gen-cpuid.py write out INIT_FEATURE_VAL_TO_NAME, derived from the same
> >> data source as INIT_FEATURE_NAME_TO_VAL, although both aliases of common_1d
> >> are needed.
> >>
> >> In xen-cpuid.c, sanity check at build time that leaf_info[] and
> >> feature_names[] are of sensible length.
> >>
> >> As dump_leaf() rendered missing names as numbers, always dump leaves even 
> >> if
> >> we don't have the leaf name.  This conversion was argumably missed in 
> >> commit
> >> 59afdb8a81d6 ("tools/misc: Tweak reserved bit handling for xen-cpuid").
> >>
> >> Signed-off-by: Roger Pau Monné 
> >> Signed-off-by: Andrew Cooper 
> > Reviewed-by: Roger Pau Monné 
> 
> Thanks.
> 
> >
> > Just one question below.
> >
> >> ---
> >> CC: Jan Beulich 
> >> CC: Roger Pau Monné 
> >>
> >> Differences in names are:
> >>
> >>  sysenter-> sep
> >>  tm  -> tm1
> >>  ds-cpl  -> dscpl
> >>  est -> eist
> >>  sse41   -> sse4-1
> >>  sse42   -> sse4-2
> >>  movebe  -> movbe
> >>  tsc-dl  -> tsc-deadline
> >>  rdrnd   -> rdrand
> >>  hyper   -> hypervisor
> >>  mmx+-> mmext
> >>  fxsr+   -> ffxsr
> >>  pg1g-> page1gb
> >>  3dnow+  -> 3dnowext
> >>  cmp -> cmp-legacy
> >>  cr8d-> cr8-legacy
> >>  lzcnt   -> abm
> >>  msse-> misalignsse
> >>  3dnowpf -> 3dnowprefetch
> >>  nodeid  -> nodeid-msr
> >>  dbx -> dbext
> >>  tsc-adj -> tsc-adjust
> >>  fdp-exn -> fdp-excp-only
> >>  deffp   -> no-fpu-sel
> >>  <24>-> bld
> >>  ppin-> amd-ppin
> >>  lfence+ -> lfence-dispatch
> >>  ppin-> intel-ppin
> >>  energy-ctrl -> energy-filtering
> >>
> >> Apparently BLD missed the update to xen-cpuid.c.  It appears to be the only
> >> one.  Several of the + names would be nice to keep as were, but doing so 
> >> isn't
> >> nice in gen-cpuid.  Any changes would alter the {dom0-}cpuid= cmdline 
> >> options,
> >> but we intentionally don't list them, so I'm not worried.
> >>
> >> Thoughts?
> >>
> >> v3:
> >>  * Rework somewhat.
> >>  * Insert aliases of common_1d.
> >>
> >> v4:
> >>  * Pad at the gen stage.  I don't like this, but I'm clearly outvoted on 
> >> the matter.
> >> ---
> >>  tools/misc/xen-cpuid.c | 16 
> >>  xen/tools/gen-cpuid.py | 29 +
> >>  2 files changed, 37 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/tools/misc/xen-cpuid.c b/tools/misc/xen-cpuid.c
> >> index 6ee835b22949..51009683da1b 100644
> >> --- a/tools/misc/xen-cpuid.c
> >> +++ b/tools/misc/xen-cpuid.c
> >> @@ -11,6 +11,7 @@
> >>  #include 
> >>  
> >>  #include 
> >> +#include 
> >>  
> >>  static uint32_t nr_features;
> >>  
> >> @@ -291,6 +292,8 @@ static const struct {
> >>  
> >>  #define COL_ALIGN "24"
> >>  
> >> +static const char *const feature_names[] = INIT_FEATURE_VAL_TO_NAME;
> >> +
> >>  static const char *const fs_names[] = {
> >>  [XEN_SYSCTL_cpu_featureset_raw] = "Raw",
> >>  [XEN_SYSCTL_cpu_featureset_host]= "Host",
> >> @@ -304,12 +307,6 @@ static void dump_leaf(uint32_t leaf, const char 
> >> *const *strs)
> >>  {
> >>  unsigned i;
> >>  
> >> -if ( !strs )
> >> -{
> >> -printf(" ???");
> >> -return;
> >> -}
> >> -
> >>  for ( i = 0; i < 32; ++i )
> >>  if ( leaf & (1u << i) )
> >>  {
> >> @@ -327,6 +324,10 @@ static void decode_featureset(const uint32_t 
> >> *features,
> >>  {
> >>  unsigned int i;
> >>  
> >> +/* If this trips, you probabl

Re: [PATCH v3.5 3/4] tools/xen-cpuid: Use automatically generated feature names

2024-05-20 Thread Roger Pau Monné

On Mon, May 20, 2024 at 03:33:59PM +0100, Andrew Cooper wrote:
> From: Roger Pau Monné 
> 
> Have gen-cpuid.py write out INIT_FEATURE_VAL_TO_NAME, derived from the same
> data source as INIT_FEATURE_NAME_TO_VAL, although both aliases of common_1d
> are needed.
> 
> In xen-cpuid.c, sanity check at build time that leaf_info[] and
> feature_names[] are of sensible length.
> 
> As dump_leaf() rendered missing names as numbers, always dump leaves even if
> we don't have the leaf name.  This conversion was argumably missed in commit
> 59afdb8a81d6 ("tools/misc: Tweak reserved bit handling for xen-cpuid").
> 
> Signed-off-by: Roger Pau Monné 
> Signed-off-by: Andrew Cooper 

Reviewed-by: Roger Pau Monné 

Just one question below.

> ---
> CC: Jan Beulich 
> CC: Roger Pau Monné 
> 
> Differences in names are:
> 
>  sysenter-> sep
>  tm  -> tm1
>  ds-cpl  -> dscpl
>  est -> eist
>  sse41   -> sse4-1
>  sse42   -> sse4-2
>  movebe  -> movbe
>  tsc-dl  -> tsc-deadline
>  rdrnd   -> rdrand
>  hyper   -> hypervisor
>  mmx+-> mmext
>  fxsr+   -> ffxsr
>  pg1g-> page1gb
>  3dnow+  -> 3dnowext
>  cmp -> cmp-legacy
>  cr8d-> cr8-legacy
>  lzcnt   -> abm
>  msse-> misalignsse
>  3dnowpf -> 3dnowprefetch
>  nodeid  -> nodeid-msr
>  dbx -> dbext
>  tsc-adj -> tsc-adjust
>  fdp-exn -> fdp-excp-only
>  deffp   -> no-fpu-sel
>  <24>-> bld
>  ppin-> amd-ppin
>  lfence+ -> lfence-dispatch
>  ppin-> intel-ppin
>  energy-ctrl -> energy-filtering
> 
> Apparently BLD missed the update to xen-cpuid.c.  It appears to be the only
> one.  Several of the + names would be nice to keep as were, but doing so isn't
> nice in gen-cpuid.  Any changes would alter the {dom0-}cpuid= cmdline options,
> but we intentionally don't list them, so I'm not worried.
> 
> Thoughts?
> 
> v3:
>  * Rework somewhat.
>  * Insert aliases of common_1d.
> 
> v4:
>  * Pad at the gen stage.  I don't like this, but I'm clearly outvoted on the 
> matter.
> ---
>  tools/misc/xen-cpuid.c | 16 
>  xen/tools/gen-cpuid.py | 29 +
>  2 files changed, 37 insertions(+), 8 deletions(-)
> 
> diff --git a/tools/misc/xen-cpuid.c b/tools/misc/xen-cpuid.c
> index 6ee835b22949..51009683da1b 100644
> --- a/tools/misc/xen-cpuid.c
> +++ b/tools/misc/xen-cpuid.c
> @@ -11,6 +11,7 @@
>  #include 
>  
>  #include 
> +#include 
>  
>  static uint32_t nr_features;
>  
> @@ -291,6 +292,8 @@ static const struct {
>  
>  #define COL_ALIGN "24"
>  
> +static const char *const feature_names[] = INIT_FEATURE_VAL_TO_NAME;
> +
>  static const char *const fs_names[] = {
>  [XEN_SYSCTL_cpu_featureset_raw] = "Raw",
>  [XEN_SYSCTL_cpu_featureset_host]= "Host",
> @@ -304,12 +307,6 @@ static void dump_leaf(uint32_t leaf, const char *const 
> *strs)
>  {
>  unsigned i;
>  
> -if ( !strs )
> -{
> -printf(" ???");
> -return;
> -}
> -
>  for ( i = 0; i < 32; ++i )
>  if ( leaf & (1u << i) )
>  {
> @@ -327,6 +324,10 @@ static void decode_featureset(const uint32_t *features,
>  {
>  unsigned int i;
>  
> +/* If this trips, you probably need to extend leaf_info[] above. */
> +BUILD_BUG_ON(ARRAY_SIZE(leaf_info) != FEATURESET_NR_ENTRIES);
> +BUILD_BUG_ON(ARRAY_SIZE(feature_names) != FEATURESET_NR_ENTRIES * 32);
> +
>  printf("%-"COL_ALIGN"s", name);
>  for ( i = 0; i < length; ++i )
>  printf("%08x%c", features[i],
> @@ -338,8 +339,7 @@ static void decode_featureset(const uint32_t *features,
>  for ( i = 0; i < length && i < ARRAY_SIZE(leaf_info); ++i )
>  {
>  printf("  [%02u] %-"COL_ALIGN"s", i, leaf_info[i].name ?: 
> "");
> -if ( leaf_info[i].name )
> -dump_leaf(features[i], leaf_info[i].strs);
> +dump_leaf(features[i], _names[i * 32]);
>  printf("\n");
>  }
>  }
> diff --git a/xen/tools/gen-cpuid.py b/xen/tools/gen-cpuid.py
> index 79d7f5c8e1c9..601eec608983 100755
> --- a/xen/tools/gen-cpuid.py
> +++ b/xen/tools/gen-cpuid.py
> @@ -470,6 +470,35 @@ def write_results(state):
>  state.output.write(
>  """}
>  
> +""")
> +
> +state.output.write(
> +"""
>

Re: [PATCH v2 2/2] tools/xg: Clean up xend-style overrides for CPU policies

2024-05-20 Thread Roger Pau Monné

On Fri, May 17, 2024 at 05:08:35PM +0100, Alejandro Vallejo wrote:
> Factor out policy getters/setters from both (CPUID and MSR) policy override
> functions. Additionally, use host policy rather than featureset when
> preparing the cur policy, saving one hypercall and several lines of
> boilerplate.
> 
> No functional change intended.
> 
> Signed-off-by: Alejandro Vallejo 
> ---
> v2:
>   * Cosmetic change to comment (// into /**/).
>   * Added missing null pointer check to MSR override function.
> ---
>  tools/libs/guest/xg_cpuid_x86.c | 445 ++--
>  1 file changed, 130 insertions(+), 315 deletions(-)
> 
> diff --git a/tools/libs/guest/xg_cpuid_x86.c b/tools/libs/guest/xg_cpuid_x86.c
> index 4f4b86b59470..74bca0e65b69 100644
> --- a/tools/libs/guest/xg_cpuid_x86.c
> +++ b/tools/libs/guest/xg_cpuid_x86.c
> @@ -36,6 +36,34 @@ enum {
>  #define bitmaskof(idx)  (1u << ((idx) & 31))
>  #define featureword_of(idx) ((idx) >> 5)
>  
> +static int deserialize_policy(xc_interface *xch, xc_cpu_policy_t *policy)
> +{
> +uint32_t err_leaf = -1, err_subleaf = -1, err_msr = -1;
> +int rc;
> +
> +rc = x86_cpuid_copy_from_buffer(>policy, policy->leaves,
> +policy->nr_leaves, _leaf, 
> _subleaf);
> +if ( rc )
> +{
> +if ( err_leaf != -1 )
> +ERROR("Failed to deserialise CPUID (err leaf %#x, subleaf %#x) 
> (%d = %s)",
> +  err_leaf, err_subleaf, -rc, strerror(-rc));
> +return rc;
> +}
> +
> +rc = x86_msr_copy_from_buffer(>policy, policy->msrs,
> +  policy->nr_msrs, _msr);
> +if ( rc )
> +{
> +if ( err_msr != -1 )
> +ERROR("Failed to deserialise MSR (err MSR %#x) (%d = %s)",
> +  err_msr, -rc, strerror(-rc));
> +return rc;
> +}
> +
> +return 0;
> +}
> +
>  int xc_get_cpu_levelling_caps(xc_interface *xch, uint32_t *caps)
>  {
>  struct xen_sysctl sysctl = {};
> @@ -260,102 +288,34 @@ static int compare_leaves(const void *l, const void *r)
>  return 0;
>  }
>  
> -static xen_cpuid_leaf_t *find_leaf(
> -xen_cpuid_leaf_t *leaves, unsigned int nr_leaves,
> -const struct xc_xend_cpuid *xend)
> +static xen_cpuid_leaf_t *find_leaf(xc_cpu_policy_t *p,
> +   const struct xc_xend_cpuid *xend)
>  {
>  const xen_cpuid_leaf_t key = { xend->leaf, xend->subleaf };
>  
> -return bsearch(, leaves, nr_leaves, sizeof(*leaves), compare_leaves);
> +return bsearch(, p->leaves, ARRAY_SIZE(p->leaves),
> +   sizeof(*p->leaves), compare_leaves);
>  }
>  
> -static int xc_cpuid_xend_policy(
> -xc_interface *xch, uint32_t domid, const struct xc_xend_cpuid *xend)
> +static int xc_cpuid_xend_policy(xc_interface *xch, uint32_t domid,
> +const struct xc_xend_cpuid *xend,
> +xc_cpu_policy_t *host,
> +xc_cpu_policy_t *def,
> +xc_cpu_policy_t *cur)
>  {
> -int rc;
> -bool hvm;
> -xc_domaininfo_t di;
> -unsigned int nr_leaves, nr_msrs;
> -uint32_t err_leaf = -1, err_subleaf = -1, err_msr = -1;
> -/*
> - * Three full policies.  The host, default for the domain type,
> - * and domain current.
> - */
> -xen_cpuid_leaf_t *host = NULL, *def = NULL, *cur = NULL;
> -unsigned int nr_host, nr_def, nr_cur;
> -
> -if ( (rc = xc_domain_getinfo_single(xch, domid, )) < 0 )
> -{
> -PERROR("Failed to obtain d%d info", domid);
> -rc = -errno;
> -goto fail;
> -}
> -hvm = di.flags & XEN_DOMINF_hvm_guest;
> -
> -rc = xc_cpu_policy_get_size(xch, _leaves, _msrs);
> -if ( rc )
> -{
> -PERROR("Failed to obtain policy info size");
> -rc = -errno;
> -goto fail;
> -}
> -
> -rc = -ENOMEM;
> -if ( (host = calloc(nr_leaves, sizeof(*host))) == NULL ||
> - (def  = calloc(nr_leaves, sizeof(*def)))  == NULL ||
> - (cur  = calloc(nr_leaves, sizeof(*cur)))  == NULL )
> -{
> -ERROR("Unable to allocate memory for %u CPUID leaves", nr_leaves);
> -goto fail;
> -}
> -
> -/* Get the domain's current policy. */
> -nr_msrs = 0;
> -nr_cur = nr_leaves;
> -rc = get_domain_cpu_policy(xch, domid, _cur, cur, _msrs, NULL);
> -if ( rc )
> -{
> -PERROR("Failed to obtain d%d current policy", domid);
> -rc = -errno;
> -goto fail;
> -}
> -
> -/* Get the domain type's default policy. */
> -nr_msrs = 0;
> -nr_def = nr_leaves;
> -rc = get_system_cpu_policy(xch, hvm ? XEN_SYSCTL_cpu_policy_hvm_default
> -: XEN_SYSCTL_cpu_policy_pv_default,
> -   _def, def, _msrs, NULL);
> -if ( rc )
> -{
> -PERROR("Failed to obtain %s def policy", hvm ? "hvm" : "pv");
> -rc =

Re: [PATCH v2 1/2] tools/xg: Streamline cpu policy serialise/deserialise calls

2024-05-20 Thread Roger Pau Monné

On Fri, May 17, 2024 at 05:08:34PM +0100, Alejandro Vallejo wrote:
> The idea is to use xc_cpu_policy_t as a single object containing both the
> serialised and deserialised forms of the policy. Note that we need lengths
> for the arrays, as the serialised policies may be shorter than the array
> capacities.
> 
> * Add the serialised lengths to the struct so we can distinguish
>   between length and capacity of the serialisation buffers.
> * Remove explicit buffer+lengths in serialise/deserialise calls
>   and use the internal buffer inside xc_cpu_policy_t instead.
> * Refactor everything to use the new serialisation functions.
> * Remove redundant serialization calls and avoid allocating dynamic
>   memory aside from the policy objects in xen-cpuid. Also minor cleanup
>   in the policy print call sites.
> 
> No functional change intended.
> 
> Signed-off-by: Alejandro Vallejo 
> ---
> v2:
>   * Removed v1/patch1.
>   * Added the accessors suggested in feedback.
> ---
>  tools/include/xenguest.h|  8 ++-
>  tools/libs/guest/xg_cpuid_x86.c | 98 -
>  tools/libs/guest/xg_private.h   |  2 +
>  tools/libs/guest/xg_sr_common_x86.c | 54 ++--
>  tools/misc/xen-cpuid.c  | 43 -
>  5 files changed, 104 insertions(+), 101 deletions(-)
> 
> diff --git a/tools/include/xenguest.h b/tools/include/xenguest.h
> index e01f494b772a..563811cd8dde 100644
> --- a/tools/include/xenguest.h
> +++ b/tools/include/xenguest.h
> @@ -799,14 +799,16 @@ int xc_cpu_policy_set_domain(xc_interface *xch, 
> uint32_t domid,
>   xc_cpu_policy_t *policy);
>  
>  /* Manipulate a policy via architectural representations. */
> -int xc_cpu_policy_serialise(xc_interface *xch, const xc_cpu_policy_t *policy,
> -xen_cpuid_leaf_t *leaves, uint32_t *nr_leaves,
> -xen_msr_entry_t *msrs, uint32_t *nr_msrs);
> +int xc_cpu_policy_serialise(xc_interface *xch, xc_cpu_policy_t *policy);
>  int xc_cpu_policy_update_cpuid(xc_interface *xch, xc_cpu_policy_t *policy,
> const xen_cpuid_leaf_t *leaves,
> uint32_t nr);
>  int xc_cpu_policy_update_msrs(xc_interface *xch, xc_cpu_policy_t *policy,
>const xen_msr_entry_t *msrs, uint32_t nr);
> +int xc_cpu_policy_get_leaves(xc_interface *xch, const xc_cpu_policy_t 
> *policy,
> + const xen_cpuid_leaf_t **leaves, uint32_t *nr);
> +int xc_cpu_policy_get_msrs(xc_interface *xch, const xc_cpu_policy_t *policy,
> +   const xen_msr_entry_t **msrs, uint32_t *nr);
>  
>  /* Compatibility calculations. */
>  bool xc_cpu_policy_is_compatible(xc_interface *xch, xc_cpu_policy_t *host,
> diff --git a/tools/libs/guest/xg_cpuid_x86.c b/tools/libs/guest/xg_cpuid_x86.c
> index 4453178100ad..4f4b86b59470 100644
> --- a/tools/libs/guest/xg_cpuid_x86.c
> +++ b/tools/libs/guest/xg_cpuid_x86.c
> @@ -834,14 +834,13 @@ void xc_cpu_policy_destroy(xc_cpu_policy_t *policy)
>  }
>  }
>  
> -static int deserialize_policy(xc_interface *xch, xc_cpu_policy_t *policy,
> -  unsigned int nr_leaves, unsigned int 
> nr_entries)
> +static int deserialize_policy(xc_interface *xch, xc_cpu_policy_t *policy)
>  {
>  uint32_t err_leaf = -1, err_subleaf = -1, err_msr = -1;
>  int rc;
>  
>  rc = x86_cpuid_copy_from_buffer(>policy, policy->leaves,
> -nr_leaves, _leaf, _subleaf);
> +policy->nr_leaves, _leaf, 
> _subleaf);
>  if ( rc )
>  {
>  if ( err_leaf != -1 )
> @@ -851,7 +850,7 @@ static int deserialize_policy(xc_interface *xch, 
> xc_cpu_policy_t *policy,
>  }
>  
>  rc = x86_msr_copy_from_buffer(>policy, policy->msrs,
> -  nr_entries, _msr);
> +  policy->nr_msrs, _msr);
>  if ( rc )
>  {
>  if ( err_msr != -1 )
> @@ -878,7 +877,10 @@ int xc_cpu_policy_get_system(xc_interface *xch, unsigned 
> int policy_idx,
>  return rc;
>  }
>  
> -rc = deserialize_policy(xch, policy, nr_leaves, nr_msrs);
> +policy->nr_leaves = nr_leaves;
> +policy->nr_msrs = nr_msrs;
> +
> +rc = deserialize_policy(xch, policy);
>  if ( rc )
>  {
>  errno = -rc;
> @@ -903,7 +905,10 @@ int xc_cpu_policy_get_domain(xc_interface *xch, uint32_t 
> domid,
>  return rc;
>  }
>  
> -rc = deserialize_policy(xch, policy, nr_leaves, nr_msrs);
> +policy->nr_leaves = nr_leaves;
> +policy->nr_msrs = nr_msrs;
> +
> +rc = deserialize_policy(xch, policy);
>  if ( rc )
>  {
>  errno = -rc;
> @@ -917,17 +922,14 @@ int xc_cpu_policy_set_domain(xc_interface *xch, 
> uint32_t domid,
>   xc_cpu_policy_t *policy)
>  {
>  uint32_t err_leaf = -1, err_subleaf = -1, err_msr =

Re: [PATCH v2 0/3] Clean the policy manipulation path in domain creation

2024-05-20 Thread Roger Pau Monné

On Fri, May 17, 2024 at 05:08:33PM +0100, Alejandro Vallejo wrote:
> v2:
>   * Removed xc_cpu_policy from xenguest.h
>   * Added accessors for xc_cpu_policy so the serialised form can be extracted.
>   * Modified xen-cpuid to use accessors.
> 
>  Original cover letter 
> 
> In the context of creating a domain, we currently issue a lot of hypercalls
> redundantly while populating its CPU policy; likely a side effect of
> organic growth more than anything else.
> 
> However, the worst part is not the overhead (this is a glacially cold
> path), but the insane amounts of boilerplate that make it really hard to
> pick apart what's going on. One major contributor to this situation is the
> fact that what's effectively "setup" and "teardown" phases in policy
> manipulation are not factored out from the functions that perform said
> manipulations, leading to the same getters and setter being invoked many
> times, when once each would do.
> 
> Another big contributor is the code being unaware of when a policy is
> serialised and when it's not.
> 
> This patch attempts to alleviate this situation, yielding over 200 LoC
> reduction.
> 
> Patch 1: Mechanical change. Makes xc_cpu_policy_t public so it's usable
>  from clients of libxc/libxg.
> Patch 2: Changes the (de)serialization wrappers in xenguest so they always
>  serialise to/from the internal buffers of xc_cpu_policy_t. The
>  struct is suitably expanded to hold extra information required.
> Patch 3: Performs the refactor of the policy manipulation code so that it
>  follows a strict: PULL_POLICIES, MUTATE_POLICY (n times), 
> PUSH_POLICY.

I think patch 3 is no longer part of the set?  I don't see anything
in the review of v1 that suggests patch 3 was not going to be part of
the next submission?

Thanks, Roger.

Re: [PATCH v2 06/12] VT-d: respect ACPI SATC's ATC_REQUIRED flag

2024-05-20 Thread Roger Pau Monné

On Wed, May 15, 2024 at 12:42:40PM +0200, Jan Beulich wrote:
> On 06.05.2024 15:38, Roger Pau Monné wrote:
> > On Thu, Feb 15, 2024 at 11:16:11AM +0100, Jan Beulich wrote:
> >> When the flag is set, permit Dom0 to control the device (no worse than
> >> what we had before and in line with other "best effort" behavior we use
> >> when it comes to Dom0),
> > 
> > I think we should somehow be able to signal dom0 that this device
> > might not operate as expected, otherwise dom0 might use it and the
> > device could silently malfunction due to ATS not being enabled.
> 
> Whatever signaling we invented, no Dom0 would be required to respect it,
> and for (perhaps quite) some time no Dom0 kernel would even exist to query
> that property.
> 
> > Otherwise we should just hide the device from dom0.
> 
> This would feel wrong to me, almost like a regression from what we had
> before.

Exposing a device to dom0 that won't be functional doesn't seem like a
very wise choice from Xen TBH.

> > I assume setting the IOMMU context entry to passthrough mode would
> > also be fine for such devices that require ATS?
> 
> I'm afraid I'm lacking the connection of the question to what is being
> done here. Can you perhaps provide some more context? To provide some
> context from my side: Using pass-through mode would be excluded when Dom0
> is PVH. Hence why I'm not getting why we would want to even just consider
> doing so.
> 
> Yet, looking at the spec, in pass-through mode translation requests are
> treated as UR. So maybe your question was towards there needing to be
> handling (whichever way) for the case where pass-through mode was
> requested for PV Dom0? The only half-way sensible thing to do in that case
> that I can think of right now would be to ignore that command line option,

Hm, maybe I'm confused, but if the IOMMU device context entry is set
in pass-through mode ATS won't be enabled and hence no translation
requests would be send from the device?

IOW, devices listed in the SATC can only mandate ATS enabled when the
IOMMU is enforcing translation.   IF the IOMMU is not enabled or if
the device is in passthrough mode then the requirement for having ATS
enabled no longer applies.

Thanks, Roger.

Re: [PATCH v2 05/12] IOMMU: rename and re-type ats_enabled

2024-05-20 Thread Roger Pau Monné

On Wed, May 15, 2024 at 12:07:50PM +0200, Jan Beulich wrote:
> On 06.05.2024 15:53, Roger Pau Monné wrote:
> > On Mon, May 06, 2024 at 03:20:38PM +0200, Jan Beulich wrote:
> >> On 06.05.2024 14:42, Roger Pau Monné wrote:
> >>> On Thu, Feb 15, 2024 at 11:15:39AM +0100, Jan Beulich wrote:
> >>>> Make the variable a tristate, with (as done elsewhere) a negative value
> >>>> meaning "default". Since all use sites need looking at, also rename it
> >>>> to match our usual "opt_*" pattern. While touching it, also move it to
> >>>> .data.ro_after_init.
> >>>
> >>> I guess I need to look at further patches, as given the feedback on
> >>> the past version I think we agreed we want to set ATS unconditionally
> >>> disabled by default, and hence I'm not sure I see the point of the
> >>> tri-state if enabling ATS will require an explicit opt-in on the
> >>> command line (ats=1).
> >>
> >> With the present wording in the VT-d spec (which we've now had vague
> >> indication that it may not be meant that way) there needs to be
> >> tristate behavior:
> >> - With "ats=0" ATS won't be used.
> >> - With "ats=1" ATS will be used for all ATS-capable devices.
> >> - Without either option ATS will be used for devices where firmware
> >>   mandates its use.
> > 
> > I'm afraid I don't agree to this behavior.  Regardless of what the
> > firmware requests ATS must only be enabled on user-request (iow: when
> > the ats=1 command line option is passed).  Otherwise ATS must remain
> > disabled for all devices.  It's not fine for firmware to trigger the
> > enabling of a feature that's not supported on Xen.
> 
> Well. On one hand I can see your point. Otoh with the spec still being the
> way it is, on systems mandating ATS use for at least one device we'd then
> simply need to deem Xen unsupported there altogether. The goal of the
> series, though, is to make things work as mandated by the spec on such
> systems, which to me implies we need to consider use of ATS supported in
> such cases (and only for those specific devices, i.e. still without
> considering use of "ats" on the command line supported).

I'm in general hesitant of ATS because I think it undermines the
security of PCI passthrough.  However this would still be acceptable
for dom0 because it's (usually?) part of the trusted base of a Xen
host.

If we want to make use of ATS for devices assigned to dom0 we should
clarify the warning in xen-command-line.pandoc.

We should also consider that dom0 usually does a lot of p2m
manipulations (by having to map grants and foreign pages).  Those will
result in p2m flushes that will lead to IOMMU flushes, and when using
ATS that will require device TLB flushes.  I wonder how much of an
overhead this will add to normal dom0 operations (plus the added risk
of those device TLB flushes stalling the IOMMU queue).

I would be much more comfortable with making the ats= command line
option a tri-state:

ats={0,1,mandatory}

Where the 'mandatory' option or equivalent enables ATS only for
devices that mandate it.  However I still think the default option
should be disabled for all devices.  If devices that require ATS are
found on the system I would use `warning_add()` to notify the user
of the need to consider adding ats=mandatory to the command line.

> If and when the spec was changed to clarify the flag is a performance hint,
> not a functional requirement, then we could do as you suggest. At which
> point, as mentioned before, opt_ats may be possible to become a plain
> boolean variable.

It's a complex situation, and I'm kind of surprised by the
introduction of this mandatory ATS requirement by Intel in a
non-backwards compatible way (as the specification claims the device
won't be functional without ATS enabled if required).

> >>>> @@ -196,7 +196,7 @@ static int __must_check amd_iommu_setup_
> >>>>  dte->sys_mgt = MASK_EXTR(ivrs_dev->device_flags, 
> >>>> ACPI_IVHD_SYSTEM_MGMT);
> >>>>  
> >>>>  if ( use_ats(pdev, iommu, ivrs_dev) )
> >>>> -dte->i = ats_enabled;
> >>>> +dte->i = true;
> >>>
> >>> Might be easier to just use:
> >>>
> >>> dte->i = use_ats(pdev, iommu, ivrs_dev);
> >>
> >> I'm hesitant here, as in principle we might be overwriting a "true" by
> >> "false" then.
> > 
> > Hm, but that would be fine, what's the point in enabling the IOMMU to
> > reply to ATS requests if ATS is not enabled on the device?

Re: [PATCH for-4.19? v4 1/6] x86/p2m: Add braces for better code clarity

2024-05-20 Thread Roger Pau Monné

On Sat, May 18, 2024 at 11:02:12AM +, Petr Beneš wrote:
> From: Petr Beneš 
> 
> No functional change.
> 
> Signed-off-by: Petr Beneš 
> Reviewed-by: Stefano Stabellini 

TBH I'm fine without the braces, but if lack of them can cause
confusion:

Acked-by: Roger Pau Monné 

CODING_STYLE states that braces can be omitted for blocks with single
statements, I guess we should clarify whether multi-line statements
are accepted, as the example contains a single-line statement.

Thanks, Roger.

Re: [PATCH for-4.19? v4 4/6] x86: Make the maximum number of altp2m views configurable

2024-05-20 Thread Roger Pau Monné

On Sat, May 18, 2024 at 11:02:15AM +, Petr Beneš wrote:
> From: Petr Beneš 
> 
> This commit introduces the ability to configure the maximum number of altp2m
> views for the domain during its creation. Previously, the limits were 
> hardcoded
> to a maximum of 10. This change allows for greater flexibility in environments
> that require more or fewer altp2m views.
> 
> The maximum configurable limit for max_altp2m on x86 is now set to MAX_EPTP
> (512). This cap is linked to the architectural limit of the EPTP-switching
> VMFUNC, which supports up to 512 entries. Despite there being no inherent need
> for limiting max_altp2m in scenarios not utilizing VMFUNC, decoupling these
> components would necessitate substantial code changes.
> 
> Signed-off-by: Petr Beneš 
> ---
>  xen/arch/x86/domain.c | 12 
>  xen/arch/x86/hvm/hvm.c|  8 ++-
>  xen/arch/x86/hvm/vmx/vmx.c|  2 +-
>  xen/arch/x86/include/asm/domain.h |  7 +--
>  xen/arch/x86/include/asm/p2m.h|  6 +-
>  xen/arch/x86/mm/altp2m.c  | 91 +++
>  xen/arch/x86/mm/hap/hap.c |  6 +-
>  xen/arch/x86/mm/mem_access.c  | 24 
>  xen/arch/x86/mm/mem_sharing.c |  2 +-
>  xen/arch/x86/mm/p2m-ept.c | 12 ++--
>  xen/arch/x86/mm/p2m.c |  8 +--
>  xen/common/domain.c   |  1 +
>  xen/include/public/domctl.h   |  5 +-
>  xen/include/xen/sched.h   |  2 +
>  14 files changed, 116 insertions(+), 70 deletions(-)
> 
> diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
> index 00a3aaa576..3bd18cb2d0 100644
> --- a/xen/arch/x86/domain.c
> +++ b/xen/arch/x86/domain.c
> @@ -685,6 +685,18 @@ int arch_sanitise_domain_config(struct 
> xen_domctl_createdomain *config)
>  return -EINVAL;
>  }

You also need to adjust arch_sanitise_domain_config() in ARM to return
an error if nr_altp2m is set, as there's no support for altp2m on ARM
yet.

> 
> +if ( config->nr_altp2m && !hvm_altp2m_supported() )
> +{
> +dprintk(XENLOG_INFO, "altp2m requested but not available\n");
> +return -EINVAL;
> +}
> +
> +if ( config->nr_altp2m > MAX_EPTP )
> +{
> +dprintk(XENLOG_INFO, "nr_altp2m must be <= %lu\n", MAX_EPTP);
> +return -EINVAL;
> +}
> +
>  if ( config->vmtrace_size )
>  {
>  unsigned int size = config->vmtrace_size;
> diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
> index 9594e0a5c5..77e4016bdb 100644
> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -4639,6 +4639,12 @@ static int do_altp2m_op(
>  goto out;
>  }
> 
> +if ( d->nr_altp2m == 0 )

I would merge this with the previous check, which also returns
-EINVAL.

> +{
> +rc = -EINVAL;
> +goto out;
> +}
> +
>  if ( (rc = xsm_hvm_altp2mhvm_op(XSM_OTHER, d, mode, a.cmd)) )
>  goto out;
> 
> @@ -5228,7 +5234,7 @@ void hvm_fast_singlestep(struct vcpu *v, uint16_t 
> p2midx)
>  if ( !hvm_is_singlestep_supported() )
>  return;
> 
> -if ( p2midx >= MAX_ALTP2M )
> +if ( p2midx >= v->domain->nr_altp2m )
>  return;
> 
>  v->arch.hvm.single_step = true;
> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> index 5f67a48592..76ee09b701 100644
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -4888,7 +4888,7 @@ bool asmlinkage vmx_vmenter_helper(const struct 
> cpu_user_regs *regs)
>  {
>  unsigned int i;
> 
> -for ( i = 0; i < MAX_ALTP2M; ++i )
> +for ( i = 0; i < currd->nr_altp2m; ++i )
>  {
>  if ( currd->arch.altp2m_eptp[i] == mfn_x(INVALID_MFN) )
>  continue;
> diff --git a/xen/arch/x86/include/asm/domain.h 
> b/xen/arch/x86/include/asm/domain.h
> index f5daeb182b..3935328781 100644
> --- a/xen/arch/x86/include/asm/domain.h
> +++ b/xen/arch/x86/include/asm/domain.h
> @@ -258,11 +258,10 @@ struct paging_vcpu {
>  struct shadow_vcpu shadow;
>  };
> 
> -#define MAX_NESTEDP2M 10
> -
> -#define MAX_ALTP2M  10 /* arbitrary */
>  #define INVALID_ALTP2M  0x
>  #define MAX_EPTP(PAGE_SIZE / sizeof(uint64_t))
> +#define MAX_NESTEDP2M   10
> +
>  struct p2m_domain;
>  struct time_scale {
>  int shift;
> @@ -353,7 +352,7 @@ struct arch_domain
> 
>  /* altp2m: allow multiple copies of host p2m */
>  bool altp2m_active;
> -struct p2m_domain *altp2m_p2m[MAX_ALTP2M];
> +struct p2m_domain **altp2m_p2m;
>  mm_lock_t altp2m_list_lock;
>  uint64_t *altp2m_eptp;
>  uint64_t *altp2m_visible_eptp;
> diff --git a/xen/arch/x86/include/asm/p2m.h b/xen/arch/x86/include/asm/p2m.h
> index 111badf89a..e66c081149 100644
> --- a/xen/arch/x86/include/asm/p2m.h
> +++ b/xen/arch/x86/include/asm/p2m.h
> @@ -881,7 +881,7 @@ static inline struct p2m_domain *p2m_get_altp2m(struct 
> vcpu *v)
>  if ( index == INVALID_ALTP2M )
>  return NULL;
> 
> -

Re: [PATCH for-4.19?] xen/x86: pretty print interrupt CPU affinity masks

2024-05-17 Thread Roger Pau Monné

On Thu, May 16, 2024 at 06:13:29PM +0100, Andrew Cooper wrote:
> On 15/05/2024 4:29 pm, Roger Pau Monne wrote:
> > Print the CPU affinity masks as numeric ranges instead of plain hexadecimal
> > bitfields.
> >
> > Signed-off-by: Roger Pau Monné 
> > ---
> >  xen/arch/x86/irq.c | 10 +-
> >  1 file changed, 5 insertions(+), 5 deletions(-)
> >
> > diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
> > index 80ba8d9fe912..3b951d81bd6d 100644
> > --- a/xen/arch/x86/irq.c
> > +++ b/xen/arch/x86/irq.c
> > @@ -1934,10 +1934,10 @@ void do_IRQ(struct cpu_user_regs *regs)
> >  if ( ~irq < nr_irqs && irq_desc_initialized(desc) )
> >  {
> >  spin_lock(>lock);
> > -printk("IRQ%d a=%04lx[%04lx,%04lx] v=%02x[%02x] t=%s 
> > s=%08x\n",
> > -   ~irq, *cpumask_bits(desc->affinity),
> > -   *cpumask_bits(desc->arch.cpu_mask),
> > -   *cpumask_bits(desc->arch.old_cpu_mask),
> > +printk("IRQ%d a={%*pbl}[{%*pbl},{%*pbl}] v=%02x[%02x] 
> > t=%s s=%08x\n",
> 
> Looking at this more closely, there's still some information obfuscation
> going on.
> 
> How about "... a={} o={} n={} v=..."
> 
> so affinity, old and new masks are all stated explicitly, instead of
> having to remember what the square brackets mean, and in particular that
> the masks are backwards?
> 
> Happy to adjust on commit.

Sure, I guess I got used to it and didn't think of adjusting the
format.  The only risk is anyone having an automated parser to consume
that information, but I think it's unlikely.

Thanks, Roger.

Re: [PATCH for-4.19] xen/x86: limit interrupt movement done by fixup_irqs()

2024-05-16 Thread Roger Pau Monné

On Thu, May 16, 2024 at 06:04:22PM +0200, Jan Beulich wrote:
> On 16.05.2024 17:56, Roger Pau Monné wrote:
> > On Thu, May 16, 2024 at 05:00:54PM +0200, Jan Beulich wrote:
> >> On 16.05.2024 15:22, Roger Pau Monne wrote:
> >>> @@ -2576,7 +2576,12 @@ void fixup_irqs(const cpumask_t *mask, bool 
> >>> verbose)
> >>>  release_old_vec(desc);
> >>>  }
> >>>  
> >>> -if ( !desc->action || cpumask_subset(desc->affinity, mask) )
> >>> +/*
> >>> + * Avoid shuffling the interrupt around if it's assigned to a 
> >>> CPU set
> >>> + * that's all covered by the requested affinity mask.
> >>> + */
> >>> +cpumask_and(affinity, desc->arch.cpu_mask, _online_map);
> >>> +if ( !desc->action || cpumask_subset(affinity, mask) )
> >>>  {
> >>>  spin_unlock(>lock);
> >>>  continue;
> >>[...]
> >> In
> >> which case cpumask_subset() is going to always return true with your
> >> change in place, if I'm not mistaken. That seems to make your change
> >> questionable. Yet with that I guess I'm overlooking something.)
> > 
> > I might we wrong, but I think you are missing that the to be offlined
> > CPU has been removed from cpu_online_map by the time it gets passed
> > to fixup_irqs().
> 
> Just on this part (I'll need to take more time to reply to other parts):
> No, I've specifically paid attention to that fact. Yet for this particular
> observation of mine is doesn't matter. If mask == _online_map, then
> no matter what is in cpu_online_map
> 
> cpumask_and(affinity, desc->arch.cpu_mask, _online_map);
> 
> will mask things down to a subset of cpu_online_map, and hence
> 
> if ( !desc->action || cpumask_subset(affinity, mask) )
> 
> (effectively being
> 
> if ( !desc->action || cpumask_subset(affinity, _online_map) )
> 
> ) is nothing else than
> 
> if ( !desc->action || true )
> 
> . Yet that doesn't feel quite right.

Oh, I get it now.  Ideally we would use cpu_online_map with the to be
removed CPU set, but that's complicated in this context.

For the purposes here we might as well avoid the AND of
->arch.cpu_mask with cpu_online_map and just check:

if ( !desc->action || cpumask_subset(desc->arch.cpu_mask, mask) )

As even if ->arch.cpu_mask has non-online CPUs set aside from the to
be offlined CPU, it would just mean that we might be shuffling more
than strictly necessary.  Note this will only be an issue with cluster
mode, physical mode must always have a single online CPU set in
->arch.cpu_mask.

Thanks, Roger.

Re: [PATCH for-4.19] xen/x86: limit interrupt movement done by fixup_irqs()

2024-05-16 Thread Roger Pau Monné

On Thu, May 16, 2024 at 05:00:54PM +0200, Jan Beulich wrote:
> On 16.05.2024 15:22, Roger Pau Monne wrote:
> > --- a/xen/arch/x86/irq.c
> > +++ b/xen/arch/x86/irq.c
> > @@ -2527,7 +2527,7 @@ static int __init cf_check setup_dump_irqs(void)
> >  }
> >  __initcall(setup_dump_irqs);
> >  
> > -/* Reset irq affinities to match the given CPU mask. */
> > +/* Evacuate interrupts assigned to CPUs not present in the input CPU mask. 
> > */
> >  void fixup_irqs(const cpumask_t *mask, bool verbose)
> >  {
> 
> Evacuating is one purpose. Updating affinity, if need be, is another. I've
> been wondering more than once though whether it is actually correct /
> necessary for ->affinity to be updated by the function. As it stands you
> don't remove the respective code, though.

Yeah, I didn't want to get into updating ->affinity in this patch, so
decided to leave that as-is.

Note however that if we shuffle the interrupt around we should update
->affinity, so that the new destination is part of ->affinity?

Otherwise we could end up with the interrupt assigned to CPU(s) that
are not part of the ->affinity mask.  Maybe that's OK, TBH I'm not
sure I understand the purpose of the ->affinity mask, hence why I've
decided to leave it alone in this patch.

> 
> > @@ -2576,7 +2576,12 @@ void fixup_irqs(const cpumask_t *mask, bool verbose)
> >  release_old_vec(desc);
> >  }
> >  
> > -if ( !desc->action || cpumask_subset(desc->affinity, mask) )
> > +/*
> > + * Avoid shuffling the interrupt around if it's assigned to a CPU 
> > set
> > + * that's all covered by the requested affinity mask.
> > + */
> > +cpumask_and(affinity, desc->arch.cpu_mask, _online_map);
> > +if ( !desc->action || cpumask_subset(affinity, mask) )
> >  {
> >  spin_unlock(>lock);
> >  continue;
> 
> First my understanding of how the two CPU sets are used: ->affinity is
> merely a representation of where the IRQ is permitted to be handled.
> ->arch.cpu_mask describes all CPUs where the assigned vector is valid
> (and would thus need cleaning up when a new vector is assigned). Neither
> of the two needs to be a strict subset of the other.

Oh, so it's allowed to have the interrupt target a CPU
(->arch.cpu_mask) that's not set in the affinity mask?

> 
> (It's not really clear whether ->arch.cpu_mask is [supposed to be] a
> subset of cpu_online_map.)

Not according to the description in arch_irq_desc:

/*
 * Except for high priority interrupts @cpu_mask may have bits set for
 * offline CPUs.  Consumers need to be careful to mask this down to
 * online ones as necessary.  There is supposed to always be a non-
 * empty intersection with cpu_online_map.
 */

So ->arch.cpu_mask can (according to the comment) not strictly be a
subset of cpu_online_map.

Note this is only possible when using logical destination mode, so
removing that would turn the destination field into an unsigned int
that would point to a single CPU that must be present in
cpu_online_map.

> If that understanding of mine is correct, going from just ->arch.cpu_mask
> doesn't look quite right to me, as that may include CPUs not in ->affinity.
> As in: Things could be further limited, by also ANDing in ->affinity.

Hm, my expectation would be that ->arch.cpu_mask is a subset of
->affinity, but even if it's not, what we do care in fixup_cpus() is
what CPUs the interrupt targets, as we need to move the interrupt if
the target set is not in the input mask set.  I don't think ->affinity
should be taken into account for that decision, it should be done
based exclusively on which CPU(s) the interrupt target
(->arch.cpu_mask).

> At the same time your(?) and my variant suffer from cpumask_subset()
> possibly returning true despite the CPU the IRQ is presently being
> handled on being the one that we're in the process of bringing down.

No, I'm not sure I see how that can happen.  The CPU we are bringing
down will always be clear from the input CPU mask, and hence
cpumask_subset(->arch.cpu_mask, mask) will only return true if all the
set CPUs in ->arch.cpu_mask are also set in mask.  IOW: none of the
possible target destinations is a CPU to be removed.

> In
> that case we absolutely cannot skip the move. (I'd like to note that
> there are only two possible input values of "mask" for the function. The
> case I think you've been looking into is when it's _online_map.

Well, it's cpu_online_map which already has the CPU to be offlined
cleared.  See the call to cpumask_clear_cpu() ahead of calling
fixup_irqs().

> In
> which case cpumask_subset() is going to always return true with your
> change in place, if I'm not mistaken. That seems to make your change
> questionable. Yet with that I guess I'm overlooking something.)

I might we wrong, but I think you are missing that the to be offlined
CPU has been removed from cpu_online_map by the time it gets passed

Re: [PATCH] x86/ucode: Further fixes to identify "ucode already up to date"

2024-05-16 Thread Roger Pau Monné

On Thu, May 16, 2024 at 01:30:21PM +0100, Andrew Cooper wrote:
> On 16/05/2024 12:50 pm, Roger Pau Monné wrote:
> > On Thu, May 16, 2024 at 12:31:03PM +0100, Andrew Cooper wrote:
> >> When the revision in hardware is newer than anything Xen has to hand,
> >> 'microcode_cache' isn't set up.  Then, `xen-ucode` initiates the update
> >> because it doesn't know whether the revisions across the system are 
> >> symmetric
> >> or not.  This involves the patch getting all the way into the
> >> apply_microcode() hooks before being found to be too old.
> >>
> >> This is all a giant mess and needs an overhaul, but in the short term 
> >> simply
> >> adjust the apply_microcode() to return -EEXIST.
> >>
> >> Also, unconditionally print the preexisting microcode revision on boot.  
> >> It's
> >> relevant information which is otherwise unavailable if Xen doesn't find new
> >> microcode to use.
> >>
> >> Fixes: 648db37a155a ("x86/ucode: Distinguish "ucode already up to date"")
> >> Signed-off-by: Andrew Cooper 
> >> ---
> >> CC: Jan Beulich 
> >> CC: Roger Pau Monné 
> >> CC: Fouad Hilly 
> >>
> >> Sorry Fouad, but this collides with your `--force` series once again.
> >> Hopefully it might make things fractionally easier.
> >>
> >> Background: For 06-55-04 (Skylake server, stepping 4 specifically), 
> >> there's a
> >> recent production firmware update which has a newer microcode revision than
> >> exists in the Intel public microcode repository.  It's causing a mess in 
> >> our
> >> automated testing, although it is finding good bugs...
> >> ---
> >>  xen/arch/x86/cpu/microcode/amd.c   | 7 +--
> >>  xen/arch/x86/cpu/microcode/core.c  | 2 ++
> >>  xen/arch/x86/cpu/microcode/intel.c | 7 +--
> >>  3 files changed, 12 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/xen/arch/x86/cpu/microcode/amd.c 
> >> b/xen/arch/x86/cpu/microcode/amd.c
> >> index 17e68697d5bf..f76a563c8b84 100644
> >> --- a/xen/arch/x86/cpu/microcode/amd.c
> >> +++ b/xen/arch/x86/cpu/microcode/amd.c
> >> @@ -222,12 +222,15 @@ static int cf_check apply_microcode(const struct 
> >> microcode_patch *patch)
> >>  uint32_t rev, old_rev = sig->rev;
> >>  enum microcode_match_result result = microcode_fits(patch);
> >>  
> >> +if ( result == MIS_UCODE )
> >> +return -EINVAL;
> >> +
> >>  /*
> >>   * Allow application of the same revision to pick up SMT-specific 
> >> changes
> >>   * even if the revision of the other SMT thread is already up-to-date.
> >>   */
> >> -if ( result != NEW_UCODE && result != SAME_UCODE )
> >> -return -EINVAL;
> >> +if ( result == OLD_UCODE )
> >> +return -EEXIST;
> > Won't it be simpler to just add this check ahead of the existing one,
> > so that you can leave the code as-is, iow:
> >
> > if ( result == OLD_UCODE )
> > return -EEXIST;
> >
> > /*
> >  * Allow application of the same revision to pick up SMT-specific 
> > changes
> >  * even if the revision of the other SMT thread is already up-to-date.
> >  */
> > if ( result != NEW_UCODE && result != SAME_UCODE )
> > return -EINVAL;
> >
> > Thanks, Roger.
> 
> Not really, no.  That still leaves this piece of logic which is
> misleading IMO.
> 
> MIS_UCODE is the only -EINVAL worthy case.
> 
> Every other *_UCODE constant needs to be 0 or -EEXIST, depending on
> allow-same/--force.

OK, my main concern was the previous logic wouldn't allow a newly
introduced state to get past the return -EINVAL, while the new logic
could possibly allow it to pass through.

I don't think adding states is that common, and if you prefer it that
way it's fine.

Acked-by: Roger Pau Monné 

Thanks, Roger.

Re: [PATCH] x86/ucode: Further fixes to identify "ucode already up to date"

2024-05-16 Thread Roger Pau Monné

On Thu, May 16, 2024 at 12:31:03PM +0100, Andrew Cooper wrote:
> When the revision in hardware is newer than anything Xen has to hand,
> 'microcode_cache' isn't set up.  Then, `xen-ucode` initiates the update
> because it doesn't know whether the revisions across the system are symmetric
> or not.  This involves the patch getting all the way into the
> apply_microcode() hooks before being found to be too old.
> 
> This is all a giant mess and needs an overhaul, but in the short term simply
> adjust the apply_microcode() to return -EEXIST.
> 
> Also, unconditionally print the preexisting microcode revision on boot.  It's
> relevant information which is otherwise unavailable if Xen doesn't find new
> microcode to use.
> 
> Fixes: 648db37a155a ("x86/ucode: Distinguish "ucode already up to date"")
> Signed-off-by: Andrew Cooper 
> ---
> CC: Jan Beulich 
> CC: Roger Pau Monné 
> CC: Fouad Hilly 
> 
> Sorry Fouad, but this collides with your `--force` series once again.
> Hopefully it might make things fractionally easier.
> 
> Background: For 06-55-04 (Skylake server, stepping 4 specifically), there's a
> recent production firmware update which has a newer microcode revision than
> exists in the Intel public microcode repository.  It's causing a mess in our
> automated testing, although it is finding good bugs...
> ---
>  xen/arch/x86/cpu/microcode/amd.c   | 7 +--
>  xen/arch/x86/cpu/microcode/core.c  | 2 ++
>  xen/arch/x86/cpu/microcode/intel.c | 7 +--
>  3 files changed, 12 insertions(+), 4 deletions(-)
> 
> diff --git a/xen/arch/x86/cpu/microcode/amd.c 
> b/xen/arch/x86/cpu/microcode/amd.c
> index 17e68697d5bf..f76a563c8b84 100644
> --- a/xen/arch/x86/cpu/microcode/amd.c
> +++ b/xen/arch/x86/cpu/microcode/amd.c
> @@ -222,12 +222,15 @@ static int cf_check apply_microcode(const struct 
> microcode_patch *patch)
>  uint32_t rev, old_rev = sig->rev;
>  enum microcode_match_result result = microcode_fits(patch);
>  
> +if ( result == MIS_UCODE )
> +return -EINVAL;
> +
>  /*
>   * Allow application of the same revision to pick up SMT-specific changes
>   * even if the revision of the other SMT thread is already up-to-date.
>   */
> -if ( result != NEW_UCODE && result != SAME_UCODE )
> -return -EINVAL;
> +if ( result == OLD_UCODE )
> +return -EEXIST;

Won't it be simpler to just add this check ahead of the existing one,
so that you can leave the code as-is, iow:

if ( result == OLD_UCODE )
return -EEXIST;

/*
 * Allow application of the same revision to pick up SMT-specific changes
 * even if the revision of the other SMT thread is already up-to-date.
 */
if ( result != NEW_UCODE && result != SAME_UCODE )
return -EINVAL;

Thanks, Roger.

Re: [PATCH] tools: Add install/uninstall targets to tests/x86_emulator

2024-05-16 Thread Roger Pau Monné

On Thu, May 16, 2024 at 12:07:10PM +0100, Alejandro Vallejo wrote:
> Bring test_x86_emulator in line with other tests by adding
> install/uninstall rules.
> 
> Signed-off-by: Alejandro Vallejo 
> ---
>  tools/tests/x86_emulator/Makefile | 11 +--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/tests/x86_emulator/Makefile 
> b/tools/tests/x86_emulator/Makefile
> index 834b2112e7fe..30edf7e0185d 100644
> --- a/tools/tests/x86_emulator/Makefile
> +++ b/tools/tests/x86_emulator/Makefile
> @@ -269,8 +269,15 @@ clean:
>  .PHONY: distclean
>  distclean: clean
>  
> -.PHONY: install uninstall
> -install uninstall:
> +.PHONY: install
> +install: all
> + $(INSTALL_DIR) $(DESTDIR)$(LIBEXEC_BIN)
> + $(if $(TARGET-y),$(INSTALL_PROG) $(TARGET-y) $(DESTDIR)$(LIBEXEC_BIN))
> +
> +.PHONY: uninstall
> +uninstall:
> + $(RM) -- $(addprefix $(DESTDIR)$(LIBEXEC_BIN)/,$(TARGET-y))
> +

FWIW, should you check that HOSTCC == CC before installing?  Otherwise
I'm unsure of the result in cross-compiled builds, as the x86_emulator
is built with HOSTCC, not CC.

Thanks, Roger.

Re: [PATCH V3 (resend) 06/19] x86: Add a boot option to enable and disable the direct map

2024-05-16 Thread Roger Pau Monné

On Wed, May 15, 2024 at 03:54:51PM +0200, Jan Beulich wrote:
> On 14.05.2024 11:20, Roger Pau Monné wrote:
> > On Mon, May 13, 2024 at 01:40:33PM +, Elias El Yandouzi wrote:
> >> --- a/docs/misc/xen-command-line.pandoc
> >> +++ b/docs/misc/xen-command-line.pandoc
> >> @@ -799,6 +799,18 @@ that enabling this option cannot guarantee anything 
> >> beyond what underlying
> >>  hardware guarantees (with, where available and known to Xen, respective
> >>  tweaks applied).
> >>  
> >> +### directmap (x86)
> >> +> `= `
> >> +
> >> +> Default: `true`
> >> +
> >> +Enable or disable the directmap region in Xen.
> > 
> > Enable or disable fully populating the directmap region in Xen.
> 
> Elias, would you please take care to address earlier review comments
> before sending a new version? This and ...
> 
> >> +
> >> +By default, Xen creates the directmap region which maps physical memory
> >   ^ all?
> >> +in that region. Setting this to no will sparsely populate the directmap,
> > 
> > "Setting this to no" => "Disabling this option will sparsely..."
> 
> ... this is what I had already asked for in reply to v2, of course with
> different wording.
> 
> >> --- a/xen/arch/x86/setup.c
> >> +++ b/xen/arch/x86/setup.c
> >> @@ -1517,6 +1517,8 @@ void asmlinkage __init noreturn __start_xen(unsigned 
> >> long mbi_p)
> >>  if ( highmem_start )
> >>  xenheap_max_mfn(PFN_DOWN(highmem_start - 1));
> >>  
> >> +printk("Booting with directmap %s\n", has_directmap() ? "on" : "off");
> > 
> > IMO this wants to be printed as part of the speculation mitigations, see
> > print_details() in spec_ctrl.c
> 
> And not "on" / "off", but "full" / "sparse" (and word order changed 
> accordingly)
> perhaps.

I've been thinking about this, and I'm leaning towards calling this
new mode "ondemand" rather than "sparse".  The fact that the direct
map ends up sparely populated is a consequence of populating it on
demand, and hence the later would be more descriptive IMO.

(Same for the Kconfig option then ONDEMAND_DIRECTMAP, or some such)

Thanks, Roger.

Re: [PATCH] x86: respect mapcache_domain_init() failing

2024-05-15 Thread Roger Pau Monné

On Wed, May 15, 2024 at 03:35:15PM +0200, Jan Beulich wrote:
> The function itself properly handles and hands onwards failure from
> create_perdomain_mapping(). Therefore its caller should respect possible
> failure, too.
> 
> Fixes: 4b28bf6ae90b ("x86: re-introduce map_domain_page() et al")
> Signed-off-by: Jan Beulich 

Acked-by: Roger Pau Monné 

Thanks, Roger.

Re: [PATCH V3 (resend) 14/19] Rename mfn_to_virt() calls

2024-05-15 Thread Roger Pau Monné

On Tue, May 14, 2024 at 06:22:59PM +0200, Jan Beulich wrote:
> On 14.05.2024 17:45, Roger Pau Monné wrote:
> > On Mon, May 13, 2024 at 01:40:41PM +, Elias El Yandouzi wrote:
> >> Until directmap gets completely removed, we'd still need to
> >> keep some calls to mfn_to_virt() for xenheap pages or when the
> >> directmap is enabled.
> >>
> >> Rename the macro to mfn_to_directmap_virt() to flag them and
> >> prevent further use of mfn_to_virt().
> > 
> > Both here and in the following patch, I'm afraid I'm unsure of it's
> > usefulness.  I'm leaning towards this being code churn for very little
> > benefit.
> 
> I expect this patch is a response to an earlier comment of mine. I'm
> rather worried that at the time this series actually goes in, un-audited
> mfn_to_virt() uses remain (perhaps because of introduction between patch
> submission and its committing). Such uses would all very likely end in
> crashes or worse, but they may not be found by testing.

I see, would be good to note the intention on the commit message then.

> > Also, I'm not sure I see how the patch prevents further usage of
> > mfn_to_virt(), as (for Arm) the existing macro is not removed.  If
> > anything I would prefer a comment clearly stating that the macro
> > operates on directmap space, and avoid the name change.
> 
> But Arm isn't switched to this sparse direct map mode, I think? At which
> point uses in Arm-specific code continue to be okay.

Right, it's just that Arm will have both mfn_to_virt() and
mfn_to_directmap_virt() which seems a bit confusing when they are
actually the same implementation.

Thanks, Roger.

Re: [PATCH V3 01/19] x86: Create per-domain mapping of guest_root_pt

2024-05-15 Thread Roger Pau Monné

On Tue, May 14, 2024 at 06:15:57PM +0100, Elias El Yandouzi wrote:
> Hi Roger,
> 
> On 13/05/2024 16:27, Roger Pau Monné wrote:
> > > diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
> > > index 2a445bb17b..1b025986f7 100644
> > > --- a/xen/arch/x86/pv/domain.c
> > > +++ b/xen/arch/x86/pv/domain.c
> > > @@ -288,6 +288,21 @@ static void pv_destroy_gdt_ldt_l1tab(struct vcpu *v)
> > > 1U << GDT_LDT_VCPU_SHIFT);
> > >   }
> > > +static int pv_create_root_pt_l1tab(struct vcpu *v)
> > > +{
> > > +return create_perdomain_mapping(v->domain,
> > > +
> > > PV_ROOT_PT_MAPPING_VCPU_VIRT_START(v),
> > > +1, v->domain->arch.pv.root_pt_l1tab,
> > > +NULL);
> > > +}
> > > +
> > > +static void pv_destroy_root_pt_l1tab(struct vcpu *v)
> > 
> > The two 'v' parameters could be const here.
> 
> I could constify the parameters but the functions wouldn't be consistent
> with the two above for gdt/ldt.

The fact they are not const for the other helpers would also need
fixing at some point IMO, it's best if those are already using the
correct type.

> > > diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
> > > index df015589ce..c1377da7a5 100644
> > > --- a/xen/arch/x86/x86_64/entry.S
> > > +++ b/xen/arch/x86/x86_64/entry.S
> > > @@ -162,7 +162,15 @@ FUNC_LOCAL(restore_all_guest)
> > >   and   %rsi, %rdi
> > >   and   %r9, %rsi
> > >   add   %rcx, %rdi
> > > +
> > > +/*
> > > + * The address in the vCPU cr3 is always mapped in the per-domain
> > > + * pv_root_pt virt area.
> > > + */
> > > +imul  $PAGE_SIZE, VCPU_id(%rbx), %esi
> > 
> > Aren't some of the previous operations against %rsi now useless since
> > it gets unconditionally overwritten here?
> 
> I think I can just get rid off of:
> 
> and   %r9, %rsi
> 
> > and   %r9, %rsi
> > [...]
> > add   %rcx, %rsi
> 
> The second operation you suggested is actually used to retrieve the VA of
> the PV_ROOT_PT.

Oh, yes, sorry, got confused when looking at the source file together
with the diff, it's only the `and` that can be removed.

Thanks, Roger.

Re: [PATCH V3 (resend) 14/19] Rename mfn_to_virt() calls

2024-05-14 Thread Roger Pau Monné

On Mon, May 13, 2024 at 01:40:41PM +, Elias El Yandouzi wrote:
> Until directmap gets completely removed, we'd still need to
> keep some calls to mfn_to_virt() for xenheap pages or when the
> directmap is enabled.
> 
> Rename the macro to mfn_to_directmap_virt() to flag them and
> prevent further use of mfn_to_virt().

Both here and in the following patch, I'm afraid I'm unsure of it's
usefulness.  I'm leaning towards this being code churn for very little
benefit.

Also, I'm not sure I see how the patch prevents further usage of
mfn_to_virt(), as (for Arm) the existing macro is not removed.  If
anything I would prefer a comment clearly stating that the macro
operates on directmap space, and avoid the name change.

Thanks, Roger.

Re: [PATCH V3 (resend) 13/19] x86/setup: Do not create valid mappings when directmap=no

2024-05-14 Thread Roger Pau Monné

On Mon, May 13, 2024 at 01:40:40PM +, Elias El Yandouzi wrote:
> From: Hongyan Xia 
> 
> Create empty mappings in the second e820 pass. Also, destroy existing
> direct map mappings created in the first pass.
> 
> To make xenheap pages visible in guests, it is necessary to create empty
> L3 tables in the direct map even when directmap=no, since guest cr3s
> copy idle domain's L4 entries, which means they will share mappings in
> the direct map if we pre-populate idle domain's L4 entries and L3
> tables. A helper is introduced for this.
> 
> Also, after the direct map is actually gone, we need to stop updating
> the direct map in update_xen_mappings().
> 
> Signed-off-by: Hongyan Xia 
> Signed-off-by: Julien Grall 
> Signed-off-by: Elias El Yandouzi 
> 
> diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
> index f26c9799e4..919347d8c2 100644
> --- a/xen/arch/x86/setup.c
> +++ b/xen/arch/x86/setup.c
> @@ -978,6 +978,57 @@ static struct domain *__init create_dom0(const module_t 
> *image,
>  /* How much of the directmap is prebuilt at compile time. */
>  #define PREBUILT_MAP_LIMIT (1 << L2_PAGETABLE_SHIFT)
>  
> +/*
> + * This either populates a valid direct map, or allocates empty L3 tables and
> + * creates the L4 entries for virtual address between [start, end) in the
> + * direct map depending on has_directmap();
> + *
> + * When directmap=no, we still need to populate empty L3 tables in the
> + * direct map region. The reason is that on-demand xenheap mappings are
> + * created in the idle domain's page table but must be seen by
> + * everyone. Since all domains share the direct map L4 entries, they
> + * will share xenheap mappings if we pre-populate the L4 entries and L3
> + * tables in the direct map region for all RAM. We also rely on the fact
> + * that L3 tables are never freed.
> + */
> +static void __init populate_directmap(uint64_t pstart, uint64_t pend,

paddr_t for both.

> +  unsigned int flags)
> +{
> +unsigned long vstart = (unsigned long)__va(pstart);
> +unsigned long vend = (unsigned long)__va(pend);
> +
> +if ( pstart >= pend )
> +return;
> +
> +BUG_ON(vstart < DIRECTMAP_VIRT_START);
> +BUG_ON(vend > DIRECTMAP_VIRT_END);
> +
> +if ( has_directmap() )
> +/* Populate valid direct map. */
> +BUG_ON(map_pages_to_xen(vstart, maddr_to_mfn(pstart),
> +PFN_DOWN(pend - pstart), flags));
> +else
> +{
> +/* Create empty L3 tables. */
> +unsigned long vaddr = vstart & ~((1UL << L4_PAGETABLE_SHIFT) - 1);
> +
> +for ( ; vaddr < vend; vaddr += (1UL << L4_PAGETABLE_SHIFT) )

It might be clearer (by avoiding some of the bitops and masks to simply
do:

for ( unsigned int idx = l4_table_offset(vstart);
  idx <= l4_table_offset(vend);
  idx++ )
{
...

> +{
> +l4_pgentry_t *pl4e = _pg_table[l4_table_offset(vaddr)];
> +
> +if ( !(l4e_get_flags(*pl4e) & _PAGE_PRESENT) )
> +{
> +mfn_t mfn = alloc_boot_pages(1, 1);

Hm, why not use alloc_xen_pagetable()?

> +void *v = map_domain_page(mfn);
> +
> +clear_page(v);
> +UNMAP_DOMAIN_PAGE(v);

Maybe use clear_domain_page()?

> +l4e_write(pl4e, l4e_from_mfn(mfn, __PAGE_HYPERVISOR));
> +}
> +}
> +}
> +}
> +
>  void asmlinkage __init noreturn __start_xen(unsigned long mbi_p)
>  {
>  const char *memmap_type = NULL, *loader, *cmdline = "";
> @@ -1601,8 +1652,17 @@ void asmlinkage __init noreturn __start_xen(unsigned 
> long mbi_p)
>  map_e = min_t(uint64_t, e,
>ARRAY_SIZE(l2_directmap) << L2_PAGETABLE_SHIFT);
>  
> -/* Pass mapped memory to allocator /before/ creating new mappings. */
> +/*
> + * Pass mapped memory to allocator /before/ creating new mappings.
> + * The direct map for the bottom 4GiB has been populated in the first
> + * e820 pass. In the second pass, we make sure those existing 
> mappings
> + * are destroyed when directmap=no.

Quite likely a stupid question, but why has the directmap been
populated for memory below 4GB?  IOW: why do we need to create those
mappings just to have them destroyed here.

Thanks, Roger.

Re: [PATCH V3 (resend) 12/19] x86/setup: vmap heap nodes when they are outside the direct map

2024-05-14 Thread Roger Pau Monné

On Mon, May 13, 2024 at 01:40:39PM +, Elias El Yandouzi wrote:
> From: Hongyan Xia 
> 
> When we do not have a direct map, archs_mfn_in_direct_map() will always
> return false, thus init_node_heap() will allocate xenheap pages from an
> existing node for the metadata of a new node. This means that the
> metadata of a new node is in a different node, slowing down heap
> allocation.
> 
> Since we now have early vmap, vmap the metadata locally in the new node.
> 
> Signed-off-by: Hongyan Xia 
> Signed-off-by: Julien Grall 
> Signed-off-by: Elias El Yandouzi 
> 
> 
> 
> Changes in v2:
> * vmap_contig_pages() was renamed to vmap_contig()
> * Fix indentation and coding style
> 
> Changes from Hongyan's version:
> * arch_mfn_in_direct_map() was renamed to
>   arch_mfns_in_direct_map()
> * Use vmap_contig_pages() rather than __vmap(...).
> * Add missing include (xen/vmap.h) so it compiles on Arm
> 
> diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
> index dfb2c05322..3c0909f333 100644
> --- a/xen/common/page_alloc.c
> +++ b/xen/common/page_alloc.c
> @@ -136,6 +136,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -605,22 +606,44 @@ static unsigned long init_node_heap(int node, unsigned 
> long mfn,
>  needed = 0;
>  }
>  else if ( *use_tail && nr >= needed &&
> -  arch_mfns_in_directmap(mfn + nr - needed, needed) &&
>(!xenheap_bits ||
> -   !((mfn + nr - 1) >> (xenheap_bits - PAGE_SHIFT))) )
> +  !((mfn + nr - 1) >> (xenheap_bits - PAGE_SHIFT))) )

Unrelated change?  The indentation was correct for this line and you
are breaking it.

>  {
> -_heap[node] = mfn_to_virt(mfn + nr - needed);
> -avail[node] = mfn_to_virt(mfn + nr - 1) +
> -  PAGE_SIZE - sizeof(**avail) * NR_ZONES;
> +if ( arch_mfns_in_directmap(mfn + nr - needed, needed) )
> +{
> +_heap[node] = mfn_to_virt(mfn + nr - needed);
> +avail[node] = mfn_to_virt(mfn + nr - 1) +
> +  PAGE_SIZE - sizeof(**avail) * NR_ZONES;
> +}
> +else
> +{
> +mfn_t needed_start = _mfn(mfn + nr - needed);
> +
> +_heap[node] = vmap_contig(needed_start, needed);
> +BUG_ON(!_heap[node]);
> +avail[node] = (void *)(_heap[node]) + (needed << PAGE_SHIFT) -
> +  sizeof(**avail) * NR_ZONES;
> +}

You could shorten the blocks I think:

if ( arch_mfns_in_directmap(mfn + nr - needed, needed) )
_heap[node] = mfn_to_virt(mfn + nr - needed);
else
_heap[node] = vmap_contig(_mfn(mfn + nr - needed), needed);

BUG_ON(!_heap[node]);
avail[node] = (void *)(_heap[node]) + (needed << PAGE_SHIFT) -
  sizeof(**avail) * NR_ZONES;

So that more part of the logic is shared between both.

>  }
>  else if ( nr >= needed &&
> -  arch_mfns_in_directmap(mfn, needed) &&
>(!xenheap_bits ||
> -   !((mfn + needed - 1) >> (xenheap_bits - PAGE_SHIFT))) )
> +  !((mfn + needed - 1) >> (xenheap_bits - PAGE_SHIFT))) )
>  {
> -_heap[node] = mfn_to_virt(mfn);
> -avail[node] = mfn_to_virt(mfn + needed - 1) +
> -  PAGE_SIZE - sizeof(**avail) * NR_ZONES;
> +if ( arch_mfns_in_directmap(mfn, needed) )
> +{
> +_heap[node] = mfn_to_virt(mfn);
> +avail[node] = mfn_to_virt(mfn + needed - 1) +
> +  PAGE_SIZE - sizeof(**avail) * NR_ZONES;
> +}
> +else
> +{
> +mfn_t needed_start = _mfn(mfn);
> +
> +_heap[node] = vmap_contig(needed_start, needed);
> +BUG_ON(!_heap[node]);
> +avail[node] = (void *)(_heap[node]) + (needed << PAGE_SHIFT) -
> +  sizeof(**avail) * NR_ZONES;
> +}

Same here.

Thanks, Roger.

Re: [PATCH for-4.19] x86/mtrr: avoid system wide rendezvous when setting AP MTRRs

2024-05-14 Thread Roger Pau Monné

On Tue, May 14, 2024 at 02:50:18PM +0100, Andrew Cooper wrote:
> On 14/05/2024 12:09 pm, Andrew Cooper wrote:
> > On 13/05/2024 9:59 am, Roger Pau Monne wrote:
> >> There's no point in forcing a system wide update of the MTRRs on all 
> >> processors
> >> when there are no changes to be propagated.  On AP startup it's only the AP
> >> that needs to write the system wide MTRR values in order to match the rest 
> >> of
> >> the already online CPUs.
> >>
> >> We have occasionally seen the watchdog trigger during `xen-hptool 
> >> cpu-online`
> >> in one Intel Cascade Lake box with 448 CPUs due to the re-setting of the 
> >> MTRRs
> >> on all the CPUs in the system.
> >>
> >> While there adjust the comment to clarify why the system-wide resetting of 
> >> the
> >> MTRR registers is not needed for the purposes of mtrr_ap_init().
> >>
> >> Signed-off-by: Roger Pau Monné 
> >> ---
> >> For consideration for 4.19: it's a bugfix of a rare instance of the 
> >> watchdog
> >> triggering, but it's also a good performance improvement when performing
> >> cpu-online.
> >>
> >> Hopefully runtime changes to MTRR will affect a single MSR at a time, 
> >> lowering
> >> the chance of the watchdog triggering due to the system-wide resetting of 
> >> the
> >> range.
> > "Runtime" changes will only be during dom0 boot, if at all, but yes - it
> > is restricted to a single MTRR at a time.
> >
> > It's XENPF_{add,del,read}_memtype, but it's only used by Classic Linux. 
> > PVOps only issues read_memtype.
> >
> > Acked-by: Andrew Cooper 
> 
> Actually no - this isn't safe in all cases.
> 
> There are BIOSes which get MTRRs wrong, and with the APs having UC
> covering a wider region than the BSP.
> 
> In this case, creating consistency will alter the MTRRs on all CPUs
> currently up, and we do need to perform the rendezvous in that case.

I'm confused, the state that gets applied in mtrr_set_all() is not
modified to match what's in the started AP registers.

An AP starting with a different set of MTRR registers than the saved
state will result in the MTRR state on the AP being changed, but not
the Xen state stored in mtrr_state, and hence there will be no changes
to synchronize.

> There are 3 cases:
> 
> 1) Nothing to do.  This is the overwhemlingly common case.
> 2) Local changes only.  No broadcast, but we do need to enter CD mode.
> 3) Remote changes needed.  Needs full broadcast.

Please bear with me, but I don't think 3) is possible during AP
bringup.  It's possible I'm missing a path where the differences in
the started AP MTRR state are somehow reconciled with the cached MTRR
state?

Thanks, Roger.

Re: [PATCH 3/4] tools/xen-cpuid: Use automatically generated feature names

2024-05-14 Thread Roger Pau Monné

On Tue, May 14, 2024 at 02:05:10PM +0100, Andrew Cooper wrote:
> On 14/05/2024 8:53 am, Roger Pau Monné wrote:
> > On Fri, May 10, 2024 at 11:40:01PM +0100, Andrew Cooper wrote:
> >> diff --git a/tools/misc/xen-cpuid.c b/tools/misc/xen-cpuid.c
> >> index 6ee835b22949..2f34694e9c57 100644
> >> --- a/tools/misc/xen-cpuid.c
> >> +++ b/tools/misc/xen-cpuid.c
> >> @@ -291,6 +292,9 @@ static const struct {
> >>  
> >>  #define COL_ALIGN "24"
> >>  
> >> +static const char *const feature_names[(FEATURESET_NR_ENTRIES + 1) << 5] =
> >> +INIT_FEATURE_VAL_TO_NAME;
> > I've also considered this when doing the original patch, but it seemed
> > worse to force each user of INIT_FEATURE_VAL_TO_NAME to have to
> > correctly size the array.  I would also use '* 32', as it's IMO
> > clearer and already used below when accessing the array.  I'm fine
> > if we want to go this way, but the extra Python code to add a last
> > array entry if required didn't seem that much TBH.
> 
> I was looking to avoid the other BUILD_BUG_ON()'s, and in particular
> bringing in known_features just for a build time check.
> 
> Given that there's only one instance right now, and no obvious other
> usecase, I'd say this is better.  In terms of just xen-cpuid.c, it's
> clearly correct whereas leaving it implicitly to
> INIT_FEATURE_VAL_TO_NAME is not.

If you dislike my original attempt at doing this, what about casting
the literal array initializer created by gen-cpuid.py, so that the
result ends up looking like:

#define INIT_FEATURE_NAME_ARRAY (const char *[(FEATURESET_NR_ENTRIES + 1) * 
32]) { \
...

Would that be better?

Regards, Roger.

Re: [PATCH V3 (resend) 11/19] x86/setup: Leave early boot slightly earlier

2024-05-14 Thread Roger Pau Monné

On Mon, May 13, 2024 at 01:40:38PM +, Elias El Yandouzi wrote:
> From: Hongyan Xia 
> 
> When we do not have a direct map, memory for metadata of heap nodes in
> init_node_heap() is allocated from xenheap, which needs to be mapped and
> unmapped on demand. However, we cannot just take memory from the boot
> allocator to create the PTEs while we are passing memory to the heap
> allocator.
> 
> To solve this race, we leave early boot slightly sooner so that Xen PTE
> pages are allocated from the heap instead of the boot allocator. We can
> do this because the metadata for the 1st node is statically allocated,
> and by the time we need memory to create mappings for the 2nd node, we
> already have enough memory in the heap allocator in the 1st node.
> 
> Signed-off-by: Hongyan Xia 
> Signed-off-by: Julien Grall 
> Signed-off-by: Elias El Yandouzi 
> 
> diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
> index bd6b1184f5..f26c9799e4 100644
> --- a/xen/arch/x86/setup.c
> +++ b/xen/arch/x86/setup.c
> @@ -1751,6 +1751,22 @@ void asmlinkage __init noreturn __start_xen(unsigned 
> long mbi_p)
>  
>  numa_initmem_init(0, raw_max_page);
>  
> +/*
> + * When we do not have a direct map, memory for metadata of heap nodes in
> + * init_node_heap() is allocated from xenheap, which needs to be mapped 
> and
> + * unmapped on demand.

Hm, maybe I'm confused, but isn't xenheap memory supposed to be always
mapped when in use?  In one of the previous patches xenheap memory is
unconditionally mapped in alloc_xenheap_pages().

IMO, this would better be worded as:  "... is allocated from xenheap,
which needs to be mapped at allocation and unmapped when freed."

> However, we cannot just take memory from the boot
> + * allocator to create the PTEs while we are passing memory to the heap
> + * allocator during end_boot_allocator().

Could you elaborate here, I don't obviously see why we can't consume
memory from the boot allocator.  Is it because under certain
conditions we might try to allocate memory from the boot allocator in
order to fulfill a call to map_pages_to_xen() and find the boot
allocator empty?

Thanks, Roger.

Re: [PATCH for-4.19] x86/mtrr: avoid system wide rendezvous when setting AP MTRRs

2024-05-14 Thread Roger Pau Monné

On Tue, May 14, 2024 at 01:57:13PM +0200, Jan Beulich wrote:
> On 13.05.2024 10:59, Roger Pau Monne wrote:
> > --- a/xen/arch/x86/cpu/mtrr/main.c
> > +++ b/xen/arch/x86/cpu/mtrr/main.c
> > @@ -573,14 +573,15 @@ void mtrr_ap_init(void)
> > if (!mtrr_if || hold_mtrr_updates_on_aps)
> > return;
> > /*
> > -* Ideally we should hold mtrr_mutex here to avoid mtrr entries changed,
> > -* but this routine will be called in cpu boot time, holding the lock
> > -* breaks it. This routine is called in two cases: 1.very earily time
> > -* of software resume, when there absolutely isn't mtrr entry changes;
> > -* 2.cpu hotadd time. We let mtrr_add/del_page hold cpuhotplug lock to
> > -* prevent mtrr entry changes
> > +* hold_mtrr_updates_on_aps takes care of preventing unnecessary MTRR
> > +* updates when batch starting the CPUs (see
> > +* mtrr_aps_sync_{begin,end}()).
> > +*
> > +* Otherwise just apply the current system wide MTRR values to this AP.
> > +* Note this doesn't require synchronization with the other CPUs, as
> > +* there are strictly no modifications of the current MTRR values.
> >  */
> > -   set_mtrr(~0U, 0, 0, 0);
> > +   mtrr_set_all();
> >  }
> 
> While I agree with the change here, it doesn't go quite far enough. Originally
> I meant to ask that, with this (supposedly) sole use of ~0U gone, you please
> also drop the handling of that special case from set_mtrr(). But another
> similar call exist in mtrr_aps_sync_end(). Yet while that's "fine" for the
> boot case (watchdog being started only slightly later), it doesn't look to be
> for the S3 resume one: The watchdog is re-enabled quite a bit earlier there.
> I actually wonder whether mtrr_aps_sync_{begin,end}() wouldn't better
> themselves invoke watchdog_{dis,en}able(), thus also making the boot case
> explicitly safe, not just safe because of ordering.

Hm, I don't like disabling the watchdog, I guess it could be
acceptable here because both usages of mtrr_aps_sync_end() are limited
to specific scenarios (boot or resume from suspension).  I can prepare
a separate patch, but I don't think the watchdog disabling should be
part of this patch.

Thanks, Roger.

Re: [PATCH V3 (resend) 10/19] xen/page_alloc: Add a path for xenheap when there is no direct map

2024-05-14 Thread Roger Pau Monné

On Mon, May 13, 2024 at 01:40:37PM +, Elias El Yandouzi wrote:
> From: Hongyan Xia 
> 
> When there is not an always-mapped direct map, xenheap allocations need
> to be mapped and unmapped on-demand.
> 
> Signed-off-by: Hongyan Xia 
> Signed-off-by: Julien Grall 
> Signed-off-by: Elias El Yandouzi 
> 
> 
> 
> I have left the call to map_pages_to_xen() and destroy_xen_mappings()
> in the split heap for now. I am not entirely convinced this is necessary
> because in that setup only the xenheap would be always mapped and
> this doesn't contain any guest memory (aside the grant-table).
> So map/unmapping for every allocation seems unnecessary.

I'm also concerned by this, did you test that
CONFIG_SEPARATE_XENHEAP=y works properly with the added {,un}map
calls?

If CONFIG_SEPARATE_XENHEAP=y I would expect the memory returned by
alloc_heap_pages(MEMZONE_XEN...) to already have the virtual mappings
created ahead?

The comment at the top of page_alloc.c also needs to be updated to
notice how the removal of the direct map affects xenheap allocations,
AFAICT a new combination is now possible:

CONFIG_SEPARATE_XENHEAP=n & CONFIG_NO_DIRECTMAP=y

> Changes in v2:
> * Fix remaining wrong indentation in alloc_xenheap_pages()
> 
> Changes since Hongyan's version:
> * Rebase
> * Fix indentation in alloc_xenheap_pages()
> * Fix build for arm32
> 
> diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
> index 9b7e4721cd..dfb2c05322 100644
> --- a/xen/common/page_alloc.c
> +++ b/xen/common/page_alloc.c
> @@ -2242,6 +2242,7 @@ void init_xenheap_pages(paddr_t ps, paddr_t pe)
>  void *alloc_xenheap_pages(unsigned int order, unsigned int memflags)
>  {
>  struct page_info *pg;
> +void *ret;

virt_addr maybe? ret is what I would expect to store the return value
of the function usually.

>  
>  ASSERT_ALLOC_CONTEXT();
>  
> @@ -2250,17 +2251,36 @@ void *alloc_xenheap_pages(unsigned int order, 
> unsigned int memflags)
>  if ( unlikely(pg == NULL) )
>  return NULL;
>  
> +ret = page_to_virt(pg);
> +
> +if ( !has_directmap() &&
> + map_pages_to_xen((unsigned long)ret, page_to_mfn(pg), 1UL << order,
> +  PAGE_HYPERVISOR) )
> +{
> +/* Failed to map xenheap pages. */
> +free_heap_pages(pg, order, false);
> +return NULL;
> +}
> +
>  return page_to_virt(pg);
>  }
>  
>  
>  void free_xenheap_pages(void *v, unsigned int order)
>  {
> +unsigned long va = (unsigned long)v & PAGE_MASK;
> +
>  ASSERT_ALLOC_CONTEXT();
>  
>  if ( v == NULL )
>  return;
>  
> +if ( !has_directmap() &&
> + destroy_xen_mappings(va, va + (1UL << (order + PAGE_SHIFT))) )
> +dprintk(XENLOG_WARNING,
> +"Error while destroying xenheap mappings at %p, order %u\n",
> +v, order);
> +
>  free_heap_pages(virt_to_page(v), order, false);
>  }
>  
> @@ -2284,6 +2304,7 @@ void *alloc_xenheap_pages(unsigned int order, unsigned 
> int memflags)
>  {
>  struct page_info *pg;
>  unsigned int i;
> +void *ret;
>  
>  ASSERT_ALLOC_CONTEXT();
>  
> @@ -2296,16 +2317,28 @@ void *alloc_xenheap_pages(unsigned int order, 
> unsigned int memflags)
>  if ( unlikely(pg == NULL) )
>  return NULL;
>  
> +ret = page_to_virt(pg);
> +
> +if ( !has_directmap() &&
> + map_pages_to_xen((unsigned long)ret, page_to_mfn(pg), 1UL << order,
> +  PAGE_HYPERVISOR) )
> +{
> +/* Failed to map xenheap pages. */
> +free_domheap_pages(pg, order);
> +return NULL;
> +}
> +
>  for ( i = 0; i < (1u << order); i++ )
>  pg[i].count_info |= PGC_xen_heap;
>  
> -return page_to_virt(pg);
> +return ret;
>  }
>  
>  void free_xenheap_pages(void *v, unsigned int order)
>  {
>  struct page_info *pg;
>  unsigned int i;
> +unsigned long va = (unsigned long)v & PAGE_MASK;
>  
>  ASSERT_ALLOC_CONTEXT();
>  
> @@ -2317,6 +2350,12 @@ void free_xenheap_pages(void *v, unsigned int order)
>  for ( i = 0; i < (1u << order); i++ )
>  pg[i].count_info &= ~PGC_xen_heap;
>  
> +if ( !has_directmap() &&
> + destroy_xen_mappings(va, va + (1UL << (order + PAGE_SHIFT))) )
> +dprintk(XENLOG_WARNING,
> +"Error while destroying xenheap mappings at %p, order %u\n",
> +v, order);

I don't think this should be a dprintk(), leaving mappings behind
could be a severe issue given the point of this work is to prevent
leaking data by having everything mapped on the direct map.

This needs to be a printk() IMO, I'm unsure whether freeing the memory
would need to be avoided if destroying the mappings failed, I can't
think of how we could recover from this gracefully.

Thanks, Roger.

Re: [PATCH V3 (resend) 07/19] xen/x86: Add support for the PMAP

2024-05-14 Thread Roger Pau Monné

On Tue, May 14, 2024 at 12:26:29PM +0200, Jan Beulich wrote:
> On 14.05.2024 12:22, Roger Pau Monné wrote:
> > On Tue, May 14, 2024 at 11:43:14AM +0200, Jan Beulich wrote:
> >> On 14.05.2024 11:40, Roger Pau Monné wrote:
> >>> On Mon, May 13, 2024 at 01:40:34PM +, Elias El Yandouzi wrote:
> >>>> @@ -53,6 +55,8 @@ enum fixed_addresses {
> >>>>  FIX_PV_CONSOLE,
> >>>>  FIX_XEN_SHARED_INFO,
> >>>>  #endif /* CONFIG_XEN_GUEST */
> >>>> +FIX_PMAP_BEGIN,
> >>>> +FIX_PMAP_END = FIX_PMAP_BEGIN + NUM_FIX_PMAP,
> >>>
> >>> This would better have
> >>>
> >>> #ifdef CONFIG_HAS_PMAP
> >>>
> >>> guards?
> >>
> >> That's useful only when the option can actually be off in certain
> >> configurations, isn't it?
> > 
> > My comment earlier on this patch suggested to make CONFIG_HAS_PMAP be
> > selected by HAS_SECRET_HIDING, rather than being unconditionally
> > arch-selected (if that's possible, I certainly don't know the usage in
> > further patches).
> 
> Right, but in patch 6 HAS_SECRET_HIDING is selected unconditionally,
> which would then also select HAS_PMAP. If, otoh, HAS_PMAP was selected
> only when SECRET_HIDING (or whatever its name is going to be), then an
> #ifdef would indeed be wanted here.

Oh, indeed, I was meant to tie to SECRET_HIDING and not
HAS_SECRET_HIDING.  I have to admit (as I've already commented on the
patch) I don't much like those names, they are far too generic.

Thanks, Roger.

Re: [PATCH V3 (resend) 09/19] x86/domain_page: Remove the fast paths when mfn is not in the directmap

2024-05-14 Thread Roger Pau Monné

On Mon, May 13, 2024 at 01:40:36PM +, Elias El Yandouzi wrote:
> From: Hongyan Xia 
> 
> When mfn is not in direct map, never use mfn_to_virt for any mappings.
> 
> We replace mfn_x(mfn) <= PFN_DOWN(__pa(HYPERVISOR_VIRT_END - 1)) with
> arch_mfns_in_direct_map(mfn, 1) because these two are equivalent. The
> extra comparison in arch_mfns_in_direct_map() looks different but because
> DIRECTMAP_VIRT_END is always higher, it does not make any difference.
> 
> Lastly, domain_page_map_to_mfn() needs to gain to a special case for
> the PMAP.
> 
> Signed-off-by: Hongyan Xia 
> Signed-off-by: Julien Grall 
> 
> 
> 
> Changes since Hongyan's version:
> * arch_mfn_in_direct_map() was renamed to arch_mfns_in_directmap()
> * add a special case for the PMAP in domain_page_map_to_mfn()
> 
> diff --git a/xen/arch/x86/domain_page.c b/xen/arch/x86/domain_page.c
> index 55e337aaf7..89caefc8a2 100644
> --- a/xen/arch/x86/domain_page.c
> +++ b/xen/arch/x86/domain_page.c
> @@ -14,8 +14,10 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
> +#include 
>  #include 
>  
>  static DEFINE_PER_CPU(struct vcpu *, override);
> @@ -35,10 +37,11 @@ static inline struct vcpu *mapcache_current_vcpu(void)
>  /*
>   * When using efi runtime page tables, we have the equivalent of the idle
>   * domain's page tables but current may point at another domain's VCPU.
> - * Return NULL as though current is not properly set up yet.
> + * Return the idle domains's vcpu on that core because the efi per-domain
> + * region (where the mapcache is) is in-sync with the idle domain.
>   */
>  if ( efi_rs_using_pgtables() )
> -return NULL;
> +return idle_vcpu[smp_processor_id()];

There's already an existing instance of idle_vcpu[smp_processor_id()]
down in the function, it might make sense to put this in a local
variable.

>  
>  /*
>   * If guest_table is NULL, and we are running a paravirtualised guest,
> @@ -77,18 +80,24 @@ void *map_domain_page(mfn_t mfn)
>  struct vcpu_maphash_entry *hashent;
>  
>  #ifdef NDEBUG
> -if ( mfn_x(mfn) <= PFN_DOWN(__pa(HYPERVISOR_VIRT_END - 1)) )
> +if ( arch_mfns_in_directmap(mfn_x(mfn), 1) )
>  return mfn_to_virt(mfn_x(mfn));
>  #endif
>  
>  v = mapcache_current_vcpu();
> -if ( !v )
> -return mfn_to_virt(mfn_x(mfn));
> +if ( !v || !v->domain->arch.mapcache.inuse )
> +{
> +if ( arch_mfns_in_directmap(mfn_x(mfn), 1) )
> +return mfn_to_virt(mfn_x(mfn));
> +else
> +{
> +BUG_ON(system_state >= SYS_STATE_smp_boot);
> +return pmap_map(mfn);
> +}
> +}
>  
>  dcache = >domain->arch.mapcache;
>  vcache = >arch.mapcache;
> -if ( !dcache->inuse )
> -return mfn_to_virt(mfn_x(mfn));
>  
>  perfc_incr(map_domain_page_count);
>  
> @@ -184,6 +193,12 @@ void unmap_domain_page(const void *ptr)
>  if ( !va || va >= DIRECTMAP_VIRT_START )
>  return;
>  
> +if ( va >= FIXADDR_START && va < FIXADDR_TOP )

This should be a fixmap helper IMO. virt_is_fixmap(addr) or similar.
There's already an existing instance in virt_to_fix().

> +{
> +pmap_unmap((void *)ptr);
> +return;
> +}
> +
>  ASSERT(va >= MAPCACHE_VIRT_START && va < MAPCACHE_VIRT_END);
>  
>  v = mapcache_current_vcpu();
> @@ -237,7 +252,7 @@ int mapcache_domain_init(struct domain *d)
>  unsigned int bitmap_pages;
>  
>  #ifdef NDEBUG
> -if ( !mem_hotplug && max_page <= PFN_DOWN(__pa(HYPERVISOR_VIRT_END - 1)) 
> )
> +if ( !mem_hotplug && arch_mfn_in_directmap(0, max_page) )
>  return 0;
>  #endif
>  
> @@ -308,7 +323,7 @@ void *map_domain_page_global(mfn_t mfn)
>  local_irq_is_enabled()));
>  
>  #ifdef NDEBUG
> -if ( mfn_x(mfn) <= PFN_DOWN(__pa(HYPERVISOR_VIRT_END - 1)) )
> +if ( arch_mfn_in_directmap(mfn_x(mfn, 1)) )
>  return mfn_to_virt(mfn_x(mfn));
>  #endif
>  
> @@ -335,6 +350,23 @@ mfn_t domain_page_map_to_mfn(const void *ptr)
>  if ( va >= DIRECTMAP_VIRT_START )
>  return _mfn(virt_to_mfn(ptr));
>  
> +/*
> + * The fixmap is stealing the top-end of the VMAP. So the check for
> + * the PMAP *must* happen first.
> + *
> + * Also, the fixmap translate a slot to an address backwards. The
> + * logic will rely on it to avoid any complexity. So check at
> + * compile time this will always hold.
> +*/
> +BUILD_BUG_ON(fix_to_virt(FIX_PMAP_BEGIN) < fix_to_virt(FIX_PMAP_END));
> +
> +if ( ((unsigned long)fix_to_virt(FIX_PMAP_END) <= va) &&
> + ((va & PAGE_MASK) <= (unsigned long)fix_to_virt(FIX_PMAP_BEGIN)) )
> +{

Can we place this as some kind of helper in fixmap.h?

It's already quite ugly, and could be useful in other places.

bool virt_in_fixmap_range(addr, start idx, end idx)

Or something similar.

Thanks, Roger.

Re: [PATCH V3 (resend) 07/19] xen/x86: Add support for the PMAP

2024-05-14 Thread Roger Pau Monné

On Tue, May 14, 2024 at 11:43:14AM +0200, Jan Beulich wrote:
> On 14.05.2024 11:40, Roger Pau Monné wrote:
> > On Mon, May 13, 2024 at 01:40:34PM +, Elias El Yandouzi wrote:
> >> @@ -53,6 +55,8 @@ enum fixed_addresses {
> >>  FIX_PV_CONSOLE,
> >>  FIX_XEN_SHARED_INFO,
> >>  #endif /* CONFIG_XEN_GUEST */
> >> +FIX_PMAP_BEGIN,
> >> +FIX_PMAP_END = FIX_PMAP_BEGIN + NUM_FIX_PMAP,
> > 
> > This would better have
> > 
> > #ifdef CONFIG_HAS_PMAP
> > 
> > guards?
> 
> That's useful only when the option can actually be off in certain
> configurations, isn't it?

My comment earlier on this patch suggested to make CONFIG_HAS_PMAP be
selected by HAS_SECRET_HIDING, rather than being unconditionally
arch-selected (if that's possible, I certainly don't know the usage in
further patches).

Regards, Roger.

Re: [PATCH V3 (resend) 06/19] x86: Add a boot option to enable and disable the direct map

2024-05-14 Thread Roger Pau Monné

On Tue, May 14, 2024 at 11:20:21AM +0200, Roger Pau Monné wrote:
> On Mon, May 13, 2024 at 01:40:33PM +, Elias El Yandouzi wrote:
> > From: Hongyan Xia 
> > diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h
> > index 7561297a75..9d4f1f2d0d 100644
> > --- a/xen/include/xen/mm.h
> > +++ b/xen/include/xen/mm.h
> > @@ -167,6 +167,13 @@ extern unsigned long max_page;
> >  extern unsigned long total_pages;
> >  extern paddr_t mem_hotplug;
> >  
> > +extern bool opt_directmap;
> > +
> > +static inline bool has_directmap(void)
> > +{
> > +return opt_directmap;
> 
> This likely wants:
> 
> return IS_ENABLED(CONFIG_HAS_SECRET_HIDING) && opt_directmap;

Er, sorry, this is wrong, should be:

return !IS_ENABLED(CONFIG_HAS_SECRET_HIDING) || opt_directmap;

Roger.

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 5733 matches

Mail list logo