On Wed, 26 Oct 2022 16:47:18 -0400
Gregory Price <gregory.pr...@memverge.com> wrote:

> On Wed, Oct 26, 2022 at 08:13:24PM +0000, Adam Manzanares wrote:
> > On Tue, Oct 25, 2022 at 08:47:33PM -0400, Gregory Price wrote:  
> > > Submitted as an extention to the multi-feature branch maintained
> > > by Jonathan Cameron at:
> > > https://urldefense.com/v3/__https://gitlab.com/jic23/qemu/-/tree/cxl-2022-10-24__;!!EwVzqGoTKBqv-0DWAJBm!RyiGL5B1XmQnVFwgxikKJeosPMKtoO1cTr61gIq8fwqfju8l4cbGZGwAEkKXIJB-Dbkfi_LNN2rGCbzMISz65cTxpAxI9pQ$
> > >    
> > > 
> > > 
> > > Summary of Changes:
> > > 1) E820 CFMW Bug fix.  
> > > 2) Add CXL_CAPACITY_MULTIPLIER definition to replace magic numbers
> > > 3) Multi-Region and Volatile Memory support for CXL Type-3 Devices
> > > 4) CXL Type-3 SRAT Generation when NUMA node is attached to memdev

+CC Dan for a question on status of Generic Ports ACPI code first ECN.
Given that was done on the mailing list I can find any public tracking
of whether it was accepted or not - hence whether we can get on with
implementation.  There hasn't been a release ACPI spec since before
that was proposed so we need that confirmation of the code first proposal
being accepted to get things moving.

> > > 
> > > 
> > > Regarding the E820 fix
> > >   * This bugfix is required for memory regions to work on x86
> > >   * input from Dan Williams and others suggest that E820 entry for
> > >     the CFMW should not exist, as it is expected to be dynamically
> > >     assigned at runtime.  If this entry exists, it instead blocks
> > >     region creation by nature of the memory region being marked as
> > >     reserved.  
> > 
> > For CXL 2.0 it is my understanding that volatile capacity present at boot 
> > will
> > be advertised by the firmware. In the absence of EFI I would assume this 
> > would
> > be provided in the e820 map.   

Whilst this is one option, it is certainly not the case that all CXL 2.0
platforms will decide to do any setup of CXL memory (volatile or not) in the
firmware.  They can leave it entirely to the OS, so a cold plug flow.
Early platforms will do the setup in BIOS to support unware OSes, once
that's not a problem any more the only reason you'd want to do this is if
you don't have other RAM to boot the system, or for some reason you want
your host kernel etc in the CXL attached memory.

I'd expect to see BIOS having OS managed configuration as an option in the
intermediate period where some OSes are aware, others not.
OS knows more about usecase / policy so can make better choices on interleaving
etc of volatile CXL type 3 memory (let alone the fun corner of devices
where you can dynamically change the mix of volatile and non volatile
memory).


> 
> The issue in this case is very explicitly that a double-mapping occurs
> for the same region.  An E820 mapping for RESERVED is set *and* the
> region driver allocates a CXL CFMW mapping.  As a result the region
> drive straight up fails to allocate regions.
> 
> So in either case - at least from my view - the entry added as RESERVED
> is just wrong.
> 
> This is separate from type-3 device SRAT entries and default mappings
> for volatile regions.  For this situation, if you explicitly assign the
> memdev backing a type-3 device to a numa node, then an SRAT area is
> generated and an explicit e820 entry is generated and marked usable -
> though I think there are likely issues with this kind of
> double-referencing.

SRAT setup for CXL type 3 devices is to my mind is a job for a full BIOS,
not QEMU level of faking a few things. That BIOS should also
be doing the full configuration (HDM Decoders and all the rest).  ARM has
a prototype for one of the fixed virtual platforms that does this (talk
at Plumbers Uconf), we should look to do the same if we want to test
those flows using QEMU via appropriate changes in EDK2 to walk topology
and configure everything.  Until the devices are configured there is no
way to configure the SLIT, HMAT entries that align with the SRAT ones
(in theory those can be updated at runtime via _SLI, _HMA but in 
practice, I'm fairly sure Linux doesn't support doing that.)


> 
> > 
> > Is the region driver meant to cover volatile capacity present at boot? I was
> > under the impression that it would be used for hot added volatile memory. It
> > would be good to cover all of these assumptions for the e820 fix.  
> 
> This region appears to cover hotplug memory behind the CFMW.  The
> problem is that this e820 RESERVED mapping blocks the CFMW region from
> being used at all.
> 
> Without this, you can't use a type-3 persistent region, even with
> support, let alone a volatile region.  In attempting to use a persistent
> region as volatile via ndctl and friends, I'm seeing further issues (it
> cannot be assigned to a numa node successfully), but that's a separate
> issue.
For the Numa node bit...

So far, the CDAT table isn't used in Linux (read out for debug purposes
is supported only).  That all needs to be added yet to get the NUMA node
allocations to work correctly.  It shouldn't have anything to do with SRAT
unless the BIOS has done the full full configuration (same way CXL will work
with a legacy OS).  Come to think of it, that should definitely be on the
priority list for kernel support (don't think it was on the list on Tuesday?)

> 
> > 
> > Lastly it is my understanding that the region driver does not have support 
> > for
> > volatile memory. It would be great to add that functionality if we make this
> > change in QEMU.
> >   
> 
> Right now this is true, but it seems a bit of a chicken/egg scenario.
> Nothing to test against vs no support.  Nudging this along such that we
> can at least report an (unusable) hot-add volatile memory region would
> provide someone working with the region driver something to poke and
> prod at.

Agreed. This is same place as Ben's original CXL QEMU work on non volatile.
Need something to develop against so we'll at least have QEMU patches
available whilst the kernel side is in development (hopefully this cycle)
> 
> > > Regarding SRAT Generation for Type-3 Devices
> > >   * Co-Developed by Davidlohr Bueso.  Built from his base patch and
> > >     extended to work with both volatile and persistent regions.
> > >   * This can be used to demonstrate static type-3 device mapping and
> > >     testing numa-access to type-3 device memory regions.  
> 
> Regarding "volatile memory present at boot" - there is still two ways
> for that memory to be onlined: Statically (entered as an explicit e820
> region after reading the SRAT), or Dynamically (hot-add by the region
> driver).
> 
> This patch would at least allow an SRAT to be generated if you
> explicitly add a NUMA node mapping to it.  Although I concede that I'm
> not entirely sure what is "correct" here.

For hotplug, needs to be the region driver, not here (or BIOS if you
are doing a BIOS driven flow - in which case the whole topology is
discovered and mostly configured by the BIOS before the OS reaches it
- including filling in SRAT, SLIT, HMAT etCc).
> 
> What this ends up looking like is mapping a memdev to both a numa node
> and to a type-3 device.  Though that seems wrong.
> 
> After further testing it seems like creating a CPU-less, Memory-less
> NUMA node with the intent of mapping volatile memory regions to it is
> not supported (yet?).

Yup, and I doubt it ever will be for reasons above. 

BIOS does it all, or OS does it all.  Mixing and matching is a mess
(there is an exception - Generic Port entries in SRAT - those we do
need for the OS driven flow and possibly also the BIOS flow
- patches welcome! I'd forgotten that on my list of QEMU stuff that
needs doing.)

https://lore.kernel.org/all/e1a52da9aec90766da5de51b1b839fd95d63a5af.ca...@intel.com/

If anyone is implementing that, also good to do Generic Initiators
as they are very similar.

Jonathan
 


Reply via email to