On Wed, 26 Oct 2022 16:47:18 -0400 Gregory Price <gregory.pr...@memverge.com> wrote:
> On Wed, Oct 26, 2022 at 08:13:24PM +0000, Adam Manzanares wrote: > > On Tue, Oct 25, 2022 at 08:47:33PM -0400, Gregory Price wrote: > > > Submitted as an extention to the multi-feature branch maintained > > > by Jonathan Cameron at: > > > https://urldefense.com/v3/__https://gitlab.com/jic23/qemu/-/tree/cxl-2022-10-24__;!!EwVzqGoTKBqv-0DWAJBm!RyiGL5B1XmQnVFwgxikKJeosPMKtoO1cTr61gIq8fwqfju8l4cbGZGwAEkKXIJB-Dbkfi_LNN2rGCbzMISz65cTxpAxI9pQ$ > > > > > > > > > > > > Summary of Changes: > > > 1) E820 CFMW Bug fix. > > > 2) Add CXL_CAPACITY_MULTIPLIER definition to replace magic numbers > > > 3) Multi-Region and Volatile Memory support for CXL Type-3 Devices > > > 4) CXL Type-3 SRAT Generation when NUMA node is attached to memdev +CC Dan for a question on status of Generic Ports ACPI code first ECN. Given that was done on the mailing list I can find any public tracking of whether it was accepted or not - hence whether we can get on with implementation. There hasn't been a release ACPI spec since before that was proposed so we need that confirmation of the code first proposal being accepted to get things moving. > > > > > > > > > Regarding the E820 fix > > > * This bugfix is required for memory regions to work on x86 > > > * input from Dan Williams and others suggest that E820 entry for > > > the CFMW should not exist, as it is expected to be dynamically > > > assigned at runtime. If this entry exists, it instead blocks > > > region creation by nature of the memory region being marked as > > > reserved. > > > > For CXL 2.0 it is my understanding that volatile capacity present at boot > > will > > be advertised by the firmware. In the absence of EFI I would assume this > > would > > be provided in the e820 map. Whilst this is one option, it is certainly not the case that all CXL 2.0 platforms will decide to do any setup of CXL memory (volatile or not) in the firmware. They can leave it entirely to the OS, so a cold plug flow. Early platforms will do the setup in BIOS to support unware OSes, once that's not a problem any more the only reason you'd want to do this is if you don't have other RAM to boot the system, or for some reason you want your host kernel etc in the CXL attached memory. I'd expect to see BIOS having OS managed configuration as an option in the intermediate period where some OSes are aware, others not. OS knows more about usecase / policy so can make better choices on interleaving etc of volatile CXL type 3 memory (let alone the fun corner of devices where you can dynamically change the mix of volatile and non volatile memory). > > The issue in this case is very explicitly that a double-mapping occurs > for the same region. An E820 mapping for RESERVED is set *and* the > region driver allocates a CXL CFMW mapping. As a result the region > drive straight up fails to allocate regions. > > So in either case - at least from my view - the entry added as RESERVED > is just wrong. > > This is separate from type-3 device SRAT entries and default mappings > for volatile regions. For this situation, if you explicitly assign the > memdev backing a type-3 device to a numa node, then an SRAT area is > generated and an explicit e820 entry is generated and marked usable - > though I think there are likely issues with this kind of > double-referencing. SRAT setup for CXL type 3 devices is to my mind is a job for a full BIOS, not QEMU level of faking a few things. That BIOS should also be doing the full configuration (HDM Decoders and all the rest). ARM has a prototype for one of the fixed virtual platforms that does this (talk at Plumbers Uconf), we should look to do the same if we want to test those flows using QEMU via appropriate changes in EDK2 to walk topology and configure everything. Until the devices are configured there is no way to configure the SLIT, HMAT entries that align with the SRAT ones (in theory those can be updated at runtime via _SLI, _HMA but in practice, I'm fairly sure Linux doesn't support doing that.) > > > > > Is the region driver meant to cover volatile capacity present at boot? I was > > under the impression that it would be used for hot added volatile memory. It > > would be good to cover all of these assumptions for the e820 fix. > > This region appears to cover hotplug memory behind the CFMW. The > problem is that this e820 RESERVED mapping blocks the CFMW region from > being used at all. > > Without this, you can't use a type-3 persistent region, even with > support, let alone a volatile region. In attempting to use a persistent > region as volatile via ndctl and friends, I'm seeing further issues (it > cannot be assigned to a numa node successfully), but that's a separate > issue. For the Numa node bit... So far, the CDAT table isn't used in Linux (read out for debug purposes is supported only). That all needs to be added yet to get the NUMA node allocations to work correctly. It shouldn't have anything to do with SRAT unless the BIOS has done the full full configuration (same way CXL will work with a legacy OS). Come to think of it, that should definitely be on the priority list for kernel support (don't think it was on the list on Tuesday?) > > > > > Lastly it is my understanding that the region driver does not have support > > for > > volatile memory. It would be great to add that functionality if we make this > > change in QEMU. > > > > Right now this is true, but it seems a bit of a chicken/egg scenario. > Nothing to test against vs no support. Nudging this along such that we > can at least report an (unusable) hot-add volatile memory region would > provide someone working with the region driver something to poke and > prod at. Agreed. This is same place as Ben's original CXL QEMU work on non volatile. Need something to develop against so we'll at least have QEMU patches available whilst the kernel side is in development (hopefully this cycle) > > > > Regarding SRAT Generation for Type-3 Devices > > > * Co-Developed by Davidlohr Bueso. Built from his base patch and > > > extended to work with both volatile and persistent regions. > > > * This can be used to demonstrate static type-3 device mapping and > > > testing numa-access to type-3 device memory regions. > > Regarding "volatile memory present at boot" - there is still two ways > for that memory to be onlined: Statically (entered as an explicit e820 > region after reading the SRAT), or Dynamically (hot-add by the region > driver). > > This patch would at least allow an SRAT to be generated if you > explicitly add a NUMA node mapping to it. Although I concede that I'm > not entirely sure what is "correct" here. For hotplug, needs to be the region driver, not here (or BIOS if you are doing a BIOS driven flow - in which case the whole topology is discovered and mostly configured by the BIOS before the OS reaches it - including filling in SRAT, SLIT, HMAT etCc). > > What this ends up looking like is mapping a memdev to both a numa node > and to a type-3 device. Though that seems wrong. > > After further testing it seems like creating a CPU-less, Memory-less > NUMA node with the intent of mapping volatile memory regions to it is > not supported (yet?). Yup, and I doubt it ever will be for reasons above. BIOS does it all, or OS does it all. Mixing and matching is a mess (there is an exception - Generic Port entries in SRAT - those we do need for the OS driven flow and possibly also the BIOS flow - patches welcome! I'd forgotten that on my list of QEMU stuff that needs doing.) https://lore.kernel.org/all/e1a52da9aec90766da5de51b1b839fd95d63a5af.ca...@intel.com/ If anyone is implementing that, also good to do Generic Initiators as they are very similar. Jonathan