On Thu, Oct 27, 2022 at 11:58:54AM +0100, Jonathan Cameron wrote: > On Wed, 26 Oct 2022 16:47:18 -0400 > Gregory Price <gregory.pr...@memverge.com> wrote: > > > On Wed, Oct 26, 2022 at 08:13:24PM +0000, Adam Manzanares wrote: > > > On Tue, Oct 25, 2022 at 08:47:33PM -0400, Gregory Price wrote: > > > > Submitted as an extention to the multi-feature branch maintained > > > > by Jonathan Cameron at: > > > > https://urldefense.com/v3/__https://gitlab.com/jic23/qemu/-/tree/cxl-2022-10-24__;!!EwVzqGoTKBqv-0DWAJBm!RyiGL5B1XmQnVFwgxikKJeosPMKtoO1cTr61gIq8fwqfju8l4cbGZGwAEkKXIJB-Dbkfi_LNN2rGCbzMISz65cTxpAxI9pQ$ > > > > > > > > > > > > > > > > Summary of Changes: > > > > 1) E820 CFMW Bug fix. > > > > 2) Add CXL_CAPACITY_MULTIPLIER definition to replace magic numbers > > > > 3) Multi-Region and Volatile Memory support for CXL Type-3 Devices > > > > 4) CXL Type-3 SRAT Generation when NUMA node is attached to memdev > > +CC Dan for a question on status of Generic Ports ACPI code first ECN. > Given that was done on the mailing list I can find any public tracking > of whether it was accepted or not - hence whether we can get on with > implementation. There hasn't been a release ACPI spec since before > that was proposed so we need that confirmation of the code first proposal > being accepted to get things moving. > > > > > > > > > > > > > Regarding the E820 fix > > > > * This bugfix is required for memory regions to work on x86 > > > > * input from Dan Williams and others suggest that E820 entry for > > > > the CFMW should not exist, as it is expected to be dynamically > > > > assigned at runtime. If this entry exists, it instead blocks > > > > region creation by nature of the memory region being marked as > > > > reserved. > > > > > > For CXL 2.0 it is my understanding that volatile capacity present at boot > > > will > > > be advertised by the firmware. In the absence of EFI I would assume this > > > would > > > be provided in the e820 map. > > Whilst this is one option, it is certainly not the case that all CXL 2.0 > platforms will decide to do any setup of CXL memory (volatile or not) in the > firmware. They can leave it entirely to the OS, so a cold plug flow. > Early platforms will do the setup in BIOS to support unware OSes, once > that's not a problem any more the only reason you'd want to do this is if > you don't have other RAM to boot the system, or for some reason you want > your host kernel etc in the CXL attached memory. > > I'd expect to see BIOS having OS managed configuration as an option in the > intermediate period where some OSes are aware, others not. > OS knows more about usecase / policy so can make better choices on > interleaving > etc of volatile CXL type 3 memory (let alone the fun corner of devices > where you can dynamically change the mix of volatile and non volatile > memory). > > > > > > The issue in this case is very explicitly that a double-mapping occurs > > for the same region. An E820 mapping for RESERVED is set *and* the > > region driver allocates a CXL CFMW mapping. As a result the region > > drive straight up fails to allocate regions. > > > > So in either case - at least from my view - the entry added as RESERVED > > is just wrong. > > > > This is separate from type-3 device SRAT entries and default mappings > > for volatile regions. For this situation, if you explicitly assign the > > memdev backing a type-3 device to a numa node, then an SRAT area is > > generated and an explicit e820 entry is generated and marked usable - > > though I think there are likely issues with this kind of > > double-referencing. > > SRAT setup for CXL type 3 devices is to my mind is a job for a full BIOS, > not QEMU level of faking a few things. That BIOS should also > be doing the full configuration (HDM Decoders and all the rest). ARM has > a prototype for one of the fixed virtual platforms that does this (talk > at Plumbers Uconf), we should look to do the same if we want to test > those flows using QEMU via appropriate changes in EDK2 to walk topology > and configure everything. Until the devices are configured there is no > way to configure the SLIT, HMAT entries that align with the SRAT ones > (in theory those can be updated at runtime via _SLI, _HMA but in > practice, I'm fairly sure Linux doesn't support doing that.) > > > > > > > > > > Is the region driver meant to cover volatile capacity present at boot? I > > > was > > > under the impression that it would be used for hot added volatile memory. > > > It > > > would be good to cover all of these assumptions for the e820 fix. > > > > This region appears to cover hotplug memory behind the CFMW. The > > problem is that this e820 RESERVED mapping blocks the CFMW region from > > being used at all. > > > > Without this, you can't use a type-3 persistent region, even with > > support, let alone a volatile region. In attempting to use a persistent > > region as volatile via ndctl and friends, I'm seeing further issues (it > > cannot be assigned to a numa node successfully), but that's a separate > > issue. > For the Numa node bit... > > So far, the CDAT table isn't used in Linux (read out for debug purposes > is supported only). That all needs to be added yet to get the NUMA node > allocations to work correctly. It shouldn't have anything to do with SRAT > unless the BIOS has done the full full configuration (same way CXL will work > with a legacy OS). Come to think of it, that should definitely be on the > priority list for kernel support (don't think it was on the list on Tuesday?) > > > > > > > > > Lastly it is my understanding that the region driver does not have > > > support for > > > volatile memory. It would be great to add that functionality if we make > > > this > > > change in QEMU. > > > > > > > Right now this is true, but it seems a bit of a chicken/egg scenario. > > Nothing to test against vs no support. Nudging this along such that we > > can at least report an (unusable) hot-add volatile memory region would > > provide someone working with the region driver something to poke and > > prod at. > > Agreed. This is same place as Ben's original CXL QEMU work on non volatile. > Need something to develop against so we'll at least have QEMU patches > available whilst the kernel side is in development (hopefully this cycle) > > > > > > Regarding SRAT Generation for Type-3 Devices > > > > * Co-Developed by Davidlohr Bueso. Built from his base patch and > > > > extended to work with both volatile and persistent regions. > > > > * This can be used to demonstrate static type-3 device mapping and > > > > testing numa-access to type-3 device memory regions. > > > > Regarding "volatile memory present at boot" - there is still two ways > > for that memory to be onlined: Statically (entered as an explicit e820 > > region after reading the SRAT), or Dynamically (hot-add by the region > > driver). > > > > This patch would at least allow an SRAT to be generated if you > > explicitly add a NUMA node mapping to it. Although I concede that I'm > > not entirely sure what is "correct" here. > > For hotplug, needs to be the region driver, not here (or BIOS if you > are doing a BIOS driven flow - in which case the whole topology is > discovered and mostly configured by the BIOS before the OS reaches it > - including filling in SRAT, SLIT, HMAT etCc). > > > > What this ends up looking like is mapping a memdev to both a numa node > > and to a type-3 device. Though that seems wrong. > > > > After further testing it seems like creating a CPU-less, Memory-less > > NUMA node with the intent of mapping volatile memory regions to it is > > not supported (yet?). > > Yup, and I doubt it ever will be for reasons above. > > BIOS does it all, or OS does it all. Mixing and matching is a mess > (there is an exception - Generic Port entries in SRAT - those we do > need for the OS driven flow and possibly also the BIOS flow > - patches welcome! I'd forgotten that on my list of QEMU stuff that > needs doing.)
Based on the discussions is it safe to say we have settled on an OS driven flow for the current QEMU CXL emulation. > > https://urldefense.com/v3/__https://lore.kernel.org/all/e1a52da9aec90766da5de51b1b839fd95d63a5af.ca...@intel.com/__;!!EwVzqGoTKBqv-0DWAJBm!XLqjPd1aXt3i5NXrZhakQlGdgJ5o4tcfV_5iUEp8vwBLv8T1Ft1OVHPI0p7KpYSFDYzhAGj_ulMz1LfZVmJgrOvrO-_v7udl$ > > > If anyone is implementing that, also good to do Generic Initiators > as they are very similar. > > Jonathan > >