Brice Goglin wrote: > Le 11/02/2023 à 02:53, Dan Williams a écrit : > > > Brice Goglin wrote: > > [..] > >>>> By the way, once configured in system ram, my CXL ram is merged into an > >>>> existing "normal" NUMA node. How do I tell Qemu that a CXL region should > >>>> be part of a new NUMA node? I assume that's what's going to happen on > >>>> real hardware? > >>> We don't yet have kernel code to deal with assigning a new NUMA node. > >>> Was on the todo list in last sync call I think. > >> > > In fact, there is no plan to support "new" NUMA node creation. A node > > can only be onlined / populated from set of static nodes defined by > > platform-firmware. The set of static nodes is defined by the union of > > all the proximity domain numbers in the SRAT as well as a node per > > CFMWS / QTG id. See: > > > > fd49f99c1809 ACPI: NUMA: Add a node and memblk for each CFMWS not in > > SRAT > > > > ...for the CXL node enumeration scheme. > > > > Once you have a node per CFMWS then it is up to CDAT and the QTG DSM to > > group devices by window. This scheme attempts to be as simple as > > possible, but no simpler. If more granularity is necessary in practice, > > that would be a good discussion to have soonish.. LSF/MM comes to mind. > > Actually I was mistaken, there's already a new NUMA node when creating > a region under Qemu, but my tools ignored it because it's empty. > After daxctl online-memory, things look good. > > Can you clarify your above sentences on a real node? If I connect two > memory expanders on two slots of the same CPU, do I get a single CFMWS or two? > What if I connect two devices to a single slot across a CXL switch?
Ultimately the answer is "ask your platform vendor", because this is a firmware decision. However, my expectation is that since the ACPI HMAT requires a proximity domain per distinct performance class, and because the ACPI HMAT needs to distinguish the memory that is "attached" to a CPU initiator domain, that CXL will at a minimum be described in a proximity domain distinct from "local DRAM". The number of CFMWS windows published is gated by the degrees of freedom platform-firmware wants to give the OS relative to the number of CXL host-bridges in the system. One scheme that seems plausible is one CFMWS window for each host-bridge / x1 interleave (to maximize RAS) and one CFMWS with all host-bridges interleaved together (to maximize performance). The above is just my personal opinion as a Linux kernel developer, a platform implementation is free to be as restrictive or generous as it wants with CFMWS resources.
