On 2025-11-26 at 02:05 +1100, Gregory Price <[email protected]> wrote... > On Tue, Nov 25, 2025 at 02:09:39PM +0000, Kiryl Shutsemau wrote: > > On Wed, Nov 12, 2025 at 02:29:16PM -0500, Gregory Price wrote: > > > With this set, we aim to enable allocation of "special purpose memory" > > > with the page allocator (mm/page_alloc.c) without exposing the same > > > memory as "System RAM". Unless a non-userland component, and does so > > > with the GFP_SPM_NODE flag, memory on these nodes cannot be allocated. > > > > How special is "special purpose memory"? If the only difference is a > > latency/bandwidth discrepancy compared to "System RAM", I don't believe > > it deserves this designation. > > > > That is not the only discrepancy, but it can certainly be one of them. > > I do think, at a certain latency/bandwidth level, memory becomes > "Specific Purpose" - because the performance implications become so > dramatic that you cannot allow just anything to land there. > > In my head, I've been thinking about this list > > 1) Plain old memory (<100ns) > 2) Kinda slower, but basically still memory (100-300ns) > 3) Slow Memory (>300ns, up to 2-3us loaded latencies) > 4) Types 1-3, but with a special feature (Such as compression) > 5) Coherent Accelerator Memory (various interconnects now exist) > 6) Non-coherent Shared Memory and PMEM (FAMFS, Optane, etc) > > Originally I was considering [3,4], but with Alistar's comments I am > also thinking about [5] since apparently some accelerators already > toss their memory into the page allocator for management.
Thanks. > Re: Slow memory -- > > Think >500-700ns cache line fetches, or 1-2us loaded. > > It's still "Basically just memory", but the scenarios in which > you can use it transparently shrink significantly. If you can > control what and how things can land there with good policy, > this can still be a boon compared to hitting I/O. > > But you still want things like reclaim and compaction to run > on this memory, and you still want buddy-allocation of this memory. > > Re: Compression > > This is a class of memory device which presents "usable memory" > but which carries stipulations around its use. > > The compressed case is the example I use in this set. There is an > inline compression mechanism on the device. If the compression ratio > drops to low, writes can get dropped resulting in memory poison. > > We could solve this kind of problem only allowing allocation via > demotion and hack off the Write-bit in the PTE. This provides the > interposition needed to fend-off compression ratio issues. > > But... it's basically still "just memory" - you can even leave it > mapped in the CPU page tables and allow userland to read unimpeded. > > In fact, we even want things like compaction and reclaim to run here. > This cannot be done *unless* this memory is in the page allocator, > and basically necessitates reimplementing all the core services the > kernel provides. > > Re: Accelerators > > Alistair has described accelerators onlining their memory as NUMA > nodes being an existing pattern (apparently not in-tree as far as I > can see, though). Yeah, sadly not yet :-( Hopefully "soon". Although onlining the memory doesn't have much driver involvement as the GPU memory all just appears in the ACPI tables as a CPU-less memory node anyway (which is why it ended up being easy for people to toss it into the page allocator). > General consensus is "don't do this" - and it should be obvious > why. Memory pressure can cause non-workload memory to spill to > these NUMA nodes as fallback allocation targets. Indeed, this is a common complaint when people have done this. > But if we had a strong isolation mechanism, this could be supported. > I'm not convinced this kind of memory actually needs core services > like reclaim, so I will wait to see those arguments/data before I > conclude whether the idea is sound. Sounds reasonable, I don't have strong arugments either way at the moment so will see if we can gather some data. > > > > > > I am not in favor of the new GFP flag approach. To me, this indicates > > that our infrastructure surrounding nodemasks is lacking. I believe we > > would benefit more by improving it rather than simply adding a GFP flag > > on top. > > > > The core of this series is not the GFP flag, it is the splitting of > (cpuset.mems_allowed) into (cpuset.mems_allowed, cpuset.sysram_nodes) > > That is the nodemask infrastructure improvement. The GFP flag is one > mechanism of loosening the validation logic from limiting allocations > from (sysram_nodes) to including all nodes present in (mems_allowed). > > > While I am not an expert in NUMA, it appears that the approach with > > default and opt-in NUMA nodes could be generally useful. Like, > > introduce a system-wide default NUMA nodemask that is a subset of all > > possible nodes. > > This patch set does that (cpuset.sysram_nodes and mt_sysram_nodemask) > > > This way, users can request the "special" nodes by using > > a wider mask than the default. > > > > I describe in the response to David that this is possible, but creates > extreme tripping hazards for a large swath of existing software. > > snippet > ''' > Simple answer: We can choose how hard this guardrail is to break. > > This initial attempt makes it "Hard": > You cannot "accidentally" allocate SPM, the call must be explicit. > > Removing the GFP would work, and make it "Easier" to access SPM memory. > > This would allow a trivial > > mbind(range, SPM_NODE_ID) > > Which is great, but is also an incredible tripping hazard: > > numactl --interleave --all > > and in kernel land: > > __alloc_pages_noprof(..., nodes[N_MEMORY]) > > These will now instantly be subject to SPM node memory. > ''' > > There are many places that use these patterns already. > > But at the end of the day, it is preference: we can choose to do that. > > > cpusets should allow to set both default and possible masks in a > > hierarchical manner where a child's default/possible mask cannot be > > wider than the parent's possible mask and default is not wider that > > own possible. > > > > This patch set implements exactly what you describe: > sysram_nodes = default > mems_allowed = possible > > > > Userspace-driven allocations are restricted by the sysram_nodes mask, > > > nothing in userspace can explicitly request memory from SPM nodes. > > > > > > Instead, the intent is to create new components which understand memory > > > features and register those nodes with those components. This abstracts > > > the hardware complexity away from userland while also not requiring new > > > memory innovations to carry entirely new allocators. > > > > I don't see how it is a positive. It seems to be negative side-effect of > > GFP being a leaky abstraction. > > > > It's a matter of applying an isolation mechanism and then punching an > explicit hole in it. As it is right now, GFP is "leaky" in that there > are, basically, no walls. Reclaim even ignored cpuset controls until > recently, and the page_alloc code even says to ignore cpuset when > in an interrupt context. > > The core of the proposal here is to provide a strong isolation mechanism > and then allow punching explicit holes in it. The GFP flag is one > pattern, I'm open to others. > > ~Gregory

