On Wed, Jul 23, 2025 at 03:34:34PM +0200, Vlastimil Babka wrote: > The sheaves do not distinguish NUMA locality of the cached objects.
While currently sheaves are opt-in, to my understanding the plan is to make this the default. I would argue a hard requirement for a general purpose allocator in this day and age is to provide node-local memory by default. Notably if you have a workload which was careful to bind itself to one node, it should not receive memory backed by other nodes unless there is no other option. AFAIU this is satisifed with the stock allocator on the grounds of running on a given domain, without having to explicitly demand memory from it for everyting. I expect the lack of NUMA-awareness to result in increased accumulation of "mismatched" memory as uptime goes up, violating the above. Some examples how I expect that to happen should this get expanded to all allocations: - wherever init happens to reap a zombie, task_struct and some more stuff may be "misplaced" - even ignoring init, literally any fork/exec/exit heavy workload which runs on more than one node will be ripe with mismatched frees as the scheduler moves things around and the original parent reaps children - a process passes a file descriptor to a process on another domain and the latter is the last to fput - a container creates a bunch of dentries and whacks them etc. In all of these cases getting unlucky means you are using non-local memory, which in turn will result in weird anomalies which suddenly clear themselves up if you restart the program (or which show up out of nowhere). Arguably, the fork thing is a problem as is and *probably* could be reduced by asking the scheduler upfront where it would run the child domain-wise if it had to do it right now and making fork allocate memory from that domain. But even with this or some other mitigation in place there would be plenty of potential to free non-local memory, so the general problem statement stands. I admit though I don't have a good solution as to how to handle the "bad" frees. Someone (I think you?) stated that one of the previous allocators was just freeing to per-domain lists or arrays and that was causing trouble -- perhaps this would work if it came with small limits in place for how big these can get?

