On Wed, Jun 10, 2026, at 7:40 PM, Mateusz Guzik wrote: > [...] > > As I tried to explain in my previous e-mail this approach does not cut > it because of NUMA. > > Suppose you have a machine with 2 nodes. The parent-to-be is running > on node 0 and the child is intended to exec something on node 1. > > When the parent-to-be allocates and populates stuff, it takes place with > memory backed by node 0. If you allocate task_struct, the file table and > other frequently used (and modified!) objs in this way, you are > guaranteeing performance loss due to interconnect traffic to access it. > > Trying to add plumbing so that all allocations respect numa placement is > probably too cumbersome.
Are we sure that last part is true? Let's also assume when this stuff was initially implemented, we didn't have it. If the basic thrust of this work is to replace functions that previously only worked on the current thread with those that worked on either arbitrary (not yet started) threads or the current thread, would that not prepare us for slowly migrating the allocation choice to reflect the node of the target task (new parameter) rather than the node of the current task over time? (This assumes the task is pre-placed on a node before it is actually run there, and that pre-placement happens as early in the allocation process as possible, so subsequent allocations can read off the partially-initialized task's node.) "Slowly migrating" is good here! It doesn't need to be the fastest thing out of the gate, but if this new proper spawning API gets popular as I think it would, and there is a clear path to optimizing it per the above, then I am confident that over the years it will happen. > The primary example for that is looking up the binary to exec in the > first place. > > userspace likes to pass paths which don't exist, meaning checking for > the binary before any hard work is a useful optimization. Suppose the > binary to be executed is in a container bound with a taskset using > node 1 and the content of the fs part of the container is currently > fully uncached. > > When you perform the lookup on node 0, you are populating a bunch of > metadata (inode, dentry) using memory from that domain. But the intended > user will only execute on node 1, again resulting in a performance loss. > > In order to not do it you would need to convince VFS to allocate memory > elsewhere. One thing I don't get about this is that isn't the cost doing a bunch of work searching the PATH for the directories where the executable *doesn't* exist? In the case of something like a shell that is going to spawn a lot of processes, I would think it is *good* to keep all that PATH crawling VFS filling to be on the shell's node, rather than the child processes' nodes. It is only the executable itself, the final step of the VFS crawl, that should be loaded into the other NUMA nodes. Insofar as (unless I am missing something) creating the process means finding the inode for the executable but not loading those pages, aren't we OK here? Only when the new process is actually scheduled and run must the ELF be paged into memory, and then that will happen on the correct node. > So I stand by my previous claim that ultimately a pristine child has to > be created (like in this patch), but which also has to do the work on > its own. I have not been a kernel dev, so my apologies if I am missing things. But in conclusion for me, the FS and other resource access patterns of *creating a process* vs *that process itself running* do not seem necessarily coincident to me. What you are describing as for sure a problem might possibly be a *good thing*, if they are in fact quite different. > Suppose there is no explicit placement requested anywhere. Even in that > case there are legitimate workloads which will eventually be forced to > exec stuff on another node. Even these have a better chance retaining > full locality if the child process does all the work. > > Per my previous message I don't see a clean interface to do it. > something quasi-posix_spawn is probably the least bad way out, it will > also allow userspace to easily wrap the new thing with posix_spawn > itself. > > Also note there is another issue with the fd-based approach: the fd will > get inherited on fork and will hang out in the child afterwards unless > explicitly closed. Suppose you have a multithreaded program which likes > to both fork(+no exec) and fork+exec. With the fd-based approach you > have no means of stopping another thread from grabbing your state thanks > to unix defaulting to copying everything. There was an attempt to fix > this aspect with O_CLOFORK, but this got rejected. I would think we don't need to worry about clone/fork very much, right? I think the premise of your emails, and just about everyone else's in this thread too, is that we agree fork+exec is bad, and the problem of unnecessarily sharing resources is inherent to fork. Furthermore, I think we all agree that while `O_CLOEXEC` and `O_CLOFORK` may help, both are unsatisfying solutions because they are opt-out not opt-in, and global to the parent process / preexec state (respectively) rather than local to the specific fork / exec in question. pidfds encounter these problems no more than any other file-descriptor-based UAPI, right? And I don't think it is good to blame any such file-descriptor-based UAPI when fork/exec are at fault. Maybe during the transition, when some things use fork and some things use this new API, stuff will be awkward, but I would rather that just be an incentive to complete the transition away from fork, not a reason to second-guess the plan. Once the transition is complete, and everyone is diligently assembling their child processes from scratch as is proposed, `O_CLOEXEC` and `O_CLOFORK` are both unneeded, and oversharing privileges will be much less common simply because "lazy coding"/"minimal typing" will only share what is needed --- anything else is more code/keystrokes! > Whatever exactly happens, NUMA is a sad fact of computing and needs to > be accounted for. The approach as proposed not only does not do it, but > it actively hinders such deployments. Despite everything I said, I want to be clear that I do agree that NUMA performance should be accounted for. Even if the first version isn't as great as it could be on that metric, there should be a clear plan for how future work can conclusively address it. Cheers, John

