On Mon, Apr 13, 2026 at 05:33:21PM +0200, Mateusz Guzik wrote: > On Mon, Apr 13, 2026 at 02:20:39PM +0800, Huang Shijie wrote: > > In NUMA, there are maybe many NUMA nodes and many CPUs. > > For example, a Hygon's server has 12 NUMA nodes, and 384 CPUs. > > In the UnixBench tests, there is a test "execl" which tests > > the execve system call. > > > > When we test our server with "./Run -c 384 execl", > > the test result is not good enough. The i_mmap locks contended heavily on > > "libc.so" and "ld.so". For example, the i_mmap tree for "libc.so" can have > > over 6000 VMAs, all the VMAs can be in different NUMA mode. > > The insert/remove operations do not run quickly enough. > > > > patch 1 & patch 2 are try to hide the direct access of i_mmap. > > patch 3 splits the i_mmap into sibling trees, and we can get better > > performance with this patch set: > > we can get 77% performance improvement(10 times average) > > > > To my reading you kept the lock as-is and only distributed the protected > state. > > While I don't doubt the improvement, I'm confident should you take a > look at the profile you are going to find this still does not scale with > rwsem being one of the problems (there are other global locks, some of > which have experimental patches for). > > Apart from that this does nothing to help high core systems which are > all one node, which imo puts another question mark on this specific > proposal. > > Of course one may question whether a RB tree is the right choice here, > it may be the lock-protected cost can go way down with merely a better > data structure. > > Regardless of that, for actual scalability, there will be no way around > decentralazing locking around this and partitioning per some core count > (not just by numa awareness). > > Decentralizing locking is definitely possible, but I have not looked > into specifics of how problematic it is. Best case scenario it will > merely with separate locks. Worst case scenario something needs a fully > stabilized state for traversal, in that case another rw lock can be > slapped around this, creating locking order read lock -> per-subset > write lock -- this will suffer scalability due to the read locking, but > it will still scale drastically better as apart from that there will be > no serialization. In this setting the problematic consumer will write > lock the new thing to stabilize the state. For your proposal in no-numa, I hope you can create a patch set for it. I can test it in our machine.
Thanks Huang Shijie
