BTW you're missing _a lot_ of CC's here, including the whole of mm/rmap.c maintainership.
On Mon, Apr 20, 2026 at 10:10:19AM +0800, Huang Shijie wrote: > On Mon, Apr 13, 2026 at 05:33:21PM +0200, Mateusz Guzik wrote: > > On Mon, Apr 13, 2026 at 02:20:39PM +0800, Huang Shijie wrote: > > > In NUMA, there are maybe many NUMA nodes and many CPUs. > > > For example, a Hygon's server has 12 NUMA nodes, and 384 CPUs. > > > In the UnixBench tests, there is a test "execl" which tests > > > the execve system call. > > > > > > When we test our server with "./Run -c 384 execl", > > > the test result is not good enough. The i_mmap locks contended heavily on > > > "libc.so" and "ld.so". For example, the i_mmap tree for "libc.so" can > > > have > > > over 6000 VMAs, all the VMAs can be in different NUMA mode. > > > The insert/remove operations do not run quickly enough. > > > > > > patch 1 & patch 2 are try to hide the direct access of i_mmap. > > > patch 3 splits the i_mmap into sibling trees, and we can get better > > > performance with this patch set: > > > we can get 77% performance improvement(10 times average) > > > > > > > To my reading you kept the lock as-is and only distributed the protected > > state. > > > > While I don't doubt the improvement, I'm confident should you take a > > look at the profile you are going to find this still does not scale with > > rwsem being one of the problems (there are other global locks, some of > > which have experimental patches for). > > > > Apart from that this does nothing to help high core systems which are > > all one node, which imo puts another question mark on this specific > > proposal. > > > > Of course one may question whether a RB tree is the right choice here, > > it may be the lock-protected cost can go way down with merely a better > > data structure. > > > > Regardless of that, for actual scalability, there will be no way around > > decentralazing locking around this and partitioning per some core count > > (not just by numa awareness). > > > > Decentralizing locking is definitely possible, but I have not looked > > into specifics of how problematic it is. Best case scenario it will > > merely with separate locks. Worst case scenario something needs a fully > > stabilized state for traversal, in that case another rw lock can be > > slapped around this, creating locking order read lock -> per-subset > > write lock -- this will suffer scalability due to the read locking, but > > it will still scale drastically better as apart from that there will be > > no serialization. In this setting the problematic consumer will write > > lock the new thing to stabilize the state. > > > I thought over again. > I can change this patch set to support the non-NUMA case by: > 1.) Still use one rw lock. No. This doesn't help anything. > 2.) For NUMA, keep the patch set as it is. Please no. No NUMA vs non-NUMA case. > 3.) For non-NUMA case, split the i_mmap tree to several subtrees. > For example, if a machine has 192 CPUs, split the 32 CPUs as a tree. If lock contention is the problem, I don't see how splitting the tree helps, unless it helps reduce lock hold time in a way that randomly helps your workload. But that's entirely random. > > So extend the patch set to support both the NUMA and non-NUMA machines. FYI I've discussed some concrete ideas for reworking file rmap with Mateusz. I'll be giving them a shot. Note that this needs to be done _carefully_, particularly as there are some hidden assumptions wrt forking that aren't very clear as to how they work[1]. [1] https://lore.kernel.org/all/bnukmnuxxuhdfeasjz33miemgr7w35c4aa6pqdmgupx7oxmeeb@gozgc3yxhcdd/ -- Pedro
