BTW you're missing _a lot_ of CC's here, including the whole of mm/rmap.c
maintainership.

On Mon, Apr 20, 2026 at 10:10:19AM +0800, Huang Shijie wrote:
> On Mon, Apr 13, 2026 at 05:33:21PM +0200, Mateusz Guzik wrote:
> > On Mon, Apr 13, 2026 at 02:20:39PM +0800, Huang Shijie wrote:
> > >   In NUMA, there are maybe many NUMA nodes and many CPUs.
> > > For example, a Hygon's server has 12 NUMA nodes, and 384 CPUs.
> > > In the UnixBench tests, there is a test "execl" which tests
> > > the execve system call.
> > > 
> > >   When we test our server with "./Run -c 384 execl",
> > > the test result is not good enough. The i_mmap locks contended heavily on
> > > "libc.so" and "ld.so". For example, the i_mmap tree for "libc.so" can 
> > > have 
> > > over 6000 VMAs, all the VMAs can be in different NUMA mode.
> > > The insert/remove operations do not run quickly enough.
> > > 
> > > patch 1 & patch 2 are try to hide the direct access of i_mmap.
> > > patch 3 splits the i_mmap into sibling trees, and we can get better 
> > > performance with this patch set:
> > >     we can get 77% performance improvement(10 times average)
> > > 
> > 
> > To my reading you kept the lock as-is and only distributed the protected
> > state.
> > 
> > While I don't doubt the improvement, I'm confident should you take a
> > look at the profile you are going to find this still does not scale with
> > rwsem being one of the problems (there are other global locks, some of
> > which have experimental patches for).
> > 
> > Apart from that this does nothing to help high core systems which are
> > all one node, which imo puts another question mark on this specific
> > proposal.
> > 
> > Of course one may question whether a RB tree is the right choice here,
> > it may be the lock-protected cost can go way down with merely a better
> > data structure.
> > 
> > Regardless of that, for actual scalability, there will be no way around
> > decentralazing locking around this and partitioning per some core count
> > (not just by numa awareness).
> > 
> > Decentralizing locking is definitely possible, but I have not looked
> > into specifics of how problematic it is. Best case scenario it will
> > merely with separate locks. Worst case scenario something needs a fully
> > stabilized state for traversal, in that case another rw lock can be
> > slapped around this, creating locking order read lock -> per-subset
> > write lock -- this will suffer scalability due to the read locking, but
> > it will still scale drastically better as apart from that there will be
> > no serialization. In this setting the problematic consumer will write
> > lock the new thing to stabilize the state.
> > 
> I thought over again.
> I can change this patch set to support the non-NUMA case by:
>   1.) Still use one rw lock.

No. This doesn't help anything.

>   2.) For NUMA, keep the patch set as it is.

Please no. No NUMA vs non-NUMA case.

>   3.) For non-NUMA case, split the i_mmap tree to several subtrees.
>       For example, if a machine has 192 CPUs, split the 32 CPUs as a tree.

If lock contention is the problem, I don't see how splitting the tree helps,
unless it helps reduce lock hold time in a way that randomly helps your 
workload.
But that's entirely random.

> 
> So extend the patch set to support both the NUMA and non-NUMA machines.

FYI I've discussed some concrete ideas for reworking file rmap with Mateusz.
I'll be giving them a shot. Note that this needs to be done _carefully_,
particularly as there are some hidden assumptions wrt forking that aren't
very clear as to how they work[1].

[1] 
https://lore.kernel.org/all/bnukmnuxxuhdfeasjz33miemgr7w35c4aa6pqdmgupx7oxmeeb@gozgc3yxhcdd/
-- 
Pedro

Reply via email to