Re: libc.so's vmobjlock / v_interlock
On Tue, Jan 21, 2020 at 09:46:03PM +, Andrew Doran wrote: > - PV lists in the pmap need explicit locking (unless the pmap uses a global > lock, which is fine if it suits the machine). I have done the PV list > locking for x86 and it's in the patch. To follow up, the x86 changes are in and and I'm checking over the other pmaps at the moment. Most are in good shape. Anyway this detangles the x86 pmap from UVM object locks. It doesn't care any more *. (Other than e.g. needing to hold a lock across pmap_remove() -> pmap_update(), to stop pages gaining a new identity while still visible somewhere via stale TLB entry.) It would be nice to to track all these mappings in batch, at the uvm_map_entry level or something like that, cos the PV tracking scheme causes uses loads of memory and is causes big cache contention. Andrew
Re: libc.so's vmobjlock / v_interlock
On Sun, Jan 19, 2020 at 09:50:24PM +, Andrew Doran wrote: > Here's a dtrace flamegraph for a kernel build on my system, while running a > kernel from the ad-namecache branch. > > http://www.netbsd.org/~ad/2020/jan19.svg > > The biggest single remaining concurency problem is the mutex on libc.so's > uvm_object / vnode. It pops up in a number of places, but most clearly seen > above the "uvm_fault_internal" box. > > I think making the vmobjlock / v_interlock a rwlock could help with making > inroads on this problem, because in the libc.so case it's not modifying too > much under cover of the lock (mostly pmap state). > > That's not a small undertaking, but I reckon it would take about a day of > donkey work to change every mutex_enter(..) to a rw_enter(.., RW_WRITER) and > make a start at choosing rw_write_held() or rw_lock_held() for the > assertions, followed by a bit of benchmarking to check for a regression and > a jumbo commit. From there on things could be changed incrementally. > > Thoughts? Here's a first set of changes to do just this. I don't plan on doing more on this for about a ~month because I will be travelling soon. http://www.netbsd.org/~ad/2020/vmobjlock.diff Andrew - This splits v_interlock apart from vmobjlock. v_interlock stays a mutex, vmobjlock becomes a RW lock. The lock order is vmobjlock -> v_interlock. There are a few shared items that need to be examined closely and fixed or good excuses made, e.g. VI_EXECMAP but it's mostly right. - I made anons/amaps have a rwlock too. There is bound to be an application for this and I think it might be PostgreSQL, must check. - There seems to be no performance regression from doing this. Not having vnode stuff adding contention to the vmobjlock makes the numbers better. To get to the point of doing concurrent faults for the simple cases, like are needed for libc.so I think the following are also needed: - PV lists in the pmap need explicit locking (unless the pmap uses a global lock, which is fine if it suits the machine). I have done the PV list locking for x86 and it's in the patch. I'm suggesting a lock still need to be held on the UVM object / amap for pmap ops. But that can be a read lock; for x86 I want a write lock held only for pmap_page_protect(VM_PROT_NONE) aka pmap_page_remove(), because trying have the x86 pmap manage walking the PV list while removing PTEs, and while everything is changing around it, and have it all nice and concurrent, would be really tough. I don't think ppp(VM_PROT_NONE) is a hotspot anyway. - With the above done uvm_unmap_remove() and uvm_map_protect() can take a reader lock on the object. I've tried this out and it works fine "on my machine", except that in the libc.so case it then starts to compete with uvm_fault() for the vmobjlock. - PG_BUSY could become a reader/writer lock of sorts, I'm thinking something like a counter in struct vm_page, and I like the idea of covering it with pg->interlock. Interested to hear any other ideas. - Then it's a trip into the uvm_fault() -> getpages -> uvn_findpages() path looking for what's needed, I expect probably not a whole lot if you only want to handle faults for stuff that's in-core using a read lock. I don't know the uvm_fault() code too well.
Re: libc.so's vmobjlock / v_interlock
On Sun, Jan 19, 2020 at 10:08:12PM +, Taylor R Campbell wrote: > > Date: Sun, 19 Jan 2020 21:50:24 + > > From: Andrew Doran > > > > The biggest single remaining concurency problem is the mutex on libc.so's > > uvm_object / vnode. It pops up in a number of places, but most clearly seen > > above the "uvm_fault_internal" box. > > > > I think making the vmobjlock / v_interlock a rwlock could help with making > > inroads on this problem, because in the libc.so case it's not modifying too > > much under cover of the lock (mostly pmap state). > > > > [...] > > > > Thoughts? > > I think we should try to go straight to pserialized page lookup and > skip rwlocks. Should be relatively easy to teach radixtree.c to work > pserialized. > > We probably won't want to use pserialize_perform() but can perhaps > instead use an epoch-based system with page daemon queues of dying > pages that may still be referenced under pserialize_read. This is an intriguing idea and I suspect that you've already thought about it quite a bit but I don't think we're there just yet because there's more to be dealt with first, namely that vmobjlock/v_interlock covers a lot more than the index of pages in the object. For example right now it's covering the usecounts, the flags in the vnode and the page, the PV entries in the pmap, interlocking the long term page locks (PG_BUSY), correct serialization of page visibility and TLB shootdown, and a whole lot more. I'm not fond of RW locks as a synchronisation primitive, there are usually better ways to do things, but I like the RW lock idea in this instance for two reasons. Firstly RW locks are pretty easy to understand and use so we can hopefully get a nice win for the system without it being too taxing in terms of the amount of work needed or the potential for introducing bugs. Secondly I think it would enable us incrementally attack some of the areas where we have what's really per-page state being covered by a per-object lock, which I think ties in with what you're suggesting. Andrew
Re: libc.so's vmobjlock / v_interlock
> Date: Sun, 19 Jan 2020 21:50:24 + > From: Andrew Doran > > The biggest single remaining concurency problem is the mutex on libc.so's > uvm_object / vnode. It pops up in a number of places, but most clearly seen > above the "uvm_fault_internal" box. > > I think making the vmobjlock / v_interlock a rwlock could help with making > inroads on this problem, because in the libc.so case it's not modifying too > much under cover of the lock (mostly pmap state). > > [...] > > Thoughts? I think we should try to go straight to pserialized page lookup and skip rwlocks. Should be relatively easy to teach radixtree.c to work pserialized. We probably won't want to use pserialize_perform() but can perhaps instead use an epoch-based system with page daemon queues of dying pages that may still be referenced under pserialize_read.
libc.so's vmobjlock / v_interlock
Hi, Here's a dtrace flamegraph for a kernel build on my system, while running a kernel from the ad-namecache branch. http://www.netbsd.org/~ad/2020/jan19.svg The biggest single remaining concurency problem is the mutex on libc.so's uvm_object / vnode. It pops up in a number of places, but most clearly seen above the "uvm_fault_internal" box. I think making the vmobjlock / v_interlock a rwlock could help with making inroads on this problem, because in the libc.so case it's not modifying too much under cover of the lock (mostly pmap state). That's not a small undertaking, but I reckon it would take about a day of donkey work to change every mutex_enter(..) to a rw_enter(.., RW_WRITER) and make a start at choosing rw_write_held() or rw_lock_held() for the assertions, followed by a bit of benchmarking to check for a regression and a jumbo commit. From there on things could be changed incrementally. Thoughts? Andrew