Re: libc.so's vmobjlock / v_interlock

2020-03-15 Thread Andrew Doran
On Tue, Jan 21, 2020 at 09:46:03PM +, Andrew Doran wrote:

> - PV lists in the pmap need explicit locking (unless the pmap uses a global
>   lock, which is fine if it suits the machine).  I have done the PV list
>   locking for x86 and it's in the patch.

To follow up, the x86 changes are in and and I'm checking over the other
pmaps at the moment.  Most are in good shape.  Anyway this detangles the x86
pmap from UVM object locks.  It doesn't care any more *.  (Other than e.g. 
needing to hold a lock across pmap_remove() -> pmap_update(), to stop pages
gaining a new identity while still visible somewhere via stale TLB entry.)

It would be nice to to track all these mappings in batch, at the
uvm_map_entry level or something like that, cos the PV tracking scheme
causes uses loads of memory and is causes big cache contention.

Andrew


Re: libc.so's vmobjlock / v_interlock

2020-01-21 Thread Andrew Doran
On Sun, Jan 19, 2020 at 09:50:24PM +, Andrew Doran wrote:

> Here's a dtrace flamegraph for a kernel build on my system, while running a
> kernel from the ad-namecache branch.
> 
>   http://www.netbsd.org/~ad/2020/jan19.svg
> 
> The biggest single remaining concurency problem is the mutex on libc.so's
> uvm_object / vnode.  It pops up in a number of places, but most clearly seen
> above the "uvm_fault_internal" box.
> 
> I think making the vmobjlock / v_interlock a rwlock could help with making
> inroads on this problem, because in the libc.so case it's not modifying too
> much under cover of the lock (mostly pmap state).
> 
> That's not a small undertaking, but I reckon it would take about a day of
> donkey work to change every mutex_enter(..) to a rw_enter(.., RW_WRITER) and
> make a start at choosing rw_write_held() or rw_lock_held() for the
> assertions, followed by a bit of benchmarking to check for a regression and
> a jumbo commit.  From there on things could be changed incrementally.
> 
> Thoughts?

Here's a first set of changes to do just this.  I don't plan on doing more
on this for about a ~month because I will be travelling soon.

http://www.netbsd.org/~ad/2020/vmobjlock.diff

Andrew


- This splits v_interlock apart from vmobjlock.  v_interlock stays a mutex,
  vmobjlock becomes a RW lock.  The lock order is vmobjlock -> v_interlock. 
  There are a few shared items that need to be examined closely and fixed or
  good excuses made, e.g. VI_EXECMAP but it's mostly right.

- I made anons/amaps have a rwlock too.  There is bound to be an application
  for this and I think it might be PostgreSQL, must check.

- There seems to be no performance regression from doing this.  Not having
  vnode stuff adding contention to the vmobjlock makes the numbers better.

To get to the point of doing concurrent faults for the simple cases, like
are needed for libc.so I think the following are also needed:

- PV lists in the pmap need explicit locking (unless the pmap uses a global
  lock, which is fine if it suits the machine).  I have done the PV list
  locking for x86 and it's in the patch.

  I'm suggesting a lock still need to be held on the UVM object / amap for
  pmap ops.  But that can be a read lock; for x86 I want a write lock held
  only for pmap_page_protect(VM_PROT_NONE) aka pmap_page_remove(), because
  trying have the x86 pmap manage walking the PV list while removing PTEs,
  and while everything is changing around it, and have it all nice and
  concurrent, would be really tough.  I don't think ppp(VM_PROT_NONE) is a
  hotspot anyway.

- With the above done uvm_unmap_remove() and uvm_map_protect() can take
  a reader lock on the object.  I've tried this out and it works fine
  "on my machine", except that in the libc.so case it then starts to 
  compete with uvm_fault() for the vmobjlock.

- PG_BUSY could become a reader/writer lock of sorts, I'm thinking something
  like a counter in struct vm_page, and I like the idea of covering it with
  pg->interlock.  Interested to hear any other ideas.

- Then it's a trip into the uvm_fault() -> getpages -> uvn_findpages()
  path looking for what's needed, I expect probably not a whole lot if you
  only want to handle faults for stuff that's in-core using a read lock.
  I don't know the uvm_fault() code too well.


Re: libc.so's vmobjlock / v_interlock

2020-01-20 Thread Andrew Doran
On Sun, Jan 19, 2020 at 10:08:12PM +, Taylor R Campbell wrote:

> > Date: Sun, 19 Jan 2020 21:50:24 +
> > From: Andrew Doran 
> > 
> > The biggest single remaining concurency problem is the mutex on libc.so's
> > uvm_object / vnode.  It pops up in a number of places, but most clearly seen
> > above the "uvm_fault_internal" box.
> > 
> > I think making the vmobjlock / v_interlock a rwlock could help with making
> > inroads on this problem, because in the libc.so case it's not modifying too
> > much under cover of the lock (mostly pmap state).
> > 
> > [...]
> > 
> > Thoughts?
> 
> I think we should try to go straight to pserialized page lookup and
> skip rwlocks.  Should be relatively easy to teach radixtree.c to work
> pserialized.
> 
> We probably won't want to use pserialize_perform() but can perhaps
> instead use an epoch-based system with page daemon queues of dying
> pages that may still be referenced under pserialize_read.

This is an intriguing idea and I suspect that you've already thought about
it quite a bit but I don't think we're there just yet because there's more
to be dealt with first, namely that vmobjlock/v_interlock covers a lot more
than the index of pages in the object.

For example right now it's covering the usecounts, the flags in the vnode
and the page, the PV entries in the pmap, interlocking the long term page
locks (PG_BUSY), correct serialization of page visibility and TLB shootdown,
and a whole lot more.

I'm not fond of RW locks as a synchronisation primitive, there are usually
better ways to do things, but I like the RW lock idea in this instance for
two reasons.

Firstly RW locks are pretty easy to understand and use so we can hopefully
get a nice win for the system without it being too taxing in terms of the
amount of work needed or the potential for introducing bugs.  Secondly I
think it would enable us incrementally attack some of the areas where we
have what's really per-page state being covered by a per-object lock, which
I think ties in with what you're suggesting.

Andrew


Re: libc.so's vmobjlock / v_interlock

2020-01-19 Thread Taylor R Campbell
> Date: Sun, 19 Jan 2020 21:50:24 +
> From: Andrew Doran 
> 
> The biggest single remaining concurency problem is the mutex on libc.so's
> uvm_object / vnode.  It pops up in a number of places, but most clearly seen
> above the "uvm_fault_internal" box.
> 
> I think making the vmobjlock / v_interlock a rwlock could help with making
> inroads on this problem, because in the libc.so case it's not modifying too
> much under cover of the lock (mostly pmap state).
> 
> [...]
> 
> Thoughts?

I think we should try to go straight to pserialized page lookup and
skip rwlocks.  Should be relatively easy to teach radixtree.c to work
pserialized.

We probably won't want to use pserialize_perform() but can perhaps
instead use an epoch-based system with page daemon queues of dying
pages that may still be referenced under pserialize_read.


libc.so's vmobjlock / v_interlock

2020-01-19 Thread Andrew Doran
Hi,

Here's a dtrace flamegraph for a kernel build on my system, while running a
kernel from the ad-namecache branch.

http://www.netbsd.org/~ad/2020/jan19.svg

The biggest single remaining concurency problem is the mutex on libc.so's
uvm_object / vnode.  It pops up in a number of places, but most clearly seen
above the "uvm_fault_internal" box.

I think making the vmobjlock / v_interlock a rwlock could help with making
inroads on this problem, because in the libc.so case it's not modifying too
much under cover of the lock (mostly pmap state).

That's not a small undertaking, but I reckon it would take about a day of
donkey work to change every mutex_enter(..) to a rw_enter(.., RW_WRITER) and
make a start at choosing rw_write_held() or rw_lock_held() for the
assertions, followed by a bit of benchmarking to check for a regression and
a jumbo commit.  From there on things could be changed incrementally.

Thoughts?

Andrew