On Sun, Sep 10, 2017 at 06:51:31PM +0100, Mindaugas Rasiukevicius wrote: > Mateusz Guzik <mjgu...@gmail.com> wrote: > > 1. exclusive vnode locking (genfs_lock) > > > > ... > > > > 2. uvm_fault_internal > > > > ... > > > > 4. vm locks in general > > > > We know these points of lock contention, but they are not really that > trivial to fix. Breaking down the UVM pagequeue locks would generally > be a major project, as it would be the first step towards NUMA support. > In any case, patches are welcome. :) >
Breaking locks is of course the preferred long term solution, but also time consuming. On the other hand there are most likely reasonably easy fixes consisting of collapsing lock/unlock cycles into just one lock/unlock etc. FreeBSD is no saint here either with one global lock for free pages, yet it manages to work OK-ish with 80 hardware threads and is quite nice with 40. That said, I had enough problems $elsewhere to not be interested in looking too hard here. :> > > 3. pmap > > > > It seems most issues stem from slow pmap handling. Chances are there are > > perfectly avoidable shootdowns and in fact cases where there is no need > > to alter KVA in the first place. > > At least x86 pmap already performs batching and has quite efficient > synchronisation logic. You are right that there are some key places > where avoiding KVA map/unmap would have a major performance improvement, > e.g. UBC and mbuf zero-copy mechanisms (it could operate on physical > pages for I/O). However, these changes are not really related to pmap. > Some subsystems just need an alternative to temporary KVA mappings. > I was predominantly looking at teardown of ubc mappings. The flamegraph suggests overly high cost there. > > > > I would like to add a remark about locking primitives. > > > > Today the rage is with MCS locks, which are fine but not trivial to > > integrate with sleepable locks like your mutexes. Even so, the current > > implementation is significantly slower than it has to be. > > > > ... > > > > Spinning mutexes should probably be handled by a different routine. > > > > ... > > > > I disagree, because this is a wrong approach to the problem. Instead of > marginally optimising the slow-path (and the more contended is the lock, > the less impact these micro-optimisations have), the subsystems should be > refactored to eliminate the lock contention in the first place. Yes, it > is much more work, but it is the long term fix. Having said that, I can > see some use cases where MCS locks could be useful, but it is really a low > priority in the big picture. > Locks are fundamentally about damage control. As noted earlier, spurious bus transaction due to an avoidable read make performance unnecessarily tad bit worse. That was minor anyway, more important bit was the backoff. Even on systems modest by today standards the quality of locking primitives can be a difference between a system which is slower than ideal but perfectly usable and one which is just dog slow. That said, making backoff parameters autoscale on cpus with some kind of upper cap is definitely warranted. -- Mateusz Guzik Swearing Maintenance Engineer