Hello, I have been playing a little bit with a NetBSD vm running on Centos7 + kvm. I ran into severe performance issues which I partially investigated. A bunch of total hacks was written to confirm few problems, but there is nothing committable without doing actual work and major problems remain.
I think the kernel is in dire need to have someone sit on issues reported below and see them through. I'm happy to test patches, although I wont necessarily have access to the same hardware used for current tests. Hardware specs: Intel(R) Xeon(R) Gold 5115 CPU @ 2.40GHz 2 sockets * 10 cores * 2 hardware threads 32GB of ram I assigned all 40 threads to the vm + gave it 16GB of ram. The host is otherwise idle. I installed the 7.1 release, downloaded recent git snapshot and built the trunk kernel while using config stolen from the release (had to edit out something about 3g modems to make it compile). I presume this is enough to not have debug of any sort enabled. The filesystem is just ufs mounted with noatime. Attempts to use virtio for storage resulted in extremely abysmall performance which I did not investigate. Using SATA gave read errors and the vm failed to boot multiuser. I settled for IDE which works reasonbly fine, but inherently makes the test worse. All tests were performed with the trunk kernel booted. Here is a bunch of "./build.sh -j 40 kernel=MYCONF > /dev/null" on stock kernel: 618.65s user 1097.80s system 2502% cpu 1:08.60 total 628.73s user 1128.71s system 2540% cpu 1:09.18 total 629.05s user 1082.58s system 2517% cpu 1:07.99 total 641.11s user 1081.05s system 2545% cpu 1:07.65 total 641.18s user 1079.89s system 2522% cpu 1:08.24 total And on kernel with total hacks: 594.08s user 693.11s system 2459% cpu 52.331 total 594.81s user 711.90s system 2498% cpu 52.292 total 600.34s user 676.39s system 2486% cpu 51.336 total 597.33s user 725.78s system 2536% cpu 52.157 total 597.13s user 708.79s system 2510% cpu 52.011 total i.e. it's still pretty bad, with system time being above user. However, real time dropped from ~68 to ~52 and %sys from ~1100 to ~700. Hacks can be seen here (wear gloves and something to protect eyes): https://people.freebsd.org/~mjg/netbsd/hacks.diff 1) #define UBC_NWINS 1024 The parameter was set in 2001 and is used on amd64 to this very day. lockstat says: 51.63 585505 321201.06 ffffe4011d8304c0 <all> 40.39 291550 251302.17 ffffe4011d8304c0 ubc_alloc+69 9.13 255967 56776.26 ffffe4011d8304c0 ubc_release+a5 1.72 35632 10680.06 ffffe4011d8304c0 uvm_fault_internal+532 [snip] The contention is on the global ubc vmobj lock just prior to hash lookup. I recompiled the kernel with randomly slapped value of 65536 and the the problem cleared itself with ubc_alloc going way down. I made no attempts to check what value makes sense or how to autoscale it. This change alone accounts for most of the speed up by giving: 586.87s user 919.99s system 2612% cpu 57.676 total 2. uvm_pageidlezero Idle zeroing these days definitely makes no sense on amd64. Any amount of pages possibly prepared is quickly shredded and vast majority of all allocations end up zeroing in place. With rep stosb this is even less of a problem. Here it turned out to be harmful by inducing avoidable cacheline traffic. Look at nm kernel | sort -nk 1: ---------------- ffffffff810b8fc0 B uvm_swap_data_lock ffffffff810b8fc8 B uvm_kentry_lock ffffffff810b8fd0 B uvm_fpageqlock ffffffff810b8fd8 B uvm_pageqlock ffffffff810b8fe0 B uvm_kernel_object ---------------- All these locks false-share a cacheline. In particular fpagqlock is obstructing uvm_pageqlock. Attempt to run zeroing performs mutex_tryenter. It uncoditionally does lock cmpxchg which dirties the cacheline, thus even if zeroing would end up not being performed the damage was already done. Chances are succesfull zeroing is also a problem, but that I did not investigate. Doing #if 0'ing the uvm_pageidlezero call in the idle func shaved about 2 seconds real time: 589.02s user 792.62s system 2541% cpu 54.365 total This should definitely be disabled for amd64 altogether and probably removed in general. 3. false sharing Followed the issue noted earlier I __cacheline_aligned aforementioned locks. But also moved atomically updated counters out of uvmexp. uvmexp is full of counters updated with mere increments possibly by multiple threads, thus the issue of this obj was not resolved. Nonetheless, said annotations applied combined with the rest give the improvement mentioned earlier. ====================================== Here is a flamegraph from a fully patched kernel: https://people.freebsd.org/~mjg/netbsd/build-kernel-j40.svg And here are top mutex spinners: 59.42 1560022 184255.00 ffffe40138351180 <all> 57.52 1538978 178356.84 ffffe40138351180 uvm_fault_internal+7e0 1.23 8884 3819.43 ffffe40138351180 uvm_unmap_remove+101 0.67 12159 2078.61 ffffe40138351180 cache_lookup+97 (see https://people.freebsd.org/~mjg/netbsd/build-kernel-j40-lockstat.txt ) Note that netbsd`0xffffffff802249ba is x86_pause. Since the function does not push frame pointer it is shown next to the actual caller, as opposed to above it. Sometimes called functions get misplaced anyway, I don't know why. 1. exclusive vnode locking (genfs_lock) It is used even for path lookup which as can be seen leads to avoidable contention. From what I'm told the primary reason is ufs constructing some state just in case it has to create an inode at the end of the lookup. However, since most lookups are not intended to create anything, this behavior can be made conditional. I don't know the details, but ufs on FreeBSD most certainly uses shared locking for common case lookups. 2. uvm_fault_internal It's shown as the main waiter for a vm obj lock. The flamegraph hints the real problem is with uvm_pageqlock & friends taken elsewhere, while most page fault handlers serialize on the vm obj lock, while the holder waits for uvm_pageqlock. 3. pmap It seems most issues stem from slow pmap handling. Chances are there are perfectly avoidable shootdowns and in fact cases where there is no need to alter KVA in the first place. 4. vm locks in general Most likely there are trivial cases where operations can be batched, especially so on process exit where there are multiple pages to operate on. ====================================== I would like to add a remark about locking primitives. Today the rage is with MCS locks, which are fine but not trivial to integrate with sleepable locks like your mutexes. Even so, the current implementation is significantly slower than it has to be. First, the lock word is read twice on entry to mutex_vector_enter - once to determine the lock type and then to read the owner. Spinning mutexes should probably be handled by a different routine. lock cmpxchg already returns the found value (the owner). It can be passed by the assembly routine to the slow path. This allows to make an initial pass at backoff without accessing the lock in the meantime. In face of contention the cacheline could have changed ownership by the time you get to the read, thus using the value we already saw avoids spurious bus transactions. Given low initial spin loop it should not have negative effects. backoff parameters were hardcoded last decade and are really off even when looking today's modest servers. For kicks I changed the max spin count to 1024 and in a trivial microbenchmark of doing dup2 + close in 40 threads I got almost double the throughput. Interestingly this change caused a regression for kernel build. I did not investigate, I suspect the cause was that the vm obj lock holder was now less aggressive on trying to grab the lock and that caused problmes for everyone else waiting on the vm obj lock. The spin loop itself is weird in the sense that instead of just having the pause instruction embedded it calls a function. This is probably unnecessarily less power/other thread friendly than it needs to be. Cheers, -- Mateusz Guzik <mjguzik gmail.com>