GSoC Proposal - Defragmentation for FFS
Hi NetBSD Community, I am currently a first year Master's Degree student studying Computer Science at Northwest Missouri State University. I'm a professional C/C++ developer with about 4 years of work experience as a storage systems engineer. My work mostly revolved around storage stack and I/O path in general. One of the issues that hits very close to home for me is file system aging. I'm interested in learning in detail how to tackle fragmentation. I believe "Defragmentation for FFS" project provides me an excellent opportunity to learn more. I have experience using and writing software in Linux, including experience in writing and debugging kernel modules. I am interested in learning more about the project "Defragmentation for FFS" albeit my work mostly involved working with XFS, I am more than willing to learn and contribute to it. Looking forward to learning more about the project and working with NetBSD community. Cheers, Xizi Chen, NWMSU
Re: Adaptive RW locks
hi, On Mon, Jan 20, 2020 at 6:07 AM Andrew Doran wrote: > In my testing lately I've often run into a high context switch rate over RW > locks. > > The RW locks we have now are "half adaptive" so to speak. Lock waiters can > spin while the lock is write locked and the owner is running, because the > owner can be identified in that case, and then it can be determined whether > or not the owner is running on a CPU: if so, the waiter spins, if not, it > sleeps. > > The same does not happen if a RW lock is read locked and someone wants to > take a write hold, because in that case there is no identifiable owner, and > so it can't be determined whether the owner(s) are running or not. In that > case the lock waiter always sleeps. It then has a kind of snowball effect > because it delays the lock getting back to its unheld state, and every > thread arriving during the aftermath sleeps too. > > If you only take mostly read holds, or only take mostly write holds on an > individual RW lock this isn't a big deal because it occurs rarely. But if > you introduce more of a mix of read/write you start to get the poor > behaviour. The correct answer here may be "don't do that", but it's not a > very practical one. > > I'd like to reduce this blocking activity, so here is a patch against > -current that implements the adaptive behaviour in the reader case too, by > having each LWP holding RW locks track those holds, and if going to sleep, > cause any held RW locks to be temporarily put into "sleep mode", with the > default being "spin mode". In my testing with mixed vnode locks this works > well. > > http://www.netbsd.org/~ad/2020/rwlock.diff i guess in rw_downgrade newown = (owner & RW_NODEBUG) | RW_SPIN; should be newown = (owner & (RW_NODEBUG | RW_SPIN)); as the owner might have failed to record the lock on acquisition. > > > This removes the assembly stubs but the compiler seems to do a pretty good > job on the C ones I added, and in microbenchmarking it the difference is > tiny. I've only included the amd64 MD changes in the diff. > > In trying this out on builds it seems that tracking at most 2 holds catches > ~99.8% of activity, but I have made the limit 4 (it's a soft limit, if > exceeded all further taken locks are put into "sleep mode"). > > I also added a rw_owner_running() for the pagedaemon, to match mutexes (not > used yet). > > Thoughts? > > cheers, > Andrew >
Re: libc.so's vmobjlock / v_interlock
> Date: Sun, 19 Jan 2020 21:50:24 + > From: Andrew Doran > > The biggest single remaining concurency problem is the mutex on libc.so's > uvm_object / vnode. It pops up in a number of places, but most clearly seen > above the "uvm_fault_internal" box. > > I think making the vmobjlock / v_interlock a rwlock could help with making > inroads on this problem, because in the libc.so case it's not modifying too > much under cover of the lock (mostly pmap state). > > [...] > > Thoughts? I think we should try to go straight to pserialized page lookup and skip rwlocks. Should be relatively easy to teach radixtree.c to work pserialized. We probably won't want to use pserialize_perform() but can perhaps instead use an epoch-based system with page daemon queues of dying pages that may still be referenced under pserialize_read.
libc.so's vmobjlock / v_interlock
Hi, Here's a dtrace flamegraph for a kernel build on my system, while running a kernel from the ad-namecache branch. http://www.netbsd.org/~ad/2020/jan19.svg The biggest single remaining concurency problem is the mutex on libc.so's uvm_object / vnode. It pops up in a number of places, but most clearly seen above the "uvm_fault_internal" box. I think making the vmobjlock / v_interlock a rwlock could help with making inroads on this problem, because in the libc.so case it's not modifying too much under cover of the lock (mostly pmap state). That's not a small undertaking, but I reckon it would take about a day of donkey work to change every mutex_enter(..) to a rw_enter(.., RW_WRITER) and make a start at choosing rw_write_held() or rw_lock_held() for the assertions, followed by a bit of benchmarking to check for a regression and a jumbo commit. From there on things could be changed incrementally. Thoughts? Andrew
Adaptive RW locks
In my testing lately I've often run into a high context switch rate over RW locks. The RW locks we have now are "half adaptive" so to speak. Lock waiters can spin while the lock is write locked and the owner is running, because the owner can be identified in that case, and then it can be determined whether or not the owner is running on a CPU: if so, the waiter spins, if not, it sleeps. The same does not happen if a RW lock is read locked and someone wants to take a write hold, because in that case there is no identifiable owner, and so it can't be determined whether the owner(s) are running or not. In that case the lock waiter always sleeps. It then has a kind of snowball effect because it delays the lock getting back to its unheld state, and every thread arriving during the aftermath sleeps too. If you only take mostly read holds, or only take mostly write holds on an individual RW lock this isn't a big deal because it occurs rarely. But if you introduce more of a mix of read/write you start to get the poor behaviour. The correct answer here may be "don't do that", but it's not a very practical one. I'd like to reduce this blocking activity, so here is a patch against -current that implements the adaptive behaviour in the reader case too, by having each LWP holding RW locks track those holds, and if going to sleep, cause any held RW locks to be temporarily put into "sleep mode", with the default being "spin mode". In my testing with mixed vnode locks this works well. http://www.netbsd.org/~ad/2020/rwlock.diff This removes the assembly stubs but the compiler seems to do a pretty good job on the C ones I added, and in microbenchmarking it the difference is tiny. I've only included the amd64 MD changes in the diff. In trying this out on builds it seems that tracking at most 2 holds catches ~99.8% of activity, but I have made the limit 4 (it's a soft limit, if exceeded all further taken locks are put into "sleep mode"). I also added a rw_owner_running() for the pagedaemon, to match mutexes (not used yet). Thoughts? cheers, Andrew
use-after-free in fd code?
I've been working on code that involves meddling with unp_internalize and unp_externalize (specifically, a way to pass memory, not just file descriptors, through AF_LOCAL sockets), and I think I may have found a use-after-free bug, one that appears to be present in even 8.0 (I'm targeting my work for my 5.2 derivative). I got my stuff "working", but when testing the codepath for dealing with descriptors dropped while in transit (not memory blocks - I was testing to make sure I hadn't broken descriptor passing), I found that very occasionally, on the order of one in 5-10 million cases, a KASSERT in unp_gc would trip. Comparing my uipc_usrreq.c with the 8.0 version from a work machine, I find that the KASSERT in question is gone, replaced with a comment "XXX: closef() doesn't pay attention to FDEFER", an unlock, and a continue. This might fix the symptom, but I think there's something worse lurking. The debugging code I added in an attempt to figure this out clearly shows that, in most cases, file_ts are used after being freed. Here's an example. The first string is a short summary of the operation (-- and ++ are decrement and increment f_count; A is a KASSERT testing the indicated condition; the a or b after -- disambiguates two decrements that would otherwise be indistinguishable); test is a more complicated case where, for example, f_count is compared to f_msgcount. The number in () is the value of f_count. The number in [] is the CPU the record was made on. The file and line are what they look like, though the line numbers won't match up with anything because I've got a lot of debugging code in those files. The stuff on the right is manually added, describing what the file-and-line indicates. (internalize_rights and externalize_process_rights are SCM_RIGHTS processing pulled out into their own functions as part of adding memory passing.) init (1) [0]: uipc_usrreq.c, line 1800 internalize_rights ++ (2) [0]: uipc_usrreq.c, line 1802internalize_rights A >0 (2) [0]: kern_descrip.c, line 648 fd_close --a (1) [0]: kern_descrip.c, line 648 fd_close A !=0 (1) [1]: uipc_usrreq.c, line 2378 unp_gc (scan 1) A !=0 (1) [1]: uipc_usrreq.c, line 2343 unp_gc (FDEFER) ++ (2) [1]: uipc_usrreq.c, line 2369unp_gc (hold) ++ (3) [0]: kern_descrip.c, line 1063 fd_affix A >0 (3) [0]: uipc_usrreq.c, line 1387 externalize_process_rights --a (2) [0]: uipc_usrreq.c, line 1387 externalize_process_rights A >0 (2) [1]: uipc_usrreq.c, line 2390 unp_gc (unhold) --a (1) [1]: uipc_usrreq.c, line 2390 unp_gc (unhold) test (1) [1]: uipc_usrreq.c, line 2421 unp_gc (sweep) A >0 (1) [0]: kern_descrip.c, line 648 fd_close --b (0) [0]: kern_descrip.c, line 648 fd_close A ==0 (0) [0]: kern_descrip.c, line 648 fd_close A ==0 (0) [0]: kern_descrip.c, line 764 ffree (closef) test (0) [1]: uipc_usrreq.c, line 2346 unp_gc (!FDEFER) test (0) [1]: uipc_usrreq.c, line 2421 unp_gc (sweep) Now, this particular case didn't trip the KASSERT; that is in what appears here as "unp_gc (FDEFER)", which ran while f_count was still nonzero. But it _is_ quite clear that unp_gc referred to fp->f_count twice *after* fd_close called closef called ffree on it. (The actual trace logs include the file_t pointer; the above is just the lines for one particular file_t.) My trace includes 55 examples; this happens in all 55 of them, and then crashes with init (1) [0]: uipc_usrreq.c, line 1800 internalize_rights ++ (2) [0]: uipc_usrreq.c, line 1802internalize_rights A >0 (2) [0]: kern_descrip.c, line 648 fd_close --a (1) [0]: kern_descrip.c, line 648 fd_close A !=0 (1) [1]: uipc_usrreq.c, line 2378 unp_gc (scan 1) ++ (2) [0]: kern_descrip.c, line 1063 fd_affix A >0 (2) [0]: uipc_usrreq.c, line 1387 externalize_process_rights --a (1) [0]: uipc_usrreq.c, line 1387 externalize_process_rights A >0 (1) [0]: kern_descrip.c, line 648 fd_close --b (0) [0]: kern_descrip.c, line 648 fd_close A ==0 (0) [0]: kern_descrip.c, line 648 fd_close A !=0 (0) [1]: uipc_usrreq.c, line 2343 unp_gc (FDEFER) In my tests I've never seen a file_t reallocated before unp_gc stops looking at it, but I don't see any reason that couldn't happen, especially if the system is close to ENFILE. (I suspect it hasn't happened to me in part because my tester is the only thing creating or closing file descriptors during these runs, it's doing only one transaction at a time, and the system is nowhere close to ENFILE.) It looks to me like a classic use-after-free, and, comparing my uipc_usrreq.c to 8.0's, I don't see any reason it can't happen there too. Of course, it's possible this arises from something I've changed (though I can't see how; I didn't change the basic control flow). It's also possible there's so