GSoC Proposal - Defragmentation for FFS

2020-01-19 Thread Chen,Xizi
Hi NetBSD Community,

I am currently a first year Master's Degree student studying Computer Science 
at Northwest Missouri State University. I'm a professional C/C++ developer with 
about 4 years of work experience as a storage systems engineer. My work mostly 
revolved around storage stack and I/O path in general.

One of the issues that hits very close to home for me is file system aging. I'm 
interested in learning in detail how to tackle fragmentation. I believe 
"Defragmentation for FFS" project provides me an excellent opportunity to learn 
more.

I have experience using and writing software in Linux, including experience in 
writing and debugging kernel modules. I am interested in learning more about 
the project "Defragmentation for FFS" albeit my work mostly involved working 
with XFS, I am more than willing to learn and contribute to it.

Looking forward to learning more about the project and working with NetBSD 
community.

Cheers,
Xizi Chen,
NWMSU


Re: Adaptive RW locks

2020-01-19 Thread Takashi YAMAMOTO
hi,

On Mon, Jan 20, 2020 at 6:07 AM Andrew Doran  wrote:

> In my testing lately I've often run into a high context switch rate over RW
> locks.
>
> The RW locks we have now are "half adaptive" so to speak.  Lock waiters can
> spin while the lock is write locked and the owner is running, because the
> owner can be identified in that case, and then it can be determined whether
> or not the owner is running on a CPU: if so, the waiter spins, if not, it
> sleeps.
>
> The same does not happen if a RW lock is read locked and someone wants to
> take a write hold, because in that case there is no identifiable owner, and
> so it can't be determined whether the owner(s) are running or not.  In that
> case the lock waiter always sleeps.  It then has a kind of snowball effect
> because it delays the lock getting back to its unheld state, and every
> thread arriving during the aftermath sleeps too.
>
> If you only take mostly read holds, or only take mostly write holds on an
> individual RW lock this isn't a big deal because it occurs rarely.  But if
> you introduce more of a mix of read/write you start to get the poor
> behaviour.  The correct answer here may be "don't do that", but it's not a
> very practical one.
>
> I'd like to reduce this blocking activity, so here is a patch against
> -current that implements the adaptive behaviour in the reader case too, by
> having each LWP holding RW locks track those holds, and if going to sleep,
> cause any held RW locks to be temporarily put into "sleep mode", with the
> default being "spin mode".  In my testing with mixed vnode locks this works
> well.
>
> http://www.netbsd.org/~ad/2020/rwlock.diff


i guess in rw_downgrade
newown = (owner & RW_NODEBUG) | RW_SPIN;

should be
newown = (owner & (RW_NODEBUG | RW_SPIN));

as the owner might have failed to record the lock on acquisition.



>
>
> This removes the assembly stubs but the compiler seems to do a pretty good
> job on the C ones I added, and in microbenchmarking it the difference is
> tiny.  I've only included the amd64 MD changes in the diff.
>
> In trying this out on builds it seems that tracking at most 2 holds catches
> ~99.8% of activity, but I have made the limit 4 (it's a soft limit, if
> exceeded all further taken locks are put into "sleep mode").
>
> I also added a rw_owner_running() for the pagedaemon, to match mutexes (not
> used yet).
>
> Thoughts?
>
> cheers,
> Andrew
>


Re: libc.so's vmobjlock / v_interlock

2020-01-19 Thread Taylor R Campbell
> Date: Sun, 19 Jan 2020 21:50:24 +
> From: Andrew Doran 
> 
> The biggest single remaining concurency problem is the mutex on libc.so's
> uvm_object / vnode.  It pops up in a number of places, but most clearly seen
> above the "uvm_fault_internal" box.
> 
> I think making the vmobjlock / v_interlock a rwlock could help with making
> inroads on this problem, because in the libc.so case it's not modifying too
> much under cover of the lock (mostly pmap state).
> 
> [...]
> 
> Thoughts?

I think we should try to go straight to pserialized page lookup and
skip rwlocks.  Should be relatively easy to teach radixtree.c to work
pserialized.

We probably won't want to use pserialize_perform() but can perhaps
instead use an epoch-based system with page daemon queues of dying
pages that may still be referenced under pserialize_read.


libc.so's vmobjlock / v_interlock

2020-01-19 Thread Andrew Doran
Hi,

Here's a dtrace flamegraph for a kernel build on my system, while running a
kernel from the ad-namecache branch.

http://www.netbsd.org/~ad/2020/jan19.svg

The biggest single remaining concurency problem is the mutex on libc.so's
uvm_object / vnode.  It pops up in a number of places, but most clearly seen
above the "uvm_fault_internal" box.

I think making the vmobjlock / v_interlock a rwlock could help with making
inroads on this problem, because in the libc.so case it's not modifying too
much under cover of the lock (mostly pmap state).

That's not a small undertaking, but I reckon it would take about a day of
donkey work to change every mutex_enter(..) to a rw_enter(.., RW_WRITER) and
make a start at choosing rw_write_held() or rw_lock_held() for the
assertions, followed by a bit of benchmarking to check for a regression and
a jumbo commit.  From there on things could be changed incrementally.

Thoughts?

Andrew


Adaptive RW locks

2020-01-19 Thread Andrew Doran
In my testing lately I've often run into a high context switch rate over RW
locks.

The RW locks we have now are "half adaptive" so to speak.  Lock waiters can
spin while the lock is write locked and the owner is running, because the
owner can be identified in that case, and then it can be determined whether
or not the owner is running on a CPU: if so, the waiter spins, if not, it
sleeps.

The same does not happen if a RW lock is read locked and someone wants to
take a write hold, because in that case there is no identifiable owner, and
so it can't be determined whether the owner(s) are running or not.  In that
case the lock waiter always sleeps.  It then has a kind of snowball effect
because it delays the lock getting back to its unheld state, and every
thread arriving during the aftermath sleeps too.

If you only take mostly read holds, or only take mostly write holds on an
individual RW lock this isn't a big deal because it occurs rarely.  But if
you introduce more of a mix of read/write you start to get the poor
behaviour.  The correct answer here may be "don't do that", but it's not a
very practical one.

I'd like to reduce this blocking activity, so here is a patch against
-current that implements the adaptive behaviour in the reader case too, by
having each LWP holding RW locks track those holds, and if going to sleep,
cause any held RW locks to be temporarily put into "sleep mode", with the
default being "spin mode".  In my testing with mixed vnode locks this works
well.

http://www.netbsd.org/~ad/2020/rwlock.diff

This removes the assembly stubs but the compiler seems to do a pretty good
job on the C ones I added, and in microbenchmarking it the difference is
tiny.  I've only included the amd64 MD changes in the diff.

In trying this out on builds it seems that tracking at most 2 holds catches
~99.8% of activity, but I have made the limit 4 (it's a soft limit, if
exceeded all further taken locks are put into "sleep mode").

I also added a rw_owner_running() for the pagedaemon, to match mutexes (not
used yet).

Thoughts?

cheers,
Andrew


use-after-free in fd code?

2020-01-19 Thread Mouse
I've been working on code that involves meddling with unp_internalize
and unp_externalize (specifically, a way to pass memory, not just file
descriptors, through AF_LOCAL sockets), and I think I may have found a
use-after-free bug, one that appears to be present in even 8.0 (I'm
targeting my work for my 5.2 derivative).

I got my stuff "working", but when testing the codepath for dealing
with descriptors dropped while in transit (not memory blocks - I was
testing to make sure I hadn't broken descriptor passing), I found that
very occasionally, on the order of one in 5-10 million cases, a KASSERT
in unp_gc would trip.  Comparing my uipc_usrreq.c with the 8.0 version
from a work machine, I find that the KASSERT in question is gone,
replaced with a comment "XXX: closef() doesn't pay attention to
FDEFER", an unlock, and a continue.  This might fix the symptom, but I
think there's something worse lurking.

The debugging code I added in an attempt to figure this out clearly
shows that, in most cases, file_ts are used after being freed.  Here's
an example.  The first string is a short summary of the operation (--
and ++ are decrement and increment f_count; A is a KASSERT testing the
indicated condition; the a or b after -- disambiguates two decrements
that would otherwise be indistinguishable); test is a more complicated
case where, for example, f_count is compared to f_msgcount.  The number
in () is the value of f_count.  The number in [] is the CPU the record
was made on.  The file and line are what they look like, though the
line numbers won't match up with anything because I've got a lot of
debugging code in those files.  The stuff on the right is manually
added, describing what the file-and-line indicates.
(internalize_rights and externalize_process_rights are SCM_RIGHTS
processing pulled out into their own functions as part of adding memory
passing.)

init (1) [0]: uipc_usrreq.c, line 1800  internalize_rights
++ (2) [0]: uipc_usrreq.c, line 1802internalize_rights
A >0 (2) [0]: kern_descrip.c, line 648  fd_close
--a (1) [0]: kern_descrip.c, line 648   fd_close
A !=0 (1) [1]: uipc_usrreq.c, line 2378 unp_gc (scan 1)
A !=0 (1) [1]: uipc_usrreq.c, line 2343 unp_gc (FDEFER)
++ (2) [1]: uipc_usrreq.c, line 2369unp_gc (hold)
++ (3) [0]: kern_descrip.c, line 1063   fd_affix
A >0 (3) [0]: uipc_usrreq.c, line 1387  externalize_process_rights
--a (2) [0]: uipc_usrreq.c, line 1387   externalize_process_rights
A >0 (2) [1]: uipc_usrreq.c, line 2390  unp_gc (unhold)
--a (1) [1]: uipc_usrreq.c, line 2390   unp_gc (unhold)
test (1) [1]: uipc_usrreq.c, line 2421  unp_gc (sweep)
A >0 (1) [0]: kern_descrip.c, line 648  fd_close
--b (0) [0]: kern_descrip.c, line 648   fd_close
A ==0 (0) [0]: kern_descrip.c, line 648 fd_close 
A ==0 (0) [0]: kern_descrip.c, line 764 ffree (closef)
test (0) [1]: uipc_usrreq.c, line 2346  unp_gc (!FDEFER)
test (0) [1]: uipc_usrreq.c, line 2421  unp_gc (sweep)

Now, this particular case didn't trip the KASSERT; that is in what
appears here as "unp_gc (FDEFER)", which ran while f_count was still
nonzero.  But it _is_ quite clear that unp_gc referred to fp->f_count
twice *after* fd_close called closef called ffree on it.  (The actual
trace logs include the file_t pointer; the above is just the lines for
one particular file_t.)  My trace includes 55 examples; this happens in
all 55 of them, and then crashes with

init (1) [0]: uipc_usrreq.c, line 1800  internalize_rights
++ (2) [0]: uipc_usrreq.c, line 1802internalize_rights
A >0 (2) [0]: kern_descrip.c, line 648  fd_close
--a (1) [0]: kern_descrip.c, line 648   fd_close
A !=0 (1) [1]: uipc_usrreq.c, line 2378 unp_gc (scan 1)
++ (2) [0]: kern_descrip.c, line 1063   fd_affix
A >0 (2) [0]: uipc_usrreq.c, line 1387  externalize_process_rights
--a (1) [0]: uipc_usrreq.c, line 1387   externalize_process_rights
A >0 (1) [0]: kern_descrip.c, line 648  fd_close
--b (0) [0]: kern_descrip.c, line 648   fd_close
A ==0 (0) [0]: kern_descrip.c, line 648 fd_close
A !=0 (0) [1]: uipc_usrreq.c, line 2343 unp_gc (FDEFER)

In my tests I've never seen a file_t reallocated before unp_gc stops
looking at it, but I don't see any reason that couldn't happen,
especially if the system is close to ENFILE.  (I suspect it hasn't
happened to me in part because my tester is the only thing creating or
closing file descriptors during these runs, it's doing only one
transaction at a time, and the system is nowhere close to ENFILE.)  It
looks to me like a classic use-after-free, and, comparing my
uipc_usrreq.c to 8.0's, I don't see any reason it can't happen there
too.

Of course, it's possible this arises from something I've changed
(though I can't see how; I didn't change the basic control flow).  It's
also possible there's