Re: vfs_lockf changes (was: CVS commit: src)

2023-09-12 Thread Andrew Doran
On Mon, Sep 11, 2023 at 03:29:39PM +, Martin Husemann wrote:

> On Sun, Sep 10, 2023 at 02:45:53PM +0000, Andrew Doran wrote:
> > Module Name:src
> > Committed By:   ad
> > Date:   Sun Sep 10 14:45:53 UTC 2023
> > 
> > Modified Files:
> > src/common/lib/libc/gen: radixtree.c
> > src/sys/kern: init_main.c kern_descrip.c kern_lwp.c kern_mutex_obj.c
> > kern_resource.c kern_rwlock_obj.c kern_turnstile.c subr_kcpuset.c
> > vfs_cwd.c vfs_init.c vfs_lockf.c
> > src/sys/rump/librump/rumpkern: rump.c
> > src/sys/rump/librump/rumpvfs: rump_vfs.c
> > src/sys/sys: namei.src
> > src/sys/uvm: uvm_init.c uvm_map.c uvm_readahead.c
> > 
> > Log Message:
> > - Do away with separate pool_cache for some kernel objects that have no 
> > special
> >   requirements and use the general purpose allocator instead.  On one of my
> >   test systems this makes for a small (~1%) but repeatable reduction in 
> > system
> >   time during builds presumably because it decreases the kernel's cache /
> >   memory bandwidth footprint a little.
> > - vfs_lockf: cache a pointer to the uidinfo and put mutex in the data 
> > segment.
> 
> Hi Andrew, 
> 
> most (all?) vfs_lockf tests are now crashing, e.g.
> 
> https://netbsd.org/~martin/evbearmv7hf-atf/469_atf.html#failed-tcs-summary

I backed out these changes since I don't have time to debug this week. 
Apologies for the interruption and thanks a lot for the reports.

Andrew
 
> 
> Core was generated by `t_vnops'.
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  rumpuser_mutex_spin_p (mtx=0x0) at 
> /work/src/lib/librumpuser/rumpuser_pth.c:166
> [Current thread is 1 (process 18892)]
> #0  rumpuser_mutex_spin_p (mtx=0x0) at 
> /work/src/lib/librumpuser/rumpuser_pth.c:166
> #1  0x71c672dc in mutex_enter (mtx=0x71d78fc0 ) at 
> /work/src/lib/librump/../../sys/rump/librump/rumpkern/locks.c:164
> #2  0x71d3dd58 in lf_advlock (ap=0x7ff897b8, head=0x6cdc7140, size= out>) at /work/src/lib/librumpvfs/../../sys/rump/../kern/vfs_lockf.c:896
> #3  0x71d51bec in VOP_ADVLOCK (vp=0x6cde0600, id=, 
> op=, fl=, flags=64) at 
> /work/src/lib/librumpvfs/../../sys/rump/../kern/vnode_if.c:1867
> #4  0x71be5f64 in do_fcntl_lock (fd=fd@entry=0, cmd=cmd@entry=8, 
> fl=fl@entry=0x7ff89818) at 
> /work/src/lib/librump/../../sys/rump/../kern/sys_descrip.c:285
> #5  0x71be6154 in sys_fcntl (l=, uap=0x7ff898fc, 
> retval=) at 
> /work/src/lib/librump/../../sys/rump/../kern/sys_descrip.c:365
> #6  0x71c70d9c in sy_call (rval=0x7ff898f4, uap=0x7ff898fc, l=0x6ce09c80, 
> sy=0x71cbf6f0 ) at 
> /work/src/lib/librump/../../sys/rump/../sys/syscallvar.h:65
> #7  sy_invoke (code=0, rval=0x7ff898f4, uap=0x7ff898fc, l=0x6ce09c80, 
> sy=0x71cbf6f0 ) at 
> /work/src/lib/librump/../../sys/rump/../sys/syscallvar.h:94
> #8  rump_syscall (num=0, num@entry=92, data=0x7ff898fc, 
> data@entry=0x7ff898f4, dlen=dlen@entry=12, retval=0x7ff898f4, 
> retval@entry=0x7ff898ec) at 
> /work/src/lib/librump/../../sys/rump/librump/rumpkern/rump.c:782
> #9  0x71c63e8c in rump___sysimpl_fcntl (fd=fd@entry=0, cmd=cmd@entry=8, 
> arg=arg@entry=0x7ff899e8) at 
> /work/src/lib/librump/../../sys/rump/librump/rumpkern/rump_syscalls.c:1322
> #10 0x007f8120 in fcntl_getlock_pids (tc=0x855234 
> , mp=0x8329c0 "/mnt") at 
> /work/src/tests/fs/vfs/t_vnops.c:941
> #11 0x00811134 in atfu_ext2fs_fcntl_getlock_pids_body (tc=0x855234 
> ) at /work/src/tests/fs/vfs/t_vnops.c:1085
> #12 0x71ae8658 in atf_tc_run (tc=0x855234 
> , resfile=) at 
> /work/src/external/bsd/atf/dist/atf-c/tc.c:1024
> #13 0x71ae4f84 in run_tc (exitcode=, p=0x7ff8a840, 
> tp=0x7ff8a83c) at /work/src/external/bsd/atf/dist/atf-c/detail/tp_main.c:510
> #14 controlled_main (exitcode=, add_tcs_hook=0x7ff8a86c, 
> argv=, argc=) at 
> /work/src/external/bsd/atf/dist/atf-c/detail/tp_main.c:580
> #15 atf_tp_main (argc=, argv=, 
> add_tcs_hook=0x7ff8a86c) at 
> /work/src/external/bsd/atf/dist/atf-c/detail/tp_main.c:610
> #16 0x007e6aa4 in ___start ()
> 
> Martin


Re: malloc(9) vs kmem(9) interfaces

2023-05-31 Thread Andrew Doran
On Sat, Oct 29, 2022 at 02:42:27PM +, Taylor R Campbell wrote:

> - Was the rationale migrating to kmem(9) written down or discussed
>   publicly anywhere?

Some I know of are:

- escape limitations of (interrupt safe) kmem_map by moving to kernel_map
- experiment with new layout strategies (vmem)
- experiment with caching, per-CPU stuff
- .. without upsetting existing malloc() users

Given the chance for a do-over I'd remove the ability to do allocation from
interrupt context with malloc() (why allow variable sized allocs there?),
use pools for interrupt context allocs and retain the old malloc()
interface, or something along those lines.

> - What would the cost of restoring attribution be, other than the
>   obvious O(ntag*nsizebuckets) memory cost to record it and the effort
>   to annotate allocations?

Related to this, in my experiments it turns out that using dedicated pools
for objects for no other reason than to to get attribution or be space
efficient seems to exert a lot of presssure on the CPU cache to the point
that reverting to the general purpose allocator where possible yields a
small but measurable and repeatable reduction in system time during builds.
I plan to do this for some objects.

Andrew


Re: TSC improvement

2020-06-14 Thread Andrew Doran
On Thu, Jun 11, 2020 at 04:50:40AM +, Taylor R Campbell wrote:

> What's trickier is synchronizing per-CPU timecounters so that they all
> give a reasonably consistent view of absolute wall clock time -- and
> so it's not just one CPU that leads while the others play catchup
> every time they try to read the clock.  (In other words, adding atomic
> catchup logic certainly does not obviate the need to synchronize
> per-CPU timecounters!)
>
> But neither synchronization nor global monotonicity is always
> necessary -- e.g., for rusage we really only need a local view of time
> since we're only measuring relative time durations spent on the
> current CPU anyway.
> 
> > >This is what the timecounter(9) API per se expects of timecounters,
> > >and right now tsc (along with various other per-CPU cycle counters)
> > >fails to guarantee that.
> > 
> > Howso, do you see a bug?  I think it's okay.  The TSC is only used for the
> > timecounter where it's known that it's insensitive to core speed variations
> > and is driven by PLL related to the bus clock.  Fortunately that means most
> > x86 systems, excepting a window of some years from roughly around the time
> > of the Pentium 4 onwards.
> 
> If tc_get_timecount goes backward by a little, e.g. because you
> queried it on cpu0 the first time and on cpu1 the second time,
> kern_tc.c will interpret that to mean that it has instead jumped
> forward by a lot -- nothing in the timecounter abstraction copes with
> a timecounter that goes backwards at all.

I thought about it some more and I just don't think we have this problem on
x86 anyway.  The way I see it, with any counter if you make explicit
comparisons on a global basis the counter could appear to go a tiny bit
backwards due to timing differences in execution - unless you want to go to
some lengths to work around that.

I think all you can really expect is for the clock to not go backwards
within a single thread of execution.  By my understanding that's all the
timecounter code expects and the TSC code on x86 makes sure of that.  I
changed tsc_get_timecount so it'll print a message out if it's ever
observed.
 
> (There's also an issue where the `monotonic' clock goes backwards
> sometimes, as reported by sched_pstats.  I'm not sure anyone has
> tracked down where that's coming from -- it seems unlikely to be
> related to cross-CPU tsc synchronization because lwp rtime should
> generally be computed from differences between samples on a single CPU
> at a time, but I don't know.)

Hmm.  There was a race condition with rusage and softints that I fixed about
6 months ago where proc0 had absurd times in ps/top but I have not seen the
"clock has gone backwards" one in a long time.  I wonder if it's related.

Andrew


Re: TSC improvement

2020-06-10 Thread Andrew Doran
On Tue, Jun 09, 2020 at 05:16:27PM +, Taylor R Campbell wrote:

> It's great to see improvements to our calibration of the TSC (and I
> tend to agree that cpu_counter should be serializing, so that, e.g.,
> cpu_counter(); ...; cpu_counter() reliably measures time taken in the
> ellipsis).
> 
> At the same time, I wonder whether we should _also_:
> 
> 1. Modify the tsc timecounter so that it uses a global atomic to
>ensure that there is a global view of time as counted by the tsc.

A long time ago we used to do something quite like this for x86 but it
worked poorly which is somewhat understandable because it's really difficult
to get something like that right.  (Which reminds me someone posted a patch
for kern_cctr.c for alpha a couple of years ago.)

>This is what the timecounter(9) API per se expects of timecounters,
>and right now tsc (along with various other per-CPU cycle counters)
>fails to guarantee that.

Howso, do you see a bug?  I think it's okay.  The TSC is only used for the
timecounter where it's known that it's insensitive to core speed variations
and is driven by PLL related to the bus clock.  Fortunately that means most
x86 systems, excepting a window of some years from roughly around the time
of the Pentium 4 onwards.

Cheers,
Andrew

> 2. Introduce an API for a local timecounter -- a per-CPU timecounter
>that never goes backwards on a single CPU, but:
>(a) measures units of wall clock time, unlike cpu_counter();
>(b) need not be synchronized between CPUs; and
>(c) may be cheaper to read than a global timecounter.
>We could then use that for, e.g., rusage calculations.


Re: Atomic vcache_tryvget()

2020-05-24 Thread Andrew Doran
On Sat, Apr 04, 2020 at 09:33:26PM +, Andrew Doran wrote:

> This change makes vcache_tryvget() do its work by means of a single atomic
> operation, which removes the need to take v_interlock beforehand.

I re-did this one more elegantly.  Assuming no problems I plan to commit
some time this week.

http://www.netbsd.org/~ad/2020/vget.diff

The basic idea: have a "gate" bit in v_usecount that's only set when the
vnode is LOADED.  vcache_tryvget() needs to see this otherwise it fails and
we go down the slow path instead.  The state transition for reclaiming is
then:

...
LOADED
-> BLOCKED
confirm that caller still holds the last vnode reference
-> RECLAIMING
...

Cheers,
Andrew


Re: ACL patch

2020-05-15 Thread Andrew Doran
On Thu, May 14, 2020 at 08:16:50PM -, Christos Zoulas wrote:
> In article <20200512235040.gd9...@homeworld.netbsd.org>,
> Andrew Doran   wrote:
> >On Tue, May 12, 2020 at 10:37:27PM +, Andrew Doran wrote:
> >> On Sun, May 10, 2020 at 09:48:10AM -0400, Christos Zoulas wrote:
> >> > 
> >> > > Maybe I missed it but I didn't see a way for cache_lookup_linked() and
> >> > > cache_have_id() to know a vnode has ACLs.  The presence of ACLs
> >means those
> >> > > routines can't do their imitation of VOP_ACCESS() and need to fail so 
> >> > > that
> >> > > the lookup is handled by VOP_LOOKUP().  To handle that on a
> >per-vnode basis
> >> > > you'd want to flag the "has ACLs" fact when an inode is loaded, and do 
> >> > > the
> >> > > same when an ACL is added to an in-core inode.  I'll look into it over 
> >> > > the
> >> > > next few days unless you want to take care of it.
> >> > 
> >> > So I looked into this a bit and genfs_can_access() is also used by 
> >cache_revlookup.
> >> > I am not sure what to do here. It is simple enough to add a flag to
> >the vnode to indicate
> >> > if it has ACLs, but is it the right thing to do?
> >> 
> >> I tried to pull down your patch again over the weekend to take a look but 
> >> it
> >> was gone.  The best place for a flag would be cache_enter_id() since it
> >> deals with the necessary synchronisation.  I'll see about adding a flag 
> >> now.
> >> In addition to the places it's called now it would need to be called
> >> whenever an inode gains an ACL.
> >
> >Ok done should just need to mark the cached ID as invalid when the inode has
> >/ gains an ACL.
> 
> Ok, I made some changes using this in my latest acl patch. Can you please
> check?  https://www.netbsd.org/~christos/acl.diff

One problem with the VV_HASACLS thing is that VOP_ACCESS() is called with an
LK_SHARED lock and you can't modify v_vflag with that.

It looks like you have to explicitly enable ACLs before the file system is
mounted with tunefs, and it's not the default?  In that's true then if either
FS_ACLS or FS_POSIX1EACLS is true during mount you could instead not set
IMNT_NCLOOKUP, which would disable the lookup-in-namecache fastpath and
everything would go through VOP_ACCESS() instead.  It's kind of a big hammer
but at least everything is always correct then.  If not I'll dig into it in
more detail tomorrow morning.

Cheers,
Andrew


Re: ACL patch

2020-05-12 Thread Andrew Doran
On Tue, May 12, 2020 at 10:37:27PM +, Andrew Doran wrote:
> On Sun, May 10, 2020 at 09:48:10AM -0400, Christos Zoulas wrote:
> > 
> > > Maybe I missed it but I didn't see a way for cache_lookup_linked() and
> > > cache_have_id() to know a vnode has ACLs.  The presence of ACLs means 
> > > those
> > > routines can't do their imitation of VOP_ACCESS() and need to fail so that
> > > the lookup is handled by VOP_LOOKUP().  To handle that on a per-vnode 
> > > basis
> > > you'd want to flag the "has ACLs" fact when an inode is loaded, and do the
> > > same when an ACL is added to an in-core inode.  I'll look into it over the
> > > next few days unless you want to take care of it.
> > 
> > So I looked into this a bit and genfs_can_access() is also used by  
> > cache_revlookup.
> > I am not sure what to do here. It is simple enough to add a flag to the 
> > vnode to indicate
> > if it has ACLs, but is it the right thing to do?
> 
> I tried to pull down your patch again over the weekend to take a look but it
> was gone.  The best place for a flag would be cache_enter_id() since it
> deals with the necessary synchronisation.  I'll see about adding a flag now.
> In addition to the places it's called now it would need to be called
> whenever an inode gains an ACL.

Ok done should just need to mark the cached ID as invalid when the inode has
/ gains an ACL.

Andrew


Re: ACL patch

2020-05-12 Thread Andrew Doran
On Sun, May 10, 2020 at 09:48:10AM -0400, Christos Zoulas wrote:
> 
> > Maybe I missed it but I didn't see a way for cache_lookup_linked() and
> > cache_have_id() to know a vnode has ACLs.  The presence of ACLs means those
> > routines can't do their imitation of VOP_ACCESS() and need to fail so that
> > the lookup is handled by VOP_LOOKUP().  To handle that on a per-vnode basis
> > you'd want to flag the "has ACLs" fact when an inode is loaded, and do the
> > same when an ACL is added to an in-core inode.  I'll look into it over the
> > next few days unless you want to take care of it.
> 
> So I looked into this a bit and genfs_can_access() is also used by  
> cache_revlookup.
> I am not sure what to do here. It is simple enough to add a flag to the vnode 
> to indicate
> if it has ACLs, but is it the right thing to do?

I tried to pull down your patch again over the weekend to take a look but it
was gone.  The best place for a flag would be cache_enter_id() since it
deals with the necessary synchronisation.  I'll see about adding a flag now.
In addition to the places it's called now it would need to be called
whenever an inode gains an ACL.

Andrew


Re: ACL patch

2020-05-04 Thread Andrew Doran
On Sat, May 02, 2020 at 07:10:29PM -0400, Christos Zoulas wrote:

> Hi,
> 
> The following patch ports the FreeBSD FFS ACLS (both posix1e and nfs4) to
> NetBSD. Comments? I will let this sit for a week or so and then commit it
> if I don't hear screams.
> 
> [it is ~24K lines, so not posted inline]
> 
> https://www.netbsd.org/~christos/acl.diff

+++ sbin/tunefs/tunefs.8
...
+.It Fl n Cm enable | disable
+Turn on/off the administrative NFSv4 ACL enable flag.

Is that handled?  I don't see it.

--- share/mk/bsd.own.mk 27 Apr 2020 03:15:12 -  1.1187
+++ share/mk/bsd.own.mk 2 May 2020 23:06:42 -
@@ -164,7 +164,9 @@ EXTERNAL_GDB_SUBDIR=/does/not/exist
#
 .if ${MACHINE_ARCH} == "x86_64" || ${MACHINE_ARCH} == "i386" || \
 ${MACHINE_ARCH} == "powerpc64" || ${MACHINE_ARCH} == "powerpc" || \
-${MACHINE_CPU} == "aarch64" || ${MACHINE_CPU} == "arm"
+${MACHINE_CPU} == "aarch64" || ${MACHINE_CPU} == "arm" || \
+${MACHINE_CPU} == "sparc64" || ${MACHINE_CPU} == "sparc" || \
+${MACHINE_CPU} == "mips"

Seems unrelated.

@@ -739,10 +1206,6 @@ genfs_can_chmod(enum vtype type, kauth_c
 {
int error;
 
-   /* The user must own the file. */
-   if (kauth_cred_geteuid(cred) != cur_uid)
-   return (EPERM);
-
/*
 * Unprivileged users can't set the sticky bit on files.
 */
@@ -762,6 +1225,12 @@ genfs_can_chmod(enum vtype type, kauth_c
return (EPERM);
}
 
+   /*
+* Deny setting setuid if we are not the file owner.
+*/
+   if ((new_mode & S_ISUID) && cur_uid != kauth_cred_geteuid(cred))
+   return (EPERM);
+

Are you sure this is correct?

Maybe I missed it but I didn't see a way for cache_lookup_linked() and
cache_have_id() to know a vnode has ACLs.  The presence of ACLs means those
routines can't do their imitation of VOP_ACCESS() and need to fail so that
the lookup is handled by VOP_LOOKUP().  To handle that on a per-vnode basis
you'd want to flag the "has ACLs" fact when an inode is loaded, and do the
same when an ACL is added to an in-core inode.  I'll look into it over the
next few days unless you want to take care of it.

Cheers,
Andrew


> Best,
> 
> 
> christos


Re: RB tree in the buffer cache

2020-04-19 Thread Andrew Doran
I thought about this one more and came to the conclusion that this is a half
measure, and to do it properly the buffer cache should catch up with the
page cache (per-vnode locking and radixtree).  That's a fair bit of effort,
so I'm going to leave this change for now.

I still plan to merge the changes for sync-on-shutdown to wait on vnode
output instead of buffer output since that's needed with whatever scheme is
chosen to replace the global hash.

Andrew

On Sat, Apr 04, 2020 at 10:53:12PM +0000, Andrew Doran wrote:

> Despite repeatedly messing around with the hash function over the last few
> months, and despite being ~64MB, bufhash often has long chains on my test
> system.
> 
> And, whatever is done to the hash function it will always throw away the
> valuable natural partitioning that's there to begin with, which is that
> buffers are cached with specific vnodes:
> 
> # vmstat -H
> total used util  num  average  maximum
> hash tablebuckets  buckets%itemschainchain
> bufhash   838860824222 0.2936958 1.53  370
> in_ifaddrhash 5122 0.392 1.001
> uihash   10244 0.394 1.001
> vcache_hashmask   8388608   247582 2.95   252082 1.022
> 
> Changing this to use a per-vnode index makes use of that partitioning, and
> moves things closer to the point where bufcache_lock can be replaced by
> v_interlock in most places (I have not made that replacement yet - it's a
> decent chunk of work).
> 
> Despite the best efforts of all involved the buffer cache code is a bit of a
> minefield because of things the users do.  For example LFS and WAPBL want
> some combination of I/O buffers for COW type stuff, pre-allocated memory,
> and inclusion on the vnode buf lists presumably for vflushbuf(), but those
> buffers shouldn't be returned via incore().  To handle that case I added a
> new flag BC_IOBUF.
> 
> I don't have performance measurements but from lockstat I see about 5-10x
> reduction in contention on bufcache_lock in my tests, which suggests to me
> that less time is being spent in incore().
> 
> Changes here:
> 
>   http://www.netbsd.org/~ad/2020/bufcache.diff
> 
> Comments welcome.
> 
> Thanks,
> Andrew


Re: detecting userconf disable

2020-04-19 Thread Andrew Doran
On Fri, Apr 17, 2020 at 10:49:53PM +0200, Manuel Bouyer wrote:

> Hello,
> As part of my Xen HVM work, I'd like to be able to be able to disable Xen
> support from userconf to boot as a plain x86 emulated host
> 
> The obvious way would be to disable the
> hypervisor* at mainbus? # Xen hypervisor
> device. But, for CPU setup I need to do some Xen setup before attaching the
> hypervisor device. And I need CPUs to be set up to attach the hypervisor
> device. So I have a xen_hvm_init() function called early.
> Is there a way, in this function, to detect if hypervisor has been disabled by
> userconf ?

I know little about userconf but RB_MD4 (boot -4) is free on x86 which may
be another option.

Andrew


RB tree in the buffer cache

2020-04-04 Thread Andrew Doran
Despite repeatedly messing around with the hash function over the last few
months, and despite being ~64MB, bufhash often has long chains on my test
system.

And, whatever is done to the hash function it will always throw away the
valuable natural partitioning that's there to begin with, which is that
buffers are cached with specific vnodes:

# vmstat -H
total used util  num  average  maximum
hash tablebuckets  buckets%itemschainchain
bufhash   838860824222 0.2936958 1.53  370
in_ifaddrhash 5122 0.392 1.001
uihash   10244 0.394 1.001
vcache_hashmask   8388608   247582 2.95   252082 1.022

Changing this to use a per-vnode index makes use of that partitioning, and
moves things closer to the point where bufcache_lock can be replaced by
v_interlock in most places (I have not made that replacement yet - it's a
decent chunk of work).

Despite the best efforts of all involved the buffer cache code is a bit of a
minefield because of things the users do.  For example LFS and WAPBL want
some combination of I/O buffers for COW type stuff, pre-allocated memory,
and inclusion on the vnode buf lists presumably for vflushbuf(), but those
buffers shouldn't be returned via incore().  To handle that case I added a
new flag BC_IOBUF.

I don't have performance measurements but from lockstat I see about 5-10x
reduction in contention on bufcache_lock in my tests, which suggests to me
that less time is being spent in incore().

Changes here:

http://www.netbsd.org/~ad/2020/bufcache.diff

Comments welcome.

Thanks,
Andrew


Atomic vcache_tryvget()

2020-04-04 Thread Andrew Doran
Hi,

This change makes vcache_tryvget() do its work by means of a single atomic
operation, which removes the need to take v_interlock beforehand.

There are two reasons I want that: it reduces the contention on v_interlock
during build.sh to near-zero, and it gets us a lot closer to a situation
where it's possible to use vcache_tryvget() where the caller cannot tolerate
blocking, for example in a pserialize/RCU type lookup where no other locks
are held.

Unfortunately the change is a little bit ugly.  Two things need to be tested
simultaneously and are combined into a single 32-bit field: vnode state and
vnode usecount.

The core of the change is here:

http://www.netbsd.org/~ad/2020/vget1.diff

Boring changes for vp->v_usecount -> vusecount(vp) are here:

http://www.netbsd.org/~ad/2020/vget2.diff

Comments welcome.

Thanks,
Andrew



Reduce contention on fstrans_lock

2020-04-04 Thread Andrew Doran
Over the course of a long job with lots of forks like build.sh, small but
frequent contention on fstrans_lock can start to add up.  This cures it by
maintaining the global list of fstrans_lwp_info in a pool cache constructor
and destructor, much the same pattern as is done for struct file and struct
vm_amap.

fstrans_mount_lock is taken in the same path but less work is done with that
lock held, so it doesn't really show up in the profile for me.

http://www.netbsd.org/~ad/2020/fstrans_lock.diff

Any comments?

Thanks,
Andrew


Re: Please review: lookup changes

2020-04-04 Thread Andrew Doran
On Wed, Mar 11, 2020 at 05:40:59PM +0100, J. Hannken-Illjes wrote:

> > On 5. Mar 2020, at 23:48, Andrew Doran  wrote:
> > 
> > Hi,
> > 
> > I'd like to merge the changes on the ad-namecache branch, and would
> > appreciate any code review.  The motivation for these changes is, as
> > you might guess, performance.  I put an overview below.  I won't be
> > merging this for a couple of weeks in any case.
> 
> [snip]
> 
> > vfs_vnode.c:
> > 
> > Namecache related changes, and the change to not block new vnode
> > references across VOP_INACTIVE() mentioned on tech-kern.  I don't
> > think this blocking is needed at least for well behaved FSes like
> > ffs & tmpfs.
> 
> I suppose this blocking is no longer needed.  It originates from
> the early steps to clean up the vnode lifecycle.  If anything
> goes wrong it is quite simple to undo this change.
> 
> Blocking further references should only be needed for vrecycle().
> 
> We have to terminate vrelel() early if we got new references,
> diff attached.

I was wondering about the same the other day.  Changes applied, thank you!

Andrew


Re: Please review: lookup changes

2020-03-17 Thread Andrew Doran
On Sun, Mar 08, 2020 at 09:22:28PM +, Taylor R Campbell wrote:

> Maybe we could use a critbit tree?  It will have cache utilization
> characteristics similar to a red/black tree (fanout is 2 either way),
> but the lookup can be pserialized and each step may be a little
> cheaper computationally.
> 
> https://mumble.net/~campbell/hg/critbit/
> 
> Alternatively, there is rmind's thmap already in tree.

I had a look and this is a really interesting data structure.  I think this
plus a seqlock, and a nice reclamation method would do great.  My only
concern is that single threaded performance should be top notch too because
single component name lookups are often more frequent even than traps and
syscalls put together.  I suppose it's a case of trying it.

Another application I can think of is _lwp_unpark() although the minor
complication there is the radixtree is now used to allocate LIDs too by
scanning for unused slots.

Andrew


Re: Patch: allow concurrent faults on a single uvm_object

2020-03-17 Thread Andrew Doran
Updated patch for current as of right now.  Only one change, fix an invalid
pointer deref in uvmfault_promote():

http://www.netbsd.org/~ad/2020/fault2.diff

It's not high-tech, no RCU or other fancy stuff, but it does cut system time
during build.sh by 40-50% for me.

Cheers,
Andrew

On Sun, Mar 15, 2020 at 09:40:57PM +0000, Andrew Doran wrote:

> Hi,
> 
> This address the problem with lock contention due to page faults on libc.so
> and other shared objects.  It also allows for concurrent faults on shared
> amaps, for example PostgreSQL's shared buffer.
> 
> The approach is:
> 
> - Add a PGO_NOBUSY flag that has getpages return pages without busying them. 
>   It has to come with PGO_LOCKED and VM_PROT_WRITE == 0.  This allows
>   getpages to be called with a RW_READER lock for the in-core case.
> 
> - For each fault, keep track of the type of lock on the upper & lower
>   objects.  Start out with the assumption that RW_READER is good for both
>   and then make some guesses to refine that.  If the fault runs into trouble
>   because it has a RW_READER lock, try to upgrade to RW_WRITER and if that
>   fails, give up and restart the fault to happen from the beginning with a
>   RW_WRITER lock.
> 
> - Add a couple of "vmstat -s" counters to see how it does in practice.
> 
> - Allow some more uvm_map() operations with RW_READER on the underlock
>   objects.
> 
> - Cut contention from v_interlock in genfs_getpages by adding a flag that's
>   set earlier with both v_interlock + vmobjlock set.
> 
> Changes here:
> 
>   http://www.netbsd.org/~ad/2020/fault.diff
> 
> Comments welcome.
> 
> Thanks,
> Andrew


Patch: allow concurrent faults on a single uvm_object

2020-03-15 Thread Andrew Doran
Hi,

This address the problem with lock contention due to page faults on libc.so
and other shared objects.  It also allows for concurrent faults on shared
amaps, for example PostgreSQL's shared buffer.

The approach is:

- Add a PGO_NOBUSY flag that has getpages return pages without busying them. 
  It has to come with PGO_LOCKED and VM_PROT_WRITE == 0.  This allows
  getpages to be called with a RW_READER lock for the in-core case.

- For each fault, keep track of the type of lock on the upper & lower
  objects.  Start out with the assumption that RW_READER is good for both
  and then make some guesses to refine that.  If the fault runs into trouble
  because it has a RW_READER lock, try to upgrade to RW_WRITER and if that
  fails, give up and restart the fault to happen from the beginning with a
  RW_WRITER lock.

- Add a couple of "vmstat -s" counters to see how it does in practice.

- Allow some more uvm_map() operations with RW_READER on the underlock
  objects.

- Cut contention from v_interlock in genfs_getpages by adding a flag that's
  set earlier with both v_interlock + vmobjlock set.

Changes here:

http://www.netbsd.org/~ad/2020/fault.diff

Comments welcome.

Thanks,
Andrew


Re: libc.so's vmobjlock / v_interlock

2020-03-15 Thread Andrew Doran
On Tue, Jan 21, 2020 at 09:46:03PM +, Andrew Doran wrote:

> - PV lists in the pmap need explicit locking (unless the pmap uses a global
>   lock, which is fine if it suits the machine).  I have done the PV list
>   locking for x86 and it's in the patch.

To follow up, the x86 changes are in and and I'm checking over the other
pmaps at the moment.  Most are in good shape.  Anyway this detangles the x86
pmap from UVM object locks.  It doesn't care any more *.  (Other than e.g. 
needing to hold a lock across pmap_remove() -> pmap_update(), to stop pages
gaining a new identity while still visible somewhere via stale TLB entry.)

It would be nice to to track all these mappings in batch, at the
uvm_map_entry level or something like that, cos the PV tracking scheme
causes uses loads of memory and is causes big cache contention.

Andrew


Re: Adaptive RW locks

2020-03-10 Thread Andrew Doran
Following up again..  It turns out the deadlocks I saw with this were down
to a problem with kernel_lock that has since been solved.

I made a couple more tweaks to it: up the max number of tracked holds to 8
because vmobjlock became a rwlock, and because aiodoned is gone now be more
selective giving up and going to sleep in the softint case (walk down the
list of pinned LWPs on curcpu and see if any of them holds the lock that's
wanted, if not held on curcpu then some other CPU has it, and it's OK to
spin).

http://www.netbsd.org/~ad/2020/rwlock3.diff

I'm going to sit on this change for a while longer though to make sure there
is no performance regression or other problems.

Andrew


Re: Please review: lookup changes

2020-03-09 Thread Andrew Doran
On Mon, Mar 09, 2020 at 07:59:03PM +, Andrew Doran wrote:

> > Why don't we check our watch immediately when we set IN_CHANGE &c.,
> > and defer the disk write to update the inode like UFS_WAPBL_UPDATE
> > does -- and just skip UFS_ITIMES in VOP_GETATTR?
> 
> This looks to go back least as far as Unix 32V (& possibly further) in one
> form or another, from what I can see.  Whatever the original intent I think
> it might have morphed into an attempt to avoid querying the system clock at
> some point, which is about the only point I can see for it now.

Considering this originates in a different era..  I suppose it might also
yield a better chance of observing distinct timestamps, if the system had
poor clock resolution.

Andrew


Re: Please review: lookup changes

2020-03-09 Thread Andrew Doran
On Sun, Mar 08, 2020 at 09:22:28PM +, Taylor R Campbell wrote:

> > Date: Thu, 5 Mar 2020 22:48:25 +
> > From: Andrew Doran 
> > 
> > I'd like to merge the changes on the ad-namecache branch, and would
> > appreciate any code review.  The motivation for these changes is, as
> > you might guess, performance.  I put an overview below.  I won't be
> > merging this for a couple of weeks in any case.
> 
> Thanks for working on this.  Lookups are a big point of contention, so
> it's worth putting effort into fixing that.  At the same time, I want
> to make sure we don't regress into the vfs locking insanity we had ten
> years ago; dholland@ and especially hannken@ have put a lot of work
> into cleaning it up and making it understandable and maintainable.
> Please make sure to wait until hannken@ has had an opportunity weigh
> in on it.

Absolutely.  Some philosophical ramblings:

I see the namecache as a thing that allows you to avoid re-doing expensive
stuff that you've already done once before - whether that's disc I/O and
metadata crawls (old application) or multiprocessor locking (the app here).

Were it 1993 I might well be up for ripping out vnode locking and going back
to inode locking, but we're way too far down the chosen path now, and there
has been major effort invested by many people over the years to work the
vnode locking into shape, as you note.  It works very well now, and I don't
think we should mess with it too much.

With that in mind I have tried do this in a way that requires as few changes
to the vnode locking and file systems as possible: use the namecache to do
things fast whenever possible, but preserve the strongly locked path.

> > If the file system exports IMNT_NCLOOKUP, this now tries to "fast
> > forward" through as many component names as possible by looking
> > directly in the namecache, without taking vnode locks nor vnode
> > refs, and stops when the leaf is reached or there is something that
> > can't be handled using the namecache.  In the most optimistic case a
> > vnode reference is taken only at the root and the leaf. 
> > IMNT_NCLOOKUP is implemented right now only for tmpfs, ffs.
> 
> What file systems can't support IMNT_NCLOOKUP and why?

Good question.  Other than layered file systems I don't know for sure, but I
think it can be done for most FSes.  NFS might have its own complications; I
haven't looked yet.  IMNT_NCLOOKUP is more of a feature gate than anything
else.  The file system hooks are unintrusive and few in number but adding
those hooks takes careful, time consuming code inspection to make sure
nothing is missed.

> > Layered file systems can't work that way (names are cached below),
> 
> It seems there's a tradeoff here:
> 
> 1. We could cache layered vnodes, at the cost of doubling the size of
>the namecache.
> 
> 2. We could _not_ cache layered vnodes, at the cost of having to redo
>layer_node_create and vcache_get for every lookup -- a process
>which scales worse than namecache lookups (global vcache_lock).
> 
> Am I missing anything else about this?  Is this tradeoff the only
> reason that it's still up to the vnode operations to call cache_lookup
> and cache_enter, rather than something namei does directly -- so that
> layer_lookup can skip them?

As far as I recall layer vnodes persist unless recycled like any other
vnode, so there should be nothing special happening there around contention
due to lifecycle.  I think you could cache or transpose the cached names
above, maybe using some kind of VOP.  I definitely think it's doable but
beyond that, at least for me thare lots of unknowns so I have avoided it.

> > and the permissions check won't work for anything unusual like ACLs,
> > so there is also IMNT_SHRLOOKUP.  If the file system exports
> > IMNT_SHRLOOKUP, then VOP_LOOKUP() for "read only" lookups (e.g.  not
> > RENAME & friends) is tried first with an LK_SHARED lock on the
> > directory.  For an FS that can do the whole lookup this way (e.g. 
> > tmpfs) all is good.  Otherwise the FS is free to return ENOLCK from
> > VOP_LOOKUP() if something tricky comes up, and the lookup gets
> > retried with LK_EXCLUSIVE (e.g. ffs when it needs to scan directory
> > metadata).
> 
> Except for access to the create/rename/remove hints (which we should
> do differently anyway [*]), in what circumstances does ufs lookup need
> an exclusive lock?  Why does scanning directory metadata require that?
>
> [*] Specifically: We should push the last component lookup into
> directory operations, and leave it to *fs_rename/remove/&c. to i

PG_BUSY and read locks in the VM system

2020-03-08 Thread Andrew Doran
The recent change to make vmobjlock & friends into RW locks is still a bit
useless because while PG_BUSY can be observed or held off while holding a
RW_READER lock, nothing can be done if it's set.  To fix that I'd like to
change it so that the WANTED flag is stored in pg->pqflags and covered by
pg->interlock.

>From there it's not huge strech to imagine how you might do basic page fault
handing for already in-core pages while holding only a read lock.

http://www.netbsd.org/~ad/2020/wanted.diff

Comments?

Thanks,
Andrew


Object dirtyness

2020-03-08 Thread Andrew Doran
With the merge of yamt-pagecache we have solid tracking of page dirtyness. 
Dirtyness tracking of the containing uvm_object/vnode is not so solid by
comparison.  For example you can call uvm_pagemarkdirty() on a vnode page,
but letting ioflush know about that is a separate step.  When the page &
uvm_object state aren't consistent we get panics.

I'd like to change it so that:

- the vnode/uvm_object gets put on the syncer worklist as soon as it gains a
  first dirty page, by having uvm_pagemarkdirty() do it.

- VI_WRMAPDIRTY is eliminated.  I think it was really only to stop dirty
  metadata buffers muddying the waters by implying dirty uvm_object, but the
  radix tree can be checked cheaply & directly now.

http://www.netbsd.org/~ad/2020/dirty.diff

Comments?

Thanks,
Andrew


Re: Please review: lookup changes

2020-03-08 Thread Andrew Doran
On Thu, Mar 05, 2020 at 10:48:25PM +, Andrew Doran wrote:

>   http://www.netbsd.org/~ad/2020/namecache.diff
>   http://www.netbsd.org/~ad/2020/vfs_lookup.c
>   http://www.netbsd.org/~ad/2020/vfs_cache.c

This was missing one small change:

http://www.netbsd.org/~ad/2020/namecache2.diff

To compile it also needs this first:

cd src/sys/sys
USETOOLS=no make namei

Andrew


Re: Please review: lookup changes

2020-03-08 Thread Andrew Doran
On Sat, Mar 07, 2020 at 12:14:05AM +0100, Mateusz Guzik wrote:

> On 3/5/20, Andrew Doran  wrote:
> > vfs_cache.c:
> >
> >   I changed the namecache to operate using a per-directory index of
> >   names rather than a global index, and changed the reclamation
> >   mechanism for cache entries to ape the VM system's CLOCK algorithm
> >   rather than using LRU / pseudo-LRU.  This made concurrency easier to
> >   manage, and the nasty locking scheme from 5.0 is gone.
> >
> 
> I don't think rb trees (or in fact anything non-hash) is a good idea for
> this purpose. See way below.

Right, I've admitted to the fact that rbtree is not ideal here.  But it is
"good enough", and I'm happy for someone to replace it with some other data
structure later on.
 
> I doubt aging namecache entries is worth the complexity. Note that said
> very likely this code does not really win anything in practice.

If there was a 1:1 relationship I would agree but due to links (including
dot and dotdot) it's N:M and I don't fancy changing that behaviour too much
right now.  Definitely something to consider in the future.
 
> >   I also changed the namecache to cache permission information for
> >   directories, so that it can be used to do single component lookups
> >   fully in-cache for file systems that can work that way.
> >
> 
> Low priority, but I think this comes with a bug. kauth() has this support
> for "listeners" and so happens in the base system at least there is
> nobody hooked for vnode lookup. Key point being that normally the vnode
> is passed in shared-locked, but fast path lookup avoids this and
> consequently if there is a real-world uesr of the feature somewhere, they
> may be silently broken.

Good point.  Yes that aspect of kauth is a bit of a mess.  Plus it adds a
lot of CPU overhead.  It could do with its own solution for concurrency
that's straightforward, easy to maintain and doesn't have too many tentacles
that depend on concurrency rules in the various subsystems.

> I had a similar problem with the MAC framework and devised a simple flag
> which can be checked upfront to disable fast path lookup.
> 
> So this would have a side effect of removing the call in the first place.
> genfs_can_access is rather pessimal for checking VEXEC. I wrote a dedicated
> routine which only checks for VEXEC and starts with a check for
> (mode & 0111) == 0111 upfront. It happens to almost always be true.
> 
> >   This is described in the comments in the file.
> >
> > vfs_getcwd.c:
> >
> >   The design pattern here was to hold vnode locks as long as possible
> >   and clearly deliniate the handoff from one lock to another.  It's a
> >   nice way of working but not good for MP performance, and not needed
> >   for correctness.  I pushed the locks back as far as possible and try
> >   to do the lookup in the namecache without taking vnode locks (not
> >   needed to look at namecache).
> >
> > vfs_lookup.c:
> >
> >   Like vfs_getcwd.c the vnode locking is pushed back far as possible.
> >
> >   If the file system exports IMNT_NCLOOKUP, this now tries to "fast
> >   forward" through as many component names as possible by looking
> >   directly in the namecache, without taking vnode locks nor vnode
> >   refs, and stops when the leaf is reached or there is something that
> >   can't be handled using the namecache.  In the most optimistic case a
> >   vnode reference is taken only at the root and the leaf.
> >   IMNT_NCLOOKUP is implemented right now only for tmpfs, ffs.
> >
> >   Layered file systems can't work that way (names are cached below),
> >   and the permissions check won't work for anything unusual like ACLs,
> >   so there is also IMNT_SHRLOOKUP.  If the file system exports
> >   IMNT_SHRLOOKUP, then VOP_LOOKUP() for "read only" lookups (e.g.  not
> >   RENAME & friends) is tried first with an LK_SHARED lock on the
> >   directory.  For an FS that can do the whole lookup this way (e.g.
> >   tmpfs) all is good.  Otherwise the FS is free to return ENOLCK from
> >   VOP_LOOKUP() if something tricky comes up, and the lookup gets
> >   retried with LK_EXCLUSIVE (e.g. ffs when it needs to scan directory
> >   metadata).
> >
> >   I also added a LOCKSHARED flag to go with LOCKLEAF (get an LK_SHARED
> >   lock on the leaf instead of LK_EXCLUSIVE).  This matches FreeBSD.
> >
> > vfs_syscalls.c:
> >
> >   Corrected v

Please review: lookup changes

2020-03-05 Thread Andrew Doran
Hi,

I'd like to merge the changes on the ad-namecache branch, and would
appreciate any code review.  The motivation for these changes is, as
you might guess, performance.  I put an overview below.  I won't be
merging this for a couple of weeks in any case.

Thanks,
Andrew


Changes here, with a couple of whole files for ease of reading:

http://www.netbsd.org/~ad/2020/namecache.diff
http://www.netbsd.org/~ad/2020/vfs_lookup.c
http://www.netbsd.org/~ad/2020/vfs_cache.c

vfs_cache.c:

I changed the namecache to operate using a per-directory index of
names rather than a global index, and changed the reclamation
mechanism for cache entries to ape the VM system's CLOCK algorithm
rather than using LRU / pseudo-LRU.  This made concurrency easier to
manage, and the nasty locking scheme from 5.0 is gone.

I also changed the namecache to cache permission information for
directories, so that it can be used to do single component lookups
fully in-cache for file systems that can work that way.

This is described in the comments in the file.

vfs_getcwd.c:

The design pattern here was to hold vnode locks as long as possible
and clearly deliniate the handoff from one lock to another.  It's a
nice way of working but not good for MP performance, and not needed
for correctness.  I pushed the locks back as far as possible and try
to do the lookup in the namecache without taking vnode locks (not
needed to look at namecache).

vfs_lookup.c:

Like vfs_getcwd.c the vnode locking is pushed back far as possible.

If the file system exports IMNT_NCLOOKUP, this now tries to "fast
forward" through as many component names as possible by looking
directly in the namecache, without taking vnode locks nor vnode
refs, and stops when the leaf is reached or there is something that
can't be handled using the namecache.  In the most optimistic case a
vnode reference is taken only at the root and the leaf. 
IMNT_NCLOOKUP is implemented right now only for tmpfs, ffs.

Layered file systems can't work that way (names are cached below),
and the permissions check won't work for anything unusual like ACLs,
so there is also IMNT_SHRLOOKUP.  If the file system exports
IMNT_SHRLOOKUP, then VOP_LOOKUP() for "read only" lookups (e.g.  not
RENAME & friends) is tried first with an LK_SHARED lock on the
directory.  For an FS that can do the whole lookup this way (e.g. 
tmpfs) all is good.  Otherwise the FS is free to return ENOLCK from
VOP_LOOKUP() if something tricky comes up, and the lookup gets
retried with LK_EXCLUSIVE (e.g. ffs when it needs to scan directory
metadata).

I also added a LOCKSHARED flag to go with LOCKLEAF (get an LK_SHARED
lock on the leaf instead of LK_EXCLUSIVE).  This matches FreeBSD.

vfs_syscalls.c:

Corrected vnode lock types: LK_SHARED ok for VOP_ACCESS(), but
LK_EXCLUSIVE still needed for VOP_GETATTR() due to itimes handling.
That's a shame and it would be nice to use a seqlock or something
there eventually.

vfs_vnode.c:

Namecache related changes, and the change to not block new vnode
references across VOP_INACTIVE() mentioned on tech-kern.  I don't
think this blocking is needed at least for well behaved FSes like
ffs & tmpfs.

Performance:

During a kernel build on my test system it yields about a 15% drop
in system time:

HEAD79.16s real  1602.97s user   419.54s system
changes 77.26s real  1564.38s user   368.41s system

The rbtree lookup can be assumed to be slightly slower than a hash
table lookup and microbenchmarking shows this sometimes in the
single CPU case, but as part of namei() it ends up being ameliorated
somewhat by the reduction in locking costs, the absence of hash
bucket collisions with rbtrees, and better L2/L3 cache use on hot
directories.  I don't have numbers but I know joerg@ saw significant
speedups on pbulk builds with an earlier version of this code with
little changed from HEAD other than hash -> rbtree.  Anyway rbtree
isn't set in stone as the data structure, it's just what we have in
tree that's most suitable right now.  It can be changed.

That aside, here's the result of a little benchmark, with 1-12 cores
in a single CPU package all doing access() on a single file name in
a kernel compile directory:

http://www.netbsd.org/~ad/graph.png

The gap between green and blue lines is the result of the changes;
basically better cache usage and the number of atomic ops for each
path name component down to 2, from 7 or 10 depending on arch.

Red 

Re: panic: softint screwup

2020-02-17 Thread Andrew Doran
On Wed, Feb 12, 2020 at 11:09:24AM +, Andrew Doran wrote:
> On Tue, Feb 11, 2020 at 07:26:29AM +, Nick Hudson wrote:
> > On 04/02/2020 23:17, Andrew Doran wrote:
> > > On Tue, Feb 04, 2020 at 07:03:28AM -0400, Jared McNeill wrote:
> > >
> > >> First time seeing this one.. an arm64 board sitting idle at the login 
> > >> prompt
> > >> rebooted itself with this panic. Unfortunately the default ddb.onpanic=0
> > >> strikes again and I can't get any more information than this:
> > >
> > > I added this recently to replace a vague KASSERT.  Thanks for grabbing the
> > > output.
> > >
> > >> [ 364.3342263] curcpu=0, spl=4 curspl=7
> > >> [ 364.3342263] onproc=0x00237f743080 => l_stat=7 l_flag=2201 
> > >> l_cpu=0
> > >> [ 364.3342263] curlwp=0x00237f71e580 => l_stat=1 l_flag=0200 
> > >> l_cpu=0
> > >> [ 364.3342263] pinned=0x00237f71e100 => l_stat=7 l_flag=0200 
> > >> l_cpu=0
> > >> [ 364.3342263] panic: softint screwup
> > >> [ 364.3342263] cpu0: Begin traceback...
> > >> [ 364.3342263] trace fp ffc101da7be0
> > >> [ 364.3342263] fp ffc101da7c00 vpanic() at ffc0004ad728 
> > >> netbsd:vpanic+0x160
> > >> [ 364.3342263] fp ffc101da7c70 panic() at ffc0004ad81c 
> > >> netbsd:panic+0x44
> > >> [ 364.3342263] fp ffc101da7d40 softint_dispatch() at 
> > >> ffc00047bda4 netbsd:softint_dispatch+0x5c4
> > >> [ 364.3342263] fp ffc101d9fc30 cpu_switchto_softint() at 
> > >> ffc85198 netbsd:cpu_switchto_softint+0x68
> > >> [ 364.3342263] fp ffc101d9fc80 splx() at ffc040d4 
> > >> netbsd:splx+0xbc
> > >> [ 364.3342263] fp ffc101d9fcb0 callout_softclock() at 
> > >> ffc000489e04 netbsd:callout_softclock+0x36c
> > >> [ 364.3342263] fp ffc101d9fd40 softint_dispatch() at 
> > >> ffc00047b8dc netbsd:softint_dispatch+0xfc
> > >> [ 364.3342263] fp ffc101d3fcc0 cpu_switchto_softint() at 
> > >> ffc85198 netbsd:cpu_switchto_softint+0x68
> > >> [ 364.3342263] fp ffc101d3fdf8 cpu_idle() at ffc86128 
> > >> netbsd:cpu_idle+0x58
> > >> [ 364.3342263] fp ffc101d3fe40 idle_loop() at ffc0004546a4 
> > >> netbsd:idle_loop+0x174
> > >
> > > Something has cleared the LW_RUNNING flag on softclk/0 between where it is
> > > set (unlocked) at line 884 of kern_softint.c and callout_softclock().
> > 
> > Isn't it the case that softclk/0 is the victim/interrupted LWP for a 
> > soft{serial,net,bio}.
> > That's certainly how I read the FP values.
> > 
> > the callout handler blocked and softclk/0 became a victim as well maybe?
> > 
> > http://src.illumos.org/source/xref/netbsd-src/sys/kern/kern_synch.c#687
> > 
> > a soft{serial,net,bio} happends before curlwp is changed away from the 
> > blocking softint thread
> 
> I suspect putting the RUNNING flag back into l_pflag will cure it, since
> the update of l_flag without the LWP locked is dodgy. I can't think of
> sonething that would clobber the update, but it is breaking th rules so
> to speak..
> 
> I'll do just that on Saturday once back in front of a real computer.

Change made on Saturday.

Andrew


Re: panic: softint screwup

2020-02-12 Thread Andrew Doran
On Tue, Feb 11, 2020 at 07:26:29AM +, Nick Hudson wrote:
> On 04/02/2020 23:17, Andrew Doran wrote:
> > On Tue, Feb 04, 2020 at 07:03:28AM -0400, Jared McNeill wrote:
> >
> >> First time seeing this one.. an arm64 board sitting idle at the login 
> >> prompt
> >> rebooted itself with this panic. Unfortunately the default ddb.onpanic=0
> >> strikes again and I can't get any more information than this:
> >
> > I added this recently to replace a vague KASSERT.  Thanks for grabbing the
> > output.
> >
> >> [ 364.3342263] curcpu=0, spl=4 curspl=7
> >> [ 364.3342263] onproc=0x00237f743080 => l_stat=7 l_flag=2201 
> >> l_cpu=0
> >> [ 364.3342263] curlwp=0x00237f71e580 => l_stat=1 l_flag=0200 
> >> l_cpu=0
> >> [ 364.3342263] pinned=0x00237f71e100 => l_stat=7 l_flag=0200 
> >> l_cpu=0
> >> [ 364.3342263] panic: softint screwup
> >> [ 364.3342263] cpu0: Begin traceback...
> >> [ 364.3342263] trace fp ffc101da7be0
> >> [ 364.3342263] fp ffc101da7c00 vpanic() at ffc0004ad728 
> >> netbsd:vpanic+0x160
> >> [ 364.3342263] fp ffc101da7c70 panic() at ffc0004ad81c 
> >> netbsd:panic+0x44
> >> [ 364.3342263] fp ffc101da7d40 softint_dispatch() at ffc00047bda4 
> >> netbsd:softint_dispatch+0x5c4
> >> [ 364.3342263] fp ffc101d9fc30 cpu_switchto_softint() at 
> >> ffc85198 netbsd:cpu_switchto_softint+0x68
> >> [ 364.3342263] fp ffc101d9fc80 splx() at ffc040d4 
> >> netbsd:splx+0xbc
> >> [ 364.3342263] fp ffc101d9fcb0 callout_softclock() at ffc000489e04 
> >> netbsd:callout_softclock+0x36c
> >> [ 364.3342263] fp ffc101d9fd40 softint_dispatch() at ffc00047b8dc 
> >> netbsd:softint_dispatch+0xfc
> >> [ 364.3342263] fp ffc101d3fcc0 cpu_switchto_softint() at 
> >> ffc85198 netbsd:cpu_switchto_softint+0x68
> >> [ 364.3342263] fp ffc101d3fdf8 cpu_idle() at ffc86128 
> >> netbsd:cpu_idle+0x58
> >> [ 364.3342263] fp ffc101d3fe40 idle_loop() at ffc0004546a4 
> >> netbsd:idle_loop+0x174
> >
> > Something has cleared the LW_RUNNING flag on softclk/0 between where it is
> > set (unlocked) at line 884 of kern_softint.c and callout_softclock().
> 
> Isn't it the case that softclk/0 is the victim/interrupted LWP for a 
> soft{serial,net,bio}.
> That's certainly how I read the FP values.
> 
> the callout handler blocked and softclk/0 became a victim as well maybe?
> 
> http://src.illumos.org/source/xref/netbsd-src/sys/kern/kern_synch.c#687
> 
> a soft{serial,net,bio} happends before curlwp is changed away from the 
> blocking softint thread

I suspect putting the RUNNING flag back into l_pflag will cure it, since
the update of l_flag without the LWP locked is dodgy. I can't think of
sonething that would clobber the update, but it is breaking th rules so
to speak..

I'll do just that on Saturday once back in front of a real computer.

Andrew


Re: Kernel (9.99.44) responsiveness issues

2020-02-04 Thread Andrew Doran
On Sun, Feb 02, 2020 at 01:02:58PM +0100, Kamil Rytarowski wrote:

> I keep observing responsiveness issues on NetBSD-current. This happened
> in last 2 months.
> 
> Whenever I start building something with -j${CORES}, I have significant
> delays of responsiveness in other applications.

What does your disk I/O look like during this time?  Are you using ffs
logging?  Do you have a tmpfs /tmp or a custom TMPDIR set?  You have enough
CPUs there to hammer the disk with a compile job given the right
circumstances.

At the end of the crash(8) manual page someone has added a nice recipe to
get backtraces from all of the kernel stacks.  If you added a "grep tstile"
in the pipeline you could see where these particular threads are blocking.

Andrew

> load averages:  2.69,  5.56,  6.22;   up 0+01:32:42 12:12:34
> 71 processes: 69 sleeping, 2 on CPU
> CPU states:  0.0% user,  0.0% nice,  0.0% system,  0.1% interrupt, 99.8%
> idle
> Memory: 19G Act, 9639M Inact, 416K Wired, 34M Exec, 19G File, 43M Free
> Swap: 64G Total, 64G Free
> 
>   PID USERNAME PRI NICE   SIZE   RES STATE  TIME   WCPUCPU COMMAND
> 0 root   00 0K   87M CPU/7  0:40  0.00%  0.49% [system]
> 15823 root  85016M 2508K poll/1 0:01  0.00%  0.00% nbmake
> 25446 kamil 43028M 2452K CPU/0  0:00  0.00%  0.00% top
> 14117 root 114027M 3356K tstile/0   0:00  0.00%  0.00% ld
> 29088 root 114027M 3280K tstile/3   0:00  0.00%  0.00% ld
> 20839 root 114027M 3208K tstile/1   0:00  0.00%  0.00% ld
> 19550 root 114026M 3184K tstile/6   0:00  0.00%  0.00% ld
> 13716 root 114026M 3104K tstile/2   0:00  0.00%  0.00% ld
>  8758 root 114026M 3048K tstile/7   0:00  0.00%  0.00% ld
>   240 root 114026M 2580K tstile/0   0:00  0.00%  0.00% ld
> 
> I can see in top(1) that processes are locked in turnstiles and load
> goes down.
> 
> $ uname -a
> NetBSD chieftec 9.99.44 NetBSD 9.99.44 (GENERIC) #0: Fri Jan 31 19:26:07
> CET 2020
> root@chieftec:/public/netbsd-root/sys/arch/amd64/compile/GENERIC amd64
> 
> 135 kamil@chieftec /home/kamil $ cpuctl list
> 
> Num  HwId Unbound LWPs Interrupts Last change  #Intr
>    --  -
> 00online   intr   Sun Feb  2 10:40:26 2020 13
> 12online   intr   Sun Feb  2 10:40:26 2020 0
> 24online   intr   Sun Feb  2 10:40:26 2020 0
> 36online   intr   Sun Feb  2 10:40:26 2020 0
> 41online   intr   Sun Feb  2 10:40:26 2020 0
> 53online   intr   Sun Feb  2 10:40:26 2020 0
> 65online   intr   Sun Feb  2 10:40:26 2020 0
> 77online   intr   Sun Feb  2 10:40:26 2020 0
> 136 kamil@chieftec /home/kamil $ cpuctl identify 0
> Cannot bind to target CPU.  Output may not accurately describe the target.
> Run as root to allow binding.
> 
> cpu0: highest basic info 000d
> cpu0: highest extended info 8008
> cpu0: "Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz"
> cpu0: Intel Xeon E3-1200v2 and 3rd gen core, Ivy Bridge (686-class),
> 3392.48 MHz
> cpu0: family 0x6 model 0x3a stepping 0x9 (id 0x306a9)
> cpu0: features
> 0xbfebfbff
> cpu0: features
> 0xbfebfbff
> cpu0: features 0xbfebfbff
> cpu0: features1 0x7fbae3ff
> cpu0: features1 0x7fbae3ff
> cpu0: features1
> 0x7fbae3ff
> cpu0: features2 0x28100800
> cpu0: features3 0x1
> cpu0: features5 0x281
> cpu0: xsave features 0x7
> cpu0: xsave instructions 0x1
> cpu0: xsave area size: current 832, maximum 832, xgetbv enabled
> cpu0: enabled xsave 0x7
> cpu0: I-cache 32KB 64B/line 8-way, D-cache 32KB 64B/line 8-way
> cpu0: L2 cache 256KB 64B/line 8-way
> cpu0: L3 cache 8MB 64B/line 16-way
> cpu0: 64B prefetching
> cpu0: ITLB 64 4KB entries 4-way, 2M/4M: 8 entries
> cpu0: DTLB 64 4KB entries 4-way, 2M/4M: 32 entries (L0)
> cpu0: L2 STLB 512 4KB entries 4-way
> cpu0: Initial APIC ID 1
> cpu0: Cluster/Package ID 0
> cpu0: Core ID 0
> cpu0: SMT ID 1
> cpu0: MONITOR/MWAIT extensions 0x3
> cpu0: monitor-line size 64
> cpu0: C1 substates 2
> cpu0: C2 substates 1
> cpu0: C3 substates 1
> cpu0: DSPM-eax 0x77
> cpu0: DSPM-ecx 0x9
> cpu0: SEF highest subleaf 
> cpu0: Perfmon-eax 0x7300403
> cpu0: Perfmon-eax 0x7300403
> cpu0: Perfmon-edx 0x603
> cpu0: microcode version 0x15, platform ID 1
> 





Re: panic: softint screwup

2020-02-04 Thread Andrew Doran
On Tue, Feb 04, 2020 at 07:03:28AM -0400, Jared McNeill wrote:

> First time seeing this one.. an arm64 board sitting idle at the login prompt
> rebooted itself with this panic. Unfortunately the default ddb.onpanic=0
> strikes again and I can't get any more information than this:

I added this recently to replace a vague KASSERT.  Thanks for grabbing the
output.

> [ 364.3342263] curcpu=0, spl=4 curspl=7
> [ 364.3342263] onproc=0x00237f743080 => l_stat=7 l_flag=2201 l_cpu=0
> [ 364.3342263] curlwp=0x00237f71e580 => l_stat=1 l_flag=0200 l_cpu=0
> [ 364.3342263] pinned=0x00237f71e100 => l_stat=7 l_flag=0200 l_cpu=0
> [ 364.3342263] panic: softint screwup
> [ 364.3342263] cpu0: Begin traceback...
> [ 364.3342263] trace fp ffc101da7be0
> [ 364.3342263] fp ffc101da7c00 vpanic() at ffc0004ad728 
> netbsd:vpanic+0x160
> [ 364.3342263] fp ffc101da7c70 panic() at ffc0004ad81c 
> netbsd:panic+0x44
> [ 364.3342263] fp ffc101da7d40 softint_dispatch() at ffc00047bda4 
> netbsd:softint_dispatch+0x5c4
> [ 364.3342263] fp ffc101d9fc30 cpu_switchto_softint() at ffc85198 
> netbsd:cpu_switchto_softint+0x68
> [ 364.3342263] fp ffc101d9fc80 splx() at ffc040d4 netbsd:splx+0xbc
> [ 364.3342263] fp ffc101d9fcb0 callout_softclock() at ffc000489e04 
> netbsd:callout_softclock+0x36c
> [ 364.3342263] fp ffc101d9fd40 softint_dispatch() at ffc00047b8dc 
> netbsd:softint_dispatch+0xfc
> [ 364.3342263] fp ffc101d3fcc0 cpu_switchto_softint() at ffc85198 
> netbsd:cpu_switchto_softint+0x68
> [ 364.3342263] fp ffc101d3fdf8 cpu_idle() at ffc86128 
> netbsd:cpu_idle+0x58
> [ 364.3342263] fp ffc101d3fe40 idle_loop() at ffc0004546a4 
> netbsd:idle_loop+0x174

Something has cleared the LW_RUNNING flag on softclk/0 between where it is
set (unlocked) at line 884 of kern_softint.c and callout_softclock(). 
Should not be possible.  Hmm.

Andrew


Re: Revamping lwp_park()

2020-01-28 Thread Andrew Doran
Updated diff:

- use the tree for lookup of zombies and fix a bug in _lwp_wait()
  (ESRCH -> EINVAL when waiting on a detached LWP)

- use the tree to allocate/compact LIDs for new LWPs.

- fix problems encountered compiling rust (first LWP is not always LID 1)

http://www.netbsd.org/~ad/2020/lwp_park2.diff

Andrew

On Sun, Jan 26, 2020 at 08:16:39PM +0000, Andrew Doran wrote:
> There have been hangs with threaded tests recently so I took the opportunity
> to go back and see if the kernel side of this could be simplified.  What I
> have done is:
> 
> - Gotten rid of the hashed sleep queues for parking LWPs.
> 
> - Added a per-process RW lock and radix tree for looking up LWPs by ID (due
>   to a nice property of our radix tree implementation this doesn't allocate
>   any memory in the single-threaded case).
> 
> - Added a function to radixtree, for waiting for memory (radixtree was
>   designed for use in the VM system where the caller would eventually do a
>   uvm_wait() if there was an ENOMEM, this is a new kind of application).
> 
> - Made the sleep locked by the per-CPU spc_lwplock so it's kind of
>   optimistically per-CPU as far as the scheduler goes.
> 
> It works better than the hashed sleep queues but what I like personally is
> that the code is a lot simpler.  Any comments, thoughts?
> 
> Thanks,
> Andrew
> 
> 
> Diff including a change to radixtree.c/.h:
> 
>   http://www.netbsd.org/~ad/2020/lwp_park.diff
> 
> sys_lwp.c in case it's easier to read:
> 
>   http://www.netbsd.org/~ad/2020/sys_lwp.c
> 
> Kernel lock contention before and after on a pthread_mutex stress test:
> 
>   http://www.netbsd.org/~ad/2020/stressor-before.txt
>   http://www.netbsd.org/~ad/2020/stressor-after.txt


Revamping lwp_park()

2020-01-26 Thread Andrew Doran
There have been hangs with threaded tests recently so I took the opportunity
to go back and see if the kernel side of this could be simplified.  What I
have done is:

- Gotten rid of the hashed sleep queues for parking LWPs.

- Added a per-process RW lock and radix tree for looking up LWPs by ID (due
  to a nice property of our radix tree implementation this doesn't allocate
  any memory in the single-threaded case).

- Added a function to radixtree, for waiting for memory (radixtree was
  designed for use in the VM system where the caller would eventually do a
  uvm_wait() if there was an ENOMEM, this is a new kind of application).

- Made the sleep locked by the per-CPU spc_lwplock so it's kind of
  optimistically per-CPU as far as the scheduler goes.

It works better than the hashed sleep queues but what I like personally is
that the code is a lot simpler.  Any comments, thoughts?

Thanks,
Andrew


Diff including a change to radixtree.c/.h:

http://www.netbsd.org/~ad/2020/lwp_park.diff

sys_lwp.c in case it's easier to read:

http://www.netbsd.org/~ad/2020/sys_lwp.c

Kernel lock contention before and after on a pthread_mutex stress test:

http://www.netbsd.org/~ad/2020/stressor-before.txt
http://www.netbsd.org/~ad/2020/stressor-after.txt


Re: Folding vnode_impl_t back into struct vnode

2020-01-24 Thread Andrew Doran
On Fri, Jan 24, 2020 at 09:36:36AM +0100, J. Hannken-Illjes wrote:

> > On 23. Jan 2020, at 17:37, Andrew Doran  wrote:
> > 
> > Hi,
> > 
> > I would like to unify these back into a single struct vnode.
> > 
> > I think I understand the motivation for the split in the first place and I
> > very much agree code outside the VFS core should not be playing around with
> > a lot of the stuff in struct vnode, but I think the split causes more
> > trouble than it's worth.
> 
> Please do not, I strongly object.
> 
> It was hard work to clean up file systems and kernel components using
> and abusing vnode fields that should be opaque outside the VFS implementation.
> 
> I'm quite sure people will start again using these fields because
> it is so simple.  For me the unique vnode was a maintenance nightmare.

Ok cool, it's a preference on my part I don't have a strong opinion.  I
understand the frustration, I have run into a lot of nasty business in
the file system code over the years.

> All (v_lock at least) contained in struct vnode_impl, all locks should
> go here as they are opaque to any consumer.

I have found a way to cluster the fields in both structures that seems to
work well enough so all is well in the end.

Thank you,
Andrew


kmem caches and the cache line size

2020-01-24 Thread Andrew Doran
Hi,

This zero pads the cache names to be more readable in vmstat output, and
avoids creation of caches where the item size is not a factor or multiple of
the cache line size.

http://www.netbsd.org/~ad/2020/kmem.diff

This gives me a repeatable 1-2% speed up on compile jobs on my test system,
means that __aligned(COHERENCY_UNIT) can be used more easily in structures
allocated with kmem_alloc(), and permits removal of a bunch of padding and
rounding done during allocation.

Comments?

Cheers,
Andrew


Folding vnode_impl_t back into struct vnode

2020-01-23 Thread Andrew Doran
Hi,

I would like to unify these back into a single struct vnode.

I think I understand the motivation for the split in the first place and I
very much agree code outside the VFS core should not be playing around with
a lot of the stuff in struct vnode, but I think the split causes more
trouble than it's worth.

Here are my reasons for wanting to go back to a single struct:

1) Performance.  I want to be able to cluster the fields around the kind of
usage they get.  For example changing v_usecount on libc.so's vnode or the
root vnode shouldn't force a cache miss on a different CPU that's just
trying to look up pages, or get at v_op or v_type or whatever; they're
different kinds of activity (see below for example).

2) Memory usage; to do the above in a way that doesn't use more memory than
needed.  With the structs split in two it's difficult to arrange all of
this cleanly.

3) Clarity.  I'm having a hard time working in the code with the split and
the different macros to convert back and forth.  It's visually untidy.  I
also think that if NetBSD's is to be a reference implementation or teaching
aid then well known stuff like this should be straightforward.

Thoughts?

Cheers,
Andrew

struct vnode {
/*
 * more stable stuff goes at the front.
 */
struct uvm_object v_uobj;
...
enum vtype  v_type;
...

/*
 * counters usecounts etc.  these guys get hammered on.
 * think of v_usecount on "rootvnode" for example.
 * (v_usecount will come out of struct uvm_object with uvm
 * rwlock changes.)
 */
int v_usecount
__cacheline_aligned(COHERENCY_UNIT);
int v_holdcount;
int v_numoutput;
int v_writecount;
int v_someothercounter;
...

/*
 * then locks at the back in their own line so they don't
 * impinge on any of the above either; different kind of activity
 * again.
 */
krwlock_t   v_lock
__cacheline_aligned(COHERENCY_UNIT);
krwlock_t   v_nc_lock;  /* namecache locks */
krwlock_t   v_nc_listlock;
};


Re: Blocking vcache_tryvget() across VOP_INACTIVE() - unneeded

2020-01-22 Thread Andrew Doran
On Tue, Jan 21, 2020 at 05:30:55PM -0500, Thor Lancelot Simon wrote:

> I think there's probably a generic bucket-locked hashtable implementation
> in one of the code contributions Coyote Point made long ago.  That said,
> we're talking about a screenful of code, so it's not like it'd be hard to
> rewrite, either.  Do we want such a thing?  Or do we want to press on
> towards more modern "that didn't work, try again" approaches like
> pserialize or the COW scheme you describe?

The way I see it, the original idea of the namecache was to have a thing
that allowed you avoid doing expensive hardware and file system stuff a
second time, because you'd already done it once before.  In this context
MP locking also looks a lot like expensive hardware and file system stuff.

Making the namecache's index per-directory matches the pattern in which the
data is accessed and hey presto all of a sudden there are no big concurrency
problems, any locking or whatever becomes trivial and follows the pattern of
access closely, things like pserialize/RCU start to look feasible, and one
can get back to the important business of avoiding expensive work instead of
generating it!  That's my thinking on it anyway.

The rbtrees are just what we have at hand.  I think a dynamically sized
hash, like a Robin Hood hash looks very interesting though, because we can
afford to resize occasionally on entry/removal, and modern pipelined CPUs
are great at that kind of thing.  I'm sure other people have better ideas
than I do here, though.  I just want to try out the directory approach, and
if it works well, get the basis in place.

Andrew


Re: libc.so's vmobjlock / v_interlock

2020-01-21 Thread Andrew Doran
On Sun, Jan 19, 2020 at 09:50:24PM +, Andrew Doran wrote:

> Here's a dtrace flamegraph for a kernel build on my system, while running a
> kernel from the ad-namecache branch.
> 
>   http://www.netbsd.org/~ad/2020/jan19.svg
> 
> The biggest single remaining concurency problem is the mutex on libc.so's
> uvm_object / vnode.  It pops up in a number of places, but most clearly seen
> above the "uvm_fault_internal" box.
> 
> I think making the vmobjlock / v_interlock a rwlock could help with making
> inroads on this problem, because in the libc.so case it's not modifying too
> much under cover of the lock (mostly pmap state).
> 
> That's not a small undertaking, but I reckon it would take about a day of
> donkey work to change every mutex_enter(..) to a rw_enter(.., RW_WRITER) and
> make a start at choosing rw_write_held() or rw_lock_held() for the
> assertions, followed by a bit of benchmarking to check for a regression and
> a jumbo commit.  From there on things could be changed incrementally.
> 
> Thoughts?

Here's a first set of changes to do just this.  I don't plan on doing more
on this for about a ~month because I will be travelling soon.

http://www.netbsd.org/~ad/2020/vmobjlock.diff

Andrew


- This splits v_interlock apart from vmobjlock.  v_interlock stays a mutex,
  vmobjlock becomes a RW lock.  The lock order is vmobjlock -> v_interlock. 
  There are a few shared items that need to be examined closely and fixed or
  good excuses made, e.g. VI_EXECMAP but it's mostly right.

- I made anons/amaps have a rwlock too.  There is bound to be an application
  for this and I think it might be PostgreSQL, must check.

- There seems to be no performance regression from doing this.  Not having
  vnode stuff adding contention to the vmobjlock makes the numbers better.

To get to the point of doing concurrent faults for the simple cases, like
are needed for libc.so I think the following are also needed:

- PV lists in the pmap need explicit locking (unless the pmap uses a global
  lock, which is fine if it suits the machine).  I have done the PV list
  locking for x86 and it's in the patch.

  I'm suggesting a lock still need to be held on the UVM object / amap for
  pmap ops.  But that can be a read lock; for x86 I want a write lock held
  only for pmap_page_protect(VM_PROT_NONE) aka pmap_page_remove(), because
  trying have the x86 pmap manage walking the PV list while removing PTEs,
  and while everything is changing around it, and have it all nice and
  concurrent, would be really tough.  I don't think ppp(VM_PROT_NONE) is a
  hotspot anyway.

- With the above done uvm_unmap_remove() and uvm_map_protect() can take
  a reader lock on the object.  I've tried this out and it works fine
  "on my machine", except that in the libc.so case it then starts to 
  compete with uvm_fault() for the vmobjlock.

- PG_BUSY could become a reader/writer lock of sorts, I'm thinking something
  like a counter in struct vm_page, and I like the idea of covering it with
  pg->interlock.  Interested to hear any other ideas.

- Then it's a trip into the uvm_fault() -> getpages -> uvn_findpages()
  path looking for what's needed, I expect probably not a whole lot if you
  only want to handle faults for stuff that's in-core using a read lock.
  I don't know the uvm_fault() code too well.


Re: Adaptive RW locks

2020-01-21 Thread Andrew Doran
To follow up: in testing this I ended up discovering a number of tedious,
complicated deadlocks that could occur due to softints, kernel_lock and
other factors.  Trying to mitigate them killed the performance gain and it
still wasn't right.  I'm abandoning this idea because in practice it seems
too complicated.  On a more positive note there are a couple of LOCKDEBUG
improvements out of it.

Andrew


Re: Blocking vcache_tryvget() across VOP_INACTIVE() - unneeded

2020-01-21 Thread Andrew Doran
On Thu, Jan 16, 2020 at 04:51:44AM +0100, Mateusz Guzik wrote:

> On 1/15/20, Andrew Doran  wrote:
> > I don't understand why we do this.  I don't think it's needed at all
> > because
> > if the file really is deleted and the inode is getting cleaned out, then it
> > shouldn't be getting new refs in any of the usual ways and we don't need to
> > worry about it.  And in any case the vnode lock prevents anyone from
> > messing
> > with the inode during VOP_INACTIVE().
> >
> > I want to change this because it causes nasty problems on MP.  For example
> > I
> > see threads very regulary thrown out of the namecache by vcache_vtryget(),
> > and when they finally get the vnode ref'd and locked they do it by digging
> > through the file system metadata (when everything is already in cache).
> >
> >   http://www.netbsd.org/~ad/inactive.diff
> >
> 
> I think looking at a bigger picture suggests a different solution, which
> will also keep the current blocked <-> loaded transition for consideration
> later.
> 
> I'm assuming the goal for the foreseeable future is to achieve path lookup
> with shared vnode locks.

Good guess.  There is a prototype of LK_SHARED lookup on the ad-namecache
branch, along with lookup using namecache only that takes as few vnode locks
and refcounts as possible.  In the easiest case only the root vnode and the
leaf vnode are touched (although whether the root vnode ever actually needs
a reference is another question).  There are many "non-easy" cases though.

> Since ->v_usecont transition to 0 happens all the time and VOP_INACTIVE
> processing is almost never needed, the current requirement to hold an
> exclusive lock just to find out there is nothing to do should be lifted.
> Apart from getting away with only shared lock (or no locks) this would
> also naturally eliminate vnode state changes for the common case.
> 
> Preferably vfs would provide a method for filesystems to call to signify
> they want VOP_INACTIVE to be called for given vnode, then vrelel could
> just test a bit. However, this is probably not worth the effort right now.
> 
> Said locking was also a problem in FreeBSD. It got temporarily sorted
> out with introduction of VOP_NEED_INACTIVE -- by default it returns 1,
> but filesystems which provide their own method easily avoid the problem.
> I think starting with requiring at least shared lock would do the trick
> just fine while providing race-free methods at least for ufs and tmpfs.
> 
> Then vrelel could stick to:
> diff --git a/sys/kern/vfs_vnode.c b/sys/kern/vfs_vnode.c
> index d05d5180f730..45d80dab4a2c 100644
> --- a/sys/kern/vfs_vnode.c
> +++ b/sys/kern/vfs_vnode.c
> @@ -774,7 +774,7 @@ vrelel(vnode_t *vp, int flags, int lktype)
>  * If not clean, deactivate the vnode, but preserve
>  * our reference across the call to VOP_INACTIVE().
>  */
> -   if (VSTATE_GET(vp) == VS_RECLAIMED) {
> +   if (VSTATE_GET(vp) == VS_RECLAIMED || !VOP_NEED_INACTIVE(vp))
> VOP_UNLOCK(vp);
> } else {
> VSTATE_CHANGE(vp, VS_LOADED, VS_BLOCKED);
> 
> which even with exclusive locks already avoids the blocked state problem.
> 
> Another option is to move some of the current body into smaller halpers
> and then let filesystem piece together their own VOP_VRELE.

Interesting.  I had tried another approach, a VOP_INACTIVE() that could be
called with an LK_SHARED lock.  It returned ENOLCK if there was work to do
that needed an exclusive lock, but it was ugly.  I will take a look at the
FreeBSD solution.

Thanks,
Andrew


Re: Adaptive RW locks

2020-01-20 Thread Andrew Doran
Hi,

On Mon, Jan 20, 2020 at 02:18:15PM +0900, Takashi YAMAMOTO wrote:

> > http://www.netbsd.org/~ad/2020/rwlock.diff
> 
> 
> i guess in rw_downgrade
> newown = (owner & RW_NODEBUG) | RW_SPIN;
> 
> should be
> newown = (owner & (RW_NODEBUG | RW_SPIN));
> 
> as the owner might have failed to record the lock on acquisition.

Right.  I'll fix it and add a comment.  Good catch, thank you.

Andrew


Re: libc.so's vmobjlock / v_interlock

2020-01-20 Thread Andrew Doran
On Sun, Jan 19, 2020 at 10:08:12PM +, Taylor R Campbell wrote:

> > Date: Sun, 19 Jan 2020 21:50:24 +
> > From: Andrew Doran 
> > 
> > The biggest single remaining concurency problem is the mutex on libc.so's
> > uvm_object / vnode.  It pops up in a number of places, but most clearly seen
> > above the "uvm_fault_internal" box.
> > 
> > I think making the vmobjlock / v_interlock a rwlock could help with making
> > inroads on this problem, because in the libc.so case it's not modifying too
> > much under cover of the lock (mostly pmap state).
> > 
> > [...]
> > 
> > Thoughts?
> 
> I think we should try to go straight to pserialized page lookup and
> skip rwlocks.  Should be relatively easy to teach radixtree.c to work
> pserialized.
> 
> We probably won't want to use pserialize_perform() but can perhaps
> instead use an epoch-based system with page daemon queues of dying
> pages that may still be referenced under pserialize_read.

This is an intriguing idea and I suspect that you've already thought about
it quite a bit but I don't think we're there just yet because there's more
to be dealt with first, namely that vmobjlock/v_interlock covers a lot more
than the index of pages in the object.

For example right now it's covering the usecounts, the flags in the vnode
and the page, the PV entries in the pmap, interlocking the long term page
locks (PG_BUSY), correct serialization of page visibility and TLB shootdown,
and a whole lot more.

I'm not fond of RW locks as a synchronisation primitive, there are usually
better ways to do things, but I like the RW lock idea in this instance for
two reasons.

Firstly RW locks are pretty easy to understand and use so we can hopefully
get a nice win for the system without it being too taxing in terms of the
amount of work needed or the potential for introducing bugs.  Secondly I
think it would enable us incrementally attack some of the areas where we
have what's really per-page state being covered by a per-object lock, which
I think ties in with what you're suggesting.

Andrew


libc.so's vmobjlock / v_interlock

2020-01-19 Thread Andrew Doran
Hi,

Here's a dtrace flamegraph for a kernel build on my system, while running a
kernel from the ad-namecache branch.

http://www.netbsd.org/~ad/2020/jan19.svg

The biggest single remaining concurency problem is the mutex on libc.so's
uvm_object / vnode.  It pops up in a number of places, but most clearly seen
above the "uvm_fault_internal" box.

I think making the vmobjlock / v_interlock a rwlock could help with making
inroads on this problem, because in the libc.so case it's not modifying too
much under cover of the lock (mostly pmap state).

That's not a small undertaking, but I reckon it would take about a day of
donkey work to change every mutex_enter(..) to a rw_enter(.., RW_WRITER) and
make a start at choosing rw_write_held() or rw_lock_held() for the
assertions, followed by a bit of benchmarking to check for a regression and
a jumbo commit.  From there on things could be changed incrementally.

Thoughts?

Andrew


Adaptive RW locks

2020-01-19 Thread Andrew Doran
In my testing lately I've often run into a high context switch rate over RW
locks.

The RW locks we have now are "half adaptive" so to speak.  Lock waiters can
spin while the lock is write locked and the owner is running, because the
owner can be identified in that case, and then it can be determined whether
or not the owner is running on a CPU: if so, the waiter spins, if not, it
sleeps.

The same does not happen if a RW lock is read locked and someone wants to
take a write hold, because in that case there is no identifiable owner, and
so it can't be determined whether the owner(s) are running or not.  In that
case the lock waiter always sleeps.  It then has a kind of snowball effect
because it delays the lock getting back to its unheld state, and every
thread arriving during the aftermath sleeps too.

If you only take mostly read holds, or only take mostly write holds on an
individual RW lock this isn't a big deal because it occurs rarely.  But if
you introduce more of a mix of read/write you start to get the poor
behaviour.  The correct answer here may be "don't do that", but it's not a
very practical one.

I'd like to reduce this blocking activity, so here is a patch against
-current that implements the adaptive behaviour in the reader case too, by
having each LWP holding RW locks track those holds, and if going to sleep,
cause any held RW locks to be temporarily put into "sleep mode", with the
default being "spin mode".  In my testing with mixed vnode locks this works
well.

http://www.netbsd.org/~ad/2020/rwlock.diff

This removes the assembly stubs but the compiler seems to do a pretty good
job on the C ones I added, and in microbenchmarking it the difference is
tiny.  I've only included the amd64 MD changes in the diff.

In trying this out on builds it seems that tracking at most 2 holds catches
~99.8% of activity, but I have made the limit 4 (it's a soft limit, if
exceeded all further taken locks are put into "sleep mode").

I also added a rw_owner_running() for the pagedaemon, to match mutexes (not
used yet).

Thoughts?

cheers,
Andrew


Blocking vcache_tryvget() across VOP_INACTIVE() - unneeded

2020-01-15 Thread Andrew Doran
I don't understand why we do this.  I don't think it's needed at all because
if the file really is deleted and the inode is getting cleaned out, then it
shouldn't be getting new refs in any of the usual ways and we don't need to
worry about it.  And in any case the vnode lock prevents anyone from messing
with the inode during VOP_INACTIVE().

I want to change this because it causes nasty problems on MP.  For example I
see threads very regulary thrown out of the namecache by vcache_vtryget(),
and when they finally get the vnode ref'd and locked they do it by digging
through the file system metadata (when everything is already in cache).

http://www.netbsd.org/~ad/inactive.diff

What am I missing here?  Is there a buggy file system or something like
that?

Cheers,
Andrew


Re: Synchronization protocol around TODR / time-related variables

2020-01-01 Thread Andrew Doran
On Wed, Jan 01, 2020 at 09:05:43AM -0800, Jason Thorpe wrote:

> It looks to me like in our kern_tc.c, timebasebin has actually replaced
> the historical function of the boottime variable, though I have to admit I
> haven't fully wrapped my head around kern_tc.c.  Anyway, if that's true,
> we can nuke the old boottime completely, and have getboottime() return the
> converted-to-ts value of timebasebin using the same lockless strategy.
> 
> Agree?

kern_tc.c is one of those alright.  Yes it looks that way to me too.

Andrew


Re: Solving the last piece of the uvm_pageqlock problem

2019-12-25 Thread Andrew Doran
On Wed, Dec 25, 2019 at 03:02:41AM +0100, Joerg Sonnenberger wrote:
> On Tue, Dec 24, 2019 at 10:08:01PM +0000, Andrew Doran wrote:
> > This is a diff against a tree containing the allocator patch I posted
> > previously:
> > 
> > http://www.netbsd.org/~ad/2019/pdpol.diff
> 
> I wanted to give this a spin before travelling, but it doesn't survive
> very long here. I get NULL pointer derefs in
> uvmpdpol_pageactivate_locked, coming from uvmpdpool_pageintent_realize.
> That's within seconds of the scan phase of a bulk build.

I had a look and I think I can see why.  I'll come back with another diff
soon.  Thank you for you for trying it Joerg, und Froehliche Weihnachten.

Andrew


Solving the last piece of the uvm_pageqlock problem

2019-12-24 Thread Andrew Doran
This is a diff against a tree containing the allocator patch I posted
previously:

http://www.netbsd.org/~ad/2019/pdpol.diff

The idea here is to buffer updates to the page's status (active, inactive,
dequeued) and then sync those to the pdpolicy / pagedaemon queues regularly,
a bit like the way the file system syncer works.  Notes:

- Since uvm_pageqlock was replaced with pg->interlock & a private lock for
  the pdpolicy code, pages can occasionally appear on a pdpolicy queue when
  they shouldn't be considered for pageout & reclaim (if the pagedaemon and
  object owner race), but it's not a problem because the pagedaemon can take
  pg->interlock and determine that a page is wired or in a state of flux or
  whatever, and so should be ignored because it'll be gone from the queues
  soon.

- This patch takes it a little further.  The pdpolicy code gets a dedicated
  TAILQ_ENTRY in struct vm_page so it doesn't need to share with the page
  allocator.  A page can be PG_FREE and still on a pdpolicy queue (but not
  for long).  We set an intended state for the page on pg->pqflags using
  atomics (active, inactive, dequeued) and then those pages are queued in a
  per-CPU buffer for their status updates to be purged and made real in the
  pdpol code's global state at some point in the near future.

- The pagedaemon can also see those updates in real time by inspecting
  pg->pqflags and make real the page's status.  So basically what I'm doing
  is batching the updates, trying to not let the global state fall too far
  behind, and always give the pagedaemon enough information to know the true
  picture for individual pages when it does its labourious scan of the
  queues, even if viewed globally the queues are a little bit behind.

This seems to work really well, I think because a page can have multiple
state transitions while it's in a queue waiting for its intended status
change to be purged and made global.

Shortly before composing this e-mail it occurred to me that FreeBSD may do
something similar but to be honest I didn't dig into their code.

I need to tweak this to allocate a smaller buffer for uniprocessor systems
and maybe consider using prefetching instructions when purging, and want to
re-run the tests because I changed a couple of things but I'm basically
happy with it.

Results on my kernel build test:

72.66 real  1653.86 user   593.19 sys   new allocator
71.26 real  1671.13 user   502.94 sys   new allocator + pdpol.diff

Lock contention before and after:

Total%  Count   Time/ms  Lock   Caller
-- --- - -- --
 28.86 44056935  77553.77 pdpol_state
 15.62 22177251  41978.93 pdpol_stateuvmpdpol_pageactivate+36
 13.12 21656129  35251.99 pdpol_stateuvmpdpol_pagedequeue+18
  0.12  223482322.77 pdpol_stateuvmpdpol_pagedeactivate+18
  0.00  73  0.07 pdpol_stateuvmpdpol_pageenqueue+18

Total%  Count   Time/ms  Lock   Caller
-- --- - -- --
  0.23   11301362.35 pdpol_stateuvmpdpol_pageintent_set+b9

Andrew


Re: Synchronization protocol around TODR / time-related variables

2019-12-24 Thread Andrew Doran
On Mon, Dec 23, 2019 at 08:06:04PM -0800, Jason Thorpe wrote:

> Now let's talk about settime1() in kern_time.c.  It does a few things that
> seem dicey vis a vis modifying global state without proper serialization,
> and just about the entire function is wrapped in an splclock()/splx()
> pair.
>
> First of all, it's not clear to me at all that splclock()/splx() is really
> needed here...  *maybe* across the call to tc_setclock()?  But the mutex
> it takes internally is an IPL_HIGH spin mutex, so splclock() seems
> superfluous, just a legacy hold-over. 

The splclock() doesn't buy us anything as far as I can tell.

IPL_HIGH for the timecounter lock irks me given that the clock interrupt is
IPL_SCHED but I have nothing to back that up.  In any case I suppose it's a
bit like kernel printf(), people are going to call from wherever they like
even if it's somewhere we'd prefer not and that's understandable.

> The only thing that requires some synchronization is modifying "boottime",
> but maybe it's not super critical (all of the consumers of this variable
> would need significant cleanup.

Looks like it wouldn't be too hard to wrap this in a function to return it
and at least keep the problem in one place.  Ah ha!  When searching for
boottime on OpenGrok I found a FreeBSDism in the NFS code that we pulled in:
getboottime().  Digging into it a bit further, it seems they have managed
this using the same concurrency strategy used for timecounters:

http://fxr.watson.org/fxr/source/kern/kern_tc.c#L506

Andrew


A new page allocator for UVM

2019-12-22 Thread Andrew Doran
Hi,

Anyone interested in taking a look?  This solves the problem we have with
uvm_fpageqlock.  Here's the code, and the blurb is below:

http://www.netbsd.org/~ad/2019/allocator.diff

Cheers,
Andrew


General overview

The page allocator currently works in 4 dimensions, with freelist
being the most important consideration, page colour the next, and so
on.  It has a global counter tracking the number of free pages.

uvm -> freelist[f] -> colour[c] -> queue[q] +-> global list
\-> local CPU list

uvm -> uvmexp.free

It turns out that this allocator has a "lock contention problem",
which is actually more of a false sharing problem, because
allocating a page involves touching and writing back to many cache
lines.

The new allocator is reduced to working in 3 dimensions.  The
separate lists for zeroed/unknown pages have been deleted (but the
ability to track whether an individual page is zeroed is retained),
the CPU free lists have been deleted, and a new concept of "bucket"
introduced.  Again freelist is the most important consideration,
bucket is the next, and colour the last (* unless the experimental
NUMA stuff is enabled, but that's disabled for now).

uvm -> freelist[f] -> bucket[b] -> colour[c]
 \
  -> pgb_nfree

The data structures to manage this are carefully arranged so that
there's minimal false sharing, fewer write backs need to occur, and
empty freelists (of which there are often many) can be skipped
quickly without having to scan everything nor take locks.

Buckets

In both old & new allocators we have the ability to dynamically
rearrange the free lists to support an arbitrary number of page
colours.  Buckets work this way too but there is a maximum number
(PGFL_MAX_BUCKETS, currently 8).

The number of buckets is sized at runtime according to the number of
physical CPU packages in the system (ci->ci_package_id), and each
logical CPU is assigned a preferred bucket to allocate from
(ucpu->pgflbucket).

This does a few things for us (when the system is not very low on
memory):

- It exponentially reduces lock contention by reducing the number of
  CPUs competing for access to each bucket.

- The lock and data structures for each bucket typically remain in
  cache somewhere on the chip, and thereby don't need to go over the
  bus, further reducing contention.

- It has a positive effect on L2/L3 cache locality, because if we
  free pages back to a bucket corresponding to the CPU where they
  were last used, the pages held in free list are very likely to be
  in cache on the chip they will next be allocated from.  The
  current allocator has a problem where logical CPUs "hoard" L2/L3
  cache by having their own free lists.

Experimental rudimentary NUMA support

I thought it would be interesting to try this, just for fun.

This changes the configuration of the buckets to correspond to NUMA
node, assigns pages to buckets according to which node the hardware
tells us they live on, changes the strategy for freeing pages so
that pages ALWAYS go back to their assigned bucket, instead of
moving about as they do with the L2/L3 strategy, and changes the
strategy for allocating so that we consider allocating from the
correct bucket to be more important than allocating from the correct
free list.

I have disabled this for now even if NUMA is enabled on the system
because it's not quite there yet and more testing needs doing.  It
does seem to work though (for a compile job).

Per-cpu caching

There is a tiny per-CPU caching layer that can be paused and resumed
at runtime, and it's only enabled when CPUs are sharing buckets. 
This functions as a buffer, where pages are shuffled to and from
buckets in batch.  On a typical Intel system, about 224kB worth of
pages will be cached.  That's not much, but is enough to further
reduce lock contention by an order of magnitude without exerting too
much pressure on the L2/L3 cache by hoarding.

Idlezero

I have removed the guts of this, but the hooks to support it are
still in place.  I think it may be useful to experiment with zeroing
the per-CPU cached pages, or try some other method, but what we have
now doesn't work well on modern systems.

How this looks in practice

Here is what the freelists look like on my system, and the dmesg
from said system.

http://www.netbsd.org/~ad/2019/freelist.txt
http://www.net

Re: Adaptec 6445 not supported?

2019-12-21 Thread Andrew Doran
Hi Darren,

On Sat, Dec 21, 2019 at 11:23:05PM +1100, Darren Reed wrote:

> Using NetBSD 8.1, the NetBSD kernel boots up with this:
> 
> Dec 21 10:53:56 unix /netbsd: vendor 9005 product 028b (RAID mass
> storage, revision 0x01) at pci1 dev 0 function 0 not configured
> 
> https://storage.microsemi.com/en-us/support/raid/sas_raid/sas-6445/
> 
> Under "Unix", there's no "NetBSD', only "FreeBSD".
> 
> Is that driver supplied to FreeBSD as a binary or can it be ported?

It looks to me like the "aacraid" driver in FreeBSD handles this, and that
looks to be an evolution of the "aac" driver.  The driver would not be too
hard to port I reckon.

We used to get engineering samples from Mark Salyzyn at DPT/Adaptec every
time they released something significant but he has long left the company.

Andrew


Re: New cpu_count races; seqlocks?

2019-12-17 Thread Andrew Doran
A bit off on a tangent, but my intent was to keep this stuff very simple and
distinct from the general (evcnt) type case because these are special to my
mind due to their well established use and how clustered they are.

On Tue, Dec 17, 2019 at 05:07:12AM +, Taylor R Campbell wrote:

> It is great to get rid of global contention over a bunch of counters
> that don't need to be read exactly.  However, the new cpu_count logic
> is racy -- particularly on 32-bit platforms, where it's not simply a
> matter of reading stale values but a matter of reading total garbage:

I'm not adverse to solving the problem at all but my thinking is that in
most cases it doesn't really matter too much since they're sampled with some
regularity, which is how we've been running without a stable/reliable view
of these numbers for 10+ years now.  However having thought it over again
perhaps we've grown a new problem that we simply don't need.

On a 32-bit platform, I don't see why any of these actually need to be
64-bit.  They've been wrappable since we've had them, they're still
wrappable in 64-bit form, and I'd be quite happy to say that if you're
running 32-bit then you get 32-bit wraparound.  Making them uintptr_t's
internally in the implementation at least would mean we don't introduce
any new problems for the user nor ourselves.

>   /* Claim exclusive access for modifications.  */
>   do {
>   while (1 & (gen = atomic_load_relaxed(&cpu_count_gen)))
>   SPINLOCK_BACKOFF_HOOK;
>   } while (atomic_cas_32(&cpu_count_gen, gen, gen | 1) != gen);

I actually quite like this from the POV of making sure no more than one LWP
at a time tries to update the sum totals since the cost is huge.

> Perhaps we should have a seqlock(9) API for broader use.  I drafted an
> implementation of Linux's seqlock API for a drm import from Linux
> 4.19:
> 
> https://github.com/riastradh/netbsd-src/blob/riastradh-drm419/sys/external/bsd/drm2/include/linux/seqlock.h

I think we are very likely to run into a need for such a thing for the file
system and network code.

Andrew


Re: [filemon] CVS commit: htdocs/support/security

2019-12-17 Thread Andrew Doran
On Tue, Dec 17, 2019 at 04:06:12PM +0100, Kamil Rytarowski wrote:
> On 17.12.2019 15:44, Andrew Doran wrote:
> >>>> Typically with a character device, the kmod can get unloaded while an 
> >>>> ioctl
> >>>> is being executed on it.
> >
> > That's solvable.  Conceptually I think the main stumbling block is that
> > there are two layers at play which need to reconciled: specfs and devsw.  It
> > could also be an opportunity to lessen the distinction between block and
> > character devices, there's no real need for cached access from userspace,
> > that would simplify things too.
> >
> >>>> When it comes to syscalls, I haven't looked
> >>>> closely, but the issue is likely the same.
> >
> > It's atomic and side effect free if done correctly.  We have pleasing
> > examples of this.  This is hard to get right though, so naturally mistakes
> > are made.
> >
> 
> It would be nice to have at least an example of doing it right.

Sure, look at ksem_sysfini().

It tries to uninstall the syscall package.  If any syscall is in use, fail. 
It then looks to see if it has created ksems.  If any exist fail, and plug
the syscalls back in.

While that's happening new ksem syscalls are being gated by sys_nomodule()
(this all happens under lock).  The legitimate users of the ksem syscalls
will never see a spurious failure due to an attempt to unload.

ksem_sysinit() seems to have a bug though; ksem_max should be set before the
syscall package is installed.

Andrew


Re: [filemon] CVS commit: htdocs/support/security

2019-12-17 Thread Andrew Doran
> > > Typically with a character device, the kmod can get unloaded while an 
> > > ioctl
> > > is being executed on it.

That's solvable.  Conceptually I think the main stumbling block is that
there are two layers at play which need to reconciled: specfs and devsw.  It
could also be an opportunity to lessen the distinction between block and
character devices, there's no real need for cached access from userspace,
that would simplify things too.

> > > When it comes to syscalls, I haven't looked
> > > closely, but the issue is likely the same.

It's atomic and side effect free if done correctly.  We have pleasing
examples of this.  This is hard to get right though, so naturally mistakes
are made.

Andrew


Re: Changes to reduce pressure on uvm_pageqlock

2019-12-12 Thread Andrew Doran
On Thu, Dec 12, 2019 at 06:30:49PM +0900, Takashi YAMAMOTO wrote:

> > I do want to add more fields.  It turns out the first two fit into the
> > unused lower bits of pg->phys_addr without too much inconvenience for
> > anybody:
> >
> > - freelist: 'cos uvm_page_lookup_freelist() turns up prominently in traces
> >
> 
> how about having physseg id instead?
> i think it isn't too expensive to get freelist and phys_addr using it.
> i guess numa things have good uses of physseg-equivalent too.

It looks doable, and still better than using the rbtree in uvm_pagefree(),
but would require changes to the allocation of physsegs for UVM_HOTPLUG.
I will go with freelist # to begin with, but it will be easy to change to
seg later with some small changes to the allocation.  I will include that in
the comments.
 
> > - bucket: which cpu->ci_package_id I am from, for L2/L3 locality or NUMA
> >
> > The third looks like it'll fit in the bottom bits of pg->offset but I need
> > to investigate what dificulties it might cause:
> >
> > - waiter count for PG_BUSY, replacing PG_WANTED, to avoid thundering herd
> >
> 
> i guess sooner or later we will need a major surgery to allow concurrent
> faults on a page. at least for easy/common cases.
> (well, it has been "sooner or later" for a decade.)

Yup a waiters count only treats the symptom.  I looked at FreeBSD's vm_page
out of interest and it seems they have turned PG_BUSY into a kind of
reader/writer lock.  Hmm.

Andrew


Re: Changes to reduce pressure on uvm_pageqlock

2019-12-11 Thread Andrew Doran
On Sun, Dec 08, 2019 at 07:00:43AM -0800, Jason Thorpe wrote:

> > On Dec 8, 2019, at 6:00 AM, Mindaugas Rasiukevicius  
> > wrote:
> > 
> > Just a small note: since vm_page::wire_count and vm_page::loan_count are
> > kind of mutually exclusive, I think suggestion by yamt@ (long time ago) to
> > merge them (positive value meaning wired count and negative -- loan) is a
> > useful one.  It would also free up those 16-bits, since Andrew seems to be
> > in need of some.

Thank you for the suggestion.  I have added stuff into vm_page which I don't
particularly like doing, so I'll look into the change you mentioned, maybe
not this week though and after I deliver this set of changes.

I do want to add more fields.  It turns out the first two fit into the
unused lower bits of pg->phys_addr without too much inconvenience for
anybody:

- freelist: 'cos uvm_page_lookup_freelist() turns up prominently in traces
- bucket: which cpu->ci_package_id I am from, for L2/L3 locality or NUMA

The third looks like it'll fit in the bottom bits of pg->offset but I need
to investigate what dificulties it might cause:

- waiter count for PG_BUSY, replacing PG_WANTED, to avoid thundering herd

> That seems like a good change to make, but omg please wrap the
> manipulation in macros / inlines because it seems like otherwise it could
> be quite error-prone.

Agreed.

Andrew


Re: Changes to reduce pressure on uvm_pageqlock

2019-12-09 Thread Andrew Doran
Complete & compilable diff as promised.  Having taken feedback into
consideration and considering that with the last revision uvm_pageqlock
wasn't doing a whole lot any more, this takes it further than before, about
as far as I want to go right now.  Summary:

- uvm_pageqlock is gone.
- instead we have vm_page::interlock protecting identity, and...
- the pdpolicy has its own lock.  could do multiple queues/locks later.
- the ground is paved for lazy dequeueing of pages.
- vm_page has grown (yamt's radixtree can eliminate rbtree node later).
- i think it fixes this recent PR: http://gnats.netbsd.org/54727
- vm_page::pqflags becomes totally private to pdpolicy.  it now needs
  to be 32-bits to stand on its own for architectures that can't do
  atomic sub-word writes, because it's locked by a different lock to
  the adjacent fields.  i am using the spare bits for lazy activation.
- i have bumped loan & wire counts to 32-bits.  i don't think 16 is safe.
- clockpro was already broken but I have tried not to make it worse.

http://www.netbsd.org/~ad/2019/uvm_pageqlock.diff

Cheers,
Andrew

On Sat, Dec 07, 2019 at 10:40:39PM +, Andrew Doran wrote:
> On Sat, Dec 07, 2019 at 01:32:35PM -0800, Chuck Silvers wrote:
> 
> > I like this new version much more than the previous version.
> > The new diff looks incomplete though, it uses the "pdpol" field in
> > vm_page that was new in the previous diff.  It looks like there
> > should be changes in uvm_pdaemon.c to go with the new diff too?
> > Could you send the complete new diff again please?
> 
> Yes it's incomplete.  I'll split these pieces out of my tree, merge into
> -current tomorrow and get a clean diff.
> 
> I think the locking change is enough on its own, so I will exclude the
> "pdpol" piece to avoid clouding things, and go again with another review for
> that and other parts in the near future.
> 
> > > With that done, we'd then only need one version of uvm_pageactivate() etc,
> > > because those functions will no longer care about uvm_pageqlock.
> > 
> > The pageq lock is named that because it currently protects the pageq's.  :-)
> > If it's not going to protect the pageq's anymore then it ought to be 
> > renamed.
> > Probably better to just work on eliminating it though.
> 
> Agreed.
>  
> > In your new diff, the pageq lock is still used by the pagedaemon
> > (for uvmpd_trylockowner() as you noted in the comment).
> > It would be good not to have a global lock involved in changing page 
> > identity,
> > and with the pageq's now protected by a separate lock, we should be able to
> > eliminate the need for the pageq lock when freeing a page with an RCUy thing
> > to prevent freeing uvm_object and uvm_anon structures while the pagedaemon 
> > might
> > still be accessing them via the vm_page fields.  We should be able to 
> > improve
> > the locking for pg->loan_count with a similar RCUy thing... my recollection 
> > is
> > that the only reason loan_count uses the pageq lock currently is to 
> > stabilize
> > the page identity while locking the owning object.  It might make sense to 
> > have
> 
> On stablisation yes I agree.  My reading has always been that it's safe to
> test those properties with either the pageqlock or the object locked - but
> on what happens next, your intent matters.
> 
> > pg->wire_count protected by the pdpol lock too (though I have some thoughts
> > about changing the nature of wire_count that would probaby make that not
> > make sense, I haven't had time to think that through yet).
> > 
> > But anyway, I'll stop rambling at this point so we can focus on the diff at 
> > hand.
> > :-)
> 
> That's an interesting point re stabilisation and a RCUy type thing I see
> exactly what you're getting at and it's related to the few points of
> connection between datasets I mentioned.  I'll have a think about that.
> 
> Thanks for taking the time to look this over Chuck.
> 
> Cheers,
> Andrew
> 
> > 
> > -Chuck
> > 
> > 
> > > I tried it out: the code looks more elegant and the scheme works in 
> > > practice
> > > with contention on uvm_pageqlock now quite minor, and the sum of 
> > > contention
> > > on the two locks lower than it was on uvm_pageqlock prior.
> > > 
> > > With the locking for the pdpolicy code being private to the file I think 
> > > it
> > > would then be easier to experiment with different strategies to reduce
> > > contention further.
> > > 
> > > http://www.netbsd.org/~ad/2019/pdpolicy.diff
> > > 
> > > Andrew


Re: modules item #14 revisited

2019-12-08 Thread Andrew Doran
On Sat, Dec 07, 2019 at 04:35:07PM -0800, John Nemeth wrote:

> On Dec 7,  4:31pm, Christos Zoulas wrote:
> } On Dec 7,  8:55pm, a...@absd.org (David Brownlee) wrote:
> } 
> } | Very much like this - would assume that modules.tgz goes away?
> 
>  I can't say I'm a fan of this.  I would hope that it goes away
> once we get serious about having a stable KABI for modules.  Modules
> shouldn't be tied to a particular kernel the way they currently
> are.

I was of the same opinion but have realised we'll never get to that point. 
For a plain release version of NetBSD I think it's not really an issue.
Also going for something more rigid makes sense for a commercial Unix but
for us I think it would needlessly put a limit on innovation.

> Note that i386 includes an additional set of modules for Xen PAE.

This was something that Manuel and I originally disagreed on.  I wanted x86
native & Xen to share modules but I don't have a strong opinion on it any
more.

Andrew


Re: Changes to reduce pressure on uvm_pageqlock

2019-12-07 Thread Andrew Doran
On Sat, Dec 07, 2019 at 01:32:35PM -0800, Chuck Silvers wrote:

> I like this new version much more than the previous version.
> The new diff looks incomplete though, it uses the "pdpol" field in
> vm_page that was new in the previous diff.  It looks like there
> should be changes in uvm_pdaemon.c to go with the new diff too?
> Could you send the complete new diff again please?

Yes it's incomplete.  I'll split these pieces out of my tree, merge into
-current tomorrow and get a clean diff.

I think the locking change is enough on its own, so I will exclude the
"pdpol" piece to avoid clouding things, and go again with another review for
that and other parts in the near future.

> > With that done, we'd then only need one version of uvm_pageactivate() etc,
> > because those functions will no longer care about uvm_pageqlock.
> 
> The pageq lock is named that because it currently protects the pageq's.  :-)
> If it's not going to protect the pageq's anymore then it ought to be renamed.
> Probably better to just work on eliminating it though.

Agreed.
 
> In your new diff, the pageq lock is still used by the pagedaemon
> (for uvmpd_trylockowner() as you noted in the comment).
> It would be good not to have a global lock involved in changing page identity,
> and with the pageq's now protected by a separate lock, we should be able to
> eliminate the need for the pageq lock when freeing a page with an RCUy thing
> to prevent freeing uvm_object and uvm_anon structures while the pagedaemon 
> might
> still be accessing them via the vm_page fields.  We should be able to improve
> the locking for pg->loan_count with a similar RCUy thing... my recollection is
> that the only reason loan_count uses the pageq lock currently is to stabilize
> the page identity while locking the owning object.  It might make sense to 
> have

On stablisation yes I agree.  My reading has always been that it's safe to
test those properties with either the pageqlock or the object locked - but
on what happens next, your intent matters.

> pg->wire_count protected by the pdpol lock too (though I have some thoughts
> about changing the nature of wire_count that would probaby make that not
> make sense, I haven't had time to think that through yet).
> 
> But anyway, I'll stop rambling at this point so we can focus on the diff at 
> hand.
> :-)

That's an interesting point re stabilisation and a RCUy type thing I see
exactly what you're getting at and it's related to the few points of
connection between datasets I mentioned.  I'll have a think about that.

Thanks for taking the time to look this over Chuck.

Cheers,
Andrew

> 
> -Chuck
> 
> 
> > I tried it out: the code looks more elegant and the scheme works in practice
> > with contention on uvm_pageqlock now quite minor, and the sum of contention
> > on the two locks lower than it was on uvm_pageqlock prior.
> > 
> > With the locking for the pdpolicy code being private to the file I think it
> > would then be easier to experiment with different strategies to reduce
> > contention further.
> > 
> > http://www.netbsd.org/~ad/2019/pdpolicy.diff
> > 
> > Andrew


Re: modules item #14 revisited

2019-12-07 Thread Andrew Doran
On Fri, Dec 06, 2019 at 09:42:24PM -0500, Christos Zoulas wrote:

> This is a quick and dirty implementation of:
> 
> http://mail-index.NetBSD.org/current-users/2009/05/10/msg009372.html

I like this approach a lot.

Andrew


Re: Changes to reduce pressure on uvm_pageqlock

2019-12-06 Thread Andrew Doran
Ok here's is a more complete patch for the change to uvm_pageqlock.  I have
edited by hand to remove lots of unrelated changes so it's to get a picture
of the changes - it won't apply or compile.

http://www.netbsd.org/~ad/2019/partial.diff

(It's time consuming to detangle this from the rest of my changes which are
per-CPU stat collection, a topology-aware page allocator that dispenses with
the "per-cpu" lists we have now, and yamt's radixtree).

There are a few more things I want to look at tomorrow including the
PQ_READAHEAD flag (it's not safe now) whether the pagedaemon needs to know
about the pdpolicy lock at all, and a last go over it all checking
everything.

The earlier changes to vm_page I mentioned for alignment and the page
replacement policy I am going to postpone because radixtree changes the
picture bigtime and more testing is required. (That's a separate review.)

As to the reason why: at the start of November on my machine system time for
a kernel build was 3200-3300s.  With this plus the remaining changes it's
down to 850s so far and with !DIAGNOSTIC about 580s.

Andrew

On Fri, Dec 06, 2019 at 04:34:50PM +, Andrew Doran wrote:

> I ended up taking a disliking to these changes so I went back and took
> another look at the problem yesterday evening.
> 
> There are two distinct sets of data involved that need protection: page
> identity (connected anon, uobj, wiring, loaning) and page replacement state.
> 
> There are limited points of connection between those two sets, so it's easy
> to do two locks.  We can give the pdpolicy code a lock for its state and let
> it manage that lock, which only it and the pagedaemon need to know about. 
> The lock can be private to the file (uvm_pdpolicy_clock / clockpro.c), with
> a couple of hooks for the pagedaemon to acquire and release it.
> 
> With that done, we'd then only need one version of uvm_pageactivate() etc,
> because those functions will no longer care about uvm_pageqlock.
> 
> I tried it out: the code looks more elegant and the scheme works in practice
> with contention on uvm_pageqlock now quite minor, and the sum of contention
> on the two locks lower than it was on uvm_pageqlock prior.
> 
> With the locking for the pdpolicy code being private to the file I think it
> would then be easier to experiment with different strategies to reduce
> contention further.
> 
> http://www.netbsd.org/~ad/2019/pdpolicy.diff
> 
> Andrew
> 
> On Tue, Dec 03, 2019 at 12:41:25AM +, Andrew Doran wrote:
> > Hi,
> > 
> > I have some straightforward changes for uvm to improve the situation on
> > many-CPU machines.  I'm going to break them into pieces to make them easier
> > to review (only this first piece and what's already in CVS is ready).
> > 
> > I have carefuly measured the impact of these over hundreds of kernel builds,
> > using lockstat, tprof and some custom instrumentation so I'm confident that
> > for each, the effects at least are of value.
> > 
> > Anyway I'd be grateful if someone could take a look.  This one is about
> > reducing pressure on uvm_pageqlock, and cache misses on struct vm_page.
> > 
> > Cheers,
> > Andrew
> > 
> > 
> > http://www.netbsd.org/~ad/2019/uvm1.diff
> > 
> > 
> > vm_page: cluster largely static fields used during page lookup in the first
> > 64-bytes.  Increase wire_count to 32-bits, and add a field for use of the
> > page replacement policy.  This leaves 2 bytes spare in a word locked by
> > uvm_fpageqlock/uvm_pageqlock which I want to use later for changes to the
> > page allocator.  It also brings vm_page up to 128 bytes on amd64.
> > 
> > New functions:
> > 
> > => uvmpdpol_pageactivate_p()
> > 
> > For page replacement policy.  Returns true if pdpol thinks activation info
> > would be useful enough to cause disruption to page queues, vm_page and
> > uvm_fpageqlock.  For CLOCK this returns true if page is not active, or was
> > not activated within the last second.
> > 
> > => uvm_pageenqueue1()
> > 
> > Call without uvm_pageqlock.  Acquires the lock and enqueues the page only
> > if not already enqueued.
> > 
> > => uvm_pageactivate1()
> > 
> > Call without uvm_pageqlock.  Acquires the lock and activates the page only
> > if uvmpdpol_pageactivate_p() says yes.  No similar change for deactivate nor
> > dequeue as they are much more definite.
> > 
> > => uvm_pagefree1()
> > 
> > First part of uvm_pagefree() - strip page of identity.  Requires
> > uvm_pageqlock if associated with an object.
> > 
> > => uvm_pagefree2()
> > 
> > Second part of uvm_pagefree().  Send page to free list.  No locks required.


Re: racy acccess in kern_runq.c

2019-12-06 Thread Andrew Doran
On Fri, Dec 06, 2019 at 05:22:39PM +0900, Kengo NAKAHARA wrote:

>   /* Update the worker */
> - worker_ci = hci;
> + atomic_swap_ptr(&worker_ci, hci);

Why atomic_swap_ptr() not atomic_store_relaxed()?  I don't see any bug that
it fixes.  Other than that it look OK to me.

Cheers,
Andrew


Re: Changes to reduce pressure on uvm_pageqlock

2019-12-06 Thread Andrew Doran
I ended up taking a disliking to these changes so I went back and took
another look at the problem yesterday evening.

There are two distinct sets of data involved that need protection: page
identity (connected anon, uobj, wiring, loaning) and page replacement state.

There are limited points of connection between those two sets, so it's easy
to do two locks.  We can give the pdpolicy code a lock for its state and let
it manage that lock, which only it and the pagedaemon need to know about. 
The lock can be private to the file (uvm_pdpolicy_clock / clockpro.c), with
a couple of hooks for the pagedaemon to acquire and release it.

With that done, we'd then only need one version of uvm_pageactivate() etc,
because those functions will no longer care about uvm_pageqlock.

I tried it out: the code looks more elegant and the scheme works in practice
with contention on uvm_pageqlock now quite minor, and the sum of contention
on the two locks lower than it was on uvm_pageqlock prior.

With the locking for the pdpolicy code being private to the file I think it
would then be easier to experiment with different strategies to reduce
contention further.

http://www.netbsd.org/~ad/2019/pdpolicy.diff

Andrew

On Tue, Dec 03, 2019 at 12:41:25AM +0000, Andrew Doran wrote:
> Hi,
> 
> I have some straightforward changes for uvm to improve the situation on
> many-CPU machines.  I'm going to break them into pieces to make them easier
> to review (only this first piece and what's already in CVS is ready).
> 
> I have carefuly measured the impact of these over hundreds of kernel builds,
> using lockstat, tprof and some custom instrumentation so I'm confident that
> for each, the effects at least are of value.
> 
> Anyway I'd be grateful if someone could take a look.  This one is about
> reducing pressure on uvm_pageqlock, and cache misses on struct vm_page.
> 
> Cheers,
> Andrew
> 
> 
> http://www.netbsd.org/~ad/2019/uvm1.diff
> 
> 
> vm_page: cluster largely static fields used during page lookup in the first
> 64-bytes.  Increase wire_count to 32-bits, and add a field for use of the
> page replacement policy.  This leaves 2 bytes spare in a word locked by
> uvm_fpageqlock/uvm_pageqlock which I want to use later for changes to the
> page allocator.  It also brings vm_page up to 128 bytes on amd64.
> 
> New functions:
> 
> => uvmpdpol_pageactivate_p()
> 
> For page replacement policy.  Returns true if pdpol thinks activation info
> would be useful enough to cause disruption to page queues, vm_page and
> uvm_fpageqlock.  For CLOCK this returns true if page is not active, or was
> not activated within the last second.
> 
> => uvm_pageenqueue1()
> 
> Call without uvm_pageqlock.  Acquires the lock and enqueues the page only
> if not already enqueued.
> 
> => uvm_pageactivate1()
> 
> Call without uvm_pageqlock.  Acquires the lock and activates the page only
> if uvmpdpol_pageactivate_p() says yes.  No similar change for deactivate nor
> dequeue as they are much more definite.
> 
> => uvm_pagefree1()
> 
> First part of uvm_pagefree() - strip page of identity.  Requires
> uvm_pageqlock if associated with an object.
> 
> => uvm_pagefree2()
> 
> Second part of uvm_pagefree().  Send page to free list.  No locks required.


Re: racy acccess in kern_runq.c

2019-12-06 Thread Andrew Doran
On Fri, Dec 06, 2019 at 02:55:47PM +, Mindaugas Rasiukevicius wrote:

> Compilers have became much more aggressive over the years.  But they are
> allowed to be so by the C standard.  Specifically, in addition to code-level
> re-ordering,

Yup the rules around that are well understood.

> plain accesses (loads/stores) are subject to load/store fusing, tearing as
> well as invented loads/stores.  At least load/store fusing and tearing
> *have* been observed in reality [1] [2] [3].  So, they are *not* merely
> theoretical or esoteric problems, although there are plenty of these in
> the C11 memory model too [4] [5].

Thank you for the references, very interesting - if distressing - reading.
So it does happen.

> Linux kernel developers went through this already.  Perhaps the C standard
> will plug the holes or perhaps compilers will just change their behaviour,
> as they get enough criticism [6] [7].  However, in the mean time, I think
> let's just accept that things moved on and let's embrace the new primitives.
> While these primitives might be slightly verbose, they are in C11, they fix
> real bugs, they definitely make code less error-prone and they have other
> merits too (e.g. they accommodate static analysers which find some real bugs).

Well I'm firmly of the opinion that this is a bug in the compiler.  By all
means a nice behaviour to have as an option and I very much see the use of
it, but having it as the default for compiling old code is bad news indeed.

No argument from me on accommodating static analysers I think that's a good
thing.  Presumably then we have no other choice than to work around it,
because even if we can tell the compiler not to be a jerk today, that option
won't be feasible to use in the future.

Thanks again,
Andrew


Re: racy acccess in kern_runq.c

2019-12-06 Thread Andrew Doran
On Fri, Dec 06, 2019 at 10:27:20AM +0100, Maxime Villard wrote:

> With 'worker_ci', there is an actual safety issue, because the compiler could
> split the accesses and the hardware may not use atomics by default like x86.
> This could cause random page faults; so it needs to be strictly atomic.

No I don't accept that.

The ability to load and store a native word sized int (and in more recent
years a pointer) with a single instruction is a fundamental assumption that
every operating system written in C rests upon.

If the compiler splits one of those acceses, then you are either using some
other data type, or have a broken compiler on your hands.  If the compiler
is broken it's the compiler you should be looking at, not the program it
compiled.  It's as simple as that.

https://www.gnu.org/software/libc/manual/html_node/Atomic-Types.html

Andrew


Re: racy acccess in kern_runq.c

2019-12-06 Thread Andrew Doran
Hi,

On Fri, Dec 06, 2019 at 03:52:23PM +0900, Kengo NAKAHARA wrote:

> There are some racy accesses in kern_runq.c detected by KCSAN.  Those
> racy access messages is so frequency that they cover other messages,
> so I want to fix them.  They can be fixed by the following patch.

Please don't commit this.  These accesses are racy by design.  There is no
safety issue and we do not want to disturb the activity of other CPUs in
this code path by locking them.  We also don't want to use atomics either. 
This code path is extremely hot using atomics would impose a severe
performance penalty on the system under certain workloads.

Also if you change this to use strong ordering, you're quite likely to
change the selection behaviour of LWPs.  This is something delicate that
exhibits reasonable scheduling behaviour in large part through randomness
i.e by accident, so serialising it it's likely to have strong effects on how
LWPs are distributed.

Lastly I have a number of experimental changes to this code which I'll be
committing in the near future allowing people to change the LWP selection
policy to see how if we can improve performance under different workloads. 
They will also likely show up in KCSAN as racy/dangerous accesses.

My suggestion is to find a way to teach KCSAN that we know something is
racy, we like it being racy, and that it's not a problem, so that it no
longer shows up in the KCSAN output.

Thank you,
Andrew


Changes to reduce pressure on uvm_pageqlock

2019-12-02 Thread Andrew Doran
Hi,

I have some straightforward changes for uvm to improve the situation on
many-CPU machines.  I'm going to break them into pieces to make them easier
to review (only this first piece and what's already in CVS is ready).

I have carefuly measured the impact of these over hundreds of kernel builds,
using lockstat, tprof and some custom instrumentation so I'm confident that
for each, the effects at least are of value.

Anyway I'd be grateful if someone could take a look.  This one is about
reducing pressure on uvm_pageqlock, and cache misses on struct vm_page.

Cheers,
Andrew


http://www.netbsd.org/~ad/2019/uvm1.diff


vm_page: cluster largely static fields used during page lookup in the first
64-bytes.  Increase wire_count to 32-bits, and add a field for use of the
page replacement policy.  This leaves 2 bytes spare in a word locked by
uvm_fpageqlock/uvm_pageqlock which I want to use later for changes to the
page allocator.  It also brings vm_page up to 128 bytes on amd64.

New functions:

=> uvmpdpol_pageactivate_p()

For page replacement policy.  Returns true if pdpol thinks activation info
would be useful enough to cause disruption to page queues, vm_page and
uvm_fpageqlock.  For CLOCK this returns true if page is not active, or was
not activated within the last second.

=> uvm_pageenqueue1()

Call without uvm_pageqlock.  Acquires the lock and enqueues the page only
if not already enqueued.

=> uvm_pageactivate1()

Call without uvm_pageqlock.  Acquires the lock and activates the page only
if uvmpdpol_pageactivate_p() says yes.  No similar change for deactivate nor
dequeue as they are much more definite.

=> uvm_pagefree1()

First part of uvm_pagefree() - strip page of identity.  Requires
uvm_pageqlock if associated with an object.

=> uvm_pagefree2()

Second part of uvm_pagefree().  Send page to free list.  No locks required.


Re: Why NetBSD x86's bus_space_barrier does not use [sml]fence?

2019-12-02 Thread Andrew Doran
On Mon, Dec 02, 2019 at 07:03:50PM +, Andrew Doran wrote:

> I think the way to avoid both of those would be key off the bus_space_tag_t. 

Ah it can't be in the tag.  If we do it, it has to be via the handle.  Since
the same tag can be used with different options.

Andrew


Re: Why NetBSD x86's bus_space_barrier does not use [sml]fence?

2019-12-02 Thread Andrew Doran
On Mon, Dec 02, 2019 at 07:03:50PM +, Andrew Doran wrote:

> We're talking UC/WB mappings here though, so:

Meant to write UC/WC - I must be getting old.


Re: Why NetBSD x86's bus_space_barrier does not use [sml]fence?

2019-12-02 Thread Andrew Doran
On Mon, Dec 02, 2019 at 06:30:21AM +, Taylor R Campbell wrote:

> > Date: Fri, 29 Nov 2019 14:49:33 +
> > From: Andrew Doran 
> > 
> > It no, the CALL instruction that calls bus_space_barrier() produces a write
> > to the stack when storing the return address.  On x86, stores are totally
> > ordered, and loads are never reordered around stores.  No further barrier
> > is needed.  This is the idea anyway, sometimes reality does not match..
> 
> I don't understand this.  The Intel manual (volume 3 of the 3-volume
> manual, System Programming Guide, Sec. 8.2.2 `Memory Ordering in P6
> and More Recent Processor Families') says:

I have not been paying attention, refused to engage my brain and gotten
confused.  My fault.  On the case of ordering in WB regions I put a comment
in sys/arch/x86/include/lock.h to describe the situation, the last time this
discussion came up and it was in regards to locks.  I have strong suspicions
about what Intel writes but probably better not to muddy the discussion with
them!

We're talking UC/WB mappings here though, so:

> It's not entirely clear to me from the Intel manual, but in the AMD
> manual (e.g., Volume 2, chapter on Memory System) it is quite clear
> that write-combining `WC' regions may issue stores out of order -- and
> that's what you get from BUS_SPACE_MAP_PREFETCHABLE.
> 
> So it seems to me we really do need
> 
> switch (flags) {
> case BUS_SPACE_BARRIER_READ:
>   lfence or locked instruction;
>   break;
> case BUS_SPACE_BARRIER_WRITE:
>   sfence or locked instruction;
>   break;
> case BUS_SPACE_BARRIER_READ|BUS_SPACE_BARRIER_WRITE:
>   mfence or locked instruction;
>   break;
> }

I see a couple of problems with what's in x86_bus_space_barrier() now:

- It will apply to UC mappings too and it's an unneeded pessimisation,
  because the majority of bus_space acceses on x86 don't need any kind of
  barrier.  Maybe the other systems are good with that, I think we should
  try to do better.

- It will also apply to I/O space mappings.  There are, if I recall
  correctly, some drivers that can work with either I/O or memory mapped
  chips and bus_space is designed to allow this.  In the case of I/O port
  access there is no memory access going on (at least from the kernel's POV,
  I don't know how the chipset implements it under the covers).  There will
  also be drivers that are I/O port mapping only and the author has followed
  the bus_space manual page dilligently and put in barriers anyway.

I think the way to avoid both of those would be key off the bus_space_tag_t. 
There are already two, one for I/O and one for memory.  We could do multiple
memory tags.  That's probably very easy, I think the harder bit would likely
be passing those down to PCI and having it choose.  I don't know that code.
 
Andrew


Re: Why NetBSD x86's bus_space_barrier does not use [sml]fence?

2019-11-30 Thread Andrew Doran
Hi,

I assume this is a PCI device.  Could you please dump the device's
configuration registers for us with "pcictl dump"?

Thanks,
Andrew


Re: Why NetBSD x86's bus_space_barrier does not use [sml]fence?

2019-11-29 Thread Andrew Doran
On Fri, Nov 29, 2019 at 10:39:09PM +0900, Shoichi Yamaguchi wrote:

> FreeBSD and OpenBSD use memory fences(mfence, sfence, lfence) to
> implement bus_space_barrier().
> On the other hand, NetBSD does not use them.
> 
> Do you know the background about current implementation?
> 
> I found an issue caused by this implementation difference while porting
> ixl(4).

Are you using BUS_SPACE_MAP_PREFETCHABLE?

If yes, I think there might be the possibility of reordering.  There should
probably be a fence.

It no, the CALL instruction that calls bus_space_barrier() produces a write
to the stack when storing the return address.  On x86, stores are totally
ordered, and loads are never reordered around stores.  No further barrier
is needed.  This is the idea anyway, sometimes reality does not match..

Andrew


Re: Some minor improvements to select/poll

2019-11-20 Thread Andrew Doran
Hi Joerg,

On Wed, Nov 20, 2019 at 11:14:32PM +0100, Joerg Sonnenberger wrote:
> On Wed, Nov 20, 2019 at 09:38:56PM +0000, Andrew Doran wrote:
> > (1) Increase the maximum number of clusters from 32 to 64 for large systems.
> > kcpuset_t could potentially be used here but that's an excursion I don't
> > want to go on right now.  uint32_t -> uint64_t is very simple.
> 
> Careful about alignment here. There is hidden padding on ILP32 with
> 64bit alignment for uint64_t.

Good catch.  Probably better to take the hit, bump the version and make sure
the alignment is safe.

Cheers,
Andrew


Some minor improvements to select/poll

2019-11-20 Thread Andrew Doran
Hi all.

Long time no see.  For review:

(1) Increase the maximum number of clusters from 32 to 64 for large systems.
kcpuset_t could potentially be used here but that's an excursion I don't
want to go on right now.  uint32_t -> uint64_t is very simple.

(2) In the case of a non-blocking select/poll, or where we won't block
because there are events ready to report, stop registering interest in
the back-end objects early.

(3) Change the wmesg for poll back to "poll".

diff:
http://www.netbsd.org/~ad/2019/select.diff

libmicro comparison:
http://www.netbsd.org/~ad/2019/select.html

Any comments?

Andrew


Re: Merge of rmind-uvmplock branch

2011-06-14 Thread Andrew Doran
Mindaugas, thank you very much for your hard work!!

On Tue, Jun 07, 2011 at 03:16:06AM +, YAMAMOTO Takashi wrote:

> the idea is to protect pv chains with object lock, right?

That was the initial driver for me anyway.

Now that this work is in, there are some changes that can be built upon it
to realise the full value -- and hopefully make exit() etc. very cheap.
Looking at this from an x86 perspective but some of the ideas will apply
to other ports:

- My initial idea was to kill the global PV hash and associated locks.  To
  replace this we would embed a list head in each uvm_map_entry (or wherever
  else a mapping is managed).  This would be supplied by the caller on each
  relevant pmap call - pmap_enter() and so on.  PV entries would be added to
  and removed from this structure by the pmap module.  An initial
  implementation could get away with a dumb linked list I think.

- So then PV entries tracked with the mapping instead of globally. Once
  pmap_remove_all() has been called pmap_remove() could switch to a
  "shortcut" mode and become very quick.  From memory I believe all it would
  need to do is tear down the software PV entries, and not touch any page or
  pmap state.  Tearing down pmap state would be deferred to pmap_destroy(). 
  At that point we can then clear out the pmap's uvm_objects and free all
  pmap pages in bulk.  This would avoid potentially thousands of expensive
  scans and atomic updates to the hardware paging structures, which account
  for the bulk of expense during exit() etc.  If the system is short on
  memory we might want a mechanism to switch CPUs away from this pmap if
  they are hanging onto it as pmap_kernel - i.e. preventing pmap_destroy()
  from being called.

- In the x86 pmap, we have a number retryable sequences when we operate in
  the P->V direction.  pmap_page_remove() for instance.  I think these can
  now be simplified as there are fewer concurrency issues, no need for
  retryable sequences.  Yamamoto-san can you confirm?


Re: [PATCH] bufq_priocscan enhancement

2011-06-14 Thread Andrew Doran
I have not read the whole thread -- sorry.  I have a suggestion. If I were
to set out to make a dent on I/O performance, I would invest time in some
groundwork first:

- Stop abusing "struct buf" to describe I/O requests. Introduce an
  iorequest_t or something.

- Currently we do a virtual->physical->virtual->physical dance for many
  I/Os.  The process has quite a bit of overhead, and for many of those I/Os
  we won't need them to appear in kernel space hence the dance is not
  needed.  Make I/O requests pass around lists of vm_pages or physical
  addresses or whatever,


Re: diff: add show proc command to ddb

2011-04-06 Thread Andrew Doran
On Wed, Apr 06, 2011 at 03:45:25PM +, Eduardo Horvath wrote:

> On Wed, 6 Apr 2011, Vladimir Kirillov wrote:
> 
> > Hello, tech-kern@!
> > 
> > I really wanted a show proc command to avoid looking up process
> > information by running ps with all flags and intensive scrolling.
> > 
> > The show proc output mostly combines the outputs of all switches
> > in ps.
> 
> Neat!
> 
> Since you're doing a lot of DDB rototilling, could you add an option to 
> your 'show proc' command that also dumps each lwp's stacktrace?  I find 
> that extremly useful for diagnosing deadlocks.  (The version I hacked up 
> is not very clean.)

Good idea.  I think it could be an involved project but one of the things
I'd love to see eventually is a ddb command that prints a diagram of "who is
waiting on who" so you don't have to piece it all together by yourself.

With "crash" you could do some of this by piping things in and out of it and
playing with awk etc, but it could do with some work and is obviously no use
if you're sitting there looking at a ddb> prompt!


Re: pg->offset and pg->flags

2011-04-04 Thread Andrew Doran
On Mon, Apr 04, 2011 at 09:08:51AM +1000, matthew green wrote:
> 
> > On Sun, Apr 03, 2011 at 07:35:04PM +1000, matthew green wrote:
> > > 
> > > > Ignoring the free page allocator which abuses pg->offset, is there any
> > > > reason we cannot fold pg->flags into pg->offset?  The lower PAGE_SHIFT 
> > > > bits
> > > > of pg->offset are not used.
> > > 
> > > is this about making vm_page smaller?  if so, and it works, i guess that
> > > seems fine, but how many bits do you want to use?  ie, what is the
> > > smallest PAGE_SIZE we will support?
> > > 
> > > if not that, why?
> > 
> > And is the memory saved enough to be significant compared to the
> > number of masking operations needed to get pg->offset.
> 
> actually, it won't really help unless we rearrange a lot more in
> vm_page{}.  right now the flags member is 1 uint16_t in a series
> of 4 in a row, so removing one is unlikely to help, as it will
> just introduce padding on most platforms.

I'm considering the mechanics of introducing an additional field to give CPU
affinity to pages.  This would be used to hash out uvm_pageqlock and the
inactive/active lists by CPU.  Very much like we do for callouts, they have
weak CPU affinity and so there's almost no lock/cache contention on the
callout state.

Since we have a bunch of sub-word sized fields in vm_page the placement of
new fields is tricky, as they need to be locked the same as adjacent fields
in the same 32-bit cell.  Merging pg->flags into pg->offset is attractive
because the same cell locking where pg->flags is would work for the purpose
I'm thinking of, and it would use no more space in vm_page.


pg->offset and pg->flags

2011-04-02 Thread Andrew Doran
Ignoring the free page allocator which abuses pg->offset, is there any
reason we cannot fold pg->flags into pg->offset?  The lower PAGE_SHIFT bits
of pg->offset are not used.


Crazy idea #1: uvm_fpageqlock

2011-04-02 Thread Andrew Doran
Ok, so relating to the freelist I have an idea for uvm_fpageqlock.  This
would be a mid term solution not taking direct account of NUMA and so forth.

Looking at all of the accounting data we have that decides how the system
behaves such as uvmexp.free and so on (currently protected by
uvm_fpageqlock), we don't really need locked access to these when reading
because we continually check and re-recheck those values.  So the system
will eventally sort itself out even if we get a bad picture of things from
time to time.  All that really matters is that we maintain the values
consistently, using atomics or locks.  So uvm_fpageqlock's not needed there.

uvm_fpageqlock also protects one set of data that is not directly related to
free memory and that's the pagedaemon wakeup and pageout state.  We could
put in a new low traffic mutex there, say uvm_pageout_lock.  (Incidentally
it looks like updates to uvmexp.pdpending might be racy, just noting it here
so I remember.)

So that leaves only the page allocator needing uvm_fpageqlock.

Currently the page allocator maintains per-CPU and global lists of free
pages.  Pages reside on both lists.  We prefer to hand out pages from the
per-CPU list: on machines with physically indexed caches, it's likely that
we'll have lines from those pages in cache on the local CPU, which is
beneficial when it comes time to fill the pages.  All lists are protected by
uvm_fpageqlock.

What I propose is to maintain the global list of pages pretty much as is,
but to split off the per-CPU lists so that they would have their own locks. 
With the exception of uvm_pglistalloc() they would only be accessed by the
local CPU, effectively functioning as a local cache of free pages.

When allocating, we'd try the local list first and then try the global list
if no pages are available.  When freeing, we'd always put back to the local
list.  When allocating from and freeing back to this local list of free
pages we would not touch any global state, even uvmexp.free.  The idlezero
code would only consider the local list of pages.

At some point we'd need to redistribute those cached pages back to the
global list of free pages.  This would be a fairly neat and tidy operation
as all we'd need to do is go through the color buckets, chop the list of
pages out and splice it into the head of the global list, then do some
accounting updates (e.g. uvmexp.free).

I'm thinking this redistribution should happen fairly regularly so perhaps
we could change the xcall thread on each CPU to awaken once per second.
Change cv_wait() in xc_thread() into a cv_timedwait(), and have it hand back
cached pages if (a) not done recently or (b) the system is struggling.

The pagedaemon would get code to directly trigger the redistribution when
under pressure but I am thinking that some sort of rate limiting would be
needed.

Thoughts?


Re: lockstat from pathological builds Re: high sys time, very very slow builds on new 24-core system

2011-04-02 Thread Andrew Doran
On Sat, Apr 02, 2011 at 01:03:13AM -0400, Kevin Bowling wrote:

> Is anyone working on lockless implementations within NetBSD?
> 24-core systems are becoming very common, even in two socket
> pizzabox rack servers.
> 
> RCU is one of the foundations of the Linux kernel but is patent
> encumbered with a GPL exception by IBM.  There are may be valid
> alternatives using something like epoch observation.  Check out this
> BSD licensed project
> http://concurrencykit.org/doc/safeMemoryReclamation.html for a
> rundown and good implementations.
> 
> I'm not sure how realistic widespread implementation would be as a
> GSoC project but I'd be willing to investigate some limited scope if
> there is a suitable mentor available.

We have code for interesting & low overhead synchronization methods like RCU
but it hasn't gone in yet, I'll let others say more if they want to.  That
said, there's already a ton of lockless stuff going on in the kernel, just
no song and dance made about it.

If you look at the lockstat output from Thor's system there are only a
couple of problem areas, the most significant being management of a central
set of data structures in the VM system.  There is no silver bullet like RCU
for that, it needs more involved work.  On the positive side, the rest of
the system scales up just fine on this workload.


Re: lockstat from pathological builds Re: high sys time, very very slow builds on new 24-core system

2011-04-01 Thread Andrew Doran
On Fri, Apr 01, 2011 at 05:47:59PM +, Andrew Doran wrote:

> The global freelist is another area of concern.

Thinking about some first steps.  We have a bunch of global counters and
statistics stored in struct uvmexp.  For example: pga_zerohit, cpuhit,
colorhit etc.

While they might seem like small change I think these should become per-CPU
as a first step because they cause all kinds of cacheline bouncing and will
get in the way of whatever sort of scheme we go with to make the allocator
scalable.

So my suggestion is that where currently we have something like:

atomic_inc_uint(&uvmexp.pga_colorhit);

.. it would change to something like:

KPREEMPT_DISABLE(l);
l->l_cpu->cpu_data.cpu_uvmexp.pga_colorhit++;
KPREEMPT_ENABLE(l);

Anyplace we need to use these value we'd call a function to sum the value
across all CPUs, maybe at splhigh() so we can call the sum "reasonably"
quick and thereby "reasonably" accurate.  I don't forsee a performance
problem on the reader side since it's rare that we read these values.

Then there are filepages, execpages, anonpages, zeropages, etc.  The same
could naively be done for these as a short term step.  In the places these
are used, we don't need a completely accurate view.  In the longer term we
may want these to become per-NUMA node or whatever.

I don't have a good suggestion for uvmexp.free since we check that one quite
regularly.  I think that one is quite closely tied to whatever scheme is
chosen for the allocator.

Thoughts?


Re: lockstat from pathological builds Re: high sys time, very very slow builds on new 24-core system

2011-04-01 Thread Andrew Doran
On Fri, Apr 01, 2011 at 01:05:02PM -0400, Thor Lancelot Simon wrote:

> On Thu, Mar 31, 2011 at 04:32:12PM -0400, Thor Lancelot Simon wrote:
> > On Thu, Mar 24, 2011 at 12:04:02PM +0000, Andrew Doran wrote:
> > > 
> > > Try lockstat as suggested to see if something pathological is going on.  
> > > In
> > > addition to showing lock contention problems it can often highlight a code
> > > path being hit too frequently for some reason.
> > 
> > I have attached build.sh and lockstat output from 24-way and 8-way builds
> > of the amd64 GENERIC kernel on this system.  Also dmesg and mount -v output
> > so you can see what's mounted how, etc.
> 
> Mindaugas spent quite a while looking at this with me.  It appears to
> be largely due to contention on the page queue lock.

I did a few back-of-a-cigarette-packet calculations.

On the 8 CPU system we are burning about 1% of the total CPU time on lock
contention, so not all that significant.

On the 24 CPU system we're burning 27% of the available CPU time.  So
roughly 6.5 cores worth of your machine are doing nothing but spinning on
locks.  So lock contention in the VM system is a problem but it doesn't tell
the full story.

> Some of the contention is caused by the idle loop page zeroing code.
> Mindaugas has dramatically improved its performance on large systems
> but I would still recommend turning it off on systems with more than
> 12 CPUs.
> 
> I would actually like to know for what workload, on what hardware, the
> idle loop zeroer is actually beneficial.  I'm sure there are some
> but clearly they are not workloads I've tried testing recently.

It was a pretty big win on the 8 core system I have.  Clearly it needs
more tuning!

> His improvement shaves about 20 seconds off the build time.  Now we
> scale nearly linearly up to about 12 build jobs and get a very slight
> performance improvement from jobs 12-16.  With vm.idlezero=0 set the
> slight performance improvement is a little less slight.
> 
> I believe he's currently working on hashing the page queue lock to see
> if that gives us some performance back for CPUs 12 through 24.

We'll likely see some contention pushed back to uvm_fpageqlock (the freelist
lock), and will still get murdered on cache line contention from the
active/inactive and free lists.

In the mid term I'm thinking we need to investigate an alternative strategy
such as hashing out the complete paging state, or move away from continual
activation/deactivation of pages, or doing things lazily so we don't need to
have so much massive churn of global state as happens now during build.sh.
The global freelist is another area of concern.


Re: high sys time, very very slow builds on new 24-core system

2011-03-25 Thread Andrew Doran
On Fri, Mar 25, 2011 at 12:16:32AM -0700, Erik Fair wrote:

> It will be interesting to see if this class of problem shows up again when
> NetBSD is supported on the UltraSPARC "Niagra" CPUs.

Agree.  Although as it stands 24 cores/threads would be a problem for
sparc64 since it has a single mutex around the pmap.  One of the goals of
the uvmplock branch is to make it very easy to address that problem.

On x86 the majority of lock contention is likely to come from uvm_pageqlock,
v_interlock on libc.so and ld.so, and directory vnode locks for / and /usr
and /usr/lib an so on.  Things like the scheduler and memory allocators and
so on should not pose any problem, unless there is a bug/regression.


Re: high sys time, very very slow builds on new 24-core system

2011-03-24 Thread Andrew Doran
On Wed, Mar 23, 2011 at 05:24:12PM -0400, Thor Lancelot Simon wrote:

> I have a new machine with 24 2Ghz Opteron cores.  It has 32GB of RAM.
> 
> Building with sources on a fast SSD ("preloaded" into the page
> cache before the build using tar > /dev/null) and obj, dest, and rel
> dirs on tmpfs, system builds are extraordinarily slow.  The system
> takes about 20 minutes to build a netbsd-5 based source tree
> with -j24 -- about the same amount of time as an older 8-core Intel
> based system running netbsd-5 requires with -j8.
> 
> All cores spend well over 50% time in 'sys', even when all or almost
> all are running cc1 processes.  The kernel is amd64 -current GENERIC
> from about 1 week ago -- no DIAGNOSTIC, DEBUG, KMEMSTATS, LOCKDEBUG,
> etc.
> 
> Does anyone have any idea what might be wrong here?

Try lockstat as suggested to see if something pathological is going on.  In
addition to showing lock contention problems it can often highlight a code
path being hit too frequently for some reason.

Have a look at the event counters and see if anything obviously ugly is
going on.

Also look for evidence of context switch storms. lockstat will show those
for lock objects, but not for condition variables or homegrown stuff
based off sleep queues.

What sort of TLB shootdown rate does systat vmstat show?  We have changes
forthcoming that should help with this during build.sh.  Don't get me wrong,
the situation isn't particularly bad now on x86 for shootdowns but the
forthcoming changes improve it quite a bit.

I have a suspicion that SSD could cause issues for us because the buffer
cache and other parts of the I/O system are not designed with near
instantaneous request->response in mind, but that likely isn't at play
here.. Do you have logging turned on for the SSD?

We have some algorhythms in the scheduler and mutual exclusion code that
aren't designed for large numbers of cores, but I think they should be OK
with 24 CPUs.  (While I'm rambling about this I think the SPINLOCK_BACKOFF
stuff should have some sort of randomness to it perhaps based off curlwp and
cpu_counter() otherwise things could proceeed in lockstep, although again
probably not the issue here).


Re: Status and future of 3rd party ABI compatibility layer

2011-03-23 Thread Andrew Doran
Hi Joerg.

On Wed, Mar 23, 2011 at 04:06:07PM +0100, Joerg Sonnenberger wrote:

> Hi all,
> the following is what I consider as summary of the thread.
> 
> Use cases mentioned for or considered simple enough:
> - Linux
> - FreeBSD
> - OSF1
> - Ultrix
> - SVR/SVR4
> 
> Incomplete, broken and of questionable use:
> - Darwin and Mach
> - IRIX
> 
> Outdated and not spoken up for:
> - Windows/PECOFF
>
> As such, I want to propose moving the last two categories into the Attic
> for further dusting.

Moving those to the Attic is definitely the right thing to do, so I fully
support your proposal.


Re: inheriting the lwp private area

2011-03-23 Thread Andrew Doran
On Wed, Mar 23, 2011 at 02:19:53PM +, Mindaugas Rasiukevicius wrote:

> Andrew Doran  wrote:
> > > 
> > > You mean l->l_private should be copied on fork?  Absolutely.
> > 
> > For native fork I agree, but that's probably not the right thing to do if
> > FORK_SHARESIGS or whatever it's called (i.e. clone()ing off a Linux-style
> > thread).
> > 
> > (Some day I hope we can emulate clone with _lwp_create() and undo a lot of
> > the complication to the process data structures.)
> 
> Chuck already did that for COMPAT_LINUX. :)  Do you mean something else?

Then its major usecase is gone and emulation may not be required in the long
term.  I'm fully aware of IRIX emulation and compatibility headaches.


Re: inheriting the lwp private area

2011-03-23 Thread Andrew Doran
On Wed, Mar 23, 2011 at 12:53:24AM -0700, Matt Thomas wrote:
> 
> On Mar 22, 2011, at 8:51 PM, Antti Kantee wrote:
> 
> > Hi,
> > 
> > On Julio's request I was looking at the now-failing tests and resulting
> > hanging processes.  Basically at least on i386 the failures are a
> > result of fork() + setcontext() (plus some voodoo) after which calling
> > pthread_mutex_lock() with signals masked causes a busyloop due to
> > pthread__self() causing a segv but the signal never getting delivered
> > (that in itself seems like stinky business, but not the motivating factor
> > of this mail).
> > 
> > Per 4am intuition it seems pretty obvious that a child should inherit
> > the forking lwp's private data.  Does that make sense to everyone else?
> > At least patching fork1() to do so fixes the hanging processes and
> > failing tests and a quick roll around qemu hasn't caused any problems.
> > 
> > If it doesn't make sense, I'll disable the pthread bits (per commit
> > guideline clause 5) until support is fully fixed so that others don't have
> > to suffer from untested half-baked commits causing juju hangs and crashes.
> 
> You mean l->l_private should be copied on fork?  Absolutely.

For native fork I agree, but that's probably not the right thing to do if
FORK_SHARESIGS or whatever it's called (i.e. clone()ing off a Linux-style
thread).

(Some day I hope we can emulate clone with _lwp_create() and undo a lot of
the complication to the process data structures.)


Re: Decomposing vfs_subr.c

2011-03-23 Thread Andrew Doran
On Wed, Mar 23, 2011 at 04:07:29AM +, Mindaugas Rasiukevicius wrote:

> I would like to split-off parts of vfs_subr.c into vfs_node.c * and
> vfs_mount.c modules.  Decomposing should hopefully bring some better
> abstraction, as well as make it easier to work with VFS subsystem.
> 
> Any objections?

Sounds good to me.  Some comments:

- I think it should be vfs_vnode.c?

- If there are mutexes, global data structures etc. consider making them
  static.  File systems shouldn't be playing with this stuff in so far as
  possible.

- Much code in vfs_syscalls.c belongs in a vfs_mount.c.

- Random thought: some day it would be nice to dump all the syscall code
  into its own directory.


Re: Bad threading performance

2011-03-08 Thread Andrew Doran
Your program makes life somewhat difficult for the scheduler
because your jobs are short lived.  It sees a number of threads in a 
tight loop of communication together, running short tasks and so will try to
run these on the same CPU.

When you signal and unlock each task queue, it's likely to preempt the
controlling thread and run the short job on the same CPU, so you get a sort
of serializing effect because the controlling thread is off the CPU and
can't signal the next worker thread to run.

As each worker thread consumes CPU time it will be penalised by having its
priority lowered below that of the controlling thread.  This cancels the
preemption effect, causing things to be queued up on the controlling CPU
thereby allowing remote CPUs to steal tasks from it and get a slice of
the action.

For some reason this distribution as a result of lowered priority is not
happening. perhaps because of this test in sched_takecpu():

385 /* If CPU of this thread is idling - run there */
386 if (ci_rq->r_count == 0) {

Can you try changing this to:

/*
 * If CPU of this thread is idling - run there.
 * XXXAD Test for PL_PPWAIT and special case for vfork()??
 */
if (ci->ci_data.cpu_onproc == ci->ci_data.cpu_idlelwp)

On Tue, Mar 08, 2011 at 05:19:35PM +, Sad Clouds wrote:

> On Mon, 7 Mar 2011 18:39:39 +
> Sad Clouds  wrote:
> 
> > Below are test results. Any ideas why concurrency on NetBSD is so bad
> > compared to Solaris? It seems as if on NetBSD threads are stuck on a
> > wait list for much longer.
> 
> OK this is a follow up on the issue I've raised previously. I have
> attached a test program, which has good concurrency on Linux and
> Solaris, however on NetBSD concurrency is very poor, i.e. CPU sits idle
> a lot of the time.
> 
> I'd be interested to hear from NetBSD developers as to what causes
> this, or at least if this is a known issue...

> /*
> Build with:
> gcc -O1 test_ptask.c -lpthread
> */
> 
> #include 
> #include 
> #include 
> #include 
> 
> #define NTHREADS 4
> 
> /* Function call structure */
> struct fcall
> {
>   /* Function pointers */
>   void *(*fptr)(void *arg);
>   /* Pointer to function arguments structure */
>   void *arg;
> };
> 
> /* Function arguments */
> struct farg
> {
>   uint32_t n1;
>   uint32_t n2;
>   /* Pad to 64-byte boundary */
>   uint8_t pad[64 - (sizeof(uint32_t) * 2)];
> };
> 
> /* Parallel task */
> struct ptask
> {
>   pthread_mutex_t mutex;
>   pthread_cond_t cond;
>   uint8_t pad0[64];
> 
>   /* Array of function call parameters */
>   struct fcall fcalls[NTHREADS];
>   uint8_t pad1[64];
> 
>   /* Array of function arguments */
>   struct farg args[NTHREADS];
> 
>   uint32_t task_run_cnt; /* Counter of tasks to run */
> };
> 
> /* Thread instance */
> struct thread
> {
>   pthread_mutex_t mutex;
>   pthread_cond_t cond;
> 
>   struct fcall *fcall_ptr;
>   struct ptask *ptask_ptr;
> 
>   uint8_t pad[64];
> };
> 
> /* Thread pool */
> struct tpool
> {
>   struct thread threads_array[NTHREADS];
> };
> 
> /* Thread function passed to pthread_create() */
> void *thread_func(void *arg)
> {
>   struct thread *tptr = (struct thread *)arg;
>   struct fcall *fcall;
>   struct ptask *ptask;
> 
>   while (1)
>   {
>   if (pthread_mutex_lock(&(tptr->mutex)) != 0)
>   abort();
> 
>   /* Sleep on a condition variable */
>   while (tptr->fcall_ptr == NULL)
>   {
>   if (pthread_cond_wait(&(tptr->cond), &(tptr->mutex)) != 
> 0)
>   abort();
>   }
> 
>   /* Copy pointers to local variables */
>   fcall = tptr->fcall_ptr;
>   ptask = tptr->ptask_ptr;
> 
>   /* Reset to null values */
>   tptr->fcall_ptr = NULL;
>   tptr->ptask_ptr = NULL;
> 
>   if (pthread_mutex_unlock(&(tptr->mutex)) != 0)
>   abort();
> 
>   /* Run current task */
>   fcall->fptr(fcall->arg);
> 
>   if (pthread_mutex_lock(&(ptask->mutex)) != 0)
>   abort();
> 
>   /* If this is last task, signal to waiting main thread */
>   if (--(ptask->task_run_cnt) == 0)
>   {
>   if (pthread_cond_signal(&(ptask->cond)) != 0)
>   abort();
>   }
> 
>   if (pthread_mutex_unlock(&(ptask->mutex)) != 0)
>   abort();
>   } /* while (1) */
> }
> 
> void *test_func(void *arg)
> {
>   struct farg *farg = (struct farg *)arg;
>   int i;
> 
>   for (i = 0; i < 100; i++)
>   {
>   farg->n1++;
>   farg->n2++;
>   }
>   return NULL;
> }
> 
> static struct tpool tpool;
> 
> int main(

Re: Fwd: Status and future of 3rd party ABI compatibility layer

2011-03-01 Thread Andrew Doran
With modules now basically working we should either retire or move
some of these items to pkgsrc so that the interested parties maintain them.
An awful lot of the compat stuff is now very compartmentalised, with not
much more work to do.

On Mon, Feb 28, 2011 at 11:13:36AM +0200, haad wrote:

> > Darwin (no GUI, doesn't to have been updated in the last 5 years)
> > IRIX

These two are strange and very broken, i.e. internally they are in
very bad shape.  I vote to delete.  The version control history will still
be there.  Can't see strong use cases for either.

> > FreeBSD (does it even handle FreeBSD 4?)
> > Ultrix
+ IBCS2
+ SunOS

These ones are fairly simple.  I see a point to FreeBSD because we have one
tool that it supports (we should have picked the Linux version instead
but whatever).

> > Linux

Obvious.

> > OSF1
> > Solaris
+SVR4

The emulation is very incomplete.  A nice mental excercise for sure but 
in the current form I can't see a good use case.



Re: Fwd: Status and future of 3rd party ABI compatibility layer

2011-03-01 Thread Andrew Doran
On Tue, Mar 01, 2011 at 12:07:19PM +0200, Antti Kantee wrote:
> On Tue Mar 01 2011 at 09:55:38 +0000, Andrew Doran wrote:
> > On Mon, Feb 28, 2011 at 11:25:07AM -0500, Thor Lancelot Simon wrote:
> > 
> > > On Mon, Feb 28, 2011 at 11:13:36AM +0200, haad wrote:
> > > > 
> > > > With solaris.kmod we are compatible with solaris kernel, (we should
> > > > be able to load solaris kernel modules).
> > > 
> > > Have you actually tried this?  I am pretty sure it would not work.
> > > 
> > > It appears to me that solaris.kmod includes shims that provide some
> > > Solaris kernel interfaces at the *source* level in NetBSD, which
> > > certainly makes it easier to port kernel code from Solaris but does
> > > not (as far as I can tell) give us binary compatibility.
> > 
> > Adam may have meant source level compat, it definitely does provide some
> > level of that. Of course no binary compat as you say.
> 
> If Solaris has a "module-compatible" kernel ABI it's most likely possible
> to be binary compatible considering we're source-compatible already
> (cf. rump ABI compatibility with the kernel).  Of course it doesn't
> happen accidentally and there's some amount of work involved.  But if
> someone finds a use case for it, why not?

As you volunteering to maintain that, so?  Good idea, yeah why not.  :-)



Re: Fwd: Status and future of 3rd party ABI compatibility layer

2011-03-01 Thread Andrew Doran
On Mon, Feb 28, 2011 at 11:25:07AM -0500, Thor Lancelot Simon wrote:

> On Mon, Feb 28, 2011 at 11:13:36AM +0200, haad wrote:
> > 
> > With solaris.kmod we are compatible with solaris kernel, (we should
> > be able to load solaris kernel modules).
> 
> Have you actually tried this?  I am pretty sure it would not work.
> 
> It appears to me that solaris.kmod includes shims that provide some
> Solaris kernel interfaces at the *source* level in NetBSD, which
> certainly makes it easier to port kernel code from Solaris but does
> not (as far as I can tell) give us binary compatibility.

Adam may have meant source level compat, it definitely does provide some
level of that. Of course no binary compat as you say.




Re: Per-CPU Unit (PCU) interface

2011-01-27 Thread Andrew Doran
On Tue, Jan 25, 2011 at 10:10:14AM +0100, Martin Husemann wrote:
> On Mon, Jan 24, 2011 at 03:27:58PM +, Mindaugas Rasiukevicius wrote:
> > While looking at the bugs on still work-in-progress mips64 FPU code on
> > Matt's branch, it occurred to me that we could abstract SMP handling
> > complexity into MI interface.  Basically, all architectures are using
> > similar logic for FPU handling on SMP, but each have own variations,
> > confusions, and therefore each fall into the bugs.  Hence, PCU:
> 
> I can't make up my mind if this is a complication or proper abstraction.
> 
> Assuming it is only used for lazy FPU saving and an arch does not have
> other PCU needs, it overall does not save a lot of work. On the other
> hand it does not allow MD optimiziations (obvious example are the fpu
> handling IPIs on sparc64 where we do not bother to create a full C runtime
> environment in the IPI handler).

I can't speak for Mindaugas but I suspect that one of the reasons for
centralising is that this sort of stuff is difficult to get right.

I don't buy into the microoptimization aspect so much because simply by
chosing to do an IPI we're already sucking rocks.  As long as the IPI
doesn't go mucking around with too much state that it wouldn't as an
assembly stub then in relative terms I don't see too much loss.
So I understand the concern and share it but I have trouble writing off
the MI idea on that alone.

> > - There can be multiple PCUs, so this can be re-used not only for FPU,
> > but any similar MD functionality, e.g. PowerPC AltiVec.
> 
> Are there other examples of this?
> 
> > - Once there is MI IPI support, it is ~trivial to convert the code to
> > use them by: 1) splsoftclock() -> splhigh() 2) replacing xc_unicast()
> > calls with cpu_send_ipi() and moving them *before* splx().
> 
> I can not parse this paragraph - and what "MI IPI support" are you talking
> about? How does it differ form xcall(9)?
> 
> Martin


Re: Question about spin-lock/interrupt/ipl interaction.

2011-01-27 Thread Andrew Doran
On Wed, Jan 26, 2011 at 11:18:12PM -0800, Matt Thomas wrote:

> Spin locks have the semantics that IPL doesn't get lowered until the last 
> spin lock is unlocked.  For example, taking a IPL_VM spinlock will make sure 
> IPL is at least IPL_VM while the lock is locked.  If then a IPL_SCHED lock is 
> locked, and the current ipl is IPL_VM, ipl is raised to IPL_SCHED.  If the 
> IPL_SCHED lock is unlocked, the ipl stays at IPL_SCHED until the last spin 
> lock is unlocked.  This is done to allow spin-locks to be unlocked in any 
> order.
> 
> Now consider this scenario: a kernel takes a IPL_VM spin lock so IPL is now 
> IPL_VM.  Now an interrupt happens from a IPL_SCHED source so that during 
> execution of the handler a IPL_SCHED spin lock is locked and unlocked leaving 
> IPL_SCHED the current ipl.  In this instance, we know that any spin-locks
> locked in the interrupt handler must be unlocked before the handler returns.
> 
> Now my question is whether the interrupt dispatch code should process 
> interrupt sources higher than IPL_SCHED (honoring the defined spin-lock/ipl 
> semantics) or can it drop IPL back to its original level allowing more 
> IPL_SCHED interrupt sources to be serviced (or not honoring the defined 
> spin-lock/ipl semantics)?
> 
> Why the latter is a violation, it seems wrong to me for a interrupt to return 
> to the code that was interrupted at a different ipl from which it was 
> invoked.  
> 
> Should the spin-locks semantics be modified for interrupt usage?

One of the design assumptions for spin mutexes was that on EOI we restore
the IPL back to what it was at the time the interrupt occurred *, regardless
of what has been done in the ISR.   So it soothes that particular bit of
nastiness with nested locks+IPLs because we're pretty much guaranteed to be
dropping back to low(ish) IPL regularly.  Does that answer the question or 
did you have something else in mind?  I think you might but I'm not sure.

* excluding of course syscalls + traps, must naturally be @ IPL_NONE on
  return there or something has gone all wacky.



Re: Per-CPU Unit (PCU) interface

2011-01-25 Thread Andrew Doran
On Mon, Jan 24, 2011 at 03:27:58PM +, Mindaugas Rasiukevicius wrote:
> Hello,
> 
> While looking at the bugs on still work-in-progress mips64 FPU code on
> Matt's branch, it occurred to me that we could abstract SMP handling
> complexity into MI interface.  Basically, all architectures are using
> similar logic for FPU handling on SMP, but each have own variations,
> confusions, and therefore each fall into the bugs.  Hence, PCU:
> 
> http://www.netbsd.org/~rmind/subr_pcu.c
> http://www.netbsd.org/~rmind/pcu.h
> http://www.netbsd.org/~rmind/pcu.diff

I have three problems with this:

- It uses our xcall interface which provides a mailbox to pass messages 
  between CPUs.  The intent seems to be to change the above code to use
  IPIs in the future. An MI IPI interface would likely not have mailbox 
  capability, as it would probably render the interface+implementation
  too specific and too inefficient for general use.  I can envison some
  other framework built on top of a raw IPI interface.

- When re-implementing the x86 lazy FPU switching for 5.0 I dropped
  the notion of discarding state.  There's very little benefit, but
  many drawbacks with potential race conditions.  It's easier just to
  save whatever state is on the CPU.  You lose an optimization for a 
  corner condition and gain code that's easier to maintain and prove.

- I would like to see immediate users of this interface, ideally at
  the time of commit.  In its current form it seems not an acceptable
  replacement for the x86
  lazy FPU code because it uses xcalls and those are somewhat 
  heavyweight.  Efficiency is important in this case because it's
  the raison d'etre for lazy FPU context switch.

> Few notes:
> 
> - MD code provides pcu_state_save() and pcu_state_load() primitives
> via pcu_ops_t.
> 
> - PCU interface provides pcu_load(), pcu_save_lwp(), pcu_discard()
> and other routines, which carry the synchronisation logic.  See the
> "Concurrency notes" description in the top.
> 
> - There can be multiple PCUs, so this can be re-used not only for FPU,
> but any similar MD functionality, e.g. PowerPC AltiVec.
> 
> - Instead of IPIs, PCU is currently using XC_HIGHPRI cross-calls and
> therefore is running at IPL_SOFTCLOCK.
> 
> - Once there is MI IPI support, it is ~trivial to convert the code to
> use them by: 1) splsoftclock() -> splhigh() 2) replacing xc_unicast()
> calls with cpu_send_ipi() and moving them *before* splx().
> 
> API is not yet set in stone, but I think the essential logic is there.
> Matt is trying this code on mips64.  I will try to adapt x86 to it.
> 
> Please review.
> 
> -- 
> Mindaugas


Re: kernel memory allocators

2011-01-20 Thread Andrew Doran
On Thu, Jan 20, 2011 at 04:30:49PM +1100, matthew green wrote:
> 
> > > Benefits I've thought about:
> > > - The kmem pools use pool_caches therefor scalability will be much
> > > better as the old malloc has a single lock for all access, the pools
> > > have one each with a per cpu cache layer.
> > > - The old malloc only returns oversized allocations back to the kmem
> > > layer but nothing that is in it's bucket, pools can be drained...
> > > - Removing one redundant interface in the kernel-api (in the long
> > > term, when dropping the malloc wrapper)
> > 
> > thanks for working on this.
> > while i'm all for removing malloc(9), i tend to think it should be done
> > by changing the users to use either kmem_alloc or pool_cache, instead of
> > making kmem_alloc interrupt-safe.  i don't think there's much demand for
> > interrupt-safe variable-sized memory allocations.
> 
> while we're discussions allocators, i miss the extra accounting
> information that malloc(9) provides that kmem(9) does not.  pool(9)
> has the info in a slightly different form, but all kmem(9) users
> are dumped into the same pile.
> 
> this part of malloc -> kmem conversions i feel is a step backwards,
> though i'm not going to claim it is more important than the general
> benefits of kmem(9) vs. malloc(9).
> 
> 
> are there any plans to modify kmem to support this sort of usage,
> where a client names itself somehow?

No plans, but in the past I have hacked lockstat to track who is
allocating memory, how much, how often and so on.  A similar mechanism
with a bit of memory (pardon the pun) and basic support in kmem would 
be easy enough to implement and could give an "over time" report to
show who holds what.  So my idea is a lockstat-style report showing the
exact call sites where things were allocated, how much is still held etc.

dtrace provides us with this capability but I haven't thought it though.



Re: kernel memory allocators

2011-01-20 Thread Andrew Doran
On Thu, Jan 20, 2011 at 02:11:59AM +, YAMAMOTO Takashi wrote:
> hi,
> 
> > Benefits I've thought about:
> > - The kmem pools use pool_caches therefor scalability will be much
> > better as the old malloc has a single lock for all access, the pools
> > have one each with a per cpu cache layer.
> > - The old malloc only returns oversized allocations back to the kmem
> > layer but nothing that is in it's bucket, pools can be drained...
> > - Removing one redundant interface in the kernel-api (in the long
> > term, when dropping the malloc wrapper)
> 
> thanks for working on this.
> while i'm all for removing malloc(9), i tend to think it should be done
> by changing the users to use either kmem_alloc or pool_cache, instead of
> making kmem_alloc interrupt-safe.  i don't think there's much demand for
> interrupt-safe variable-sized memory allocations.

Agreed.  Variable-sized allocations from interrupt context are indicative
of a design flaw.  Either the processing should happen at another level,
or allocations should be from a pool (whether that is pool_cache or some
private mechanism).



Re: Heads up: moving some uvmexp stat to being per-cpu

2010-12-15 Thread Andrew Doran
On Tue, Dec 14, 2010 at 08:49:14PM -0800, Matt Thomas wrote:
> I have a fairly large but mostly simple patch which changes the stats 
> collected in
> uvmexp for faults, intrs, softs, syscalls, and traps from 32 bit to 64 bits 
> and
> puts them in cpu_data (in cpu_info).  This makes more accurate and a little 
> cheaper
> to update on 64bit systems.
> 
> I've had to modify some assembly from architectures I'm not really proficient 
> at
> (m68k, sparc, sparc64, sh3, i386, amd64) but I think did so correctly.  I 
> would
> appreciate some reviews of those changes.  I've verified that all kernels 
> still
> build (except those that didn't before the changes).
> 
> This change also remove the include of  from files where it
> isn't needed.
> 
> The diffs are at http://www.netbsd.org/uvmexp-diff.txt 
> 
> Have fun reading them!

I did!

I have three concerns around atomicity and concurrency.

- curcpu()->ci_data.cpu_ we can be preempted in the middle of this type
  of dereference.  So sometimes we could end up with the counter going
  backwards.  Admittedly there are many places where we are counting,
  where interrupts are off.

- 64 bit increment isn't atomic on a 32-bit or RISCy platform.  So the
  same sort of problem as above.

- In uvm where we tally the counters, we can read them mid-update so
  maybe we'll only see half the updated value.  The reason I had the 
  periodic update in the clock interrupt was to avoid this problem.

So all that said, maybe we don't care about the above problems but I wanted
to point them out.  As long as there is not a "really bad" case then
perhaps it's OK.




Re: mutexes, locks and so on...

2010-11-23 Thread Andrew Doran
On Tue, Nov 23, 2010 at 06:49:47PM +0200, Antti Kantee wrote:
> On Fri Nov 19 2010 at 00:11:12 +0000, Andrew Doran wrote:
> > You can release it with either call, mutex_spin_ is just a way to avoid
> > additional atomic operations.  The ususal case is adaptive mutex, but
> > stuff like the dispatcher/scheduler makes use of spin mutexes exclusively
> > and the fast path versions were invented for that. (Because you can measure
> > the effect with benchmarks :-).
> 
> Speaking of which, something I (and a few others) have been thinking
> about is to have constantly running benchmarks (akin to constantly
> running tests).  That way we can have a rough idea which way performance
> and resource consumption is going and if there are any sharp regressions.
> Are your old benchmarking programs still available somewhere?

I think this is a great idea.  Here are the easy ones:

- libmicro.  I had adjustments to this to use -O0 and something to do with
  the timekeeping, I can't remember exactly.

- hackbench

- build.sh on a static, unchanging source tree.

- mysql sysbench.

I had lots of one-offs and more complicated setups but the above are really
easy to set up and repeat.



  1   2   >