Re: Unexpected out of memory kills when running parallel find instances over millions of files

2023-10-19 Thread Mateusz Guzik
On Thu, Oct 19, 2023 at 10:49:37AM +, Michael van Elst wrote:
> mjgu...@gmail.com (Mateusz Guzik) writes:
> 
> >Running 20 find(1) instances, where each has a "private" tree with
> >million of files runs into trouble with the kernel killing them (and
> >others):
> >[   785.194378] UVM: pid 1998.1998 (find), uid 0 killed: out of swap
> 
> 
> >This should not be happening -- there is tons of reusable RAM as
> >virtually all of the vnodes getting here are immediately recyclable.
> 
> While vnodes would be recyclable, they hardly get recycled unless
> an filesystem object is deleted or the filesystem is unmounted.
> 

They get recycled all the time by vdrain thread if numvnodes goes above
desiredvnodes, like it does in this test.

Part of the problem is that with 20 processes walking the filesystem it
gets outpaced (20:1) and has no means of stopping it, while the memory
allocator just keeps handing objects out.

> >Specs are 24 cores, 24G of RAM and ufs2 with noatime. swap is *not* 
> >configured.
> 
> Without swap, the kernel also has no chance to evict process pages
> to grow the vnode cache further.
> 

It should not be trying to grow the vnode cache. If anything it should
stop it from blowing out of proportion and definitely should not kill
processes in presence of swaths of immediately freeable vnodes.

As noted above there is code to try to do it, but it is not sufficient.

I tested several systems (Linux, all the BSDs and even Illumos) and only
NetBSD fails to complete the run. That is to say even OpenBSD chugs
along no problem.

This is definitely a reliability problem in the kernel.

Traditionally vnode allocation would recycle something from the "free"
list if need be. Perhaps restorign this behavior is the easiest way out
for the time being.



Unexpected out of memory kills when running parallel find instances over millions of files

2023-10-19 Thread Mateusz Guzik
Running 20 find(1) instances, where each has a "private" tree with
million of files runs into trouble with the kernel killing them (and
others):
[   785.194378] UVM: pid 1998.1998 (find), uid 0 killed: out of swap
[   785.194378] UVM: pid 2010.2010 (find), uid 0 killed: out of swap
[   785.224675] UVM: pid 1771.1771 (top), uid 0 killed: out of swap
[   785.285291] UVM: pid 1960.1960 (zsh), uid 0 killed: out of swap
[   785.376172] UVM: pid 2013.2013 (find), uid 0 killed: out of swap
[   785.416572] UVM: pid 1760.1760 (find), uid 0 killed: out of swap
[   785.416572] UVM: pid 1683.1683 (tmux), uid 0 killed: out of swap

This should not be happening -- there is tons of reusable RAM as
virtually all of the vnodes getting here are immediately recyclable.

$elsewhere I got a report of a workload with hundreds of millions of
files which get walked in parallel -- a number high enough that it
does not fit in RAM on boxes which run it. Out of curiosity I figured
I'll check how others are doing on the front, but key is that this is
not a made up problem.

I'm running NetBSD 10, kernel built from this commit at top of the tree:
Author: andvar 
Date:   Sat Oct 14 08:05:25 2023 +

fix various typos in comments and documentation, mainly in word "between".

Specs are 24 cores, 24G of RAM and ufs2 with noatime. swap is *not* configured.

Test generates 20 separate trees, each has 1000 directories with 1000
files (or 20 million files in total + some dirs).

Repro instructions are here:
https://people.freebsd.org/~mjg/.junk/fstree.tgz

Note that parallel creation of the these trees is dog slow, took over
40 minutes for me.

I had to pass extra flags to newfs to for the target fs to even fit
this inode count:
newfs -n 22000 -O 2 /dev/wd1e

So the expected outcome is that this finishes (extra points for
reasonable time) instead of having userspace getting killed.

I don't know what kind of diagnostic info would be best here, but
given repro steps above I don't think I need to look for something.
Have fun. :)

-- 
Mateusz Guzik 



Re: tmpfs vs VV_LOCKSWORK

2020-08-20 Thread Mateusz Guzik
Now that I sent the e-mail I had a look at access().

The benchmark at hand is custom code added to will-it-scale, pasted at
the bottom.

On 8/20/20, Mateusz Guzik  wrote:
> tmpfs does *NOT* set the flag. While I can't be arsed to verify
> semantics, I suspect it is more than eligible.
>
> This came up as a slowdown in an "access" microbenchmark with target
> file on tmpfs, stemming from:
>
> commit 5645873e2da675f475c8b8d61260c8ffe9d411ab
> Author: riastradh 
> Date:   Tue Aug 4 03:00:10 2020 +
>
> Fix bogus fast path in vput.
>
> I wrote a trivial patch for it (see below).
> Results are (ops/s):
> stock: 1721941
> aug 4: 1592518
> patch: 1664363
>
> The difference stems from calling VOP_ISLOCKED.
>

I meant not all performance is recovered because of the newly added
VOP_ISLOCKED call.
However, looking closer, access() passes a share-locked vnode so the
Aug 4 change fixed a serious bug for tmpfs handling.

As for recovering perf, I once more suggest VOP_NEED_INACTIVE which is
the easiest way to avoid the read -> write transition in the common
case. Preferably the need to perform this processing would be
expressed with a bit in the usecount or similar, but that's rather
invasive.

> The patch is a PoC, I don't claim much correctness. Basically
> someone(tm) will have to take care of the problem.
> diff --git a/sys/fs/tmpfs/tmpfs_subr.c b/sys/fs/tmpfs/tmpfs_subr.c
> index 92a0eedf4443..394320fe71ce 100644
> --- a/sys/fs/tmpfs/tmpfs_subr.c
> +++ b/sys/fs/tmpfs/tmpfs_subr.c
> @@ -136,6 +136,8 @@ tmpfs_init_vnode(struct vnode *vp, tmpfs_node_t *node)
> /* FALLTHROUGH */
> case VLNK:
> case VREG:
> +   vp->v_vflag |= VV_LOCKSWORK;
> +   /* FALLTHROUGH */
> case VSOCK:
> vp->v_op = tmpfs_vnodeop_p;
> break;
>
> --
> Mateusz Guzik 
>

#include 
#include 
#include 
#include 
#include 

char *testcase_description = "Separate file access(2)";

void testcase(unsigned long long *iterations, unsigned long nr)
{
char tmpfile[] = "/tmp/willitscale.XX";
int fd = mkstemp(tmpfile);

assert(fd >= 0);
close(fd);

while (1) {
        int error = access(tmpfile, R_OK);
assert(error == 0);

(*iterations)++;
}

unlink(tmpfile);
}

-- 
Mateusz Guzik 



tmpfs vs VV_LOCKSWORK

2020-08-20 Thread Mateusz Guzik
tmpfs does *NOT* set the flag. While I can't be arsed to verify
semantics, I suspect it is more than eligible.

This came up as a slowdown in an "access" microbenchmark with target
file on tmpfs, stemming from:

commit 5645873e2da675f475c8b8d61260c8ffe9d411ab
Author: riastradh 
Date:   Tue Aug 4 03:00:10 2020 +

Fix bogus fast path in vput.

I wrote a trivial patch for it (see below).
Results are (ops/s):
stock: 1721941
aug 4: 1592518
patch: 1664363

The difference stems from calling VOP_ISLOCKED.

The patch is a PoC, I don't claim much correctness. Basically
someone(tm) will have to take care of the problem.
diff --git a/sys/fs/tmpfs/tmpfs_subr.c b/sys/fs/tmpfs/tmpfs_subr.c
index 92a0eedf4443..394320fe71ce 100644
--- a/sys/fs/tmpfs/tmpfs_subr.c
+++ b/sys/fs/tmpfs/tmpfs_subr.c
@@ -136,6 +136,8 @@ tmpfs_init_vnode(struct vnode *vp, tmpfs_node_t *node)
/* FALLTHROUGH */
case VLNK:
case VREG:
+   vp->v_vflag |= VV_LOCKSWORK;
+   /* FALLTHROUGH */
case VSOCK:
vp->v_op = tmpfs_vnodeop_p;
break;

-- 
Mateusz Guzik 



Re: notes from running will-it-scale

2020-07-19 Thread Mateusz Guzik
There is:

Built-in Function: void * __builtin_assume_aligned (const void *exp,
size_t align, ...)

but I don't know how it is working out in practice. Note both direct
map arg and the address should be aligned.

However, an argument for dedicated routines is that on CPUs with ERMS
it is faster to rep stosb/movsb than stosq/movsq (and the other way
around without said extension). If rep-using memset is inlined you are
stuck with stosq by default.

On 7/19/20, Jaromír Doleček  wrote:
> Very interesting, particularly the outrageous assembly for
> pmap_{zero,copy}_page().
>
> Is there some way to tell the compiler that the address is already
> 4096-aligned and avoid the conditionals? Failing that, we could just
> adopt the FreeBSD assembly for this.
>
> Does anyone see a problem with introducing a vfs.timestamp_precision
> to avoid the rtdscp?
>
> Jaromir
>
> Le dim. 19 juil. 2020 à 13:21, Mateusz Guzik  a écrit :
>>
>> Hello,
>>
>> I recently took an opportunity to run cross-systems microbenchmarks
>> with will-it-scale and included NetBSD (amd64).
>>
>> https://people.freebsd.org/~mjg/freebsd-dragonflybsd-netbsd-v2.txt
>> [no linux in this doc, I will probably create a new one soon(tm)]
>>
>> The system has a lot of problems in the vfs layer, vm is a mixed bag
>> with multithreaded cases lagging behind and some singlethreaded being
>> pretty good (and at least one winning against the other systems).
>>
>> Notes:
>> - rtdscp is very expensive in vms, yet the kernel unconditionally
>> performs by calling vfs_timestamp. Both FreeBSD and DragonflyBSD have
>> a knob to change the resolution (and consequently avoid the
>> instruction), I think you should introduce it and default to less
>> accuracy on vms. Sample results:
>> stock pipe1: 2413901
>> patched pipe1: 3147312
>> stock vfsmix: 13889
>> patched vfsmix: 73477
>> - sched_yield is apparently a nop when the binary is not linked with
>> pthread. this does not match other systems and is probably a bug.
>> - pmap_zero_page/pmap_copy_page compile to atrocious code which keeps
>> checking for alignment. The compiler does not know what values can be
>> assigned to pmap_direct_base and improvises.
>>
>>0x805200c3 <+0>:   add0xf93b46(%rip),%rdi#
>> 0x814b3c10 
>>0x805200ca <+7>:   mov$0x1000,%edx
>>0x805200cf <+12>:  xor%eax,%eax
>>0x805200d1 <+14>:  test   $0x1,%dil
>>0x805200d5 <+18>:  jne0x805200ff
>> 
>>0x805200d7 <+20>:  test   $0x2,%dil
>>0x805200db <+24>:  jne0x8052010b
>> 
>>0x805200dd <+26>:  test   $0x4,%dil
>>0x805200e1 <+30>:  jne0x80520116
>> 
>>0x805200e3 <+32>:  mov%edx,%ecx
>>0x805200e5 <+34>:  shr$0x3,%ecx
>>0x805200e8 <+37>:  rep stos %rax,%es:(%rdi)
>>0x805200eb <+40>:  test   $0x4,%dl
>>0x805200ee <+43>:  je 0x805200f1
>> 
>>0x805200f0 <+45>:  stos   %eax,%es:(%rdi)
>>0x805200f1 <+46>:  test   $0x2,%dl
>>0x805200f4 <+49>:  je 0x805200f8
>> 
>>0x805200f6 <+51>:  stos   %ax,%es:(%rdi)
>>0x805200f8 <+53>:  and$0x1,%edx
>>0x805200fb <+56>:  je 0x805200fe
>> 
>>0x805200fd <+58>:  stos   %al,%es:(%rdi)
>>0x805200fe <+59>:  retq
>>0x805200ff <+60>:  stos   %al,%es:(%rdi)
>>0x80520100 <+61>:  mov$0xfff,%edx
>>0x80520105 <+66>:  test   $0x2,%dil
>>0x80520109 <+70>:  je 0x805200dd
>> 
>>0x8052010b <+72>:  stos   %ax,%es:(%rdi)
>>0x8052010d <+74>:  sub$0x2,%edx
>>0x80520110 <+77>:  test   $0x4,%dil
>>0x80520114 <+81>:  je 0x805200e3
>> 
>>0x80520116 <+83>:  stos   %eax,%es:(%rdi)
>>0x80520117 <+84>:  sub$0x4,%edx
>>0xffff8052011a <+87>:  jmp0x805200e3
>> 
>>
>> The thing to do in my opinion is to just provide dedicated asm funcs.
>> This is the equivalent on FreeBSD (ifunc'ed):
>>
>> ENTRY(pagezero_std)
>> PUSH_FRAME_POINTER
>> movl$PAGE_SIZE/8,%ecx
>> xorl%eax,%eax
>> rep
>> stosq
>> POP_FRAME_POINTER
>> ret
>> END(pagezero_std)
>>
>> ENTRY(pagezero_erms)
>> PUSH_FRAME_POINTER
>> movl$PAGE_SIZE,%ecx
>> xorl%eax,%eax
>> rep
>> stosb
>> POP_FRAME_POINTER
>> ret
>> END(pagezero_erms)
>>
>> --
>> Mateusz Guzik 
>>
>


-- 
Mateusz Guzik 



notes from running will-it-scale

2020-07-19 Thread Mateusz Guzik
Hello,

I recently took an opportunity to run cross-systems microbenchmarks
with will-it-scale and included NetBSD (amd64).

https://people.freebsd.org/~mjg/freebsd-dragonflybsd-netbsd-v2.txt
[no linux in this doc, I will probably create a new one soon(tm)]

The system has a lot of problems in the vfs layer, vm is a mixed bag
with multithreaded cases lagging behind and some singlethreaded being
pretty good (and at least one winning against the other systems).

Notes:
- rtdscp is very expensive in vms, yet the kernel unconditionally
performs by calling vfs_timestamp. Both FreeBSD and DragonflyBSD have
a knob to change the resolution (and consequently avoid the
instruction), I think you should introduce it and default to less
accuracy on vms. Sample results:
stock pipe1: 2413901
patched pipe1: 3147312
stock vfsmix: 13889
patched vfsmix: 73477
- sched_yield is apparently a nop when the binary is not linked with
pthread. this does not match other systems and is probably a bug.
- pmap_zero_page/pmap_copy_page compile to atrocious code which keeps
checking for alignment. The compiler does not know what values can be
assigned to pmap_direct_base and improvises.

   0x805200c3 <+0>:   add0xf93b46(%rip),%rdi#
0x814b3c10 
   0x805200ca <+7>:   mov$0x1000,%edx
   0x805200cf <+12>:  xor%eax,%eax
   0x805200d1 <+14>:  test   $0x1,%dil
   0x805200d5 <+18>:  jne0x805200ff 
   0x805200d7 <+20>:  test   $0x2,%dil
   0x805200db <+24>:  jne0x8052010b 
   0x805200dd <+26>:  test   $0x4,%dil
   0x805200e1 <+30>:  jne0x80520116 
   0x805200e3 <+32>:  mov%edx,%ecx
   0x805200e5 <+34>:  shr$0x3,%ecx
   0x805200e8 <+37>:  rep stos %rax,%es:(%rdi)
   0x805200eb <+40>:  test   $0x4,%dl
   0x805200ee <+43>:  je 0x805200f1 
   0x805200f0 <+45>:  stos   %eax,%es:(%rdi)
   0x805200f1 <+46>:  test   $0x2,%dl
   0x805200f4 <+49>:  je 0x805200f8 
   0x805200f6 <+51>:  stos   %ax,%es:(%rdi)
   0x805200f8 <+53>:  and$0x1,%edx
   0x805200fb <+56>:  je 0x805200fe 
   0x805200fd <+58>:  stos   %al,%es:(%rdi)
   0x805200fe <+59>:  retq
   0x805200ff <+60>:  stos   %al,%es:(%rdi)
   0x80520100 <+61>:  mov$0xfff,%edx
   0x80520105 <+66>:  test   $0x2,%dil
   0x80520109 <+70>:  je 0x805200dd 
   0x8052010b <+72>:  stos   %ax,%es:(%rdi)
   0x8052010d <+74>:  sub$0x2,%edx
   0x80520110 <+77>:  test   $0x4,%dil
   0x80520114 <+81>:  je 0x805200e3 
   0x80520116 <+83>:  stos   %eax,%es:(%rdi)
   0x80520117 <+84>:  sub$0x4,%edx
   0x8052011a <+87>:  jmp0x805200e3 

The thing to do in my opinion is to just provide dedicated asm funcs.
This is the equivalent on FreeBSD (ifunc'ed):

ENTRY(pagezero_std)
PUSH_FRAME_POINTER
movl$PAGE_SIZE/8,%ecx
xorl%eax,%eax
rep
stosq
POP_FRAME_POINTER
ret
END(pagezero_std)

ENTRY(pagezero_erms)
PUSH_FRAME_POINTER
movl$PAGE_SIZE,%ecx
xorl%eax,%eax
rep
stosb
POP_FRAME_POINTER
ret
END(pagezero_erms)

-- 
Mateusz Guzik 



Re: Please review: lookup changes

2020-03-11 Thread Mateusz Guzik
>> locks and not hashing itself. Note that at the time vnode interlock and
>> vm object lock were the same thing and in this workload the lock is
>> under a lot of contention. Should the lock owner get preempted, anyone
>> doing a lookup to the affected vnode (e.g., libc) will be holding
>> the relevant per-cpu lock and will block on a turnstile. Whoever ends
>> up running on the affected cpu is likely to do a lookup on their own,
>> but the relevant per-cpu lock is taken and go off cpu. The same thing
>> happening on more than one cpu at a time could easily reslut in a
>> cascading failure, which I strongly suspect is precisely what happened.
>>
>> That is, the win does not stem from rb trees but finer-grained locking
>> which does not block other threads which look up something else on
>> the same cpu.
>
> Not on NetBSD.  Kernel preemption is possible and allowed (mainly for real
> time applications), but happens infrequently during normal operation.
> There
> are a number of pieces of code that take advantage of that fact and are
> "optimistically per-CPU", and they work very well as preemption is rarely
> observed.  Further the blocking case on v_interlock in cache_lookup() is
> rare.  That's no to say it doesn't happen, it does, but I don't think it
> enough to explain the performance differences.
>

I noted suspected preemption was occurring because of contention on the vm
side. It should only take one to start the cascade.

>> As mentioned earlier I think rb trees instead of a hash are pessimal
>> here.
>>
>> First, a little step back. The lookup starts with securing vnodes from
>> cwdinfo. This represents a de facto global serialisation point (times two
>> since the work has to be reverted later). In FreeBSD I implemented an
>> equivalent with copy-on-write semantics. Easy take on it is that I take
>> an rwlock-equivalent and then grab a reference on the found struct.
>> This provides me with an implicit reference on root and current working
>> directory vnodes. If the struct is unshared on fork, aforementioned
>> serialisation point becomes localized to the process.
>
> Interesting.  Sounds somewhat like both NetBSD and FreeBSD do for process
> credentials.
>

I was thinking about doing precisely that but I found it iffy to have
permanently stored references per-thread. With the proposal they get
"gained" around the actual lookup, otherwise this is very similar to
what mountcheckdirs is dealing with right now.

>> In my tests even with lookups which share most path components, the
>> last one tends to be different. Using a hash means this typically
>> results in grabbing different locks for that case and consequently
>> fewer cache-line ping pongs.
>
> My experience has been different.  What I've observed is that shared hash
> tables usually generate huge cache pressure unless they're small and rarely
> updated.  If the hash were small, matching the access pattern (e.g.
> per-dir) then I think it would have the opportunity to make maximum use of
> the cache.  That could be a great win and certainly better than rbtree.
>

Well in my tests this is all heavily dominated by SMP-effects, which I
expect to be exacerbated by just one lock.

Side note is that I had a look at your vput. The pre-read + VOP_UNLOCK +
actual loop to drop the ref definitely slow things down if only a little
bit as this can force a shared cacheline transition from under someone
cmpxching.

That said, can you generate a flamegraph from a fully patched kernel?
Curious where the time is spent now, my bet is spinning on vnode locks.

-- 
Mateusz Guzik 



Re: Please review: lookup changes

2020-03-11 Thread Mateusz Guzik
Something ate a huge chunk of my e-mail, resending

On 3/8/20, Andrew Doran  wrote:
> On Sat, Mar 07, 2020 at 12:14:05AM +0100, Mateusz Guzik wrote:
>> I believe it is always legal to just upgrade the lock in getattr if
>> necessary. Since itimes updates are typically not needed, this would mean
>> common case would get away with shared locking. This would run into
>> issues
>> in vput, for which see below.
>
> Hmm.  With NetBSD upgrading a lock in the middle of a VOP could have
> consequences for fstrans.  I'm not sure.
>

I don't know how do you do upgrades specifically. "mere" upgrades can of
course fail in presence of additional readers. In FreeBSD there is an
option to release the lock entirely and relock from scratch, in paritcular
meaning you can undo fstrans enter (if necessary). While I don't know the
specifics of your VFS layer, v_usecount which is kept should postpone
freeing of the vnode. Should it fall victim of forced unmount or some
other state change you can just return an error, pretending getattr
never got this far.

>> Assuming this is fixed fstat could also shared lock. I think this is a
>> simple change to make and most certainly worth doing.
>>
>> > vfs_vnode.c:
>> >
>> >   Namecache related changes, and the change to not block new vnode
>> >   references across VOP_INACTIVE() mentioned on tech-kern.  I don't
>> >   think this blocking is needed at least for well behaved FSes like
>> >   ffs & tmpfs.
>> >
>>
>> I think this is very hairy and the entire thing is avoidable.
>>
>> I believe I mentioned some other time that FreeBSD got VOP_NEED_INACTIVE.
>> You should be able to trivially implement an equivalent which would be
>> called if the vnode is only shared locked. Supporting an unlocked case
>> comes with extra woes which may not be worth it at this stage.
>>
>> Since inative processing is almost never needed this would avoid a lot
>> of lock contention.
>>
>> With all this in place you would also not run into the problematic state
>> in the fast path in the common case.
>>
>> This is a cheap cop out, the real solution would encode need for inactive
>> in usecount as a flag but that's a lot of work.
>
> Two seperate issues - I agree on avoiding VOP_INACTIVE(), but the issue
> here
> is peculiar to NetBSD I think.  NetBSD sets the equivalent of the 4.4BSD
> VXLOCK flag to block vget() across VOP_INACTIVE().  Should only be needed
> for revoke/clean.  At one point we did have VXLOCK embedded in v_usecount
> which allowed for vget() with one atomic op, but that was undone at some
> point, I don't know why.
>

If you have a dedicated spot which is guaranteed to be executed for
exclusion against vget then things become easier in terms of effort
required. A dedicated vget variant for the namecache can xadd into the
counter and if the bit is set, backpedal from it. New arrivals should
eliminated by purging nc entries beforehand. Everyone else can
cmpxchg-loop with the interlock like right now.

>> > Performance:
>> >
>> >   During a kernel build on my test system it yields about a 15%
>> > drop
>> >   in system time:
>> >
>> >   HEAD79.16s real  1602.97s user   419.54s system
>> >   changes 77.26s real  1564.38s user   368.41s system
>> >
>> >   The rbtree lookup can be assumed to be slightly slower than a
>> > hash
>> >   table lookup and microbenchmarking shows this sometimes in the
>> >   single CPU case, but as part of namei() it ends up being
>> > ameliorated
>> >   somewhat by the reduction in locking costs, the absence of hash
>> >   bucket collisions with rbtrees, and better L2/L3 cache use on hot
>> >   directories.  I don't have numbers but I know joerg@ saw
>> > significant
>> >   speedups on pbulk builds with an earlier version of this code
>> > with
>> >   little changed from HEAD other than hash -> rbtree.  Anyway
>> > rbtree
>> >   isn't set in stone as the data structure, it's just what we have
>> > in
>> >   tree that's most suitable right now.  It can be changed.
>> >
>>
>> I strongly suspect the win observed with pbulk had to with per-cpu
>> locks and not hashing itself. Note that at the time vnode interlock and
>> vm object lock were the same thing and in this workload the lock is
>> under a lot of contention. Should the lock owner get preempted, anyone
>> doing a lookup to the affected vnode (e.g., libc) will be holding
>> the relevant per-cp

Re: Adaptive RW locks

2020-03-11 Thread Mateusz Guzik
 On 3/10/20, Andrew Doran  wrote:
> Following up again..  It turns out the deadlocks I saw with this were down
> to a problem with kernel_lock that has since been solved.
>
> I made a couple more tweaks to it: up the max number of tracked holds to 8
> because vmobjlock became a rwlock, and because aiodoned is gone now be more
> selective giving up and going to sleep in the softint case (walk down the
> list of pinned LWPs on curcpu and see if any of them holds the lock that's
> wanted, if not held on curcpu then some other CPU has it, and it's OK to
> spin).

I think the proposed patch avoidably adds overhead to the most common use
case to cover a corner case. Sensibly bounded spinning on readers can be
done in a speculative fashion, i.e. without actively tracking if there are
threads off cpu.

FreeBSD is not completely there yet, but general idea is pretty trivial:
once a writer shows up it sets a bit (RW_SPINNER or so), then it can
observe whether read owners are draining with some pre-defined upper
bound on how long it is willing to wait. There is no spinning if there are
blocked to-be writers already. Thus this might get some waste initially,
but it is bounded.

Side note is that waiting for threads to drain can use the reader count
to speculate how long to wait before re-reading the lock instead of doing
it every pause.

>  void
> +rw_enter(krwlock_t *rw, const krw_t op)
> +{
> [..]
> +
> + /*
> +  * Read the lock owner field.  If the need-to-wait
> +  * indicator is clear, then try to acquire the lock.
> +  */
> + owner = rw->rw_owner;
> + if (__predict_true((owner & need_wait) == 0)) {
> + next = rw_cas(rw, owner, (owner + incr) & mask);
> + if (__predict_true(next == owner)) {
> + /* Got it! */
> + RW_LOCKED(rw, op);
> + KPREEMPT_ENABLE(l);
> + RW_MEMBAR_ENTER();
> + return;
> + }
> + }

Both this and the unlock path lose on throughput if the lock is heavily
used by readers.

The original amd64 fast path has a tight loop which immediately retries
cmpxchg, while the patched variant aborts immediately.

Assume the lock is only used by readers. Patched rw_enter will always
conclude it can rw_cas on it. However, this can very easily fail if
racing against another rw_enter or rw_exit. In this case this will fall
to the slow path which starts with a re-read and then performs a tight
loop.

Whether a strict tight loop is a good idea is highly dependent, I believe
it is the best thing to do on amd64 by default. Other architectures may
want some form of pause-equivalent. Nonetheless, the fallback to re-read
is definitely pessimal and will keep happening under load.

I don't know if worth the effort at this stage, but this will eventually
have to morph into a xadd-based lock. As noted elsewhere cmpxchg-baesd
loops degrade under load to a significantly higher degree (of course
everything is already pretty bad if you have a heavily bounced lock
in the first place). Note that the conversion would also take care of
the pause problem.

I was looking at doing it in FreeBSD but there are some constraints which
pose important design questions and to my understanding your code shares
most of them (if not all).

Blindly adding into the lock word means there is no space to store the
entire thread pointer (since it could partially overwrite the address),
but the owner needs to be findable in some manner in order to lend it
priority.

Assuming 32-bit architectures can stick to the current code there are
some options without growing the lock.

One variant was "compressing" the pointer -- if all threads are guaranteed
to come from a pre-defined VA range the lock can only store the relevant
portion of the address which can be pretty small. Alternatively there
can be a lockless hash of threads keyed by thread id. Or this can can
use a variant of storing the lock which you implemented here, except by
the writer. It can denote in the lock it went off cpu with the relevant
turnstile lock held. Then everyone waiting can look it up based on the
lock.

Of course one can also "give up" and add another word for the pointer,
then this mostly degrades to regular read-write lock problems.

-- 
Mateusz Guzik 



Re: Blocking vcache_tryvget() across VOP_INACTIVE() - unneeded

2020-01-23 Thread Mateusz Guzik
On 1/22/20, Andrew Doran  wrote:
> On Tue, Jan 21, 2020 at 05:30:55PM -0500, Thor Lancelot Simon wrote:
>
>> I think there's probably a generic bucket-locked hashtable implementation
>> in one of the code contributions Coyote Point made long ago.  That said,
>> we're talking about a screenful of code, so it's not like it'd be hard to
>> rewrite, either.  Do we want such a thing?  Or do we want to press on
>> towards more modern "that didn't work, try again" approaches like
>> pserialize or the COW scheme you describe?
>
> The way I see it, the original idea of the namecache was to have a thing
> that allowed you avoid doing expensive hardware and file system stuff a
> second time, because you'd already done it once before.  In this context
> MP locking also looks a lot like expensive hardware and file system stuff.
>
> Making the namecache's index per-directory matches the pattern in which the
> data is accessed and hey presto all of a sudden there are no big
> concurrency
> problems, any locking or whatever becomes trivial and follows the pattern
> of
> access closely, things like pserialize/RCU start to look feasible, and one
> can get back to the important business of avoiding expensive work instead
> of
> generating it!  That's my thinking on it anyway.
>
> The rbtrees are just what we have at hand.  I think a dynamically sized
> hash, like a Robin Hood hash looks very interesting though, because we can
> afford to resize occasionally on entry/removal, and modern pipelined CPUs
> are great at that kind of thing.  I'm sure other people have better ideas
> than I do here, though.  I just want to try out the directory approach, and
> if it works well, get the basis in place.
>

But a hash which has collisions taken care of with linked lists is the
easiest data structure to scale in this regard. Additions and removals
in the face of concurrent lookups are easy to handle (and in fact the
current code already does it to some extent).

With no safe memory reclamation in use this still provides a win as
contetnion is spread out across different buckets.

-- 
Mateusz Guzik 



Re: Blocking vcache_tryvget() across VOP_INACTIVE() - unneeded

2020-01-21 Thread Mateusz Guzik
On 1/21/20, Andrew Doran  wrote:
> On Thu, Jan 16, 2020 at 04:51:44AM +0100, Mateusz Guzik wrote:
>>
>> I'm assuming the goal for the foreseeable future is to achieve path
>> lookup
>> with shared vnode locks.
>
> Good guess.  There is a prototype of LK_SHARED lookup on the ad-namecache
> branch, along with lookup using namecache only that takes as few vnode
> locks
> and refcounts as possible.  In the easiest case only the root vnode and the
> leaf vnode are touched (although whether the root vnode ever actually needs
> a reference is another question).  There are many "non-easy" cases though.
>

As in you can get away without ever accessing it if there are no long
enough .. chains and absolute paths (along with symlinks)?

My solution to the problem (not implemented yet) is to introduce a
copy-on-write structre which holds the ref to the rootvnode. Then lwps
can just use it while at worst refing/unrefing that strut but never the
vnode. This also covers the current working directory.

I have seen the branch. I think the use of rb instead of a hash is
pessimal.

My speculation where the wins over the current code are coming from boils
down to per-cpu locks being eliminated. Specifically, since v_interlock
is a lock from bufobj and is heavily used in vm, likelyhood of getting
preempted while holding it increases. Say CPU0 takes its own pcpu lock and
proceeds to take v_interlock. Now the thread taken off to wait. If the new
thread running on CPU0 performs *any* path lookup it will also block,
instead of only blocking if it was looking up the affected vnode. iow
there is huge potential to stall *all* lookups on given CPU.

This does not happen with rb trees and would not happen with the hash
table if there was bucket locking instead of per-cpu.

I would stick to hash tables since they are easier to scale (both with
and without locks).

For instance if 2 threads look up "/bin" and "/usr" and there is no
hash collision, they lock *separate* buckets and suffer less in terms
of cacheline bouncing. In comparison with rb trees they will take the
same lock. Of course this similarly does not scale if they are looking
up the same path.

-- 
Mateusz Guzik 



Re: Blocking vcache_tryvget() across VOP_INACTIVE() - unneeded

2020-01-15 Thread Mateusz Guzik
On 1/15/20, Andrew Doran  wrote:
> I don't understand why we do this.  I don't think it's needed at all
> because
> if the file really is deleted and the inode is getting cleaned out, then it
> shouldn't be getting new refs in any of the usual ways and we don't need to
> worry about it.  And in any case the vnode lock prevents anyone from
> messing
> with the inode during VOP_INACTIVE().
>
> I want to change this because it causes nasty problems on MP.  For example
> I
> see threads very regulary thrown out of the namecache by vcache_vtryget(),
> and when they finally get the vnode ref'd and locked they do it by digging
> through the file system metadata (when everything is already in cache).
>
>   http://www.netbsd.org/~ad/inactive.diff
>

I think looking at a bigger picture suggests a different solution, which
will also keep the current blocked <-> loaded transition for consideration
later.

I'm assuming the goal for the foreseeable future is to achieve path lookup
with shared vnode locks.

Since ->v_usecont transition to 0 happens all the time and VOP_INACTIVE
processing is almost never needed, the current requirement to hold an
exclusive lock just to find out there is nothing to do should be lifted.
Apart from getting away with only shared lock (or no locks) this would
also naturally eliminate vnode state changes for the common case.

Preferably vfs would provide a method for filesystems to call to signify
they want VOP_INACTIVE to be called for given vnode, then vrelel could
just test a bit. However, this is probably not worth the effort right now.

Said locking was also a problem in FreeBSD. It got temporarily sorted
out with introduction of VOP_NEED_INACTIVE -- by default it returns 1,
but filesystems which provide their own method easily avoid the problem.
I think starting with requiring at least shared lock would do the trick
just fine while providing race-free methods at least for ufs and tmpfs.

Then vrelel could stick to:
diff --git a/sys/kern/vfs_vnode.c b/sys/kern/vfs_vnode.c
index d05d5180f730..45d80dab4a2c 100644
--- a/sys/kern/vfs_vnode.c
+++ b/sys/kern/vfs_vnode.c
@@ -774,7 +774,7 @@ vrelel(vnode_t *vp, int flags, int lktype)
 * If not clean, deactivate the vnode, but preserve
 * our reference across the call to VOP_INACTIVE().
 */
-   if (VSTATE_GET(vp) == VS_RECLAIMED) {
+   if (VSTATE_GET(vp) == VS_RECLAIMED || !VOP_NEED_INACTIVE(vp))
VOP_UNLOCK(vp);
} else {
VSTATE_CHANGE(vp, VS_LOADED, VS_BLOCKED);

which even with exclusive locks already avoids the blocked state problem.

Another option is to move some of the current body into smaller halpers
and then let filesystem piece together their own VOP_VRELE.

-- 
Mateusz Guzik 



Re: fcntl(F_GETPATH) support

2019-09-16 Thread Mateusz Guzik
On 9/15/19, Jason Thorpe  wrote:
>
>
>> On Sep 15, 2019, at 11:03 AM, Christos Zoulas 
>> wrote:
>>
>> I think it is quite reliable because all the file descriptors would be
>> recently
>> opened and therefore be in the cache.  One would need to DoS the cache
>> cause eviction. If that turns out to be false, we can make the namecache
>> reliable, or withdraw it.
>
> If we had a way to ask the file system for the / a name+parent in the event
> we don't find an entry in the name cache, then we don't have to make the
> name cache "reliable".

See my other e-mail about turning realpath into a syscall. Then you have
everything to work with and the code should be trivial unless you want
to handle hardlinks. In the most simplistic implementation you do the
lookup and then do the reverse lookup of what you found. Things still
can get evicted in the time window in-between but that's probably
tolerable.

>
> FWIW, I don't think making the name cache "reliable" is necessarily a good
> idea -- it could be abused to consume kernel memory by creating lots of
> long-named files / directories.
>

I don't think it's a material change in terms of memory use. You can
already force this scenario by creating files and stating them.
Getting rid of namecache entries is not going to save you much compared
to all the vnodes which need to be dropped/recycled (which will also
remove the nc entries).

> In the event that the name really can't be determined (perhaps it's an
> open-unlinked file?), then I think it's perfectly reasonable to return
> ENOENT for F_GETPATH; callers need to be capable of handling that.
>

Corner case of the sort is probably not accounted for either and not
expected to be ran into -- if you remove the files from under the compiler
you are tampering with the build and it is free to fail.

On the other hand name eviction is not a problem you expect to run
into.

I can't easily test right now for sure, but historically BSDs have not
been adding newly created files to the namecache and perhaps
NetBSD is not doing it either. In which case F_GETPATH on newly
created files is guaranteed to fail unless someone looked them up
separately. Whether this should be modified is imho a separate
discussion, but is relevant to points made earlier about memory
use and DoSability. Imho it should be fine to change the kernel
to always add these.

-- 
Mateusz Guzik 



Re: fcntl(F_GETPATH) support

2019-09-15 Thread Mateusz Guzik
On 9/15/19, Mateusz Guzik  wrote:
> On 9/14/19, Christos Zoulas  wrote:
>>
>> Comments?
>>
>> +error = vnode_to_path(kpath, MAXPATHLEN, fp->f_vnode, l, l->l_proc);
>
> What motivates this change?
>
> I think it is a little problematic in that namecache resolution is not
> guaranteed to work. If the consumer does not check for failure or fallback
> to something else the user is going to experience "once in the blue moon"
> failures.
>
> For example clang likes to get full paths to names. On Linux it can
> trivially do it thanks to dentry cache. On other systems there is a
> compilation check for F_GETPATH. If none of this is present it goes for
> realpath, which is slover but basically works. If F_GETPATH is present,
> it is always used and there is no fallback if it fails. Then you risk
> setting yourself up for a weird compilation failure, which may not even
> repeat itself after you retry.
>
> All in all I don't think this should be added unless namecache becomes
> reliable. The above reasoning is why I did not add it to FreeBSD.
>

Now that I wrote this I suspect a kernel-based realpath could do the
trick here. Will have to take a closer look.

-- 
Mateusz Guzik 



Re: NCHNAMLEN vnode cache limitation removal

2019-09-15 Thread Mateusz Guzik
On 9/14/19, Mouse  wrote:
>> [...], because fexecve is causing rumbles about doing significantly
>> more reverse lookups.
>
> Why is everyone so concerned about finding "the" name of an inode, or
> indeed any name for an inode?  As far as I can tell there has never
> been any promise that any given inode has any names pointing to it, at
> least not unless it's got no other references to it (in which case it
> would be freed if it had no names).
>
> Given the list of smart people I've seen discussing this, I'm
> presumably just missing something, but what?
>

I think an always working name resolution is nicer to users when it comes
to diagnosing issues in userspace. In particular if someone spawns a new
program, that program performs a lot of ops in given fd, you can check
what file it is if name resolution works (or you have a facility similar
to linux's /proc//fd). Otherwise you are left with dirty hacks.

There is a very different angle to this though and it has to do with
performance, especially in face of concurrent access. When doing SMP
you want to avoid writes to shared areas as they induce caches misses
on next access by other CPUs. If everyone accesses read-only they can
keep doing it without interfering with others. If everyone writes to
it, everyone else (minus 1) stalls waiting for their turn.

Vast majority of all lookups only care about the last path component.
Currently the kernel leapfrogs through all of them, e.g. consider the
lookup of /foo/bar/baz/qux.

Along the way it refs foo, locks foo, finds bar, references bar, locks
bar, unlocks and unrefs foo... repeat that with s/bar/baz/
and s/foo/bar/. Note that's very simplified (e.g. i skipped the
beginning of the lookup and other work like permission checking).
Crossing filesystems is another hurdle. This is only to highlight
the point below.

That's a lot of work single-threaded which can be partially avoided in
the common case, provided all the info is structured for it. But most
importantly that's a lot of atomic ops on shared areas which severely
limit your performance in case of concurrent access (especially on
multi-socket systems). It's a long way to get to a point where you can
do this in a write-free manner and FreeBSD is very early in there.

If everything is cached and you can roll forward in the common case
without refing/locking anything but the last component you win big
time. (Of course you can't "just" do it, the leapfrogging is there for
a reason but it can be worked out with tracking the state and having
safe memory reclamation.)

TL;DR it's minor quality of life improvement for users and a de facto
mandatory part of VFS if it is to scale on big systems.

-- 
Mateusz Guzik 



Re: fcntl(F_GETPATH) support

2019-09-15 Thread Mateusz Guzik
On 9/14/19, Christos Zoulas  wrote:
>
> Comments?
>
> + error = vnode_to_path(kpath, MAXPATHLEN, fp->f_vnode, l, l->l_proc);

What motivates this change?

I think it is a little problematic in that namecache resolution is not
guaranteed to work. If the consumer does not check for failure or fallback
to something else the user is going to experience "once in the blue moon"
failures.

For example clang likes to get full paths to names. On Linux it can
trivially do it thanks to dentry cache. On other systems there is a
compilation check for F_GETPATH. If none of this is present it goes for
realpath, which is slover but basically works. If F_GETPATH is present,
it is always used and there is no fallback if it fails. Then you risk
setting yourself up for a weird compilation failure, which may not even
repeat itself after you retry.

All in all I don't think this should be added unless namecache becomes
reliable. The above reasoning is why I did not add it to FreeBSD.

-- 
Mateusz Guzik 



Re: mutex vs turnstile

2018-02-14 Thread Mateusz Guzik
On Thu, Jan 18, 2018 at 10:38:02AM +, Nick Hudson wrote:
> On 01/09/18 03:30, Mateusz Guzik wrote:
> > Some time ago I wrote about performance problems when doing high -j
> > build.sh and made few remarks about mutex implementation.
> >
> > TL;DR for that one was mostly that cas returns the found value, so
> > explicit re-reads can be avoided. There are also avoidable explicit
> > barriers (yes, I know about the old Opteron errata).
> >
> > I had another look and have a remark definitely worth looking at (and
> > easy to fix). Unfortunately I lost my test jig so can't benchmark.
> >
> > That said, here it is:
> >
> > After it is is concluded that lock owner sleeps:
>
> itym... after it is concluded that the thread should sleep as the lock is
> owned (by another lwp) and the owner is not on cpu.

I don't think this distinction makes a difference here.

>
> >
> >  ts = turnstile_lookup(mtx);
> >  /*
> >   * Once we have the turnstile chain interlock, mark the
> >   * mutex has having waiters.  If that fails, spin again:
> >   * chances are that the mutex has been released.
> >   */
> >  if (!MUTEX_SET_WAITERS(mtx, owner)) {
> >  turnstile_exit(mtx);
> >  owner = mtx->mtx_owner;
> >  continue;
> >  }
> >
> > Note that the lock apart from being free, can be:
> > 1. owned by the same owner, which is now running
> >
> > In this case the bit is set spuriously and triggers slow path
> > unlock.
>
> Recursive locks aren't allowed so I don't follow what you mean here.
>
> 544 if (__predict_false(MUTEX_OWNER(owner) == curthread)) {
> 545 MUTEX_ABORT(mtx, "locking against myself");
> 546
>
> > 2. owned by a different owner, which is NOT running
>
> Is this still a possibility given the above?
>

This is not about recursive locking. I explicitely noted 2 scenarios
which are handled suboptimaly:

Lock owner is remembered across turnstile_lookup.

MUTEX_SET_WAITERS does:
rv = MUTEX_CAS(>mtx_owner, owner, owner | MUTEX_BIT_WAITERS

If the lock owner is the same and sleeps, it's fine.

If the lock owner is the same but does not sleep, the bit is set for no
reason.

If there is a different owner (or the lock is free), cas fails and
turnstile_exit is called.

If the lock is free or said different owner does not sleep, it's fine.

If said different owner sleeps, turnstile_exit is called for no good
reason and the code circles back.

-- 
Mateusz Guzik 


Re: Race condition between an LWP migration and curlwp_bind

2018-02-14 Thread Mateusz Guzik
On Tue, Feb 13, 2018 at 05:14:38PM +0900, Ryota Ozaki wrote:
>   panic: kernel diagnostic assertion "(psref->psref_cpu == curcpu())"
> failed: file "/(hidden)/sys/kern/subr_psref.c", line 317 passive
> reference transferred from CPU 0 to CPU 3
>
> I first thought that something went wrong in an ioctl handler
> for example curlwp_bindx was called doubly and LP_BOUND was unset
> then the LWP was migrated to another CPU. However, this kind of
> assumptions was denied by KASSERTs in psref_release. So I doubted
> of LP_BOUND and found there is a race condition on LWP migrations.
>
> curlwp_bind sets LP_BOUND to l_pflags of curlwp and that prevents
> curlwp from migrating to another CPU until curlwp_bindx is called.

The bug you found (and I trimmed) looks like the culprit, but there is
an extra problem which probably happens to not manifest itself in terms
of code generation: the bind/unbind inlines lack compiler barriers. See
KPREEMPT_* inlines for comparison. The diff is definitely trivial:

diff --git a/sys/sys/lwp.h b/sys/sys/lwp.h
index 47d162271f9c..f18b76b984e4 100644
--- a/sys/sys/lwp.h
+++ b/sys/sys/lwp.h
@@ -536,6 +536,7 @@ curlwp_bind(void)

bound = curlwp->l_pflag & LP_BOUND;
curlwp->l_pflag |= LP_BOUND;
+   __insn_barrier();

return bound;
 }
@@ -545,6 +546,7 @@ curlwp_bindx(int bound)
 {

KASSERT(curlwp->l_pflag & LP_BOUND);
+   __insn_barrier();
curlwp->l_pflag ^= bound ^ LP_BOUND;
 }

-- 
Mateusz Guzik 


mutex vs turnstile

2018-01-09 Thread Mateusz Guzik
Some time ago I wrote about performance problems when doing high -j
build.sh and made few remarks about mutex implementation.

TL;DR for that one was mostly that cas returns the found value, so
explicit re-reads can be avoided. There are also avoidable explicit
barriers (yes, I know about the old Opteron errata).

I had another look and have a remark definitely worth looking at (and
easy to fix). Unfortunately I lost my test jig so can't benchmark.

That said, here it is:

After it is is concluded that lock owner sleeps:

ts = turnstile_lookup(mtx);
/*
 * Once we have the turnstile chain interlock, mark the
 * mutex has having waiters.  If that fails, spin again:
 * chances are that the mutex has been released.
 */
if (!MUTEX_SET_WAITERS(mtx, owner)) {
turnstile_exit(mtx);
owner = mtx->mtx_owner;
continue;
}

Note that the lock apart from being free, can be:
1. owned by the same owner, which is now running

In this case the bit is set spuriously and triggers slow path
unlock.

2. owned by a different owner, which is NOT running

Assuming the owner still has the lock and is not running after
turnstile_exit, this causes extra trip.

That said, proposed change is trivial and follows what FreeBSD is doing:
re-read and abort sleep unless the lock is owned by an off-cpu thread.

Note there is a minor optimisation of not setting the bit if already
seen.

That said, the minimal fix would look like this (untested):
ts = turnstile_lookup(mtx);
owner = mtx->mtx_owner;
if (mutex_oncpu(owner) || !MUTEX_SET_WAITERS(mtx, owner))
turnstile_exit(mtx);
owner = mtx->mtx_owner;
continue;
}

If value from cas was to be preserved it would definitely make sense
to re-test failed MUTEX_SET_WAITERS while still holding the turnstile lock.

It looks like rw locks can use a similar treatment.

Fun fact is that implementation on Illumos behaves worse in this regard:
it sets the waiters bit regardless *who* owns the lock (or whether they
are running), but then only goes to sleep if the *original* owner has
the lock.

-- 
Mateusz Guzik 


Re: performance issues during build.sh -j 40 kernel

2017-09-11 Thread Mateusz Guzik
On Sat, Sep 09, 2017 at 08:48:19PM +0200, Mateusz Guzik wrote:
>
> Here is a bunch of "./build.sh -j 40 kernel=MYCONF > /dev/null" on stock
> kernel:
>   618.65s user 1097.80s system 2502% cpu 1:08.60 total
[..]
>
> And on kernel with total hacks:
>   594.08s user 693.11s system 2459% cpu 52.331 total
[..]
>
> ==
>
> Here is a flamegraph from a fully patched kernel:
> https://people.freebsd.org/~mjg/netbsd/build-kernel-j40.svg
>
> And here are top mutex spinners:
>  59.42 1560022 184255.00 e40138351180   
>  57.52 1538978 178356.84 e40138351180   uvm_fault_internal+7e0
>   1.238884   3819.43 e40138351180   uvm_unmap_remove+101
>   0.67   12159   2078.61 e40138351180   cache_lookup+97
>
> (see https://people.freebsd.org/~mjg/netbsd/build-kernel-j40-lockstat.txt
)
>

So I added PoC batching to uvm_fault_lower_lookup and uvm_anon_dispose.

While real time barely moved and %sys is somewhat floating around 630,
I'm happy to inform that wait time on global locks locks dropped
significantly:

 46.03 1162651  85410.88 e40127167040   
 43.80 1146153  81273.38 e40127167040   uvm_fault_internal+7c0
  1.527112   2827.06 e40127167040   uvm_unmap_remove+101
  0.719385   1310.42 e40127167040   cache_lookup+a5
  0.00   1  0.01 e40127167040   vfs_vnode_iterator_next1+87

https://people.freebsd.org/~mjg/netbsd/build-kernel-j40-hacks2.svg

https://people.freebsd.org/~mjg/netbsd/build-kernel-j40-hacks2-lockstat.txt

You can see on the flamegraph that the entire time spent in the page
fault handler dropped and the non-user time shifted to syscall handling.

Specifically, now genfs_lock is a more significant player accounting for
about 8.7% total time (6.6% previously).

Batching can be enabled with:
sysctl -w use_anon_dispose_pagelocked=1
sysctl -w uvm_fault_batch_requeue=1

Mix of total hackery is here:
https://people.freebsd.org/~mjg/netbsd/hacks2.diff

I'm quite certain there are other trivial wins in the handler.

I also noted that mutex_spin_retry routine never lowered spl. I added
total crap support to changing it, but did not measure any difference.

There is also a currently wrong hack for the namecache: instead of
taking the interlock, first check if usecount if 0 and if not, try to
bump it by 1. This races with possible transitions to VS_BLLOCKED.

I think the general idea will work fine if the prohibited state will get
embedded into top bits of the v_usecount. Regular bumps will be
unaffected, while cmpxchg like here will automagically fail. There is
only a problem of code reading the count "by hand" which would have to
be updated to mask the bit.

The reason for the hack is that the interlock is in fact the vm obj
lock and it adds tad bit of contention.

Probably with few more fixes of the sort cranking up backoff will be
beneficial.

-- 
Mateusz Guzik
Swearing Maintenance Engineer


Re: performance issues during build.sh -j 40 kernel

2017-09-11 Thread Mateusz Guzik
On Sun, Sep 10, 2017 at 06:51:31PM +0100, Mindaugas Rasiukevicius wrote:
> Mateusz Guzik <mjgu...@gmail.com> wrote:
> > 1. exclusive vnode locking (genfs_lock)
> >
> > ...
> >
> > 2. uvm_fault_internal
> >
> > ...
> >
> > 4. vm locks in general
> >
>
> We know these points of lock contention, but they are not really that
> trivial to fix.  Breaking down the UVM pagequeue locks would generally
> be a major project, as it would be the first step towards NUMA support.
> In any case, patches are welcome. :)
>

Breaking locks is of course the preferred long term solution, but also
time consuming. On the other hand there are most likely reasonably easy
fixes consisting of collapsing lock/unlock cycles into just one lock/unlock
etc.

FreeBSD is no saint here either with one global lock for free pages, yet
it manages to work OK-ish with 80 hardware threads and is quite nice
with 40.

That said, I had enough problems $elsewhere to not be interested in
looking too hard here. :>

> > 3. pmap
> >
> > It seems most issues stem from slow pmap handling. Chances are there are
> > perfectly avoidable shootdowns and in fact cases where there is no need
> > to alter KVA in the first place.
>
> At least x86 pmap already performs batching and has quite efficient
> synchronisation logic.  You are right that there are some key places
> where avoiding KVA map/unmap would have a major performance improvement,
> e.g. UBC and mbuf zero-copy mechanisms (it could operate on physical
> pages for I/O).  However, these changes are not really related to pmap.
> Some subsystems just need an alternative to temporary KVA mappings.
>

I was predominantly looking at teardown of ubc mappings. The flamegraph
suggests overly high cost there.

> >
> > I would like to add a remark about locking primitives.
> >
> > Today the rage is with MCS locks, which are fine but not trivial to
> > integrate with sleepable locks like your mutexes. Even so, the current
> > implementation is significantly slower than it has to be.
> >
> > ...
> >
> > Spinning mutexes should probably be handled by a different routine.
> >
> > ...
> >
>
> I disagree, because this is a wrong approach to the problem.  Instead of
> marginally optimising the slow-path (and the more contended is the lock,
> the less impact these micro-optimisations have), the subsystems should be
> refactored to eliminate the lock contention in the first place.  Yes, it
> is much more work, but it is the long term fix.  Having said that, I can
> see some use cases where MCS locks could be useful, but it is really a low
> priority in the big picture.
>

Locks are fundamentally about damage control. As noted earlier, spurious
bus transaction due to an avoidable read make performance unnecessarily
tad bit worse. That was minor anyway, more important bit was the
backoff.

Even on systems modest by today standards the quality of locking
primitives can be a difference between a system which is slower than
ideal but perfectly usable and one which is just dog slow.

That said, making backoff parameters autoscale on cpus with some kind of
upper cap is definitely warranted.

-- 
Mateusz Guzik
Swearing Maintenance Engineer


Re: performance issues during build.sh -j 40 kernel

2017-09-11 Thread Mateusz Guzik
> Le 09/09/2017 à 20:48, Mateusz Guzik a écrit :
On Sun, Sep 10, 2017 at 07:29:11PM +0200, Maxime Villard wrote:
> Le 09/09/2017 à 20:48, Mateusz Guzik a écrit :
> > [...]
> > I installed the 7.1 release, downloaded recent git snapshot and built
the
> > trunk kernel while using config stolen from the release (had to edit out
> > something about 3g modems to make it compile). I presume this is enough
> > to not have debug of any sort enabled.
>
> Not sure I understand; did you test a kernel from the netbsd-7.1 branch,
or
> from netbsd-current? You might want to test netbsd-current, I know that
several
> performance-related improvements were made.
>

I noted it's a current kernel. The 7.1 release bits were there to ensure
I don't run into userspace/kernel debug.

> > 3. pmap
> >
> > It seems most issues stem from slow pmap handling. Chances are there are
> > perfectly avoidable shootdowns and in fact cases where there is no need
> > to alter KVA in the first place.
>
> This seems rather surprising to me. I tried to reduce the number of
shootdowns
> some time ago, but they were already optimized, and my attempts just made
them
> slower to process. The only related thing I fixed was making sure there
is no
> kernel page that gets flushed under a local shootdown, but as far as I
> remember, it didn't significantly improve performance (on a somewhat old
> hardware, I must admit).
>

Note this was tested on kvm, where shootdowns are more expensive than on
bare metal so the result is probably worsened compared to bare-metal
(still, kvm is a perfectly fine production vm deployment, so I don't
feel bad for testing on it).

I'm did not investigate in detail (I'll have to), but I believe
dragonflybsd went to extended measures to reduce/eleminate IPIs in
general. Most definitely worth looking at.

-- 
Mateusz Guzik
Swearing Maintenance Engineer


performance issues during build.sh -j 40 kernel

2017-09-10 Thread Mateusz Guzik
fe40138351180   cache_lookup+97

(see https://people.freebsd.org/~mjg/netbsd/build-kernel-j40-lockstat.txt )

Note that netbsd`0x802249ba is x86_pause. Since the function
does not push frame pointer it is shown next to the actual caller, as
opposed to above it. Sometimes called functions get misplaced anyway,
I don't know why.

1. exclusive vnode locking (genfs_lock)

It is used even for path lookup which as can be seen leads to avoidable
contention. From what I'm told the primary reason is ufs constructing
some state just in case it has to create an inode at the end of the lookup.

However, since most lookups are not intended to create anything, this
behavior can be made conditional. I don't know the details, but ufs on
FreeBSD most certainly uses shared locking for common case lookups.

2. uvm_fault_internal

It's shown as the main waiter for a vm obj lock. The flamegraph hints
the real problem is with uvm_pageqlock & friends taken elsewhere, while
most page fault handlers serialize on the vm obj lock, while the holder
waits for uvm_pageqlock.

3. pmap

It seems most issues stem from slow pmap handling. Chances are there are
perfectly avoidable shootdowns and in fact cases where there is no need
to alter KVA in the first place.

4. vm locks in general

Most likely there are trivial cases where operations can be batched,
especially so on process exit where there are multiple pages to operate
on.

==

I would like to add a remark about locking primitives.

Today the rage is with MCS locks, which are fine but not trivial to
integrate with sleepable locks like your mutexes. Even so, the current
implementation is significantly slower than it has to be.

First, the lock word is read twice on entry to mutex_vector_enter - once
to determine the lock type and then to read the owner.

Spinning mutexes should probably be handled by a different routine.

lock cmpxchg already returns the found value (the owner). It can be
passed by the assembly routine to the slow path. This allows to make
an initial pass at backoff without accessing the lock in the meantime.
In face of contention the cacheline could have changed ownership by the
time you get to the read, thus using the value we already saw avoids
spurious bus transactions. Given low initial spin loop it should not have
negative effects.

backoff parameters were hardcoded last decade and are really off even
when looking today's modest servers. For kicks I changed the max spin
count to 1024 and in a trivial microbenchmark of doing dup2 + close
in 40 threads I got almost double the throughput.

Interestingly this change caused a regression for kernel build.
I did not investigate, I suspect the cause was that the vm obj lock
holder was now less aggressive on trying to grab the lock and that
caused problmes for everyone else waiting on the vm obj lock.

The spin loop itself is weird in the sense that instead of just having
the pause instruction embedded it calls a function. This is probably
unnecessarily less power/other thread friendly than it needs to be.

Cheers,

-- 
Mateusz Guzik 


unsafe ->p_cwdi access in mount_checkdirs

2017-09-08 Thread Mateusz Guzik
In mount_checkdirs you can find a loop:

mutex_enter(proc_lock);
PROCLIST_FOREACH(p, ) {
if ((cwdi = p->p_cwdi) == NULL)
continue;
if (cwdi->cwdi_cdir != olddp &&
cwdi->cwdi_rdir != olddp)
continue;
retry = true;
rele1 = NULL;
rele2 = NULL;
atomic_inc_uint(>cwdi_refcnt);
mutex_exit(proc_lock);
rw_enter(>cwdi_lock, RW_WRITER);

The problem is that p_cwdi IS NOT protected by proc_lock. The exit path
uses reflock to destroy the object.

Since this function is not performance critical, my suggested fix is to
uncondionally take the p_lock and ref cwdi only if the process is alive
and constructed.

Alternatively, cwdi freeing can be split into 2 parts where actual freeing
of the cwdi object itself is postponed until after proc_lock lock/unlock
dance.

Pseudo-code:
struct cwdinfo *
cwddestroy(struct cwdinfo *cwdi)
{

if (atomic_dec_uint_nv(>cwdi_refcnt) > 0)
return NULL;

vrele(cwdi->cwdi_cdir);
if (cwdi->cwdi_rdir)
vrele(cwdi->cwdi_rdir);
if (cwdi->cwdi_edir)
vrele(cwdi->cwdi_edir);
return cwdi;
}

void
cwdfree(struct cwdinfo *cwdi)
{

pool_cache_put(cwdi_cache, cwdi);
}

Then the exit path would:
cwdi = cwddestroy(p->p_cwdi);
p->p_cwdi = NULL;

mutex_enter(proc_lock);

mutex_exit(proc_lock);
cwdfree(cwdi);



mount_checkdirs would then atomic_inc_not_zero (or however you have this
named).

With the above, if cwdi was spotted, it is guaranteeed to not get freed
until after proc_lock is dropped. A successfull non-zero -> non-zero + 1
refcount bump guarantees it wont get freed and the content will remain
valid.

-- 
Mateusz Guzik 


fork1 use-after-free of the child process

2017-09-08 Thread Mateusz Guzik
The fork1 routine can wait for the child to exit (if vforked) and/or
return the pointer to the child.

Neither case guarantees the safety of said operation. The key is that
the parent can be ignoring SIGCHLD, which results in autoreaping the
child and the child itself is made runnable. Thus in particular it can
exit and get its struct proc freed. No prevention measures are taken
AFAICS.

I did not test it on the NetBSD kernel, but it's a standard problem and
the code definitely looks buggy.

1. returning the pointer

The last parameter to fork1 is struct proc **rnewprocp. I found only 2
uses:
- linux_sys_clone - the parameter is populated from a local variable
  which is not used afterwards, i.e. the func can as well start passing
  NULL instead
- creation of initproc - since this becomes the only remaining use of
  the parameter I think it makes sense to get rid of it altogether. the
  simplest hack which comes to mind will simply proc_find_raw the
  process after fork return

2. vfork

By the end there is:
while (p2->p_lflag & PL_PPWAIT)
cv_wait(>p_waitcv, proc_lock);

Retesting after wake up triggers the bug.

The cleanest solution I'm aware of introduces a dedicated condvar to use
by vfork. The parent uses a local variable for this purpose. Since the
condvar is only used by vfork to signal exec/exit, there is no need
to access the child after waking up. The downside is extention of struct
proc with a pointer. Interestingly this is what they do in Linux.

FreeBSD does the obvious of grabbing a reference preventing the
destruction of struct proc from going through.

Solaris suprisingly has a total hack: they look up the pid and see if
found process (if any) has to be waited on. Apart from general ugliness
it's also a minor scalability problem due to extra lock/unlock of the
global process list lock.

-- 
Mateusz Guzik 


Re: In-kernel process exit hooks?

2016-01-09 Thread Mateusz Guzik
  return (filemon);
>}
>}

But p_pid should be constant for 'struct proc' lifetime, so locking this
early is wasteful.

>lp = p;
>p = p->p_pptr;
>rw_exit(>p_reflock);

I guess the lock is needed to read the pointer to the parent.

Nothing was done to keep the parent around, which means it could have
exited and be freed by now. Which in turn makes the second loop
iteration a use-after-free.

The first iteration is safe if the argument is curproc (which it is), as
it clearly cannot disappear.

Turns out this traversal is even more wrong since p_pptr does not have
to be the process you are looking for - ptrace is reparenting tracees.

>}
>}
>return (NULL);
>}

-- 
Mateusz Guzik 



Re: In-kernel process exit hooks?

2016-01-09 Thread Mateusz Guzik
On Sat, Jan 09, 2016 at 08:25:05AM +0100, Mateusz Guzik wrote:
> On Sat, Jan 09, 2016 at 02:25:02PM +0800, Paul Goyette wrote:
> > On Sat, 9 Jan 2016, Mateusz Guzik wrote:
> > 
> > >While I don't know all the details, it is clear that the purpose would
> > >be much better served by ktrace and I would argue efforts should be
> > >spent there.
> > 
> > filemon's purpose is somewhat different than that of ktrace.  There
> > is a limited set of events that filemon captures, and it only
> > captures successful operations.  Its output is quite simple and easy
> > to parse by a human of shell script.
> > 
> 
> ktrace already have limited support for specifying what events are
> wanted. One can add a special filemon-like mode.
> 
> The output of kdump is definitely less friendly for scripts, but it is
> way easier to write a script formatting it than it is to fix filemon.
> 
> > >>From usability perspective what seems to be needed is few hacks to
> > >ktrace to only log the stuff make is interested in and parsing code for
> > >bmake to translate to what looks like current filemon format for
> > >aformentioned copying.
> > 
> > Perhaps.  But filemon is also a run-time loadable module (and it is
> > preferred that it NOT be built-in to any kernels).  If I remember
> > correctly, ktrace has hooks all over the system and does not readily
> > lend itself to modularization.
> > 
> 
> I doubt that compiling in ktrace on a which can handle filemon is an
> issue, and even then it should outweight hops around filemon.
> 
> For correct of a minimalistic safe filemon module the kernel would
> already have to grow hooks in few places (see below).
> 
> 
> > >Filemon monitors stuff at syscall level, stores the fd and remembers
> > >pids.
> > 
> > The fd is no longer being stored...  :)  But yes, it does still
> > remember pids, so it can determine if an event belongs to a monitored
> > process tree.  Again IIRC, ktrace is somewhat more intrusive, in that
> > it manipulates various process flags which can affect behavior of the
> > target (monitored) process.
> 
> It has a dedicated field in struct proc which is the only sensible way
> of implementing this I can think of.
> 
> > >>   error = kauth_authorize_process(curproc->p_cred,
> > >>   KAUTH_PROCESS_CANSEE, tp,
> > >>   KAUTH_ARG(KAUTH_REQ_PROCESS_CANSEE_ENTRY), NULL, NULL);
> > >
> > >Now that the check is pased the process could have executed something
> > >setuid.
> > 
> > Which process could have executed something setuid?  How would that
> > matter?  (I will confess I haven't checked, but I would be surprised
> > if "executing something setuid" would change the result of kauth()
> > check. But I could be wrong here.)
> > 
> 
> You start monitoring given process and proceed to execve a setuid file,
> which gives you information what it is doing, even though the
> kauth_authorize_process check would fail. This is basically what you
> yourself stated below about pid reuse.
> 
> So this either needs to prevent processes from changing credentials when
> traced or stop tracing when such a change occurs. The former seems like
> a rather bad idea and latter is already done by ktrace.
> 
> > >I'm somewhat surprised this code is here. Should not
> > >kauth_authorize_process assert that tp has stable credentials in some
> > >way, in effect just crashing filemon consumers?
> > >
> > >>   if (!error) {
> > >>   filemon->fm_pid = tp->p_pid;
> > >>   }
> > 
> > Why does the target process need stable credentials?  It's not doing
> > anything for filemon.
> > 
> 
> See below.
> 
> > >Pid is stored, but tp is not held. What if the 'traced' process exits?
> > >grep only reveals the following bit in filemon_wrapper_sys_exit:
> > >
> > >>   if (filemon->fm_pid == curproc->p_pid)
> > >>   filemon_printf(filemon, "# Bye bye\n");
> > >
> > >So filemon will continue tracing anything which happened to reuse the
> > >pid.
> > 
> > Yes, this is an outstanding issue.  More particularly, the new user
> > of the pid might not be accessible by the monitoring process (ie, it
> > would fail the earlier kauth() check).  But holding the pointer to
> > the proc isn't really the answer either - although less likely it is
> > entirely possible that the a

Re: In-kernel process exit hooks?

2016-01-09 Thread Mateusz Guzik
On Sat, Jan 09, 2016 at 02:25:02PM +0800, Paul Goyette wrote:
> On Sat, 9 Jan 2016, Mateusz Guzik wrote:
> 
> >While I don't know all the details, it is clear that the purpose would
> >be much better served by ktrace and I would argue efforts should be
> >spent there.
> 
> filemon's purpose is somewhat different than that of ktrace.  There
> is a limited set of events that filemon captures, and it only
> captures successful operations.  Its output is quite simple and easy
> to parse by a human of shell script.
> 

ktrace already have limited support for specifying what events are
wanted. One can add a special filemon-like mode.

The output of kdump is definitely less friendly for scripts, but it is
way easier to write a script formatting it than it is to fix filemon.

> >>From usability perspective what seems to be needed is few hacks to
> >ktrace to only log the stuff make is interested in and parsing code for
> >bmake to translate to what looks like current filemon format for
> >aformentioned copying.
> 
> Perhaps.  But filemon is also a run-time loadable module (and it is
> preferred that it NOT be built-in to any kernels).  If I remember
> correctly, ktrace has hooks all over the system and does not readily
> lend itself to modularization.
> 

I doubt that compiling in ktrace on a which can handle filemon is an
issue, and even then it should outweight hops around filemon.

For correct of a minimalistic safe filemon module the kernel would
already have to grow hooks in few places (see below).


> >Filemon monitors stuff at syscall level, stores the fd and remembers
> >pids.
> 
> The fd is no longer being stored...  :)  But yes, it does still
> remember pids, so it can determine if an event belongs to a monitored
> process tree.  Again IIRC, ktrace is somewhat more intrusive, in that
> it manipulates various process flags which can affect behavior of the
> target (monitored) process.

It has a dedicated field in struct proc which is the only sensible way
of implementing this I can think of.

> >>   error = kauth_authorize_process(curproc->p_cred,
> >>   KAUTH_PROCESS_CANSEE, tp,
> >>   KAUTH_ARG(KAUTH_REQ_PROCESS_CANSEE_ENTRY), NULL, NULL);
> >
> >Now that the check is pased the process could have executed something
> >setuid.
> 
> Which process could have executed something setuid?  How would that
> matter?  (I will confess I haven't checked, but I would be surprised
> if "executing something setuid" would change the result of kauth()
> check. But I could be wrong here.)
> 

You start monitoring given process and proceed to execve a setuid file,
which gives you information what it is doing, even though the
kauth_authorize_process check would fail. This is basically what you
yourself stated below about pid reuse.

So this either needs to prevent processes from changing credentials when
traced or stop tracing when such a change occurs. The former seems like
a rather bad idea and latter is already done by ktrace.

> >I'm somewhat surprised this code is here. Should not
> >kauth_authorize_process assert that tp has stable credentials in some
> >way, in effect just crashing filemon consumers?
> >
> >>   if (!error) {
> >>   filemon->fm_pid = tp->p_pid;
> >>   }
> 
> Why does the target process need stable credentials?  It's not doing
> anything for filemon.
> 

See below.

> >Pid is stored, but tp is not held. What if the 'traced' process exits?
> >grep only reveals the following bit in filemon_wrapper_sys_exit:
> >
> >>   if (filemon->fm_pid == curproc->p_pid)
> >>   filemon_printf(filemon, "# Bye bye\n");
> >
> >So filemon will continue tracing anything which happened to reuse the
> >pid.
> 
> Yes, this is an outstanding issue.  More particularly, the new user
> of the pid might not be accessible by the monitoring process (ie, it
> would fail the earlier kauth() check).  But holding the pointer to
> the proc isn't really the answer either - although less likely it is
> entirely possible that the address could have been reused.
> 

What you need is the ability to clear tracing bits on process exit.
ktrace definitely has relevant bits in place already.

> (It's too bad we don't have "generation numbers" on the pid, which
> would easily detect reuse.)

That would require struct proc to be type stable, which seems wasteful.

> >>   lp = p;
> >>   p = p->p_pptr;
> >>   rw_exit(>p_reflock);
> >
> >I guess the lock is needed