Re: Unexpected out of memory kills when running parallel find instances over millions of files
On Thu, Oct 19, 2023 at 10:49:37AM +, Michael van Elst wrote: > mjgu...@gmail.com (Mateusz Guzik) writes: > > >Running 20 find(1) instances, where each has a "private" tree with > >million of files runs into trouble with the kernel killing them (and > >others): > >[ 785.194378] UVM: pid 1998.1998 (find), uid 0 killed: out of swap > > > >This should not be happening -- there is tons of reusable RAM as > >virtually all of the vnodes getting here are immediately recyclable. > > While vnodes would be recyclable, they hardly get recycled unless > an filesystem object is deleted or the filesystem is unmounted. > They get recycled all the time by vdrain thread if numvnodes goes above desiredvnodes, like it does in this test. Part of the problem is that with 20 processes walking the filesystem it gets outpaced (20:1) and has no means of stopping it, while the memory allocator just keeps handing objects out. > >Specs are 24 cores, 24G of RAM and ufs2 with noatime. swap is *not* > >configured. > > Without swap, the kernel also has no chance to evict process pages > to grow the vnode cache further. > It should not be trying to grow the vnode cache. If anything it should stop it from blowing out of proportion and definitely should not kill processes in presence of swaths of immediately freeable vnodes. As noted above there is code to try to do it, but it is not sufficient. I tested several systems (Linux, all the BSDs and even Illumos) and only NetBSD fails to complete the run. That is to say even OpenBSD chugs along no problem. This is definitely a reliability problem in the kernel. Traditionally vnode allocation would recycle something from the "free" list if need be. Perhaps restorign this behavior is the easiest way out for the time being.
Unexpected out of memory kills when running parallel find instances over millions of files
Running 20 find(1) instances, where each has a "private" tree with million of files runs into trouble with the kernel killing them (and others): [ 785.194378] UVM: pid 1998.1998 (find), uid 0 killed: out of swap [ 785.194378] UVM: pid 2010.2010 (find), uid 0 killed: out of swap [ 785.224675] UVM: pid 1771.1771 (top), uid 0 killed: out of swap [ 785.285291] UVM: pid 1960.1960 (zsh), uid 0 killed: out of swap [ 785.376172] UVM: pid 2013.2013 (find), uid 0 killed: out of swap [ 785.416572] UVM: pid 1760.1760 (find), uid 0 killed: out of swap [ 785.416572] UVM: pid 1683.1683 (tmux), uid 0 killed: out of swap This should not be happening -- there is tons of reusable RAM as virtually all of the vnodes getting here are immediately recyclable. $elsewhere I got a report of a workload with hundreds of millions of files which get walked in parallel -- a number high enough that it does not fit in RAM on boxes which run it. Out of curiosity I figured I'll check how others are doing on the front, but key is that this is not a made up problem. I'm running NetBSD 10, kernel built from this commit at top of the tree: Author: andvar Date: Sat Oct 14 08:05:25 2023 + fix various typos in comments and documentation, mainly in word "between". Specs are 24 cores, 24G of RAM and ufs2 with noatime. swap is *not* configured. Test generates 20 separate trees, each has 1000 directories with 1000 files (or 20 million files in total + some dirs). Repro instructions are here: https://people.freebsd.org/~mjg/.junk/fstree.tgz Note that parallel creation of the these trees is dog slow, took over 40 minutes for me. I had to pass extra flags to newfs to for the target fs to even fit this inode count: newfs -n 22000 -O 2 /dev/wd1e So the expected outcome is that this finishes (extra points for reasonable time) instead of having userspace getting killed. I don't know what kind of diagnostic info would be best here, but given repro steps above I don't think I need to look for something. Have fun. :) -- Mateusz Guzik
Re: tmpfs vs VV_LOCKSWORK
Now that I sent the e-mail I had a look at access(). The benchmark at hand is custom code added to will-it-scale, pasted at the bottom. On 8/20/20, Mateusz Guzik wrote: > tmpfs does *NOT* set the flag. While I can't be arsed to verify > semantics, I suspect it is more than eligible. > > This came up as a slowdown in an "access" microbenchmark with target > file on tmpfs, stemming from: > > commit 5645873e2da675f475c8b8d61260c8ffe9d411ab > Author: riastradh > Date: Tue Aug 4 03:00:10 2020 + > > Fix bogus fast path in vput. > > I wrote a trivial patch for it (see below). > Results are (ops/s): > stock: 1721941 > aug 4: 1592518 > patch: 1664363 > > The difference stems from calling VOP_ISLOCKED. > I meant not all performance is recovered because of the newly added VOP_ISLOCKED call. However, looking closer, access() passes a share-locked vnode so the Aug 4 change fixed a serious bug for tmpfs handling. As for recovering perf, I once more suggest VOP_NEED_INACTIVE which is the easiest way to avoid the read -> write transition in the common case. Preferably the need to perform this processing would be expressed with a bit in the usecount or similar, but that's rather invasive. > The patch is a PoC, I don't claim much correctness. Basically > someone(tm) will have to take care of the problem. > diff --git a/sys/fs/tmpfs/tmpfs_subr.c b/sys/fs/tmpfs/tmpfs_subr.c > index 92a0eedf4443..394320fe71ce 100644 > --- a/sys/fs/tmpfs/tmpfs_subr.c > +++ b/sys/fs/tmpfs/tmpfs_subr.c > @@ -136,6 +136,8 @@ tmpfs_init_vnode(struct vnode *vp, tmpfs_node_t *node) > /* FALLTHROUGH */ > case VLNK: > case VREG: > + vp->v_vflag |= VV_LOCKSWORK; > + /* FALLTHROUGH */ > case VSOCK: > vp->v_op = tmpfs_vnodeop_p; > break; > > -- > Mateusz Guzik > #include #include #include #include #include char *testcase_description = "Separate file access(2)"; void testcase(unsigned long long *iterations, unsigned long nr) { char tmpfile[] = "/tmp/willitscale.XX"; int fd = mkstemp(tmpfile); assert(fd >= 0); close(fd); while (1) { int error = access(tmpfile, R_OK); assert(error == 0); (*iterations)++; } unlink(tmpfile); } -- Mateusz Guzik
tmpfs vs VV_LOCKSWORK
tmpfs does *NOT* set the flag. While I can't be arsed to verify semantics, I suspect it is more than eligible. This came up as a slowdown in an "access" microbenchmark with target file on tmpfs, stemming from: commit 5645873e2da675f475c8b8d61260c8ffe9d411ab Author: riastradh Date: Tue Aug 4 03:00:10 2020 + Fix bogus fast path in vput. I wrote a trivial patch for it (see below). Results are (ops/s): stock: 1721941 aug 4: 1592518 patch: 1664363 The difference stems from calling VOP_ISLOCKED. The patch is a PoC, I don't claim much correctness. Basically someone(tm) will have to take care of the problem. diff --git a/sys/fs/tmpfs/tmpfs_subr.c b/sys/fs/tmpfs/tmpfs_subr.c index 92a0eedf4443..394320fe71ce 100644 --- a/sys/fs/tmpfs/tmpfs_subr.c +++ b/sys/fs/tmpfs/tmpfs_subr.c @@ -136,6 +136,8 @@ tmpfs_init_vnode(struct vnode *vp, tmpfs_node_t *node) /* FALLTHROUGH */ case VLNK: case VREG: + vp->v_vflag |= VV_LOCKSWORK; + /* FALLTHROUGH */ case VSOCK: vp->v_op = tmpfs_vnodeop_p; break; -- Mateusz Guzik
Re: notes from running will-it-scale
There is: Built-in Function: void * __builtin_assume_aligned (const void *exp, size_t align, ...) but I don't know how it is working out in practice. Note both direct map arg and the address should be aligned. However, an argument for dedicated routines is that on CPUs with ERMS it is faster to rep stosb/movsb than stosq/movsq (and the other way around without said extension). If rep-using memset is inlined you are stuck with stosq by default. On 7/19/20, Jaromír Doleček wrote: > Very interesting, particularly the outrageous assembly for > pmap_{zero,copy}_page(). > > Is there some way to tell the compiler that the address is already > 4096-aligned and avoid the conditionals? Failing that, we could just > adopt the FreeBSD assembly for this. > > Does anyone see a problem with introducing a vfs.timestamp_precision > to avoid the rtdscp? > > Jaromir > > Le dim. 19 juil. 2020 à 13:21, Mateusz Guzik a écrit : >> >> Hello, >> >> I recently took an opportunity to run cross-systems microbenchmarks >> with will-it-scale and included NetBSD (amd64). >> >> https://people.freebsd.org/~mjg/freebsd-dragonflybsd-netbsd-v2.txt >> [no linux in this doc, I will probably create a new one soon(tm)] >> >> The system has a lot of problems in the vfs layer, vm is a mixed bag >> with multithreaded cases lagging behind and some singlethreaded being >> pretty good (and at least one winning against the other systems). >> >> Notes: >> - rtdscp is very expensive in vms, yet the kernel unconditionally >> performs by calling vfs_timestamp. Both FreeBSD and DragonflyBSD have >> a knob to change the resolution (and consequently avoid the >> instruction), I think you should introduce it and default to less >> accuracy on vms. Sample results: >> stock pipe1: 2413901 >> patched pipe1: 3147312 >> stock vfsmix: 13889 >> patched vfsmix: 73477 >> - sched_yield is apparently a nop when the binary is not linked with >> pthread. this does not match other systems and is probably a bug. >> - pmap_zero_page/pmap_copy_page compile to atrocious code which keeps >> checking for alignment. The compiler does not know what values can be >> assigned to pmap_direct_base and improvises. >> >>0x805200c3 <+0>: add0xf93b46(%rip),%rdi# >> 0x814b3c10 >>0x805200ca <+7>: mov$0x1000,%edx >>0x805200cf <+12>: xor%eax,%eax >>0x805200d1 <+14>: test $0x1,%dil >>0x805200d5 <+18>: jne0x805200ff >> >>0x805200d7 <+20>: test $0x2,%dil >>0x805200db <+24>: jne0x8052010b >> >>0x805200dd <+26>: test $0x4,%dil >>0x805200e1 <+30>: jne0x80520116 >> >>0x805200e3 <+32>: mov%edx,%ecx >>0x805200e5 <+34>: shr$0x3,%ecx >>0x805200e8 <+37>: rep stos %rax,%es:(%rdi) >>0x805200eb <+40>: test $0x4,%dl >>0x805200ee <+43>: je 0x805200f1 >> >>0x805200f0 <+45>: stos %eax,%es:(%rdi) >>0x805200f1 <+46>: test $0x2,%dl >>0x805200f4 <+49>: je 0x805200f8 >> >>0x805200f6 <+51>: stos %ax,%es:(%rdi) >>0x805200f8 <+53>: and$0x1,%edx >>0x805200fb <+56>: je 0x805200fe >> >>0x805200fd <+58>: stos %al,%es:(%rdi) >>0x805200fe <+59>: retq >>0x805200ff <+60>: stos %al,%es:(%rdi) >>0x80520100 <+61>: mov$0xfff,%edx >>0x80520105 <+66>: test $0x2,%dil >>0x80520109 <+70>: je 0x805200dd >> >>0x8052010b <+72>: stos %ax,%es:(%rdi) >>0x8052010d <+74>: sub$0x2,%edx >>0x80520110 <+77>: test $0x4,%dil >>0x80520114 <+81>: je 0x805200e3 >> >>0x80520116 <+83>: stos %eax,%es:(%rdi) >>0x80520117 <+84>: sub$0x4,%edx >>0xffff8052011a <+87>: jmp0x805200e3 >> >> >> The thing to do in my opinion is to just provide dedicated asm funcs. >> This is the equivalent on FreeBSD (ifunc'ed): >> >> ENTRY(pagezero_std) >> PUSH_FRAME_POINTER >> movl$PAGE_SIZE/8,%ecx >> xorl%eax,%eax >> rep >> stosq >> POP_FRAME_POINTER >> ret >> END(pagezero_std) >> >> ENTRY(pagezero_erms) >> PUSH_FRAME_POINTER >> movl$PAGE_SIZE,%ecx >> xorl%eax,%eax >> rep >> stosb >> POP_FRAME_POINTER >> ret >> END(pagezero_erms) >> >> -- >> Mateusz Guzik >> > -- Mateusz Guzik
notes from running will-it-scale
Hello, I recently took an opportunity to run cross-systems microbenchmarks with will-it-scale and included NetBSD (amd64). https://people.freebsd.org/~mjg/freebsd-dragonflybsd-netbsd-v2.txt [no linux in this doc, I will probably create a new one soon(tm)] The system has a lot of problems in the vfs layer, vm is a mixed bag with multithreaded cases lagging behind and some singlethreaded being pretty good (and at least one winning against the other systems). Notes: - rtdscp is very expensive in vms, yet the kernel unconditionally performs by calling vfs_timestamp. Both FreeBSD and DragonflyBSD have a knob to change the resolution (and consequently avoid the instruction), I think you should introduce it and default to less accuracy on vms. Sample results: stock pipe1: 2413901 patched pipe1: 3147312 stock vfsmix: 13889 patched vfsmix: 73477 - sched_yield is apparently a nop when the binary is not linked with pthread. this does not match other systems and is probably a bug. - pmap_zero_page/pmap_copy_page compile to atrocious code which keeps checking for alignment. The compiler does not know what values can be assigned to pmap_direct_base and improvises. 0x805200c3 <+0>: add0xf93b46(%rip),%rdi# 0x814b3c10 0x805200ca <+7>: mov$0x1000,%edx 0x805200cf <+12>: xor%eax,%eax 0x805200d1 <+14>: test $0x1,%dil 0x805200d5 <+18>: jne0x805200ff 0x805200d7 <+20>: test $0x2,%dil 0x805200db <+24>: jne0x8052010b 0x805200dd <+26>: test $0x4,%dil 0x805200e1 <+30>: jne0x80520116 0x805200e3 <+32>: mov%edx,%ecx 0x805200e5 <+34>: shr$0x3,%ecx 0x805200e8 <+37>: rep stos %rax,%es:(%rdi) 0x805200eb <+40>: test $0x4,%dl 0x805200ee <+43>: je 0x805200f1 0x805200f0 <+45>: stos %eax,%es:(%rdi) 0x805200f1 <+46>: test $0x2,%dl 0x805200f4 <+49>: je 0x805200f8 0x805200f6 <+51>: stos %ax,%es:(%rdi) 0x805200f8 <+53>: and$0x1,%edx 0x805200fb <+56>: je 0x805200fe 0x805200fd <+58>: stos %al,%es:(%rdi) 0x805200fe <+59>: retq 0x805200ff <+60>: stos %al,%es:(%rdi) 0x80520100 <+61>: mov$0xfff,%edx 0x80520105 <+66>: test $0x2,%dil 0x80520109 <+70>: je 0x805200dd 0x8052010b <+72>: stos %ax,%es:(%rdi) 0x8052010d <+74>: sub$0x2,%edx 0x80520110 <+77>: test $0x4,%dil 0x80520114 <+81>: je 0x805200e3 0x80520116 <+83>: stos %eax,%es:(%rdi) 0x80520117 <+84>: sub$0x4,%edx 0x8052011a <+87>: jmp0x805200e3 The thing to do in my opinion is to just provide dedicated asm funcs. This is the equivalent on FreeBSD (ifunc'ed): ENTRY(pagezero_std) PUSH_FRAME_POINTER movl$PAGE_SIZE/8,%ecx xorl%eax,%eax rep stosq POP_FRAME_POINTER ret END(pagezero_std) ENTRY(pagezero_erms) PUSH_FRAME_POINTER movl$PAGE_SIZE,%ecx xorl%eax,%eax rep stosb POP_FRAME_POINTER ret END(pagezero_erms) -- Mateusz Guzik
Re: Please review: lookup changes
>> locks and not hashing itself. Note that at the time vnode interlock and >> vm object lock were the same thing and in this workload the lock is >> under a lot of contention. Should the lock owner get preempted, anyone >> doing a lookup to the affected vnode (e.g., libc) will be holding >> the relevant per-cpu lock and will block on a turnstile. Whoever ends >> up running on the affected cpu is likely to do a lookup on their own, >> but the relevant per-cpu lock is taken and go off cpu. The same thing >> happening on more than one cpu at a time could easily reslut in a >> cascading failure, which I strongly suspect is precisely what happened. >> >> That is, the win does not stem from rb trees but finer-grained locking >> which does not block other threads which look up something else on >> the same cpu. > > Not on NetBSD. Kernel preemption is possible and allowed (mainly for real > time applications), but happens infrequently during normal operation. > There > are a number of pieces of code that take advantage of that fact and are > "optimistically per-CPU", and they work very well as preemption is rarely > observed. Further the blocking case on v_interlock in cache_lookup() is > rare. That's no to say it doesn't happen, it does, but I don't think it > enough to explain the performance differences. > I noted suspected preemption was occurring because of contention on the vm side. It should only take one to start the cascade. >> As mentioned earlier I think rb trees instead of a hash are pessimal >> here. >> >> First, a little step back. The lookup starts with securing vnodes from >> cwdinfo. This represents a de facto global serialisation point (times two >> since the work has to be reverted later). In FreeBSD I implemented an >> equivalent with copy-on-write semantics. Easy take on it is that I take >> an rwlock-equivalent and then grab a reference on the found struct. >> This provides me with an implicit reference on root and current working >> directory vnodes. If the struct is unshared on fork, aforementioned >> serialisation point becomes localized to the process. > > Interesting. Sounds somewhat like both NetBSD and FreeBSD do for process > credentials. > I was thinking about doing precisely that but I found it iffy to have permanently stored references per-thread. With the proposal they get "gained" around the actual lookup, otherwise this is very similar to what mountcheckdirs is dealing with right now. >> In my tests even with lookups which share most path components, the >> last one tends to be different. Using a hash means this typically >> results in grabbing different locks for that case and consequently >> fewer cache-line ping pongs. > > My experience has been different. What I've observed is that shared hash > tables usually generate huge cache pressure unless they're small and rarely > updated. If the hash were small, matching the access pattern (e.g. > per-dir) then I think it would have the opportunity to make maximum use of > the cache. That could be a great win and certainly better than rbtree. > Well in my tests this is all heavily dominated by SMP-effects, which I expect to be exacerbated by just one lock. Side note is that I had a look at your vput. The pre-read + VOP_UNLOCK + actual loop to drop the ref definitely slow things down if only a little bit as this can force a shared cacheline transition from under someone cmpxching. That said, can you generate a flamegraph from a fully patched kernel? Curious where the time is spent now, my bet is spinning on vnode locks. -- Mateusz Guzik
Re: Please review: lookup changes
Something ate a huge chunk of my e-mail, resending On 3/8/20, Andrew Doran wrote: > On Sat, Mar 07, 2020 at 12:14:05AM +0100, Mateusz Guzik wrote: >> I believe it is always legal to just upgrade the lock in getattr if >> necessary. Since itimes updates are typically not needed, this would mean >> common case would get away with shared locking. This would run into >> issues >> in vput, for which see below. > > Hmm. With NetBSD upgrading a lock in the middle of a VOP could have > consequences for fstrans. I'm not sure. > I don't know how do you do upgrades specifically. "mere" upgrades can of course fail in presence of additional readers. In FreeBSD there is an option to release the lock entirely and relock from scratch, in paritcular meaning you can undo fstrans enter (if necessary). While I don't know the specifics of your VFS layer, v_usecount which is kept should postpone freeing of the vnode. Should it fall victim of forced unmount or some other state change you can just return an error, pretending getattr never got this far. >> Assuming this is fixed fstat could also shared lock. I think this is a >> simple change to make and most certainly worth doing. >> >> > vfs_vnode.c: >> > >> > Namecache related changes, and the change to not block new vnode >> > references across VOP_INACTIVE() mentioned on tech-kern. I don't >> > think this blocking is needed at least for well behaved FSes like >> > ffs & tmpfs. >> > >> >> I think this is very hairy and the entire thing is avoidable. >> >> I believe I mentioned some other time that FreeBSD got VOP_NEED_INACTIVE. >> You should be able to trivially implement an equivalent which would be >> called if the vnode is only shared locked. Supporting an unlocked case >> comes with extra woes which may not be worth it at this stage. >> >> Since inative processing is almost never needed this would avoid a lot >> of lock contention. >> >> With all this in place you would also not run into the problematic state >> in the fast path in the common case. >> >> This is a cheap cop out, the real solution would encode need for inactive >> in usecount as a flag but that's a lot of work. > > Two seperate issues - I agree on avoiding VOP_INACTIVE(), but the issue > here > is peculiar to NetBSD I think. NetBSD sets the equivalent of the 4.4BSD > VXLOCK flag to block vget() across VOP_INACTIVE(). Should only be needed > for revoke/clean. At one point we did have VXLOCK embedded in v_usecount > which allowed for vget() with one atomic op, but that was undone at some > point, I don't know why. > If you have a dedicated spot which is guaranteed to be executed for exclusion against vget then things become easier in terms of effort required. A dedicated vget variant for the namecache can xadd into the counter and if the bit is set, backpedal from it. New arrivals should eliminated by purging nc entries beforehand. Everyone else can cmpxchg-loop with the interlock like right now. >> > Performance: >> > >> > During a kernel build on my test system it yields about a 15% >> > drop >> > in system time: >> > >> > HEAD79.16s real 1602.97s user 419.54s system >> > changes 77.26s real 1564.38s user 368.41s system >> > >> > The rbtree lookup can be assumed to be slightly slower than a >> > hash >> > table lookup and microbenchmarking shows this sometimes in the >> > single CPU case, but as part of namei() it ends up being >> > ameliorated >> > somewhat by the reduction in locking costs, the absence of hash >> > bucket collisions with rbtrees, and better L2/L3 cache use on hot >> > directories. I don't have numbers but I know joerg@ saw >> > significant >> > speedups on pbulk builds with an earlier version of this code >> > with >> > little changed from HEAD other than hash -> rbtree. Anyway >> > rbtree >> > isn't set in stone as the data structure, it's just what we have >> > in >> > tree that's most suitable right now. It can be changed. >> > >> >> I strongly suspect the win observed with pbulk had to with per-cpu >> locks and not hashing itself. Note that at the time vnode interlock and >> vm object lock were the same thing and in this workload the lock is >> under a lot of contention. Should the lock owner get preempted, anyone >> doing a lookup to the affected vnode (e.g., libc) will be holding >> the relevant per-cp
Re: Adaptive RW locks
On 3/10/20, Andrew Doran wrote: > Following up again.. It turns out the deadlocks I saw with this were down > to a problem with kernel_lock that has since been solved. > > I made a couple more tweaks to it: up the max number of tracked holds to 8 > because vmobjlock became a rwlock, and because aiodoned is gone now be more > selective giving up and going to sleep in the softint case (walk down the > list of pinned LWPs on curcpu and see if any of them holds the lock that's > wanted, if not held on curcpu then some other CPU has it, and it's OK to > spin). I think the proposed patch avoidably adds overhead to the most common use case to cover a corner case. Sensibly bounded spinning on readers can be done in a speculative fashion, i.e. without actively tracking if there are threads off cpu. FreeBSD is not completely there yet, but general idea is pretty trivial: once a writer shows up it sets a bit (RW_SPINNER or so), then it can observe whether read owners are draining with some pre-defined upper bound on how long it is willing to wait. There is no spinning if there are blocked to-be writers already. Thus this might get some waste initially, but it is bounded. Side note is that waiting for threads to drain can use the reader count to speculate how long to wait before re-reading the lock instead of doing it every pause. > void > +rw_enter(krwlock_t *rw, const krw_t op) > +{ > [..] > + > + /* > + * Read the lock owner field. If the need-to-wait > + * indicator is clear, then try to acquire the lock. > + */ > + owner = rw->rw_owner; > + if (__predict_true((owner & need_wait) == 0)) { > + next = rw_cas(rw, owner, (owner + incr) & mask); > + if (__predict_true(next == owner)) { > + /* Got it! */ > + RW_LOCKED(rw, op); > + KPREEMPT_ENABLE(l); > + RW_MEMBAR_ENTER(); > + return; > + } > + } Both this and the unlock path lose on throughput if the lock is heavily used by readers. The original amd64 fast path has a tight loop which immediately retries cmpxchg, while the patched variant aborts immediately. Assume the lock is only used by readers. Patched rw_enter will always conclude it can rw_cas on it. However, this can very easily fail if racing against another rw_enter or rw_exit. In this case this will fall to the slow path which starts with a re-read and then performs a tight loop. Whether a strict tight loop is a good idea is highly dependent, I believe it is the best thing to do on amd64 by default. Other architectures may want some form of pause-equivalent. Nonetheless, the fallback to re-read is definitely pessimal and will keep happening under load. I don't know if worth the effort at this stage, but this will eventually have to morph into a xadd-based lock. As noted elsewhere cmpxchg-baesd loops degrade under load to a significantly higher degree (of course everything is already pretty bad if you have a heavily bounced lock in the first place). Note that the conversion would also take care of the pause problem. I was looking at doing it in FreeBSD but there are some constraints which pose important design questions and to my understanding your code shares most of them (if not all). Blindly adding into the lock word means there is no space to store the entire thread pointer (since it could partially overwrite the address), but the owner needs to be findable in some manner in order to lend it priority. Assuming 32-bit architectures can stick to the current code there are some options without growing the lock. One variant was "compressing" the pointer -- if all threads are guaranteed to come from a pre-defined VA range the lock can only store the relevant portion of the address which can be pretty small. Alternatively there can be a lockless hash of threads keyed by thread id. Or this can can use a variant of storing the lock which you implemented here, except by the writer. It can denote in the lock it went off cpu with the relevant turnstile lock held. Then everyone waiting can look it up based on the lock. Of course one can also "give up" and add another word for the pointer, then this mostly degrades to regular read-write lock problems. -- Mateusz Guzik
Re: Blocking vcache_tryvget() across VOP_INACTIVE() - unneeded
On 1/22/20, Andrew Doran wrote: > On Tue, Jan 21, 2020 at 05:30:55PM -0500, Thor Lancelot Simon wrote: > >> I think there's probably a generic bucket-locked hashtable implementation >> in one of the code contributions Coyote Point made long ago. That said, >> we're talking about a screenful of code, so it's not like it'd be hard to >> rewrite, either. Do we want such a thing? Or do we want to press on >> towards more modern "that didn't work, try again" approaches like >> pserialize or the COW scheme you describe? > > The way I see it, the original idea of the namecache was to have a thing > that allowed you avoid doing expensive hardware and file system stuff a > second time, because you'd already done it once before. In this context > MP locking also looks a lot like expensive hardware and file system stuff. > > Making the namecache's index per-directory matches the pattern in which the > data is accessed and hey presto all of a sudden there are no big > concurrency > problems, any locking or whatever becomes trivial and follows the pattern > of > access closely, things like pserialize/RCU start to look feasible, and one > can get back to the important business of avoiding expensive work instead > of > generating it! That's my thinking on it anyway. > > The rbtrees are just what we have at hand. I think a dynamically sized > hash, like a Robin Hood hash looks very interesting though, because we can > afford to resize occasionally on entry/removal, and modern pipelined CPUs > are great at that kind of thing. I'm sure other people have better ideas > than I do here, though. I just want to try out the directory approach, and > if it works well, get the basis in place. > But a hash which has collisions taken care of with linked lists is the easiest data structure to scale in this regard. Additions and removals in the face of concurrent lookups are easy to handle (and in fact the current code already does it to some extent). With no safe memory reclamation in use this still provides a win as contetnion is spread out across different buckets. -- Mateusz Guzik
Re: Blocking vcache_tryvget() across VOP_INACTIVE() - unneeded
On 1/21/20, Andrew Doran wrote: > On Thu, Jan 16, 2020 at 04:51:44AM +0100, Mateusz Guzik wrote: >> >> I'm assuming the goal for the foreseeable future is to achieve path >> lookup >> with shared vnode locks. > > Good guess. There is a prototype of LK_SHARED lookup on the ad-namecache > branch, along with lookup using namecache only that takes as few vnode > locks > and refcounts as possible. In the easiest case only the root vnode and the > leaf vnode are touched (although whether the root vnode ever actually needs > a reference is another question). There are many "non-easy" cases though. > As in you can get away without ever accessing it if there are no long enough .. chains and absolute paths (along with symlinks)? My solution to the problem (not implemented yet) is to introduce a copy-on-write structre which holds the ref to the rootvnode. Then lwps can just use it while at worst refing/unrefing that strut but never the vnode. This also covers the current working directory. I have seen the branch. I think the use of rb instead of a hash is pessimal. My speculation where the wins over the current code are coming from boils down to per-cpu locks being eliminated. Specifically, since v_interlock is a lock from bufobj and is heavily used in vm, likelyhood of getting preempted while holding it increases. Say CPU0 takes its own pcpu lock and proceeds to take v_interlock. Now the thread taken off to wait. If the new thread running on CPU0 performs *any* path lookup it will also block, instead of only blocking if it was looking up the affected vnode. iow there is huge potential to stall *all* lookups on given CPU. This does not happen with rb trees and would not happen with the hash table if there was bucket locking instead of per-cpu. I would stick to hash tables since they are easier to scale (both with and without locks). For instance if 2 threads look up "/bin" and "/usr" and there is no hash collision, they lock *separate* buckets and suffer less in terms of cacheline bouncing. In comparison with rb trees they will take the same lock. Of course this similarly does not scale if they are looking up the same path. -- Mateusz Guzik
Re: Blocking vcache_tryvget() across VOP_INACTIVE() - unneeded
On 1/15/20, Andrew Doran wrote: > I don't understand why we do this. I don't think it's needed at all > because > if the file really is deleted and the inode is getting cleaned out, then it > shouldn't be getting new refs in any of the usual ways and we don't need to > worry about it. And in any case the vnode lock prevents anyone from > messing > with the inode during VOP_INACTIVE(). > > I want to change this because it causes nasty problems on MP. For example > I > see threads very regulary thrown out of the namecache by vcache_vtryget(), > and when they finally get the vnode ref'd and locked they do it by digging > through the file system metadata (when everything is already in cache). > > http://www.netbsd.org/~ad/inactive.diff > I think looking at a bigger picture suggests a different solution, which will also keep the current blocked <-> loaded transition for consideration later. I'm assuming the goal for the foreseeable future is to achieve path lookup with shared vnode locks. Since ->v_usecont transition to 0 happens all the time and VOP_INACTIVE processing is almost never needed, the current requirement to hold an exclusive lock just to find out there is nothing to do should be lifted. Apart from getting away with only shared lock (or no locks) this would also naturally eliminate vnode state changes for the common case. Preferably vfs would provide a method for filesystems to call to signify they want VOP_INACTIVE to be called for given vnode, then vrelel could just test a bit. However, this is probably not worth the effort right now. Said locking was also a problem in FreeBSD. It got temporarily sorted out with introduction of VOP_NEED_INACTIVE -- by default it returns 1, but filesystems which provide their own method easily avoid the problem. I think starting with requiring at least shared lock would do the trick just fine while providing race-free methods at least for ufs and tmpfs. Then vrelel could stick to: diff --git a/sys/kern/vfs_vnode.c b/sys/kern/vfs_vnode.c index d05d5180f730..45d80dab4a2c 100644 --- a/sys/kern/vfs_vnode.c +++ b/sys/kern/vfs_vnode.c @@ -774,7 +774,7 @@ vrelel(vnode_t *vp, int flags, int lktype) * If not clean, deactivate the vnode, but preserve * our reference across the call to VOP_INACTIVE(). */ - if (VSTATE_GET(vp) == VS_RECLAIMED) { + if (VSTATE_GET(vp) == VS_RECLAIMED || !VOP_NEED_INACTIVE(vp)) VOP_UNLOCK(vp); } else { VSTATE_CHANGE(vp, VS_LOADED, VS_BLOCKED); which even with exclusive locks already avoids the blocked state problem. Another option is to move some of the current body into smaller halpers and then let filesystem piece together their own VOP_VRELE. -- Mateusz Guzik
Re: fcntl(F_GETPATH) support
On 9/15/19, Jason Thorpe wrote: > > >> On Sep 15, 2019, at 11:03 AM, Christos Zoulas >> wrote: >> >> I think it is quite reliable because all the file descriptors would be >> recently >> opened and therefore be in the cache. One would need to DoS the cache >> cause eviction. If that turns out to be false, we can make the namecache >> reliable, or withdraw it. > > If we had a way to ask the file system for the / a name+parent in the event > we don't find an entry in the name cache, then we don't have to make the > name cache "reliable". See my other e-mail about turning realpath into a syscall. Then you have everything to work with and the code should be trivial unless you want to handle hardlinks. In the most simplistic implementation you do the lookup and then do the reverse lookup of what you found. Things still can get evicted in the time window in-between but that's probably tolerable. > > FWIW, I don't think making the name cache "reliable" is necessarily a good > idea -- it could be abused to consume kernel memory by creating lots of > long-named files / directories. > I don't think it's a material change in terms of memory use. You can already force this scenario by creating files and stating them. Getting rid of namecache entries is not going to save you much compared to all the vnodes which need to be dropped/recycled (which will also remove the nc entries). > In the event that the name really can't be determined (perhaps it's an > open-unlinked file?), then I think it's perfectly reasonable to return > ENOENT for F_GETPATH; callers need to be capable of handling that. > Corner case of the sort is probably not accounted for either and not expected to be ran into -- if you remove the files from under the compiler you are tampering with the build and it is free to fail. On the other hand name eviction is not a problem you expect to run into. I can't easily test right now for sure, but historically BSDs have not been adding newly created files to the namecache and perhaps NetBSD is not doing it either. In which case F_GETPATH on newly created files is guaranteed to fail unless someone looked them up separately. Whether this should be modified is imho a separate discussion, but is relevant to points made earlier about memory use and DoSability. Imho it should be fine to change the kernel to always add these. -- Mateusz Guzik
Re: fcntl(F_GETPATH) support
On 9/15/19, Mateusz Guzik wrote: > On 9/14/19, Christos Zoulas wrote: >> >> Comments? >> >> +error = vnode_to_path(kpath, MAXPATHLEN, fp->f_vnode, l, l->l_proc); > > What motivates this change? > > I think it is a little problematic in that namecache resolution is not > guaranteed to work. If the consumer does not check for failure or fallback > to something else the user is going to experience "once in the blue moon" > failures. > > For example clang likes to get full paths to names. On Linux it can > trivially do it thanks to dentry cache. On other systems there is a > compilation check for F_GETPATH. If none of this is present it goes for > realpath, which is slover but basically works. If F_GETPATH is present, > it is always used and there is no fallback if it fails. Then you risk > setting yourself up for a weird compilation failure, which may not even > repeat itself after you retry. > > All in all I don't think this should be added unless namecache becomes > reliable. The above reasoning is why I did not add it to FreeBSD. > Now that I wrote this I suspect a kernel-based realpath could do the trick here. Will have to take a closer look. -- Mateusz Guzik
Re: NCHNAMLEN vnode cache limitation removal
On 9/14/19, Mouse wrote: >> [...], because fexecve is causing rumbles about doing significantly >> more reverse lookups. > > Why is everyone so concerned about finding "the" name of an inode, or > indeed any name for an inode? As far as I can tell there has never > been any promise that any given inode has any names pointing to it, at > least not unless it's got no other references to it (in which case it > would be freed if it had no names). > > Given the list of smart people I've seen discussing this, I'm > presumably just missing something, but what? > I think an always working name resolution is nicer to users when it comes to diagnosing issues in userspace. In particular if someone spawns a new program, that program performs a lot of ops in given fd, you can check what file it is if name resolution works (or you have a facility similar to linux's /proc//fd). Otherwise you are left with dirty hacks. There is a very different angle to this though and it has to do with performance, especially in face of concurrent access. When doing SMP you want to avoid writes to shared areas as they induce caches misses on next access by other CPUs. If everyone accesses read-only they can keep doing it without interfering with others. If everyone writes to it, everyone else (minus 1) stalls waiting for their turn. Vast majority of all lookups only care about the last path component. Currently the kernel leapfrogs through all of them, e.g. consider the lookup of /foo/bar/baz/qux. Along the way it refs foo, locks foo, finds bar, references bar, locks bar, unlocks and unrefs foo... repeat that with s/bar/baz/ and s/foo/bar/. Note that's very simplified (e.g. i skipped the beginning of the lookup and other work like permission checking). Crossing filesystems is another hurdle. This is only to highlight the point below. That's a lot of work single-threaded which can be partially avoided in the common case, provided all the info is structured for it. But most importantly that's a lot of atomic ops on shared areas which severely limit your performance in case of concurrent access (especially on multi-socket systems). It's a long way to get to a point where you can do this in a write-free manner and FreeBSD is very early in there. If everything is cached and you can roll forward in the common case without refing/locking anything but the last component you win big time. (Of course you can't "just" do it, the leapfrogging is there for a reason but it can be worked out with tracking the state and having safe memory reclamation.) TL;DR it's minor quality of life improvement for users and a de facto mandatory part of VFS if it is to scale on big systems. -- Mateusz Guzik
Re: fcntl(F_GETPATH) support
On 9/14/19, Christos Zoulas wrote: > > Comments? > > + error = vnode_to_path(kpath, MAXPATHLEN, fp->f_vnode, l, l->l_proc); What motivates this change? I think it is a little problematic in that namecache resolution is not guaranteed to work. If the consumer does not check for failure or fallback to something else the user is going to experience "once in the blue moon" failures. For example clang likes to get full paths to names. On Linux it can trivially do it thanks to dentry cache. On other systems there is a compilation check for F_GETPATH. If none of this is present it goes for realpath, which is slover but basically works. If F_GETPATH is present, it is always used and there is no fallback if it fails. Then you risk setting yourself up for a weird compilation failure, which may not even repeat itself after you retry. All in all I don't think this should be added unless namecache becomes reliable. The above reasoning is why I did not add it to FreeBSD. -- Mateusz Guzik
Re: mutex vs turnstile
On Thu, Jan 18, 2018 at 10:38:02AM +, Nick Hudson wrote: > On 01/09/18 03:30, Mateusz Guzik wrote: > > Some time ago I wrote about performance problems when doing high -j > > build.sh and made few remarks about mutex implementation. > > > > TL;DR for that one was mostly that cas returns the found value, so > > explicit re-reads can be avoided. There are also avoidable explicit > > barriers (yes, I know about the old Opteron errata). > > > > I had another look and have a remark definitely worth looking at (and > > easy to fix). Unfortunately I lost my test jig so can't benchmark. > > > > That said, here it is: > > > > After it is is concluded that lock owner sleeps: > > itym... after it is concluded that the thread should sleep as the lock is > owned (by another lwp) and the owner is not on cpu. I don't think this distinction makes a difference here. > > > > > ts = turnstile_lookup(mtx); > > /* > > * Once we have the turnstile chain interlock, mark the > > * mutex has having waiters. If that fails, spin again: > > * chances are that the mutex has been released. > > */ > > if (!MUTEX_SET_WAITERS(mtx, owner)) { > > turnstile_exit(mtx); > > owner = mtx->mtx_owner; > > continue; > > } > > > > Note that the lock apart from being free, can be: > > 1. owned by the same owner, which is now running > > > > In this case the bit is set spuriously and triggers slow path > > unlock. > > Recursive locks aren't allowed so I don't follow what you mean here. > > 544 if (__predict_false(MUTEX_OWNER(owner) == curthread)) { > 545 MUTEX_ABORT(mtx, "locking against myself"); > 546 > > > 2. owned by a different owner, which is NOT running > > Is this still a possibility given the above? > This is not about recursive locking. I explicitely noted 2 scenarios which are handled suboptimaly: Lock owner is remembered across turnstile_lookup. MUTEX_SET_WAITERS does: rv = MUTEX_CAS(>mtx_owner, owner, owner | MUTEX_BIT_WAITERS If the lock owner is the same and sleeps, it's fine. If the lock owner is the same but does not sleep, the bit is set for no reason. If there is a different owner (or the lock is free), cas fails and turnstile_exit is called. If the lock is free or said different owner does not sleep, it's fine. If said different owner sleeps, turnstile_exit is called for no good reason and the code circles back. -- Mateusz Guzik
Re: Race condition between an LWP migration and curlwp_bind
On Tue, Feb 13, 2018 at 05:14:38PM +0900, Ryota Ozaki wrote: > panic: kernel diagnostic assertion "(psref->psref_cpu == curcpu())" > failed: file "/(hidden)/sys/kern/subr_psref.c", line 317 passive > reference transferred from CPU 0 to CPU 3 > > I first thought that something went wrong in an ioctl handler > for example curlwp_bindx was called doubly and LP_BOUND was unset > then the LWP was migrated to another CPU. However, this kind of > assumptions was denied by KASSERTs in psref_release. So I doubted > of LP_BOUND and found there is a race condition on LWP migrations. > > curlwp_bind sets LP_BOUND to l_pflags of curlwp and that prevents > curlwp from migrating to another CPU until curlwp_bindx is called. The bug you found (and I trimmed) looks like the culprit, but there is an extra problem which probably happens to not manifest itself in terms of code generation: the bind/unbind inlines lack compiler barriers. See KPREEMPT_* inlines for comparison. The diff is definitely trivial: diff --git a/sys/sys/lwp.h b/sys/sys/lwp.h index 47d162271f9c..f18b76b984e4 100644 --- a/sys/sys/lwp.h +++ b/sys/sys/lwp.h @@ -536,6 +536,7 @@ curlwp_bind(void) bound = curlwp->l_pflag & LP_BOUND; curlwp->l_pflag |= LP_BOUND; + __insn_barrier(); return bound; } @@ -545,6 +546,7 @@ curlwp_bindx(int bound) { KASSERT(curlwp->l_pflag & LP_BOUND); + __insn_barrier(); curlwp->l_pflag ^= bound ^ LP_BOUND; } -- Mateusz Guzik
mutex vs turnstile
Some time ago I wrote about performance problems when doing high -j build.sh and made few remarks about mutex implementation. TL;DR for that one was mostly that cas returns the found value, so explicit re-reads can be avoided. There are also avoidable explicit barriers (yes, I know about the old Opteron errata). I had another look and have a remark definitely worth looking at (and easy to fix). Unfortunately I lost my test jig so can't benchmark. That said, here it is: After it is is concluded that lock owner sleeps: ts = turnstile_lookup(mtx); /* * Once we have the turnstile chain interlock, mark the * mutex has having waiters. If that fails, spin again: * chances are that the mutex has been released. */ if (!MUTEX_SET_WAITERS(mtx, owner)) { turnstile_exit(mtx); owner = mtx->mtx_owner; continue; } Note that the lock apart from being free, can be: 1. owned by the same owner, which is now running In this case the bit is set spuriously and triggers slow path unlock. 2. owned by a different owner, which is NOT running Assuming the owner still has the lock and is not running after turnstile_exit, this causes extra trip. That said, proposed change is trivial and follows what FreeBSD is doing: re-read and abort sleep unless the lock is owned by an off-cpu thread. Note there is a minor optimisation of not setting the bit if already seen. That said, the minimal fix would look like this (untested): ts = turnstile_lookup(mtx); owner = mtx->mtx_owner; if (mutex_oncpu(owner) || !MUTEX_SET_WAITERS(mtx, owner)) turnstile_exit(mtx); owner = mtx->mtx_owner; continue; } If value from cas was to be preserved it would definitely make sense to re-test failed MUTEX_SET_WAITERS while still holding the turnstile lock. It looks like rw locks can use a similar treatment. Fun fact is that implementation on Illumos behaves worse in this regard: it sets the waiters bit regardless *who* owns the lock (or whether they are running), but then only goes to sleep if the *original* owner has the lock. -- Mateusz Guzik
Re: performance issues during build.sh -j 40 kernel
On Sat, Sep 09, 2017 at 08:48:19PM +0200, Mateusz Guzik wrote: > > Here is a bunch of "./build.sh -j 40 kernel=MYCONF > /dev/null" on stock > kernel: > 618.65s user 1097.80s system 2502% cpu 1:08.60 total [..] > > And on kernel with total hacks: > 594.08s user 693.11s system 2459% cpu 52.331 total [..] > > == > > Here is a flamegraph from a fully patched kernel: > https://people.freebsd.org/~mjg/netbsd/build-kernel-j40.svg > > And here are top mutex spinners: > 59.42 1560022 184255.00 e40138351180 > 57.52 1538978 178356.84 e40138351180 uvm_fault_internal+7e0 > 1.238884 3819.43 e40138351180 uvm_unmap_remove+101 > 0.67 12159 2078.61 e40138351180 cache_lookup+97 > > (see https://people.freebsd.org/~mjg/netbsd/build-kernel-j40-lockstat.txt ) > So I added PoC batching to uvm_fault_lower_lookup and uvm_anon_dispose. While real time barely moved and %sys is somewhat floating around 630, I'm happy to inform that wait time on global locks locks dropped significantly: 46.03 1162651 85410.88 e40127167040 43.80 1146153 81273.38 e40127167040 uvm_fault_internal+7c0 1.527112 2827.06 e40127167040 uvm_unmap_remove+101 0.719385 1310.42 e40127167040 cache_lookup+a5 0.00 1 0.01 e40127167040 vfs_vnode_iterator_next1+87 https://people.freebsd.org/~mjg/netbsd/build-kernel-j40-hacks2.svg https://people.freebsd.org/~mjg/netbsd/build-kernel-j40-hacks2-lockstat.txt You can see on the flamegraph that the entire time spent in the page fault handler dropped and the non-user time shifted to syscall handling. Specifically, now genfs_lock is a more significant player accounting for about 8.7% total time (6.6% previously). Batching can be enabled with: sysctl -w use_anon_dispose_pagelocked=1 sysctl -w uvm_fault_batch_requeue=1 Mix of total hackery is here: https://people.freebsd.org/~mjg/netbsd/hacks2.diff I'm quite certain there are other trivial wins in the handler. I also noted that mutex_spin_retry routine never lowered spl. I added total crap support to changing it, but did not measure any difference. There is also a currently wrong hack for the namecache: instead of taking the interlock, first check if usecount if 0 and if not, try to bump it by 1. This races with possible transitions to VS_BLLOCKED. I think the general idea will work fine if the prohibited state will get embedded into top bits of the v_usecount. Regular bumps will be unaffected, while cmpxchg like here will automagically fail. There is only a problem of code reading the count "by hand" which would have to be updated to mask the bit. The reason for the hack is that the interlock is in fact the vm obj lock and it adds tad bit of contention. Probably with few more fixes of the sort cranking up backoff will be beneficial. -- Mateusz Guzik Swearing Maintenance Engineer
Re: performance issues during build.sh -j 40 kernel
On Sun, Sep 10, 2017 at 06:51:31PM +0100, Mindaugas Rasiukevicius wrote: > Mateusz Guzik <mjgu...@gmail.com> wrote: > > 1. exclusive vnode locking (genfs_lock) > > > > ... > > > > 2. uvm_fault_internal > > > > ... > > > > 4. vm locks in general > > > > We know these points of lock contention, but they are not really that > trivial to fix. Breaking down the UVM pagequeue locks would generally > be a major project, as it would be the first step towards NUMA support. > In any case, patches are welcome. :) > Breaking locks is of course the preferred long term solution, but also time consuming. On the other hand there are most likely reasonably easy fixes consisting of collapsing lock/unlock cycles into just one lock/unlock etc. FreeBSD is no saint here either with one global lock for free pages, yet it manages to work OK-ish with 80 hardware threads and is quite nice with 40. That said, I had enough problems $elsewhere to not be interested in looking too hard here. :> > > 3. pmap > > > > It seems most issues stem from slow pmap handling. Chances are there are > > perfectly avoidable shootdowns and in fact cases where there is no need > > to alter KVA in the first place. > > At least x86 pmap already performs batching and has quite efficient > synchronisation logic. You are right that there are some key places > where avoiding KVA map/unmap would have a major performance improvement, > e.g. UBC and mbuf zero-copy mechanisms (it could operate on physical > pages for I/O). However, these changes are not really related to pmap. > Some subsystems just need an alternative to temporary KVA mappings. > I was predominantly looking at teardown of ubc mappings. The flamegraph suggests overly high cost there. > > > > I would like to add a remark about locking primitives. > > > > Today the rage is with MCS locks, which are fine but not trivial to > > integrate with sleepable locks like your mutexes. Even so, the current > > implementation is significantly slower than it has to be. > > > > ... > > > > Spinning mutexes should probably be handled by a different routine. > > > > ... > > > > I disagree, because this is a wrong approach to the problem. Instead of > marginally optimising the slow-path (and the more contended is the lock, > the less impact these micro-optimisations have), the subsystems should be > refactored to eliminate the lock contention in the first place. Yes, it > is much more work, but it is the long term fix. Having said that, I can > see some use cases where MCS locks could be useful, but it is really a low > priority in the big picture. > Locks are fundamentally about damage control. As noted earlier, spurious bus transaction due to an avoidable read make performance unnecessarily tad bit worse. That was minor anyway, more important bit was the backoff. Even on systems modest by today standards the quality of locking primitives can be a difference between a system which is slower than ideal but perfectly usable and one which is just dog slow. That said, making backoff parameters autoscale on cpus with some kind of upper cap is definitely warranted. -- Mateusz Guzik Swearing Maintenance Engineer
Re: performance issues during build.sh -j 40 kernel
> Le 09/09/2017 à 20:48, Mateusz Guzik a écrit : On Sun, Sep 10, 2017 at 07:29:11PM +0200, Maxime Villard wrote: > Le 09/09/2017 à 20:48, Mateusz Guzik a écrit : > > [...] > > I installed the 7.1 release, downloaded recent git snapshot and built the > > trunk kernel while using config stolen from the release (had to edit out > > something about 3g modems to make it compile). I presume this is enough > > to not have debug of any sort enabled. > > Not sure I understand; did you test a kernel from the netbsd-7.1 branch, or > from netbsd-current? You might want to test netbsd-current, I know that several > performance-related improvements were made. > I noted it's a current kernel. The 7.1 release bits were there to ensure I don't run into userspace/kernel debug. > > 3. pmap > > > > It seems most issues stem from slow pmap handling. Chances are there are > > perfectly avoidable shootdowns and in fact cases where there is no need > > to alter KVA in the first place. > > This seems rather surprising to me. I tried to reduce the number of shootdowns > some time ago, but they were already optimized, and my attempts just made them > slower to process. The only related thing I fixed was making sure there is no > kernel page that gets flushed under a local shootdown, but as far as I > remember, it didn't significantly improve performance (on a somewhat old > hardware, I must admit). > Note this was tested on kvm, where shootdowns are more expensive than on bare metal so the result is probably worsened compared to bare-metal (still, kvm is a perfectly fine production vm deployment, so I don't feel bad for testing on it). I'm did not investigate in detail (I'll have to), but I believe dragonflybsd went to extended measures to reduce/eleminate IPIs in general. Most definitely worth looking at. -- Mateusz Guzik Swearing Maintenance Engineer
performance issues during build.sh -j 40 kernel
fe40138351180 cache_lookup+97 (see https://people.freebsd.org/~mjg/netbsd/build-kernel-j40-lockstat.txt ) Note that netbsd`0x802249ba is x86_pause. Since the function does not push frame pointer it is shown next to the actual caller, as opposed to above it. Sometimes called functions get misplaced anyway, I don't know why. 1. exclusive vnode locking (genfs_lock) It is used even for path lookup which as can be seen leads to avoidable contention. From what I'm told the primary reason is ufs constructing some state just in case it has to create an inode at the end of the lookup. However, since most lookups are not intended to create anything, this behavior can be made conditional. I don't know the details, but ufs on FreeBSD most certainly uses shared locking for common case lookups. 2. uvm_fault_internal It's shown as the main waiter for a vm obj lock. The flamegraph hints the real problem is with uvm_pageqlock & friends taken elsewhere, while most page fault handlers serialize on the vm obj lock, while the holder waits for uvm_pageqlock. 3. pmap It seems most issues stem from slow pmap handling. Chances are there are perfectly avoidable shootdowns and in fact cases where there is no need to alter KVA in the first place. 4. vm locks in general Most likely there are trivial cases where operations can be batched, especially so on process exit where there are multiple pages to operate on. == I would like to add a remark about locking primitives. Today the rage is with MCS locks, which are fine but not trivial to integrate with sleepable locks like your mutexes. Even so, the current implementation is significantly slower than it has to be. First, the lock word is read twice on entry to mutex_vector_enter - once to determine the lock type and then to read the owner. Spinning mutexes should probably be handled by a different routine. lock cmpxchg already returns the found value (the owner). It can be passed by the assembly routine to the slow path. This allows to make an initial pass at backoff without accessing the lock in the meantime. In face of contention the cacheline could have changed ownership by the time you get to the read, thus using the value we already saw avoids spurious bus transactions. Given low initial spin loop it should not have negative effects. backoff parameters were hardcoded last decade and are really off even when looking today's modest servers. For kicks I changed the max spin count to 1024 and in a trivial microbenchmark of doing dup2 + close in 40 threads I got almost double the throughput. Interestingly this change caused a regression for kernel build. I did not investigate, I suspect the cause was that the vm obj lock holder was now less aggressive on trying to grab the lock and that caused problmes for everyone else waiting on the vm obj lock. The spin loop itself is weird in the sense that instead of just having the pause instruction embedded it calls a function. This is probably unnecessarily less power/other thread friendly than it needs to be. Cheers, -- Mateusz Guzik
unsafe ->p_cwdi access in mount_checkdirs
In mount_checkdirs you can find a loop: mutex_enter(proc_lock); PROCLIST_FOREACH(p, ) { if ((cwdi = p->p_cwdi) == NULL) continue; if (cwdi->cwdi_cdir != olddp && cwdi->cwdi_rdir != olddp) continue; retry = true; rele1 = NULL; rele2 = NULL; atomic_inc_uint(>cwdi_refcnt); mutex_exit(proc_lock); rw_enter(>cwdi_lock, RW_WRITER); The problem is that p_cwdi IS NOT protected by proc_lock. The exit path uses reflock to destroy the object. Since this function is not performance critical, my suggested fix is to uncondionally take the p_lock and ref cwdi only if the process is alive and constructed. Alternatively, cwdi freeing can be split into 2 parts where actual freeing of the cwdi object itself is postponed until after proc_lock lock/unlock dance. Pseudo-code: struct cwdinfo * cwddestroy(struct cwdinfo *cwdi) { if (atomic_dec_uint_nv(>cwdi_refcnt) > 0) return NULL; vrele(cwdi->cwdi_cdir); if (cwdi->cwdi_rdir) vrele(cwdi->cwdi_rdir); if (cwdi->cwdi_edir) vrele(cwdi->cwdi_edir); return cwdi; } void cwdfree(struct cwdinfo *cwdi) { pool_cache_put(cwdi_cache, cwdi); } Then the exit path would: cwdi = cwddestroy(p->p_cwdi); p->p_cwdi = NULL; mutex_enter(proc_lock); mutex_exit(proc_lock); cwdfree(cwdi); mount_checkdirs would then atomic_inc_not_zero (or however you have this named). With the above, if cwdi was spotted, it is guaranteeed to not get freed until after proc_lock is dropped. A successfull non-zero -> non-zero + 1 refcount bump guarantees it wont get freed and the content will remain valid. -- Mateusz Guzik
fork1 use-after-free of the child process
The fork1 routine can wait for the child to exit (if vforked) and/or return the pointer to the child. Neither case guarantees the safety of said operation. The key is that the parent can be ignoring SIGCHLD, which results in autoreaping the child and the child itself is made runnable. Thus in particular it can exit and get its struct proc freed. No prevention measures are taken AFAICS. I did not test it on the NetBSD kernel, but it's a standard problem and the code definitely looks buggy. 1. returning the pointer The last parameter to fork1 is struct proc **rnewprocp. I found only 2 uses: - linux_sys_clone - the parameter is populated from a local variable which is not used afterwards, i.e. the func can as well start passing NULL instead - creation of initproc - since this becomes the only remaining use of the parameter I think it makes sense to get rid of it altogether. the simplest hack which comes to mind will simply proc_find_raw the process after fork return 2. vfork By the end there is: while (p2->p_lflag & PL_PPWAIT) cv_wait(>p_waitcv, proc_lock); Retesting after wake up triggers the bug. The cleanest solution I'm aware of introduces a dedicated condvar to use by vfork. The parent uses a local variable for this purpose. Since the condvar is only used by vfork to signal exec/exit, there is no need to access the child after waking up. The downside is extention of struct proc with a pointer. Interestingly this is what they do in Linux. FreeBSD does the obvious of grabbing a reference preventing the destruction of struct proc from going through. Solaris suprisingly has a total hack: they look up the pid and see if found process (if any) has to be waited on. Apart from general ugliness it's also a minor scalability problem due to extra lock/unlock of the global process list lock. -- Mateusz Guzik
Re: In-kernel process exit hooks?
return (filemon); >} >} But p_pid should be constant for 'struct proc' lifetime, so locking this early is wasteful. >lp = p; >p = p->p_pptr; >rw_exit(>p_reflock); I guess the lock is needed to read the pointer to the parent. Nothing was done to keep the parent around, which means it could have exited and be freed by now. Which in turn makes the second loop iteration a use-after-free. The first iteration is safe if the argument is curproc (which it is), as it clearly cannot disappear. Turns out this traversal is even more wrong since p_pptr does not have to be the process you are looking for - ptrace is reparenting tracees. >} >} >return (NULL); >} -- Mateusz Guzik
Re: In-kernel process exit hooks?
On Sat, Jan 09, 2016 at 08:25:05AM +0100, Mateusz Guzik wrote: > On Sat, Jan 09, 2016 at 02:25:02PM +0800, Paul Goyette wrote: > > On Sat, 9 Jan 2016, Mateusz Guzik wrote: > > > > >While I don't know all the details, it is clear that the purpose would > > >be much better served by ktrace and I would argue efforts should be > > >spent there. > > > > filemon's purpose is somewhat different than that of ktrace. There > > is a limited set of events that filemon captures, and it only > > captures successful operations. Its output is quite simple and easy > > to parse by a human of shell script. > > > > ktrace already have limited support for specifying what events are > wanted. One can add a special filemon-like mode. > > The output of kdump is definitely less friendly for scripts, but it is > way easier to write a script formatting it than it is to fix filemon. > > > >>From usability perspective what seems to be needed is few hacks to > > >ktrace to only log the stuff make is interested in and parsing code for > > >bmake to translate to what looks like current filemon format for > > >aformentioned copying. > > > > Perhaps. But filemon is also a run-time loadable module (and it is > > preferred that it NOT be built-in to any kernels). If I remember > > correctly, ktrace has hooks all over the system and does not readily > > lend itself to modularization. > > > > I doubt that compiling in ktrace on a which can handle filemon is an > issue, and even then it should outweight hops around filemon. > > For correct of a minimalistic safe filemon module the kernel would > already have to grow hooks in few places (see below). > > > > >Filemon monitors stuff at syscall level, stores the fd and remembers > > >pids. > > > > The fd is no longer being stored... :) But yes, it does still > > remember pids, so it can determine if an event belongs to a monitored > > process tree. Again IIRC, ktrace is somewhat more intrusive, in that > > it manipulates various process flags which can affect behavior of the > > target (monitored) process. > > It has a dedicated field in struct proc which is the only sensible way > of implementing this I can think of. > > > >> error = kauth_authorize_process(curproc->p_cred, > > >> KAUTH_PROCESS_CANSEE, tp, > > >> KAUTH_ARG(KAUTH_REQ_PROCESS_CANSEE_ENTRY), NULL, NULL); > > > > > >Now that the check is pased the process could have executed something > > >setuid. > > > > Which process could have executed something setuid? How would that > > matter? (I will confess I haven't checked, but I would be surprised > > if "executing something setuid" would change the result of kauth() > > check. But I could be wrong here.) > > > > You start monitoring given process and proceed to execve a setuid file, > which gives you information what it is doing, even though the > kauth_authorize_process check would fail. This is basically what you > yourself stated below about pid reuse. > > So this either needs to prevent processes from changing credentials when > traced or stop tracing when such a change occurs. The former seems like > a rather bad idea and latter is already done by ktrace. > > > >I'm somewhat surprised this code is here. Should not > > >kauth_authorize_process assert that tp has stable credentials in some > > >way, in effect just crashing filemon consumers? > > > > > >> if (!error) { > > >> filemon->fm_pid = tp->p_pid; > > >> } > > > > Why does the target process need stable credentials? It's not doing > > anything for filemon. > > > > See below. > > > >Pid is stored, but tp is not held. What if the 'traced' process exits? > > >grep only reveals the following bit in filemon_wrapper_sys_exit: > > > > > >> if (filemon->fm_pid == curproc->p_pid) > > >> filemon_printf(filemon, "# Bye bye\n"); > > > > > >So filemon will continue tracing anything which happened to reuse the > > >pid. > > > > Yes, this is an outstanding issue. More particularly, the new user > > of the pid might not be accessible by the monitoring process (ie, it > > would fail the earlier kauth() check). But holding the pointer to > > the proc isn't really the answer either - although less likely it is > > entirely possible that the a
Re: In-kernel process exit hooks?
On Sat, Jan 09, 2016 at 02:25:02PM +0800, Paul Goyette wrote: > On Sat, 9 Jan 2016, Mateusz Guzik wrote: > > >While I don't know all the details, it is clear that the purpose would > >be much better served by ktrace and I would argue efforts should be > >spent there. > > filemon's purpose is somewhat different than that of ktrace. There > is a limited set of events that filemon captures, and it only > captures successful operations. Its output is quite simple and easy > to parse by a human of shell script. > ktrace already have limited support for specifying what events are wanted. One can add a special filemon-like mode. The output of kdump is definitely less friendly for scripts, but it is way easier to write a script formatting it than it is to fix filemon. > >>From usability perspective what seems to be needed is few hacks to > >ktrace to only log the stuff make is interested in and parsing code for > >bmake to translate to what looks like current filemon format for > >aformentioned copying. > > Perhaps. But filemon is also a run-time loadable module (and it is > preferred that it NOT be built-in to any kernels). If I remember > correctly, ktrace has hooks all over the system and does not readily > lend itself to modularization. > I doubt that compiling in ktrace on a which can handle filemon is an issue, and even then it should outweight hops around filemon. For correct of a minimalistic safe filemon module the kernel would already have to grow hooks in few places (see below). > >Filemon monitors stuff at syscall level, stores the fd and remembers > >pids. > > The fd is no longer being stored... :) But yes, it does still > remember pids, so it can determine if an event belongs to a monitored > process tree. Again IIRC, ktrace is somewhat more intrusive, in that > it manipulates various process flags which can affect behavior of the > target (monitored) process. It has a dedicated field in struct proc which is the only sensible way of implementing this I can think of. > >> error = kauth_authorize_process(curproc->p_cred, > >> KAUTH_PROCESS_CANSEE, tp, > >> KAUTH_ARG(KAUTH_REQ_PROCESS_CANSEE_ENTRY), NULL, NULL); > > > >Now that the check is pased the process could have executed something > >setuid. > > Which process could have executed something setuid? How would that > matter? (I will confess I haven't checked, but I would be surprised > if "executing something setuid" would change the result of kauth() > check. But I could be wrong here.) > You start monitoring given process and proceed to execve a setuid file, which gives you information what it is doing, even though the kauth_authorize_process check would fail. This is basically what you yourself stated below about pid reuse. So this either needs to prevent processes from changing credentials when traced or stop tracing when such a change occurs. The former seems like a rather bad idea and latter is already done by ktrace. > >I'm somewhat surprised this code is here. Should not > >kauth_authorize_process assert that tp has stable credentials in some > >way, in effect just crashing filemon consumers? > > > >> if (!error) { > >> filemon->fm_pid = tp->p_pid; > >> } > > Why does the target process need stable credentials? It's not doing > anything for filemon. > See below. > >Pid is stored, but tp is not held. What if the 'traced' process exits? > >grep only reveals the following bit in filemon_wrapper_sys_exit: > > > >> if (filemon->fm_pid == curproc->p_pid) > >> filemon_printf(filemon, "# Bye bye\n"); > > > >So filemon will continue tracing anything which happened to reuse the > >pid. > > Yes, this is an outstanding issue. More particularly, the new user > of the pid might not be accessible by the monitoring process (ie, it > would fail the earlier kauth() check). But holding the pointer to > the proc isn't really the answer either - although less likely it is > entirely possible that the address could have been reused. > What you need is the ability to clear tracing bits on process exit. ktrace definitely has relevant bits in place already. > (It's too bad we don't have "generation numbers" on the pid, which > would easily detect reuse.) That would require struct proc to be type stable, which seems wasteful. > >> lp = p; > >> p = p->p_pptr; > >> rw_exit(>p_reflock); > > > >I guess the lock is needed