> Date: Thu, 17 Nov 2022 20:23:37 +0100 > From: Mark Kettenis <mark.kette...@xs4all.nl> > > > From: Jeremie Courreges-Anglas <j...@wxcvbn.org> > > Date: Thu, 17 Nov 2022 18:00:21 +0100 > > > > On Tue, Nov 15 2022, Martin Pieuchot <m...@openbsd.org> wrote: > > > UVM vnode objects include a reference count to keep track of the number > > > of processes that have the corresponding pages mapped in their VM space. > > > > > > When the last process referencing a given library or executable dies, > > > the reaper will munmap this object on its behalf. When this happens it > > > doesn't free the associated pages to speed-up possible re-use of the > > > file. Instead the pages are placed on the inactive list but stay ready > > > to be pmap_enter()'d without requiring I/O as soon as a newly process > > > needs to access them. > > > > > > The mechanism to keep pages populated, known as UVM_VNODE_CANPERSIST, > > > doesn't work well with swapping [0]. For some reason when the page daemon > > > wants to free pages on the inactive list it tries to flush the pages to > > > disk and panic(9) because it needs a valid reference to the vnode to do > > > so. > > > > > > This indicates that the mechanism described above, which seems to work > > > fine for RO mappings, is currently buggy in more complex situations. > > > Flushing the pages when the last reference of the UVM object is dropped > > > also doesn't seem to be enough as bluhm@ reported [1]. > > > > > > The diff below, which has already be committed and reverted, gets rid of > > > the UVM_VNODE_CANPERSIST logic. I'd like to commit it again now that > > > the arm64 caching bug has been found and fixed. > > > > > > Getting rid of this logic means more I/O will be generated and pages > > > might have a faster reuse cycle. I'm aware this might introduce a small > > > slowdown, > > > > Numbers for my usual make -j4 in libc, > > on an Unmatched riscv64 box, now: > > 16m32.65s real 21m36.79s user 30m53.45s system > > 16m32.37s real 21m33.40s user 31m17.98s system > > 16m32.63s real 21m35.74s user 31m12.01s system > > 16m32.13s real 21m36.12s user 31m06.92s system > > After: > > 19m14.15s real 21m09.39s user 36m51.33s system > > 19m19.11s real 21m02.61s user 36m58.46s system > > 19m21.77s real 21m09.23s user 37m03.85s system > > 19m09.39s real 21m08.96s user 36m36.00s system > > > > 4 cores amd64 VM, before (-current plus an other diff): > > 1m54.31s real 2m47.36s user 4m24.70s system > > 1m52.64s real 2m45.68s user 4m23.46s system > > 1m53.47s real 2m43.59s user 4m27.60s system > > After: > > 2m34.12s real 2m51.15s user 6m20.91s system > > 2m34.30s real 2m48.48s user 6m23.34s system > > 2m37.07s real 2m49.60s user 6m31.53s system > > > > > however I believe we should work towards loading files from the > > > buffer cache to save I/O cycles instead of having another layer of cache. > > > Such work isn't trivial and making sure the vnode <-> UVM relation is > > > simple and well understood is the first step in this direction. > > > > > > I'd appreciate if the diff below could be tested on many architectures, > > > include the offending rpi4. > > > > Mike has already tested a make build on a riscv64 Unmatched. I have > > also run regress in sys, lib/libc and lib/libpthread on that arch. As > > far as I can see this looks stable on my machine, but what I really care > > about is the riscv64 bulk build cluster (I'm going to start another > > bulk build soon). > > > > > Comments? Oks? > > > > The performance drop in my microbenchmark kinda worries me but it's only > > a microbenchmark... > > I wouldn't call this a microbenchmark. I fear this is typical for > builds of anything on clang architectures. And I expect it to be > worse on single-processor machine where *every* time we execute clang > or lld all the pages are thrown away upon exit and will need to be > read back from disk again for the next bit of C code we compile. > > Having mmap'ed reads go through the buffer cache should indeed help > here, but that smells like UBC and even if it isn't, it is something > we don't have and will be tricky to get right. > > The discussions we had during h2k22 made me understand this thing a > bit better. Is there any reason why we can't keep a reference to the > vnode and have the pagedaemon drop it when it drops the pages? Is > there a chance that we run out of vnodes this way? I vaguely remember > beck@ saying that we can have tons of vnodes now.
Maybe that isn't the best idea. So here is what may be an actual fix for the "macppc panic: vref used where vget required" problem we're seeing that prompted us to look at UVM_VNODE_CANPERSIST. If we look back to at the race that bluhm@ posted: panic: vref used where vget required Stopped at db_enter+0x24: lwz r11,12(r1) TID PID UID PRFLAGS PFLAGS CPU COMMAND 192060 78628 21 0x2 0 0 c++ *132472 74971 0 0x14000 0x200 1K pagedaemon db_enter() at db_enter+0x20 panic(91373c) at panic+0x158 vref(23b8fa20) at vref+0xac uvm_vnp_uncache(e7eb7c50) at uvm_vnp_uncache+0x88 ffs_write(7ed2e423) at ffs_write+0x3b0 VOP_WRITE(23b8fa20,e7eb7c50,40,1ff3f60) at VOP_WRITE+0x48 uvn_io(7ed2e423,a93d0c,a93814,ffffffff,e4010000) at uvn_io+0x264 uvn_put(414b673a,e7eb7dd4,24f00070,5326e90) at uvn_put+0x64 uvm_pager_put(0,0,e7eb7d70,6ee0b8,2000000,80000000,0) at uvm_pager_put+0x15c uvmpd_scan_inactive(0) at uvmpd_scan_inactive+0x224 uvmpd_scan() at uvmpd_scan+0x158 uvm_pageout(7e932633) at uvm_pageout+0x398 fork_trampoline() at fork_trampoline+0x14 end trace frame: 0x0, count: 2 I don't think there is anything complicated going on. The pagedaemon is paging out a page associated with a persisting vnode. That page isn't mapped anymore so nobody holds a reference to the vnode anymore and it sits happily on the freelist to either be reused or recycled. The code repsonsible for paging out this page is uvn_put(), which calls uvn_io() and eventually ffs_write() *without* holding a reference to the vnode. I believe that is wrong; ffs_write() clearly assumes that its caller holds a reference to vnode. But it doesn't check until we call uvm_vnp_uncache() which attempts to grab another reference, which triggers the diagnostic panic we're seeing. So I think uvn_put() should make sure it holds a reference to the vnode. And since it won't hold one if called by the pagedaemon, this means we should use vget() to get one. Now there may be a risk of a deadlock here. The pagedaemon may be trying to page out a page that is associated with a vnode that is already in the process of being recycled. But cleaning the vnode as part of the recycling process requires the vmobjlock that we're holding. So use LK_NOWAIT and trust that the page will be written to disk and freed as the associated vnode gets cleaned. Which means the pagedaemon can simply skip the page so we return VM_PAGER_PEND. Thoughts? If this works, I probably should add some comments explaining what's going on here. Index: uvm/uvm_vnode.c =================================================================== RCS file: /cvs/src/sys/uvm/uvm_vnode.c,v retrieving revision 1.130 diff -u -p -r1.130 uvm_vnode.c --- uvm/uvm_vnode.c 20 Oct 2022 13:31:52 -0000 1.130 +++ uvm/uvm_vnode.c 18 Nov 2022 19:56:51 -0000 @@ -899,11 +899,18 @@ uvn_cluster(struct uvm_object *uobj, vof int uvn_put(struct uvm_object *uobj, struct vm_page **pps, int npages, int flags) { + struct uvm_vnode *uvn = (struct uvm_vnode *)uobj; int retval; KASSERT(rw_write_held(uobj->vmobjlock)); + KASSERT(uvn->u_flags & UVM_VNODE_VALID); + if (vget(uvn->u_vnode, LK_NOWAIT)) + return VM_PAGER_PEND; + retval = uvn_io((struct uvm_vnode*)uobj, pps, npages, flags, UIO_WRITE); + + vrele(uvn->u_vnode); return retval; }