On 18/11/22(Fri) 21:33, Mark Kettenis wrote: > > Date: Thu, 17 Nov 2022 20:23:37 +0100 > > From: Mark Kettenis <mark.kette...@xs4all.nl> > > > > > From: Jeremie Courreges-Anglas <j...@wxcvbn.org> > > > Date: Thu, 17 Nov 2022 18:00:21 +0100 > > > > > > On Tue, Nov 15 2022, Martin Pieuchot <m...@openbsd.org> wrote: > > > > UVM vnode objects include a reference count to keep track of the number > > > > of processes that have the corresponding pages mapped in their VM space. > > > > > > > > When the last process referencing a given library or executable dies, > > > > the reaper will munmap this object on its behalf. When this happens it > > > > doesn't free the associated pages to speed-up possible re-use of the > > > > file. Instead the pages are placed on the inactive list but stay ready > > > > to be pmap_enter()'d without requiring I/O as soon as a newly process > > > > needs to access them. > > > > > > > > The mechanism to keep pages populated, known as UVM_VNODE_CANPERSIST, > > > > doesn't work well with swapping [0]. For some reason when the page > > > > daemon > > > > wants to free pages on the inactive list it tries to flush the pages to > > > > disk and panic(9) because it needs a valid reference to the vnode to do > > > > so. > > > > > > > > This indicates that the mechanism described above, which seems to work > > > > fine for RO mappings, is currently buggy in more complex situations. > > > > Flushing the pages when the last reference of the UVM object is dropped > > > > also doesn't seem to be enough as bluhm@ reported [1]. > > > > > > > > The diff below, which has already be committed and reverted, gets rid of > > > > the UVM_VNODE_CANPERSIST logic. I'd like to commit it again now that > > > > the arm64 caching bug has been found and fixed. > > > > > > > > Getting rid of this logic means more I/O will be generated and pages > > > > might have a faster reuse cycle. I'm aware this might introduce a small > > > > slowdown, > > > > > > Numbers for my usual make -j4 in libc, > > > on an Unmatched riscv64 box, now: > > > 16m32.65s real 21m36.79s user 30m53.45s system > > > 16m32.37s real 21m33.40s user 31m17.98s system > > > 16m32.63s real 21m35.74s user 31m12.01s system > > > 16m32.13s real 21m36.12s user 31m06.92s system > > > After: > > > 19m14.15s real 21m09.39s user 36m51.33s system > > > 19m19.11s real 21m02.61s user 36m58.46s system > > > 19m21.77s real 21m09.23s user 37m03.85s system > > > 19m09.39s real 21m08.96s user 36m36.00s system > > > > > > 4 cores amd64 VM, before (-current plus an other diff): > > > 1m54.31s real 2m47.36s user 4m24.70s system > > > 1m52.64s real 2m45.68s user 4m23.46s system > > > 1m53.47s real 2m43.59s user 4m27.60s system > > > After: > > > 2m34.12s real 2m51.15s user 6m20.91s system > > > 2m34.30s real 2m48.48s user 6m23.34s system > > > 2m37.07s real 2m49.60s user 6m31.53s system > > > > > > > however I believe we should work towards loading files from the > > > > buffer cache to save I/O cycles instead of having another layer of > > > > cache. > > > > Such work isn't trivial and making sure the vnode <-> UVM relation is > > > > simple and well understood is the first step in this direction. > > > > > > > > I'd appreciate if the diff below could be tested on many architectures, > > > > include the offending rpi4. > > > > > > Mike has already tested a make build on a riscv64 Unmatched. I have > > > also run regress in sys, lib/libc and lib/libpthread on that arch. As > > > far as I can see this looks stable on my machine, but what I really care > > > about is the riscv64 bulk build cluster (I'm going to start another > > > bulk build soon). > > > > > > > Comments? Oks? > > > > > > The performance drop in my microbenchmark kinda worries me but it's only > > > a microbenchmark... > > > > I wouldn't call this a microbenchmark. I fear this is typical for > > builds of anything on clang architectures. And I expect it to be > > worse on single-processor machine where *every* time we execute clang > > or lld all the pages are thrown away upon exit and will need to be > > read back from disk again for the next bit of C code we compile. > > > > Having mmap'ed reads go through the buffer cache should indeed help > > here, but that smells like UBC and even if it isn't, it is something > > we don't have and will be tricky to get right. > > > > The discussions we had during h2k22 made me understand this thing a > > bit better. Is there any reason why we can't keep a reference to the > > vnode and have the pagedaemon drop it when it drops the pages? Is > > there a chance that we run out of vnodes this way? I vaguely remember > > beck@ saying that we can have tons of vnodes now. > > Maybe that isn't the best idea. So here is what may be an actual fix > for the "macppc panic: vref used where vget required" problem we're > seeing that prompted us to look at UVM_VNODE_CANPERSIST. > > If we look back to at the race that bluhm@ posted: > > panic: vref used where vget required > Stopped at db_enter+0x24: lwz r11,12(r1) > TID PID UID PRFLAGS PFLAGS CPU COMMAND > 192060 78628 21 0x2 0 0 c++ > *132472 74971 0 0x14000 0x200 1K pagedaemon > db_enter() at db_enter+0x20 > panic(91373c) at panic+0x158 > vref(23b8fa20) at vref+0xac > uvm_vnp_uncache(e7eb7c50) at uvm_vnp_uncache+0x88 > ffs_write(7ed2e423) at ffs_write+0x3b0 > VOP_WRITE(23b8fa20,e7eb7c50,40,1ff3f60) at VOP_WRITE+0x48 > uvn_io(7ed2e423,a93d0c,a93814,ffffffff,e4010000) at uvn_io+0x264 > uvn_put(414b673a,e7eb7dd4,24f00070,5326e90) at uvn_put+0x64 > uvm_pager_put(0,0,e7eb7d70,6ee0b8,2000000,80000000,0) at uvm_pager_put+0x15c > uvmpd_scan_inactive(0) at uvmpd_scan_inactive+0x224 > uvmpd_scan() at uvmpd_scan+0x158 > uvm_pageout(7e932633) at uvm_pageout+0x398 > fork_trampoline() at fork_trampoline+0x14 > end trace frame: 0x0, count: 2 > > I don't think there is anything complicated going on. The pagedaemon > is paging out a page associated with a persisting vnode. That page > isn't mapped anymore so nobody holds a reference to the vnode anymore > and it sits happily on the freelist to either be reused or recycled. > > The code repsonsible for paging out this page is uvn_put(), which > calls uvn_io() and eventually ffs_write() *without* holding a > reference to the vnode. I believe that is wrong; ffs_write() clearly > assumes that its caller holds a reference to vnode. But it doesn't > check until we call uvm_vnp_uncache() which attempts to grab another > reference, which triggers the diagnostic panic we're seeing. > > So I think uvn_put() should make sure it holds a reference to the > vnode. And since it won't hold one if called by the pagedaemon, this > means we should use vget() to get one. Now there may be a risk of a > deadlock here. The pagedaemon may be trying to page out a page that > is associated with a vnode that is already in the process of being > recycled. But cleaning the vnode as part of the recycling process > requires the vmobjlock that we're holding. So use LK_NOWAIT and trust > that the page will be written to disk and freed as the associated > vnode gets cleaned. Which means the pagedaemon can simply skip the > page so we return VM_PAGER_PEND. > > Thoughts? If this works, I probably should add some comments > explaining what's going on here.
I like the idea and the diff. I don't like the use of VM_PAGER_PEND. This return value is used to indicate that I/O has been started asynchronously. This currently never happens with vnode so we're introducing a new tricky error code path. Plus the page hasn't been flushed to disk when uvm_pager_put() returns. I'd rather return VM_PAGER_AGAIN which would leave the page on the queue and once the page daemon unlocks `vmobjlock' the racing thread will be able to proceed. > Index: uvm/uvm_vnode.c > =================================================================== > RCS file: /cvs/src/sys/uvm/uvm_vnode.c,v > retrieving revision 1.130 > diff -u -p -r1.130 uvm_vnode.c > --- uvm/uvm_vnode.c 20 Oct 2022 13:31:52 -0000 1.130 > +++ uvm/uvm_vnode.c 18 Nov 2022 19:56:51 -0000 > @@ -899,11 +899,18 @@ uvn_cluster(struct uvm_object *uobj, vof > int > uvn_put(struct uvm_object *uobj, struct vm_page **pps, int npages, int flags) > { > + struct uvm_vnode *uvn = (struct uvm_vnode *)uobj; > int retval; > > KASSERT(rw_write_held(uobj->vmobjlock)); > > + KASSERT(uvn->u_flags & UVM_VNODE_VALID); > + if (vget(uvn->u_vnode, LK_NOWAIT)) > + return VM_PAGER_PEND; > + > retval = uvn_io((struct uvm_vnode*)uobj, pps, npages, flags, UIO_WRITE); > + > + vrele(uvn->u_vnode); > > return retval; > }