Re: Get rid of UVM_VNODE_CANPERSIST

Mark Kettenis Fri, 18 Nov 2022 12:33:37 -0800

> Date: Thu, 17 Nov 2022 20:23:37 +0100
> From: Mark Kettenis <mark.kette...@xs4all.nl>
> 
> > From: Jeremie Courreges-Anglas <j...@wxcvbn.org>
> > Date: Thu, 17 Nov 2022 18:00:21 +0100
> > 
> > On Tue, Nov 15 2022, Martin Pieuchot <m...@openbsd.org> wrote:
> > > UVM vnode objects include a reference count to keep track of the number
> > > of processes that have the corresponding pages mapped in their VM space.
> > >
> > > When the last process referencing a given library or executable dies,
> > > the reaper will munmap this object on its behalf.  When this happens it
> > > doesn't free the associated pages to speed-up possible re-use of the
> > > file.  Instead the pages are placed on the inactive list but stay ready
> > > to be pmap_enter()'d without requiring I/O as soon as a newly process
> > > needs to access them.
> > >
> > > The mechanism to keep pages populated, known as UVM_VNODE_CANPERSIST,
> > > doesn't work well with swapping [0].  For some reason when the page daemon
> > > wants to free pages on the inactive list it tries to flush the pages to
> > > disk and panic(9) because it needs a valid reference to the vnode to do
> > > so.
> > >
> > > This indicates that the mechanism described above, which seems to work
> > > fine for RO mappings, is currently buggy in more complex situations.
> > > Flushing the pages when the last reference of the UVM object is dropped
> > > also doesn't seem to be enough as bluhm@ reported [1].
> > >
> > > The diff below, which has already be committed and reverted, gets rid of
> > > the UVM_VNODE_CANPERSIST logic.  I'd like to commit it again now that
> > > the arm64 caching bug has been found and fixed.
> > >
> > > Getting rid of this logic means more I/O will be generated and pages
> > > might have a faster reuse cycle.  I'm aware this might introduce a small
> > > slowdown,
> > 
> > Numbers for my usual make -j4 in libc,
> > on an Unmatched riscv64 box, now:
> >    16m32.65s real    21m36.79s user    30m53.45s system
> >    16m32.37s real    21m33.40s user    31m17.98s system
> >    16m32.63s real    21m35.74s user    31m12.01s system
> >    16m32.13s real    21m36.12s user    31m06.92s system
> > After:
> >    19m14.15s real    21m09.39s user    36m51.33s system
> >    19m19.11s real    21m02.61s user    36m58.46s system
> >    19m21.77s real    21m09.23s user    37m03.85s system
> >    19m09.39s real    21m08.96s user    36m36.00s system
> > 
> > 4 cores amd64 VM, before (-current plus an other diff):
> >    1m54.31s real     2m47.36s user     4m24.70s system
> >    1m52.64s real     2m45.68s user     4m23.46s system
> >    1m53.47s real     2m43.59s user     4m27.60s system
> > After:
> >    2m34.12s real     2m51.15s user     6m20.91s system
> >    2m34.30s real     2m48.48s user     6m23.34s system
> >    2m37.07s real     2m49.60s user     6m31.53s system
> > 
> > > however I believe we should work towards loading files from the
> > > buffer cache to save I/O cycles instead of having another layer of cache.
> > > Such work isn't trivial and making sure the vnode <-> UVM relation is
> > > simple and well understood is the first step in this direction.
> > >
> > > I'd appreciate if the diff below could be tested on many architectures,
> > > include the offending rpi4.
> > 
> > Mike has already tested a make build on a riscv64 Unmatched.  I have
> > also run regress in sys, lib/libc and lib/libpthread on that arch.  As
> > far as I can see this looks stable on my machine, but what I really care
> > about is the riscv64 bulk build cluster (I'm going to start another
> > bulk build soon).
> > 
> > > Comments?  Oks?
> > 
> > The performance drop in my microbenchmark kinda worries me but it's only
> > a microbenchmark...
> 
> I wouldn't call this a microbenchmark.  I fear this is typical for
> builds of anything on clang architectures.  And I expect it to be
> worse on single-processor machine where *every* time we execute clang
> or lld all the pages are thrown away upon exit and will need to be
> read back from disk again for the next bit of C code we compile.
> 
> Having mmap'ed reads go through the buffer cache should indeed help
> here, but that smells like UBC and even if it isn't, it is something
> we don't have and will be tricky to get right.
> 
> The discussions we had during h2k22 made me understand this thing a
> bit better.  Is there any reason why we can't keep a reference to the
> vnode and have the pagedaemon drop it when it drops the pages?  Is
> there a chance that we run out of vnodes this way?  I vaguely remember
> beck@ saying that we can have tons of vnodes now.


Maybe that isn't the best idea.  So here is what may be an actual fix
for the "macppc panic: vref used where vget required" problem we're
seeing that prompted us to look at UVM_VNODE_CANPERSIST.

If we look back to at the race that bluhm@ posted:

panic: vref used where vget required
Stopped at      db_enter+0x24:  lwz r11,12(r1)
    TID    PID    UID     PRFLAGS     PFLAGS  CPU  COMMAND
 192060  78628     21         0x2          0    0  c++
*132472  74971      0     0x14000      0x200    1K pagedaemon
db_enter() at db_enter+0x20
panic(91373c) at panic+0x158
vref(23b8fa20) at vref+0xac
uvm_vnp_uncache(e7eb7c50) at uvm_vnp_uncache+0x88
ffs_write(7ed2e423) at ffs_write+0x3b0
VOP_WRITE(23b8fa20,e7eb7c50,40,1ff3f60) at VOP_WRITE+0x48
uvn_io(7ed2e423,a93d0c,a93814,ffffffff,e4010000) at uvn_io+0x264
uvn_put(414b673a,e7eb7dd4,24f00070,5326e90) at uvn_put+0x64
uvm_pager_put(0,0,e7eb7d70,6ee0b8,2000000,80000000,0) at uvm_pager_put+0x15c
uvmpd_scan_inactive(0) at uvmpd_scan_inactive+0x224
uvmpd_scan() at uvmpd_scan+0x158
uvm_pageout(7e932633) at uvm_pageout+0x398
fork_trampoline() at fork_trampoline+0x14
end trace frame: 0x0, count: 2

I don't think there is anything complicated going on.  The pagedaemon
is paging out a page associated with a persisting vnode.  That page
isn't mapped anymore so nobody holds a reference to the vnode anymore
and it sits happily on the freelist to either be reused or recycled.

The code repsonsible for paging out this page is uvn_put(), which
calls uvn_io() and eventually ffs_write() *without* holding a
reference to the vnode.  I believe that is wrong; ffs_write() clearly
assumes that its caller holds a reference to vnode.  But it doesn't
check until we call uvm_vnp_uncache() which attempts to grab another
reference, which triggers the diagnostic panic we're seeing.

So I think uvn_put() should make sure it holds a reference to the
vnode.  And since it won't hold one if called by the pagedaemon, this
means we should use vget() to get one.  Now there may be a risk of a
deadlock here.  The pagedaemon may be trying to page out a page that
is associated with a vnode that is already in the process of being
recycled.  But cleaning the vnode as part of the recycling process
requires the vmobjlock that we're holding.  So use LK_NOWAIT and trust
that the page will be written to disk and freed as the associated
vnode gets cleaned.  Which means the pagedaemon can simply skip the
page so we return VM_PAGER_PEND.

Thoughts?  If this works, I probably should add some comments
explaining what's going on here.


Index: uvm/uvm_vnode.c
===================================================================
RCS file: /cvs/src/sys/uvm/uvm_vnode.c,v
retrieving revision 1.130
diff -u -p -r1.130 uvm_vnode.c
--- uvm/uvm_vnode.c     20 Oct 2022 13:31:52 -0000      1.130
+++ uvm/uvm_vnode.c     18 Nov 2022 19:56:51 -0000
@@ -899,11 +899,18 @@ uvn_cluster(struct uvm_object *uobj, vof
 int
 uvn_put(struct uvm_object *uobj, struct vm_page **pps, int npages, int flags)
 {
+       struct uvm_vnode *uvn = (struct uvm_vnode *)uobj;
        int retval;
 
        KASSERT(rw_write_held(uobj->vmobjlock));
 
+       KASSERT(uvn->u_flags & UVM_VNODE_VALID);
+       if (vget(uvn->u_vnode, LK_NOWAIT))
+               return VM_PAGER_PEND;
+
        retval = uvn_io((struct uvm_vnode*)uobj, pps, npages, flags, UIO_WRITE);
+
+       vrele(uvn->u_vnode);
 
        return retval;
 }

Re: Get rid of UVM_VNODE_CANPERSIST

Reply via email to