On 18/11/22(Fri) 21:33, Mark Kettenis wrote:
> > Date: Thu, 17 Nov 2022 20:23:37 +0100
> > From: Mark Kettenis <mark.kette...@xs4all.nl>
> > 
> > > From: Jeremie Courreges-Anglas <j...@wxcvbn.org>
> > > Date: Thu, 17 Nov 2022 18:00:21 +0100
> > > 
> > > On Tue, Nov 15 2022, Martin Pieuchot <m...@openbsd.org> wrote:
> > > > UVM vnode objects include a reference count to keep track of the number
> > > > of processes that have the corresponding pages mapped in their VM space.
> > > >
> > > > When the last process referencing a given library or executable dies,
> > > > the reaper will munmap this object on its behalf.  When this happens it
> > > > doesn't free the associated pages to speed-up possible re-use of the
> > > > file.  Instead the pages are placed on the inactive list but stay ready
> > > > to be pmap_enter()'d without requiring I/O as soon as a newly process
> > > > needs to access them.
> > > >
> > > > The mechanism to keep pages populated, known as UVM_VNODE_CANPERSIST,
> > > > doesn't work well with swapping [0].  For some reason when the page 
> > > > daemon
> > > > wants to free pages on the inactive list it tries to flush the pages to
> > > > disk and panic(9) because it needs a valid reference to the vnode to do
> > > > so.
> > > >
> > > > This indicates that the mechanism described above, which seems to work
> > > > fine for RO mappings, is currently buggy in more complex situations.
> > > > Flushing the pages when the last reference of the UVM object is dropped
> > > > also doesn't seem to be enough as bluhm@ reported [1].
> > > >
> > > > The diff below, which has already be committed and reverted, gets rid of
> > > > the UVM_VNODE_CANPERSIST logic.  I'd like to commit it again now that
> > > > the arm64 caching bug has been found and fixed.
> > > >
> > > > Getting rid of this logic means more I/O will be generated and pages
> > > > might have a faster reuse cycle.  I'm aware this might introduce a small
> > > > slowdown,
> > > 
> > > Numbers for my usual make -j4 in libc,
> > > on an Unmatched riscv64 box, now:
> > >    16m32.65s real    21m36.79s user    30m53.45s system
> > >    16m32.37s real    21m33.40s user    31m17.98s system
> > >    16m32.63s real    21m35.74s user    31m12.01s system
> > >    16m32.13s real    21m36.12s user    31m06.92s system
> > > After:
> > >    19m14.15s real    21m09.39s user    36m51.33s system
> > >    19m19.11s real    21m02.61s user    36m58.46s system
> > >    19m21.77s real    21m09.23s user    37m03.85s system
> > >    19m09.39s real    21m08.96s user    36m36.00s system
> > > 
> > > 4 cores amd64 VM, before (-current plus an other diff):
> > >    1m54.31s real     2m47.36s user     4m24.70s system
> > >    1m52.64s real     2m45.68s user     4m23.46s system
> > >    1m53.47s real     2m43.59s user     4m27.60s system
> > > After:
> > >    2m34.12s real     2m51.15s user     6m20.91s system
> > >    2m34.30s real     2m48.48s user     6m23.34s system
> > >    2m37.07s real     2m49.60s user     6m31.53s system
> > > 
> > > > however I believe we should work towards loading files from the
> > > > buffer cache to save I/O cycles instead of having another layer of 
> > > > cache.
> > > > Such work isn't trivial and making sure the vnode <-> UVM relation is
> > > > simple and well understood is the first step in this direction.
> > > >
> > > > I'd appreciate if the diff below could be tested on many architectures,
> > > > include the offending rpi4.
> > > 
> > > Mike has already tested a make build on a riscv64 Unmatched.  I have
> > > also run regress in sys, lib/libc and lib/libpthread on that arch.  As
> > > far as I can see this looks stable on my machine, but what I really care
> > > about is the riscv64 bulk build cluster (I'm going to start another
> > > bulk build soon).
> > > 
> > > > Comments?  Oks?
> > > 
> > > The performance drop in my microbenchmark kinda worries me but it's only
> > > a microbenchmark...
> > 
> > I wouldn't call this a microbenchmark.  I fear this is typical for
> > builds of anything on clang architectures.  And I expect it to be
> > worse on single-processor machine where *every* time we execute clang
> > or lld all the pages are thrown away upon exit and will need to be
> > read back from disk again for the next bit of C code we compile.
> > 
> > Having mmap'ed reads go through the buffer cache should indeed help
> > here, but that smells like UBC and even if it isn't, it is something
> > we don't have and will be tricky to get right.
> > 
> > The discussions we had during h2k22 made me understand this thing a
> > bit better.  Is there any reason why we can't keep a reference to the
> > vnode and have the pagedaemon drop it when it drops the pages?  Is
> > there a chance that we run out of vnodes this way?  I vaguely remember
> > beck@ saying that we can have tons of vnodes now.
> 
> Maybe that isn't the best idea.  So here is what may be an actual fix
> for the "macppc panic: vref used where vget required" problem we're
> seeing that prompted us to look at UVM_VNODE_CANPERSIST.
> 
> If we look back to at the race that bluhm@ posted:
> 
> panic: vref used where vget required
> Stopped at      db_enter+0x24:  lwz r11,12(r1)
>     TID    PID    UID     PRFLAGS     PFLAGS  CPU  COMMAND
>  192060  78628     21         0x2          0    0  c++
> *132472  74971      0     0x14000      0x200    1K pagedaemon
> db_enter() at db_enter+0x20
> panic(91373c) at panic+0x158
> vref(23b8fa20) at vref+0xac
> uvm_vnp_uncache(e7eb7c50) at uvm_vnp_uncache+0x88
> ffs_write(7ed2e423) at ffs_write+0x3b0
> VOP_WRITE(23b8fa20,e7eb7c50,40,1ff3f60) at VOP_WRITE+0x48
> uvn_io(7ed2e423,a93d0c,a93814,ffffffff,e4010000) at uvn_io+0x264
> uvn_put(414b673a,e7eb7dd4,24f00070,5326e90) at uvn_put+0x64
> uvm_pager_put(0,0,e7eb7d70,6ee0b8,2000000,80000000,0) at uvm_pager_put+0x15c
> uvmpd_scan_inactive(0) at uvmpd_scan_inactive+0x224
> uvmpd_scan() at uvmpd_scan+0x158
> uvm_pageout(7e932633) at uvm_pageout+0x398
> fork_trampoline() at fork_trampoline+0x14
> end trace frame: 0x0, count: 2
> 
> I don't think there is anything complicated going on.  The pagedaemon
> is paging out a page associated with a persisting vnode.  That page
> isn't mapped anymore so nobody holds a reference to the vnode anymore
> and it sits happily on the freelist to either be reused or recycled.
> 
> The code repsonsible for paging out this page is uvn_put(), which
> calls uvn_io() and eventually ffs_write() *without* holding a
> reference to the vnode.  I believe that is wrong; ffs_write() clearly
> assumes that its caller holds a reference to vnode.  But it doesn't
> check until we call uvm_vnp_uncache() which attempts to grab another
> reference, which triggers the diagnostic panic we're seeing.
> 
> So I think uvn_put() should make sure it holds a reference to the
> vnode.  And since it won't hold one if called by the pagedaemon, this
> means we should use vget() to get one.  Now there may be a risk of a
> deadlock here.  The pagedaemon may be trying to page out a page that
> is associated with a vnode that is already in the process of being
> recycled.  But cleaning the vnode as part of the recycling process
> requires the vmobjlock that we're holding.  So use LK_NOWAIT and trust
> that the page will be written to disk and freed as the associated
> vnode gets cleaned.  Which means the pagedaemon can simply skip the
> page so we return VM_PAGER_PEND.
> 
> Thoughts?  If this works, I probably should add some comments
> explaining what's going on here.

I like the idea and the diff.

I don't like the use of VM_PAGER_PEND.  This return value is used to
indicate that I/O has been started asynchronously.  This currently never
happens with vnode so we're introducing a new tricky error code path. 

Plus the page hasn't been flushed to disk when uvm_pager_put() returns.
I'd rather return VM_PAGER_AGAIN which would leave the page on the queue
and once the page daemon unlocks `vmobjlock' the racing thread will be
able to proceed.

> Index: uvm/uvm_vnode.c
> ===================================================================
> RCS file: /cvs/src/sys/uvm/uvm_vnode.c,v
> retrieving revision 1.130
> diff -u -p -r1.130 uvm_vnode.c
> --- uvm/uvm_vnode.c   20 Oct 2022 13:31:52 -0000      1.130
> +++ uvm/uvm_vnode.c   18 Nov 2022 19:56:51 -0000
> @@ -899,11 +899,18 @@ uvn_cluster(struct uvm_object *uobj, vof
>  int
>  uvn_put(struct uvm_object *uobj, struct vm_page **pps, int npages, int flags)
>  {
> +     struct uvm_vnode *uvn = (struct uvm_vnode *)uobj;
>       int retval;
>  
>       KASSERT(rw_write_held(uobj->vmobjlock));
>  
> +     KASSERT(uvn->u_flags & UVM_VNODE_VALID);
> +     if (vget(uvn->u_vnode, LK_NOWAIT))
> +             return VM_PAGER_PEND;
> +
>       retval = uvn_io((struct uvm_vnode*)uobj, pps, npages, flags, UIO_WRITE);
> +
> +     vrele(uvn->u_vnode);
>  
>       return retval;
>  }

Reply via email to