What happens next?


Mark Kettenis <[email protected]> wrote:

> > Date: Fri, 10 Jan 2025 09:07:44 +0100
> > From: Martin Pieuchot <[email protected]>
> > 
> > Hello,
> > 
> > On 09/01/25(Thu) 22:49, Mark Kettenis wrote:
> > > > > > [...] 
> > > > > > Any proposal on how we could proceed to find a solution for this 
> > > > > > issue?
> > > > > 
> > > > > The following hack fixes the issue for me.  I don't think this is a
> > > > > proper solution, but it may be a starting point.  Or a temporary fix.
> > > > > 
> > > > > The issue really is that we can't tell whether pmap_enter(9) failed
> > > > > because we're out of physical memory, or if it failed for some other
> > > > > reason.  In the case at hand we failt because of contention on the
> > > > > kernel map lock.  But we could also be failing because we have 
> > > > > completely run out of KVA.
> > > > 
> > > > Works for me as well!
> > > >  
> > > > > We can't sleep while holding all those uvm locks.  I'm not sure the
> > > > > free memory check this does is right.  Or whether we want such a check
> > > > > at all.  The vm_map_lock()/vm_map_unlock() dance is necessary to make
> > > > > sure we don't spin too quickly if the kernel map lock is contended.
> > > > > 
> > > > > A better fix would perhaps be to have a new pmap function that we
> > > > > could call at this spot that would sleep until the necessary resources
> > > > > are available.  On arm64 this would populate the page tables using
> > > > > pool allocations that use PR_WAITOK, but would not actually enter a
> > > > > valid mapping.  I'm going to explore that idea a bit.
> > > 
> > > This seems to work.  Even a little bit better I think as the number of
> > > vp pool fails seems to be a bit smaller with this diff.
> > > 
> > > Thoughts?
> > 
> > Thanks a lot for the diagnostic and diff.
> > 
> > The fact that pools are used by pmaps with NOWAIT and that pools can only
> > be refilled with a sleeping context is very concerning.
> > 
> > They are multiple ways to fix the issue.  I'd prefer if we can work on a
> > fix that avoids re-faulting:
> > 
> > dlg@ Is it possible to make pools used by pmaps specials such that they
> > act as a thin wrapper around uvm_pagealloc(9)/uvm_pglistalloc(9)?  This
> > would help in two ways:
> > - ensure pmap_enter(9) only returns an error if OOM
> > - benefit of the per-CPU page cache for all pmaps
> 
> Not really.  We still have plenty of physical memory, and allocating
> physical pages succeeds just fine.  The problem here is allocating
> virtual address space.
> 
> Here we have two options.  Most pools that are used with PR_NOWAIT use
> the "interrupt safe" kernel map.  When locking that map, we use
> mutexes, so it is safe to do so if we're in a context where we can't
> sleep.  But the address space of the "interrupt safe" kernel map is
> small.  So allocations may still fail if we run out of address space.
> 
> Also, the "vp" items are fairly large (8K) and we use quite a lot of
> them building the page tables for all our processes.  We would quickly
> fill the "interrupt safe" kernel map, so instead the "vp" pool uses
> the normal kernel map.  The normal map uses rwlocks to do the locking.
> So to avoid sleeping we opportunistically try to lock the map and fail
> if we don't succeed.
> 
> This is a fundamental problem with all our kernel memory allocators.
> We have some mitigations in the form of the single page allocator (the
> uvm_km_pages stuff) and __HAVE_PMAP_DIRECT.  But those are not
> guaranteed to work and virtual address allocations can still fail.
> 
> > Alternatively we can see pool_get() shortages integrated in the page daemon.
> > Instead of adding pmap_populate() we could simply check for pools needing a
> > refill in the page daemon.  The same already applies to uvm_anwait().
> 
> I don't think making the page daemon responsible for refilling the
> pools is a good idea.  It would mean the page daemon could block when
> there is a KVA shortage.  And isn't allocating more memory to fill the
> pools the last thing you'd want to do when waking up the page daemon?
> 
> And yes, the uvm_anwait() thing is very similar to this case.  In fact
> my initial idea was for pmap_populate() to just refill the pools.  But
> then I realized that it made more sense to just populate the page
> tables for the faulting address because we're going to refault on the
> same address anyway.
> 
> > But IMHO fewer usages of pools in UVM and pmaps is the best way forward.
> 
> I used to think that as well, but these days the pools have some
> performance-critical optimizations that are very relevant for our
> pmaps.
> 
> > > Index: arch/arm64/arm64/pmap.c
> > > ===================================================================
> > > RCS file: /cvs/src/sys/arch/arm64/arm64/pmap.c,v
> > > diff -u -p -r1.105 pmap.c
> > > --- arch/arm64/arm64/pmap.c       9 Nov 2024 12:58:29 -0000       1.105
> > > +++ arch/arm64/arm64/pmap.c       9 Jan 2025 21:44:41 -0000
> > > @@ -443,6 +443,67 @@ pmap_vp_enter(pmap_t pm, vaddr_t va, str
> > >   return 0;
> > >  }
> > >  
> > > +void
> > > +pmap_vp_populate(pmap_t pm, vaddr_t va)
> > > +{
> > > + struct pte_desc *pted;
> > > + struct pmapvp1 *vp1;
> > > + struct pmapvp2 *vp2;
> > > + struct pmapvp3 *vp3;
> > > + void *vp;
> > > +
> > > + pted = pool_get(&pmap_pted_pool, PR_WAITOK | PR_ZERO);
> > > + vp = pool_get(&pmap_vp_pool, PR_WAITOK | PR_ZERO);
> > > +
> > > + pmap_lock(pm);
> > > +
> > > + if (pm->have_4_level_pt) {
> > > +         vp1 = pm->pm_vp.l0->vp[VP_IDX0(va)];
> > > +         if (vp1 == NULL) {
> > > +                 vp1 = vp; vp = NULL;
> > > +                 pmap_set_l1(pm, va, vp1);
> > > +         }
> > > + } else {
> > > +         vp1 = pm->pm_vp.l1;
> > > + }
> > > +
> > > + if (vp == NULL) {
> > > +         pmap_unlock(pm);
> > > +         vp = pool_get(&pmap_vp_pool, PR_WAITOK | PR_ZERO);
> > > +         pmap_lock(pm);
> > > + }
> > > +
> > > + vp2 = vp1->vp[VP_IDX1(va)];
> > > + if (vp2 == NULL) {
> > > +         vp2 = vp; vp = NULL;
> > > +         pmap_set_l2(pm, va, vp1, vp2);
> > > + }
> > > + 
> > > + if (vp == NULL) {
> > > +         pmap_unlock(pm);
> > > +         vp = pool_get(&pmap_vp_pool, PR_WAITOK | PR_ZERO);
> > > +         pmap_lock(pm);
> > > + }
> > > +
> > > + vp3 = vp2->vp[VP_IDX2(va)];
> > > + if (vp3 == NULL) {
> > > +         vp3 = vp; vp = NULL;
> > > +         pmap_set_l3(pm, va, vp2, vp3);
> > > + }
> > > +
> > > + if (vp3->vp[VP_IDX3(va)] == NULL) {
> > > +         vp3->vp[VP_IDX3(va)] = pted;
> > > +         pted = NULL;
> > > + }
> > > +
> > > + pmap_unlock(pm);
> > > +
> > > + if (vp)
> > > +         pool_put(&pmap_vp_pool, vp);
> > > + if (pted)
> > > +         pool_put(&pmap_pted_pool, pted);
> > > +}
> > > +
> > >  void *
> > >  pmap_vp_page_alloc(struct pool *pp, int flags, int *slowdown)
> > >  {
> > > @@ -616,6 +677,11 @@ out:
> > >   return error;
> > >  }
> > >  
> > > +void
> > > +pmap_populate(pmap_t pm, vaddr_t va)
> > > +{
> > > + pmap_vp_populate(pm, va);
> > > +}
> > >  
> > >  /*
> > >   * Remove the given range of mapping entries.
> > > Index: arch/arm64/include/pmap.h
> > > ===================================================================
> > > RCS file: /cvs/src/sys/arch/arm64/include/pmap.h,v
> > > diff -u -p -r1.25 pmap.h
> > > --- arch/arm64/include/pmap.h     11 Dec 2023 22:12:53 -0000      1.25
> > > +++ arch/arm64/include/pmap.h     9 Jan 2025 21:44:41 -0000
> > > @@ -123,6 +123,7 @@ struct pv_entry;
> > >  int      pmap_fault_fixup(pmap_t, vaddr_t, vm_prot_t);
> > >  
> > >  #define __HAVE_PMAP_MPSAFE_ENTER_COW
> > > +#define __HAVE_PMAP_POPULATE
> > >  
> > >  #endif /* _KERNEL && !_LOCORE */
> > >  
> > > Index: uvm/uvm_fault.c
> > > ===================================================================
> > > RCS file: /cvs/src/sys/uvm/uvm_fault.c,v
> > > diff -u -p -r1.159 uvm_fault.c
> > > --- uvm/uvm_fault.c       3 Jan 2025 15:31:48 -0000       1.159
> > > +++ uvm/uvm_fault.c       9 Jan 2025 21:44:41 -0000
> > > @@ -1105,8 +1105,12 @@ uvm_fault_upper(struct uvm_faultinfo *uf
> > >                   /* XXX instrumentation */
> > >                   return ENOMEM;
> > >           }
> > > +#ifdef __HAVE_PMAP_POPULATE
> > > +         pmap_populate(ufi->orig_map->pmap, ufi->orig_rvaddr);
> > > +#else
> > >           /* XXX instrumentation */
> > >           uvm_wait("flt_pmfail1");
> > > +#endif
> > >           return ERESTART;
> > >   }
> > >  
> > > @@ -1457,8 +1461,12 @@ uvm_fault_lower(struct uvm_faultinfo *uf
> > >                   /* XXX instrumentation */
> > >                   return (ENOMEM);
> > >           }
> > > +#ifdef __HAVE_PMAP_POPULATE
> > > +         pmap_populate(ufi->orig_map->pmap, ufi->orig_rvaddr);
> > > +#else
> > >           /* XXX instrumentation */
> > >           uvm_wait("flt_pmfail2");
> > > +#endif
> > >           return ERESTART;
> > >   }
> > >  
> > > Index: uvm/uvm_pmap.h
> > > ===================================================================
> > > RCS file: /cvs/src/sys/uvm/uvm_pmap.h,v
> > > diff -u -p -r1.34 uvm_pmap.h
> > > --- uvm/uvm_pmap.h        3 Apr 2024 18:43:32 -0000       1.34
> > > +++ uvm/uvm_pmap.h        9 Jan 2025 21:44:41 -0000
> > > @@ -179,6 +179,10 @@ vaddr_t               pmap_steal_memory(vsize_t, vad
> > >  void              pmap_virtual_space(vaddr_t *, vaddr_t *);
> > >  #endif
> > >  
> > > +#if defined(__HAVE_PMAP_POPULATE)
> > > +void             pmap_populate(pmap_t, vaddr_t);
> > > +#endif
> > > +
> > >  /* nested pmaps are used in i386/amd64 vmm */
> > >  #ifndef pmap_nested
> > >  #define pmap_nested(pm) 0
> > 
> > 
> > 
> 

Reply via email to