> Date: Fri, 10 Jan 2025 09:07:44 +0100
> From: Martin Pieuchot <[email protected]>
>
> Hello,
>
> On 09/01/25(Thu) 22:49, Mark Kettenis wrote:
> > > > > [...]
> > > > > Any proposal on how we could proceed to find a solution for this
> > > > > issue?
> > > >
> > > > The following hack fixes the issue for me. I don't think this is a
> > > > proper solution, but it may be a starting point. Or a temporary fix.
> > > >
> > > > The issue really is that we can't tell whether pmap_enter(9) failed
> > > > because we're out of physical memory, or if it failed for some other
> > > > reason. In the case at hand we failt because of contention on the
> > > > kernel map lock. But we could also be failing because we have
> > > > completely run out of KVA.
> > >
> > > Works for me as well!
> > >
> > > > We can't sleep while holding all those uvm locks. I'm not sure the
> > > > free memory check this does is right. Or whether we want such a check
> > > > at all. The vm_map_lock()/vm_map_unlock() dance is necessary to make
> > > > sure we don't spin too quickly if the kernel map lock is contended.
> > > >
> > > > A better fix would perhaps be to have a new pmap function that we
> > > > could call at this spot that would sleep until the necessary resources
> > > > are available. On arm64 this would populate the page tables using
> > > > pool allocations that use PR_WAITOK, but would not actually enter a
> > > > valid mapping. I'm going to explore that idea a bit.
> >
> > This seems to work. Even a little bit better I think as the number of
> > vp pool fails seems to be a bit smaller with this diff.
> >
> > Thoughts?
>
> Thanks a lot for the diagnostic and diff.
>
> The fact that pools are used by pmaps with NOWAIT and that pools can only
> be refilled with a sleeping context is very concerning.
>
> They are multiple ways to fix the issue. I'd prefer if we can work on a
> fix that avoids re-faulting:
>
> dlg@ Is it possible to make pools used by pmaps specials such that they
> act as a thin wrapper around uvm_pagealloc(9)/uvm_pglistalloc(9)? This
> would help in two ways:
> - ensure pmap_enter(9) only returns an error if OOM
> - benefit of the per-CPU page cache for all pmaps
Not really. We still have plenty of physical memory, and allocating
physical pages succeeds just fine. The problem here is allocating
virtual address space.
Here we have two options. Most pools that are used with PR_NOWAIT use
the "interrupt safe" kernel map. When locking that map, we use
mutexes, so it is safe to do so if we're in a context where we can't
sleep. But the address space of the "interrupt safe" kernel map is
small. So allocations may still fail if we run out of address space.
Also, the "vp" items are fairly large (8K) and we use quite a lot of
them building the page tables for all our processes. We would quickly
fill the "interrupt safe" kernel map, so instead the "vp" pool uses
the normal kernel map. The normal map uses rwlocks to do the locking.
So to avoid sleeping we opportunistically try to lock the map and fail
if we don't succeed.
This is a fundamental problem with all our kernel memory allocators.
We have some mitigations in the form of the single page allocator (the
uvm_km_pages stuff) and __HAVE_PMAP_DIRECT. But those are not
guaranteed to work and virtual address allocations can still fail.
> Alternatively we can see pool_get() shortages integrated in the page daemon.
> Instead of adding pmap_populate() we could simply check for pools needing a
> refill in the page daemon. The same already applies to uvm_anwait().
I don't think making the page daemon responsible for refilling the
pools is a good idea. It would mean the page daemon could block when
there is a KVA shortage. And isn't allocating more memory to fill the
pools the last thing you'd want to do when waking up the page daemon?
And yes, the uvm_anwait() thing is very similar to this case. In fact
my initial idea was for pmap_populate() to just refill the pools. But
then I realized that it made more sense to just populate the page
tables for the faulting address because we're going to refault on the
same address anyway.
> But IMHO fewer usages of pools in UVM and pmaps is the best way forward.
I used to think that as well, but these days the pools have some
performance-critical optimizations that are very relevant for our
pmaps.
> > Index: arch/arm64/arm64/pmap.c
> > ===================================================================
> > RCS file: /cvs/src/sys/arch/arm64/arm64/pmap.c,v
> > diff -u -p -r1.105 pmap.c
> > --- arch/arm64/arm64/pmap.c 9 Nov 2024 12:58:29 -0000 1.105
> > +++ arch/arm64/arm64/pmap.c 9 Jan 2025 21:44:41 -0000
> > @@ -443,6 +443,67 @@ pmap_vp_enter(pmap_t pm, vaddr_t va, str
> > return 0;
> > }
> >
> > +void
> > +pmap_vp_populate(pmap_t pm, vaddr_t va)
> > +{
> > + struct pte_desc *pted;
> > + struct pmapvp1 *vp1;
> > + struct pmapvp2 *vp2;
> > + struct pmapvp3 *vp3;
> > + void *vp;
> > +
> > + pted = pool_get(&pmap_pted_pool, PR_WAITOK | PR_ZERO);
> > + vp = pool_get(&pmap_vp_pool, PR_WAITOK | PR_ZERO);
> > +
> > + pmap_lock(pm);
> > +
> > + if (pm->have_4_level_pt) {
> > + vp1 = pm->pm_vp.l0->vp[VP_IDX0(va)];
> > + if (vp1 == NULL) {
> > + vp1 = vp; vp = NULL;
> > + pmap_set_l1(pm, va, vp1);
> > + }
> > + } else {
> > + vp1 = pm->pm_vp.l1;
> > + }
> > +
> > + if (vp == NULL) {
> > + pmap_unlock(pm);
> > + vp = pool_get(&pmap_vp_pool, PR_WAITOK | PR_ZERO);
> > + pmap_lock(pm);
> > + }
> > +
> > + vp2 = vp1->vp[VP_IDX1(va)];
> > + if (vp2 == NULL) {
> > + vp2 = vp; vp = NULL;
> > + pmap_set_l2(pm, va, vp1, vp2);
> > + }
> > +
> > + if (vp == NULL) {
> > + pmap_unlock(pm);
> > + vp = pool_get(&pmap_vp_pool, PR_WAITOK | PR_ZERO);
> > + pmap_lock(pm);
> > + }
> > +
> > + vp3 = vp2->vp[VP_IDX2(va)];
> > + if (vp3 == NULL) {
> > + vp3 = vp; vp = NULL;
> > + pmap_set_l3(pm, va, vp2, vp3);
> > + }
> > +
> > + if (vp3->vp[VP_IDX3(va)] == NULL) {
> > + vp3->vp[VP_IDX3(va)] = pted;
> > + pted = NULL;
> > + }
> > +
> > + pmap_unlock(pm);
> > +
> > + if (vp)
> > + pool_put(&pmap_vp_pool, vp);
> > + if (pted)
> > + pool_put(&pmap_pted_pool, pted);
> > +}
> > +
> > void *
> > pmap_vp_page_alloc(struct pool *pp, int flags, int *slowdown)
> > {
> > @@ -616,6 +677,11 @@ out:
> > return error;
> > }
> >
> > +void
> > +pmap_populate(pmap_t pm, vaddr_t va)
> > +{
> > + pmap_vp_populate(pm, va);
> > +}
> >
> > /*
> > * Remove the given range of mapping entries.
> > Index: arch/arm64/include/pmap.h
> > ===================================================================
> > RCS file: /cvs/src/sys/arch/arm64/include/pmap.h,v
> > diff -u -p -r1.25 pmap.h
> > --- arch/arm64/include/pmap.h 11 Dec 2023 22:12:53 -0000 1.25
> > +++ arch/arm64/include/pmap.h 9 Jan 2025 21:44:41 -0000
> > @@ -123,6 +123,7 @@ struct pv_entry;
> > int pmap_fault_fixup(pmap_t, vaddr_t, vm_prot_t);
> >
> > #define __HAVE_PMAP_MPSAFE_ENTER_COW
> > +#define __HAVE_PMAP_POPULATE
> >
> > #endif /* _KERNEL && !_LOCORE */
> >
> > Index: uvm/uvm_fault.c
> > ===================================================================
> > RCS file: /cvs/src/sys/uvm/uvm_fault.c,v
> > diff -u -p -r1.159 uvm_fault.c
> > --- uvm/uvm_fault.c 3 Jan 2025 15:31:48 -0000 1.159
> > +++ uvm/uvm_fault.c 9 Jan 2025 21:44:41 -0000
> > @@ -1105,8 +1105,12 @@ uvm_fault_upper(struct uvm_faultinfo *uf
> > /* XXX instrumentation */
> > return ENOMEM;
> > }
> > +#ifdef __HAVE_PMAP_POPULATE
> > + pmap_populate(ufi->orig_map->pmap, ufi->orig_rvaddr);
> > +#else
> > /* XXX instrumentation */
> > uvm_wait("flt_pmfail1");
> > +#endif
> > return ERESTART;
> > }
> >
> > @@ -1457,8 +1461,12 @@ uvm_fault_lower(struct uvm_faultinfo *uf
> > /* XXX instrumentation */
> > return (ENOMEM);
> > }
> > +#ifdef __HAVE_PMAP_POPULATE
> > + pmap_populate(ufi->orig_map->pmap, ufi->orig_rvaddr);
> > +#else
> > /* XXX instrumentation */
> > uvm_wait("flt_pmfail2");
> > +#endif
> > return ERESTART;
> > }
> >
> > Index: uvm/uvm_pmap.h
> > ===================================================================
> > RCS file: /cvs/src/sys/uvm/uvm_pmap.h,v
> > diff -u -p -r1.34 uvm_pmap.h
> > --- uvm/uvm_pmap.h 3 Apr 2024 18:43:32 -0000 1.34
> > +++ uvm/uvm_pmap.h 9 Jan 2025 21:44:41 -0000
> > @@ -179,6 +179,10 @@ vaddr_t pmap_steal_memory(vsize_t, vad
> > void pmap_virtual_space(vaddr_t *, vaddr_t *);
> > #endif
> >
> > +#if defined(__HAVE_PMAP_POPULATE)
> > +void pmap_populate(pmap_t, vaddr_t);
> > +#endif
> > +
> > /* nested pmaps are used in i386/amd64 vmm */
> > #ifndef pmap_nested
> > #define pmap_nested(pm) 0
>
>
>