What happens next?
Mark Kettenis <[email protected]> wrote: > > Date: Fri, 10 Jan 2025 09:07:44 +0100 > > From: Martin Pieuchot <[email protected]> > > > > Hello, > > > > On 09/01/25(Thu) 22:49, Mark Kettenis wrote: > > > > > > [...] > > > > > > Any proposal on how we could proceed to find a solution for this > > > > > > issue? > > > > > > > > > > The following hack fixes the issue for me. I don't think this is a > > > > > proper solution, but it may be a starting point. Or a temporary fix. > > > > > > > > > > The issue really is that we can't tell whether pmap_enter(9) failed > > > > > because we're out of physical memory, or if it failed for some other > > > > > reason. In the case at hand we failt because of contention on the > > > > > kernel map lock. But we could also be failing because we have > > > > > completely run out of KVA. > > > > > > > > Works for me as well! > > > > > > > > > We can't sleep while holding all those uvm locks. I'm not sure the > > > > > free memory check this does is right. Or whether we want such a check > > > > > at all. The vm_map_lock()/vm_map_unlock() dance is necessary to make > > > > > sure we don't spin too quickly if the kernel map lock is contended. > > > > > > > > > > A better fix would perhaps be to have a new pmap function that we > > > > > could call at this spot that would sleep until the necessary resources > > > > > are available. On arm64 this would populate the page tables using > > > > > pool allocations that use PR_WAITOK, but would not actually enter a > > > > > valid mapping. I'm going to explore that idea a bit. > > > > > > This seems to work. Even a little bit better I think as the number of > > > vp pool fails seems to be a bit smaller with this diff. > > > > > > Thoughts? > > > > Thanks a lot for the diagnostic and diff. > > > > The fact that pools are used by pmaps with NOWAIT and that pools can only > > be refilled with a sleeping context is very concerning. > > > > They are multiple ways to fix the issue. I'd prefer if we can work on a > > fix that avoids re-faulting: > > > > dlg@ Is it possible to make pools used by pmaps specials such that they > > act as a thin wrapper around uvm_pagealloc(9)/uvm_pglistalloc(9)? This > > would help in two ways: > > - ensure pmap_enter(9) only returns an error if OOM > > - benefit of the per-CPU page cache for all pmaps > > Not really. We still have plenty of physical memory, and allocating > physical pages succeeds just fine. The problem here is allocating > virtual address space. > > Here we have two options. Most pools that are used with PR_NOWAIT use > the "interrupt safe" kernel map. When locking that map, we use > mutexes, so it is safe to do so if we're in a context where we can't > sleep. But the address space of the "interrupt safe" kernel map is > small. So allocations may still fail if we run out of address space. > > Also, the "vp" items are fairly large (8K) and we use quite a lot of > them building the page tables for all our processes. We would quickly > fill the "interrupt safe" kernel map, so instead the "vp" pool uses > the normal kernel map. The normal map uses rwlocks to do the locking. > So to avoid sleeping we opportunistically try to lock the map and fail > if we don't succeed. > > This is a fundamental problem with all our kernel memory allocators. > We have some mitigations in the form of the single page allocator (the > uvm_km_pages stuff) and __HAVE_PMAP_DIRECT. But those are not > guaranteed to work and virtual address allocations can still fail. > > > Alternatively we can see pool_get() shortages integrated in the page daemon. > > Instead of adding pmap_populate() we could simply check for pools needing a > > refill in the page daemon. The same already applies to uvm_anwait(). > > I don't think making the page daemon responsible for refilling the > pools is a good idea. It would mean the page daemon could block when > there is a KVA shortage. And isn't allocating more memory to fill the > pools the last thing you'd want to do when waking up the page daemon? > > And yes, the uvm_anwait() thing is very similar to this case. In fact > my initial idea was for pmap_populate() to just refill the pools. But > then I realized that it made more sense to just populate the page > tables for the faulting address because we're going to refault on the > same address anyway. > > > But IMHO fewer usages of pools in UVM and pmaps is the best way forward. > > I used to think that as well, but these days the pools have some > performance-critical optimizations that are very relevant for our > pmaps. > > > > Index: arch/arm64/arm64/pmap.c > > > =================================================================== > > > RCS file: /cvs/src/sys/arch/arm64/arm64/pmap.c,v > > > diff -u -p -r1.105 pmap.c > > > --- arch/arm64/arm64/pmap.c 9 Nov 2024 12:58:29 -0000 1.105 > > > +++ arch/arm64/arm64/pmap.c 9 Jan 2025 21:44:41 -0000 > > > @@ -443,6 +443,67 @@ pmap_vp_enter(pmap_t pm, vaddr_t va, str > > > return 0; > > > } > > > > > > +void > > > +pmap_vp_populate(pmap_t pm, vaddr_t va) > > > +{ > > > + struct pte_desc *pted; > > > + struct pmapvp1 *vp1; > > > + struct pmapvp2 *vp2; > > > + struct pmapvp3 *vp3; > > > + void *vp; > > > + > > > + pted = pool_get(&pmap_pted_pool, PR_WAITOK | PR_ZERO); > > > + vp = pool_get(&pmap_vp_pool, PR_WAITOK | PR_ZERO); > > > + > > > + pmap_lock(pm); > > > + > > > + if (pm->have_4_level_pt) { > > > + vp1 = pm->pm_vp.l0->vp[VP_IDX0(va)]; > > > + if (vp1 == NULL) { > > > + vp1 = vp; vp = NULL; > > > + pmap_set_l1(pm, va, vp1); > > > + } > > > + } else { > > > + vp1 = pm->pm_vp.l1; > > > + } > > > + > > > + if (vp == NULL) { > > > + pmap_unlock(pm); > > > + vp = pool_get(&pmap_vp_pool, PR_WAITOK | PR_ZERO); > > > + pmap_lock(pm); > > > + } > > > + > > > + vp2 = vp1->vp[VP_IDX1(va)]; > > > + if (vp2 == NULL) { > > > + vp2 = vp; vp = NULL; > > > + pmap_set_l2(pm, va, vp1, vp2); > > > + } > > > + > > > + if (vp == NULL) { > > > + pmap_unlock(pm); > > > + vp = pool_get(&pmap_vp_pool, PR_WAITOK | PR_ZERO); > > > + pmap_lock(pm); > > > + } > > > + > > > + vp3 = vp2->vp[VP_IDX2(va)]; > > > + if (vp3 == NULL) { > > > + vp3 = vp; vp = NULL; > > > + pmap_set_l3(pm, va, vp2, vp3); > > > + } > > > + > > > + if (vp3->vp[VP_IDX3(va)] == NULL) { > > > + vp3->vp[VP_IDX3(va)] = pted; > > > + pted = NULL; > > > + } > > > + > > > + pmap_unlock(pm); > > > + > > > + if (vp) > > > + pool_put(&pmap_vp_pool, vp); > > > + if (pted) > > > + pool_put(&pmap_pted_pool, pted); > > > +} > > > + > > > void * > > > pmap_vp_page_alloc(struct pool *pp, int flags, int *slowdown) > > > { > > > @@ -616,6 +677,11 @@ out: > > > return error; > > > } > > > > > > +void > > > +pmap_populate(pmap_t pm, vaddr_t va) > > > +{ > > > + pmap_vp_populate(pm, va); > > > +} > > > > > > /* > > > * Remove the given range of mapping entries. > > > Index: arch/arm64/include/pmap.h > > > =================================================================== > > > RCS file: /cvs/src/sys/arch/arm64/include/pmap.h,v > > > diff -u -p -r1.25 pmap.h > > > --- arch/arm64/include/pmap.h 11 Dec 2023 22:12:53 -0000 1.25 > > > +++ arch/arm64/include/pmap.h 9 Jan 2025 21:44:41 -0000 > > > @@ -123,6 +123,7 @@ struct pv_entry; > > > int pmap_fault_fixup(pmap_t, vaddr_t, vm_prot_t); > > > > > > #define __HAVE_PMAP_MPSAFE_ENTER_COW > > > +#define __HAVE_PMAP_POPULATE > > > > > > #endif /* _KERNEL && !_LOCORE */ > > > > > > Index: uvm/uvm_fault.c > > > =================================================================== > > > RCS file: /cvs/src/sys/uvm/uvm_fault.c,v > > > diff -u -p -r1.159 uvm_fault.c > > > --- uvm/uvm_fault.c 3 Jan 2025 15:31:48 -0000 1.159 > > > +++ uvm/uvm_fault.c 9 Jan 2025 21:44:41 -0000 > > > @@ -1105,8 +1105,12 @@ uvm_fault_upper(struct uvm_faultinfo *uf > > > /* XXX instrumentation */ > > > return ENOMEM; > > > } > > > +#ifdef __HAVE_PMAP_POPULATE > > > + pmap_populate(ufi->orig_map->pmap, ufi->orig_rvaddr); > > > +#else > > > /* XXX instrumentation */ > > > uvm_wait("flt_pmfail1"); > > > +#endif > > > return ERESTART; > > > } > > > > > > @@ -1457,8 +1461,12 @@ uvm_fault_lower(struct uvm_faultinfo *uf > > > /* XXX instrumentation */ > > > return (ENOMEM); > > > } > > > +#ifdef __HAVE_PMAP_POPULATE > > > + pmap_populate(ufi->orig_map->pmap, ufi->orig_rvaddr); > > > +#else > > > /* XXX instrumentation */ > > > uvm_wait("flt_pmfail2"); > > > +#endif > > > return ERESTART; > > > } > > > > > > Index: uvm/uvm_pmap.h > > > =================================================================== > > > RCS file: /cvs/src/sys/uvm/uvm_pmap.h,v > > > diff -u -p -r1.34 uvm_pmap.h > > > --- uvm/uvm_pmap.h 3 Apr 2024 18:43:32 -0000 1.34 > > > +++ uvm/uvm_pmap.h 9 Jan 2025 21:44:41 -0000 > > > @@ -179,6 +179,10 @@ vaddr_t pmap_steal_memory(vsize_t, vad > > > void pmap_virtual_space(vaddr_t *, vaddr_t *); > > > #endif > > > > > > +#if defined(__HAVE_PMAP_POPULATE) > > > +void pmap_populate(pmap_t, vaddr_t); > > > +#endif > > > + > > > /* nested pmaps are used in i386/amd64 vmm */ > > > #ifndef pmap_nested > > > #define pmap_nested(pm) 0 > > > > > > >
