> Date: Sat, 18 Jan 2025 16:44:38 +0100
> From: Martin Pieuchot <[email protected]>
> 
> On 18/01/25(Sat) 13:44, Mark Kettenis wrote:
> > > Date: Fri, 17 Jan 2025 16:39:12 +0100
> > > From: Martin Pieuchot <[email protected]>
> > > 
> > > Thanks for your answer Mark!
> > > 
> > > On 17/01/25(Fri) 15:12, Mark Kettenis wrote:
> > > > >  [...]
> > > > > > Not really.  We still have plenty of physical memory, and allocating
> > > > > > physical pages succeeds just fine.  The problem here is allocating
> > > > > > virtual address space.
> > > > > 
> > > > > So is it potentially a problem on all architectures without direct 
> > > > > map?
> > > > 
> > > > Yes.  And maybe even on architectures *with* direct map as the direct
> > > > map can't be used to map multiple pages.
> > > 
> > > Do we have a pmap_enter() that fits in this description?
> > 
> > Several pmaps use pools that don't explicitly ask for the single-page
> > pool page allocator.  But the size of the pool items is probably small
> > enough that they still allocate single pages.
> > 
> > > > > > Here we have two options.  Most pools that are used with PR_NOWAIT 
> > > > > > use
> > > > > > the "interrupt safe" kernel map.  When locking that map, we use
> > > > > > mutexes, so it is safe to do so if we're in a context where we can't
> > > > > > sleep.  But the address space of the "interrupt safe" kernel map is
> > > > > > small.  So allocations may still fail if we run out of address 
> > > > > > space.
> > > > > 
> > > > > The way I understand the problem is that pmap implementations with 
> > > > > direct
> > > > > map only need (physical) pages allocations.  The current fault handler
> > > > > work with them.
> > > > 
> > > > Yes, but note that even the amd64 pmap uses pools and needs to use
> > > > pool_allocator_single to make sure that pool pages can indeed be
> > > > mapped through the direct map.
> > > 
> > > Is there a problem with that?
> > 
> > Nope that's fine.  Other pmaps should probably do the same.
> > 
> > > > > Without direct map the physical memory needs to be mapped in the 
> > > > > kernel
> > > > > space and there is currently no guarantee to be able to do it without
> > > > > sleeping.
> > > > 
> > > > Right.  And such a guarantee would be very hard to implement as there
> > > > is always the risk of simply running out of kernel virtual address
> > > > space.
> > > 
> > > What happen in such case?  Is recovery possible?
> > 
> > With the current code that calls uwm_wait(), you'll probably end up in
> > the scenario where you continuously wake up the page daemon.
> > 
> > With the pmap_populate() approach, I believe we'll end up waiting in
> > pool_get() until some other process ends up returning items to the
> > pool.  That probably gives us a better chance to recover.
> > 
> > > > > > Also, the "vp" items are fairly large (8K) and we use quite a lot of
> > > > > > them building the page tables for all our processes.  We would 
> > > > > > quickly
> > > > > > fill the "interrupt safe" kernel map, so instead the "vp" pool uses
> > > > > > the normal kernel map.  The normal map uses rwlocks to do the 
> > > > > > locking.
> > > > > > So to avoid sleeping we opportunistically try to lock the map and 
> > > > > > fail
> > > > > > if we don't succeed.
> > > > > 
> > > > > That's a nice hack.
> > > > > 
> > > > > > This is a fundamental problem with all our kernel memory allocators.
> > > > > 
> > > > > Maybe.  I don't know.  However here we're interested in a very special
> > > > > case of allocations.  Currently it seems to be only a problem on 
> > > > > arm64.
> > > > > Is there other pmap_enter() that allocates more than one physical 
> > > > > page? 
> > > > > If not we're currently saved by km_alloc(9)'s free list?
> > > > 
> > > > We do the same thing on riscv64.  But even on architectures that only
> > > > allocate a single page km_alloc(9)'s free list doesn't really save us.
> > > > There is only a limited number of addresses on the free list, so you
> > > > can run out of those.  And then you need to poke the thread to
> > > > allocate more.
> > > 
> > > Are you suggesting we add a pmap_populate() for all non direct map
> > > architectures?
> > 
> > Maybe we should even add it to all our architectures.
> 
> I agree.
> 
> > > > > > We have some mitigations in the form of the single page allocator 
> > > > > > (the
> > > > > > uvm_km_pages stuff) and __HAVE_PMAP_DIRECT.  But those are not
> > > > > > guaranteed to work and virtual address allocations can still fail.
> > > > > 
> > > > > I understand.  However I'm not convinced we should embrace the 
> > > > > existing
> > > > > behavior.  Currently  pmap_enter(9) can fail because it can't map 
> > > > > pages
> > > > > in kernel space.  Can we fix this?   Should we fix it?
> > > > 
> > > > As I said, this is a fundamental problem.  We can add more hacks to
> > > > reduce the risk if failing because we can't map pages in kernel space.
> > > > But we can never eliminate it completely.
> > > 
> > > Fine.
> > > 
> > > > > The alternative is to add another error case to pmap_enter(9) which
> > > > > could be recovered.  Could this lead to some starvation?
> > > > 
> > > > Not easily since our kernel allocators don't tell us whether
> > > > allocation failed because we couldn't allocate physical memory or if
> > > > we couldn't map that physical memory.
> > > > 
> > > > > Could this explain the contention on the kernel object rwlock I am
> > > > > seeing on arm64?
> > > > 
> > > > Shouldn't be.  The kernel object lock only comes into play when we
> > > > allocate pageable kernel memory.  And if our pmaps would be allocating
> > > > pageable memory we'd have a bigg problem...
> > > > 
> > > > > > > Alternatively we can see pool_get() shortages integrated in the 
> > > > > > > page daemon.
> > > > > > > Instead of adding pmap_populate() we could simply check for pools 
> > > > > > > needing a
> > > > > > > refill in the page daemon.  The same already applies to 
> > > > > > > uvm_anwait().
> > > > > > 
> > > > > > I don't think making the page daemon responsible for refilling the
> > > > > > pools is a good idea.  It would mean the page daemon could block 
> > > > > > when
> > > > > > there is a KVA shortage.  And isn't allocating more memory to fill 
> > > > > > the
> > > > > > pools the last thing you'd want to do when waking up the page 
> > > > > > daemon?
> > > > > 
> > > > > This needs more explanations from my side.  First of all the "page 
> > > > > daemon"
> > > > > is now more a "shortage thread".  It might free buffers as well.
> > > > 
> > > > The only reason it frees buffers is to free up physical memory!  I
> > > > don't think it even frees up KVA since the buffer cache uses its own
> > > > submap to map buffers.
> > > > 
> > > > > The logic behind the fault handler is that it call uvm_wait() when
> > > > > it encounters an error which correspond to a shortage of memory.
> > > > > Nowadays there are more errors than a shortage of memory and memory
> > > > > can be freed not just by swapping pages.  So I'm wondering if we
> > > > > shouldn't simply look at which pools failed to allocated in the same
> > > > > context.  Doing so we allows us to get rid of uvm_anwait() as well
> > > > > since uvm_wait() would wake up the daemon which would tell whatever
> > > > > pool mechanism to refill itself.
> > > > 
> > > > Well, uvm_wait() is a bit of dumb hammer.  You poke the page daemon
> > > > and hope it frees up the right resources.  And by the time you retry
> > > > something else may already have consumed those resources.
> > > > 
> > > > My approach is much more robust.  It allocates the specific resources
> > > > that are needed to handle the page fault and populates the page
> > > > tables.  So we can be sure by the time we retry the resources are
> > > > there.
> > > > 
> > > > > I don't like the fragmentation of memory management and the growing 
> > > > > complexity of error code path.  I wish we could refrain to add more
> > > > > layers in the allocations inside UVM.
> > > > 
> > > > I would not call this another layer of memory management.
> > > > 
> > > > > > And yes, the uvm_anwait() thing is very similar to this case.  In 
> > > > > > fact
> > > > > > my initial idea was for pmap_populate() to just refill the pools.  
> > > > > > But
> > > > > > then I realized that it made more sense to just populate the page
> > > > > > tables for the faulting address because we're going to refault on 
> > > > > > the
> > > > > > same address anyway.
> > > > > 
> > > > > This is a clever hack.  I'm not opposed to it.  I wonder if with the
> > > > > current understanding of the different pmaps this is our best shot.
> > > > > 
> > > > > > > But IMHO fewer usages of pools in UVM and pmaps is the best way 
> > > > > > > forward.
> > > > > > 
> > > > > > I used to think that as well, but these days the pools have some
> > > > > > performance-critical optimizations that are very relevant for our
> > > > > > pmaps.
> > > > > 
> > > > > Maybe but I'd be so sure.  My observations with more parts of UVM
> > > > > running in parallel show a lot of contention on pmap pools.  There is
> > > > > definitively some nice optimizations, but I believe this is just the
> > > > > beginning of some non trivial work in this area.
> > > > 
> > > > Ultimately we will need to make the address allocators run in parallel
> > > > in some way.
> > > > 
> > > > > To summary, arm64 is doing 8K allocations inside pmap_enter(9).  It 
> > > > > uses
> > > > > uvm_map(9) via km_alloc(9) via pool_get(9) to map these pages in 
> > > > > kernel
> > > > > space.  The larger kernel_map is protected via a rw_lock and is 
> > > > > apparently
> > > > > contended which prevent pmap_enter(9) to succeed.
> > > > > 
> > > > > - Can we tweak the `uvm_km_pages' mechanism to be able to work with 8K
> > > > > allocations on arm64?
> > > > 
> > > > Yes, but it still wouldn't solve the fundamental problem and we'd
> > > > probably waste resources.
> > > > 
> > > > > - Or can we change the arm64 pmap to not do allocations bigger than a
> > > > > page?
> > > > 
> > > > Yes, but it still wouldn't solve the fundamental problem.
> > > > 
> > > > > Any of these two would reduce reduce the differences between 
> > > > > non-direct
> > > > > and direct map pmaps.  If we could ensure uvm_map() is not called from
> > > > > within pmap_enter() that would help me sleep better at night.
> > > > 
> > > > I don't think that is achievable.
> > > 
> > > Well it is if we use `uvm_km_pages' or limit pmap_enter() allocations to
> > > page-sized ones.
> > 
> > True.  And this may help us avoid contention on the kernal map lock.
> > And it might be relatively easy to turn uvm_km_pages into some sort of
> > a per-cpu cache.  I'll have another look at what needs to change in
> > the arm64 pmap to achieve this.
> 
> I came to the same conclusion.  I'd appreciate seeing this happening.
> 
> > > > > Which other pmaps could suffer from a shortage of `uvm_km_pages'?
> > > > 
> > > > armv7, arm64, powerpc64, riscv64, sparc64
> > > 
> > > So I now believe your fix is what we want.  One thing we did not
> > > consider is if uvm_map() failed because of some contention.  But I
> > > believe we can wait until this becomes a bottleneck. 
> > > 
> > > I'd go even further, remove the #ifdef and replace uvm_swapisfull() +
> > > uvm_wait() by pmap_populate().
> > > 
> > > Thanks for the fix and sharing your knowledge with me.
> > 
> > is that an ok for the diff?
> 
> It is, ok mpi@

Since riscv64 suffers from the same issue, here is a diff that
implements pmap_populate() there.

ok?

Index: arch/riscv64/include/pmap.h
===================================================================
RCS file: /cvs/src/sys/arch/riscv64/include/pmap.h,v
diff -u -p -r1.12 pmap.h
--- arch/riscv64/include/pmap.h 6 Apr 2024 18:33:54 -0000       1.12
+++ arch/riscv64/include/pmap.h 18 Jan 2025 18:52:44 -0000
@@ -114,6 +114,8 @@ int pmap_fault_fixup(pmap_t, vaddr_t, vm
 void   pmap_postinit(void);
 void   pmap_init_percpu(void);
 
+#define __HAVE_PMAP_POPULATE
+
 #endif /* _KERNEL && !_LOCORE */
 
 #ifndef _LOCORE
Index: arch/riscv64/riscv64/pmap.c
===================================================================
RCS file: /cvs/src/sys/arch/riscv64/riscv64/pmap.c,v
diff -u -p -r1.43 pmap.c
--- arch/riscv64/riscv64/pmap.c 20 Nov 2024 22:25:38 -0000      1.43
+++ arch/riscv64/riscv64/pmap.c 18 Jan 2025 18:52:44 -0000
@@ -390,6 +390,53 @@ pmap_vp_enter(pmap_t pm, vaddr_t va, str
        return 0;
 }
 
+void
+pmap_vp_populate(pmap_t pm, vaddr_t va)
+{
+       struct pte_desc *pted;
+       struct pmapvp1 *vp1;
+       struct pmapvp2 *vp2;
+       struct pmapvp3 *vp3;
+       void *vp;
+
+       pted = pool_get(&pmap_pted_pool, PR_WAITOK | PR_ZERO);
+       vp = pool_get(&pmap_vp_pool, PR_WAITOK | PR_ZERO);
+
+       pmap_lock(pm);
+
+       vp1 = pm->pm_vp.l1;
+
+       vp2 = vp1->vp[VP_IDX1(va)];
+       if (vp2 == NULL) {
+               vp2 = vp; vp = NULL;
+               pmap_set_l2(pm, va, vp2, 0);
+       }
+       
+       if (vp == NULL) {
+               pmap_unlock(pm);
+               vp = pool_get(&pmap_vp_pool, PR_WAITOK | PR_ZERO);
+               pmap_lock(pm);
+       }
+
+       vp3 = vp2->vp[VP_IDX2(va)];
+       if (vp3 == NULL) {
+               vp3 = vp; vp = NULL;
+               pmap_set_l3(pm, va, vp3, 0);
+       }
+
+       if (vp3->vp[VP_IDX3(va)] == NULL) {
+               vp3->vp[VP_IDX3(va)] = pted;
+               pted = NULL;
+       }
+
+       pmap_unlock(pm);
+
+       if (vp)
+               pool_put(&pmap_vp_pool, vp);
+       if (pted)
+               pool_put(&pmap_pted_pool, pted);
+}
+
 void *
 pmap_vp_page_alloc(struct pool *pp, int flags, int *slowdown)
 {
@@ -563,6 +610,11 @@ out:
        return error;
 }
 
+void
+pmap_populate(pmap_t pm, vaddr_t va)
+{
+       pmap_vp_populate(pm, va);
+}
 
 /*
  * Remove the given range of mapping entries.

Reply via email to