> Date: Fri, 17 Jan 2025 16:39:12 +0100
> From: Martin Pieuchot <[email protected]>
> 
> Thanks for your answer Mark!
> 
> On 17/01/25(Fri) 15:12, Mark Kettenis wrote:
> > >  [...]
> > > > Not really.  We still have plenty of physical memory, and allocating
> > > > physical pages succeeds just fine.  The problem here is allocating
> > > > virtual address space.
> > > 
> > > So is it potentially a problem on all architectures without direct map?
> > 
> > Yes.  And maybe even on architectures *with* direct map as the direct
> > map can't be used to map multiple pages.
> 
> Do we have a pmap_enter() that fits in this description?

Several pmaps use pools that don't explicitly ask for the single-page
pool page allocator.  But the size of the pool items is probably small
enough that they still allocate single pages.

> > > > Here we have two options.  Most pools that are used with PR_NOWAIT use
> > > > the "interrupt safe" kernel map.  When locking that map, we use
> > > > mutexes, so it is safe to do so if we're in a context where we can't
> > > > sleep.  But the address space of the "interrupt safe" kernel map is
> > > > small.  So allocations may still fail if we run out of address space.
> > > 
> > > The way I understand the problem is that pmap implementations with direct
> > > map only need (physical) pages allocations.  The current fault handler
> > > work with them.
> > 
> > Yes, but note that even the amd64 pmap uses pools and needs to use
> > pool_allocator_single to make sure that pool pages can indeed be
> > mapped through the direct map.
> 
> Is there a problem with that?

Nope that's fine.  Other pmaps should probably do the same.

> > > Without direct map the physical memory needs to be mapped in the kernel
> > > space and there is currently no guarantee to be able to do it without
> > > sleeping.
> > 
> > Right.  And such a guarantee would be very hard to implement as there
> > is always the risk of simply running out of kernel virtual address
> > space.
> 
> What happen in such case?  Is recovery possible?

With the current code that calls uwm_wait(), you'll probably end up in
the scenario where you continuously wake up the page daemon.

With the pmap_populate() approach, I believe we'll end up waiting in
pool_get() until some other process ends up returning items to the
pool.  That probably gives us a better chance to recover.

> > > > Also, the "vp" items are fairly large (8K) and we use quite a lot of
> > > > them building the page tables for all our processes.  We would quickly
> > > > fill the "interrupt safe" kernel map, so instead the "vp" pool uses
> > > > the normal kernel map.  The normal map uses rwlocks to do the locking.
> > > > So to avoid sleeping we opportunistically try to lock the map and fail
> > > > if we don't succeed.
> > > 
> > > That's a nice hack.
> > > 
> > > > This is a fundamental problem with all our kernel memory allocators.
> > > 
> > > Maybe.  I don't know.  However here we're interested in a very special
> > > case of allocations.  Currently it seems to be only a problem on arm64.
> > > Is there other pmap_enter() that allocates more than one physical page? 
> > > If not we're currently saved by km_alloc(9)'s free list?
> > 
> > We do the same thing on riscv64.  But even on architectures that only
> > allocate a single page km_alloc(9)'s free list doesn't really save us.
> > There is only a limited number of addresses on the free list, so you
> > can run out of those.  And then you need to poke the thread to
> > allocate more.
> 
> Are you suggesting we add a pmap_populate() for all non direct map
> architectures?

Maybe we should even add it to all our architectures.

> > > > We have some mitigations in the form of the single page allocator (the
> > > > uvm_km_pages stuff) and __HAVE_PMAP_DIRECT.  But those are not
> > > > guaranteed to work and virtual address allocations can still fail.
> > > 
> > > I understand.  However I'm not convinced we should embrace the existing
> > > behavior.  Currently  pmap_enter(9) can fail because it can't map pages
> > > in kernel space.  Can we fix this?   Should we fix it?
> > 
> > As I said, this is a fundamental problem.  We can add more hacks to
> > reduce the risk if failing because we can't map pages in kernel space.
> > But we can never eliminate it completely.
> 
> Fine.
> 
> > > The alternative is to add another error case to pmap_enter(9) which
> > > could be recovered.  Could this lead to some starvation?
> > 
> > Not easily since our kernel allocators don't tell us whether
> > allocation failed because we couldn't allocate physical memory or if
> > we couldn't map that physical memory.
> > 
> > > Could this explain the contention on the kernel object rwlock I am
> > > seeing on arm64?
> > 
> > Shouldn't be.  The kernel object lock only comes into play when we
> > allocate pageable kernel memory.  And if our pmaps would be allocating
> > pageable memory we'd have a bigg problem...
> > 
> > > > > Alternatively we can see pool_get() shortages integrated in the page 
> > > > > daemon.
> > > > > Instead of adding pmap_populate() we could simply check for pools 
> > > > > needing a
> > > > > refill in the page daemon.  The same already applies to uvm_anwait().
> > > > 
> > > > I don't think making the page daemon responsible for refilling the
> > > > pools is a good idea.  It would mean the page daemon could block when
> > > > there is a KVA shortage.  And isn't allocating more memory to fill the
> > > > pools the last thing you'd want to do when waking up the page daemon?
> > > 
> > > This needs more explanations from my side.  First of all the "page daemon"
> > > is now more a "shortage thread".  It might free buffers as well.
> > 
> > The only reason it frees buffers is to free up physical memory!  I
> > don't think it even frees up KVA since the buffer cache uses its own
> > submap to map buffers.
> > 
> > > The logic behind the fault handler is that it call uvm_wait() when
> > > it encounters an error which correspond to a shortage of memory.
> > > Nowadays there are more errors than a shortage of memory and memory
> > > can be freed not just by swapping pages.  So I'm wondering if we
> > > shouldn't simply look at which pools failed to allocated in the same
> > > context.  Doing so we allows us to get rid of uvm_anwait() as well
> > > since uvm_wait() would wake up the daemon which would tell whatever
> > > pool mechanism to refill itself.
> > 
> > Well, uvm_wait() is a bit of dumb hammer.  You poke the page daemon
> > and hope it frees up the right resources.  And by the time you retry
> > something else may already have consumed those resources.
> > 
> > My approach is much more robust.  It allocates the specific resources
> > that are needed to handle the page fault and populates the page
> > tables.  So we can be sure by the time we retry the resources are
> > there.
> > 
> > > I don't like the fragmentation of memory management and the growing 
> > > complexity of error code path.  I wish we could refrain to add more
> > > layers in the allocations inside UVM.
> > 
> > I would not call this another layer of memory management.
> > 
> > > > And yes, the uvm_anwait() thing is very similar to this case.  In fact
> > > > my initial idea was for pmap_populate() to just refill the pools.  But
> > > > then I realized that it made more sense to just populate the page
> > > > tables for the faulting address because we're going to refault on the
> > > > same address anyway.
> > > 
> > > This is a clever hack.  I'm not opposed to it.  I wonder if with the
> > > current understanding of the different pmaps this is our best shot.
> > > 
> > > > > But IMHO fewer usages of pools in UVM and pmaps is the best way 
> > > > > forward.
> > > > 
> > > > I used to think that as well, but these days the pools have some
> > > > performance-critical optimizations that are very relevant for our
> > > > pmaps.
> > > 
> > > Maybe but I'd be so sure.  My observations with more parts of UVM
> > > running in parallel show a lot of contention on pmap pools.  There is
> > > definitively some nice optimizations, but I believe this is just the
> > > beginning of some non trivial work in this area.
> > 
> > Ultimately we will need to make the address allocators run in parallel
> > in some way.
> > 
> > > To summary, arm64 is doing 8K allocations inside pmap_enter(9).  It uses
> > > uvm_map(9) via km_alloc(9) via pool_get(9) to map these pages in kernel
> > > space.  The larger kernel_map is protected via a rw_lock and is apparently
> > > contended which prevent pmap_enter(9) to succeed.
> > > 
> > > - Can we tweak the `uvm_km_pages' mechanism to be able to work with 8K
> > > allocations on arm64?
> > 
> > Yes, but it still wouldn't solve the fundamental problem and we'd
> > probably waste resources.
> > 
> > > - Or can we change the arm64 pmap to not do allocations bigger than a
> > > page?
> > 
> > Yes, but it still wouldn't solve the fundamental problem.
> > 
> > > Any of these two would reduce reduce the differences between non-direct
> > > and direct map pmaps.  If we could ensure uvm_map() is not called from
> > > within pmap_enter() that would help me sleep better at night.
> > 
> > I don't think that is achievable.
> 
> Well it is if we use `uvm_km_pages' or limit pmap_enter() allocations to
> page-sized ones.

True.  And this may help us avoid contention on the kernal map lock.
And it might be relatively easy to turn uvm_km_pages into some sort of
a per-cpu cache.  I'll have another look at what needs to change in
the arm64 pmap to achieve this.

> > > Which other pmaps could suffer from a shortage of `uvm_km_pages'?
> > 
> > armv7, arm64, powerpc64, riscv64, sparc64
> 
> So I now believe your fix is what we want.  One thing we did not
> consider is if uvm_map() failed because of some contention.  But I
> believe we can wait until this becomes a bottleneck. 
> 
> I'd go even further, remove the #ifdef and replace uvm_swapisfull() +
> uvm_wait() by pmap_populate().
> 
> Thanks for the fix and sharing your knowledge with me.

is that an ok for the diff?

Reply via email to