On 18/01/25(Sat) 13:44, Mark Kettenis wrote:
> > Date: Fri, 17 Jan 2025 16:39:12 +0100
> > From: Martin Pieuchot <[email protected]>
> > 
> > Thanks for your answer Mark!
> > 
> > On 17/01/25(Fri) 15:12, Mark Kettenis wrote:
> > > >  [...]
> > > > > Not really.  We still have plenty of physical memory, and allocating
> > > > > physical pages succeeds just fine.  The problem here is allocating
> > > > > virtual address space.
> > > > 
> > > > So is it potentially a problem on all architectures without direct map?
> > > 
> > > Yes.  And maybe even on architectures *with* direct map as the direct
> > > map can't be used to map multiple pages.
> > 
> > Do we have a pmap_enter() that fits in this description?
> 
> Several pmaps use pools that don't explicitly ask for the single-page
> pool page allocator.  But the size of the pool items is probably small
> enough that they still allocate single pages.
> 
> > > > > Here we have two options.  Most pools that are used with PR_NOWAIT use
> > > > > the "interrupt safe" kernel map.  When locking that map, we use
> > > > > mutexes, so it is safe to do so if we're in a context where we can't
> > > > > sleep.  But the address space of the "interrupt safe" kernel map is
> > > > > small.  So allocations may still fail if we run out of address space.
> > > > 
> > > > The way I understand the problem is that pmap implementations with 
> > > > direct
> > > > map only need (physical) pages allocations.  The current fault handler
> > > > work with them.
> > > 
> > > Yes, but note that even the amd64 pmap uses pools and needs to use
> > > pool_allocator_single to make sure that pool pages can indeed be
> > > mapped through the direct map.
> > 
> > Is there a problem with that?
> 
> Nope that's fine.  Other pmaps should probably do the same.
> 
> > > > Without direct map the physical memory needs to be mapped in the kernel
> > > > space and there is currently no guarantee to be able to do it without
> > > > sleeping.
> > > 
> > > Right.  And such a guarantee would be very hard to implement as there
> > > is always the risk of simply running out of kernel virtual address
> > > space.
> > 
> > What happen in such case?  Is recovery possible?
> 
> With the current code that calls uwm_wait(), you'll probably end up in
> the scenario where you continuously wake up the page daemon.
> 
> With the pmap_populate() approach, I believe we'll end up waiting in
> pool_get() until some other process ends up returning items to the
> pool.  That probably gives us a better chance to recover.
> 
> > > > > Also, the "vp" items are fairly large (8K) and we use quite a lot of
> > > > > them building the page tables for all our processes.  We would quickly
> > > > > fill the "interrupt safe" kernel map, so instead the "vp" pool uses
> > > > > the normal kernel map.  The normal map uses rwlocks to do the locking.
> > > > > So to avoid sleeping we opportunistically try to lock the map and fail
> > > > > if we don't succeed.
> > > > 
> > > > That's a nice hack.
> > > > 
> > > > > This is a fundamental problem with all our kernel memory allocators.
> > > > 
> > > > Maybe.  I don't know.  However here we're interested in a very special
> > > > case of allocations.  Currently it seems to be only a problem on arm64.
> > > > Is there other pmap_enter() that allocates more than one physical page? 
> > > > If not we're currently saved by km_alloc(9)'s free list?
> > > 
> > > We do the same thing on riscv64.  But even on architectures that only
> > > allocate a single page km_alloc(9)'s free list doesn't really save us.
> > > There is only a limited number of addresses on the free list, so you
> > > can run out of those.  And then you need to poke the thread to
> > > allocate more.
> > 
> > Are you suggesting we add a pmap_populate() for all non direct map
> > architectures?
> 
> Maybe we should even add it to all our architectures.

I agree.

> > > > > We have some mitigations in the form of the single page allocator (the
> > > > > uvm_km_pages stuff) and __HAVE_PMAP_DIRECT.  But those are not
> > > > > guaranteed to work and virtual address allocations can still fail.
> > > > 
> > > > I understand.  However I'm not convinced we should embrace the existing
> > > > behavior.  Currently  pmap_enter(9) can fail because it can't map pages
> > > > in kernel space.  Can we fix this?   Should we fix it?
> > > 
> > > As I said, this is a fundamental problem.  We can add more hacks to
> > > reduce the risk if failing because we can't map pages in kernel space.
> > > But we can never eliminate it completely.
> > 
> > Fine.
> > 
> > > > The alternative is to add another error case to pmap_enter(9) which
> > > > could be recovered.  Could this lead to some starvation?
> > > 
> > > Not easily since our kernel allocators don't tell us whether
> > > allocation failed because we couldn't allocate physical memory or if
> > > we couldn't map that physical memory.
> > > 
> > > > Could this explain the contention on the kernel object rwlock I am
> > > > seeing on arm64?
> > > 
> > > Shouldn't be.  The kernel object lock only comes into play when we
> > > allocate pageable kernel memory.  And if our pmaps would be allocating
> > > pageable memory we'd have a bigg problem...
> > > 
> > > > > > Alternatively we can see pool_get() shortages integrated in the 
> > > > > > page daemon.
> > > > > > Instead of adding pmap_populate() we could simply check for pools 
> > > > > > needing a
> > > > > > refill in the page daemon.  The same already applies to 
> > > > > > uvm_anwait().
> > > > > 
> > > > > I don't think making the page daemon responsible for refilling the
> > > > > pools is a good idea.  It would mean the page daemon could block when
> > > > > there is a KVA shortage.  And isn't allocating more memory to fill the
> > > > > pools the last thing you'd want to do when waking up the page daemon?
> > > > 
> > > > This needs more explanations from my side.  First of all the "page 
> > > > daemon"
> > > > is now more a "shortage thread".  It might free buffers as well.
> > > 
> > > The only reason it frees buffers is to free up physical memory!  I
> > > don't think it even frees up KVA since the buffer cache uses its own
> > > submap to map buffers.
> > > 
> > > > The logic behind the fault handler is that it call uvm_wait() when
> > > > it encounters an error which correspond to a shortage of memory.
> > > > Nowadays there are more errors than a shortage of memory and memory
> > > > can be freed not just by swapping pages.  So I'm wondering if we
> > > > shouldn't simply look at which pools failed to allocated in the same
> > > > context.  Doing so we allows us to get rid of uvm_anwait() as well
> > > > since uvm_wait() would wake up the daemon which would tell whatever
> > > > pool mechanism to refill itself.
> > > 
> > > Well, uvm_wait() is a bit of dumb hammer.  You poke the page daemon
> > > and hope it frees up the right resources.  And by the time you retry
> > > something else may already have consumed those resources.
> > > 
> > > My approach is much more robust.  It allocates the specific resources
> > > that are needed to handle the page fault and populates the page
> > > tables.  So we can be sure by the time we retry the resources are
> > > there.
> > > 
> > > > I don't like the fragmentation of memory management and the growing 
> > > > complexity of error code path.  I wish we could refrain to add more
> > > > layers in the allocations inside UVM.
> > > 
> > > I would not call this another layer of memory management.
> > > 
> > > > > And yes, the uvm_anwait() thing is very similar to this case.  In fact
> > > > > my initial idea was for pmap_populate() to just refill the pools.  But
> > > > > then I realized that it made more sense to just populate the page
> > > > > tables for the faulting address because we're going to refault on the
> > > > > same address anyway.
> > > > 
> > > > This is a clever hack.  I'm not opposed to it.  I wonder if with the
> > > > current understanding of the different pmaps this is our best shot.
> > > > 
> > > > > > But IMHO fewer usages of pools in UVM and pmaps is the best way 
> > > > > > forward.
> > > > > 
> > > > > I used to think that as well, but these days the pools have some
> > > > > performance-critical optimizations that are very relevant for our
> > > > > pmaps.
> > > > 
> > > > Maybe but I'd be so sure.  My observations with more parts of UVM
> > > > running in parallel show a lot of contention on pmap pools.  There is
> > > > definitively some nice optimizations, but I believe this is just the
> > > > beginning of some non trivial work in this area.
> > > 
> > > Ultimately we will need to make the address allocators run in parallel
> > > in some way.
> > > 
> > > > To summary, arm64 is doing 8K allocations inside pmap_enter(9).  It uses
> > > > uvm_map(9) via km_alloc(9) via pool_get(9) to map these pages in kernel
> > > > space.  The larger kernel_map is protected via a rw_lock and is 
> > > > apparently
> > > > contended which prevent pmap_enter(9) to succeed.
> > > > 
> > > > - Can we tweak the `uvm_km_pages' mechanism to be able to work with 8K
> > > > allocations on arm64?
> > > 
> > > Yes, but it still wouldn't solve the fundamental problem and we'd
> > > probably waste resources.
> > > 
> > > > - Or can we change the arm64 pmap to not do allocations bigger than a
> > > > page?
> > > 
> > > Yes, but it still wouldn't solve the fundamental problem.
> > > 
> > > > Any of these two would reduce reduce the differences between non-direct
> > > > and direct map pmaps.  If we could ensure uvm_map() is not called from
> > > > within pmap_enter() that would help me sleep better at night.
> > > 
> > > I don't think that is achievable.
> > 
> > Well it is if we use `uvm_km_pages' or limit pmap_enter() allocations to
> > page-sized ones.
> 
> True.  And this may help us avoid contention on the kernal map lock.
> And it might be relatively easy to turn uvm_km_pages into some sort of
> a per-cpu cache.  I'll have another look at what needs to change in
> the arm64 pmap to achieve this.

I came to the same conclusion.  I'd appreciate seeing this happening.

> > > > Which other pmaps could suffer from a shortage of `uvm_km_pages'?
> > > 
> > > armv7, arm64, powerpc64, riscv64, sparc64
> > 
> > So I now believe your fix is what we want.  One thing we did not
> > consider is if uvm_map() failed because of some contention.  But I
> > believe we can wait until this becomes a bottleneck. 
> > 
> > I'd go even further, remove the #ifdef and replace uvm_swapisfull() +
> > uvm_wait() by pmap_populate().
> > 
> > Thanks for the fix and sharing your knowledge with me.
> 
> is that an ok for the diff?

It is, ok mpi@


Reply via email to