> Date: Fri, 17 Jan 2025 12:59:46 +0100
> From: Martin Pieuchot <[email protected]>
> 
> On 10/01/25(Fri) 21:04, Mark Kettenis wrote:
> > > Date: Fri, 10 Jan 2025 09:07:44 +0100
> > > From: Martin Pieuchot <[email protected]>
> > > 
> > > Hello,
> > > 
> > > On 09/01/25(Thu) 22:49, Mark Kettenis wrote:
> > > > > > > [...] 
> > > > > > > Any proposal on how we could proceed to find a solution for this 
> > > > > > > issue?
> > > > > > 
> > > > > > The following hack fixes the issue for me.  I don't think this is a
> > > > > > proper solution, but it may be a starting point.  Or a temporary 
> > > > > > fix.
> > > > > > 
> > > > > > The issue really is that we can't tell whether pmap_enter(9) failed
> > > > > > because we're out of physical memory, or if it failed for some other
> > > > > > reason.  In the case at hand we failt because of contention on the
> > > > > > kernel map lock.  But we could also be failing because we have 
> > > > > > completely run out of KVA.
> > > > > 
> > > > > Works for me as well!
> > > > >  
> > > > > > We can't sleep while holding all those uvm locks.  I'm not sure the
> > > > > > free memory check this does is right.  Or whether we want such a 
> > > > > > check
> > > > > > at all.  The vm_map_lock()/vm_map_unlock() dance is necessary to 
> > > > > > make
> > > > > > sure we don't spin too quickly if the kernel map lock is contended.
> > > > > > 
> > > > > > A better fix would perhaps be to have a new pmap function that we
> > > > > > could call at this spot that would sleep until the necessary 
> > > > > > resources
> > > > > > are available.  On arm64 this would populate the page tables using
> > > > > > pool allocations that use PR_WAITOK, but would not actually enter a
> > > > > > valid mapping.  I'm going to explore that idea a bit.
> > > > 
> > > > This seems to work.  Even a little bit better I think as the number of
> > > > vp pool fails seems to be a bit smaller with this diff.
> > > > 
> > > > Thoughts?
> > > 
> > > Thanks a lot for the diagnostic and diff.
> > > 
> > > The fact that pools are used by pmaps with NOWAIT and that pools can only
> > > be refilled with a sleeping context is very concerning.
> > > 
> > > They are multiple ways to fix the issue.  I'd prefer if we can work on a
> > > fix that avoids re-faulting:
> > > 
> > > dlg@ Is it possible to make pools used by pmaps specials such that they
> > > act as a thin wrapper around uvm_pagealloc(9)/uvm_pglistalloc(9)?  This
> > > would help in two ways:
> > > - ensure pmap_enter(9) only returns an error if OOM
> > > - benefit of the per-CPU page cache for all pmaps
> > 
> > Not really.  We still have plenty of physical memory, and allocating
> > physical pages succeeds just fine.  The problem here is allocating
> > virtual address space.
> 
> So is it potentially a problem on all architectures without direct map?

Yes.  And maybe even on architectures *with* direct map as the direct
map can't be used to map multiple pages.
 
> > Here we have two options.  Most pools that are used with PR_NOWAIT use
> > the "interrupt safe" kernel map.  When locking that map, we use
> > mutexes, so it is safe to do so if we're in a context where we can't
> > sleep.  But the address space of the "interrupt safe" kernel map is
> > small.  So allocations may still fail if we run out of address space.
> 
> The way I understand the problem is that pmap implementations with direct
> map only need (physical) pages allocations.  The current fault handler
> work with them.

Yes, but note that even the amd64 pmap uses pools and needs to use
pool_allocator_single to make sure that pool pages can indeed be
mapped through the direct map.

> Without direct map the physical memory needs to be mapped in the kernel
> space and there is currently no guarantee to be able to do it without
> sleeping.

Right.  And such a guarantee would be very hard to implement as there
is always the risk of simply running out of kernel virtual address
space.

> > Also, the "vp" items are fairly large (8K) and we use quite a lot of
> > them building the page tables for all our processes.  We would quickly
> > fill the "interrupt safe" kernel map, so instead the "vp" pool uses
> > the normal kernel map.  The normal map uses rwlocks to do the locking.
> > So to avoid sleeping we opportunistically try to lock the map and fail
> > if we don't succeed.
> 
> That's a nice hack.
> 
> > This is a fundamental problem with all our kernel memory allocators.
> 
> Maybe.  I don't know.  However here we're interested in a very special
> case of allocations.  Currently it seems to be only a problem on arm64.
> Is there other pmap_enter() that allocates more than one physical page? 
> If not we're currently saved by km_alloc(9)'s free list?

We do the same thing on riscv64.  But even on architectures that only
allocate a single page km_alloc(9)'s free list doesn't really save us.
There is only a limited number of addresses on the free list, so you
can run out of those.  And then you need to poke the thread to
allocate more.

> > We have some mitigations in the form of the single page allocator (the
> > uvm_km_pages stuff) and __HAVE_PMAP_DIRECT.  But those are not
> > guaranteed to work and virtual address allocations can still fail.
> 
> I understand.  However I'm not convinced we should embrace the existing
> behavior.  Currently  pmap_enter(9) can fail because it can't map pages
> in kernel space.  Can we fix this?   Should we fix it?

As I said, this is a fundamental problem.  We can add more hacks to
reduce the risk if failing because we can't map pages in kernel space.
But we can never eliminate it completely.

> The alternative is to add another error case to pmap_enter(9) which
> could be recovered.  Could this lead to some starvation?

Not easily since our kernel allocators don't tell us whether
allocation failed because we couldn't allocate physical memory or if
we couldn't map that physical memory.

> Could this explain the contention on the kernel object rwlock I am
> seeing on arm64?

Shouldn't be.  The kernel object lock only comes into play when we
allocate pageable kernel memory.  And if our pmaps would be allocating
pageable memory we'd have a bigg problem...

> > > Alternatively we can see pool_get() shortages integrated in the page 
> > > daemon.
> > > Instead of adding pmap_populate() we could simply check for pools needing 
> > > a
> > > refill in the page daemon.  The same already applies to uvm_anwait().
> > 
> > I don't think making the page daemon responsible for refilling the
> > pools is a good idea.  It would mean the page daemon could block when
> > there is a KVA shortage.  And isn't allocating more memory to fill the
> > pools the last thing you'd want to do when waking up the page daemon?
> 
> This needs more explanations from my side.  First of all the "page daemon"
> is now more a "shortage thread".  It might free buffers as well.

The only reason it frees buffers is to free up physical memory!  I
don't think it even frees up KVA since the buffer cache uses its own
submap to map buffers.

> The logic behind the fault handler is that it call uvm_wait() when
> it encounters an error which correspond to a shortage of memory.
> Nowadays there are more errors than a shortage of memory and memory
> can be freed not just by swapping pages.  So I'm wondering if we
> shouldn't simply look at which pools failed to allocated in the same
> context.  Doing so we allows us to get rid of uvm_anwait() as well
> since uvm_wait() would wake up the daemon which would tell whatever
> pool mechanism to refill itself.

Well, uvm_wait() is a bit of dumb hammer.  You poke the page daemon
and hope it frees up the right resources.  And by the time you retry
something else may already have consumed those resources.

My approach is much more robust.  It allocates the specific resources
that are needed to handle the page fault and populates the page
tables.  So we can be sure by the time we retry the resources are
there.

> I don't like the fragmentation of memory management and the growing 
> complexity of error code path.  I wish we could refrain to add more
> layers in the allocations inside UVM.

I would not call this another layer of memory management.

> > And yes, the uvm_anwait() thing is very similar to this case.  In fact
> > my initial idea was for pmap_populate() to just refill the pools.  But
> > then I realized that it made more sense to just populate the page
> > tables for the faulting address because we're going to refault on the
> > same address anyway.
> 
> This is a clever hack.  I'm not opposed to it.  I wonder if with the
> current understanding of the different pmaps this is our best shot.
> 
> > > But IMHO fewer usages of pools in UVM and pmaps is the best way forward.
> > 
> > I used to think that as well, but these days the pools have some
> > performance-critical optimizations that are very relevant for our
> > pmaps.
> 
> Maybe but I'd be so sure.  My observations with more parts of UVM
> running in parallel show a lot of contention on pmap pools.  There is
> definitively some nice optimizations, but I believe this is just the
> beginning of some non trivial work in this area.

Ultimately we will need to make the address allocators run in parallel
in some way.

> To summary, arm64 is doing 8K allocations inside pmap_enter(9).  It uses
> uvm_map(9) via km_alloc(9) via pool_get(9) to map these pages in kernel
> space.  The larger kernel_map is protected via a rw_lock and is apparently
> contended which prevent pmap_enter(9) to succeed.
> 
> - Can we tweak the `uvm_km_pages' mechanism to be able to work with 8K
> allocations on arm64?

Yes, but it still wouldn't solve the fundamental problem and we'd
probably waste resources.

> - Or can we change the arm64 pmap to not do allocations bigger than a
> page?

Yes, but it still wouldn't solve the fundamental problem.

> Any of these two would reduce reduce the differences between non-direct
> and direct map pmaps.  If we could ensure uvm_map() is not called from
> within pmap_enter() that would help me sleep better at night.

I don't think that is achievable.

> Which other pmaps could suffer from a shortage of `uvm_km_pages'?

armv7, arm64, powerpc64, riscv64, sparc64

Reply via email to