Thanks for your answer Mark!

On 17/01/25(Fri) 15:12, Mark Kettenis wrote:
> >  [...]
> > > Not really.  We still have plenty of physical memory, and allocating
> > > physical pages succeeds just fine.  The problem here is allocating
> > > virtual address space.
> > 
> > So is it potentially a problem on all architectures without direct map?
> 
> Yes.  And maybe even on architectures *with* direct map as the direct
> map can't be used to map multiple pages.

Do we have a pmap_enter() that fits in this description?

> > > Here we have two options.  Most pools that are used with PR_NOWAIT use
> > > the "interrupt safe" kernel map.  When locking that map, we use
> > > mutexes, so it is safe to do so if we're in a context where we can't
> > > sleep.  But the address space of the "interrupt safe" kernel map is
> > > small.  So allocations may still fail if we run out of address space.
> > 
> > The way I understand the problem is that pmap implementations with direct
> > map only need (physical) pages allocations.  The current fault handler
> > work with them.
> 
> Yes, but note that even the amd64 pmap uses pools and needs to use
> pool_allocator_single to make sure that pool pages can indeed be
> mapped through the direct map.

Is there a problem with that?

> > Without direct map the physical memory needs to be mapped in the kernel
> > space and there is currently no guarantee to be able to do it without
> > sleeping.
> 
> Right.  And such a guarantee would be very hard to implement as there
> is always the risk of simply running out of kernel virtual address
> space.

What happen in such case?  Is recovery possible?

> > > Also, the "vp" items are fairly large (8K) and we use quite a lot of
> > > them building the page tables for all our processes.  We would quickly
> > > fill the "interrupt safe" kernel map, so instead the "vp" pool uses
> > > the normal kernel map.  The normal map uses rwlocks to do the locking.
> > > So to avoid sleeping we opportunistically try to lock the map and fail
> > > if we don't succeed.
> > 
> > That's a nice hack.
> > 
> > > This is a fundamental problem with all our kernel memory allocators.
> > 
> > Maybe.  I don't know.  However here we're interested in a very special
> > case of allocations.  Currently it seems to be only a problem on arm64.
> > Is there other pmap_enter() that allocates more than one physical page? 
> > If not we're currently saved by km_alloc(9)'s free list?
> 
> We do the same thing on riscv64.  But even on architectures that only
> allocate a single page km_alloc(9)'s free list doesn't really save us.
> There is only a limited number of addresses on the free list, so you
> can run out of those.  And then you need to poke the thread to
> allocate more.

Are you suggesting we add a pmap_populate() for all non direct map
architectures?

> > > We have some mitigations in the form of the single page allocator (the
> > > uvm_km_pages stuff) and __HAVE_PMAP_DIRECT.  But those are not
> > > guaranteed to work and virtual address allocations can still fail.
> > 
> > I understand.  However I'm not convinced we should embrace the existing
> > behavior.  Currently  pmap_enter(9) can fail because it can't map pages
> > in kernel space.  Can we fix this?   Should we fix it?
> 
> As I said, this is a fundamental problem.  We can add more hacks to
> reduce the risk if failing because we can't map pages in kernel space.
> But we can never eliminate it completely.

Fine.

> > The alternative is to add another error case to pmap_enter(9) which
> > could be recovered.  Could this lead to some starvation?
> 
> Not easily since our kernel allocators don't tell us whether
> allocation failed because we couldn't allocate physical memory or if
> we couldn't map that physical memory.
> 
> > Could this explain the contention on the kernel object rwlock I am
> > seeing on arm64?
> 
> Shouldn't be.  The kernel object lock only comes into play when we
> allocate pageable kernel memory.  And if our pmaps would be allocating
> pageable memory we'd have a bigg problem...
> 
> > > > Alternatively we can see pool_get() shortages integrated in the page 
> > > > daemon.
> > > > Instead of adding pmap_populate() we could simply check for pools 
> > > > needing a
> > > > refill in the page daemon.  The same already applies to uvm_anwait().
> > > 
> > > I don't think making the page daemon responsible for refilling the
> > > pools is a good idea.  It would mean the page daemon could block when
> > > there is a KVA shortage.  And isn't allocating more memory to fill the
> > > pools the last thing you'd want to do when waking up the page daemon?
> > 
> > This needs more explanations from my side.  First of all the "page daemon"
> > is now more a "shortage thread".  It might free buffers as well.
> 
> The only reason it frees buffers is to free up physical memory!  I
> don't think it even frees up KVA since the buffer cache uses its own
> submap to map buffers.
> 
> > The logic behind the fault handler is that it call uvm_wait() when
> > it encounters an error which correspond to a shortage of memory.
> > Nowadays there are more errors than a shortage of memory and memory
> > can be freed not just by swapping pages.  So I'm wondering if we
> > shouldn't simply look at which pools failed to allocated in the same
> > context.  Doing so we allows us to get rid of uvm_anwait() as well
> > since uvm_wait() would wake up the daemon which would tell whatever
> > pool mechanism to refill itself.
> 
> Well, uvm_wait() is a bit of dumb hammer.  You poke the page daemon
> and hope it frees up the right resources.  And by the time you retry
> something else may already have consumed those resources.
> 
> My approach is much more robust.  It allocates the specific resources
> that are needed to handle the page fault and populates the page
> tables.  So we can be sure by the time we retry the resources are
> there.
> 
> > I don't like the fragmentation of memory management and the growing 
> > complexity of error code path.  I wish we could refrain to add more
> > layers in the allocations inside UVM.
> 
> I would not call this another layer of memory management.
> 
> > > And yes, the uvm_anwait() thing is very similar to this case.  In fact
> > > my initial idea was for pmap_populate() to just refill the pools.  But
> > > then I realized that it made more sense to just populate the page
> > > tables for the faulting address because we're going to refault on the
> > > same address anyway.
> > 
> > This is a clever hack.  I'm not opposed to it.  I wonder if with the
> > current understanding of the different pmaps this is our best shot.
> > 
> > > > But IMHO fewer usages of pools in UVM and pmaps is the best way forward.
> > > 
> > > I used to think that as well, but these days the pools have some
> > > performance-critical optimizations that are very relevant for our
> > > pmaps.
> > 
> > Maybe but I'd be so sure.  My observations with more parts of UVM
> > running in parallel show a lot of contention on pmap pools.  There is
> > definitively some nice optimizations, but I believe this is just the
> > beginning of some non trivial work in this area.
> 
> Ultimately we will need to make the address allocators run in parallel
> in some way.
> 
> > To summary, arm64 is doing 8K allocations inside pmap_enter(9).  It uses
> > uvm_map(9) via km_alloc(9) via pool_get(9) to map these pages in kernel
> > space.  The larger kernel_map is protected via a rw_lock and is apparently
> > contended which prevent pmap_enter(9) to succeed.
> > 
> > - Can we tweak the `uvm_km_pages' mechanism to be able to work with 8K
> > allocations on arm64?
> 
> Yes, but it still wouldn't solve the fundamental problem and we'd
> probably waste resources.
> 
> > - Or can we change the arm64 pmap to not do allocations bigger than a
> > page?
> 
> Yes, but it still wouldn't solve the fundamental problem.
> 
> > Any of these two would reduce reduce the differences between non-direct
> > and direct map pmaps.  If we could ensure uvm_map() is not called from
> > within pmap_enter() that would help me sleep better at night.
> 
> I don't think that is achievable.

Well it is if we use `uvm_km_pages' or limit pmap_enter() allocations to
page-sized ones.

> > Which other pmaps could suffer from a shortage of `uvm_km_pages'?
> 
> armv7, arm64, powerpc64, riscv64, sparc64

So I now believe your fix is what we want.  One thing we did not
consider is if uvm_map() failed because of some contention.  But I
believe we can wait until this becomes a bottleneck. 

I'd go even further, remove the #ifdef and replace uvm_swapisfull() +
uvm_wait() by pmap_populate().

Thanks for the fix and sharing your knowledge with me.

Cheers,
Martin


Reply via email to