On 10/01/25(Fri) 21:04, Mark Kettenis wrote: > > Date: Fri, 10 Jan 2025 09:07:44 +0100 > > From: Martin Pieuchot <[email protected]> > > > > Hello, > > > > On 09/01/25(Thu) 22:49, Mark Kettenis wrote: > > > > > > [...] > > > > > > Any proposal on how we could proceed to find a solution for this > > > > > > issue? > > > > > > > > > > The following hack fixes the issue for me. I don't think this is a > > > > > proper solution, but it may be a starting point. Or a temporary fix. > > > > > > > > > > The issue really is that we can't tell whether pmap_enter(9) failed > > > > > because we're out of physical memory, or if it failed for some other > > > > > reason. In the case at hand we failt because of contention on the > > > > > kernel map lock. But we could also be failing because we have > > > > > completely run out of KVA. > > > > > > > > Works for me as well! > > > > > > > > > We can't sleep while holding all those uvm locks. I'm not sure the > > > > > free memory check this does is right. Or whether we want such a check > > > > > at all. The vm_map_lock()/vm_map_unlock() dance is necessary to make > > > > > sure we don't spin too quickly if the kernel map lock is contended. > > > > > > > > > > A better fix would perhaps be to have a new pmap function that we > > > > > could call at this spot that would sleep until the necessary resources > > > > > are available. On arm64 this would populate the page tables using > > > > > pool allocations that use PR_WAITOK, but would not actually enter a > > > > > valid mapping. I'm going to explore that idea a bit. > > > > > > This seems to work. Even a little bit better I think as the number of > > > vp pool fails seems to be a bit smaller with this diff. > > > > > > Thoughts? > > > > Thanks a lot for the diagnostic and diff. > > > > The fact that pools are used by pmaps with NOWAIT and that pools can only > > be refilled with a sleeping context is very concerning. > > > > They are multiple ways to fix the issue. I'd prefer if we can work on a > > fix that avoids re-faulting: > > > > dlg@ Is it possible to make pools used by pmaps specials such that they > > act as a thin wrapper around uvm_pagealloc(9)/uvm_pglistalloc(9)? This > > would help in two ways: > > - ensure pmap_enter(9) only returns an error if OOM > > - benefit of the per-CPU page cache for all pmaps > > Not really. We still have plenty of physical memory, and allocating > physical pages succeeds just fine. The problem here is allocating > virtual address space.
So is it potentially a problem on all architectures without direct map? > Here we have two options. Most pools that are used with PR_NOWAIT use > the "interrupt safe" kernel map. When locking that map, we use > mutexes, so it is safe to do so if we're in a context where we can't > sleep. But the address space of the "interrupt safe" kernel map is > small. So allocations may still fail if we run out of address space. The way I understand the problem is that pmap implementations with direct map only need (physical) pages allocations. The current fault handler work with them. Without direct map the physical memory needs to be mapped in the kernel space and there is currently no guarantee to be able to do it without sleeping. > Also, the "vp" items are fairly large (8K) and we use quite a lot of > them building the page tables for all our processes. We would quickly > fill the "interrupt safe" kernel map, so instead the "vp" pool uses > the normal kernel map. The normal map uses rwlocks to do the locking. > So to avoid sleeping we opportunistically try to lock the map and fail > if we don't succeed. That's a nice hack. > This is a fundamental problem with all our kernel memory allocators. Maybe. I don't know. However here we're interested in a very special case of allocations. Currently it seems to be only a problem on arm64. Is there other pmap_enter() that allocates more than one physical page? If not we're currently saved by km_alloc(9)'s free list? > We have some mitigations in the form of the single page allocator (the > uvm_km_pages stuff) and __HAVE_PMAP_DIRECT. But those are not > guaranteed to work and virtual address allocations can still fail. I understand. However I'm not convinced we should embrace the existing behavior. Currently pmap_enter(9) can fail because it can't map pages in kernel space. Can we fix this? Should we fix it? The alternative is to add another error case to pmap_enter(9) which could be recovered. Could this lead to some starvation? Could this explain the contention on the kernel object rwlock I am seeing on arm64? > > Alternatively we can see pool_get() shortages integrated in the page daemon. > > Instead of adding pmap_populate() we could simply check for pools needing a > > refill in the page daemon. The same already applies to uvm_anwait(). > > I don't think making the page daemon responsible for refilling the > pools is a good idea. It would mean the page daemon could block when > there is a KVA shortage. And isn't allocating more memory to fill the > pools the last thing you'd want to do when waking up the page daemon? This needs more explanations from my side. First of all the "page daemon" is now more a "shortage thread". It might free buffers as well. The logic behind the fault handler is that it call uvm_wait() when it encounters an error which correspond to a shortage of memory. Nowadays there are more errors than a shortage of memory and memory can be freed not just by swapping pages. So I'm wondering if we shouldn't simply look at which pools failed to allocated in the same context. Doing so we allows us to get rid of uvm_anwait() as well since uvm_wait() would wake up the daemon which would tell whatever pool mechanism to refill itself. I don't like the fragmentation of memory management and the growing complexity of error code path. I wish we could refrain to add more layers in the allocations inside UVM. > And yes, the uvm_anwait() thing is very similar to this case. In fact > my initial idea was for pmap_populate() to just refill the pools. But > then I realized that it made more sense to just populate the page > tables for the faulting address because we're going to refault on the > same address anyway. This is a clever hack. I'm not opposed to it. I wonder if with the current understanding of the different pmaps this is our best shot. > > But IMHO fewer usages of pools in UVM and pmaps is the best way forward. > > I used to think that as well, but these days the pools have some > performance-critical optimizations that are very relevant for our > pmaps. Maybe but I'd be so sure. My observations with more parts of UVM running in parallel show a lot of contention on pmap pools. There is definitively some nice optimizations, but I believe this is just the beginning of some non trivial work in this area. To summary, arm64 is doing 8K allocations inside pmap_enter(9). It uses uvm_map(9) via km_alloc(9) via pool_get(9) to map these pages in kernel space. The larger kernel_map is protected via a rw_lock and is apparently contended which prevent pmap_enter(9) to succeed. - Can we tweak the `uvm_km_pages' mechanism to be able to work with 8K allocations on arm64? - Or can we change the arm64 pmap to not do allocations bigger than a page? Any of these two would reduce reduce the differences between non-direct and direct map pmaps. If we could ensure uvm_map() is not called from within pmap_enter() that would help me sleep better at night. Which other pmaps could suffer from a shortage of `uvm_km_pages'?
