Thanks for your answer Mark! On 17/01/25(Fri) 15:12, Mark Kettenis wrote: > > [...] > > > Not really. We still have plenty of physical memory, and allocating > > > physical pages succeeds just fine. The problem here is allocating > > > virtual address space. > > > > So is it potentially a problem on all architectures without direct map? > > Yes. And maybe even on architectures *with* direct map as the direct > map can't be used to map multiple pages.
Do we have a pmap_enter() that fits in this description? > > > Here we have two options. Most pools that are used with PR_NOWAIT use > > > the "interrupt safe" kernel map. When locking that map, we use > > > mutexes, so it is safe to do so if we're in a context where we can't > > > sleep. But the address space of the "interrupt safe" kernel map is > > > small. So allocations may still fail if we run out of address space. > > > > The way I understand the problem is that pmap implementations with direct > > map only need (physical) pages allocations. The current fault handler > > work with them. > > Yes, but note that even the amd64 pmap uses pools and needs to use > pool_allocator_single to make sure that pool pages can indeed be > mapped through the direct map. Is there a problem with that? > > Without direct map the physical memory needs to be mapped in the kernel > > space and there is currently no guarantee to be able to do it without > > sleeping. > > Right. And such a guarantee would be very hard to implement as there > is always the risk of simply running out of kernel virtual address > space. What happen in such case? Is recovery possible? > > > Also, the "vp" items are fairly large (8K) and we use quite a lot of > > > them building the page tables for all our processes. We would quickly > > > fill the "interrupt safe" kernel map, so instead the "vp" pool uses > > > the normal kernel map. The normal map uses rwlocks to do the locking. > > > So to avoid sleeping we opportunistically try to lock the map and fail > > > if we don't succeed. > > > > That's a nice hack. > > > > > This is a fundamental problem with all our kernel memory allocators. > > > > Maybe. I don't know. However here we're interested in a very special > > case of allocations. Currently it seems to be only a problem on arm64. > > Is there other pmap_enter() that allocates more than one physical page? > > If not we're currently saved by km_alloc(9)'s free list? > > We do the same thing on riscv64. But even on architectures that only > allocate a single page km_alloc(9)'s free list doesn't really save us. > There is only a limited number of addresses on the free list, so you > can run out of those. And then you need to poke the thread to > allocate more. Are you suggesting we add a pmap_populate() for all non direct map architectures? > > > We have some mitigations in the form of the single page allocator (the > > > uvm_km_pages stuff) and __HAVE_PMAP_DIRECT. But those are not > > > guaranteed to work and virtual address allocations can still fail. > > > > I understand. However I'm not convinced we should embrace the existing > > behavior. Currently pmap_enter(9) can fail because it can't map pages > > in kernel space. Can we fix this? Should we fix it? > > As I said, this is a fundamental problem. We can add more hacks to > reduce the risk if failing because we can't map pages in kernel space. > But we can never eliminate it completely. Fine. > > The alternative is to add another error case to pmap_enter(9) which > > could be recovered. Could this lead to some starvation? > > Not easily since our kernel allocators don't tell us whether > allocation failed because we couldn't allocate physical memory or if > we couldn't map that physical memory. > > > Could this explain the contention on the kernel object rwlock I am > > seeing on arm64? > > Shouldn't be. The kernel object lock only comes into play when we > allocate pageable kernel memory. And if our pmaps would be allocating > pageable memory we'd have a bigg problem... > > > > > Alternatively we can see pool_get() shortages integrated in the page > > > > daemon. > > > > Instead of adding pmap_populate() we could simply check for pools > > > > needing a > > > > refill in the page daemon. The same already applies to uvm_anwait(). > > > > > > I don't think making the page daemon responsible for refilling the > > > pools is a good idea. It would mean the page daemon could block when > > > there is a KVA shortage. And isn't allocating more memory to fill the > > > pools the last thing you'd want to do when waking up the page daemon? > > > > This needs more explanations from my side. First of all the "page daemon" > > is now more a "shortage thread". It might free buffers as well. > > The only reason it frees buffers is to free up physical memory! I > don't think it even frees up KVA since the buffer cache uses its own > submap to map buffers. > > > The logic behind the fault handler is that it call uvm_wait() when > > it encounters an error which correspond to a shortage of memory. > > Nowadays there are more errors than a shortage of memory and memory > > can be freed not just by swapping pages. So I'm wondering if we > > shouldn't simply look at which pools failed to allocated in the same > > context. Doing so we allows us to get rid of uvm_anwait() as well > > since uvm_wait() would wake up the daemon which would tell whatever > > pool mechanism to refill itself. > > Well, uvm_wait() is a bit of dumb hammer. You poke the page daemon > and hope it frees up the right resources. And by the time you retry > something else may already have consumed those resources. > > My approach is much more robust. It allocates the specific resources > that are needed to handle the page fault and populates the page > tables. So we can be sure by the time we retry the resources are > there. > > > I don't like the fragmentation of memory management and the growing > > complexity of error code path. I wish we could refrain to add more > > layers in the allocations inside UVM. > > I would not call this another layer of memory management. > > > > And yes, the uvm_anwait() thing is very similar to this case. In fact > > > my initial idea was for pmap_populate() to just refill the pools. But > > > then I realized that it made more sense to just populate the page > > > tables for the faulting address because we're going to refault on the > > > same address anyway. > > > > This is a clever hack. I'm not opposed to it. I wonder if with the > > current understanding of the different pmaps this is our best shot. > > > > > > But IMHO fewer usages of pools in UVM and pmaps is the best way forward. > > > > > > I used to think that as well, but these days the pools have some > > > performance-critical optimizations that are very relevant for our > > > pmaps. > > > > Maybe but I'd be so sure. My observations with more parts of UVM > > running in parallel show a lot of contention on pmap pools. There is > > definitively some nice optimizations, but I believe this is just the > > beginning of some non trivial work in this area. > > Ultimately we will need to make the address allocators run in parallel > in some way. > > > To summary, arm64 is doing 8K allocations inside pmap_enter(9). It uses > > uvm_map(9) via km_alloc(9) via pool_get(9) to map these pages in kernel > > space. The larger kernel_map is protected via a rw_lock and is apparently > > contended which prevent pmap_enter(9) to succeed. > > > > - Can we tweak the `uvm_km_pages' mechanism to be able to work with 8K > > allocations on arm64? > > Yes, but it still wouldn't solve the fundamental problem and we'd > probably waste resources. > > > - Or can we change the arm64 pmap to not do allocations bigger than a > > page? > > Yes, but it still wouldn't solve the fundamental problem. > > > Any of these two would reduce reduce the differences between non-direct > > and direct map pmaps. If we could ensure uvm_map() is not called from > > within pmap_enter() that would help me sleep better at night. > > I don't think that is achievable. Well it is if we use `uvm_km_pages' or limit pmap_enter() allocations to page-sized ones. > > Which other pmaps could suffer from a shortage of `uvm_km_pages'? > > armv7, arm64, powerpc64, riscv64, sparc64 So I now believe your fix is what we want. One thing we did not consider is if uvm_map() failed because of some contention. But I believe we can wait until this becomes a bottleneck. I'd go even further, remove the #ifdef and replace uvm_swapisfull() + uvm_wait() by pmap_populate(). Thanks for the fix and sharing your knowledge with me. Cheers, Martin
