> Date: Fri, 17 Jan 2025 16:39:12 +0100 > From: Martin Pieuchot <[email protected]> > > Thanks for your answer Mark! > > On 17/01/25(Fri) 15:12, Mark Kettenis wrote: > > > [...] > > > > Not really. We still have plenty of physical memory, and allocating > > > > physical pages succeeds just fine. The problem here is allocating > > > > virtual address space. > > > > > > So is it potentially a problem on all architectures without direct map? > > > > Yes. And maybe even on architectures *with* direct map as the direct > > map can't be used to map multiple pages. > > Do we have a pmap_enter() that fits in this description?
Several pmaps use pools that don't explicitly ask for the single-page pool page allocator. But the size of the pool items is probably small enough that they still allocate single pages. > > > > Here we have two options. Most pools that are used with PR_NOWAIT use > > > > the "interrupt safe" kernel map. When locking that map, we use > > > > mutexes, so it is safe to do so if we're in a context where we can't > > > > sleep. But the address space of the "interrupt safe" kernel map is > > > > small. So allocations may still fail if we run out of address space. > > > > > > The way I understand the problem is that pmap implementations with direct > > > map only need (physical) pages allocations. The current fault handler > > > work with them. > > > > Yes, but note that even the amd64 pmap uses pools and needs to use > > pool_allocator_single to make sure that pool pages can indeed be > > mapped through the direct map. > > Is there a problem with that? Nope that's fine. Other pmaps should probably do the same. > > > Without direct map the physical memory needs to be mapped in the kernel > > > space and there is currently no guarantee to be able to do it without > > > sleeping. > > > > Right. And such a guarantee would be very hard to implement as there > > is always the risk of simply running out of kernel virtual address > > space. > > What happen in such case? Is recovery possible? With the current code that calls uwm_wait(), you'll probably end up in the scenario where you continuously wake up the page daemon. With the pmap_populate() approach, I believe we'll end up waiting in pool_get() until some other process ends up returning items to the pool. That probably gives us a better chance to recover. > > > > Also, the "vp" items are fairly large (8K) and we use quite a lot of > > > > them building the page tables for all our processes. We would quickly > > > > fill the "interrupt safe" kernel map, so instead the "vp" pool uses > > > > the normal kernel map. The normal map uses rwlocks to do the locking. > > > > So to avoid sleeping we opportunistically try to lock the map and fail > > > > if we don't succeed. > > > > > > That's a nice hack. > > > > > > > This is a fundamental problem with all our kernel memory allocators. > > > > > > Maybe. I don't know. However here we're interested in a very special > > > case of allocations. Currently it seems to be only a problem on arm64. > > > Is there other pmap_enter() that allocates more than one physical page? > > > If not we're currently saved by km_alloc(9)'s free list? > > > > We do the same thing on riscv64. But even on architectures that only > > allocate a single page km_alloc(9)'s free list doesn't really save us. > > There is only a limited number of addresses on the free list, so you > > can run out of those. And then you need to poke the thread to > > allocate more. > > Are you suggesting we add a pmap_populate() for all non direct map > architectures? Maybe we should even add it to all our architectures. > > > > We have some mitigations in the form of the single page allocator (the > > > > uvm_km_pages stuff) and __HAVE_PMAP_DIRECT. But those are not > > > > guaranteed to work and virtual address allocations can still fail. > > > > > > I understand. However I'm not convinced we should embrace the existing > > > behavior. Currently pmap_enter(9) can fail because it can't map pages > > > in kernel space. Can we fix this? Should we fix it? > > > > As I said, this is a fundamental problem. We can add more hacks to > > reduce the risk if failing because we can't map pages in kernel space. > > But we can never eliminate it completely. > > Fine. > > > > The alternative is to add another error case to pmap_enter(9) which > > > could be recovered. Could this lead to some starvation? > > > > Not easily since our kernel allocators don't tell us whether > > allocation failed because we couldn't allocate physical memory or if > > we couldn't map that physical memory. > > > > > Could this explain the contention on the kernel object rwlock I am > > > seeing on arm64? > > > > Shouldn't be. The kernel object lock only comes into play when we > > allocate pageable kernel memory. And if our pmaps would be allocating > > pageable memory we'd have a bigg problem... > > > > > > > Alternatively we can see pool_get() shortages integrated in the page > > > > > daemon. > > > > > Instead of adding pmap_populate() we could simply check for pools > > > > > needing a > > > > > refill in the page daemon. The same already applies to uvm_anwait(). > > > > > > > > I don't think making the page daemon responsible for refilling the > > > > pools is a good idea. It would mean the page daemon could block when > > > > there is a KVA shortage. And isn't allocating more memory to fill the > > > > pools the last thing you'd want to do when waking up the page daemon? > > > > > > This needs more explanations from my side. First of all the "page daemon" > > > is now more a "shortage thread". It might free buffers as well. > > > > The only reason it frees buffers is to free up physical memory! I > > don't think it even frees up KVA since the buffer cache uses its own > > submap to map buffers. > > > > > The logic behind the fault handler is that it call uvm_wait() when > > > it encounters an error which correspond to a shortage of memory. > > > Nowadays there are more errors than a shortage of memory and memory > > > can be freed not just by swapping pages. So I'm wondering if we > > > shouldn't simply look at which pools failed to allocated in the same > > > context. Doing so we allows us to get rid of uvm_anwait() as well > > > since uvm_wait() would wake up the daemon which would tell whatever > > > pool mechanism to refill itself. > > > > Well, uvm_wait() is a bit of dumb hammer. You poke the page daemon > > and hope it frees up the right resources. And by the time you retry > > something else may already have consumed those resources. > > > > My approach is much more robust. It allocates the specific resources > > that are needed to handle the page fault and populates the page > > tables. So we can be sure by the time we retry the resources are > > there. > > > > > I don't like the fragmentation of memory management and the growing > > > complexity of error code path. I wish we could refrain to add more > > > layers in the allocations inside UVM. > > > > I would not call this another layer of memory management. > > > > > > And yes, the uvm_anwait() thing is very similar to this case. In fact > > > > my initial idea was for pmap_populate() to just refill the pools. But > > > > then I realized that it made more sense to just populate the page > > > > tables for the faulting address because we're going to refault on the > > > > same address anyway. > > > > > > This is a clever hack. I'm not opposed to it. I wonder if with the > > > current understanding of the different pmaps this is our best shot. > > > > > > > > But IMHO fewer usages of pools in UVM and pmaps is the best way > > > > > forward. > > > > > > > > I used to think that as well, but these days the pools have some > > > > performance-critical optimizations that are very relevant for our > > > > pmaps. > > > > > > Maybe but I'd be so sure. My observations with more parts of UVM > > > running in parallel show a lot of contention on pmap pools. There is > > > definitively some nice optimizations, but I believe this is just the > > > beginning of some non trivial work in this area. > > > > Ultimately we will need to make the address allocators run in parallel > > in some way. > > > > > To summary, arm64 is doing 8K allocations inside pmap_enter(9). It uses > > > uvm_map(9) via km_alloc(9) via pool_get(9) to map these pages in kernel > > > space. The larger kernel_map is protected via a rw_lock and is apparently > > > contended which prevent pmap_enter(9) to succeed. > > > > > > - Can we tweak the `uvm_km_pages' mechanism to be able to work with 8K > > > allocations on arm64? > > > > Yes, but it still wouldn't solve the fundamental problem and we'd > > probably waste resources. > > > > > - Or can we change the arm64 pmap to not do allocations bigger than a > > > page? > > > > Yes, but it still wouldn't solve the fundamental problem. > > > > > Any of these two would reduce reduce the differences between non-direct > > > and direct map pmaps. If we could ensure uvm_map() is not called from > > > within pmap_enter() that would help me sleep better at night. > > > > I don't think that is achievable. > > Well it is if we use `uvm_km_pages' or limit pmap_enter() allocations to > page-sized ones. True. And this may help us avoid contention on the kernal map lock. And it might be relatively easy to turn uvm_km_pages into some sort of a per-cpu cache. I'll have another look at what needs to change in the arm64 pmap to achieve this. > > > Which other pmaps could suffer from a shortage of `uvm_km_pages'? > > > > armv7, arm64, powerpc64, riscv64, sparc64 > > So I now believe your fix is what we want. One thing we did not > consider is if uvm_map() failed because of some contention. But I > believe we can wait until this becomes a bottleneck. > > I'd go even further, remove the #ifdef and replace uvm_swapisfull() + > uvm_wait() by pmap_populate(). > > Thanks for the fix and sharing your knowledge with me. is that an ok for the diff?
