Re: [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP
On 31.01.24 15:08, Michal Hocko wrote: On Wed 31-01-24 10:26:13, Ryan Roberts wrote: IIRC there is an option to zero memory when it is freed back to the buddy? So that could be a place where time is proportional to size rather than proportional to folio count? But I think that option is intended for debug only? So perhaps not a problem in practice? init_on_free is considered a security/hardening feature more than a debugging one. It will surely add an overhead and I guess this is something people who use it know about. The batch size limit is a latency reduction feature for !PREEMPT kernels but by no means it should be considered low latency guarantee feature. A lot of has changed since the limit was introduced and the current latency numbers will surely be different than back then. As long as soft lockups do not trigger again this should be acceptable IMHO. It could now be zeroing out ~512 MiB. That shouldn't take double-digit seconds unless we are running in a very problematic environment (over-committed VM). But then, we might have different problems already. I'll do some sanity checks with an extremely large processes (as much as I can fit on my machines), with a !CONFIG_PREEMPT kernel and init_on_free, to see if anything pops up. Thanks Michal! -- Cheers, David / dhildenb
Re: [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP
On Wed 31-01-24 11:16:01, David Hildenbrand wrote: [...] > This 1 pages limit was introduced in 53a59fc67f97 ("mm: limit mmu_gather > batching to fix soft lockups on !CONFIG_PREEMPT") where we wanted to handle > soft-lockups. AFAIR at the time of this patch this was mostly just to put some cap on the number of batches to collect and free at once. If there is a lot of free memory and a large process exiting this could grow really high. Now that those pages^Wfolios can represent larger memory chunks it could mean more physical memory being freed but from which might make the operation take longer but still far from soft lockup triggering. Now latency might suck on !PREEMPT kernels with too many pages to free in a single batch but I guess this is somehow expected for this preemption model. The soft lockup has to be avoided because this can panic the machine in some configurations. -- Michal Hocko SUSE Labs
Re: [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP
On Wed 31-01-24 10:26:13, Ryan Roberts wrote: > IIRC there is an option to zero memory when it is freed back to the buddy? So > that could be a place where time is proportional to size rather than > proportional to folio count? But I think that option is intended for debug > only? > So perhaps not a problem in practice? init_on_free is considered a security/hardening feature more than a debugging one. It will surely add an overhead and I guess this is something people who use it know about. The batch size limit is a latency reduction feature for !PREEMPT kernels but by no means it should be considered low latency guarantee feature. A lot of has changed since the limit was introduced and the current latency numbers will surely be different than back then. As long as soft lockups do not trigger again this should be acceptable IMHO. -- Michal Hocko SUSE Labs
Re: [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP
On 31.01.24 03:20, Yin Fengwei wrote: On 1/29/24 22:32, David Hildenbrand wrote: This series is based on [1] and must be applied on top of it. Similar to what we did with fork(), let's implement PTE batching during unmap/zap when processing PTE-mapped THPs. We collect consecutive PTEs that map consecutive pages of the same large folio, making sure that the other PTE bits are compatible, and (a) adjust the refcount only once per batch, (b) call rmap handling functions only once per batch, (c) perform batch PTE setting/updates and (d) perform TLB entry removal once per batch. Ryan was previously working on this in the context of cont-pte for arm64, int latest iteration [2] with a focus on arm6 with cont-pte only. This series implements the optimization for all architectures, independent of such PTE bits, teaches MMU gather/TLB code to be fully aware of such large-folio-pages batches as well, and amkes use of our new rmap batching function when removing the rmap. To achieve that, we have to enlighten MMU gather / page freeing code (i.e., everything that consumes encoded_page) to process unmapping of consecutive pages that all belong to the same large folio. I'm being very careful to not degrade order-0 performance, and it looks like I managed to achieve that. One possible scenario: If all the folio is 2M size folio, then one full batch could hold 510M memory. Is it too much regarding one full batch before just can hold (2M - 4096 * 2) memory? Good point, we do have CONFIG_INIT_ON_FREE_DEFAULT_ON. I don't remember if init_on_free or init_on_alloc was used in production systems. In tlb_batch_pages_flush(), there is a cond_resched() to limit the number of entries we process. So if that is actually problematic, we'd run into a soft-lockup and need another cond_resched() [I have some faint recollection that people are working on removing cond_resched() completely]. One could do some counting in free_pages_and_swap_cache() (where we iterate all entries already) and insert cond_resched+release_pages() for every (e.g., 512) pages. -- Cheers, David / dhildenb
Re: [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP
On 31/01/2024 10:16, David Hildenbrand wrote: > On 31.01.24 03:20, Yin Fengwei wrote: >> On 1/29/24 22:32, David Hildenbrand wrote: >>> This series is based on [1] and must be applied on top of it. >>> Similar to what we did with fork(), let's implement PTE batching >>> during unmap/zap when processing PTE-mapped THPs. >>> >>> We collect consecutive PTEs that map consecutive pages of the same large >>> folio, making sure that the other PTE bits are compatible, and (a) adjust >>> the refcount only once per batch, (b) call rmap handling functions only >>> once per batch, (c) perform batch PTE setting/updates and (d) perform TLB >>> entry removal once per batch. >>> >>> Ryan was previously working on this in the context of cont-pte for >>> arm64, int latest iteration [2] with a focus on arm6 with cont-pte only. >>> This series implements the optimization for all architectures, independent >>> of such PTE bits, teaches MMU gather/TLB code to be fully aware of such >>> large-folio-pages batches as well, and amkes use of our new rmap batching >>> function when removing the rmap. >>> >>> To achieve that, we have to enlighten MMU gather / page freeing code >>> (i.e., everything that consumes encoded_page) to process unmapping >>> of consecutive pages that all belong to the same large folio. I'm being >>> very careful to not degrade order-0 performance, and it looks like I >>> managed to achieve that. >> > > Let's CC Linus and Michal to make sure I'm not daydreaming. > > Relevant patch: > https://lkml.kernel.org/r/20240129143221.263763-8-da...@redhat.com > > Context: I'm adjusting MMU gather code to support batching of consecutive > pages > that belong to the same large folio, when unmapping/zapping PTEs. > > For small folios, there is no (relevant) change. > > Imagine we have a PTE-mapped THP (2M folio -> 512 pages) and zap all 512 PTEs: > Instead of adding 512 individual encoded_page entries, we add a combined entry > that expresses "page+nr_pages". That allows for "easily" adding various other > per-folio batching (refcount, rmap, swap freeing). > > The implication is, that we can now batch effective more pages with large > folios, exceeding the old 1 limit. The number of involved *folios* does > not > increase, though. > >> One possible scenario: >> If all the folio is 2M size folio, then one full batch could hold 510M >> memory. >> Is it too much regarding one full batch before just can hold (2M - 4096 * 2) >> memory? > > Excellent point, I think there are three parts to it: > > (1) Batch pages / folio fragments per batch page > > Before this change (and with 4k folios) we have exactly one page (4k) per > encoded_page entry in the batch. Now, we can have (with 2M folios), 512 pages > for every two encoded_page entries (page+nr_pages) in a batch page. So an > average ~256 pages per encoded_page entry. > > So one batch page can now store in the worst case ~256 times the number of > pages, but the number of folio fragments ("pages+nr_pages") would not > increase. > > The time it takes to perform the actual page freeing of a batch will not be > 256 > times higher -- the time is expected to be much closer to the old time (i.e., > not freeing more folios). IIRC there is an option to zero memory when it is freed back to the buddy? So that could be a place where time is proportional to size rather than proportional to folio count? But I think that option is intended for debug only? So perhaps not a problem in practice? > > (2) Delayed rmap handling > > We limit batching early (see tlb_next_batch()) when we have delayed rmap > pending. Reason being, that we don't want to check for many entries if they > require delayed rmap handling, while still holding the page table lock (see > tlb_flush_rmaps()), because we have to remove the rmap before dropping the > PTL. > > Note that we perform the check whether we need delayed rmap handling per > page+nr_pages entry, not per page. So we won't perform more such checks. > > Once we set tlb->delayed_rmap (because we add one entry that requires it), we > already force a flush before dropping the PT lock. So once we get a single > delayed rmap entry in there, we will not batch more than we could have in the > same page table: so not more than 512 entries (x86-64) in the worst case. So > it > will still be bounded, and not significantly more than what we had before. > > So regarding delayed rmap handling I think this should be fine. > > (3) Total patched pages > > MAX_GATHER_BATCH_COUNT effectively limits the number of pages we allocate > (full > batches), and thereby limits the number of pages we were able to batch. > > The old limit was ~1 pages, now we could batch ~5000 folio fragments > (page+nr_pages), resulting int the "times 256" increase in the worst case on > x86-64 as you point out. > > This 1 pages limit was introduced in 53a59fc67f97 ("mm: limit mmu_gather > batching to fix soft lockups on !CONFIG_PREEMPT") where we wanted to
Re: [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP
On 31.01.24 03:20, Yin Fengwei wrote: On 1/29/24 22:32, David Hildenbrand wrote: This series is based on [1] and must be applied on top of it. Similar to what we did with fork(), let's implement PTE batching during unmap/zap when processing PTE-mapped THPs. We collect consecutive PTEs that map consecutive pages of the same large folio, making sure that the other PTE bits are compatible, and (a) adjust the refcount only once per batch, (b) call rmap handling functions only once per batch, (c) perform batch PTE setting/updates and (d) perform TLB entry removal once per batch. Ryan was previously working on this in the context of cont-pte for arm64, int latest iteration [2] with a focus on arm6 with cont-pte only. This series implements the optimization for all architectures, independent of such PTE bits, teaches MMU gather/TLB code to be fully aware of such large-folio-pages batches as well, and amkes use of our new rmap batching function when removing the rmap. To achieve that, we have to enlighten MMU gather / page freeing code (i.e., everything that consumes encoded_page) to process unmapping of consecutive pages that all belong to the same large folio. I'm being very careful to not degrade order-0 performance, and it looks like I managed to achieve that. Let's CC Linus and Michal to make sure I'm not daydreaming. Relevant patch: https://lkml.kernel.org/r/20240129143221.263763-8-da...@redhat.com Context: I'm adjusting MMU gather code to support batching of consecutive pages that belong to the same large folio, when unmapping/zapping PTEs. For small folios, there is no (relevant) change. Imagine we have a PTE-mapped THP (2M folio -> 512 pages) and zap all 512 PTEs: Instead of adding 512 individual encoded_page entries, we add a combined entry that expresses "page+nr_pages". That allows for "easily" adding various other per-folio batching (refcount, rmap, swap freeing). The implication is, that we can now batch effective more pages with large folios, exceeding the old 1 limit. The number of involved *folios* does not increase, though. One possible scenario: If all the folio is 2M size folio, then one full batch could hold 510M memory. Is it too much regarding one full batch before just can hold (2M - 4096 * 2) memory? Excellent point, I think there are three parts to it: (1) Batch pages / folio fragments per batch page Before this change (and with 4k folios) we have exactly one page (4k) per encoded_page entry in the batch. Now, we can have (with 2M folios), 512 pages for every two encoded_page entries (page+nr_pages) in a batch page. So an average ~256 pages per encoded_page entry. So one batch page can now store in the worst case ~256 times the number of pages, but the number of folio fragments ("pages+nr_pages") would not increase. The time it takes to perform the actual page freeing of a batch will not be 256 times higher -- the time is expected to be much closer to the old time (i.e., not freeing more folios). (2) Delayed rmap handling We limit batching early (see tlb_next_batch()) when we have delayed rmap pending. Reason being, that we don't want to check for many entries if they require delayed rmap handling, while still holding the page table lock (see tlb_flush_rmaps()), because we have to remove the rmap before dropping the PTL. Note that we perform the check whether we need delayed rmap handling per page+nr_pages entry, not per page. So we won't perform more such checks. Once we set tlb->delayed_rmap (because we add one entry that requires it), we already force a flush before dropping the PT lock. So once we get a single delayed rmap entry in there, we will not batch more than we could have in the same page table: so not more than 512 entries (x86-64) in the worst case. So it will still be bounded, and not significantly more than what we had before. So regarding delayed rmap handling I think this should be fine. (3) Total patched pages MAX_GATHER_BATCH_COUNT effectively limits the number of pages we allocate (full batches), and thereby limits the number of pages we were able to batch. The old limit was ~1 pages, now we could batch ~5000 folio fragments (page+nr_pages), resulting int the "times 256" increase in the worst case on x86-64 as you point out. This 1 pages limit was introduced in 53a59fc67f97 ("mm: limit mmu_gather batching to fix soft lockups on !CONFIG_PREEMPT") where we wanted to handle soft-lockups. As the number of effective folios we are freeing does not increase, I *think* this should be fine. If any of that is a problem, we would have to keep track of the total number of pages in our batch, and stop as soon as we hit our 1 limit -- independent of page vs. folio fragment. Something I would like to avoid of possible. -- Cheers, David / dhildenb
Re: [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP
On 1/29/24 22:32, David Hildenbrand wrote: > This series is based on [1] and must be applied on top of it. > Similar to what we did with fork(), let's implement PTE batching > during unmap/zap when processing PTE-mapped THPs. > > We collect consecutive PTEs that map consecutive pages of the same large > folio, making sure that the other PTE bits are compatible, and (a) adjust > the refcount only once per batch, (b) call rmap handling functions only > once per batch, (c) perform batch PTE setting/updates and (d) perform TLB > entry removal once per batch. > > Ryan was previously working on this in the context of cont-pte for > arm64, int latest iteration [2] with a focus on arm6 with cont-pte only. > This series implements the optimization for all architectures, independent > of such PTE bits, teaches MMU gather/TLB code to be fully aware of such > large-folio-pages batches as well, and amkes use of our new rmap batching > function when removing the rmap. > > To achieve that, we have to enlighten MMU gather / page freeing code > (i.e., everything that consumes encoded_page) to process unmapping > of consecutive pages that all belong to the same large folio. I'm being > very careful to not degrade order-0 performance, and it looks like I > managed to achieve that. One possible scenario: If all the folio is 2M size folio, then one full batch could hold 510M memory. Is it too much regarding one full batch before just can hold (2M - 4096 * 2) memory? Regards Yin, Fengwei
[PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP
This series is based on [1] and must be applied on top of it. Similar to what we did with fork(), let's implement PTE batching during unmap/zap when processing PTE-mapped THPs. We collect consecutive PTEs that map consecutive pages of the same large folio, making sure that the other PTE bits are compatible, and (a) adjust the refcount only once per batch, (b) call rmap handling functions only once per batch, (c) perform batch PTE setting/updates and (d) perform TLB entry removal once per batch. Ryan was previously working on this in the context of cont-pte for arm64, int latest iteration [2] with a focus on arm6 with cont-pte only. This series implements the optimization for all architectures, independent of such PTE bits, teaches MMU gather/TLB code to be fully aware of such large-folio-pages batches as well, and amkes use of our new rmap batching function when removing the rmap. To achieve that, we have to enlighten MMU gather / page freeing code (i.e., everything that consumes encoded_page) to process unmapping of consecutive pages that all belong to the same large folio. I'm being very careful to not degrade order-0 performance, and it looks like I managed to achieve that. While this series should -- similar to [1] -- be beneficial for adding cont-pte support on arm64[2], it's one of the requirements for maintaining a total mapcount[3] for large folios with minimal added overhead and further changes[4] that build up on top of the total mapcount. Independent of all that, this series results in a speedup during munmap() and similar unmapping (process teardown, MADV_DONTNEED on larger ranges) with PTE-mapped THP, which is the default with THPs that are smaller than a PMD (for example, 16KiB to 1024KiB mTHPs for anonymous memory[5]). On an Intel Xeon Silver 4210R CPU, munmap'ing a 1GiB VMA backed by PTE-mapped folios of the same size (stddev < 1%) results in the following runtimes for munmap() in seconds (shorter is better): Folio Size | mm-unstable | New | Change - 4KiB |0.058110 | 0.057715 | - 1% 16KiB |0.044198 | 0.035469 | -20% 32KiB |0.034216 | 0.023522 | -31% 64KiB |0.029207 | 0.018434 | -37% 128KiB |0.026579 | 0.014026 | -47% 256KiB |0.025130 | 0.011756 | -53% 512KiB |0.024292 | 0.010703 | -56% 1024KiB |0.023812 | 0.010294 | -57% 2048KiB |0.023785 | 0.009910 | -58% CCing especially s390x folks, because they have a tlb freeing hooks that needs adjustment. Only tested on x86-64 for now, will have to do some more stress testing. Compile-tested on most other architectures. The PPC change is negleglible and makes my cross-compiler happy. [1] https://lkml.kernel.org/r/20240129124649.189745-1-da...@redhat.com [2] https://lkml.kernel.org/r/20231218105100.172635-1-ryan.robe...@arm.com [3] https://lkml.kernel.org/r/20230809083256.699513-1-da...@redhat.com [4] https://lkml.kernel.org/r/20231124132626.235350-1-da...@redhat.com [5] https://lkml.kernel.org/r/20231207161211.2374093-1-ryan.robe...@arm.com Cc: Andrew Morton Cc: Matthew Wilcox (Oracle) Cc: Ryan Roberts Cc: Catalin Marinas Cc: Will Deacon Cc: "Aneesh Kumar K.V" Cc: Nick Piggin Cc: Peter Zijlstra Cc: Michael Ellerman Cc: Christophe Leroy Cc: "Naveen N. Rao" Cc: Heiko Carstens Cc: Vasily Gorbik Cc: Alexander Gordeev Cc: Christian Borntraeger Cc: Sven Schnelle Cc: Arnd Bergmann Cc: linux-a...@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-s...@vger.kernel.org David Hildenbrand (9): mm/memory: factor out zapping of present pte into zap_present_pte() mm/memory: handle !page case in zap_present_pte() separately mm/memory: further separate anon and pagecache folio handling in zap_present_pte() mm/memory: factor out zapping folio pte into zap_present_folio_pte() mm/mmu_gather: pass "delay_rmap" instead of encoded page to __tlb_remove_page_size() mm/mmu_gather: define ENCODED_PAGE_FLAG_DELAY_RMAP mm/mmu_gather: add __tlb_remove_folio_pages() mm/mmu_gather: add tlb_remove_tlb_entries() mm/memory: optimize unmap/zap with PTE-mapped THP arch/powerpc/include/asm/tlb.h | 2 + arch/s390/include/asm/tlb.h| 30 -- include/asm-generic/tlb.h | 40 ++-- include/linux/mm_types.h | 37 ++-- include/linux/pgtable.h| 66 + mm/memory.c| 167 +++-- mm/mmu_gather.c| 63 +++-- mm/swap.c | 12 ++- mm/swap_state.c| 12 ++- 9 files changed, 347 insertions(+), 82 deletions(-) -- 2.43.0