Re: [PATCH v1 9/9] mm/memory: optimize unmap/zap with PTE-mapped THP
On 1/31/2024 6:30 PM, David Hildenbrand wrote: On 31.01.24 03:30, Yin Fengwei wrote: On 1/29/24 22:32, David Hildenbrand wrote: +static inline pte_t get_and_clear_full_ptes(struct mm_struct *mm, + unsigned long addr, pte_t *ptep, unsigned int nr, int full) +{ + pte_t pte, tmp_pte; + + pte = ptep_get_and_clear_full(mm, addr, ptep, full); + while (--nr) { + ptep++; + addr += PAGE_SIZE; + tmp_pte = ptep_get_and_clear_full(mm, addr, ptep, full); + if (pte_dirty(tmp_pte)) + pte = pte_mkdirty(pte); + if (pte_young(tmp_pte)) + pte = pte_mkyoung(pte); I am wondering whether it's worthy to move the pte_mkdirty() and pte_mkyoung() out of the loop and just do it one time if needed. The worst case is that they are called nr - 1 time. Or it's just too micro? I also thought about just indicating "any_accessed" or "any_dirty" using flags to the caller, to avoid the PTE modifications completely. Felt a bit micro-optimized. Regarding your proposal: I thought about that as well, but my assumption was that dirty+young are "cheap" to be set. On x86, pte_mkyoung() is setting _PAGE_ACCESSED. pte_mkdirty() is setting _PAGE_DIRTY | _PAGE_SOFT_DIRTY, but it also has to handle the saveddirty handling, using some bit trickery. So at least for pte_mkyoung() there would be no real benefit as far as I can see (might be even worse). For pte_mkdirty() there might be a small benefit. Is it going to be measurable? Likely not. Yeah. We can do more investigation when performance profiling call this out. Regards Yin, Fengwei Am I missing something? Thanks!
Re: [PATCH v1 9/9] mm/memory: optimize unmap/zap with PTE-mapped THP
On 1/29/24 22:32, David Hildenbrand wrote: > +static inline pte_t get_and_clear_full_ptes(struct mm_struct *mm, > + unsigned long addr, pte_t *ptep, unsigned int nr, int full) > +{ > + pte_t pte, tmp_pte; > + > + pte = ptep_get_and_clear_full(mm, addr, ptep, full); > + while (--nr) { > + ptep++; > + addr += PAGE_SIZE; > + tmp_pte = ptep_get_and_clear_full(mm, addr, ptep, full); > + if (pte_dirty(tmp_pte)) > + pte = pte_mkdirty(pte); > + if (pte_young(tmp_pte)) > + pte = pte_mkyoung(pte); I am wondering whether it's worthy to move the pte_mkdirty() and pte_mkyoung() out of the loop and just do it one time if needed. The worst case is that they are called nr - 1 time. Or it's just too micro? Regards Yin, Fengwei > + } > + return pte; > +}
Re: [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP
On 1/29/24 22:32, David Hildenbrand wrote: > This series is based on [1] and must be applied on top of it. > Similar to what we did with fork(), let's implement PTE batching > during unmap/zap when processing PTE-mapped THPs. > > We collect consecutive PTEs that map consecutive pages of the same large > folio, making sure that the other PTE bits are compatible, and (a) adjust > the refcount only once per batch, (b) call rmap handling functions only > once per batch, (c) perform batch PTE setting/updates and (d) perform TLB > entry removal once per batch. > > Ryan was previously working on this in the context of cont-pte for > arm64, int latest iteration [2] with a focus on arm6 with cont-pte only. > This series implements the optimization for all architectures, independent > of such PTE bits, teaches MMU gather/TLB code to be fully aware of such > large-folio-pages batches as well, and amkes use of our new rmap batching > function when removing the rmap. > > To achieve that, we have to enlighten MMU gather / page freeing code > (i.e., everything that consumes encoded_page) to process unmapping > of consecutive pages that all belong to the same large folio. I'm being > very careful to not degrade order-0 performance, and it looks like I > managed to achieve that. One possible scenario: If all the folio is 2M size folio, then one full batch could hold 510M memory. Is it too much regarding one full batch before just can hold (2M - 4096 * 2) memory? Regards Yin, Fengwei
Re: API for setting multiple PTEs at once
On 2/8/2023 7:23 PM, Alexandre Ghiti wrote: > Hi Matthew, > > On 2/7/23 21:27, Matthew Wilcox wrote: >> On Thu, Feb 02, 2023 at 09:14:23PM +, Matthew Wilcox wrote: >>> For those of you not subscribed, linux-mm is currently discussing >>> how best to handle page faults on large folios. I simply made it work >>> when adding large folio support. Now Yin Fengwei is working on >>> making it fast. >> OK, here's an actual implementation: >> >> https://lore.kernel.org/linux-mm/20230207194937.122543-3-wi...@infradead.org/ >> >> It survives a run of xfstests. If your architecture doesn't store its >> PFNs at PAGE_SHIFT, you're going to want to implement your own set_ptes(), > > > riscv stores its pfn at PAGE_PFN_SHIFT instead of PAGE_SHIFT, se we need to > reimplement set_ptes. But I have been playing with your patchset and we never > fall into the case where set_ptes is called with nr > 1, any idea why? I > booted a large ubuntu defconfig and launched will_it_scale.page_fault4. Need to use xfs filesystem to get large folio for file mapping. Other filesystem may be also OK. But I just tried xfs. Thanks. Regards Yin, Fengwei > > I'll come up with the proper implementation of set_ptes anyway soon. > > Thanks, > > Alex > > >> or you'll see entirely the wrong pages mapped into userspace. You may >> also wish to implement set_ptes() if it can be done more efficiently >> than __pte(pteval(pte) + PAGE_SIZE). >> >> Architectures that implement things like flush_icache_page() and >> update_mmu_cache() may want to propose batched versions of those. >> That's alpha, csky, m68k, mips, nios2, parisc, sh, >> arm, loongarch, openrisc, powerpc, riscv, sparc and xtensa. >> Maintainers BCC'd, mailing lists CC'd. >> >> I'm happy to collect implementations and submit them as part of a v6. >> >> ___ >> linux-riscv mailing list >> linux-ri...@lists.infradead.org >> http://lists.infradead.org/mailman/listinfo/linux-riscv