Re: [PATCH v1 9/9] mm/memory: optimize unmap/zap with PTE-mapped THP

2024-01-31 Thread Yin, Fengwei




On 1/31/2024 6:30 PM, David Hildenbrand wrote:

On 31.01.24 03:30, Yin Fengwei wrote:



On 1/29/24 22:32, David Hildenbrand wrote:

+static inline pte_t get_and_clear_full_ptes(struct mm_struct *mm,
+    unsigned long addr, pte_t *ptep, unsigned int nr, int full)
+{
+    pte_t pte, tmp_pte;
+
+    pte = ptep_get_and_clear_full(mm, addr, ptep, full);
+    while (--nr) {
+    ptep++;
+    addr += PAGE_SIZE;
+    tmp_pte = ptep_get_and_clear_full(mm, addr, ptep, full);
+    if (pte_dirty(tmp_pte))
+    pte = pte_mkdirty(pte);
+    if (pte_young(tmp_pte))
+    pte = pte_mkyoung(pte);
I am wondering whether it's worthy to move the pte_mkdirty() and 
pte_mkyoung()
out of the loop and just do it one time if needed. The worst case is 
that they

are called nr - 1 time. Or it's just too micro?


I also thought about just indicating "any_accessed" or "any_dirty" using 
flags to the caller, to avoid the PTE modifications completely. Felt a 
bit micro-optimized.


Regarding your proposal: I thought about that as well, but my assumption 
was that dirty+young are "cheap" to be set.


On x86, pte_mkyoung() is setting _PAGE_ACCESSED.
pte_mkdirty() is setting _PAGE_DIRTY | _PAGE_SOFT_DIRTY, but it also has 
to handle the saveddirty handling, using some bit trickery.


So at least for pte_mkyoung() there would be no real benefit as far as I 
can see (might be even worse). For pte_mkdirty() there might be a small 
benefit.


Is it going to be measurable? Likely not.

Yeah. We can do more investigation when performance profiling call this
out.


Regards
Yin, Fengwei



Am I missing something?

Thanks!



Re: [PATCH v1 9/9] mm/memory: optimize unmap/zap with PTE-mapped THP

2024-01-30 Thread Yin Fengwei



On 1/29/24 22:32, David Hildenbrand wrote:
> +static inline pte_t get_and_clear_full_ptes(struct mm_struct *mm,
> + unsigned long addr, pte_t *ptep, unsigned int nr, int full)
> +{
> + pte_t pte, tmp_pte;
> +
> + pte = ptep_get_and_clear_full(mm, addr, ptep, full);
> + while (--nr) {
> + ptep++;
> + addr += PAGE_SIZE;
> + tmp_pte = ptep_get_and_clear_full(mm, addr, ptep, full);
> + if (pte_dirty(tmp_pte))
> + pte = pte_mkdirty(pte);
> + if (pte_young(tmp_pte))
> + pte = pte_mkyoung(pte);
I am wondering whether it's worthy to move the pte_mkdirty() and pte_mkyoung()
out of the loop and just do it one time if needed. The worst case is that they
are called nr - 1 time. Or it's just too micro?


Regards
Yin, Fengwei

> + }
> + return pte;
> +}


Re: [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP

2024-01-30 Thread Yin Fengwei
On 1/29/24 22:32, David Hildenbrand wrote:
> This series is based on [1] and must be applied on top of it.
> Similar to what we did with fork(), let's implement PTE batching
> during unmap/zap when processing PTE-mapped THPs.
> 
> We collect consecutive PTEs that map consecutive pages of the same large
> folio, making sure that the other PTE bits are compatible, and (a) adjust
> the refcount only once per batch, (b) call rmap handling functions only
> once per batch, (c) perform batch PTE setting/updates and (d) perform TLB
> entry removal once per batch.
> 
> Ryan was previously working on this in the context of cont-pte for
> arm64, int latest iteration [2] with a focus on arm6 with cont-pte only.
> This series implements the optimization for all architectures, independent
> of such PTE bits, teaches MMU gather/TLB code to be fully aware of such
> large-folio-pages batches as well, and amkes use of our new rmap batching
> function when removing the rmap.
> 
> To achieve that, we have to enlighten MMU gather / page freeing code
> (i.e., everything that consumes encoded_page) to process unmapping
> of consecutive pages that all belong to the same large folio. I'm being
> very careful to not degrade order-0 performance, and it looks like I
> managed to achieve that.

One possible scenario:
If all the folio is 2M size folio, then one full batch could hold 510M memory.
Is it too much regarding one full batch before just can hold (2M - 4096 * 2)
memory?


Regards
Yin, Fengwei



Re: API for setting multiple PTEs at once

2023-02-08 Thread Yin, Fengwei



On 2/8/2023 7:23 PM, Alexandre Ghiti wrote:
> Hi Matthew,
> 
> On 2/7/23 21:27, Matthew Wilcox wrote:
>> On Thu, Feb 02, 2023 at 09:14:23PM +, Matthew Wilcox wrote:
>>> For those of you not subscribed, linux-mm is currently discussing
>>> how best to handle page faults on large folios.  I simply made it work
>>> when adding large folio support.  Now Yin Fengwei is working on
>>> making it fast.
>> OK, here's an actual implementation:
>>
>> https://lore.kernel.org/linux-mm/20230207194937.122543-3-wi...@infradead.org/
>>
>> It survives a run of xfstests.  If your architecture doesn't store its
>> PFNs at PAGE_SHIFT, you're going to want to implement your own set_ptes(),
> 
> 
> riscv stores its pfn at PAGE_PFN_SHIFT instead of PAGE_SHIFT, se we need to 
> reimplement set_ptes. But I have been playing with your patchset and we never 
> fall into the case where set_ptes is called with nr > 1, any idea why? I 
> booted a large ubuntu defconfig and launched will_it_scale.page_fault4.
Need to use xfs filesystem to get large folio for file mapping.
Other filesystem may be also OK. But I just tried xfs. Thanks.


Regards
Yin, Fengwei

> 
> I'll come up with the proper implementation of set_ptes anyway soon.
> 
> Thanks,
> 
> Alex
> 
> 
>> or you'll see entirely the wrong pages mapped into userspace.  You may
>> also wish to implement set_ptes() if it can be done more efficiently
>> than __pte(pteval(pte) + PAGE_SIZE).
>>
>> Architectures that implement things like flush_icache_page() and
>> update_mmu_cache() may want to propose batched versions of those.
>> That's alpha, csky, m68k, mips, nios2, parisc, sh,
>> arm, loongarch, openrisc, powerpc, riscv, sparc and xtensa.
>> Maintainers BCC'd, mailing lists CC'd.
>>
>> I'm happy to collect implementations and submit them as part of a v6.
>>
>> ___
>> linux-riscv mailing list
>> linux-ri...@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-riscv