If a process (qemu) with a lot of CPUs (128) try to munmap() a large chunk of memory (496GB) mapped with THP, it takes an average of 275 seconds, which can cause a lot of problems to the load (in qemu case, the guest will lock for this time).
Trying to find the source of this bug, I found out most of this time is spent on serialize_against_pte_lookup(). This function will take a lot of time in smp_call_function_many() if there is more than a couple CPUs running the user process. Since it has to happen to all THP mapped, it will take a very long time for large amounts of memory. By the docs, serialize_against_pte_lookup() is needed in order to avoid pmd_t to pte_t casting inside find_current_mm_pte() to happen concurrently with the next part of the functions it's called in. It does so by calling a do_nothing() on each CPU in mm->cpu_bitmap[]; So, by what I could understand, if there is no find_current_mm_pte() running, there is no need to call serialize_against_pte_lookup(). So, to avoid the cost of running serialize_against_pte_lookup(), I propose a counter that keeps track of how many find_current_mm_pte() are currently running, and if there is none, just skip smp_call_function_many(). On my workload (qemu), I could see munmap's time reduction from 275 seconds to 418ms. Signed-off-by: Leonardo Bras <leona...@linux.ibm.com> --- I need more experienced people's help in order to understand if this is really a valid improvement, and if mm_struct is the best place to put such counter. Thanks! --- arch/powerpc/include/asm/pte-walk.h | 3 +++ arch/powerpc/mm/book3s64/pgtable.c | 2 ++ include/linux/mm_types.h | 1 + 3 files changed, 6 insertions(+) diff --git a/arch/powerpc/include/asm/pte-walk.h b/arch/powerpc/include/asm/pte-walk.h index 33fa5dd8ee6a..3b82cb3bd563 100644 --- a/arch/powerpc/include/asm/pte-walk.h +++ b/arch/powerpc/include/asm/pte-walk.h @@ -40,6 +40,8 @@ static inline pte_t *find_current_mm_pte(pgd_t *pgdir, unsigned long ea, { pte_t *pte; + atomic64_inc(¤t->mm->lockless_ptlookup_count); + VM_WARN(!arch_irqs_disabled(), "%s called with irq enabled\n", __func__); VM_WARN(pgdir != current->mm->pgd, "%s lock less page table lookup called on wrong mm\n", __func__); @@ -53,6 +55,7 @@ static inline pte_t *find_current_mm_pte(pgd_t *pgdir, unsigned long ea, if (hshift) WARN_ON(*hshift); #endif + atomic64_dec(¤t->mm->lockless_ptlookup_count); return pte; } diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c index 7d0e0d0d22c4..8f6fc2f80071 100644 --- a/arch/powerpc/mm/book3s64/pgtable.c +++ b/arch/powerpc/mm/book3s64/pgtable.c @@ -95,6 +95,8 @@ static void do_nothing(void *unused) void serialize_against_pte_lookup(struct mm_struct *mm) { smp_mb(); + if (atomic64_read(&mm->lockless_ptlookup_count) == 0) + return; smp_call_function_many(mm_cpumask(mm), do_nothing, NULL, 1); } diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 6a7a1083b6fb..97fb2545e967 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -518,6 +518,7 @@ struct mm_struct { #endif } __randomize_layout; + atomic64_t lockless_ptlookup_count; /* * The mm_cpumask needs to be at the end of mm_struct, because it * is dynamically sized based on nr_cpu_ids. -- 2.20.1