[REF PATCH] x86/tlb: just do tlb flush on one of siblings of SMT

Alex Shi Tue, 05 Apr 2016 20:19:03 -0700

It seems Intel core still share the TLB pool, flush both of threads' TLB
just cause a extra useless IPI and a extra flush. The extra flush will 
flush out TLB again which another thread just introduced.
That's double waste.


The micro testing show memory access can save about 25% time on my 
haswell i7 desktop.
munmap source code is here: https://lkml.org/lkml/2012/5/17/59

test result on Kernel v4.5.0:
$/home/alexs/bin/perf stat -e 
dTLB-load-misses,dTLB-loads,dTLB-store-misses,dTLB-stores,iTLB-load-misses,iTLB-loads
 -e tlb:tlb_flush munmap -n 64 -t 16
munmap use 57ms 14072ns/time, memory access uses 48356 times/thread/ms, cost 
20ns/time

 Performance counter stats for '/home/alexs/backups/exec-laptop/tlb/munmap -n 
64 -t 16':

        18,739,808      dTLB-load-misses          #    2.47% of all dTLB cache 
hits   (43.05%)
       757,380,911      dTLB-loads                                              
      (34.34%)
         2,125,275      dTLB-store-misses                                       
      (32.23%)
       318,307,759      dTLB-stores                                             
      (46.32%)
            32,765      iTLB-load-misses          #    2.03% of all iTLB cache 
hits   (56.90%)
         1,616,237      iTLB-loads                                              
      (44.47%)
            41,476      tlb:tlb_flush

       1.443484546 seconds time elapsed

/proc/vmstat/nr_tlb_remote_flush increased: 4616
/proc/vmstat/nr_tlb_remote_flush_received increased: 32262

test result on Kernel v4.5.0 + this patch:
$/home/alexs/bin/perf stat -e 
dTLB-load-misses,dTLB-loads,dTLB-store-misses,dTLB-stores,iTLB-load-misses,iTLB-loads
 -e tlb:tlb_flush munmap -n 64 -t 16
munmap use 48ms 11933ns/time, memory access uses 59966 times/thread/ms, cost 
16ns/time

 Performance counter stats for '/home/alexs/backups/exec-laptop/tlb/munmap -n 
64 -t 16':

        15,984,772      dTLB-load-misses          #    1.89% of all dTLB cache 
hits   (41.72%)
       844,099,241      dTLB-loads                                              
      (33.30%)
         1,328,102      dTLB-store-misses                                       
      (52.13%)
       280,902,875      dTLB-stores                                             
      (52.03%)
            27,678      iTLB-load-misses          #    1.67% of all iTLB cache 
hits   (35.35%)
         1,659,550      iTLB-loads                                              
      (38.38%)
            25,137      tlb:tlb_flush

       1.428880301 seconds time elapsed

/proc/vmstat/nr_tlb_remote_flush increased: 4616
/proc/vmstat/nr_tlb_remote_flush_received increased: 15912

BTW, 
This change isn't architecturally guaranteed.

Signed-off-by: Alex Shi <alex....@linaro.org>
Cc: Andrew Morton <a...@linux-foundation.org>
To: linux-kernel@vger.kernel.org
To: Mel Gorman <mgor...@suse.de>
To: x...@kernel.org
To: "H. Peter Anvin" <h...@zytor.com>
To: Thomas Gleixner <t...@linutronix.de>
Cc: Andy Lutomirski <l...@kernel.org>
Cc: Rik van Riel <r...@redhat.com>
Cc: Alex Shi <alex....@linaro.org>
---
 arch/x86/mm/tlb.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 8f4cc3d..6510316 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -134,7 +134,10 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
                                 struct mm_struct *mm, unsigned long start,
                                 unsigned long end)
 {
+       int cpu;
        struct flush_tlb_info info;
+       cpumask_t flush_mask, *sblmask;
+
        info.flush_mm = mm;
        info.flush_start = start;
        info.flush_end = end;
@@ -151,7 +154,23 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
                                                                &info, 1);
                return;
        }
-       smp_call_function_many(cpumask, flush_tlb_func, &info, 1);
+
+       if (unlikely(smp_num_siblings <= 1)) {
+               smp_call_function_many(cpumask, flush_tlb_func, &info, 1);
+               return;
+       }
+
+       /* Only one flush needed on both siblings of SMT */
+       cpumask_copy(&flush_mask, cpumask);
+       for_each_cpu(cpu, &flush_mask) {
+               sblmask = topology_sibling_cpumask(cpu);
+               if (!cpumask_subset(sblmask, &flush_mask))
+                       continue;
+
+               cpumask_clear_cpu(cpumask_next(cpu, sblmask), &flush_mask);
+       }
+
+       smp_call_function_many(&flush_mask, flush_tlb_func, &info, 1);
 }
 
 void flush_tlb_current_task(void)
-- 
2.7.2.333.g70bd996

[REF PATCH] x86/tlb: just do tlb flush on one of siblings of SMT

Reply via email to