Sorry for the delay, I was eaten by a grue. I found that my initial study did not actually measure the number of TLB shootdown IPIs sent per TLB shootdown. I think the intuition was correct but I didn't actually observe what I thought I had; my original use of probe points was incorrect. However, after fixing my methodology, I'm having trouble proving that the existing Lazy TLB mode is working properly.
I've spent some time trying to reproduce this in a microbenchmark. One thread does mmap, touch page, munmap, while other threads in the same process are configured to either busy-spin or busy-spin and yield. All threads set their own affinity to a unique cpu, and the system is otherwise idle. I look at the per-cpu delta of the TLB and CAL lines of /proc/interrupts over the run of the microbenchmark. Let's say I have 4 spin threads that never yield. The mmap thread does N unmaps. I observe each spin-thread core receives N (+/- small noise) TLB shootdown interrupts, and the total TLB interrupt count is 4N (+/- small noise). This is expected behavior. Then I add some synchronization: the unmap thread rendezvouses with all the spinners, and when they are all ready, the spinners busy-spin for D milliseconds and then yield (pthread_yield, sched_yield produce identical results, though I'm not confident here that this is the right yield). Meanwhile, the unmap thread busy-spins for D+E milliseconds and then does M map/touch/unmaps. (D, E are single-digit milliseconds). The idea here is that the unmap happens a little while after the spinners yielded; the kernel should be in the user process' mm but lazy TLB mode should defer TLB flushes. It seems that lazy mode on each CPU should take 1 interrupt and then suppress subsequent interrupts. I expect lazy TLB invalidation to take 1 interrupt on each spinner CPU, per rendezvous sequence, and I expect Rik's extra-lazy version to take 0. I see M in all cases. This leads me to wonder if I'm failing to trigger lazy TLB invalidation, or if lazy TLB invalidation is not working as intended. I get similar results using perf record on probe points: I filter by CPU number and count the number of IPIs sent per each pair of probe points in the tlb flush routines. I put probe points on flush_tlb_mm_range and flush_tlb_mm_range%return. Counting number of IPIs sent: In a VM that uses x2_physical mode, probing native_x2apic_icr_write or __x2apic_send_IPI_dest is usually convenient if it doesn't get inlined away (which sometimes happens), since that function is called once per CPU target in the cpu_mask of __x2apic_send_IPI_mask (in x2 physical mode). I filter perf script to look at the distribution of cpus targeted per TLB shootdown. Rik's patch definitely looks correct, but I can't yet cite the gains. Thanks! Ben On Wed, Sep 7, 2016 at 11:56 PM, Ingo Molnar <mi...@kernel.org> wrote: > > * Rik van Riel <r...@redhat.com> wrote: > >> On Sat, 27 Aug 2016 16:02:25 -0700 >> Linus Torvalds <torva...@linux-foundation.org> wrote: >> >> > Yeah, with those small fixes from Ingo, I definitely don't think this >> > looks hacky at all. This all seems to be exactly what we should always >> > have done. >> >> OK, so I was too tired yesterday to do kernel hacking, and >> missed yet another bit (xen_flush_tlb_others). Sigh. >> >> Otherwise, the patch is identical. >> >> Looking forward to Ben's test results. > > Gentle ping to Ben. > > I can also apply this without waiting for the test result, the patch looks > sane > enough to me. > > Thanks, > > Ingo