> On Dec 1, 2020, at 7:47 PM, Nicholas Piggin <npig...@gmail.com> wrote: > > Excerpts from Andy Lutomirski's message of December 1, 2020 4:31 am: >> other arch folk: there's some background here: >> >> https://lkml.kernel.org/r/calcetrvxube8lfnn-qs+dzroqaiw+sfug1j047ybyv31sat...@mail.gmail.com >> >>> On Sun, Nov 29, 2020 at 12:16 PM Andy Lutomirski <l...@kernel.org> wrote: >>> >>> On Sat, Nov 28, 2020 at 7:54 PM Andy Lutomirski <l...@kernel.org> wrote: >>>> >>>> On Sat, Nov 28, 2020 at 8:02 AM Nicholas Piggin <npig...@gmail.com> wrote: >>>>> >>>>> On big systems, the mm refcount can become highly contented when doing >>>>> a lot of context switching with threaded applications (particularly >>>>> switching between the idle thread and an application thread). >>>>> >>>>> Abandoning lazy tlb slows switching down quite a bit in the important >>>>> user->idle->user cases, so so instead implement a non-refcounted scheme >>>>> that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down >>>>> any remaining lazy ones. >>>>> >>>>> Shootdown IPIs are some concern, but they have not been observed to be >>>>> a big problem with this scheme (the powerpc implementation generated >>>>> 314 additional interrupts on a 144 CPU system during a kernel compile). >>>>> There are a number of strategies that could be employed to reduce IPIs >>>>> if they turn out to be a problem for some workload. >>>> >>>> I'm still wondering whether we can do even better. >>>> >>> >>> Hold on a sec.. __mmput() unmaps VMAs, frees pagetables, and flushes >>> the TLB. On x86, this will shoot down all lazies as long as even a >>> single pagetable was freed. (Or at least it will if we don't have a >>> serious bug, but the code seems okay. We'll hit pmd_free_tlb, which >>> sets tlb->freed_tables, which will trigger the IPI.) So, on >>> architectures like x86, the shootdown approach should be free. The >>> only way it ought to have any excess IPIs is if we have CPUs in >>> mm_cpumask() that don't need IPI to free pagetables, which could >>> happen on paravirt. >> >> Indeed, on x86, we do this: >> >> [ 11.558844] flush_tlb_mm_range.cold+0x18/0x1d >> [ 11.559905] tlb_finish_mmu+0x10e/0x1a0 >> [ 11.561068] exit_mmap+0xc8/0x1a0 >> [ 11.561932] mmput+0x29/0xd0 >> [ 11.562688] do_exit+0x316/0xa90 >> [ 11.563588] do_group_exit+0x34/0xb0 >> [ 11.564476] __x64_sys_exit_group+0xf/0x10 >> [ 11.565512] do_syscall_64+0x34/0x50 >> >> and we have info->freed_tables set. >> >> What are the architectures that have large systems like? >> >> x86: we already zap lazies, so it should cost basically nothing to do > > This is not zapping lazies, this is freeing the user page tables. > > "lazy mm" is where a switch to a kernel thread takes on the > previous mm for its kernel mapping rather than switch to init_mm.
The intent of the code is to flush the TLB after freeing user pages tables, but, on bare metal, lazies get zapped as a side effect. Anyway, I'm going to send out a mockup of an alternative approach shortly.