Excerpts from Andy Lutomirski's message of December 4, 2020 3:26 pm: > The core scheduler isn't a great place for > membarrier_mm_sync_core_before_usermode() -- the core scheduler doesn't > actually know whether we are lazy. With the old code, if a CPU is > running a membarrier-registered task, goes idle, gets unlazied via a TLB > shootdown IPI, and switches back to the membarrier-registered task, it > will do an unnecessary core sync. > > Conveniently, x86 is the only architecture that does anything in this > hook, so we can just move the code.
This should go on top of my series that adds the exit_lazy_mm call and switches x86 over, at least. > XXX: there are some comments in swich_mm_irqs_off() that seem to be > trying to document what barriers are expected, and it's not clear to me > that these barriers are actually present in all paths through the > code. So I think this change makes the code more comprehensible and > has no effect on the code's correctness, but I'm not at all convinced > that the code is correct. > > Signed-off-by: Andy Lutomirski <l...@kernel.org> > --- > arch/x86/mm/tlb.c | 17 ++++++++++++++++- > kernel/sched/core.c | 14 +++++++------- > 2 files changed, 23 insertions(+), 8 deletions(-) > > diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c > index 3338a1feccf9..23df035b80e8 100644 > --- a/arch/x86/mm/tlb.c > +++ b/arch/x86/mm/tlb.c > @@ -8,6 +8,7 @@ > #include <linux/export.h> > #include <linux/cpu.h> > #include <linux/debugfs.h> > +#include <linux/sched/mm.h> > > #include <asm/tlbflush.h> > #include <asm/mmu_context.h> > @@ -496,6 +497,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct > mm_struct *next, > * from one thread in a process to another thread in the same > * process. No TLB flush required. > */ > + > + // XXX: why is this okay wrt membarrier? > if (!was_lazy) > return; > > @@ -508,12 +511,24 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct > mm_struct *next, > smp_mb(); > next_tlb_gen = atomic64_read(&next->context.tlb_gen); > if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) == > - next_tlb_gen) > + next_tlb_gen) { > + /* > + * We're reactivating an mm, and membarrier might > + * need to serialize. Tell membarrier. > + */ > + > + // XXX: I can't understand the logic in > + // membarrier_mm_sync_core_before_usermode(). What's > + // the mm check for? Writing CR3 is serializing, apparently. Another x86ism that gets commented and moved into arch/x86 with my patch. > + membarrier_mm_sync_core_before_usermode(next); > return; > + } > > /* > * TLB contents went out of date while we were in lazy > * mode. Fall through to the TLB switching code below. > + * No need for an explicit membarrier invocation -- the CR3 > + * write will serialize. > */ > new_asid = prev_asid; > need_flush = true; > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 2d95dc3f4644..6c4b76147166 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -3619,22 +3619,22 @@ static struct rq *finish_task_switch(struct > task_struct *prev) > kcov_finish_switch(current); > > fire_sched_in_preempt_notifiers(current); > + > /* > * When switching through a kernel thread, the loop in > * membarrier_{private,global}_expedited() may have observed that > * kernel thread and not issued an IPI. It is therefore possible to > * schedule between user->kernel->user threads without passing though > * switch_mm(). Membarrier requires a barrier after storing to > - * rq->curr, before returning to userspace, so provide them here: > + * rq->curr, before returning to userspace, and mmdrop() provides > + * this barrier. > * > - * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly > - * provided by mmdrop(), > - * - a sync_core for SYNC_CORE. > + * XXX: I don't think mmdrop() actually does this. There's no > + * smp_mb__before/after_atomic() in there. mmdrop definitely does provide a full barrier. > */ > - if (mm) { > - membarrier_mm_sync_core_before_usermode(mm); > + if (mm) > mmdrop(mm); > - } > + > if (unlikely(prev_state == TASK_DEAD)) { > if (prev->sched_class->task_dead) > prev->sched_class->task_dead(prev); > -- > 2.28.0 > >