----- Original Message ----- > From: "Peter Zijlstra" <pet...@infradead.org> > To: "Mathieu Desnoyers" <mathieu.desnoy...@efficios.com> > Cc: linux-kernel@vger.kernel.org, "KOSAKI Motohiro" > <kosaki.motoh...@jp.fujitsu.com>, "Steven Rostedt" > <rost...@goodmis.org>, "Paul E. McKenney" <paul...@linux.vnet.ibm.com>, > "Nicholas Miell" <nmi...@comcast.net>, > "Linus Torvalds" <torva...@linux-foundation.org>, "Ingo Molnar" > <mi...@redhat.com>, "Alan Cox" > <gno...@lxorguk.ukuu.org.uk>, "Lai Jiangshan" <la...@cn.fujitsu.com>, > "Stephen Hemminger" > <step...@networkplumber.org>, "Andrew Morton" <a...@linux-foundation.org>, > "Josh Triplett" <j...@joshtriplett.org>, > "Thomas Gleixner" <t...@linutronix.de>, "David Howells" > <dhowe...@redhat.com>, "Nick Piggin" <npig...@kernel.dk> > Sent: Monday, March 16, 2015 10:19:39 AM > Subject: Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier > (x86) (v12) > > On Sun, Mar 15, 2015 at 03:24:19PM -0400, Mathieu Desnoyers wrote: > > TL;DR > > > --- a/arch/x86/include/asm/mmu_context.h > > +++ b/arch/x86/include/asm/mmu_context.h > > @@ -45,6 +45,16 @@ static inline void switch_mm(struct mm_struct *prev, > > struct mm_struct *next, > > #endif > > cpumask_set_cpu(cpu, mm_cpumask(next)); > > > > + /* > > + * smp_mb() between mm_cpumask set and following memory > > + * accesses to user-space addresses is required by > > + * sys_membarrier(). A smp_mb() is also needed between > > + * prior memory accesses and mm_cpumask clear. This > > + * ensures that all user-space address memory accesses > > + * performed by the current thread are in program order > > + * when the mm_cpumask is set. Implied by load_cr3. > > + */ > > + > > /* Re-load page tables */ > > load_cr3(next->pgd); > > trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); > > @@ -82,6 +92,13 @@ static inline void switch_mm(struct mm_struct *prev, > > struct mm_struct *next, > > * We were in lazy tlb mode and leave_mm disabled > > * tlb flush IPI delivery. We must reload CR3 > > * to make sure to use no freed page tables. > > + * > > + * smp_mb() between mm_cpumask set and memory accesses > > + * to user-space addresses is required by > > + * sys_membarrier(). This ensures that all user-space > > + * address memory accesses performed by the current > > + * thread are in program order when the mm_cpumask is > > + * set. Implied by load_cr3. > > */ > > load_cr3(next->pgd); > > trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, > > TLB_FLUSH_ALL); > > > In both cases the cpumask_set_cpu() will also imply a MB.
I'm probably missing what exactly in cpumask_set_cpu() implies this guarantee. cpumask_set_cpu() uses set_bit(). On x86, set_bit is indeed implemented with a lock-prefixed orb or bts. However, the comment above set_bit() states: * Note: there are no guarantees that this function will not be reordered * on non x86 architectures, so if you are writing portable code, * make sure not to rely on its reordering guarantees. And it states nothing about memory barriers. Typically, atomic ops that imply memory barriers always return something (xchg, cmpxchg, add_return). Ops like atomic_add do not imply barriers. > > > +enum { > > + /* > > + * Private flag set: only synchronize across a single process. If this > > + * flag is not set, it means "shared": synchronize across multiple > > + * processes. The shared mode is useful for shared memory mappings > > + * across processes. > > + */ > > + MEMBARRIER_PRIVATE_FLAG = (1 << 0), > > + > > + /* > > + * Expedited flag set: adds some overhead, fast execution (few > > + * microseconds). If this flag is not set, it means "delayed": low > > + * overhead, but slow execution (few milliseconds). > > + */ > > + MEMBARRIER_EXPEDITED_FLAG = (1 << 1), > > > I suppose this is an unprivileged syscall; so what do we do about: > > for (;;) > sys_membar(EXPEDITED); > > Which would spray the entire system with IPIs at break neck speed. Currently, combining EXPEDITED with non-PRIVATE returns -EINVAL. Therefore, if someone cares about issuing barriers on the entire system, the only option is to use non-EXPEDITED, which rely on synchronize_rcu(). The only way to invoke expedited barriers in a loop is: for (;;) sys_membarrier(MEMBARRIER_EXPEDITED | MEMBARRIER_PRIVATE); Which will only send IPIs to the CPU running threads from the same process. > > > +static void membarrier_ipi(void *unused) > > +{ > > + /* Order memory accesses with respects to sys_membarrier caller. */ > > + smp_mb(); > > +} > > + > > +/* > > + * Handle out-of-memory by sending per-cpu IPIs instead. > > + */ > > +static void membarrier_fallback(void) > > +{ > > + struct mm_struct *mm; > > + int cpu; > > + > > + for_each_cpu(cpu, mm_cpumask(current->mm)) { > > + raw_spin_lock_irq(&cpu_rq(cpu)->lock); > > + mm = cpu_curr(cpu)->mm; > > + raw_spin_unlock_irq(&cpu_rq(cpu)->lock); > > + if (current->mm == mm) > > + smp_call_function_single(cpu, membarrier_ipi, NULL, 1); > > + } > > +} > > > +static void membarrier_expedited(void) > > +{ > > + struct mm_struct *mm; > > + cpumask_var_t tmpmask; > > + int cpu; > > + > > + /* > > + * Memory barrier on the caller thread between previous memory accesses > > + * to user-space addresses and sending memory-barrier IPIs. Orders all > > + * user-space address memory accesses prior to sys_membarrier() before > > + * mm_cpumask read and membarrier_ipi executions. This barrier is paired > > + * with memory barriers in: > > + * - membarrier_ipi() (for each running threads of the current process) > > + * - switch_mm() (ordering scheduler mm_cpumask update wrt memory > > + * accesses to user-space addresses) > > + * - Each CPU ->mm update performed with rq lock held by the scheduler. > > + * A memory barrier is issued each time ->mm is changed while the rq > > + * lock is held. > > + */ > > + smp_mb(); > > + if (!alloc_cpumask_var(&tmpmask, GFP_NOWAIT)) { > > + membarrier_fallback(); > > + goto out; > > + } > > + cpumask_copy(tmpmask, mm_cpumask(current->mm)); > > + preempt_disable(); > > + cpumask_clear_cpu(smp_processor_id(), tmpmask); > > + for_each_cpu(cpu, tmpmask) { > > + raw_spin_lock_irq(&cpu_rq(cpu)->lock); > > + mm = cpu_curr(cpu)->mm; > > + raw_spin_unlock_irq(&cpu_rq(cpu)->lock); > > + if (current->mm != mm) > > + cpumask_clear_cpu(cpu, tmpmask); > > + } > > + smp_call_function_many(tmpmask, membarrier_ipi, NULL, 1); > > + preempt_enable(); > > + free_cpumask_var(tmpmask); > > +out: > > + /* > > + * Memory barrier on the caller thread between sending & waiting for > > + * memory-barrier IPIs and following memory accesses to user-space > > + * addresses. Orders mm_cpumask read and membarrier_ipi executions > > + * before all user-space address memory accesses following > > + * sys_membarrier(). This barrier is paired with memory barriers in: > > + * - membarrier_ipi() (for each running threads of the current process) > > + * - switch_mm() (ordering scheduler mm_cpumask update wrt memory > > + * accesses to user-space addresses) > > + * - Each CPU ->mm update performed with rq lock held by the scheduler. > > + * A memory barrier is issued each time ->mm is changed while the rq > > + * lock is held. > > + */ > > + smp_mb(); > > +} > > Did you just write: > > bool membar_cpu_is_mm(int cpu, void *info) > { > struct mm_struct *mm = info; > struct rq *rq = cpu_rq(cpu); > bool ret; > > raw_spin_lock_irq(&rq->lock); > ret = rq->curr->mm == mm; > raw_spin_unlock_irq(&rq->lock); > > return ret; > } > > on_each_cpu_cond(membar_cpu_is_mm, membar_ipi, mm, 1, GFP_NOWAIT); > It is very similar indeed! The main difference is that my implementation was starting from a copy of mm_cpumask(current->mm) and clearing the CPUs for which TLB shootdown is simply pending (confirmed by taking the rq lock and checking cpu_curr(cpu)->mm against current->mm). Now that you mention this, I think we don't really need to use mm_cpumask(current->mm) at all. Just iterating on each cpu, taking the rq lock, and comparing the mm should be enough. This would remove the need to rely on having extra memory barriers around set/clear of the mm cpumask. The main reason why I did not use on_each_cpu_cond() was that it did not exist back in 2010. ;-) > On which; I absolutely hate that rq->lock thing in there. What is > 'wrong' with doing a lockless compare there? Other than not actually > being able to deref rq->curr of course, but we need to fix that anyhow. If we can make sure rq->curr deref could be done without holding the rq lock, then I think all we would need is to ensure that updates to rq->curr are surrounded by memory barriers. Therefore, we would have the following: * When a thread is scheduled out, a memory barrier would be issued before rq->curr is updated to the next thread task_struct. * Before a thread is scheduled in, a memory barrier needs to be issued after rq->curr is updated to the incoming thread. In order to be able to dereference rq->curr->mm without holding the rq->lock, do you envision we should protect task reclaim with RCU-sched ? Thanks! Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/