----- Original Message ----- > From: "Mathieu Desnoyers" <mathieu.desnoy...@efficios.com> > To: "Peter Zijlstra" <pet...@infradead.org> > Cc: linux-kernel@vger.kernel.org, "KOSAKI Motohiro" > <kosaki.motoh...@jp.fujitsu.com>, "Steven Rostedt" > <rost...@goodmis.org>, "Paul E. McKenney" <paul...@linux.vnet.ibm.com>, > "Nicholas Miell" <nmi...@comcast.net>, > "Linus Torvalds" <torva...@linux-foundation.org>, "Ingo Molnar" > <mi...@redhat.com>, "Alan Cox" > <gno...@lxorguk.ukuu.org.uk>, "Lai Jiangshan" <la...@cn.fujitsu.com>, > "Stephen Hemminger" > <step...@networkplumber.org>, "Andrew Morton" <a...@linux-foundation.org>, > "Josh Triplett" <j...@joshtriplett.org>, > "Thomas Gleixner" <t...@linutronix.de>, "David Howells" > <dhowe...@redhat.com>, "Nick Piggin" <npig...@kernel.dk> > Sent: Monday, March 16, 2015 11:43:56 AM > Subject: Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier > (x86) (v12) > > ----- Original Message ----- > > From: "Peter Zijlstra" <pet...@infradead.org> > > To: "Mathieu Desnoyers" <mathieu.desnoy...@efficios.com> > > Cc: linux-kernel@vger.kernel.org, "KOSAKI Motohiro" > > <kosaki.motoh...@jp.fujitsu.com>, "Steven Rostedt" > > <rost...@goodmis.org>, "Paul E. McKenney" <paul...@linux.vnet.ibm.com>, > > "Nicholas Miell" <nmi...@comcast.net>, > > "Linus Torvalds" <torva...@linux-foundation.org>, "Ingo Molnar" > > <mi...@redhat.com>, "Alan Cox" > > <gno...@lxorguk.ukuu.org.uk>, "Lai Jiangshan" <la...@cn.fujitsu.com>, > > "Stephen Hemminger" > > <step...@networkplumber.org>, "Andrew Morton" <a...@linux-foundation.org>, > > "Josh Triplett" <j...@joshtriplett.org>, > > "Thomas Gleixner" <t...@linutronix.de>, "David Howells" > > <dhowe...@redhat.com>, "Nick Piggin" <npig...@kernel.dk> > > Sent: Monday, March 16, 2015 10:19:39 AM > > Subject: Re: [RFC PATCH] sys_membarrier(): system/process-wide memory > > barrier (x86) (v12) > > > > On Sun, Mar 15, 2015 at 03:24:19PM -0400, Mathieu Desnoyers wrote: > > > > TL;DR > > > > > --- a/arch/x86/include/asm/mmu_context.h > > > +++ b/arch/x86/include/asm/mmu_context.h > > > @@ -45,6 +45,16 @@ static inline void switch_mm(struct mm_struct *prev, > > > struct mm_struct *next, > > > #endif > > > cpumask_set_cpu(cpu, mm_cpumask(next)); > > > > > > + /* > > > + * smp_mb() between mm_cpumask set and following memory > > > + * accesses to user-space addresses is required by > > > + * sys_membarrier(). A smp_mb() is also needed between > > > + * prior memory accesses and mm_cpumask clear. This > > > + * ensures that all user-space address memory accesses > > > + * performed by the current thread are in program order > > > + * when the mm_cpumask is set. Implied by load_cr3. > > > + */ > > > + > > > /* Re-load page tables */ > > > load_cr3(next->pgd); > > > trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); > > > @@ -82,6 +92,13 @@ static inline void switch_mm(struct mm_struct *prev, > > > struct mm_struct *next, > > > * We were in lazy tlb mode and leave_mm disabled > > > * tlb flush IPI delivery. We must reload CR3 > > > * to make sure to use no freed page tables. > > > + * > > > + * smp_mb() between mm_cpumask set and memory accesses > > > + * to user-space addresses is required by > > > + * sys_membarrier(). This ensures that all user-space > > > + * address memory accesses performed by the current > > > + * thread are in program order when the mm_cpumask is > > > + * set. Implied by load_cr3. > > > */ > > > load_cr3(next->pgd); > > > trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, > > > TLB_FLUSH_ALL); > > > > > > In both cases the cpumask_set_cpu() will also imply a MB. > > I'm probably missing what exactly in cpumask_set_cpu() > implies this guarantee. cpumask_set_cpu() uses set_bit(). > On x86, set_bit is indeed implemented with a lock-prefixed > orb or bts. However, the comment above set_bit() states: > > * Note: there are no guarantees that this function will not be reordered > * on non x86 architectures, so if you are writing portable code, > * make sure not to rely on its reordering guarantees. > > And it states nothing about memory barriers. Typically, > atomic ops that imply memory barriers always return > something (xchg, cmpxchg, add_return). Ops like atomic_add > do not imply barriers. > > > > > > +enum { > > > + /* > > > + * Private flag set: only synchronize across a single process. If this > > > + * flag is not set, it means "shared": synchronize across multiple > > > + * processes. The shared mode is useful for shared memory mappings > > > + * across processes. > > > + */ > > > + MEMBARRIER_PRIVATE_FLAG = (1 << 0), > > > + > > > + /* > > > + * Expedited flag set: adds some overhead, fast execution (few > > > + * microseconds). If this flag is not set, it means "delayed": low > > > + * overhead, but slow execution (few milliseconds). > > > + */ > > > + MEMBARRIER_EXPEDITED_FLAG = (1 << 1), > > > > > > I suppose this is an unprivileged syscall; so what do we do about: > > > > for (;;) > > sys_membar(EXPEDITED); > > > > Which would spray the entire system with IPIs at break neck speed. > > Currently, combining EXPEDITED with non-PRIVATE returns -EINVAL. > Therefore, if someone cares about issuing barriers on the entire > system, the only option is to use non-EXPEDITED, which rely on > synchronize_rcu().
Sorry, I meant "synchronize_sched()". Thanks, Mathieu > > The only way to invoke expedited barriers in a loop is: > > for (;;) > sys_membarrier(MEMBARRIER_EXPEDITED | MEMBARRIER_PRIVATE); > > Which will only send IPIs to the CPU running threads from the same > process. > > > > > > +static void membarrier_ipi(void *unused) > > > +{ > > > + /* Order memory accesses with respects to sys_membarrier caller. */ > > > + smp_mb(); > > > +} > > > + > > > +/* > > > + * Handle out-of-memory by sending per-cpu IPIs instead. > > > + */ > > > +static void membarrier_fallback(void) > > > +{ > > > + struct mm_struct *mm; > > > + int cpu; > > > + > > > + for_each_cpu(cpu, mm_cpumask(current->mm)) { > > > + raw_spin_lock_irq(&cpu_rq(cpu)->lock); > > > + mm = cpu_curr(cpu)->mm; > > > + raw_spin_unlock_irq(&cpu_rq(cpu)->lock); > > > + if (current->mm == mm) > > > + smp_call_function_single(cpu, membarrier_ipi, NULL, 1); > > > + } > > > +} > > > > > +static void membarrier_expedited(void) > > > +{ > > > + struct mm_struct *mm; > > > + cpumask_var_t tmpmask; > > > + int cpu; > > > + > > > + /* > > > + * Memory barrier on the caller thread between previous memory accesses > > > + * to user-space addresses and sending memory-barrier IPIs. Orders all > > > + * user-space address memory accesses prior to sys_membarrier() before > > > + * mm_cpumask read and membarrier_ipi executions. This barrier is > > > paired > > > + * with memory barriers in: > > > + * - membarrier_ipi() (for each running threads of the current process) > > > + * - switch_mm() (ordering scheduler mm_cpumask update wrt memory > > > + * accesses to user-space addresses) > > > + * - Each CPU ->mm update performed with rq lock held by the scheduler. > > > + * A memory barrier is issued each time ->mm is changed while the rq > > > + * lock is held. > > > + */ > > > + smp_mb(); > > > + if (!alloc_cpumask_var(&tmpmask, GFP_NOWAIT)) { > > > + membarrier_fallback(); > > > + goto out; > > > + } > > > + cpumask_copy(tmpmask, mm_cpumask(current->mm)); > > > + preempt_disable(); > > > + cpumask_clear_cpu(smp_processor_id(), tmpmask); > > > + for_each_cpu(cpu, tmpmask) { > > > + raw_spin_lock_irq(&cpu_rq(cpu)->lock); > > > + mm = cpu_curr(cpu)->mm; > > > + raw_spin_unlock_irq(&cpu_rq(cpu)->lock); > > > + if (current->mm != mm) > > > + cpumask_clear_cpu(cpu, tmpmask); > > > + } > > > + smp_call_function_many(tmpmask, membarrier_ipi, NULL, 1); > > > + preempt_enable(); > > > + free_cpumask_var(tmpmask); > > > +out: > > > + /* > > > + * Memory barrier on the caller thread between sending & waiting for > > > + * memory-barrier IPIs and following memory accesses to user-space > > > + * addresses. Orders mm_cpumask read and membarrier_ipi executions > > > + * before all user-space address memory accesses following > > > + * sys_membarrier(). This barrier is paired with memory barriers in: > > > + * - membarrier_ipi() (for each running threads of the current process) > > > + * - switch_mm() (ordering scheduler mm_cpumask update wrt memory > > > + * accesses to user-space addresses) > > > + * - Each CPU ->mm update performed with rq lock held by the scheduler. > > > + * A memory barrier is issued each time ->mm is changed while the rq > > > + * lock is held. > > > + */ > > > + smp_mb(); > > > +} > > > > Did you just write: > > > > bool membar_cpu_is_mm(int cpu, void *info) > > { > > struct mm_struct *mm = info; > > struct rq *rq = cpu_rq(cpu); > > bool ret; > > > > raw_spin_lock_irq(&rq->lock); > > ret = rq->curr->mm == mm; > > raw_spin_unlock_irq(&rq->lock); > > > > return ret; > > } > > > > on_each_cpu_cond(membar_cpu_is_mm, membar_ipi, mm, 1, GFP_NOWAIT); > > > > It is very similar indeed! The main difference is that my implementation > was starting from a copy of mm_cpumask(current->mm) and clearing the CPUs > for which TLB shootdown is simply pending (confirmed by taking the rq lock > and checking cpu_curr(cpu)->mm against current->mm). > > Now that you mention this, I think we don't really need to use > mm_cpumask(current->mm) at all. Just iterating on each cpu, taking > the rq lock, and comparing the mm should be enough. This would > remove the need to rely on having extra memory barriers around > set/clear of the mm cpumask. > > The main reason why I did not use on_each_cpu_cond() was that it did > not exist back in 2010. ;-) > > > On which; I absolutely hate that rq->lock thing in there. What is > > 'wrong' with doing a lockless compare there? Other than not actually > > being able to deref rq->curr of course, but we need to fix that anyhow. > > If we can make sure rq->curr deref could be done without holding the rq > lock, then I think all we would need is to ensure that updates to rq->curr > are surrounded by memory barriers. Therefore, we would have the following: > > * When a thread is scheduled out, a memory barrier would be issued before > rq->curr is updated to the next thread task_struct. > > * Before a thread is scheduled in, a memory barrier needs to be issued > after rq->curr is updated to the incoming thread. > > In order to be able to dereference rq->curr->mm without holding the > rq->lock, do you envision we should protect task reclaim with RCU-sched ? > > Thanks! > > Mathieu > > -- > Mathieu Desnoyers > EfficiOS Inc. > http://www.efficios.com > -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/