On Sun, Mar 15, 2015 at 03:24:19PM -0400, Mathieu Desnoyers wrote: > Here is an implementation of a new system call, sys_membarrier(), which > executes a memory barrier on either all running threads of the current > process (MEMBARRIER_PRIVATE_FLAG) or calls synchronize_sched() to issue > a memory barrier on all threads running on the system. It can be used to > distribute the cost of user-space memory barriers asymmetrically by > transforming pairs of memory barriers into pairs consisting of > sys_membarrier() and a compiler barrier. For synchronization primitives > that distinguish between read-side and write-side (e.g. userspace RCU, > rwlocks), the read-side can be accelerated significantly by moving the > bulk of the memory barrier overhead to the write-side.
Acked-by: Paul E. McKenney <paul...@linux.vnet.ibm.com> > The first user of this system call is the "liburcu" Userspace RCU > implementation [1]. It aims at greatly simplifying and enhancing the > current implementation, which uses a scheme similar to the > sys_membarrier(), but based on signals sent to each reader thread. > Liburcu is currently packaged in all major Linux distributions. > > One user of this library is the LTTng-UST (Userspace Tracing) library > [2]. The impact of two additional memory barriers on the LTTng tracer > fast path has been benchmarked to 35ns on our reference Intel Core Xeon > 2.0 GHz, which adds up to about 25% performance degradation. > > This patch mostly sits in kernel/sched.c (it needs to access struct rq). > It is based on kernel v3.19, and also applies fine to master. I am > submitting it as RFC. > > Alternative approach: signals. A signal-based alternative proposed by > Ingo would lead to important scheduler modifications, which involves > adding context-switch in/out overhead (calling user-space upon scheduler > switch in and out). In addition to the overhead issue, I am also > reluctant to base a synchronization primitive on the signals, which, to > quote Linus, are "already one of our more "exciting" layers out there", > which does not give me the warm feeling of rock-solidness that's usually > expected from synchronization primitives. > > Changes since v11: > - 5 years have passed. > - Rebase on v3.19 kernel. > - Add futex-alike PRIVATE vs SHARED semantic: private for per-process > barriers, non-private for memory mappings shared between processes. > - Simplify user API. > - Code refactoring. > > Changes since v10: > - Apply Randy's comments. > - Rebase on 2.6.34-rc4 -tip. > > Changes since v9: > - Clean up #ifdef CONFIG_SMP. > > Changes since v8: > - Go back to rq spin locks taken by sys_membarrier() rather than adding > memory barriers to the scheduler. It implies a potential RoS > (reduction of service) if sys_membarrier() is executed in a busy-loop > by a user, but nothing more than what is already possible with other > existing system calls, but saves memory barriers in the scheduler fast > path. > - re-add the memory barrier comments to x86 switch_mm() as an example to > other architectures. > - Update documentation of the memory barriers in sys_membarrier and > switch_mm(). > - Append execution scenarios to the changelog showing the purpose of > each memory barrier. > > Changes since v7: > - Move spinlock-mb and scheduler related changes to separate patches. > - Add support for sys_membarrier on x86_32. > - Only x86 32/64 system calls are reserved in this patch. It is planned > to incrementally reserve syscall IDs on other architectures as these > are tested. > > Changes since v6: > - Remove some unlikely() not so unlikely. > - Add the proper scheduler memory barriers needed to only use the RCU > read lock in sys_membarrier rather than take each runqueue spinlock: > - Move memory barriers from per-architecture switch_mm() to schedule() > and finish_lock_switch(), where they clearly document that all data > protected by the rq lock is guaranteed to have memory barriers issued > between the scheduler update and the task execution. Replacing the > spin lock acquire/release barriers with these memory barriers imply > either no overhead (x86 spinlock atomic instruction already implies a > full mb) or some hopefully small overhead caused by the upgrade of the > spinlock acquire/release barriers to more heavyweight smp_mb(). > - The "generic" version of spinlock-mb.h declares both a mapping to > standard spinlocks and full memory barriers. Each architecture can > specialize this header following their own need and declare > CONFIG_HAVE_SPINLOCK_MB to use their own spinlock-mb.h. > - Note: benchmarks of scheduler overhead with specialized spinlock-mb.h > implementations on a wide range of architecture would be welcome. > > Changes since v5: > - Plan ahead for extensibility by introducing mandatory/optional masks > to the "flags" system call parameter. Past experience with accept4(), > signalfd4(), eventfd2(), epoll_create1(), dup3(), pipe2(), and > inotify_init1() indicates that this is the kind of thing we want to > plan for. Return -EINVAL if the mandatory flags received are unknown. > - Create include/linux/membarrier.h to define these flags. > - Add MEMBARRIER_QUERY optional flag. > > Changes since v4: > - Add "int expedited" parameter, use synchronize_sched() in the > non-expedited case. Thanks to Lai Jiangshan for making us consider > seriously using synchronize_sched() to provide the low-overhead > membarrier scheme. > - Check num_online_cpus() == 1, quickly return without doing nothing. > > Changes since v3a: > - Confirm that each CPU indeed runs the current task's ->mm before > sending an IPI. Ensures that we do not disturb RT tasks in the > presence of lazy TLB shootdown. > - Document memory barriers needed in switch_mm(). > - Surround helper functions with #ifdef CONFIG_SMP. > > Changes since v2: > - simply send-to-many to the mm_cpumask. It contains the list of > processors we have to IPI to (which use the mm), and this mask is > updated atomically. > > Changes since v1: > - Only perform the IPI in CONFIG_SMP. > - Only perform the IPI if the process has more than one thread. > - Only send IPIs to CPUs involved with threads belonging to our process. > - Adaptative IPI scheme (single vs many IPI with threshold). > - Issue smp_mb() at the beginning and end of the system call. > > To explain the benefit of this scheme, let's introduce two example threads: > > Thread A (non-frequent, e.g. executing liburcu synchronize_rcu()) > Thread B (frequent, e.g. executing liburcu > rcu_read_lock()/rcu_read_unlock()) > > In a scheme where all smp_mb() in thread A are ordering memory accesses > with respect to smp_mb() present in Thread B, we can change each > smp_mb() within Thread A into calls to sys_membarrier() and each > smp_mb() within Thread B into compiler barriers "barrier()". > > Before the change, we had, for each smp_mb() pairs: > > Thread A Thread B > previous mem accesses previous mem accesses > smp_mb() smp_mb() > following mem accesses following mem accesses > > After the change, these pairs become: > > Thread A Thread B > prev mem accesses prev mem accesses > sys_membarrier() barrier() > follow mem accesses follow mem accesses > > As we can see, there are two possible scenarios: either Thread B memory > accesses do not happen concurrently with Thread A accesses (1), or they > do (2). > > 1) Non-concurrent Thread A vs Thread B accesses: > > Thread A Thread B > prev mem accesses > sys_membarrier() > follow mem accesses > prev mem accesses > barrier() > follow mem accesses > > In this case, thread B accesses will be weakly ordered. This is OK, > because at that point, thread A is not particularly interested in > ordering them with respect to its own accesses. > > 2) Concurrent Thread A vs Thread B accesses > > Thread A Thread B > prev mem accesses prev mem accesses > sys_membarrier() barrier() > follow mem accesses follow mem accesses > > In this case, thread B accesses, which are ensured to be in program > order thanks to the compiler barrier, will be "upgraded" to full > smp_mb() by to the IPIs executing memory barriers on each active > system threads. Each non-running process threads are intrinsically > serialized by the scheduler. > > * Benchmarks > > On Intel Xeon E5405 (8 cores) > (one thread is calling sys_membarrier, the other 7 threads are busy > looping) > > - expedited > > 10,000,000 sys_membarrier calls in 43s = 4.3 microseconds/call. > > - non-expedited > > 1000 sys_membarrier calls in 33s = 33 milliseconds/call. > > Expedited is 7600 times faster than non-expedited. > > * User-space user of this system call: Userspace RCU library > > Both the signal-based and the sys_membarrier userspace RCU schemes > permit us to remove the memory barrier from the userspace RCU > rcu_read_lock() and rcu_read_unlock() primitives, thus significantly > accelerating them. These memory barriers are replaced by compiler > barriers on the read-side, and all matching memory barriers on the > write-side are turned into an invocation of a memory barrier on all > active threads in the process. By letting the kernel perform this > synchronization rather than dumbly sending a signal to every process > threads (as we currently do), we diminish the number of unnecessary wake > ups and only issue the memory barriers on active threads. Non-running > threads do not need to execute such barrier anyway, because these are > implied by the scheduler context switches. > > Results in liburcu: > > Operations in 10s, 6 readers, 2 writers: > > memory barriers in reader: 1701557485 reads, 3129842 writes > signal-based scheme: 9825306874 reads, 5386 writes > sys_membarrier expedited: 6637539697 reads, 852129 writes > sys_membarrier non-expedited: 7992076602 reads, 220 writes > > The dynamic sys_membarrier availability check adds some overhead to > the read-side compared to the signal-based scheme, but besides that, > with the expedited scheme, we can see that we are close to the read-side > performance of the signal-based scheme and also at 1/4 of the > memory-barrier write-side performance. We have a write-side speedup of > 158:1 over the signal-based scheme by using the sys_membarrier system > call. This allows a 3.9:1 read-side speedup over the pre-existing memory > barrier scheme. > > The non-expedited scheme adds indeed a much lower overhead on the > read-side both because we do not send IPIs and because we perform less > updates, which in turn generates less cache-line exchanges. The > write-side latency becomes even higher than with the signal-based > scheme. The advantage of sys_membarrier() over signal-based scheme is > that it does not require to wake up all the process threads. > > The non-expedited sys_membarrier scheme can be useful to a userspace RCU > flavor that encompass all processes on a system, which may share memory > mappings. > > * More information about memory barriers in: > > - sys_membarrier() > - membarrier_ipi() > - switch_mm() > - issued with ->mm update while the rq lock is held > > The goal of these memory barriers is to ensure that all memory accesses > to user-space addresses performed by every processor which execute > threads belonging to the current process are observed to be in program > order at least once between the two memory barriers surrounding > sys_membarrier(). > > If we were to simply broadcast an IPI to all processors between the two > smp_mb() in sys_membarrier(), membarrier_ipi() would execute on each > processor, and waiting for these handlers to complete execution > guarantees that each running processor passed through a state where > user-space memory address accesses were in program order. > > However, this "big hammer" approach does not please the real-time > concerned people. This would let a non RT task disturb real-time tasks > by sending useless IPIs to processors not concerned by the memory of the > current process. > > This is why we iterate on the mm_cpumask, which is a superset of the > processors concerned by the process memory map and check each processor > ->mm with the rq lock held to confirm that the processor is indeed > running a thread concerned with our mm (and not just part of the > mm_cpumask due to lazy TLB shootdown). > > User-space memory address accesses must be in program order when > mm_cpumask is set or cleared. (more details in the x86 switch_mm() > comments). > > The verification, for each cpu part of the mm_cpumask, that the rq ->mm > is indeed part of the current ->mm needs to be done with the rq lock > held. This ensures that each time a rq ->mm is modified, a memory > barrier (typically implied by the change of memory mapping) is also > issued. These ->mm update and memory barrier are made atomic by the rq > spinlock. > > The execution scenario (1) shows the behavior of the sys_membarrier() > system call executed on Thread A while Thread B executes memory accesses > that need to be ordered. Thread B is running. Memory accesses in Thread > B are in program order (e.g. separated by a compiler barrier()). > > 1) Thread B running, ordering ensured by the membarrier_ipi(): > > Thread A Thread B > ------------------------------------------------------------------------- > prev accesses to userspace addr. prev accesses to userspace addr. > sys_membarrier > smp_mb > IPI ------------------------------> membarrier_ipi() > smp_mb > return > smp_mb > following accesses to userspace addr. following accesses to userspace > addr. > > The execution scenarios (2-3-4-5) show the same setup as (1), but Thread > B is not running while sys_membarrier() is called. Thanks to the memory > barriers implied by load_cr3 in switch_mm(), Thread B user-space address > memory accesses are already in program order when sys_membarrier finds > out that either the mm_cpumask does not contain Thread B CPU or that > that CPU's ->mm is not running the current process mm. > > 2) Context switch in, showing rq spin lock synchronization: > > Thread A Thread B > ------------------------------------------------------------------------- > <prev accesses to userspace addr. > saved on stack> > prev accesses to userspace addr. > sys_membarrier > smp_mb > for each cpu in mm_cpumask > <Thread B CPU is present e.g. due > to lazy TLB shootdown> > spin lock cpu rq > mm = cpu rq mm > spin unlock cpu rq > context switch in > <spin lock cpu rq by other thread> > load_cr3 (or equiv. mem. barrier) > spin unlock cpu rq > following accesses to userspace > addr. > if (mm == current rq mm) > <false> > smp_mb > following accesses to userspace addr. > Here, the important point is that Thread B have passed through a point > where all its userspace memory address accesses were in program order > between the two smp_mb() in sys_membarrier. > > 3) Context switch out, showing rq spin lock synchronization: > > Thread A Thread B > ------------------------------------------------------------------------- > prev accesses to userspace addr. > prev accesses to userspace addr. > sys_membarrier > smp_mb > for each cpu in mm_cpumask > context switch out > spin lock cpu rq > load_cr3 (or equiv. mem. barrier) > <spin unlock cpu rq by other thread> > <following accesses to userspace > addr. will happen when rescheduled> > spin lock cpu rq > mm = cpu rq mm > spin unlock cpu rq > if (mm == current rq mm) > <false> > smp_mb > following accesses to userspace addr. > Same as (2): the important point is that Thread B have passed through a > point where all its userspace memory address accesses were in program > order between the two smp_mb() in sys_membarrier. > > 4) Context switch in, showing mm_cpumask synchronization: > > Thread A Thread B > ------------------------------------------------------------------------- > <prev accesses to userspace addr. > saved on stack> > prev accesses to userspace addr. > sys_membarrier > smp_mb > for each cpu in mm_cpumask > <Thread B CPU not in mask> > context switch in > set cpu bit in mm_cpumask > load_cr3 (or equiv. mem. barrier) > following accesses to userspace > addr. > smp_mb > following accesses to userspace addr. > Same as 2-3: Thread B is passing through a point where userspace memory > address accesses are in program order between the two smp_mb() in > sys_membarrier(). > > 5) Context switch out, showing mm_cpumask synchronization: > > Thread A Thread B > ------------------------------------------------------------------------- > prev accesses to userspace addr. > prev accesses to userspace addr. > sys_membarrier > smp_mb > context switch out > load_cr3 (or equiv. mem. barrier) > clear cpu bit in mm_cpumask > <following accesses to userspace > addr. will happen when rescheduled> > for each cpu in mm_cpumask > <Thread B CPU not in mask> > smp_mb > following accesses to userspace addr. > Same as 2-3-4: Thread B is passing through a point where userspace > memory address accesses are in program order between the two smp_mb() in > sys_membarrier(). > > This patch only adds the system calls to x86. See the sys_membarrier() > comments for memory barriers requirement in switch_mm() to port to other > architectures. > > [1] http://urcu.so > [2] http://lttng.org > > Signed-off-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com> > CC: KOSAKI Motohiro <kosaki.motoh...@jp.fujitsu.com> > CC: Steven Rostedt <rost...@goodmis.org> > CC: Paul E. McKenney <paul...@linux.vnet.ibm.com> > CC: Nicholas Miell <nmi...@comcast.net> > CC: Linus Torvalds <torva...@linux-foundation.org> > CC: Ingo Molnar <mi...@redhat.com> > CC: Alan Cox <gno...@lxorguk.ukuu.org.uk> > CC: Lai Jiangshan <la...@cn.fujitsu.com> > CC: Stephen Hemminger <step...@networkplumber.org> > CC: Andrew Morton <a...@linux-foundation.org> > CC: Josh Triplett <j...@joshtriplett.org> > CC: Thomas Gleixner <t...@linutronix.de> > CC: Peter Zijlstra <pet...@infradead.org> > CC: David Howells <dhowe...@redhat.com> > CC: Nick Piggin <npig...@kernel.dk> > --- > arch/x86/include/asm/mmu_context.h | 17 +++ > arch/x86/syscalls/syscall_32.tbl | 1 + > arch/x86/syscalls/syscall_64.tbl | 1 + > include/linux/syscalls.h | 2 + > include/uapi/asm-generic/unistd.h | 4 +- > include/uapi/linux/Kbuild | 1 + > include/uapi/linux/membarrier.h | 75 +++++++++++++ > kernel/sched/core.c | 208 > ++++++++++++++++++++++++++++++++++++ > kernel/sys_ni.c | 3 + > 9 files changed, 311 insertions(+), 1 deletions(-) > create mode 100644 include/uapi/linux/membarrier.h > > diff --git a/arch/x86/include/asm/mmu_context.h > b/arch/x86/include/asm/mmu_context.h > index 4b75d59..a30a63d 100644 > --- a/arch/x86/include/asm/mmu_context.h > +++ b/arch/x86/include/asm/mmu_context.h > @@ -45,6 +45,16 @@ static inline void switch_mm(struct mm_struct *prev, > struct mm_struct *next, > #endif > cpumask_set_cpu(cpu, mm_cpumask(next)); > > + /* > + * smp_mb() between mm_cpumask set and following memory > + * accesses to user-space addresses is required by > + * sys_membarrier(). A smp_mb() is also needed between > + * prior memory accesses and mm_cpumask clear. This > + * ensures that all user-space address memory accesses > + * performed by the current thread are in program order > + * when the mm_cpumask is set. Implied by load_cr3. > + */ > + > /* Re-load page tables */ > load_cr3(next->pgd); > trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); > @@ -82,6 +92,13 @@ static inline void switch_mm(struct mm_struct *prev, > struct mm_struct *next, > * We were in lazy tlb mode and leave_mm disabled > * tlb flush IPI delivery. We must reload CR3 > * to make sure to use no freed page tables. > + * > + * smp_mb() between mm_cpumask set and memory accesses > + * to user-space addresses is required by > + * sys_membarrier(). This ensures that all user-space > + * address memory accesses performed by the current > + * thread are in program order when the mm_cpumask is > + * set. Implied by load_cr3. > */ > load_cr3(next->pgd); > trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, > TLB_FLUSH_ALL); > diff --git a/arch/x86/syscalls/syscall_32.tbl > b/arch/x86/syscalls/syscall_32.tbl > index b3560ec..439415f 100644 > --- a/arch/x86/syscalls/syscall_32.tbl > +++ b/arch/x86/syscalls/syscall_32.tbl > @@ -365,3 +365,4 @@ > 356 i386 memfd_create sys_memfd_create > 357 i386 bpf sys_bpf > 358 i386 execveat sys_execveat > stub32_execveat > +359 i386 membarrier sys_membarrier > diff --git a/arch/x86/syscalls/syscall_64.tbl > b/arch/x86/syscalls/syscall_64.tbl > index 8d656fb..823130d 100644 > --- a/arch/x86/syscalls/syscall_64.tbl > +++ b/arch/x86/syscalls/syscall_64.tbl > @@ -329,6 +329,7 @@ > 320 common kexec_file_load sys_kexec_file_load > 321 common bpf sys_bpf > 322 64 execveat stub_execveat > +323 common membarrier sys_membarrier > > # > # x32-specific system call numbers start at 512 to avoid cache impact > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h > index 85893d7..058ec0a 100644 > --- a/include/linux/syscalls.h > +++ b/include/linux/syscalls.h > @@ -882,4 +882,6 @@ asmlinkage long sys_execveat(int dfd, const char __user > *filename, > const char __user *const __user *argv, > const char __user *const __user *envp, int flags); > > +asmlinkage long sys_membarrier(int flags); > + > #endif > diff --git a/include/uapi/asm-generic/unistd.h > b/include/uapi/asm-generic/unistd.h > index e016bd9..8da542a 100644 > --- a/include/uapi/asm-generic/unistd.h > +++ b/include/uapi/asm-generic/unistd.h > @@ -709,9 +709,11 @@ __SYSCALL(__NR_memfd_create, sys_memfd_create) > __SYSCALL(__NR_bpf, sys_bpf) > #define __NR_execveat 281 > __SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat) > +#define __NR_membarrier 282 > +__SYSCALL(__NR_membarrier, sys_membarrier) > > #undef __NR_syscalls > -#define __NR_syscalls 282 > +#define __NR_syscalls 283 > > /* > * All syscalls below here should go away really, > diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild > index 00b10002..c5b0dbf 100644 > --- a/include/uapi/linux/Kbuild > +++ b/include/uapi/linux/Kbuild > @@ -248,6 +248,7 @@ header-y += mdio.h > header-y += media.h > header-y += media-bus-format.h > header-y += mei.h > +header-y += membarrier.h > header-y += memfd.h > header-y += mempolicy.h > header-y += meye.h > diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h > new file mode 100644 > index 0000000..928b0d5a > --- /dev/null > +++ b/include/uapi/linux/membarrier.h > @@ -0,0 +1,75 @@ > +#ifndef _UAPI_LINUX_MEMBARRIER_H > +#define _UAPI_LINUX_MEMBARRIER_H > + > +/* > + * linux/membarrier.h > + * > + * membarrier system call API > + * > + * Copyright (c) 2010, 2015 Mathieu Desnoyers > <mathieu.desnoy...@efficios.com> > + * > + * Permission is hereby granted, free of charge, to any person obtaining a > copy > + * of this software and associated documentation files (the "Software"), to > deal > + * in the Software without restriction, including without limitation the > rights > + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell > + * copies of the Software, and to permit persons to whom the Software is > + * furnished to do so, subject to the following conditions: > + * > + * The above copyright notice and this permission notice shall be included in > + * all copies or substantial portions of the Software. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR > + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, > + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL > THE > + * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER > + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING > FROM, > + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN > THE > + * SOFTWARE. > + */ > + > +/* > + * All memory accesses performed in program order from each targeted thread > is > + * guaranteed to be ordered with respect to sys_membarrier(). If we use the > + * semantic "barrier()" to represent a compiler barrier forcing memory > accesses > + * to be performed in program order across the barrier, and smp_mb() to > + * represent explicit memory barriers forcing full memory ordering across the > + * barrier, we have the following ordering table for each pair of barrier(), > + * sys_membarrier() and smp_mb() : > + * > + * The pair ordering is detailed as (O: ordered, X: not ordered): > + * > + * barrier() smp_mb() sys_membarrier() > + * barrier() X X O > + * smp_mb() X O O > + * sys_membarrier() O O O > + * > + * If the private flag is set, only running threads belonging to the same > + * process are targeted. Else, all running threads, including those > belonging to > + * other processes are targeted. > + */ > + > +/* System call membarrier "flags" argument. */ > +enum { > + /* > + * Private flag set: only synchronize across a single process. If this > + * flag is not set, it means "shared": synchronize across multiple > + * processes. The shared mode is useful for shared memory mappings > + * across processes. > + */ > + MEMBARRIER_PRIVATE_FLAG = (1 << 0), > + > + /* > + * Expedited flag set: adds some overhead, fast execution (few > + * microseconds). If this flag is not set, it means "delayed": low > + * overhead, but slow execution (few milliseconds). > + */ > + MEMBARRIER_EXPEDITED_FLAG = (1 << 1), > + > + /* > + * Query whether the rest of the specified flags are supported, without > + * performing synchronization. > + */ > + MEMBARRIER_QUERY_FLAG = (1 << 31), > +}; > + > +#endif /* _UAPI_LINUX_MEMBARRIER_H */ > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 5eab11d..8b33728 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -74,6 +74,7 @@ > #include <linux/binfmts.h> > #include <linux/context_tracking.h> > #include <linux/compiler.h> > +#include <linux/membarrier.h> > > #include <asm/switch_to.h> > #include <asm/tlb.h> > @@ -8402,3 +8403,210 @@ void dump_cpu_task(int cpu) > pr_info("Task dump for CPU %d:\n", cpu); > sched_show_task(cpu_curr(cpu)); > } > + > +#ifdef CONFIG_SMP > + > +/* > + * Execute a memory barrier on all active threads from the current process > + * on SMP systems. In order to keep this code independent of implementation > + * details of IPI handlers, do not rely on implicit barriers within IPI > handler > + * execution. This is not the bulk of the overhead anyway, so let's stay on > the > + * safe side. > + */ > +static void membarrier_ipi(void *unused) > +{ > + /* Order memory accesses with respects to sys_membarrier caller. */ > + smp_mb(); > +} > + > +/* > + * Handle out-of-memory by sending per-cpu IPIs instead. > + */ > +static void membarrier_fallback(void) > +{ > + struct mm_struct *mm; > + int cpu; > + > + for_each_cpu(cpu, mm_cpumask(current->mm)) { > + raw_spin_lock_irq(&cpu_rq(cpu)->lock); > + mm = cpu_curr(cpu)->mm; > + raw_spin_unlock_irq(&cpu_rq(cpu)->lock); > + if (current->mm == mm) > + smp_call_function_single(cpu, membarrier_ipi, NULL, 1); > + } > +} > + > +static int membarrier_validate_flags(int flags) > +{ > + /* Check for unrecognized flag. */ > + if (flags & ~(MEMBARRIER_PRIVATE_FLAG | MEMBARRIER_EXPEDITED_FLAG > + | MEMBARRIER_QUERY_FLAG)) > + return -EINVAL; > + /* Check for unsupported flag combination. */ > + if ((flags & MEMBARRIER_EXPEDITED_FLAG) > + && !(flags & MEMBARRIER_PRIVATE_FLAG)) > + return -EINVAL; > + return 0; > +} > + > +static void membarrier_expedited(void) > +{ > + struct mm_struct *mm; > + cpumask_var_t tmpmask; > + int cpu; > + > + /* > + * Memory barrier on the caller thread between previous memory accesses > + * to user-space addresses and sending memory-barrier IPIs. Orders all > + * user-space address memory accesses prior to sys_membarrier() before > + * mm_cpumask read and membarrier_ipi executions. This barrier is paired > + * with memory barriers in: > + * - membarrier_ipi() (for each running threads of the current process) > + * - switch_mm() (ordering scheduler mm_cpumask update wrt memory > + * accesses to user-space addresses) > + * - Each CPU ->mm update performed with rq lock held by the scheduler. > + * A memory barrier is issued each time ->mm is changed while the rq > + * lock is held. > + */ > + smp_mb(); > + if (!alloc_cpumask_var(&tmpmask, GFP_NOWAIT)) { > + membarrier_fallback(); > + goto out; > + } > + cpumask_copy(tmpmask, mm_cpumask(current->mm)); > + preempt_disable(); > + cpumask_clear_cpu(smp_processor_id(), tmpmask); > + for_each_cpu(cpu, tmpmask) { > + raw_spin_lock_irq(&cpu_rq(cpu)->lock); > + mm = cpu_curr(cpu)->mm; > + raw_spin_unlock_irq(&cpu_rq(cpu)->lock); > + if (current->mm != mm) > + cpumask_clear_cpu(cpu, tmpmask); > + } > + smp_call_function_many(tmpmask, membarrier_ipi, NULL, 1); > + preempt_enable(); > + free_cpumask_var(tmpmask); > +out: > + /* > + * Memory barrier on the caller thread between sending & waiting for > + * memory-barrier IPIs and following memory accesses to user-space > + * addresses. Orders mm_cpumask read and membarrier_ipi executions > + * before all user-space address memory accesses following > + * sys_membarrier(). This barrier is paired with memory barriers in: > + * - membarrier_ipi() (for each running threads of the current process) > + * - switch_mm() (ordering scheduler mm_cpumask update wrt memory > + * accesses to user-space addresses) > + * - Each CPU ->mm update performed with rq lock held by the scheduler. > + * A memory barrier is issued each time ->mm is changed while the rq > + * lock is held. > + */ > + smp_mb(); > +} > + > +/* > + * sys_membarrier - issue memory barrier on target threads > + * @flags: MEMBARRIER_PRIVATE_FLAG: > + * Private flag set: only synchronize across a single process. If > + * this flag is not set, it means "shared": synchronize across > + * multiple processes. The shared mode is useful for shared > memory > + * mappings across processes. > + * MEMBARRIER_EXPEDITED_FLAG: > + * Expedited flag set: adds some overhead, fast execution (few > + * microseconds). If this flag is not set, it means "delayed": > low > + * overhead, but slow execution (few milliseconds). > + * MEMBARRIER_QUERY_FLAG: > + * Query whether the rest of the specified flags are supported, > + * without performing synchronization. > + * > + * return values: Returns -EINVAL if the flags are incorrect. Testing for > kernel > + * sys_membarrier support can be done by checking for -ENOSYS return value. > + * Return value of 0 indicates success. For a given set of flags on a given > + * kernel, this system call will always return the same value. It is > therefore > + * correct to check the return value only once during a process lifetime, > + * setting MEMBARRIER_QUERY_FLAG to only check if the flags are supported, > + * without performing any synchronization. > + * > + * This system call executes a memory barrier on all targeted threads. > + * If the private flag is set, only running threads belonging to the same > + * process are targeted. Else, all running threads, including those > belonging to > + * other processes are targeted. Upon completion, the caller thread is > ensured > + * that all targeted running threads have passed through a state where all > + * memory accesses to user-space addresses match program order. (non-running > + * threads are de facto in such a state.) > + * > + * Using the non-expedited mode is recommended for applications which can > + * afford leaving the caller thread waiting for a few milliseconds. A good > + * example would be a thread dedicated to execute RCU callbacks, which waits > + * for callbacks to enqueue most of the time anyway. > + * > + * The expedited mode is recommended whenever the application needs to have > + * control returning to the caller thread as quickly as possible. An example > + * of such application would be one which uses the same thread to perform > + * data structure updates and issue the RCU synchronization. > + * > + * It is perfectly safe to call both expedited and non-expedited > + * sys_membarrier() in a process. > + * > + * The combination of expedited mode (MEMBARRIER_EXPEDITED_FLAG) and > non-private > + * (shared) (~MEMBARRIER_PRIVATE_FLAG) flags is currently unimplemented. > Using > + * this combination returns -EINVAL. > + * > + * mm_cpumask is used as an approximation of the processors which run threads > + * belonging to the current process. It is a superset of the cpumask to > which we > + * must send IPIs, mainly due to lazy TLB shootdown. Therefore, for each CPU > in > + * the mm_cpumask, we check each runqueue with the rq lock held to make sure > our > + * ->mm is indeed running on them. The rq lock ensures that a memory barrier > is > + * issued each time the rq current task is changed. This reduces the risk of > + * disturbing a RT task by sending unnecessary IPIs. There is still a slight > + * chance to disturb an unrelated task, because we do not lock the runqueues > + * while sending IPIs, but the real-time effect of this heavy locking would > be > + * worse than the comparatively small disruption of an IPI. > + * > + * RED PEN: before assigning a system call number for sys_membarrier() to an > + * architecture, we must ensure that switch_mm issues full memory barriers > + * (or a synchronizing instruction having the same effect) between: > + * - memory accesses to user-space addresses and clear mm_cpumask. > + * - set mm_cpumask and memory accesses to user-space addresses. > + * > + * The reason why these memory barriers are required is that mm_cpumask > updates, > + * as well as iteration on the mm_cpumask, offer no ordering guarantees. > + * These added memory barriers ensure that any thread modifying the > mm_cpumask > + * is in a state where all memory accesses to user-space addresses are > + * guaranteed to be in program order. > + * > + * In some case adding a comment to this effect will suffice, in others we > + * will need to add explicit memory barriers. These barriers are required to > + * ensure we do not _miss_ a CPU that need to receive an IPI, which would be > a > + * bug. > + * > + * On uniprocessor systems, this system call simply returns 0 after > validating > + * the arguments, so user-space knows it is implemented. > + */ > +SYSCALL_DEFINE1(membarrier, int, flags) > +{ > + int retval; > + > + retval = membarrier_validate_flags(flags); > + if (retval) > + goto end; > + if (unlikely((flags & MEMBARRIER_QUERY_FLAG) > + || ((flags & MEMBARRIER_PRIVATE_FLAG) > + && thread_group_empty(current))) > + || num_online_cpus() == 1) > + goto end; > + if (flags & MEMBARRIER_EXPEDITED_FLAG) > + membarrier_expedited(); > + else > + synchronize_sched(); > +end: > + return retval; > +} > + > +#else /* !CONFIG_SMP */ > + > +SYSCALL_DEFINE1(membarrier, int, flags) > +{ > + return membarrier_validate_args(flags); > +} > + > +#endif /* CONFIG_SMP */ > diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c > index 5adcb0a..5913b84 100644 > --- a/kernel/sys_ni.c > +++ b/kernel/sys_ni.c > @@ -229,3 +229,6 @@ cond_syscall(sys_bpf); > > /* execveat */ > cond_syscall(sys_execveat); > + > +/* membarrier */ > +cond_syscall(sys_membarrier); > -- > 1.7.7.3 > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/