Here is an implementation of a new system call, sys_membarrier(), which executes a memory barrier on either all running threads of the current process (MEMBARRIER_PRIVATE_FLAG) or calls synchronize_sched() to issue a memory barrier on all threads running on the system. It can be used to distribute the cost of user-space memory barriers asymmetrically by transforming pairs of memory barriers into pairs consisting of sys_membarrier() and a compiler barrier. For synchronization primitives that distinguish between read-side and write-side (e.g. userspace RCU, rwlocks), the read-side can be accelerated significantly by moving the bulk of the memory barrier overhead to the write-side.
The first user of this system call is the "liburcu" Userspace RCU implementation [1]. It aims at greatly simplifying and enhancing the current implementation, which uses a scheme similar to the sys_membarrier(), but based on signals sent to each reader thread. Liburcu is currently packaged in all major Linux distributions. One user of this library is the LTTng-UST (Userspace Tracing) library [2]. The impact of two additional memory barriers on the LTTng tracer fast path has been benchmarked to 35ns on our reference Intel Core Xeon 2.0 GHz, which adds up to about 25% performance degradation. This patch mostly sits in kernel/sched.c (it needs to access struct rq). It is based on kernel v3.19, and also applies fine to master. I am submitting it as RFC. Alternative approach: signals. A signal-based alternative proposed by Ingo would lead to important scheduler modifications, which involves adding context-switch in/out overhead (calling user-space upon scheduler switch in and out). In addition to the overhead issue, I am also reluctant to base a synchronization primitive on the signals, which, to quote Linus, are "already one of our more "exciting" layers out there", which does not give me the warm feeling of rock-solidness that's usually expected from synchronization primitives. Changes since v11: - 5 years have passed. - Rebase on v3.19 kernel. - Add futex-alike PRIVATE vs SHARED semantic: private for per-process barriers, non-private for memory mappings shared between processes. - Simplify user API. - Code refactoring. Changes since v10: - Apply Randy's comments. - Rebase on 2.6.34-rc4 -tip. Changes since v9: - Clean up #ifdef CONFIG_SMP. Changes since v8: - Go back to rq spin locks taken by sys_membarrier() rather than adding memory barriers to the scheduler. It implies a potential RoS (reduction of service) if sys_membarrier() is executed in a busy-loop by a user, but nothing more than what is already possible with other existing system calls, but saves memory barriers in the scheduler fast path. - re-add the memory barrier comments to x86 switch_mm() as an example to other architectures. - Update documentation of the memory barriers in sys_membarrier and switch_mm(). - Append execution scenarios to the changelog showing the purpose of each memory barrier. Changes since v7: - Move spinlock-mb and scheduler related changes to separate patches. - Add support for sys_membarrier on x86_32. - Only x86 32/64 system calls are reserved in this patch. It is planned to incrementally reserve syscall IDs on other architectures as these are tested. Changes since v6: - Remove some unlikely() not so unlikely. - Add the proper scheduler memory barriers needed to only use the RCU read lock in sys_membarrier rather than take each runqueue spinlock: - Move memory barriers from per-architecture switch_mm() to schedule() and finish_lock_switch(), where they clearly document that all data protected by the rq lock is guaranteed to have memory barriers issued between the scheduler update and the task execution. Replacing the spin lock acquire/release barriers with these memory barriers imply either no overhead (x86 spinlock atomic instruction already implies a full mb) or some hopefully small overhead caused by the upgrade of the spinlock acquire/release barriers to more heavyweight smp_mb(). - The "generic" version of spinlock-mb.h declares both a mapping to standard spinlocks and full memory barriers. Each architecture can specialize this header following their own need and declare CONFIG_HAVE_SPINLOCK_MB to use their own spinlock-mb.h. - Note: benchmarks of scheduler overhead with specialized spinlock-mb.h implementations on a wide range of architecture would be welcome. Changes since v5: - Plan ahead for extensibility by introducing mandatory/optional masks to the "flags" system call parameter. Past experience with accept4(), signalfd4(), eventfd2(), epoll_create1(), dup3(), pipe2(), and inotify_init1() indicates that this is the kind of thing we want to plan for. Return -EINVAL if the mandatory flags received are unknown. - Create include/linux/membarrier.h to define these flags. - Add MEMBARRIER_QUERY optional flag. Changes since v4: - Add "int expedited" parameter, use synchronize_sched() in the non-expedited case. Thanks to Lai Jiangshan for making us consider seriously using synchronize_sched() to provide the low-overhead membarrier scheme. - Check num_online_cpus() == 1, quickly return without doing nothing. Changes since v3a: - Confirm that each CPU indeed runs the current task's ->mm before sending an IPI. Ensures that we do not disturb RT tasks in the presence of lazy TLB shootdown. - Document memory barriers needed in switch_mm(). - Surround helper functions with #ifdef CONFIG_SMP. Changes since v2: - simply send-to-many to the mm_cpumask. It contains the list of processors we have to IPI to (which use the mm), and this mask is updated atomically. Changes since v1: - Only perform the IPI in CONFIG_SMP. - Only perform the IPI if the process has more than one thread. - Only send IPIs to CPUs involved with threads belonging to our process. - Adaptative IPI scheme (single vs many IPI with threshold). - Issue smp_mb() at the beginning and end of the system call. To explain the benefit of this scheme, let's introduce two example threads: Thread A (non-frequent, e.g. executing liburcu synchronize_rcu()) Thread B (frequent, e.g. executing liburcu rcu_read_lock()/rcu_read_unlock()) In a scheme where all smp_mb() in thread A are ordering memory accesses with respect to smp_mb() present in Thread B, we can change each smp_mb() within Thread A into calls to sys_membarrier() and each smp_mb() within Thread B into compiler barriers "barrier()". Before the change, we had, for each smp_mb() pairs: Thread A Thread B previous mem accesses previous mem accesses smp_mb() smp_mb() following mem accesses following mem accesses After the change, these pairs become: Thread A Thread B prev mem accesses prev mem accesses sys_membarrier() barrier() follow mem accesses follow mem accesses As we can see, there are two possible scenarios: either Thread B memory accesses do not happen concurrently with Thread A accesses (1), or they do (2). 1) Non-concurrent Thread A vs Thread B accesses: Thread A Thread B prev mem accesses sys_membarrier() follow mem accesses prev mem accesses barrier() follow mem accesses In this case, thread B accesses will be weakly ordered. This is OK, because at that point, thread A is not particularly interested in ordering them with respect to its own accesses. 2) Concurrent Thread A vs Thread B accesses Thread A Thread B prev mem accesses prev mem accesses sys_membarrier() barrier() follow mem accesses follow mem accesses In this case, thread B accesses, which are ensured to be in program order thanks to the compiler barrier, will be "upgraded" to full smp_mb() by to the IPIs executing memory barriers on each active system threads. Each non-running process threads are intrinsically serialized by the scheduler. * Benchmarks On Intel Xeon E5405 (8 cores) (one thread is calling sys_membarrier, the other 7 threads are busy looping) - expedited 10,000,000 sys_membarrier calls in 43s = 4.3 microseconds/call. - non-expedited 1000 sys_membarrier calls in 33s = 33 milliseconds/call. Expedited is 7600 times faster than non-expedited. * User-space user of this system call: Userspace RCU library Both the signal-based and the sys_membarrier userspace RCU schemes permit us to remove the memory barrier from the userspace RCU rcu_read_lock() and rcu_read_unlock() primitives, thus significantly accelerating them. These memory barriers are replaced by compiler barriers on the read-side, and all matching memory barriers on the write-side are turned into an invocation of a memory barrier on all active threads in the process. By letting the kernel perform this synchronization rather than dumbly sending a signal to every process threads (as we currently do), we diminish the number of unnecessary wake ups and only issue the memory barriers on active threads. Non-running threads do not need to execute such barrier anyway, because these are implied by the scheduler context switches. Results in liburcu: Operations in 10s, 6 readers, 2 writers: memory barriers in reader: 1701557485 reads, 3129842 writes signal-based scheme: 9825306874 reads, 5386 writes sys_membarrier expedited: 6637539697 reads, 852129 writes sys_membarrier non-expedited: 7992076602 reads, 220 writes The dynamic sys_membarrier availability check adds some overhead to the read-side compared to the signal-based scheme, but besides that, with the expedited scheme, we can see that we are close to the read-side performance of the signal-based scheme and also at 1/4 of the memory-barrier write-side performance. We have a write-side speedup of 158:1 over the signal-based scheme by using the sys_membarrier system call. This allows a 3.9:1 read-side speedup over the pre-existing memory barrier scheme. The non-expedited scheme adds indeed a much lower overhead on the read-side both because we do not send IPIs and because we perform less updates, which in turn generates less cache-line exchanges. The write-side latency becomes even higher than with the signal-based scheme. The advantage of sys_membarrier() over signal-based scheme is that it does not require to wake up all the process threads. The non-expedited sys_membarrier scheme can be useful to a userspace RCU flavor that encompass all processes on a system, which may share memory mappings. * More information about memory barriers in: - sys_membarrier() - membarrier_ipi() - switch_mm() - issued with ->mm update while the rq lock is held The goal of these memory barriers is to ensure that all memory accesses to user-space addresses performed by every processor which execute threads belonging to the current process are observed to be in program order at least once between the two memory barriers surrounding sys_membarrier(). If we were to simply broadcast an IPI to all processors between the two smp_mb() in sys_membarrier(), membarrier_ipi() would execute on each processor, and waiting for these handlers to complete execution guarantees that each running processor passed through a state where user-space memory address accesses were in program order. However, this "big hammer" approach does not please the real-time concerned people. This would let a non RT task disturb real-time tasks by sending useless IPIs to processors not concerned by the memory of the current process. This is why we iterate on the mm_cpumask, which is a superset of the processors concerned by the process memory map and check each processor ->mm with the rq lock held to confirm that the processor is indeed running a thread concerned with our mm (and not just part of the mm_cpumask due to lazy TLB shootdown). User-space memory address accesses must be in program order when mm_cpumask is set or cleared. (more details in the x86 switch_mm() comments). The verification, for each cpu part of the mm_cpumask, that the rq ->mm is indeed part of the current ->mm needs to be done with the rq lock held. This ensures that each time a rq ->mm is modified, a memory barrier (typically implied by the change of memory mapping) is also issued. These ->mm update and memory barrier are made atomic by the rq spinlock. The execution scenario (1) shows the behavior of the sys_membarrier() system call executed on Thread A while Thread B executes memory accesses that need to be ordered. Thread B is running. Memory accesses in Thread B are in program order (e.g. separated by a compiler barrier()). 1) Thread B running, ordering ensured by the membarrier_ipi(): Thread A Thread B ------------------------------------------------------------------------- prev accesses to userspace addr. prev accesses to userspace addr. sys_membarrier smp_mb IPI ------------------------------> membarrier_ipi() smp_mb return smp_mb following accesses to userspace addr. following accesses to userspace addr. The execution scenarios (2-3-4-5) show the same setup as (1), but Thread B is not running while sys_membarrier() is called. Thanks to the memory barriers implied by load_cr3 in switch_mm(), Thread B user-space address memory accesses are already in program order when sys_membarrier finds out that either the mm_cpumask does not contain Thread B CPU or that that CPU's ->mm is not running the current process mm. 2) Context switch in, showing rq spin lock synchronization: Thread A Thread B ------------------------------------------------------------------------- <prev accesses to userspace addr. saved on stack> prev accesses to userspace addr. sys_membarrier smp_mb for each cpu in mm_cpumask <Thread B CPU is present e.g. due to lazy TLB shootdown> spin lock cpu rq mm = cpu rq mm spin unlock cpu rq context switch in <spin lock cpu rq by other thread> load_cr3 (or equiv. mem. barrier) spin unlock cpu rq following accesses to userspace addr. if (mm == current rq mm) <false> smp_mb following accesses to userspace addr. Here, the important point is that Thread B have passed through a point where all its userspace memory address accesses were in program order between the two smp_mb() in sys_membarrier. 3) Context switch out, showing rq spin lock synchronization: Thread A Thread B ------------------------------------------------------------------------- prev accesses to userspace addr. prev accesses to userspace addr. sys_membarrier smp_mb for each cpu in mm_cpumask context switch out spin lock cpu rq load_cr3 (or equiv. mem. barrier) <spin unlock cpu rq by other thread> <following accesses to userspace addr. will happen when rescheduled> spin lock cpu rq mm = cpu rq mm spin unlock cpu rq if (mm == current rq mm) <false> smp_mb following accesses to userspace addr. Same as (2): the important point is that Thread B have passed through a point where all its userspace memory address accesses were in program order between the two smp_mb() in sys_membarrier. 4) Context switch in, showing mm_cpumask synchronization: Thread A Thread B ------------------------------------------------------------------------- <prev accesses to userspace addr. saved on stack> prev accesses to userspace addr. sys_membarrier smp_mb for each cpu in mm_cpumask <Thread B CPU not in mask> context switch in set cpu bit in mm_cpumask load_cr3 (or equiv. mem. barrier) following accesses to userspace addr. smp_mb following accesses to userspace addr. Same as 2-3: Thread B is passing through a point where userspace memory address accesses are in program order between the two smp_mb() in sys_membarrier(). 5) Context switch out, showing mm_cpumask synchronization: Thread A Thread B ------------------------------------------------------------------------- prev accesses to userspace addr. prev accesses to userspace addr. sys_membarrier smp_mb context switch out load_cr3 (or equiv. mem. barrier) clear cpu bit in mm_cpumask <following accesses to userspace addr. will happen when rescheduled> for each cpu in mm_cpumask <Thread B CPU not in mask> smp_mb following accesses to userspace addr. Same as 2-3-4: Thread B is passing through a point where userspace memory address accesses are in program order between the two smp_mb() in sys_membarrier(). This patch only adds the system calls to x86. See the sys_membarrier() comments for memory barriers requirement in switch_mm() to port to other architectures. [1] http://urcu.so [2] http://lttng.org Signed-off-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com> CC: KOSAKI Motohiro <kosaki.motoh...@jp.fujitsu.com> CC: Steven Rostedt <rost...@goodmis.org> CC: Paul E. McKenney <paul...@linux.vnet.ibm.com> CC: Nicholas Miell <nmi...@comcast.net> CC: Linus Torvalds <torva...@linux-foundation.org> CC: Ingo Molnar <mi...@redhat.com> CC: Alan Cox <gno...@lxorguk.ukuu.org.uk> CC: Lai Jiangshan <la...@cn.fujitsu.com> CC: Stephen Hemminger <step...@networkplumber.org> CC: Andrew Morton <a...@linux-foundation.org> CC: Josh Triplett <j...@joshtriplett.org> CC: Thomas Gleixner <t...@linutronix.de> CC: Peter Zijlstra <pet...@infradead.org> CC: David Howells <dhowe...@redhat.com> CC: Nick Piggin <npig...@kernel.dk> --- arch/x86/include/asm/mmu_context.h | 17 +++ arch/x86/syscalls/syscall_32.tbl | 1 + arch/x86/syscalls/syscall_64.tbl | 1 + include/linux/syscalls.h | 2 + include/uapi/asm-generic/unistd.h | 4 +- include/uapi/linux/Kbuild | 1 + include/uapi/linux/membarrier.h | 75 +++++++++++++ kernel/sched/core.c | 208 ++++++++++++++++++++++++++++++++++++ kernel/sys_ni.c | 3 + 9 files changed, 311 insertions(+), 1 deletions(-) create mode 100644 include/uapi/linux/membarrier.h diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h index 4b75d59..a30a63d 100644 --- a/arch/x86/include/asm/mmu_context.h +++ b/arch/x86/include/asm/mmu_context.h @@ -45,6 +45,16 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, #endif cpumask_set_cpu(cpu, mm_cpumask(next)); + /* + * smp_mb() between mm_cpumask set and following memory + * accesses to user-space addresses is required by + * sys_membarrier(). A smp_mb() is also needed between + * prior memory accesses and mm_cpumask clear. This + * ensures that all user-space address memory accesses + * performed by the current thread are in program order + * when the mm_cpumask is set. Implied by load_cr3. + */ + /* Re-load page tables */ load_cr3(next->pgd); trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); @@ -82,6 +92,13 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, * We were in lazy tlb mode and leave_mm disabled * tlb flush IPI delivery. We must reload CR3 * to make sure to use no freed page tables. + * + * smp_mb() between mm_cpumask set and memory accesses + * to user-space addresses is required by + * sys_membarrier(). This ensures that all user-space + * address memory accesses performed by the current + * thread are in program order when the mm_cpumask is + * set. Implied by load_cr3. */ load_cr3(next->pgd); trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl index b3560ec..439415f 100644 --- a/arch/x86/syscalls/syscall_32.tbl +++ b/arch/x86/syscalls/syscall_32.tbl @@ -365,3 +365,4 @@ 356 i386 memfd_create sys_memfd_create 357 i386 bpf sys_bpf 358 i386 execveat sys_execveat stub32_execveat +359 i386 membarrier sys_membarrier diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl index 8d656fb..823130d 100644 --- a/arch/x86/syscalls/syscall_64.tbl +++ b/arch/x86/syscalls/syscall_64.tbl @@ -329,6 +329,7 @@ 320 common kexec_file_load sys_kexec_file_load 321 common bpf sys_bpf 322 64 execveat stub_execveat +323 common membarrier sys_membarrier # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 85893d7..058ec0a 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -882,4 +882,6 @@ asmlinkage long sys_execveat(int dfd, const char __user *filename, const char __user *const __user *argv, const char __user *const __user *envp, int flags); +asmlinkage long sys_membarrier(int flags); + #endif diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index e016bd9..8da542a 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -709,9 +709,11 @@ __SYSCALL(__NR_memfd_create, sys_memfd_create) __SYSCALL(__NR_bpf, sys_bpf) #define __NR_execveat 281 __SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat) +#define __NR_membarrier 282 +__SYSCALL(__NR_membarrier, sys_membarrier) #undef __NR_syscalls -#define __NR_syscalls 282 +#define __NR_syscalls 283 /* * All syscalls below here should go away really, diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild index 00b10002..c5b0dbf 100644 --- a/include/uapi/linux/Kbuild +++ b/include/uapi/linux/Kbuild @@ -248,6 +248,7 @@ header-y += mdio.h header-y += media.h header-y += media-bus-format.h header-y += mei.h +header-y += membarrier.h header-y += memfd.h header-y += mempolicy.h header-y += meye.h diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h new file mode 100644 index 0000000..928b0d5a --- /dev/null +++ b/include/uapi/linux/membarrier.h @@ -0,0 +1,75 @@ +#ifndef _UAPI_LINUX_MEMBARRIER_H +#define _UAPI_LINUX_MEMBARRIER_H + +/* + * linux/membarrier.h + * + * membarrier system call API + * + * Copyright (c) 2010, 2015 Mathieu Desnoyers <mathieu.desnoy...@efficios.com> + * + * Permission is hereby granted, free of charge, to any person obtaining a copy + * of this software and associated documentation files (the "Software"), to deal + * in the Software without restriction, including without limitation the rights + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell + * copies of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE + * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +/* + * All memory accesses performed in program order from each targeted thread is + * guaranteed to be ordered with respect to sys_membarrier(). If we use the + * semantic "barrier()" to represent a compiler barrier forcing memory accesses + * to be performed in program order across the barrier, and smp_mb() to + * represent explicit memory barriers forcing full memory ordering across the + * barrier, we have the following ordering table for each pair of barrier(), + * sys_membarrier() and smp_mb() : + * + * The pair ordering is detailed as (O: ordered, X: not ordered): + * + * barrier() smp_mb() sys_membarrier() + * barrier() X X O + * smp_mb() X O O + * sys_membarrier() O O O + * + * If the private flag is set, only running threads belonging to the same + * process are targeted. Else, all running threads, including those belonging to + * other processes are targeted. + */ + +/* System call membarrier "flags" argument. */ +enum { + /* + * Private flag set: only synchronize across a single process. If this + * flag is not set, it means "shared": synchronize across multiple + * processes. The shared mode is useful for shared memory mappings + * across processes. + */ + MEMBARRIER_PRIVATE_FLAG = (1 << 0), + + /* + * Expedited flag set: adds some overhead, fast execution (few + * microseconds). If this flag is not set, it means "delayed": low + * overhead, but slow execution (few milliseconds). + */ + MEMBARRIER_EXPEDITED_FLAG = (1 << 1), + + /* + * Query whether the rest of the specified flags are supported, without + * performing synchronization. + */ + MEMBARRIER_QUERY_FLAG = (1 << 31), +}; + +#endif /* _UAPI_LINUX_MEMBARRIER_H */ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 5eab11d..8b33728 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -74,6 +74,7 @@ #include <linux/binfmts.h> #include <linux/context_tracking.h> #include <linux/compiler.h> +#include <linux/membarrier.h> #include <asm/switch_to.h> #include <asm/tlb.h> @@ -8402,3 +8403,210 @@ void dump_cpu_task(int cpu) pr_info("Task dump for CPU %d:\n", cpu); sched_show_task(cpu_curr(cpu)); } + +#ifdef CONFIG_SMP + +/* + * Execute a memory barrier on all active threads from the current process + * on SMP systems. In order to keep this code independent of implementation + * details of IPI handlers, do not rely on implicit barriers within IPI handler + * execution. This is not the bulk of the overhead anyway, so let's stay on the + * safe side. + */ +static void membarrier_ipi(void *unused) +{ + /* Order memory accesses with respects to sys_membarrier caller. */ + smp_mb(); +} + +/* + * Handle out-of-memory by sending per-cpu IPIs instead. + */ +static void membarrier_fallback(void) +{ + struct mm_struct *mm; + int cpu; + + for_each_cpu(cpu, mm_cpumask(current->mm)) { + raw_spin_lock_irq(&cpu_rq(cpu)->lock); + mm = cpu_curr(cpu)->mm; + raw_spin_unlock_irq(&cpu_rq(cpu)->lock); + if (current->mm == mm) + smp_call_function_single(cpu, membarrier_ipi, NULL, 1); + } +} + +static int membarrier_validate_flags(int flags) +{ + /* Check for unrecognized flag. */ + if (flags & ~(MEMBARRIER_PRIVATE_FLAG | MEMBARRIER_EXPEDITED_FLAG + | MEMBARRIER_QUERY_FLAG)) + return -EINVAL; + /* Check for unsupported flag combination. */ + if ((flags & MEMBARRIER_EXPEDITED_FLAG) + && !(flags & MEMBARRIER_PRIVATE_FLAG)) + return -EINVAL; + return 0; +} + +static void membarrier_expedited(void) +{ + struct mm_struct *mm; + cpumask_var_t tmpmask; + int cpu; + + /* + * Memory barrier on the caller thread between previous memory accesses + * to user-space addresses and sending memory-barrier IPIs. Orders all + * user-space address memory accesses prior to sys_membarrier() before + * mm_cpumask read and membarrier_ipi executions. This barrier is paired + * with memory barriers in: + * - membarrier_ipi() (for each running threads of the current process) + * - switch_mm() (ordering scheduler mm_cpumask update wrt memory + * accesses to user-space addresses) + * - Each CPU ->mm update performed with rq lock held by the scheduler. + * A memory barrier is issued each time ->mm is changed while the rq + * lock is held. + */ + smp_mb(); + if (!alloc_cpumask_var(&tmpmask, GFP_NOWAIT)) { + membarrier_fallback(); + goto out; + } + cpumask_copy(tmpmask, mm_cpumask(current->mm)); + preempt_disable(); + cpumask_clear_cpu(smp_processor_id(), tmpmask); + for_each_cpu(cpu, tmpmask) { + raw_spin_lock_irq(&cpu_rq(cpu)->lock); + mm = cpu_curr(cpu)->mm; + raw_spin_unlock_irq(&cpu_rq(cpu)->lock); + if (current->mm != mm) + cpumask_clear_cpu(cpu, tmpmask); + } + smp_call_function_many(tmpmask, membarrier_ipi, NULL, 1); + preempt_enable(); + free_cpumask_var(tmpmask); +out: + /* + * Memory barrier on the caller thread between sending & waiting for + * memory-barrier IPIs and following memory accesses to user-space + * addresses. Orders mm_cpumask read and membarrier_ipi executions + * before all user-space address memory accesses following + * sys_membarrier(). This barrier is paired with memory barriers in: + * - membarrier_ipi() (for each running threads of the current process) + * - switch_mm() (ordering scheduler mm_cpumask update wrt memory + * accesses to user-space addresses) + * - Each CPU ->mm update performed with rq lock held by the scheduler. + * A memory barrier is issued each time ->mm is changed while the rq + * lock is held. + */ + smp_mb(); +} + +/* + * sys_membarrier - issue memory barrier on target threads + * @flags: MEMBARRIER_PRIVATE_FLAG: + * Private flag set: only synchronize across a single process. If + * this flag is not set, it means "shared": synchronize across + * multiple processes. The shared mode is useful for shared memory + * mappings across processes. + * MEMBARRIER_EXPEDITED_FLAG: + * Expedited flag set: adds some overhead, fast execution (few + * microseconds). If this flag is not set, it means "delayed": low + * overhead, but slow execution (few milliseconds). + * MEMBARRIER_QUERY_FLAG: + * Query whether the rest of the specified flags are supported, + * without performing synchronization. + * + * return values: Returns -EINVAL if the flags are incorrect. Testing for kernel + * sys_membarrier support can be done by checking for -ENOSYS return value. + * Return value of 0 indicates success. For a given set of flags on a given + * kernel, this system call will always return the same value. It is therefore + * correct to check the return value only once during a process lifetime, + * setting MEMBARRIER_QUERY_FLAG to only check if the flags are supported, + * without performing any synchronization. + * + * This system call executes a memory barrier on all targeted threads. + * If the private flag is set, only running threads belonging to the same + * process are targeted. Else, all running threads, including those belonging to + * other processes are targeted. Upon completion, the caller thread is ensured + * that all targeted running threads have passed through a state where all + * memory accesses to user-space addresses match program order. (non-running + * threads are de facto in such a state.) + * + * Using the non-expedited mode is recommended for applications which can + * afford leaving the caller thread waiting for a few milliseconds. A good + * example would be a thread dedicated to execute RCU callbacks, which waits + * for callbacks to enqueue most of the time anyway. + * + * The expedited mode is recommended whenever the application needs to have + * control returning to the caller thread as quickly as possible. An example + * of such application would be one which uses the same thread to perform + * data structure updates and issue the RCU synchronization. + * + * It is perfectly safe to call both expedited and non-expedited + * sys_membarrier() in a process. + * + * The combination of expedited mode (MEMBARRIER_EXPEDITED_FLAG) and non-private + * (shared) (~MEMBARRIER_PRIVATE_FLAG) flags is currently unimplemented. Using + * this combination returns -EINVAL. + * + * mm_cpumask is used as an approximation of the processors which run threads + * belonging to the current process. It is a superset of the cpumask to which we + * must send IPIs, mainly due to lazy TLB shootdown. Therefore, for each CPU in + * the mm_cpumask, we check each runqueue with the rq lock held to make sure our + * ->mm is indeed running on them. The rq lock ensures that a memory barrier is + * issued each time the rq current task is changed. This reduces the risk of + * disturbing a RT task by sending unnecessary IPIs. There is still a slight + * chance to disturb an unrelated task, because we do not lock the runqueues + * while sending IPIs, but the real-time effect of this heavy locking would be + * worse than the comparatively small disruption of an IPI. + * + * RED PEN: before assigning a system call number for sys_membarrier() to an + * architecture, we must ensure that switch_mm issues full memory barriers + * (or a synchronizing instruction having the same effect) between: + * - memory accesses to user-space addresses and clear mm_cpumask. + * - set mm_cpumask and memory accesses to user-space addresses. + * + * The reason why these memory barriers are required is that mm_cpumask updates, + * as well as iteration on the mm_cpumask, offer no ordering guarantees. + * These added memory barriers ensure that any thread modifying the mm_cpumask + * is in a state where all memory accesses to user-space addresses are + * guaranteed to be in program order. + * + * In some case adding a comment to this effect will suffice, in others we + * will need to add explicit memory barriers. These barriers are required to + * ensure we do not _miss_ a CPU that need to receive an IPI, which would be a + * bug. + * + * On uniprocessor systems, this system call simply returns 0 after validating + * the arguments, so user-space knows it is implemented. + */ +SYSCALL_DEFINE1(membarrier, int, flags) +{ + int retval; + + retval = membarrier_validate_flags(flags); + if (retval) + goto end; + if (unlikely((flags & MEMBARRIER_QUERY_FLAG) + || ((flags & MEMBARRIER_PRIVATE_FLAG) + && thread_group_empty(current))) + || num_online_cpus() == 1) + goto end; + if (flags & MEMBARRIER_EXPEDITED_FLAG) + membarrier_expedited(); + else + synchronize_sched(); +end: + return retval; +} + +#else /* !CONFIG_SMP */ + +SYSCALL_DEFINE1(membarrier, int, flags) +{ + return membarrier_validate_args(flags); +} + +#endif /* CONFIG_SMP */ diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 5adcb0a..5913b84 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -229,3 +229,6 @@ cond_syscall(sys_bpf); /* execveat */ cond_syscall(sys_execveat); + +/* membarrier */ +cond_syscall(sys_membarrier); -- 1.7.7.3 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/