Re: [PATCH for v4.2 v18 1/3] sys_membarrier(): system-wide memory barrier (generic, x86)
- On May 30, 2015, at 12:40 AM, Andrew Morton a...@linux-foundation.org wrote: > On Sat, 16 May 2015 19:48:18 -0400 Mathieu Desnoyers > wrote: > >> Here is an implementation of a new system call, sys_membarrier(), which >> executes a memory barrier on all threads running on the system. It is >> implemented by calling synchronize_sched(). It can be used to distribute >> the cost of user-space memory barriers asymmetrically by transforming >> pairs of memory barriers into pairs consisting of sys_membarrier() and a >> compiler barrier. For synchronization primitives that distinguish >> between read-side and write-side (e.g. userspace RCU [1], rwlocks), the >> read-side can be accelerated significantly by moving the bulk of the >> memory barrier overhead to the write-side. >> >> ... >> > > It would be nice to hear about the real world value of this syscall to > our users. I'm seeing test results for a microbenchmark but so what. > What actual applications or application classes are calling for this and > what results can they expect to see? AFAIK, the existing open source applications that would be improved by this system call are as follows: * Through Userspace RCU library (http://urcu.so) - DNS server (Knot DNS) https://www.knot-dns.cz/ - Network sniffer (http://netsniff-ng.org/) - Distributed object storage (https://sheepdog.github.io/sheepdog/) - User-space tracing (http://lttng.org) - Network storage system (https://www.gluster.org/) Those projects use RCU in userspace to increase read-side speed and scalability compared to locking. Especially in the case of RCU used by libraries, sys_membarrier can speed up the read-side by moving the bulk of the memory barrier cost to synchronize_rcu(). * Direct users of sys_membarrier - core dotnet garbage collector (https://github.com/dotnet/coreclr/issues/198) Microsoft core dotnet GC developers are planning to use the mprotect() side-effect of issuing memory barriers through IPIs as a way to implement Windows FlushProcessWriteBuffers() on Linux. They are referring to sys_membarrier in their github thread, specifically stating that sys_membarrier() is what they are looking for. > >> >> membarrier(2) man page: >> --- snip --- >> MEMBARRIER(2) Linux Programmer's Manual >> MEMBARRIER(2) >> >> NAME >>membarrier - issue memory barriers on a set of threads >> >> SYNOPSIS >>#include >> >>int membarrier(int cmd, int flags); >> >> DESCRIPTION >>The cmd argument is one of the following: >> >>MEMBARRIER_CMD_QUERY >> Query the set of supported commands. It returns a bitmask >> of >> supported commands. >> >>MEMBARRIER_CMD_SHARED >> Execute a memory barrier on all threads running on the >> system. >> Upon return from system call, the caller thread is ensured >> that >> all running threads have passed through a state where all >> memory >> accesses to user-space addresses match program order >> between >> entry to and return from the system call (non-running >> threads >> are de facto in such a state). This covers threads from all >> pro___ >> cesses running on the system. This command returns 0. >> >>The flags argument needs to be 0. For future extensions. >> >>All memory accesses performed in program order from each >> targeted >>thread is guaranteed to be ordered with respect to sys_membarrier(). >> If >>we use the semantic "barrier()" to represent a compiler barrier >> forcing >>memory accesses to be performed in program order across the >> barrier, >>and smp_mb() to represent explicit memory barriers forcing full >> memory >>ordering across the barrier, we have the following ordering table >> for >>each pair of barrier(), sys_membarrier() and smp_mb(): >> >>The pair ordering is detailed as (O: ordered, X: not ordered): >> >> barrier() smp_mb() sys_membarrier() >> barrier() X XO >> smp_mb() X OO >> sys_membarrier() O OO >> >> RETURN VALUE >>On success, these system calls return zero. On error, -1 is >> returned, >>and errno is set appropriately. For a given command, with flags >>argument set to 0, this system call is guaranteed to always return the >>same value until reboot. > > I suggest "with flags argument set to MEMBARRIER_CMD_QUERY" here. No, the enum is for the "cmd" argument (see above) not the flags argument. We really mean flags = 0 (the value) here. > >> >> ERRORS >>ENOSYS System call is not implemented. >> >>EINVAL Invalid arguments. >> >> ... >> >> +SYSCALL_DEFINE2(membarr
Re: [PATCH for v4.2 v18 1/3] sys_membarrier(): system-wide memory barrier (generic, x86)
On Sat, 16 May 2015 19:48:18 -0400 Mathieu Desnoyers wrote: > Here is an implementation of a new system call, sys_membarrier(), which > executes a memory barrier on all threads running on the system. It is > implemented by calling synchronize_sched(). It can be used to distribute > the cost of user-space memory barriers asymmetrically by transforming > pairs of memory barriers into pairs consisting of sys_membarrier() and a > compiler barrier. For synchronization primitives that distinguish > between read-side and write-side (e.g. userspace RCU [1], rwlocks), the > read-side can be accelerated significantly by moving the bulk of the > memory barrier overhead to the write-side. > > ... > It would be nice to hear about the real world value of this syscall to our users. I'm seeing test results for a microbenchmark but so what. What actual applications or application classes are calling for this and what results can they expect to see? > > membarrier(2) man page: > --- snip --- > MEMBARRIER(2) Linux Programmer's Manual MEMBARRIER(2) > > NAME >membarrier - issue memory barriers on a set of threads > > SYNOPSIS >#include > >int membarrier(int cmd, int flags); > > DESCRIPTION >The cmd argument is one of the following: > >MEMBARRIER_CMD_QUERY > Query the set of supported commands. It returns a bitmask of > supported commands. > >MEMBARRIER_CMD_SHARED > Execute a memory barrier on all threads running on the system. > Upon return from system call, the caller thread is ensured that > all running threads have passed through a state where all memory > accesses to user-space addresses match program order between > entry to and return from the system call (non-running threads > are de facto in such a state). This covers threads from all > pro___ > cesses running on the system. This command returns 0. > >The flags argument needs to be 0. For future extensions. > >All memory accesses performed in program order from each targeted >thread is guaranteed to be ordered with respect to sys_membarrier(). If >we use the semantic "barrier()" to represent a compiler barrier forcing >memory accesses to be performed in program order across the barrier, >and smp_mb() to represent explicit memory barriers forcing full memory >ordering across the barrier, we have the following ordering table for >each pair of barrier(), sys_membarrier() and smp_mb(): > >The pair ordering is detailed as (O: ordered, X: not ordered): > > barrier() smp_mb() sys_membarrier() > barrier() X XO > smp_mb() X OO > sys_membarrier() O OO > > RETURN VALUE >On success, these system calls return zero. On error, -1 is returned, >and errno is set appropriately. For a given command, with flags >argument set to 0, this system call is guaranteed to always return the >same value until reboot. I suggest "with flags argument set to MEMBARRIER_CMD_QUERY" here. > > ERRORS >ENOSYS System call is not implemented. > >EINVAL Invalid arguments. > > ... > > +SYSCALL_DEFINE2(membarrier, int, cmd, int, flags) > +{ > + if (flags) > + return -EINVAL; I'm not a huge fan of this "add a flags arg to syscalls" rule. Is there any realistic expectation that we'll ever *use* this thing? If not, why add it? You may as well put an unlikely() in there btw. > + switch (cmd) { > + case MEMBARRIER_CMD_QUERY: > + return MEMBARRIER_CMD_BITMASK; > + case MEMBARRIER_CMD_SHARED: > + if (num_online_cpus() > 1) > + synchronize_sched(); > + return 0; > + default: > + return -EINVAL; > + } > +} -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH for v4.2 v18 1/3] sys_membarrier(): system-wide memory barrier (generic, x86)
Here is an implementation of a new system call, sys_membarrier(), which executes a memory barrier on all threads running on the system. It is implemented by calling synchronize_sched(). It can be used to distribute the cost of user-space memory barriers asymmetrically by transforming pairs of memory barriers into pairs consisting of sys_membarrier() and a compiler barrier. For synchronization primitives that distinguish between read-side and write-side (e.g. userspace RCU [1], rwlocks), the read-side can be accelerated significantly by moving the bulk of the memory barrier overhead to the write-side. It is based on kernel v4.1-rc2. To explain the benefit of this scheme, let's introduce two example threads: Thread A (non-frequent, e.g. executing liburcu synchronize_rcu()) Thread B (frequent, e.g. executing liburcu rcu_read_lock()/rcu_read_unlock()) In a scheme where all smp_mb() in thread A are ordering memory accesses with respect to smp_mb() present in Thread B, we can change each smp_mb() within Thread A into calls to sys_membarrier() and each smp_mb() within Thread B into compiler barriers "barrier()". Before the change, we had, for each smp_mb() pairs: Thread AThread B previous mem accesses previous mem accesses smp_mb()smp_mb() following mem accesses following mem accesses After the change, these pairs become: Thread AThread B prev mem accesses prev mem accesses sys_membarrier()barrier() follow mem accesses follow mem accesses As we can see, there are two possible scenarios: either Thread B memory accesses do not happen concurrently with Thread A accesses (1), or they do (2). 1) Non-concurrent Thread A vs Thread B accesses: Thread AThread B prev mem accesses sys_membarrier() follow mem accesses prev mem accesses barrier() follow mem accesses In this case, thread B accesses will be weakly ordered. This is OK, because at that point, thread A is not particularly interested in ordering them with respect to its own accesses. 2) Concurrent Thread A vs Thread B accesses Thread AThread B prev mem accesses prev mem accesses sys_membarrier()barrier() follow mem accesses follow mem accesses In this case, thread B accesses, which are ensured to be in program order thanks to the compiler barrier, will be "upgraded" to full smp_mb() by synchronize_sched(). * Benchmarks On Intel Xeon E5405 (8 cores) (one thread is calling sys_membarrier, the other 7 threads are busy looping) 1000 non-expedited sys_membarrier calls in 33s = 33 milliseconds/call. * User-space user of this system call: Userspace RCU library Both the signal-based and the sys_membarrier userspace RCU schemes permit us to remove the memory barrier from the userspace RCU rcu_read_lock() and rcu_read_unlock() primitives, thus significantly accelerating them. These memory barriers are replaced by compiler barriers on the read-side, and all matching memory barriers on the write-side are turned into an invocation of a memory barrier on all active threads in the process. By letting the kernel perform this synchronization rather than dumbly sending a signal to every process threads (as we currently do), we diminish the number of unnecessary wake ups and only issue the memory barriers on active threads. Non-running threads do not need to execute such barrier anyway, because these are implied by the scheduler context switches. Results in liburcu: Operations in 10s, 6 readers, 2 writers: memory barriers in reader:1701557485 reads, 2202847 writes signal-based scheme: 9830061167 reads,6700 writes sys_membarrier: 9952759104 reads, 425 writes sys_membarrier (dyn. check): 7970328887 reads, 425 writes The dynamic sys_membarrier availability check adds some overhead to the read-side compared to the signal-based scheme, but besides that, sys_membarrier slightly outperforms the signal-based scheme. However, this non-expedited sys_membarrier implementation has a much slower grace period than signal and memory barrier schemes. Besides diminishing the number of wake-ups, one major advantage of the membarrier system call over the signal-based scheme is that it does not need to reserve a signal. This plays much more nicely with libraries, and with processes injected into for tracing purposes, for which we cannot expect that signals will be unused by the application. An expedited version of this system call can be added later on to speed up the grace period. Its implementation will likely depend on reading the cpu_curr()->mm without holding each CPU's rq lock. This patch adds the system call to x86 and to asm-generic. [1] http://urcu.so Signed-off-by: Mathieu Desnoyers Reviewed-by: Paul E. McKenney Reviewed-by: Josh Triplett CC: KOSAKI Motohiro CC: Steven Ro