Re: Robust futexes: lost wakeups and design flaws in the glibc/kernel synchronization scheme
On Thu, 2017-01-12 at 10:39 +0100, Torvald Riegel wrote: > On Sat, 2016-12-24 at 17:01 +0100, Torvald Riegel wrote: > > === Robust mutexes have bugs, in both glibc and the kernel > > > > I've been reviewing the implementation of robust mutexes in both glibc > > and the kernel support code recently because there are several bug > > reports, for example: > > https://bugzilla.redhat.com/show_bug.cgi?id=1401665 > > https://sourceware.org/bugzilla/show_bug.cgi?id=19402 > > https://sourceware.org/bugzilla/show_bug.cgi?id=14485 > > > > This review revealed a bunch of bugs. I have committed/proposed patches > > that fix all glibc-only bugs that I am aware of: > > https://sourceware.org/ml/libc-alpha/2016-12/msg00587.html > > https://sourceware.org/ml/libc-alpha/2016-12/msg00862.html > > https://sourceware.org/ml/libc-alpha/2016-12/msg00863.html > > https://sourceware.org/ml/libc-alpha/2016-12/msg00947.html These patches are now all committed to glibc master. Thus, you can start fixing/testing now and have a high probability that you won't run into glibc bugs.
Re: Robust futexes: lost wakeups and design flaws in the glibc/kernel synchronization scheme
On Sat, 2016-12-24 at 17:01 +0100, Torvald Riegel wrote: > TLDR: > (1) The Linux kernel fails to wake futex waiters when a userspace thread > that releases a mutex is killed between modifying the futex word and the > subsequent FUTEX_WAKE syscall. > (2) POSIX makes certain requirements regarding when destruction of a > mutex is safe that are violated by the current glibc/kernel > synchronization scheme, which can lead to corruption of userspace > memory; robust FUTEX_UNLOCK_PI must be changed slightly, and the non-PI > robust mutexes must either use a FUTEX_UNLOCK-like syscall to release > the mutex, try to detect this by interpreting (current position in) > userspace code, or we must add a new futex recovery syscall if we want > to preserve the userspace-only mutex-release fastpath. > > [Please keep me in CC because I'm not subscribed to LKML.] > > > === Robust mutexes have bugs, in both glibc and the kernel > > I've been reviewing the implementation of robust mutexes in both glibc > and the kernel support code recently because there are several bug > reports, for example: > https://bugzilla.redhat.com/show_bug.cgi?id=1401665 > https://sourceware.org/bugzilla/show_bug.cgi?id=19402 > https://sourceware.org/bugzilla/show_bug.cgi?id=14485 > > This review revealed a bunch of bugs. I have committed/proposed patches > that fix all glibc-only bugs that I am aware of: > https://sourceware.org/ml/libc-alpha/2016-12/msg00587.html > https://sourceware.org/ml/libc-alpha/2016-12/msg00862.html > https://sourceware.org/ml/libc-alpha/2016-12/msg00863.html > https://sourceware.org/ml/libc-alpha/2016-12/msg00947.html > These fixes are meant to be easy to backport, so they don't contain as > much cleanup and explanation of the glibc code as would be ideal. > > I plan to follow these up with more cleanup and adding of documentation, > which should also describe the glibc/kernel synchronization scheme in > more detail. I believe I have found all the bugs in robust mutexes > (when including the kernel-side bugs described in this email), but I'll > make another review pass when preparing this cleanup+documentation > patch. > > > === Background > > A brief reminder of how glibc and the kernel attempt to synchronize with > each other: > Each thread maintains a list of robust mutexes that it has acquired, and > a single pointer to a futex word that it is about to acquire or release > (called op_pending). The list is a concurrent list (to some extent) > because userspace can be interrupted by the post-crash cleanup code run > by the kernel (ie, this is similar to how one would synchronize with a > signal handler). > > To release a mutex without using FUTEX_UNLOCK_PI, userspace basically > does the following: > (L1) atomic_store (&op_pending, &mutex->futex, memory_order_relaxed); > (L2) <"thread-safe" dequeue of mutex->futex from robust mutex list> > (L3) fw = atomic_exchange(&mutex->futex, 0, memory_order_release); > (L4) if (fw & FUTEX_WAITERS) futex_wake(&mutex->futex, 1); > (L5) atomic_store (&op_pending, NULL, memory_order_relaxed); > (I'm not showing the signal fences required; for simplicity, assume that > there is a signal fence after each line.) > > In case of a robust PI mutex, L3 and L4 are done by the kernel in > FUTEX_UNLOCK_PI. > > > === Lost wakeups caused by the kernel > > If a thread happens to be killed between L3 and L4 and FUTEX_WAITERS is > set, then userspace will have set the futex word to 0 but not yet have > run FUTEX_WAKE. > op_pending still points to the futex word, but handle_futex_death in > kernel/futex.c only calls futex_wake if the TID bits in the futex word > are equal to the TID of the killed thread. The result of that is a lost > wake-up for one of the threads that set or shared FUTEX_WAITERS. > > This can be fixed by running futex_wake conservatively, even if the > value of *op_ending does not match the TID of the killed thread. This > can lead to spurious wakeups (eg, if the thread was killed between L4 > and L5), but userspace code has to, in the general case, be prepared for > spurious futex wake-ups anyway if using POSIX mutexes: > https://lkml.org/lkml/2014/11/27/472 > Also, the same spurious wake-ups can happen under slightly different > executions without a crash (eg, if FUTEX_WAITERS is set spuriously due > to timeouts on another thread that set it). > > > === POSIX' mutex destruction requirements are violated > > In a nutshell, the mutex destruction requirements mean that there must > be a single atomic action that releases a mutex; after a mutex is > released, the thread releasing the mutex must not a
Robust futexes: lost wakeups and design flaws in the glibc/kernel synchronization scheme
TLDR: (1) The Linux kernel fails to wake futex waiters when a userspace thread that releases a mutex is killed between modifying the futex word and the subsequent FUTEX_WAKE syscall. (2) POSIX makes certain requirements regarding when destruction of a mutex is safe that are violated by the current glibc/kernel synchronization scheme, which can lead to corruption of userspace memory; robust FUTEX_UNLOCK_PI must be changed slightly, and the non-PI robust mutexes must either use a FUTEX_UNLOCK-like syscall to release the mutex, try to detect this by interpreting (current position in) userspace code, or we must add a new futex recovery syscall if we want to preserve the userspace-only mutex-release fastpath. [Please keep me in CC because I'm not subscribed to LKML.] === Robust mutexes have bugs, in both glibc and the kernel I've been reviewing the implementation of robust mutexes in both glibc and the kernel support code recently because there are several bug reports, for example: https://bugzilla.redhat.com/show_bug.cgi?id=1401665 https://sourceware.org/bugzilla/show_bug.cgi?id=19402 https://sourceware.org/bugzilla/show_bug.cgi?id=14485 This review revealed a bunch of bugs. I have committed/proposed patches that fix all glibc-only bugs that I am aware of: https://sourceware.org/ml/libc-alpha/2016-12/msg00587.html https://sourceware.org/ml/libc-alpha/2016-12/msg00862.html https://sourceware.org/ml/libc-alpha/2016-12/msg00863.html https://sourceware.org/ml/libc-alpha/2016-12/msg00947.html These fixes are meant to be easy to backport, so they don't contain as much cleanup and explanation of the glibc code as would be ideal. I plan to follow these up with more cleanup and adding of documentation, which should also describe the glibc/kernel synchronization scheme in more detail. I believe I have found all the bugs in robust mutexes (when including the kernel-side bugs described in this email), but I'll make another review pass when preparing this cleanup+documentation patch. === Background A brief reminder of how glibc and the kernel attempt to synchronize with each other: Each thread maintains a list of robust mutexes that it has acquired, and a single pointer to a futex word that it is about to acquire or release (called op_pending). The list is a concurrent list (to some extent) because userspace can be interrupted by the post-crash cleanup code run by the kernel (ie, this is similar to how one would synchronize with a signal handler). To release a mutex without using FUTEX_UNLOCK_PI, userspace basically does the following: (L1) atomic_store (&op_pending, &mutex->futex, memory_order_relaxed); (L2) <"thread-safe" dequeue of mutex->futex from robust mutex list> (L3) fw = atomic_exchange(&mutex->futex, 0, memory_order_release); (L4) if (fw & FUTEX_WAITERS) futex_wake(&mutex->futex, 1); (L5) atomic_store (&op_pending, NULL, memory_order_relaxed); (I'm not showing the signal fences required; for simplicity, assume that there is a signal fence after each line.) In case of a robust PI mutex, L3 and L4 are done by the kernel in FUTEX_UNLOCK_PI. === Lost wakeups caused by the kernel If a thread happens to be killed between L3 and L4 and FUTEX_WAITERS is set, then userspace will have set the futex word to 0 but not yet have run FUTEX_WAKE. op_pending still points to the futex word, but handle_futex_death in kernel/futex.c only calls futex_wake if the TID bits in the futex word are equal to the TID of the killed thread. The result of that is a lost wake-up for one of the threads that set or shared FUTEX_WAITERS. This can be fixed by running futex_wake conservatively, even if the value of *op_ending does not match the TID of the killed thread. This can lead to spurious wakeups (eg, if the thread was killed between L4 and L5), but userspace code has to, in the general case, be prepared for spurious futex wake-ups anyway if using POSIX mutexes: https://lkml.org/lkml/2014/11/27/472 Also, the same spurious wake-ups can happen under slightly different executions without a crash (eg, if FUTEX_WAITERS is set spuriously due to timeouts on another thread that set it). === POSIX' mutex destruction requirements are violated In a nutshell, the mutex destruction requirements mean that there must be a single atomic action that releases a mutex; after a mutex is released, the thread releasing the mutex must not access the mutex' memory anymore (but it is still allowed to run a FUTEX_WAKE on it); see https://lkml.org/lkml/2014/11/27/472 for additional background. Currently, the kernel can potentially modify the mutex' memory until L5 has executed. This means that getting killed while releasing a robust mutex is not guaranteed to be a single atomic action. For example, assume that Thread1 has acquired the mutex and releases it, but is killed right before L5; if the program is built in such a way that only Thread2 will acquire the mutex after Thread1 has released it, then POSIX allows Thread2 to acquire the mu
Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism
On Fri, 2016-04-29 at 16:51 -0700, Linus Torvalds wrote: > On Fri, Apr 29, 2016 at 2:10 PM, Linus Torvalds > wrote: > > On Thu, Apr 28, 2016 at 11:32 AM, Linus Torvalds > > wrote: > >> > >> hash_long/ptr really shouldn't care about the low bits being zero at > >> all, because it should mix in all the bits (using a prime multiplier > >> and taking the high bits). > > > > Looking at this assertion, it doesn't actually pan out very well at all. > > > > The 32-bit hash looks to be ok. Certainly not perfect, but not horrible. > > > > The 64-bit hash seems to be quite horribly bad with lots of values. > > Ok, I have tried to research this a bit more. There's a lot of > confusion here that causes the fact that hash_64() sucks donkey balls. > > The basic method for the hashing method by multiplication is fairly > sane. It's well-documented, and attributed to Knuth as the comment > above it says. Section 2.2 of this article might also be of interest: M. Dietzfelbinger, T. Hagerup, J. Katajainen, and M. Penttonen. A Re- liable Randomized Algorithm for the Closest-Pair Problem. Journal of Algorithms, 25(1):19 – 51, 1997. I don't know much about hash functions, but my understanding of this is that one can do good hashing of words by multiplying with a random number and using the most-significant bits of the lower-significant word of the result. Different random numbers may work better than others for the input data (and some might be really awful), but the paper seems to claim that one *can* find a good random number for all input data. In practice, this might mean that re-hashing with a different random number after seeing too many conflicts in a hash table should eventually lead to a good hashing (ie, because the *class* of such hash functions is universal).
Re: [RFC patch 4/7] futex: Add support for attached futexes
On Sun, 2016-04-03 at 06:30 -0500, Linus Torvalds wrote: > On Sun, Apr 3, 2016 at 6:16 AM, Ingo Molnar wrote: > > > > So an ABI distinction and offloading the decision to every single > > application that > > wants to use it and hardcode it into actual application source code via an > > ABI is > > pretty much the _WORST_ way to go about it IMHO... > > > > So how about this: don't add any ABI details, but make futexes > > auto-attached on > > NUMA systems (and obviously PREEMPT_RT systems)? > > I agree. > > Do *not* make this a visible new ABI. >From a glibc perspective, I agree that this shouldn't require an extension of the ABI unless it's really the only possible way to solve this. For "special" mutex kinds such as PI mutexes, the change in the interface might be justifiable -- but for ordinary mutexes, there's no good place to add the attach/detach calls in each thread: An implementation of, say, C11 mutexes cannot easily estimate whether it should use attached futexes, and it would have to track whether a particular mutex has been attached to by the current thread; this might just move the overhead of tracking and caching associations from the kernel to userspace. > You will find that people will make exactly the wrong choices - either > not using it (because the futex is deep in a standard library!) when > they want to, or using it when they shouldn't (because the futex is > deep in a standard library, and the library writer knows *his* code is > so important that it should get a special faster futex). There are cases where userspace might know that it should use a "special" futex. Consider an MCS-lock implementation in glibc by which all pthreads, C, and C++ mutexes are backed: the lock nodes that threads would be spinning on would be per-thread and exist for the whole lifetime of the thread; attach and detach would be simple to do, and there would be a limited number of these in the system. A Java VM's park/unpark implementation might be another candidate. However, these use cases are pretty specific (eg, there's only a single waiter), so any kernel support for these might be more special too.
Re: futex(3) man page, final draft for pre-release review
On Tue, 2015-12-15 at 14:41 -0800, Davidlohr Bueso wrote: > On Tue, 15 Dec 2015, Michael Kerrisk (man-pages) wrote: > > > When executing a futex operation that requests to block a thread, > > the kernel will block only if the futex word has the value that > > the calling thread supplied (as one of the arguments of the > > futex() call) as the expected value of the futex word. The load??? > > ing of the futex word's value, the comparison of that value with > > the expected value, and the actual blocking will happen atomi??? > > > >FIXME: for next line, it would be good to have an explanation of > >"totally ordered" somewhere around here. > > > > cally and totally ordered with respect to concurrently executing > > futex operations on the same futex word. > > So there are two things here regarding ordering. One is the most obvious > which is ordered due to the taking/dropping the hb spinlock. I suppose that this means what is described in the manpage already? That is, that futex operations (ie, the syscalls) are atomic wrt each other and in a strict total order? > Secondly, its > the cases which Peter brought up a while ago that involves atomic futex ops > futex_atomic_*(), which do not have clearly defined semantics, and you > get > inconsistencies with certain archs (tile being the worst iirc). OK. So, from a user's POV, this is about the semantics of the kernel's accesses to the futex word. I agree that specifying this more clearly would be helpful. First, there are the comparisons of the futex words used in, for example, FUTEX_WAIT. They should use an atomic load within the conceptual critical sections that make up futex operations. This load itself doesn't need to establish any ordering, so it can be equivalent to a C11 memory_order_relaxed load. Are there any objections to that? Second, We have the write accesses in FUTEX_[TRY]LOCK_PI and FUTEX_UNLOCK_PI. We already specify those as atomic and within the conceptual critical sections of the futex operation. In addition, they should establish ordering themselves, so C11 have memory_order_acquire / memory_order_release semantics. Specifying this would be good. Any objections to these semantics? Third, we have the atomic read-modify-write operation that is part of FUTEX_WAKE_OP (ie, AFAIU, the case you pointed at specifically). I don't have a strong opinion on what it should be, because I think userspace can enforce the orderings it needs on its own (eg, if I interpret Peter Zijlstra's example correctly, userspace can add appropriate fences before the CPU0/futex_unlock and after the CPU2/futex_load calls). FUTEX_WAKE_OP accesses no other userspace memory location, so there's no ordering relation to other accesses to userspace memory that userspace cannot affect. OTOH, legacy userspace may have assumed strong semantics, so making the read-modify-write have memory_order_seq_cst semantics is probably a safe bet. Futex operations typically shouldn't be on the fast paths anyway. > But anyway, the important thing users need to know about is that the atomic > futex operation must be totally ordered wrt any other user tasks that are > trying > to access that address. I'm not sure what you mean precisely. One can't order the whole futex operations totally wrt memory accesses by userspace because they'd need to synchronize to do that, and thus userspace would to hvae either hook into the kernel's synchronization or use HTM or such. > This is not necessarily the case for kernel ops. Peter > illustrates this nicely with lock stealing example; > (see https://lkml.org/lkml/2015/8/26/596). > > Internally, I believe we decided that making it fully ordered (as opposed to > making use of implicit barriers for ACQUIRE/RELEASE), so you'd endup having > an MB ll/sc MB kind of setup. OK. So, any objections to documenting that the read-modify-write op in FUTEX_WAKE_OP has memory_order_seq_cst semantics? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: futex(3) man page, final draft for pre-release review
On Tue, 2015-12-15 at 13:18 -0800, Darren Hart wrote: > On Tue, Dec 15, 2015 at 02:43:50PM +0100, Michael Kerrisk (man-pages) wrote: > > > >When executing a futex operation that requests to block a thread, > >the kernel will block only if the futex word has the value that > >the calling thread supplied (as one of the arguments of the > >futex() call) as the expected value of the futex word. The load‐ > >ing of the futex word's value, the comparison of that value with > >the expected value, and the actual blocking will happen atomi‐ > > > > FIXME: for next line, it would be good to have an explanation of > > "totally ordered" somewhere around here. > > > >cally and totally ordered with respect to concurrently executing > > Totally ordered with respect futex operations refers to semantics of the > ACQUIRE/RELEASE operations and how they impact ordering of memory reads and > writes. The kernel futex operations are protected by spinlocks, which ensure > that that all operations are serialized with respect to one another. > > This is a lot to attempt to define in this document. Perhaps a reference to > linux/Documentation/memory-barriers.txt as a footnote would be sufficient? Or > perhaps for this manual, "serialized" would be sufficient, with a footnote > regarding "totally ordered" and a pointer to the memory-barrier documentation? I'd strongly prefer to document the semantics for users here. And I don't think users use the kernel's memory model -- instead, if we assume that most users will call futex ops from C or C++, then the best we have is the C11 / C++11 memory model. Therefore, if we want to expand that, we should specify semantics in terms of as-if equivalence to C11 pseudo code. I had proposed that in the past but, IIRC, Michael didn't want to add a C11 "dependency" in the semantics back then, at least for the initial release. Here's what I wrote back then (atomic_*_relaxed() is like C11 atomic_*(..., memory_order_relaxed), lock/unlock have normal C11 mutex semantics): For example, we could say that futex_wait is, in terms of synchronization semantics, *as if* we'd execute a piece of C11 code. Here's a part of the docs for a glibc-internal futex wrapper that I'm working on; this is futex_wait ... : /* Atomically wrt other futex operations, this blocks iff the value at *FUTEX matches the expected value. This is semantically equivalent to: l = (FUTEX); wait_flag = (FUTEX); lock (l); val = atomic_load_relaxed (FUTEX); if (val != expected) { unlock (l); return EAGAIN; } atomic_store_relaxed (wait_flag, 1); unlock (l); // Now block; can time out in futex_time_wait (see below) while (atomic_load_relaxed(wait_flag)); Note that no guarantee of a happens-before relation between a woken futex_wait and a futex_wake is documented; however, this does not matter in practice because we have to consider spurious wake-ups (see below), and thus would not be able to reason which futex_wake woke us anyway. ... and this is futex_wake: /* Atomically wrt other futex operations, this unblocks the specified number of processes, or all processes blocked on this futex if there are fewer than the specified number. Semantically, this is equivalent to: l = (futex); lock (l); for (res = 0; processes_to_wake > 0; processes_to_wake--, res++) { if () break; wf = (futex); // No happens-before guarantee with woken futex_wait (see above) atomic_store_relaxed (wf, 0); } return res; This allows a programmer to really infer the guarantees he/she can get from a futex in terms of synchronization, without the docs having to use prose to describe that. This should also not constrain the kernel in terms of how to implement it, because it is a conceptual as-if relation (e.g., the kernel won't spin-wait the whole time, and we might want to make this clear for the PI case). Of course, there are several as-if representations we could use, and we might want to be a bit more pseudo-code-ish to make this also easy to understand for people not familiar with C11 (e.g., using mutex + condvar with some relaxation of condvar guaranteees). = I will go through the discussion pointed out by Davidlohr next. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: futex(3) man page, final draft for pre-release review
On Wed, 2015-12-16 at 16:54 +0100, Michael Kerrisk (man-pages) wrote: > Hello Darren, > > On 12/15/2015 10:18 PM, Darren Hart wrote: > > On Tue, Dec 15, 2015 at 02:43:50PM +0100, Michael Kerrisk (man-pages) wrote: > > [...] > > >>When executing a futex operation that requests to block a thread, > >>the kernel will block only if the futex word has the value that > >>the calling thread supplied (as one of the arguments of the > >>futex() call) as the expected value of the futex word. The load‐ > >>ing of the futex word's value, the comparison of that value with > >>the expected value, and the actual blocking will happen atomi‐ > >> > >> FIXME: for next line, it would be good to have an explanation of > >> "totally ordered" somewhere around here. > >> > >>cally and totally ordered with respect to concurrently executing > > > > Totally ordered with respect futex operations refers to semantics of the > > ACQUIRE/RELEASE operations and how they impact ordering of memory reads and > > writes. The kernel futex operations are protected by spinlocks, which ensure > > that that all operations are serialized with respect to one another. > > > > This is a lot to attempt to define in this document. Perhaps a reference to > > linux/Documentation/memory-barriers.txt as a footnote would be sufficient? > > Or > > perhaps for this manual, "serialized" would be sufficient, with a footnote > > regarding "totally ordered" and a pointer to the memory-barrier > > documentation? > > I think I'll just settle for writing serialized in the man page, and be > done with it :-). I'd prefer if you'd not just use "serialized" :) Eventually, I'd prefer if we can explain the semantics for the user in terms of the terminology and semantics of the memory model of the programming language that users will likely use to call futex ops (ie, C11 / C++11). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: futex(3) man page, final draft for pre-release review
On Tue, 2015-12-15 at 14:43 +0100, Michael Kerrisk (man-pages) wrote: > Hello all, > > After much too long a time, the revised futex man page *will* > go out in the next man pages release (it has been merged > into master). > > There are various places where the page could still be improved, > but it is much better (and more than 5 times longer) than the > existing page. This looks good to me; I just saw minor things (see below). Thank you for all the work you put into this (and to everybody who contributed)! >When executing a futex operation that requests to block a thread, >the kernel will block only if the futex word has the value that >the calling thread supplied (as one of the arguments of the >futex() call) as the expected value of the futex word. The load‐ >ing of the futex word's value, the comparison of that value with >the expected value, and the actual blocking will happen atomi‐ > > FIXME: for next line, it would be good to have an explanation of > "totally ordered" somewhere around here. > >cally and totally ordered with respect to concurrently executing >futex operations on the same futex word. Thus, the futex word is >used to connect the synchronization in user space with the imple‐ >mentation of blocking by the kernel. Analogously to an atomic >compare-and-exchange operation that potentially changes shared >memory, blocking via a futex is an atomic compare-and-block oper‐ >ation. Maybe -- should we just say that it refers to the mathematical notion of a total order (or, technically, a strict total order in this case)? Though I would hope that everyone using futexes is roughly aware of the differences between partial and total orders. >FUTEX_TRYLOCK_PI (since Linux 2.6.18) > This operation tries to acquire the futex at uaddr. It is s/futex/lock/ to make it consistent with FUTEX_LOCK. > invoked when a user-space atomic acquire did not succeed > because the futex word was not 0. > > > FIXME(Next sentence) The wording "The trylock in kernel" below > needs clarification. Suggestions? > > The trylock in kernel might succeed because the futex word > contains stale state (FUTEX_WAITERS and/or > FUTEX_OWNER_DIED). This can happen when the owner of the > futex died. User space cannot handle this condition in a > race-free manner, but the kernel can fix this up and > acquire the futex. > > The uaddr2, val, timeout, and val3 arguments are ignored. What about "The acquisition of the lock might suceed if performed by the kernel in cases when the futex word contains stale state...". >FUTEX_WAIT_REQUEUE_PI (since Linux 2.6.31) > Wait on a non-PI futex at uaddr and potentially be > requeued (via a FUTEX_CMP_REQUEUE_PI operation in another > task) onto a PI futex at uaddr2. The wait operation on > uaddr is the same as for FUTEX_WAIT. > > The waiter can be removed from the wait on uaddr without > requeueing on uaddr2 via a FUTEX_WAKE operation in another > task. In this case, the FUTEX_WAIT_REQUEUE_PI operation > returns with the error EWOULDBLOCK. This should be EAGAIN, I suppose, or the enumeration of errors should include EWOULDBLOCK. Torvald -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [c++std-parallel-1641] Re: Compilers and RCU readers: Once more unto the breach!
On Thu, 2015-05-21 at 13:42 -0700, Linus Torvalds wrote: > On Thu, May 21, 2015 at 1:02 PM, Paul E. McKenney > wrote: > > > > The compiler can (and does) speculate non-atomic non-volatile writes > > in some cases, but I do not believe that it is permitted to speculate > > either volatile or atomic writes. > > I do *not* believe that a compiler is ever allowed to speculate *any* > writes - volatile or not - unless the compiler can prove that the end > result is either single-threaded, or the write in question is > guaranteed to only be visible in that thread (ie local stack variable > etc). It must not speculative volatile accesses. It could speculate non-volatiles even if those are atomic and observable by other threads but that would require further work/checks on all potential observers of those (ie, to still satisfy as-if). Thus, compilers are unlikely to do such speculation, I'd say. The as-if rule (ie, equality of observable behavior (ie, volatiles, ...) to the abstract machine) makes all this clear. > Also, I do think that the whole "consume" read should be explained > better to compiler writers. Right now the language (including very > much in the "restricted dependency" model) is described in very > abstract terms. Yet those abstract terms are actually very subtle and > complex, and very opaque to a compiler writer. I believe the issues around the existing specification of mo_consume where pointed out by compiler folks. It's a complex problem, and I'm all for more explanations, but I did get the impression that the compiler writers in ISO C++ Study Group 1 do have a good understanding of the problem. > I personally think the whole "abstract machine" model of the C > language is a mistake. It would be much better to talk about things in > terms of actual code generation and actual issues. Make all the > problems much more concrete, with actual examples of how memory > ordering matters on different architectures. As someone working for a toolchain team, I don't see how the abstract-machine-based specification is a problem at all, nor have I seen compiler writers struggling with it. It does give precise rules for code generation. The level of abstraction is a good thing for most programs because for those, the primary concern is that the observable behavior and end result is computed -- it's secondary and QoI how that happens. In contrast, if you specify on the level of code generation, you'd have to foresee how code generation might look in the future, including predict future optimizations and all that. That doesn't look future-proof to me. I do realize that this may be less than ideal for cases when one would want to use a C compiler more like a convenient assembler. But that case isn't the 99%, I believe. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [c++std-parallel-1611] Compilers and RCU readers: Once more unto the breach!
On Tue, 2015-05-19 at 17:55 -0700, Paul E. McKenney wrote: > http://www.rdrop.com/users/paulmck/RCU/consume.2015.05.18a.pdf I have been discussing Section 7.9 with Paul during last week. While I think that 7.9 helps narrow down the problem somewhat, I'm still concerned that it effectively requires compilers to either track dependencies or conservatively prevent optimizations like value speculation and specialization based on that. Neither is a good thing for a compiler. 7.9 adds requirements that dependency chains stop if the program itself informs the compiler about the value of something in the dependency chain (e.g., as shown in Figure 33). Also, if a correct program that does not have undefined behavior must use a particular value, this is also considered as "informing" the compiler about that value. For example: int arr[2]; int* x = foo.load(mo_consume); if (x > arr) // implies same object/array, so x is in arr[] int r1 = *x; // compiler knows x == arr + 1 The program, assuming no undefined behavior, first tells the compiler that x should be within arr, and then the comparison tells the compiler that x!=arr, so x==arr+1 must hold because there are just two elements in the array. Having these requirements is generally good, but we don't yet know how to specify this properly. For example, I suppose we'd need to also say that the compiler cannot assume to know anything about a value returned from an mo_consume load; otherwise, nothing prevents a compiler from using knowledge about the stores that the mo_consume load can read from (see Section 7.2). Also, a compiler is still required to not do value-speculation or optimizations based on that. For example, this program: void op(type *p) { foo /= p->a; bar = p->b; } void bar() { pointer = ppp.load(mo_consume); op(pointer); } ... must not be transformed into this program, even if the compiler knows that global_var->a == 1: void op(type *p) { /* unchanged */} void bar() { pointer = ppp.load(mo_consume); if (pointer != global_var) { op(pointer); else // specialization for global_var { // compiler knows global_var->a==1; // compiler uses global_var directly, inlines, optimizes: bar = global_var->b; } The compiler could optimize out the division if pointer==global_var but it must not access field b directly through global_var. This would be pretty awkwaard; the compiler may work based on an optimized expression in the specialization (ie, create code that assumes global_var instead of pointer) but it would still have to carry around and use the non-optimized expression. This wouldn't be as bad if it were easily constrained to code sequences that really need the dependencies. However, 7.9 does not effectively contain dependencies to only the code that really needs them, IMO. Unless the compiler proves otherwise, it has to assume that a load from a pointer carries a dependency. Proving that is often very hard because it requires points-to analysis; 7.9 restricts this to intra-thread analysis but that is still nontrivial. Michael Matz' had a similar concern (in terms of what it results in). Given that mo_consume is useful but a very specialized feature, I wouldn't be optimistic that 7.9 would actually be supported by many compilers. The trade-off between having to track dependencies or having to disallow optimizations is a bad one to make. The simple way out for a compiler would be to just emit mo_acquire instead of mo_consume and be done with all -- and this might be the most practical decision overall, or the default general-purpose implementation. At least I haven't heard any compiler implementer say that they think it's obviously worth implementing. I also don't think 7.9 is ready for ISO standardization yet (or any of the other alternatives mentioned in the paper). Standardizing a feature that we're not sure whether it will actually be implemented is not a good thing to do; it's too costly for all involved parties (compiler writers *and* users). IMO, the approach outlined in Section 7.7 is still the most promising contender in the long run. It avoid the perhaps more pervasive changes that a type-system-based approach as the one in Section 7.2 might result in, yet still informs the compiler where dependencies are actually used and which chain of expressions would be involved in that. Tracking is probably simplified, as dependencies are never open-ended and potentially leaking into various other regions of code. It seems easier to specify in a standard because we just need the programmer to annotate the intent and the rest is compiler QoI. It would require users to annotate their use of dependencies, but they don't need to follow further rules; performance tuning of the code so it actually makes use of dependencies is mostly a compiler QoI thing, and if the compiler can't maintain a dependency, it can issue warnings and thus make the tuning interactive for the user. Of
Re: Revised futex(2) man page for review
On Tue, 2015-04-14 at 23:40 +0200, Thomas Gleixner wrote: > On Sat, 28 Mar 2015, Peter Zijlstra wrote: > > On Sat, Mar 28, 2015 at 09:53:21AM +0100, Michael Kerrisk (man-pages) wrote: > > > So, please take a look at the page below. At this point, > > > I would most especially appreciate help with the FIXMEs. > > > > For people who cannot read that troff gibberish (me).. > > Ditto :) > > > NOTES > >Glibc does not provide a wrapper for this system call; call it > > using > >syscall(2). > > You might mention that pthread_mutex, pthread_condvar interfaces are > high level wrappers for the syscall and recommended to be used for > normal use cases. IIRC unnamed semaphores are implemented with futexes > as well. If we add this, I'd rephrase it to something like that there are high-level programming abstractions such as the pthread_condvar interfaces or semaphores that are implemented using the syscall and that are typically a better fit for normal use cases. I'd consider only the condvars as something like a wrapper, or targeting a similar use case. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: futex(2) man page update help request
On Sat, 2015-01-24 at 12:35 +0100, Thomas Gleixner wrote: > So we should never see -EINTR in the case of a spurious wakeup here. > > But, here is the not so good news: > > I did some archaeology. The restart handling of futex_wait() got > introduced in kernel 2.6.22, so anything older than that will have > the spurious -EINTR issues. > > futex_wait_pi() always had the restart handling and glibc folks back > then (2006) requested that it should never return -EINTR, so it > unconditionally restarts the syscall whether a signal had been > delivered or not. > > So kernels >= 2.6.22 should never return -EINTR spuriously. If that > happens it's a bug and needs to be fixed. Thanks for looking into this. Michael, can you include the above in the documentation please? This is useful for userspace code like glibc that expects a minimum kernel version. Thanks! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: futex(2) man page update help request
On Sat, 2015-01-24 at 11:05 +0100, Thomas Gleixner wrote: > On Fri, 23 Jan 2015, Torvald Riegel wrote: > > > On Fri, 2015-01-16 at 16:46 -0800, Darren Hart wrote: > > > On 1/16/15, 12:54 PM, "Michael Kerrisk (man-pages)" > > > wrote: > > > > > > >Color me stupid, but I can't see this in futex_requeue(). Where is that > > > >check that is "independent of the requeue type (normal/pi)"? > > > > > > > >When I look through futex_requeue(), all the likely looking sources > > > >of EINVAL are governed by a check on the 'requeue_pi' argument. > > > > > > > > > Right, in the non-PI case, I believe there are valid use cases: move to > > > the back of the FIFO, for example (OK, maybe the only example?). > > > > But we never guarantee a futex is a FIFO, or do we? If we don't, then > > such a requeue could be implemented as a no-op by the kernel, which > > would sort of invalidate the use case. > > > > (And I guess we don't want to guarantee FIFO behavior for futexes.) > > The (current) behaviour is: > > real-time threads: FIFO per priority level > sched-other threads: FIFO independent of nice level > > The wakeup is priority ordered. Highest priority level first. OK. But, just to be clear, do I correctly understand that you do not want to guarantee FIFO behavior in the specified futex semantics? I think there are cases where being able to *rely* on FIFO (now and on all future kernels) would be helpful for users (e.g., on POSIX/C++11 condvars and I assume in other ordered-wakeup cases too) -- but at the same time, this would constrain future futex implementations. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: futex(2) man page update help request
On Thu, 2015-01-15 at 16:10 +0100, Michael Kerrisk (man-pages) wrote: > [Adding a few people to CC that have expressed interest in the > progress of the updates of this page, or who may be able to > provide review feedback. Eventually, you'll all get CCed on > the new draft of the page.] > > Hello Thomas, > > On 05/15/2014 04:14 PM, Thomas Gleixner wrote: > > On Thu, 15 May 2014, Michael Kerrisk (man-pages) wrote: > >> And that universe would love to have your documentation of > >> FUTEX_WAKE_BITSET and FUTEX_WAIT_BITSET ;-), > > > > I give you almost the full treatment, but I leave REQUEUE_PI to > > Darren and FUTEX_WAKE_OP to Jakub. :) > > Thank you for the great effort you put into compiling the > text below, and apologies for my long delay in following up. > > I've integrated almost all of your suggestions into the > manual page. I will shortly send out a new draft of the > page that contains various FIXMEs for points that remain > unclear. Michael, thanks for working on the draft! I'll review the draft closely once you've sent it (or have I missed it?). There are a few things that I'd like to see covered. First, we should discuss that users, until they control all code in the respective process, need to expect futexes to be affected by spurious futex_wake calls; see https://lkml.org/lkml/2014/11/27/472 for background and Linus' choice (AFAIU) to just document this. So, for example, a return code of 0 for FUTEX_WAIT can mean either being woken up by a FUTEX_WAKE intended for this futex, or a stale one intended for another futex used by, for example, glibc internally. (Note that as explained in this thread, this isn't just a glibc artifact, but a result of the general futex design mixed with destruction requirements for mutexes and other constructs in C++11 and POSIX.) It might also be necessary to further consider this when documenting the errors, because it does affect how to handle them. See this for a glibc perspective: https://sourceware.org/ml/libc-alpha/2014-09/msg00381.html Second, the current documentation for EINTR is that it can happen due to receiving a signal *or* due to a spurious wake-up. This is difficult to handle when implementing POSIX semaphores, because they require that EINTR is returned from SEM_WAIT if and only if the interruption was due to a signal. Thus, if FUTEX_WAIT returns EINTR, the semaphore implementation can't return EINTR from sem_wait; see this for more comments, including some discussion why use cases relying on the POSIX requirement around EINTR are borderline timing-dependent: https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/sem_waitcommon.c;h=96848d7ac5b6f8f1f3099b422deacc09323c796a;hb=HEAD#l282 Others have commented that aio_suspend has a similar issue; if EINTR wouldn't in fact be returned spuriously, the POSIX-implementation-side would get easier. Third, I think it would be useful to -- somewhere -- explain which behavior the futex operations would have conceptually when expressed by C11 code. We currently say that they wake up, sleep, etc, and which values they return. But we never say how to properly synchronize with them on the userspace side. The C11 memory model is probably the best model to use on the userspace side, so that's why I'm arguing for this. Basically, I think we need to (1) tell people that they should use memory_order_relaxed accesses to the futex variable (ie, the memory location associated with the whole futex construct on the kernel side -- or do we have another name for this?), and (2) give some conceptual guarantees for the kernel-side synchronization so that one use this to derive how to use them correctly in userspace. The man pages might not be the right place for this, and maybe we just need a revision of "Futexes are tricky". If you have other suggestions for where to document this, or on the content, let me know. (I'm also willing to spend time on this :) ). Torvald -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: futex(2) man page update help request
On Fri, 2015-01-16 at 16:46 -0800, Darren Hart wrote: > On 1/16/15, 12:54 PM, "Michael Kerrisk (man-pages)" > wrote: > > >Color me stupid, but I can't see this in futex_requeue(). Where is that > >check that is "independent of the requeue type (normal/pi)"? > > > >When I look through futex_requeue(), all the likely looking sources > >of EINVAL are governed by a check on the 'requeue_pi' argument. > > > Right, in the non-PI case, I believe there are valid use cases: move to > the back of the FIFO, for example (OK, maybe the only example?). But we never guarantee a futex is a FIFO, or do we? If we don't, then such a requeue could be implemented as a no-op by the kernel, which would sort of invalidate the use case. (And I guess we don't want to guarantee FIFO behavior for futexes.) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: POSIX mutex destruction requirements vs. futexes
On Mon, 2014-12-01 at 10:31 -0800, Linus Torvalds wrote: > On Mon, Dec 1, 2014 at 4:05 AM, Torvald Riegel wrote: > > On Thu, 2014-11-27 at 11:38 -0800, Linus Torvalds wrote: > >> > >> > (1) Allow spurious wake-ups from FUTEX_WAIT. > >> > >> because afaik that is what we actually *do* today (we'll wake up > >> whoever re-used that location in another thread), and it's mainly > >> about the whole documentation issue. No? > > > > If that's what the kernel prefers to do, this would be fine with me. > > I think it's more of a "can we even do anything else"? > > The kernel doesn't even see the reuse of the futex, or the fast path > parts. Not seeing the fast path is obviously by design, and not seeing > the reuse is partly due to pthreads interfaces (I guess people should > generally call mutex_destroy, but I doubt people really do, and even > if they did, how would user space actually handle the nasty race > between "pthread_unlock -> stale futex_wake" "pthread_mutex_destroy()" > anyway?). User space could count stale (or, pending) futex_wake calls and spin in pthread_mutex_destroy() until this count is zero. However, that would increase contention significantly, and we must spin, not block in pthread_mutex_destroy() at least for process-shared mutexes, because there's no place to put a futex for this destruction-time blocking that is not subject to the same issue again. (For process-private futexes, this could be a memory location that is only ever used by glibc.) There might even be more issues related to unmapping memory of process-shared mutexes based on reference-counting in the critical sections. The additional contention of counting stale futex_wake's worries me most. You only need to count when threads actually use futexes to block, and perhaps glibc's mutex implementation could be changed to spin-wait aggressively, and perhaps we could add some explicit handover to other lock owners or something similar to avoid issuing a futex_wake unless really necessary -- but this seems like quite a lot of hoops to jump through to work around a relatively small aspect of the current futexes. > So the thing is, I don't see how we could possibly change the existing > FUTEX_WAKE behavior. > > And introducing a new "debug mode" that explicitly adds spurious > events might as well be done in user space with a wrapper about > FUTEX_WAKE or something. User space could introduce a wrapper (e.g., glibc could provide a futex wrapper that allows spurious wakeup on return of 0) -- but glibc can't prevent users from not using futexes directly and not through the wrapper. Or should it try to intercept direct, non-wrapper uses of the futex syscall in some way? That's why I included a "debug mode" -- I'd rather call it a "new" mode than a debug mode, because it's not really about debugging -- in the list of options (ie, options (2) and (3)). This would allow glibc (and other libraries) to use the futex variants with the new semantics in the most natural way. And yet would not create situations in which calls to the old futex variant *appear* to allow spurious wake-ups (due to glibc's or other libraries' fault -- libstdc++ is in the same boat, for example). > Because as far as the *kernel* is concerned, there is no "spurious > wake" event. It's not spurious. It's real and exists, and the wakeup > was really done by the user. The fact that it's really a stale wakeup > for a previous allocation of a pthread mutex data structure is > something that is completely and fundamentally invisible to the > kernel. > > No? I agree, and that's why I mentioned that it may seem odd to fix this on the level of the kernel interface. However, it just seems the best approach when considering practice in kernel and user space, the by-design futex properties, and the realities of what POSIX requires. > So even from a documentation standpoint, it's really not about "kernel > interfaces" being incorrectly documented, as about a big honking > warning about internal pthread_mutex() implementation in a library, > and the impact that library choice has on allocation re-use, and how > that means that even if the FUTEX_WAKE is guaranteed to not be > spurious from a *kernel* standpoint, certain use models will create > their own spurious wake events in very subtle ways. I agree it's not the kernel's fault, but that doesn't solve the dilemma. It's one aspect of the futex design -- whether spurious wake-ups are allowed for a return of 0 -- that makes it hard to use futexes for POSIX, C++11 (and C11, most likely) synchronization primitives without user space violating the k
Re: POSIX mutex destruction requirements vs. futexes
On Thu, 2014-11-27 at 11:38 -0800, Linus Torvalds wrote: > On Thu, Nov 27, 2014 at 6:27 AM, Torvald Riegel wrote: > > This means that in the second example above, futex_wake can be > > concurrent with whatever happens on the mutex' memory location after the > > mutex has been destroyed. Examples are: > > * The memory is unmapped. futex_wake will return an error. OK. > > * The memory is reused, but not for a futex. No thread will get > > woken. OK. > > * The memory is reused for another glibc mutex. The slow-path > > futex wake will now hit another, unrelated futex -- but the > > mutex implementation is robust to such spurious wake-ups anyway, > > because it can always happen when a mutex is acquired and > > released more than once. OK. > > * The memory is reused for another futex in some custom data > > structure that expects there is just one wait/wake cycle, and > > relies on FUTEX_WAIT returning 0 to mean that this is caused by > > the matching FUTEX_WAKE call by *this* data structure. Not OK, > > because now the delayed slow-path wake-up introduces a spurious > > wake-up in an unrelated futex. > > > > Thus, introducing spurious wake-ups is the core issue. > > So my gut feeling is that we should just try to see if we can live > with spurious wakeups, ie your: > > > (1) Allow spurious wake-ups from FUTEX_WAIT. > > because afaik that is what we actually *do* today (we'll wake up > whoever re-used that location in another thread), and it's mainly > about the whole documentation issue. No? If that's what the kernel prefers to do, this would be fine with me. In this case, I'd follow up with Michael Kerrisk to get this documented. I guess something along the lines of this would work: "Return 0 if the process was woken by a FUTEX_WAKE call, or spuriously. (Note that these spurious wake-ups can result from futex operations by other threads or processes.)" A more detailed explanation outside of the manpages (ie, the rationale for this wording) would also be useful for users I suppose. However, I've heard others being more concerned over this change in semantics. Therefore, I'll describe the pros/cons that I see, just so that we're all on the same page on what the decision for (1) would mean. The change in return-value-of-0 semantics that (1) is about can violate the futex semantics that existing code like the following example might assume: Initially f==0 Thread 1: atomic_store_relaxed(&f, 1); futex_wake(&f, 1, ...); Thread 2: do { err = futex_wait(&f, 0, NULL); // handle fatal errors like ENOSYS, EINVAL, ... } while (err == EINTR); // err must have been EWOULDBLOCK or 0 For this to work according to the specs, the program itself must not have pending FUTEX_WAKE's to f (e.g., by having a barrier after the operations by both threads). The program can use glibc pthreads synchronization operations, because they don't specify anywhere that they cause concurrent FUTEX_WAKE's after the synchronization data structures have been destroyed and their memory has been reused. For this to work in practice, the program would need to not use reference-counting to trigger destruction of pthread synchronization datastructures (or do something similar that causes an mutex unlock that has not returned to the caller to be concurrent with destruction). Thus, there can be correct programs out there that never are affected by spurious wake-ups, but now would have to change if they want to be correct after the change in documentation. I would bet that those cases are the minority, but I don't have any data to back this up or even make a guess about how small a minority they would be. OTOH, in practice, they would *not* become buggy after the documentation change, *unless* the kernel should decide to, in the future, also introduce spurious wake-ups due to other reasons and return 0 instead of EINTR. (That's why I put the note in the suggested docs change above.) One could also argue that the current spec actually allows the spurious wake-ups we are concerned with. The manpage currently states: "Returns 0 if the process was woken by a FUTEX_WAKE call." The spurious wake-ups *are* due to other FUTEX_WAKE calls, just not due to the calls that the piece of code assumed they might come from. But that's probably not quite obvious :) It also seems a little odd that consequences of the design of how futexes are used in practice (ie, fast/slow path split) and the POSIX/C ++11 requirements on destruction bleed into the semantics of kernel operations. However, allowing spurious wake-ups for a return value of 0 might be the simplest way to
POSIX mutex destruction requirements vs. futexes
TLDR: POSIX and C++11 make implementation requirements for the destruction of mutexes that, if we want to continue to use the userspace fastpath / kernel slowpath approach of non-PI futexes, conflict with spurious wake-ups not being allowed for FUTEX_WAIT calls that return 0. The best solutions seem to be to either allow spurious wake-ups, or create new variants of FUTEX_WAIT and/or FUTEX_WAKE that allow spurious wake-ups. Using reference-counting in critical sections to decide when the mutex protecting the critical section can be destroyed has been recently discussed on LKML. For example, something like this is supposed to work: int free = 0; mutex_lock(&s->lock); if (--s->refcount == 0) free = 1 mutex_unlock(&s->lock); if (free) kfree(s); The Austin Group has clarified that pthreads mutexes have to support such uses: http://austingroupbugs.net/view.php?id=811 C++11 makes essentially the same requirement (see 30.4.1.2.1p5; destruction is allowed if no thread owns the mutex; it is not required that all unlock calls have returned before destruction starts). C11 is silent on this particular aspect of mutex semantics, but it usually follows POSIX and/or C++ semantics. The destruction requirement doesn't just apply in cases of explicit reference counting. Another example would be: Thread1: pthread_mutex_lock(&m); pthread_create(..., Thread2, ...); pthread_mutex_unlock(&m); Thread2: pthread_mutex_lock(&m); pthread_mutex_unlock(&m); pthread_mutex_destroy(&m); // reuse the memory of mutex m for something else; // a new, unrelated futex happens to reside at the same address as // m's internal futex Here, the program has ensured in a different way that no thread owns the mutex when it is destructed. This requirement is tough to implement for glibc -- or with futexes in general -- because what one would like to do in a mutex unlock implementation based on futexes is the following, roughly: lock(): while (1) { // fast path: assume uncontended lock if (atomic_compare_exchange_acquire(&futex, NOT_ACQUIRED, ACQUIRED) == SUCCESS) return; // slow path: signal that there is a slow-path waiter and block prev = atomic_exchange(&futex, ACQUIRED_AND_WAITERS); if (prev == NOT_ACQUIRED) return; futex_wait(&futex, ACQUIRED_AND_WAITERS, ...); } unlock(): // fast path unlock prev = atomic_exchange_release(&futex, NOT_ACQUIRED); // slow path unlock if (prev == ACQUIRED_AND_WAITERS) futex_wake(&futex, ...); Having the fast path is good for performance. It enables spinning on acquisition, and avoids a syscall when releasing the mutex. The latter is good in the uncontended case but also when there is contention, because another spinning thread may be able to acquire the mutex more quickly than if we'd require the kernel to wake a blocked thread -- thus increasing scalability. However, because another thread can acquire the mutex right after the fast path of the unlock(), the next steps of this thread can then be concurrent with the slow path of the unlock(), in particular the futex_wake call. All we need to make such a pending, concurrent futex_wake call possible is that at some time in the past, we were in the ACQUIRED_AND_WAITERS state. This means that in the second example above, futex_wake can be concurrent with whatever happens on the mutex' memory location after the mutex has been destroyed. Examples are: * The memory is unmapped. futex_wake will return an error. OK. * The memory is reused, but not for a futex. No thread will get woken. OK. * The memory is reused for another glibc mutex. The slow-path futex wake will now hit another, unrelated futex -- but the mutex implementation is robust to such spurious wake-ups anyway, because it can always happen when a mutex is acquired and released more than once. OK. * The memory is reused for another futex in some custom data structure that expects there is just one wait/wake cycle, and relies on FUTEX_WAIT returning 0 to mean that this is caused by the matching FUTEX_WAKE call by *this* data structure. Not OK, because now the delayed slow-path wake-up introduces a spurious wake-up in an unrelated futex. Thus, introducing spurious wake-ups is the core issue. This is not restricted to pthreads mutexes (https://sourceware.org/bugzilla/show_bug.cgi?id=13690) but also is an issue for semaphores (https://sourceware.org/bugzilla/show_bug.cgi?id=12674; I have a patch that I'll send upstream soon that fixes the memory access issues under concurrent destruction -- the futex issue remains) and barriers (https://sourceware.org/bugzilla/show_bug.cgi?id=13065). In general, this is an issue for all futex uses that rely on this same fast-path / slow-path pattern and a similar destruction scheme. There are ways to solve this in userspace only and with the existing
Re: [RFC][PATCH 0/5] arch: atomic rework
On Wed, 2014-03-05 at 10:15 -0800, Paul E. McKenney wrote: > On Wed, Mar 05, 2014 at 05:54:59PM +0100, Torvald Riegel wrote: > > On Tue, 2014-03-04 at 13:35 -0800, Paul E. McKenney wrote: > > > On Tue, Mar 04, 2014 at 11:00:32AM -0800, Paul E. McKenney wrote: > > > > On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote: > > > > > xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com > > > > > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA) > > > > > > > > > > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote: > > > > > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote: > > > > > > > xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com > > > > > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC) > > > > > > > > > > > > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote: > > > > > > > > +o Do not use the results from the boolean "&&" and "||" > > > > > > > > when > > > > > > > > + dereferencing. For example, the following (rather > > > > > > > > improbable) > > > > > > > > + code is buggy: > > > > > > > > + > > > > > > > > + int a[2]; > > > > > > > > + int index; > > > > > > > > + int force_zero_index = 1; > > > > > > > > + > > > > > > > > + ... > > > > > > > > + > > > > > > > > + r1 = rcu_dereference(i1) > > > > > > > > + r2 = a[r1 && force_zero_index]; /* BUGGY!!! */ > > > > > > > > + > > > > > > > > + The reason this is buggy is that "&&" and "||" are > > > > > > > > often compiled > > > > > > > > + using branches. While weak-memory machines such as ARM > > > > > > > > or PowerPC > > > > > > > > + do order stores after such branches, they can speculate > > > > > > > > loads, > > > > > > > > + which can result in misordering bugs. > > > > > > > > + > > > > > > > > +o Do not use the results from relational operators ("==", > > > > > > > > "!=", > > > > > > > > + ">", ">=", "<", or "<=") when dereferencing. For > > > > > > > > example, > > > > > > > > + the following (quite strange) code is buggy: > > > > > > > > + > > > > > > > > + int a[2]; > > > > > > > > + int index; > > > > > > > > + int flip_index = 0; > > > > > > > > + > > > > > > > > + ... > > > > > > > > + > > > > > > > > + r1 = rcu_dereference(i1) > > > > > > > > + r2 = a[r1 != flip_index]; /* BUGGY!!! */ > > > > > > > > + > > > > > > > > + As before, the reason this is buggy is that relational > > > > > > > > operators > > > > > > > > + are often compiled using branches. And as before, > > > > > > > > although > > > > > > > > + weak-memory machines such as ARM or PowerPC do order > > > > > > > > stores > > > > > > > > + after such branches, but can speculate loads, which can > > > > > > > > again > > > > > > > > + result in misordering bugs. > > > > > > > > > > > > > > Those two would be allowed by the wording I have recently > > > > > > > proposed, > > > > > > > AFAICS. r1 != flip_index would result in two possible values > > > > > > > (unless > > > > > > > there are further constraints due to the type of r1 and the > > > > > > > values that > > > > > > > flip_index can have). > > > &g
Re: [RFC][PATCH 0/5] arch: atomic rework
On Wed, 2014-03-05 at 10:01 -0800, Paul E. McKenney wrote: > On Wed, Mar 05, 2014 at 05:26:36PM +0100, Torvald Riegel wrote: > > xagsmtp3.20140305162928.8...@uk1vsc.vnet.ibm.com > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP3 at UK1VSC) > > > > On Tue, 2014-03-04 at 11:00 -0800, Paul E. McKenney wrote: > > > On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote: > > > > xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com > > > > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA) > > > > > > > > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote: > > > > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote: > > > > > > xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com > > > > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC) > > > > > > > > > > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote: > > > > > > > +oDo not use the results from the boolean "&&" and "||" > > > > > > > when > > > > > > > + dereferencing. For example, the following (rather improbable) > > > > > > > + code is buggy: > > > > > > > + > > > > > > > + int a[2]; > > > > > > > + int index; > > > > > > > + int force_zero_index = 1; > > > > > > > + > > > > > > > + ... > > > > > > > + > > > > > > > + r1 = rcu_dereference(i1) > > > > > > > + r2 = a[r1 && force_zero_index]; /* BUGGY!!! */ > > > > > > > + > > > > > > > + The reason this is buggy is that "&&" and "||" are often > > > > > > > compiled > > > > > > > + using branches. While weak-memory machines such as ARM or > > > > > > > PowerPC > > > > > > > + do order stores after such branches, they can speculate loads, > > > > > > > + which can result in misordering bugs. > > > > > > > + > > > > > > > +oDo not use the results from relational operators ("==", > > > > > > > "!=", > > > > > > > + ">", ">=", "<", or "<=") when dereferencing. For example, > > > > > > > + the following (quite strange) code is buggy: > > > > > > > + > > > > > > > + int a[2]; > > > > > > > + int index; > > > > > > > + int flip_index = 0; > > > > > > > + > > > > > > > + ... > > > > > > > + > > > > > > > + r1 = rcu_dereference(i1) > > > > > > > + r2 = a[r1 != flip_index]; /* BUGGY!!! */ > > > > > > > + > > > > > > > + As before, the reason this is buggy is that relational operators > > > > > > > + are often compiled using branches. And as before, although > > > > > > > + weak-memory machines such as ARM or PowerPC do order stores > > > > > > > + after such branches, but can speculate loads, which can again > > > > > > > + result in misordering bugs. > > > > > > > > > > > > Those two would be allowed by the wording I have recently proposed, > > > > > > AFAICS. r1 != flip_index would result in two possible values > > > > > > (unless > > > > > > there are further constraints due to the type of r1 and the values > > > > > > that > > > > > > flip_index can have). > > > > > > > > > > And I am OK with the value_dep_preserving type providing more/better > > > > > guarantees than we get by default from current compilers. > > > > > > > > > > One question, though. Suppose that the code did not want a value > > > > > dependency to be tracked through a comparison operator. What does > > > > > the developer do in that case? (The reason I ask is that I have > > > > > not yet found a use case in the Linux kernel that expects a value > > > > > dependency to be tracked through a comparison.) > > > > > > > > Hmm. I suppose use an explic
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, 2014-03-04 at 22:11 +, Peter Sewell wrote: > On 3 March 2014 20:44, Torvald Riegel wrote: > > On Sun, 2014-03-02 at 04:05 -0600, Peter Sewell wrote: > >> On 1 March 2014 08:03, Paul E. McKenney wrote: > >> > On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote: > >> >> Hi Paul, > >> >> > >> >> On 28 February 2014 18:50, Paul E. McKenney > >> >> wrote: > >> >> > On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote: > >> >> >> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote: > >> >> >> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney > >> >> >> > wrote: > >> >> >> > > > >> >> >> > > 3. The comparison was against another RCU-protected pointer, > >> >> >> > > where that other pointer was properly fetched using one > >> >> >> > > of the RCU primitives. Here it doesn't matter which > >> >> >> > > pointer > >> >> >> > > you use. At least as long as the rcu_assign_pointer() > >> >> >> > > for > >> >> >> > > that other pointer happened after the last update to the > >> >> >> > > pointed-to structure. > >> >> >> > > > >> >> >> > > I am a bit nervous about #3. Any thoughts on it? > >> >> >> > > >> >> >> > I think that it might be worth pointing out as an example, and > >> >> >> > saying > >> >> >> > that code like > >> >> >> > > >> >> >> >p = atomic_read(consume); > >> >> >> >X; > >> >> >> >q = atomic_read(consume); > >> >> >> >Y; > >> >> >> >if (p == q) > >> >> >> > data = p->val; > >> >> >> > > >> >> >> > then the access of "p->val" is constrained to be data-dependent on > >> >> >> > *either* p or q, but you can't really tell which, since the > >> >> >> > compiler > >> >> >> > can decide that the values are interchangeable. > >> >> >> > > >> >> >> > I cannot for the life of me come up with a situation where this > >> >> >> > would > >> >> >> > matter, though. If "X" contains a fence, then that fence will be a > >> >> >> > stronger ordering than anything the consume through "p" would > >> >> >> > guarantee anyway. And if "X" does *not* contain a fence, then the > >> >> >> > atomic reads of p and q are unordered *anyway*, so then whether the > >> >> >> > ordering to the access through "p" is through p or q is kind of > >> >> >> > irrelevant. No? > >> >> >> > >> >> >> I can make a contrived litmus test for it, but you are right, the > >> >> >> only > >> >> >> time you can see it happen is when X has no barriers, in which case > >> >> >> you don't have any ordering anyway -- both the compiler and the CPU > >> >> >> can > >> >> >> reorder the loads into p and q, and the read from p->val can, as you > >> >> >> say, > >> >> >> come from either pointer. > >> >> >> > >> >> >> For whatever it is worth, hear is the litmus test: > >> >> >> > >> >> >> T1: p = kmalloc(...); > >> >> >> if (p == NULL) > >> >> >> deal_with_it(); > >> >> >> p->a = 42; /* Each field in its own cache line. */ > >> >> >> p->b = 43; > >> >> >> p->c = 44; > >> >> >> atomic_store_explicit(&gp1, p, memory_order_release); > >> >> >> p->b = 143; > >> >> >> p->c = 144; > >> >> >> atomic_store_explicit(&gp2, p, memory_order_release); > >> >> >> > >> >> >> T2: p = atomic_load_explicit(&gp2, memory_or
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, 2014-03-04 at 13:35 -0800, Paul E. McKenney wrote: > On Tue, Mar 04, 2014 at 11:00:32AM -0800, Paul E. McKenney wrote: > > On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote: > > > xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com > > > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA) > > > > > > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote: > > > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote: > > > > > xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com > > > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC) > > > > > > > > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote: > > > > > > +o Do not use the results from the boolean "&&" and "||" when > > > > > > + dereferencing. For example, the following (rather improbable) > > > > > > + code is buggy: > > > > > > + > > > > > > + int a[2]; > > > > > > + int index; > > > > > > + int force_zero_index = 1; > > > > > > + > > > > > > + ... > > > > > > + > > > > > > + r1 = rcu_dereference(i1) > > > > > > + r2 = a[r1 && force_zero_index]; /* BUGGY!!! */ > > > > > > + > > > > > > + The reason this is buggy is that "&&" and "||" are often > > > > > > compiled > > > > > > + using branches. While weak-memory machines such as ARM or > > > > > > PowerPC > > > > > > + do order stores after such branches, they can speculate loads, > > > > > > + which can result in misordering bugs. > > > > > > + > > > > > > +o Do not use the results from relational operators ("==", "!=", > > > > > > + ">", ">=", "<", or "<=") when dereferencing. For example, > > > > > > + the following (quite strange) code is buggy: > > > > > > + > > > > > > + int a[2]; > > > > > > + int index; > > > > > > + int flip_index = 0; > > > > > > + > > > > > > + ... > > > > > > + > > > > > > + r1 = rcu_dereference(i1) > > > > > > + r2 = a[r1 != flip_index]; /* BUGGY!!! */ > > > > > > + > > > > > > + As before, the reason this is buggy is that relational operators > > > > > > + are often compiled using branches. And as before, although > > > > > > + weak-memory machines such as ARM or PowerPC do order stores > > > > > > + after such branches, but can speculate loads, which can again > > > > > > + result in misordering bugs. > > > > > > > > > > Those two would be allowed by the wording I have recently proposed, > > > > > AFAICS. r1 != flip_index would result in two possible values (unless > > > > > there are further constraints due to the type of r1 and the values > > > > > that > > > > > flip_index can have). > > > > > > > > And I am OK with the value_dep_preserving type providing more/better > > > > guarantees than we get by default from current compilers. > > > > > > > > One question, though. Suppose that the code did not want a value > > > > dependency to be tracked through a comparison operator. What does > > > > the developer do in that case? (The reason I ask is that I have > > > > not yet found a use case in the Linux kernel that expects a value > > > > dependency to be tracked through a comparison.) > > > > > > Hmm. I suppose use an explicit cast to non-vdp before or after the > > > comparison? > > > > That should work well assuming that things like "if", "while", and "?:" > > conditions are happy to take a vdp. This assumes that p->a only returns > > vdp if field "a" is declared vdp, otherwise we have vdps running wild > > through the program. ;-) > > > > The other thing that can happen is that a vdp can get handed off to > > another synchronization mechanism, for example, to reference counting: > > > >
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, 2014-03-04 at 11:00 -0800, Paul E. McKenney wrote: > On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote: > > xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com > > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA) > > > > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote: > > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote: > > > > xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com > > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC) > > > > > > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote: > > > > > +oDo not use the results from the boolean "&&" and "||" when > > > > > + dereferencing. For example, the following (rather improbable) > > > > > + code is buggy: > > > > > + > > > > > + int a[2]; > > > > > + int index; > > > > > + int force_zero_index = 1; > > > > > + > > > > > + ... > > > > > + > > > > > + r1 = rcu_dereference(i1) > > > > > + r2 = a[r1 && force_zero_index]; /* BUGGY!!! */ > > > > > + > > > > > + The reason this is buggy is that "&&" and "||" are often > > > > > compiled > > > > > + using branches. While weak-memory machines such as ARM or > > > > > PowerPC > > > > > + do order stores after such branches, they can speculate loads, > > > > > + which can result in misordering bugs. > > > > > + > > > > > +oDo not use the results from relational operators ("==", "!=", > > > > > + ">", ">=", "<", or "<=") when dereferencing. For example, > > > > > + the following (quite strange) code is buggy: > > > > > + > > > > > + int a[2]; > > > > > + int index; > > > > > + int flip_index = 0; > > > > > + > > > > > + ... > > > > > + > > > > > + r1 = rcu_dereference(i1) > > > > > + r2 = a[r1 != flip_index]; /* BUGGY!!! */ > > > > > + > > > > > + As before, the reason this is buggy is that relational operators > > > > > + are often compiled using branches. And as before, although > > > > > + weak-memory machines such as ARM or PowerPC do order stores > > > > > + after such branches, but can speculate loads, which can again > > > > > + result in misordering bugs. > > > > > > > > Those two would be allowed by the wording I have recently proposed, > > > > AFAICS. r1 != flip_index would result in two possible values (unless > > > > there are further constraints due to the type of r1 and the values that > > > > flip_index can have). > > > > > > And I am OK with the value_dep_preserving type providing more/better > > > guarantees than we get by default from current compilers. > > > > > > One question, though. Suppose that the code did not want a value > > > dependency to be tracked through a comparison operator. What does > > > the developer do in that case? (The reason I ask is that I have > > > not yet found a use case in the Linux kernel that expects a value > > > dependency to be tracked through a comparison.) > > > > Hmm. I suppose use an explicit cast to non-vdp before or after the > > comparison? > > That should work well assuming that things like "if", "while", and "?:" > conditions are happy to take a vdp. I currently don't see a reason why that should be disallowed. If we have allowed an implicit conversion to non-vdp, I believe that should follow. ?: could be somewhat special, in that the type depends on the 2nd and 3rd operand. Thus, "vdp x = non-vdp ? vdp : vdp;" should be allowed, whereas "vdp x = non-vdp ? non-vdp : vdp;" probably should be disallowed if we don't provide for implicit casts from non-vdp to vdp. > This assumes that p->a only returns > vdp if field "a" is declared vdp, otherwise we have vdps running wild > through the program. ;-) That's a good question. For the scheme I had in mind, I'm not concerned about vdps running wild because one needs t
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote: > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote: > > xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC) > > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote: > > > +oDo not use the results from the boolean "&&" and "||" when > > > + dereferencing. For example, the following (rather improbable) > > > + code is buggy: > > > + > > > + int a[2]; > > > + int index; > > > + int force_zero_index = 1; > > > + > > > + ... > > > + > > > + r1 = rcu_dereference(i1) > > > + r2 = a[r1 && force_zero_index]; /* BUGGY!!! */ > > > + > > > + The reason this is buggy is that "&&" and "||" are often compiled > > > + using branches. While weak-memory machines such as ARM or PowerPC > > > + do order stores after such branches, they can speculate loads, > > > + which can result in misordering bugs. > > > + > > > +oDo not use the results from relational operators ("==", "!=", > > > + ">", ">=", "<", or "<=") when dereferencing. For example, > > > + the following (quite strange) code is buggy: > > > + > > > + int a[2]; > > > + int index; > > > + int flip_index = 0; > > > + > > > + ... > > > + > > > + r1 = rcu_dereference(i1) > > > + r2 = a[r1 != flip_index]; /* BUGGY!!! */ > > > + > > > + As before, the reason this is buggy is that relational operators > > > + are often compiled using branches. And as before, although > > > + weak-memory machines such as ARM or PowerPC do order stores > > > + after such branches, but can speculate loads, which can again > > > + result in misordering bugs. > > > > Those two would be allowed by the wording I have recently proposed, > > AFAICS. r1 != flip_index would result in two possible values (unless > > there are further constraints due to the type of r1 and the values that > > flip_index can have). > > And I am OK with the value_dep_preserving type providing more/better > guarantees than we get by default from current compilers. > > One question, though. Suppose that the code did not want a value > dependency to be tracked through a comparison operator. What does > the developer do in that case? (The reason I ask is that I have > not yet found a use case in the Linux kernel that expects a value > dependency to be tracked through a comparison.) Hmm. I suppose use an explicit cast to non-vdp before or after the comparison? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Sun, 2014-03-02 at 04:05 -0600, Peter Sewell wrote: > On 1 March 2014 08:03, Paul E. McKenney wrote: > > On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote: > >> Hi Paul, > >> > >> On 28 February 2014 18:50, Paul E. McKenney > >> wrote: > >> > On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote: > >> >> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote: > >> >> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney > >> >> > wrote: > >> >> > > > >> >> > > 3. The comparison was against another RCU-protected pointer, > >> >> > > where that other pointer was properly fetched using one > >> >> > > of the RCU primitives. Here it doesn't matter which pointer > >> >> > > you use. At least as long as the rcu_assign_pointer() for > >> >> > > that other pointer happened after the last update to the > >> >> > > pointed-to structure. > >> >> > > > >> >> > > I am a bit nervous about #3. Any thoughts on it? > >> >> > > >> >> > I think that it might be worth pointing out as an example, and saying > >> >> > that code like > >> >> > > >> >> >p = atomic_read(consume); > >> >> >X; > >> >> >q = atomic_read(consume); > >> >> >Y; > >> >> >if (p == q) > >> >> > data = p->val; > >> >> > > >> >> > then the access of "p->val" is constrained to be data-dependent on > >> >> > *either* p or q, but you can't really tell which, since the compiler > >> >> > can decide that the values are interchangeable. > >> >> > > >> >> > I cannot for the life of me come up with a situation where this would > >> >> > matter, though. If "X" contains a fence, then that fence will be a > >> >> > stronger ordering than anything the consume through "p" would > >> >> > guarantee anyway. And if "X" does *not* contain a fence, then the > >> >> > atomic reads of p and q are unordered *anyway*, so then whether the > >> >> > ordering to the access through "p" is through p or q is kind of > >> >> > irrelevant. No? > >> >> > >> >> I can make a contrived litmus test for it, but you are right, the only > >> >> time you can see it happen is when X has no barriers, in which case > >> >> you don't have any ordering anyway -- both the compiler and the CPU can > >> >> reorder the loads into p and q, and the read from p->val can, as you > >> >> say, > >> >> come from either pointer. > >> >> > >> >> For whatever it is worth, hear is the litmus test: > >> >> > >> >> T1: p = kmalloc(...); > >> >> if (p == NULL) > >> >> deal_with_it(); > >> >> p->a = 42; /* Each field in its own cache line. */ > >> >> p->b = 43; > >> >> p->c = 44; > >> >> atomic_store_explicit(&gp1, p, memory_order_release); > >> >> p->b = 143; > >> >> p->c = 144; > >> >> atomic_store_explicit(&gp2, p, memory_order_release); > >> >> > >> >> T2: p = atomic_load_explicit(&gp2, memory_order_consume); > >> >> r1 = p->b; /* Guaranteed to get 143. */ > >> >> q = atomic_load_explicit(&gp1, memory_order_consume); > >> >> if (p == q) { > >> >> /* The compiler decides that q->c is same as p->c. */ > >> >> r2 = p->c; /* Could get 44 on weakly order system. */ > >> >> } > >> >> > >> >> The loads from gp1 and gp2 are, as you say, unordered, so you get what > >> >> you get. > >> >> > >> >> And publishing a structure via one RCU-protected pointer, updating it, > >> >> then publishing it via another pointer seems to me to be asking for > >> >> trouble anyway. If you really want to do something like that and still > >> >> see consistency across all the fields in the structure, please put a > >> >> lock > >> >> in the structure and use it to guard updates and accesses to those > >> >> fields. > >> > > >> > And here is a patch documenting the restrictions for the current Linux > >> > kernel. The rules change a bit due to rcu_dereference() acting a bit > >> > differently than atomic_load_explicit(&p, memory_order_consume). > >> > > >> > Thoughts? > >> > >> That might serve as informal documentation for linux kernel > >> programmers about the bounds on the optimisations that you expect > >> compilers to do for common-case RCU code - and I guess that's what you > >> intend it to be for. But I don't see how one can make it precise > >> enough to serve as a language definition, so that compiler people > >> could confidently say "yes, we respect that", which I guess is what > >> you really need. As a useful criterion, we should aim for something > >> precise enough that in a verified-compiler context you can > >> mathematically prove that the compiler will satisfy it (even though > >> that won't happen anytime soon for GCC), and that analysis tool > >> authors can actually know what they're working with. All this stuff > >> about "you should avoid cancellation", and "avoid masking with just a > >> small number of bits" is just too vague. > > > > Understood, and yes, this is intended to docum
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote: > +oDo not use the results from the boolean "&&" and "||" when > + dereferencing. For example, the following (rather improbable) > + code is buggy: > + > + int a[2]; > + int index; > + int force_zero_index = 1; > + > + ... > + > + r1 = rcu_dereference(i1) > + r2 = a[r1 && force_zero_index]; /* BUGGY!!! */ > + > + The reason this is buggy is that "&&" and "||" are often compiled > + using branches. While weak-memory machines such as ARM or PowerPC > + do order stores after such branches, they can speculate loads, > + which can result in misordering bugs. > + > +oDo not use the results from relational operators ("==", "!=", > + ">", ">=", "<", or "<=") when dereferencing. For example, > + the following (quite strange) code is buggy: > + > + int a[2]; > + int index; > + int flip_index = 0; > + > + ... > + > + r1 = rcu_dereference(i1) > + r2 = a[r1 != flip_index]; /* BUGGY!!! */ > + > + As before, the reason this is buggy is that relational operators > + are often compiled using branches. And as before, although > + weak-memory machines such as ARM or PowerPC do order stores > + after such branches, but can speculate loads, which can again > + result in misordering bugs. Those two would be allowed by the wording I have recently proposed, AFAICS. r1 != flip_index would result in two possible values (unless there are further constraints due to the type of r1 and the values that flip_index can have). I don't think the wording is flawed. We could raise the requirement of having more than one value left for r1 to having more than N with N > 1 values left, but the fundamental problem remains in that a compiler could try to generate a (big) switch statement. Instead, I think that this indicates that the value_dep_preserving type modifier would be useful: It would tell the compiler that it shouldn't transform this into a branch in this case, yet allow that optimization for all other code. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, 2014-02-27 at 09:50 -0800, Paul E. McKenney wrote: > Your proposal looks quite promising at first glance. But rather than > try and comment on it immediately, I am going to take a number of uses of > RCU from the Linux kernel and apply your proposal to them, then respond > with the results > > Fair enough? Sure. Thanks for doing the cross-check! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, 2014-02-27 at 11:47 -0800, Linus Torvalds wrote: > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney > wrote: > > > > 3. The comparison was against another RCU-protected pointer, > > where that other pointer was properly fetched using one > > of the RCU primitives. Here it doesn't matter which pointer > > you use. At least as long as the rcu_assign_pointer() for > > that other pointer happened after the last update to the > > pointed-to structure. > > > > I am a bit nervous about #3. Any thoughts on it? > > I think that it might be worth pointing out as an example, and saying > that code like > >p = atomic_read(consume); >X; >q = atomic_read(consume); >Y; >if (p == q) > data = p->val; > > then the access of "p->val" is constrained to be data-dependent on > *either* p or q, but you can't really tell which, since the compiler > can decide that the values are interchangeable. The wording I proposed would make the p dereference have a value dependency unless X and Y would somehow restrict p and q. The reasoning is that if the atomic loads return potentially more than one value, then even if we find out that two such loads did return the same value, we still don't know what the exact value was. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, 2014-02-27 at 09:01 -0800, Linus Torvalds wrote: > On Thu, Feb 27, 2014 at 7:37 AM, Torvald Riegel wrote: > > Regarding the latter, we make a fresh start at each mo_consume load (ie, > > we assume we know nothing -- L could have returned any possible value); > > I believe this is easier to reason about than other scopes like function > > granularities (what happens on inlining?), or translation units. It > > should also be simple to implement for compilers, and would hopefully > > not constrain optimization too much. > > > > [...] > > > > Paul's litmus test would work, because we guarantee to the programmer > > that it can assume that the mo_consume load would return any value > > allowed by the type; effectively, this forbids the compiler analysis > > Paul thought about: > > So realistically, since with the new wording we can ignore the silly > cases (ie "p-p") and we can ignore the trivial-to-optimize compiler > cases ("if (p == &variable) .. use p"), and you would forbid the > "global value range optimization case" that Paul bright up, what > remains would seem to be just really subtle compiler transformations > of data dependencies to control dependencies. > > And the only such thing I can think of is basically compiler-initiated > value-prediction, presumably directed by PGO (since now if the value > prediction is in the source code, it's considered to break the value > chain). The other example that comes to mind would be feedback-directed JIT compilation. I don't think that's widely used today, and it might never be for the kernel -- but *in the standard*, we at least have to consider what the future might bring. > The good thing is that afaik, value-prediction is largely not used in > real life, afaik. There are lots of papers on it, but I don't think > anybody actually does it (although I can easily see some > specint-specific optimization pattern that is build up around it). > > And even value prediction is actually fine, as long as the compiler > can see the memory *source* of the value prediction (and it isn't a > mo_consume). So it really ends up limiting your value prediction in > very simple ways: you cannot do it to function arguments if they are > registers. But you can still do value prediction on values you loaded > from memory, if you can actually *see* that memory op. I think one would need to show that the source is *not even indirectly* a mo_consume load. With the wording I proposed, value dependencies don't break when storing to / loading from memory locations. Thus, if a compiler ends up at a memory load after waling SSA, it needs to prove that the load cannot read a value that (1) was produced by a store sequenced-before the load and (2) might carry a value dependency (e.g., by being a mo_consume load) that the value prediction in question would break. This, in general, requires alias analysis. Deciding whether a prediction would break a value dependency has to consider what later stages in a compiler would be doing, including LTO or further rounds of inlining/optimizations. OTOH, if the compiler can treat an mo_consume load as returning all possible values (eg, by ignoring all knowledge about it), then it can certainly do so with other memory loads too. So, I think that the constraints due to value dependencies can matter in practice. However, the impact on optimizations on non-mo_consume-related code are hard to estimate -- I don't see a huge amount of impact right now, but I also wouldn't want to predict that this can't change in the future. > Of course, on more strongly ordered CPU's, even that "register > argument" limitation goes away. > > So I agree that there is basically no real optimization constraint. > Value-prediction is of dubious value to begin with, and the actual > constraint on its use if some compiler writer really wants to is not > onerous. > > > What I have in mind is roughly the following (totally made-up syntax -- > > suggestions for how to do this properly are very welcome): > > * Have a type modifier (eg, like restrict), that specifies that > > operations on data of this type are preserving value dependencies: > > So I'm not violently opposed, but I think the upsides are not great. > Note that my earlier suggestion to use "restrict" wasn't because I > believed the annotation itself would be visible, but basically just as > a legalistic promise to the compiler that *if* it found an alias, then > it didn't need to worry about ordering. So to me, that type modifier > was about conceptual guarantees, not about actual value chains. > > Anyway, the reason I don't believe any type m
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, 2014-02-24 at 11:54 -0800, Linus Torvalds wrote: > On Mon, Feb 24, 2014 at 10:53 AM, Paul E. McKenney > wrote: > > > > Good points. How about the following replacements? > > > > 3. Adding or subtracting an integer to/from a chained pointer > > results in another chained pointer in that same pointer chain. > > The results of addition and subtraction operations that cancel > > the chained pointer's value (for example, "p-(long)p" where "p" > > is a pointer to char) are implementation defined. > > > > 4. Bitwise operators ("&", "|", "^", and I suppose also "~") > > applied to a chained pointer and an integer for the purposes > > of alignment and pointer translation results in another > > chained pointer in that same pointer chain. Other uses > > of bitwise operators on chained pointers (for example, > > "p|~0") are implementation defined. > > Quite frankly, I think all of this language that is about the actual > operations is irrelevant and wrong. > > It's not going to help compiler writers, and it sure isn't going to > help users that read this. > > Why not just talk about "value chains" and that any operations that > restrict the value range severely end up breaking the chain. There is > no point in listing the operations individually, because every single > operation *can* restrict things. Listing individual operations and > depdendencies is just fundamentally wrong. [...] > The *only* thing that matters for all of them is whether they are > "value-preserving", or whether they drop so much information that the > compiler might decide to use a control dependency instead. That's true > for every single one of them. > > Similarly, actual true control dependencies that limit the problem > space sufficiently that the actual pointer value no longer has > significant information in it (see the above example) are also things > that remove information to the point that only a control dependency > remains. Even when the value itself is not modified in any way at all. I agree that just considering syntactic properties of the program seems to be insufficient. Making it instead depend on whether there is a "semantic" dependency due to a value being "necessary" to compute a result seems better. However, whether a value is "necessary" might not be obvious, and I understand Paul's argument that he does not want to have to reason about all potential compiler optimizations. Thus, I believe we need to specify when a value is "necessary". I have a suggestion for a somewhat different formulation of the feature that you seem to have in mind, which I'll discuss below. Excuse the verbosity of the following, but I'd rather like to avoid misunderstandings than save a few words. What we'd like to capture is that a value originating from a mo_consume load is "necessary" for a computation (e.g., it "cannot" be replaced with value predictions and/or control dependencies); if that's the case in the program, we can reasonably assume that a compiler implementation will transform this into a data dependency, which will then lead to ordering guarantees by the HW. However, we need to specify when a value is "necessary". We could say that this is implementation-defined, and use a set of litmus tests (e.g., like those discussed in the thread) to roughly carve out what a programmer could expect. This may even be practical for a project like the Linux kernel that follows strict project-internal rules and pays a lot of attention to what the particular implementations of compilers expected to compile the kernel are doing. However, I think this approach would be too vague for the standard and for many other programs/projects. One way to understand "necessary" would be to say that if a mo_consume load can result in more than V different values, then the actual value is "unknown", and thus "necessary" to compute anything based on it. (But this is flawed, as discussed below.) However, how big should V be? If it's larger than 1, atomic bool cannot be used with mo_consume, which seems weird. If V is 1, then Linus' litmus tests work (but Paul's doesn't; see below), but the compiler must not try to predict more than one value. This holds for any choice of V, so there always is an *additional* constraint on code generation for operations that are meant to take part in such "value dependencies". The bigger V might be, the less likely it should be for this to actually constrain a particular compiler's optimizations (e.g., while it might be meaningful to use value prediction for two or three values, it's probably not for 1000s). Nonetheless, if we don't want to predict the future, we need to specify V. Given that we always have some constraint for code generation anyway, and given that V > 1 might be an arbitrary-looking constraint and disallows use on atomic bool, I believe V should be 1. Furthermore, there is a problem in saying "a load can
Re: [RFC][PATCH 0/5] arch: atomic rework
On Wed, 2014-02-26 at 18:43 +, Joseph S. Myers wrote: > On Wed, 26 Feb 2014, Torvald Riegel wrote: > > > On Fri, 2014-02-21 at 22:10 +, Joseph S. Myers wrote: > > > On Fri, 21 Feb 2014, Paul E. McKenney wrote: > > > > > > > This needs to be as follows: > > > > > > > > [[carries_dependency]] int getzero(int i [[carries_dependency]]) > > > > { > > > > return i - i; > > > > } > > > > > > > > Otherwise dependencies won't get carried through it. > > > > > > C11 doesn't have attributes at all (and no specification regarding calls > > > and dependencies that I can see). And the way I read the C++11 > > > specification of carries_dependency is that specifying carries_dependency > > > is purely about increasing optimization of the caller: that if it isn't > > > specified, then the caller doesn't know what dependencies might be > > > carried. "Note: The carries_dependency attribute does not change the > > > meaning of the program, but may result in generation of more efficient > > > code. - end note". > > > > I think that this last sentence can be kind of misleading, especially > > when looking at it from an implementation point of view. How > > dependencies are handled (ie, preserving the syntactic dependencies vs. > > emitting barriers) must be part of the ABI, or things like > > [[carries_dependency]] won't work as expected (or lead to inefficient > > code). Thus, in practice, all compiler vendors on a platform would have > > to agree to a particular handling, which might end up in selecting the > > easy-but-conservative implementation option (ie, always emitting > > mo_acquire when the source uses mo_consume). > > Regardless of the ABI, my point is that if a program is valid, it is also > valid when all uses of [[carries_dependency]] are removed. If a function > doesn't use [[carries_dependency]], that means "dependencies may or may > not be carried through this function". If a function uses > [[carries_dependency]], that means that certain dependencies *are* carried > through the function (and the ABI should then specify what this means the > caller can rely on, in terms of the architecture's memory model). (This > may or may not be useful, but it's how I understand C++11.) I agree. What I tried to point out is that it's not the case that an *implementation* can just ignore [[carries_dependency]]. So from an implementation perspective, the attribute does have semantics. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, 2014-02-24 at 09:28 -0800, Paul E. McKenney wrote: > On Mon, Feb 24, 2014 at 05:55:50PM +0100, Michael Matz wrote: > > Hi, > > > > On Mon, 24 Feb 2014, Linus Torvalds wrote: > > > > > > To me that reads like > > > > > > > > int i; > > > > int *q = &i; > > > > int **p = &q; > > > > > > > > atomic_XXX (p, CONSUME); > > > > > > > > orders against accesses '*p', '**p', '*q' and 'i'. Thus it seems they > > > > want to say that it orders against aliased storage - but then go further > > > > and include "indirectly through a chain of pointers"?! Thus an > > > > atomic read of a int * orders against any 'int' memory operation but > > > > not against 'float' memory operations? > > > > > > No, it's not about type at all, and the "chain of pointers" can be > > > much more complex than that, since the "int *" can point to within an > > > object that contains other things than just that "int" (the "int" can > > > be part of a structure that then has pointers to other structures > > > etc). > > > > So, let me try to poke holes into your definition or increase my > > understanding :) . You said "chain of pointers"(dereferences I assume), > > e.g. if p is result of consume load, then access to > > p->here->there->next->prev->stuff is supposed to be ordered with that load > > (or only when that last load/store itself is also an atomic load or > > store?). > > > > So, what happens if the pointer deref chain is partly hidden in some > > functions: > > > > A * adjustptr (B *ptr) { return &ptr->here->there->next; } > > B * p = atomic_XXX (&somewhere, consume); > > adjustptr(p)->prev->stuff = bla; > > > > As far as I understood you, this whole ptrderef chain business would be > > only an optimization opportunity, right? So if the compiler can't be sure > > how p is actually used (as in my function-using case, assume adjustptr is > > defined in another unit), then the consume load would simply be > > transformed into an acquire (or whatever, with some barrier I mean)? Only > > _if_ the compiler sees all obvious uses of p (indirectly through pointer > > derefs) can it, yeah, do what with the consume load? > > Good point, I left that out of my list. Adding it: > > 13. By default, pointer chains do not propagate into or out of functions. > In implementations having attributes, a [[carries_dependency]] > may be used to mark a function argument or return as passing > a pointer chain into or out of that function. > > If a function does not contain memory_order_consume loads and > also does not contain [[carries_dependency]] attributes, then > that function may be compiled using any desired dependency-breaking > optimizations. > > The ordering effects are implementation defined when a given > pointer chain passes into or out of a function through a parameter > or return not marked with a [[carries_dependency]] attributed. > > Note that this last paragraph differs from the current standard, which > would require ordering regardless. I would prefer if we could get rid off [[carries_dependency]] as well; currently, it's a hint whose effectiveness really depends on how the particular implementation handles this attribute. If we still need something like it in the future, it would be good if it had a clearer use and performance effects. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, 2014-02-24 at 09:38 -0800, Linus Torvalds wrote: > On Mon, Feb 24, 2014 at 8:55 AM, Michael Matz wrote: > > > > So, let me try to poke holes into your definition or increase my > > understanding :) . You said "chain of pointers"(dereferences I assume), > > e.g. if p is result of consume load, then access to > > p->here->there->next->prev->stuff is supposed to be ordered with that load > > (or only when that last load/store itself is also an atomic load or > > store?). > > It's supposed to be ordered wrt the first load (the consuming one), yes. > > > So, what happens if the pointer deref chain is partly hidden in some > > functions: > > No problem. > > The thing is, the ordering is actually handled by the CPU in all > relevant cases. So the compiler doesn't actually need to *do* > anything. All this legalistic stuff is just to describe the semantics > and the guarantees. > > The problem is two cases: > > (a) alpha (which doesn't really order any accesses at all, not even > dependent loads), but for a compiler alpha is actually trivial: just > add a "rmb" instruction after the load, and you can't really do > anything else (there's a few optimizations you can do wrt the rmb, but > they are very specific and simple). > > So (a) is a "problem", but the solution is actually really simple, and > gives very *strong* guarantees: on alpha, a "consume" ends up being > basically the same as a read barrier after the load, with only very > minimal room for optimization. > > (b) ARM and powerpc and similar architectures, that guarantee the > data dependency as long as it is an *actual* data dependency, and > never becomes a control dependency. > > On ARM and powerpc, control dependencies do *not* order accesses (the > reasons boil down to essentially: branch prediction breaks the > dependency, and instructions that come after the branch can be happily > executed before the branch). But it's almost impossible to describe > that in the standard, since compilers can (and very much do) turn a > control dependency into a data dependency and vice versa. > > So the current standard tries to describe that "control vs data" > dependency, and tries to limit it to a data dependency. It fails. It > fails for multiple reasons - it doesn't allow for trivial > optimizations that just remove the data dependency, and it also > doesn't allow for various trivial cases where the compiler *does* turn > the data dependency into a control dependency. > > So I really really think that the current C standard language is > broken. Unfixably broken. > > I'm trying to change the "syntactic data dependency" that the current > standard uses into something that is clearer and correct. > > The "chain of pointers" thing is still obviously a data dependency, > but by limiting it to pointers, it simplifies the language, clarifies > the meaning, avoids all syntactic tricks (ie "p-p" is clearly a > syntactic dependency on "p", but does *not* involve in any way > following the pointer) and makes it basically impossible for the > compiler to break the dependency without doing value prediction, and > since value prediction has to be disallowed anyway, that's a feature, > not a bug. AFAIU, Michael is wondering about how we can separate non-synchronizing code (ie, in this case, not taking part in any "chain of pointers" used with mo_consume loads) from code that does. If we cannot, then we prevent value prediction *everywhere*, unless the compiler can prove that the code is never part of such a chain (which is hard due to alias analysis being hard, etc.). (We can probably argue to which extent value prediction is necessary for generation of efficient code, but it obviously does work in non-synchronizing code (or even with acquire barriers with some care) -- so forbidding it entirely might be bad.) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, 2014-02-21 at 22:10 +, Joseph S. Myers wrote: > On Fri, 21 Feb 2014, Paul E. McKenney wrote: > > > This needs to be as follows: > > > > [[carries_dependency]] int getzero(int i [[carries_dependency]]) > > { > > return i - i; > > } > > > > Otherwise dependencies won't get carried through it. > > C11 doesn't have attributes at all (and no specification regarding calls > and dependencies that I can see). And the way I read the C++11 > specification of carries_dependency is that specifying carries_dependency > is purely about increasing optimization of the caller: that if it isn't > specified, then the caller doesn't know what dependencies might be > carried. "Note: The carries_dependency attribute does not change the > meaning of the program, but may result in generation of more efficient > code. - end note". I think that this last sentence can be kind of misleading, especially when looking at it from an implementation point of view. How dependencies are handled (ie, preserving the syntactic dependencies vs. emitting barriers) must be part of the ABI, or things like [[carries_dependency]] won't work as expected (or lead to inefficient code). Thus, in practice, all compiler vendors on a platform would have to agree to a particular handling, which might end up in selecting the easy-but-conservative implementation option (ie, always emitting mo_acquire when the source uses mo_consume). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, 2014-02-21 at 11:13 -0800, Paul E. McKenney wrote: > On Fri, Feb 21, 2014 at 07:35:37PM +0100, Michael Matz wrote: > > Hi, > > > > On Thu, 20 Feb 2014, Linus Torvalds wrote: > > > > > But I'm pretty sure that any compiler guy must *hate* that current odd > > > dependency-generation part, and if I was a gcc person, seeing that > > > bugzilla entry Torvald pointed at, I would personally want to > > > dismember somebody with a rusty spoon.. > > > > Yes. Defining dependency chains in the way the standard currently seems > > to do must come from people not writing compilers. There's simply no > > sensible way to implement it without being really conservative, because > > the depchains can contain arbitrary constructs including stores, > > loads and function calls but must still be observed. > > > > And with conservative I mean "everything is a source of a dependency, and > > hence can't be removed, reordered or otherwise fiddled with", and that > > includes code sequences where no atomic objects are anywhere in sight [1]. > > In the light of that the only realistic way (meaning to not have to > > disable optimization everywhere) to implement consume as currently > > specified is to map it to acquire. At which point it becomes pointless. > > No, only memory_order_consume loads and [[carries_dependency]] > function arguments are sources of dependency chains. However, that is, given how the standard specifies things, just one of the possible ways for how an implementation can handle this. Treating [[carries_dependency]] as a "necessary" annotation to make exploiting mo_consume work in practice is possible, but it's not required by the standard. Also, dependencies are specified to flow through loads and stores (restricted to scalar objects and bitfields), so any load that might load from a dependency-carrying store can also be a source (and that doesn't seem to be restricted by [[carries_dependency]]). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, 2014-02-20 at 11:09 -0800, Linus Torvalds wrote: > On Thu, Feb 20, 2014 at 10:53 AM, Torvald Riegel wrote: > > On Thu, 2014-02-20 at 10:32 -0800, Linus Torvalds wrote: > >> On Thu, Feb 20, 2014 at 10:11 AM, Paul E. McKenney > >> wrote: > >> > > >> > You really need that "consume" to be "acquire". > >> > >> So I think we now all agree that that is what the standard is saying. > > > > Huh? > > > > The standard says that there are two separate things (among many more): > > mo_acquire and mo_consume. They both influence happens-before in > > different (and independent!) ways. > > > > What Paul is saying is that *you* should have used *acquire* in that > > example. > > I understand. > > And I disagree. I think the standard is wrong, and what I *should* be > doing is point out the fact very loudly, and just tell people to NEVER > EVER use "consume" as long as it's not reliable and has insane > semantics. Stating that (1) "the standard is wrong" and (2) that you think that mo_consume semantics are not good is two different things. Making bold statements without a proper context isn't helpful in making this discussion constructive. It's simply not efficient if I (or anybody else reading this) has to wonder whether you actually mean what you said (even if, when reading it literally, is arguably not consistent with the arguments brought up in the discussion) or whether those statements just have to be interpreted in some other way. > So what I "should do" is to not accept any C11 atomics use in the > kernel. You're obviously free to do that. > Because with the "acquire", it generates worse code than what > we already have, I would argue that this is still under debate. At least I haven't seen a definition of what you want that is complete and based on the standard (e.g., an example of what a compiler might do in a specific case isn't a definition). From what I've seen, it's not inconceivable that what you want is just an optimized acquire. I'll bring this question up again elsewhere in the thread (where it hopefully fits better). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, 2014-02-20 at 10:18 -0800, Paul E. McKenney wrote: > On Thu, Feb 20, 2014 at 06:26:08PM +0100, Torvald Riegel wrote: > > xagsmtp2.20140220172700.0...@vmsdvm4.vnet.ibm.com > > X-Xagent-Gateway: vmsdvm4.vnet.ibm.com (XAGSMTP2 at VMSDVM4) > > > > On Wed, 2014-02-19 at 20:01 -0800, Paul E. McKenney wrote: > > > On Wed, Feb 19, 2014 at 04:53:49PM -0800, Linus Torvalds wrote: > > > > On Tue, Feb 18, 2014 at 11:47 AM, Torvald Riegel > > > > wrote: > > > > > On Tue, 2014-02-18 at 09:44 -0800, Linus Torvalds wrote: > > > > >> > > > > >> Can you point to it? Because I can find a draft standard, and it sure > > > > >> as hell does *not* contain any clarity of the model. It has a *lot* > > > > >> of > > > > >> verbiage, but it's pretty much impossible to actually understand, > > > > >> even > > > > >> for somebody who really understands memory ordering. > > > > > > > > > > http://www.cl.cam.ac.uk/~mjb220/n3132.pdf > > > > > This has an explanation of the model up front, and then the detailed > > > > > formulae in Section 6. This is from 2010, and there might have been > > > > > smaller changes since then, but I'm not aware of any bigger ones. > > > > > > > > Ahh, this is different from what others pointed at. Same people, > > > > similar name, but not the same paper. > > > > > > > > I will read this version too, but from reading the other one and the > > > > standard in parallel and trying to make sense of it, it seems that I > > > > may have originally misunderstood part of the whole control dependency > > > > chain. > > > > > > > > The fact that the left side of "? :", "&&" and "||" breaks data > > > > dependencies made me originally think that the standard tried very > > > > hard to break any control dependencies. Which I felt was insane, when > > > > then some of the examples literally were about the testing of the > > > > value of an atomic read. The data dependency matters quite a bit. The > > > > fact that the other "Mathematical" paper then very much talked about > > > > consume only in the sense of following a pointer made me think so even > > > > more. > > > > > > > > But reading it some more, I now think that the whole "data dependency" > > > > logic (which is where the special left-hand side rule of the ternary > > > > and logical operators come in) are basically an exception to the rule > > > > that sequence points end up being also meaningful for ordering (ok, so > > > > C11 seems to have renamed "sequence points" to "sequenced before"). > > > > > > > > So while an expression like > > > > > > > > atomic_read(p, consume) ? a : b; > > > > > > > > doesn't have a data dependency from the atomic read that forces > > > > serialization, writing > > > > > > > >if (atomic_read(p, consume)) > > > > a; > > > >else > > > > b; > > > > > > > > the standard *does* imply that the atomic read is "happens-before" wrt > > > > "a", and I'm hoping that there is no question that the control > > > > dependency still acts as an ordering point. > > > > > > The control dependency should order subsequent stores, at least assuming > > > that "a" and "b" don't start off with identical stores that the compiler > > > could pull out of the "if" and merge. The same might also be true for ?: > > > for all I know. (But see below) > > > > I don't think this is quite true. I agree that a conditional store will > > not be executed speculatively (note that if it would happen in both the > > then and the else branch, it's not conditional); so, the store in > > "a;" (assuming it would be a store) won't happen unless the thread can > > really observe a true value for p. However, this is *this thread's* > > view of the world, but not guaranteed to constrain how any other thread > > sees the state. mo_consume does not contribute to > > inter-thread-happens-before in the same way that mo_acquire does (which > > *does* put a constraint on i-t-h-b, and thus enforces a global &g
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, 2014-02-20 at 10:32 -0800, Linus Torvalds wrote: > On Thu, Feb 20, 2014 at 10:11 AM, Paul E. McKenney > wrote: > > > > You really need that "consume" to be "acquire". > > So I think we now all agree that that is what the standard is saying. Huh? The standard says that there are two separate things (among many more): mo_acquire and mo_consume. They both influence happens-before in different (and independent!) ways. What Paul is saying is that *you* should have used *acquire* in that example. > And I'm saying that that is wrong, that the standard is badly written, > and should be fixed. > > Because before the standard is fixed, I claim that "consume" is > unusable. We cannot trust it. End of story. Then we still have all the rest. Let's just ignore mo_consume for now, and look at mo_acquire, I suggest. > The fact that apparently gcc is currently buggy because it got the > dependency calculations *wrong* just reinforces my point. Well, I'm pretty sure nobody actually worked on trying to preserve the dependencies at all. IOW, I suspect this fell through the cracks. We can ask the person working on this if you really want to know. > The gcc bug Torvald pointed at is exactly because the current C > standard is illogical unreadable CRAP. It's obviously logically consistent to the extent that it can be represented by a formal specification such as the one by the Cambridge group. Makes sense, or not? > I can guarantee that what > happened is: > > - the compiler saw that the result of the read was used as the left > hand expression of the ternary "? :" operator > > - as a result, the compiler decided that there's no dependency > > - the compiler didn't think about the dependency that comes from the > result of the load *also* being used as the middle part of the ternary > expression, because it had optimized it away, despite the standard not > talking about that at all. > > - so the compiler never saw the dependency that the standard talks about > > BECAUSE THE STANDARD LANGUAGE IS PURE AND UTTER SHIT. Please, be specific. Right now you're saying that all of it is useless. Which is arguable not true. > My suggested language never had any of these problems, because *my* > suggested semantics are clear, logical, and don't have these kinds of > idiotic pit-falls. Have you looked at and understood the semantics of the memory model (e.g. in the formalized form) with mo_consume and related being ignored (ie, just ignore 6.13 and 6.14 in n3132)? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, 2014-02-20 at 10:11 -0800, Paul E. McKenney wrote: > But yes, the compiler guys would be extremely happy to simply drop > memory_order_consume from the standard, as it is the memory order > that they most love to hate. > > Getting them to agree to any sort of peep-hole optimization semantics > for memory_order_consume is likely problematic. I wouldn't be so pessimistic about that. If the transformations can be shown to be always correct in terms of the semantics specified in the standard, and if the performance win is sufficiently large, why not? Of course, somebody has to volunteer to actually implement it :) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, 2014-02-20 at 09:01 -0800, Linus Torvalds wrote: > On Thu, Feb 20, 2014 at 12:30 AM, Paul E. McKenney > wrote: > >> > >> So lets make this really simple: if you have a consume->cmp->read, is > >> the ordering of the two reads guaranteed? > > > > Not as far as I know. Also, as far as I know, there is no difference > > between consume and relaxed in the consume->cmp->read case. > > Ok, quite frankly, I think that means that "consume" is misdesigned. > > > The above example can have a return value of 0 if translated > > straightforwardly into either ARM or Power, right? > > Correct. And I think that is too subtle. It's dangerous, it makes code > that *looks* correct work incorrectly, and it actually happens to work > on x86 since x86 doesn't have crap-for-brains memory ordering > semantics. > > > So, if you make one of two changes to your example, then I will agree > > with you. > > No. We're not playing games here. I'm fed up with complex examples > that make no sense. Note that Paul's second suggestion for a change was to just use mo_acquire; that's a simple change, and the easiest option, so it should be just fine. > Nobody sane writes code that does that pointer comparison, and it is > entirely immaterial what the compiler can do behind our backs. The C > standard semantics need to make sense to the *user* (ie programmer), > not to a CPU and not to a compiler. The CPU and compiler are "tools". > They don't matter. Their only job is to make the code *work*, dammit. > > So no idiotic made-up examples that involve code that nobody will ever > write and that have subtle issues. > > So the starting point is that (same example as before, but with even > clearer naming): > > Initialization state: > initialized = 0; > value = 0; > > Consumer: > > return atomic_read(&initialized, consume) ? value : -1; > > Writer: > value = 42; > atomic_write(&initialized, 1, release); > > and because the C memory ordering standard is written in such a way > that this is subtly buggy (and can return 0, which is *not* logically > a valid value), then I think the C memory ordering standard is broken. > > That "consumer" memory ordering is dangerous as hell, and it is > dangerous FOR NO GOOD REASON. > > The trivial "fix" to the standard would be to get rid of all the > "carries a dependency" crap, and just say that *anything* that depends > on it is ordered wrt it. > > That just means that on alpha, "consume" implies an unconditional read > barrier (well, unless the value is never used and is loaded just > because it is also volatile), on x86, "consume" is the same as > "acquire" which is just a plain load with ordering guarantees, and on > ARM or power you can still avoid the extra synchronization *if* the > value is used just for computation and for following pointers, but if > the value is used for a comparison, there needs to be a > synchronization barrier. > > Notice? Getting rid of the stupid "carries-dependency" crap from the > standard actually > (a) simplifies the standard Agreed, although it's easy to ignore the parts related to mo_consume, I think. > (b) means that the above obvious example *works* > (c) does not in *any* way make for any less efficient code generation > for the cases that "consume" works correctly for in the current > mis-designed standard. > (d) is actually a hell of a lot easier to explain to a compiler > writer, and I can guarantee that it is simpler to implement too. mo_acquire is certainly easier to implement in a compiler. > Why do I claim (d) "it is simpler to implement" - because on ARM/power > you can implement it *exactly* as a special "acquire", with just a > trivial peep-hole special case that follows the use chain of the > acquire op to the consume, and then just drop the acquire bit if the > only use is that compute-to-load chain. That's similar to the way I saw it and described in my reply to your other email (before getting to this email here). It seems that this indeed might be doable transparently in the compiler, without requiring a special mo_acquire variant visible to programmers. > In fact, realistically, the *only* thing you need to actually care > about for the intended use case of "consume" is the question "is the > consuming load immediately consumed as an address (with offset) of a > memory operation. So you don't even need to follow any complicated > computation chain in a compiler - the only case that matters for the > barrier removal optimization is the "oh, I can see that it is only > used as an address to a dereference". To make this mo_acquire optimization apply often, a compiler might have to try to filter out accesses that don't synchronize (e.g., so that an access to a non-shared temporary variable doesn't prevent the optimization). > Seriously. The current standard is broken. Please, let's be precise in such statement, so that everyone actually knows what's meant. The rest of the memory model can be perfectly
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, 2014-02-20 at 09:34 -0800, Linus Torvalds wrote: > On Thu, Feb 20, 2014 at 9:14 AM, Torvald Riegel wrote: > >> > >> So the clarification is basically to the statement that the "if > >> (consume(p)) a" version *would* have an ordering guarantee between the > >> read of "p" and "a", but the "consume(p) ? a : b" would *not* have > >> such an ordering guarantee. Yes? > > > > Not as I understand it. If my reply above wasn't clear, let me know and > > I'll try to rephrase it into something that is. > > Yeah, so you and Paul agree. And as I mentioned in the email that > crossed with yours, I think that means that the standard is overly > complex, hard to understand, fragile, and all *pointlessly* so. Let's step back a little here and distinguish different things: 1) AFAICT, mo_acquire provides all the ordering guarantees you want. Thus, I suggest focusing on mo_acquire for now, especially when reading the model. Are you fine with the mo_acquire semantics? 2) mo_acquire is independent of carries-a-dependency and similar definitions/relations, so you can ignore all those to reason about mo_acquire. 3) Do you have concerns over the runtime costs of barriers with mo_acquire semantics? If so, that's a valid discussion point, and we can certainly dig deeper into this topic to see how we can possibly use weaker HW barriers by exploiting things the compiler sees and can potentially preserve (e.g., control dependencies). There might be some stuff the compiler can do without needing further input from the programmer. 4) mo_consume is kind of the difficult special-case variant of mo_acquire. We should discuss it (including whether it's helpful) separately from the memory model, because it's not essential. > Btw, there are many ways that "use a consume as an input to a > conditional" can happen. In particular, even if the result is actually > *used* like a pointer as far as the programmer is concerned, tricks > like pointer compression etc can well mean that the "pointer" is > actually at least partly implemented using conditionals, so that some > paths end up being only dependent through a comparison of the pointer > value. AFAIU, this is similar to my concerns about how a compiler can reasonably implement the ordering guarantees: The mo_consume value may be used like a pointer in source code, but how this looks in the generated code be different after reasonable transformations (including transformations to control dependencies, so the compiler would have to avoid those). Or did I misunderstand? > So I very much did *not* want to complicate the "litmus test" code > snippet when Paul tried to make it more complex, but I do think that > there are cases where code that "looks" like pure pointer chasing > actually is not for some cases, and then can become basically that > litmus test for some path. > > Just to give you an example: in simple list handling it is not at all > unusual to have a special node that is discovered by comparing the > address, not by just loading from the pointer and following the list > itself. Examples of that would be a HEAD node in a doubly linked list > (Linux uses this concept quite widely, it's our default list > implementation), or it could be a place-marker ("cursor" entry) in the > list for safe traversal in the presence of concurrent deletes etc. > > And obviously there is the already much earlier mentioned > compiler-induced compare, due to value speculation, that can basically > create such sequences even wherethey did not originally exist in the > source code itself. > > So even if you work with "pointer dereferences", and thus match that > particular consume pattern, I really don't see why anybody would think > that "hey, we can ignore any control dependencies" is a good idea. > It's a *TERRIBLE* idea. > > And as mentioned, it's a terrible idea with no upsides. It doesn't > help compiler optimizations for the case it's *intended* to help, > since those optimizations can still be done without the horribly > broken semantics. It doesn't help compiler writers, it just confuses > them. I'm worried about how compilers can implement mo_consume without prohibiting lots of optimizations on the code. Your thoughts seem to point in a similar direction. I think we should continue by discussing mo_acquire first. It has easier semantics and allows a relatively simple implementation in compilers (although there might be not-quite-so-simple optimizations). It's unfortunate we started the discussion with the tricky special case first; maybe that's what contributed to the confusion. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, 2014-02-20 at 00:30 -0800, Paul E. McKenney wrote: > Well, all the compilers currently convert consume to acquire, so you have > your wish there. Of course, that also means that they generate actual > unneeded memory-barrier instructions, which seems extremely sub-optimal > to me. GCC doesn't currently, but it also doesn't seem to track the dependencies, but that's a bug: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Wed, 2014-02-19 at 20:43 -0800, Linus Torvalds wrote: [Paul has already answered many of your questions, and my reply to your previous email should also answer some.] > If the consumer of an atomic load isn't a pointer chasing operation, > then the consume should be defined to be the same as acquire. None of > this "conditionals break consumers". No, conditionals on the > dependency path should turn consumers into acquire, because otherwise > the "consume" load is dangerous as hell. Yes, mo_consume is more tricky than mo_acquire. However, that has an advantage because you can avoid getting stronger barriers if you don't need them (ie, you can avoid the "auto-update to acquire" you seem to have in mind). The auto-upgrade would be a possible semantics, I agree. Another option may be to let an implementation optimize the HW barriers that it uses for mo_acquire. That is, if the compiler sees that (1) the result of an mo_acquire load is used on certain code paths *only* for consumers that carry the dependency *and* (2) there are no other operations on that code path that can possibly rely on the mo_acquire ordering guarantees, then the compiler can use a weaker HW barrier on archs such as PowerPC or ARM. That is similar to the rules for mo_consume you seem to have in mind, but makes it a compiler optimization on mo_acquire. However, the compiler has to be conservative here, so having a mo_consume that is trickier to use but doesn't ever silently introduce stronger HW barriers seems to be useful (modulo the difficulties regarding to how it's currently specified in the standard). > And if the definition of acquire doesn't include the control > dependency either, then the C atomic memory model is just completely > and utterly broken, since the above *trivial* and clearly useful > example is broken. In terms of the model, if you establish a synchronizes-with using a reads-from that has (or crosses) a release/acquire memory-order pair, then this synchronizes-with will also order other operations that it's sequenced-before (see the composition of synchronizes-with and sequenced-before in the inter-thread-happens-before definition in n3132 6.15). So yes, mo_acquire does that the logical "control dependencies" / sequenced-before into account. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Wed, 2014-02-19 at 20:01 -0800, Paul E. McKenney wrote: > On Wed, Feb 19, 2014 at 04:53:49PM -0800, Linus Torvalds wrote: > > On Tue, Feb 18, 2014 at 11:47 AM, Torvald Riegel wrote: > > > On Tue, 2014-02-18 at 09:44 -0800, Linus Torvalds wrote: > > >> > > >> Can you point to it? Because I can find a draft standard, and it sure > > >> as hell does *not* contain any clarity of the model. It has a *lot* of > > >> verbiage, but it's pretty much impossible to actually understand, even > > >> for somebody who really understands memory ordering. > > > > > > http://www.cl.cam.ac.uk/~mjb220/n3132.pdf > > > This has an explanation of the model up front, and then the detailed > > > formulae in Section 6. This is from 2010, and there might have been > > > smaller changes since then, but I'm not aware of any bigger ones. > > > > Ahh, this is different from what others pointed at. Same people, > > similar name, but not the same paper. > > > > I will read this version too, but from reading the other one and the > > standard in parallel and trying to make sense of it, it seems that I > > may have originally misunderstood part of the whole control dependency > > chain. > > > > The fact that the left side of "? :", "&&" and "||" breaks data > > dependencies made me originally think that the standard tried very > > hard to break any control dependencies. Which I felt was insane, when > > then some of the examples literally were about the testing of the > > value of an atomic read. The data dependency matters quite a bit. The > > fact that the other "Mathematical" paper then very much talked about > > consume only in the sense of following a pointer made me think so even > > more. > > > > But reading it some more, I now think that the whole "data dependency" > > logic (which is where the special left-hand side rule of the ternary > > and logical operators come in) are basically an exception to the rule > > that sequence points end up being also meaningful for ordering (ok, so > > C11 seems to have renamed "sequence points" to "sequenced before"). > > > > So while an expression like > > > > atomic_read(p, consume) ? a : b; > > > > doesn't have a data dependency from the atomic read that forces > > serialization, writing > > > >if (atomic_read(p, consume)) > > a; > >else > > b; > > > > the standard *does* imply that the atomic read is "happens-before" wrt > > "a", and I'm hoping that there is no question that the control > > dependency still acts as an ordering point. > > The control dependency should order subsequent stores, at least assuming > that "a" and "b" don't start off with identical stores that the compiler > could pull out of the "if" and merge. The same might also be true for ?: > for all I know. (But see below) I don't think this is quite true. I agree that a conditional store will not be executed speculatively (note that if it would happen in both the then and the else branch, it's not conditional); so, the store in "a;" (assuming it would be a store) won't happen unless the thread can really observe a true value for p. However, this is *this thread's* view of the world, but not guaranteed to constrain how any other thread sees the state. mo_consume does not contribute to inter-thread-happens-before in the same way that mo_acquire does (which *does* put a constraint on i-t-h-b, and thus enforces a global constraint that all threads have to respect). Is it clear which distinction I'm trying to show here? > That said, in this case, you could substitute relaxed for consume and get > the same effect. The return value from atomic_read() gets absorbed into > the "if" condition, so there is no dependency-ordered-before relationship, > so nothing for consume to do. > > One caution... The happens-before relationship requires you to trace a > full path between the two operations of interest. This is illustrated > by the following example, with both x and y initially zero: > > T1: atomic_store_explicit(&x, 1, memory_order_relaxed); > r1 = atomic_load_explicit(&y, memory_order_relaxed); > > T2: atomic_store_explicit(&y, 1, memory_order_relaxed); > r2 = atomic_load_explicit(&x, memory_order_relaxed); > > There is a happens-before relationship between T1's load and store, > and another happens-before relationship between T2
Re: [RFC][PATCH 0/5] arch: atomic rework
On Wed, 2014-02-19 at 16:53 -0800, Linus Torvalds wrote: > On Tue, Feb 18, 2014 at 11:47 AM, Torvald Riegel wrote: > > On Tue, 2014-02-18 at 09:44 -0800, Linus Torvalds wrote: > >> > >> Can you point to it? Because I can find a draft standard, and it sure > >> as hell does *not* contain any clarity of the model. It has a *lot* of > >> verbiage, but it's pretty much impossible to actually understand, even > >> for somebody who really understands memory ordering. > > > > http://www.cl.cam.ac.uk/~mjb220/n3132.pdf > > This has an explanation of the model up front, and then the detailed > > formulae in Section 6. This is from 2010, and there might have been > > smaller changes since then, but I'm not aware of any bigger ones. > > Ahh, this is different from what others pointed at. Same people, > similar name, but not the same paper. > > I will read this version too, but from reading the other one and the > standard in parallel and trying to make sense of it, it seems that I > may have originally misunderstood part of the whole control dependency > chain. > > The fact that the left side of "? :", "&&" and "||" breaks data > dependencies made me originally think that the standard tried very > hard to break any control dependencies. Hmm... (I'm not quite sure how to express this properly, so bear with me.) The control dependencies as you seem to consider aren't there *as is*, and removing "? :", "&&", and "||" from what's called Carries-a-dependency-to in n3132 is basically removing control dependences in expressions from that relation (ie, so it's just data dependences). However, control dependences in the program are not ignored, because they control what steps the abstract machine would do. If we have candidate executions with a particular choice for the reads-from relation, they must make sense in terms of program logic (which does include conditionals and such), or a candidate execution will not be consistent. The difference is in how reads-from is constrainted by different memory orders. I hope what I mean gets clearer below when looking at the example... Mark Batty / Peter Sewell: Do you have a better explanation? (And please correct anything wrong in what I wrote ;) > Which I felt was insane, when > then some of the examples literally were about the testing of the > value of an atomic read. The data dependency matters quite a bit. The > fact that the other "Mathematical" paper then very much talked about > consume only in the sense of following a pointer made me think so even > more. > > But reading it some more, I now think that the whole "data dependency" > logic (which is where the special left-hand side rule of the ternary > and logical operators come in) are basically an exception to the rule > that sequence points end up being also meaningful for ordering (ok, so > C11 seems to have renamed "sequence points" to "sequenced before"). I'm not quite sure what you mean, but I think that this isn't quite the case. mo_consume and mo_acquire combine with sequenced-before differently. Have a look at the definition of inter-thread-happens-before (6.15 in n3132): for synchronizes-with, the composition with sequenced-before is used (ie, what you'd get with mo_acquire), whereas for mo_consume, we just have dependency_ordered_before (so, data dependencies only). > So while an expression like > > atomic_read(p, consume) ? a : b; > > doesn't have a data dependency from the atomic read that forces > serialization, writing > >if (atomic_read(p, consume)) > a; >else > b; > > the standard *does* imply that the atomic read is "happens-before" wrt > "a", and I'm hoping that there is no question that the control > dependency still acts as an ordering point. I don't think it does for mo_consume (but does for mo_acquire). Both examples do basically the same in the abstract machine, and I think it would be surprising if rewriting could the code with something seemingly equivalent would make a semantic difference. AFAICT, "? :" etc. is removed to allow the standard defining data dependencies as being based on expressions that take the mo_consume value as input, without also pulling in control dependencies. To look at this example in the cppmem tool, we can transform your example above into: int main() { atomic_int x = 0; int y = 0; {{{ { y = 1; x.store(1, memory_order_release); } ||| { if (x.load(memory_order_consume)) r2 = y; } }}}; return 0; } This has 2 threads, one is doing the store to y (the data) and the release operation, the other is using consume
Re: [RFC][PATCH 0/5] arch: atomic rework
On Wed, 2014-02-19 at 07:23 -0800, David Lang wrote: > On Tue, 18 Feb 2014, Torvald Riegel wrote: > > > On Tue, 2014-02-18 at 22:40 +0100, Peter Zijlstra wrote: > >> On Tue, Feb 18, 2014 at 10:21:56PM +0100, Torvald Riegel wrote: > >>> Well, that's how atomics that aren't volatile are defined in the > >>> standard. I can see that you want something else too, but that doesn't > >>> mean that the other thing is broken. > >> > >> Well that other thing depends on being able to see the entire program at > >> compile time. PaulMck already listed various ways in which this is > >> not feasible even for normal userspace code. > >> > >> In particular; DSOs and JITs were mentioned. > > > > No it doesn't depend on whole-program analysis being possible. Because > > if it isn't, then a correct compiler will just not do certain > > optimizations simply because it can't prove properties required for the > > optimization to hold. With the exception of access to objects via magic > > numbers (e.g., fixed and known addresses (see my reply to Paul), which > > are outside of the semantics specified in the standard), I don't see a > > correctness problem here. > > Are you really sure that the compiler can figure out every possible thing > that a > loadable module or JITed code can access? That seems like a pretty strong > claim. If the other code can be produced by a C translation unit that is valid to be linked with the rest of the program, then I'm pretty sure the compiler has a well-defined notion of whether it does or does not see all other potential accesses. IOW, if the C compiler is dealing with C semantics and mechanisms only (including the C mechanisms for sharing with non-C code!), then it will know what to do. If you're playing tricks behind the C compiler's back using implementation-defined stuff outside of the C specification, then there's nothing the compiler really can do. For example, if you're trying to access a variable on a function's stack from some other function, you better know how the register allocator of the compiler operates. In contrast, if you let this function simply export the address of the variable to some external place, all will be fine. The documentation of GCC's -fwhole-program and -flto might also be interesting for you. GCC wouldn't need to have -fwhole-program if it weren't conservative by default (correctly so). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Wed, 2014-02-19 at 07:14 -0800, Paul E. McKenney wrote: > On Wed, Feb 19, 2014 at 11:59:08AM +0100, Torvald Riegel wrote: > > On Tue, 2014-02-18 at 14:58 -0800, Paul E. McKenney wrote: > > > On Tue, Feb 18, 2014 at 10:40:15PM +0100, Torvald Riegel wrote: > > > > xagsmtp4.20140218214207.8...@vmsdvm9.vnet.ibm.com > > > > X-Xagent-Gateway: vmsdvm9.vnet.ibm.com (XAGSMTP4 at VMSDVM9) > > > > > > > > On Tue, 2014-02-18 at 09:16 -0800, Paul E. McKenney wrote: > > > > > On Tue, Feb 18, 2014 at 08:49:13AM -0800, Linus Torvalds wrote: > > > > > > On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel > > > > > > wrote: > > > > > > > On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote: > > > > > > >> And exactly because I know enough, I would *really* like atomics > > > > > > >> to be > > > > > > >> well-defined, and have very clear - and *local* - rules about > > > > > > >> how they > > > > > > >> can be combined and optimized. > > > > > > > > > > > > > > "Local"? > > > > > > > > > > > > Yes. > > > > > > > > > > > > So I think that one of the big advantages of atomics over volatile > > > > > > is > > > > > > that they *can* be optimized, and as such I'm not at all against > > > > > > trying to generate much better code than for volatile accesses. > > > > > > > > > > > > But at the same time, that can go too far. For example, one of the > > > > > > things we'd want to use atomics for is page table accesses, where it > > > > > > is very important that we don't generate multiple accesses to the > > > > > > values, because parts of the values can be change *by*hardware* (ie > > > > > > accessed and dirty bits). > > > > > > > > > > > > So imagine that you have some clever global optimizer that sees that > > > > > > the program never ever actually sets the dirty bit at all in any > > > > > > thread, and then uses that kind of non-local knowledge to make > > > > > > optimization decisions. THAT WOULD BE BAD. > > > > > > > > > > Might as well list other reasons why value proofs via whole-program > > > > > analysis are unreliable for the Linux kernel: > > > > > > > > > > 1.As Linus said, changes from hardware. > > > > > > > > This is what's volatile is for, right? (Or the weak-volatile idea I > > > > mentioned). > > > > > > > > Compilers won't be able to prove something about the values of such > > > > variables, if marked (weak-)volatile. > > > > > > Yep. > > > > > > > > 2.Assembly code that is not visible to the compiler. > > > > > Inline asms will -normally- let the compiler know what > > > > > memory they change, but some just use the "memory" tag. > > > > > Worse yet, I suspect that most compilers don't look all > > > > > that carefully at .S files. > > > > > > > > > > Any number of other programs contain assembly files. > > > > > > > > Are the annotations of changed memory really a problem? If the "memory" > > > > tag exists, isn't that supposed to mean all memory? > > > > > > > > To make a proof about a program for location X, the compiler has to > > > > analyze all uses of X. Thus, as soon as X escapes into an .S file, then > > > > the compiler will simply not be able to prove a thing (except maybe due > > > > to the data-race-free requirement for non-atomics). The attempt to > > > > prove something isn't unreliable, simply because a correct compiler > > > > won't claim to be able to "prove" something. > > > > > > I am indeed less worried about inline assembler than I am about files > > > full of assembly. Or files full of other languages. > > > > > > > One reason that could corrupt this is that if program addresses objects > > > > other than through the mechanisms defined in the language. For example, > > > > if one thread lays out a data structure at a constant fixed memory > > &
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, 2014-02-18 at 14:14 -0800, Linus Torvalds wrote: > On Tue, Feb 18, 2014 at 1:21 PM, Torvald Riegel wrote: > >> > >> So imagine that you have some clever global optimizer that sees that > >> the program never ever actually sets the dirty bit at all in any > >> thread, and then uses that kind of non-local knowledge to make > >> optimization decisions. THAT WOULD BE BAD. > >> > >> Do you see what I'm aiming for? > > > > Yes, I do. But that seems to be "volatile" territory. It crosses the > > boundaries of the abstract machine, and thus is input/output. Which > > fraction of your atomic accesses can read values produced by hardware? > > I would still suppose that lots of synchronization is not affected by > > this. > > The "hardware can change things" case is indeed pretty rare. > > But quite frankly, even when it isn't hardware, as far as the compiler > is concerned you have the exact same issue - you have TLB faults > happening on other CPU's that do the same thing asynchronously using > software TLB fault handlers. So *semantically*, it really doesn't make > any difference what-so-ever if it's a software TLB handler on another > CPU, a microcoded TLB fault, or an actual hardware path. I think there are a few semantic differences: * If a SW handler uses the C11 memory model, it will synchronize like any other thread. HW might do something else entirely, including synchronizing differently, not using atomic accesses, etc. (At least that's the constraints I had in mind). * If we can treat any interrupt handler like Just Another Thread, then the next question is whether the compiler will be aware that there is another thread. I think that in practice it will be: You'll set up the handler in some way by calling a function the compiler can't analyze, so the compiler will know that stuff accessible to the handler (e.g., global variables) will potentially be accessed by other threads. * Similarly, if the C code is called from some external thing, it also has to assume the presence of other threads. (Perhaps this is what the compiler has to assume in a freestanding implementation anyway...) However, accessibility will be different for, say, stack variables that haven't been shared with other functions yet; those are arguably not reachable by other things, at least not through mechanisms defined by the C standard. So optimizing these should be possible with the assumption that there is no other thread (at least as default -- I'm not saying that this is the only reasonable semantics). > So if the answer for all of the above is "use volatile", then I think > that means that the C11 atomics are badly designed. > > The whole *point* of atomic accesses is that stuff like above should > "JustWork(tm)" I think that it should in the majority of cases. If the other thing potentially accessing can do as much as a valid C11 thread can do, the synchronization itself will work just fine. In most cases except the (void*)0x123 example (or linker scripts etc.) the compiler is aware when data is made visible to other threads or other non-analyzable functions that may spawn other threads (or just by being a plain global variable accessible to other (potentially .S) translation units. > > Do you perhaps want a weaker form of volatile? That is, one that, for > > example, allows combining of two adjacent loads of the dirty bits, but > > will make sure that this is treated as if there is some imaginary > > external thread that it cannot analyze and that may write? > > Yes, that's basically what I would want. And it is what I would expect > an atomic to be. Right now we tend to use "ACCESS_ONCE()", which is a > bit of a misnomer, because technically we really generally want > "ACCESS_AT_MOST_ONCE()" (but "once" is what we get, because we use > volatile, and is a hell of a lot simpler to write ;^). > > So we obviously use "volatile" for this currently, and generally the > semantics we really want are: > > - the load or store is done as a single access ("atomic") > > - the compiler must not try to re-materialize the value by reloading > it from memory (this is the "at most once" part) In the presence of other threads performing operations unknown to the compiler, that's what you should get even if the compiler is trying to optimize C11 atomics. The first requirement is clear, and the "at most once" follows from another thread potentially writing to the variable. The only difference I can see right now is that a compiler may be able to *prove* that it doesn't matter whether it reloaded the value or not. But this seems very
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, 2014-02-18 at 22:47 +0100, Peter Zijlstra wrote: > On Tue, Feb 18, 2014 at 10:21:56PM +0100, Torvald Riegel wrote: > > Yes, I do. But that seems to be "volatile" territory. It crosses the > > boundaries of the abstract machine, and thus is input/output. Which > > fraction of your atomic accesses can read values produced by hardware? > > I would still suppose that lots of synchronization is not affected by > > this. > > Its not only hardware; also the kernel/user boundary has this same > problem. We cannot a-priory say what userspace will do; in fact, because > we're a general purpose OS, we must assume it will willfully try its > bestest to wreck whatever assumptions we make about its behaviour. That's a good note, and I think a distinct case from those below, because here you're saying that you can't assume that the userspace code follows the C11 semantics ... > We also have loadable modules -- much like regular userspace DSOs -- so > there too we cannot say what will or will not happen. > > We also have JITs that generate code on demand. .. whereas for those, you might assume that the other code follows C11 semantics and the same ABI, which makes this just a normal case already handled as (see my other replies nearby in this thread). > And I'm absolutely sure (with the exception of the JITs, its not an area > I've worked on) that we have atomic usage across all those boundaries. That would be fine as long as all involved parties use the same memory model and ABI to implement it. (Of course, I'm assuming here that the compiler is aware of sharing with other entities, which is always the case except in those corner case like accesses to (void*)0x123 magically aliasing with something else). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, 2014-02-18 at 14:58 -0800, Paul E. McKenney wrote: > On Tue, Feb 18, 2014 at 10:40:15PM +0100, Torvald Riegel wrote: > > xagsmtp4.20140218214207.8...@vmsdvm9.vnet.ibm.com > > X-Xagent-Gateway: vmsdvm9.vnet.ibm.com (XAGSMTP4 at VMSDVM9) > > > > On Tue, 2014-02-18 at 09:16 -0800, Paul E. McKenney wrote: > > > On Tue, Feb 18, 2014 at 08:49:13AM -0800, Linus Torvalds wrote: > > > > On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel > > > > wrote: > > > > > On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote: > > > > >> And exactly because I know enough, I would *really* like atomics to > > > > >> be > > > > >> well-defined, and have very clear - and *local* - rules about how > > > > >> they > > > > >> can be combined and optimized. > > > > > > > > > > "Local"? > > > > > > > > Yes. > > > > > > > > So I think that one of the big advantages of atomics over volatile is > > > > that they *can* be optimized, and as such I'm not at all against > > > > trying to generate much better code than for volatile accesses. > > > > > > > > But at the same time, that can go too far. For example, one of the > > > > things we'd want to use atomics for is page table accesses, where it > > > > is very important that we don't generate multiple accesses to the > > > > values, because parts of the values can be change *by*hardware* (ie > > > > accessed and dirty bits). > > > > > > > > So imagine that you have some clever global optimizer that sees that > > > > the program never ever actually sets the dirty bit at all in any > > > > thread, and then uses that kind of non-local knowledge to make > > > > optimization decisions. THAT WOULD BE BAD. > > > > > > Might as well list other reasons why value proofs via whole-program > > > analysis are unreliable for the Linux kernel: > > > > > > 1.As Linus said, changes from hardware. > > > > This is what's volatile is for, right? (Or the weak-volatile idea I > > mentioned). > > > > Compilers won't be able to prove something about the values of such > > variables, if marked (weak-)volatile. > > Yep. > > > > 2.Assembly code that is not visible to the compiler. > > > Inline asms will -normally- let the compiler know what > > > memory they change, but some just use the "memory" tag. > > > Worse yet, I suspect that most compilers don't look all > > > that carefully at .S files. > > > > > > Any number of other programs contain assembly files. > > > > Are the annotations of changed memory really a problem? If the "memory" > > tag exists, isn't that supposed to mean all memory? > > > > To make a proof about a program for location X, the compiler has to > > analyze all uses of X. Thus, as soon as X escapes into an .S file, then > > the compiler will simply not be able to prove a thing (except maybe due > > to the data-race-free requirement for non-atomics). The attempt to > > prove something isn't unreliable, simply because a correct compiler > > won't claim to be able to "prove" something. > > I am indeed less worried about inline assembler than I am about files > full of assembly. Or files full of other languages. > > > One reason that could corrupt this is that if program addresses objects > > other than through the mechanisms defined in the language. For example, > > if one thread lays out a data structure at a constant fixed memory > > address, and another one then uses the fixed memory address to get > > access to the object with a cast (e.g., (void*)0x123). > > Or if the program uses gcc linker scripts to get the same effect. > > > > 3.Kernel modules that have not yet been written. Now, the > > > compiler could refrain from trying to prove anything about > > > an EXPORT_SYMBOL() or EXPORT_SYMBOL_GPL() variable, but there > > > is currently no way to communicate this information to the > > > compiler other than marking the variable "volatile". > > > > Even if the variable is just externally accessible, then the compiler > > knows that it can't do whole-program analysis about it. > > > > It is true that whole-program analysis will not be applicable in this > > case, but it will not be
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, 2014-02-18 at 22:52 +0100, Peter Zijlstra wrote: > > > 4.Some drivers allow user-mode code to mmap() some of their > > > state. Any changes undertaken by the user-mode code would > > > be invisible to the compiler. > > > > A good point, but a compiler that doesn't try to (incorrectly) assume > > something about the semantics of mmap will simply see that the mmap'ed > > data will escape to stuff if can't analyze, so it will not be able to > > make a proof. > > > > This is different from, for example, malloc(), which is guaranteed to > > return "fresh" nonaliasing memory. > > The kernel side of this is different.. it looks like 'normal' memory, we > just happen to allow it to end up in userspace too. > > But on that point; how do you tell the compiler the difference between > malloc() and mmap()? Is that some function attribute? Yes: malloc The malloc attribute is used to tell the compiler that a function may be treated as if any non-NULL pointer it returns cannot alias any other pointer valid when the function returns and that the memory has undefined content. This often improves optimization. Standard functions with this property include malloc and calloc. realloc-like functions do not have this property as the memory pointed to does not have undefined content. I'm not quite sure whether GCC assumes malloc() to be indeed C's malloc even if the function attribute isn't used, and/or whether that is different for freestanding environments. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, 2014-02-18 at 23:48 +, Peter Sewell wrote: > On 18 February 2014 20:43, Torvald Riegel wrote: > > On Tue, 2014-02-18 at 12:12 +, Peter Sewell wrote: > >> Several of you have said that the standard and compiler should not > >> permit speculative writes of atomics, or (effectively) that the > >> compiler should preserve dependencies. In simple examples it's easy > >> to see what that means, but in general it's not so clear what the > >> language should guarantee, because dependencies may go via non-atomic > >> code in other compilation units, and we have to consider the extent to > >> which it's desirable to limit optimisation there. > > > > [...] > > > >> 2) otherwise, the language definition should prohibit it but the > >>compiler would have to preserve dependencies even in compilation > >>units that have no mention of atomics. It's unclear what the > >>(runtime and compiler development) cost of that would be in > >>practice - perhaps Torvald could comment? > > > > If I'm reading the standard correctly, it requires that data > > dependencies are preserved through loads and stores, including nonatomic > > ones. That sounds convenient because it allows programmers to use > > temporary storage. > > The standard only needs this for consume chains, That's right, and the runtime cost / implementation problems of mo_consume was what I was making statements about. Sorry if that wasn't clear. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, 2014-02-18 at 22:40 +0100, Peter Zijlstra wrote: > On Tue, Feb 18, 2014 at 10:21:56PM +0100, Torvald Riegel wrote: > > Well, that's how atomics that aren't volatile are defined in the > > standard. I can see that you want something else too, but that doesn't > > mean that the other thing is broken. > > Well that other thing depends on being able to see the entire program at > compile time. PaulMck already listed various ways in which this is > not feasible even for normal userspace code. > > In particular; DSOs and JITs were mentioned. No it doesn't depend on whole-program analysis being possible. Because if it isn't, then a correct compiler will just not do certain optimizations simply because it can't prove properties required for the optimization to hold. With the exception of access to objects via magic numbers (e.g., fixed and known addresses (see my reply to Paul), which are outside of the semantics specified in the standard), I don't see a correctness problem here. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, 2014-02-18 at 09:16 -0800, Paul E. McKenney wrote: > On Tue, Feb 18, 2014 at 08:49:13AM -0800, Linus Torvalds wrote: > > On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel wrote: > > > On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote: > > >> And exactly because I know enough, I would *really* like atomics to be > > >> well-defined, and have very clear - and *local* - rules about how they > > >> can be combined and optimized. > > > > > > "Local"? > > > > Yes. > > > > So I think that one of the big advantages of atomics over volatile is > > that they *can* be optimized, and as such I'm not at all against > > trying to generate much better code than for volatile accesses. > > > > But at the same time, that can go too far. For example, one of the > > things we'd want to use atomics for is page table accesses, where it > > is very important that we don't generate multiple accesses to the > > values, because parts of the values can be change *by*hardware* (ie > > accessed and dirty bits). > > > > So imagine that you have some clever global optimizer that sees that > > the program never ever actually sets the dirty bit at all in any > > thread, and then uses that kind of non-local knowledge to make > > optimization decisions. THAT WOULD BE BAD. > > Might as well list other reasons why value proofs via whole-program > analysis are unreliable for the Linux kernel: > > 1.As Linus said, changes from hardware. This is what's volatile is for, right? (Or the weak-volatile idea I mentioned). Compilers won't be able to prove something about the values of such variables, if marked (weak-)volatile. > 2.Assembly code that is not visible to the compiler. > Inline asms will -normally- let the compiler know what > memory they change, but some just use the "memory" tag. > Worse yet, I suspect that most compilers don't look all > that carefully at .S files. > > Any number of other programs contain assembly files. Are the annotations of changed memory really a problem? If the "memory" tag exists, isn't that supposed to mean all memory? To make a proof about a program for location X, the compiler has to analyze all uses of X. Thus, as soon as X escapes into an .S file, then the compiler will simply not be able to prove a thing (except maybe due to the data-race-free requirement for non-atomics). The attempt to prove something isn't unreliable, simply because a correct compiler won't claim to be able to "prove" something. One reason that could corrupt this is that if program addresses objects other than through the mechanisms defined in the language. For example, if one thread lays out a data structure at a constant fixed memory address, and another one then uses the fixed memory address to get access to the object with a cast (e.g., (void*)0x123). > 3.Kernel modules that have not yet been written. Now, the > compiler could refrain from trying to prove anything about > an EXPORT_SYMBOL() or EXPORT_SYMBOL_GPL() variable, but there > is currently no way to communicate this information to the > compiler other than marking the variable "volatile". Even if the variable is just externally accessible, then the compiler knows that it can't do whole-program analysis about it. It is true that whole-program analysis will not be applicable in this case, but it will not be unreliable. I think that's an important difference. > Other programs have similar issues, e.g., via dlopen(). > > 4.Some drivers allow user-mode code to mmap() some of their > state. Any changes undertaken by the user-mode code would > be invisible to the compiler. A good point, but a compiler that doesn't try to (incorrectly) assume something about the semantics of mmap will simply see that the mmap'ed data will escape to stuff if can't analyze, so it will not be able to make a proof. This is different from, for example, malloc(), which is guaranteed to return "fresh" nonaliasing memory. > 5.JITed code produced based on BPF: https://lwn.net/Articles/437981/ This might be special, or not, depending on how the JITed code gets access to data. If this is via fixed addresses (e.g., (void*)0x123), then see above. If this is through function calls that the compiler can't analyze, then this is like 4. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, 2014-02-18 at 08:49 -0800, Linus Torvalds wrote: > On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel wrote: > > On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote: > >> And exactly because I know enough, I would *really* like atomics to be > >> well-defined, and have very clear - and *local* - rules about how they > >> can be combined and optimized. > > > > "Local"? > > Yes. > > So I think that one of the big advantages of atomics over volatile is > that they *can* be optimized, and as such I'm not at all against > trying to generate much better code than for volatile accesses. > > But at the same time, that can go too far. For example, one of the > things we'd want to use atomics for is page table accesses, where it > is very important that we don't generate multiple accesses to the > values, because parts of the values can be change *by*hardware* (ie > accessed and dirty bits). > > So imagine that you have some clever global optimizer that sees that > the program never ever actually sets the dirty bit at all in any > thread, and then uses that kind of non-local knowledge to make > optimization decisions. THAT WOULD BE BAD. > > Do you see what I'm aiming for? Yes, I do. But that seems to be "volatile" territory. It crosses the boundaries of the abstract machine, and thus is input/output. Which fraction of your atomic accesses can read values produced by hardware? I would still suppose that lots of synchronization is not affected by this. Do you perhaps want a weaker form of volatile? That is, one that, for example, allows combining of two adjacent loads of the dirty bits, but will make sure that this is treated as if there is some imaginary external thread that it cannot analyze and that may write? I'm trying to be careful here in distinguishing between volatile and synchronization because I believe those are orthogonal, and this is a good thing because it allows for more-than-conservatively optimized code. > Any optimization that tries to prove > anything from more than local state is by definition broken, because > it assumes that everything is described by the program. Well, that's how atomics that aren't volatile are defined in the standard. I can see that you want something else too, but that doesn't mean that the other thing is broken. > But *local* optimizations are fine, as long as they follow the obvious > rule of not actually making changes that are semantically visible. If we assume that there is this imaginary thread called hardware that can write/read to/from such weak-volatile atomics, I believe this should restrict optimizations sufficiently even in the model as specified in the standard. For example, it would prevent a compiler from proving that there is no access by another thread to a variable, so it would prevent the cases in our discussion that you didn't want to get optimized. Yet, I believe the model itself could stay unmodified. Thus, with this "weak volatile", we could let programmers request precisely the semantics that they want, without using volatile too much and without preventing optimizations more than necessary. Thoughts? > (In practice, I'd be impressed as hell for that particular example, > and we actually do end up setting the dirty bit by hand in some > situations, so the example is slightly made up, but there are other > cases that might be more realistic in that sometimes we do things that > are hidden from the compiler - in assembly etc - and the compiler > might *think* it knows what is going on, but it doesn't actually see > all accesses). If something is visible to assembly, then the compiler should see this (or at least be aware that it can't prove anything about such data). Or am I missing anything? > > Sorry, but the rules *are* very clear. I *really* suggest to look at > > the formalization by Batty et al. And in these rules, proving that a > > read will always return value X has a well-defined meaning, and you can > > use it. That simply follows from how the model is built. > > What "model"? The C/C++ memory model was what I was referring to. > That's the thing. I have tried to figure out whether the model is some > abstract C model, or a model based on the actual hardware that the > compiler is compiling for, and whether the model is one that assumes > the compiler has complete knowledge of the system (see the example > above). It's a model as specified in the standard. It's not parametrized by the hardware the program will eventually run on (ignoring implementation-defined behavior, timing, etc.). The compiler has complete knowledge of the system unless for "volatiles" and things coming in from the extern
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, 2014-02-18 at 18:21 +, Peter Sewell wrote: > On 18 February 2014 17:38, Linus Torvalds > wrote: > > On Tue, Feb 18, 2014 at 4:12 AM, Peter Sewell > > wrote: > >> > >> For example, suppose we have, in one compilation unit: > >> > >> void f(int ra, int*rb) { > >> if (ra==42) > >> *rb=42; > >> else > >> *rb=42; > >> } > > > > So this is a great example, and in general I really like your page at: > > > >> For more context, this example is taken from a summary of the thin-air > >> problem by Mark Batty and myself, > >> , and the problem with > >> dependencies via other compilation units was AFAIK first pointed out > >> by Hans Boehm. > > > > and the reason I like your page is that it really talks about the > > problem by pointing to the "unoptimized" code, and what hardware would > > do. > > Thanks. It's certainly necessary to separately understand what compiler > optimisation and the hardware might do, to get anywhere here. But... > > > As mentioned, I think that's actually the *correct* way to think about > > the problem space, because it allows the programmer to take hardware > > characteristics into account, without having to try to "describe" them > > at a source level. > > ...to be clear, I am ultimately after a decent source-level description of > what > programmers can depend on, and we (Mark and I) view that page as > identifying constraints on what that description can say. There are too > many compiler optimisations for people to reason directly in terms of > the set of all transformations that they do, so we need some more > concise and comprehensible envelope identifying what is allowed, > as an interface between compiler writers and users. AIUI that's basically > what Torvald is arguing. Yes, that's one reason. Another one is that if a programmer would actually want to use atomics in a machine-independent / portable way, he/she does also not want to reason about how all those transformations might interact with the machine's memory model. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, 2014-02-18 at 12:12 +, Peter Sewell wrote: > Several of you have said that the standard and compiler should not > permit speculative writes of atomics, or (effectively) that the > compiler should preserve dependencies. In simple examples it's easy > to see what that means, but in general it's not so clear what the > language should guarantee, because dependencies may go via non-atomic > code in other compilation units, and we have to consider the extent to > which it's desirable to limit optimisation there. [...] > 2) otherwise, the language definition should prohibit it but the >compiler would have to preserve dependencies even in compilation >units that have no mention of atomics. It's unclear what the >(runtime and compiler development) cost of that would be in >practice - perhaps Torvald could comment? If I'm reading the standard correctly, it requires that data dependencies are preserved through loads and stores, including nonatomic ones. That sounds convenient because it allows programmers to use temporary storage. However, what happens if a dependency "arrives" at a store for which the alias set isn't completely known? Then we either have to add a barrier to enforce the ordering at this point, or we have to assume that all other potentially aliasing memory locations would also have to start carrying dependencies (which might be in other functions in other compilation units). Neither option is good. The first might introduce barriers in places in which they might not be required (or the programmer has to use kill_dependency() quite often to avoid all these). The second is bad because points-to analysis is hard, so in practice the points-to set will not be precisely known for a lot of pointers. So this might not just creep into other functions via calls of [[carries_dependency]] functions, but also through normal loads and stores, likely prohibiting many optimizations. Furthermore, the dependency tracking can currently only be "disabled/enabled" on a function granularity (via [[carries_dependency]]). Thus, if we have big functions, then dependency tracking may slow down a lot of code in the big function. If we have small functions, there's a lot of attributes to be added. If a function may only carry a dependency but doesn't necessarily (eg, depending on input parameters), then the programmer has to make a trade-off whether he/she want's to benefit from mo_consume but slow down other calls due to additional barriers (ie, when this function is called from non-[[carries_dependency]] functions), or vice versa. (IOW, because of the function granularity, other code's performance is affected.) If a compiler wants to implement dependency tracking just for a few constructs (e.g., operators -> + ...) and use barriers otherwise, then this decision must be compatible with how all this is handled in other compilation units. Thus, compiler optimizations effectively become part of the ABI, which doesn't seem right. I hope these examples illustrate my concerns about the implementability in practice of this. It's also why I've suggested to move from an opt-out approach as in the current standard (ie, with kill_dependency()) to an opt-in approach for conservative dependency tracking (e.g., with a preserve_dependencies(exp) call, where exp will not be optimized in a way that removes any dependencies). This wouldn't help with many optimizations being prevented, but it should at least help programmers contain the problem to smaller regions of code. I'm not aware of any implementation that tries to track dependencies, so I can't give any real performance numbers. This could perhaps be simulated, but I'm not sure whether a realistic case would be made without at least supporting [[carries_dependency]] properly in the compiler, which would be some work. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, 2014-02-18 at 08:55 -0800, Paul E. McKenney wrote: > On Tue, Feb 18, 2014 at 04:38:40PM +0100, Torvald Riegel wrote: > > On Mon, 2014-02-17 at 16:18 -0800, Linus Torvalds wrote: > > > On Mon, Feb 17, 2014 at 3:41 PM, Torvald Riegel > > > wrote: > > > > > > > > There's an underlying problem here that's independent from the actual > > > > instance that you're worried about here: "no sense" is a ultimately a > > > > matter of taste/objectives/priorities as long as the respective > > > > specification is logically consistent. > > > > > > Yes. But I don't think it's "independent". > > > > > > Exactly *because* some people will read standards without applying > > > "does the resulting code generation actually make sense for the > > > programmer that wrote the code", the standard has to be pretty clear. > > > > > > The standard often *isn't* pretty clear. It wasn't clear enough when > > > it came to "volatile", and yet that was a *much* simpler concept than > > > atomic accesses and memory ordering. > > > > > > And most of the time it's not a big deal. But because the C standard > > > generally tries to be very portable, and cover different machines, > > > there tends to be a mindset that anything inherently unportable is > > > "undefined" or "implementation defined", and then the compiler writer > > > is basically given free reign to do anything they want (with > > > "implementation defined" at least requiring that it is reliably the > > > same thing). > > > > Yes, that's how it works in general. And this makes sense, because all > > optimizations rely on that. Second, you can't keep something consistent > > (eg, between compilers) if it isn't specified. So if we want stricter > > rules, those need to be specified somewhere. > > > > > And when it comes to memory ordering, *everything* is basically > > > non-portable, because different CPU's very much have different rules. > > > > Well, the current set of memory orders (and the memory model as a whole) > > is portable, even though it might not allow to exploit all hardware > > properties, and thus might perform sub-optimally in some cases. > > > > > I worry that that means that the standard then takes the stance that > > > "well, compiler re-ordering is no worse than CPU re-ordering, so we > > > let the compiler do anything". And then we have to either add > > > "volatile" to make sure the compiler doesn't do that, or use an overly > > > strict memory model at the compiler level that makes it all pointless. > > > > Using "volatile" is not a good option, I think, because synchronization > > between threads should be orthogonal to observable output of the > > abstract machine. > > Are you thinking of "volatile" -instead- of atomics? My belief is that > given the current standard there will be times that we need to use > "volatile" -in- -addition- to atomics. No, I was talking about having to use "volatile" in addition to atomics to get synchronization properties for the atomics. ISTM that this would be unfortunate because the objective for "volatile" (ie, declaring what is output of the abstract machine) is orthogonal to synchronization between threads. So if we want to preserve control dependencies and get ordering guarantees through that, I wouldn't prefer it through hacks based on volatile. For example, if we have a macro or helper function that contains synchronization code, but the compiler can prove that the macro/function is used just on data only accessible to a single thread, then the "volatile" notation on it would prevent optimizations. > > The current memory model might not allow to exploit all hardware > > properties, I agree. > > > > But then why don't we work on how to extend it to do so? We need to > > specify the behavior we want anyway, and this can't be independent of > > the language semantics, so it has to be conceptually integrated with the > > standard anyway. > > > > > So I really really hope that the standard doesn't give compiler > > > writers free hands to do anything that they can prove is "equivalent" > > > in the virtual C machine model. > > > > It does, but it also doesn't mean this can't be extended. So let's > > focus on whether we can find an extension. > > > > > That's not how you get reliable > > > results. > > > > In this general form, that's obviously a false claim. > > These two sentences starkly illustrate the difference in perspective > between you two. You are talking past each other. Not sure how to fix > this at the moment, but what else is new? ;-) Well, we're still talking, and we are making little steps in clarifying things, so there is progress :) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, 2014-02-18 at 09:44 -0800, Linus Torvalds wrote: > On Tue, Feb 18, 2014 at 8:17 AM, Torvald Riegel wrote: > > The standard is clear on what's required. I strongly suggest reading > > the formalization of the memory model by Batty et al. > > Can you point to it? Because I can find a draft standard, and it sure > as hell does *not* contain any clarity of the model. It has a *lot* of > verbiage, but it's pretty much impossible to actually understand, even > for somebody who really understands memory ordering. http://www.cl.cam.ac.uk/~mjb220/n3132.pdf This has an explanation of the model up front, and then the detailed formulae in Section 6. This is from 2010, and there might have been smaller changes since then, but I'm not aware of any bigger ones. The cppmem tool is based on this, so if you want to play around with a few code examples, it's pretty nice because it shows you all allowed executions, including graphs like in the paper (see elsewhere in this thread for CAS syntax): http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, 2014-02-17 at 16:09 -0800, Linus Torvalds wrote: > On Mon, Feb 17, 2014 at 3:17 PM, Torvald Riegel wrote: > > On Mon, 2014-02-17 at 14:32 -0800, > > > >> Stop claiming it "can return 1".. It *never* returns 1 unless you do > >> the load and *verify* it, or unless the load itself can be made to go > >> away. And with the code sequence given, that just doesn't happen. END > >> OF STORY. > > > > void foo(); > > { > > atomic x = 1; > > if (atomic_load(&x, mo_relaxed) == 1) > > atomic_store(&y, 3, mo_relaxed)); > > } > > This is the very example I gave, where the real issue is not that "you > prove that load returns 1", you instead say "store followed by a load > can be combined". > > I (in another email I just wrote) tried to show why the "prove > something is true" is a very dangerous model. Seriously, it's pure > crap. It's broken. I don't see anything dangerous in the example above with the language semantics as specified: It's a well-defined situation, given the rules of the language. I replied to the other email you wrote with my viewpoint on why the above is useful, how it compares to what you seem to what, and where I think we need to start to bridge the gap. > If the C standard defines atomics in terms of "provable equivalence", > it's broken. Exactly because on a *virtual* machine you can prove > things that are not actually true in a *real* machine. For the control dependencies you have in mind, it's actually the other way around. You expect the real machine's properties in a program whose semantics only give you the virtual machine's properties. Anything you prove on the virtual machine will be true on the real machine (in a correct implementation) -- but you can't expect to have real-machine properties on language that's based on the virtual machine. > I have the > example of value speculation changing the memory ordering model of the > actual machine. This example is not true for the language as specified. It is true for a modified language that you have in mind, but for this one I've just seen pretty rough rules so far. Please see my other reply. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, 2014-02-17 at 19:42 -0800, Linus Torvalds wrote: > On Mon, Feb 17, 2014 at 7:24 PM, Linus Torvalds > wrote: > > > > As far as I can tell, the intent is that you can't do value > > speculation (except perhaps for the "relaxed", which quite frankly > > sounds largely useless). > > Hmm. The language I see for "consume" is not obvious: > > "Consume operation: no reads in the current thread dependent on the > value currently loaded can be reordered before this load" I can't remember seeing that language in the standard (ie, C or C++). Where is this from? > and it could make a compiler writer say that value speculation is > still valid, if you do it like this (with "ptr" being the atomic > variable): > > value = ptr->val; I assume the load from ptr has mo_consume ordering? > into > > tmp = ptr; > value = speculated.value; > if (unlikely(tmp != &speculated)) > value = tmp->value; > > which is still bogus. The load of "ptr" does happen before the load of > "value = speculated->value" in the instruction stream, but it would > still result in the CPU possibly moving the value read before the > pointer read at least on ARM and power. And surprise, in the C/C++ model the load from ptr is sequenced-before the load from speculated, but there's no ordering constraint on the reads-from relation for the value load if you use mo_consume on the ptr load. Thus, the transformed code has less ordering constraints than the original code, and we arrive at the same outcome. > So if you're a compiler person, you think you followed the letter of > the spec - as far as *you* were concerned, no load dependent on the > value of the atomic load moved to before the atomic load. No. Because the wobbly sentence you cited(?) above is not what the standard says. Would you please stop making claims about what compiler writers would do or not if you seemingly aren't even familiar with the model that compiler writers would use to reason about transformations? Seriously? > You go home, > happy, knowing you've done your job. Never mind that you generated > code that doesn't actually work. > > I dread having to explain to the compiler person that he may be right > in some theoretical virtual machine, but the code is subtly broken and > nobody will ever understand why (and likely not be able to create a > test-case showing the breakage). > > But maybe the full standard makes it clear that "reordered before this > load" actually means on the real hardware, not just in the generated > instruction stream. Do you think everyone else is stupid? If there's an ordering constraint in the virtual machine, it better be present when executing in the real machine unless it provably cannot result in different output as specified by the language's semantics. > Reading it with understanding of the *intent* and > understanding all the different memory models that requirement should > be obvious (on alpha, you need an "rmb" instruction after the load), > but ... The standard is clear on what's required. I strongly suggest reading the formalization of the memory model by Batty et al. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, 2014-02-17 at 19:00 -0800, Paul E. McKenney wrote: > On Mon, Feb 17, 2014 at 12:18:21PM -0800, Linus Torvalds wrote: > > On Mon, Feb 17, 2014 at 11:55 AM, Torvald Riegel wrote: > > > > > > Which example do you have in mind here? Haven't we resolved all the > > > debated examples, or did I miss any? > > > > Well, Paul seems to still think that the standard possibly allows > > speculative writes or possibly value speculation in ways that break > > the hardware-guaranteed orderings. > > It is not that I know of any specific problems, but rather that I > know I haven't looked under all the rocks. Plus my impression from > my few years on the committee is that the standard will be pushed to > the limit when it comes time to add optimizations. > > One example that I learned about last week uses the branch-prediction > hardware to validate value speculation. And no, I am not at all a fan > of value speculation, in case you were curious. However, it is still > an educational example. > > This is where you start: > > p = gp.load_explicit(memory_order_consume); /* AKA rcu_dereference() */ > do_something(p->a, p->b, p->c); > p->d = 1; I assume that's the source code. > Then you leverage branch-prediction hardware as follows: > > p = gp.load_explicit(memory_order_consume); /* AKA rcu_dereference() */ > if (p == GUESS) { > do_something(GUESS->a, GUESS->b, GUESS->c); > GUESS->d = 1; > } else { > do_something(p->a, p->b, p->c); > p->d = 1; > } I assume that this is a potential transformation by a compiler. > The CPU's branch-prediction hardware squashes speculation in the case where > the guess was wrong, and this prevents the speculative store to ->d from > ever being visible. However, the then-clause breaks dependencies, which > means that the loads -could- be speculated, so that do_something() gets > passed pre-initialization values. > > Now, I hope and expect that the wording in the standard about dependency > ordering prohibits this sort of thing. But I do not yet know for certain. The transformation would be incorrect. p->a in the source code carries a dependency, and as you say, the transformed code wouldn't have that dependency any more. So the transformed code would loose ordering constraints that it has in the virtual machine, so in the absence of other proofs of correctness based on properties not shown in the example, the transformed code would not result in the same behavior as allowed by the abstract machine. If the transformation would actually be by a programmer, then this wouldn't do the same as the first example because mo_consume doesn't work through the if statement. Are there other specified concerns that you have regarding this example? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, 2014-02-17 at 16:18 -0800, Linus Torvalds wrote: > On Mon, Feb 17, 2014 at 3:41 PM, Torvald Riegel wrote: > > > > There's an underlying problem here that's independent from the actual > > instance that you're worried about here: "no sense" is a ultimately a > > matter of taste/objectives/priorities as long as the respective > > specification is logically consistent. > > Yes. But I don't think it's "independent". > > Exactly *because* some people will read standards without applying > "does the resulting code generation actually make sense for the > programmer that wrote the code", the standard has to be pretty clear. > > The standard often *isn't* pretty clear. It wasn't clear enough when > it came to "volatile", and yet that was a *much* simpler concept than > atomic accesses and memory ordering. > > And most of the time it's not a big deal. But because the C standard > generally tries to be very portable, and cover different machines, > there tends to be a mindset that anything inherently unportable is > "undefined" or "implementation defined", and then the compiler writer > is basically given free reign to do anything they want (with > "implementation defined" at least requiring that it is reliably the > same thing). Yes, that's how it works in general. And this makes sense, because all optimizations rely on that. Second, you can't keep something consistent (eg, between compilers) if it isn't specified. So if we want stricter rules, those need to be specified somewhere. > And when it comes to memory ordering, *everything* is basically > non-portable, because different CPU's very much have different rules. Well, the current set of memory orders (and the memory model as a whole) is portable, even though it might not allow to exploit all hardware properties, and thus might perform sub-optimally in some cases. > I worry that that means that the standard then takes the stance that > "well, compiler re-ordering is no worse than CPU re-ordering, so we > let the compiler do anything". And then we have to either add > "volatile" to make sure the compiler doesn't do that, or use an overly > strict memory model at the compiler level that makes it all pointless. Using "volatile" is not a good option, I think, because synchronization between threads should be orthogonal to observable output of the abstract machine. The current memory model might not allow to exploit all hardware properties, I agree. But then why don't we work on how to extend it to do so? We need to specify the behavior we want anyway, and this can't be independent of the language semantics, so it has to be conceptually integrated with the standard anyway. > So I really really hope that the standard doesn't give compiler > writers free hands to do anything that they can prove is "equivalent" > in the virtual C machine model. It does, but it also doesn't mean this can't be extended. So let's focus on whether we can find an extension. > That's not how you get reliable > results. In this general form, that's obviously a false claim. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote: > And exactly because I know enough, I would *really* like atomics to be > well-defined, and have very clear - and *local* - rules about how they > can be combined and optimized. "Local"? > None of this "if you can prove that the read has value X" stuff. And > things like value speculation should simply not be allowed, because > that actually breaks the dependency chain that the CPU architects give > guarantees for. Instead, make the rules be very clear, and very > simple, like my suggestion. You can never remove a load because you > can "prove" it has some value, but you can combine two consecutive > atomic accesses/ Sorry, but the rules *are* very clear. I *really* suggest to look at the formalization by Batty et al. And in these rules, proving that a read will always return value X has a well-defined meaning, and you can use it. That simply follows from how the model is built. What you seem to want just isn't covered by the model as it is today -- you can't infer from that that the model itself would be wrong. The dependency chains aren't modeled in the way you envision it (except in what consume_mo tries, but that seems to be hard to implement); they are there on the level of the program logic as modeled by the abstract machine and the respective execution/output rules, but they are not built to represent those specific ordering guarantees the HW gives you. I would also be cautious claiming that the rules you suggested would be very clear and very simple. I haven't seen a memory model spec from you that would be viable as the standard model for C/C++, nor have I seen proof that this would actually be easier to understand for programmers in general. > For example, CPU people actually do tend to give guarantees for > certain things, like stores that are causally related being visible in > a particular order. Yes, but that's not part of the model so far. If you want to exploit this, please make a suggestion for how to extend the model to cover this. You certainly expect compilers to actually optimize code, which usually removes or reorders stuff. Now, we could say that we just don't want that for atomic accesses, but even this isn't as clear-cut: How do we actually detect where non-atomic code is intended by the programmer to express (1) a dependency that needs to be transformed into a dependency in the generated code from (2) a dependency that is just program logic? IOW, consider the consume_mo example "*(p + flag - flag)", where flag is coming from a consume_mo load: Can you give a complete set of rules (for a full program) for when the compiler is allowed to optimize out flag-flag and when not? (And it should be practically implementable in a compiler, and not prevent optimizations where people expect them.) > If the compiler starts doing value speculation on > atomic accesses, you are quite possibly breaking things like that. > It's just not a good idea. Don't do it. Write the standard so that it > clearly is disallowed. You never want value speculation for anything possibly originating from an atomic load? Or what are the concrete rules you have in mind? > Because you may think that a C standard is machine-independent, but > that isn't really the case. The people who write code still write code > for a particular machine. Our code works (in the general case) on > different byte orderings, different register sizes, different memory > ordering models. But in each *instance* we still end up actually > coding for each machine. That's how *you* do it. I'm sure you are aware that for lots of programmers, having a machine-dependent standard is not helpful at all. As the standard is written, code is portable. It can certainly be beneficial for programmers to optimize for different machines, but the core of the memory model will always remain portable, because that's what arguably most programmers need. That doesn't prevent machine-dependent extensions, but those then should be well integrated with the rest of the model. > So the rules for atomics should be simple and *specific* enough that > when you write code for a particular architecture, you can take the > architecture memory ordering *and* the C atomics orderings into > account, and do the right thing for that architecture. That would be an extension, but not viable as a *general* requirement as far as the standard is concerned. > And that very much means that doing things like value speculation MUST > NOT HAPPEN. See? Even if you can prove that your code is "equivalent", > it isn't. I'm kind of puzzled why you keep stating generalizations of assertions that clearly aren't well-founded (ie, which might be true for the kernel or your wishes for how the world should be, but aren't true in general or for the objectives of others). It's not helpful to just pretend other people's opinions or worlds didn't exist in discussions such as this one. It's clear that the standard
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, 2014-02-17 at 14:47 -0800, Linus Torvalds wrote: > On Mon, Feb 17, 2014 at 2:25 PM, Torvald Riegel wrote: > > On Mon, 2014-02-17 at 14:02 -0800, Linus Torvalds wrote: > >> > >> The argument was that an lvalue doesn't actually "access" the memory > >> (an rvalue does), so this: > >> > >>volatile int *p = ...; > >> > >>*p; > >> > >> doesn't need to generate a load from memory, because "*p" is still an > >> lvalue (since you could assign things to it). > >> > >> This isn't an issue in C, because in C, expression statements are > >> always rvalues, but C++ changed that. > > > > Huhh. I can see the problems that this creates in terms of C/C++ > > compatibility. > > That's not the biggest problem. > > The biggest problem is that you have compiler writers that don't care > about sane *use* of the features they write a compiler for, they just > care about the standard. > > So they don't care about C vs C++ compatibility. Even more > importantly, they don't care about the *user* that uses only C++ and > the fact that their reading of the standard results in *meaningless* > behavior. They point to the standard and say "that's what the standard > says, suck it", and silently generate code (or in this case, avoid > generating code) that makes no sense. There's an underlying problem here that's independent from the actual instance that you're worried about here: "no sense" is a ultimately a matter of taste/objectives/priorities as long as the respective specification is logically consistent. If you want to be independent of your sanity being different from other people's sanity (e.g., compiler writers), you need to make sure that the specification is precise and says what you want. IOW, think about the specification being the program, and the people being computers; you better want a well-defined program in this case. > So it's not about C++ being incompatible with C, it's about C++ having > insane and bad semantics unless you just admit that "oh, ok, I need to > not just read the standard, I also need to use my brain, and admit > that a C++ statement expression needs to act as if it is an "access" > wrt volatile variables". 1) I agree that (IMO) a good standard strives for being easy to understand. 2) In practice, there is a trade-off between "Easy to understand" and actually producing a specification. A standard is not a tutorial. And that's for good reason, because (a) there might be more than one way to teach something and that should be allowed and (b) that the standard should carry the full precision but still be compact enough to be manageable. 3) Implementations can try to be nice to users by helping them avoiding error-prone corner cases or such. A warning for common problems is such a case. But an implementation has to draw a line somewhere, demarcating cases where it fully exploits what the standard says (eg, to allow optimizations) from cases where it is more conservative and does what the standard allows but in a potentially more intuitive way. That's especially the case if it's being asked to produce high-performance code. 4) There will be arguments for where the line actually is, simply because different users will have different goals. 5) The way to reduce 4) is to either make the standard more specific, or to provide better user documentation. If the standard has strict requirements, then there will be less misunderstanding. 6) To achieve 5), one way is to get involved in the standards process. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, 2014-02-17 at 14:32 -0800, Linus Torvalds wrote: > On Mon, Feb 17, 2014 at 2:09 PM, Torvald Riegel wrote: > > On Sat, 2014-02-15 at 11:15 -0800, Linus Torvalds wrote: > >> > > >> > if (atomic_load(&x, mo_relaxed) == 1) > >> > atomic_store(&y, 3, mo_relaxed)); > >> > >> No, please don't use this idiotic example. It is wrong. > > > > It won't be useful in practice in a lot of cases, but that doesn't mean > > it's wrong. It's clearly not illegal code. It also serves a purpose: a > > simple example to reason about a few aspects of the memory model. > > It's not illegal code, but i you claim that you can make that store > unconditional, it's a pointless and wrong example. > > >> The fact is, if a compiler generates anything but the obvious sequence > >> (read/cmp/branch/store - where branch/store might obviously be done > >> with some other machine conditional like a predicate), the compiler is > >> wrong. > > > > Why? I've reasoned why (1) to (3) above allow in certain cases (i.e., > > the first load always returning 1) for the branch (or other machine > > conditional) to not be emitted. So please either poke holes into this > > reasoning, or clarify that you don't in fact, contrary to what you wrote > > above, agree with (1) to (3). > > The thing is, the first load DOES NOT RETURN 1. It returns whatever > that memory location contains. End of story. The memory location is just an abstraction for state, if it's not volatile. > Stop claiming it "can return 1".. It *never* returns 1 unless you do > the load and *verify* it, or unless the load itself can be made to go > away. And with the code sequence given, that just doesn't happen. END > OF STORY. void foo(); { atomic x = 1; if (atomic_load(&x, mo_relaxed) == 1) atomic_store(&y, 3, mo_relaxed)); } This is a counter example to your claim, and yes, the compiler has proof that x is 1. It's deliberately simple, but I can replace this with other more advanced situations. For example, if x comes out of malloc (or, on the kernel side, something else that returns non-aliasing memory) and hasn't provably escaped to other threads yet. I haven't posted this full example, but I've *clearly* said that *if* the compiler can prove that the load would always return 1, it can remove it. And it's simple to see why that's the case: If this holds, then in all allowed executions it would load from a know store, the relaxed_mo gives no further ordering guarantees so we can just take the value, and we're good. > So your argument is *shit*. Why do you continue to argue it? Maybe because it isn't? Maybe you should try to at least trust that my intentions are good, even if distrusting my ability to reason. > I told you how that load can go away, and you agreed. But IT CANNOT GO > AWAY any other way. You cannot claim "the compiler knows". The > compiler doesn't know. It's that simple. Oh yes it can. Because of the same rules that allow you to perform the other transformations. Please try to see the similarities here. You previously said you don't want to mix volatile semantics and atomics. This is something that's being applied in this example. > >> So why do I say you are wrong, after I just gave you an example of how > >> it happens? Because my example went back to the *real* issue, and > >> there are actual real semantically meaningful details with doing > >> things like load merging. > >> > >> To give an example, let's rewrite things a bit more to use an extra > >> variable: > >> > >> atomic_store(&x, 1, mo_relaxed); > >> a = atomic_load(&1, mo_relaxed); > >> if (a == 1) > >> atomic_store(&y, 3, mo_relaxed); > >> > >> which looks exactly the same. > > > > I'm confused. Is this a new example? > > That is a new example. The important part is that it has left a > "trace" for the programmer: because 'a' contains the value, the > programmer can now look at the value later and say "oh, we know we did > a store iff a was 1" > > >> This sequence: > >> > >> atomic_store(&x, 1, mo_relaxed); > >> a = atomic_load(&x, mo_relaxed); > >> atomic_store(&y, 3, mo_relaxed); > >> > >> is actually - and very seriously - buggy. > >> > >> Why? Because you have effectively split the atomic_load into two loads > >> - one for the value of 'a', and one for your '
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, 2014-02-17 at 14:02 -0800, Linus Torvalds wrote: > On Mon, Feb 17, 2014 at 1:21 PM, Torvald Riegel wrote: > > On Mon, 2014-02-17 at 12:18 -0800, Linus Torvalds wrote: > >> and then it is read by people (compiler writers) that intentionally > >> try to mis-use the words and do language-lawyering ("that depends on > >> what the meaning of 'is' is"). > > > > That assumption about people working on compilers is a little too broad, > > don't you think? > > Let's just say that *some* are that way, and those are the ones that I > end up butting heads with. > > The sane ones I never have to argue with - point them at a bug, and > they just say "yup, bug". The insane ones say "we don't need to fix > that, because if you read this copy of the standards that have been > translated to chinese and back, it clearly says that this is > acceptable". > > >> The whole "lvalue vs rvalue expression > >> vs 'what is a volatile access'" thing for C++ was/is a great example > >> of that. > > > > I'm not aware of the details of this. > > The argument was that an lvalue doesn't actually "access" the memory > (an rvalue does), so this: > >volatile int *p = ...; > >*p; > > doesn't need to generate a load from memory, because "*p" is still an > lvalue (since you could assign things to it). > > This isn't an issue in C, because in C, expression statements are > always rvalues, but C++ changed that. Huhh. I can see the problems that this creates in terms of C/C++ compatibility. > The people involved with the C++ > standards have generally been totally clueless about their subtle > changes. This isn't a fair characterization. There are many people that do care, and certainly not all are clueless. But it's a limited set of people, bugs happen, and not all of them will have the same goals. I think one way to prevent such problems in the future could be to have someone in the kernel community volunteer to look through standard revisions before they are published. The standard needs to be fixed, because compilers need to conform to the standard (e.g., a compiler's extension "fixing" the above wouldn't be conforming anymore because it emits more volatile reads than specified). Or maybe those of us working on the standard need to flag potential changes of interest to the kernel folks. But that may be less reliable than someone from the kernel side looking at them; I don't know. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, 2014-02-17 at 14:14 -0800, Paul E. McKenney wrote: > On Mon, Feb 17, 2014 at 09:39:54PM +0100, Richard Biener wrote: > > On February 17, 2014 7:18:15 PM GMT+01:00, "Paul E. McKenney" > > wrote: > > >On Wed, Feb 12, 2014 at 07:12:05PM +0100, Peter Zijlstra wrote: > > >> On Wed, Feb 12, 2014 at 09:42:09AM -0800, Paul E. McKenney wrote: > > >> > You need volatile semantics to force the compiler to ignore any > > >proofs > > >> > it might otherwise attempt to construct. Hence all the > > >ACCESS_ONCE() > > >> > calls in my email to Torvald. (Hopefully I translated your example > > >> > reasonably.) > > >> > > >> My brain gave out for today; but it did appear to have the right > > >> structure. > > > > > >I can relate. ;-) > > > > > >> I would prefer it C11 would not require the volatile casts. It should > > >> simply _never_ speculate with atomic writes, volatile or not. > > > > > >I agree with not needing volatiles to prevent speculated writes. > > >However, > > >they will sometimes be needed to prevent excessive load/store > > >combining. > > >The compiler doesn't have the runtime feedback mechanisms that the > > >hardware has, and thus will need help from the developer from time > > >to time. > > > > > >Or maybe the Linux kernel simply waits to transition to C11 relaxed > > >atomics > > >until the compiler has learned to be sufficiently conservative in its > > >load-store combining decisions. > > > > Sounds backwards. Currently the compiler does nothing to the atomics. I'm > > sure we'll eventually add something. But if testing coverage is zero > > outside then surely things get worse, not better with time. > > Perhaps we solve this chicken-and-egg problem by creating a test suite? Perhaps. The test suite might also be a good set of examples showing which cases we expect to be optimized in a certain way, and which not. I suppose the uses of (the equivalent) of atomics in the kernel would be a good start. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Sat, 2014-02-15 at 11:15 -0800, Linus Torvalds wrote: > On Sat, Feb 15, 2014 at 9:30 AM, Torvald Riegel wrote: > > > > I think the example is easy to misunderstand, because the context isn't > > clear. Therefore, let me first try to clarify the background. > > > > (1) The abstract machine does not write speculatively. > > (2) Emitting a branch instruction and executing a branch at runtime is > > not part of the specified behavior of the abstract machine. Of course, > > the abstract machine performs conditional execution, but that just > > specifies the output / side effects that it must produce (e.g., volatile > > stores) -- not with which hardware instructions it is producing this. > > (3) A compiled program must produce the same output as if executed by > > the abstract machine. > > Ok, I'm fine with that. > > > Thus, we need to be careful what "speculative store" is meant to refer > > to. A few examples: > > > > if (atomic_load(&x, mo_relaxed) == 1) > > atomic_store(&y, 3, mo_relaxed)); > > No, please don't use this idiotic example. It is wrong. It won't be useful in practice in a lot of cases, but that doesn't mean it's wrong. It's clearly not illegal code. It also serves a purpose: a simple example to reason about a few aspects of the memory model. > The fact is, if a compiler generates anything but the obvious sequence > (read/cmp/branch/store - where branch/store might obviously be done > with some other machine conditional like a predicate), the compiler is > wrong. Why? I've reasoned why (1) to (3) above allow in certain cases (i.e., the first load always returning 1) for the branch (or other machine conditional) to not be emitted. So please either poke holes into this reasoning, or clarify that you don't in fact, contrary to what you wrote above, agree with (1) to (3). > Anybody who argues anything else is wrong, or confused, or confusing. I appreciate your opinion, and maybe I'm just one of the three things above (my vote is on "confusing"). But without you saying why doesn't help me see what's the misunderstanding here. > Instead, argue about *other* sequences where the compiler can do something. I'd prefer if we could clarify the misunderstanding for the simple case first that doesn't involve stronger ordering requirements in the form of non-relaxed MOs. > For example, this sequence: > >atomic_store(&x, a, mo_relaxed); >b = atomic_load(&x, mo_relaxed); > > can validly be transformed to > >atomic_store(&x, a, mo_relaxed); >b = (typeof(x)) a; > > and I think everybody agrees about that. In fact, that optimization > can be done even for mo_strict. Yes. > But even that "obvious" optimization has subtle cases. What if the > store is relaxed, but the load is strict? You can't do the > optimization without a lot of though, because dropping the strict load > would drop an ordering point. So even the "store followed by exact > same load" case has subtle issues. Yes if a compiler wants to optimize that, it has to give it more thought. My gut feeling is that either the store should get the stronger ordering, or the accesses should be merged. But I'd have to think more about that one (which I can do on request). > With similar caveats, it is perfectly valid to merge two consecutive > loads, and to merge two consecutive stores. > > Now that means that the sequence > > atomic_store(&x, 1, mo_relaxed); > if (atomic_load(&x, mo_relaxed) == 1) > atomic_store(&y, 3, mo_relaxed); > > can first be optimized to > > atomic_store(&x, 1, mo_relaxed); > if (1 == 1) > atomic_store(&y, 3, mo_relaxed); > > and then you get the end result that you wanted in the first place > (including the ability to re-order the two stores due to the relaxed > ordering, assuming they can be proven to not alias - and please don't > use the idiotic type-based aliasing rules). > > Bringing up your first example is pure and utter confusion. Sorry if it was confusing. But then maybe we need to talk about it more, because it shouldn't be confusing if we agree on what the memory model allows and what not. I had originally picked the example because it was related to the example Paul/Peter brought up. > Don't do > it. Instead, show what are obvious and valid transformations, and then > you can bring up these kinds of combinations as "look, this is > obviously also correct". I have my doubts whether the best way to reason about the memory model is by thinking about specific compiler transformations. YM
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, 2014-02-17 at 12:18 -0800, Linus Torvalds wrote: > On Mon, Feb 17, 2014 at 11:55 AM, Torvald Riegel wrote: > > > > Which example do you have in mind here? Haven't we resolved all the > > debated examples, or did I miss any? > > Well, Paul seems to still think that the standard possibly allows > speculative writes or possibly value speculation in ways that break > the hardware-guaranteed orderings. That's true, I just didn't see any specific examples so far. > And personally, I can't read standards paperwork. It is invariably > written in some basically impossible-to-understand lawyeristic mode, Yeah, it's not the most intuitive form for things like the memory model. > and then it is read by people (compiler writers) that intentionally > try to mis-use the words and do language-lawyering ("that depends on > what the meaning of 'is' is"). That assumption about people working on compilers is a little too broad, don't you think? I think that it is important to stick to a specification, in the same way that one wouldn't expect a program with undefined behavior make any sense of it, magically, in cases where stuff is undefined. However, that of course doesn't include trying to exploit weasel-wording (BTW, both users and compiler writers try to do it). IMHO, weasel-wording in a standard is a problem in itself even if not exploited, and often it indicates that there is a real issue. There might be reasons to have weasel-wording (e.g., because there's no known better way to express it like in case of the not really precise no-out-of-thin-air rule today), but nonetheless those aren't ideal. > The whole "lvalue vs rvalue expression > vs 'what is a volatile access'" thing for C++ was/is a great example > of that. I'm not aware of the details of this. > So quite frankly, as a result I refuse to have anything to do with the > process directly. That's unfortunate. Then please work with somebody that isn't uncomfortable with participating directly in the process. But be warned, it may very well be a person working on compilers :) Have you looked at the formalization of the model by Batty et al.? The overview of this is prose, but the formalized model itself is all formal relations and logic. So there should be no language-lawyering issues with that form. (For me, the formalized model is much easier to reason about.) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, 2014-02-17 at 12:23 -0800, Paul E. McKenney wrote: > On Mon, Feb 17, 2014 at 08:55:47PM +0100, Torvald Riegel wrote: > > On Sat, 2014-02-15 at 10:49 -0800, Linus Torvalds wrote: > > > On Sat, Feb 15, 2014 at 9:45 AM, Torvald Riegel > > > wrote: > > > > > > > > I think a major benefit of C11's memory model is that it gives a > > > > *precise* specification for how a compiler is allowed to optimize. > > > > > > Clearly it does *not*. This whole discussion is proof of that. It's > > > not at all clear, > > > > It might not be an easy-to-understand specification, but as far as I'm > > aware it is precise. The Cambridge group's formalization certainly is > > precise. From that, one can derive (together with the usual rules for > > as-if etc.) what a compiler is allowed to do (assuming that the standard > > is indeed precise). My replies in this discussion have been based on > > reasoning about the standard, and not secret knowledge (with the > > exception of no-out-of-thin-air, which is required in the standard's > > prose but not yet formalized). > > > > I agree that I'm using the formalization as a kind of placeholder for > > the standard's prose (which isn't all that easy to follow for me > > either), but I guess there's no way around an ISO standard using prose. > > > > If you see a case in which the standard isn't precise, please bring it > > up or open a C++ CWG issue for it. > > I suggest that I go through the Linux kernel's requirements for atomics > and memory barriers and see how they map to C11 atomics. With that done, > we would have very specific examples to go over. Without that done, the > discussion won't converge very well. > > Seem reasonable? Sounds good! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Sat, 2014-02-15 at 10:49 -0800, Linus Torvalds wrote: > On Sat, Feb 15, 2014 at 9:45 AM, Torvald Riegel wrote: > > > > I think a major benefit of C11's memory model is that it gives a > > *precise* specification for how a compiler is allowed to optimize. > > Clearly it does *not*. This whole discussion is proof of that. It's > not at all clear, It might not be an easy-to-understand specification, but as far as I'm aware it is precise. The Cambridge group's formalization certainly is precise. From that, one can derive (together with the usual rules for as-if etc.) what a compiler is allowed to do (assuming that the standard is indeed precise). My replies in this discussion have been based on reasoning about the standard, and not secret knowledge (with the exception of no-out-of-thin-air, which is required in the standard's prose but not yet formalized). I agree that I'm using the formalization as a kind of placeholder for the standard's prose (which isn't all that easy to follow for me either), but I guess there's no way around an ISO standard using prose. If you see a case in which the standard isn't precise, please bring it up or open a C++ CWG issue for it. > and the standard apparently is at least debatably > allowing things that shouldn't be allowed. Which example do you have in mind here? Haven't we resolved all the debated examples, or did I miss any? > It's also a whole lot more > complicated than "volatile", so the likelihood of a compiler writer > actually getting it right - even if the standard does - is lower. It's not easy, that's for sure, but none of the high-performance alternatives are easy either. There are testing tools out there based on the formalization of the model, and we've found bugs with them. And the alternative of using something not specified by the standard is even worse, I think, because then you have to guess what a compiler might do, without having any constraints; IOW, one is resorting to "no sane compiler would do that", and that doesn't seem to very robust either. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, 2014-02-17 at 18:59 +, Joseph S. Myers wrote: > On Sat, 15 Feb 2014, Torvald Riegel wrote: > > > glibc is a counterexample that comes to mind, although it's a smaller > > code base. (It's currently not using C11 atomics, but transitioning > > there makes sense, and some thing I want to get to eventually.) > > glibc is using C11 atomics (GCC builtins rather than _Atomic / > , but using __atomic_* with explicitly specified memory model > rather than the older __sync_*) on AArch64, plus in certain cases on ARM > and MIPS. I think the major steps remaining is moving the other architectures over, and rechecking concurrent code (e.g., for the code that I have seen, it was either asm variants (eg, on x86), or built before C11; ARM pthread_once was lacking memory_barriers (see "pthread_once unification" patches I posted)). We also need/should to move towards using relaxed-MO atomic loads instead of plain loads. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, 2014-02-14 at 18:44 -0800, Linus Torvalds wrote: > On Fri, Feb 14, 2014 at 6:08 PM, Paul E. McKenney > wrote: > > > > One way of looking at the discussion between Torvald and myself would be > > as a seller (Torvald) and a buyer (me) haggling over the fine print in > > a proposed contract (the standard). Whether that makes you feel better > > or worse about the situation I cannot say. ;-) > > Oh, I'm perfectly fine with that. But we're not desperate to buy, and > I actually think the C11 people are - or at least should be - *way* > more desperate to sell their model to us than we are to take it. > > Which means that as a buyer you should say "this is what we want, if > you don't give us this, we'll just walk away". Not try to see how much > we can pay for it. Because there is very little upside for us, and > _unless_ the C11 standard gets things right it's just extra complexity > for us, coupled with new compiler fragility and years of code > generation bugs. I think there is an upside to you, mainly in that it allows compiler testing tools, potentially verification tools for atomics, and tools like cppmem that show allowed executions of code. I agree that the Linux community has been working well without this, and it's big enough to make running it's own show viable. This will be different for smaller projects, though. > Why would we want that extra complexity and inevitable compiler bugs? > If we then have to fight compiler writers that point to the standard > and say "..but look, the standard says we can do this", then at that > point it went from "extra complexity and compiler bugs" to a whole > 'nother level of frustration and pain. I see your point, but flip side of the coin is that if you get the standard to say what you want, then you can tell the compiler writers to look at the standard. Or show them bugs revealed by fuzz testing and such. > So just walk away unless the C11 standard gives us exactly what we > want. Not "something kind of like what we'd use". EXACTLY. Because I'm > not in the least interested in fighting compiler people that have a > crappy standard they can point to. Been there, done that, got the > T-shirt and learnt my lesson. > > And the thing is, I suspect that the Linux kernel is the most complete > - and most serious - user of true atomics that the C11 people can sell > their solution to. I agree, but there are likely also other big projects that could make use of C11 atomics on the userspace side (e.g., certain databases, ...). > If we don't buy it, they have no serious user. I disagree with that. That obviously depends on one's definition of "serious", but if you combine all C/C++ programs that use low-level atomics, then this is serious use as well. There's lots of shared-memory synchronization in userspace as well. > Sure, they'll have lots > of random other one-off users for their atomics, where each user wants > one particular thing, but I suspect that we'll have the only really > unified portable code base glibc is a counterexample that comes to mind, although it's a smaller code base. (It's currently not using C11 atomics, but transitioning there makes sense, and some thing I want to get to eventually.) > that handles pretty much *all* the serious > odd cases that the C11 atomics can actually talk about to each other. You certainly have lots of odd cases, but I would disagree with the assumption that only the Linux kernel will do full "testing" of the implementations. If you have plenty of userspace programs using the atomics, that's a pretty big test suite, and one that should help bring the compilers up to speed. So that might be a benefit even to the Linux kernel if it would use the C11 atomics. > Oh, they'll push things through with or without us, and it will be a > collection of random stuff, where they tried to please everybody, with > particularly compiler/architecture people who have no f*cking clue > about how their stuff is used pushing to make it easy/efficient for > their particular compiler/architecture. I'll ignore this... :) > But we have real optimized uses of pretty much all relevant cases that > people actually care about. You certainly cover a lot of cases. Finding out whether you cover all that "people care about" would require you to actually ask all people, which I'm sure you've done ;) > We can walk away from them, and not really lose anything but a small > convenience (and it's a convenience *only* if the standard gets things > right). > > And conversely, the C11 people can walk away from us too. But if they > can't make us happy (and by "make us happy", I really mean no stupid > games on our part) I personally think they'll have a stronger > standard, and a real use case, and real arguments. I'm assuming they > want that. I agree. > That's why I complain when you talk about things like marking control > dependencies explicitly. That's *us* bending over backwards. And as a > buyer, we have absolutely z
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, 2014-02-14 at 12:02 -0800, Linus Torvalds wrote: > On Fri, Feb 14, 2014 at 11:50 AM, Linus Torvalds > wrote: > > > > Why are we still discussing this idiocy? It's irrelevant. If the > > standard really allows random store speculation, the standard doesn't > > matter, and sane people shouldn't waste their time arguing about it. > > Btw, the other part of this coin is that our manual types (using > volatile and various architecture-specific stuff) and our manual > barriers and inline asm accesses are generally *fine*. AFAICT, it does work for you, but hasn't been exactly pain-free. I think a major benefit of C11's memory model is that it gives a *precise* specification for how a compiler is allowed to optimize. There is a formalization of the model, which allows things like the cppmem tool by the Cambridge group. It also allows meaningful fuzz testing: http://www.di.ens.fr/~zappa/projects/cmmtest/ ; this did reveal several GCC compiler bugs. I also think that reasoning about this model is easier than reasoning about how lots of different, concrete compiler optimizations would interact. > The C11 stuff doesn't buy us anything. The argument that "new > architectures might want to use it" is prue and utter bollocks, since > unless the standard gets the thing *right*, nobody sane would ever use > it for some new architecture, when the sane thing to do is to just > fill in the normal barriers and inline asms. > > So I'm very very serious: either the compiler and the standard gets > things right, or we don't use it. There is no middle ground where "we > might use it for one or two architectures and add random hints". > That's just stupid. > > The only "middle ground" is about which compiler version we end up > trusting _if_ it turns out that the compiler and standard do get > things right. From Torvald's explanations (once I don't mis-read them > ;), my take-away so far has actually been that the standard *does* get > things right, but I do know from over-long personal experience that > compiler people sometimes want to be legalistic and twist the > documentation to the breaking point, at which point we just go "we'd > be crazy do use that". I agree that compilers want to optimize, and sometimes there's probably a little too much emphasis on applying an optimization vs. not surprising users. But we have to draw a line (e.g., what is undefined behavior and what is not), because we need this to be actually able to optimize. Therefore, we need to get the rules into a shape that both allows optimizations and isn't full of surprising corner cases. The rules are the standard, so it's the standard we have to get right. According to my experience, a lot of thought goes into how to design the standard's language and library so that they are intuitive yet efficient. If you see issues in the standard, please bring them up. Either report the defects directly and get involved yourself, or reach out to somebody that is participating in the standards process. The standard certainly isn't perfect, so there is room to contribute. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, 2014-02-14 at 11:50 -0800, Linus Torvalds wrote: > On Fri, Feb 14, 2014 at 9:29 AM, Paul E. McKenney > wrote: > > > > Linus, Peter, any objections to marking places where we are relying on > > ordering from control dependencies against later stores? This approach > > seems to me to have significant documentation benefits. > > Quite frankly, I think it's stupid, and the "documentation" is not a > benefit, it's just wrong. I think the example is easy to misunderstand, because the context isn't clear. Therefore, let me first try to clarify the background. (1) The abstract machine does not write speculatively. (2) Emitting a branch instruction and executing a branch at runtime is not part of the specified behavior of the abstract machine. Of course, the abstract machine performs conditional execution, but that just specifies the output / side effects that it must produce (e.g., volatile stores) -- not with which hardware instructions it is producing this. (3) A compiled program must produce the same output as if executed by the abstract machine. Thus, we need to be careful what "speculative store" is meant to refer to. A few examples: if (atomic_load(&x, mo_relaxed) == 1) atomic_store(&y, 3, mo_relaxed)); Here, the requirement is that in terms of program logic, y is assigned 3 if x equals 1. It's not specified how an implementation does that. * If the compiler can prove that x is always 1, then it can remove the branch. This is because of (2). Because of the proof, (1) is not violated. * If the compiler can prove that the store to y is never observed or does not change the program's output, the store can be removed. if (atomic_load(&x, mo_relaxed) == 1) { atomic_store(&y, 3, mo_relaxed)); other_a(); } else { atomic_store(&y, 3, mo_relaxed)); other_b(); } Here, y will be assigned to regardless of the value of x. * The compiler can hoist the store out of the two branches. This is because the store and the branch instruction aren't observable outcomes of the abstract machine. * The compiler can even move the store to y before the load from x (unless this affects logical program order of this thread in some way.) This is because the load/store are ordered by sequenced-before (intra-thread), but mo_relaxed allows the hardware to reorder, so the compiler can do it as well (IOW, other threads can't expect a particular order). if (atomic_load(&x, mo_acquire) == 1) atomic_store(&y, 3, mo_relaxed)); This is similar to the first case, but with stronger memory order. * If the compiler proves that x is always 1, then it does so by showing that the load will always be able to read from a particular store (or several of them) that (all) assigned 1 to x -- as specified by the abstract machine and taking the forward progress guarantees into account. In general, it still has to establish the synchronized-with edge if any of those stores used release_mo (or other fences resulting in the same situation), so it can't just get rid of the acquire "fence" in this case. (There are probably situations in which this can be done, but I can't characterize them easily at the moment.) These examples all rely on the abstract machine as specified in the current standard. In contrast, the example that Paul (and Peter, I assume) where looking at is not currently modeled by the standard. AFAIU, they want to exploit that control dependencies, when encountered in binary code, can result in the hardware giving certain ordering guarantees. This is vaguely similar to mo_consume which is about data dependencies. mo_consume is, partially due to how it's specified, pretty hard to implement for compilers in a way that actually exploits and preserves data dependencies and not just substitutes mo_consume for a stronger memory order. Part of this problem is that the standard takes an opt-out approach regarding the code that should track dependencies (e.g., certain operators are specified as not preserving them), instead of cleanly carving out meaningful operators where one can track dependencies without obstructing generally useful compiler optimizations (i.e., "opt-in"). This leads to cases such as that in "*(p + f - f)", the compiler either has to keep f - f or emit a stronger fence if f is originating from a mo_consume load. Furthermore, dependencies are supposed to be tracked across any load and store, so the compiler needs to do points-to if it wants to optimize this as much as possible. Paul and I have been thinking about alternatives, and one of them was doing the opt-in by demarcating code that needs explicit dependency tracking because it wants to exploit mo_consume. Back to HW control dependencies, this lead to the idea of marking the "control dependencies" in the source code (ie, on the abstract machine level), that need to be preserved in the generated binary code, even if they have no semantic meaning on the abstract machine level. So, this is something extra that isn't modeled in the standard curr
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, 2014-02-14 at 09:29 -0800, Paul E. McKenney wrote: > On Thu, Feb 13, 2014 at 08:43:01PM -0800, Torvald Riegel wrote: > > On Thu, 2014-02-13 at 18:01 -0800, Paul E. McKenney wrote: > > [ . . . ] > > > > Another option would be to flag the conditional expression, prohibiting > > > the compiler from optimizing out any conditional branches. Perhaps > > > something like this: > > > > > > r1 = atomic_load(x, memory_order_control); > > > if (control_dependency(r1)) > > > atomic_store(y, memory_order_relaxed); > > > > That's the one I had in mind and talked to you about earlier today. My > > gut feeling is that this is preferably over the other because it "marks" > > the if-statement, so the compiler knows exactly which branches matter. > > I'm not sure one would need the other memory order for that, if indeed > > all you want is relaxed -> branch -> relaxed. But maybe there are > > corner cases (see the weaker-than-relaxed discussion in SG1 today). > > Linus, Peter, any objections to marking places where we are relying on > ordering from control dependencies against later stores? This approach > seems to me to have significant documentation benefits. Let me note that at least as I'm concerned, that's just a quick idea. At least I haven't looked at (1) how to properly specify the semantics of this, (2) whether it has any bad effects on unrelated code, (3) and whether there are pitfalls for compiler implementations. It looks not too bad at first glance, though. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, 2014-02-14 at 10:50 +0100, Peter Zijlstra wrote: > On Thu, Feb 13, 2014 at 09:07:55PM -0800, Torvald Riegel wrote: > > That depends on what your goal is. First, I don't know why you quoted that, but without the context, quoting it doesn't make sense. Let me repeat the point. The standard is the rule set for the compiler. Period. The compiler does not just serve the semantics that you might have in your head. It does have to do something meaningful for all of its users. Thus, the goal for the compiler is to properly compile programs in the language as specified. If there is a deficiency in the standard (bug or missing feature) -- and thus the specification, we need to have a patch for the standard that fixes this deficiency. If you think that this is the case, that's where you fix it. If your goal is to do wishful thinking, imagine some kind of semantics in your head, and then assume that magically, implementations will do just that, then that's bound to fail. > A compiler that we don't need to fight in order to generate sane code > would be nice. But as Linus said; we can continue to ignore you lot and > go on as we've done. I don't see why it's so hard to understand that you need to specify semantics, and the place (or at least the base) for that is the standard. Aren't you guys the ones replying "send a patch"? :) This isn't any different. If you're uncomfortable working with the standard, then say so, and reach out to people that aren't. You can surely ignore the specification of the language(s) that you are depending on. But that won't help you. If you want a change, get involved. (Oh, and claiming that the other side doesn't get it doesn't count as getting involved.) There's no fight between people here. It's just a technical problem that we have to solve in the right way. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Wed, 2014-02-12 at 10:19 +0100, Peter Zijlstra wrote: > > I don't know the specifics of your example, but from how I understand > > it, I don't see a problem if the compiler can prove that the store will > > always happen. > > > > To be more specific, if the compiler can prove that the store will > > happen anyway, and the region of code can be assumed to always run > > atomically (e.g., there's no loop or such in there), then it is known > > that we have one atomic region of code that will always perform the > > store, so we might as well do the stuff in the region in some order. > > > > Now, if any of the memory accesses are atomic, then the whole region of > > code containing those accesses is often not atomic because other threads > > might observe intermediate results in a data-race-free way. > > > > (I know that this isn't a very precise formulation, but I hope it brings > > my line of reasoning across.) > > So given something like: > > if (x) > y = 3; > > assuming both x and y are atomic (so don't gimme crap for now knowing > the C11 atomic incantations); and you can prove x is always true; you > don't see a problem with not emitting the conditional? That depends on what your goal is. It would be correct as far as the standard is specified; this makes sense if all you want is indeed a program that does what the abstract machine might do, and produces the same output / side effects. If you're trying to preserve the branch in the code emitted / executed by the implementation, then it would not be correct. But those branches aren't specified as being part of the observable side effects. In the common case, this makes sense because it enables optimizations that are useful; this line of reasoning also allows the compiler to merge some atomic accesses in the way that Linus would like to see it. > Avoiding the conditional changes the result; see that control dependency > email from earlier. It does not regarding how the standard defines "result". > In the above example the load of X and the store to > Y are strictly ordered, due to control dependencies. Not emitting the > condition and maybe not even emitting the load completely wrecks this. I think you're trying to solve this backwards. You are looking at this with an implicit wishlist of what the compiler should do (or how you want to use the hardware), but this is not a viable specification that one can write a compiler against. We do need clear rules for what the compiler is allowed to do or not (e.g., a memory model that models multi-threaded executions). Otherwise it's all hand-waving, and we're getting nowhere. Thus, the way to approach this is to propose a feature or change to the standard, make sure that this is consistent and has no unintended side effects for other aspects of compilation or other code, and then ask the compiler to implement it. IOW, we need a patch for where this all starts: in the rules and requirements for compilation. Paul and I are at the C++ meeting currently, and we had sessions in which the concurrency study group talked about memory model issues like dependency tracking and memory_order_consume. Paul shared uses of atomics (or likewise) in the kernel, and we discussed how the memory model currently handles various cases and why, how one could express other requirements consistently, and what is actually implementable in practice. I can't speak for Paul, but I thought those discussions were productive. > Its therefore an invalid optimization to take out the conditional or > speculate the store, since it takes out the dependency. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, 2014-02-13 at 18:01 -0800, Paul E. McKenney wrote: > On Thu, Feb 13, 2014 at 12:03:57PM -0800, Torvald Riegel wrote: > > On Wed, 2014-02-12 at 16:23 -0800, Paul E. McKenney wrote: > > > On Wed, Feb 12, 2014 at 12:22:53PM -0800, Linus Torvalds wrote: > > > > On Wed, Feb 12, 2014 at 10:07 AM, Paul E. McKenney > > > > wrote: > > > > > > > > > > Us Linux-kernel hackers will often need to use volatile semantics in > > > > > combination with C11 atomics in most cases. The C11 atomics do cover > > > > > some of the reasons we currently use ACCESS_ONCE(), but not all of > > > > > them -- > > > > > in particular, it allows load/store merging. > > > > > > > > I really disagree with the "will need to use volatile". > > > > > > > > We should never need to use volatile (outside of whatever MMIO we do > > > > using C) if C11 defines atomics correctly. > > > > > > > > Allowing load/store merging is *fine*. All sane CPU's do that anyway - > > > > it's called a cache - and there's no actual reason to think that > > > > "ACCESS_ONCE()" has to mean our current "volatile". > > > > > > > > Now, it's possible that the C standards simply get atomics _wrong_, so > > > > that they create visible semantics that are different from what a CPU > > > > cache already does, but that's a plain bug in the standard if so. > > > > > > > > But merging loads and stores is fine. And I *guarantee* it is fine, > > > > exactly because CPU's already do it, so claiming that the compiler > > > > couldn't do it is just insanity. > > > > > > Agreed, both CPUs and compilers can merge loads and stores. But CPUs > > > normally get their stores pushed through the store buffer in reasonable > > > time, and CPUs also use things like invalidations to ensure that a > > > store is seen in reasonable time by readers. Compilers don't always > > > have these two properties, so we do need to be more careful of load > > > and store merging by compilers. > > > > The standard's _wording_ is a little vague about forward-progress > > guarantees, but I believe the vast majority of the people involved do > > want compilers to not prevent forward progress. There is of course a > > difference whether a compiler establishes _eventual_ forward progress in > > the sense of after 10 years or forward progress in a small bounded > > interval of time, but this is a QoI issue, and good compilers won't want > > to introduce unnecessary latencies. I believe that it is fine if the > > standard merely talks about eventual forward progress. > > The compiler will need to earn my trust on this one. ;-) > > > > > Now, there are things that are *not* fine, like speculative stores > > > > that could be visible to other threads. Those are *bugs* (either in > > > > the compiler or in the standard), and anybody who claims otherwise is > > > > not worth discussing with. > > > > > > And as near as I can tell, volatile semantics are required in C11 to > > > avoid speculative stores. I might be wrong about this, and hope that > > > I am wrong. But I am currently not seeing it in the current standard. > > > (Though I expect that most compilers would avoid speculating stores, > > > especially in the near term. > > > > This really depends on how we define speculative stores. The memory > > model is absolutely clear that programs have to behave as if executed by > > the virtual machine, and that rules out speculative stores to volatiles > > and other locations. Under certain circumstances, there will be > > "speculative" stores in the sense that they will happen at different > > times as if you had a trivial implementation of the abstract machine. > > But to be allowed to do that, the compiler has to prove that such a > > transformation still fulfills the as-if rule. > > Agreed, although the as-if rule would ignore control dependencies, since > these are not yet part of the standard (as you in fact note below). > I nevertheless consider myself at least somewhat reassured that current > C11 won't speculate stores. My remaining concerns involve the compiler > proving to itself that a given branch is always taken, thus motivating > it to optimize the branch away -- though this is more properly a > control-dependency concern. > > > IOW,
Re: [RFC][PATCH 0/5] arch: atomic rework
On Wed, 2014-02-12 at 16:23 -0800, Paul E. McKenney wrote: > On Wed, Feb 12, 2014 at 12:22:53PM -0800, Linus Torvalds wrote: > > On Wed, Feb 12, 2014 at 10:07 AM, Paul E. McKenney > > wrote: > > > > > > Us Linux-kernel hackers will often need to use volatile semantics in > > > combination with C11 atomics in most cases. The C11 atomics do cover > > > some of the reasons we currently use ACCESS_ONCE(), but not all of them -- > > > in particular, it allows load/store merging. > > > > I really disagree with the "will need to use volatile". > > > > We should never need to use volatile (outside of whatever MMIO we do > > using C) if C11 defines atomics correctly. > > > > Allowing load/store merging is *fine*. All sane CPU's do that anyway - > > it's called a cache - and there's no actual reason to think that > > "ACCESS_ONCE()" has to mean our current "volatile". > > > > Now, it's possible that the C standards simply get atomics _wrong_, so > > that they create visible semantics that are different from what a CPU > > cache already does, but that's a plain bug in the standard if so. > > > > But merging loads and stores is fine. And I *guarantee* it is fine, > > exactly because CPU's already do it, so claiming that the compiler > > couldn't do it is just insanity. > > Agreed, both CPUs and compilers can merge loads and stores. But CPUs > normally get their stores pushed through the store buffer in reasonable > time, and CPUs also use things like invalidations to ensure that a > store is seen in reasonable time by readers. Compilers don't always > have these two properties, so we do need to be more careful of load > and store merging by compilers. The standard's _wording_ is a little vague about forward-progress guarantees, but I believe the vast majority of the people involved do want compilers to not prevent forward progress. There is of course a difference whether a compiler establishes _eventual_ forward progress in the sense of after 10 years or forward progress in a small bounded interval of time, but this is a QoI issue, and good compilers won't want to introduce unnecessary latencies. I believe that it is fine if the standard merely talks about eventual forward progress. > > Now, there are things that are *not* fine, like speculative stores > > that could be visible to other threads. Those are *bugs* (either in > > the compiler or in the standard), and anybody who claims otherwise is > > not worth discussing with. > > And as near as I can tell, volatile semantics are required in C11 to > avoid speculative stores. I might be wrong about this, and hope that > I am wrong. But I am currently not seeing it in the current standard. > (Though I expect that most compilers would avoid speculating stores, > especially in the near term. This really depends on how we define speculative stores. The memory model is absolutely clear that programs have to behave as if executed by the virtual machine, and that rules out speculative stores to volatiles and other locations. Under certain circumstances, there will be "speculative" stores in the sense that they will happen at different times as if you had a trivial implementation of the abstract machine. But to be allowed to do that, the compiler has to prove that such a transformation still fulfills the as-if rule. IOW, the abstract machine is what currently defines disallowed speculative stores. If you want to put *further* constraints on what implementations are allowed to do, I suppose it is best to talk about those and see how we can add rules that allow programmers to express those constraints. For example, control dependencies might be such a case. I don't have a specific suggestion -- maybe the control dependencies are best tackled similar to consume dependencies (even though we don't have a good solution for those yets). But using volatile accesses for that seems to be a big hammer, or even the wrong one. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, 2014-02-11 at 07:59 -0800, Paul E. McKenney wrote: > On Mon, Feb 10, 2014 at 11:09:24AM -0800, Linus Torvalds wrote: > > On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel wrote: > > > > > > Intuitively, this is wrong because this let's the program take a step > > > the abstract machine wouldn't do. This is different to the sequential > > > code that Peter posted because it uses atomics, and thus one can't > > > easily assume that the difference is not observable. > > > > Btw, what is the definition of "observable" for the atomics? > > > > Because I'm hoping that it's not the same as for volatiles, where > > "observable" is about the virtual machine itself, and as such volatile > > accesses cannot be combined or optimized at all. > > > > Now, I claim that atomic accesses cannot be done speculatively for > > writes, and not re-done for reads (because the value could change), > > but *combining* them would be possible and good. > > > > For example, we often have multiple independent atomic accesses that > > could certainly be combined: testing the individual bits of an atomic > > value with helper functions, causing things like "load atomic, test > > bit, load same atomic, test another bit". The two atomic loads could > > be done as a single load without possibly changing semantics on a real > > machine, but if "visibility" is defined in the same way it is for > > "volatile", that wouldn't be a valid transformation. Right now we use > > "volatile" semantics for these kinds of things, and they really can > > hurt. > > > > Same goes for multiple writes (possibly due to setting bits): > > combining multiple accesses into a single one is generally fine, it's > > *adding* write accesses speculatively that is broken by design.. > > > > At the same time, you can't combine atomic loads or stores infinitely > > - "visibility" on a real machine definitely is about timeliness. > > Removing all but the last write when there are multiple consecutive > > writes is generally fine, even if you unroll a loop to generate those > > writes. But if what remains is a loop, it might be a busy-loop > > basically waiting for something, so it would be wrong ("untimely") to > > hoist a store in a loop entirely past the end of the loop, or hoist a > > load in a loop to before the loop. > > > > Does the standard allow for that kind of behavior? > > You asked! ;-) > > So the current standard allows merging of both loads and stores, unless of > course ordring constraints prevent the merging. Volatile semantics may be > used to prevent this merging, if desired, for example, for real-time code. Agreed. > Infinite merging is intended to be prohibited, but I am not certain that > the current wording is bullet-proof (1.10p24 and 1.10p25). Yeah, maybe not. But it at least seems to rather clearly indicate the intent ;) > The only prohibition against speculative stores that I can see is in a > non-normative note, and it can be argued to apply only to things that are > not atomics (1.10p22). I think this one is specifically about speculative stores that would affect memory locations that the abstract machine would not write to, and that might be observable or create data races. While a compiler could potentially prove that such stores aren't leading to a difference in the behavior of the program (e.g., by proving that there are no observers anywhere and this isn't overlapping with any volatile locations), I think that this is hard in general and most compilers will just not do such things. In GCC, bugs in that category were fixed after researchers doing fuzz-testing found them (IIRC, speculative stores by loops). > I don't see any prohibition against reordering > a store to precede a load preceding a conditional branch -- which would > not be speculative if the branch was know to be taken and the load > hit in the store buffer. In a system where stores could be reordered, > some other CPU might perceive the store as happening before the load > that controlled the conditional branch. This needs to be addressed. I don't know the specifics of your example, but from how I understand it, I don't see a problem if the compiler can prove that the store will always happen. To be more specific, if the compiler can prove that the store will happen anyway, and the region of code can be assumed to always run atomically (e.g., there's no loop or such in there), then it is known that we have one atomic region of code that will always perform the store, so we might as well do
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, 2014-02-10 at 11:09 -0800, Linus Torvalds wrote: > On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel wrote: > > > > Intuitively, this is wrong because this let's the program take a step > > the abstract machine wouldn't do. This is different to the sequential > > code that Peter posted because it uses atomics, and thus one can't > > easily assume that the difference is not observable. > > Btw, what is the definition of "observable" for the atomics? > > Because I'm hoping that it's not the same as for volatiles, where > "observable" is about the virtual machine itself, and as such volatile > accesses cannot be combined or optimized at all. No, atomics aren't an observable behavior of the abstract machine (unless they are volatile). See 1.8.p8 (citing the C++ standard). > Now, I claim that atomic accesses cannot be done speculatively for > writes, and not re-done for reads (because the value could change), Agreed, unless the compiler can prove that this doesn't make a difference in the program at hand and it's not volatile atomics. In general, that will be hard and thus won't happen often I suppose, but if correctly proved it would fall under the as-if rule I think. > but *combining* them would be possible and good. Agreed. > For example, we often have multiple independent atomic accesses that > could certainly be combined: testing the individual bits of an atomic > value with helper functions, causing things like "load atomic, test > bit, load same atomic, test another bit". The two atomic loads could > be done as a single load without possibly changing semantics on a real > machine, but if "visibility" is defined in the same way it is for > "volatile", that wouldn't be a valid transformation. Right now we use > "volatile" semantics for these kinds of things, and they really can > hurt. Agreed. In your example, the compiler would have to prove that the abstract machine would always be able to run the two loads atomically (ie, as one load) without running into impossible/disallowed behavior of the program. But if there's no loop or branch or such in-between, this should be straight-forward because any hardware oddity or similar could merge those loads and it wouldn't be disallowed by the standard (considering that we're talking about a finite number of loads), so the compiler would be allowed to do it as well. > Same goes for multiple writes (possibly due to setting bits): > combining multiple accesses into a single one is generally fine, it's > *adding* write accesses speculatively that is broken by design.. Agreed. As Paul points out, this being correct assumes that there are no other ordering guarantees or memory accesses "interfering", but if the stores are to the same memory location and adjacent to each other in the program, then I don't see a reason why they wouldn't be combinable. > At the same time, you can't combine atomic loads or stores infinitely > - "visibility" on a real machine definitely is about timeliness. > Removing all but the last write when there are multiple consecutive > writes is generally fine, even if you unroll a loop to generate those > writes. But if what remains is a loop, it might be a busy-loop > basically waiting for something, so it would be wrong ("untimely") to > hoist a store in a loop entirely past the end of the loop, or hoist a > load in a loop to before the loop. Agreed. That's what 1.10p24 and 1.10p25 are meant to specify for loads, although those might not be bullet-proof as Paul points out. Forward progress is rather vaguely specified in the standard, but at least parts of the committee (and people in ISO C++ SG1, in particular) are working on trying to improve this. > Does the standard allow for that kind of behavior? I think the standard requires (or intends to require) the behavior that you (and I) seem to prefer in these examples. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Sun, 2014-02-09 at 19:51 -0800, Paul E. McKenney wrote: > On Mon, Feb 10, 2014 at 01:06:48AM +0100, Torvald Riegel wrote: > > On Thu, 2014-02-06 at 20:20 -0800, Paul E. McKenney wrote: > > > On Fri, Feb 07, 2014 at 12:44:48AM +0100, Torvald Riegel wrote: > > > > On Thu, 2014-02-06 at 14:11 -0800, Paul E. McKenney wrote: > > > > > On Thu, Feb 06, 2014 at 10:17:03PM +0100, Torvald Riegel wrote: > > > > > > On Thu, 2014-02-06 at 11:27 -0800, Paul E. McKenney wrote: > > > > > > > On Thu, Feb 06, 2014 at 06:59:10PM +, Will Deacon wrote: > > > > > > > > There are also so many ways to blow your head off it's untrue. > > > > > > > > For example, > > > > > > > > cmpxchg takes a separate memory model parameter for failure and > > > > > > > > success, but > > > > > > > > then there are restrictions on the sets you can use for each. > > > > > > > > It's not hard > > > > > > > > to find well-known memory-ordering experts shouting "Just use > > > > > > > > memory_model_seq_cst for everything, it's too hard otherwise". > > > > > > > > Then there's > > > > > > > > the fun of load-consume vs load-acquire (arm64 GCC completely > > > > > > > > ignores consume > > > > > > > > atm and optimises all of the data dependencies away) as well as > > > > > > > > the definition > > > > > > > > of "data races", which seem to be used as an excuse to > > > > > > > > miscompile a program > > > > > > > > at the earliest opportunity. > > > > > > > > > > > > > > Trust me, rcu_dereference() is not going to be defined in terms of > > > > > > > memory_order_consume until the compilers implement it both > > > > > > > correctly and > > > > > > > efficiently. They are not there yet, and there is currently no > > > > > > > shortage > > > > > > > of compiler writers who would prefer to ignore > > > > > > > memory_order_consume. > > > > > > > > > > > > Do you have any input on > > > > > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448? In particular, > > > > > > the > > > > > > language standard's definition of dependencies? > > > > > > > > > > Let's see... 1.10p9 says that a dependency must be carried unless: > > > > > > > > > > — B is an invocation of any specialization of std::kill_dependency > > > > > (29.3), or > > > > > — A is the left operand of a built-in logical AND (&&, see 5.14) or > > > > > logical OR (||, see 5.15) operator, > > > > > or > > > > > — A is the left operand of a conditional (?:, see 5.16) operator, or > > > > > — A is the left operand of the built-in comma (,) operator (5.18); > > > > > > > > > > So the use of "flag" before the "?" is ignored. But the "flag - flag" > > > > > after the "?" will carry a dependency, so the code fragment in 59448 > > > > > needs to do the ordering rather than just optimizing "flag - flag" out > > > > > of existence. One way to do that on both ARM and Power is to actually > > > > > emit code for "flag - flag", but there are a number of other ways to > > > > > make that work. > > > > > > > > And that's what would concern me, considering that these requirements > > > > seem to be able to creep out easily. Also, whereas the other atomics > > > > just constrain compilers wrt. reordering across atomic accesses or > > > > changes to the atomic accesses themselves, the dependencies are new > > > > requirements on pieces of otherwise non-synchronizing code. The latter > > > > seems far more involved to me. > > > > > > Well, the wording of 1.10p9 is pretty explicit on this point. > > > There are only a few exceptions to the rule that dependencies from > > > memory_order_consume loads must be tracked. And to your point about > > > requirements being placed on pieces of otherwise non-synchronizing code, > > > we already have that with plain old load
Re: [RFC][PATCH 0/5] arch: atomic rework
On Sun, 2014-02-09 at 17:24 -0800, Linus Torvalds wrote: > On Sun, Feb 9, 2014 at 5:16 PM, Torvald Riegel wrote: > > > > (a) seems to say that you don't like requiring programmers to mark > > atomic accesses specially. Is that the case? > > In Paul's example, they were marked specially. > > And you seemed to argue that Paul's example could possibly return > anything but 0/0. Just read my reply to Paul again. Here's an excerpt: > Initial state: x == y == 0 > > T1: r1 = atomic_load_explicit(x, memory_order_relaxed); > atomic_store_explicit(42, y, memory_order_relaxed); > if (r1 != 42) > atomic_store_explicit(r1, y, memory_order_relaxed); > > T2: r2 = atomic_load_explicit(y, memory_order_relaxed); > atomic_store_explicit(r2, x, memory_order_relaxed); Intuitively, this is wrong because this let's the program take a step the abstract machine wouldn't do. This is different to the sequential code that Peter posted because it uses atomics, and thus one can't easily assume that the difference is not observable. IOW, I wrote that such a compiler transformation would be wrong in my opinion. Thus, it should *not* return 42. If you still see how what I wrote could be misunderstood, please let me know because I really don't see it. Yes, I don't do so by swearing or calling other insane, and I try to see the reasoning that those compilers' authors might have had, even if I don't agree with it. In my personal opinion, that's a good thing. > If you really think that, I hope to God that you have nothing to do > with the C standard or any actual compiler I ever use. > > Because such a standard or compiler would be shit. It's sadly not too > uncommon. Thanks for the kind words. Perhaps writing that was quicker than reading what I actually wrote. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Sun, 2014-02-09 at 16:56 -0800, Linus Torvalds wrote: > On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel wrote: > > > > I wouldn't characterize the situation like this (although I can't speak > > for others, obviously). IMHO, it's perfectly fine on sequential / > > non-synchronizing code, because we know the difference isn't observable > > by a correct program. > > What BS is that? If you use an "atomic_store_explicit()", by > definition you're either > > (a) f*cking insane > (b) not doing sequential non-synchronizing code > > and a compiler that assumes that the programmer is insane may actually > be correct more often than not, but it's still a shit compiler. > Agreed? Due to all the expletives, I can't really understand what you are saying. Nor does what I guess it might mean seem to relate to what I said. (a) seems to say that you don't like requiring programmers to mark atomic accesses specially. Is that the case? If so, I have to disagree. If you're writing concurrent code, marking the atomic accesses is the least of your problem. Instead, for the concurrent code I've written, it rather improved readability; it made code more verbose, but in turn it made the synchronization easier to see. Beside this question of taste (and I don't care what the Kernel style guides are), there is a real technical argument here, though: Requiring all synchronizing memory accesses to be annotated makes the compiler aware what is sequential code, and what is not. Without it, one can't exploit data-race-freedom. So unless we want to slow down optimization of sequential code, we need to tell the compiler what is what. If every access is potentially synchronizing, then this doesn't just prevent speculative stores. (b) seems as if you are saying that if there is a specially annotated atomic access in the code, that this isn't sequential/non-synchronizing code anymore. I don't agree with that, obviously. > So I don't see how any sane person can say that speculative writes are > ok. They are clearly not ok. We are discussing programs written against the C11/C++11 memory model here. At least that's the part that got forwarded to g...@gcc.gnu.org, and was subject of the nearest emails in the thread. This memory model requires programs to be data-race-free. Thus, we can optimize accordingly. If you don't like that, then this isn't C11/C++11 anymore. Which would be fine, but then complain about that specifically. > Speculative stores are a bad idea in general. I disagree in the context of C11/C++11 programs. At least from a correctness point of view. > They are completely > invalid for anything that says "atomic". I agree, as you will see when you read the other emails I posted as part of this thread. But those two things are separate. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, 2014-02-07 at 10:02 -0800, Paul E. McKenney wrote: > On Fri, Feb 07, 2014 at 04:55:48PM +, Will Deacon wrote: > > Hi Paul, > > > > On Fri, Feb 07, 2014 at 04:50:28PM +, Paul E. McKenney wrote: > > > On Fri, Feb 07, 2014 at 08:44:05AM +0100, Peter Zijlstra wrote: > > > > On Thu, Feb 06, 2014 at 08:20:51PM -0800, Paul E. McKenney wrote: > > > > > Hopefully some discussion of out-of-thin-air values as well. > > > > > > > > Yes, absolutely shoot store speculation in the head already. Then drive > > > > a wooden stake through its hart. > > > > > > > > C11/C++11 should not be allowed to claim itself a memory model until > > > > that > > > > is sorted. > > > > > > There actually is a proposal being put forward, but it might not make ARM > > > and Power people happy because it involves adding a compare, a branch, > > > and an ISB/isync after every relaxed load... Me, I agree with you, > > > much preferring the no-store-speculation approach. > > > > Can you elaborate a bit on this please? We don't permit speculative stores > > in the ARM architecture, so it seems counter-intuitive that GCC needs to > > emit any additional instructions to prevent that from happening. > > Requiring a compare/branch/ISB after each relaxed load enables a simple(r) > proof that out-of-thin-air values cannot be observed in the face of any > compiler optimization that refrains from reordering a prior relaxed load > with a subsequent relaxed store. > > > Stores can, of course, be observed out-of-order but that's a lot more > > reasonable :) > > So let me try an example. I am sure that Torvald Riegel will jump in > with any needed corrections or amplifications: > > Initial state: x == y == 0 > > T1: r1 = atomic_load_explicit(x, memory_order_relaxed); > atomic_store_explicit(r1, y, memory_order_relaxed); > > T2: r2 = atomic_load_explicit(y, memory_order_relaxed); > atomic_store_explicit(r2, x, memory_order_relaxed); > > One would intuitively expect r1 == r2 == 0 as the only possible outcome. > But suppose that the compiler used specialization optimizations, as it > would if there was a function that has a very lightweight implementation > for some values and a very heavyweight one for other. In particular, > suppose that the lightweight implementation was for the value 42. > Then the compiler might do something like the following: > > Initial state: x == y == 0 > > T1: r1 = atomic_load_explicit(x, memory_order_relaxed); > if (r1 == 42) > atomic_store_explicit(42, y, memory_order_relaxed); > else > atomic_store_explicit(r1, y, memory_order_relaxed); > > T2: r2 = atomic_load_explicit(y, memory_order_relaxed); > atomic_store_explicit(r2, x, memory_order_relaxed); > > Suddenly we have an explicit constant 42 showing up. Of course, if > the compiler carefully avoided speculative stores (as both Peter and > I believe that it should if its code generation is to be regarded as > anything other than an act of vandalism, the words in the standard > notwithstanding), there would be no problem. But currently, a number > of compiler writers see absolutely nothing wrong with transforming > the optimized-for-42 version above with something like this: > > Initial state: x == y == 0 > > T1: r1 = atomic_load_explicit(x, memory_order_relaxed); > atomic_store_explicit(42, y, memory_order_relaxed); > if (r1 != 42) > atomic_store_explicit(r1, y, memory_order_relaxed); > > T2: r2 = atomic_load_explicit(y, memory_order_relaxed); > atomic_store_explicit(r2, x, memory_order_relaxed); Intuitively, this is wrong because this let's the program take a step the abstract machine wouldn't do. This is different to the sequential code that Peter posted because it uses atomics, and thus one can't easily assume that the difference is not observable. For this to be correct, the compiler would actually have to prove that the speculative store is "as-if correct", which in turn would mean that it needs to be aware of all potential observers, and check whether those observers aren't actually affected by the speculative store. I would guess that the compilers you have in mind don't really do that. If they do, then I don't see why this should be okay, unless you think out-of-thin-air values are something good (which I wouldn't agree with). > And then it is a short and uncontroversial step to the following: > > Initial state: x == y == 0 > > T1: atomic_store_explicit(42, y, memory_order_rela
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, 2014-02-06 at 20:20 -0800, Paul E. McKenney wrote: > On Fri, Feb 07, 2014 at 12:44:48AM +0100, Torvald Riegel wrote: > > On Thu, 2014-02-06 at 14:11 -0800, Paul E. McKenney wrote: > > > On Thu, Feb 06, 2014 at 10:17:03PM +0100, Torvald Riegel wrote: > > > > On Thu, 2014-02-06 at 11:27 -0800, Paul E. McKenney wrote: > > > > > On Thu, Feb 06, 2014 at 06:59:10PM +, Will Deacon wrote: > > > > > > There are also so many ways to blow your head off it's untrue. For > > > > > > example, > > > > > > cmpxchg takes a separate memory model parameter for failure and > > > > > > success, but > > > > > > then there are restrictions on the sets you can use for each. It's > > > > > > not hard > > > > > > to find well-known memory-ordering experts shouting "Just use > > > > > > memory_model_seq_cst for everything, it's too hard otherwise". Then > > > > > > there's > > > > > > the fun of load-consume vs load-acquire (arm64 GCC completely > > > > > > ignores consume > > > > > > atm and optimises all of the data dependencies away) as well as the > > > > > > definition > > > > > > of "data races", which seem to be used as an excuse to miscompile a > > > > > > program > > > > > > at the earliest opportunity. > > > > > > > > > > Trust me, rcu_dereference() is not going to be defined in terms of > > > > > memory_order_consume until the compilers implement it both correctly > > > > > and > > > > > efficiently. They are not there yet, and there is currently no > > > > > shortage > > > > > of compiler writers who would prefer to ignore memory_order_consume. > > > > > > > > Do you have any input on > > > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448? In particular, the > > > > language standard's definition of dependencies? > > > > > > Let's see... 1.10p9 says that a dependency must be carried unless: > > > > > > — B is an invocation of any specialization of std::kill_dependency > > > (29.3), or > > > — A is the left operand of a built-in logical AND (&&, see 5.14) or > > > logical OR (||, see 5.15) operator, > > > or > > > — A is the left operand of a conditional (?:, see 5.16) operator, or > > > — A is the left operand of the built-in comma (,) operator (5.18); > > > > > > So the use of "flag" before the "?" is ignored. But the "flag - flag" > > > after the "?" will carry a dependency, so the code fragment in 59448 > > > needs to do the ordering rather than just optimizing "flag - flag" out > > > of existence. One way to do that on both ARM and Power is to actually > > > emit code for "flag - flag", but there are a number of other ways to > > > make that work. > > > > And that's what would concern me, considering that these requirements > > seem to be able to creep out easily. Also, whereas the other atomics > > just constrain compilers wrt. reordering across atomic accesses or > > changes to the atomic accesses themselves, the dependencies are new > > requirements on pieces of otherwise non-synchronizing code. The latter > > seems far more involved to me. > > Well, the wording of 1.10p9 is pretty explicit on this point. > There are only a few exceptions to the rule that dependencies from > memory_order_consume loads must be tracked. And to your point about > requirements being placed on pieces of otherwise non-synchronizing code, > we already have that with plain old load acquire and store release -- > both of these put ordering constraints that affect the surrounding > non-synchronizing code. I think there's a significant difference. With acquire/release or more general memory orders, it's true that we can't order _across_ the atomic access. However, we can reorder and optimize without additional constraints if we do not reorder. This is not the case with consume memory order, as the (p + flag - flag) example shows. > This issue got a lot of discussion, and the compromise is that > dependencies cannot leak into or out of functions unless the relevant > parameters or return values are annotated with [[carries_dependency]]. > This means that the compiler can see all the places where dependencies > must be tracked. This is described in 7.6.4.
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, 2014-02-07 at 08:50 -0800, Paul E. McKenney wrote: > On Fri, Feb 07, 2014 at 08:44:05AM +0100, Peter Zijlstra wrote: > > On Thu, Feb 06, 2014 at 08:20:51PM -0800, Paul E. McKenney wrote: > > > Hopefully some discussion of out-of-thin-air values as well. > > > > Yes, absolutely shoot store speculation in the head already. Then drive > > a wooden stake through its hart. > > > > C11/C++11 should not be allowed to claim itself a memory model until that > > is sorted. > > There actually is a proposal being put forward, but it might not make ARM > and Power people happy because it involves adding a compare, a branch, > and an ISB/isync after every relaxed load... Me, I agree with you, > much preferring the no-store-speculation approach. My vague recollection is that everyone agrees that out-of-thin-air values shouldn't be allowed, but that it's surprisingly complex to actually specify this properly. However, the example that Peter posted further down in the thread seems to be unrelated to out-of-thin-air. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, 2014-02-07 at 18:06 +0100, Peter Zijlstra wrote: > On Fri, Feb 07, 2014 at 04:55:48PM +, Will Deacon wrote: > > Hi Paul, > > > > On Fri, Feb 07, 2014 at 04:50:28PM +, Paul E. McKenney wrote: > > > On Fri, Feb 07, 2014 at 08:44:05AM +0100, Peter Zijlstra wrote: > > > > On Thu, Feb 06, 2014 at 08:20:51PM -0800, Paul E. McKenney wrote: > > > > > Hopefully some discussion of out-of-thin-air values as well. > > > > > > > > Yes, absolutely shoot store speculation in the head already. Then drive > > > > a wooden stake through its hart. > > > > > > > > C11/C++11 should not be allowed to claim itself a memory model until > > > > that > > > > is sorted. > > > > > > There actually is a proposal being put forward, but it might not make ARM > > > and Power people happy because it involves adding a compare, a branch, > > > and an ISB/isync after every relaxed load... Me, I agree with you, > > > much preferring the no-store-speculation approach. > > > > Can you elaborate a bit on this please? We don't permit speculative stores > > in the ARM architecture, so it seems counter-intuitive that GCC needs to > > emit any additional instructions to prevent that from happening. > > > > Stores can, of course, be observed out-of-order but that's a lot more > > reasonable :) > > This is more about the compiler speculating on stores; imagine: > > if (x) > y = 1; > else > y = 2; > > The compiler is allowed to change that into: > > y = 2; > if (x) > y = 1; If you write the example like that, this is indeed allowed because it's all sequential code (and there's no volatiles in there, at least you didn't show them :). A store to y would happen in either case. You cannot observe the difference between both examples in a data-race-free program. Are there supposed to be atomic/non-sequential accesses in there? If so, please update the example. > Which is of course a big problem when you want to rely on the ordering. > > There's further problems where things like memset() can write outside > the specified address range. Examples are memset() using single > instructions to wipe entire cachelines and then 'restoring' the tail > bit. As Joseph said, this would be a bug IMO. > While valid for single threaded, its a complete disaster for concurrent > code. > > There's more, but it all boils down to doing stores you don't expect in > a 'sane' concurrent environment and/or don't respect the control flow. A few of those got fixed already, because they violated the memory model's requirements. If you have further examples that are valid code in the C11/C++11 model, please report them. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, 2014-02-06 at 20:06 -0800, Paul E. McKenney wrote: > On Thu, Feb 06, 2014 at 11:58:22PM +0100, Torvald Riegel wrote: > > On Thu, 2014-02-06 at 13:55 -0800, Paul E. McKenney wrote: > > > On Thu, Feb 06, 2014 at 10:09:25PM +0100, Torvald Riegel wrote: > > > > On Thu, 2014-02-06 at 18:59 +, Will Deacon wrote: > > > > > To answer that question, you need to go and look at the definitions of > > > > > synchronises-with, happens-before, dependency_ordered_before and a > > > > > whole > > > > > pile of vaguely written waffle to realise that you don't know. > > > > > > > > Are you familiar with the formalization of the C11/C++11 model by Batty > > > > et al.? > > > > http://www.cl.cam.ac.uk/~mjb220/popl085ap-sewell.pdf > > > > http://www.cl.cam.ac.uk/~mjb220/n3132.pdf > > > > > > > > They also have a nice tool that can run condensed examples and show you > > > > all allowed (and forbidden) executions (it runs in the browser, so is > > > > slow for larger examples), including nice annotated graphs for those: > > > > http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/ > > > > > > > > It requires somewhat special syntax, but the following, which should be > > > > equivalent to your example above, runs just fine: > > > > > > > > int main() { > > > > atomic_int foo = 0; > > > > atomic_int bar = 0; > > > > atomic_int baz = 0; > > > > {{{ { > > > > foo.store(42, memory_order_relaxed); > > > > bar.store(1, memory_order_seq_cst); > > > > baz.store(42, memory_order_relaxed); > > > > } > > > > ||| { > > > > r1=baz.load(memory_order_seq_cst).readsvalue(42); > > > > r2=foo.load(memory_order_seq_cst).readsvalue(0); > > > > } > > > > }}}; > > > > return 0; } > > > > > > > > That yields 3 consistent executions for me, and likewise if the last > > > > readsvalue() is using 42 as argument. > > > > > > > > If you add a "fence(memory_order_seq_cst);" after the store to foo, the > > > > program can't observe != 42 for foo anymore, because the seq-cst fence > > > > is adding a synchronizes-with edge via the baz reads-from. > > > > > > > > I think this is a really neat tool, and very helpful to answer such > > > > questions as in your example. > > > > > > Hmmm... The tool doesn't seem to like fetch_add(). But let's assume that > > > your substitution of store() for fetch_add() is correct. Then this shows > > > that we cannot substitute fetch_add() for atomic_add_return(). > > > > It should be in this example, I believe. > > You lost me on this one. I mean that in this example, substituting fetch_add() with store() should not change meaning, given that what the fetch_add reads-from seems irrelevant. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, 2014-02-06 at 14:11 -0800, Paul E. McKenney wrote: > On Thu, Feb 06, 2014 at 10:17:03PM +0100, Torvald Riegel wrote: > > On Thu, 2014-02-06 at 11:27 -0800, Paul E. McKenney wrote: > > > On Thu, Feb 06, 2014 at 06:59:10PM +, Will Deacon wrote: > > > > There are also so many ways to blow your head off it's untrue. For > > > > example, > > > > cmpxchg takes a separate memory model parameter for failure and > > > > success, but > > > > then there are restrictions on the sets you can use for each. It's not > > > > hard > > > > to find well-known memory-ordering experts shouting "Just use > > > > memory_model_seq_cst for everything, it's too hard otherwise". Then > > > > there's > > > > the fun of load-consume vs load-acquire (arm64 GCC completely ignores > > > > consume > > > > atm and optimises all of the data dependencies away) as well as the > > > > definition > > > > of "data races", which seem to be used as an excuse to miscompile a > > > > program > > > > at the earliest opportunity. > > > > > > Trust me, rcu_dereference() is not going to be defined in terms of > > > memory_order_consume until the compilers implement it both correctly and > > > efficiently. They are not there yet, and there is currently no shortage > > > of compiler writers who would prefer to ignore memory_order_consume. > > > > Do you have any input on > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448? In particular, the > > language standard's definition of dependencies? > > Let's see... 1.10p9 says that a dependency must be carried unless: > > — B is an invocation of any specialization of std::kill_dependency (29.3), or > — A is the left operand of a built-in logical AND (&&, see 5.14) or logical > OR (||, see 5.15) operator, > or > — A is the left operand of a conditional (?:, see 5.16) operator, or > — A is the left operand of the built-in comma (,) operator (5.18); > > So the use of "flag" before the "?" is ignored. But the "flag - flag" > after the "?" will carry a dependency, so the code fragment in 59448 > needs to do the ordering rather than just optimizing "flag - flag" out > of existence. One way to do that on both ARM and Power is to actually > emit code for "flag - flag", but there are a number of other ways to > make that work. And that's what would concern me, considering that these requirements seem to be able to creep out easily. Also, whereas the other atomics just constrain compilers wrt. reordering across atomic accesses or changes to the atomic accesses themselves, the dependencies are new requirements on pieces of otherwise non-synchronizing code. The latter seems far more involved to me. > BTW, there is some discussion on 1.10p9's handling of && and ||, and > that clause is likely to change. And yes, I am behind on analyzing > usage in the Linux kernel to find out if Linux cares... Do you have any pointers to these discussions (e.g., LWG issues)? > > > And rcu_dereference() will need per-arch overrides for some time during > > > any transition to memory_order_consume. > > > > > > > Trying to introduce system concepts (writes to devices, interrupts, > > > > non-coherent agents) into this mess is going to be an uphill battle > > > > IMHO. I'd > > > > just rather stick to the semantics we have and the asm volatile > > > > barriers. > > > > > > And barrier() isn't going to go away any time soon, either. And > > > ACCESS_ONCE() needs to keep volatile semantics until there is some > > > memory_order_whatever that prevents loads and stores from being coalesced. > > > > I'd be happy to discuss something like this in ISO C++ SG1 (or has this > > been discussed in the past already?). But it needs to have a paper I > > suppose. > > The current position of the usual suspects other than me is that this > falls into the category of forward-progress guarantees, which are > considers (again, by the usual suspects other than me) to be out > of scope. But I think we need to better describe forward progress, even though that might be tricky. We made at least some progress on http://cplusplus.github.io/LWG/lwg-active.html#2159 in Chicago, even though we can't constrain the OS schedulers too much, and for lock-free we're in this weird position that on most general-purpose schedulers and machines, obstruction-free algorithms are likely
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, 2014-02-06 at 22:13 +, Joseph S. Myers wrote: > On Thu, 6 Feb 2014, Torvald Riegel wrote: > > > > It seems that nobody really > > > agrees on exactly how the C11 atomics map to real architectural > > > instructions on anything but the trivial architectures. > > > > There's certainly different ways to implement the memory model and those > > have to be specified elsewhere, but I don't see how this differs much > > from other things specified in the ABI(s) for each architecture. > > It is not clear to me that there is any good consensus understanding of > how to specify this in an ABI, or how, given an ABI, to determine whether > an implementation is valid. > > For ABIs not considering atomics / concurrency, it's well understood, for > example, that the ABI specifies observable conditions at function call > boundaries ("these arguments are in these registers", "the stack pointer > has this alignment", "on return from the function, the values of these > registers are unchanged from the values they had on entry"). It may > sometimes specify things at other times (e.g. presence or absence of a red > zone - whether memory beyond the stack pointer may be overwritten on an > interrupt). But if it gives a code sequence, it's clear this is just an > example rather than a requirement for particular code to be generated - > any code sequence suffices if it meets the observable conditions at the > points where code generated by one implementation may pass control to code > generated by another implementation. > > When atomics are involved, you no longer have a limited set of > well-defined points where control may pass from code generated by one > implementation to code generated by another - the code generated by the > two may be running concurrently. Agreed. > We know of certain cases > <http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html> where there are > choices of the mapping of atomic operations to particular instructions. > But I'm not sure there's much evidence that these are the only ABI issues > arising from concurrency - that there aren't any other ways in which an > implementation may transform code, consistent with the as-if rule of ISO > C, that may run into incompatibilities of different choices. I can't think of other incompatibilities with high likelihood -- provided we ignore consume memory order and the handling of dependencies (see below). But I would doubt there is a high risk of such because (1) the data race definition should hopefully not cause subtle incompatibilities and (2) there is clear "hand-off point" from the compiler to a specific instruction representing an atomic access. For example, if we have a release store, and we agree on the instructions used for that, then compilers will have to ensure happens-before for anything before the release store; for example, as long as stores sequenced-before the release store are performed, it doesn't matter in which order that happens. Subsequently, an acquire-load somewhere else can pick this sequence of events up just by using the agreed-upon acquire-load; like with the stores, it can order subsequent loads in any way it sees fit, including different optimizations. That's obviously not a formal proof, though. But it seems likely to me, at least :) I'm more concerned about consume and dependencies because as far as I understand the standard's requirements, dependencies need to be tracked across function calls. Thus, we might have several compilers involved in that, and we can't just "condense" things to happens-before, but it's instead how and that we keep dependencies intact. Because of that, I'm wondering whether this is actually implementable practically. (See http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448#c10) > And even if > those are the only issues, it's not clear there are well-understood ways > to define the mapping from the C11 memory model to the architecture's > model, which provide a good way to reason about whether a particular > choice of instructions is valid according to the mapping. I think that if we have different options, there needs to be agreement on which to choose across the compilers, at the very least. I don't quite know how this looks like for GCC and LLVM, for example. > > Are you familiar with the formalization of the C11/C++11 model by Batty > > et al.? > > http://www.cl.cam.ac.uk/~mjb220/popl085ap-sewell.pdf > > http://www.cl.cam.ac.uk/~mjb220/n3132.pdf > > These discuss, as well as the model itself, proving the validity of a > particular choice of x86 instructions. I imagine that might be a starting > point
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, 2014-02-06 at 13:55 -0800, Paul E. McKenney wrote: > On Thu, Feb 06, 2014 at 10:09:25PM +0100, Torvald Riegel wrote: > > On Thu, 2014-02-06 at 18:59 +, Will Deacon wrote: > > > On Thu, Feb 06, 2014 at 06:55:01PM +, Ramana Radhakrishnan wrote: > > > > On 02/06/14 18:25, David Howells wrote: > > > > > > > > > > Is it worth considering a move towards using C11 atomics and barriers > > > > > and > > > > > compiler intrinsics inside the kernel? The compiler _ought_ to be > > > > > able to do > > > > > these. > > > > > > > > > > > > It sounds interesting to me, if we can make it work properly and > > > > reliably. + g...@gcc.gnu.org for others in the GCC community to chip in. > > > > > > Given my (albeit limited) experience playing with the C11 spec and GCC, I > > > really think this is a bad idea for the kernel. > > > > I'm not going to comment on what's best for the kernel (simply because I > > don't work on it), but I disagree with several of your statements. > > > > > It seems that nobody really > > > agrees on exactly how the C11 atomics map to real architectural > > > instructions on anything but the trivial architectures. > > > > There's certainly different ways to implement the memory model and those > > have to be specified elsewhere, but I don't see how this differs much > > from other things specified in the ABI(s) for each architecture. > > > > > For example, should > > > the following code fire the assert? > > > > I don't see how your example (which is about what the language requires > > or not) relates to the statement about the mapping above? > > > > > > > > extern atomic foo, bar, baz; > > > > > > void thread1(void) > > > { > > > foo.store(42, memory_order_relaxed); > > > bar.fetch_add(1, memory_order_seq_cst); > > > baz.store(42, memory_order_relaxed); > > > } > > > > > > void thread2(void) > > > { > > > while (baz.load(memory_order_seq_cst) != 42) { > > > /* do nothing */ > > > } > > > > > > assert(foo.load(memory_order_seq_cst) == 42); > > > } > > > > > > > It's a good example. My first gut feeling was that the assertion should > > never fire, but that was wrong because (as I seem to usually forget) the > > seq-cst total order is just a constraint but doesn't itself contribute > > to synchronizes-with -- but this is different for seq-cst fences. > > From what I can see, Will's point is that mapping the Linux kernel's > atomic_add_return() primitive into fetch_add() does not work because > atomic_add_return()'s ordering properties require that the assert() > never fire. > > Augmenting the fetch_add() with a seq_cst fence would work on many > architectures, but not for all similar examples. The reason is that > the C11 seq_cst fence is deliberately weak compared to ARM's dmb or > Power's sync. To your point, I believe that it would make the above > example work, but there are some IRIW-like examples that would fail > according to the standard (though a number of specific implementations > would in fact work correctly). Thanks for the background. I don't read LKML, and it wasn't obvious from reading just the part of the thread posted to gcc@ that achieving these semantics is the goal. > > > To answer that question, you need to go and look at the definitions of > > > synchronises-with, happens-before, dependency_ordered_before and a whole > > > pile of vaguely written waffle to realise that you don't know. > > > > Are you familiar with the formalization of the C11/C++11 model by Batty > > et al.? > > http://www.cl.cam.ac.uk/~mjb220/popl085ap-sewell.pdf > > http://www.cl.cam.ac.uk/~mjb220/n3132.pdf > > > > They also have a nice tool that can run condensed examples and show you > > all allowed (and forbidden) executions (it runs in the browser, so is > > slow for larger examples), including nice annotated graphs for those: > > http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/ > > > > It requires somewhat special syntax, but the following, which should be > > equivalent to your example above, runs just fine: > > > > int main() { > > atomic_int foo = 0; > > atomic_int bar = 0; > > atomic_int baz = 0; > > {{{ { > > foo.store(42, memory_order_rel
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, 2014-02-06 at 11:27 -0800, Paul E. McKenney wrote: > On Thu, Feb 06, 2014 at 06:59:10PM +, Will Deacon wrote: > > On Thu, Feb 06, 2014 at 06:55:01PM +, Ramana Radhakrishnan wrote: > > > On 02/06/14 18:25, David Howells wrote: > > > > > > > > Is it worth considering a move towards using C11 atomics and barriers > > > > and > > > > compiler intrinsics inside the kernel? The compiler _ought_ to be able > > > > to do > > > > these. > > > > > > > > > It sounds interesting to me, if we can make it work properly and > > > reliably. + g...@gcc.gnu.org for others in the GCC community to chip in. > > > > Given my (albeit limited) experience playing with the C11 spec and GCC, I > > really think this is a bad idea for the kernel. It seems that nobody really > > agrees on exactly how the C11 atomics map to real architectural > > instructions on anything but the trivial architectures. For example, should > > the following code fire the assert? > > > > > > extern atomic foo, bar, baz; > > > > void thread1(void) > > { > > foo.store(42, memory_order_relaxed); > > bar.fetch_add(1, memory_order_seq_cst); > > baz.store(42, memory_order_relaxed); > > } > > > > void thread2(void) > > { > > while (baz.load(memory_order_seq_cst) != 42) { > > /* do nothing */ > > } > > > > assert(foo.load(memory_order_seq_cst) == 42); > > } > > > > > > To answer that question, you need to go and look at the definitions of > > synchronises-with, happens-before, dependency_ordered_before and a whole > > pile of vaguely written waffle to realise that you don't know. Certainly, > > the code that arm64 GCC currently spits out would allow the assertion to > > fire > > on some microarchitectures. > > Yep! I believe that a memory_order_seq_cst fence in combination with the > fetch_add() would do the trick on many architectures, however. All of > this is one reason that any C11 definitions need to be individually > overridable by individual architectures. "Overridable" in which sense? Do you want to change the semantics on the language level in the sense of altering the memory model, or rather use a different implementation under the hood to, for example, fix deficiencies in the compilers? > > There are also so many ways to blow your head off it's untrue. For example, > > cmpxchg takes a separate memory model parameter for failure and success, but > > then there are restrictions on the sets you can use for each. It's not hard > > to find well-known memory-ordering experts shouting "Just use > > memory_model_seq_cst for everything, it's too hard otherwise". Then there's > > the fun of load-consume vs load-acquire (arm64 GCC completely ignores > > consume > > atm and optimises all of the data dependencies away) as well as the > > definition > > of "data races", which seem to be used as an excuse to miscompile a program > > at the earliest opportunity. > > Trust me, rcu_dereference() is not going to be defined in terms of > memory_order_consume until the compilers implement it both correctly and > efficiently. They are not there yet, and there is currently no shortage > of compiler writers who would prefer to ignore memory_order_consume. Do you have any input on http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448? In particular, the language standard's definition of dependencies? > And rcu_dereference() will need per-arch overrides for some time during > any transition to memory_order_consume. > > > Trying to introduce system concepts (writes to devices, interrupts, > > non-coherent agents) into this mess is going to be an uphill battle IMHO. > > I'd > > just rather stick to the semantics we have and the asm volatile barriers. > > And barrier() isn't going to go away any time soon, either. And > ACCESS_ONCE() needs to keep volatile semantics until there is some > memory_order_whatever that prevents loads and stores from being coalesced. I'd be happy to discuss something like this in ISO C++ SG1 (or has this been discussed in the past already?). But it needs to have a paper I suppose. Will you be in Issaquah for the C++ meeting next week? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, 2014-02-06 at 18:59 +, Will Deacon wrote: > On Thu, Feb 06, 2014 at 06:55:01PM +, Ramana Radhakrishnan wrote: > > On 02/06/14 18:25, David Howells wrote: > > > > > > Is it worth considering a move towards using C11 atomics and barriers and > > > compiler intrinsics inside the kernel? The compiler _ought_ to be able > > > to do > > > these. > > > > > > It sounds interesting to me, if we can make it work properly and > > reliably. + g...@gcc.gnu.org for others in the GCC community to chip in. > > Given my (albeit limited) experience playing with the C11 spec and GCC, I > really think this is a bad idea for the kernel. I'm not going to comment on what's best for the kernel (simply because I don't work on it), but I disagree with several of your statements. > It seems that nobody really > agrees on exactly how the C11 atomics map to real architectural > instructions on anything but the trivial architectures. There's certainly different ways to implement the memory model and those have to be specified elsewhere, but I don't see how this differs much from other things specified in the ABI(s) for each architecture. > For example, should > the following code fire the assert? I don't see how your example (which is about what the language requires or not) relates to the statement about the mapping above? > > extern atomic foo, bar, baz; > > void thread1(void) > { > foo.store(42, memory_order_relaxed); > bar.fetch_add(1, memory_order_seq_cst); > baz.store(42, memory_order_relaxed); > } > > void thread2(void) > { > while (baz.load(memory_order_seq_cst) != 42) { > /* do nothing */ > } > > assert(foo.load(memory_order_seq_cst) == 42); > } > It's a good example. My first gut feeling was that the assertion should never fire, but that was wrong because (as I seem to usually forget) the seq-cst total order is just a constraint but doesn't itself contribute to synchronizes-with -- but this is different for seq-cst fences. > To answer that question, you need to go and look at the definitions of > synchronises-with, happens-before, dependency_ordered_before and a whole > pile of vaguely written waffle to realise that you don't know. Are you familiar with the formalization of the C11/C++11 model by Batty et al.? http://www.cl.cam.ac.uk/~mjb220/popl085ap-sewell.pdf http://www.cl.cam.ac.uk/~mjb220/n3132.pdf They also have a nice tool that can run condensed examples and show you all allowed (and forbidden) executions (it runs in the browser, so is slow for larger examples), including nice annotated graphs for those: http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/ It requires somewhat special syntax, but the following, which should be equivalent to your example above, runs just fine: int main() { atomic_int foo = 0; atomic_int bar = 0; atomic_int baz = 0; {{{ { foo.store(42, memory_order_relaxed); bar.store(1, memory_order_seq_cst); baz.store(42, memory_order_relaxed); } ||| { r1=baz.load(memory_order_seq_cst).readsvalue(42); r2=foo.load(memory_order_seq_cst).readsvalue(0); } }}}; return 0; } That yields 3 consistent executions for me, and likewise if the last readsvalue() is using 42 as argument. If you add a "fence(memory_order_seq_cst);" after the store to foo, the program can't observe != 42 for foo anymore, because the seq-cst fence is adding a synchronizes-with edge via the baz reads-from. I think this is a really neat tool, and very helpful to answer such questions as in your example. > Certainly, > the code that arm64 GCC currently spits out would allow the assertion to fire > on some microarchitectures. > > There are also so many ways to blow your head off it's untrue. For example, > cmpxchg takes a separate memory model parameter for failure and success, but > then there are restrictions on the sets you can use for each. That's in there for the architectures without a single-instruction CAS/cmpxchg, I believe. > It's not hard > to find well-known memory-ordering experts shouting "Just use > memory_model_seq_cst for everything, it's too hard otherwise". Everyone I've heard saying this meant this as advice to people new to synchronization or just dealing infrequently with it. The advice is the simple and safe fallback, and I don't think it's meant as an acknowledgment that the model itself would be too hard. If the language's memory model is supposed to represent weak HW memory models to at least some extent, there's only so much you can do in terms of keeping it simple. If all architectures had x86-like models, the language's model would certainly be simpler... :) > Then there's > the fun of load-consume vs load-acquire (arm64 GCC completely ignores consume > atm and optimises all of the data dependencies away) AFAIK consume memory order was added to model Power/ARM-specific behavior. I agree that the way the standard specifies how dependencies are to be p