> Date: Sun, 24 Nov 2019 19:25:52 +0000 > From: Taylor R Campbell <riastr...@netbsd.org> > > This thread is not converging on consensus, so we're discussing the > semantics and naming of these operations as core and will come back > with a decision by the end of the week.
We (core) carefully read the thread, and discussed this and the related Linux READ_ONCE/WRITE_ONCE macros as well as the C11 atomic API. For maxv: Please add conditional definitions in <sys/atomic.h> according to what KCSAN needs, and use atomic_load/store_relaxed for counters and other integer objects in the rest of your patch. (I didn't see any pointer loads there.) For uvm's lossy counters, please use atomic_store_relaxed(p, 1 + atomic_load_relaxed(p)) and not an __add_once macro -- since these should really be per-CPU counters, we don't want to endorse this pattern by making it pretty. * Summary We added a few macros to <sys/atomic.h> for the purpose, atomic_load_<ordering>(p) and atomic_store_<ordering>(p,v). The orderings are relaxed, acquire, consume, and release, and are intended to match C11 semantics. See the new atomic_loadstore(9) man page for reference. Currently they are defined in terms of volatile loads and stores, but we should eventually use the C11 atomic API instead in order to provide the intended atomicity guarantees under all compilers without having to rely on the folklore interpretations of volatile. * Details There are four main properties involved in the operations under discussion: 1. No tearing. A 32-bit write can't be split into two separate 16-bit writes, for instance. * In _some_ cases, namely aligned pointers to sufficiently small objects, Linux READ_ONCE/WRITE_ONCE guarantee no tearing. * C11 atomic_load/store guarantees no tearing -- although on large objects it may involve locks, requiring the C11 type qualifier _Atomic and changing the ABI. This was the primary motivation for maxv's original question. 2. No fusing. Consecutive writes can't be combined into one, for instance, or a write followed by a read can't skip the read to return the value that was written. * Linux's READ_ONCE/WRITE_ONCE and C11's atomic_load/store guarantee no fusing. 3. Data-dependent memory ordering. If you read a pointer, and then dereference the pointer (maybe plus some offset), the reads happen in that order. * Linux's READ_ONCE guarantees this by issuing the analogue of membar_datadep_consumer on DEC Alpha, and nothing on other CPUs. * C11's atomic_load guarantees this with seq_cst, acquire, or consume memory ordering. 4. Cost. There's no need to incur cost of read/modify/write atomic operations, and for many purposes, no need to incur cost of memory-ordering barriers. To express these, we've decided to add a few macros that are similar to Linux's READ_ONCE/WRITE_ONCE and C11's atomic_load/store_explicit but are less error-prone and less cumbersome: #include <sys/atomic.h> - atomic_load_relaxed(p) is like *p, but guarantees no tearing and no fusing. No ordering relative to memory operations on other objects is guaranteed. - atomic_store_relaxed(p, v) is like *p = v, but guarantees no tearing and no fusing. No ordering relative to memory operations on other objects is guaranteed. - atomic_store_release(p, v) and atomic_load_acquire(p) are, respectively, like *p = v and *p, but guarantee no tearing and no fusing. They _also_ guarantee for logic like Thread A Thread B -------- -------- stuff(); atomic_store_release(p, v); u = atomic_load_acquire(p); things(); that _if_ the atomic_load_acquire(p) in thread B witnesses the state of the object at p set by atomic_store_release(p, v) in thread A, then all memory operations in stuff() happen before any memory operations in things(). No guarantees if only one thread participates -- the store-release and load-acquire _must_ be paired. - atomic_load_consume(p) is like atomic_load_acquire(p), but it only guarantees ordering for data-dependent memory references. Like atomic_load_acquire, it must be paired with atomic_store_release. However, on most CPUs, it is as _cheap_ as atomic_load_relaxed. The atomic load/store operations are defined _only_ on objects as large as the architecture can support -- so, for example, on 32-bit platforms they cannot be used on 64-bit quantities; attempts to do so will lead to compile-time errors. They are also defined _only_ on aligned pointers -- using them on unaligned pointers may lead to run-time crashes, even on architectures without strict alignment requirements. * Why the names atomic_{load,store}_<ordering>? - Atomic. Although `atomic' may suggest `expensive' to some people (and I'm guilty of making that connection in the past), what's really expensive is atomic _read/modify/write_ operations and _memory ordering guarantees_. Merely preventing tearing and fusing is often cheap -- normal CPU load/store instructions are usually cheap and atomic, and these operations help to ensure that (a) we catch mistakes with aggregate objects like 64-bit words on a 32-bit machine, and (b) the compiler doesn't do any tricks behind our back to violate those guarantees. - Load/store. We could say read/write but we see little value in deviating from the modern C11 API. - Memory ordering. C11 defines atomic_load and atomic_store with _sequential consistency_, the most expensive kind of ordering -- in C11, there is a total order on every sequentially consistent memory operation that every thread shares. So the names atomic_load and atomic_store would conflict with that. It's not obvious from the names READ_ONCE/WRITE_ONCE that any ordering guarantees are needed. And for things like lossy counters, ordering is not needed. But in Linux, some applications (like RCU) _do_ rely on ordering guarantees from READ_ONCE -- and those _must_ be paired with ordering guarantees on the writer side in order to work. We could have adopted the rather cumbersome atomic_load_explicit and atomic_store_explicit from C11, but I figured it would be better if we just name the five useful versions with shorter names. We see little value in deviating from the nomenclature in C11, since the terminology `relaxed', `acquire', and `release' in the literature is ubiquitous today (personally, I might prefer `unordered' over `relaxed', but not enough to warrant divergence from the literature and standard), and the semantics doesn't exactly match our existing membar_ops(3) anyway. Thus, the names are atomic_{load,store}_* and annotated with the C11 memory ordering so you have to be clear about it -- but not quite as cumbersome as the C11 `atomic_load_explicit(p, memory_order_acquire)'. General rules: - For any atomic_load_acquire or atomic_load_consume, make sure you can identify the atomic_store_release that it corresponds with, and vice versa. Leave a code comment on each part pointing out its counterpart. - Translate Linux READ_ONCE into atomic_load_consume, unless you _must_ operate on large or unaligned objects. => Optimization: If downstream memory operations do not depend on the value, then you can use atomic_load_relaxed. - Translate Linux WRITE_ONCE into atomic_store_relaxed, unless you _must_ operate on large or unaligned objects. * How do they relate to existing atomic_ops(3) and membar_ops(3) API? We're still working on details, but for now, you can treat atomic_r/m/w(p, ...); // from atomic_ops(3), except the *_ni membar_enter(); as a load-acquire, and membar_exit(); atomic_r/m/w(q, ...); // from atomic_ops(3), except the *_ni as a store-release. On architectures with __HAVE_ATOMIC_AS_MEMBAR, such as x86, the membar_enter/exit is not necessary and every atomic_r/m/w implies store-release _and_ load-acquire. (Caveat parallel programmer: membar_enter is _not_ the same as C11 atomic_thread_fence(memory_order_acquire), and atomic_load_relaxed followed by membar_enter() is _not_ a load-acquire, which is why this is not the end of the story.)