> Date: Tue, 15 Feb 2022 12:54:09 +0100 > From: Edgar Fuß <e...@math.uni-bonn.de> > > If there's a widely adopted terminology, one should probably stick > to it even if the wording is counter-intuitive or misleading (but > note that fact in the documentation). After all, Simple Groups are > not easy at all and you need to know about Galois Theory to > understand why Solvable Groups are named that way.
The C11 terminology -- which is widely used in the literature these days -- is store-release and load-acquire operations: store-release(p, v) is roughly *p = v, and u = load-acquire(p) is roughly u = *p, with the additional property that if 1. one thread does some memory operations A, and then 2. issues store-release(p, v); and 3. another thread doing u = load-acquire(p) witnesses that store, and 4. proceeds to do some other memory operations B, then all the memory operations A happen before all the memory operations in B -- even if they were in different threads, on different CPUs, with crazy relaxed memory models like SPARC RMO or Alpha. Store-release is like a load/store-before-store barrier and then a store, and load-acquire is like a load and then load-before-load/store barrier. Except, because store-release and load-acquire only impose ordering relative to a _specific_ store or load, they may be cheaper than that (e.g., on Arm, there are LDAR and STLR instructions for individual load-acquire and store-release operations on a specific address). C11 introduces atomic operations with an argument specifying memory ordering, so you can write, e.g., use_stuff(obj->data); /* obj->refcnt += -1 */ atomic_fetch_add_explicit(&obj->refcnt, -1, memory_order_release); in one thread to drop a reference count as a store-release; then, another thread watching to see if it has gone to zero can use a load-acquire to make sure all memory operations done by users of the object happen before all memory operations used to free resources connected to it: if (atomic_load_explicit(&obj->refcnt, memory_order_acquire) == 0) { clear(obj->data); free_resources(obj->data); } In these fragments, if the store is observed by the load, then we're guaranteed that use_stuff(obj->data); happens before clear(obj->data); free_resources(obj->data); (These fragments are just to illustrate reasoning about ordering -- there's generally going to be a lot more to a real-world program doing reference-counting, like a lock and a condvar around dropping the last reference or something.) If I were naming NetBSD's membars today, I might choose the names membar_acquire for load-before-load/store and membar_release for load/store-before-store. But we've already had the names membar_enter and membar_exit for a while, and they've gathered a lot of definitions and uses...and it _turns out_ that, despite the word `store' in the man page, membar_enter was always implemented (except on riscv, which has never shipped) _and_ used (except in one place, where no barrier was needed anyway) as load-before-load/store, not as store-before-load/store. (Of course, atomic-r/m/w-then-load/store is ordered equally by load-before-load/store or store-before-load/store, but on every every CPU architecture I know with the distinction, load-before-load/store is cheaper than store-before-load/store -- and load-before-load/store is _useful_ for non-atomic-r/m/w uses, while store-before-load/store is not.) So I think we should just fix the man page -- and spend any _engineering_ effort on adopting C11 atomics. > When I read frozz_enter() and frozz_exit() in code, my expectation > is that every call fo enter is paired with a call to exit _in the > control flow_, i.e., there's no (other than panic) code path that > goes through one of them, but not the other. This is the case for the memory barriers issued by mutexes: mutex_enter essentially implies membar_enter, and mutex_exit essentially implies membar_exit. So if you write another kind of lock, say with atomic_cas, that's what you'll want to use: myspinlock_enter() { while (atomic_cas(&lock, 0, 1) != 0) continue membar_enter() } myspinlock_exit() { membar_exit() lock = 0 } But the important thing about memory barriers is not so much that _one_ sequential fragment of code looks like this (if you look under the hood of what mutex_spin_enter/exit essentially do): while (atomic_cas(&lock, 0, 1) != 0) continue membar_enter() ...memory operations on data... membar_exit() lock = 0 Rather, what's really important is that the code running on _two CPUs_ looks like this: // cpu0 // cpu1 ...memory operations A... membar_exit() lock = 0 while (atomic_cas(&lock, 0, 1) != 0) continue membar_enter() ...memory operations B... In other words, memory barriers aren't really about ordering _two_ memory operations in _one_ thread -- they're about ordering _four_ operations in _two_ threads: 1. memory operations A, on cpu0 2. lock = 0, on cpu0 3. atomic_cas(&lock, 0, 1), on cpu1 4. memory operations B, on cpu1 Our real goal here (other than the mutual exclusion part of a mutex) is to ensure that memory operations A and memory operations B happen in that order. And it's only when _both threads_ issue matching memory barriers that everything works out. > Would it make sense to call the intended-usage aliases something like > push/pull, provide/consume or publish/whatever? Well, what we have now are: membar_enter and membar_exit membar_consumer and membar_producer I think these are pretty good pairs, as far as English words go. My only issue with them right now is the text about membar_enter in the man page -- it disagrees with 15 years of practice in NetBSD _and_ with sensible algorithm design. (There's also membar_sync, of which there are (as far as I can tell) zero legitimate uses in-tree, because almost nothing ever needs store-before-load ordering except exotic kludges like Dekker's algorithm -- almost everything is satisfied by release/acquire ordering or less.)