> Date: Fri, 11 Feb 2022 15:47:01 -0800 > From: Jason Thorpe <thor...@me.com> > > My beef with the membar_enter definitional change is with the word > "acquire". I.e. you want to give it what is called today "acquire" > semantics. My beef is with now "acquire" is defined, as > load-before-load/store.
Whatever the name is, do you contend that store-before-load/store is _useful_? Can you show why? And, can you show an architecture where it's actually cheaper than membar_sync? (I can show plenty of examples of where load-before-load/store is useful -- heck, just search for membar_enter and you'll find some!) I would rather avoid introducing a proliferation of membar names, because the more there are, the more confusing the choice is. Having nicely paired names helps: if you see `membar_exit', that's a hint you should see a corresponding `membar_enter' -- and if you don't, that should raise alarm bells in your head. We could add membar_acquire/release, but `membar_exit' is already appropriate here. Semantically, generally load/store-before-store (membar_exit) is appropriately paired with load-before-load/store to make a happens-before relation that makes programs easy to reason about. But store-before-load/store? Raises alarm bells of an incoherent design or terrible choice like Dekker's algorithm. I contend that store-before-load/store is not worth naming -- except possibly for the never-released riscv, we have _zero_ definitions that are cheaper than membar_sync (and I'm not sure fence w,rw is actually cheaper than fence rw,rw on any real hardware -- likely isn't), and _zero_ uses. > v9-PSO -- Because Atomic load-stores ("ldstub" and "casx") are not > ordered with respect to stores, you would need "membar #StoreStore" > (in PSO mode, Atomic load-stores are already strongly ordered with > respect to other loads). This is not accurate. There is no need for `membar #StoreStore' here, because, from the other part you quoted about PSO: Each load and atomic load-store instruction behaves as if it were followed by MEMBAR with a mask value of 05_16. LoadLoad = 0x01, LoadStore = 0x04, so LoadLoad|LoadStore = 0x05 or `05_16'; in other words, this is load-before-load/store. (Confirmed in Appendix D.5, which spells it out as MEMBAR #LoadLoad|LoadStore.) > Now, because in PSO mode, Atomic load-stores are not strongly > ordered with respect to stores, in order for the following code to > work: > > mutex_enter(); > *foo = 0; > result = *bar; > mutex_exit(); > > ...then you need to issue a "membar #StoreStore" because the > ordering of the lock acquisition and the store through *foo is not > guaranteed without it. But you can also issue a "membar #StoreLoad > | #StoreStore", which also works in RMO mode. No membar needed here in PSO because the the CAS or LDSTUB in mutex_enter already implies MEMBAR #LoadLoad|LoadStore without any explicit instruction. So the CAS/LDSTUB inside mutex_enter happens-before all loads and stores afterward, namely *foo = 0 and result = *bar. In PSO you _do_ need MEMBAR #StoreStore in mutex_exit, even if mutex_exit uses an atomic r/m/w to unlock the mutex, because the store *foo = 0 could be delayed until after the atomic r/m/w inside mutex_exit. That's why, as you said, `atomic load-stores are not ordered with respect to stores' -- they can be reordered _in one direction_, which is relevant to mutex_exit but not to mutex_enter. > In other words, it's the **store into the lock cell** that actually > performs the acquisition of the lock. No, it's the atomic r/m/w operation as a unit. The operation is atomic; there's no meaningful separation between the parts. Even with LL/SC, the only way you can elicit a semantic difference between the two choices of memory barrier in ll ...other logic... sc (repeat if failed) membar load-before-load/store vs store-before-load/store is by issuing a load or store in `...other logic...' that is ordered differently by the barrier. The LL/SC itself functions as a single atomic memory operation with both a load and a store, and so is equally ordered by load-before-load/store or store-before-load/store here. > In addition to being true on > platforms that have Atomic load-store (like SPARC), it is also true > on platforms that have LL/SC semantics (the load in that case > doesn't mean jack-squat, and the ordering guarantees that the LL has > are specifically with respect to the paired SC). [citation needed] Can you exhibit a program using LL/SC on one of the architectures you have in mind, such that it behaves differently depending on which barrier you issue -- and without cheating by using an intermediate load or store in `...other logic...' that vacuously makes the difference independent of the LL/SC? If not, this is all a distinction without a difference -- any difference boils down to how membar_enter affects memory operations that _aren't_ atomic r/m/w (or, equivalently, LL/SC). Which brings us back to: What utility does store-before-load/store have? Very little in NetBSD, it seems! Store-before-load ordering is generally only ever needed in weird exotic schemes like Dekker's algorithm which you generally don't want to use in practice, or early CPU spinup with a busy loop that is perfectly adequately served by membar_sync or DELAY(). But load-before-load/store, in contrast, is ubiquitous and important in performance-critical code.