Re: membar_enter semantics

Taylor R Campbell Tue, 15 Feb 2022 18:11:38 -0800

> Date: Tue, 15 Feb 2022 12:54:09 +0100
> From: Edgar Fuß <e...@math.uni-bonn.de>
> 
> If there's a widely adopted terminology, one should probably stick
> to it even if the wording is counter-intuitive or misleading (but
> note that fact in the documentation). After all, Simple Groups are
> not easy at all and you need to know about Galois Theory to
> understand why Solvable Groups are named that way.


The C11 terminology -- which is widely used in the literature these
days -- is store-release and load-acquire operations:

        store-release(p, v)  is roughly  *p = v,  and
        u = load-acquire(p)  is roughly  u = *p,

with the additional property that if

1. one thread does some memory operations A, and then
2. issues store-release(p, v); and
3. another thread doing u = load-acquire(p) witnesses that store, and
4. proceeds to do some other memory operations B,

then all the memory operations A happen before all the memory
operations in B -- even if they were in different threads, on
different CPUs, with crazy relaxed memory models like SPARC RMO or
Alpha.

Store-release is like a load/store-before-store barrier and then a
store, and load-acquire is like a load and then load-before-load/store
barrier.  Except, because store-release and load-acquire only impose
ordering relative to a _specific_ store or load, they may be cheaper
than that (e.g., on Arm, there are LDAR and STLR instructions for
individual load-acquire and store-release operations on a specific
address).

C11 introduces atomic operations with an argument specifying memory
ordering, so you can write, e.g.,

        use_stuff(obj->data);
        /* obj->refcnt += -1 */
        atomic_fetch_add_explicit(&obj->refcnt, -1,
            memory_order_release);

in one thread to drop a reference count as a store-release; then,
another thread watching to see if it has gone to zero can use a
load-acquire to make sure all memory operations done by users of the
object happen before all memory operations used to free resources
connected to it:

        if (atomic_load_explicit(&obj->refcnt, memory_order_acquire) == 0) {
                clear(obj->data);
                free_resources(obj->data);
        }

In these fragments, if the store is observed by the load, then we're
guaranteed that

        use_stuff(obj->data);

happens before

        clear(obj->data);
        free_resources(obj->data);

(These fragments are just to illustrate reasoning about ordering --
there's generally going to be a lot more to a real-world program doing
reference-counting, like a lock and a condvar around dropping the last
reference or something.)

If I were naming NetBSD's membars today, I might choose the names
membar_acquire for load-before-load/store and membar_release for
load/store-before-store.

But we've already had the names membar_enter and membar_exit for a
while, and they've gathered a lot of definitions and uses...and it
_turns out_ that, despite the word `store' in the man page,
membar_enter was always implemented (except on riscv, which has never
shipped) _and_ used (except in one place, where no barrier was needed
anyway) as load-before-load/store, not as store-before-load/store.

(Of course, atomic-r/m/w-then-load/store is ordered equally by
load-before-load/store or store-before-load/store, but on every every
CPU architecture I know with the distinction, load-before-load/store
is cheaper than store-before-load/store -- and load-before-load/store
is _useful_ for non-atomic-r/m/w uses, while store-before-load/store
is not.)

So I think we should just fix the man page -- and spend any
_engineering_ effort on adopting C11 atomics.

> When I read frozz_enter() and frozz_exit() in code, my expectation
> is that every call fo enter is paired with a call to exit _in the
> control flow_, i.e., there's no (other than panic) code path that
> goes through one of them, but not the other.

This is the case for the memory barriers issued by mutexes:
mutex_enter essentially implies membar_enter, and mutex_exit
essentially implies membar_exit.  So if you write another kind of
lock, say with atomic_cas, that's what you'll want to use:

myspinlock_enter()
{
        while (atomic_cas(&lock, 0, 1) != 0)
                continue
        membar_enter()
}

myspinlock_exit()
{
        membar_exit()
        lock = 0
}

But the important thing about memory barriers is not so much that
_one_ sequential fragment of code looks like this (if you look under
the hood of what mutex_spin_enter/exit essentially do):

        while (atomic_cas(&lock, 0, 1) != 0)
                continue
        membar_enter()
        ...memory operations on data...
        membar_exit()
        lock = 0

Rather, what's really important is that the code running on _two CPUs_
looks like this:

        // cpu0                         // cpu1
        ...memory operations A...
        membar_exit()
        lock = 0
                                        while (atomic_cas(&lock, 0, 1) != 0)
                                                continue
                                        membar_enter()
                                        ...memory operations B...

In other words, memory barriers aren't really about ordering _two_
memory operations in _one_ thread -- they're about ordering _four_
operations in _two_ threads:

1. memory operations A, on cpu0
2. lock = 0, on cpu0
3. atomic_cas(&lock, 0, 1), on cpu1
4. memory operations B, on cpu1

Our real goal here (other than the mutual exclusion part of a mutex)
is to ensure that memory operations A and memory operations B happen
in that order.  And it's only when _both threads_ issue matching
memory barriers that everything works out.

> Would it make sense to call the intended-usage aliases something like 
> push/pull, provide/consume or publish/whatever?

Well, what we have now are:

membar_enter and membar_exit
membar_consumer and membar_producer

I think these are pretty good pairs, as far as English words go.  My
only issue with them right now is the text about membar_enter in the
man page -- it disagrees with 15 years of practice in NetBSD _and_
with sensible algorithm design.

(There's also membar_sync, of which there are (as far as I can tell)
zero legitimate uses in-tree, because almost nothing ever needs
store-before-load ordering except exotic kludges like Dekker's
algorithm -- almost everything is satisfied by release/acquire
ordering or less.)

Re: membar_enter semantics

Reply via email to