The main tip I give people that try to understand fences, barriers, and 
memory ordering rules is to completely (and very intentionally) ignore and 
avoid thinking about any CPU's internal details (like buffers or queues or 
flushes or such). The simple rule to remember is this: before a CPU ever 
sees the instructions you think you are running, a compiler (static, JIT, 
whatever) will have a chance to completely shuffle those instructions 
around and give them to the cpu in any order it sees fit. So what a CPU 
actually does with barriers has no functional/correntness implications, as 
it is the compiler's job to make the CPU do what it needs to. 
Barriers/fences/memory models are first and foremost compiler concerns, and 
have both a correctness and (potential and profound) performance 
implications. The CPU concerns are only secondary, and only effect 
performance (never correctness). This is critical to understand and get 
your head around before trying to reason about what fences/barriers/etc. 
*mean* to program code.

For example, the sort of code mutation we are talking about is not just a 
simple change in order. It is also a combination of reordering with other 
optimizations. E.g. a repreated load from the same location in a loop can 
(and will) float "up above" the loop, execute only once (instead of once 
per loop iteration) and cache the loaded result in a register, unless 
barriers/fences/ordering rules prevent it from doing so. Similarly 
redundant stores to the same location can be folded into a single store, 
unless rules prevent it.

With that in mind, you can start looking at the various barriers and fence 
semantics available in various languages (that actually define them) and in 
various ad-hoc tooling (like e.g. libraries and conventions used in C).

I like to use the simplistic LoadLoad, LoadStore, StoreStore, StoreLoad 
mental model as starting point for thinking about barriers, as it can be 
used to model most (not all) barrier semantics. Those are simple to 
understand:
- LoadStore means "all loads that precede this fence will appear (from the 
point of view of any other thread) to execute before the next store appears 
to execute.
- LoadLoad means "all loads that precede this fence will appear (from the 
point of view of any other thread) to execute before the next load appears 
to execute.
etc. etc.
The name tells you what the rule is, clear and simple. Nothing more is 
promised, and nothing less.

I like to think of Acquire and Release (which are somewhat more 
restrictive) in terms of "what would a lock need to do?" (even if no lock 
is involved). Thinking in terms of "in a critical section protected by a 
lock, what sort of code motion would the acquire/release fences associated 
with the lock acquire and release action allow?" allows me to answer 
questions about them without looking up a definition or thinking about the 
more abstract notions that define them. E.g. A lock acquire can 
allow preceding loads and stores that (in the program) appear "before" the 
lock acquire to "float forward past the acquire and into the locked 
region". But it cannot allow loads and stores that (in the program) appear 
"after" the lock acquire to "float to backwards past the acquire and out of 
the locked region". So an acquire fence needs to enforce those rules. 
Similarly, release can allow loads and stores that follow the release to 
"float back past the release and into the locked region", but cannot allow 
loads and stores that precede it to "float forward past the release and out 
of the locked region". 

Since acquire doesn't (necessarily) involve a load or a store, Acquire 
doesn't quite equate to LoadLoad|LoadStore. It can be when acquiring the 
lock involves a load followed by LoadLoad|LoadStore, but there are lock 
implementations where that doesn't actually happen, and uses of Acquire 
fences with no locking or loads involved. Similarly, since a release 
doesn't (necessarily) involve a load or a store, it is not quite equivalent 
to StoreLoad|StoreStore. (But can be when releasing involves a store 
followed by StoreLoad|StoreStore.

HTH

On Sunday, June 12, 2016 at 4:28:14 PM UTC-7, Dain Ironfoot wrote:
>
> Hi Guys,
>
> I am attempting to understand memory barriers at a level useful for java 
> lock-free programmers.
> This level, I feel, is somewhere between learning about volatile from java 
> books and learning about working of Store/Load buffers from an x86 manual.
> I spent some time reading a bunch of blogs/cookbooks and have come up with 
> the summary shown below.
>
> LFENCE
> ====================================================
> Name             : LFENCE/Load Barrier/Acquire Fence
> Barriers          : LoadLoad + LoadStore
> Details           : Given sequence {Load1, LFENCE, Load2, Store1}, the 
> barrier ensures that Load1 can't be moved south and Load2 and Store1 can't 
> be                  moved north of the barrier. Note that Load2 and Store1 
> can still be reordered.
> Buffer Effect    : Causes the contents of the *LoadBuffer* (pending 
> loads) to be processed for that CPU. 
>                         This makes program state exposed from other CPUs 
> visible to this CPU before Load2 and Store1 are executed.
> Cost on x86    : Either very cheap or a no-op.
> Java instruction: Reading a volatile variable, Unsafe.loadFence()
>     
>
> SFENCE
> ====================================================
> Name              : SFENCE/Store Barrier/Release Fence
> Barriers           : StoreStore + LoadStore
> Details            : Given sequence {Load1, Store1, SFENCE, Store2, 
> Load2}, the barrier ensures that Load1 and Store1 can't be moved south and 
> Store2 can't be moved north of the barrier.
>                        Note that Load1 and Store1 can still be reordered 
> AND Load2 can be  moved north of the barrier.
> Buffer Effect    : Causes the contents of the *StoreBuffer *flushed to 
> cache for the CPU on which it is issued.
>                         This will make program state visible to other CPUs 
> before Store2 and Load1 are executed.
> Cost on x86    : Either very cheap or a no-op.
> Java instruction: lazySet(), Unsafe.storeFence(), Unsafe.putOrdered*()
>
>
> MFENCE
> ====================================================
> Name            : MFENCE/Full Barrier/Fence
> Barriers         : StoreLoad
> Details          : Obtains the effects of the other three barrier. So can 
> serve as a general-purpose barrier.
>                       Given sequence {Load1, Store1, MFENCE, Store2, 
> Load2}, the barrier ensures that Load1 and Store1 can't be moved south and 
> Store2 and               Load2 can't be moved north of the barrier.
>                       Note that Load1 and Store1 can still be reordered 
> AND Store2 and Load2 can still be reordered.
>                   
> Buffer Effect   : Causes the contents of the *LoadBuffer *(pending loads) 
> to be processed for that CPU. 
>                       AND
>                       Causes the contents of the *StoreBuffer *flushed to 
> cache for the CPU on which it is issued.
>
> Cost on x86    : The most expensive kind.
> Java instruction: Writing to a volatile, Unsafe.fullFence(), Using locks
>     
> My questions are:
>
> a) Is my understanding correct or am I missed something critical?
> b) If both SFENCE and MFENCE drains the storeBuffer (invalidates cacheline 
> and waits for acks from other cpus), why is one a no-op and the other a 
> very expensive op?
>
> Thanks
>
> ** Edited to make the info easier to read.
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to