On 09/13/2011 10:58 AM, Geert Bosch wrote:
On Sep 13, 2011, at 08:08, Andrew MacLeod wrote:
On 09/12/2011 09:52 PM, Geert Bosch wrote:
No that's false. Even on systems with nice memory models, such as x86 and SPARC
with a TSO model, you need a fence to avoid that a write-load of the same
location is forced to
Note that here with write-load I meant a write instruction *and* a subsequent
load instruction.
make it all the way to coherent memory and not forwarded directly from the
write buffer or L1 cache. The reasons that fences are expensive is exactly that
it requires system-wide agreement.
On x86, all the atomic operations are prefixed with LOCK which is suppose to
grant them exclusive use of shared memory. Ken's comments would appear to
indicate that imposes a total order across all processors.
Yes, that's right. All atomic read-modify-write operations have an implicit
full barrier on x86 and on SPARC. However, my example was about regular stores
and loads from an atomic int using the C++ relaxed memory model. Indeed, just
using XCHG (or SWAP on SPARC) instructions for writes and regular loads for
reads is sufficient to establish a total order.
Your example was not about regular stores, it used atomic variables.
*ALL* atomic variable writes are prefixed by lock on x86. This is one
reason we have built-ins for all atomic read and writes, to let targets
define the appropriate sequences to ensure atomicity.
Additional costs may come with synchronizing the *other* shared memory
variables in a thread, which is what the memory models are there for.
For relaxed mode, no other shared memory values have to be
flushed/sorted out because relaxed doesn't synchronize.
When you switch to release/acquire, then 2 threads have to get
themselves into a consistent state, so any other pending writes before
an atomic release operation in one thread must be flushed back to shared
memory, possibly requiring a extra instruction(s) on some architectures.
The simple use of the lock prefix on x86 writes satisfies this
constraint as well.
And then seq-cst requires pretty much everyone in the system to get
straightened away which could be a very expensive operation. As it
turns out, x86 is still satisfied by just using a lock on the atomic write.
x86 obviously doesn't benefit much from the more relaxed models since
its pretty much seq-cst by default, but some other arch's do. There a
table being built here:
http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html
which shows what the sequences should be for various architectures, and
that's what I'm planning to use for the atomic sequences on each of
those targets. As you can see, some architectures more closely match
the various memory models than x86.
The optimizers are still free to do shared memory optimizations subject
to the memory model restrictions. (ie, all sorts of code motion can
happen across relaxed atomics, and none across seq-cst). This is where
x86 might benefit in performance.
Andrew
BTW, If someone cares about the sequences for their favourite
architecture, and it isn't listed there, I encourage you to contact
Peter or Jaroslav with the relevant information to get it added to this
page. (I CC'd them on this reply.)