On 09/13/2011 10:58 AM, Geert Bosch wrote:

On Sep 13, 2011, at 08:08, Andrew MacLeod wrote:

On 09/12/2011 09:52 PM, Geert Bosch wrote:
No that's false. Even on systems with nice memory models, such as x86 and SPARC 
with a TSO model, you need a fence to avoid that a write-load of the same 
location is forced to
Note that here with write-load I meant a write instruction *and* a subsequent 
load instruction.
  make it all the way to coherent memory and not forwarded directly from the 
write buffer or L1 cache. The reasons that fences are expensive is exactly that 
it requires system-wide agreement.

On x86, all the atomic operations are prefixed with LOCK which is suppose to 
grant them exclusive use of shared memory. Ken's comments would appear to 
indicate that imposes a total order across all processors.
Yes, that's right. All atomic read-modify-write operations have an implicit 
full barrier on x86 and on SPARC. However, my example was about regular stores 
and loads from an atomic int using the C++ relaxed memory model. Indeed, just 
using XCHG (or SWAP on SPARC) instructions for writes and regular loads for 
reads is sufficient to establish a total order.



Your example was not about regular stores, it used atomic variables. *ALL* atomic variable writes are prefixed by lock on x86. This is one reason we have built-ins for all atomic read and writes, to let targets define the appropriate sequences to ensure atomicity.

Additional costs may come with synchronizing the *other* shared memory variables in a thread, which is what the memory models are there for.

For relaxed mode, no other shared memory values have to be flushed/sorted out because relaxed doesn't synchronize.

When you switch to release/acquire, then 2 threads have to get themselves into a consistent state, so any other pending writes before an atomic release operation in one thread must be flushed back to shared memory, possibly requiring a extra instruction(s) on some architectures. The simple use of the lock prefix on x86 writes satisfies this constraint as well.

And then seq-cst requires pretty much everyone in the system to get straightened away which could be a very expensive operation. As it turns out, x86 is still satisfied by just using a lock on the atomic write.

x86 obviously doesn't benefit much from the more relaxed models since its pretty much seq-cst by default, but some other arch's do. There a table being built here:

http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html

which shows what the sequences should be for various architectures, and that's what I'm planning to use for the atomic sequences on each of those targets. As you can see, some architectures more closely match the various memory models than x86.

The optimizers are still free to do shared memory optimizations subject to the memory model restrictions. (ie, all sorts of code motion can happen across relaxed atomics, and none across seq-cst). This is where x86 might benefit in performance.

Andrew

BTW, If someone cares about the sequences for their favourite architecture, and it isn't listed there, I encourage you to contact Peter or Jaroslav with the relevant information to get it added to this page. (I CC'd them on this reply.)


Reply via email to