Re: should sync builtins be full optimization barriers?

Andrew MacLeod Tue, 13 Sep 2011 09:16:42 -0700

On 09/13/2011 10:58 AM, Geert Bosch wrote:


On Sep 13, 2011, at 08:08, Andrew MacLeod wrote:

On 09/12/2011 09:52 PM, Geert Bosch wrote:

No that's false. Even on systems with nice memory models, such as x86 and SPARC 
with a TSO model, you need a fence to avoid that a write-load of the same 
location is forced to

Note that here with write-load I meant a write instruction *and* a subsequent 
load instruction.

  make it all the way to coherent memory and not forwarded directly from the 
write buffer or L1 cache. The reasons that fences are expensive is exactly that 
it requires system-wide agreement.


On x86, all the atomic operations are prefixed with LOCK which is suppose to 
grant them exclusive use of shared memory. Ken's comments would appear to 
indicate that imposes a total order across all processors.

Yes, that's right. All atomic read-modify-write operations have an implicit 
full barrier on x86 and on SPARC. However, my example was about regular stores 
and loads from an atomic int using the C++ relaxed memory model. Indeed, just 
using XCHG (or SWAP on SPARC) instructions for writes and regular loads for 
reads is sufficient to establish a total order.

Your example was not about regular stores, it used atomic variables.*ALL* atomic variable writes are prefixed by lock on x86. This is onereason we have built-ins for all atomic read and writes, to let targetsdefine the appropriate sequences to ensure atomicity.

Additional costs may come with synchronizing the *other* shared memoryvariables in a thread, which is what the memory models are there for.

For relaxed mode, no other shared memory values have to beflushed/sorted out because relaxed doesn't synchronize.

When you switch to release/acquire, then 2 threads have to getthemselves into a consistent state, so any other pending writes beforean atomic release operation in one thread must be flushed back to sharedmemory, possibly requiring a extra instruction(s) on some architectures.The simple use of the lock prefix on x86 writes satisfies thisconstraint as well.

And then seq-cst requires pretty much everyone in the system to getstraightened away which could be a very expensive operation. As itturns out, x86 is still satisfied by just using a lock on the atomic write.

x86 obviously doesn't benefit much from the more relaxed models sinceits pretty much seq-cst by default, but some other arch's do. There atable being built here:


http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html

which shows what the sequences should be for various architectures, andthat's what I'm planning to use for the atomic sequences on each ofthose targets. As you can see, some architectures more closely matchthe various memory models than x86.

The optimizers are still free to do shared memory optimizations subjectto the memory model restrictions. (ie, all sorts of code motion canhappen across relaxed atomics, and none across seq-cst). This is wherex86 might benefit in performance.


Andrew

BTW, If someone cares about the sequences for their favouritearchitecture, and it isn't listed there, I encourage you to contactPeter or Jaroslav with the relevant information to get it added to thispage. (I CC'd them on this reply.)

Re: should sync builtins be full optimization barriers?

Reply via email to