On Mon, Dec 8, 2014 at 12:46 AM, David Holmes <davidchol...@aapt.net.au> wrote: > Martin, > > The paper you cite is about ARM and Power architectures - why do you think > the lack of mention of x86/sparc implies those architectures are > multiple-copy-atomic?
Reading some more in the same paper, I see: """Returning to the two properties above, in TSO a thread can see its own writes before they become visible to other threads (by reading them from its write buffer), but any write becomes visible to all other threads simultaneously: TSO is a multiple-copy atomic model, in the terminology of Collier [Col92]. One can also see the possibility of reading from the local write buffer as allowing a specific kind of local reordering. A program that writes one location x then reads another location y might execute by adding the write to x to the thread’s buffer, then reading y from memory, before finally making the write to x visible to other threads by flushing it from the buffer. In this case the thread reads the value of y that was in the memory before the new write of x hits memory.""" So (as you say) with TSO you don't have a total order of stores if you read your own writes out of your own CPU's write buffer. However, my interpretation of "multiple-copy atomic" is that the initial publishing thread can choose to use an instruction with sufficiently strong memory barrier attached (e.g. LOCK;XXX on x86) to write to memory so that the write buffer is flushed and then use plain relaxed loads everywhere else to read those memory locations and this explains the situation on x86 and sparc where volatile writes are expensive and volatile reads are "free" and you get sequential consistency for Java volatiles. http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf