On Tue, Jun 18, 2013 at 03:24:24PM +0200, Paolo Bonzini wrote: > Il 17/06/2013 20:57, Richard Henderson ha scritto: > >> + * And for the few ia64 lovers that exist, an atomic_mb_read is a ld.acq, > >> + * while an atomic_mb_set is a st.rel followed by a memory barrier. > > ... > >> + */ > >> +#ifndef atomic_mb_read > >> +#if QEMU_GNUC_PREREQ(4, 8) > >> +#define atomic_mb_read(ptr) ({ \ > >> + typeof(*ptr) _val; \ > >> + __atomic_load(ptr, &_val, __ATOMIC_SEQ_CST); \ > >> + _val; \ > >> +}) > >> +#else > >> +#define atomic_mb_read(ptr) ({ \ > >> + typeof(*ptr) _val = atomic_read(ptr); \ > >> + smp_rmb(); \ > >> + _val; \ > > > > This latter definition is ACQUIRE not SEQ_CST (except for ia64). Without > > load_acquire, one needs barriers before and after the atomic_read in order > > to > > implement SEQ_CST. > > The store-load barrier between atomic_mb_set and atomic_mb_read are > provided by the atomic_mb_set. The load-load barrier between two > atomic_mb_reads are provided by the first read. > > > So again I have to ask, what semantics are you actually looking for here? > > So, everything I found points to Java volatile being sequentially > consistent, though I'm still not sure why C11 suggests hwsync for load > seq-cst on POWER instead of lwsync. Comparing the sequences that my > code (based on the JSR-133 cookbook) generate with the C11 suggestions > you get: > > Load seqcst hwsync; ld; cmp; bc; isync > Store seqcst hwsync; st > (source: http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html) > > Load Java volatile ld; lwsync > Store Java volatile lwsync; st; hwsync > (source: http://g.oswego.edu/dl/jmm/cookbook.html) > > where the lwsync in loads acts as a load-load barrier, while the one in > stores acts as load-store + store-store barrier. > > Is the cookbook known to be wrong? > > Or is Java volatile somewhere between acq_rel and seq_cst, as the last > paragraph of > http://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html#volatile > seems to suggest?
First, I am not a fan of SC, mostly because there don't seem to be many (any?) production-quality algorithms that need SC. But if you really want to take a parallel-programming trip back to the 1980s, let's go! ;-) The canonical litmus test for SC is independent reads of independent writes, or IRIW for short. It is as follows, with x and y both initially zero: Thread 0 Thread 1 Thread 2 Thread 3 x=1; y=1; r1=x; r2=y; r3=y; r4=x; On a sequentially consistent system, r1==1&&r2==1&&r3==0&&&r4==0 is a forbidden final outcome, where "final" means that we wait for all four threads to terminate and for their effects on memory to propagate fully through the system before checking the final values. The first mapping has been demonstrated to be correct at the URL you provide. So let's apply the second mapping: Thread 0 Thread 1 Thread 2 Thread 3 lwsync; lwsync; r1=x; r2=y; x=1; y=1; lwsync; lwsync; hwsync; hwsync; r3=y; r4=x; lwsync; lwsync; Now barriers operate by forcing ordering between accesses on either side of the barrier. This means that barriers at the very start or the very end of a given thread's execution have no effect and may be ignored. This results in the following: Thread 0 Thread 1 Thread 2 Thread 3 x=1; y=1; r1=x; r2=y; lwsync; lwsync; r3=y; r4=x; This sequence does -not- result in SC execution, as documented in the table at the beginning of Section 6.3 on page 23 of: http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf See the row with "IRIW+lwsync", which shows that this code sequence does not preserve SC ordering, neither in theory nor on real hardware. (And I strongly recommend this paper, even before having read it thoroughly myself, hence my taking the precaution of CCing the authors in case I am misinterpreting something.) If you want to prove it for yourself, use the software tool called out at http://lwn.net/Articles/470681/. So is this a real issue? Show me a parallel algorithm that relies on SC, demonstrate that it is the best algorithm for the problem that it solves, and then demonstrate that your problem statement is the best match for some important real-world problem. Until then, no, this is not a real issue. Thanx, Paul PS: Nevertheless, I personally prefer the C++ formulation, but that is only because I stand with one foot in theory and the other in practice. If I were a pure practitioner, I would probably strongly prefer the Java formulation.