On 18 June 2013 15:50, Paul E. McKenney <paul...@linux.vnet.ibm.com> wrote: > On Tue, Jun 18, 2013 at 03:24:24PM +0200, Paolo Bonzini wrote: >> Il 17/06/2013 20:57, Richard Henderson ha scritto: >> >> + * And for the few ia64 lovers that exist, an atomic_mb_read is a ld.acq, >> >> + * while an atomic_mb_set is a st.rel followed by a memory barrier. >> > ... >> >> + */ >> >> +#ifndef atomic_mb_read >> >> +#if QEMU_GNUC_PREREQ(4, 8) >> >> +#define atomic_mb_read(ptr) ({ \ >> >> + typeof(*ptr) _val; \ >> >> + __atomic_load(ptr, &_val, __ATOMIC_SEQ_CST); \ >> >> + _val; \ >> >> +}) >> >> +#else >> >> +#define atomic_mb_read(ptr) ({ \ >> >> + typeof(*ptr) _val = atomic_read(ptr); \ >> >> + smp_rmb(); \ >> >> + _val; \ >> > >> > This latter definition is ACQUIRE not SEQ_CST (except for ia64). Without >> > load_acquire, one needs barriers before and after the atomic_read in order >> > to >> > implement SEQ_CST. >> >> The store-load barrier between atomic_mb_set and atomic_mb_read are >> provided by the atomic_mb_set. The load-load barrier between two >> atomic_mb_reads are provided by the first read. >> >> > So again I have to ask, what semantics are you actually looking for here? >> >> So, everything I found points to Java volatile being sequentially >> consistent, though I'm still not sure why C11 suggests hwsync for load >> seq-cst on POWER instead of lwsync. Comparing the sequences that my >> code (based on the JSR-133 cookbook) generate with the C11 suggestions >> you get: >> >> Load seqcst hwsync; ld; cmp; bc; isync >> Store seqcst hwsync; st >> (source: http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html) >> >> Load Java volatile ld; lwsync >> Store Java volatile lwsync; st; hwsync >> (source: http://g.oswego.edu/dl/jmm/cookbook.html) >> >> where the lwsync in loads acts as a load-load barrier, while the one in >> stores acts as load-store + store-store barrier.
(I don't know the context of these mails, but let me add a gloss to correct a common misconception there: it's not sufficient to think about these things solely in terms of thread-local reordering, as the tutorial that Paul points to below explains. One has to think about the lack of multi-copy atomicity too, as seen in that IRIW example) >> Is the cookbook known to be wrong? >> >> Or is Java volatile somewhere between acq_rel and seq_cst, as the last >> paragraph of >> http://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html#volatile >> seems to suggest? > > First, I am not a fan of SC, mostly because there don't seem to be many > (any?) production-quality algorithms that need SC. But if you really > want to take a parallel-programming trip back to the 1980s, let's go! ;-) > > The canonical litmus test for SC is independent reads of independent > writes, or IRIW for short. It is as follows, with x and y both initially > zero: > > Thread 0 Thread 1 Thread 2 Thread 3 > x=1; y=1; r1=x; r2=y; > r3=y; r4=x; > > On a sequentially consistent system, r1==1&&r2==1&&r3==0&&&r4==0 is > a forbidden final outcome, where "final" means that we wait for all > four threads to terminate and for their effects on memory to propagate > fully through the system before checking the final values. > > The first mapping has been demonstrated to be correct at the URL > you provide. So let's apply the second mapping: > > Thread 0 Thread 1 Thread 2 Thread 3 > lwsync; lwsync; r1=x; r2=y; > x=1; y=1; lwsync; lwsync; > hwsync; hwsync; r3=y; r4=x; > lwsync; lwsync; > > Now barriers operate by forcing ordering between accesses on either > side of the barrier. This means that barriers at the very start or > the very end of a given thread's execution have no effect and may > be ignored. This results in the following: > > Thread 0 Thread 1 Thread 2 Thread 3 > x=1; y=1; r1=x; r2=y; > lwsync; lwsync; > r3=y; r4=x; > > This sequence does -not- result in SC execution, as documented in the > table at the beginning of Section 6.3 on page 23 of: > > http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf > > See the row with "IRIW+lwsync", which shows that this code sequence does > not preserve SC ordering, neither in theory nor on real hardware. (And I > strongly recommend this paper, even before having read it thoroughly > myself, hence my taking the precaution of CCing the authors in case I > am misinterpreting something.) If you want to prove it for yourself, > use the software tool called out at http://lwn.net/Articles/470681/. > > So is this a real issue? Show me a parallel algorithm that relies on > SC, demonstrate that it is the best algorithm for the problem that it > solves, and then demonstrate that your problem statement is the best > match for some important real-world problem. Until then, no, this is > not a real issue. > > Thanx, Paul > > PS: Nevertheless, I personally prefer the C++ formulation, but that is > only because I stand with one foot in theory and the other in > practice. If I were a pure practitioner, I would probably strongly > prefer the Java formulation. >