On 18 June 2013 15:50, Paul E. McKenney <paul...@linux.vnet.ibm.com> wrote:
> On Tue, Jun 18, 2013 at 03:24:24PM +0200, Paolo Bonzini wrote:
>> Il 17/06/2013 20:57, Richard Henderson ha scritto:
>> >> + * And for the few ia64 lovers that exist, an atomic_mb_read is a ld.acq,
>> >> + * while an atomic_mb_set is a st.rel followed by a memory barrier.
>> > ...
>> >> + */
>> >> +#ifndef atomic_mb_read
>> >> +#if QEMU_GNUC_PREREQ(4, 8)
>> >> +#define atomic_mb_read(ptr)       ({             \
>> >> +    typeof(*ptr) _val;                           \
>> >> +    __atomic_load(ptr, &_val, __ATOMIC_SEQ_CST); \
>> >> +    _val;                                        \
>> >> +})
>> >> +#else
>> >> +#define atomic_mb_read(ptr)    ({           \
>> >> +    typeof(*ptr) _val = atomic_read(ptr);   \
>> >> +    smp_rmb();                              \
>> >> +    _val;                                   \
>> >
>> > This latter definition is ACQUIRE not SEQ_CST (except for ia64).  Without
>> > load_acquire, one needs barriers before and after the atomic_read in order 
>> > to
>> > implement SEQ_CST.
>>
>> The store-load barrier between atomic_mb_set and atomic_mb_read are
>> provided by the atomic_mb_set.  The load-load barrier between two
>> atomic_mb_reads are provided by the first read.
>>
>> > So again I have to ask, what semantics are you actually looking for here?
>>
>> So, everything I found points to Java volatile being sequentially
>> consistent, though I'm still not sure why C11 suggests hwsync for load
>> seq-cst on POWER instead of lwsync.  Comparing the sequences that my
>> code (based on the JSR-133 cookbook) generate with the C11 suggestions
>> you get:
>>
>>     Load seqcst               hwsync; ld; cmp; bc; isync
>>     Store seqcst              hwsync; st
>>     (source: http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html)
>>
>>     Load Java volatile        ld; lwsync
>>     Store Java volatile       lwsync; st; hwsync
>>     (source: http://g.oswego.edu/dl/jmm/cookbook.html)
>>
>> where the lwsync in loads acts as a load-load barrier, while the one in
>> stores acts as load-store + store-store barrier.

(I don't know the context of these mails, but let me add a gloss to
correct a common misconception there:  it's not sufficient to think
about
these things solely in terms of thread-local reordering, as the
tutorial that Paul points to
below explains. One has to think about the lack of multi-copy
atomicity too, as seen in that IRIW example)

>> Is the cookbook known to be wrong?
>>
>> Or is Java volatile somewhere between acq_rel and seq_cst, as the last
>> paragraph of
>> http://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html#volatile
>> seems to suggest?
>
> First, I am not a fan of SC, mostly because there don't seem to be many
> (any?) production-quality algorithms that need SC.  But if you really
> want to take a parallel-programming trip back to the 1980s, let's go!  ;-)
>
> The canonical litmus test for SC is independent reads of independent
> writes, or IRIW for short.  It is as follows, with x and y both initially
> zero:
>
>         Thread 0        Thread 1        Thread 2        Thread 3
>         x=1;            y=1;            r1=x;           r2=y;
>                                         r3=y;           r4=x;
>
> On a sequentially consistent system, r1==1&&r2==1&&r3==0&&&r4==0 is
> a forbidden final outcome, where "final" means that we wait for all
> four threads to terminate and for their effects on memory to propagate
> fully through the system before checking the final values.
>
> The first mapping has been demonstrated to be correct at the URL
> you provide.  So let's apply the second mapping:
>
>         Thread 0        Thread 1        Thread 2        Thread 3
>         lwsync;         lwsync;         r1=x;           r2=y;
>         x=1;            y=1;            lwsync;         lwsync;
>         hwsync;         hwsync;         r3=y;           r4=x;
>                                         lwsync;         lwsync;
>
> Now barriers operate by forcing ordering between accesses on either
> side of the barrier.  This means that barriers at the very start or
> the very end of a given thread's execution have no effect and may
> be ignored.  This results in the following:
>
>         Thread 0        Thread 1        Thread 2        Thread 3
>         x=1;            y=1;            r1=x;           r2=y;
>                                         lwsync;         lwsync;
>                                         r3=y;           r4=x;
>
> This sequence does -not- result in SC execution, as documented in the
> table at the beginning of Section 6.3 on page 23 of:
>
>         http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf
>
> See the row with "IRIW+lwsync", which shows that this code sequence does
> not preserve SC ordering, neither in theory nor on real hardware.  (And I
> strongly recommend this paper, even before having read it thoroughly
> myself, hence my taking the precaution of CCing the authors in case I
> am misinterpreting something.)  If you want to prove it for yourself,
> use the software tool called out at http://lwn.net/Articles/470681/.
>
> So is this a real issue?  Show me a parallel algorithm that relies on
> SC, demonstrate that it is the best algorithm for the problem that it
> solves, and then demonstrate that your problem statement is the best
> match for some important real-world problem.  Until then, no, this is
> not a real issue.
>
>                                                         Thanx, Paul
>
> PS:  Nevertheless, I personally prefer the C++ formulation, but that is
>      only because I stand with one foot in theory and the other in
>      practice.  If I were a pure practitioner, I would probably strongly
>      prefer the Java formulation.
>

Reply via email to