Re: [Qemu-devel] Java volatile vs. C11 seq_cst (was Re: [PATCH v2 1/2] add a header file for atomic operations)

Paul E. McKenney Tue, 18 Jun 2013 14:00:18 -0700

On Tue, Jun 18, 2013 at 03:24:24PM +0200, Paolo Bonzini wrote:
> Il 17/06/2013 20:57, Richard Henderson ha scritto:
> >> + * And for the few ia64 lovers that exist, an atomic_mb_read is a ld.acq,
> >> + * while an atomic_mb_set is a st.rel followed by a memory barrier.
> > ...
> >> + */
> >> +#ifndef atomic_mb_read
> >> +#if QEMU_GNUC_PREREQ(4, 8)
> >> +#define atomic_mb_read(ptr)       ({             \
> >> +    typeof(*ptr) _val;                           \
> >> +    __atomic_load(ptr, &_val, __ATOMIC_SEQ_CST); \
> >> +    _val;                                        \
> >> +})
> >> +#else
> >> +#define atomic_mb_read(ptr)    ({           \
> >> +    typeof(*ptr) _val = atomic_read(ptr);   \
> >> +    smp_rmb();                              \
> >> +    _val;                                   \
> > 
> > This latter definition is ACQUIRE not SEQ_CST (except for ia64).  Without
> > load_acquire, one needs barriers before and after the atomic_read in order 
> > to
> > implement SEQ_CST.
> 
> The store-load barrier between atomic_mb_set and atomic_mb_read are
> provided by the atomic_mb_set.  The load-load barrier between two
> atomic_mb_reads are provided by the first read.
> 
> > So again I have to ask, what semantics are you actually looking for here?
> 
> So, everything I found points to Java volatile being sequentially
> consistent, though I'm still not sure why C11 suggests hwsync for load
> seq-cst on POWER instead of lwsync.  Comparing the sequences that my
> code (based on the JSR-133 cookbook) generate with the C11 suggestions
> you get:
> 
>     Load seqcst               hwsync; ld; cmp; bc; isync
>     Store seqcst              hwsync; st
>     (source: http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html)
> 
>     Load Java volatile        ld; lwsync
>     Store Java volatile       lwsync; st; hwsync
>     (source: http://g.oswego.edu/dl/jmm/cookbook.html)
> 
> where the lwsync in loads acts as a load-load barrier, while the one in
> stores acts as load-store + store-store barrier.
> 
> Is the cookbook known to be wrong?
> 
> Or is Java volatile somewhere between acq_rel and seq_cst, as the last
> paragraph of
> http://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html#volatile
> seems to suggest?


First, I am not a fan of SC, mostly because there don't seem to be many
(any?) production-quality algorithms that need SC.  But if you really
want to take a parallel-programming trip back to the 1980s, let's go!  ;-)

The canonical litmus test for SC is independent reads of independent
writes, or IRIW for short.  It is as follows, with x and y both initially
zero:

        Thread 0        Thread 1        Thread 2        Thread 3
        x=1;            y=1;            r1=x;           r2=y;
                                        r3=y;           r4=x;

On a sequentially consistent system, r1==1&&r2==1&&r3==0&&&r4==0 is
a forbidden final outcome, where "final" means that we wait for all
four threads to terminate and for their effects on memory to propagate
fully through the system before checking the final values.

The first mapping has been demonstrated to be correct at the URL
you provide.  So let's apply the second mapping:

        Thread 0        Thread 1        Thread 2        Thread 3
        lwsync;         lwsync;         r1=x;           r2=y;
        x=1;            y=1;            lwsync;         lwsync;
        hwsync;         hwsync;         r3=y;           r4=x;
                                        lwsync;         lwsync;

Now barriers operate by forcing ordering between accesses on either
side of the barrier.  This means that barriers at the very start or
the very end of a given thread's execution have no effect and may
be ignored.  This results in the following:

        Thread 0        Thread 1        Thread 2        Thread 3
        x=1;            y=1;            r1=x;           r2=y;
                                        lwsync;         lwsync;
                                        r3=y;           r4=x;

This sequence does -not- result in SC execution, as documented in the
table at the beginning of Section 6.3 on page 23 of:

        http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf

See the row with "IRIW+lwsync", which shows that this code sequence does
not preserve SC ordering, neither in theory nor on real hardware.  (And I
strongly recommend this paper, even before having read it thoroughly
myself, hence my taking the precaution of CCing the authors in case I
am misinterpreting something.)  If you want to prove it for yourself,
use the software tool called out at http://lwn.net/Articles/470681/.

So is this a real issue?  Show me a parallel algorithm that relies on
SC, demonstrate that it is the best algorithm for the problem that it
solves, and then demonstrate that your problem statement is the best
match for some important real-world problem.  Until then, no, this is
not a real issue.

                                                        Thanx, Paul

PS:  Nevertheless, I personally prefer the C++ formulation, but that is
     only because I stand with one foot in theory and the other in
     practice.  If I were a pure practitioner, I would probably strongly
     prefer the Java formulation.

Re: [Qemu-devel] Java volatile vs. C11 seq_cst (was Re: [PATCH v2 1/2] add a header file for atomic operations)

Reply via email to