Re: [concurrency-interest] RFR: 8065804: JEP171:Clarifications/corrections for fence intrinsics

Hans Boehm Mon, 01 Dec 2014 09:26:05 -0800

Definitions here seem to be less clear than I would like.  What I meant by
"store atomicity", which I think is more or less synonymous with
"multi-copy atomicity" is that a store becomes visible to all observers at
the same time, or equivalently stores become visible to all observers in a
consistent order.  In my view, IRIW is the canonical test for that.


I agree with Roman that IRIW requirements for Java volatiles are here to
stay.  Many of us thought about ways to relax the requirement about 8 or 9
years ago.  In my view:

- Sequential consistency for data-race-free code is the only model that we
can possibly explain to the majority of programmers.  (Even stronger models
like some notions of region serializability may also make sense, but
they'll cost you.)  This requires IRIW.  This model is also by far the
easiest to reason about formally.

- The next weaker model that seems to be somewhat explainable, but really
only to experts, is something along the lines of the C++ acquire/release
model.  This doesn't require IRIW.  It's clearly too weak to replace Java
volatile behavior, since it also fails to work for Dekkers-like settings,
which are fairly common.  (Nonexperts perhaps shouldn't write lock-free
Dekkers-like code, but it's hard to explain precisely what they shouldn't
be doing.)

- A large amount of effort to generate models between those two failed to
generate anything viable.  The general experience was that once you no
longer require IRIW, you also end up failing various other, potentially
more important, litmus tests in ways that are really difficult to explain.
And those models generally looked too complex to me to form a viable basis
for real programs

I think many people, even those who would rather not enforce IRIW,
generally agree with this characterization.

Hans

On Tue, Nov 25, 2014 at 6:10 PM, David Holmes <davidchol...@aapt.net.au>
wrote:

>  Hi Hans,
>
> Given IRIW is a thorn in everyone's side and has no known useful benefit,
> and can hopefully be killed off in the future, lets not get bogged down in
> IRIW. But none of what you say below relates to multi-copy-atomicity.
>
> Cheers,
> David
>
> -----Original Message-----
> *From:* hjkhbo...@gmail.com [mailto:hjkhbo...@gmail.com]*On Behalf Of *Hans
> Boehm
> *Sent:* Wednesday, 26 November 2014 12:04 PM
> *To:* dhol...@ieee.org
> *Cc:* Stephan Diestelhorst; concurrency-inter...@cs.oswego.edu;
> core-libs-dev
> *Subject:* Re: [concurrency-interest] RFR: 8065804:
> JEP171:Clarifications/corrections for fence intrinsics
>
> To be concrete here, on Power, loads can normally be ordered by an address
> dependency or light-weight fence (lwsync).  However, neither is enough to
> prevent the questionable outcome for IRIW, since it doesn't ensure that the
> stores in T1 and T2 will be made visible to other threads in a consistent
> order.  That outcome can be prevented by using heavyweight fences (sync)
> instructions between the loads instead.  Peter Sewell's group concluded
> that to enforce correct volatile behavior on Power, you essentially need a
> a heavyweight fence between every pair of volatile operations on Power.
> That cannot be understood based on simple ordering constraints.
>
> As Stephan pointed out, there are similar issues on ARM, but they're less
> commonly encountered in a Java implementation.  If you're lucky, you can
> get to the right implementation recipe by looking at only reordering, I
> think.
>
>
> On Tue, Nov 25, 2014 at 4:36 PM, David Holmes <davidchol...@aapt.net.au>
> wrote:
>
>>  Stephan Diestelhorst writes:
>> >
>> > David Holmes wrote:
>> > > Stephan Diestelhorst writes:
>> > > > Am Dienstag, 25. November 2014, 11:15:36 schrieb Hans Boehm:
>> > > > > I'm no hardware architect, but fundamentally it seems to me that
>> > > > >
>> > > > > load x
>> > > > > acquire_fence
>> > > > >
>> > > > > imposes a much more stringent constraint than
>> > > > >
>> > > > > load_acquire x
>> > > > >
>> > > > > Consider the case in which the load from x is an L1 hit, but a
>> > > > > preceding load (from say y) is a long-latency miss.  If we enforce
>> > > > > ordering by just waiting for completion of prior operation, the
>> > > > > former has to wait for the load from y to complete; while the
>> > > > > latter doesn't.  I find it hard to believe that this doesn't leave
>> > > > > an appreciable amount of performance on the table, at least for
>> > > > > some interesting microarchitectures.
>> > > >
>> > > > I agree, Hans, that this is a reasonable assumption.  Load_acquire x
>> > > > does allow roach motel, whereas the acquire fence does not.
>> > > >
>> > > > >  In addition, for better or worse, fencing requirements on at
>> least
>> > > > >  Power are actually driven as much by store atomicity issues, as
>> by
>> > > > >  the ordering issues discussed in the cookbook.  This was not
>> > > > >  understood in 2005, and unfortunately doesn't seem to be
>> > amenable to
>> > > > >  the kind of straightforward explanation as in Doug's cookbook.
>> > > >
>> > > > Coming from a strongly ordered architecture to a weakly ordered one
>> > > > myself, I also needed some mental adjustment about store
>> (multi-copy)
>> > > > atomicity.  I can imagine others will be unaware of this difference,
>> > > > too, even in 2014.
>> > >
>> > > Sorry I'm missing the connection between fences and multi-copy
>> > atomicity.
>> >
>> > One example is the classic IRIW.  With non-multi copy atomic stores, but
>> > ordered (say through a dependency) loads in the following example:
>> >
>> > Memory: foo = bar = 0
>> > _T1_         _T2_         _T3_                              _T4_
>> > st (foo),1   st (bar),1   ld r1, (bar)                      ld r3,(foo)
>> >                           <addr dep / local "fence" here>   <addr dep>
>> >                           ld r2, (foo)                      ld r4, (bar)
>> >
>> > You may observe r1 = 1, r2 = 0, r3 = 1, r4 = 0 on non-multi-copy atomic
>> > machines.  On TSO boxes, this is not possible.  That means that the
>> > memory fence that will prevent such a behaviour (DMB on ARM) needs to
>> > carry some additional oomph in ensuring multi-copy atomicity, or rather
>> > prevent you from seeing it (which is the same thing).
>>
>> I take it as given that any code for which you may have ordering
>> constraints, must first have basic atomicity properties for loads and
>> stores. I would not expect any kind of fence to add multi-copy-atomicity
>> where there was none.
>>
>> David
>>
>> > Stephan
>> >
>> > _______________________________________________
>> > Concurrency-interest mailing list
>> > concurrency-inter...@cs.oswego.edu
>> > http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>
>> _______________________________________________
>> Concurrency-interest mailing list
>> concurrency-inter...@cs.oswego.edu
>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>
>
>

Re: [concurrency-interest] RFR: 8065804: JEP171:Clarifications/corrections for fence intrinsics

Reply via email to