I see it differently. The issue is ordering - the inability of non-TSO platforms enforce total order of independent stores. The first loads are also independent and their ordering can neither be enforced, nor detected. But the following load can detect the lack of total ordering of stores and loads, so it is enforced through a heavyweight barrier.

But I understood now why other barriers won't work. Thank you.

Alex


On 09/12/2014 21:59, David Holmes wrote:
In this case the issue is not ordering per-se (which is what dependencies help with) but global visibility. After performing the first read each thread must ensure that its second read will return what the other thread saw for the first read - hence a full dmb/sync between the reads; or generalizing a full dmb/sync after every volatile read.
David

    -----Original Message-----
    *From:* Oleksandr Otenko [mailto:oleksandr.ote...@oracle.com]
    *Sent:* Wednesday, 10 December 2014 7:54 AM
    *To:* dhol...@ieee.org; Hans Boehm
    *Cc:* core-libs-dev; concurrency-inter...@cs.oswego.edu
    *Subject:* Re: [concurrency-interest] RFR: 8065804:
    JEP171:Clarifications/corrections for fence intrinsics

    Yes, I do understand the reader needs barriers, too. I guess I was
    wondering more why the reader would need something stronger than
    what dependencies etc could enforce. I guess I'll read what Martin
    forwarded first.

    Alex


    On 09/12/2014 21:37, David Holmes wrote:
    See my earlier response to Martin. The reader has to force a
    consistent view of memory - the writer can't as the write escapes
    before it can issue the barrier.
    David

        -----Original Message-----
        *From:* concurrency-interest-boun...@cs.oswego.edu
        [mailto:concurrency-interest-boun...@cs.oswego.edu]*On Behalf
        Of *Oleksandr Otenko
        *Sent:* Wednesday, 10 December 2014 6:04 AM
        *To:* Hans Boehm; dhol...@ieee.org
        *Cc:* core-libs-dev; concurrency-inter...@cs.oswego.edu
        *Subject:* Re: [concurrency-interest] RFR: 8065804:
        JEP171:Clarifications/corrections for fence intrinsics

        On 26/11/2014 02:04, Hans Boehm wrote:
        To be concrete here, on Power, loads can normally be ordered
by an address dependency or light-weight fence (lwsync). However, neither is enough to prevent the questionable
        outcome for IRIW, since it doesn't ensure that the stores in
        T1 and T2 will be made visible to other threads in a
        consistent order.  That outcome can be prevented by using
        heavyweight fences (sync) instructions between the loads
        instead.

        Why would they need fences between loads instead of syncing
        the order of stores?


        Alex


        Peter Sewell's group concluded that to enforce correct
        volatile behavior on Power, you essentially need a a
        heavyweight fence between every pair of volatile operations
        on Power.  That cannot be understood based on simple
        ordering constraints.

        As Stephan pointed out, there are similar issues on ARM, but
they're less commonly encountered in a Java implementation. If you're lucky, you can get to the right implementation
        recipe by looking at only reordering, I think.


        On Tue, Nov 25, 2014 at 4:36 PM, David Holmes
        <davidchol...@aapt.net.au <mailto:davidchol...@aapt.net.au>>
        wrote:

            Stephan Diestelhorst writes:
            >
            > David Holmes wrote:
            > > Stephan Diestelhorst writes:
            > > > Am Dienstag, 25. November 2014, 11:15:36 schrieb
            Hans Boehm:
            > > > > I'm no hardware architect, but fundamentally it
            seems to me that
            > > > >
            > > > > load x
            > > > > acquire_fence
            > > > >
            > > > > imposes a much more stringent constraint than
            > > > >
            > > > > load_acquire x
            > > > >
            > > > > Consider the case in which the load from x is an
            L1 hit, but a
            > > > > preceding load (from say y) is a long-latency
            miss.  If we enforce
            > > > > ordering by just waiting for completion of prior
            operation, the
            > > > > former has to wait for the load from y to
            complete; while the
            > > > > latter doesn't.  I find it hard to believe that
            this doesn't leave
            > > > > an appreciable amount of performance on the
            table, at least for
            > > > > some interesting microarchitectures.
            > > >
            > > > I agree, Hans, that this is a reasonable
            assumption.  Load_acquire x
            > > > does allow roach motel, whereas the acquire fence
            does not.
            > > >
            > > > >  In addition, for better or worse, fencing
            requirements on at least
            > > > >  Power are actually driven as much by store
            atomicity issues, as by
> > > > the ordering issues discussed in the cookbook. This was not
            > > > >  understood in 2005, and unfortunately doesn't
            seem to be
            > amenable to
            > > > >  the kind of straightforward explanation as in
            Doug's cookbook.
            > > >
            > > > Coming from a strongly ordered architecture to a
            weakly ordered one
            > > > myself, I also needed some mental adjustment about
            store (multi-copy)
            > > > atomicity.  I can imagine others will be unaware
            of this difference,
            > > > too, even in 2014.
            > >
            > > Sorry I'm missing the connection between fences and
            multi-copy
            > atomicity.
            >
            > One example is the classic IRIW.  With non-multi copy
            atomic stores, but
            > ordered (say through a dependency) loads in the
            following example:
            >
            > Memory: foo = bar = 0
            > _T1_         _T2_         _T3_                 _T4_
> st (foo),1 st (bar),1 ld r1, (bar) ld r3,(foo)
            >                           <addr dep / local "fence"
            here>   <addr dep>
> ld r2, (foo) ld r4, (bar)
            >
            > You may observe r1 = 1, r2 = 0, r3 = 1, r4 = 0 on
            non-multi-copy atomic
            > machines.  On TSO boxes, this is not possible.  That
            means that the
            > memory fence that will prevent such a behaviour (DMB
            on ARM) needs to
            > carry some additional oomph in ensuring multi-copy
            atomicity, or rather
            > prevent you from seeing it (which is the same thing).

            I take it as given that any code for which you may have
            ordering
            constraints, must first have basic atomicity properties
            for loads and
            stores. I would not expect any kind of fence to add
            multi-copy-atomicity
            where there was none.

            David

            > Stephan
            >
            > _______________________________________________
            > Concurrency-interest mailing list
            > concurrency-inter...@cs.oswego.edu
            <mailto:concurrency-inter...@cs.oswego.edu>
            > http://cs.oswego.edu/mailman/listinfo/concurrency-interest

            _______________________________________________
            Concurrency-interest mailing list
            concurrency-inter...@cs.oswego.edu
            <mailto:concurrency-inter...@cs.oswego.edu>
            http://cs.oswego.edu/mailman/listinfo/concurrency-interest




        _______________________________________________
        Concurrency-interest mailing list
        concurrency-inter...@cs.oswego.edu
        http://cs.oswego.edu/mailman/listinfo/concurrency-interest



Reply via email to