Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics

2014-12-31 Thread Hans Boehm
If we look at using purely store fences and purely load fences in the
initialized flag example as in this discussion, I think it's worth
distinguishing too possible scenarios:

1) We guarantee some form of dependency-based ordering, as most real
computer architectures do.  This probably invalidates the example from my
committee paper that's under discussion here.  The problem is, as always,
that we don't know how to make this precise at the programming language
level.  It's the compiler's job to break certain dependencies, like the
dependency of the store to x on the load of y in x = 0 * y.  Many people
are thinking about this problem, both to deal with out-of-thin-air issues
correctly in various memory models, and to design a version of C++'s
memory_order_consume that's more usable.  If we had a way to guarantee some
well-defined notion of dependency-based ordering, then at least some of the
examples here would need to be revisited.

2) We don't guarantee that dependencies imply any sort of ordering.  Then I
think the weird example under discussion here stands.  There is officially
nothing to prevent the load of x.a in thread 1 from being reordered with
the store to x_init.

But there may actually be better examples as to why the store-store
ordering in the initializing thread is not always enough.  Consider:

Thread 1:
x.a = 1;
if (x.a != 1) world_is_broken = true;
StoreStore fence;
x_init = true;
...
if (world_is_broken) die();

Thread 2:
if (x_init) {
full fence;
x.a++;
}

I think there is nothing to prevent the read of x.a in Thread 1 from seeing
the incremented value, at least if (1) the compiler promotes
world_is_broken to a register, and  (2) at the assembly level the store to
x_init is not dependent on the load of x.a.  (1) seems quite plausible, and
(2) seems very reasonable if the architecture has a conditional move
instruction or the like.  (For Itanium, (2) holds even for the naive
compilation.)

This is not a particularly likely scenario, but I have no idea how would
concoct programming rules that would guarantee to prevent this kind of
weirdness.  The first two statements of Thread 1 might appear inside an
initialize a library routine that knows nothing about concurrency.

Hans


On Wed, Dec 17, 2014 at 10:54 AM, Martin Buchholz marti...@google.com
wrote:

 On Wed, Dec 17, 2014 at 1:28 AM, Peter Levart peter.lev...@gmail.com
 wrote:
  On 12/17/2014 03:28 AM, David Holmes wrote:
 
  On 17/12/2014 10:06 AM, Martin Buchholz wrote:
  Hans allows for the nonsensical, in my view, possibility that the load
 of
  x.a can happen after the x_init=true store and yet somehow be subject
 to the
  ++ and the ensuing store that has to come before the x_init = true.
 
  Perhaps, he is speaking about why it is dangerous to replace BOTH release
  with just store-store AND acquire with just load-load?

 I'm pretty sure he's talking about weakening EITHER.

 Clearly, and unsurprisingly, it is unsafe to replace the
 load_acquire with a version that restricts only load ordering in this
 case. That would allow the store to x in thread 2 to become visible
 before the initialization of x by thread 1 is complete, possibly
 losing the update, or corrupting the state of x during initialization.

 More interestingly, it is also generally unsafe to restrict the
 release ordering constraint in thread 1 to only stores.

 (What's clear and unsurprising to Hans may not be to the rest of us)



Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics

2014-12-17 Thread Peter Levart

On 12/17/2014 03:28 AM, David Holmes wrote:

On 17/12/2014 10:06 AM, Martin Buchholz wrote:

On Thu, Dec 11, 2014 at 1:08 AM, Andrew Haley a...@redhat.com wrote:

On 11/12/14 00:53, David Holmes wrote:

There are many good uses of storestore in the various lock-free
algorithms inside hotspot.


There may be many uses, but I am extremely suspicious of how good
they are.  I wonder if we went through all the uses of storestore in
hotspot how many bugs we'd find.  As far as I can see (in the absence
of other barriers) the only things you can write before a storestore
are constants.


Hans has provided us with the canonical writeup opposing store-store
and load-load barriers, here:
http://www.hboehm.info/c++mm/no_write_fences.html
Few programmers will be able to deal confidently with causality
defying time paradoxes, especially loads from the future.


Well I take that with a grain of salt - Hans dismisses ordering based 
on dependencies which puts us into the realm of those causality 
defying time paradoxes in my opinion. Given:


x.a = 0;
x.a++
storestore
x_init = true

Hans allows for the nonsensical, in my view, possibility that the load 
of x.a can happen after the x_init=true store and yet somehow be 
subject to the ++ and the ensuing store that has to come before the 
x_init = true.


David
-



Perhaps, he is speaking about why it is dangerous to replace BOTH 
release with just store-store AND acquire with just load-load?


The example would then become:

T1:

store x.a - 0
load r - x.a
store x.a - r+1
; store-store
store x_init - true

T2:

load r - x.a
; load-load
if (r)
store x.a - 42


Suppose a store on some hypothetical architecture is actually a 
two-phase execution: prepare-store, commit-store


With prepare-store imagined as speculative posting of store to the write 
buffer and commit-store just marking it in the write buffer as commited, 
so that is is written to main memory on write buffer flush. Non commited 
stores are not written to main memory, but are allowed to be visible to 
loads in some threads (executing on same core?) which are not ordered by 
load-store before the speculative prepare-store. A load-load does not 
prevent the T2 to be executed as following:


T2:

prepare-store x.a - 42
load r - x.a
; load-load
if (r)
commit-store x.a - 42

Now this speculative prepare-store can (in real time) happen long before 
T1 instructions are executed. The loads in T1 are then allowed to see 
this speculative prepare-store from T2, because just store-store does 
not logically order them in any way - only load-store would.



Does this make sense?

Regards, Peter



Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics

2014-12-17 Thread Peter Levart

On 12/17/2014 10:28 AM, Peter Levart wrote:

The example would then become:

T1:

store x.a - 0
load r - x.a
store x.a - r+1
; store-store
store x_init - true

T2:

load r - x.a
; load-load
if (r)
store x.a - 42


Sorry the above has an error. I meant:

T2:

load r - x_init
; load-load
if (r)
store x.a - 42


Peter


Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics

2014-12-16 Thread Martin Buchholz
On Thu, Dec 11, 2014 at 1:08 AM, Andrew Haley a...@redhat.com wrote:
 On 11/12/14 00:53, David Holmes wrote:
 There are many good uses of storestore in the various lock-free
 algorithms inside hotspot.

 There may be many uses, but I am extremely suspicious of how good
 they are.  I wonder if we went through all the uses of storestore in
 hotspot how many bugs we'd find.  As far as I can see (in the absence
 of other barriers) the only things you can write before a storestore
 are constants.

Hans has provided us with the canonical writeup opposing store-store
and load-load barriers, here:
http://www.hboehm.info/c++mm/no_write_fences.html
Few programmers will be able to deal confidently with causality
defying time paradoxes, especially loads from the future.


Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics

2014-12-16 Thread David Holmes

On 17/12/2014 10:06 AM, Martin Buchholz wrote:

On Thu, Dec 11, 2014 at 1:08 AM, Andrew Haley a...@redhat.com wrote:

On 11/12/14 00:53, David Holmes wrote:

There are many good uses of storestore in the various lock-free
algorithms inside hotspot.


There may be many uses, but I am extremely suspicious of how good
they are.  I wonder if we went through all the uses of storestore in
hotspot how many bugs we'd find.  As far as I can see (in the absence
of other barriers) the only things you can write before a storestore
are constants.


Hans has provided us with the canonical writeup opposing store-store
and load-load barriers, here:
http://www.hboehm.info/c++mm/no_write_fences.html
Few programmers will be able to deal confidently with causality
defying time paradoxes, especially loads from the future.


Well I take that with a grain of salt - Hans dismisses ordering based on 
dependencies which puts us into the realm of those causality defying 
time paradoxes in my opinion. Given:


x.a = 0;
x.a++
storestore
x_init = true

Hans allows for the nonsensical, in my view, possibility that the load 
of x.a can happen after the x_init=true store and yet somehow be subject 
to the ++ and the ensuing store that has to come before the x_init = true.


David
-



Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics

2014-12-11 Thread Andrew Haley
On 11/12/14 00:53, David Holmes wrote:
 On 11/12/2014 7:02 AM, Andrew Haley wrote:
 On 12/05/2014 09:49 PM, Martin Buchholz wrote:
 The actual implementations of storestore (see below) seem to
 universally give you the stronger ::release barrier, and it seems
 likely that hotspot engineers are implicitly relying on that, that
 some uses of ::storestore in the hotspot sources are bugs (should be
 ::release instead) and that there is very little potential performance
 benefit from using ::storestore instead of ::release, precisely
 because the additional loadstore barrier is very close to free on all
 current hardware.

 That's not really true for ARM, where the additional loadstore requires
 a full barrier.  There is a good use for a storestore, which is when
 zeroing a newly-created object.
 
 There are many good uses of storestore in the various lock-free 
 algorithms inside hotspot.

There may be many uses, but I am extremely suspicious of how good
they are.  I wonder if we went through all the uses of storestore in
hotspot how many bugs we'd find.  As far as I can see (in the absence
of other barriers) the only things you can write before a storestore
are constants.

Andrew.


Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics

2014-12-10 Thread Martin Buchholz
Today I Leaned
that release fence and acquire fence are technical terms defined
in the C/C++ 11 standards.

So my latest version reads instead:

 * Ensures that loads and stores before the fence will not be reordered with
 * stores after the fence; a StoreStore plus LoadStore barrier.
 *
 * Corresponds to C11 atomic_thread_fence(memory_order_release)
 * (a release fence).


Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics

2014-12-10 Thread Andrew Haley
On 12/05/2014 09:49 PM, Martin Buchholz wrote:
 The actual implementations of storestore (see below) seem to
 universally give you the stronger ::release barrier, and it seems
 likely that hotspot engineers are implicitly relying on that, that
 some uses of ::storestore in the hotspot sources are bugs (should be
 ::release instead) and that there is very little potential performance
 benefit from using ::storestore instead of ::release, precisely
 because the additional loadstore barrier is very close to free on all
 current hardware.

That's not really true for ARM, where the additional loadstore requires
a full barrier.  There is a good use for a storestore, which is when
zeroing a newly-created object.

Andrew.



Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics

2014-12-10 Thread David Holmes

On 11/12/2014 3:31 AM, Martin Buchholz wrote:

Today I Leaned
that release fence and acquire fence are technical terms defined
in the C/C++ 11 standards.

So my latest version reads instead:

  * Ensures that loads and stores before the fence will not be reordered 
with
  * stores after the fence; a StoreStore plus LoadStore barrier.
  *
  * Corresponds to C11 atomic_thread_fence(memory_order_release)
  * (a release fence).


Thank you Martin - I find the updated wording much more appropriate.

For the email record, as I have written in the bug report, I think the 
correction of the semantics for storeFence have resulted in 
problematic naming where storeFence and loadFence have opposite 
directionality constraints but the names suggest the same directionality 
constraints. Had the original API suggested these names with the revised 
semantics I would have argued against them. This area is confusing 
enough without adding to that confusion with names that don't suggest 
the action.


David



Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics

2014-12-10 Thread David Holmes

On 11/12/2014 7:02 AM, Andrew Haley wrote:

On 12/05/2014 09:49 PM, Martin Buchholz wrote:

The actual implementations of storestore (see below) seem to
universally give you the stronger ::release barrier, and it seems
likely that hotspot engineers are implicitly relying on that, that
some uses of ::storestore in the hotspot sources are bugs (should be
::release instead) and that there is very little potential performance
benefit from using ::storestore instead of ::release, precisely
because the additional loadstore barrier is very close to free on all
current hardware.


That's not really true for ARM, where the additional loadstore requires
a full barrier.  There is a good use for a storestore, which is when
zeroing a newly-created object.


There are many good uses of storestore in the various lock-free 
algorithms inside hotspot.


David


Andrew.



Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics

2014-12-10 Thread Martin Buchholz
On Wed, Dec 10, 2014 at 4:52 PM, David Holmes david.hol...@oracle.com wrote:

 For the email record, as I have written in the bug report, I think the
 correction of the semantics for storeFence have resulted in problematic
 naming where storeFence and loadFence have opposite directionality
 constraints but the names suggest the same directionality constraints. Had
 the original API suggested these names with the revised semantics I would
 have argued against them. This area is confusing enough without adding to
 that confusion with names that don't suggest the action.

I also dislike the names of the atomic methods in Unsafe and would
like to align them as much as possible with C/C++ 11 atomics
nomenclature.


Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics

2014-12-10 Thread Martin Buchholz
On Wed, Dec 10, 2014 at 1:02 PM, Andrew Haley a...@redhat.com wrote:
 On 12/05/2014 09:49 PM, Martin Buchholz wrote:
 The actual implementations of storestore (see below) seem to
 universally give you the stronger ::release barrier, and it seems
 likely that hotspot engineers are implicitly relying on that, that
 some uses of ::storestore in the hotspot sources are bugs (should be
 ::release instead) and that there is very little potential performance
 benefit from using ::storestore instead of ::release, precisely
 because the additional loadstore barrier is very close to free on all
 current hardware.

 That's not really true for ARM, where the additional loadstore requires
 a full barrier.  There is a good use for a storestore, which is when
 zeroing a newly-created object.

Andrew and David,

I agree that given ARM's decision to provide the choice of StoreStore
and full fences, hotspot is right to use them in a few carefully
chosen places, like object initializers.

But ARM's decision seems poor to me. No mainstream language (like
C/C++ or Java) is likely to support those in any way accessible to
user programs (not even via Java Unsafe).  Making their StoreStore
barrier also do LoadStore would dramatically increase the
applicability of their instruction with low cost.  Maybe it's not too
late for ARM to do so.


Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics

2014-12-09 Thread Martin Buchholz
On Mon, Dec 8, 2014 at 8:51 PM, David Holmes david.hol...@oracle.com wrote:

 Then please point me to the common industry definition of it because I
 couldn't find anything definitive. And as you state yourself above one
 definition of it - the corresponding C11 fence - does not in fact have the
 same semantics!

I changed to the terminology acquire fence and release fence as
popularized by preshing
http://preshing.com/20130922/acquire-and-release-fences/

 * Ensures that loads and stores before the fence will not be reordered with
 * stores after the fence; also referred to as a StoreStore plus LoadStore
 * barrier, or as a release fence.


Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics

2014-12-09 Thread Martin Buchholz
On Mon, Dec 8, 2014 at 8:35 PM, David Holmes david.hol...@oracle.com wrote:

 So (as you say) with TSO you don't have a total order of stores if you
 read your own writes out of your own CPU's write buffer.  However, my
 interpretation of multiple-copy atomic is that the initial
 publishing thread can choose to use an instruction with sufficiently
 strong memory barrier attached (e.g. LOCK;XXX on x86) to write to
 memory so that the write buffer is flushed and then use plain relaxed
 loads everywhere else to read those memory locations and this explains
 the situation on x86 and sparc where volatile writes are expensive and
 volatile reads are free and you get sequential consistency for Java
 volatiles.


 We don't use lock'd instructions for volatile stores on x86, but the
 trailing mfence achieves the flushing.

 However this still raised some questions for me. Using a mfence on x86 or
 equivalent on sparc, is no different to issuing a DMB SYNC on ARM, or a
 SYNC on PowerPC. They each ensure TSO for volatile stores with global
 visibility. So when such fences are used the resulting system should be
 multiple-copy atomic - no? (No!**) And there seems to be an equivalence
 between being multiple-copy atomic and providing the IRIW property. Yet we
 know that on ARM/Power, as per the paper, TSO with global visibility is not

ARM/Power don't have TSO.

 sufficient to achieve IRIW. So what is it that x86 and sparc have in
 addition to TSO that provides for IRIW?

We have both been learning to think in new ways.
 I found the second section of Peter Sewell's tutorial
2 From Sequential Consistency to Relaxed Memory Models
to be most useful, especially the diagrams.

 I pondered this for quite a while before realizing that the mfence on x86
 (or equivalent on sparc) is not in fact playing the same role as the
 DMB/SYNC on ARM/PPC. The key property that x86 and sparc have (and we can
 ignore the store buffers) is that stores become globally visible - if any
 other thread sees a store then all other threads see the same store. Whereas
 on ARM/PPC you can imagine a store casually making its way through the
 system, gradually becoming visible to more and more threads - unless there
 is a DMB/SYNC to force a globally consistent memory view. Hence for IRIW
 placing the DMB/SYNC after the store does not suffice because prior to the
 DMB/SYNC the store may be visible to an arbitrary subset of threads.
 Consequently IRIW requires the DMB/SYNC between the loads - to ensure that
 each thread on their second load, must see the value that the other thread
 saw on its first load (ref Section 6.1 of the paper).

 ** So using DMB/SYNC does not achieve multiple-copy atomicity, because until
 the DMB/SYNC happens different threads can have different views of memory.

To me, the most desirable property of x86-style TSO is that barriers
are only necessary on stores to achieve sequential consistency - the
publisher gets to decide.  Volatile reads can then be close to free.

 All of which reinforces to me that IRIW is an undesirable property to have
 to implement. YMMV. (And I also need to re-examine the PPC64 implementation
 to see exactly where they add/remove barriers when IRIW is enabled.)

I believe you get a full sync between volatile reads.

#define GET_FIELD_VOLATILE(obj, offset, type_name, v) \
  oop p = JNIHandles::resolve(obj); \
  if (support_IRIW_for_not_multiple_copy_atomic_cpu) { \
OrderAccess::fence(); \
  } \


 Cheers,
 David

 http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf


Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics

2014-12-09 Thread David Holmes

On 10/12/2014 4:15 AM, Martin Buchholz wrote:

On Mon, Dec 8, 2014 at 8:35 PM, David Holmes david.hol...@oracle.com wrote:


So (as you say) with TSO you don't have a total order of stores if you
read your own writes out of your own CPU's write buffer.  However, my
interpretation of multiple-copy atomic is that the initial
publishing thread can choose to use an instruction with sufficiently
strong memory barrier attached (e.g. LOCK;XXX on x86) to write to
memory so that the write buffer is flushed and then use plain relaxed
loads everywhere else to read those memory locations and this explains
the situation on x86 and sparc where volatile writes are expensive and
volatile reads are free and you get sequential consistency for Java
volatiles.



We don't use lock'd instructions for volatile stores on x86, but the
trailing mfence achieves the flushing.

However this still raised some questions for me. Using a mfence on x86 or
equivalent on sparc, is no different to issuing a DMB SYNC on ARM, or a
SYNC on PowerPC. They each ensure TSO for volatile stores with global
visibility. So when such fences are used the resulting system should be
multiple-copy atomic - no? (No!**) And there seems to be an equivalence
between being multiple-copy atomic and providing the IRIW property. Yet we
know that on ARM/Power, as per the paper, TSO with global visibility is not


ARM/Power don't have TSO.


Yes we all know that. Please re-read what I wrote.


sufficient to achieve IRIW. So what is it that x86 and sparc have in
addition to TSO that provides for IRIW?


We have both been learning to think in new ways.
  I found the second section of Peter Sewell's tutorial
2 From Sequential Consistency to Relaxed Memory Models
to be most useful, especially the diagrams.


I pondered this for quite a while before realizing that the mfence on x86
(or equivalent on sparc) is not in fact playing the same role as the
DMB/SYNC on ARM/PPC. The key property that x86 and sparc have (and we can
ignore the store buffers) is that stores become globally visible - if any
other thread sees a store then all other threads see the same store. Whereas
on ARM/PPC you can imagine a store casually making its way through the
system, gradually becoming visible to more and more threads - unless there
is a DMB/SYNC to force a globally consistent memory view. Hence for IRIW
placing the DMB/SYNC after the store does not suffice because prior to the
DMB/SYNC the store may be visible to an arbitrary subset of threads.
Consequently IRIW requires the DMB/SYNC between the loads - to ensure that
each thread on their second load, must see the value that the other thread
saw on its first load (ref Section 6.1 of the paper).

** So using DMB/SYNC does not achieve multiple-copy atomicity, because until
the DMB/SYNC happens different threads can have different views of memory.


To me, the most desirable property of x86-style TSO is that barriers
are only necessary on stores to achieve sequential consistency - the
publisher gets to decide.  Volatile reads can then be close to free.


TSO doesn't need store barriers for sequential consistency.

It is somewhat amusing I think that the free-ness of volatile reads on 
TSO comes from the fact that all writes cause global memory 
synchronization. But because we can't turn that off we can't actually 
measure the cost we pay for those synchronizing writes. In contrast on 
non-TSO we have to explicitly cause synchronizing writes and so 
potentially require synchronizing reads - and then complain because the 
hidden costs are no longer hidden :)



All of which reinforces to me that IRIW is an undesirable property to have
to implement. YMMV. (And I also need to re-examine the PPC64 implementation
to see exactly where they add/remove barriers when IRIW is enabled.)


I believe you get a full sync between volatile reads.

#define GET_FIELD_VOLATILE(obj, offset, type_name, v) \
   oop p = JNIHandles::resolve(obj); \
   if (support_IRIW_for_not_multiple_copy_atomic_cpu) { \
 OrderAccess::fence(); \
   } \


Yes, it was more the remove part that I was unsure of the details - I 
think they simply remove the trailing fence (ie PPC SYNC) from the 
volatile writes.


Thanks,
David




Cheers,
David


http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf


Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics

2014-12-08 Thread Martin Buchholz
On Sun, Dec 7, 2014 at 2:58 PM, David Holmes david.hol...@oracle.com wrote:

 I believe the comment _does_ reflect hotspot's current implementation
 (entirely from exploring the sources).
 I believe it's correct to say all of the platforms are
 multiple-copy-atomic except PPC.

... current hotspot sources don't contain ARM support.

 Here is the definition of multi-copy atomicity from the ARM architecture
 manual:

 In a multiprocessing system, writes to a memory location are multi-copy
 atomic if the following conditions are both true:
 • All writes to the same location are serialized, meaning they are observed
 in the same order by all observers, although some observers might not
 observe all of the writes.
 • A read of a location does not return the value of a write until all
 observers observe that write.

The hotspot sources give


// To assure the IRIW property on processors that are not multiple copy
// atomic, sync instructions must be issued between volatile reads to
// assure their ordering, instead of after volatile stores.
// (See A Tutorial Introduction to the ARM and POWER Relaxed Memory Models
// by Luc Maranget, Susmit Sarkar and Peter Sewell, INRIA/Cambridge)
#ifdef CPU_NOT_MULTIPLE_COPY_ATOMIC
const bool support_IRIW_for_not_multiple_copy_atomic_cpu = true;


and the referenced paper gives


on POWER and ARM, two threads can observe writes to different
locations in different orders, even in
the absence of any thread-local reordering. In other words, the
architectures are not multiple-copy atomic [Col92].


which strongly suggests that x86 and sparc are OK.

 The first condition is met by Total-Store-Order (TSO) systems like x86 and
 sparc; and not by relaxed-memory-order (RMO) systems like ARM and PPC.
 However the second condition is not met simply by having TSO. If the local
 processor can see a write from the local store buffer prior to it being
 visible to other processors, then we do not have multi-copy atomicity and I
 believe that is true for x86 and sparc. Hence none of our supported
 platforms are multi-copy-atomic as far as I can see.

 I believe hotspot must implement IRIW correctly to fulfil the promise
 of sequential consistency for standard Java, so on ppc volatile reads
 get a full fence, which leads us back to the ppc pointer chasing
 performance problem that started all of this.


 Note that nothing in the JSR-133 cookbook allows for IRIW, even on x86 and
 sparc. The key feature needed for IRIW is a load barrier that forces global
 memory synchronization to ensure that all processors see writes at the same
 time. I'm not even sure we can force that on x86 and sparc! Such a load
 barrier negates the need for some store barriers as defined in the cookbook.

 My understanding, which could be wrong, is that the JMM implies
 linearizability of volatile accesses, which in turn provides the IRIW
 property. It is also my understanding that linearizability is a necessary
 property for current proof systems to be applicable. However absence of
 proof is not proof of absence, and it doesn't follow that code that doesn't
 rely on IRIW is incorrect if IRIW is not ensured on a system. As has been
 stated many times now, in the literature no practical lock-free algorithm
 seems to rely on IRIW. So I still hope that IRIW can somehow be removed
 because implementing it will impact everything related to the JMM in
 hotspot.


RE: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics

2014-12-08 Thread David Holmes
Martin,

The paper you cite is about ARM and Power architectures - why do you think the 
lack of mention of x86/sparc implies those architectures are 
multiple-copy-atomic?

David

 -Original Message-
 From: Martin Buchholz [mailto:marti...@google.com]
 Sent: Monday, 8 December 2014 6:42 PM
 To: David Holmes
 Cc: David Holmes; Vladimir Kozlov; core-libs-dev; concurrency-interest
 Subject: Re: [concurrency-interest] RFR: 8065804: JEP 171:
 Clarifications/corrections for fence intrinsics
 
 
 On Sun, Dec 7, 2014 at 2:58 PM, David Holmes 
 david.hol...@oracle.com wrote:
 
  I believe the comment _does_ reflect hotspot's current implementation
  (entirely from exploring the sources).
  I believe it's correct to say all of the platforms are
  multiple-copy-atomic except PPC.
 
 ... current hotspot sources don't contain ARM support.
 
  Here is the definition of multi-copy atomicity from the ARM architecture
  manual:
 
  In a multiprocessing system, writes to a memory location are multi-copy
  atomic if the following conditions are both true:
  • All writes to the same location are serialized, meaning they 
 are observed
  in the same order by all observers, although some observers might not
  observe all of the writes.
  • A read of a location does not return the value of a write until all
  observers observe that write.
 
 The hotspot sources give
 
 
 // To assure the IRIW property on processors that are not multiple copy
 // atomic, sync instructions must be issued between volatile reads to
 // assure their ordering, instead of after volatile stores.
 // (See A Tutorial Introduction to the ARM and POWER Relaxed 
 Memory Models
 // by Luc Maranget, Susmit Sarkar and Peter Sewell, INRIA/Cambridge)
 #ifdef CPU_NOT_MULTIPLE_COPY_ATOMIC
 const bool support_IRIW_for_not_multiple_copy_atomic_cpu = true;
 
 
 and the referenced paper gives
 
 
 on POWER and ARM, two threads can observe writes to different
 locations in different orders, even in
 the absence of any thread-local reordering. In other words, the
 architectures are not multiple-copy atomic [Col92].
 
 
 which strongly suggests that x86 and sparc are OK.
 
  The first condition is met by Total-Store-Order (TSO) systems 
 like x86 and
  sparc; and not by relaxed-memory-order (RMO) systems like ARM and PPC.
  However the second condition is not met simply by having TSO. 
 If the local
  processor can see a write from the local store buffer prior to it being
  visible to other processors, then we do not have multi-copy 
 atomicity and I
  believe that is true for x86 and sparc. Hence none of our supported
  platforms are multi-copy-atomic as far as I can see.
 
  I believe hotspot must implement IRIW correctly to fulfil the promise
  of sequential consistency for standard Java, so on ppc volatile reads
  get a full fence, which leads us back to the ppc pointer chasing
  performance problem that started all of this.
 
 
  Note that nothing in the JSR-133 cookbook allows for IRIW, even 
 on x86 and
  sparc. The key feature needed for IRIW is a load barrier that 
 forces global
  memory synchronization to ensure that all processors see writes 
 at the same
  time. I'm not even sure we can force that on x86 and sparc! Such a load
  barrier negates the need for some store barriers as defined in 
 the cookbook.
 
  My understanding, which could be wrong, is that the JMM implies
  linearizability of volatile accesses, which in turn provides the IRIW
  property. It is also my understanding that linearizability is a 
 necessary
  property for current proof systems to be applicable. However absence of
  proof is not proof of absence, and it doesn't follow that code 
 that doesn't
  rely on IRIW is incorrect if IRIW is not ensured on a system. 
 As has been
  stated many times now, in the literature no practical lock-free 
 algorithm
  seems to rely on IRIW. So I still hope that IRIW can somehow be removed
  because implementing it will impact everything related to the JMM in
  hotspot.
 



Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics

2014-12-08 Thread Martin Buchholz
On Mon, Dec 8, 2014 at 12:46 AM, David Holmes davidchol...@aapt.net.au wrote:
 Martin,

 The paper you cite is about ARM and Power architectures - why do you think 
 the lack of mention of x86/sparc implies those architectures are 
 multiple-copy-atomic?

Reading some more in the same paper, I see:

Returning to the two properties above, in TSO a thread can see its
own writes before they become visible to other
threads (by reading them from its write buffer), but any write becomes
visible to all other threads simultaneously: TSO
is a multiple-copy atomic model, in the terminology of Collier
[Col92]. One can also see the possibility of reading
from the local write buffer as allowing a specific kind of local
reordering. A program that writes one location x then
reads another location y might execute by adding the write to x to the
thread’s buffer, then reading y from memory,
before finally making the write to x visible to other threads by
flushing it from the buffer. In this case the thread reads
the value of y that was in the memory before the new write of x hits memory.

So (as you say) with TSO you don't have a total order of stores if you
read your own writes out of your own CPU's write buffer.  However, my
interpretation of multiple-copy atomic is that the initial
publishing thread can choose to use an instruction with sufficiently
strong memory barrier attached (e.g. LOCK;XXX on x86) to write to
memory so that the write buffer is flushed and then use plain relaxed
loads everywhere else to read those memory locations and this explains
the situation on x86 and sparc where volatile writes are expensive and
volatile reads are free and you get sequential consistency for Java
volatiles.

http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf


Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics

2014-12-08 Thread Martin Buchholz
Webrev updated to remove the comparison with volatile loads and stores.

On Sun, Dec 7, 2014 at 2:40 PM, David Holmes david.hol...@oracle.com wrote:
 On 6/12/2014 7:49 AM, Martin Buchholz wrote:

 On Thu, Dec 4, 2014 at 5:55 PM, David Holmes david.hol...@oracle.com
 wrote:

 In general phrasing like:  also known as a LoadLoad plus LoadStore
 barrier
 ... is misleading to me as these are not aliases- the loadFence (in
 this
 case) is being specified to have the same semantics as the
 loadload|storeload. It should say corresponds to a LoadLoad plus
 LoadStore
 barrier


 + * Ensures that loads before the fence will not be reordered with
 loads and
 + * stores after the fence; also known as a LoadLoad plus LoadStore
 barrier,

 I don't understand this.  I believe they _are_ aliases.  The first
 clause perfectly describes a LoadLoad plus LoadStore barrier.


 I find the language use inappropriate - you are defining the first to be the
 second.

Am I missing something?  Is there something else that LoadLoad plus
LoadStore barrier (as used in hotspot sources and elsewhere) could
possibly mean?

   - as per the Corresponds to a C11  And referring to things

 like load-acquire fence is meaningless without some reference to a
 definition - who defines a load-acquire fence? Is there a universal
 definition? I would be okay with something looser eg:


 Well, I'm defining load-acquire fence here in the javadoc - I'm
 claiming that loadFence is also known via other terminology, including
 load-acquire fence.  Although it's true that load-acquire fence is
 also used to refer to the corresponding C11 fence, which has subtly
 different semantics.


 When you say also known as XXX it means that XXX is already defined
 elsewhere. Unless there is a generally accepted definition of XXX then this
 doesn't add much value.

Everything about this topic is confusing, but I continue to think that
load-acquire fence is a common industry term.  One of the reasons I
want to include them in the javadoc is precisely because these
different terms are generally used synonymously.


Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics

2014-12-08 Thread David Holmes

On 9/12/2014 5:25 AM, Martin Buchholz wrote:

Webrev updated to remove the comparison with volatile loads and stores.


Thanks. More inline ...


On Sun, Dec 7, 2014 at 2:40 PM, David Holmes david.hol...@oracle.com wrote:

On 6/12/2014 7:49 AM, Martin Buchholz wrote:


On Thu, Dec 4, 2014 at 5:55 PM, David Holmes david.hol...@oracle.com
wrote:


In general phrasing like:  also known as a LoadLoad plus LoadStore
barrier
... is misleading to me as these are not aliases- the loadFence (in
this
case) is being specified to have the same semantics as the
loadload|storeload. It should say corresponds to a LoadLoad plus
LoadStore
barrier



+ * Ensures that loads before the fence will not be reordered with
loads and
+ * stores after the fence; also known as a LoadLoad plus LoadStore
barrier,

I don't understand this.  I believe they _are_ aliases.  The first
clause perfectly describes a LoadLoad plus LoadStore barrier.



I find the language use inappropriate - you are defining the first to be the
second.


Am I missing something?  Is there something else that LoadLoad plus
LoadStore barrier (as used in hotspot sources and elsewhere) could
possibly mean?


I think we are getting our wires crossed here. My issue is with the wording:

also known as a LoadLoad plus LoadStore barrier

because we are explicitly choosing to specify the operation as being a 
LoadLoad plus LoadStore barrier. Hence I would much prefer it to say 
that this corresponds to a LoadLoad plus LoadStore barrier.



   - as per the Corresponds to a C11  And referring to things


like load-acquire fence is meaningless without some reference to a
definition - who defines a load-acquire fence? Is there a universal
definition? I would be okay with something looser eg:



Well, I'm defining load-acquire fence here in the javadoc - I'm
claiming that loadFence is also known via other terminology, including
load-acquire fence.  Although it's true that load-acquire fence is
also used to refer to the corresponding C11 fence, which has subtly
different semantics.



When you say also known as XXX it means that XXX is already defined
elsewhere. Unless there is a generally accepted definition of XXX then this
doesn't add much value.


Everything about this topic is confusing, but I continue to think that
load-acquire fence is a common industry term.


Then please point me to the common industry definition of it because I 
couldn't find anything definitive. And as you state yourself above one 
definition of it - the corresponding C11 fence - does not in fact have 
the same semantics!



One of the reasons I
want to include them in the javadoc is precisely because these
different terms are generally used synonymously.


I would concede sometimes referred to as ... as that doesn't imply a 
single standard agreed upon definition, while still making the link. 
Though even then there is a danger of making false assumptions about the 
equivalence of this fence and those going by the name load-acquire fence.


David


Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics

2014-12-07 Thread David Holmes

On 6/12/2014 7:49 AM, Martin Buchholz wrote:

On Thu, Dec 4, 2014 at 5:55 PM, David Holmes david.hol...@oracle.com wrote:


In general phrasing like:  also known as a LoadLoad plus LoadStore barrier
... is misleading to me as these are not aliases- the loadFence (in this
case) is being specified to have the same semantics as the
loadload|storeload. It should say corresponds to a LoadLoad plus LoadStore
barrier


+ * Ensures that loads before the fence will not be reordered with loads and
+ * stores after the fence; also known as a LoadLoad plus LoadStore barrier,

I don't understand this.  I believe they _are_ aliases.  The first
clause perfectly describes a LoadLoad plus LoadStore barrier.


I find the language use inappropriate - you are defining the first to be 
the second.



  - as per the Corresponds to a C11  And referring to things

like load-acquire fence is meaningless without some reference to a
definition - who defines a load-acquire fence? Is there a universal
definition? I would be okay with something looser eg:


Well, I'm defining load-acquire fence here in the javadoc - I'm
claiming that loadFence is also known via other terminology, including
load-acquire fence.  Although it's true that load-acquire fence is
also used to refer to the corresponding C11 fence, which has subtly
different semantics.


When you say also known as XXX it means that XXX is already defined 
elsewhere. Unless there is a generally accepted definition of XXX then 
this doesn't add much value.



/**
   * Ensures that loads before the fence will not be reordered with loads and
   * stores after the fence. Corresponds to a LoadLoad plus LoadStore
barrier,
   * and also to the C11 atomic_thread_fence(memory_order_acquire).
   * Sometimes referred to as a load-acquire fence.
   *

Also I find this comment strange:

!  * A pure StoreStore fence is not provided, since the addition of
LoadStore
!  * is almost always desired, and most current hardware instructions
that
!  * provide a StoreStore barrier also provide a LoadStore barrier for
free.

because inside hotspot we use storeStore barriers a lot, without any
loadStore at the same point.


I believe the use of e.g. OrderAccess::storestore in the hotspot
sources is unfortunate.


I don't! Not at all!


The actual implementations of storestore (see below) seem to
universally give you the stronger ::release barrier,


Don't conflate hardware barriers and compiler barriers. On TSO systems 
storestore() is a no-op for the hardware but a compiler-barrier is still 
required. The compiler barrier is stronger than storestore but that's 
because we don't have fine-grained compiler barriers. As I've said 
previously there is an open bug to clean up the orderAccess definitions 
because semantically it is very misleading to define things like 
storestore() as release(). However if you look at the implementation of 
things we are not actually adding any additional semantics to the 
storestore(). Eg for x86 (after a recent change top cleanup the 
compiler-barrier:


inline void OrderAccess::release() {
  compiler_barrier();
}

So storestore() is also just a compiler barrier, thought I'd prefer to 
see it expressed as:


inline void OrderAccess::storestore() { compiler_barrier(); }

than the misleading (and very wrong on non-TSO):

inline void OrderAccess::storestore() { release(); }

And of course the non-TSO platforms use the barrier necessary on that 
platform.



and it seems
likely that hotspot engineers are implicitly relying on that, that
some uses of ::storestore in the hotspot sources are bugs (should be
::release instead)


That seems a rather baseless speculation on your part. Having been 
involved in a lot of discussions involving memory barrier usage in 
various algorithms in different areas of hotspot I can assure you that 
the use of storestore() has been quite deliberate and does not assume an 
implicit loadstore().


Also as I've said many many times now release() as a stand-alone 
operation is meaningless.


David
-

and that there is very little potential performance

benefit from using ::storestore instead of ::release, precisely
because the additional loadstore barrier is very close to free on all
current hardware.  Writing correct code using ::storestore is harder
than ::release, which is already difficult enough. C11 doesn't provide
a corresponding fence, which is a strong hint.


./bsd_zero/vm/orderAccess_bsd_zero.inline.hpp:71:inline void
OrderAccess::storestore() { release(); }
./linux_sparc/vm/orderAccess_linux_sparc.inline.hpp:35:inline void
OrderAccess::storestore() { release(); }
./aix_ppc/vm/orderAccess_aix_ppc.inline.hpp:73:inline void
OrderAccess::storestore() { inlasm_lwsync();  }
./linux_zero/vm/orderAccess_linux_zero.inline.hpp:70:inline void
OrderAccess::storestore() { release(); }
./solaris_sparc/vm/orderAccess_solaris_sparc.inline.hpp:40:inline void
OrderAccess::storestore() { release(); }

Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics

2014-12-07 Thread David Holmes

On 6/12/2014 7:29 AM, Martin Buchholz wrote:

On Thu, Dec 4, 2014 at 5:36 PM, David Holmes david.hol...@oracle.com wrote:

Martin,

On 2/12/2014 6:46 AM, Martin Buchholz wrote:



Is this finalized then? You can only make one commit per CR.


Right.  I'd like to commit and then perhaps do another round of clarifications.


I still find this entire comment block to be misguided and misplaced:

! // Fences, also known as memory barriers, or membars.
! // See hotspot sources for more details:
! // orderAccess.hpp memnode.hpp unsafe.cpp
! //
! // One way of implementing Java language-level volatile variables
using
! // fences (but there is often a better way without) is by:
! // translating a volatile store into the sequence:
! // - storeFence()
! // - relaxed store
! // - fullFence()
! // and translating a volatile load into the sequence:
! // - if (CPU_NOT_MULTIPLE_COPY_ATOMIC) fullFence()
! // - relaxed load
! // - loadFence()
! // The full fence on volatile stores ensures the memory model
guarantee of
! // sequential consistency on most platforms.  On some platforms (ppc)
we
! // need an additional full fence between volatile loads as well (see
! // hotspot's CPU_NOT_MULTIPLE_COPY_ATOMIC).


Even I think this comment is marginal - I will delete it.  But
consider this a plea for better documentation of the hotspot
internals.


Okay, but Unsafe.java is not the place to document anything about hotspot.


why do want this description here - it has no relevance to the API itself,
nor to how volatiles are implemented in the VM. And as I said in the bug
report CPU_NOT_MULTIPLE_COPY_ATOMIC exists only for platforms that want to
implement IRIW (none of our platforms are multiple-copy-atomic, but only PPC
sets this so that it employs IRIW).


I believe the comment _does_ reflect hotspot's current implementation
(entirely from exploring the sources).
I believe it's correct to say all of the platforms are
multiple-copy-atomic except PPC.


Here is the definition of multi-copy atomicity from the ARM architecture 
manual:


In a multiprocessing system, writes to a memory location are multi-copy 
atomic if the following conditions are both true:
• All writes to the same location are serialized, meaning they are 
observed in the same order by all observers, although some observers 
might not observe all of the writes.
• A read of a location does not return the value of a write until all 
observers observe that write.


The first condition is met by Total-Store-Order (TSO) systems like x86 
and sparc; and not by relaxed-memory-order (RMO) systems like ARM and 
PPC. However the second condition is not met simply by having TSO. If 
the local processor can see a write from the local store buffer prior to 
it being visible to other processors, then we do not have multi-copy 
atomicity and I believe that is true for x86 and sparc. Hence none of 
our supported platforms are multi-copy-atomic as far as I can see.



I believe hotspot must implement IRIW correctly to fulfil the promise
of sequential consistency for standard Java, so on ppc volatile reads
get a full fence, which leads us back to the ppc pointer chasing
performance problem that started all of this.


Note that nothing in the JSR-133 cookbook allows for IRIW, even on x86 
and sparc. The key feature needed for IRIW is a load barrier that forces 
global memory synchronization to ensure that all processors see writes 
at the same time. I'm not even sure we can force that on x86 and sparc! 
Such a load barrier negates the need for some store barriers as defined 
in the cookbook.


My understanding, which could be wrong, is that the JMM implies 
linearizability of volatile accesses, which in turn provides the IRIW 
property. It is also my understanding that linearizability is a 
necessary property for current proof systems to be applicable. However 
absence of proof is not proof of absence, and it doesn't follow that 
code that doesn't rely on IRIW is incorrect if IRIW is not ensured on a 
system. As has been stated many times now, in the literature no 
practical lock-free algorithm seems to rely on IRIW. So I still hope 
that IRIW can somehow be removed because implementing it will impact 
everything related to the JMM in hotspot.


David
-


Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics

2014-12-05 Thread Martin Buchholz
On Thu, Dec 4, 2014 at 5:36 PM, David Holmes david.hol...@oracle.com wrote:
 Martin,

 On 2/12/2014 6:46 AM, Martin Buchholz wrote:

 Is this finalized then? You can only make one commit per CR.

Right.  I'd like to commit and then perhaps do another round of clarifications.

 I still find this entire comment block to be misguided and misplaced:

 ! // Fences, also known as memory barriers, or membars.
 ! // See hotspot sources for more details:
 ! // orderAccess.hpp memnode.hpp unsafe.cpp
 ! //
 ! // One way of implementing Java language-level volatile variables
 using
 ! // fences (but there is often a better way without) is by:
 ! // translating a volatile store into the sequence:
 ! // - storeFence()
 ! // - relaxed store
 ! // - fullFence()
 ! // and translating a volatile load into the sequence:
 ! // - if (CPU_NOT_MULTIPLE_COPY_ATOMIC) fullFence()
 ! // - relaxed load
 ! // - loadFence()
 ! // The full fence on volatile stores ensures the memory model
 guarantee of
 ! // sequential consistency on most platforms.  On some platforms (ppc)
 we
 ! // need an additional full fence between volatile loads as well (see
 ! // hotspot's CPU_NOT_MULTIPLE_COPY_ATOMIC).

Even I think this comment is marginal - I will delete it.  But
consider this a plea for better documentation of the hotspot
internals.

 why do want this description here - it has no relevance to the API itself,
 nor to how volatiles are implemented in the VM. And as I said in the bug
 report CPU_NOT_MULTIPLE_COPY_ATOMIC exists only for platforms that want to
 implement IRIW (none of our platforms are multiple-copy-atomic, but only PPC
 sets this so that it employs IRIW).

I believe the comment _does_ reflect hotspot's current implementation
(entirely from exploring the sources).
I believe it's correct to say all of the platforms are
multiple-copy-atomic except PPC.
I believe hotspot must implement IRIW correctly to fulfil the promise
of sequential consistency for standard Java, so on ppc volatile reads
get a full fence, which leads us back to the ppc pointer chasing
performance problem that started all of this.


Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics

2014-12-05 Thread Martin Buchholz
On Thu, Dec 4, 2014 at 5:55 PM, David Holmes david.hol...@oracle.com wrote:

 In general phrasing like:  also known as a LoadLoad plus LoadStore barrier
 ... is misleading to me as these are not aliases- the loadFence (in this
 case) is being specified to have the same semantics as the
 loadload|storeload. It should say corresponds to a LoadLoad plus LoadStore
 barrier

+ * Ensures that loads before the fence will not be reordered with loads and
+ * stores after the fence; also known as a LoadLoad plus LoadStore barrier,

I don't understand this.  I believe they _are_ aliases.  The first
clause perfectly describes a LoadLoad plus LoadStore barrier.

 - as per the Corresponds to a C11  And referring to things
 like load-acquire fence is meaningless without some reference to a
 definition - who defines a load-acquire fence? Is there a universal
 definition? I would be okay with something looser eg:

Well, I'm defining load-acquire fence here in the javadoc - I'm
claiming that loadFence is also known via other terminology, including
load-acquire fence.  Although it's true that load-acquire fence is
also used to refer to the corresponding C11 fence, which has subtly
different semantics.

 /**
   * Ensures that loads before the fence will not be reordered with loads and
   * stores after the fence. Corresponds to a LoadLoad plus LoadStore
 barrier,
   * and also to the C11 atomic_thread_fence(memory_order_acquire).
   * Sometimes referred to as a load-acquire fence.
   *

 Also I find this comment strange:

 !  * A pure StoreStore fence is not provided, since the addition of
 LoadStore
 !  * is almost always desired, and most current hardware instructions
 that
 !  * provide a StoreStore barrier also provide a LoadStore barrier for
 free.

 because inside hotspot we use storeStore barriers a lot, without any
 loadStore at the same point.

I believe the use of e.g. OrderAccess::storestore in the hotspot
sources is unfortunate.

The actual implementations of storestore (see below) seem to
universally give you the stronger ::release barrier, and it seems
likely that hotspot engineers are implicitly relying on that, that
some uses of ::storestore in the hotspot sources are bugs (should be
::release instead) and that there is very little potential performance
benefit from using ::storestore instead of ::release, precisely
because the additional loadstore barrier is very close to free on all
current hardware.  Writing correct code using ::storestore is harder
than ::release, which is already difficult enough. C11 doesn't provide
a corresponding fence, which is a strong hint.


./bsd_zero/vm/orderAccess_bsd_zero.inline.hpp:71:inline void
OrderAccess::storestore() { release(); }
./linux_sparc/vm/orderAccess_linux_sparc.inline.hpp:35:inline void
OrderAccess::storestore() { release(); }
./aix_ppc/vm/orderAccess_aix_ppc.inline.hpp:73:inline void
OrderAccess::storestore() { inlasm_lwsync();  }
./linux_zero/vm/orderAccess_linux_zero.inline.hpp:70:inline void
OrderAccess::storestore() { release(); }
./solaris_sparc/vm/orderAccess_solaris_sparc.inline.hpp:40:inline void
OrderAccess::storestore() { release(); }
./linux_ppc/vm/orderAccess_linux_ppc.inline.hpp:75:inline void
OrderAccess::storestore() { inlasm_lwsync();  }
./solaris_x86/vm/orderAccess_solaris_x86.inline.hpp:40:inline void
OrderAccess::storestore() { release(); }
./linux_x86/vm/orderAccess_linux_x86.inline.hpp:35:inline void
OrderAccess::storestore() { release(); }
./bsd_x86/vm/orderAccess_bsd_x86.inline.hpp:35:inline void
OrderAccess::storestore() { release(); }
./windows_x86/vm/orderAccess_windows_x86.inline.hpp:35:inline void
OrderAccess::storestore() { release(); }


Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics

2014-12-04 Thread David Holmes

Martin,

On 2/12/2014 6:46 AM, Martin Buchholz wrote:

David, Paul (i.e. Reviewers) and Doug,

I'd like to commit corrections so we make progress.


Is this finalized then? You can only make one commit per CR.


I think the current webrev is simple progress with the exception of my
attempt to translate volatiles into fences, which is marginal (but was
a good learning exercise for me).


I still find this entire comment block to be misguided and misplaced:

! // Fences, also known as memory barriers, or membars.
! // See hotspot sources for more details:
! // orderAccess.hpp memnode.hpp unsafe.cpp
! //
! // One way of implementing Java language-level volatile variables 
using

! // fences (but there is often a better way without) is by:
! // translating a volatile store into the sequence:
! // - storeFence()
! // - relaxed store
! // - fullFence()
! // and translating a volatile load into the sequence:
! // - if (CPU_NOT_MULTIPLE_COPY_ATOMIC) fullFence()
! // - relaxed load
! // - loadFence()
! // The full fence on volatile stores ensures the memory model 
guarantee of
! // sequential consistency on most platforms.  On some platforms 
(ppc) we

! // need an additional full fence between volatile loads as well (see
! // hotspot's CPU_NOT_MULTIPLE_COPY_ATOMIC).

why do want this description here - it has no relevance to the API 
itself, nor to how volatiles are implemented in the VM. And as I said in 
the bug report CPU_NOT_MULTIPLE_COPY_ATOMIC exists only for platforms 
that want to implement IRIW (none of our platforms are 
multiple-copy-atomic, but only PPC sets this so that it employs IRIW).


David


Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics

2014-12-04 Thread David Holmes

On 2/12/2014 6:46 AM, Martin Buchholz wrote:

David, Paul (i.e. Reviewers) and Doug,

I'd like to commit corrections so we make progress.

I think the current webrev is simple progress with the exception of my
attempt to translate volatiles into fences, which is marginal (but was
a good learning exercise for me).


Looking at the actual API changes ...

In general phrasing like:  also known as a LoadLoad plus LoadStore 
barrier ... is misleading to me as these are not aliases- the 
loadFence (in this case) is being specified to have the same semantics 
as the loadload|storeload. It should say corresponds to a LoadLoad plus 
LoadStore barrier - as per the Corresponds to a C11  And 
referring to things like load-acquire fence is meaningless without 
some reference to a definition - who defines a load-acquire fence? Is 
there a universal definition? I would be okay with something looser eg:


/**
  * Ensures that loads before the fence will not be reordered with 
loads and
  * stores after the fence. Corresponds to a LoadLoad plus LoadStore 
barrier,

  * and also to the C11 atomic_thread_fence(memory_order_acquire).
  * Sometimes referred to as a load-acquire fence.
  *

Also I find this comment strange:

!  * A pure StoreStore fence is not provided, since the addition of 
LoadStore
!  * is almost always desired, and most current hardware 
instructions that
!  * provide a StoreStore barrier also provide a LoadStore barrier 
for free.


because inside hotspot we use storeStore barriers a lot, without any 
loadStore at the same point.


David


Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics

2014-12-02 Thread Paul Sandoz

On Dec 2, 2014, at 1:58 AM, Doug Lea d...@cs.oswego.edu wrote:

 On 12/01/2014 03:46 PM, Martin Buchholz wrote:
 David, Paul (i.e. Reviewers) and Doug,
 
 I'd like to commit corrections so we make progress.
 
 The current one looks OK to me.
 (http://cr.openjdk.java.net/~martin/webrevs/openjdk9/fence-intrinsics/)
 

Same here, looks ok.

I anticipate we will be revisiting this area with the enhanced volatiles [1] 
work and related JMM updates, where there will be a public API for low-level 
enhanced field/array access [2].

As you rightly observed Unsafe does not currently have a get/read-acquire 
method. Implementations of [2] currently emulate that with a relaxed read + 
Unsafe.loadFence. It's something we need to add.

Paul.

[1] http://openjdk.java.net/jeps/193
[2] 
http://hg.openjdk.java.net/valhalla/valhalla/jdk/file/2d4531473a89/src/java.base/share/classes/java/lang/invoke/VarHandle.java


Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics

2014-12-01 Thread Stephan Diestelhorst
Am Dienstag, 25. November 2014, 11:15:36 schrieb Hans Boehm:
 I'm no hardware architect, but fundamentally it seems to me that
 
 load x
 acquire_fence
 
 imposes a much more stringent constraint than
 
 load_acquire x
 
 Consider the case in which the load from x is an L1 hit, but a preceding
 load (from say y) is a long-latency miss.  If we enforce ordering by just
 waiting for completion of prior operation, the former has to wait for the
 load from y to complete; while the latter doesn't.  I find it hard to
 believe that this doesn't leave an appreciable amount of performance on the
 table, at least for some interesting microarchitectures.

I agree, Hans, that this is a reasonable assumption.  Load_acquire x
does allow roach motel, whereas the acquire fence does not.

  In addition, for better or worse, fencing requirements on at least
  Power are actually driven as much by store atomicity issues, as by
  the ordering issues discussed in the cookbook.  This was not
  understood in 2005, and unfortunately doesn't seem to be amenable to
  the kind of straightforward explanation as in Doug's cookbook.

Coming from a strongly ordered architecture to a weakly ordered one
myself, I also needed some mental adjustment about store (multi-copy)
atomicity.  I can imagine others will be unaware of this difference,
too, even in 2014.

Stephan



Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics

2014-12-01 Thread DT
I see time to time comments in the jvm sources referencing membars and fences. 
Would you say that they are used interchangeably ? Having the same meaning but 
for different CPU arch.

Sent from my iPhone

 On Nov 25, 2014, at 6:04 AM, Paul Sandoz paul.san...@oracle.com wrote:
 
 Hi Martin,
 
 Thanks for looking into this.
 
 1141  * Currently hotspot's implementation of a Java language-level 
 volatile
 1142  * store has the same effect as a storeFence followed by a relaxed 
 store,
 1143  * although that may be a little stronger than needed.
 
 IIUC to emulate hotpot's volatile store you will need to say that a fullFence 
 immediately follows the relaxed store.
 
 The bit that always confuses me about release and acquire is ordering is 
 restricted to one direction, as talked about in orderAccess.hpp [1]. So for a 
 release, accesses prior to the release cannot move below it, but accesses 
 succeeding the release can move above it. And that seems to apply to 
 Unsafe.storeFence [2] (acting like a monitor exit). Is that contrary to C++ 
 release fences where ordering is restricted both to prior and succeeding 
 accesses? [3]
 
 So what about the following?
 
  a = r1; // Cannot move below the fence
  Unsafe.storeFence();
  b = r2; // Can move above the fence?
 
 Paul.
 
 [1] In orderAccess.hpp
 // Execution by a processor of release makes the effect of all memory
 // accesses issued by it previous to the release visible to all
 // processors *before* the release completes.  The effect of subsequent
 // memory accesses issued by it *may* be made visible *before* the
 // release.  I.e., subsequent memory accesses may float above the
 // release, but prior ones may not float below it.
 
 [2] In memnode.hpp
 // Release - no earlier ref can move after (but later refs can move
 // up, like a speculative pipelined cache-hitting Load).  Requires
 // multi-cpu visibility.  Inserted independent of any store, as required
 // for intrinsic sun.misc.Unsafe.storeFence().
 class StoreFenceNode: public MemBarNode {
 public:
  StoreFenceNode(Compile* C, int alias_idx, Node* precedent)
: MemBarNode(C, alias_idx, precedent) {}
  virtual int Opcode() const;
 };
 
 [3] 
 http://preshing.com/20131125/acquire-and-release-fences-dont-work-the-way-youd-expect/
 
 On Nov 25, 2014, at 1:47 AM, Martin Buchholz marti...@google.com wrote:
 
 OK, I worked in some wording for comparison with volatiles.
 I believe you when you say that the semantics of the corresponding C++
 fences are slightly different, but it's rather subtle - can we say
 anything more than closely related to?
 
 On Mon, Nov 24, 2014 at 1:29 PM, Aleksey Shipilev
 aleksey.shipi...@oracle.com wrote:
 Hi Martin,
 
 On 11/24/2014 11:56 PM, Martin Buchholz wrote:
 Review carefully - I am trying to learn about fences by explaining them!
 I have borrowed some wording from my reviewers!
 
 https://bugs.openjdk.java.net/browse/JDK-8065804
 http://cr.openjdk.java.net/~martin/webrevs/openjdk9/fence-intrinsics/
 
 I think implies the effect of C++11 is too strong wording. related
 might be more appropriate.
 
 See also comments here for connection with volatiles:
 https://bugs.openjdk.java.net/browse/JDK-8038978
 
 Take note the Hans' correction that fences generally imply more than
 volatile load/store, but since you are listing the related things in the
 docs, I think the native Java example is good to have.
 
 -Aleksey.
 ___
 Concurrency-interest mailing list
 concurrency-inter...@cs.oswego.edu
 http://cs.oswego.edu/mailman/listinfo/concurrency-interest
 
 ___
 Concurrency-interest mailing list
 concurrency-inter...@cs.oswego.edu
 http://cs.oswego.edu/mailman/listinfo/concurrency-interest


Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics

2014-12-01 Thread Alexander Terekhov
 memory_order_release meaningful, but somewhat different from the
non-fence

it would be nice to have release fence with an artificial dependency
to define a set of actually release stores and not constraining other
subsequent stores (and the order of release stores with respect to
each other), e.g.:

// set multiple flags each indicating 'release' without imposing
// ordering on 'release' stores respect to each other and not
// constraining other subsequent stores

.
.
.

if (atomic_thread_fence(memory_order_release)) {

  flag1.store(READY, memory_order_relaxed);
  flag2.store(READY, memory_order_relaxed);

}

regards,
alexander.

Hans Boehm bo...@acm.org@cs.oswego.edu on 29.11.2014 05:56:04

Sent by:concurrency-interest-boun...@cs.oswego.edu


To: Peter Levart peter.lev...@gmail.com
cc: Vladimir Kozlov vladimir.koz...@oracle.com, concurrency-interest
   concurrency-inter...@cs.oswego.edu, Martin Buchholz
   marti...@google.com, core-libs-dev
   core-libs-dev@openjdk.java.net, dhol...@ieee.org
Subject:Re: [concurrency-interest] RFR: 8065804: JEP 171:
   Clarifications/corrections for fence intrinsics


I basically agree with David's observation.  However the C++

atomic_thread_fence(memory_order_acquire)

actually has somewhat different semantics from load(memory_order_acquire).
It basically ensures that prior atomic loads L are not reordered with later
(i.e. following the fence in program order) loads and stores, making it
something like a LoadLoad|LoadStore fence.  Thus the fence orders two sets
of operations where the acquire load orders a single operation with respect
to a set.  This makes the fence versions of memory_order_acquire and
memory_order_release meaningful, but somewhat different from the non-fence
versions.  The terminology is probably not great, but that seems to be the
most common usage now.

On Wed, Nov 26, 2014 at 11:48 PM, Peter Levart peter.lev...@gmail.com
wrote:
  On 11/27/2014 04:00 AM, David Holmes wrote:
Can I make an observation about acquire() and release() - to me
they are
meaningless when considered in isolation. Given their
definitions they allow
anything to move into a region bounded by acquire() and release
(), then you
can effectively move the whole program into the region and thus
the
acquire() and release() do not constrain any reorderings.
  acquire() and
release() only make sense when their own movement is
constrained with
respect to something else - such as lock acquisition/release,
or when
combined with specific load/store actions.

  ...or another acquire/release region?

  Regards, Peter



David

Martin Buchholz writes:
  On Tue, Nov 25, 2014 at 6:04 AM, Paul Sandoz
  paul.san...@oracle.com wrote:
Hi Martin,

Thanks for looking into this.

1141      * Currently hotspot's implementation of a
Java
  language-level volatile
1142      * store has the same effect as a
storeFence followed
  by a relaxed store,
1143      * although that may be a little stronger
than needed.

IIUC to emulate hotpot's volatile store you will
need to say
  that a fullFence immediately follows the relaxed store.

  Right - I've been groking that.

The bit that always confuses me about release and
acquire is
  ordering is restricted to one direction, as talked about
  in
  orderAccess.hpp [1]. So for a release, accesses prior to
  the
  release cannot move below it, but accesses succeeding the
  release
  can move above it. And that seems to apply to
  Unsafe.storeFence
  [2] (acting like a monitor exit). Is that contrary to C++
  release
  fences where ordering is restricted both to prior and
  succeeding
  accesses? [3]
So what about the following?

   a = r1; // Cannot move below the fence
   Unsafe.storeFence();
   b = r2; // Can move above the fence?
  I think the hotspot docs need to be more precise about
  when they're
  talking about movement of stores and when about loads.

// release.  I.e., subsequent memory accesses may
float above

Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics

2014-12-01 Thread Martin Buchholz
Hans,

(Thanks for your excellent work on C/C++ 11 and your eternal patience)

On Tue, Nov 25, 2014 at 11:15 AM, Hans Boehm bo...@acm.org wrote:
 It seems to me that a (dubiuously named) loadFence is intended to have
 essentially the same semantics as the (perhaps slightly less dubiously
 named) C++ atomic_thread_fence(memory_order_acquire), and a storeFence
 matches atomic_thread_fence(memory_order_release).  The C++ standard and,
 even more so, Mark Batty's work have a precise definition of what those mean
 in terms of implied synchronizes with relationships.

 It looks to me like this whole implementation model for volatiles in terms
 of fences is fundamentally doomed, and it probably makes more sense to get
 rid of it rather than spending time on renaming it (though we just did the
 latter in Android to avoid similar confusion about semantics).  It's

I would also like to see alignment to leverage the technical and
cultural work done on C11.  I would like to see Unsafe get
load-acquire and store-release methods and these should be used in
preference to fences where possible.  I'd like to see the C11 wording
reused as much as possible.  The meanings of the words acquire and
release are now owned by the C11 community and we should tag
along.

A better API for Unsafe would be

putOrdered - storeRelease
put - storeRelaxed
(ordinary volatile write) - store (default is sequential consistent)

etc ...

but the high cost of renaming methods in Unsafe probably makes this a
no-go, even though Unsafe is not a public API in theory.

At least the documentation of all the methods should indicate what the
memory effects and the corresponding C++11 memory model interpretation
is.

E.g. Unsafe.compareAndSwap should document the memory effects, i.e.
sequential consistency.

Unsafe doesn't currently have a readAcquire method (mirror of
putOrdered) probably because volatile read is _almost_ the same (but
not on ppc!).

 fundamentally incompatible with the way volatiles/atomics are intended to be
 implemented on ARMv8 (and Itanium).  Which I think fundamentally get this
 much closer to right than traditional fence-based ISAs.

 I'm no hardware architect, but fundamentally it seems to me that

 load x
 acquire_fence

 imposes a much more stringent constraint than

 load_acquire x

 Consider the case in which the load from x is an L1 hit, but a preceding
 load (from say y) is a long-latency miss.  If we enforce ordering by just
 waiting for completion of prior operation, the former has to wait for the
 load from y to complete; while the latter doesn't.  I find it hard to
 believe that this doesn't leave an appreciable amount of performance on the
 table, at least for some interesting microarchitectures.

I agree.  Fences should be used rarely.


Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics

2014-12-01 Thread Martin Buchholz
David, Paul (i.e. Reviewers) and Doug,

I'd like to commit corrections so we make progress.

I think the current webrev is simple progress with the exception of my
attempt to translate volatiles into fences, which is marginal (but was
a good learning exercise for me).


Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics

2014-12-01 Thread Hans Boehm
Needless to say, I would clearly also like to see a simple correspondence.

But this does raise the interesting question of whether put/get and
store(..., memory_order_relaxed)/load(memory_order_relaxed) are intended to
have similar semantics.  I would guess not, in that the former don't
satisfy coherence; accesses to the same variable can be reordered as for
normal variable accesses, while the C++11/C11 variants do provide those
guarantees.  On most, but not all, architectures that's entirely a compiler
issue; the hardware claims to provide that guarantee.

This affects, for example, whether a variable that is only ever incremented
by one thread can appear to another thread to decrease in value.  Or if a
reference set to a non-null value exactly once can appear to change back to
null after appearing non-null.  In my opinion, it makes sense to always
provide coherence for atomics, since the overhead is small, and so are the
odds of getting code relying on non-coherent racing accesses correct.  But
for ordinary variables whose accesses are not intended to race the
trade-offs are very different.



Hans

On Mon, Dec 1, 2014 at 12:40 PM, Martin Buchholz marti...@google.com
wrote:

 Hans,

 (Thanks for your excellent work on C/C++ 11 and your eternal patience)

 On Tue, Nov 25, 2014 at 11:15 AM, Hans Boehm bo...@acm.org wrote:
  It seems to me that a (dubiuously named) loadFence is intended to have
  essentially the same semantics as the (perhaps slightly less dubiously
  named) C++ atomic_thread_fence(memory_order_acquire), and a storeFence
  matches atomic_thread_fence(memory_order_release).  The C++ standard and,
  even more so, Mark Batty's work have a precise definition of what those
 mean
  in terms of implied synchronizes with relationships.
 
  It looks to me like this whole implementation model for volatiles in
 terms
  of fences is fundamentally doomed, and it probably makes more sense to
 get
  rid of it rather than spending time on renaming it (though we just did
 the
  latter in Android to avoid similar confusion about semantics).  It's

 I would also like to see alignment to leverage the technical and
 cultural work done on C11.  I would like to see Unsafe get
 load-acquire and store-release methods and these should be used in
 preference to fences where possible.  I'd like to see the C11 wording
 reused as much as possible.  The meanings of the words acquire and
 release are now owned by the C11 community and we should tag
 along.

 A better API for Unsafe would be

 putOrdered - storeRelease
 put - storeRelaxed
 (ordinary volatile write) - store (default is sequential consistent)

 etc ...

 but the high cost of renaming methods in Unsafe probably makes this a
 no-go, even though Unsafe is not a public API in theory.

 At least the documentation of all the methods should indicate what the
 memory effects and the corresponding C++11 memory model interpretation
 is.

 E.g. Unsafe.compareAndSwap should document the memory effects, i.e.
 sequential consistency.

 Unsafe doesn't currently have a readAcquire method (mirror of
 putOrdered) probably because volatile read is _almost_ the same (but
 not on ppc!).

  fundamentally incompatible with the way volatiles/atomics are intended
 to be
  implemented on ARMv8 (and Itanium).  Which I think fundamentally get this
  much closer to right than traditional fence-based ISAs.
 
  I'm no hardware architect, but fundamentally it seems to me that
 
  load x
  acquire_fence
 
  imposes a much more stringent constraint than
 
  load_acquire x
 
  Consider the case in which the load from x is an L1 hit, but a preceding
  load (from say y) is a long-latency miss.  If we enforce ordering by just
  waiting for completion of prior operation, the former has to wait for the
  load from y to complete; while the latter doesn't.  I find it hard to
  believe that this doesn't leave an appreciable amount of performance on
 the
  table, at least for some interesting microarchitectures.

 I agree.  Fences should be used rarely.



Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics

2014-12-01 Thread Doug Lea

On 12/01/2014 03:46 PM, Martin Buchholz wrote:

David, Paul (i.e. Reviewers) and Doug,

I'd like to commit corrections so we make progress.


The current one looks OK to me.
(http://cr.openjdk.java.net/~martin/webrevs/openjdk9/fence-intrinsics/)

-Doug




I think the current webrev is simple progress with the exception of my
attempt to translate volatiles into fences, which is marginal (but was
a good learning exercise for me).





Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics

2014-12-01 Thread Martin Buchholz
On Mon, Dec 1, 2014 at 1:51 PM, Hans Boehm bo...@acm.org wrote:
 Needless to say, I would clearly also like to see a simple correspondence.

 But this does raise the interesting question of whether put/get and
 store(..., memory_order_relaxed)/load(memory_order_relaxed) are intended to
 have similar semantics.  I would guess not, in that the former don't satisfy
 coherence; accesses to the same variable can be reordered as for normal
 variable accesses, while the C++11/C11 variants do provide those guarantees.
 On most, but not all, architectures that's entirely a compiler issue; the
 hardware claims to provide that guarantee.

 This affects, for example, whether a variable that is only ever incremented
 by one thread can appear to another thread to decrease in value.  Or if a
 reference set to a non-null value exactly once can appear to change back to
 null after appearing non-null.  In my opinion, it makes sense to always
 provide coherence for atomics, since the overhead is small, and so are the
 odds of getting code relying on non-coherent racing accesses correct.  But
 for ordinary variables whose accesses are not intended to race the
 trade-offs are very different.

It would be nice to pretend that ordinary java loads and stores map
perfectly to C11 relaxed loads and stores.  This maps well to the lack
of undefined behavior for data races in Java.  But this fails also
with lack of atomicity of Java longs and doubles.  I have no intuition
as to whether always requiring per-variable sequential consistency
would be a performance problem.  Introducing an explicit relaxed
memory order mode in Java when the distinction between ordinary access
is smaller than in C/C++ 11 would be confusing.

Despite all that, it would be clean, consistent and seemingly
straightforward to simply add all of the C/C++ atomic loads, stores
and fences to sun.misc.Unsafe (with the possible exception of consume,
which is still under a cloud).  If that works out for jdk-internal
code, we can add them to a public API.  Providing the full set will
help with interoperability with C code running in another thread
accessing a direct buffer.


Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics

2014-12-01 Thread Hans Boehm
I think that requiring coherence for ordinary Java accesses would be a
performance problem.  The pre-2005 Java memory model actually promised it,
but implementations ignored that requirement.  That was one significant
motivation of the 2005 memory model overhaul.

The basic problem is that if you have

r1 = x.f;
r2 = y.f;
r3 = x.f;

the compiler can no longer perform common subexpression elimination  on the
two loads from x.f unless it can prove that x and y do not alias, which is
probably rare.  Loads kill available expressions.  Clearly this can
significantly reduce the effectiveness of CSE and similar basic
optimizations.  Enforcing coherence also turns out to be somewhat expensive
on Itanium, and rather expensive on some slightly older ARM processors.

Those arguments probably don't apply to sun.misc.unsafe.  But no matter
what you do for sun.misc.unsafe, something will be inconsistent.

(The other problem of course is that we still don't really know how to
define memory_order_relaxed any better than we know how to define ordinary
Java memory references.)

On Mon, Dec 1, 2014 at 5:05 PM, Martin Buchholz marti...@google.com wrote:

 On Mon, Dec 1, 2014 at 1:51 PM, Hans Boehm bo...@acm.org wrote:
  Needless to say, I would clearly also like to see a simple
 correspondence.
 
  But this does raise the interesting question of whether put/get and
  store(..., memory_order_relaxed)/load(memory_order_relaxed) are intended
 to
  have similar semantics.  I would guess not, in that the former don't
 satisfy
  coherence; accesses to the same variable can be reordered as for normal
  variable accesses, while the C++11/C11 variants do provide those
 guarantees.
  On most, but not all, architectures that's entirely a compiler issue; the
  hardware claims to provide that guarantee.
 
  This affects, for example, whether a variable that is only ever
 incremented
  by one thread can appear to another thread to decrease in value.  Or if a
  reference set to a non-null value exactly once can appear to change back
 to
  null after appearing non-null.  In my opinion, it makes sense to always
  provide coherence for atomics, since the overhead is small, and so are
 the
  odds of getting code relying on non-coherent racing accesses correct.
 But
  for ordinary variables whose accesses are not intended to race the
  trade-offs are very different.

 It would be nice to pretend that ordinary java loads and stores map
 perfectly to C11 relaxed loads and stores.  This maps well to the lack
 of undefined behavior for data races in Java.  But this fails also
 with lack of atomicity of Java longs and doubles.  I have no intuition
 as to whether always requiring per-variable sequential consistency
 would be a performance problem.  Introducing an explicit relaxed
 memory order mode in Java when the distinction between ordinary access
 is smaller than in C/C++ 11 would be confusing.

 Despite all that, it would be clean, consistent and seemingly
 straightforward to simply add all of the C/C++ atomic loads, stores
 and fences to sun.misc.Unsafe (with the possible exception of consume,
 which is still under a cloud).  If that works out for jdk-internal
 code, we can add them to a public API.  Providing the full set will
 help with interoperability with C code running in another thread
 accessing a direct buffer.



Re: [concurrency-interest] RFR: 8065804: JEP 171:Clarifications/corrections for fence intrinsics

2014-11-27 Thread Doug Lea

On 11/26/2014 09:56 PM, David Holmes wrote:

Martin Buchholz writes:

On Wed, Nov 26, 2014 at 5:08 PM, David Holmes
david.hol...@oracle.com wrote:

Please explain why you have changed the defined semantics for

storeFence.

You have completely reversed the direction of the barrier.


Yes.  I believe the current spec of storeFence was a copy-paste typo,
and it seems others feel likewise.


Can whomever wrote that original spec please confirm that.



The translations of loadFence == [LoadLoad|LoadStore]
and storeFence == [StoreStore|LoadStore] into prose got mangled
at some point. (Probably by me; sorry if so!)

-Doug




Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics

2014-11-26 Thread Martin Buchholz
On Tue, Nov 25, 2014 at 6:04 AM, Paul Sandoz paul.san...@oracle.com wrote:
 Hi Martin,

 Thanks for looking into this.

 1141  * Currently hotspot's implementation of a Java language-level 
 volatile
 1142  * store has the same effect as a storeFence followed by a relaxed 
 store,
 1143  * although that may be a little stronger than needed.

 IIUC to emulate hotpot's volatile store you will need to say that a fullFence 
 immediately follows the relaxed store.

Right - I've been groking that.

 The bit that always confuses me about release and acquire is ordering is 
 restricted to one direction, as talked about in orderAccess.hpp [1]. So for a 
 release, accesses prior to the release cannot move below it, but accesses 
 succeeding the release can move above it. And that seems to apply to 
 Unsafe.storeFence [2] (acting like a monitor exit). Is that contrary to C++ 
 release fences where ordering is restricted both to prior and succeeding 
 accesses? [3]

 So what about the following?

   a = r1; // Cannot move below the fence
   Unsafe.storeFence();
   b = r2; // Can move above the fence?

I think the hotspot docs need to be more precise about when they're
talking about movement of stores and when about loads.

 // release.  I.e., subsequent memory accesses may float above the
 // release, but prior ones may not float below it.

As I've said elsewhere, the above makes no sense without restricting
the type of access.


Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics

2014-11-26 Thread Martin Buchholz
On Tue, Nov 25, 2014 at 1:41 PM, Andrew Haley a...@redhat.com wrote:
 On 11/24/2014 08:56 PM, Martin Buchholz wrote:
 + * Currently hotspot's implementation of a Java language-level volatile
 + * store has the same effect as a storeFence followed by a relaxed store,
 + * although that may be a little stronger than needed.

 While this may be true today

No - it was very wrong, since it doesn't give you sequential consistency!

, I'm hopefully about to commit an
 AArch64 OpenJDK port that uses the ARMv8 stlr instruction.  I
 don't think that what you've written here is terribly misleading,
 but bear in mind that it may be there for some time.


RE: [concurrency-interest] RFR: 8065804: JEP 171:Clarifications/corrections for fence intrinsics

2014-11-26 Thread David Holmes
Martin Buchholz writes:
 On Wed, Nov 26, 2014 at 5:08 PM, David Holmes 
 david.hol...@oracle.com wrote:
  Please explain why you have changed the defined semantics for 
 storeFence.
  You have completely reversed the direction of the barrier.
 
 Yes.  I believe the current spec of storeFence was a copy-paste typo,
 and it seems others feel likewise.

Can whomever wrote that original spec please confirm that.

Thanks,
David



RE: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics

2014-11-26 Thread David Holmes
Can I make an observation about acquire() and release() - to me they are
meaningless when considered in isolation. Given their definitions they allow
anything to move into a region bounded by acquire() and release(), then you
can effectively move the whole program into the region and thus the
acquire() and release() do not constrain any reorderings. acquire() and
release() only make sense when their own movement is constrained with
respect to something else - such as lock acquisition/release, or when
combined with specific load/store actions.

David

Martin Buchholz writes:

 On Tue, Nov 25, 2014 at 6:04 AM, Paul Sandoz
 paul.san...@oracle.com wrote:
  Hi Martin,
 
  Thanks for looking into this.
 
  1141  * Currently hotspot's implementation of a Java
 language-level volatile
  1142  * store has the same effect as a storeFence followed
 by a relaxed store,
  1143  * although that may be a little stronger than needed.
 
  IIUC to emulate hotpot's volatile store you will need to say
 that a fullFence immediately follows the relaxed store.

 Right - I've been groking that.

  The bit that always confuses me about release and acquire is
 ordering is restricted to one direction, as talked about in
 orderAccess.hpp [1]. So for a release, accesses prior to the
 release cannot move below it, but accesses succeeding the release
 can move above it. And that seems to apply to Unsafe.storeFence
 [2] (acting like a monitor exit). Is that contrary to C++ release
 fences where ordering is restricted both to prior and succeeding
 accesses? [3]
 
  So what about the following?
 
a = r1; // Cannot move below the fence
Unsafe.storeFence();
b = r2; // Can move above the fence?

 I think the hotspot docs need to be more precise about when they're
 talking about movement of stores and when about loads.

  // release.  I.e., subsequent memory accesses may float above the
  // release, but prior ones may not float below it.

 As I've said elsewhere, the above makes no sense without restricting
 the type of access.

 ___
 Concurrency-interest mailing list
 concurrency-inter...@cs.oswego.edu
 http://cs.oswego.edu/mailman/listinfo/concurrency-interest



Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics

2014-11-26 Thread Martin Buchholz
On Wed, Nov 26, 2014 at 7:00 PM, David Holmes davidchol...@aapt.net.au wrote:
 Can I make an observation about acquire() and release() - to me they are
 meaningless when considered in isolation. Given their definitions they allow
 anything to move into a region bounded by acquire() and release(), then you
 can effectively move the whole program into the region and thus the
 acquire() and release() do not constrain any reorderings. acquire() and
 release() only make sense when their own movement is constrained with
 respect to something else - such as lock acquisition/release, or when
 combined with specific load/store actions.

David, it seems you are agreeing with my argument below.  The
definitions in the hotspot sources should be fixed, in the same sort
of way that I'm trying to make the specs for Unsafe loads clearer and
more precise.

 David

 Martin Buchholz writes:
 I think the hotspot docs need to be more precise about when they're
 talking about movement of stores and when about loads.

  // release.  I.e., subsequent memory accesses may float above the
  // release, but prior ones may not float below it.

 As I've said elsewhere, the above makes no sense without restricting
 the type of access.


RE: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics

2014-11-26 Thread David Holmes
Martin writes:
 On Wed, Nov 26, 2014 at 7:00 PM, David Holmes 
 davidchol...@aapt.net.au wrote:
  Can I make an observation about acquire() and release() - to me they are
  meaningless when considered in isolation. Given their 
 definitions they allow
  anything to move into a region bounded by acquire() and 
 release(), then you
  can effectively move the whole program into the region and thus the
  acquire() and release() do not constrain any reorderings. acquire() and
  release() only make sense when their own movement is constrained with
  respect to something else - such as lock acquisition/release, or when
  combined with specific load/store actions.
 
 David, it seems you are agreeing with my argument below.  The
 definitions in the hotspot sources should be fixed, in the same sort
 of way that I'm trying to make the specs for Unsafe loads clearer and
 more precise.

Please see:

https://bugs.openjdk.java.net/browse/JDK-7143664

Though I'm not sure my ramblings there reflect my current thoughts on all this. 
I really think acquire/release are too confusingly used to be useful - by which 
I mean that the names do not reflect their actions so you will always have to 
remember/look-up exactly what *release* and *acquire* mean in that context, and 
hence talking about acquire semantics and release semantics becomes 
meaningless. In contrast the loadload|loadstore etc barriers are completely 
straight-forward to understand from their names. However it seems they are too 
strong compared to what recent hardware provides.

Hotspot implementations in orderAccess are confusing - barriers with different 
semantics have been defined in terms of the other, but the low-level 
implementations provide a barrier that is stronger than the required semantics, 
so the high-level APIs are satisfied correctly, even if not implemented in a 
way that makes sense if you reason about what each barrier theoretically allows.

David
 
  David
 
  Martin Buchholz writes:
  I think the hotspot docs need to be more precise about when they're
  talking about movement of stores and when about loads.
 
   // release.  I.e., subsequent memory accesses may float above the
   // release, but prior ones may not float below it.
 
  As I've said elsewhere, the above makes no sense without restricting
  the type of access.
 



Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics

2014-11-26 Thread Peter Levart

On 11/27/2014 04:00 AM, David Holmes wrote:

Can I make an observation about acquire() and release() - to me they are
meaningless when considered in isolation. Given their definitions they allow
anything to move into a region bounded by acquire() and release(), then you
can effectively move the whole program into the region and thus the
acquire() and release() do not constrain any reorderings.
  acquire() and
release() only make sense when their own movement is constrained with
respect to something else - such as lock acquisition/release, or when
combined with specific load/store actions.


...or another acquire/release region?

Regards, Peter



David

Martin Buchholz writes:

On Tue, Nov 25, 2014 at 6:04 AM, Paul Sandoz
paul.san...@oracle.com wrote:

Hi Martin,

Thanks for looking into this.

1141  * Currently hotspot's implementation of a Java

language-level volatile

1142  * store has the same effect as a storeFence followed

by a relaxed store,

1143  * although that may be a little stronger than needed.

IIUC to emulate hotpot's volatile store you will need to say

that a fullFence immediately follows the relaxed store.

Right - I've been groking that.


The bit that always confuses me about release and acquire is

ordering is restricted to one direction, as talked about in
orderAccess.hpp [1]. So for a release, accesses prior to the
release cannot move below it, but accesses succeeding the release
can move above it. And that seems to apply to Unsafe.storeFence
[2] (acting like a monitor exit). Is that contrary to C++ release
fences where ordering is restricted both to prior and succeeding
accesses? [3]

So what about the following?

   a = r1; // Cannot move below the fence
   Unsafe.storeFence();
   b = r2; // Can move above the fence?

I think the hotspot docs need to be more precise about when they're
talking about movement of stores and when about loads.


// release.  I.e., subsequent memory accesses may float above the
// release, but prior ones may not float below it.

As I've said elsewhere, the above makes no sense without restricting
the type of access.

___
Concurrency-interest mailing list
concurrency-inter...@cs.oswego.edu
http://cs.oswego.edu/mailman/listinfo/concurrency-interest




Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics

2014-11-25 Thread Paul Sandoz
Hi Martin,

Thanks for looking into this.

1141  * Currently hotspot's implementation of a Java language-level volatile
1142  * store has the same effect as a storeFence followed by a relaxed 
store,
1143  * although that may be a little stronger than needed.

IIUC to emulate hotpot's volatile store you will need to say that a fullFence 
immediately follows the relaxed store.

The bit that always confuses me about release and acquire is ordering is 
restricted to one direction, as talked about in orderAccess.hpp [1]. So for a 
release, accesses prior to the release cannot move below it, but accesses 
succeeding the release can move above it. And that seems to apply to 
Unsafe.storeFence [2] (acting like a monitor exit). Is that contrary to C++ 
release fences where ordering is restricted both to prior and succeeding 
accesses? [3]

So what about the following?

  a = r1; // Cannot move below the fence
  Unsafe.storeFence();
  b = r2; // Can move above the fence?

Paul.

[1] In orderAccess.hpp
// Execution by a processor of release makes the effect of all memory
// accesses issued by it previous to the release visible to all
// processors *before* the release completes.  The effect of subsequent
// memory accesses issued by it *may* be made visible *before* the
// release.  I.e., subsequent memory accesses may float above the
// release, but prior ones may not float below it.

[2] In memnode.hpp
// Release - no earlier ref can move after (but later refs can move
// up, like a speculative pipelined cache-hitting Load).  Requires
// multi-cpu visibility.  Inserted independent of any store, as required
// for intrinsic sun.misc.Unsafe.storeFence().
class StoreFenceNode: public MemBarNode {
public:
  StoreFenceNode(Compile* C, int alias_idx, Node* precedent)
: MemBarNode(C, alias_idx, precedent) {}
  virtual int Opcode() const;
};

[3] 
http://preshing.com/20131125/acquire-and-release-fences-dont-work-the-way-youd-expect/

On Nov 25, 2014, at 1:47 AM, Martin Buchholz marti...@google.com wrote:

 OK, I worked in some wording for comparison with volatiles.
 I believe you when you say that the semantics of the corresponding C++
 fences are slightly different, but it's rather subtle - can we say
 anything more than closely related to?
 
 On Mon, Nov 24, 2014 at 1:29 PM, Aleksey Shipilev
 aleksey.shipi...@oracle.com wrote:
 Hi Martin,
 
 On 11/24/2014 11:56 PM, Martin Buchholz wrote:
 Review carefully - I am trying to learn about fences by explaining them!
 I have borrowed some wording from my reviewers!
 
 https://bugs.openjdk.java.net/browse/JDK-8065804
 http://cr.openjdk.java.net/~martin/webrevs/openjdk9/fence-intrinsics/
 
 I think implies the effect of C++11 is too strong wording. related
 might be more appropriate.
 
 See also comments here for connection with volatiles:
 https://bugs.openjdk.java.net/browse/JDK-8038978
 
 Take note the Hans' correction that fences generally imply more than
 volatile load/store, but since you are listing the related things in the
 docs, I think the native Java example is good to have.
 
 -Aleksey.
 
 
 ___
 Concurrency-interest mailing list
 concurrency-inter...@cs.oswego.edu
 http://cs.oswego.edu/mailman/listinfo/concurrency-interest



Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics

2014-11-25 Thread Andrew Haley
On 11/24/2014 08:56 PM, Martin Buchholz wrote:
 Hi folks,
 
 Review carefully - I am trying to learn about fences by explaining them!
 I have borrowed some wording from my reviewers!

+ * Currently hotspot's implementation of a Java language-level volatile
+ * store has the same effect as a storeFence followed by a relaxed store,
+ * although that may be a little stronger than needed.

While this may be true today, I'm hopefully about to commit an
AArch64 OpenJDK port that uses the ARMv8 stlr instruction.  I
don't think that what you've written here is terribly misleading,
but bear in mind that it may be there for some time.

Andrew.



RE: [concurrency-interest] RFR: 8065804: JEP 171:Clarifications/corrections for fence intrinsics

2014-11-25 Thread David Holmes
Stephan Diestelhorst writes:

 Am Dienstag, 25. November 2014, 11:15:36 schrieb Hans Boehm:
  I'm no hardware architect, but fundamentally it seems to me that
 
  load x
  acquire_fence
 
  imposes a much more stringent constraint than
 
  load_acquire x
 
  Consider the case in which the load from x is an L1 hit, but a preceding
  load (from say y) is a long-latency miss.  If we enforce
 ordering by just
  waiting for completion of prior operation, the former has to
 wait for the
  load from y to complete; while the latter doesn't.  I find it hard to
  believe that this doesn't leave an appreciable amount of
 performance on the
  table, at least for some interesting microarchitectures.

 I agree, Hans, that this is a reasonable assumption.  Load_acquire x
 does allow roach motel, whereas the acquire fence does not.

   In addition, for better or worse, fencing requirements on at least
   Power are actually driven as much by store atomicity issues, as by
   the ordering issues discussed in the cookbook.  This was not
   understood in 2005, and unfortunately doesn't seem to be amenable to
   the kind of straightforward explanation as in Doug's cookbook.

 Coming from a strongly ordered architecture to a weakly ordered one
 myself, I also needed some mental adjustment about store (multi-copy)
 atomicity.  I can imagine others will be unaware of this difference,
 too, even in 2014.

Sorry I'm missing the connection between fences and multi-copy atomicity.

David

 Stephan

 ___
 Concurrency-interest mailing list
 concurrency-inter...@cs.oswego.edu
 http://cs.oswego.edu/mailman/listinfo/concurrency-interest