Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics
If we look at using purely store fences and purely load fences in the initialized flag example as in this discussion, I think it's worth distinguishing too possible scenarios: 1) We guarantee some form of dependency-based ordering, as most real computer architectures do. This probably invalidates the example from my committee paper that's under discussion here. The problem is, as always, that we don't know how to make this precise at the programming language level. It's the compiler's job to break certain dependencies, like the dependency of the store to x on the load of y in x = 0 * y. Many people are thinking about this problem, both to deal with out-of-thin-air issues correctly in various memory models, and to design a version of C++'s memory_order_consume that's more usable. If we had a way to guarantee some well-defined notion of dependency-based ordering, then at least some of the examples here would need to be revisited. 2) We don't guarantee that dependencies imply any sort of ordering. Then I think the weird example under discussion here stands. There is officially nothing to prevent the load of x.a in thread 1 from being reordered with the store to x_init. But there may actually be better examples as to why the store-store ordering in the initializing thread is not always enough. Consider: Thread 1: x.a = 1; if (x.a != 1) world_is_broken = true; StoreStore fence; x_init = true; ... if (world_is_broken) die(); Thread 2: if (x_init) { full fence; x.a++; } I think there is nothing to prevent the read of x.a in Thread 1 from seeing the incremented value, at least if (1) the compiler promotes world_is_broken to a register, and (2) at the assembly level the store to x_init is not dependent on the load of x.a. (1) seems quite plausible, and (2) seems very reasonable if the architecture has a conditional move instruction or the like. (For Itanium, (2) holds even for the naive compilation.) This is not a particularly likely scenario, but I have no idea how would concoct programming rules that would guarantee to prevent this kind of weirdness. The first two statements of Thread 1 might appear inside an initialize a library routine that knows nothing about concurrency. Hans On Wed, Dec 17, 2014 at 10:54 AM, Martin Buchholz marti...@google.com wrote: On Wed, Dec 17, 2014 at 1:28 AM, Peter Levart peter.lev...@gmail.com wrote: On 12/17/2014 03:28 AM, David Holmes wrote: On 17/12/2014 10:06 AM, Martin Buchholz wrote: Hans allows for the nonsensical, in my view, possibility that the load of x.a can happen after the x_init=true store and yet somehow be subject to the ++ and the ensuing store that has to come before the x_init = true. Perhaps, he is speaking about why it is dangerous to replace BOTH release with just store-store AND acquire with just load-load? I'm pretty sure he's talking about weakening EITHER. Clearly, and unsurprisingly, it is unsafe to replace the load_acquire with a version that restricts only load ordering in this case. That would allow the store to x in thread 2 to become visible before the initialization of x by thread 1 is complete, possibly losing the update, or corrupting the state of x during initialization. More interestingly, it is also generally unsafe to restrict the release ordering constraint in thread 1 to only stores. (What's clear and unsurprising to Hans may not be to the rest of us)
Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics
On 12/17/2014 03:28 AM, David Holmes wrote: On 17/12/2014 10:06 AM, Martin Buchholz wrote: On Thu, Dec 11, 2014 at 1:08 AM, Andrew Haley a...@redhat.com wrote: On 11/12/14 00:53, David Holmes wrote: There are many good uses of storestore in the various lock-free algorithms inside hotspot. There may be many uses, but I am extremely suspicious of how good they are. I wonder if we went through all the uses of storestore in hotspot how many bugs we'd find. As far as I can see (in the absence of other barriers) the only things you can write before a storestore are constants. Hans has provided us with the canonical writeup opposing store-store and load-load barriers, here: http://www.hboehm.info/c++mm/no_write_fences.html Few programmers will be able to deal confidently with causality defying time paradoxes, especially loads from the future. Well I take that with a grain of salt - Hans dismisses ordering based on dependencies which puts us into the realm of those causality defying time paradoxes in my opinion. Given: x.a = 0; x.a++ storestore x_init = true Hans allows for the nonsensical, in my view, possibility that the load of x.a can happen after the x_init=true store and yet somehow be subject to the ++ and the ensuing store that has to come before the x_init = true. David - Perhaps, he is speaking about why it is dangerous to replace BOTH release with just store-store AND acquire with just load-load? The example would then become: T1: store x.a - 0 load r - x.a store x.a - r+1 ; store-store store x_init - true T2: load r - x.a ; load-load if (r) store x.a - 42 Suppose a store on some hypothetical architecture is actually a two-phase execution: prepare-store, commit-store With prepare-store imagined as speculative posting of store to the write buffer and commit-store just marking it in the write buffer as commited, so that is is written to main memory on write buffer flush. Non commited stores are not written to main memory, but are allowed to be visible to loads in some threads (executing on same core?) which are not ordered by load-store before the speculative prepare-store. A load-load does not prevent the T2 to be executed as following: T2: prepare-store x.a - 42 load r - x.a ; load-load if (r) commit-store x.a - 42 Now this speculative prepare-store can (in real time) happen long before T1 instructions are executed. The loads in T1 are then allowed to see this speculative prepare-store from T2, because just store-store does not logically order them in any way - only load-store would. Does this make sense? Regards, Peter
Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics
On 12/17/2014 10:28 AM, Peter Levart wrote: The example would then become: T1: store x.a - 0 load r - x.a store x.a - r+1 ; store-store store x_init - true T2: load r - x.a ; load-load if (r) store x.a - 42 Sorry the above has an error. I meant: T2: load r - x_init ; load-load if (r) store x.a - 42 Peter
Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics
On Thu, Dec 11, 2014 at 1:08 AM, Andrew Haley a...@redhat.com wrote: On 11/12/14 00:53, David Holmes wrote: There are many good uses of storestore in the various lock-free algorithms inside hotspot. There may be many uses, but I am extremely suspicious of how good they are. I wonder if we went through all the uses of storestore in hotspot how many bugs we'd find. As far as I can see (in the absence of other barriers) the only things you can write before a storestore are constants. Hans has provided us with the canonical writeup opposing store-store and load-load barriers, here: http://www.hboehm.info/c++mm/no_write_fences.html Few programmers will be able to deal confidently with causality defying time paradoxes, especially loads from the future.
Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics
On 17/12/2014 10:06 AM, Martin Buchholz wrote: On Thu, Dec 11, 2014 at 1:08 AM, Andrew Haley a...@redhat.com wrote: On 11/12/14 00:53, David Holmes wrote: There are many good uses of storestore in the various lock-free algorithms inside hotspot. There may be many uses, but I am extremely suspicious of how good they are. I wonder if we went through all the uses of storestore in hotspot how many bugs we'd find. As far as I can see (in the absence of other barriers) the only things you can write before a storestore are constants. Hans has provided us with the canonical writeup opposing store-store and load-load barriers, here: http://www.hboehm.info/c++mm/no_write_fences.html Few programmers will be able to deal confidently with causality defying time paradoxes, especially loads from the future. Well I take that with a grain of salt - Hans dismisses ordering based on dependencies which puts us into the realm of those causality defying time paradoxes in my opinion. Given: x.a = 0; x.a++ storestore x_init = true Hans allows for the nonsensical, in my view, possibility that the load of x.a can happen after the x_init=true store and yet somehow be subject to the ++ and the ensuing store that has to come before the x_init = true. David -
Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics
On 11/12/14 00:53, David Holmes wrote: On 11/12/2014 7:02 AM, Andrew Haley wrote: On 12/05/2014 09:49 PM, Martin Buchholz wrote: The actual implementations of storestore (see below) seem to universally give you the stronger ::release barrier, and it seems likely that hotspot engineers are implicitly relying on that, that some uses of ::storestore in the hotspot sources are bugs (should be ::release instead) and that there is very little potential performance benefit from using ::storestore instead of ::release, precisely because the additional loadstore barrier is very close to free on all current hardware. That's not really true for ARM, where the additional loadstore requires a full barrier. There is a good use for a storestore, which is when zeroing a newly-created object. There are many good uses of storestore in the various lock-free algorithms inside hotspot. There may be many uses, but I am extremely suspicious of how good they are. I wonder if we went through all the uses of storestore in hotspot how many bugs we'd find. As far as I can see (in the absence of other barriers) the only things you can write before a storestore are constants. Andrew.
Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics
Today I Leaned that release fence and acquire fence are technical terms defined in the C/C++ 11 standards. So my latest version reads instead: * Ensures that loads and stores before the fence will not be reordered with * stores after the fence; a StoreStore plus LoadStore barrier. * * Corresponds to C11 atomic_thread_fence(memory_order_release) * (a release fence).
Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics
On 12/05/2014 09:49 PM, Martin Buchholz wrote: The actual implementations of storestore (see below) seem to universally give you the stronger ::release barrier, and it seems likely that hotspot engineers are implicitly relying on that, that some uses of ::storestore in the hotspot sources are bugs (should be ::release instead) and that there is very little potential performance benefit from using ::storestore instead of ::release, precisely because the additional loadstore barrier is very close to free on all current hardware. That's not really true for ARM, where the additional loadstore requires a full barrier. There is a good use for a storestore, which is when zeroing a newly-created object. Andrew.
Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics
On 11/12/2014 3:31 AM, Martin Buchholz wrote: Today I Leaned that release fence and acquire fence are technical terms defined in the C/C++ 11 standards. So my latest version reads instead: * Ensures that loads and stores before the fence will not be reordered with * stores after the fence; a StoreStore plus LoadStore barrier. * * Corresponds to C11 atomic_thread_fence(memory_order_release) * (a release fence). Thank you Martin - I find the updated wording much more appropriate. For the email record, as I have written in the bug report, I think the correction of the semantics for storeFence have resulted in problematic naming where storeFence and loadFence have opposite directionality constraints but the names suggest the same directionality constraints. Had the original API suggested these names with the revised semantics I would have argued against them. This area is confusing enough without adding to that confusion with names that don't suggest the action. David
Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics
On 11/12/2014 7:02 AM, Andrew Haley wrote: On 12/05/2014 09:49 PM, Martin Buchholz wrote: The actual implementations of storestore (see below) seem to universally give you the stronger ::release barrier, and it seems likely that hotspot engineers are implicitly relying on that, that some uses of ::storestore in the hotspot sources are bugs (should be ::release instead) and that there is very little potential performance benefit from using ::storestore instead of ::release, precisely because the additional loadstore barrier is very close to free on all current hardware. That's not really true for ARM, where the additional loadstore requires a full barrier. There is a good use for a storestore, which is when zeroing a newly-created object. There are many good uses of storestore in the various lock-free algorithms inside hotspot. David Andrew.
Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics
On Wed, Dec 10, 2014 at 4:52 PM, David Holmes david.hol...@oracle.com wrote: For the email record, as I have written in the bug report, I think the correction of the semantics for storeFence have resulted in problematic naming where storeFence and loadFence have opposite directionality constraints but the names suggest the same directionality constraints. Had the original API suggested these names with the revised semantics I would have argued against them. This area is confusing enough without adding to that confusion with names that don't suggest the action. I also dislike the names of the atomic methods in Unsafe and would like to align them as much as possible with C/C++ 11 atomics nomenclature.
Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics
On Wed, Dec 10, 2014 at 1:02 PM, Andrew Haley a...@redhat.com wrote: On 12/05/2014 09:49 PM, Martin Buchholz wrote: The actual implementations of storestore (see below) seem to universally give you the stronger ::release barrier, and it seems likely that hotspot engineers are implicitly relying on that, that some uses of ::storestore in the hotspot sources are bugs (should be ::release instead) and that there is very little potential performance benefit from using ::storestore instead of ::release, precisely because the additional loadstore barrier is very close to free on all current hardware. That's not really true for ARM, where the additional loadstore requires a full barrier. There is a good use for a storestore, which is when zeroing a newly-created object. Andrew and David, I agree that given ARM's decision to provide the choice of StoreStore and full fences, hotspot is right to use them in a few carefully chosen places, like object initializers. But ARM's decision seems poor to me. No mainstream language (like C/C++ or Java) is likely to support those in any way accessible to user programs (not even via Java Unsafe). Making their StoreStore barrier also do LoadStore would dramatically increase the applicability of their instruction with low cost. Maybe it's not too late for ARM to do so.
Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics
On Mon, Dec 8, 2014 at 8:51 PM, David Holmes david.hol...@oracle.com wrote: Then please point me to the common industry definition of it because I couldn't find anything definitive. And as you state yourself above one definition of it - the corresponding C11 fence - does not in fact have the same semantics! I changed to the terminology acquire fence and release fence as popularized by preshing http://preshing.com/20130922/acquire-and-release-fences/ * Ensures that loads and stores before the fence will not be reordered with * stores after the fence; also referred to as a StoreStore plus LoadStore * barrier, or as a release fence.
Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics
On Mon, Dec 8, 2014 at 8:35 PM, David Holmes david.hol...@oracle.com wrote: So (as you say) with TSO you don't have a total order of stores if you read your own writes out of your own CPU's write buffer. However, my interpretation of multiple-copy atomic is that the initial publishing thread can choose to use an instruction with sufficiently strong memory barrier attached (e.g. LOCK;XXX on x86) to write to memory so that the write buffer is flushed and then use plain relaxed loads everywhere else to read those memory locations and this explains the situation on x86 and sparc where volatile writes are expensive and volatile reads are free and you get sequential consistency for Java volatiles. We don't use lock'd instructions for volatile stores on x86, but the trailing mfence achieves the flushing. However this still raised some questions for me. Using a mfence on x86 or equivalent on sparc, is no different to issuing a DMB SYNC on ARM, or a SYNC on PowerPC. They each ensure TSO for volatile stores with global visibility. So when such fences are used the resulting system should be multiple-copy atomic - no? (No!**) And there seems to be an equivalence between being multiple-copy atomic and providing the IRIW property. Yet we know that on ARM/Power, as per the paper, TSO with global visibility is not ARM/Power don't have TSO. sufficient to achieve IRIW. So what is it that x86 and sparc have in addition to TSO that provides for IRIW? We have both been learning to think in new ways. I found the second section of Peter Sewell's tutorial 2 From Sequential Consistency to Relaxed Memory Models to be most useful, especially the diagrams. I pondered this for quite a while before realizing that the mfence on x86 (or equivalent on sparc) is not in fact playing the same role as the DMB/SYNC on ARM/PPC. The key property that x86 and sparc have (and we can ignore the store buffers) is that stores become globally visible - if any other thread sees a store then all other threads see the same store. Whereas on ARM/PPC you can imagine a store casually making its way through the system, gradually becoming visible to more and more threads - unless there is a DMB/SYNC to force a globally consistent memory view. Hence for IRIW placing the DMB/SYNC after the store does not suffice because prior to the DMB/SYNC the store may be visible to an arbitrary subset of threads. Consequently IRIW requires the DMB/SYNC between the loads - to ensure that each thread on their second load, must see the value that the other thread saw on its first load (ref Section 6.1 of the paper). ** So using DMB/SYNC does not achieve multiple-copy atomicity, because until the DMB/SYNC happens different threads can have different views of memory. To me, the most desirable property of x86-style TSO is that barriers are only necessary on stores to achieve sequential consistency - the publisher gets to decide. Volatile reads can then be close to free. All of which reinforces to me that IRIW is an undesirable property to have to implement. YMMV. (And I also need to re-examine the PPC64 implementation to see exactly where they add/remove barriers when IRIW is enabled.) I believe you get a full sync between volatile reads. #define GET_FIELD_VOLATILE(obj, offset, type_name, v) \ oop p = JNIHandles::resolve(obj); \ if (support_IRIW_for_not_multiple_copy_atomic_cpu) { \ OrderAccess::fence(); \ } \ Cheers, David http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf
Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics
On 10/12/2014 4:15 AM, Martin Buchholz wrote: On Mon, Dec 8, 2014 at 8:35 PM, David Holmes david.hol...@oracle.com wrote: So (as you say) with TSO you don't have a total order of stores if you read your own writes out of your own CPU's write buffer. However, my interpretation of multiple-copy atomic is that the initial publishing thread can choose to use an instruction with sufficiently strong memory barrier attached (e.g. LOCK;XXX on x86) to write to memory so that the write buffer is flushed and then use plain relaxed loads everywhere else to read those memory locations and this explains the situation on x86 and sparc where volatile writes are expensive and volatile reads are free and you get sequential consistency for Java volatiles. We don't use lock'd instructions for volatile stores on x86, but the trailing mfence achieves the flushing. However this still raised some questions for me. Using a mfence on x86 or equivalent on sparc, is no different to issuing a DMB SYNC on ARM, or a SYNC on PowerPC. They each ensure TSO for volatile stores with global visibility. So when such fences are used the resulting system should be multiple-copy atomic - no? (No!**) And there seems to be an equivalence between being multiple-copy atomic and providing the IRIW property. Yet we know that on ARM/Power, as per the paper, TSO with global visibility is not ARM/Power don't have TSO. Yes we all know that. Please re-read what I wrote. sufficient to achieve IRIW. So what is it that x86 and sparc have in addition to TSO that provides for IRIW? We have both been learning to think in new ways. I found the second section of Peter Sewell's tutorial 2 From Sequential Consistency to Relaxed Memory Models to be most useful, especially the diagrams. I pondered this for quite a while before realizing that the mfence on x86 (or equivalent on sparc) is not in fact playing the same role as the DMB/SYNC on ARM/PPC. The key property that x86 and sparc have (and we can ignore the store buffers) is that stores become globally visible - if any other thread sees a store then all other threads see the same store. Whereas on ARM/PPC you can imagine a store casually making its way through the system, gradually becoming visible to more and more threads - unless there is a DMB/SYNC to force a globally consistent memory view. Hence for IRIW placing the DMB/SYNC after the store does not suffice because prior to the DMB/SYNC the store may be visible to an arbitrary subset of threads. Consequently IRIW requires the DMB/SYNC between the loads - to ensure that each thread on their second load, must see the value that the other thread saw on its first load (ref Section 6.1 of the paper). ** So using DMB/SYNC does not achieve multiple-copy atomicity, because until the DMB/SYNC happens different threads can have different views of memory. To me, the most desirable property of x86-style TSO is that barriers are only necessary on stores to achieve sequential consistency - the publisher gets to decide. Volatile reads can then be close to free. TSO doesn't need store barriers for sequential consistency. It is somewhat amusing I think that the free-ness of volatile reads on TSO comes from the fact that all writes cause global memory synchronization. But because we can't turn that off we can't actually measure the cost we pay for those synchronizing writes. In contrast on non-TSO we have to explicitly cause synchronizing writes and so potentially require synchronizing reads - and then complain because the hidden costs are no longer hidden :) All of which reinforces to me that IRIW is an undesirable property to have to implement. YMMV. (And I also need to re-examine the PPC64 implementation to see exactly where they add/remove barriers when IRIW is enabled.) I believe you get a full sync between volatile reads. #define GET_FIELD_VOLATILE(obj, offset, type_name, v) \ oop p = JNIHandles::resolve(obj); \ if (support_IRIW_for_not_multiple_copy_atomic_cpu) { \ OrderAccess::fence(); \ } \ Yes, it was more the remove part that I was unsure of the details - I think they simply remove the trailing fence (ie PPC SYNC) from the volatile writes. Thanks, David Cheers, David http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf
Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics
On Sun, Dec 7, 2014 at 2:58 PM, David Holmes david.hol...@oracle.com wrote: I believe the comment _does_ reflect hotspot's current implementation (entirely from exploring the sources). I believe it's correct to say all of the platforms are multiple-copy-atomic except PPC. ... current hotspot sources don't contain ARM support. Here is the definition of multi-copy atomicity from the ARM architecture manual: In a multiprocessing system, writes to a memory location are multi-copy atomic if the following conditions are both true: • All writes to the same location are serialized, meaning they are observed in the same order by all observers, although some observers might not observe all of the writes. • A read of a location does not return the value of a write until all observers observe that write. The hotspot sources give // To assure the IRIW property on processors that are not multiple copy // atomic, sync instructions must be issued between volatile reads to // assure their ordering, instead of after volatile stores. // (See A Tutorial Introduction to the ARM and POWER Relaxed Memory Models // by Luc Maranget, Susmit Sarkar and Peter Sewell, INRIA/Cambridge) #ifdef CPU_NOT_MULTIPLE_COPY_ATOMIC const bool support_IRIW_for_not_multiple_copy_atomic_cpu = true; and the referenced paper gives on POWER and ARM, two threads can observe writes to different locations in different orders, even in the absence of any thread-local reordering. In other words, the architectures are not multiple-copy atomic [Col92]. which strongly suggests that x86 and sparc are OK. The first condition is met by Total-Store-Order (TSO) systems like x86 and sparc; and not by relaxed-memory-order (RMO) systems like ARM and PPC. However the second condition is not met simply by having TSO. If the local processor can see a write from the local store buffer prior to it being visible to other processors, then we do not have multi-copy atomicity and I believe that is true for x86 and sparc. Hence none of our supported platforms are multi-copy-atomic as far as I can see. I believe hotspot must implement IRIW correctly to fulfil the promise of sequential consistency for standard Java, so on ppc volatile reads get a full fence, which leads us back to the ppc pointer chasing performance problem that started all of this. Note that nothing in the JSR-133 cookbook allows for IRIW, even on x86 and sparc. The key feature needed for IRIW is a load barrier that forces global memory synchronization to ensure that all processors see writes at the same time. I'm not even sure we can force that on x86 and sparc! Such a load barrier negates the need for some store barriers as defined in the cookbook. My understanding, which could be wrong, is that the JMM implies linearizability of volatile accesses, which in turn provides the IRIW property. It is also my understanding that linearizability is a necessary property for current proof systems to be applicable. However absence of proof is not proof of absence, and it doesn't follow that code that doesn't rely on IRIW is incorrect if IRIW is not ensured on a system. As has been stated many times now, in the literature no practical lock-free algorithm seems to rely on IRIW. So I still hope that IRIW can somehow be removed because implementing it will impact everything related to the JMM in hotspot.
RE: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics
Martin, The paper you cite is about ARM and Power architectures - why do you think the lack of mention of x86/sparc implies those architectures are multiple-copy-atomic? David -Original Message- From: Martin Buchholz [mailto:marti...@google.com] Sent: Monday, 8 December 2014 6:42 PM To: David Holmes Cc: David Holmes; Vladimir Kozlov; core-libs-dev; concurrency-interest Subject: Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics On Sun, Dec 7, 2014 at 2:58 PM, David Holmes david.hol...@oracle.com wrote: I believe the comment _does_ reflect hotspot's current implementation (entirely from exploring the sources). I believe it's correct to say all of the platforms are multiple-copy-atomic except PPC. ... current hotspot sources don't contain ARM support. Here is the definition of multi-copy atomicity from the ARM architecture manual: In a multiprocessing system, writes to a memory location are multi-copy atomic if the following conditions are both true: • All writes to the same location are serialized, meaning they are observed in the same order by all observers, although some observers might not observe all of the writes. • A read of a location does not return the value of a write until all observers observe that write. The hotspot sources give // To assure the IRIW property on processors that are not multiple copy // atomic, sync instructions must be issued between volatile reads to // assure their ordering, instead of after volatile stores. // (See A Tutorial Introduction to the ARM and POWER Relaxed Memory Models // by Luc Maranget, Susmit Sarkar and Peter Sewell, INRIA/Cambridge) #ifdef CPU_NOT_MULTIPLE_COPY_ATOMIC const bool support_IRIW_for_not_multiple_copy_atomic_cpu = true; and the referenced paper gives on POWER and ARM, two threads can observe writes to different locations in different orders, even in the absence of any thread-local reordering. In other words, the architectures are not multiple-copy atomic [Col92]. which strongly suggests that x86 and sparc are OK. The first condition is met by Total-Store-Order (TSO) systems like x86 and sparc; and not by relaxed-memory-order (RMO) systems like ARM and PPC. However the second condition is not met simply by having TSO. If the local processor can see a write from the local store buffer prior to it being visible to other processors, then we do not have multi-copy atomicity and I believe that is true for x86 and sparc. Hence none of our supported platforms are multi-copy-atomic as far as I can see. I believe hotspot must implement IRIW correctly to fulfil the promise of sequential consistency for standard Java, so on ppc volatile reads get a full fence, which leads us back to the ppc pointer chasing performance problem that started all of this. Note that nothing in the JSR-133 cookbook allows for IRIW, even on x86 and sparc. The key feature needed for IRIW is a load barrier that forces global memory synchronization to ensure that all processors see writes at the same time. I'm not even sure we can force that on x86 and sparc! Such a load barrier negates the need for some store barriers as defined in the cookbook. My understanding, which could be wrong, is that the JMM implies linearizability of volatile accesses, which in turn provides the IRIW property. It is also my understanding that linearizability is a necessary property for current proof systems to be applicable. However absence of proof is not proof of absence, and it doesn't follow that code that doesn't rely on IRIW is incorrect if IRIW is not ensured on a system. As has been stated many times now, in the literature no practical lock-free algorithm seems to rely on IRIW. So I still hope that IRIW can somehow be removed because implementing it will impact everything related to the JMM in hotspot.
Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics
On Mon, Dec 8, 2014 at 12:46 AM, David Holmes davidchol...@aapt.net.au wrote: Martin, The paper you cite is about ARM and Power architectures - why do you think the lack of mention of x86/sparc implies those architectures are multiple-copy-atomic? Reading some more in the same paper, I see: Returning to the two properties above, in TSO a thread can see its own writes before they become visible to other threads (by reading them from its write buffer), but any write becomes visible to all other threads simultaneously: TSO is a multiple-copy atomic model, in the terminology of Collier [Col92]. One can also see the possibility of reading from the local write buffer as allowing a specific kind of local reordering. A program that writes one location x then reads another location y might execute by adding the write to x to the thread’s buffer, then reading y from memory, before finally making the write to x visible to other threads by flushing it from the buffer. In this case the thread reads the value of y that was in the memory before the new write of x hits memory. So (as you say) with TSO you don't have a total order of stores if you read your own writes out of your own CPU's write buffer. However, my interpretation of multiple-copy atomic is that the initial publishing thread can choose to use an instruction with sufficiently strong memory barrier attached (e.g. LOCK;XXX on x86) to write to memory so that the write buffer is flushed and then use plain relaxed loads everywhere else to read those memory locations and this explains the situation on x86 and sparc where volatile writes are expensive and volatile reads are free and you get sequential consistency for Java volatiles. http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf
Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics
Webrev updated to remove the comparison with volatile loads and stores. On Sun, Dec 7, 2014 at 2:40 PM, David Holmes david.hol...@oracle.com wrote: On 6/12/2014 7:49 AM, Martin Buchholz wrote: On Thu, Dec 4, 2014 at 5:55 PM, David Holmes david.hol...@oracle.com wrote: In general phrasing like: also known as a LoadLoad plus LoadStore barrier ... is misleading to me as these are not aliases- the loadFence (in this case) is being specified to have the same semantics as the loadload|storeload. It should say corresponds to a LoadLoad plus LoadStore barrier + * Ensures that loads before the fence will not be reordered with loads and + * stores after the fence; also known as a LoadLoad plus LoadStore barrier, I don't understand this. I believe they _are_ aliases. The first clause perfectly describes a LoadLoad plus LoadStore barrier. I find the language use inappropriate - you are defining the first to be the second. Am I missing something? Is there something else that LoadLoad plus LoadStore barrier (as used in hotspot sources and elsewhere) could possibly mean? - as per the Corresponds to a C11 And referring to things like load-acquire fence is meaningless without some reference to a definition - who defines a load-acquire fence? Is there a universal definition? I would be okay with something looser eg: Well, I'm defining load-acquire fence here in the javadoc - I'm claiming that loadFence is also known via other terminology, including load-acquire fence. Although it's true that load-acquire fence is also used to refer to the corresponding C11 fence, which has subtly different semantics. When you say also known as XXX it means that XXX is already defined elsewhere. Unless there is a generally accepted definition of XXX then this doesn't add much value. Everything about this topic is confusing, but I continue to think that load-acquire fence is a common industry term. One of the reasons I want to include them in the javadoc is precisely because these different terms are generally used synonymously.
Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics
On 9/12/2014 5:25 AM, Martin Buchholz wrote: Webrev updated to remove the comparison with volatile loads and stores. Thanks. More inline ... On Sun, Dec 7, 2014 at 2:40 PM, David Holmes david.hol...@oracle.com wrote: On 6/12/2014 7:49 AM, Martin Buchholz wrote: On Thu, Dec 4, 2014 at 5:55 PM, David Holmes david.hol...@oracle.com wrote: In general phrasing like: also known as a LoadLoad plus LoadStore barrier ... is misleading to me as these are not aliases- the loadFence (in this case) is being specified to have the same semantics as the loadload|storeload. It should say corresponds to a LoadLoad plus LoadStore barrier + * Ensures that loads before the fence will not be reordered with loads and + * stores after the fence; also known as a LoadLoad plus LoadStore barrier, I don't understand this. I believe they _are_ aliases. The first clause perfectly describes a LoadLoad plus LoadStore barrier. I find the language use inappropriate - you are defining the first to be the second. Am I missing something? Is there something else that LoadLoad plus LoadStore barrier (as used in hotspot sources and elsewhere) could possibly mean? I think we are getting our wires crossed here. My issue is with the wording: also known as a LoadLoad plus LoadStore barrier because we are explicitly choosing to specify the operation as being a LoadLoad plus LoadStore barrier. Hence I would much prefer it to say that this corresponds to a LoadLoad plus LoadStore barrier. - as per the Corresponds to a C11 And referring to things like load-acquire fence is meaningless without some reference to a definition - who defines a load-acquire fence? Is there a universal definition? I would be okay with something looser eg: Well, I'm defining load-acquire fence here in the javadoc - I'm claiming that loadFence is also known via other terminology, including load-acquire fence. Although it's true that load-acquire fence is also used to refer to the corresponding C11 fence, which has subtly different semantics. When you say also known as XXX it means that XXX is already defined elsewhere. Unless there is a generally accepted definition of XXX then this doesn't add much value. Everything about this topic is confusing, but I continue to think that load-acquire fence is a common industry term. Then please point me to the common industry definition of it because I couldn't find anything definitive. And as you state yourself above one definition of it - the corresponding C11 fence - does not in fact have the same semantics! One of the reasons I want to include them in the javadoc is precisely because these different terms are generally used synonymously. I would concede sometimes referred to as ... as that doesn't imply a single standard agreed upon definition, while still making the link. Though even then there is a danger of making false assumptions about the equivalence of this fence and those going by the name load-acquire fence. David
Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics
On 6/12/2014 7:49 AM, Martin Buchholz wrote: On Thu, Dec 4, 2014 at 5:55 PM, David Holmes david.hol...@oracle.com wrote: In general phrasing like: also known as a LoadLoad plus LoadStore barrier ... is misleading to me as these are not aliases- the loadFence (in this case) is being specified to have the same semantics as the loadload|storeload. It should say corresponds to a LoadLoad plus LoadStore barrier + * Ensures that loads before the fence will not be reordered with loads and + * stores after the fence; also known as a LoadLoad plus LoadStore barrier, I don't understand this. I believe they _are_ aliases. The first clause perfectly describes a LoadLoad plus LoadStore barrier. I find the language use inappropriate - you are defining the first to be the second. - as per the Corresponds to a C11 And referring to things like load-acquire fence is meaningless without some reference to a definition - who defines a load-acquire fence? Is there a universal definition? I would be okay with something looser eg: Well, I'm defining load-acquire fence here in the javadoc - I'm claiming that loadFence is also known via other terminology, including load-acquire fence. Although it's true that load-acquire fence is also used to refer to the corresponding C11 fence, which has subtly different semantics. When you say also known as XXX it means that XXX is already defined elsewhere. Unless there is a generally accepted definition of XXX then this doesn't add much value. /** * Ensures that loads before the fence will not be reordered with loads and * stores after the fence. Corresponds to a LoadLoad plus LoadStore barrier, * and also to the C11 atomic_thread_fence(memory_order_acquire). * Sometimes referred to as a load-acquire fence. * Also I find this comment strange: ! * A pure StoreStore fence is not provided, since the addition of LoadStore ! * is almost always desired, and most current hardware instructions that ! * provide a StoreStore barrier also provide a LoadStore barrier for free. because inside hotspot we use storeStore barriers a lot, without any loadStore at the same point. I believe the use of e.g. OrderAccess::storestore in the hotspot sources is unfortunate. I don't! Not at all! The actual implementations of storestore (see below) seem to universally give you the stronger ::release barrier, Don't conflate hardware barriers and compiler barriers. On TSO systems storestore() is a no-op for the hardware but a compiler-barrier is still required. The compiler barrier is stronger than storestore but that's because we don't have fine-grained compiler barriers. As I've said previously there is an open bug to clean up the orderAccess definitions because semantically it is very misleading to define things like storestore() as release(). However if you look at the implementation of things we are not actually adding any additional semantics to the storestore(). Eg for x86 (after a recent change top cleanup the compiler-barrier: inline void OrderAccess::release() { compiler_barrier(); } So storestore() is also just a compiler barrier, thought I'd prefer to see it expressed as: inline void OrderAccess::storestore() { compiler_barrier(); } than the misleading (and very wrong on non-TSO): inline void OrderAccess::storestore() { release(); } And of course the non-TSO platforms use the barrier necessary on that platform. and it seems likely that hotspot engineers are implicitly relying on that, that some uses of ::storestore in the hotspot sources are bugs (should be ::release instead) That seems a rather baseless speculation on your part. Having been involved in a lot of discussions involving memory barrier usage in various algorithms in different areas of hotspot I can assure you that the use of storestore() has been quite deliberate and does not assume an implicit loadstore(). Also as I've said many many times now release() as a stand-alone operation is meaningless. David - and that there is very little potential performance benefit from using ::storestore instead of ::release, precisely because the additional loadstore barrier is very close to free on all current hardware. Writing correct code using ::storestore is harder than ::release, which is already difficult enough. C11 doesn't provide a corresponding fence, which is a strong hint. ./bsd_zero/vm/orderAccess_bsd_zero.inline.hpp:71:inline void OrderAccess::storestore() { release(); } ./linux_sparc/vm/orderAccess_linux_sparc.inline.hpp:35:inline void OrderAccess::storestore() { release(); } ./aix_ppc/vm/orderAccess_aix_ppc.inline.hpp:73:inline void OrderAccess::storestore() { inlasm_lwsync(); } ./linux_zero/vm/orderAccess_linux_zero.inline.hpp:70:inline void OrderAccess::storestore() { release(); } ./solaris_sparc/vm/orderAccess_solaris_sparc.inline.hpp:40:inline void OrderAccess::storestore() { release(); }
Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics
On 6/12/2014 7:29 AM, Martin Buchholz wrote: On Thu, Dec 4, 2014 at 5:36 PM, David Holmes david.hol...@oracle.com wrote: Martin, On 2/12/2014 6:46 AM, Martin Buchholz wrote: Is this finalized then? You can only make one commit per CR. Right. I'd like to commit and then perhaps do another round of clarifications. I still find this entire comment block to be misguided and misplaced: ! // Fences, also known as memory barriers, or membars. ! // See hotspot sources for more details: ! // orderAccess.hpp memnode.hpp unsafe.cpp ! // ! // One way of implementing Java language-level volatile variables using ! // fences (but there is often a better way without) is by: ! // translating a volatile store into the sequence: ! // - storeFence() ! // - relaxed store ! // - fullFence() ! // and translating a volatile load into the sequence: ! // - if (CPU_NOT_MULTIPLE_COPY_ATOMIC) fullFence() ! // - relaxed load ! // - loadFence() ! // The full fence on volatile stores ensures the memory model guarantee of ! // sequential consistency on most platforms. On some platforms (ppc) we ! // need an additional full fence between volatile loads as well (see ! // hotspot's CPU_NOT_MULTIPLE_COPY_ATOMIC). Even I think this comment is marginal - I will delete it. But consider this a plea for better documentation of the hotspot internals. Okay, but Unsafe.java is not the place to document anything about hotspot. why do want this description here - it has no relevance to the API itself, nor to how volatiles are implemented in the VM. And as I said in the bug report CPU_NOT_MULTIPLE_COPY_ATOMIC exists only for platforms that want to implement IRIW (none of our platforms are multiple-copy-atomic, but only PPC sets this so that it employs IRIW). I believe the comment _does_ reflect hotspot's current implementation (entirely from exploring the sources). I believe it's correct to say all of the platforms are multiple-copy-atomic except PPC. Here is the definition of multi-copy atomicity from the ARM architecture manual: In a multiprocessing system, writes to a memory location are multi-copy atomic if the following conditions are both true: • All writes to the same location are serialized, meaning they are observed in the same order by all observers, although some observers might not observe all of the writes. • A read of a location does not return the value of a write until all observers observe that write. The first condition is met by Total-Store-Order (TSO) systems like x86 and sparc; and not by relaxed-memory-order (RMO) systems like ARM and PPC. However the second condition is not met simply by having TSO. If the local processor can see a write from the local store buffer prior to it being visible to other processors, then we do not have multi-copy atomicity and I believe that is true for x86 and sparc. Hence none of our supported platforms are multi-copy-atomic as far as I can see. I believe hotspot must implement IRIW correctly to fulfil the promise of sequential consistency for standard Java, so on ppc volatile reads get a full fence, which leads us back to the ppc pointer chasing performance problem that started all of this. Note that nothing in the JSR-133 cookbook allows for IRIW, even on x86 and sparc. The key feature needed for IRIW is a load barrier that forces global memory synchronization to ensure that all processors see writes at the same time. I'm not even sure we can force that on x86 and sparc! Such a load barrier negates the need for some store barriers as defined in the cookbook. My understanding, which could be wrong, is that the JMM implies linearizability of volatile accesses, which in turn provides the IRIW property. It is also my understanding that linearizability is a necessary property for current proof systems to be applicable. However absence of proof is not proof of absence, and it doesn't follow that code that doesn't rely on IRIW is incorrect if IRIW is not ensured on a system. As has been stated many times now, in the literature no practical lock-free algorithm seems to rely on IRIW. So I still hope that IRIW can somehow be removed because implementing it will impact everything related to the JMM in hotspot. David -
Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics
On Thu, Dec 4, 2014 at 5:36 PM, David Holmes david.hol...@oracle.com wrote: Martin, On 2/12/2014 6:46 AM, Martin Buchholz wrote: Is this finalized then? You can only make one commit per CR. Right. I'd like to commit and then perhaps do another round of clarifications. I still find this entire comment block to be misguided and misplaced: ! // Fences, also known as memory barriers, or membars. ! // See hotspot sources for more details: ! // orderAccess.hpp memnode.hpp unsafe.cpp ! // ! // One way of implementing Java language-level volatile variables using ! // fences (but there is often a better way without) is by: ! // translating a volatile store into the sequence: ! // - storeFence() ! // - relaxed store ! // - fullFence() ! // and translating a volatile load into the sequence: ! // - if (CPU_NOT_MULTIPLE_COPY_ATOMIC) fullFence() ! // - relaxed load ! // - loadFence() ! // The full fence on volatile stores ensures the memory model guarantee of ! // sequential consistency on most platforms. On some platforms (ppc) we ! // need an additional full fence between volatile loads as well (see ! // hotspot's CPU_NOT_MULTIPLE_COPY_ATOMIC). Even I think this comment is marginal - I will delete it. But consider this a plea for better documentation of the hotspot internals. why do want this description here - it has no relevance to the API itself, nor to how volatiles are implemented in the VM. And as I said in the bug report CPU_NOT_MULTIPLE_COPY_ATOMIC exists only for platforms that want to implement IRIW (none of our platforms are multiple-copy-atomic, but only PPC sets this so that it employs IRIW). I believe the comment _does_ reflect hotspot's current implementation (entirely from exploring the sources). I believe it's correct to say all of the platforms are multiple-copy-atomic except PPC. I believe hotspot must implement IRIW correctly to fulfil the promise of sequential consistency for standard Java, so on ppc volatile reads get a full fence, which leads us back to the ppc pointer chasing performance problem that started all of this.
Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics
On Thu, Dec 4, 2014 at 5:55 PM, David Holmes david.hol...@oracle.com wrote: In general phrasing like: also known as a LoadLoad plus LoadStore barrier ... is misleading to me as these are not aliases- the loadFence (in this case) is being specified to have the same semantics as the loadload|storeload. It should say corresponds to a LoadLoad plus LoadStore barrier + * Ensures that loads before the fence will not be reordered with loads and + * stores after the fence; also known as a LoadLoad plus LoadStore barrier, I don't understand this. I believe they _are_ aliases. The first clause perfectly describes a LoadLoad plus LoadStore barrier. - as per the Corresponds to a C11 And referring to things like load-acquire fence is meaningless without some reference to a definition - who defines a load-acquire fence? Is there a universal definition? I would be okay with something looser eg: Well, I'm defining load-acquire fence here in the javadoc - I'm claiming that loadFence is also known via other terminology, including load-acquire fence. Although it's true that load-acquire fence is also used to refer to the corresponding C11 fence, which has subtly different semantics. /** * Ensures that loads before the fence will not be reordered with loads and * stores after the fence. Corresponds to a LoadLoad plus LoadStore barrier, * and also to the C11 atomic_thread_fence(memory_order_acquire). * Sometimes referred to as a load-acquire fence. * Also I find this comment strange: ! * A pure StoreStore fence is not provided, since the addition of LoadStore ! * is almost always desired, and most current hardware instructions that ! * provide a StoreStore barrier also provide a LoadStore barrier for free. because inside hotspot we use storeStore barriers a lot, without any loadStore at the same point. I believe the use of e.g. OrderAccess::storestore in the hotspot sources is unfortunate. The actual implementations of storestore (see below) seem to universally give you the stronger ::release barrier, and it seems likely that hotspot engineers are implicitly relying on that, that some uses of ::storestore in the hotspot sources are bugs (should be ::release instead) and that there is very little potential performance benefit from using ::storestore instead of ::release, precisely because the additional loadstore barrier is very close to free on all current hardware. Writing correct code using ::storestore is harder than ::release, which is already difficult enough. C11 doesn't provide a corresponding fence, which is a strong hint. ./bsd_zero/vm/orderAccess_bsd_zero.inline.hpp:71:inline void OrderAccess::storestore() { release(); } ./linux_sparc/vm/orderAccess_linux_sparc.inline.hpp:35:inline void OrderAccess::storestore() { release(); } ./aix_ppc/vm/orderAccess_aix_ppc.inline.hpp:73:inline void OrderAccess::storestore() { inlasm_lwsync(); } ./linux_zero/vm/orderAccess_linux_zero.inline.hpp:70:inline void OrderAccess::storestore() { release(); } ./solaris_sparc/vm/orderAccess_solaris_sparc.inline.hpp:40:inline void OrderAccess::storestore() { release(); } ./linux_ppc/vm/orderAccess_linux_ppc.inline.hpp:75:inline void OrderAccess::storestore() { inlasm_lwsync(); } ./solaris_x86/vm/orderAccess_solaris_x86.inline.hpp:40:inline void OrderAccess::storestore() { release(); } ./linux_x86/vm/orderAccess_linux_x86.inline.hpp:35:inline void OrderAccess::storestore() { release(); } ./bsd_x86/vm/orderAccess_bsd_x86.inline.hpp:35:inline void OrderAccess::storestore() { release(); } ./windows_x86/vm/orderAccess_windows_x86.inline.hpp:35:inline void OrderAccess::storestore() { release(); }
Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics
Martin, On 2/12/2014 6:46 AM, Martin Buchholz wrote: David, Paul (i.e. Reviewers) and Doug, I'd like to commit corrections so we make progress. Is this finalized then? You can only make one commit per CR. I think the current webrev is simple progress with the exception of my attempt to translate volatiles into fences, which is marginal (but was a good learning exercise for me). I still find this entire comment block to be misguided and misplaced: ! // Fences, also known as memory barriers, or membars. ! // See hotspot sources for more details: ! // orderAccess.hpp memnode.hpp unsafe.cpp ! // ! // One way of implementing Java language-level volatile variables using ! // fences (but there is often a better way without) is by: ! // translating a volatile store into the sequence: ! // - storeFence() ! // - relaxed store ! // - fullFence() ! // and translating a volatile load into the sequence: ! // - if (CPU_NOT_MULTIPLE_COPY_ATOMIC) fullFence() ! // - relaxed load ! // - loadFence() ! // The full fence on volatile stores ensures the memory model guarantee of ! // sequential consistency on most platforms. On some platforms (ppc) we ! // need an additional full fence between volatile loads as well (see ! // hotspot's CPU_NOT_MULTIPLE_COPY_ATOMIC). why do want this description here - it has no relevance to the API itself, nor to how volatiles are implemented in the VM. And as I said in the bug report CPU_NOT_MULTIPLE_COPY_ATOMIC exists only for platforms that want to implement IRIW (none of our platforms are multiple-copy-atomic, but only PPC sets this so that it employs IRIW). David
Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics
On 2/12/2014 6:46 AM, Martin Buchholz wrote: David, Paul (i.e. Reviewers) and Doug, I'd like to commit corrections so we make progress. I think the current webrev is simple progress with the exception of my attempt to translate volatiles into fences, which is marginal (but was a good learning exercise for me). Looking at the actual API changes ... In general phrasing like: also known as a LoadLoad plus LoadStore barrier ... is misleading to me as these are not aliases- the loadFence (in this case) is being specified to have the same semantics as the loadload|storeload. It should say corresponds to a LoadLoad plus LoadStore barrier - as per the Corresponds to a C11 And referring to things like load-acquire fence is meaningless without some reference to a definition - who defines a load-acquire fence? Is there a universal definition? I would be okay with something looser eg: /** * Ensures that loads before the fence will not be reordered with loads and * stores after the fence. Corresponds to a LoadLoad plus LoadStore barrier, * and also to the C11 atomic_thread_fence(memory_order_acquire). * Sometimes referred to as a load-acquire fence. * Also I find this comment strange: ! * A pure StoreStore fence is not provided, since the addition of LoadStore ! * is almost always desired, and most current hardware instructions that ! * provide a StoreStore barrier also provide a LoadStore barrier for free. because inside hotspot we use storeStore barriers a lot, without any loadStore at the same point. David
Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics
On Dec 2, 2014, at 1:58 AM, Doug Lea d...@cs.oswego.edu wrote: On 12/01/2014 03:46 PM, Martin Buchholz wrote: David, Paul (i.e. Reviewers) and Doug, I'd like to commit corrections so we make progress. The current one looks OK to me. (http://cr.openjdk.java.net/~martin/webrevs/openjdk9/fence-intrinsics/) Same here, looks ok. I anticipate we will be revisiting this area with the enhanced volatiles [1] work and related JMM updates, where there will be a public API for low-level enhanced field/array access [2]. As you rightly observed Unsafe does not currently have a get/read-acquire method. Implementations of [2] currently emulate that with a relaxed read + Unsafe.loadFence. It's something we need to add. Paul. [1] http://openjdk.java.net/jeps/193 [2] http://hg.openjdk.java.net/valhalla/valhalla/jdk/file/2d4531473a89/src/java.base/share/classes/java/lang/invoke/VarHandle.java
Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics
Am Dienstag, 25. November 2014, 11:15:36 schrieb Hans Boehm: I'm no hardware architect, but fundamentally it seems to me that load x acquire_fence imposes a much more stringent constraint than load_acquire x Consider the case in which the load from x is an L1 hit, but a preceding load (from say y) is a long-latency miss. If we enforce ordering by just waiting for completion of prior operation, the former has to wait for the load from y to complete; while the latter doesn't. I find it hard to believe that this doesn't leave an appreciable amount of performance on the table, at least for some interesting microarchitectures. I agree, Hans, that this is a reasonable assumption. Load_acquire x does allow roach motel, whereas the acquire fence does not. In addition, for better or worse, fencing requirements on at least Power are actually driven as much by store atomicity issues, as by the ordering issues discussed in the cookbook. This was not understood in 2005, and unfortunately doesn't seem to be amenable to the kind of straightforward explanation as in Doug's cookbook. Coming from a strongly ordered architecture to a weakly ordered one myself, I also needed some mental adjustment about store (multi-copy) atomicity. I can imagine others will be unaware of this difference, too, even in 2014. Stephan
Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics
I see time to time comments in the jvm sources referencing membars and fences. Would you say that they are used interchangeably ? Having the same meaning but for different CPU arch. Sent from my iPhone On Nov 25, 2014, at 6:04 AM, Paul Sandoz paul.san...@oracle.com wrote: Hi Martin, Thanks for looking into this. 1141 * Currently hotspot's implementation of a Java language-level volatile 1142 * store has the same effect as a storeFence followed by a relaxed store, 1143 * although that may be a little stronger than needed. IIUC to emulate hotpot's volatile store you will need to say that a fullFence immediately follows the relaxed store. The bit that always confuses me about release and acquire is ordering is restricted to one direction, as talked about in orderAccess.hpp [1]. So for a release, accesses prior to the release cannot move below it, but accesses succeeding the release can move above it. And that seems to apply to Unsafe.storeFence [2] (acting like a monitor exit). Is that contrary to C++ release fences where ordering is restricted both to prior and succeeding accesses? [3] So what about the following? a = r1; // Cannot move below the fence Unsafe.storeFence(); b = r2; // Can move above the fence? Paul. [1] In orderAccess.hpp // Execution by a processor of release makes the effect of all memory // accesses issued by it previous to the release visible to all // processors *before* the release completes. The effect of subsequent // memory accesses issued by it *may* be made visible *before* the // release. I.e., subsequent memory accesses may float above the // release, but prior ones may not float below it. [2] In memnode.hpp // Release - no earlier ref can move after (but later refs can move // up, like a speculative pipelined cache-hitting Load). Requires // multi-cpu visibility. Inserted independent of any store, as required // for intrinsic sun.misc.Unsafe.storeFence(). class StoreFenceNode: public MemBarNode { public: StoreFenceNode(Compile* C, int alias_idx, Node* precedent) : MemBarNode(C, alias_idx, precedent) {} virtual int Opcode() const; }; [3] http://preshing.com/20131125/acquire-and-release-fences-dont-work-the-way-youd-expect/ On Nov 25, 2014, at 1:47 AM, Martin Buchholz marti...@google.com wrote: OK, I worked in some wording for comparison with volatiles. I believe you when you say that the semantics of the corresponding C++ fences are slightly different, but it's rather subtle - can we say anything more than closely related to? On Mon, Nov 24, 2014 at 1:29 PM, Aleksey Shipilev aleksey.shipi...@oracle.com wrote: Hi Martin, On 11/24/2014 11:56 PM, Martin Buchholz wrote: Review carefully - I am trying to learn about fences by explaining them! I have borrowed some wording from my reviewers! https://bugs.openjdk.java.net/browse/JDK-8065804 http://cr.openjdk.java.net/~martin/webrevs/openjdk9/fence-intrinsics/ I think implies the effect of C++11 is too strong wording. related might be more appropriate. See also comments here for connection with volatiles: https://bugs.openjdk.java.net/browse/JDK-8038978 Take note the Hans' correction that fences generally imply more than volatile load/store, but since you are listing the related things in the docs, I think the native Java example is good to have. -Aleksey. ___ Concurrency-interest mailing list concurrency-inter...@cs.oswego.edu http://cs.oswego.edu/mailman/listinfo/concurrency-interest ___ Concurrency-interest mailing list concurrency-inter...@cs.oswego.edu http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics
memory_order_release meaningful, but somewhat different from the non-fence it would be nice to have release fence with an artificial dependency to define a set of actually release stores and not constraining other subsequent stores (and the order of release stores with respect to each other), e.g.: // set multiple flags each indicating 'release' without imposing // ordering on 'release' stores respect to each other and not // constraining other subsequent stores . . . if (atomic_thread_fence(memory_order_release)) { flag1.store(READY, memory_order_relaxed); flag2.store(READY, memory_order_relaxed); } regards, alexander. Hans Boehm bo...@acm.org@cs.oswego.edu on 29.11.2014 05:56:04 Sent by:concurrency-interest-boun...@cs.oswego.edu To: Peter Levart peter.lev...@gmail.com cc: Vladimir Kozlov vladimir.koz...@oracle.com, concurrency-interest concurrency-inter...@cs.oswego.edu, Martin Buchholz marti...@google.com, core-libs-dev core-libs-dev@openjdk.java.net, dhol...@ieee.org Subject:Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics I basically agree with David's observation. However the C++ atomic_thread_fence(memory_order_acquire) actually has somewhat different semantics from load(memory_order_acquire). It basically ensures that prior atomic loads L are not reordered with later (i.e. following the fence in program order) loads and stores, making it something like a LoadLoad|LoadStore fence. Thus the fence orders two sets of operations where the acquire load orders a single operation with respect to a set. This makes the fence versions of memory_order_acquire and memory_order_release meaningful, but somewhat different from the non-fence versions. The terminology is probably not great, but that seems to be the most common usage now. On Wed, Nov 26, 2014 at 11:48 PM, Peter Levart peter.lev...@gmail.com wrote: On 11/27/2014 04:00 AM, David Holmes wrote: Can I make an observation about acquire() and release() - to me they are meaningless when considered in isolation. Given their definitions they allow anything to move into a region bounded by acquire() and release (), then you can effectively move the whole program into the region and thus the acquire() and release() do not constrain any reorderings. acquire() and release() only make sense when their own movement is constrained with respect to something else - such as lock acquisition/release, or when combined with specific load/store actions. ...or another acquire/release region? Regards, Peter David Martin Buchholz writes: On Tue, Nov 25, 2014 at 6:04 AM, Paul Sandoz paul.san...@oracle.com wrote: Hi Martin, Thanks for looking into this. 1141 * Currently hotspot's implementation of a Java language-level volatile 1142 * store has the same effect as a storeFence followed by a relaxed store, 1143 * although that may be a little stronger than needed. IIUC to emulate hotpot's volatile store you will need to say that a fullFence immediately follows the relaxed store. Right - I've been groking that. The bit that always confuses me about release and acquire is ordering is restricted to one direction, as talked about in orderAccess.hpp [1]. So for a release, accesses prior to the release cannot move below it, but accesses succeeding the release can move above it. And that seems to apply to Unsafe.storeFence [2] (acting like a monitor exit). Is that contrary to C++ release fences where ordering is restricted both to prior and succeeding accesses? [3] So what about the following? a = r1; // Cannot move below the fence Unsafe.storeFence(); b = r2; // Can move above the fence? I think the hotspot docs need to be more precise about when they're talking about movement of stores and when about loads. // release. I.e., subsequent memory accesses may float above
Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics
Hans, (Thanks for your excellent work on C/C++ 11 and your eternal patience) On Tue, Nov 25, 2014 at 11:15 AM, Hans Boehm bo...@acm.org wrote: It seems to me that a (dubiuously named) loadFence is intended to have essentially the same semantics as the (perhaps slightly less dubiously named) C++ atomic_thread_fence(memory_order_acquire), and a storeFence matches atomic_thread_fence(memory_order_release). The C++ standard and, even more so, Mark Batty's work have a precise definition of what those mean in terms of implied synchronizes with relationships. It looks to me like this whole implementation model for volatiles in terms of fences is fundamentally doomed, and it probably makes more sense to get rid of it rather than spending time on renaming it (though we just did the latter in Android to avoid similar confusion about semantics). It's I would also like to see alignment to leverage the technical and cultural work done on C11. I would like to see Unsafe get load-acquire and store-release methods and these should be used in preference to fences where possible. I'd like to see the C11 wording reused as much as possible. The meanings of the words acquire and release are now owned by the C11 community and we should tag along. A better API for Unsafe would be putOrdered - storeRelease put - storeRelaxed (ordinary volatile write) - store (default is sequential consistent) etc ... but the high cost of renaming methods in Unsafe probably makes this a no-go, even though Unsafe is not a public API in theory. At least the documentation of all the methods should indicate what the memory effects and the corresponding C++11 memory model interpretation is. E.g. Unsafe.compareAndSwap should document the memory effects, i.e. sequential consistency. Unsafe doesn't currently have a readAcquire method (mirror of putOrdered) probably because volatile read is _almost_ the same (but not on ppc!). fundamentally incompatible with the way volatiles/atomics are intended to be implemented on ARMv8 (and Itanium). Which I think fundamentally get this much closer to right than traditional fence-based ISAs. I'm no hardware architect, but fundamentally it seems to me that load x acquire_fence imposes a much more stringent constraint than load_acquire x Consider the case in which the load from x is an L1 hit, but a preceding load (from say y) is a long-latency miss. If we enforce ordering by just waiting for completion of prior operation, the former has to wait for the load from y to complete; while the latter doesn't. I find it hard to believe that this doesn't leave an appreciable amount of performance on the table, at least for some interesting microarchitectures. I agree. Fences should be used rarely.
Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics
David, Paul (i.e. Reviewers) and Doug, I'd like to commit corrections so we make progress. I think the current webrev is simple progress with the exception of my attempt to translate volatiles into fences, which is marginal (but was a good learning exercise for me).
Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics
Needless to say, I would clearly also like to see a simple correspondence. But this does raise the interesting question of whether put/get and store(..., memory_order_relaxed)/load(memory_order_relaxed) are intended to have similar semantics. I would guess not, in that the former don't satisfy coherence; accesses to the same variable can be reordered as for normal variable accesses, while the C++11/C11 variants do provide those guarantees. On most, but not all, architectures that's entirely a compiler issue; the hardware claims to provide that guarantee. This affects, for example, whether a variable that is only ever incremented by one thread can appear to another thread to decrease in value. Or if a reference set to a non-null value exactly once can appear to change back to null after appearing non-null. In my opinion, it makes sense to always provide coherence for atomics, since the overhead is small, and so are the odds of getting code relying on non-coherent racing accesses correct. But for ordinary variables whose accesses are not intended to race the trade-offs are very different. Hans On Mon, Dec 1, 2014 at 12:40 PM, Martin Buchholz marti...@google.com wrote: Hans, (Thanks for your excellent work on C/C++ 11 and your eternal patience) On Tue, Nov 25, 2014 at 11:15 AM, Hans Boehm bo...@acm.org wrote: It seems to me that a (dubiuously named) loadFence is intended to have essentially the same semantics as the (perhaps slightly less dubiously named) C++ atomic_thread_fence(memory_order_acquire), and a storeFence matches atomic_thread_fence(memory_order_release). The C++ standard and, even more so, Mark Batty's work have a precise definition of what those mean in terms of implied synchronizes with relationships. It looks to me like this whole implementation model for volatiles in terms of fences is fundamentally doomed, and it probably makes more sense to get rid of it rather than spending time on renaming it (though we just did the latter in Android to avoid similar confusion about semantics). It's I would also like to see alignment to leverage the technical and cultural work done on C11. I would like to see Unsafe get load-acquire and store-release methods and these should be used in preference to fences where possible. I'd like to see the C11 wording reused as much as possible. The meanings of the words acquire and release are now owned by the C11 community and we should tag along. A better API for Unsafe would be putOrdered - storeRelease put - storeRelaxed (ordinary volatile write) - store (default is sequential consistent) etc ... but the high cost of renaming methods in Unsafe probably makes this a no-go, even though Unsafe is not a public API in theory. At least the documentation of all the methods should indicate what the memory effects and the corresponding C++11 memory model interpretation is. E.g. Unsafe.compareAndSwap should document the memory effects, i.e. sequential consistency. Unsafe doesn't currently have a readAcquire method (mirror of putOrdered) probably because volatile read is _almost_ the same (but not on ppc!). fundamentally incompatible with the way volatiles/atomics are intended to be implemented on ARMv8 (and Itanium). Which I think fundamentally get this much closer to right than traditional fence-based ISAs. I'm no hardware architect, but fundamentally it seems to me that load x acquire_fence imposes a much more stringent constraint than load_acquire x Consider the case in which the load from x is an L1 hit, but a preceding load (from say y) is a long-latency miss. If we enforce ordering by just waiting for completion of prior operation, the former has to wait for the load from y to complete; while the latter doesn't. I find it hard to believe that this doesn't leave an appreciable amount of performance on the table, at least for some interesting microarchitectures. I agree. Fences should be used rarely.
Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics
On 12/01/2014 03:46 PM, Martin Buchholz wrote: David, Paul (i.e. Reviewers) and Doug, I'd like to commit corrections so we make progress. The current one looks OK to me. (http://cr.openjdk.java.net/~martin/webrevs/openjdk9/fence-intrinsics/) -Doug I think the current webrev is simple progress with the exception of my attempt to translate volatiles into fences, which is marginal (but was a good learning exercise for me).
Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics
On Mon, Dec 1, 2014 at 1:51 PM, Hans Boehm bo...@acm.org wrote: Needless to say, I would clearly also like to see a simple correspondence. But this does raise the interesting question of whether put/get and store(..., memory_order_relaxed)/load(memory_order_relaxed) are intended to have similar semantics. I would guess not, in that the former don't satisfy coherence; accesses to the same variable can be reordered as for normal variable accesses, while the C++11/C11 variants do provide those guarantees. On most, but not all, architectures that's entirely a compiler issue; the hardware claims to provide that guarantee. This affects, for example, whether a variable that is only ever incremented by one thread can appear to another thread to decrease in value. Or if a reference set to a non-null value exactly once can appear to change back to null after appearing non-null. In my opinion, it makes sense to always provide coherence for atomics, since the overhead is small, and so are the odds of getting code relying on non-coherent racing accesses correct. But for ordinary variables whose accesses are not intended to race the trade-offs are very different. It would be nice to pretend that ordinary java loads and stores map perfectly to C11 relaxed loads and stores. This maps well to the lack of undefined behavior for data races in Java. But this fails also with lack of atomicity of Java longs and doubles. I have no intuition as to whether always requiring per-variable sequential consistency would be a performance problem. Introducing an explicit relaxed memory order mode in Java when the distinction between ordinary access is smaller than in C/C++ 11 would be confusing. Despite all that, it would be clean, consistent and seemingly straightforward to simply add all of the C/C++ atomic loads, stores and fences to sun.misc.Unsafe (with the possible exception of consume, which is still under a cloud). If that works out for jdk-internal code, we can add them to a public API. Providing the full set will help with interoperability with C code running in another thread accessing a direct buffer.
Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics
I think that requiring coherence for ordinary Java accesses would be a performance problem. The pre-2005 Java memory model actually promised it, but implementations ignored that requirement. That was one significant motivation of the 2005 memory model overhaul. The basic problem is that if you have r1 = x.f; r2 = y.f; r3 = x.f; the compiler can no longer perform common subexpression elimination on the two loads from x.f unless it can prove that x and y do not alias, which is probably rare. Loads kill available expressions. Clearly this can significantly reduce the effectiveness of CSE and similar basic optimizations. Enforcing coherence also turns out to be somewhat expensive on Itanium, and rather expensive on some slightly older ARM processors. Those arguments probably don't apply to sun.misc.unsafe. But no matter what you do for sun.misc.unsafe, something will be inconsistent. (The other problem of course is that we still don't really know how to define memory_order_relaxed any better than we know how to define ordinary Java memory references.) On Mon, Dec 1, 2014 at 5:05 PM, Martin Buchholz marti...@google.com wrote: On Mon, Dec 1, 2014 at 1:51 PM, Hans Boehm bo...@acm.org wrote: Needless to say, I would clearly also like to see a simple correspondence. But this does raise the interesting question of whether put/get and store(..., memory_order_relaxed)/load(memory_order_relaxed) are intended to have similar semantics. I would guess not, in that the former don't satisfy coherence; accesses to the same variable can be reordered as for normal variable accesses, while the C++11/C11 variants do provide those guarantees. On most, but not all, architectures that's entirely a compiler issue; the hardware claims to provide that guarantee. This affects, for example, whether a variable that is only ever incremented by one thread can appear to another thread to decrease in value. Or if a reference set to a non-null value exactly once can appear to change back to null after appearing non-null. In my opinion, it makes sense to always provide coherence for atomics, since the overhead is small, and so are the odds of getting code relying on non-coherent racing accesses correct. But for ordinary variables whose accesses are not intended to race the trade-offs are very different. It would be nice to pretend that ordinary java loads and stores map perfectly to C11 relaxed loads and stores. This maps well to the lack of undefined behavior for data races in Java. But this fails also with lack of atomicity of Java longs and doubles. I have no intuition as to whether always requiring per-variable sequential consistency would be a performance problem. Introducing an explicit relaxed memory order mode in Java when the distinction between ordinary access is smaller than in C/C++ 11 would be confusing. Despite all that, it would be clean, consistent and seemingly straightforward to simply add all of the C/C++ atomic loads, stores and fences to sun.misc.Unsafe (with the possible exception of consume, which is still under a cloud). If that works out for jdk-internal code, we can add them to a public API. Providing the full set will help with interoperability with C code running in another thread accessing a direct buffer.
Re: [concurrency-interest] RFR: 8065804: JEP 171:Clarifications/corrections for fence intrinsics
On 11/26/2014 09:56 PM, David Holmes wrote: Martin Buchholz writes: On Wed, Nov 26, 2014 at 5:08 PM, David Holmes david.hol...@oracle.com wrote: Please explain why you have changed the defined semantics for storeFence. You have completely reversed the direction of the barrier. Yes. I believe the current spec of storeFence was a copy-paste typo, and it seems others feel likewise. Can whomever wrote that original spec please confirm that. The translations of loadFence == [LoadLoad|LoadStore] and storeFence == [StoreStore|LoadStore] into prose got mangled at some point. (Probably by me; sorry if so!) -Doug
Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics
On Tue, Nov 25, 2014 at 6:04 AM, Paul Sandoz paul.san...@oracle.com wrote: Hi Martin, Thanks for looking into this. 1141 * Currently hotspot's implementation of a Java language-level volatile 1142 * store has the same effect as a storeFence followed by a relaxed store, 1143 * although that may be a little stronger than needed. IIUC to emulate hotpot's volatile store you will need to say that a fullFence immediately follows the relaxed store. Right - I've been groking that. The bit that always confuses me about release and acquire is ordering is restricted to one direction, as talked about in orderAccess.hpp [1]. So for a release, accesses prior to the release cannot move below it, but accesses succeeding the release can move above it. And that seems to apply to Unsafe.storeFence [2] (acting like a monitor exit). Is that contrary to C++ release fences where ordering is restricted both to prior and succeeding accesses? [3] So what about the following? a = r1; // Cannot move below the fence Unsafe.storeFence(); b = r2; // Can move above the fence? I think the hotspot docs need to be more precise about when they're talking about movement of stores and when about loads. // release. I.e., subsequent memory accesses may float above the // release, but prior ones may not float below it. As I've said elsewhere, the above makes no sense without restricting the type of access.
Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics
On Tue, Nov 25, 2014 at 1:41 PM, Andrew Haley a...@redhat.com wrote: On 11/24/2014 08:56 PM, Martin Buchholz wrote: + * Currently hotspot's implementation of a Java language-level volatile + * store has the same effect as a storeFence followed by a relaxed store, + * although that may be a little stronger than needed. While this may be true today No - it was very wrong, since it doesn't give you sequential consistency! , I'm hopefully about to commit an AArch64 OpenJDK port that uses the ARMv8 stlr instruction. I don't think that what you've written here is terribly misleading, but bear in mind that it may be there for some time.
RE: [concurrency-interest] RFR: 8065804: JEP 171:Clarifications/corrections for fence intrinsics
Martin Buchholz writes: On Wed, Nov 26, 2014 at 5:08 PM, David Holmes david.hol...@oracle.com wrote: Please explain why you have changed the defined semantics for storeFence. You have completely reversed the direction of the barrier. Yes. I believe the current spec of storeFence was a copy-paste typo, and it seems others feel likewise. Can whomever wrote that original spec please confirm that. Thanks, David
RE: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics
Can I make an observation about acquire() and release() - to me they are meaningless when considered in isolation. Given their definitions they allow anything to move into a region bounded by acquire() and release(), then you can effectively move the whole program into the region and thus the acquire() and release() do not constrain any reorderings. acquire() and release() only make sense when their own movement is constrained with respect to something else - such as lock acquisition/release, or when combined with specific load/store actions. David Martin Buchholz writes: On Tue, Nov 25, 2014 at 6:04 AM, Paul Sandoz paul.san...@oracle.com wrote: Hi Martin, Thanks for looking into this. 1141 * Currently hotspot's implementation of a Java language-level volatile 1142 * store has the same effect as a storeFence followed by a relaxed store, 1143 * although that may be a little stronger than needed. IIUC to emulate hotpot's volatile store you will need to say that a fullFence immediately follows the relaxed store. Right - I've been groking that. The bit that always confuses me about release and acquire is ordering is restricted to one direction, as talked about in orderAccess.hpp [1]. So for a release, accesses prior to the release cannot move below it, but accesses succeeding the release can move above it. And that seems to apply to Unsafe.storeFence [2] (acting like a monitor exit). Is that contrary to C++ release fences where ordering is restricted both to prior and succeeding accesses? [3] So what about the following? a = r1; // Cannot move below the fence Unsafe.storeFence(); b = r2; // Can move above the fence? I think the hotspot docs need to be more precise about when they're talking about movement of stores and when about loads. // release. I.e., subsequent memory accesses may float above the // release, but prior ones may not float below it. As I've said elsewhere, the above makes no sense without restricting the type of access. ___ Concurrency-interest mailing list concurrency-inter...@cs.oswego.edu http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics
On Wed, Nov 26, 2014 at 7:00 PM, David Holmes davidchol...@aapt.net.au wrote: Can I make an observation about acquire() and release() - to me they are meaningless when considered in isolation. Given their definitions they allow anything to move into a region bounded by acquire() and release(), then you can effectively move the whole program into the region and thus the acquire() and release() do not constrain any reorderings. acquire() and release() only make sense when their own movement is constrained with respect to something else - such as lock acquisition/release, or when combined with specific load/store actions. David, it seems you are agreeing with my argument below. The definitions in the hotspot sources should be fixed, in the same sort of way that I'm trying to make the specs for Unsafe loads clearer and more precise. David Martin Buchholz writes: I think the hotspot docs need to be more precise about when they're talking about movement of stores and when about loads. // release. I.e., subsequent memory accesses may float above the // release, but prior ones may not float below it. As I've said elsewhere, the above makes no sense without restricting the type of access.
RE: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics
Martin writes: On Wed, Nov 26, 2014 at 7:00 PM, David Holmes davidchol...@aapt.net.au wrote: Can I make an observation about acquire() and release() - to me they are meaningless when considered in isolation. Given their definitions they allow anything to move into a region bounded by acquire() and release(), then you can effectively move the whole program into the region and thus the acquire() and release() do not constrain any reorderings. acquire() and release() only make sense when their own movement is constrained with respect to something else - such as lock acquisition/release, or when combined with specific load/store actions. David, it seems you are agreeing with my argument below. The definitions in the hotspot sources should be fixed, in the same sort of way that I'm trying to make the specs for Unsafe loads clearer and more precise. Please see: https://bugs.openjdk.java.net/browse/JDK-7143664 Though I'm not sure my ramblings there reflect my current thoughts on all this. I really think acquire/release are too confusingly used to be useful - by which I mean that the names do not reflect their actions so you will always have to remember/look-up exactly what *release* and *acquire* mean in that context, and hence talking about acquire semantics and release semantics becomes meaningless. In contrast the loadload|loadstore etc barriers are completely straight-forward to understand from their names. However it seems they are too strong compared to what recent hardware provides. Hotspot implementations in orderAccess are confusing - barriers with different semantics have been defined in terms of the other, but the low-level implementations provide a barrier that is stronger than the required semantics, so the high-level APIs are satisfied correctly, even if not implemented in a way that makes sense if you reason about what each barrier theoretically allows. David David Martin Buchholz writes: I think the hotspot docs need to be more precise about when they're talking about movement of stores and when about loads. // release. I.e., subsequent memory accesses may float above the // release, but prior ones may not float below it. As I've said elsewhere, the above makes no sense without restricting the type of access.
Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics
On 11/27/2014 04:00 AM, David Holmes wrote: Can I make an observation about acquire() and release() - to me they are meaningless when considered in isolation. Given their definitions they allow anything to move into a region bounded by acquire() and release(), then you can effectively move the whole program into the region and thus the acquire() and release() do not constrain any reorderings. acquire() and release() only make sense when their own movement is constrained with respect to something else - such as lock acquisition/release, or when combined with specific load/store actions. ...or another acquire/release region? Regards, Peter David Martin Buchholz writes: On Tue, Nov 25, 2014 at 6:04 AM, Paul Sandoz paul.san...@oracle.com wrote: Hi Martin, Thanks for looking into this. 1141 * Currently hotspot's implementation of a Java language-level volatile 1142 * store has the same effect as a storeFence followed by a relaxed store, 1143 * although that may be a little stronger than needed. IIUC to emulate hotpot's volatile store you will need to say that a fullFence immediately follows the relaxed store. Right - I've been groking that. The bit that always confuses me about release and acquire is ordering is restricted to one direction, as talked about in orderAccess.hpp [1]. So for a release, accesses prior to the release cannot move below it, but accesses succeeding the release can move above it. And that seems to apply to Unsafe.storeFence [2] (acting like a monitor exit). Is that contrary to C++ release fences where ordering is restricted both to prior and succeeding accesses? [3] So what about the following? a = r1; // Cannot move below the fence Unsafe.storeFence(); b = r2; // Can move above the fence? I think the hotspot docs need to be more precise about when they're talking about movement of stores and when about loads. // release. I.e., subsequent memory accesses may float above the // release, but prior ones may not float below it. As I've said elsewhere, the above makes no sense without restricting the type of access. ___ Concurrency-interest mailing list concurrency-inter...@cs.oswego.edu http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics
Hi Martin, Thanks for looking into this. 1141 * Currently hotspot's implementation of a Java language-level volatile 1142 * store has the same effect as a storeFence followed by a relaxed store, 1143 * although that may be a little stronger than needed. IIUC to emulate hotpot's volatile store you will need to say that a fullFence immediately follows the relaxed store. The bit that always confuses me about release and acquire is ordering is restricted to one direction, as talked about in orderAccess.hpp [1]. So for a release, accesses prior to the release cannot move below it, but accesses succeeding the release can move above it. And that seems to apply to Unsafe.storeFence [2] (acting like a monitor exit). Is that contrary to C++ release fences where ordering is restricted both to prior and succeeding accesses? [3] So what about the following? a = r1; // Cannot move below the fence Unsafe.storeFence(); b = r2; // Can move above the fence? Paul. [1] In orderAccess.hpp // Execution by a processor of release makes the effect of all memory // accesses issued by it previous to the release visible to all // processors *before* the release completes. The effect of subsequent // memory accesses issued by it *may* be made visible *before* the // release. I.e., subsequent memory accesses may float above the // release, but prior ones may not float below it. [2] In memnode.hpp // Release - no earlier ref can move after (but later refs can move // up, like a speculative pipelined cache-hitting Load). Requires // multi-cpu visibility. Inserted independent of any store, as required // for intrinsic sun.misc.Unsafe.storeFence(). class StoreFenceNode: public MemBarNode { public: StoreFenceNode(Compile* C, int alias_idx, Node* precedent) : MemBarNode(C, alias_idx, precedent) {} virtual int Opcode() const; }; [3] http://preshing.com/20131125/acquire-and-release-fences-dont-work-the-way-youd-expect/ On Nov 25, 2014, at 1:47 AM, Martin Buchholz marti...@google.com wrote: OK, I worked in some wording for comparison with volatiles. I believe you when you say that the semantics of the corresponding C++ fences are slightly different, but it's rather subtle - can we say anything more than closely related to? On Mon, Nov 24, 2014 at 1:29 PM, Aleksey Shipilev aleksey.shipi...@oracle.com wrote: Hi Martin, On 11/24/2014 11:56 PM, Martin Buchholz wrote: Review carefully - I am trying to learn about fences by explaining them! I have borrowed some wording from my reviewers! https://bugs.openjdk.java.net/browse/JDK-8065804 http://cr.openjdk.java.net/~martin/webrevs/openjdk9/fence-intrinsics/ I think implies the effect of C++11 is too strong wording. related might be more appropriate. See also comments here for connection with volatiles: https://bugs.openjdk.java.net/browse/JDK-8038978 Take note the Hans' correction that fences generally imply more than volatile load/store, but since you are listing the related things in the docs, I think the native Java example is good to have. -Aleksey. ___ Concurrency-interest mailing list concurrency-inter...@cs.oswego.edu http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics
On 11/24/2014 08:56 PM, Martin Buchholz wrote: Hi folks, Review carefully - I am trying to learn about fences by explaining them! I have borrowed some wording from my reviewers! + * Currently hotspot's implementation of a Java language-level volatile + * store has the same effect as a storeFence followed by a relaxed store, + * although that may be a little stronger than needed. While this may be true today, I'm hopefully about to commit an AArch64 OpenJDK port that uses the ARMv8 stlr instruction. I don't think that what you've written here is terribly misleading, but bear in mind that it may be there for some time. Andrew.
RE: [concurrency-interest] RFR: 8065804: JEP 171:Clarifications/corrections for fence intrinsics
Stephan Diestelhorst writes: Am Dienstag, 25. November 2014, 11:15:36 schrieb Hans Boehm: I'm no hardware architect, but fundamentally it seems to me that load x acquire_fence imposes a much more stringent constraint than load_acquire x Consider the case in which the load from x is an L1 hit, but a preceding load (from say y) is a long-latency miss. If we enforce ordering by just waiting for completion of prior operation, the former has to wait for the load from y to complete; while the latter doesn't. I find it hard to believe that this doesn't leave an appreciable amount of performance on the table, at least for some interesting microarchitectures. I agree, Hans, that this is a reasonable assumption. Load_acquire x does allow roach motel, whereas the acquire fence does not. In addition, for better or worse, fencing requirements on at least Power are actually driven as much by store atomicity issues, as by the ordering issues discussed in the cookbook. This was not understood in 2005, and unfortunately doesn't seem to be amenable to the kind of straightforward explanation as in Doug's cookbook. Coming from a strongly ordered architecture to a weakly ordered one myself, I also needed some mental adjustment about store (multi-copy) atomicity. I can imagine others will be unaware of this difference, too, even in 2014. Sorry I'm missing the connection between fences and multi-copy atomicity. David Stephan ___ Concurrency-interest mailing list concurrency-inter...@cs.oswego.edu http://cs.oswego.edu/mailman/listinfo/concurrency-interest