Re: [HACKERS] better atomics - v0.5
Hi Andres, Are you planning to continue working on this? Summarizing the discussion so far: * Robert listed a bunch of little cleanup tasks (http://www.postgresql.org/message-id/ca+tgmozshvvqjakul6p3kdvzvpibtgkzoti3m+fvvjg5v+x...@mail.gmail.com). Amit posted yet more detailed commends. * We talked about changing the file layout. I think everyone is happy with your proposal here: http://www.postgresql.org/message-id/20140702131729.gp21...@awork2.anarazel.de, with an overview description of what goes where (http://www.postgresql.org/message-id/CA+TgmoYuxfsm2dSy48=jgutrh1mozrvmjlhgqrfku7_wpv-...@mail.gmail.com) * Talked about nested spinlocks. The consensus seems to be to (continue to) forbid nested spinlocks, but allow atomic operations while holding a spinlock. When atomics are emulated with spinlocks, it's OK to acquire the emulation spinlock while holding another spinlock. * Talked about whether emulating atomics with spinlocks is a good idea. You posted performance results showing that at least with the patch to use atomics to implement LWLocks, the emulation performs more or less the same as the current spinlock-based implementation. I believe everyone was more or less satisfied with that. So ISTM we have consensus that the approach to spinlock emulation and nesting stuff is OK. To finish the patch, you'll just need to address the file layout and the laundry list of other little details that have been raised. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] better atomics - v0.5
Hi, On 2014-08-20 12:43:05 +0300, Heikki Linnakangas wrote: Are you planning to continue working on this? Yes, I am! I've recently been on holiday and now I'm busy with catching up with everything that has happened since. I hope to have a next version ready early next week. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] better atomics - v0.5
On Mon, Jul 14, 2014 at 12:47 AM, Andres Freund and...@2ndquadrant.com wrote: On 2014-07-08 11:51:14 +0530, Amit Kapila wrote: On Thu, Jul 3, 2014 at 10:56 AM, Amit Kapila amit.kapil...@gmail.com wrote: Further review of patch: 1. /* * pg_atomic_test_and_set_flag - TAS() * * Acquire/read barrier semantics. */ STATIC_IF_INLINE_DECLARE bool pg_atomic_test_set_flag(volatile pg_atomic_flag *ptr); a. I think Acquire and read barrier semantics aren't equal. Right. It was mean as Acquire (thus including read barrier) semantics. Okay, I got confused by reading comments, may be saying the same explicitly in comments is better. I guess I'll better update README.barrier to have definitions of all barriers. I think updating about the reasons behind using specific barriers for atomic operations defined by PG can be useful for people who want to understand or use the atomic op API's. b. I think it adheres to Acquire semantics for platforms where that semantics are supported else it defaults to full memory ordering. Example : #define TAS(lock) (InterlockedCompareExchange(lock, 1, 0)) Is it right to declare that generic version of function adheres to Acquire semantics. Implementing stronger semantics than required should always be fine. There's simply no sane way to work with the variety of platforms we support in any other way. Agreed, may be we can have a note somewhere either in code or readme to mention about it, if you think that can be helpful. 2. bool pg_atomic_test_set_flag_impl(volatile pg_atomic_flag *ptr) { return TAS((slock_t *) ptr-sema); #define TAS(lock) (InterlockedCompareExchange(lock, 1, 0)) Where's that from? Its there in s_lock.h. Refer below code: #ifdef WIN32_ONLY_COMPILER typedef LONG slock_t; #define HAS_TEST_AND_SET #define TAS(lock) (InterlockedCompareExchange(lock, 1, 0)) I can't recall adding a line of code like that? It is not added by your patch but used in your patch. I am not sure if that is what excatly you want atomic API to define for WIN32, but I think it is picking this definition. I have one question here, if the value of sema is other than 1 or 0, then it won't set the flag and not sure what it will return (ex. if the value is negative), so is this implementation completely sane? a. In above usage (ptr-sema), sema itself is not declared as volatile, so would it be right to use it in this way for API (InterlockedCompareExchange) expecting volatile. Upgrading a pointer to volatile is defined, so I don't see the problem? Thats okay. However all other structure definitions use it as volatile, example for atomics-arch-amd64.h it is defined as follows: #define PG_HAVE_ATOMIC_FLAG_SUPPORT typedef struct pg_atomic_flag { volatile char value; } pg_atomic_flag; So is there any harm if we keep this definition in sync with others? 3. /* * pg_atomic_unlocked_test_flag - TAS() * * No barrier semantics. */ STATIC_IF_INLINE_DECLARE bool pg_atomic_unlocked_test_flag(volatile pg_atomic_flag *ptr); a. There is no setting in this, so using TAS in function comments seems to be wrong. Good point. b. BTW, I am not able see this function in C11 specs. So? It's needed for a sensible implementation of spinlocks ontop of atomics. Okay got the point, do you think it makes sense to have a common agreement/consensus w.r.t which API's are necessary as a first cut. For me as a reviewer, if the API is implemented as per specs or what it is desired to then we can have it, however I am not sure if there is general consensus or at least some people agreed to set of API's which are required for first cut, have I missed some conclusion/discussion about it. 5. atomics-generic-gcc.h #ifndef PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG #define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG static inline bool pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr) { return ptr-value == 0; } #endif Don't we need to compare it with 1 instead of 0? Why? It returns true if the lock is free, makes sense imo? Other implementation in atomics-generic.h seems to implement it other way #define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG static inline bool pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr) { return pg_atomic_read_u32_impl(ptr) == 1; } Will add comments to atomics.h I think that will be helpful. 6. atomics-fallback.h pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr) { /* * Can't do this in the semaphore based implementation - always return * true. Do this in the header so compilers can optimize the test away. */ return true; } Why we can't implement this in semaphore based implementation? Because semaphores don't provide the API for it? Okay, but it is not clear from comments why this test should always return true, basically I am not able to think why incase of missing API returning true is correct behaviour for this API?
Re: [HACKERS] better atomics - v0.5
On 2014-07-10 08:46:55 +0530, Amit Kapila wrote: As per my understanding of the general theory around barriers, read and write are defined to avoid reordering due to compiler and full memory barriers are defined to avoid reordering due to processors. No, that's not the case. There's several processors where write/read barriers are an actual thing on the hardware level. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] better atomics - v0.5
Hi Amit, On 2014-07-08 11:51:14 +0530, Amit Kapila wrote: On Thu, Jul 3, 2014 at 10:56 AM, Amit Kapila amit.kapil...@gmail.com wrote: On Tue, Jul 1, 2014 at 4:10 PM, Andres Freund and...@2ndquadrant.com wrote: Further review of patch: 1. /* * pg_atomic_test_and_set_flag - TAS() * * Acquire/read barrier semantics. */ STATIC_IF_INLINE_DECLARE bool pg_atomic_test_set_flag(volatile pg_atomic_flag *ptr); a. I think Acquire and read barrier semantics aren't equal. Right. It was mean as Acquire (thus including read barrier) semantics. I guess I'll better update README.barrier to have definitions of all barriers. b. I think it adheres to Acquire semantics for platforms where that semantics are supported else it defaults to full memory ordering. Example : #define TAS(lock) (InterlockedCompareExchange(lock, 1, 0)) Is it right to declare that generic version of function adheres to Acquire semantics. Implementing stronger semantics than required should always be fine. There's simply no sane way to work with the variety of platforms we support in any other way. 2. bool pg_atomic_test_set_flag_impl(volatile pg_atomic_flag *ptr) { return TAS((slock_t *) ptr-sema); #define TAS(lock) (InterlockedCompareExchange(lock, 1, 0)) Where's that from? I can't recall adding a line of code like that? a. In above usage (ptr-sema), sema itself is not declared as volatile, so would it be right to use it in this way for API (InterlockedCompareExchange) expecting volatile. Upgrading a pointer to volatile is defined, so I don't see the problem? 3. /* * pg_atomic_unlocked_test_flag - TAS() * * No barrier semantics. */ STATIC_IF_INLINE_DECLARE bool pg_atomic_unlocked_test_flag(volatile pg_atomic_flag *ptr); a. There is no setting in this, so using TAS in function comments seems to be wrong. Good point. b. BTW, I am not able see this function in C11 specs. So? It's needed for a sensible implementation of spinlocks ontop of atomics. 4. #if !defined(PG_HAS_ATOMIC_TEST_SET_FLAG) defined(PG_HAVE_ATOMIC_EXCHANGE_U32) .. #define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG static inline bool pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr) { return pg_atomic_read_u32_impl(ptr) == 1; } #elif !defined(PG_HAS_ATOMIC_TEST_SET_FLAG) defined(PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32) .. #define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG static inline bool pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr) { return (bool) pg_atomic_read_u32_impl(ptr); } Is there any reason to keep these two implementations different? No, will change. 5. atomics-generic-gcc.h #ifndef PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG #define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG static inline bool pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr) { return ptr-value == 0; } #endif Don't we need to compare it with 1 instead of 0? Why? It returns true if the lock is free, makes sense imo? Will add comments to atomics.h 6. atomics-fallback.h pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr) { /* * Can't do this in the semaphore based implementation - always return * true. Do this in the header so compilers can optimize the test away. */ return true; } Why we can't implement this in semaphore based implementation? Because semaphores don't provide the API for it? 7. /* * pg_atomic_clear_flag - release lock set by TAS() * * Release/write barrier semantics. */ STATIC_IF_INLINE_DECLARE void pg_atomic_clear_flag(volatile pg_atomic_flag *ptr); a. Are Release and write barriers equivalent? Same answer as above. Meant as Release (thus including write barrier) semantics b. IIUC then current code doesn't have release semantics for unlock/clear. If you're referring to s_lock.h with 'current code', yes it should have: * On platforms with weak memory ordering, the TAS(), TAS_SPIN(), and * S_UNLOCK() macros must further include hardware-level memory fence * instructions to prevent similar re-ordering at the hardware level. * TAS() and TAS_SPIN() must guarantee that loads and stores issued after * the macro are not executed until the lock has been obtained. Conversely, * S_UNLOCK() must guarantee that loads and stores issued before the macro * have been executed before the lock is released. We are planing to introduce it in this patch, also this doesn't follow the specs which says that clear should have memory_order_seq_cst ordering semantics. Making it guarantee full memory barrier semantics is a major performance loss on x86-64, so I don't want to do that. Also there is atomic_flag_test_and_set_explicit()... As per its current usage in patch (S_UNLOCK), it seems reasonable to have *release* semantics for this API, however I think if we use it for generic purpose then it might not be right to define it with Release semantics. I can't really see a user where it's not what you
Re: [HACKERS] better atomics - v0.5
On Thu, Jul 10, 2014 at 08:46:55AM +0530, Amit Kapila wrote: As per my understanding of the general theory around barriers, read and write are defined to avoid reordering due to compiler and full memory barriers are defined to avoid reordering due to processors. There are some processors that support instructions for memory barriers with acquire, release and fence semantics. Not exactly, both compilers and processors can rearrange loads and stores. Because the architecture most developers work on (x86) has such a strong memory model (it's goes to lot of effort to hide reordering) people are under the impression that it's only the compiler you need to worry about, but that's not true for any other architechture. Memory barriers/fences are on the whole discouraged if you can use acquire/release semantics because the latter are faster and much easier to optimise for. I strongly recommend the Atomic Weapons talk on the C11 memory model to help understand how they work. As a bonus it includes correct implementations for the major architectures. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] better atomics - v0.5
On Sat, Jul 12, 2014 at 1:26 AM, Martijn van Oosterhout klep...@svana.org wrote: On Thu, Jul 10, 2014 at 08:46:55AM +0530, Amit Kapila wrote: As per my understanding of the general theory around barriers, read and write are defined to avoid reordering due to compiler and full memory barriers are defined to avoid reordering due to processors. There are some processors that support instructions for memory barriers with acquire, release and fence semantics. Not exactly, both compilers and processors can rearrange loads and stores. Because the architecture most developers work on (x86) has such a strong memory model (it's goes to lot of effort to hide reordering) people are under the impression that it's only the compiler you need to worry about, but that's not true for any other architechture. I am not saying that we need only barriers to avoid reordering due to compiler, rather my point was that we need to avoid reordering due to both compiler and processor, however ways to achieve it require different API's Memory barriers/fences are on the whole discouraged if you can use acquire/release semantics because the latter are faster and much easier to optimise for. Do all processors support instructions for memory barriers with acquire, release semantics? I strongly recommend the Atomic Weapons talk on the C11 memory model to help understand how they work. As a bonus it includes correct implementations for the major architectures. Yes that will be a good if we can can implement as per C11 memory model and thats what I am referring while reviewing this patch, however even if that is not possible or difficult to achieve in all cases, it is worth to have a discussion for memory model used in current API's implemented in patch. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] better atomics - v0.5
On Tue, Jul 8, 2014 at 2:21 AM, Amit Kapila amit.kapil...@gmail.com wrote: On Thu, Jul 3, 2014 at 10:56 AM, Amit Kapila amit.kapil...@gmail.com wrote: On Tue, Jul 1, 2014 at 4:10 PM, Andres Freund and...@2ndquadrant.com wrote: Further review of patch: 1. /* * pg_atomic_test_and_set_flag - TAS() * * Acquire/read barrier semantics. */ STATIC_IF_INLINE_DECLARE bool pg_atomic_test_set_flag(volatile pg_atomic_flag *ptr); a. I think Acquire and read barrier semantics aren't equal. With acquire semantics, the results of the operation are available before the results of any operation that appears after it in code which means it applies for both load and stores. Definition of read barrier just ensures loads. So will be right to declare like above in comments? Yeah. Barriers can be by the operations they keep from being reordered (read, write, or both) and by the direction in which they prevent reordering (acquire, release, or both). So in theory there are 9 combinations: read barrier (both directions) read barrier (acquire) read barrier (release) write barrier (both directions) write barrier (acquire) write barrier (both directions) full barrier (both directions) full barrier (acquire) full barrier (release) To make things more confusing, a read barrier plus a write barrier need not equal a full barrier. Read barriers prevent reads from being reordered with other reads, and write barriers keep writes from being reordered with other writes, but neither prevents a read from being reordered relative to a write. A full barrier, however, must prevent all reordering: read/read, read/write, write/read, and write/write. Clarity here is essential. barrier.h only proposed to implement the following: read barrier (both directions), write barrier (both directions), full barrier (both directions) The reason I did that is because it seemed to be more or less what Linux was doing, and it's generally suitable for lock-less algorithms. The acquire and release semantics come into play specifically when you're using barriers to implement locking primitives, which is isn't what I was trying to enable. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] better atomics - v0.5
On Wed, Jul 9, 2014 at 8:58 PM, Robert Haas robertmh...@gmail.com wrote: On Tue, Jul 8, 2014 at 2:21 AM, Amit Kapila amit.kapil...@gmail.com wrote: On Thu, Jul 3, 2014 at 10:56 AM, Amit Kapila amit.kapil...@gmail.com wrote: Further review of patch: 1. /* * pg_atomic_test_and_set_flag - TAS() * * Acquire/read barrier semantics. */ STATIC_IF_INLINE_DECLARE bool pg_atomic_test_set_flag(volatile pg_atomic_flag *ptr); a. I think Acquire and read barrier semantics aren't equal. With acquire semantics, the results of the operation are available before the results of any operation that appears after it in code which means it applies for both load and stores. Definition of read barrier just ensures loads. So will be right to declare like above in comments? Yeah. Barriers can be by the operations they keep from being reordered (read, write, or both) and by the direction in which they prevent reordering (acquire, release, or both). So in theory there are 9 combinations: read barrier (both directions) read barrier (acquire) read barrier (release) write barrier (both directions) write barrier (acquire) write barrier (both directions) full barrier (both directions) full barrier (acquire) full barrier (release) With these definitions, I think it will fall into *full barrier (acquire)* category as it needs to ensure that no reordering should happen for both load and store. As an example, we can refer its implementation for msvc which ensures that it follows full memory barrier semantics. #define TAS(lock) (InterlockedCompareExchange(lock, 1, 0)) To make things more confusing, a read barrier plus a write barrier need not equal a full barrier. Read barriers prevent reads from being reordered with other reads, and write barriers keep writes from being reordered with other writes, but neither prevents a read from being reordered relative to a write. A full barrier, however, must prevent all reordering: read/read, read/write, write/read, and write/write. Clarity here is essential. barrier.h only proposed to implement the following: read barrier (both directions), write barrier (both directions), full barrier (both directions) As per my understanding of the general theory around barriers, read and write are defined to avoid reordering due to compiler and full memory barriers are defined to avoid reordering due to processors. There are some processors that support instructions for memory barriers with acquire, release and fence semantics. I think the current definitions of barriers is as per needs of usage in PostgreSQL and not by referring standard (am I missing something here), however as you explained the definitions are quite clear. The reason I did that is because it seemed to be more or less what Linux was doing, and it's generally suitable for lock-less algorithms. The acquire and release semantics come into play specifically when you're using barriers to implement locking primitives, which is isn't what I was trying to enable. Now by implementing atomic-ops, I think we are moving more towards lock-less algorithms, so I am not sure if we just want to adhere to existing definitions as you listed above (what current barrier.h implements) and implement accordingly or we can extend the existing definitions. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] better atomics - v0.5
On Thu, Jul 3, 2014 at 10:56 AM, Amit Kapila amit.kapil...@gmail.com wrote: On Tue, Jul 1, 2014 at 4:10 PM, Andres Freund and...@2ndquadrant.com wrote: Further review of patch: 1. /* * pg_atomic_test_and_set_flag - TAS() * * Acquire/read barrier semantics. */ STATIC_IF_INLINE_DECLARE bool pg_atomic_test_set_flag(volatile pg_atomic_flag *ptr); a. I think Acquire and read barrier semantics aren't equal. With acquire semantics, the results of the operation are available before the results of any operation that appears after it in code which means it applies for both load and stores. Definition of read barrier just ensures loads. So will be right to declare like above in comments? b. I think it adheres to Acquire semantics for platforms where that semantics are supported else it defaults to full memory ordering. Example : #define TAS(lock) (InterlockedCompareExchange(lock, 1, 0)) Is it right to declare that generic version of function adheres to Acquire semantics. 2. bool pg_atomic_test_set_flag_impl(volatile pg_atomic_flag *ptr) { return TAS((slock_t *) ptr-sema); #define TAS(lock) (InterlockedCompareExchange(lock, 1, 0)) a. In above usage (ptr-sema), sema itself is not declared as volatile, so would it be right to use it in this way for API (InterlockedCompareExchange) expecting volatile. b. Also shouldn't this implementation be moved to atomics-generic-msvc.h 3. /* * pg_atomic_unlocked_test_flag - TAS() * * No barrier semantics. */ STATIC_IF_INLINE_DECLARE bool pg_atomic_unlocked_test_flag(volatile pg_atomic_flag *ptr); a. There is no setting in this, so using TAS in function comments seems to be wrong. b. BTW, I am not able see this function in C11 specs. 4. #if !defined(PG_HAS_ATOMIC_TEST_SET_FLAG) defined(PG_HAVE_ATOMIC_EXCHANGE_U32) .. #define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG static inline bool pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr) { return pg_atomic_read_u32_impl(ptr) == 1; } #elif !defined(PG_HAS_ATOMIC_TEST_SET_FLAG) defined(PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32) .. #define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG static inline bool pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr) { return (bool) pg_atomic_read_u32_impl(ptr); } Is there any reason to keep these two implementations different? 5. atomics-generic-gcc.h #ifndef PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG #define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG static inline bool pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr) { return ptr-value == 0; } #endif Don't we need to compare it with 1 instead of 0? 6. atomics-fallback.h pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr) { /* * Can't do this in the semaphore based implementation - always return * true. Do this in the header so compilers can optimize the test away. */ return true; } Why we can't implement this in semaphore based implementation? 7. /* * pg_atomic_clear_flag - release lock set by TAS() * * Release/write barrier semantics. */ STATIC_IF_INLINE_DECLARE void pg_atomic_clear_flag(volatile pg_atomic_flag *ptr); a. Are Release and write barriers equivalent? With release semantics, the results of the operation are available after the results of any operation that appears before it in code. A write barrier acts as a compiler barrier, and orders stores. I think both definitions are not same. b. IIUC then current code doesn't have release semantics for unlock/clear. We are planing to introduce it in this patch, also this doesn't follow the specs which says that clear should have memory_order_seq_cst ordering semantics. As per its current usage in patch (S_UNLOCK), it seems reasonable to have *release* semantics for this API, however I think if we use it for generic purpose then it might not be right to define it with Release semantics. 8. #define PG_HAS_ATOMIC_CLEAR_FLAG static inline void pg_atomic_clear_flag_impl(volatile pg_atomic_flag *ptr) { /* XXX: release semantics suffice? */ pg_memory_barrier_impl(); pg_atomic_write_u32_impl(ptr, 0); } Are we sure that having memory barrier before clearing flag will achieve *Release* semantics; as per my understanding the definition of Release sematics is as below and above doesn't seem to follow the same. With release semantics, the results of the operation are available after the results of any operation that appears before it in code. 9. static inline uint32 pg_atomic_exchange_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 xchg_) { uint32 old; while (true) { old = pg_atomic_read_u32_impl(ptr); if (pg_atomic_compare_exchange_u32_impl(ptr, old, xchg_)) break; } return old; } What is the reason to implement exchange as compare-and-exchange, at least for Windows I know corresponding function (InterlockedExchange) exists. I could not see any other implementation of same. I think this at least deserves comment explaining why we have done the implementation this way. 10. STATIC_IF_INLINE_DECLARE uint32 pg_atomic_fetch_add_u32(volatile pg_atomic_uint32 *ptr,
Re: [HACKERS] better atomics - v0.5
On Fri, Jun 27, 2014 at 9:00 PM, Andres Freund and...@2ndquadrant.com wrote: Imagine the situation for the buffer header spinlock which is one of the bigger performance issues atm. We could aim to replace all usages of that with clever and complicated logic, but it's hard. The IMO reasonable (and prototyped) way to do it is to make the common paths lockless, but fall back to the spinlock for the more complicated situations. For the buffer header that means that pin/unpin and buffer lookup are lockless, but IO and changing the identity of a buffer still require the spinlock. My attempts to avoid the latter basically required a buffer header specific reimplementation of spinlocks. There is a 2010 paper [1] that demonstrates a fully non-blocking approach to buffer management using the same generalized clock algorithm that PostgreSQL has. The site also has an implementation for Apache Derby. You may find some interesting ideas in there. [1] http://code.google.com/p/derby-nb/source/browse/trunk/derby-nb/ICDE10_conf_full_409.pdf Regards, Ants Aasma -- Cybertec Schönig Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] better atomics - v0.5
On 2014-07-02 09:27:52 +0300, Ants Aasma wrote: On Fri, Jun 27, 2014 at 9:00 PM, Andres Freund and...@2ndquadrant.com wrote: Imagine the situation for the buffer header spinlock which is one of the bigger performance issues atm. We could aim to replace all usages of that with clever and complicated logic, but it's hard. The IMO reasonable (and prototyped) way to do it is to make the common paths lockless, but fall back to the spinlock for the more complicated situations. For the buffer header that means that pin/unpin and buffer lookup are lockless, but IO and changing the identity of a buffer still require the spinlock. My attempts to avoid the latter basically required a buffer header specific reimplementation of spinlocks. There is a 2010 paper [1] that demonstrates a fully non-blocking approach to buffer management using the same generalized clock algorithm that PostgreSQL has. The site also has an implementation for Apache Derby. You may find some interesting ideas in there. [1] http://code.google.com/p/derby-nb/source/browse/trunk/derby-nb/ICDE10_conf_full_409.pdf Interesting. Thanks for the link. I think I had pretty much all the parts they described lockless as well (excluding the buffer mapping hashtable itself, which they didn't focus on either), it was just operations like replacing a dirty victim buffer where I fell back to the spinlock. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] better atomics - v0.5
Hi, On 2014-06-30 20:28:52 +0200, Andres Freund wrote: To quantify this, on my 2 socket xeon E5520 workstation - which is too small to heavily show the contention problems in pgbench -S - the numbers are: pgbench -M prepared -c 16 -j 16 -T 10 -S (best of 5, noticeably variability) master: 152354.294117 lwlocks-atomics: 170528.784958 lwlocks-atomics-spin: 159416.202578 pgbench -M prepared -c 1 -j 1 -T 10 -S (best of 5, noticeably variability) master: 16273.702191 lwlocks-atomics: 16768.712433 lwlocks-atomics-spin: 16744.913083 So, there really isn't a problem with the emulation. It's not actually that surprising - the absolute number of atomic ops is prety much the same. Where we earlier took a spinlock to increment shared we now still take one. Tom, you've voiced concern that using atomics will regress performance for platforms that don't use atomics in a nearby thread. Can these numbers at least ease those a bit? I've compared the atomics vs emulated atomics on 32bit x86, 64bit x86 and POWER7. That's all I've access to at the moment. In all cases the performance with lwlocks ontop emulated atomics was the same as the old spinlock based algorithm or even a bit better. I'm sure that it's possible to design atomics based algorithms where that's not the case, but I don't think we need to solve those before we meet them. I think for most contended places which possibly can be improved two things matter: The number of instructions that need to work atomically (i.e. TAS/CAS/xchg for spinlocks, xadd/cmpxchg for atomics) and the duration locks are held. When converting to atomics, even if emulated, the hot paths shouldn't need more locked instructions than before and often enough the duration during which spinlocks are held will be smaller. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] better atomics - v0.5
On 2014-06-26 00:42:37 +0300, Heikki Linnakangas wrote: On 06/25/2014 11:36 PM, Andres Freund wrote: - I completely loathe the file layout you've proposed in src/include/storage. If we're going to have separate #include files for each architecture (and I'd rather we didn't go that route, because it's messy and unlike what we have now), then shouldn't that stuff go in src/include/port or a new directory src/include/port/atomics? Polluting src/include/storage with a whole bunch of files called atomics-whatever.h strikes me as completely ugly, and there's a lot of duplicate code in there. I don't mind moving it somewhere else and it's easy enough to change. I think having a separate file for each architecture is nice. I totally agree that they don't belong in src/include/storage, though. s_lock.h has always been misplaced there, but we've let it be for historical reasons, but now that we're adding a dozen new files, it's time to move them out. So, src/include/port/atomics.h, src/include/port/atomics/generic-$compiler.h, src/include/port/atomics/arch-$arch.h, src/include/port/atomics/fallback.h, src/include/port/atomics/generic.h, src/backend/port/atomics.c ? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] better atomics - v0.5
On Tue, Jul 1, 2014 at 4:10 PM, Andres Freund and...@2ndquadrant.com wrote: On 2014-07-01 10:44:19 +0530, Amit Kapila wrote: I think defining barrier support for some atomic ops and not for others will lead to inconsistency with specs and we need to additionally define rules for such atomic ops usage. Meh. It's quite common that read/load do not have barrier semantics. The majority of read/write users won't need barriers and it'll be expensive to provide them with them. So in such a case they shouldn't use atomic API and rather access directly, I think part of the purpose to use atomic ops is also that they should provide synchronisation for read/write among processes, earlier we don't have any such facility, so wherever required we have no option but to use barriers to achieve the same. One more point here, in patch 64-bit atomic read/write are implemented using *_exchange_* API, won't this automatically ensure that 64-bit versions have barrier support? +pg_atomic_read_u64_impl(volatile pg_atomic_uint64 *ptr) +{ .. + pg_atomic_compare_exchange_u64_impl(ptr, old, 0); .. +} b) It's only supported from vista onwards. Afaik we still support XP. Well, we have quite good evidence for it working by knowing that postgres has in the past worked on xp? If you have a better idea, send in a patch for today's barrier.h and I'll adopt that. Here the only point was to check if it is better to use version of InterLocked... API which is a recommended for the case of Itanium- based processor and will behave same for others. If you have any inclination/suggestion for the same, then I can try for it. In this case, I have a question for you. Either noone has ever tested postgres on itanium windows (quite possible), because afaik _Asm_sched_fence() doesn't work on windows/msvc, or windows doesn't define __ia64__/__ia64. Those macros arefor HP's acc on HPUX. Hmm.. Do you think in such a case, it would have gone in below define: #elif defined(__INTEL_COMPILER) /* * icc defines __GNUC__, but doesn't support gcc's inline asm syntax */ #if defined(__ia64__) || defined(__ia64) #define pg_memory_barrier() __mf() #elif defined(__i386__) || defined(__x86_64__) #define pg_memory_barrier() _mm_mfence() #endif Hm? Since _Asm_sched_fence isn't for the intel compiler but for HP's acc, I don't see why? Sorry, I don't understand what you mean by above? The point I wanted to clarify here was that is it possible that above defines lead to different definitions for Itanium on windows? But they should go into atomics-generic-acc.h. Okay, but in your current patch (0001-Basic-atomic-ops-implementation), they seem to be in different files. -#define pg_compiler_barrier() __memory_barrier() Currently this will be considered as compiler barrier for __INTEL_COMPILER, but after patch, I don't see this define. I think after patch, it will be compiler_impl specific, but why such a change? #if defined(__INTEL_COMPILER) #define pg_compiler_barrier_impl() __memory_barrier() #else #define pg_compiler_barrier_impl() __asm__ __volatile__( ::: memory) #endif There earlier was a typo defining pg_compiler_barrier instead of pg_compiler_barrier_impl for intel, maybe that's what you were referring to? I have tried to search for this in patch file 0001-Basic-atomic-ops-implementation and found that it has been removed from barrier.h but not added anywhere else. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] better atomics - v0.5
On 2014-07-01 10:44:19 +0530, Amit Kapila wrote: I think for such usage, we need to rely on barriers wherever there is a need for synchronisation, recent example I have noticed is in your patch where we have to use pg_write_barrier() during wakeup. However if we go by atomic ops definition, then no such dependency is required, like if somewhere we need to use pg_atomic_compare_exchange_u32(), after that we don't need to ensure about barriers, because this atomic API adheres to Full barrier semantics. I think defining barrier support for some atomic ops and not for others will lead to inconsistency with specs and we need to additionally define rules for such atomic ops usage. Meh. It's quite common that read/load do not have barrier semantics. The majority of read/write users won't need barriers and it'll be expensive to provide them with them. b) It's only supported from vista onwards. Afaik we still support XP. Well, with barrier.h as it stands they'd get a compiler error if it indeed is unsupported? But I think there's something else going on - msft might just be removing XP from it's documentation. Okay, but why would they remove for MemoryBarrier() and not for InterlockedCompareExchange(). I think it is bit difficult to predict the reason and if we want to use anything which is not as msft docs, we shall do that carefully and have some way to ensure that it will work fine. Well, we have quite good evidence for it working by knowing that postgres has in the past worked on xp? If you have a better idea, send in a patch for today's barrier.h and I'll adopt that. In this case, I have a question for you. Un-patched usage in barrier.h is as follows: .. If I understand correctly the current define mechanism in barrier.h, it will have different definition for Itanium processors even for windows. Either noone has ever tested postgres on itanium windows (quite possible), because afaik _Asm_sched_fence() doesn't work on windows/msvc, or windows doesn't define __ia64__/__ia64. Those macros arefor HP's acc on HPUX. Hmm.. Do you think in such a case, it would have gone in below define: #elif defined(__INTEL_COMPILER) /* * icc defines __GNUC__, but doesn't support gcc's inline asm syntax */ #if defined(__ia64__) || defined(__ia64) #define pg_memory_barrier() __mf() #elif defined(__i386__) || defined(__x86_64__) #define pg_memory_barrier() _mm_mfence() #endif Hm? Since _Asm_sched_fence isn't for the intel compiler but for HP's acc, I don't see why? But they should go into atomics-generic-acc.h. -#define pg_compiler_barrier() __memory_barrier() Currently this will be considered as compiler barrier for __INTEL_COMPILER, but after patch, I don't see this define. I think after patch, it will be compiler_impl specific, but why such a change? #if defined(__INTEL_COMPILER) #define pg_compiler_barrier_impl() __memory_barrier() #else #define pg_compiler_barrier_impl() __asm__ __volatile__( ::: memory) #endif There earlier was a typo defining pg_compiler_barrier instead of pg_compiler_barrier_impl for intel, maybe that's what you were referring to? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] better atomics - v0.5
On 2014-06-30 11:04:53 +0530, Amit Kapila wrote: On Sun, Jun 29, 2014 at 2:54 PM, Andres Freund and...@2ndquadrant.com wrote: On 2014-06-29 11:53:28 +0530, Amit Kapila wrote: On Sat, Jun 28, 2014 at 1:48 PM, Andres Freund and...@2ndquadrant.com I think it is better to have some discussion about it. Apart from this, today I noticed atomic read/write doesn't use any barrier semantics and just rely on volatile. Yes, intentionally so. It's often important to get/set the current value without the cost of emitting a memory barrier. It just has to be a recent value and it actually has to come from memory. I agree with you that we don't want to incur the cost of memory barrier for get/set, however it should not be at the cost of correctness. I really can't follow here. A volatile read/store forces it to go to memory without barriers. The ABIs of all platforms we work with guarantee that 4bytes stores/reads are atomic - we've been relying on that for a long while. And that's actually enforced by the current definition - I think? Yeah, this is the only point which I am trying to ensure, thats why I sent you few links in previous mail where I had got some suspicion that just doing get/set with volatile might lead to correctness issue in some cases. But this isn't something this patch started doing - we've been doing that for a long while? Some another Note, I found today while reading more on it which suggests that my previous observation is right: Within a thread of execution, accesses (reads and writes) to all volatile objects are guaranteed to not be reordered relative to each other, but this order is not guaranteed to be observed by another thread, since volatile access does not establish inter-thread synchronization. http://en.cppreference.com/w/c/atomic/memory_order That's just saying there's no ordering guarantees with volatile stores/reads. I don't see a contradiction? b) It's only supported from vista onwards. Afaik we still support XP. #ifndef pg_memory_barrier_impl #define pg_memory_barrier_impl() MemoryBarrier() #endif The MemoryBarrier() macro used also supports only on vista onwards, so I think for reasons similar to using MemoryBarrier() can apply for this as well. I think that's odd because barrier.h has been using MemoryBarrier() without a version test for a long time now. I guess everyone uses a new enough visual studio. Yeap or might be the people who even are not using new enough version doesn't have a use case that can hit a problem due to MemoryBarrier() Well, with barrier.h as it stands they'd get a compiler error if it indeed is unsupported? But I think there's something else going on - msft might just be removing XP from it's documentation. Postgres supports building with with visual studio 2005 and MemoryBarrier() seems to be supported by it. I think we'd otherwise seen problems since 9.1 where barrier.h was introduced. In this case, I have a question for you. Un-patched usage in barrier.h is as follows: .. #elif defined(__ia64__) || defined(__ia64) #define pg_compiler_barrier() _Asm_sched_fence() #define pg_memory_barrier() _Asm_mf() #elif defined(WIN32_ONLY_COMPILER) /* Should work on both MSVC and Borland. */ #include intrin.h #pragma intrinsic(_ReadWriteBarrier) #define pg_compiler_barrier() _ReadWriteBarrier() #define pg_memory_barrier() MemoryBarrier() #endif If I understand correctly the current define mechanism in barrier.h, it will have different definition for Itanium processors even for windows. Either noone has ever tested postgres on itanium windows (quite possible), because afaik _Asm_sched_fence() doesn't work on windows/msvc, or windows doesn't define __ia64__/__ia64. Those macros arefor HP's acc on HPUX. However the patch defines as below: #if defined(HAVE_GCC__ATOMIC_INT32_CAS) || defined(HAVE_GCC__SYNC_INT32_CAS) # include storage/atomics-generic-gcc.h #elif defined(WIN32_ONLY_COMPILER) # include storage/atomics-generic-msvc.h What I can understand from above is that defines in storage/atomics-generic-msvc.h, will override any previous defines for compiler/memory barriers and _ReadWriteBarrier()/MemoryBarrier() will be considered for Windows always. Well, the memory barrier is surrounded by #ifndef pg_memory_barrier_impl. The compiler barrier can't reasonably be defined earlier since it's a compiler not an architecture thing. 6. pg_atomic_compare_exchange_u32() It is better to have comments above this and all other related functions. Okay, generally code has such comments on top of function definition rather than with declaration. I don't want to have it on the _impl functions because they're duplicated for the individual platforms and will just become out of date... Sure, I was also not asking for _impl functions. What I was asking in this point was to have comments on top of definition of
Re: [HACKERS] better atomics - v0.5
On Sat, Jun 28, 2014 at 4:25 AM, Andres Freund and...@2ndquadrant.com wrote: To make pin/unpin et al safe without acquiring the spinlock the pincount and the flags had to be moved into one variable so that when incrementing the pincount you automatically get the flagbits back. If it's currently valid all is good, otherwise the spinlock needs to be acquired. But for that to work you need to set flags while the spinlock is held. Possibly you can come up with ways to avoid that, but I am pretty damn sure that the code will be less robust by forbidding atomics inside a critical section, not the contrary. It's a good idea to avoid it, but not at all costs. You might be right, but I'm not convinced. Why? Anything I can do to convince you of this? Note that other users of atomics (e.g. the linux kernel) widely use atomics inside spinlock protected regions. The Linux kernel isn't a great example, because they can ensure that they aren't preempted while holding the spinlock. We can't, and that's a principle part of my concern here. More broadly, it doesn't seem consistent. I think that other projects also sometimes write code that acquires a spinlock while holding another spinlock, and we don't do that; in fact, we've elevated that to a principle, regardless of whether it performs well in practice. In some cases, the CPU instruction that we issue to acquire that spinlock might be the exact same instruction we'd issue to manipulate an atomic variable. I don't see how it can be right to say that a spinlock-protected critical section is allowed to manipulate an atomic variable with cmpxchg, but not allowed to acquire another spinlock with cmpxchg. Either acquiring the second spinlock is OK, in which case our current coding rule is wrong, or acquiring the atomic variable is not OK, either. I don't see how we can have that both ways. What I'm basically afraid of is that this will work fine in many cases but have really ugly failure modes. That's pretty much what happens with spinlocks already - the overhead is insignificant at low levels of contention but as the spinlock starts to become contended the CPUs all fight over the cache line and performance goes to pot. ISTM that making the spinlock critical section significantly longer by putting atomic ops in there could appear to win in some cases while being very bad in others. Well, I'm not saying it's something I suggest doing all the time. But if using an atomic op in the slow path allows you to remove the spinlock from 99% of the cases I don't see it having a significant impact. In most scenarios (where atomics aren't emulated, i.e. platforms we expect to used in heavily concurrent cases) the spinlock and the atomic will be on the same cacheline making stalls much less likely. And this again is my point: why can't we make the same argument about two spinlocks situated on the same cache line? I don't have a bit of trouble believing that doing the same thing with a couple of spinlocks could sometimes work out well, too, but Tom is adamantly opposed to that. I don't want to just make an end run around that issue without understanding whether he has a valid point. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] better atomics - v0.5
On 2014-06-30 12:15:06 -0400, Robert Haas wrote: More broadly, it doesn't seem consistent. I think that other projects also sometimes write code that acquires a spinlock while holding another spinlock, and we don't do that; in fact, we've elevated that to a principle, regardless of whether it performs well in practice. It's actually more than a principle: It's now required for correctness, because otherwise the semaphore based spinlock implementation will deadlock otherwise. In some cases, the CPU instruction that we issue to acquire that spinlock might be the exact same instruction we'd issue to manipulate an atomic variable. I don't see how it can be right to say that a spinlock-protected critical section is allowed to manipulate an atomic variable with cmpxchg, but not allowed to acquire another spinlock with cmpxchg. Either acquiring the second spinlock is OK, in which case our current coding rule is wrong, or acquiring the atomic variable is not OK, either. I don't see how we can have that both ways. The no nested spinlock thing used to be a guideline. One nobody had big problems with because it didn't prohibit solutions to problems. Since guidelines are more useful when simple it's no nested spinlocks!. Also those two cmpxchg's aren't the same: The CAS for spinlock acquiration will only succeed if the lock isn't acquired. If the lock holder sleeps you'll have to wait for it to wake up. In contrast to that CASs for lockfree (or lock-reduced) algorithms won't be blocked if the last manipulation was done by a backend that's now sleeping. It'll possibly loop once, get the new value, and retry the CAS. Yes, it can still take couple of iterations under heavy concurrency, but you don't have the problem that the lockholder goes to sleep. Now, you can argue that the spinlock based fallbacks make that difference moot, but I think that's just the price for an emulation fringe platforms will have to pay. They aren't platforms used under heavy concurrency. What I'm basically afraid of is that this will work fine in many cases but have really ugly failure modes. That's pretty much what happens with spinlocks already - the overhead is insignificant at low levels of contention but as the spinlock starts to become contended the CPUs all fight over the cache line and performance goes to pot. ISTM that making the spinlock critical section significantly longer by putting atomic ops in there could appear to win in some cases while being very bad in others. Well, I'm not saying it's something I suggest doing all the time. But if using an atomic op in the slow path allows you to remove the spinlock from 99% of the cases I don't see it having a significant impact. In most scenarios (where atomics aren't emulated, i.e. platforms we expect to used in heavily concurrent cases) the spinlock and the atomic will be on the same cacheline making stalls much less likely. And this again is my point: why can't we make the same argument about two spinlocks situated on the same cache line? Because it's not relevant? Where and why do we want to acquire two spinlocks that are on the same cacheline? As far as I know there's never been an actual need for nested spinlocks. So the guideline can be as restrictive as is it is because the price is low and there's no benefit in making it more complex. I think this line of discussion is pretty backwards. The reason I (and seemingly Heikki) want to use atomics is that it can reduce contention significantly. That contention is today too a good part based on spinlocks. Your argument is basically that increasing spinlock contention by doing things that can take more than a fixed cycles can increase contention. But the changes we've talked about only make sense if they significantly reduce spinlock contention in the first place - a potential for a couple iterations while holding better not be relevant for proposed patches. I don't think a blanket rule makes sense here. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] better atomics - v0.5
Robert Haas robertmh...@gmail.com writes: ... this again is my point: why can't we make the same argument about two spinlocks situated on the same cache line? I don't have a bit of trouble believing that doing the same thing with a couple of spinlocks could sometimes work out well, too, but Tom is adamantly opposed to that. I think you might be overstating my position here. What I'm concerned about is that we be sure that spinlocks are held for a sufficiently short time that it's very unlikely that we get pre-empted while holding one. I don't have any particular bright line about how short a time that is, but more than a few instructions worries me. As you say, the Linux kernel is a bad example to follow because it hasn't got a risk of losing its timeslice while holding a spinlock. The existing coding rules discourage looping (though I might be okay with a small constant loop count), and subroutine calls (mainly because somebody might add $random_amount_of_work to the subroutine if they don't realize it can be called while holding a spinlock). Both of these rules are meant to reduce the risk that a short interval balloons into a long one due to unrelated coding changes. The existing coding rules also discourage spinlocking within a spinlock, and the reason for that is that there's no very clear upper bound to the time required to obtain a spinlock, so that there would also be no clear upper bound to the time you're holding the original one (thus possibly leading to cascading performance problems). So ISTM the question we ought to be asking is whether atomic operations have bounded execution time, or more generally what the distribution of execution times is likely to be. I'd be OK with an answer that includes sometimes it can be long so long as sometimes is pretty damn seldom. Spinlocks have a nonzero risk of taking a long time already, since we can't entirely prevent the possibility of losing our timeslice while holding one. The issue here is just to be sure that that happens seldom enough that it doesn't cause performance problems. If we fail to do that we might negate all the desired performance improvements from adopting atomic ops in the first place. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] better atomics - v0.5
On Mon, Jun 30, 2014 at 12:54 PM, Tom Lane t...@sss.pgh.pa.us wrote: The existing coding rules also discourage spinlocking within a spinlock, and the reason for that is that there's no very clear upper bound to the time required to obtain a spinlock, so that there would also be no clear upper bound to the time you're holding the original one (thus possibly leading to cascading performance problems). Well, as Andres points out, commit daa7527afc2274432094ebe7ceb03aa41f916607 made that more of an ironclad requirement than a suggestion. If it's sometimes OK to acquire a spinlock from within a spinlock, then that commit was badly misguided; but the design of that patch was pretty much driven by your insistence that that was never, ever OK. If that was wrong, then we should reconsider that whole commit more broadly; but if it was right, then the atomics patch shouldn't drill a hole through it to install an exception just for emulating atomics. I'm personally not convinced that we're approaching this topic in the right way. I'm not convinced that it's at all reasonable to try to emulate atomics on platforms that don't have them. I would punt the problem into the next layer and force things like lwlock.c to have fallback implementations at that level that don't require atomics, and remove those fallback implementations if and when we move the goalposts so that all supported platforms must have working atomics implementations. People who write code that uses atomics are not likely to think about how those algorithms will actually perform when those atomics are merely emulated, and I suspect that means that in practice platforms that have only emulated atomics are going to regress significantly vs. the status quo today. If this patch goes in the way it's written right now, then I think it basically amounts to throwing platforms that don't have working atomics emulation under the bus, and my vote is that if we're going to do that, we should do it explicitly: sorry, we now only support platforms with working atomics. You don't have any, so you can't run PostgreSQL. Have a nice day. If we *don't* want to make that decision, then I'm not convinced that the approach this patch takes is in any way viable, completely aside from this particular question. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] better atomics - v0.5
On 2014-06-30 13:15:23 -0400, Robert Haas wrote: On Mon, Jun 30, 2014 at 12:54 PM, Tom Lane t...@sss.pgh.pa.us wrote: The existing coding rules also discourage spinlocking within a spinlock, and the reason for that is that there's no very clear upper bound to the time required to obtain a spinlock, so that there would also be no clear upper bound to the time you're holding the original one (thus possibly leading to cascading performance problems). Well, as Andres points out, commit daa7527afc2274432094ebe7ceb03aa41f916607 made that more of an ironclad requirement than a suggestion. If it's sometimes OK to acquire a spinlock from within a spinlock, then that commit was badly misguided; but the design of that patch was pretty much driven by your insistence that that was never, ever OK. If that was wrong, then we should reconsider that whole commit more broadly; but if it was right, then the atomics patch shouldn't drill a hole through it to install an exception just for emulating atomics. Well, since nobody has a problem with the rule of not having nested spinlocks and since the problems aren't the same I'm failing to see why we should reconsider this. My recollection was that we primarily discussed whether it'd be a good idea to add code to Assert() when spinlocks are nested to which Tom responded that it's not necessary because critical sections should be short enough to immmediately see that that's not a problem. Then we argued a bit back and forth around that. I'm personally not convinced that we're approaching this topic in the right way. I'm not convinced that it's at all reasonable to try to emulate atomics on platforms that don't have them. I would punt the problem into the next layer and force things like lwlock.c to have fallback implementations at that level that don't require atomics, and remove those fallback implementations if and when we move the goalposts so that all supported platforms must have working atomics implementations. If we're requiring a atomics-less implementation for lwlock.c, bufmgr. et al. I'll stop working on those features (and by extension on atomics). It's an utter waste of resources and maintainability trying to maintain parallel high level locking algorithms in several pieces of the codebase. The nonatomic versions will stagnate and won't actually work under load. I'd rather spend my time on something useful. The spinlock based atomics emulation is *small*. It's easy to reason about correctness there. If we're going that way, we've made a couple of absolute fringe platforms hold postgres, as actually used in reality, hostage. I think that's insane. People who write code that uses atomics are not likely to think about how those algorithms will actually perform when those atomics are merely emulated, and I suspect that means that in practice platforms that have only emulated atomics are going to regress significantly vs. the status quo today. I think you're overstating the likely performance penalty for nonparallel platforms/workloads here quite a bit. The platforms without changes of gettings atomics implemented aren't ones where that's a likely thing. Yes, you'll see a regression if you run a readonly pgbench over a 4 node NUMA system - but it's not large. You won't see much of improved performance in comparison to 9.4, but I think that's primarily it. Also it's not like this is something new - the semaphore based fallback *sucks* but we still have it. *Many* features in new major versions regress performance for fringe platforms. That's normal. If this patch goes in the way it's written right now, then I think it basically amounts to throwing platforms that don't have working atomics emulation under the bus Why? Nobody is going to run 9.5 under high performance workloads on such platforms. They'll be so IO and RAM starved that the spinlocks won't be the bottleneck. , and my vote is that if we're going to do that, we should do it explicitly: sorry, we now only support platforms with working atomics. You don't have any, so you can't run PostgreSQL. Have a nice day. I'd be fine with that. I don't think we're gaining much by those platforms. *But* I don't really see why it's required at this point. The fallback works and unless you're using semaphore based spinlocks the performance isn't bad. The lwlock code doesn't really get slower/faster in comparison to before when the atomics implementation is backed by semaphores. It's pretty much the same number of syscalls using the atomics/current implementation. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] better atomics - v0.5
Robert Haas robertmh...@gmail.com writes: I'm personally not convinced that we're approaching this topic in the right way. I'm not convinced that it's at all reasonable to try to emulate atomics on platforms that don't have them. I would punt the problem into the next layer and force things like lwlock.c to have fallback implementations at that level that don't require atomics, and remove those fallback implementations if and when we move the goalposts so that all supported platforms must have working atomics implementations. People who write code that uses atomics are not likely to think about how those algorithms will actually perform when those atomics are merely emulated, and I suspect that means that in practice platforms that have only emulated atomics are going to regress significantly vs. the status quo today. I think this is a valid objection, and I for one am not prepared to say that we no longer care about platforms that don't have atomic ops (especially not if it's not a *very small* set of atomic ops). Also, just because a platform claims to have atomic ops doesn't mean that those ops perform well. If there's a kernel trap involved, they don't, at least not for our purposes. We're only going to be bothering with installing atomic-op code in places that are contention bottlenecks for us already, so we are not going to be happy with the results for any atomic-op implementation that's not industrial strength. This is one reason why I'm extremely suspicious of depending on gcc's intrinsics for this; that will not make the issue go away, only make it beyond our power to control. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] better atomics - v0.5
On 2014-06-30 12:54:10 -0400, Tom Lane wrote: Robert Haas robertmh...@gmail.com writes: ... this again is my point: why can't we make the same argument about two spinlocks situated on the same cache line? I don't have a bit of trouble believing that doing the same thing with a couple of spinlocks could sometimes work out well, too, but Tom is adamantly opposed to that. I think you might be overstating my position here. What I'm concerned about is that we be sure that spinlocks are held for a sufficiently short time that it's very unlikely that we get pre-empted while holding one. I don't have any particular bright line about how short a time that is, but more than a few instructions worries me. As you say, the Linux kernel is a bad example to follow because it hasn't got a risk of losing its timeslice while holding a spinlock. Well, they actually have preemptable and irq-interruptile versions for some of these locks, but whatever :). So ISTM the question we ought to be asking is whether atomic operations have bounded execution time, or more generally what the distribution of execution times is likely to be. I'd be OK with an answer that includes sometimes it can be long so long as sometimes is pretty damn seldom. I think it ranges from 'never' to 'sometimes, but pretty darn seldom'. Depending on the platform and emulation. And, leaving the emulation and badly written code aside, there's no issue with another backend sleeping while the atomic variable needs to be modified. Spinlocks have a nonzero risk of taking a long time already, since we can't entirely prevent the possibility of losing our timeslice while holding one. The issue here is just to be sure that that happens seldom enough that it doesn't cause performance problems. If we fail to do that we might negate all the desired performance improvements from adopting atomic ops in the first place. I think individual patches will have to stand up to a test like this. Modifying a atomic variable inside a spinlock is only going to be worth it if it allows to avoid spinlocks alltogether in 95%+ of the time. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] better atomics - v0.5
On 2014-06-30 19:44:47 +0200, Andres Freund wrote: On 2014-06-30 13:15:23 -0400, Robert Haas wrote: People who write code that uses atomics are not likely to think about how those algorithms will actually perform when those atomics are merely emulated, and I suspect that means that in practice platforms that have only emulated atomics are going to regress significantly vs. the status quo today. I think you're overstating the likely performance penalty for nonparallel platforms/workloads here quite a bit. The platforms without changes of gettings atomics implemented aren't ones where that's a likely thing. Yes, you'll see a regression if you run a readonly pgbench over a 4 node NUMA system - but it's not large. You won't see much of improved performance in comparison to 9.4, but I think that's primarily it. To quantify this, on my 2 socket xeon E5520 workstation - which is too small to heavily show the contention problems in pgbench -S - the numbers are: pgbench -M prepared -c 16 -j 16 -T 10 -S (best of 5, noticeably variability) master: 152354.294117 lwlocks-atomics: 170528.784958 lwlocks-atomics-spin: 159416.202578 pgbench -M prepared -c 1 -j 1 -T 10 -S (best of 5, noticeably variability) master: 16273.702191 lwlocks-atomics: 16768.712433 lwlocks-atomics-spin: 16744.913083 So, there really isn't a problem with the emulation. It's not actually that surprising - the absolute number of atomic ops is prety much the same. Where we earlier took a spinlock to increment shared we now still take one. I expect that repeating this on the 4 socket machine will show a large gap between lwlocks-atomics and the other two. The other two will be about the same, with lwlocks-atomics-spin remaining a bit better than master. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] better atomics - v0.5
On 2014-06-30 13:45:52 -0400, Tom Lane wrote: Robert Haas robertmh...@gmail.com writes: I'm personally not convinced that we're approaching this topic in the right way. I'm not convinced that it's at all reasonable to try to emulate atomics on platforms that don't have them. I would punt the problem into the next layer and force things like lwlock.c to have fallback implementations at that level that don't require atomics, and remove those fallback implementations if and when we move the goalposts so that all supported platforms must have working atomics implementations. People who write code that uses atomics are not likely to think about how those algorithms will actually perform when those atomics are merely emulated, and I suspect that means that in practice platforms that have only emulated atomics are going to regress significantly vs. the status quo today. I think this is a valid objection, and I for one am not prepared to say that we no longer care about platforms that don't have atomic ops (especially not if it's not a *very small* set of atomic ops). It's still TAS(), cmpxchg(), xadd. With TAS/xadd possibly implemented via cmpxchg (which quite possibly *is* the native implementation for $arch). Also, just because a platform claims to have atomic ops doesn't mean that those ops perform well. If there's a kernel trap involved, they don't, at least not for our purposes. Yea. I think if we start to use 8byte atomics at some later point we're going to have to prohibit e.g. older arms from using them, even if the compiler claims they're supported. But that's just one #define. This is one reason why I'm extremely suspicious of depending on gcc's intrinsics for this; that will not make the issue go away, only make it beyond our power to control. The patch as submitted makes it easy to provide a platform specific implementation of the important routines if it's likely that it's better than the intrinsics provided one. I'm not too worried about the intrinsics though - the number of projects relying on them these days is quite large. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] better atomics - v0.5
On Mon, Jun 30, 2014 at 07:44:47PM +0200, Andres Freund wrote: On 2014-06-30 13:15:23 -0400, Robert Haas wrote: I'm personally not convinced that we're approaching this topic in the right way. I'm not convinced that it's at all reasonable to try to emulate atomics on platforms that don't have them. I would punt the problem into the next layer and force things like lwlock.c to have fallback implementations at that level that don't require atomics, and remove those fallback implementations if and when we move the goalposts so that all supported platforms must have working atomics implementations. If we're requiring a atomics-less implementation for lwlock.c, bufmgr. et al. I'll stop working on those features (and by extension on atomics). It's an utter waste of resources and maintainability trying to maintain parallel high level locking algorithms in several pieces of the codebase. The nonatomic versions will stagnate and won't actually work under load. I'd rather spend my time on something useful. The spinlock based atomics emulation is *small*. It's easy to reason about correctness there. If we're going that way, we've made a couple of absolute fringe platforms hold postgres, as actually used in reality, hostage. I think that's insane. I agree completely. If this patch goes in the way it's written right now, then I think it basically amounts to throwing platforms that don't have working atomics emulation under the bus , and my vote is that if we're going to do that, we should do it explicitly: sorry, we now only support platforms with working atomics. You don't have any, so you can't run PostgreSQL. Have a nice day. I might share that view if a platform faced a huge and easily-seen performance regression, say a 40% regression to 5-client performance. I don't expect the shortfall to reach that ballpark for any atomics-exploiting algorithm we'll adopt, and your (Andres's) latest measurements corroborate that. -- Noah Misch EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] better atomics - v0.5
On Mon, Jun 30, 2014 at 2:54 PM, Andres Freund and...@2ndquadrant.com wrote: On 2014-06-30 11:04:53 +0530, Amit Kapila wrote: On Sun, Jun 29, 2014 at 2:54 PM, Andres Freund and...@2ndquadrant.com wrote: Yes, intentionally so. It's often important to get/set the current value without the cost of emitting a memory barrier. It just has to be a recent value and it actually has to come from memory. I agree with you that we don't want to incur the cost of memory barrier for get/set, however it should not be at the cost of correctness. I really can't follow here. A volatile read/store forces it to go to memory without barriers. The ABIs of all platforms we work with guarantee that 4bytes stores/reads are atomic - we've been relying on that for a long while. I think for such usage, we need to rely on barriers wherever there is a need for synchronisation, recent example I have noticed is in your patch where we have to use pg_write_barrier() during wakeup. However if we go by atomic ops definition, then no such dependency is required, like if somewhere we need to use pg_atomic_compare_exchange_u32(), after that we don't need to ensure about barriers, because this atomic API adheres to Full barrier semantics. I think defining barrier support for some atomic ops and not for others will lead to inconsistency with specs and we need to additionally define rules for such atomic ops usage. I think we can still rely on volatile read/store for ordering guarantee within same thread of execution and where ever we need synchronisation of usage among different processes, we can directly use atomic ops (get/set) without the need to use barriers. b) It's only supported from vista onwards. Afaik we still support XP. Well, with barrier.h as it stands they'd get a compiler error if it indeed is unsupported? But I think there's something else going on - msft might just be removing XP from it's documentation. Okay, but why would they remove for MemoryBarrier() and not for InterlockedCompareExchange(). I think it is bit difficult to predict the reason and if we want to use anything which is not as msft docs, we shall do that carefully and have some way to ensure that it will work fine. In this case, I have a question for you. Un-patched usage in barrier.h is as follows: .. If I understand correctly the current define mechanism in barrier.h, it will have different definition for Itanium processors even for windows. Either noone has ever tested postgres on itanium windows (quite possible), because afaik _Asm_sched_fence() doesn't work on windows/msvc, or windows doesn't define __ia64__/__ia64. Those macros arefor HP's acc on HPUX. Hmm.. Do you think in such a case, it would have gone in below define: #elif defined(__INTEL_COMPILER) /* * icc defines __GNUC__, but doesn't support gcc's inline asm syntax */ #if defined(__ia64__) || defined(__ia64) #define pg_memory_barrier() __mf() #elif defined(__i386__) || defined(__x86_64__) #define pg_memory_barrier() _mm_mfence() #endif -#define pg_compiler_barrier() __memory_barrier() Currently this will be considered as compiler barrier for __INTEL_COMPILER, but after patch, I don't see this define. I think after patch, it will be compiler_impl specific, but why such a change? However the patch defines as below: .. What I can understand from above is that defines in storage/atomics-generic-msvc.h, will override any previous defines for compiler/memory barriers and _ReadWriteBarrier()/MemoryBarrier() will be considered for Windows always. Well, the memory barrier is surrounded by #ifndef pg_memory_barrier_impl. The compiler barrier can't reasonably be defined earlier since it's a compiler not an architecture thing. I think as per your new compiler specific implementation stuff, this holds true. 6. Sure, I was also not asking for _impl functions. What I was asking in this point was to have comments on top of definition of pg_atomic_compare_exchange_u32() in atomics.h In particular, on top of below and similar functions, rather than at the place where they are declared. Hm, we can do that. Don't think it'll be clearer (because you need to go to the header anyway), but I don't feel strongly. I think this depends on individual's perspective, so do the way you feel better, the intention of this point was lets not deviate from existing coding ways. I'd much rather get rid of the separated definition/declaration, but we'd need to rely on PG_USE_INLINE for it... Okay. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] better atomics - v0.5
On Sat, Jun 28, 2014 at 1:48 PM, Andres Freund and...@2ndquadrant.com wrote: On 2014-06-28 11:58:41 +0530, Amit Kapila wrote: 1. +pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr, + uint32 *expected, uint32 newval) +{ + bool ret; + uint32 current; + current = InterlockedCompareExchange(ptr-value, newval, *expected); Is there a reason why using InterlockedCompareExchange() is better than using InterlockedCompareExchangeAcquire() variant? Two things: a) compare_exchange is a read/write operator and so far I've defined it to have full barrier semantics (documented in atomics.h). I'd be happy to discuss which semantics we want for which operation. I think it is better to have some discussion about it. Apart from this, today I noticed atomic read/write doesn't use any barrier semantics and just rely on volatile. I think it might not be sane in all cases, example with Visual Studio 2003, volatile to volatile references are ordered; the compiler will not re-order volatile variable access. However, these operations could be re-ordered by the processor. For detailed example, refer below link: http://msdn.microsoft.com/en-us/library/windows/desktop/ms686355(v=vs.85).aspx Also if we see C11 specs both load and store uses memory order as memory_order_seq_cst by default which is same as for compare and exchange I have referred below link: http://en.cppreference.com/w/c/atomic/atomic_store b) It's only supported from vista onwards. Afaik we still support XP. #ifndef pg_memory_barrier_impl #define pg_memory_barrier_impl() MemoryBarrier() #endif The MemoryBarrier() macro used also supports only on vista onwards, so I think for reasons similar to using MemoryBarrier() can apply for this as well. c) It doesn't make any difference on x86 ;) What about processors like Itanium which care about acquire/release memory barrier semantics? 2. +pg_atomic_compare_exchange_u32_impl() { .. + *expected = current; } Can we implement this API without changing expected value? .. I think this value is required for lwlock patch, but I am wondering why can't the same be achieved if we just return the *current* value and then let lwlock patch do the handling based on it. This will have another advantage that our pg_* API will also have similar signature as native API's. Many usages of compare/exchange require that you'll get the old value back in an atomic fashion. Unfortunately the Interlocked* API doesn't provide a nice way to do that. Yes, but I think the same cannot be accomplished even by using expected. Note that this definition of compare/exchange both maps nicely to C11's atomics and the actual x86 cmpxchg instruction. I've generally tried to mimick C11's API as far as it's possible. There's some limitations because we can't do generic types and such, but other than that. If I am reading C11's spec for this API correctly, then it says as below: Atomically compares the value pointed to by obj with the value pointed to by expected, and if those are equal, replaces the former with desired (performs read-modify-write operation). Otherwise, loads the actual value pointed to by obj into *expected (performs load operation). So it essentialy means that it loads actual value in expected only if operation is not successful. 3. Is there a overhead of using Add, instead of increment/decrement version of Interlocked? There's a theoretical difference for IA64 which has a special increment/decrement operation which can only change the value by a rather limited number of values. I don't think it's worth to care. Okay. 4. .. There is a Caution notice in microsoft site indicating _ReadWriteBarrier/MemoryBarrier are deprected. It seemed to be the most widely available API, and it's what barrier.h already used. Do you have a different suggestion? I am trying to think based on suggestion given by Microsoft, but not completely clear how to accomplish the same at this moment. 6. pg_atomic_compare_exchange_u32() It is better to have comments above this and all other related functions. Check atomics.h, there's comments above it: /* * pg_atomic_compare_exchange_u32 - CAS operation * * Atomically compare the current value of ptr with *expected and store newval * iff ptr and *expected have the same value. The current value of *ptr will * always be stored in *expected. * * Return whether the values have been exchanged. * * Full barrier semantics. */ Okay, generally code has such comments on top of function definition rather than with declaration. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] better atomics - v0.5
On 2014-06-29 11:53:28 +0530, Amit Kapila wrote: On Sat, Jun 28, 2014 at 1:48 PM, Andres Freund and...@2ndquadrant.com Two things: a) compare_exchange is a read/write operator and so far I've defined it to have full barrier semantics (documented in atomics.h). I'd be happy to discuss which semantics we want for which operation. I think it is better to have some discussion about it. Apart from this, today I noticed atomic read/write doesn't use any barrier semantics and just rely on volatile. Yes, intentionally so. It's often important to get/set the current value without the cost of emitting a memory barrier. It just has to be a recent value and it actually has to come from memory. And that's actually enforced by the current definition - I think? b) It's only supported from vista onwards. Afaik we still support XP. #ifndef pg_memory_barrier_impl #define pg_memory_barrier_impl() MemoryBarrier() #endif The MemoryBarrier() macro used also supports only on vista onwards, so I think for reasons similar to using MemoryBarrier() can apply for this as well. I think that's odd because barrier.h has been using MemoryBarrier() without a version test for a long time now. I guess everyone uses a new enough visual studio. Those intrinsics aren't actually OS but just compiler dependent. Otherwise we could just define it as: FORCEINLINE VOID MemoryBarrier ( VOID ) { LONG Barrier; __asm { xchg Barrier, eax } } c) It doesn't make any difference on x86 ;) What about processors like Itanium which care about acquire/release memory barrier semantics? Well, it still will be correct? I don't think it makes much sense to focus overly much on itanium here with the price of making anything more complicated for others. I think this value is required for lwlock patch, but I am wondering why can't the same be achieved if we just return the *current* value and then let lwlock patch do the handling based on it. This will have another advantage that our pg_* API will also have similar signature as native API's. Many usages of compare/exchange require that you'll get the old value back in an atomic fashion. Unfortunately the Interlocked* API doesn't provide a nice way to do that. Yes, but I think the same cannot be accomplished even by using expected. More complicatedly so, yes? I don't think we want those comparisons on practically every callsite. Note that this definition of compare/exchange both maps nicely to C11's atomics and the actual x86 cmpxchg instruction. I've generally tried to mimick C11's API as far as it's possible. There's some limitations because we can't do generic types and such, but other than that. If I am reading C11's spec for this API correctly, then it says as below: Atomically compares the value pointed to by obj with the value pointed to by expected, and if those are equal, replaces the former with desired (performs read-modify-write operation). Otherwise, loads the actual value pointed to by obj into *expected (performs load operation). So it essentialy means that it loads actual value in expected only if operation is not successful. Yes. But in the case it's successfull it will already contain the right value. 4. .. There is a Caution notice in microsoft site indicating _ReadWriteBarrier/MemoryBarrier are deprected. It seemed to be the most widely available API, and it's what barrier.h already used. Do you have a different suggestion? I am trying to think based on suggestion given by Microsoft, but not completely clear how to accomplish the same at this moment. Well, they refer to C11 stuff. But I think it'll be a while before it makes sense to use a fallback based on that. 6. pg_atomic_compare_exchange_u32() It is better to have comments above this and all other related functions. Check atomics.h, there's comments above it: /* * pg_atomic_compare_exchange_u32 - CAS operation * * Atomically compare the current value of ptr with *expected and store newval * iff ptr and *expected have the same value. The current value of *ptr will * always be stored in *expected. * * Return whether the values have been exchanged. * * Full barrier semantics. */ Okay, generally code has such comments on top of function definition rather than with declaration. I don't want to have it on the _impl functions because they're duplicated for the individual platforms and will just become out of date... Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] better atomics - v0.5
On 2014-06-28 19:36:39 +0300, Heikki Linnakangas wrote: On 06/27/2014 08:15 PM, Robert Haas wrote: On Thu, Jun 26, 2014 at 3:04 PM, Andres Freund and...@2ndquadrant.com wrote: I don't really see usecases where it's not related data that's being touched, but: The point is that the fastpath (not using a spinlock) might touch the atomic variable, even while the slowpath has acquired the spinlock. So the slowpath can't just use non-atomic operations on the atomic variable. Imagine something like releasing a buffer pin while somebody else is doing something that requires holding the buffer header spinlock. If the atomic variable can be manipulated without the spinlock under *any* circumstances, then how is it a good idea to ever manipulate it with the spinlock? With the WALInsertLock scaling stuff in 9.4, there are now two variables protected by a spinlock: the current WAL insert location, and the prev pointer (CurrBytePos and PrevBytePos). To insert a new WAL record, you need to grab the spinlock to update both of them atomically. But to just read the WAL insert pointer, you could do an atomic read of CurrBytePos if the architecture supports it - now you have to grab the spinlock. Admittedly that's just an atomic read, not an atomic compare and exchange or fetch-and-add. There's several architectures where you can do atomic 8byte reads, but only busing cmpxchg or similar, *not* using a plain read... So I think it's actually a fair example. Or if the architecture has an atomic 128-bit compare exchange op you could replace the spinlock with that. But it's not hard to imagine similar situations where you sometimes need to lock a larger data structure to modify it atomically, but sometimes you just need to modify part of it and an atomic op would suffice. Yes. I thought Andres' LWLock patch also did something like that. If the lock is not contended, you can acquire it with an atomic compare exchange to increment the exclusive/shared counter. But to manipulate the wait queue, you need the spinlock. Right. It just happens that the algorithm requires none of the atomic ops have to be under the spinlock... But I generally think that we'll see more slow/fastpath like situations cropping up where we can't always avoid the atomic op while holding the lock. How about adding a paragraph that explains that it's better to avoid atomics usage of spinlocks because of the prolonged runtime, especially due to the emulation and cache misses? But also saying it's ok if the algorithm needs it and is a sufficient benefit? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] better atomics - v0.5
On Sun, Jun 29, 2014 at 2:54 PM, Andres Freund and...@2ndquadrant.com wrote: On 2014-06-29 11:53:28 +0530, Amit Kapila wrote: On Sat, Jun 28, 2014 at 1:48 PM, Andres Freund and...@2ndquadrant.com I think it is better to have some discussion about it. Apart from this, today I noticed atomic read/write doesn't use any barrier semantics and just rely on volatile. Yes, intentionally so. It's often important to get/set the current value without the cost of emitting a memory barrier. It just has to be a recent value and it actually has to come from memory. I agree with you that we don't want to incur the cost of memory barrier for get/set, however it should not be at the cost of correctness. And that's actually enforced by the current definition - I think? Yeah, this is the only point which I am trying to ensure, thats why I sent you few links in previous mail where I had got some suspicion that just doing get/set with volatile might lead to correctness issue in some cases. Some another Note, I found today while reading more on it which suggests that my previous observation is right: Within a thread of execution, accesses (reads and writes) to all volatile objects are guaranteed to not be reordered relative to each other, but this order is not guaranteed to be observed by another thread, since volatile access does not establish inter-thread synchronization. http://en.cppreference.com/w/c/atomic/memory_order b) It's only supported from vista onwards. Afaik we still support XP. #ifndef pg_memory_barrier_impl #define pg_memory_barrier_impl() MemoryBarrier() #endif The MemoryBarrier() macro used also supports only on vista onwards, so I think for reasons similar to using MemoryBarrier() can apply for this as well. I think that's odd because barrier.h has been using MemoryBarrier() without a version test for a long time now. I guess everyone uses a new enough visual studio. Yeap or might be the people who even are not using new enough version doesn't have a use case that can hit a problem due to MemoryBarrier() Those intrinsics aren't actually OS but just compiler dependent. Otherwise we could just define it as: FORCEINLINE VOID MemoryBarrier ( VOID ) { LONG Barrier; __asm { xchg Barrier, eax } } Yes, I think it will be better, how about defining this way just for the cases where MemoryBarrier macro is not supported. c) It doesn't make any difference on x86 ;) What about processors like Itanium which care about acquire/release memory barrier semantics? Well, it still will be correct? I don't think it makes much sense to focus overly much on itanium here with the price of making anything more complicated for others. Completely agreed that we should not add or change anything which makes patch complicated or can effect the performance, however the reason for focusing is just to ensure that we should not break any existing use case for Itanium w.r.t performance or otherwise. Another way is that we can give ignore or just give lower priority for verfiying if the patch has anything wrong for Itanium processors. If I am reading C11's spec for this API correctly, then it says as below: Atomically compares the value pointed to by obj with the value pointed to by expected, and if those are equal, replaces the former with desired (performs read-modify-write operation). Otherwise, loads the actual value pointed to by obj into *expected (performs load operation). So it essentialy means that it loads actual value in expected only if operation is not successful. Yes. But in the case it's successfull it will already contain the right value. The point was just that we are not completely following C11 atomic specification, so it might be okay to tweak a bit and make things simpler and save LOAD instruction. However as there is no correctness issue here and you think that current implementation is good and can serve more cases, we can keep as it is and move forward. 4. .. There is a Caution notice in microsoft site indicating _ReadWriteBarrier/MemoryBarrier are deprected. It seemed to be the most widely available API, and it's what barrier.h already used. Do you have a different suggestion? I am trying to think based on suggestion given by Microsoft, but not completely clear how to accomplish the same at this moment. Well, they refer to C11 stuff. But I think it'll be a while before it makes sense to use a fallback based on that. In this case, I have a question for you. Un-patched usage in barrier.h is as follows: .. #elif defined(__ia64__) || defined(__ia64) #define pg_compiler_barrier() _Asm_sched_fence() #define pg_memory_barrier() _Asm_mf() #elif defined(WIN32_ONLY_COMPILER) /* Should work on both MSVC and Borland. */ #include intrin.h #pragma intrinsic(_ReadWriteBarrier) #define pg_compiler_barrier() _ReadWriteBarrier() #define pg_memory_barrier() MemoryBarrier()
Re: [HACKERS] better atomics - v0.5
On Wed, Jun 25, 2014 at 10:44 PM, Andres Freund and...@2ndquadrant.com wrote: Attached is a new version of the atomic operations patch. Lots has changed since the last post: * gcc, msvc work. acc, xlc, sunpro have blindly written support which should be relatively easy to fix up. All try to implement TAS, 32 bit atomic add, 32 bit compare exchange; some do it for 64bit as well. I have reviewed msvc part of patch and below are my findings: 1. +pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr, + uint32 *expected, uint32 newval) +{ + bool ret; + uint32 current; + current = InterlockedCompareExchange(ptr-value, newval, *expected); Is there a reason why using InterlockedCompareExchange() is better than using InterlockedCompareExchangeAcquire() variant? *Acquire or *Release variants can provide acquire memory ordering semantics for processors which support the same else will be defined as InterlockedCompareExchange. 2. +pg_atomic_compare_exchange_u32_impl() { .. + *expected = current; } Can we implement this API without changing expected value? I mean if the InterlockedCompareExchange() is success, then it will store the newval in ptr-value, else *ret* will be false. I am not sure if there is anything implementation specific in changing *expected*. I think this value is required for lwlock patch, but I am wondering why can't the same be achieved if we just return the *current* value and then let lwlock patch do the handling based on it. This will have another advantage that our pg_* API will also have similar signature as native API's. 3. Is there a overhead of using Add, instead of increment/decrement version of Interlocked? I could not find any evidence which can clearly indicate, if one is better than other except some recommendation in below link which suggests to *maintain reference count*, use Increment/Decrement Refer *The Locking Cookbook* in below link: http://blogs.technet.com/b/winserverperformance/archive/2008/05/21/designing-applications-for-high-performance-part-ii.aspx However, when I tried to check the instructions each function generates, then I don't find anything which suggests that *Add variant can be slow. Refer Instructions for both functions: cPublicProc _InterlockedIncrement,1 cPublicFpo 1,0 mov ecx,Addend ; get pointer to addend variable mov eax,1 ; set increment value Lock1: lock xadd[ecx],eax ; interlocked increment inc eax ; adjust return value stdRET _InterlockedIncrement; stdENDP _InterlockedIncrement cPublicProc _InterlockedExchangeAdd, 2 cPublicFpo 2,0 mov ecx, [esp + 4] ; get addend address mov eax, [esp + 8] ; get increment value Lock4: lock xadd[ecx], eax ; exchange add stdRET _InterlockedExchangeAdd stdENDP _InterlockedExchangeAdd For details, refer link: http://read.pudn.com/downloads3/sourcecode/windows/248345/win2k/private/windows/base/client/i386/critsect.asm__.htm I don't think there is any need of change in current implementation, however I just wanted to share my analysis, so that if any one else can see any point in preferring one or the other way of implementation. 4. #pragma intrinsic(_ReadWriteBarrier) #define pg_compiler_barrier_impl() _ReadWriteBarrier() #ifndef pg_memory_barrier_impl #define pg_memory_barrier_impl() MemoryBarrier() #endif There is a Caution notice in microsoft site indicating _ReadWriteBarrier/MemoryBarrier are deprected. Please refer below link: http://msdn.microsoft.com/en-us/library/f20w0x5e(v=vs.110).aspx When I checked previous versions of MSVC, I noticed that these are supported on x86, IPF, x64 till VS2008 and supported only on x86, x64 for VS2012 onwards. Link for VS2010 support: http://msdn.microsoft.com/en-us/library/f20w0x5e(v=vs.100).aspx 5. + * atomics-generic-msvc.h .. + * Portions Copyright (c) 2013, PostgreSQL Global Development Group Shouldn't copyright notice be 2014? 6. pg_atomic_compare_exchange_u32() It is better to have comments above this and all other related functions. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] better atomics - v0.5
On 2014-06-28 11:58:41 +0530, Amit Kapila wrote: On Wed, Jun 25, 2014 at 10:44 PM, Andres Freund and...@2ndquadrant.com wrote: Attached is a new version of the atomic operations patch. Lots has changed since the last post: * gcc, msvc work. acc, xlc, sunpro have blindly written support which should be relatively easy to fix up. All try to implement TAS, 32 bit atomic add, 32 bit compare exchange; some do it for 64bit as well. I have reviewed msvc part of patch and below are my findings: Thanks for looking at it - I pretty much wrote that part blindly. 1. +pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr, + uint32 *expected, uint32 newval) +{ + bool ret; + uint32 current; + current = InterlockedCompareExchange(ptr-value, newval, *expected); Is there a reason why using InterlockedCompareExchange() is better than using InterlockedCompareExchangeAcquire() variant? Two things: a) compare_exchange is a read/write operator and so far I've defined it to have full barrier semantics (documented in atomics.h). I'd be happy to discuss which semantics we want for which operation. b) It's only supported from vista onwards. Afaik we still support XP. c) It doesn't make any difference on x86 ;) *Acquire or *Release variants can provide acquire memory ordering semantics for processors which support the same else will be defined as InterlockedCompareExchange. 2. +pg_atomic_compare_exchange_u32_impl() { .. + *expected = current; } Can we implement this API without changing expected value? I mean if the InterlockedCompareExchange() is success, then it will store the newval in ptr-value, else *ret* will be false. I am not sure if there is anything implementation specific in changing *expected*. I think this value is required for lwlock patch, but I am wondering why can't the same be achieved if we just return the *current* value and then let lwlock patch do the handling based on it. This will have another advantage that our pg_* API will also have similar signature as native API's. Many usages of compare/exchange require that you'll get the old value back in an atomic fashion. Unfortunately the Interlocked* API doesn't provide a nice way to do that. Note that this definition of compare/exchange both maps nicely to C11's atomics and the actual x86 cmpxchg instruction. I've generally tried to mimick C11's API as far as it's possible. There's some limitations because we can't do generic types and such, but other than that. 3. Is there a overhead of using Add, instead of increment/decrement version of Interlocked? There's a theoretical difference for IA64 which has a special increment/decrement operation which can only change the value by a rather limited number of values. I don't think it's worth to care. 4. #pragma intrinsic(_ReadWriteBarrier) #define pg_compiler_barrier_impl() _ReadWriteBarrier() #ifndef pg_memory_barrier_impl #define pg_memory_barrier_impl() MemoryBarrier() #endif There is a Caution notice in microsoft site indicating _ReadWriteBarrier/MemoryBarrier are deprected. It seemed to be the most widely available API, and it's what barrier.h already used. Do you have a different suggestion? 6. pg_atomic_compare_exchange_u32() It is better to have comments above this and all other related functions. Check atomics.h, there's comments above it: /* * pg_atomic_compare_exchange_u32 - CAS operation * * Atomically compare the current value of ptr with *expected and store newval * iff ptr and *expected have the same value. The current value of *ptr will * always be stored in *expected. * * Return whether the values have been exchanged. * * Full barrier semantics. */ Thanks, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] better atomics - v0.5
On 2014-06-27 22:47:02 -0400, Robert Haas wrote: On Fri, Jun 27, 2014 at 2:00 PM, Andres Freund and...@2ndquadrant.com wrote: On 2014-06-27 13:15:25 -0400, Robert Haas wrote: On Thu, Jun 26, 2014 at 3:04 PM, Andres Freund and...@2ndquadrant.com wrote: I don't really see usecases where it's not related data that's being touched, but: The point is that the fastpath (not using a spinlock) might touch the atomic variable, even while the slowpath has acquired the spinlock. So the slowpath can't just use non-atomic operations on the atomic variable. Imagine something like releasing a buffer pin while somebody else is doing something that requires holding the buffer header spinlock. If the atomic variable can be manipulated without the spinlock under *any* circumstances, then how is it a good idea to ever manipulate it with the spinlock? That seems hard to reason about, and unnecessary. Critical sections for spinlocks should be short and contain only the instructions that need protection, and clearly the atomic op is not one of those. Imagine the situation for the buffer header spinlock which is one of the bigger performance issues atm. We could aim to replace all usages of that with clever and complicated logic, but it's hard. The IMO reasonable (and prototyped) way to do it is to make the common paths lockless, but fall back to the spinlock for the more complicated situations. For the buffer header that means that pin/unpin and buffer lookup are lockless, but IO and changing the identity of a buffer still require the spinlock. My attempts to avoid the latter basically required a buffer header specific reimplementation of spinlocks. To make pin/unpin et al safe without acquiring the spinlock the pincount and the flags had to be moved into one variable so that when incrementing the pincount you automatically get the flagbits back. If it's currently valid all is good, otherwise the spinlock needs to be acquired. But for that to work you need to set flags while the spinlock is held. Possibly you can come up with ways to avoid that, but I am pretty damn sure that the code will be less robust by forbidding atomics inside a critical section, not the contrary. It's a good idea to avoid it, but not at all costs. You might be right, but I'm not convinced. Why? Anything I can do to convince you of this? Note that other users of atomics (e.g. the linux kernel) widely use atomics inside spinlock protected regions. Does the lwlock scalability patch work this way, too? No. I've moved one add/sub inside a critical section in the current version while debugging, but it's not required at all. I generally think it makes sense to document the suggestion of moving atomics outside, but not make it a requirement. What I'm basically afraid of is that this will work fine in many cases but have really ugly failure modes. That's pretty much what happens with spinlocks already - the overhead is insignificant at low levels of contention but as the spinlock starts to become contended the CPUs all fight over the cache line and performance goes to pot. ISTM that making the spinlock critical section significantly longer by putting atomic ops in there could appear to win in some cases while being very bad in others. Well, I'm not saying it's something I suggest doing all the time. But if using an atomic op in the slow path allows you to remove the spinlock from 99% of the cases I don't see it having a significant impact. In most scenarios (where atomics aren't emulated, i.e. platforms we expect to used in heavily concurrent cases) the spinlock and the atomic will be on the same cacheline making stalls much less likely. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] better atomics - v0.5
On 06/27/2014 08:15 PM, Robert Haas wrote: On Thu, Jun 26, 2014 at 3:04 PM, Andres Freund and...@2ndquadrant.com wrote: I don't really see usecases where it's not related data that's being touched, but: The point is that the fastpath (not using a spinlock) might touch the atomic variable, even while the slowpath has acquired the spinlock. So the slowpath can't just use non-atomic operations on the atomic variable. Imagine something like releasing a buffer pin while somebody else is doing something that requires holding the buffer header spinlock. If the atomic variable can be manipulated without the spinlock under *any* circumstances, then how is it a good idea to ever manipulate it with the spinlock? With the WALInsertLock scaling stuff in 9.4, there are now two variables protected by a spinlock: the current WAL insert location, and the prev pointer (CurrBytePos and PrevBytePos). To insert a new WAL record, you need to grab the spinlock to update both of them atomically. But to just read the WAL insert pointer, you could do an atomic read of CurrBytePos if the architecture supports it - now you have to grab the spinlock. Admittedly that's just an atomic read, not an atomic compare and exchange or fetch-and-add. Or if the architecture has an atomic 128-bit compare exchange op you could replace the spinlock with that. But it's not hard to imagine similar situations where you sometimes need to lock a larger data structure to modify it atomically, but sometimes you just need to modify part of it and an atomic op would suffice. I thought Andres' LWLock patch also did something like that. If the lock is not contended, you can acquire it with an atomic compare exchange to increment the exclusive/shared counter. But to manipulate the wait queue, you need the spinlock. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] better atomics - v0.5
On Thu, Jun 26, 2014 at 3:04 PM, Andres Freund and...@2ndquadrant.com wrote: I don't really see usecases where it's not related data that's being touched, but: The point is that the fastpath (not using a spinlock) might touch the atomic variable, even while the slowpath has acquired the spinlock. So the slowpath can't just use non-atomic operations on the atomic variable. Imagine something like releasing a buffer pin while somebody else is doing something that requires holding the buffer header spinlock. If the atomic variable can be manipulated without the spinlock under *any* circumstances, then how is it a good idea to ever manipulate it with the spinlock? That seems hard to reason about, and unnecessary. Critical sections for spinlocks should be short and contain only the instructions that need protection, and clearly the atomic op is not one of those. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] better atomics - v0.5
On 2014-06-27 13:15:25 -0400, Robert Haas wrote: On Thu, Jun 26, 2014 at 3:04 PM, Andres Freund and...@2ndquadrant.com wrote: I don't really see usecases where it's not related data that's being touched, but: The point is that the fastpath (not using a spinlock) might touch the atomic variable, even while the slowpath has acquired the spinlock. So the slowpath can't just use non-atomic operations on the atomic variable. Imagine something like releasing a buffer pin while somebody else is doing something that requires holding the buffer header spinlock. If the atomic variable can be manipulated without the spinlock under *any* circumstances, then how is it a good idea to ever manipulate it with the spinlock? That seems hard to reason about, and unnecessary. Critical sections for spinlocks should be short and contain only the instructions that need protection, and clearly the atomic op is not one of those. Imagine the situation for the buffer header spinlock which is one of the bigger performance issues atm. We could aim to replace all usages of that with clever and complicated logic, but it's hard. The IMO reasonable (and prototyped) way to do it is to make the common paths lockless, but fall back to the spinlock for the more complicated situations. For the buffer header that means that pin/unpin and buffer lookup are lockless, but IO and changing the identity of a buffer still require the spinlock. My attempts to avoid the latter basically required a buffer header specific reimplementation of spinlocks. To make pin/unpin et al safe without acquiring the spinlock the pincount and the flags had to be moved into one variable so that when incrementing the pincount you automatically get the flagbits back. If it's currently valid all is good, otherwise the spinlock needs to be acquired. But for that to work you need to set flags while the spinlock is held. Possibly you can come up with ways to avoid that, but I am pretty damn sure that the code will be less robust by forbidding atomics inside a critical section, not the contrary. It's a good idea to avoid it, but not at all costs. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] better atomics - v0.5
On Fri, Jun 27, 2014 at 2:00 PM, Andres Freund and...@2ndquadrant.com wrote: On 2014-06-27 13:15:25 -0400, Robert Haas wrote: On Thu, Jun 26, 2014 at 3:04 PM, Andres Freund and...@2ndquadrant.com wrote: I don't really see usecases where it's not related data that's being touched, but: The point is that the fastpath (not using a spinlock) might touch the atomic variable, even while the slowpath has acquired the spinlock. So the slowpath can't just use non-atomic operations on the atomic variable. Imagine something like releasing a buffer pin while somebody else is doing something that requires holding the buffer header spinlock. If the atomic variable can be manipulated without the spinlock under *any* circumstances, then how is it a good idea to ever manipulate it with the spinlock? That seems hard to reason about, and unnecessary. Critical sections for spinlocks should be short and contain only the instructions that need protection, and clearly the atomic op is not one of those. Imagine the situation for the buffer header spinlock which is one of the bigger performance issues atm. We could aim to replace all usages of that with clever and complicated logic, but it's hard. The IMO reasonable (and prototyped) way to do it is to make the common paths lockless, but fall back to the spinlock for the more complicated situations. For the buffer header that means that pin/unpin and buffer lookup are lockless, but IO and changing the identity of a buffer still require the spinlock. My attempts to avoid the latter basically required a buffer header specific reimplementation of spinlocks. To make pin/unpin et al safe without acquiring the spinlock the pincount and the flags had to be moved into one variable so that when incrementing the pincount you automatically get the flagbits back. If it's currently valid all is good, otherwise the spinlock needs to be acquired. But for that to work you need to set flags while the spinlock is held. Possibly you can come up with ways to avoid that, but I am pretty damn sure that the code will be less robust by forbidding atomics inside a critical section, not the contrary. It's a good idea to avoid it, but not at all costs. You might be right, but I'm not convinced. Does the lwlock scalability patch work this way, too? I was hoping to look at that RSN, so if there are roughly the same issues there it might shed some light on this point also. What I'm basically afraid of is that this will work fine in many cases but have really ugly failure modes. That's pretty much what happens with spinlocks already - the overhead is insignificant at low levels of contention but as the spinlock starts to become contended the CPUs all fight over the cache line and performance goes to pot. ISTM that making the spinlock critical section significantly longer by putting atomic ops in there could appear to win in some cases while being very bad in others. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] better atomics - v0.5
On 2014-06-25 20:16:08 -0400, Robert Haas wrote: On Wed, Jun 25, 2014 at 4:36 PM, Andres Freund and...@2ndquadrant.com wrote: Since it better be legal to manipulate a atomic variable while holding a spinlock we cannot simply use an arbitrary spinlock as backing for atomics. That'd possibly cause us to wait on ourselves or cause deadlocks. I think that's going to fall afoul of Tom's previously-articulated no loops inside spinlocks rule. Most atomics, by nature, are loop-until-it-works. Well, so is TAS itself :). More seriously, I think we're not going to have much fun if we're making up the rule that you can't do an atomic add/sub while a spinlock is held. That just precludes to many use cases and will make the code much harder to understand. I don't think we're going to end up having many problems if we allow atomic read/add/sub/write in there. How much would we lose if we supported compiler intrinsics on versions of GCC and MSVC that have it and left everything else to future patches? The discussion so far seemed pretty clear that we can't regress somewhat frequently used platforms. And dropping support for all but msvc and gcc would end up doing that. We're going to have to do the legword for the most common platforms... Note that I didn't use assembler for those, but relied on intrinsics... We can't *regress* somewhat-frequently used platforms, but that's different from saying we've got to support *new* facilities on those platforms. Well, we want to use the new facilities somewhere. And it's not entirely unfair to call it a regression if other os/compiler/arch combinations speed up but yours don't. Essentially the entire buildfarm is running either gcc, some Microsoft compiler, or a compiler that's supports the same atomics intrinsics as gcc i.e. icc or clang. Some of those compilers may be too old to support the atomics intrinsics, and there's one Sun Studio animal, but I don't know that we need to care about those things in this patch... I think we should support those where it's easy and where we'll see the breakage on the buildfarm. Sun studio's intrinsics are simple enough that I don't think my usage of them will be too hard to fix. ...unless of course the atomics fallbacks in this patch are sufficiently sucky that anyone who ends up using those is going to be sad. Then the bar is a lot higher. But if that's the case then I wonder if we're really on the right course here. I don't you'll notice it much on mostly nonconcurrent uses. But some of those architectures/platforms do support larger NUMA like setups. And for those it'd probably hurt for some workloads. And e.g. solaris is still fairly common for such setups. I added the x86 inline assembler because a fair number of buildfarm animals use ancient gcc's and because I could easily test it. It's also more efficient for gcc 4.6. I'm not wedded to keeping it. Hmm. gcc 4.6 is only just over a year old, so if pre-4.6 implementations aren't that good, that's a pretty good argument for keeping our own implementations around. :-( The difference isn't that big. It's return __atomic_compare_exchange_n(ptr-value, expected, newval, false, __ATOMIC_SEQ_CST,__ATOMIC_SEQ_CST); vs boolret; uint64 current; current = __sync_val_compare_and_swap(ptr-value, *expected, newval); ret = current == *expected; *expected = current; return ret; On x86 that's an additional cmp, mov and such. Useless pos because cmpxchg already computes all that... (that's returned by the setz in my patch). MSVC only supports the second form as well. There's a fair number of 4.2 gccs on the buildfarm as well though. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] better atomics - v0.5
On 2014-06-25 20:22:31 -0400, Robert Haas wrote: On Wed, Jun 25, 2014 at 5:42 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: I think having a separate file for each architecture is nice. I totally agree that they don't belong in src/include/storage, though. s_lock.h has always been misplaced there, but we've let it be for historical reasons, but now that we're adding a dozen new files, it's time to move them out. I find the current organization pretty confusing, but maybe that could be solved by better documentation of what's supposed to go in each architecture or compiler-dependent file. The idea is that first a architecture specific file (atomics-arch-*.h) is included. That file can provide a (partial) implementation for the specific architecture. Or it can do pretty much nothing. After that a compiler specific file is included (atomics-generic-*.h). If atomics aren't yet implemented that can provide an intrinsics based implementation if the compiler version has support for it. At the very least a compiler barrier should be provided. After that the spinlock based fallback implementation (atomics-fallback.h) provides atomics and barriers if not yet available. By here we're sure that nothing else will provide them. Then we can provide operations (atomics-generic.h) that build ontop of the provided functions. E.g. implement _sub, _and et al. I'll include some more of that explanation in the header. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] better atomics - v0.5
On Thu, Jun 26, 2014 at 5:20 AM, Andres Freund and...@2ndquadrant.com wrote: On 2014-06-25 20:16:08 -0400, Robert Haas wrote: On Wed, Jun 25, 2014 at 4:36 PM, Andres Freund and...@2ndquadrant.com wrote: Since it better be legal to manipulate a atomic variable while holding a spinlock we cannot simply use an arbitrary spinlock as backing for atomics. That'd possibly cause us to wait on ourselves or cause deadlocks. I think that's going to fall afoul of Tom's previously-articulated no loops inside spinlocks rule. Most atomics, by nature, are loop-until-it-works. Well, so is TAS itself :). More seriously, I think we're not going to have much fun if we're making up the rule that you can't do an atomic add/sub while a spinlock is held. That just precludes to many use cases and will make the code much harder to understand. I don't think we're going to end up having many problems if we allow atomic read/add/sub/write in there. That rule seems reasonable -- why would you ever want to do this? While you couldn't properly deadlock it seems like it could lead to unpredictable and hard to diagnose performance stalls. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] better atomics - v0.5
On 2014-06-26 07:44:14 -0500, Merlin Moncure wrote: On Thu, Jun 26, 2014 at 5:20 AM, Andres Freund and...@2ndquadrant.com wrote: On 2014-06-25 20:16:08 -0400, Robert Haas wrote: On Wed, Jun 25, 2014 at 4:36 PM, Andres Freund and...@2ndquadrant.com wrote: Since it better be legal to manipulate a atomic variable while holding a spinlock we cannot simply use an arbitrary spinlock as backing for atomics. That'd possibly cause us to wait on ourselves or cause deadlocks. I think that's going to fall afoul of Tom's previously-articulated no loops inside spinlocks rule. Most atomics, by nature, are loop-until-it-works. Well, so is TAS itself :). More seriously, I think we're not going to have much fun if we're making up the rule that you can't do an atomic add/sub while a spinlock is held. That just precludes to many use cases and will make the code much harder to understand. I don't think we're going to end up having many problems if we allow atomic read/add/sub/write in there. That rule seems reasonable -- why would you ever want to do this? Are you wondering why you'd ever manipulate an atomic op inside a spinlock? There's enough algorithms where a slowpath is done under a spinlock but the fastpath is done without. You can't simply use nonatomic operations for manipulation under the spinlock because the fastpaths might then observe/cause bogus state. Obviously I'm not advocating to do random stuff in spinlocks. W're talking about single lock xadd;s (or the LL/SC) equivalent here. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] better atomics - v0.5
On Thu, Jun 26, 2014 at 12:20:06PM +0200, Andres Freund wrote: On 2014-06-25 20:16:08 -0400, Robert Haas wrote: On Wed, Jun 25, 2014 at 4:36 PM, Andres Freund and...@2ndquadrant.com wrote: Since it better be legal to manipulate a atomic variable while holding a spinlock we cannot simply use an arbitrary spinlock as backing for atomics. That'd possibly cause us to wait on ourselves or cause deadlocks. I think that's going to fall afoul of Tom's previously-articulated no loops inside spinlocks rule. Most atomics, by nature, are loop-until-it-works. Well, so is TAS itself :). More seriously, I think we're not going to have much fun if we're making up the rule that you can't do an atomic add/sub while a spinlock is held. That just precludes to many use cases and will make the code much harder to understand. I don't think we're going to end up having many problems if we allow atomic read/add/sub/write in there. I agree it's valuable to have a design that permits invoking an atomic operation while holding a spinlock. I added the x86 inline assembler because a fair number of buildfarm animals use ancient gcc's and because I could easily test it. It's also more efficient for gcc 4.6. I'm not wedded to keeping it. Hmm. gcc 4.6 is only just over a year old, so if pre-4.6 implementations aren't that good, that's a pretty good argument for keeping our own implementations around. :-( GCC 4.6 was released 2011-03-25, though that doesn't change the bottom line. -- Noah Misch EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] better atomics - v0.5
On Thu, Jun 26, 2014 at 6:20 AM, Andres Freund and...@2ndquadrant.com wrote: On 2014-06-25 20:16:08 -0400, Robert Haas wrote: On Wed, Jun 25, 2014 at 4:36 PM, Andres Freund and...@2ndquadrant.com wrote: Since it better be legal to manipulate a atomic variable while holding a spinlock we cannot simply use an arbitrary spinlock as backing for atomics. That'd possibly cause us to wait on ourselves or cause deadlocks. I think that's going to fall afoul of Tom's previously-articulated no loops inside spinlocks rule. Most atomics, by nature, are loop-until-it-works. Well, so is TAS itself :). Yeah, which is why we have a rule that you're not suppose to acquire another spinlock while already holding one. You're trying to make the spinlocks used to simulate atomic ops an exception to that rule, but I'm not sure that's OK. An atomic op is a lot more expensive than a regular unlocked load or store even if it doesn't loop, and if it does loop, it's worse, potentially by a large multiple. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] better atomics - v0.5
On Thu, Jun 26, 2014 at 7:21 AM, Andres Freund and...@2ndquadrant.com wrote: On 2014-06-25 20:22:31 -0400, Robert Haas wrote: On Wed, Jun 25, 2014 at 5:42 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: I think having a separate file for each architecture is nice. I totally agree that they don't belong in src/include/storage, though. s_lock.h has always been misplaced there, but we've let it be for historical reasons, but now that we're adding a dozen new files, it's time to move them out. I find the current organization pretty confusing, but maybe that could be solved by better documentation of what's supposed to go in each architecture or compiler-dependent file. The idea is that first a architecture specific file (atomics-arch-*.h) is included. That file can provide a (partial) implementation for the specific architecture. Or it can do pretty much nothing. After that a compiler specific file is included (atomics-generic-*.h). If atomics aren't yet implemented that can provide an intrinsics based implementation if the compiler version has support for it. At the very least a compiler barrier should be provided. After that the spinlock based fallback implementation (atomics-fallback.h) provides atomics and barriers if not yet available. By here we're sure that nothing else will provide them. Then we can provide operations (atomics-generic.h) that build ontop of the provided functions. E.g. implement _sub, _and et al. I'll include some more of that explanation in the header. I get the general principle, but I think it would be good to have one place that says something like: Each architecture must provide A and B, may provide both or neither of C and D, and may also any or all of E, F, and G. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] better atomics - v0.5
Robert Haas robertmh...@gmail.com writes: On Thu, Jun 26, 2014 at 6:20 AM, Andres Freund and...@2ndquadrant.com wrote: On 2014-06-25 20:16:08 -0400, Robert Haas wrote: I think that's going to fall afoul of Tom's previously-articulated no loops inside spinlocks rule. Most atomics, by nature, are loop-until-it-works. Well, so is TAS itself :). Yeah, which is why we have a rule that you're not suppose to acquire another spinlock while already holding one. You're trying to make the spinlocks used to simulate atomic ops an exception to that rule, but I'm not sure that's OK. An atomic op is a lot more expensive than a regular unlocked load or store even if it doesn't loop, and if it does loop, it's worse, potentially by a large multiple. Well, the point here is to be sure that there's a reasonably small bound on how long we hold the spinlock. It doesn't have to be zero, just small. Would it be practical to say that the coding rule is that you can only use atomic ops on fields that are protected by the spinlock, ie, nobody else is supposed to be touching those fields while you have the spinlock? If that's the case, then the atomic op should see no contention and thus not take very long. On the other hand, if the atomic op is touching something not protected by the spinlock, that seems to me to be morally equivalent to taking a spinlock while holding another one, which as Robert says is forbidden by our current coding rules, and for very good reasons IMO. However, that may be safe only as long as we have real atomic ops. If we're simulating those using spinlocks then you have to ask questions about how many underlying spinlocks there are and whether artificial contention might ensue (due to the same spinlock being used for multiple things). regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] better atomics - v0.5
On 2014-06-26 11:47:15 -0700, Tom Lane wrote: Robert Haas robertmh...@gmail.com writes: On Thu, Jun 26, 2014 at 6:20 AM, Andres Freund and...@2ndquadrant.com wrote: On 2014-06-25 20:16:08 -0400, Robert Haas wrote: I think that's going to fall afoul of Tom's previously-articulated no loops inside spinlocks rule. Most atomics, by nature, are loop-until-it-works. Well, so is TAS itself :). Yeah, which is why we have a rule that you're not suppose to acquire another spinlock while already holding one. You're trying to make the spinlocks used to simulate atomic ops an exception to that rule, but I'm not sure that's OK. An atomic op is a lot more expensive than a regular unlocked load or store even if it doesn't loop, and if it does loop, it's worse, potentially by a large multiple. Well, the point here is to be sure that there's a reasonably small bound on how long we hold the spinlock. It doesn't have to be zero, just small. Would it be practical to say that the coding rule is that you can only use atomic ops on fields that are protected by the spinlock, ie, nobody else is supposed to be touching those fields while you have the spinlock? If that's the case, then the atomic op should see no contention and thus not take very long. I don't really see usecases where it's not related data that's being touched, but: The point is that the fastpath (not using a spinlock) might touch the atomic variable, even while the slowpath has acquired the spinlock. So the slowpath can't just use non-atomic operations on the atomic variable. Imagine something like releasing a buffer pin while somebody else is doing something that requires holding the buffer header spinlock. On the other hand, if the atomic op is touching something not protected by the spinlock, that seems to me to be morally equivalent to taking a spinlock while holding another one, which as Robert says is forbidden by our current coding rules, and for very good reasons IMO. However, that may be safe only as long as we have real atomic ops. If we're simulating those using spinlocks then you have to ask questions about how many underlying spinlocks there are and whether artificial contention might ensue (due to the same spinlock being used for multiple things). With the code as submitted it's one spinlock per atomic variable. Which is fine until spinlocks are emulated using semaphores - in that case a separate array of semaphores (quite small in the patch as submitted) is used so there's no chance of deadlocks against a 'real' spinlock or anything like that. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] better atomics - v0.5
On Thu, Jun 26, 2014 at 1:47 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: Would it be practical to say that the coding rule is that you can only use atomic ops on fields that are protected by the spinlock, ie, nobody else is supposed to be touching those fields while you have the spinlock? If that's the case, then the atomic op should see no contention and thus not take very long. I wonder if this is true in all cases. The address you are locking might be logically protected but at the same time nearby to other memory that is under contention. In other words, I don't think you can assume an atomic op locks exactly 4 bytes of memory for the operation. On the other hand, if the atomic op is touching something not protected by the spinlock, that seems to me to be morally equivalent to taking a spinlock while holding another one, which as Robert says is forbidden by our current coding rules, and for very good reasons IMO. Well not quite: you don't have the possibility of deadlock. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] better atomics - v0.5
On 06/25/2014 10:14 AM, Andres Freund wrote: Hi, [sorry for the second copy Robert] Attached is a new version of the atomic operations patch. Lots has changed since the last post: Is this at a state where we can performance-test it yet? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] better atomics - v0.5
On 2014-06-25 10:39:53 -0700, Josh Berkus wrote: On 06/25/2014 10:14 AM, Andres Freund wrote: Hi, [sorry for the second copy Robert] Attached is a new version of the atomic operations patch. Lots has changed since the last post: Is this at a state where we can performance-test it yet? Well, this patch itself won't have a performance impact at all. It's about proving the infrastructure to use atomic operations in further patches in a compiler and platform independent way. I'll repost a new version of the lwlocks patch (still fighting an issue I introduced while rebasing ntop of Heikki's xlog scalability/lwlock merger) soon. Addressing the review comments Amit has made since. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] better atomics - v0.5
On Wed, Jun 25, 2014 at 1:14 PM, Andres Freund and...@2ndquadrant.com wrote: [sorry for the second copy Robert] Attached is a new version of the atomic operations patch. Lots has changed since the last post: * gcc, msvc work. acc, xlc, sunpro have blindly written support which should be relatively easy to fix up. All try to implement TAS, 32 bit atomic add, 32 bit compare exchange; some do it for 64bit as well. * x86 gcc inline assembly means that we support every gcc version on x86 = i486; even when the __sync APIs aren't available yet. * 'inline' support isn't required anymore. We fall back to ATOMICS_INCLUDE_DEFINITIONS/STATIC_IF_INLINE etc. trickery. * When the current platform doesn't have support for atomic operations a spinlock backed implementation is used. This can be forced using --disable-atomics. That even works when semaphores are used to implement spinlocks (a separate array is used there to avoid nesting problems). It contrast to an earlier implementation this even works when atomics are mapped to different addresses in individual processes (think dsm). * s_lock.h falls back to the atomics.h provided APIs iff it doesn't have native support for the current platform. This can be forced by defining USE_ATOMICS_BASED_SPINLOCKS. Due to generic compiler intrinsics based implementations that should make it easier to bring up postgres on different platfomrs. * I tried to improve the documentation of the facilities in src/include/storage/atomics.h. Including documentation of the various barrier semantics. * There's tests in regress.c that get call via a SQL function from the regression tests. * Lots of other details changed, but that's the major pieces. Review: - The changes to spin.c include unnecessary whitespace adjustments. - The new argument to s_init_lock_sema() isn't used. - on top is two words, not one (ontop). - The changes to pg_config_manual.h add a superfluous blank line. - Are the regression tests in regress.c likely to catch anything? Really? - The patch adds a test for PGAC_HAVE_GCC__SYNC_INT32_ATOMICS, which seems to be testing for __builtin_constant_p. I don't see that being used in the patch anywhere, though; and also, why doesn't the naming match? - s_lock.h adds some fallback code to use atomics to define TAS on platforms where GCC supports atomics but where we do not have a working TAS implementation. However, this supposed fallback code defines HAS_TEST_AND_SET unconditionally, so I think it will get used even if we don't have (or have disabled via configure) atomics. - I completely loathe the file layout you've proposed in src/include/storage. If we're going to have separate #include files for each architecture (and I'd rather we didn't go that route, because it's messy and unlike what we have now), then shouldn't that stuff go in src/include/port or a new directory src/include/port/atomics? Polluting src/include/storage with a whole bunch of files called atomics-whatever.h strikes me as completely ugly, and there's a lot of duplicate code in there. - What's the point of defining pg_read_barrier() to be pg_read_barrier_impl() and pg_write_barrier() to be pg_write_barrier_impl() and so forth? That seems like unnecessary notation. - Should SpinlockSemaInit() be assigning SpinlockSemas() to a local variable instead of re-executing it every time? Granted, the compiler might inline this somehow, but... More generally, I'm really wondering if you're making the compare-and-swap implementation a lot more complicated than it needs to be. How much would we lose if we supported compiler intrinsics on versions of GCC and MSVC that have it and left everything else to future patches? I suspect this patch could be about 20% of its current size and give us everything that we need. I've previously opposed discarding s_lock.h on the grounds that it's extremely well battle-tested, and if we change it, we might introduce subtle bugs that are dependent on GCC versions and such. But now that compiler intrinsics are relatively common, I don't really see any reason for us to provide our own assembler versions of *new* primitives we want to use. As long as we have some kind of wrapper functions or macros, we retain the *option* to add workarounds for compiler bugs or lack of compiler support on platforms, but it seems an awful lot simpler to me to start by assuming that that the compiler will DTRT, and only roll our own if that proves to be necessary. It's not like our hand-rolled implementations couldn't have bugs - which is different from s_lock.h, where we have evidence that the exact coding in that file does or did work on those platforms. Similarly, I would rip out - or separate into a separate patch - the code that tries to use atomic operations to implement TAS if we don't already have an implementation in s_lock.h. That might be worth doing, if there are any common architectures that don't already have
Re: [HACKERS] better atomics - v0.5
Hi, On 2014-06-25 15:54:37 -0400, Robert Haas wrote: - The new argument to s_init_lock_sema() isn't used. It's used when atomics fallback to spinlocks which fall back to semaphores. c.f. atomics.c. Since it better be legal to manipulate a atomic variable while holding a spinlock we cannot simply use an arbitrary spinlock as backing for atomics. That'd possibly cause us to wait on ourselves or cause deadlocks. - Are the regression tests in regress.c likely to catch anything? Really? They already cought several bugs. If the operations were immediately widely used they probably wouldn't be necessary. - The patch adds a test for PGAC_HAVE_GCC__SYNC_INT32_ATOMICS, which seems to be testing for __builtin_constant_p. I don't see that being used in the patch anywhere, though; and also, why doesn't the naming match? Uh. that's a rebase screweup. Should entirely be gone. - s_lock.h adds some fallback code to use atomics to define TAS on platforms where GCC supports atomics but where we do not have a working TAS implementation. However, this supposed fallback code defines HAS_TEST_AND_SET unconditionally, so I think it will get used even if we don't have (or have disabled via configure) atomics. Hm. Good point. It should protect against that. - I completely loathe the file layout you've proposed in src/include/storage. If we're going to have separate #include files for each architecture (and I'd rather we didn't go that route, because it's messy and unlike what we have now), then shouldn't that stuff go in src/include/port or a new directory src/include/port/atomics? Polluting src/include/storage with a whole bunch of files called atomics-whatever.h strikes me as completely ugly, and there's a lot of duplicate code in there. I don't mind moving it somewhere else and it's easy enough to change. - What's the point of defining pg_read_barrier() to be pg_read_barrier_impl() and pg_write_barrier() to be pg_write_barrier_impl() and so forth? That seems like unnecessary notation. We could forgo it for pg_read_barrier() et al, but for the actual atomics it's useful because it allows to put common checks in the !impl routines. - Should SpinlockSemaInit() be assigning SpinlockSemas() to a local variable instead of re-executing it every time? Granted, the compiler might inline this somehow, but... I sure hope it will, but still a good idea. More generally, I'm really wondering if you're making the compare-and-swap implementation a lot more complicated than it needs to be. Possible. How much would we lose if we supported compiler intrinsics on versions of GCC and MSVC that have it and left everything else to future patches? The discussion so far seemed pretty clear that we can't regress somewhat frequently used platforms. And dropping support for all but msvc and gcc would end up doing that. We're going to have to do the legword for the most common platforms... Note that I didn't use assembler for those, but relied on intrinsics... I suspect this patch could be about 20% of its current size and give us everything that we need. I think the only way to do that would be to create a s_lock.h like maze. Which imo is a utterly horrible idea. I'm pretty sure there'll be more assembler implementations for 9.5 and unless we think ahead of how to structure that we'll create something bad. I also think that a year or so down the road we're going to support more operations. E.g. optional support for a 16byte CAS can allow for some awesome things when you want to do lockless stuff on a 64bit platform. I think there's chances for reducing the size by merging i386/amd64, that looked better earlier than it does now (due to the addition of the inline assembler). I've previously opposed discarding s_lock.h on the grounds that it's extremely well battle-tested, and if we change it, we might introduce subtle bugs that are dependent on GCC versions and such. I think we should entirely get rid of s_lock.h. It's an unmaintainable mess. That's what I'd done in an earlier version of the patch, but you'd argued against it. I think you're right in that it shouldn't be done in the same patch and possibly not even in the same cycle, but I think we better do it sometime not too far away. I e.g. don't see how we can guarantee sane barrier semantics with the current implementation. But now that compiler intrinsics are relatively common, I don't really see any reason for us to provide our own assembler versions of *new* primitives we want to use. As long as we have some kind of wrapper functions or macros, we retain the *option* to add workarounds for compiler bugs or lack of compiler support on platforms, but it seems an awful lot simpler to me to start by assuming that that the compiler will DTRT, and only roll our own if that proves to be necessary. It's not like our hand-rolled implementations couldn't have bugs - which is different from s_lock.h, where we have evidence that
Re: [HACKERS] better atomics - v0.5
On 06/25/2014 11:36 PM, Andres Freund wrote: - I completely loathe the file layout you've proposed in src/include/storage. If we're going to have separate #include files for each architecture (and I'd rather we didn't go that route, because it's messy and unlike what we have now), then shouldn't that stuff go in src/include/port or a new directory src/include/port/atomics? Polluting src/include/storage with a whole bunch of files called atomics-whatever.h strikes me as completely ugly, and there's a lot of duplicate code in there. I don't mind moving it somewhere else and it's easy enough to change. I think having a separate file for each architecture is nice. I totally agree that they don't belong in src/include/storage, though. s_lock.h has always been misplaced there, but we've let it be for historical reasons, but now that we're adding a dozen new files, it's time to move them out. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] better atomics - v0.5
Heikki Linnakangas wrote: On 06/25/2014 11:36 PM, Andres Freund wrote: - I completely loathe the file layout you've proposed in src/include/storage. If we're going to have separate #include files for each architecture (and I'd rather we didn't go that route, because it's messy and unlike what we have now), then shouldn't that stuff go in src/include/port or a new directory src/include/port/atomics? Polluting src/include/storage with a whole bunch of files called atomics-whatever.h strikes me as completely ugly, and there's a lot of duplicate code in there. I don't mind moving it somewhere else and it's easy enough to change. I think having a separate file for each architecture is nice. I totally agree that they don't belong in src/include/storage, though. s_lock.h has always been misplaced there, but we've let it be for historical reasons, but now that we're adding a dozen new files, it's time to move them out. We also have a bunch of files in src/backend/storage/lmgr, both current and introduced by this patch, which would be good to relocate. (Really, how did lmgr/ end up in storage/ in the first place?) -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] better atomics - v0.5
On Wed, Jun 25, 2014 at 4:36 PM, Andres Freund and...@2ndquadrant.com wrote: Since it better be legal to manipulate a atomic variable while holding a spinlock we cannot simply use an arbitrary spinlock as backing for atomics. That'd possibly cause us to wait on ourselves or cause deadlocks. I think that's going to fall afoul of Tom's previously-articulated no loops inside spinlocks rule. Most atomics, by nature, are loop-until-it-works. How much would we lose if we supported compiler intrinsics on versions of GCC and MSVC that have it and left everything else to future patches? The discussion so far seemed pretty clear that we can't regress somewhat frequently used platforms. And dropping support for all but msvc and gcc would end up doing that. We're going to have to do the legword for the most common platforms... Note that I didn't use assembler for those, but relied on intrinsics... We can't *regress* somewhat-frequently used platforms, but that's different from saying we've got to support *new* facilities on those platforms. Essentially the entire buildfarm is running either gcc, some Microsoft compiler, or a compiler that's supports the same atomics intrinsics as gcc i.e. icc or clang. Some of those compilers may be too old to support the atomics intrinsics, and there's one Sun Studio animal, but I don't know that we need to care about those things in this patch... ...unless of course the atomics fallbacks in this patch are sufficiently sucky that anyone who ends up using those is going to be sad. Then the bar is a lot higher. But if that's the case then I wonder if we're really on the right course here. I added the x86 inline assembler because a fair number of buildfarm animals use ancient gcc's and because I could easily test it. It's also more efficient for gcc 4.6. I'm not wedded to keeping it. Hmm. gcc 4.6 is only just over a year old, so if pre-4.6 implementations aren't that good, that's a pretty good argument for keeping our own implementations around. :-( -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] better atomics - v0.5
On Wed, Jun 25, 2014 at 5:42 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: I think having a separate file for each architecture is nice. I totally agree that they don't belong in src/include/storage, though. s_lock.h has always been misplaced there, but we've let it be for historical reasons, but now that we're adding a dozen new files, it's time to move them out. I find the current organization pretty confusing, but maybe that could be solved by better documentation of what's supposed to go in each architecture or compiler-dependent file. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers