subject:"\[HACKERS\] better atomics \- v0.5"

Re: [HACKERS] better atomics - v0.5

2014-08-20 Thread Heikki Linnakangas

Hi Andres,

Are you planning to continue working on this? Summarizing the discussion
so far:

* Robert listed a bunch of little cleanup tasks
(http://www.postgresql.org/message-id/ca+tgmozshvvqjakul6p3kdvzvpibtgkzoti3m+fvvjg5v+x...@mail.gmail.com).
Amit posted yet more detailed commends.

* We talked about changing the file layout. I think everyone is happy
with your proposal here:
http://www.postgresql.org/message-id/20140702131729.gp21...@awork2.anarazel.de,
with an overview description of what goes where
(http://www.postgresql.org/message-id/CA+TgmoYuxfsm2dSy48=jgutrh1mozrvmjlhgqrfku7_wpv-...@mail.gmail.com)

* Talked about nested spinlocks. The consensus seems to be to (continue
to) forbid nested spinlocks, but allow atomic operations while holding a
spinlock. When atomics are emulated with spinlocks, it's OK to acquire
the emulation spinlock while holding another spinlock.

* Talked about whether emulating atomics with spinlocks is a good idea.
You posted performance results showing that at least with the patch to
use atomics to implement LWLocks, the emulation performs more or less
the same as the current spinlock-based implementation. I believe
everyone was more or less satisfied with that.

So ISTM we have consensus that the approach to spinlock emulation and
nesting stuff is OK. To finish the patch, you'll just need to address
the file layout and the laundry list of other little details that have
been raised.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] better atomics - v0.5

2014-08-20 Thread Andres Freund

Hi,

On 2014-08-20 12:43:05 +0300, Heikki Linnakangas wrote:
 Are you planning to continue working on this?

Yes, I am! I've recently been on holiday and now I'm busy with catching
up with everything that has happened since. I hope to have a next
version ready early next week.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] better atomics - v0.5

2014-07-16 Thread Amit Kapila

On Mon, Jul 14, 2014 at 12:47 AM, Andres Freund and...@2ndquadrant.com
wrote:
 On 2014-07-08 11:51:14 +0530, Amit Kapila wrote:
  On Thu, Jul 3, 2014 at 10:56 AM, Amit Kapila amit.kapil...@gmail.com
  wrote:

  Further review of patch:
  1.
  /*
   * pg_atomic_test_and_set_flag - TAS()
   *
   * Acquire/read barrier semantics.
   */
  STATIC_IF_INLINE_DECLARE bool
  pg_atomic_test_set_flag(volatile pg_atomic_flag *ptr);
 
  a. I think Acquire and read barrier semantics aren't equal.

 Right. It was mean as Acquire (thus including read barrier) semantics.

Okay, I got confused by reading comments, may be saying the same
explicitly in comments is better.

 I guess I'll better update README.barrier to have definitions of all
 barriers.

I think updating about the reasons behind using specific
barriers for atomic operations defined by PG can be useful
for people who want to understand or use the atomic op API's.

  b. I think it adheres to Acquire semantics for platforms where
  that semantics
  are supported else it defaults to full memory ordering.
  Example :
  #define TAS(lock) (InterlockedCompareExchange(lock, 1, 0))
 
  Is it right to declare that generic version of function adheres to
  Acquire semantics.

 Implementing stronger semantics than required should always be
 fine. There's simply no sane way to work with the variety of platforms
 we support in any other way.

Agreed, may be we can have a note somewhere either in code or
readme to mention about it, if you think that can be helpful.

 
  2.
  bool
  pg_atomic_test_set_flag_impl(volatile pg_atomic_flag *ptr)
  {
  return TAS((slock_t *) ptr-sema);
  #define TAS(lock) (InterlockedCompareExchange(lock, 1, 0))

 Where's that from?

Its there in s_lock.h. Refer below code:
#ifdef WIN32_ONLY_COMPILER
typedef LONG slock_t;

#define HAS_TEST_AND_SET
#define TAS(lock) (InterlockedCompareExchange(lock, 1, 0))

 I can't recall adding a line of code like that?

It is not added by your patch but used in your patch.

I am not sure if that is what excatly you want atomic API to define
for WIN32, but I think it is picking this definition.  I have one question
here, if the value of sema is other than 1 or 0, then it won't set the flag
and not sure what it will return (ex. if the value is negative), so is this
implementation completely sane?


  a. In above usage (ptr-sema), sema itself is not declared as volatile,
  so would it be right to use it in this way for API
  (InterlockedCompareExchange)
  expecting volatile.

 Upgrading a pointer to volatile is defined, so I don't see the problem?

Thats okay.  However all other structure definitions use it as
volatile, example for atomics-arch-amd64.h it is defined as follows:
#define PG_HAVE_ATOMIC_FLAG_SUPPORT
typedef struct pg_atomic_flag
{
volatile char value;
} pg_atomic_flag;

So is there any harm if we keep this definition in sync with others?

  3.
  /*
   * pg_atomic_unlocked_test_flag - TAS()
   *
   * No barrier semantics.
   */
  STATIC_IF_INLINE_DECLARE bool
  pg_atomic_unlocked_test_flag(volatile pg_atomic_flag *ptr);
 
  a. There is no setting in this, so using TAS in function comments
  seems to be wrong.

 Good point.

  b. BTW, I am not able see this function in C11 specs.

 So? It's needed for a sensible implementation of spinlocks ontop of
 atomics.

Okay got the point, do you think it makes sense to have a common
agreement/consensus w.r.t which API's are necessary as a first
cut.  For me as a reviewer, if the API is implemented as per specs or
what it is desired to then we can have it, however I am not sure if there
is general consensus or at least some people agreed to set of API's
which are required for first cut, have I missed some conclusion/discussion
about it.

  5.
  atomics-generic-gcc.h
  #ifndef PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
  #define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
  static inline bool
  pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
  {
  return ptr-value == 0;
  }
  #endif
 
  Don't we need to compare it with 1 instead of 0?

 Why? It returns true if the lock is free, makes sense imo?

Other implementation in atomics-generic.h seems to implement it other
way
#define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
static inline bool
pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
{
return pg_atomic_read_u32_impl(ptr) == 1;
}

 Will add comments to atomics.h

I think that will be helpful.

  6.
  atomics-fallback.h
  pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
  {
  /*
   * Can't do this in the semaphore based implementation - always return
   * true. Do this in the header so compilers can optimize the test away.
   */
  return true;
  }
 
  Why we can't implement this in semaphore based implementation?

 Because semaphores don't provide the API for it?

Okay, but it is not clear from comments why this test should always
return true, basically I am not able to think why incase of missing API
returning true is correct behaviour for this API?

Re: [HACKERS] better atomics - v0.5

2014-07-13 Thread Andres Freund

On 2014-07-10 08:46:55 +0530, Amit Kapila wrote:
 As per my understanding of the general theory around barriers,
 read and write are defined to avoid reordering due to compiler and
 full memory barriers are defined to avoid reordering due to
 processors.

No, that's not the case. There's several processors where write/read
barriers are an actual thing on the hardware level.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] better atomics - v0.5

2014-07-13 Thread Andres Freund

Hi Amit,

On 2014-07-08 11:51:14 +0530, Amit Kapila wrote:
 On Thu, Jul 3, 2014 at 10:56 AM, Amit Kapila amit.kapil...@gmail.com
 wrote:
  On Tue, Jul 1, 2014 at 4:10 PM, Andres Freund and...@2ndquadrant.com
 wrote:
 
 Further review of patch:
 1.
 /*
  * pg_atomic_test_and_set_flag - TAS()
  *
  * Acquire/read barrier semantics.
  */
 STATIC_IF_INLINE_DECLARE bool
 pg_atomic_test_set_flag(volatile pg_atomic_flag *ptr);
 
 a. I think Acquire and read barrier semantics aren't equal.

Right. It was mean as Acquire (thus including read barrier) semantics. I
guess I'll better update README.barrier to have definitions of all barriers.

 b. I think it adheres to Acquire semantics for platforms where
 that semantics
 are supported else it defaults to full memory ordering.
 Example :
 #define TAS(lock) (InterlockedCompareExchange(lock, 1, 0))
 
 Is it right to declare that generic version of function adheres to
 Acquire semantics.

Implementing stronger semantics than required should always be
fine. There's simply no sane way to work with the variety of platforms
we support in any other way.

 
 2.
 bool
 pg_atomic_test_set_flag_impl(volatile pg_atomic_flag *ptr)
 {
 return TAS((slock_t *) ptr-sema);
 #define TAS(lock) (InterlockedCompareExchange(lock, 1, 0))

Where's that from? I can't recall adding a line of code like that?

 a. In above usage (ptr-sema), sema itself is not declared as volatile,
 so would it be right to use it in this way for API
 (InterlockedCompareExchange)
 expecting volatile.

Upgrading a pointer to volatile is defined, so I don't see the problem?

 3.
 /*
  * pg_atomic_unlocked_test_flag - TAS()
  *
  * No barrier semantics.
  */
 STATIC_IF_INLINE_DECLARE bool
 pg_atomic_unlocked_test_flag(volatile pg_atomic_flag *ptr);
 
 a. There is no setting in this, so using TAS in function comments
 seems to be wrong.

Good point.

 b. BTW, I am not able see this function in C11 specs.

So? It's needed for a sensible implementation of spinlocks ontop of
atomics.

 4.
 #if !defined(PG_HAS_ATOMIC_TEST_SET_FLAG) 
 defined(PG_HAVE_ATOMIC_EXCHANGE_U32)
 ..
 #define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
 static inline bool
 pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
 {
 return pg_atomic_read_u32_impl(ptr) == 1;
 }
 
 
 #elif !defined(PG_HAS_ATOMIC_TEST_SET_FLAG) 
 defined(PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32)
 ..
 #define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
 static inline bool
 pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
 {
 return (bool) pg_atomic_read_u32_impl(ptr);
 }
 
 Is there any reason to keep these two implementations different?

No, will change.

 5.
 atomics-generic-gcc.h
 #ifndef PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
 #define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
 static inline bool
 pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
 {
 return ptr-value == 0;
 }
 #endif
 
 Don't we need to compare it with 1 instead of 0?

Why? It returns true if the lock is free, makes sense imo? Will add
comments to atomics.h

 6.
 atomics-fallback.h
 pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
 {
 /*
  * Can't do this in the semaphore based implementation - always return
  * true. Do this in the header so compilers can optimize the test away.
  */
 return true;
 }
 
 Why we can't implement this in semaphore based implementation?

Because semaphores don't provide the API for it?

 7.
 /*
  * pg_atomic_clear_flag - release lock set by TAS()
  *
  * Release/write barrier semantics.
  */
 STATIC_IF_INLINE_DECLARE void
 pg_atomic_clear_flag(volatile pg_atomic_flag *ptr);
 
 a. Are Release and write barriers equivalent?

Same answer as above. Meant as Release (thus including write barrier) 
semantics

 b. IIUC then current code doesn't have release semantics for unlock/clear.

If you're referring to s_lock.h with 'current code', yes it should have:
 *  On platforms with weak memory ordering, the TAS(), TAS_SPIN(), and
 *  S_UNLOCK() macros must further include hardware-level memory fence
 *  instructions to prevent similar re-ordering at the hardware level.
 *  TAS() and TAS_SPIN() must guarantee that loads and stores issued after
 *  the macro are not executed until the lock has been obtained.  
Conversely,
 *  S_UNLOCK() must guarantee that loads and stores issued before the macro
 *  have been executed before the lock is released.

 We are planing to introduce it in this patch, also this doesn't follow the
 specs which says that clear should have memory_order_seq_cst ordering
 semantics.

Making it guarantee full memory barrier semantics is a major performance
loss on x86-64, so I don't want to do that.

Also there is atomic_flag_test_and_set_explicit()...

 As per its current usage in patch (S_UNLOCK), it seems reasonable
 to have *release* semantics for this API, however I think if we use it for
 generic purpose then it might not be right to define it with Release
 semantics.

I can't really see a user where it's not what you

Re: [HACKERS] better atomics - v0.5

2014-07-11 Thread Martijn van Oosterhout

On Thu, Jul 10, 2014 at 08:46:55AM +0530, Amit Kapila wrote:
 As per my understanding of the general theory around barriers,
 read and write are defined to avoid reordering due to compiler and
 full memory barriers are defined to avoid reordering due to processors.
 There are some processors that support instructions for memory
 barriers with acquire, release and fence semantics.

Not exactly, both compilers and processors can rearrange loads and
stores.  Because the architecture most developers work on (x86) has
such a strong memory model (it's goes to lot of effort to hide
reordering) people are under the impression that it's only the compiler
you need to worry about, but that's not true for any other
architechture.

Memory barriers/fences are on the whole discouraged if you can use
acquire/release semantics because the latter are faster and much easier
to optimise for.

I strongly recommend the Atomic Weapons talk on the C11 memory model
to help understand how they work. As a bonus it includes correct
implementations for the major architectures.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] better atomics - v0.5

2014-07-11 Thread Amit Kapila

On Sat, Jul 12, 2014 at 1:26 AM, Martijn van Oosterhout klep...@svana.org
wrote:
 On Thu, Jul 10, 2014 at 08:46:55AM +0530, Amit Kapila wrote:
  As per my understanding of the general theory around barriers,
  read and write are defined to avoid reordering due to compiler and
  full memory barriers are defined to avoid reordering due to processors.
  There are some processors that support instructions for memory
  barriers with acquire, release and fence semantics.

 Not exactly, both compilers and processors can rearrange loads and
 stores.  Because the architecture most developers work on (x86) has
 such a strong memory model (it's goes to lot of effort to hide
 reordering) people are under the impression that it's only the compiler
 you need to worry about, but that's not true for any other
 architechture.

I am not saying that we need only barriers to avoid reordering due to
compiler, rather my point was that we need to avoid reordering due to
both compiler and processor, however ways to achieve it require different
API's

 Memory barriers/fences are on the whole discouraged if you can use
 acquire/release semantics because the latter are faster and much easier
 to optimise for.

Do all processors support instructions for memory barriers with acquire,
release semantics?

 I strongly recommend the Atomic Weapons talk on the C11 memory model
 to help understand how they work. As a bonus it includes correct
 implementations for the major architectures.

Yes that will be a good if we can can implement as per C11 memory model and
thats what I am referring while reviewing this patch, however even if that
is not
possible or difficult to achieve in all cases, it is worth to have a
discussion for
memory model used in current API's implemented in patch.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] better atomics - v0.5

2014-07-09 Thread Robert Haas

On Tue, Jul 8, 2014 at 2:21 AM, Amit Kapila amit.kapil...@gmail.com wrote:
 On Thu, Jul 3, 2014 at 10:56 AM, Amit Kapila amit.kapil...@gmail.com
 wrote:
 On Tue, Jul 1, 2014 at 4:10 PM, Andres Freund and...@2ndquadrant.com
 wrote:

 Further review of patch:
 1.
 /*
  * pg_atomic_test_and_set_flag - TAS()
  *
  * Acquire/read barrier semantics.
  */
 STATIC_IF_INLINE_DECLARE bool
 pg_atomic_test_set_flag(volatile pg_atomic_flag *ptr);

 a. I think Acquire and read barrier semantics aren't equal.
 With acquire semantics, the results of the operation are available before
 the
 results of any operation that appears after it in code which means it
 applies
 for both load and stores. Definition of read barrier just ensures loads.

 So will be right to declare like above in comments?

Yeah.  Barriers can be by the operations they keep from being
reordered (read, write, or both) and by the direction in which they
prevent reordering (acquire, release, or both).  So in theory there
are 9 combinations:

read barrier (both directions)
read barrier (acquire)
read barrier (release)
write barrier (both directions)
write barrier (acquire)
write barrier (both directions)
full barrier (both directions)
full barrier (acquire)
full barrier (release)

To make things more confusing, a read barrier plus a write barrier
need not equal a full barrier.  Read barriers prevent reads from being
reordered with other reads, and write barriers keep writes from being
reordered with other writes, but neither prevents a read from being
reordered relative to a write.  A full barrier, however, must prevent
all reordering: read/read, read/write, write/read, and write/write.

Clarity here is essential.

barrier.h only proposed to implement the following:

read barrier (both directions), write barrier (both directions), full
barrier (both directions)

The reason I did that is because it seemed to be more or less what
Linux was doing, and it's generally suitable for lock-less algorithms.
The acquire and release semantics come into play specifically when
you're using barriers to implement locking primitives, which is isn't
what I was trying to enable.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] better atomics - v0.5

2014-07-09 Thread Amit Kapila

On Wed, Jul 9, 2014 at 8:58 PM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Jul 8, 2014 at 2:21 AM, Amit Kapila amit.kapil...@gmail.com
wrote:
  On Thu, Jul 3, 2014 at 10:56 AM, Amit Kapila amit.kapil...@gmail.com
  wrote:
 
  Further review of patch:
  1.
  /*
   * pg_atomic_test_and_set_flag - TAS()
   *
   * Acquire/read barrier semantics.
   */
  STATIC_IF_INLINE_DECLARE bool
  pg_atomic_test_set_flag(volatile pg_atomic_flag *ptr);
 
  a. I think Acquire and read barrier semantics aren't equal.
  With acquire semantics, the results of the operation are available
before
  the
  results of any operation that appears after it in code which means it
  applies
  for both load and stores. Definition of read barrier just ensures loads.
 
  So will be right to declare like above in comments?

 Yeah.  Barriers can be by the operations they keep from being
 reordered (read, write, or both) and by the direction in which they
 prevent reordering (acquire, release, or both).  So in theory there
 are 9 combinations:

 read barrier (both directions)
 read barrier (acquire)
 read barrier (release)
 write barrier (both directions)
 write barrier (acquire)
 write barrier (both directions)
 full barrier (both directions)
 full barrier (acquire)
 full barrier (release)

With these definitions, I think it will fall into *full barrier (acquire)*
category as it needs to ensure that no reordering should happen
for both load and store.  As an example, we can refer its
implementation for msvc which ensures that it follows full memory
barrier semantics.
#define TAS(lock) (InterlockedCompareExchange(lock, 1, 0))

 To make things more confusing, a read barrier plus a write barrier
 need not equal a full barrier.  Read barriers prevent reads from being
 reordered with other reads, and write barriers keep writes from being
 reordered with other writes, but neither prevents a read from being
 reordered relative to a write.  A full barrier, however, must prevent
 all reordering: read/read, read/write, write/read, and write/write.

 Clarity here is essential.

 barrier.h only proposed to implement the following:

 read barrier (both directions), write barrier (both directions), full
 barrier (both directions)

As per my understanding of the general theory around barriers,
read and write are defined to avoid reordering due to compiler and
full memory barriers are defined to avoid reordering due to processors.
There are some processors that support instructions for memory
barriers with acquire, release and fence semantics.

I think the current definitions of barriers is as per needs of usage in
PostgreSQL and not by referring standard (am I missing something
here), however as you explained the definitions are quite clear.

 The reason I did that is because it seemed to be more or less what
 Linux was doing, and it's generally suitable for lock-less algorithms.
 The acquire and release semantics come into play specifically when
 you're using barriers to implement locking primitives, which is isn't
 what I was trying to enable.

Now by implementing atomic-ops, I think we are moving more towards
lock-less algorithms, so I am not sure if we just want to adhere to
existing definitions as you listed above (what current barrier.h
implements) and implement accordingly or we can extend the existing
definitions.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] better atomics - v0.5

2014-07-08 Thread Amit Kapila

On Thu, Jul 3, 2014 at 10:56 AM, Amit Kapila amit.kapil...@gmail.com
wrote:
 On Tue, Jul 1, 2014 at 4:10 PM, Andres Freund and...@2ndquadrant.com
wrote:

Further review of patch:
1.
/*
 * pg_atomic_test_and_set_flag - TAS()
 *
 * Acquire/read barrier semantics.
 */
STATIC_IF_INLINE_DECLARE bool
pg_atomic_test_set_flag(volatile pg_atomic_flag *ptr);

a. I think Acquire and read barrier semantics aren't equal.
With acquire semantics, the results of the operation are available before
the
results of any operation that appears after it in code which means it
applies
for both load and stores. Definition of read barrier just ensures loads.

So will be right to declare like above in comments?

b. I think it adheres to Acquire semantics for platforms where
that semantics
are supported else it defaults to full memory ordering.
Example :
#define TAS(lock) (InterlockedCompareExchange(lock, 1, 0))

Is it right to declare that generic version of function adheres to
Acquire semantics.


2.
bool
pg_atomic_test_set_flag_impl(volatile pg_atomic_flag *ptr)
{
return TAS((slock_t *) ptr-sema);
#define TAS(lock) (InterlockedCompareExchange(lock, 1, 0))

a. In above usage (ptr-sema), sema itself is not declared as volatile,
so would it be right to use it in this way for API
(InterlockedCompareExchange)
expecting volatile.

b. Also shouldn't this implementation be moved to atomics-generic-msvc.h

3.
/*
 * pg_atomic_unlocked_test_flag - TAS()
 *
 * No barrier semantics.
 */
STATIC_IF_INLINE_DECLARE bool
pg_atomic_unlocked_test_flag(volatile pg_atomic_flag *ptr);

a. There is no setting in this, so using TAS in function comments
seems to be wrong.
b. BTW, I am not able see this function in C11 specs.

4.
#if !defined(PG_HAS_ATOMIC_TEST_SET_FLAG) 
defined(PG_HAVE_ATOMIC_EXCHANGE_U32)
..
#define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
static inline bool
pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
{
return pg_atomic_read_u32_impl(ptr) == 1;
}


#elif !defined(PG_HAS_ATOMIC_TEST_SET_FLAG) 
defined(PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32)
..
#define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
static inline bool
pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
{
return (bool) pg_atomic_read_u32_impl(ptr);
}

Is there any reason to keep these two implementations different?

5.
atomics-generic-gcc.h
#ifndef PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
#define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
static inline bool
pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
{
return ptr-value == 0;
}
#endif

Don't we need to compare it with 1 instead of 0?

6.
atomics-fallback.h
pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
{
/*
 * Can't do this in the semaphore based implementation - always return
 * true. Do this in the header so compilers can optimize the test away.
 */
return true;
}

Why we can't implement this in semaphore based implementation?

7.
/*
 * pg_atomic_clear_flag - release lock set by TAS()
 *
 * Release/write barrier semantics.
 */
STATIC_IF_INLINE_DECLARE void
pg_atomic_clear_flag(volatile pg_atomic_flag *ptr);

a. Are Release and write barriers equivalent?

With release semantics, the results of the operation are available after the
results of any operation that appears before it in code.
A write barrier acts as a compiler barrier, and orders stores.

I think both definitions are not same.

b. IIUC then current code doesn't have release semantics for unlock/clear.
We are planing to introduce it in this patch, also this doesn't follow the
specs which says that clear should have memory_order_seq_cst ordering
semantics.

As per its current usage in patch (S_UNLOCK), it seems reasonable
to have *release* semantics for this API, however I think if we use it for
generic purpose then it might not be right to define it with Release
semantics.

8.
#define PG_HAS_ATOMIC_CLEAR_FLAG
static inline void
pg_atomic_clear_flag_impl(volatile pg_atomic_flag *ptr)
{
/* XXX: release semantics suffice? */
pg_memory_barrier_impl();
pg_atomic_write_u32_impl(ptr, 0);
}

Are we sure that having memory barrier before clearing flag will
achieve *Release* semantics; as per my understanding the definition
of Release sematics is as below and above doesn't seem to follow the same.
With release semantics, the results of the operation are available after
the results of any operation that appears before it in code.

9.
static inline uint32
pg_atomic_exchange_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 xchg_)
{
uint32 old;
while (true)
{
old = pg_atomic_read_u32_impl(ptr);
if (pg_atomic_compare_exchange_u32_impl(ptr, old, xchg_))
break;
}
return old;
}

What is the reason to implement exchange as compare-and-exchange, at least
for
Windows I know corresponding function (InterlockedExchange) exists.
I could not see any other implementation of same.
I think this at least deserves comment explaining why we have done
the implementation this way.

10.
STATIC_IF_INLINE_DECLARE uint32
pg_atomic_fetch_add_u32(volatile pg_atomic_uint32 *ptr,

Re: [HACKERS] better atomics - v0.5

2014-07-02 Thread Ants Aasma

On Fri, Jun 27, 2014 at 9:00 PM, Andres Freund and...@2ndquadrant.com wrote:
 Imagine the situation for the buffer header spinlock which is one of the
 bigger performance issues atm. We could aim to replace all usages of
 that with clever and complicated logic, but it's hard.

 The IMO reasonable (and prototyped) way to do it is to make the common
 paths lockless, but fall back to the spinlock for the more complicated
 situations. For the buffer header that means that pin/unpin and buffer
 lookup are lockless, but IO and changing the identity of a buffer still
 require the spinlock. My attempts to avoid the latter basically required
 a buffer header specific reimplementation of spinlocks.

There is a 2010 paper [1] that demonstrates a fully non-blocking
approach to buffer management using the same generalized clock
algorithm that PostgreSQL has. The site also has an implementation for
Apache Derby. You may find some interesting ideas in there.

[1] 
http://code.google.com/p/derby-nb/source/browse/trunk/derby-nb/ICDE10_conf_full_409.pdf

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] better atomics - v0.5

2014-07-02 Thread Andres Freund

On 2014-07-02 09:27:52 +0300, Ants Aasma wrote:
 On Fri, Jun 27, 2014 at 9:00 PM, Andres Freund and...@2ndquadrant.com wrote:
  Imagine the situation for the buffer header spinlock which is one of the
  bigger performance issues atm. We could aim to replace all usages of
  that with clever and complicated logic, but it's hard.
 
  The IMO reasonable (and prototyped) way to do it is to make the common
  paths lockless, but fall back to the spinlock for the more complicated
  situations. For the buffer header that means that pin/unpin and buffer
  lookup are lockless, but IO and changing the identity of a buffer still
  require the spinlock. My attempts to avoid the latter basically required
  a buffer header specific reimplementation of spinlocks.
 
 There is a 2010 paper [1] that demonstrates a fully non-blocking
 approach to buffer management using the same generalized clock
 algorithm that PostgreSQL has. The site also has an implementation for
 Apache Derby. You may find some interesting ideas in there.
 
 [1] 
 http://code.google.com/p/derby-nb/source/browse/trunk/derby-nb/ICDE10_conf_full_409.pdf

Interesting. Thanks for the link. I think I had pretty much all the
parts they described lockless as well (excluding the buffer mapping
hashtable itself, which they didn't focus on either), it was just
operations like replacing a dirty victim buffer where I fell back to the
spinlock.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] better atomics - v0.5

2014-07-02 Thread Andres Freund

Hi,

On 2014-06-30 20:28:52 +0200, Andres Freund wrote:
 To quantify this, on my 2 socket xeon E5520 workstation - which is too
 small to heavily show the contention problems in pgbench -S - the
 numbers are:

 pgbench -M prepared -c 16 -j 16 -T 10 -S (best of 5, noticeably variability)
 master: 152354.294117
 lwlocks-atomics: 170528.784958
 lwlocks-atomics-spin: 159416.202578

 pgbench -M prepared -c 1 -j 1 -T 10 -S (best of 5, noticeably variability)
 master: 16273.702191
 lwlocks-atomics: 16768.712433
 lwlocks-atomics-spin: 16744.913083

 So, there really isn't a problem with the emulation. It's not actually
 that surprising - the absolute number of atomic ops is prety much the
 same. Where we earlier took a spinlock to increment shared we now still
 take one.

Tom, you've voiced concern that using atomics will regress performance
for platforms that don't use atomics in a nearby thread. Can these
numbers at least ease those a bit?
I've compared the atomics vs emulated atomics on 32bit x86, 64bit x86
and POWER7. That's all I've access to at the moment.  In all cases the
performance with lwlocks ontop emulated atomics was the same as the old
spinlock based algorithm or even a bit better. I'm sure that it's
possible to design atomics based algorithms where that's not the case,
but I don't think we need to solve those before we meet them.

I think for most contended places which possibly can be improved two
things matter: The number of instructions that need to work atomically
(i.e. TAS/CAS/xchg for spinlocks, xadd/cmpxchg for atomics) and the
duration locks are held. When converting to atomics, even if emulated,
the hot paths shouldn't need more locked instructions than before and
often enough the duration during which spinlocks are held will be
smaller.

Greetings,

Andres Freund

--
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] better atomics - v0.5

2014-07-02 Thread Andres Freund

On 2014-06-26 00:42:37 +0300, Heikki Linnakangas wrote:
 On 06/25/2014 11:36 PM, Andres Freund wrote:
 
 - I completely loathe the file layout you've proposed in
 src/include/storage.  If we're going to have separate #include files
 for each architecture (and I'd rather we didn't go that route, because
 it's messy and unlike what we have now), then shouldn't that stuff go
 in src/include/port or a new directory src/include/port/atomics?
 Polluting src/include/storage with a whole bunch of files called
 atomics-whatever.h strikes me as completely ugly, and there's a lot of
 duplicate code in there.
 I don't mind moving it somewhere else and it's easy enough to change.
 
 I think having a separate file for each architecture is nice. I totally
 agree that they don't belong in src/include/storage, though. s_lock.h has
 always been misplaced there, but we've let it be for historical reasons, but
 now that we're adding a dozen new files, it's time to move them out.

So,
src/include/port/atomics.h,
src/include/port/atomics/generic-$compiler.h,
src/include/port/atomics/arch-$arch.h,
src/include/port/atomics/fallback.h,
src/include/port/atomics/generic.h,
src/backend/port/atomics.c
?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] better atomics - v0.5

2014-07-02 Thread Amit Kapila

On Tue, Jul 1, 2014 at 4:10 PM, Andres Freund and...@2ndquadrant.com
wrote:
 On 2014-07-01 10:44:19 +0530, Amit Kapila wrote:

  I think defining barrier support for some atomic ops and not for others
  will lead to inconsistency with specs and we need to additionally
  define rules for such atomic ops usage.

 Meh. It's quite common that read/load do not have barrier semantics. The
 majority of read/write users won't need barriers and it'll be expensive
 to provide them with them.

So in such a case they shouldn't use atomic API and rather access
directly, I think part of the purpose to use atomic ops is also that they
should provide synchronisation for read/write among processes, earlier
we don't have any such facility, so wherever required we have no option
but to use barriers to achieve the same.

One more point here, in patch 64-bit atomic read/write are
implemented using *_exchange_* API, won't this automatically ensure
that 64-bit versions have barrier support?


+pg_atomic_read_u64_impl(volatile pg_atomic_uint64 *ptr)
+{
..
+ pg_atomic_compare_exchange_u64_impl(ptr, old, 0);
..
+}


   b) It's only supported from vista onwards. Afaik we still
support
  XP.

 Well, we have quite good evidence for it working by knowing that
 postgres has in the past worked on xp? If you have a better idea, send
 in a patch for today's barrier.h and I'll adopt that.

Here the only point was to check if it is better to use version of
InterLocked... API which is a recommended for the case of Itanium-
based processor and will behave same for others.  If you have any
inclination/suggestion for the same, then I can try for it.

In this case, I have a question for you.

   Either noone has ever tested postgres on itanium windows (quite
   possible), because afaik _Asm_sched_fence() doesn't work on
   windows/msvc, or windows doesn't define __ia64__/__ia64. Those macros
   arefor HP's acc on HPUX.
 
  Hmm.. Do you think in such a case, it would have gone in below
  define:

  #elif defined(__INTEL_COMPILER)
  /*
   * icc defines __GNUC__, but doesn't support gcc's inline asm syntax
   */
  #if defined(__ia64__) || defined(__ia64)
  #define pg_memory_barrier() __mf()
  #elif defined(__i386__) || defined(__x86_64__)
  #define pg_memory_barrier() _mm_mfence()
  #endif

 Hm? Since _Asm_sched_fence isn't for the intel compiler but for HP's
 acc, I don't see why?

Sorry, I don't understand what you mean by above?
The point I wanted to clarify here was that is it possible that above
defines lead to different definitions for Itanium on windows?

 But they should go into atomics-generic-acc.h.

Okay, but in your current patch (0001-Basic-atomic-ops-implementation),
they seem to be in different files.

  -#define pg_compiler_barrier() __memory_barrier()
 
  Currently this will be considered as compiler barrier for
  __INTEL_COMPILER, but after patch, I don't see this define. I think
  after patch, it will be compiler_impl specific, but why such a change?

 #if defined(__INTEL_COMPILER)
 #define pg_compiler_barrier_impl()  __memory_barrier()
 #else
 #define pg_compiler_barrier_impl()  __asm__ __volatile__( :::
memory)
 #endif

 There earlier was a typo defining pg_compiler_barrier instead of
 pg_compiler_barrier_impl for intel, maybe that's what you were referring
to?

I have tried to search for this in patch
file 0001-Basic-atomic-ops-implementation
and found that it has been removed from barrier.h but not added anywhere
else.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] better atomics - v0.5

2014-07-01 Thread Andres Freund

On 2014-07-01 10:44:19 +0530, Amit Kapila wrote:
 I think for such usage, we need to rely on barriers wherever there is a
 need for synchronisation, recent example I have noticed is in your patch
 where we have to use pg_write_barrier() during wakeup.  However if we
 go by atomic ops definition, then no such dependency is required, like
 if  somewhere we need to use pg_atomic_compare_exchange_u32(),
 after that we don't need to ensure about barriers, because this atomic
 API adheres to Full barrier semantics.

 I think defining barrier support for some atomic ops and not for others
 will lead to inconsistency with specs and we need to additionally
 define rules for such atomic ops usage.

Meh. It's quite common that read/load do not have barrier semantics. The
majority of read/write users won't need barriers and it'll be expensive
to provide them with them.

  b) It's only supported from vista onwards. Afaik we still support
 XP.
 
  Well, with barrier.h as it stands they'd get a compiler error if it
  indeed is unsupported? But I think there's something else going on -
  msft might just be removing XP from it's documentation.
 
 Okay, but why would they remove for MemoryBarrier() and not
 for InterlockedCompareExchange().  I think it is bit difficult to predict
 the reason and if we want to use anything which is not as msft docs,
 we shall do that carefully and have some way to ensure that it will
 work fine.

Well, we have quite good evidence for it working by knowing that
postgres has in the past worked on xp? If you have a better idea, send
in a patch for today's barrier.h and I'll adopt that.

   In this case, I have a question for you.
  
   Un-patched usage  in barrier.h is as follows:
   ..
  
   If I understand correctly the current define mechanism in barrier.h,
   it will have different definition for Itanium processors even for windows.
 
  Either noone has ever tested postgres on itanium windows (quite
  possible), because afaik _Asm_sched_fence() doesn't work on
  windows/msvc, or windows doesn't define __ia64__/__ia64. Those macros
  arefor HP's acc on HPUX.
 
 Hmm.. Do you think in such a case, it would have gone in below
 define:

 #elif defined(__INTEL_COMPILER)
 /*
  * icc defines __GNUC__, but doesn't support gcc's inline asm syntax
  */
 #if defined(__ia64__) || defined(__ia64)
 #define pg_memory_barrier() __mf()
 #elif defined(__i386__) || defined(__x86_64__)
 #define pg_memory_barrier() _mm_mfence()
 #endif

Hm? Since _Asm_sched_fence isn't for the intel compiler but for HP's
acc, I don't see why?

But they should go into atomics-generic-acc.h.

 -#define pg_compiler_barrier() __memory_barrier()
 
 Currently this will be considered as compiler barrier for
 __INTEL_COMPILER, but after patch, I don't see this define. I think
 after patch, it will be compiler_impl specific, but why such a change?

#if defined(__INTEL_COMPILER)
#define pg_compiler_barrier_impl()  __memory_barrier()
#else
#define pg_compiler_barrier_impl()  __asm__ __volatile__( ::: memory)
#endif

There earlier was a typo defining pg_compiler_barrier instead of
pg_compiler_barrier_impl for intel, maybe that's what you were referring to?


Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] better atomics - v0.5

2014-06-30 Thread Andres Freund

On 2014-06-30 11:04:53 +0530, Amit Kapila wrote:
 On Sun, Jun 29, 2014 at 2:54 PM, Andres Freund and...@2ndquadrant.com wrote:
  On 2014-06-29 11:53:28 +0530, Amit Kapila wrote:
   On Sat, Jun 28, 2014 at 1:48 PM, Andres Freund and...@2ndquadrant.com
   I think it is better to have some discussion about it. Apart from this,
   today I noticed atomic read/write doesn't use any barrier semantics
   and just rely on volatile.
 
  Yes, intentionally so. It's often important to get/set the current value
  without the cost of emitting a memory barrier. It just has to be a
  recent value  and it actually has to come from memory.
 
 I agree with you that we don't want to incur the cost of memory barrier
 for get/set, however it should not be at the cost of correctness.

I really can't follow here. A volatile read/store forces it to go to
memory without barriers. The ABIs of all platforms we work with
guarantee that 4bytes stores/reads are atomic - we've been relying on
that for a long while.

  And that's actually
  enforced by the current definition - I think?
 
 Yeah, this is the only point which I am trying to ensure, thats why I sent
 you few links in previous mail where I had got some suspicion that
 just doing get/set with volatile might lead to correctness issue in some
 cases.

But this isn't something this patch started doing - we've been doing
that for a long while?

 Some another Note, I found today while reading more on it which suggests
 that my previous observation is right:
 
 Within a thread of execution, accesses (reads and writes) to all volatile
 objects are guaranteed to not be reordered relative to each other, but this
 order is not guaranteed to be observed by another thread, since volatile
 access does not establish inter-thread synchronization.
 http://en.cppreference.com/w/c/atomic/memory_order

That's just saying there's no ordering guarantees with volatile
stores/reads. I don't see a contradiction?

b) It's only supported from vista onwards. Afaik we still support XP.
   #ifndef pg_memory_barrier_impl
   #define pg_memory_barrier_impl() MemoryBarrier()
   #endif
  
   The MemoryBarrier() macro used also supports only on vista onwards,
   so I think for reasons similar to using MemoryBarrier() can apply for
   this as well.
 
  I think that's odd because barrier.h has been using MemoryBarrier()
  without a version test for a long time now. I guess everyone uses a new
  enough visual studio.
 
 Yeap or might be the people who even are not using new enough version
 doesn't have a use case that can hit a problem due to MemoryBarrier() 

Well, with barrier.h as it stands they'd get a compiler error if it
indeed is unsupported? But I think there's something else going on -
msft might just be removing XP from it's documentation. Postgres supports
building with with visual studio 2005 and MemoryBarrier() seems to be
supported by it.
I think we'd otherwise seen problems since 9.1 where barrier.h was introduced.

 In this case, I have a question for you.
 
 Un-patched usage  in barrier.h is as follows:
 ..
 #elif defined(__ia64__) || defined(__ia64)
 #define pg_compiler_barrier() _Asm_sched_fence()
 #define pg_memory_barrier() _Asm_mf()
 
 #elif defined(WIN32_ONLY_COMPILER)
 /* Should work on both MSVC and Borland. */
 #include intrin.h
 #pragma intrinsic(_ReadWriteBarrier)
 #define pg_compiler_barrier() _ReadWriteBarrier()
 #define pg_memory_barrier() MemoryBarrier()
 #endif
 
 If I understand correctly the current define mechanism in barrier.h,
 it will have different definition for Itanium processors even for windows.

Either noone has ever tested postgres on itanium windows (quite
possible), because afaik _Asm_sched_fence() doesn't work on
windows/msvc, or windows doesn't define __ia64__/__ia64. Those macros
arefor HP's acc on HPUX.

 However the patch defines as below:
 
 #if defined(HAVE_GCC__ATOMIC_INT32_CAS) || defined(HAVE_GCC__SYNC_INT32_CAS)
 # include storage/atomics-generic-gcc.h
 #elif defined(WIN32_ONLY_COMPILER)
 # include storage/atomics-generic-msvc.h
 
 What I can understand from above is that defines in 
 storage/atomics-generic-msvc.h, will override any previous defines
 for compiler/memory barriers and _ReadWriteBarrier()/MemoryBarrier()
 will be considered for Windows always.

Well, the memory barrier is surrounded by #ifndef
pg_memory_barrier_impl. The compiler barrier can't reasonably be defined
earlier since it's a compiler not an architecture thing.

 6.
 pg_atomic_compare_exchange_u32()

 It is better to have comments above this and all other related
   functions.
   Okay, generally code has such comments on top of function
   definition rather than with declaration.
 
  I don't want to have it on the _impl functions because they're
  duplicated for the individual platforms and will just become out of
  date...
 
 Sure, I was also not asking for _impl functions.  What I was asking
 in this point was to have comments on top of definition of

Re: [HACKERS] better atomics - v0.5

2014-06-30 Thread Robert Haas

On Sat, Jun 28, 2014 at 4:25 AM, Andres Freund and...@2ndquadrant.com wrote:
  To make pin/unpin et al safe without acquiring the spinlock the pincount
  and the flags had to be moved into one variable so that when
  incrementing the pincount you automatically get the flagbits back. If
  it's currently valid all is good, otherwise the spinlock needs to be
  acquired. But for that to work you need to set flags while the spinlock
  is held.
 
  Possibly you can come up with ways to avoid that, but I am pretty damn
  sure that the code will be less robust by forbidding atomics inside a
  critical section, not the contrary. It's a good idea to avoid it, but
  not at all costs.

 You might be right, but I'm not convinced.

 Why? Anything I can do to convince you of this? Note that other users of
 atomics (e.g. the linux kernel) widely use atomics inside spinlock
 protected regions.

The Linux kernel isn't a great example, because they can ensure that
they aren't preempted while holding the spinlock.  We can't, and
that's a principle part of my concern here.

More broadly, it doesn't seem consistent.  I think that other projects
also sometimes write code that acquires a spinlock while holding
another spinlock, and we don't do that; in fact, we've elevated that
to a principle, regardless of whether it performs well in practice.
In some cases, the CPU instruction that we issue to acquire that
spinlock might be the exact same instruction we'd issue to manipulate
an atomic variable.  I don't see how it can be right to say that a
spinlock-protected critical section is allowed to manipulate an atomic
variable with cmpxchg, but not allowed to acquire another spinlock
with cmpxchg.  Either acquiring the second spinlock is OK, in which
case our current coding rule is wrong, or acquiring the atomic
variable is not OK, either.  I don't see how we can have that both
ways.

 What I'm basically afraid of is that this will work fine in many cases
 but have really ugly failure modes.  That's pretty much what happens
 with spinlocks already - the overhead is insignificant at low levels
 of contention but as the spinlock starts to become contended the CPUs
 all fight over the cache line and performance goes to pot.  ISTM that
 making the spinlock critical section significantly longer by putting
 atomic ops in there could appear to win in some cases while being very
 bad in others.

 Well, I'm not saying it's something I suggest doing all the time. But if
 using an atomic op in the slow path allows you to remove the spinlock
 from 99% of the cases I don't see it having a significant impact.
 In most scenarios (where atomics aren't emulated, i.e. platforms we
 expect to used in heavily concurrent cases) the spinlock and the atomic
 will be on the same cacheline making stalls much less likely.

And this again is my point: why can't we make the same argument about
two spinlocks situated on the same cache line?  I don't have a bit of
trouble believing that doing the same thing with a couple of spinlocks
could sometimes work out well, too, but Tom is adamantly opposed to
that.  I don't want to just make an end run around that issue without
understanding whether he has a valid point.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] better atomics - v0.5

2014-06-30 Thread Andres Freund

On 2014-06-30 12:15:06 -0400, Robert Haas wrote:
 More broadly, it doesn't seem consistent.  I think that other projects
 also sometimes write code that acquires a spinlock while holding
 another spinlock, and we don't do that; in fact, we've elevated that
 to a principle, regardless of whether it performs well in practice.

It's actually more than a principle: It's now required for correctness,
because otherwise the semaphore based spinlock implementation will
deadlock otherwise.

 In some cases, the CPU instruction that we issue to acquire that
 spinlock might be the exact same instruction we'd issue to manipulate
 an atomic variable.  I don't see how it can be right to say that a
 spinlock-protected critical section is allowed to manipulate an atomic
 variable with cmpxchg, but not allowed to acquire another spinlock
 with cmpxchg.  Either acquiring the second spinlock is OK, in which
 case our current coding rule is wrong, or acquiring the atomic
 variable is not OK, either.  I don't see how we can have that both
 ways.

The no nested spinlock thing used to be a guideline. One nobody had big
problems with because it didn't prohibit solutions to problems. Since
guidelines are more useful when simple it's no nested
spinlocks!.

Also those two cmpxchg's aren't the same:

The CAS for spinlock acquiration will only succeed if the lock isn't
acquired. If the lock holder sleeps you'll have to wait for it to wake
up.

In contrast to that CASs for lockfree (or lock-reduced) algorithms won't
be blocked if the last manipulation was done by a backend that's now
sleeping. It'll possibly loop once, get the new value, and retry the
CAS. Yes, it can still take couple of iterations under heavy
concurrency, but you don't have the problem that the lockholder goes to
sleep.

Now, you can argue that the spinlock based fallbacks make that
difference moot, but I think that's just the price for an emulation
fringe platforms will have to pay. They aren't platforms used under
heavy concurrency.

  What I'm basically afraid of is that this will work fine in many cases
  but have really ugly failure modes.  That's pretty much what happens
  with spinlocks already - the overhead is insignificant at low levels
  of contention but as the spinlock starts to become contended the CPUs
  all fight over the cache line and performance goes to pot.  ISTM that
  making the spinlock critical section significantly longer by putting
  atomic ops in there could appear to win in some cases while being very
  bad in others.
 
  Well, I'm not saying it's something I suggest doing all the time. But if
  using an atomic op in the slow path allows you to remove the spinlock
  from 99% of the cases I don't see it having a significant impact.
  In most scenarios (where atomics aren't emulated, i.e. platforms we
  expect to used in heavily concurrent cases) the spinlock and the atomic
  will be on the same cacheline making stalls much less likely.
 
 And this again is my point: why can't we make the same argument about
 two spinlocks situated on the same cache line?

Because it's not relevant? Where and why do we want to acquire two
spinlocks that are on the same cacheline?
As far as I know there's never been an actual need for nested
spinlocks. So the guideline can be as restrictive as is it is because
the price is low and there's no benefit in making it more complex.


I think this line of discussion is pretty backwards. The reason I (and
seemingly Heikki) want to use atomics is that it can reduce contention
significantly. That contention is today too a good part based on
spinlocks.  Your argument is basically that increasing spinlock
contention by doing things that can take more than a fixed cycles can
increase contention. But the changes we've talked about only make sense
if they significantly reduce spinlock contention in the first place - a
potential for a couple iterations while holding better not be relevant
for proposed patches.

I don't think a blanket rule makes sense here.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] better atomics - v0.5

2014-06-30 Thread Tom Lane

Robert Haas robertmh...@gmail.com writes:
 ... this again is my point: why can't we make the same argument about
 two spinlocks situated on the same cache line?  I don't have a bit of
 trouble believing that doing the same thing with a couple of spinlocks
 could sometimes work out well, too, but Tom is adamantly opposed to
 that.

I think you might be overstating my position here.  What I'm concerned
about is that we be sure that spinlocks are held for a sufficiently short
time that it's very unlikely that we get pre-empted while holding one.
I don't have any particular bright line about how short a time that is,
but more than a few instructions worries me.  As you say, the Linux
kernel is a bad example to follow because it hasn't got a risk of losing
its timeslice while holding a spinlock.

The existing coding rules discourage looping (though I might be okay
with a small constant loop count), and subroutine calls (mainly because
somebody might add $random_amount_of_work to the subroutine if they
don't realize it can be called while holding a spinlock).  Both of these
rules are meant to reduce the risk that a short interval balloons
into a long one due to unrelated coding changes.

The existing coding rules also discourage spinlocking within a spinlock,
and the reason for that is that there's no very clear upper bound to the
time required to obtain a spinlock, so that there would also be no clear
upper bound to the time you're holding the original one (thus possibly
leading to cascading performance problems).

So ISTM the question we ought to be asking is whether atomic operations
have bounded execution time, or more generally what the distribution of
execution times is likely to be.  I'd be OK with an answer that includes
sometimes it can be long so long as sometimes is pretty damn seldom.
Spinlocks have a nonzero risk of taking a long time already, since we
can't entirely prevent the possibility of losing our timeslice while
holding one.  The issue here is just to be sure that that happens seldom
enough that it doesn't cause performance problems.  If we fail to do that
we might negate all the desired performance improvements from adopting
atomic ops in the first place.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] better atomics - v0.5

2014-06-30 Thread Robert Haas

On Mon, Jun 30, 2014 at 12:54 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 The existing coding rules also discourage spinlocking within a spinlock,
 and the reason for that is that there's no very clear upper bound to the
 time required to obtain a spinlock, so that there would also be no clear
 upper bound to the time you're holding the original one (thus possibly
 leading to cascading performance problems).

Well, as Andres points out, commit
daa7527afc2274432094ebe7ceb03aa41f916607 made that more of an ironclad
requirement than a suggestion.  If it's sometimes OK to acquire a
spinlock from within a spinlock, then that commit was badly misguided;
but the design of that patch was pretty much driven by your insistence
that that was never, ever OK.  If that was wrong, then we should
reconsider that whole commit more broadly; but if it was right, then
the atomics patch shouldn't drill a hole through it to install an
exception just for emulating atomics.

I'm personally not convinced that we're approaching this topic in the
right way.  I'm not convinced that it's at all reasonable to try to
emulate atomics on platforms that don't have them.  I would punt the
problem into the next layer and force things like lwlock.c to have
fallback implementations at that level that don't require atomics, and
remove those fallback implementations if and when we move the
goalposts so that all supported platforms must have working atomics
implementations.  People who write code that uses atomics are not
likely to think about how those algorithms will actually perform when
those atomics are merely emulated, and I suspect that means that in
practice platforms that have only emulated atomics are going to
regress significantly vs. the status quo today.  If this patch goes in
the way it's written right now, then I think it basically amounts to
throwing platforms that don't have working atomics emulation under the
bus, and my vote is that if we're going to do that, we should do it
explicitly: sorry, we now only support platforms with working atomics.
You don't have any, so you can't run PostgreSQL.  Have a nice day.

If we *don't* want to make that decision, then I'm not convinced that
the approach this patch takes is in any way viable, completely aside
from this particular question.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] better atomics - v0.5

2014-06-30 Thread Andres Freund

On 2014-06-30 13:15:23 -0400, Robert Haas wrote:
 On Mon, Jun 30, 2014 at 12:54 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  The existing coding rules also discourage spinlocking within a spinlock,
  and the reason for that is that there's no very clear upper bound to the
  time required to obtain a spinlock, so that there would also be no clear
  upper bound to the time you're holding the original one (thus possibly
  leading to cascading performance problems).
 
 Well, as Andres points out, commit
 daa7527afc2274432094ebe7ceb03aa41f916607 made that more of an ironclad
 requirement than a suggestion.  If it's sometimes OK to acquire a
 spinlock from within a spinlock, then that commit was badly misguided;
 but the design of that patch was pretty much driven by your insistence
 that that was never, ever OK.  If that was wrong, then we should
 reconsider that whole commit more broadly; but if it was right, then
 the atomics patch shouldn't drill a hole through it to install an
 exception just for emulating atomics.

Well, since nobody has a problem with the rule of not having nested
spinlocks and since the problems aren't the same I'm failing to see why
we should reconsider this.

My recollection was that we primarily discussed whether it'd be a good
idea to add code to Assert() when spinlocks are nested to which Tom
responded that it's not necessary because critical sections should be
short enough to immmediately see that that's not a problem. Then we
argued a bit back and forth around that.

 I'm personally not convinced that we're approaching this topic in the
 right way.  I'm not convinced that it's at all reasonable to try to
 emulate atomics on platforms that don't have them.  I would punt the
 problem into the next layer and force things like lwlock.c to have
 fallback implementations at that level that don't require atomics, and
 remove those fallback implementations if and when we move the
 goalposts so that all supported platforms must have working atomics
 implementations.

If we're requiring a atomics-less implementation for lwlock.c,
bufmgr. et al. I'll stop working on those features (and by extension on
atomics). It's an utter waste of resources and maintainability trying to
maintain parallel high level locking algorithms in several pieces of the
codebase. The nonatomic versions will stagnate and won't actually work
under load. I'd rather spend my time on something useful. The spinlock
based atomics emulation is *small*. It's easy to reason about
correctness there.

If we're going that way, we've made a couple of absolute fringe
platforms hold postgres, as actually used in reality, hostage. I think
that's insane.

 People who write code that uses atomics are not
 likely to think about how those algorithms will actually perform when
 those atomics are merely emulated, and I suspect that means that in
 practice platforms that have only emulated atomics are going to
 regress significantly vs. the status quo today.

I think you're overstating the likely performance penalty for
nonparallel platforms/workloads here quite a bit. The platforms without
changes of gettings atomics implemented aren't ones where that's a
likely thing. Yes, you'll see a regression if you run a readonly pgbench
over a 4 node NUMA system - but it's not large. You won't see much of
improved performance in comparison to 9.4, but I think that's primarily
it.

Also it's not like this is something new - the semaphore based fallback
*sucks* but we still have it.

*Many* features in new major versions regress performance for fringe
platforms. That's normal.

 If this patch goes in
 the way it's written right now, then I think it basically amounts to
 throwing platforms that don't have working atomics emulation under the
 bus

Why? Nobody is going to run 9.5 under high performance workloads on such
platforms. They'll be so IO and RAM starved that the spinlocks won't be
the bottleneck.

, and my vote is that if we're going to do that, we should do it
 explicitly: sorry, we now only support platforms with working atomics.
 You don't have any, so you can't run PostgreSQL.  Have a nice day.

I'd be fine with that. I don't think we're gaining much by those
platforms.
*But* I don't really see why it's required at this point. The fallback
works and unless you're using semaphore based spinlocks the performance
isn't bad.
The lwlock code doesn't really get slower/faster in comparison to before
when the atomics implementation is backed by semaphores. It's pretty
much the same number of syscalls using the atomics/current implementation.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] better atomics - v0.5

2014-06-30 Thread Tom Lane

Robert Haas robertmh...@gmail.com writes:
 I'm personally not convinced that we're approaching this topic in the
 right way.  I'm not convinced that it's at all reasonable to try to
 emulate atomics on platforms that don't have them.  I would punt the
 problem into the next layer and force things like lwlock.c to have
 fallback implementations at that level that don't require atomics, and
 remove those fallback implementations if and when we move the
 goalposts so that all supported platforms must have working atomics
 implementations.  People who write code that uses atomics are not
 likely to think about how those algorithms will actually perform when
 those atomics are merely emulated, and I suspect that means that in
 practice platforms that have only emulated atomics are going to
 regress significantly vs. the status quo today.

I think this is a valid objection, and I for one am not prepared to
say that we no longer care about platforms that don't have atomic ops
(especially not if it's not a *very small* set of atomic ops).

Also, just because a platform claims to have atomic ops doesn't mean that
those ops perform well.  If there's a kernel trap involved, they don't,
at least not for our purposes.  We're only going to be bothering with
installing atomic-op code in places that are contention bottlenecks
for us already, so we are not going to be happy with the results for any
atomic-op implementation that's not industrial strength.  This is one
reason why I'm extremely suspicious of depending on gcc's intrinsics for
this; that will not make the issue go away, only make it beyond our
power to control.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] better atomics - v0.5

2014-06-30 Thread Andres Freund

On 2014-06-30 12:54:10 -0400, Tom Lane wrote:
 Robert Haas robertmh...@gmail.com writes:
  ... this again is my point: why can't we make the same argument about
  two spinlocks situated on the same cache line?  I don't have a bit of
  trouble believing that doing the same thing with a couple of spinlocks
  could sometimes work out well, too, but Tom is adamantly opposed to
  that.
 
 I think you might be overstating my position here.  What I'm concerned
 about is that we be sure that spinlocks are held for a sufficiently short
 time that it's very unlikely that we get pre-empted while holding one.
 I don't have any particular bright line about how short a time that is,
 but more than a few instructions worries me.  As you say, the Linux
 kernel is a bad example to follow because it hasn't got a risk of losing
 its timeslice while holding a spinlock.

Well, they actually have preemptable and irq-interruptile versions for
some of these locks, but whatever :).

 So ISTM the question we ought to be asking is whether atomic operations
 have bounded execution time, or more generally what the distribution of
 execution times is likely to be.  I'd be OK with an answer that includes
 sometimes it can be long so long as sometimes is pretty damn
 seldom.

I think it ranges from 'never' to 'sometimes, but pretty darn
seldom'. Depending on the platform and emulation.

And, leaving the emulation and badly written code aside, there's no
issue with another backend sleeping while the atomic variable needs to
be modified.

 Spinlocks have a nonzero risk of taking a long time already, since we
 can't entirely prevent the possibility of losing our timeslice while
 holding one.  The issue here is just to be sure that that happens seldom
 enough that it doesn't cause performance problems.  If we fail to do that
 we might negate all the desired performance improvements from adopting
 atomic ops in the first place.

I think individual patches will have to stand up to a test like
this. Modifying a atomic variable inside a spinlock is only going to be
worth it if it allows to avoid spinlocks alltogether in 95%+ of the
time.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] better atomics - v0.5

2014-06-30 Thread Andres Freund

On 2014-06-30 19:44:47 +0200, Andres Freund wrote:
 On 2014-06-30 13:15:23 -0400, Robert Haas wrote:
  People who write code that uses atomics are not
  likely to think about how those algorithms will actually perform when
  those atomics are merely emulated, and I suspect that means that in
  practice platforms that have only emulated atomics are going to
  regress significantly vs. the status quo today.
 
 I think you're overstating the likely performance penalty for
 nonparallel platforms/workloads here quite a bit. The platforms without
 changes of gettings atomics implemented aren't ones where that's a
 likely thing. Yes, you'll see a regression if you run a readonly pgbench
 over a 4 node NUMA system - but it's not large. You won't see much of
 improved performance in comparison to 9.4, but I think that's primarily
 it.

To quantify this, on my 2 socket xeon E5520 workstation - which is too
small to heavily show the contention problems in pgbench -S - the
numbers are:

pgbench -M prepared -c 16 -j 16 -T 10 -S (best of 5, noticeably variability)
master: 152354.294117
lwlocks-atomics: 170528.784958
lwlocks-atomics-spin: 159416.202578

pgbench -M prepared -c 1 -j 1 -T 10 -S (best of 5, noticeably variability)
master: 16273.702191
lwlocks-atomics: 16768.712433
lwlocks-atomics-spin: 16744.913083

So, there really isn't a problem with the emulation. It's not actually
that surprising - the absolute number of atomic ops is prety much the
same. Where we earlier took a spinlock to increment shared we now still
take one.

I expect that repeating this on the 4 socket machine will show a large
gap between lwlocks-atomics and the other two. The other two will be
about the same, with lwlocks-atomics-spin remaining a bit better than
master.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] better atomics - v0.5

2014-06-30 Thread Andres Freund

On 2014-06-30 13:45:52 -0400, Tom Lane wrote:
 Robert Haas robertmh...@gmail.com writes:
  I'm personally not convinced that we're approaching this topic in the
  right way.  I'm not convinced that it's at all reasonable to try to
  emulate atomics on platforms that don't have them.  I would punt the
  problem into the next layer and force things like lwlock.c to have
  fallback implementations at that level that don't require atomics, and
  remove those fallback implementations if and when we move the
  goalposts so that all supported platforms must have working atomics
  implementations.  People who write code that uses atomics are not
  likely to think about how those algorithms will actually perform when
  those atomics are merely emulated, and I suspect that means that in
  practice platforms that have only emulated atomics are going to
  regress significantly vs. the status quo today.
 
 I think this is a valid objection, and I for one am not prepared to
 say that we no longer care about platforms that don't have atomic ops
 (especially not if it's not a *very small* set of atomic ops).

It's still TAS(), cmpxchg(), xadd. With TAS/xadd possibly implemented
via cmpxchg (which quite possibly *is* the native implementation for
$arch).

 Also, just because a platform claims to have atomic ops doesn't mean that
 those ops perform well.  If there's a kernel trap involved, they don't,
 at least not for our purposes.

Yea. I think if we start to use 8byte atomics at some later point we're
going to have to prohibit e.g. older arms from using them, even if the
compiler claims they're supported. But that's just one #define.

 This is one
 reason why I'm extremely suspicious of depending on gcc's intrinsics for
 this; that will not make the issue go away, only make it beyond our
 power to control.

The patch as submitted makes it easy to provide a platform specific
implementation of the important routines if it's likely that it's better
than the intrinsics provided one.

I'm not too worried about the intrinsics though - the number of projects
relying on them these days is quite large.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] better atomics - v0.5

2014-06-30 Thread Noah Misch

On Mon, Jun 30, 2014 at 07:44:47PM +0200, Andres Freund wrote:
 On 2014-06-30 13:15:23 -0400, Robert Haas wrote:
  I'm personally not convinced that we're approaching this topic in the
  right way.  I'm not convinced that it's at all reasonable to try to
  emulate atomics on platforms that don't have them.  I would punt the
  problem into the next layer and force things like lwlock.c to have
  fallback implementations at that level that don't require atomics, and
  remove those fallback implementations if and when we move the
  goalposts so that all supported platforms must have working atomics
  implementations.
 
 If we're requiring a atomics-less implementation for lwlock.c,
 bufmgr. et al. I'll stop working on those features (and by extension on
 atomics). It's an utter waste of resources and maintainability trying to
 maintain parallel high level locking algorithms in several pieces of the
 codebase. The nonatomic versions will stagnate and won't actually work
 under load. I'd rather spend my time on something useful. The spinlock
 based atomics emulation is *small*. It's easy to reason about
 correctness there.
 
 If we're going that way, we've made a couple of absolute fringe
 platforms hold postgres, as actually used in reality, hostage. I think
 that's insane.

I agree completely.

  If this patch goes in
  the way it's written right now, then I think it basically amounts to
  throwing platforms that don't have working atomics emulation under the
  bus

 , and my vote is that if we're going to do that, we should do it
  explicitly: sorry, we now only support platforms with working atomics.
  You don't have any, so you can't run PostgreSQL.  Have a nice day.

I might share that view if a platform faced a huge and easily-seen performance
regression, say a 40% regression to 5-client performance.  I don't expect the
shortfall to reach that ballpark for any atomics-exploiting algorithm we'll
adopt, and your (Andres's) latest measurements corroborate that.

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] better atomics - v0.5

2014-06-30 Thread Amit Kapila

On Mon, Jun 30, 2014 at 2:54 PM, Andres Freund and...@2ndquadrant.com
wrote:
 On 2014-06-30 11:04:53 +0530, Amit Kapila wrote:
  On Sun, Jun 29, 2014 at 2:54 PM, Andres Freund and...@2ndquadrant.com
wrote:

   Yes, intentionally so. It's often important to get/set the current
value
   without the cost of emitting a memory barrier. It just has to be a
   recent value  and it actually has to come from memory.
 
  I agree with you that we don't want to incur the cost of memory barrier
  for get/set, however it should not be at the cost of correctness.

 I really can't follow here. A volatile read/store forces it to go to
 memory without barriers. The ABIs of all platforms we work with
 guarantee that 4bytes stores/reads are atomic - we've been relying on
 that for a long while.

I think for such usage, we need to rely on barriers wherever there is a
need for synchronisation, recent example I have noticed is in your patch
where we have to use pg_write_barrier() during wakeup.  However if we
go by atomic ops definition, then no such dependency is required, like
if  somewhere we need to use pg_atomic_compare_exchange_u32(),
after that we don't need to ensure about barriers, because this atomic
API adheres to Full barrier semantics.

I think defining barrier support for some atomic ops and not for others
will lead to inconsistency with specs and we need to additionally
define rules for such atomic ops usage.

I think we can still rely on volatile read/store for ordering guarantee
within same thread of execution and where ever we need synchronisation
of usage among different processes, we can directly use atomic ops
(get/set) without the need to use barriers.

 b) It's only supported from vista onwards. Afaik we still support
XP.

 Well, with barrier.h as it stands they'd get a compiler error if it
 indeed is unsupported? But I think there's something else going on -
 msft might just be removing XP from it's documentation.

Okay, but why would they remove for MemoryBarrier() and not
for InterlockedCompareExchange().  I think it is bit difficult to predict
the reason and if we want to use anything which is not as msft docs,
we shall do that carefully and have some way to ensure that it will
work fine.

  In this case, I have a question for you.
 
  Un-patched usage  in barrier.h is as follows:
  ..
 
  If I understand correctly the current define mechanism in barrier.h,
  it will have different definition for Itanium processors even for
windows.

 Either noone has ever tested postgres on itanium windows (quite
 possible), because afaik _Asm_sched_fence() doesn't work on
 windows/msvc, or windows doesn't define __ia64__/__ia64. Those macros
 arefor HP's acc on HPUX.

Hmm.. Do you think in such a case, it would have gone in below
define:
#elif defined(__INTEL_COMPILER)
/*
 * icc defines __GNUC__, but doesn't support gcc's inline asm syntax
 */
#if defined(__ia64__) || defined(__ia64)
#define pg_memory_barrier() __mf()
#elif defined(__i386__) || defined(__x86_64__)
#define pg_memory_barrier() _mm_mfence()
#endif



-#define pg_compiler_barrier() __memory_barrier()

Currently this will be considered as compiler barrier for
__INTEL_COMPILER, but after patch, I don't see this define. I think
after patch, it will be compiler_impl specific, but why such a change?

  However the patch defines as below:
..
  What I can understand from above is that defines in
  storage/atomics-generic-msvc.h, will override any previous defines
  for compiler/memory barriers and _ReadWriteBarrier()/MemoryBarrier()
  will be considered for Windows always.

 Well, the memory barrier is surrounded by #ifndef
 pg_memory_barrier_impl. The compiler barrier can't reasonably be defined
 earlier since it's a compiler not an architecture thing.

I think as per your new compiler specific implementation stuff, this holds
true.

  6.
  Sure, I was also not asking for _impl functions.  What I was asking
  in this point was to have comments on top of definition of
  pg_atomic_compare_exchange_u32() in atomics.h
  In particular, on top of below and similar functions, rather than
  at the place where they are declared.

 Hm, we can do that. Don't think it'll be clearer (because you need to go
 to the header anyway), but I don't feel strongly.

I think this depends on individual's perspective, so do the way
you feel better, the intention of this point was lets not deviate from
existing coding ways.

 I'd much rather get rid of the separated definition/declaration, but
 we'd need to rely on PG_USE_INLINE for it...

Okay.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] better atomics - v0.5

2014-06-29 Thread Amit Kapila

On Sat, Jun 28, 2014 at 1:48 PM, Andres Freund and...@2ndquadrant.com
wrote:
 On 2014-06-28 11:58:41 +0530, Amit Kapila wrote:
  1.
  +pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
  + uint32 *expected, uint32 newval)
  +{
  + bool ret;
  + uint32 current;
  + current = InterlockedCompareExchange(ptr-value, newval, *expected);
 
  Is there a reason why using InterlockedCompareExchange() is better
  than using InterlockedCompareExchangeAcquire() variant?

 Two things:
 a) compare_exchange is a read/write operator and so far I've defined it
to have full barrier semantics (documented in atomics.h). I'd be
happy to discuss which semantics we want for which operation.

I think it is better to have some discussion about it. Apart from this,
today I noticed atomic read/write doesn't use any barrier semantics
and just rely on volatile.  I think it might not be sane in all cases,
example with Visual Studio 2003, volatile to volatile references are
ordered; the compiler will not re-order volatile variable access. However,
these operations could be re-ordered by the processor.
For detailed example, refer below link:

http://msdn.microsoft.com/en-us/library/windows/desktop/ms686355(v=vs.85).aspx

Also if we see C11 specs both load and store uses memory order as
memory_order_seq_cst by default which is same as for compare and
exchange
I have referred below link:
   http://en.cppreference.com/w/c/atomic/atomic_store

 b) It's only supported from vista onwards. Afaik we still support XP.
#ifndef pg_memory_barrier_impl
#define pg_memory_barrier_impl() MemoryBarrier()
#endif

The MemoryBarrier() macro used also supports only on vista onwards,
so I think for reasons similar to using MemoryBarrier() can apply for
this as well.

 c) It doesn't make any difference on x86 ;)

What about processors like Itanium which care about acquire/release
memory barrier semantics?

  2.
  +pg_atomic_compare_exchange_u32_impl()
  {
  ..
  + *expected = current;
  }
 
  Can we implement this API without changing expected value?
..
  I think this value is required for lwlock patch, but I am wondering why
  can't the same be achieved if we just return the *current* value and
  then let lwlock patch do the handling based on it.  This will have
another
  advantage that our pg_* API will also have similar signature as native
  API's.

 Many usages of compare/exchange require that you'll get the old value
 back in an atomic fashion. Unfortunately the Interlocked* API doesn't
 provide a nice way to do that.

Yes, but I think the same cannot be accomplished even by using
expected.

Note that this definition of
 compare/exchange both maps nicely to C11's atomics and the actual x86
 cmpxchg instruction.

 I've generally tried to mimick C11's API as far as it's
 possible. There's some limitations because we can't do generic types and
 such, but other than that.

If I am reading C11's spec for this API correctly, then it says as below:
Atomically compares the value pointed to by obj with the value pointed
to by expected, and if those are equal, replaces the former with desired
(performs read-modify-write operation). Otherwise, loads the actual value
pointed to by obj into *expected (performs load operation).

So it essentialy means that it loads actual value in expected only if
operation is not successful.


  3. Is there a overhead of using Add, instead of increment/decrement
  version of Interlocked?

 There's a theoretical difference for IA64 which has a special
 increment/decrement operation which can only change the value by a
 rather limited number of values. I don't think it's worth to care.

Okay.

  4.
..
  There is a Caution notice in microsoft site indicating
  _ReadWriteBarrier/MemoryBarrier are deprected.

 It seemed to be the most widely available API, and it's what barrier.h
 already used.
 Do you have a different suggestion?

I am trying to think based on suggestion given by Microsoft, but
not completely clear how to accomplish the same at this moment.

  6.
  pg_atomic_compare_exchange_u32()
 
  It is better to have comments above this and all other related
functions.

 Check atomics.h, there's comments above it:
 /*
  * pg_atomic_compare_exchange_u32 - CAS operation
  *
  * Atomically compare the current value of ptr with *expected and store
newval
  * iff ptr and *expected have the same value. The current value of *ptr
will
  * always be stored in *expected.
  *
  * Return whether the values have been exchanged.
  *
  * Full barrier semantics.
  */

Okay, generally code has such comments on top of function
definition rather than with declaration.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] better atomics - v0.5

2014-06-29 Thread Andres Freund

On 2014-06-29 11:53:28 +0530, Amit Kapila wrote:
 On Sat, Jun 28, 2014 at 1:48 PM, Andres Freund and...@2ndquadrant.com
  Two things:
  a) compare_exchange is a read/write operator and so far I've defined it
 to have full barrier semantics (documented in atomics.h). I'd be
 happy to discuss which semantics we want for which operation.
 
 I think it is better to have some discussion about it. Apart from this,
 today I noticed atomic read/write doesn't use any barrier semantics
 and just rely on volatile.

Yes, intentionally so. It's often important to get/set the current value
without the cost of emitting a memory barrier. It just has to be a
recent value  and it actually has to come from memory. And that's actually
enforced by the current definition - I think?

  b) It's only supported from vista onwards. Afaik we still support XP.
 #ifndef pg_memory_barrier_impl
 #define pg_memory_barrier_impl() MemoryBarrier()
 #endif
 
 The MemoryBarrier() macro used also supports only on vista onwards,
 so I think for reasons similar to using MemoryBarrier() can apply for
 this as well.

I think that's odd because barrier.h has been using MemoryBarrier()
without a version test for a long time now. I guess everyone uses a new
enough visual studio. Those intrinsics aren't actually OS but just
compiler dependent.

Otherwise we could just define it as:

FORCEINLINE
VOID
MemoryBarrier (
VOID
)
{
LONG Barrier;
__asm {
xchg Barrier, eax
}
}


  c) It doesn't make any difference on x86 ;)
 
 What about processors like Itanium which care about acquire/release
 memory barrier semantics?

Well, it still will be correct? I don't think it makes much sense to
focus overly much on itanium here with the price of making anything more
complicated for others.

   I think this value is required for lwlock patch, but I am wondering why
   can't the same be achieved if we just return the *current* value and
   then let lwlock patch do the handling based on it.  This will have
 another
   advantage that our pg_* API will also have similar signature as native
   API's.
 
  Many usages of compare/exchange require that you'll get the old value
  back in an atomic fashion. Unfortunately the Interlocked* API doesn't
  provide a nice way to do that.
 
 Yes, but I think the same cannot be accomplished even by using
 expected.

More complicatedly so, yes? I don't think we want those comparisons on
practically every callsite.

 Note that this definition of
  compare/exchange both maps nicely to C11's atomics and the actual x86
  cmpxchg instruction.
 
  I've generally tried to mimick C11's API as far as it's
  possible. There's some limitations because we can't do generic types and
  such, but other than that.
 
 If I am reading C11's spec for this API correctly, then it says as below:
 Atomically compares the value pointed to by obj with the value pointed
 to by expected, and if those are equal, replaces the former with desired
 (performs read-modify-write operation). Otherwise, loads the actual value
 pointed to by obj into *expected (performs load operation).
 
 So it essentialy means that it loads actual value in expected only if
 operation is not successful.

Yes. But in the case it's successfull it will already contain the right
value.

   4.
 ..
   There is a Caution notice in microsoft site indicating
   _ReadWriteBarrier/MemoryBarrier are deprected.
 
  It seemed to be the most widely available API, and it's what barrier.h
  already used.
  Do you have a different suggestion?
 
 I am trying to think based on suggestion given by Microsoft, but
 not completely clear how to accomplish the same at this moment.

Well, they refer to C11 stuff. But I think it'll be a while before it
makes sense to use a fallback based on that.

   6.
   pg_atomic_compare_exchange_u32()
  
   It is better to have comments above this and all other related
 functions.
 
  Check atomics.h, there's comments above it:
  /*
   * pg_atomic_compare_exchange_u32 - CAS operation
   *
   * Atomically compare the current value of ptr with *expected and store
 newval
   * iff ptr and *expected have the same value. The current value of *ptr
 will
   * always be stored in *expected.
   *
   * Return whether the values have been exchanged.
   *
   * Full barrier semantics.
   */
 
 Okay, generally code has such comments on top of function
 definition rather than with declaration.

I don't want to have it on the _impl functions because they're
duplicated for the individual platforms and will just become out of
date...

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] better atomics - v0.5

2014-06-29 Thread Andres Freund

On 2014-06-28 19:36:39 +0300, Heikki Linnakangas wrote:
 On 06/27/2014 08:15 PM, Robert Haas wrote:
 On Thu, Jun 26, 2014 at 3:04 PM, Andres Freund and...@2ndquadrant.com 
 wrote:
 I don't really see usecases where it's not related data that's being
 touched, but: The point is that the fastpath (not using a spinlock) might
 touch the atomic variable, even while the slowpath has acquired the
 spinlock. So the slowpath can't just use non-atomic operations on the
 atomic variable.
 Imagine something like releasing a buffer pin while somebody else is
 doing something that requires holding the buffer header spinlock.
 
 If the atomic variable can be manipulated without the spinlock under
 *any* circumstances, then how is it a good idea to ever manipulate it
 with the spinlock?
 
 With the WALInsertLock scaling stuff in 9.4, there are now two variables
 protected by a spinlock: the current WAL insert location, and the prev
 pointer (CurrBytePos and PrevBytePos). To insert a new WAL record, you need
 to grab the spinlock to update both of them atomically. But to just read the
 WAL insert pointer, you could do an atomic read of CurrBytePos if the
 architecture supports it - now you have to grab the spinlock.
 Admittedly that's just an atomic read, not an atomic compare and exchange or
 fetch-and-add.

There's several architectures where you can do atomic 8byte reads, but
only busing cmpxchg or similar, *not* using a plain read... So I think
it's actually a fair example.

 Or if the architecture has an atomic 128-bit compare 
 exchange op you could replace the spinlock with that. But it's not hard to
 imagine similar situations where you sometimes need to lock a larger data
 structure to modify it atomically, but sometimes you just need to modify
 part of it and an atomic op would suffice.

Yes.

 I thought Andres' LWLock patch also did something like that. If the lock is
 not contended, you can acquire it with an atomic compare  exchange to
 increment the exclusive/shared counter. But to manipulate the wait queue,
 you need the spinlock.

Right. It just happens that the algorithm requires none of the atomic
ops have to be under the spinlock... But I generally think that we'll
see more slow/fastpath like situations cropping up where we can't always
avoid the atomic op while holding the lock.

How about adding a paragraph that explains that it's better to avoid
atomics usage of spinlocks because of the prolonged runtime, especially
due to the emulation and cache misses? But also saying it's ok if the
algorithm needs it and is a sufficient benefit?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] better atomics - v0.5

2014-06-29 Thread Amit Kapila

On Sun, Jun 29, 2014 at 2:54 PM, Andres Freund and...@2ndquadrant.com
wrote:
 On 2014-06-29 11:53:28 +0530, Amit Kapila wrote:
  On Sat, Jun 28, 2014 at 1:48 PM, Andres Freund and...@2ndquadrant.com
  I think it is better to have some discussion about it. Apart from this,
  today I noticed atomic read/write doesn't use any barrier semantics
  and just rely on volatile.

 Yes, intentionally so. It's often important to get/set the current value
 without the cost of emitting a memory barrier. It just has to be a
 recent value  and it actually has to come from memory.

I agree with you that we don't want to incur the cost of memory barrier
for get/set, however it should not be at the cost of correctness.

 And that's actually
 enforced by the current definition - I think?

Yeah, this is the only point which I am trying to ensure, thats why I sent
you few links in previous mail where I had got some suspicion that
just doing get/set with volatile might lead to correctness issue in some
cases.

Some another Note, I found today while reading more on it which suggests
that my previous observation is right:

Within a thread of execution, accesses (reads and writes) to all volatile
objects are guaranteed to not be reordered relative to each other, but this
order is not guaranteed to be observed by another thread, since volatile
access does not establish inter-thread synchronization.
http://en.cppreference.com/w/c/atomic/memory_order

   b) It's only supported from vista onwards. Afaik we still support XP.
  #ifndef pg_memory_barrier_impl
  #define pg_memory_barrier_impl() MemoryBarrier()
  #endif
 
  The MemoryBarrier() macro used also supports only on vista onwards,
  so I think for reasons similar to using MemoryBarrier() can apply for
  this as well.

 I think that's odd because barrier.h has been using MemoryBarrier()
 without a version test for a long time now. I guess everyone uses a new
 enough visual studio.

Yeap or might be the people who even are not using new enough version
doesn't have a use case that can hit a problem due to MemoryBarrier()

 Those intrinsics aren't actually OS but just
 compiler dependent.

 Otherwise we could just define it as:

 FORCEINLINE
 VOID
 MemoryBarrier (
 VOID
 )
 {
 LONG Barrier;
 __asm {
 xchg Barrier, eax
 }
 }

Yes, I think it will be better, how about defining this way just for
the cases where MemoryBarrier macro is not supported.

   c) It doesn't make any difference on x86 ;)
 
  What about processors like Itanium which care about acquire/release
  memory barrier semantics?

 Well, it still will be correct? I don't think it makes much sense to
 focus overly much on itanium here with the price of making anything more
 complicated for others.

Completely agreed that we should not add or change anything which
makes patch complicated or can effect the performance, however
the reason for focusing is just to ensure that we should not break
any existing use case for Itanium w.r.t performance or otherwise.

Another way is that we can give ignore or just give lower priority for
verfiying if the patch has anything wrong for Itanium processors.

  If I am reading C11's spec for this API correctly, then it says as
below:
  Atomically compares the value pointed to by obj with the value pointed
  to by expected, and if those are equal, replaces the former with desired
  (performs read-modify-write operation). Otherwise, loads the actual
value
  pointed to by obj into *expected (performs load operation).
 
  So it essentialy means that it loads actual value in expected only if
  operation is not successful.

 Yes. But in the case it's successfull it will already contain the right
 value.

The point was just that we are not completely following C11 atomic
specification, so it might be okay to tweak a bit and make things
simpler and save LOAD instruction.  However as there is no correctness
issue here and you think that current implementation is good and
can serve more cases, we can keep as it is and move forward.

4.
  ..
There is a Caution notice in microsoft site indicating
_ReadWriteBarrier/MemoryBarrier are deprected.
  
   It seemed to be the most widely available API, and it's what barrier.h
   already used.
   Do you have a different suggestion?
 
  I am trying to think based on suggestion given by Microsoft, but
  not completely clear how to accomplish the same at this moment.

 Well, they refer to C11 stuff. But I think it'll be a while before it
 makes sense to use a fallback based on that.

In this case, I have a question for you.

Un-patched usage  in barrier.h is as follows:
..
#elif defined(__ia64__) || defined(__ia64)
#define pg_compiler_barrier() _Asm_sched_fence()
#define pg_memory_barrier() _Asm_mf()

#elif defined(WIN32_ONLY_COMPILER)
/* Should work on both MSVC and Borland. */
#include intrin.h
#pragma intrinsic(_ReadWriteBarrier)
#define pg_compiler_barrier() _ReadWriteBarrier()
#define pg_memory_barrier() MemoryBarrier()

Re: [HACKERS] better atomics - v0.5

2014-06-28 Thread Amit Kapila

On Wed, Jun 25, 2014 at 10:44 PM, Andres Freund and...@2ndquadrant.com
wrote:

 Attached is a new version of the atomic operations patch. Lots has
 changed since the last post:

 * gcc, msvc work. acc, xlc, sunpro have blindly written support which
   should be relatively easy to fix up. All try to implement TAS, 32 bit
   atomic add, 32 bit compare exchange; some do it for 64bit as well.

I have reviewed msvc part of patch and below are my findings:

1.
+pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
+ uint32 *expected, uint32 newval)
+{
+ bool ret;
+ uint32 current;
+ current = InterlockedCompareExchange(ptr-value, newval, *expected);

Is there a reason why using InterlockedCompareExchange() is better
than using InterlockedCompareExchangeAcquire() variant?
*Acquire or *Release variants can provide acquire memory ordering
semantics for processors which support the same else will be defined
as InterlockedCompareExchange.


2.
+pg_atomic_compare_exchange_u32_impl()
{
..
+ *expected = current;
}

Can we implement this API without changing expected value?
I mean if the InterlockedCompareExchange() is success, then it will
store the newval in ptr-value, else *ret* will be false.
I am not sure if there is anything implementation specific in changing
*expected*.

I think this value is required for lwlock patch, but I am wondering why
can't the same be achieved if we just return the *current* value and
then let lwlock patch do the handling based on it.  This will have another
advantage that our pg_* API will also have similar signature as native
API's.


3. Is there a overhead of using Add, instead of increment/decrement
version of Interlocked?

I could not find any evidence which can clearly indicate, if one is better
than other except some recommendation in below link which suggests to
*maintain reference count*, use Increment/Decrement

Refer *The Locking Cookbook* in below link:
http://blogs.technet.com/b/winserverperformance/archive/2008/05/21/designing-applications-for-high-performance-part-ii.aspx

However, when I tried to check the instructions each function generates,
then I don't find anything which suggests that *Add variant can be slow.

Refer Instructions for both functions:

cPublicProc _InterlockedIncrement,1
cPublicFpo 1,0
mov ecx,Addend  ; get pointer to addend variable
mov eax,1   ; set increment value
Lock1:
   lock xadd[ecx],eax   ; interlocked increment
inc eax ; adjust return value
stdRET _InterlockedIncrement;

stdENDP _InterlockedIncrement




cPublicProc _InterlockedExchangeAdd, 2
cPublicFpo 2,0

mov ecx, [esp + 4]  ; get addend address
mov eax, [esp + 8]  ; get increment value
Lock4:
   lock xadd[ecx], eax  ; exchange add

stdRET  _InterlockedExchangeAdd

stdENDP _InterlockedExchangeAdd

For details, refer link:
http://read.pudn.com/downloads3/sourcecode/windows/248345/win2k/private/windows/base/client/i386/critsect.asm__.htm

I don't think there is any need of change in current implementation,
however I just wanted to share my analysis, so that if any one else
can see any point in preferring one or the other way of implementation.

4.
#pragma intrinsic(_ReadWriteBarrier)
#define pg_compiler_barrier_impl() _ReadWriteBarrier()

#ifndef pg_memory_barrier_impl
#define pg_memory_barrier_impl() MemoryBarrier()
#endif

There is a Caution notice in microsoft site indicating
_ReadWriteBarrier/MemoryBarrier are deprected.

Please refer below link:
http://msdn.microsoft.com/en-us/library/f20w0x5e(v=vs.110).aspx

When I checked previous versions of MSVC, I noticed that these are
supported on x86, IPF, x64 till VS2008 and supported only on x86, x64
for VS2012 onwards.

Link for VS2010 support:
http://msdn.microsoft.com/en-us/library/f20w0x5e(v=vs.100).aspx

5.
+ * atomics-generic-msvc.h
..
+ * Portions Copyright (c) 2013, PostgreSQL Global Development Group

Shouldn't copyright notice be 2014?

6.
pg_atomic_compare_exchange_u32()

It is better to have comments above this and all other related functions.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] better atomics - v0.5

2014-06-28 Thread Andres Freund

On 2014-06-28 11:58:41 +0530, Amit Kapila wrote:
 On Wed, Jun 25, 2014 at 10:44 PM, Andres Freund and...@2ndquadrant.com
 wrote:
 
  Attached is a new version of the atomic operations patch. Lots has
  changed since the last post:
 
  * gcc, msvc work. acc, xlc, sunpro have blindly written support which
should be relatively easy to fix up. All try to implement TAS, 32 bit
atomic add, 32 bit compare exchange; some do it for 64bit as well.
 
 I have reviewed msvc part of patch and below are my findings:

Thanks for looking at it - I pretty much wrote that part blindly.

 1.
 +pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
 + uint32 *expected, uint32 newval)
 +{
 + bool ret;
 + uint32 current;
 + current = InterlockedCompareExchange(ptr-value, newval, *expected);
 
 Is there a reason why using InterlockedCompareExchange() is better
 than using InterlockedCompareExchangeAcquire() variant?

Two things:
a) compare_exchange is a read/write operator and so far I've defined it
   to have full barrier semantics (documented in atomics.h). I'd be
   happy to discuss which semantics we want for which operation.
b) It's only supported from vista onwards. Afaik we still support XP.
c) It doesn't make any difference on x86 ;)

 *Acquire or *Release variants can provide acquire memory ordering
 semantics for processors which support the same else will be defined
 as InterlockedCompareExchange.

 2.
 +pg_atomic_compare_exchange_u32_impl()
 {
 ..
 + *expected = current;
 }
 
 Can we implement this API without changing expected value?
 I mean if the InterlockedCompareExchange() is success, then it will
 store the newval in ptr-value, else *ret* will be false.
 I am not sure if there is anything implementation specific in changing
 *expected*.
 I think this value is required for lwlock patch, but I am wondering why
 can't the same be achieved if we just return the *current* value and
 then let lwlock patch do the handling based on it.  This will have another
 advantage that our pg_* API will also have similar signature as native
 API's.

Many usages of compare/exchange require that you'll get the old value
back in an atomic fashion. Unfortunately the Interlocked* API doesn't
provide a nice way to do that. Note that this definition of
compare/exchange both maps nicely to C11's atomics and the actual x86
cmpxchg instruction.

I've generally tried to mimick C11's API as far as it's
possible. There's some limitations because we can't do generic types and
such, but other than that.

 3. Is there a overhead of using Add, instead of increment/decrement
 version of Interlocked?

There's a theoretical difference for IA64 which has a special
increment/decrement operation which can only change the value by a
rather limited number of values. I don't think it's worth to care.

 4.
 #pragma intrinsic(_ReadWriteBarrier)
 #define pg_compiler_barrier_impl() _ReadWriteBarrier()
 
 #ifndef pg_memory_barrier_impl
 #define pg_memory_barrier_impl() MemoryBarrier()
 #endif
 
 There is a Caution notice in microsoft site indicating
 _ReadWriteBarrier/MemoryBarrier are deprected.

It seemed to be the most widely available API, and it's what barrier.h
already used.
Do you have a different suggestion?

 6.
 pg_atomic_compare_exchange_u32()
 
 It is better to have comments above this and all other related functions.

Check atomics.h, there's comments above it:
/*
 * pg_atomic_compare_exchange_u32 - CAS operation
 *
 * Atomically compare the current value of ptr with *expected and store newval
 * iff ptr and *expected have the same value. The current value of *ptr will
 * always be stored in *expected.
 *
 * Return whether the values have been exchanged.
 *
 * Full barrier semantics.
 */

Thanks,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] better atomics - v0.5

2014-06-28 Thread Andres Freund

On 2014-06-27 22:47:02 -0400, Robert Haas wrote:
 On Fri, Jun 27, 2014 at 2:00 PM, Andres Freund and...@2ndquadrant.com wrote:
  On 2014-06-27 13:15:25 -0400, Robert Haas wrote:
  On Thu, Jun 26, 2014 at 3:04 PM, Andres Freund and...@2ndquadrant.com 
  wrote:
   I don't really see usecases where it's not related data that's being
   touched, but: The point is that the fastpath (not using a spinlock) might
   touch the atomic variable, even while the slowpath has acquired the
   spinlock. So the slowpath can't just use non-atomic operations on the
   atomic variable.
   Imagine something like releasing a buffer pin while somebody else is
   doing something that requires holding the buffer header spinlock.
 
  If the atomic variable can be manipulated without the spinlock under
  *any* circumstances, then how is it a good idea to ever manipulate it
  with the spinlock?  That seems hard to reason about, and unnecessary.
  Critical sections for spinlocks should be short and contain only the
  instructions that need protection, and clearly the atomic op is not
  one of those.
 
  Imagine the situation for the buffer header spinlock which is one of the
  bigger performance issues atm. We could aim to replace all usages of
  that with clever and complicated logic, but it's hard.
 
  The IMO reasonable (and prototyped) way to do it is to make the common
  paths lockless, but fall back to the spinlock for the more complicated
  situations. For the buffer header that means that pin/unpin and buffer
  lookup are lockless, but IO and changing the identity of a buffer still
  require the spinlock. My attempts to avoid the latter basically required
  a buffer header specific reimplementation of spinlocks.
 
  To make pin/unpin et al safe without acquiring the spinlock the pincount
  and the flags had to be moved into one variable so that when
  incrementing the pincount you automatically get the flagbits back. If
  it's currently valid all is good, otherwise the spinlock needs to be
  acquired. But for that to work you need to set flags while the spinlock
  is held.
 
  Possibly you can come up with ways to avoid that, but I am pretty damn
  sure that the code will be less robust by forbidding atomics inside a
  critical section, not the contrary. It's a good idea to avoid it, but
  not at all costs.
 
 You might be right, but I'm not convinced.

Why? Anything I can do to convince you of this? Note that other users of
atomics (e.g. the linux kernel) widely use atomics inside spinlock
protected regions.

 Does the lwlock scalability patch work this way, too?

No. I've moved one add/sub inside a critical section in the current
version while debugging, but it's not required at all. I generally think
it makes sense to document the suggestion of moving atomics outside, but
not make it a requirement.

 What I'm basically afraid of is that this will work fine in many cases
 but have really ugly failure modes.  That's pretty much what happens
 with spinlocks already - the overhead is insignificant at low levels
 of contention but as the spinlock starts to become contended the CPUs
 all fight over the cache line and performance goes to pot.  ISTM that
 making the spinlock critical section significantly longer by putting
 atomic ops in there could appear to win in some cases while being very
 bad in others.

Well, I'm not saying it's something I suggest doing all the time. But if
using an atomic op in the slow path allows you to remove the spinlock
from 99% of the cases I don't see it having a significant impact.
In most scenarios (where atomics aren't emulated, i.e. platforms we
expect to used in heavily concurrent cases) the spinlock and the atomic
will be on the same cacheline making stalls much less likely.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] better atomics - v0.5

2014-06-28 Thread Heikki Linnakangas


On 06/27/2014 08:15 PM, Robert Haas wrote:

On Thu, Jun 26, 2014 at 3:04 PM, Andres Freund and...@2ndquadrant.com wrote:

I don't really see usecases where it's not related data that's being
touched, but: The point is that the fastpath (not using a spinlock) might
touch the atomic variable, even while the slowpath has acquired the
spinlock. So the slowpath can't just use non-atomic operations on the
atomic variable.
Imagine something like releasing a buffer pin while somebody else is
doing something that requires holding the buffer header spinlock.


If the atomic variable can be manipulated without the spinlock under
*any* circumstances, then how is it a good idea to ever manipulate it
with the spinlock?


With the WALInsertLock scaling stuff in 9.4, there are now two variables 
protected by a spinlock: the current WAL insert location, and the prev 
pointer (CurrBytePos and PrevBytePos). To insert a new WAL record, you 
need to grab the spinlock to update both of them atomically. But to just 
read the WAL insert pointer, you could do an atomic read of CurrBytePos 
if the architecture supports it - now you have to grab the spinlock.


Admittedly that's just an atomic read, not an atomic compare and 
exchange or fetch-and-add. Or if the architecture has an atomic 128-bit 
compare  exchange op you could replace the spinlock with that. But it's 
not hard to imagine similar situations where you sometimes need to lock 
a larger data structure to modify it atomically, but sometimes you just 
need to modify part of it and an atomic op would suffice.


I thought Andres' LWLock patch also did something like that. If the lock 
is not contended, you can acquire it with an atomic compare  exchange 
to increment the exclusive/shared counter. But to manipulate the wait 
queue, you need the spinlock.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] better atomics - v0.5

2014-06-27 Thread Robert Haas

On Thu, Jun 26, 2014 at 3:04 PM, Andres Freund and...@2ndquadrant.com wrote:
 I don't really see usecases where it's not related data that's being
 touched, but: The point is that the fastpath (not using a spinlock) might
 touch the atomic variable, even while the slowpath has acquired the
 spinlock. So the slowpath can't just use non-atomic operations on the
 atomic variable.
 Imagine something like releasing a buffer pin while somebody else is
 doing something that requires holding the buffer header spinlock.

If the atomic variable can be manipulated without the spinlock under
*any* circumstances, then how is it a good idea to ever manipulate it
with the spinlock?  That seems hard to reason about, and unnecessary.
Critical sections for spinlocks should be short and contain only the
instructions that need protection, and clearly the atomic op is not
one of those.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] better atomics - v0.5

2014-06-27 Thread Andres Freund

On 2014-06-27 13:15:25 -0400, Robert Haas wrote:
 On Thu, Jun 26, 2014 at 3:04 PM, Andres Freund and...@2ndquadrant.com wrote:
  I don't really see usecases where it's not related data that's being
  touched, but: The point is that the fastpath (not using a spinlock) might
  touch the atomic variable, even while the slowpath has acquired the
  spinlock. So the slowpath can't just use non-atomic operations on the
  atomic variable.
  Imagine something like releasing a buffer pin while somebody else is
  doing something that requires holding the buffer header spinlock.
 
 If the atomic variable can be manipulated without the spinlock under
 *any* circumstances, then how is it a good idea to ever manipulate it
 with the spinlock?  That seems hard to reason about, and unnecessary.
 Critical sections for spinlocks should be short and contain only the
 instructions that need protection, and clearly the atomic op is not
 one of those.

Imagine the situation for the buffer header spinlock which is one of the
bigger performance issues atm. We could aim to replace all usages of
that with clever and complicated logic, but it's hard.

The IMO reasonable (and prototyped) way to do it is to make the common
paths lockless, but fall back to the spinlock for the more complicated
situations. For the buffer header that means that pin/unpin and buffer
lookup are lockless, but IO and changing the identity of a buffer still
require the spinlock. My attempts to avoid the latter basically required
a buffer header specific reimplementation of spinlocks.

To make pin/unpin et al safe without acquiring the spinlock the pincount
and the flags had to be moved into one variable so that when
incrementing the pincount you automatically get the flagbits back. If
it's currently valid all is good, otherwise the spinlock needs to be
acquired. But for that to work you need to set flags while the spinlock
is held.

Possibly you can come up with ways to avoid that, but I am pretty damn
sure that the code will be less robust by forbidding atomics inside a
critical section, not the contrary. It's a good idea to avoid it, but
not at all costs.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] better atomics - v0.5

2014-06-27 Thread Robert Haas

On Fri, Jun 27, 2014 at 2:00 PM, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-06-27 13:15:25 -0400, Robert Haas wrote:
 On Thu, Jun 26, 2014 at 3:04 PM, Andres Freund and...@2ndquadrant.com 
 wrote:
  I don't really see usecases where it's not related data that's being
  touched, but: The point is that the fastpath (not using a spinlock) might
  touch the atomic variable, even while the slowpath has acquired the
  spinlock. So the slowpath can't just use non-atomic operations on the
  atomic variable.
  Imagine something like releasing a buffer pin while somebody else is
  doing something that requires holding the buffer header spinlock.

 If the atomic variable can be manipulated without the spinlock under
 *any* circumstances, then how is it a good idea to ever manipulate it
 with the spinlock?  That seems hard to reason about, and unnecessary.
 Critical sections for spinlocks should be short and contain only the
 instructions that need protection, and clearly the atomic op is not
 one of those.

 Imagine the situation for the buffer header spinlock which is one of the
 bigger performance issues atm. We could aim to replace all usages of
 that with clever and complicated logic, but it's hard.

 The IMO reasonable (and prototyped) way to do it is to make the common
 paths lockless, but fall back to the spinlock for the more complicated
 situations. For the buffer header that means that pin/unpin and buffer
 lookup are lockless, but IO and changing the identity of a buffer still
 require the spinlock. My attempts to avoid the latter basically required
 a buffer header specific reimplementation of spinlocks.

 To make pin/unpin et al safe without acquiring the spinlock the pincount
 and the flags had to be moved into one variable so that when
 incrementing the pincount you automatically get the flagbits back. If
 it's currently valid all is good, otherwise the spinlock needs to be
 acquired. But for that to work you need to set flags while the spinlock
 is held.

 Possibly you can come up with ways to avoid that, but I am pretty damn
 sure that the code will be less robust by forbidding atomics inside a
 critical section, not the contrary. It's a good idea to avoid it, but
 not at all costs.

You might be right, but I'm not convinced.  Does the lwlock
scalability patch work this way, too?  I was hoping to look at that
RSN, so if there are roughly the same issues there it might shed some
light on this point also.

What I'm basically afraid of is that this will work fine in many cases
but have really ugly failure modes.  That's pretty much what happens
with spinlocks already - the overhead is insignificant at low levels
of contention but as the spinlock starts to become contended the CPUs
all fight over the cache line and performance goes to pot.  ISTM that
making the spinlock critical section significantly longer by putting
atomic ops in there could appear to win in some cases while being very
bad in others.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] better atomics - v0.5

2014-06-26 Thread Andres Freund

On 2014-06-25 20:16:08 -0400, Robert Haas wrote:
 On Wed, Jun 25, 2014 at 4:36 PM, Andres Freund and...@2ndquadrant.com wrote:
  Since it better be legal to manipulate a atomic variable while holding a
  spinlock we cannot simply use an arbitrary spinlock as backing for
  atomics. That'd possibly cause us to wait on ourselves or cause
  deadlocks.
 
 I think that's going to fall afoul of Tom's previously-articulated no
 loops inside spinlocks rule.  Most atomics, by nature, are
 loop-until-it-works.

Well, so is TAS itself :).

More seriously, I think we're not going to have much fun if we're making
up the rule that you can't do an atomic add/sub while a spinlock is
held. That just precludes to many use cases and will make the code much
harder to understand. I don't think we're going to end up having many
problems if we allow atomic read/add/sub/write in there.

  How much would we lose if we supported compiler intrinsics on
  versions of GCC and MSVC that have it and left everything else to
  future patches?
 
  The discussion so far seemed pretty clear that we can't regress somewhat
  frequently used platforms. And dropping support for all but msvc and gcc
  would end up doing that. We're going to have to do the legword for the
  most common platforms... Note that I didn't use assembler for those, but
  relied on intrinsics...
 
 We can't *regress* somewhat-frequently used platforms, but that's
 different from saying we've got to support *new* facilities on those
 platforms.

Well, we want to use the new facilities somewhere. And it's not entirely
unfair to call it a regression if other os/compiler/arch combinations
speed up but yours don't.

 Essentially the entire buildfarm is running either gcc,
 some Microsoft compiler, or a compiler that's supports the same
 atomics intrinsics as gcc i.e. icc or clang.  Some of those compilers
 may be too old to support the atomics intrinsics, and there's one Sun
 Studio animal, but I don't know that we need to care about those
 things in this patch...

I think we should support those where it's easy and where we'll see the
breakage on the buildfarm. Sun studio's intrinsics are simple enough
that I don't think my usage of them will be too hard to fix.

 ...unless of course the atomics fallbacks in this patch are
 sufficiently sucky that anyone who ends up using those is going to be
 sad.  Then the bar is a lot higher.  But if that's the case then I
 wonder if we're really on the right course here.

I don't you'll notice it much on mostly nonconcurrent uses. But some of
those architectures/platforms do support larger NUMA like setups. And
for those it'd probably hurt for some workloads. And e.g. solaris is
still fairly common for such setups.

  I added the x86 inline assembler because a fair number of buildfarm
  animals use ancient gcc's and because I could easily test it. It's also
  more efficient for gcc  4.6. I'm not wedded to keeping it.
 
 Hmm. gcc 4.6 is only just over a year old, so if pre-4.6
 implementations aren't that good, that's a pretty good argument for
 keeping our own implementations around.  :-(

The difference isn't that big. It's
return __atomic_compare_exchange_n(ptr-value, expected, newval,
   false,  __ATOMIC_SEQ_CST,__ATOMIC_SEQ_CST);
vs

boolret;
uint64  current;
current = __sync_val_compare_and_swap(ptr-value, *expected, newval);
ret = current == *expected;
*expected = current;
return ret;

On x86 that's an additional cmp, mov and such. Useless pos because
cmpxchg already computes all that... (that's returned by the setz in my
patch). MSVC only supports the second form as well.

There's a fair number of  4.2 gccs on the buildfarm as well though.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] better atomics - v0.5

2014-06-26 Thread Andres Freund

On 2014-06-25 20:22:31 -0400, Robert Haas wrote:
 On Wed, Jun 25, 2014 at 5:42 PM, Heikki Linnakangas
 hlinnakan...@vmware.com wrote:
  I think having a separate file for each architecture is nice. I totally
  agree that they don't belong in src/include/storage, though. s_lock.h has
  always been misplaced there, but we've let it be for historical reasons, but
  now that we're adding a dozen new files, it's time to move them out.
 
 I find the current organization pretty confusing, but maybe that could
 be solved by better documentation of what's supposed to go in each
 architecture or compiler-dependent file.

The idea is that first a architecture specific file (atomics-arch-*.h)
is included. That file can provide a (partial) implementation for the
specific architecture. Or it can do pretty much nothing.

After that a compiler specific file is included
(atomics-generic-*.h). If atomics aren't yet implemented that can
provide an intrinsics based implementation if the compiler version has
support for it. At the very least a compiler barrier should be provided.

After that the spinlock based fallback implementation
(atomics-fallback.h) provides atomics and barriers if not yet
available. By here we're sure that nothing else will provide them.

Then we can provide operations (atomics-generic.h) that build ontop of
the provided functions. E.g. implement _sub, _and et al.

I'll include some more of that explanation in the header.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] better atomics - v0.5

2014-06-26 Thread Merlin Moncure

On Thu, Jun 26, 2014 at 5:20 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-06-25 20:16:08 -0400, Robert Haas wrote:
 On Wed, Jun 25, 2014 at 4:36 PM, Andres Freund and...@2ndquadrant.com 
 wrote:
  Since it better be legal to manipulate a atomic variable while holding a
  spinlock we cannot simply use an arbitrary spinlock as backing for
  atomics. That'd possibly cause us to wait on ourselves or cause
  deadlocks.

 I think that's going to fall afoul of Tom's previously-articulated no
 loops inside spinlocks rule.  Most atomics, by nature, are
 loop-until-it-works.

 Well, so is TAS itself :).

 More seriously, I think we're not going to have much fun if we're making
 up the rule that you can't do an atomic add/sub while a spinlock is
 held. That just precludes to many use cases and will make the code much
 harder to understand. I don't think we're going to end up having many
 problems if we allow atomic read/add/sub/write in there.

That rule seems reasonable -- why would you ever want to do this?
While you couldn't properly deadlock it seems like it could lead to
unpredictable and hard to diagnose performance stalls.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] better atomics - v0.5

2014-06-26 Thread Andres Freund

On 2014-06-26 07:44:14 -0500, Merlin Moncure wrote:
 On Thu, Jun 26, 2014 at 5:20 AM, Andres Freund and...@2ndquadrant.com wrote:
  On 2014-06-25 20:16:08 -0400, Robert Haas wrote:
  On Wed, Jun 25, 2014 at 4:36 PM, Andres Freund and...@2ndquadrant.com 
  wrote:
   Since it better be legal to manipulate a atomic variable while holding a
   spinlock we cannot simply use an arbitrary spinlock as backing for
   atomics. That'd possibly cause us to wait on ourselves or cause
   deadlocks.
 
  I think that's going to fall afoul of Tom's previously-articulated no
  loops inside spinlocks rule.  Most atomics, by nature, are
  loop-until-it-works.
 
  Well, so is TAS itself :).
 
  More seriously, I think we're not going to have much fun if we're making
  up the rule that you can't do an atomic add/sub while a spinlock is
  held. That just precludes to many use cases and will make the code much
  harder to understand. I don't think we're going to end up having many
  problems if we allow atomic read/add/sub/write in there.
 
 That rule seems reasonable -- why would you ever want to do this?

Are you wondering why you'd ever manipulate an atomic op inside a
spinlock?
There's enough algorithms where a slowpath is done under a spinlock but
the fastpath is done without. You can't simply use nonatomic operations
for manipulation under the spinlock because the fastpaths might then
observe/cause bogus state.

Obviously I'm not advocating to do random stuff in spinlocks. W're
talking about single lock xadd;s (or the LL/SC) equivalent here.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] better atomics - v0.5

2014-06-26 Thread Noah Misch

On Thu, Jun 26, 2014 at 12:20:06PM +0200, Andres Freund wrote:
 On 2014-06-25 20:16:08 -0400, Robert Haas wrote:
  On Wed, Jun 25, 2014 at 4:36 PM, Andres Freund and...@2ndquadrant.com 
  wrote:
   Since it better be legal to manipulate a atomic variable while holding a
   spinlock we cannot simply use an arbitrary spinlock as backing for
   atomics. That'd possibly cause us to wait on ourselves or cause
   deadlocks.
  
  I think that's going to fall afoul of Tom's previously-articulated no
  loops inside spinlocks rule.  Most atomics, by nature, are
  loop-until-it-works.
 
 Well, so is TAS itself :).
 
 More seriously, I think we're not going to have much fun if we're making
 up the rule that you can't do an atomic add/sub while a spinlock is
 held. That just precludes to many use cases and will make the code much
 harder to understand. I don't think we're going to end up having many
 problems if we allow atomic read/add/sub/write in there.

I agree it's valuable to have a design that permits invoking an atomic
operation while holding a spinlock.

   I added the x86 inline assembler because a fair number of buildfarm
   animals use ancient gcc's and because I could easily test it. It's also
   more efficient for gcc  4.6. I'm not wedded to keeping it.
  
  Hmm. gcc 4.6 is only just over a year old, so if pre-4.6
  implementations aren't that good, that's a pretty good argument for
  keeping our own implementations around.  :-(

GCC 4.6 was released 2011-03-25, though that doesn't change the bottom line.

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] better atomics - v0.5

2014-06-26 Thread Robert Haas

On Thu, Jun 26, 2014 at 6:20 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-06-25 20:16:08 -0400, Robert Haas wrote:
 On Wed, Jun 25, 2014 at 4:36 PM, Andres Freund and...@2ndquadrant.com 
 wrote:
  Since it better be legal to manipulate a atomic variable while holding a
  spinlock we cannot simply use an arbitrary spinlock as backing for
  atomics. That'd possibly cause us to wait on ourselves or cause
  deadlocks.

 I think that's going to fall afoul of Tom's previously-articulated no
 loops inside spinlocks rule.  Most atomics, by nature, are
 loop-until-it-works.

 Well, so is TAS itself :).

Yeah, which is why we have a rule that you're not suppose to acquire
another spinlock while already holding one.  You're trying to make the
spinlocks used to simulate atomic ops an exception to that rule, but
I'm not sure that's OK.  An atomic op is a lot more expensive than a
regular unlocked load or store even if it doesn't loop, and if it does
loop, it's worse, potentially by a large multiple.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] better atomics - v0.5

2014-06-26 Thread Robert Haas

On Thu, Jun 26, 2014 at 7:21 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-06-25 20:22:31 -0400, Robert Haas wrote:
 On Wed, Jun 25, 2014 at 5:42 PM, Heikki Linnakangas
 hlinnakan...@vmware.com wrote:
  I think having a separate file for each architecture is nice. I totally
  agree that they don't belong in src/include/storage, though. s_lock.h has
  always been misplaced there, but we've let it be for historical reasons, 
  but
  now that we're adding a dozen new files, it's time to move them out.

 I find the current organization pretty confusing, but maybe that could
 be solved by better documentation of what's supposed to go in each
 architecture or compiler-dependent file.

 The idea is that first a architecture specific file (atomics-arch-*.h)
 is included. That file can provide a (partial) implementation for the
 specific architecture. Or it can do pretty much nothing.

 After that a compiler specific file is included
 (atomics-generic-*.h). If atomics aren't yet implemented that can
 provide an intrinsics based implementation if the compiler version has
 support for it. At the very least a compiler barrier should be provided.

 After that the spinlock based fallback implementation
 (atomics-fallback.h) provides atomics and barriers if not yet
 available. By here we're sure that nothing else will provide them.

 Then we can provide operations (atomics-generic.h) that build ontop of
 the provided functions. E.g. implement _sub, _and et al.

 I'll include some more of that explanation in the header.

I get the general principle, but I think it would be good to have one
place that says something like:

Each architecture must provide A and B, may provide both or neither of
C and D, and may also any or all of E, F, and G.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] better atomics - v0.5

2014-06-26 Thread Tom Lane

Robert Haas robertmh...@gmail.com writes:
 On Thu, Jun 26, 2014 at 6:20 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-06-25 20:16:08 -0400, Robert Haas wrote:
 I think that's going to fall afoul of Tom's previously-articulated no
 loops inside spinlocks rule.  Most atomics, by nature, are
 loop-until-it-works.

 Well, so is TAS itself :).

 Yeah, which is why we have a rule that you're not suppose to acquire
 another spinlock while already holding one.  You're trying to make the
 spinlocks used to simulate atomic ops an exception to that rule, but
 I'm not sure that's OK.  An atomic op is a lot more expensive than a
 regular unlocked load or store even if it doesn't loop, and if it does
 loop, it's worse, potentially by a large multiple.

Well, the point here is to be sure that there's a reasonably small bound
on how long we hold the spinlock.  It doesn't have to be zero, just small.

Would it be practical to say that the coding rule is that you can only
use atomic ops on fields that are protected by the spinlock, ie, nobody
else is supposed to be touching those fields while you have the spinlock?
If that's the case, then the atomic op should see no contention and thus
not take very long.

On the other hand, if the atomic op is touching something not protected
by the spinlock, that seems to me to be morally equivalent to taking a
spinlock while holding another one, which as Robert says is forbidden
by our current coding rules, and for very good reasons IMO.

However, that may be safe only as long as we have real atomic ops.
If we're simulating those using spinlocks then you have to ask questions
about how many underlying spinlocks there are and whether artificial
contention might ensue (due to the same spinlock being used for multiple
things).

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] better atomics - v0.5

2014-06-26 Thread Andres Freund

On 2014-06-26 11:47:15 -0700, Tom Lane wrote:
 Robert Haas robertmh...@gmail.com writes:
  On Thu, Jun 26, 2014 at 6:20 AM, Andres Freund and...@2ndquadrant.com 
  wrote:
  On 2014-06-25 20:16:08 -0400, Robert Haas wrote:
  I think that's going to fall afoul of Tom's previously-articulated no
  loops inside spinlocks rule.  Most atomics, by nature, are
  loop-until-it-works.
 
  Well, so is TAS itself :).
 
  Yeah, which is why we have a rule that you're not suppose to acquire
  another spinlock while already holding one.  You're trying to make the
  spinlocks used to simulate atomic ops an exception to that rule, but
  I'm not sure that's OK.  An atomic op is a lot more expensive than a
  regular unlocked load or store even if it doesn't loop, and if it does
  loop, it's worse, potentially by a large multiple.
 
 Well, the point here is to be sure that there's a reasonably small bound
 on how long we hold the spinlock.  It doesn't have to be zero, just small.
 
 Would it be practical to say that the coding rule is that you can only
 use atomic ops on fields that are protected by the spinlock, ie, nobody
 else is supposed to be touching those fields while you have the spinlock?
 If that's the case, then the atomic op should see no contention and thus
 not take very long.

I don't really see usecases where it's not related data that's being
touched, but: The point is that the fastpath (not using a spinlock) might
touch the atomic variable, even while the slowpath has acquired the
spinlock. So the slowpath can't just use non-atomic operations on the
atomic variable.
Imagine something like releasing a buffer pin while somebody else is
doing something that requires holding the buffer header spinlock.

 On the other hand, if the atomic op is touching something not protected
 by the spinlock, that seems to me to be morally equivalent to taking a
 spinlock while holding another one, which as Robert says is forbidden
 by our current coding rules, and for very good reasons IMO.

 However, that may be safe only as long as we have real atomic ops.
 If we're simulating those using spinlocks then you have to ask questions
 about how many underlying spinlocks there are and whether artificial
 contention might ensue (due to the same spinlock being used for multiple
 things).

With the code as submitted it's one spinlock per atomic variable. Which
is fine until spinlocks are emulated using semaphores - in that case a
separate array of semaphores (quite small in the patch as submitted) is
used so there's no chance of deadlocks against a 'real' spinlock or
anything like that.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] better atomics - v0.5

2014-06-26 Thread Merlin Moncure

On Thu, Jun 26, 2014 at 1:47 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 Would it be practical to say that the coding rule is that you can only
 use atomic ops on fields that are protected by the spinlock, ie, nobody
 else is supposed to be touching those fields while you have the spinlock?
 If that's the case, then the atomic op should see no contention and thus
 not take very long.

I wonder if this is true in all cases.  The address you are locking
might be logically protected but at the same time nearby to other
memory that is under contention.  In other words, I don't think you
can assume an atomic op locks exactly 4 bytes of memory for the
operation.

 On the other hand, if the atomic op is touching something not protected
 by the spinlock, that seems to me to be morally equivalent to taking a
 spinlock while holding another one, which as Robert says is forbidden
 by our current coding rules, and for very good reasons IMO.

Well not quite: you don't have the possibility of deadlock.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] better atomics - v0.5

2014-06-25 Thread Josh Berkus

On 06/25/2014 10:14 AM, Andres Freund wrote:
 Hi,
 
 [sorry for the second copy Robert]
 
 Attached is a new version of the atomic operations patch. Lots has
 changed since the last post:

Is this at a state where we can performance-test it yet?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] better atomics - v0.5

2014-06-25 Thread Andres Freund

On 2014-06-25 10:39:53 -0700, Josh Berkus wrote:
 On 06/25/2014 10:14 AM, Andres Freund wrote:
  Hi,
  
  [sorry for the second copy Robert]
  
  Attached is a new version of the atomic operations patch. Lots has
  changed since the last post:
 
 Is this at a state where we can performance-test it yet?

Well, this patch itself won't have a performance impact at all. It's
about proving the infrastructure to use atomic operations in further
patches in a compiler and platform independent way.
I'll repost a new version of the lwlocks patch (still fighting an issue
I introduced while rebasing ntop of Heikki's xlog scalability/lwlock
merger) soon. Addressing the review comments Amit has made since.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] better atomics - v0.5

2014-06-25 Thread Robert Haas

On Wed, Jun 25, 2014 at 1:14 PM, Andres Freund and...@2ndquadrant.com wrote:
 [sorry for the second copy Robert]

 Attached is a new version of the atomic operations patch. Lots has
 changed since the last post:

 * gcc, msvc work. acc, xlc, sunpro have blindly written support which
   should be relatively easy to fix up. All try to implement TAS, 32 bit
   atomic add, 32 bit compare exchange; some do it for 64bit as well.

 * x86 gcc inline assembly means that we support every gcc version on x86
   = i486; even when the __sync APIs aren't available yet.

 * 'inline' support isn't required anymore. We fall back to
   ATOMICS_INCLUDE_DEFINITIONS/STATIC_IF_INLINE etc. trickery.

 * When the current platform doesn't have support for atomic operations a
   spinlock backed implementation is used. This can be forced using
   --disable-atomics.

   That even works when semaphores are used to implement spinlocks (a
   separate array is used there to avoid nesting problems). It contrast
   to an earlier implementation this even works when atomics are mapped
   to different addresses in individual processes (think dsm).

 * s_lock.h falls back to the atomics.h provided APIs iff it doesn't have
   native support for the current platform. This can be forced by
   defining USE_ATOMICS_BASED_SPINLOCKS. Due to generic compiler
   intrinsics based implementations that should make it easier to bring
   up postgres on different platfomrs.

 * I tried to improve the documentation of the facilities in
   src/include/storage/atomics.h. Including documentation of the various
   barrier semantics.

 * There's tests in regress.c that get call via a SQL function from the
   regression tests.

 * Lots of other details changed, but that's the major pieces.

Review:

- The changes to spin.c include unnecessary whitespace adjustments.
- The new argument to s_init_lock_sema() isn't used.
- on top is two words, not one (ontop).
- The changes to pg_config_manual.h add a superfluous blank line.
- Are the regression tests in regress.c likely to catch anything?  Really?
- The patch adds a test for PGAC_HAVE_GCC__SYNC_INT32_ATOMICS, which
seems to be testing for __builtin_constant_p.  I don't see that being
used in the patch anywhere, though; and also, why doesn't the naming
match?
- s_lock.h adds some fallback code to use atomics to define TAS on
platforms where GCC supports atomics but where we do not have a
working TAS implementation.  However, this supposed fallback code
defines HAS_TEST_AND_SET unconditionally, so I think it will get used
even if we don't have (or have disabled via configure) atomics.
- I completely loathe the file layout you've proposed in
src/include/storage.  If we're going to have separate #include files
for each architecture (and I'd rather we didn't go that route, because
it's messy and unlike what we have now), then shouldn't that stuff go
in src/include/port or a new directory src/include/port/atomics?
Polluting src/include/storage with a whole bunch of files called
atomics-whatever.h strikes me as completely ugly, and there's a lot of
duplicate code in there.
- What's the point of defining pg_read_barrier() to be
pg_read_barrier_impl() and pg_write_barrier() to be
pg_write_barrier_impl() and so forth?  That seems like unnecessary
notation.
- Should SpinlockSemaInit() be assigning SpinlockSemas() to a local
variable instead of re-executing it every time?  Granted, the compiler
might inline this somehow, but...

More generally, I'm really wondering if you're making the
compare-and-swap implementation a lot more complicated than it needs
to be.  How much would we lose if we supported compiler intrinsics on
versions of GCC and MSVC that have it and left everything else to
future patches?  I suspect this patch could be about 20% of its
current size and give us everything that we need.  I've previously
opposed discarding s_lock.h on the grounds that it's extremely well
battle-tested, and if we change it, we might introduce subtle bugs
that are dependent on GCC versions and such.  But now that compiler
intrinsics are relatively common, I don't really see any reason for us
to provide our own assembler versions of *new* primitives we want to
use.  As long as we have some kind of wrapper functions or macros, we
retain the *option* to add workarounds for compiler bugs or lack of
compiler support on platforms, but it seems an awful lot simpler to me
to start by assuming that that the compiler will DTRT, and only roll
our own if that proves to be necessary.  It's not like our hand-rolled
implementations couldn't have bugs - which is different from s_lock.h,
where we have evidence that the exact coding in that file does or did
work on those platforms.

Similarly, I would rip out - or separate into a separate patch - the
code that tries to use atomic operations to implement TAS if we don't
already have an implementation in s_lock.h.  That might be worth
doing, if there are any common architectures that don't already have

Re: [HACKERS] better atomics - v0.5

2014-06-25 Thread Andres Freund

Hi,

On 2014-06-25 15:54:37 -0400, Robert Haas wrote:
 - The new argument to s_init_lock_sema() isn't used.

It's used when atomics fallback to spinlocks which fall back to
semaphores. c.f. atomics.c.

Since it better be legal to manipulate a atomic variable while holding a
spinlock we cannot simply use an arbitrary spinlock as backing for
atomics. That'd possibly cause us to wait on ourselves or cause
deadlocks.

 - Are the regression tests in regress.c likely to catch anything?
 Really?

They already cought several bugs. If the operations were immediately
widely used they probably wouldn't be necessary.

 - The patch adds a test for PGAC_HAVE_GCC__SYNC_INT32_ATOMICS, which
 seems to be testing for __builtin_constant_p.  I don't see that being
 used in the patch anywhere, though; and also, why doesn't the naming
 match?

Uh. that's a rebase screweup. Should entirely be gone.

 - s_lock.h adds some fallback code to use atomics to define TAS on
 platforms where GCC supports atomics but where we do not have a
 working TAS implementation.  However, this supposed fallback code
 defines HAS_TEST_AND_SET unconditionally, so I think it will get used
 even if we don't have (or have disabled via configure) atomics.

Hm. Good point. It should protect against that.

 - I completely loathe the file layout you've proposed in
 src/include/storage.  If we're going to have separate #include files
 for each architecture (and I'd rather we didn't go that route, because
 it's messy and unlike what we have now), then shouldn't that stuff go
 in src/include/port or a new directory src/include/port/atomics?
 Polluting src/include/storage with a whole bunch of files called
 atomics-whatever.h strikes me as completely ugly, and there's a lot of
 duplicate code in there.

I don't mind moving it somewhere else and it's easy enough to change.

 - What's the point of defining pg_read_barrier() to be
 pg_read_barrier_impl() and pg_write_barrier() to be
 pg_write_barrier_impl() and so forth?  That seems like unnecessary
 notation.

We could forgo it for pg_read_barrier() et al, but for the actual
atomics it's useful because it allows to put common checks in the !impl
routines.

 - Should SpinlockSemaInit() be assigning SpinlockSemas() to a local
 variable instead of re-executing it every time?  Granted, the compiler
 might inline this somehow, but...

I sure hope it will, but still a good idea.

 More generally, I'm really wondering if you're making the
 compare-and-swap implementation a lot more complicated than it needs
 to be.

Possible.

 How much would we lose if we supported compiler intrinsics on
 versions of GCC and MSVC that have it and left everything else to
 future patches?

The discussion so far seemed pretty clear that we can't regress somewhat
frequently used platforms. And dropping support for all but msvc and gcc
would end up doing that. We're going to have to do the legword for the
most common platforms... Note that I didn't use assembler for those, but
relied on intrinsics...

 I suspect this patch could be about 20% of its current size and give
 us everything that we need.

I think the only way to do that would be to create a s_lock.h like
maze. Which imo is a utterly horrible idea. I'm pretty sure there'll be
more assembler implementations for 9.5 and unless we think ahead of how
to structure that we'll create something bad.
I also think that a year or so down the road we're going to support more
operations. E.g. optional support for a 16byte CAS can allow for some
awesome things when you want to do lockless stuff on a 64bit platform.

I think there's chances for reducing the size by merging i386/amd64,
that looked better earlier than it does now (due to the addition of the
inline assembler).

 I've previously
 opposed discarding s_lock.h on the grounds that it's extremely well
 battle-tested, and if we change it, we might introduce subtle bugs
 that are dependent on GCC versions and such.

I think we should entirely get rid of s_lock.h. It's an unmaintainable
mess. That's what I'd done in an earlier version of the patch, but you'd
argued against it. I think you're right in that it shouldn't be done in
the same patch and possibly not even in the same cycle, but I think we
better do it sometime not too far away.
I e.g. don't see how we can guarantee sane barrier semantics with the
current implementation.

 But now that compiler
 intrinsics are relatively common, I don't really see any reason for us
 to provide our own assembler versions of *new* primitives we want to
 use.  As long as we have some kind of wrapper functions or macros, we
 retain the *option* to add workarounds for compiler bugs or lack of
 compiler support on platforms, but it seems an awful lot simpler to me
 to start by assuming that that the compiler will DTRT, and only roll
 our own if that proves to be necessary.  It's not like our hand-rolled
 implementations couldn't have bugs - which is different from s_lock.h,
 where we have evidence that

Re: [HACKERS] better atomics - v0.5

2014-06-25 Thread Heikki Linnakangas


On 06/25/2014 11:36 PM, Andres Freund wrote:



- I completely loathe the file layout you've proposed in
src/include/storage.  If we're going to have separate #include files
for each architecture (and I'd rather we didn't go that route, because
it's messy and unlike what we have now), then shouldn't that stuff go
in src/include/port or a new directory src/include/port/atomics?
Polluting src/include/storage with a whole bunch of files called
atomics-whatever.h strikes me as completely ugly, and there's a lot of
duplicate code in there.

I don't mind moving it somewhere else and it's easy enough to change.


I think having a separate file for each architecture is nice. I totally 
agree that they don't belong in src/include/storage, though. s_lock.h 
has always been misplaced there, but we've let it be for historical 
reasons, but now that we're adding a dozen new files, it's time to move 
them out.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] better atomics - v0.5

2014-06-25 Thread Alvaro Herrera

Heikki Linnakangas wrote:
 On 06/25/2014 11:36 PM, Andres Freund wrote:
 
 - I completely loathe the file layout you've proposed in
 src/include/storage.  If we're going to have separate #include files
 for each architecture (and I'd rather we didn't go that route, because
 it's messy and unlike what we have now), then shouldn't that stuff go
 in src/include/port or a new directory src/include/port/atomics?
 Polluting src/include/storage with a whole bunch of files called
 atomics-whatever.h strikes me as completely ugly, and there's a lot of
 duplicate code in there.
 I don't mind moving it somewhere else and it's easy enough to change.
 
 I think having a separate file for each architecture is nice. I
 totally agree that they don't belong in src/include/storage, though.
 s_lock.h has always been misplaced there, but we've let it be for
 historical reasons, but now that we're adding a dozen new files,
 it's time to move them out.

We also have a bunch of files in src/backend/storage/lmgr, both current
and introduced by this patch, which would be good to relocate.  (Really,
how did lmgr/ end up in storage/ in the first place?)

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] better atomics - v0.5

2014-06-25 Thread Robert Haas

On Wed, Jun 25, 2014 at 4:36 PM, Andres Freund and...@2ndquadrant.com wrote:
 Since it better be legal to manipulate a atomic variable while holding a
 spinlock we cannot simply use an arbitrary spinlock as backing for
 atomics. That'd possibly cause us to wait on ourselves or cause
 deadlocks.

I think that's going to fall afoul of Tom's previously-articulated no
loops inside spinlocks rule.  Most atomics, by nature, are
loop-until-it-works.

 How much would we lose if we supported compiler intrinsics on
 versions of GCC and MSVC that have it and left everything else to
 future patches?

 The discussion so far seemed pretty clear that we can't regress somewhat
 frequently used platforms. And dropping support for all but msvc and gcc
 would end up doing that. We're going to have to do the legword for the
 most common platforms... Note that I didn't use assembler for those, but
 relied on intrinsics...

We can't *regress* somewhat-frequently used platforms, but that's
different from saying we've got to support *new* facilities on those
platforms.  Essentially the entire buildfarm is running either gcc,
some Microsoft compiler, or a compiler that's supports the same
atomics intrinsics as gcc i.e. icc or clang.  Some of those compilers
may be too old to support the atomics intrinsics, and there's one Sun
Studio animal, but I don't know that we need to care about those
things in this patch...

...unless of course the atomics fallbacks in this patch are
sufficiently sucky that anyone who ends up using those is going to be
sad.  Then the bar is a lot higher.  But if that's the case then I
wonder if we're really on the right course here.

 I added the x86 inline assembler because a fair number of buildfarm
 animals use ancient gcc's and because I could easily test it. It's also
 more efficient for gcc  4.6. I'm not wedded to keeping it.

Hmm. gcc 4.6 is only just over a year old, so if pre-4.6
implementations aren't that good, that's a pretty good argument for
keeping our own implementations around.  :-(

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] better atomics - v0.5

2014-06-25 Thread Robert Haas

On Wed, Jun 25, 2014 at 5:42 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 I think having a separate file for each architecture is nice. I totally
 agree that they don't belong in src/include/storage, though. s_lock.h has
 always been misplaced there, but we've let it be for historical reasons, but
 now that we're adding a dozen new files, it's time to move them out.

I find the current organization pretty confusing, but maybe that could
be solved by better documentation of what's supposed to go in each
architecture or compiler-dependent file.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers