Re: [HACKERS] Fix performance of generic atomics

2017-09-06 Thread Tom Lane
I wrote:
> Ah.  I was not thinking of touching pg_atomic_read_u32/u64_impl,
> although now that you mention it, it's not clear to me why we
> couldn't simplify

> - return *(>value);
> + return ptr->value;

Just to check, I applied that change to pg_atomic_read_u32_impl and
pg_atomic_read_u64_impl, and recompiled.  I get bit-for-bit the
same backend executable.  Maybe it would have an effect on some
other compiler, but I doubt it, except perhaps at -O0.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix performance of generic atomics

2017-09-06 Thread Tom Lane
Andres Freund  writes:
> On 2017-09-06 15:25:20 -0400, Tom Lane wrote:
>> I think we can just use "old = ptr->value" to set up for the cmpxchg
>> loop in every generic.h function that uses such a loop.

> I think we might have been talking past each other - I thought you were
> talking about changing the pg_atomic_read_u64_impl implementation for
> external users.

Ah.  I was not thinking of touching pg_atomic_read_u32/u64_impl,
although now that you mention it, it's not clear to me why we
couldn't simplify

-   return *(>value);
+   return ptr->value;

AFAIK, the compiler is entitled to, and does, simplify away that
take-a-pointer-and-immediately-dereference-it dance.  If it did
not, a whole lot of standard array locutions would be much less
efficient than they should be.  What matters here is the volatile
qualifier, which we've already got.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix performance of generic atomics

2017-09-06 Thread Andres Freund
On 2017-09-06 15:25:20 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2017-09-06 15:12:13 -0400, Tom Lane wrote:
> >> It looks to me like two of the three implementations promise no such
> >> thing.
> 
> > They're volatile vars, so why not?
> 
> Yeah, but so are the caller's variables.  That is, in
> 
> pg_atomic_exchange_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 xchg_)
> {
>   uint64 old;
>   old = ptr->value;
> 
> ISTM that the compiler is required to actually fetch ptr->value, not
> rely on some previous read of it.  I do not think that (the first
> version of) pg_atomic_read_u64_impl is adding any guarantee that wasn't
> there already.
> 
> >> Even if they somehow do, it hardly matters given that the cmpxchg loop
> >> would be self-correcting.
> 
> > Well, in this one instance maybe, hardly in others.
> 
> All the functions involved use nigh-identical cmpxchg loops.
> 
> > What are you suggesting as an alternative?
> 
> I think we can just use "old = ptr->value" to set up for the cmpxchg
> loop in every generic.h function that uses such a loop.

I think we might have been talking past each other - I thought you were
talking about changing the pg_atomic_read_u64_impl implementation for
external users.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix performance of generic atomics

2017-09-06 Thread Tom Lane
Andres Freund  writes:
> On 2017-09-06 15:12:13 -0400, Tom Lane wrote:
>> It looks to me like two of the three implementations promise no such
>> thing.

> They're volatile vars, so why not?

Yeah, but so are the caller's variables.  That is, in

pg_atomic_exchange_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 xchg_)
{
uint64 old;
old = ptr->value;

ISTM that the compiler is required to actually fetch ptr->value, not
rely on some previous read of it.  I do not think that (the first
version of) pg_atomic_read_u64_impl is adding any guarantee that wasn't
there already.

>> Even if they somehow do, it hardly matters given that the cmpxchg loop
>> would be self-correcting.

> Well, in this one instance maybe, hardly in others.

All the functions involved use nigh-identical cmpxchg loops.

> What are you suggesting as an alternative?

I think we can just use "old = ptr->value" to set up for the cmpxchg
loop in every generic.h function that uses such a loop.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix performance of generic atomics

2017-09-06 Thread Andres Freund
On 2017-09-06 15:12:13 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2017-09-06 14:31:26 -0400, Tom Lane wrote:
> >> However, if that's the reasoning, why don't we make all of these
> >> use simple reads?  It seems unlikely that a locked read is free.
> 
> > We don't really use locked reads? All the _atomic_ wrapper forces is an
> > actual read from memory rather than a register.
> 
> It looks to me like two of the three implementations promise no such
> thing.

They're volatile vars, so why not?


> Even if they somehow do, it hardly matters given that the cmpxchg loop
> would be self-correcting.

Well, in this one instance maybe, hardly in others.


> Mostly, though, I'm looking at the fallback pg_atomic_read_u64_impl
> implementation (with a CAS), which seems far more expensive than can
> be justified for this.

What are you suggesting as an alternative?

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix performance of generic atomics

2017-09-06 Thread Tom Lane
Andres Freund  writes:
> On 2017-09-06 14:31:26 -0400, Tom Lane wrote:
>> However, if that's the reasoning, why don't we make all of these
>> use simple reads?  It seems unlikely that a locked read is free.

> We don't really use locked reads? All the _atomic_ wrapper forces is an
> actual read from memory rather than a register.

It looks to me like two of the three implementations promise no such
thing.  Even if they somehow do, it hardly matters given that the cmpxchg
loop would be self-correcting.  Mostly, though, I'm looking at the
fallback pg_atomic_read_u64_impl implementation (with a CAS), which
seems far more expensive than can be justified for this.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix performance of generic atomics

2017-09-06 Thread Andres Freund
Hi,

On 2017-09-06 14:31:26 -0400, Tom Lane wrote:
> I wrote:
> > Anyway, I don't have a big objection to applying this.  My concern
> > is more that we need to be taking a harder look at other parts of
> > the atomics infrastructure, because tweaks there are likely to buy
> > much more.
> 
> I went ahead and pushed these patches, adding __sync_fetch_and_sub
> since gcc seems to provide that on the same footing as these other
> functions.
> 
> Looking at these generic functions a bit closer, I notice that
> most of them are coded like
> 
>   old = pg_atomic_read_u32_impl(ptr);
>   while (!pg_atomic_compare_exchange_u32_impl(ptr, , old | or_))
>   /* skip */;
> 
> but there's an exception: pg_atomic_exchange_u64_impl just does
> 
>   old = ptr->value;
>   while (!pg_atomic_compare_exchange_u64_impl(ptr, , xchg_))
>   /* skip */;
> 
> That's interesting.  Why not pg_atomic_read_u64_impl there?

Because our platforms don't generally guaranatee that 64bit reads:

/*
 * 64 bit reads aren't safe on all platforms. In the generic
 * implementation implement them as a compare/exchange with 0. That'll
 * fail or succeed, but always return the old value. Possible might 
store
 * a 0, but only if the prev. value also was a 0 - i.e. harmless.
 */
pg_atomic_compare_exchange_u64_impl(ptr, , 0);

so I *guess* I decided doing an unnecessary cmpxchg wasn't useful. But
since then we've added an optimization for platforms where we know 64bit
reads have single copy atomicity. So we could change that now.


> I think that the simple read is actually okay as it stands: if we
> are unlucky enough to get a torn read, the first compare/exchange
> will fail to compare equal, and it will replace "old" with a valid
> atomically-read value, and then the next loop iteration has a chance
> to succeed.  Therefore there's no need to pay the extra cost of a
> locked read instead of an unlocked one.
> 
> However, if that's the reasoning, why don't we make all of these
> use simple reads?  It seems unlikely that a locked read is free.

We don't really use locked reads? All the _atomic_ wrapper forces is an
actual read from memory rather than a register.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix performance of generic atomics

2017-09-06 Thread Tom Lane
I wrote:
> Anyway, I don't have a big objection to applying this.  My concern
> is more that we need to be taking a harder look at other parts of
> the atomics infrastructure, because tweaks there are likely to buy
> much more.

I went ahead and pushed these patches, adding __sync_fetch_and_sub
since gcc seems to provide that on the same footing as these other
functions.

Looking at these generic functions a bit closer, I notice that
most of them are coded like

old = pg_atomic_read_u32_impl(ptr);
while (!pg_atomic_compare_exchange_u32_impl(ptr, , old | or_))
/* skip */;

but there's an exception: pg_atomic_exchange_u64_impl just does

old = ptr->value;
while (!pg_atomic_compare_exchange_u64_impl(ptr, , xchg_))
/* skip */;

That's interesting.  Why not pg_atomic_read_u64_impl there?

I think that the simple read is actually okay as it stands: if we
are unlucky enough to get a torn read, the first compare/exchange
will fail to compare equal, and it will replace "old" with a valid
atomically-read value, and then the next loop iteration has a chance
to succeed.  Therefore there's no need to pay the extra cost of a
locked read instead of an unlocked one.

However, if that's the reasoning, why don't we make all of these
use simple reads?  It seems unlikely that a locked read is free.

If there's actually a reason for pg_atomic_exchange_u64_impl to be
different from the rest, it needs to have a comment explaining why.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix performance of generic atomics

2017-09-06 Thread Jesper Pedersen

Hi Jeff,

On 09/05/2017 03:47 PM, Jeff Janes wrote:

I ran pgbench (-M prepared) with synchronous_commit 'on' and 'off' using
both logged and unlogged tables. Also ran an internal benchmark which
didn't show anything either.



What scale factor and client count? How many cores per socket?  It looks
like Sokolov was just starting to see gains at 200 clients on 72 cores,
using -N transaction.


I have done a run with scale factor 300, and another with 3000 on a 
2S/28C/56T/256Gb w/ 2 x RAID10 SSD machine; up to 200 clients.


I would consider the runs as "noise" as I'm seeing +-1% for all client 
counts, so nothing like Yura is seeing in [1] for the higher client counts.


I did a run with -N too using scale factor 300, using the settings in 
[1], but with same result (+-1%).


[1] 
https://www.postgresql.org/message-id/d62d7d9d473d07e172d799d5a57e7...@postgrespro.ru


Best regards,
 Jesper


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix performance of generic atomics

2017-09-06 Thread Tom Lane
Robert Haas  writes:
> It's not a question of whether the return value is used, but of
> whether the updated value of *old is used.

Right, but if we re-read "old" in the loop, and if the primitive
doesn't return "old" (or does, but call site ignores it) then in
principle the CAS might be strength-reduced a bit.  Whether that
can win enough to be better than removing the unlocked read,
I dunno.

In the case at hand, looking at a loop like

while (count-- > 0)
{
(void) pg_atomic_fetch_or_u32(myptr, 1);
}

with our HEAD code and RHEL6's gcc I get this for the inner loop:

.L9:
movl(%rdx), %eax
movl%eax, %ecx
orl $1, %ecx
lock
cmpxchgl%ecx,(%rdx) 
setz%cl 
testb   %cl, %cl
je  .L9
subq$1, %rbx
testq   %rbx, %rbx
jg  .L9

Applying the proposed generic.h patch, it becomes

.L10:
movl(%rsi), %eax
.L9:
movl%eax, %ecx
orl $1, %ecx
lock
cmpxchgl%ecx,(%rdx) 
setz%cl 
testb   %cl, %cl
je  .L9
subq$1, %rbx
testq   %rbx, %rbx
jg  .L10

Note that in both cases the cmpxchgl is coming out of the asm construct in
pg_atomic_compare_exchange_u32_impl from atomics/arch-x86.h, so that even
if a simpler assembly instruction were possible given that we don't need
%eax to be updated, there's no chance of that actually happening.  This
gets back to the point I made in the other thread that maybe the
arch-x86.h asm sequences are not an improvement over the gcc intrinsics.

The code is the same if the loop is modified to use the result of
pg_atomic_fetch_or_u32, so I won't bother showing that.

Adding the proposed generic-gcc.h patch, the loop reduces to

.L11:
lock orl$1, (%rax)
subq$1, %rbx
testq   %rbx, %rbx
jg  .L11

or if we make the loop do
junk += pg_atomic_fetch_or_u32(myptr, 1);
then we get

.L13:
movl(%rsi), %eax
.L10:
movl%eax, %edi
movl%eax, %ecx
orl $1, %ecx
lock cmpxchgl   %ecx, (%rdx)
jne .L10
addl%edi, %r8d
subq$1, %rbx
testq   %rbx, %rbx
jg  .L13

So that last is slightly better than the generic.h + asm CAS version,
perhaps not meaningfully so --- but it's definitely better when
the compiler can see the result isn't used.

In short then, given the facts on the ground here, in particular the
asm-based CAS primitive at the heart of these loops, it's clear that
taking the read out of the loop can't hurt.  But that should be read
as a rather narrow conclusion.  With a different compiler and/or a
different version of pg_atomic_compare_exchange_u32_impl, maybe the
answer would be different.  And of course it's moot once the
generic-gcc.h patch is applied.

Anyway, I don't have a big objection to applying this.  My concern
is more that we need to be taking a harder look at other parts of
the atomics infrastructure, because tweaks there are likely to buy
much more.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix performance of generic atomics

2017-09-06 Thread Robert Haas
On Wed, Sep 6, 2017 at 12:01 PM, Tom Lane  wrote:
>> This seems like a pretty sound argument to me.  I think Tom's probably
>> right that the changes in generic-gcc.h are the important ones, but
>> I'm not sure that's an argument against patching generics.h.  Given
>> that pg_atomic_compare_exchange_u32_impl is defined to update *old
>> there seems to be no reason to call pg_atomic_read_u32_impl every time
>> through the loop.
>
> Probably not.  I'm not quite 100% convinced of that, because of my
> observation that gcc is capable of generating different and better
> code for some of these primitives if it can prove that the return
> value is not needed.  It's not clear that that could apply in any
> of these uses of pg_atomic_compare_exchange_u32_impl, though.
> In any case, by my own argument, it shouldn't matter, because if
> any of these are really performance-critical then we should be
> looking for better ways.

It's not a question of whether the return value is used, but of
whether the updated value of *old is used.

Unless I'm confused.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix performance of generic atomics

2017-09-06 Thread Tom Lane
Robert Haas  writes:
> On Wed, Sep 6, 2017 at 9:42 AM, Sokolov Yura
>  wrote:
>> But I think, generic version still should be "fixed".
>> If generic version is not reached on any platform, then why it is kept?
>> If it is reached somewhere, then it should be improved.

> This seems like a pretty sound argument to me.  I think Tom's probably
> right that the changes in generic-gcc.h are the important ones, but
> I'm not sure that's an argument against patching generics.h.  Given
> that pg_atomic_compare_exchange_u32_impl is defined to update *old
> there seems to be no reason to call pg_atomic_read_u32_impl every time
> through the loop.

Probably not.  I'm not quite 100% convinced of that, because of my
observation that gcc is capable of generating different and better
code for some of these primitives if it can prove that the return
value is not needed.  It's not clear that that could apply in any
of these uses of pg_atomic_compare_exchange_u32_impl, though.
In any case, by my own argument, it shouldn't matter, because if
any of these are really performance-critical then we should be
looking for better ways.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix performance of generic atomics

2017-09-06 Thread Robert Haas
On Wed, Sep 6, 2017 at 9:42 AM, Sokolov Yura
 wrote:
> Yes, you're right.
>
> But I think, generic version still should be "fixed".
> If generic version is not reached on any platform, then why it is kept?
> If it is reached somewhere, then it should be improved.

This seems like a pretty sound argument to me.  I think Tom's probably
right that the changes in generic-gcc.h are the important ones, but
I'm not sure that's an argument against patching generics.h.  Given
that pg_atomic_compare_exchange_u32_impl is defined to update *old
there seems to be no reason to call pg_atomic_read_u32_impl every time
through the loop.  Whether doing so hurts performance in practice is
likely to depend on a bunch of stuff, but we don't normally refuse to
remove unnecessary code on the grounds that it will rarely be reached.
Given that CPUs are weird, it's possible that there is some system
where throwing an extra read of a value we already have into the loop
works out to a win, but in the absence of evidence I'm reluctant to
presume it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix performance of generic atomics

2017-09-06 Thread Sokolov Yura

On 2017-09-06 16:36, Tom Lane wrote:

Sokolov Yura  writes:

On 2017-09-06 15:56, Tom Lane wrote:

The point I'm trying to make is that if tweaking generic.h improves
performance then it's an indicator of missed cases in the 
less-generic
atomics code, and the latter is where our attention should be 
focused.
I think basically all of the improvement Sokolov got was from 
upgrading

the coverage of generic-gcc.h.



Not exactly. I've checked, that new version of generic
pg_atomic_fetch_or_u32
loop also gives improvement.


But once you put in the generic-gcc version, that's not reached 
anymore.




Yes, you're right.

But I think, generic version still should be "fixed".
If generic version is not reached on any platform, then why it is kept?
If it is reached somewhere, then it should be improved.

--
Sokolov Yura aka funny_falcon
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix performance of generic atomics

2017-09-06 Thread Tom Lane
Sokolov Yura  writes:
> On 2017-09-06 15:56, Tom Lane wrote:
>> The point I'm trying to make is that if tweaking generic.h improves
>> performance then it's an indicator of missed cases in the less-generic
>> atomics code, and the latter is where our attention should be focused.
>> I think basically all of the improvement Sokolov got was from upgrading
>> the coverage of generic-gcc.h.

> Not exactly. I've checked, that new version of generic 
> pg_atomic_fetch_or_u32
> loop also gives improvement.

But once you put in the generic-gcc version, that's not reached anymore.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix performance of generic atomics

2017-09-06 Thread Sokolov Yura

On 2017-09-06 15:56, Tom Lane wrote:

Simon Riggs  writes:

On 5 September 2017 at 21:23, Tom Lane  wrote:
Moreover, it matters which primitive you're testing, on which 
platform,

with which compiler, because we have a couple of layers of atomic ops
implementations.



If there is no gain on 2-socket, at least there is no loss either.


The point I'm trying to make is that if tweaking generic.h improves
performance then it's an indicator of missed cases in the less-generic
atomics code, and the latter is where our attention should be focused.
I think basically all of the improvement Sokolov got was from upgrading
the coverage of generic-gcc.h.

regards, tom lane


Not exactly. I've checked, that new version of generic 
pg_atomic_fetch_or_u32

loop also gives improvement. Without that check I'd not suggest to fix
generic atomic functions. Of course, gcc intrinsic gives more gain.

--
Sokolov Yura aka funny_falcon
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix performance of generic atomics

2017-09-06 Thread Tom Lane
Simon Riggs  writes:
> On 5 September 2017 at 21:23, Tom Lane  wrote:
>> Moreover, it matters which primitive you're testing, on which platform,
>> with which compiler, because we have a couple of layers of atomic ops
>> implementations.

> If there is no gain on 2-socket, at least there is no loss either.

The point I'm trying to make is that if tweaking generic.h improves
performance then it's an indicator of missed cases in the less-generic
atomics code, and the latter is where our attention should be focused.
I think basically all of the improvement Sokolov got was from upgrading
the coverage of generic-gcc.h.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix performance of generic atomics

2017-09-06 Thread Simon Riggs
On 5 September 2017 at 21:23, Tom Lane  wrote:
> Jeff Janes  writes:
>> What scale factor and client count? How many cores per socket?  It looks
>> like Sokolov was just starting to see gains at 200 clients on 72 cores,
>> using -N transaction.

...
> Moreover, it matters which primitive you're testing, on which platform,
> with which compiler, because we have a couple of layers of atomic ops
> implementations.
...

I think Sokolov was aiming at 4-socket servers specifically, rather
than as a general performance gain.

If there is no gain on 2-socket, at least there is no loss either.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix performance of generic atomics

2017-09-06 Thread Sokolov Yura

On 2017-09-06 14:54, Sokolov Yura wrote:

On 2017-09-06 07:23, Tom Lane wrote:

Jeff Janes  writes:
What scale factor and client count? How many cores per socket?  It 
looks
like Sokolov was just starting to see gains at 200 clients on 72 
cores,

using -N transaction.

This means that Sokolov's proposed changes in atomics/generic.h
ought to be uninteresting for performance on any platform we care 
about --- at

least for pg_atomic_fetch_add_u32().


May I cite you this way:


I apologize for tone of my previous message.
My tongue is my enemy.

--
Sokolov Yura aka funny_falcon
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix performance of generic atomics

2017-09-06 Thread Sokolov Yura

On 2017-09-06 07:23, Tom Lane wrote:

Jeff Janes  writes:
What scale factor and client count? How many cores per socket?  It 
looks
like Sokolov was just starting to see gains at 200 clients on 72 
cores,

using -N transaction.

This means that Sokolov's proposed changes in atomics/generic.h
ought to be uninteresting for performance on any platform we care about 
--- at

least for pg_atomic_fetch_add_u32().


May I cite you this way:
"Tom Lane says PostgreSQL core team doesn't care for 4 socket 72 core
Intel Xeon servers"
???

To be honestly, I didn't measure exact impact of changing 
pg_atomic_fetch_add_u32.

I found issue with pg_atomic_fetch_or_u32, cause it is highly contended
both in LWLock, and in buffer page state management. I found changing
loop for generic pg_atomic_fetch_or_u32 already gives improvement.
And I changed loop for other generic atomic functions to make
all functions uniform.

It is bad style guide not to changing worse (but correct) code into
better (and also correct) just because you can't measure difference
(in my opinion. Perhaps i am mistaken).

(and yes: gcc intrinsic gives more improvement).

--
With regards,
Sokolov Yura aka funny_falcon
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix performance of generic atomics

2017-09-05 Thread Tom Lane
Jeff Janes  writes:
> What scale factor and client count? How many cores per socket?  It looks
> like Sokolov was just starting to see gains at 200 clients on 72 cores,
> using -N transaction.

I spent some time poking at this with the test scaffolding I showed at
https://www.postgresql.org/message-id/17694.1504665058%40sss.pgh.pa.us

AFAICT, there are not huge differences between different coding methods
even for two processes in direct competition in the tightest possible
atomic-ops loops.  So if you're testing SQL operations, you're definitely
not going to see anything without a whole lot of concurrent processes.

Moreover, it matters which primitive you're testing, on which platform,
with which compiler, because we have a couple of layers of atomic ops
implementations.

I started out testing pg_atomic_fetch_add_u32(), as shown in the above-
mentioned message.  On x86_x64 with gcc, that is implemented by the
code for it in src/include/port/atomics/arch-x86.h.  If you dike out
that support, it falls back to the version in atomics/generic-gcc.h,
which seems no worse and possibly better.  Only if you also dike out
generic-gcc.h do you get to the version in atomics/generic.h that this
patch wants to change.  However, at that point you're also down to a
spinlock-based implementation of the underlying pg_atomic_read_u32_impl
and pg_atomic_compare_exchange_u32_impl, which means that performance
is going to be less than great anyway.  No platform that we consider
well-supported ought to be hitting the spinlock paths.  This means
that Sokolov's proposed changes in atomics/generic.h ought to be
uninteresting for performance on any platform we care about --- at
least for pg_atomic_fetch_add_u32().

However, Sokolov also proposes adding gcc-intrinsic support for
pg_atomic_fetch_and_xxx and pg_atomic_fetch_or_xxx.  This is a
different kettle of fish.  I repeated the experiments I'd done
for pg_atomic_fetch_add_u32(), per the above message, using the "or"
primitive, and found largely the same results as for "add": it's
slightly better under contention than the generic code, and
significantly better if the result value of the atomic op isn't needed.

So I think we should definitely take the part of the patch that
changes generic-gcc.h --- and we should check to see if we're missing
out on any other gcc primitives we should be using there.

I'm less excited about the proposed changes in generic.h.  There's
nothing wrong with them in principle, but they mostly shouldn't
make any performance difference on any platform we care about,
because we shouldn't get to that code in the first place on any
platform we care about.  If we are getting to that code for any
specific primitive, then either there's a gap in the platform-specific
or compiler-specific support, or it's debatable that we ought to
consider that primitive to be very primitive.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix performance of generic atomics

2017-09-05 Thread Jesper Pedersen

Hi,

On 05/25/2017 11:12 AM, Sokolov Yura wrote:

I agree that lonely semicolon looks bad.
Applied your suggestion for empty loop body (/* skip */).

Patch in first letter had while(true), but I removed it cause
I think it is uglier:
- `while(true)` was necessary for grouping read with `if`,
- but now there is single statement in a loop body and it is
   condition for loop exit, so it is clearly just a loop.

Optimization is valid cause compare_exchange always store old value
in `old` variable in a same atomic manner as atomic read.



I have tested this patch on a 2-socket machine, but don't see any 
performance change in the various runs. However, there is no regression 
either in all cases.


As such, I have marked the entry "Ready for Committer".

Remember to add a version postfix to your patches such that is easy to 
identify which is the latest version.


Best regards,
 Jesper


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix performance of generic atomics

2017-09-05 Thread Jeff Janes
On Tue, Sep 5, 2017 at 11:39 AM, Jesper Pedersen  wrote:

> On 09/05/2017 02:24 PM, Tom Lane wrote:
>
>> Jesper Pedersen  writes:
>>
>>> I have tested this patch on a 2-socket machine, but don't see any
>>> performance change in the various runs. However, there is no regression
>>> either in all cases.
>>>
>>
>> Hm, so if we can't demonstrate a performance win, it's hard to justify
>> risking touching this code.  What test case(s) did you use?
>>
>>
> I ran pgbench (-M prepared) with synchronous_commit 'on' and 'off' using
> both logged and unlogged tables. Also ran an internal benchmark which
> didn't show anything either.
>

What scale factor and client count? How many cores per socket?  It looks
like Sokolov was just starting to see gains at 200 clients on 72 cores,
using -N transaction.

Cheers,

Jeff


Re: [HACKERS] Fix performance of generic atomics

2017-09-05 Thread Tom Lane
Jesper Pedersen  writes:
> On 09/05/2017 02:24 PM, Tom Lane wrote:
>> Hm, so if we can't demonstrate a performance win, it's hard to justify
>> risking touching this code.  What test case(s) did you use?

> I ran pgbench (-M prepared) with synchronous_commit 'on' and 'off' using 
> both logged and unlogged tables. Also ran an internal benchmark which 
> didn't show anything either.

That may just mean that pgbench isn't stressing any atomic ops very
hard (at least in the default scenario).

I'm tempted to write a little C function that just hits the relevant
atomic ops in a tight loop, and see how long it takes to do a few
million iterations.  That would be erring in the opposite direction,
of overstating the importance of atomic ops to real-world scenarios
--- but if we didn't get any win that way, then it's surely in the noise.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix performance of generic atomics

2017-09-05 Thread Jesper Pedersen

On 09/05/2017 02:24 PM, Tom Lane wrote:

Jesper Pedersen  writes:

I have tested this patch on a 2-socket machine, but don't see any
performance change in the various runs. However, there is no regression
either in all cases.


Hm, so if we can't demonstrate a performance win, it's hard to justify
risking touching this code.  What test case(s) did you use?



I ran pgbench (-M prepared) with synchronous_commit 'on' and 'off' using 
both logged and unlogged tables. Also ran an internal benchmark which 
didn't show anything either.


Setting the entry back to "Needs Review" for additional feedback from 
others.


Best regards,
 Jesper


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix performance of generic atomics

2017-09-05 Thread Tom Lane
Jesper Pedersen  writes:
> I have tested this patch on a 2-socket machine, but don't see any 
> performance change in the various runs. However, there is no regression 
> either in all cases.

Hm, so if we can't demonstrate a performance win, it's hard to justify
risking touching this code.  What test case(s) did you use?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix performance of generic atomics

2017-09-05 Thread Jesper Pedersen

Hi,

On 05/25/2017 11:12 AM, Sokolov Yura wrote:

I agree that lonely semicolon looks bad.
Applied your suggestion for empty loop body (/* skip */).

Patch in first letter had while(true), but I removed it cause
I think it is uglier:
- `while(true)` was necessary for grouping read with `if`,
- but now there is single statement in a loop body and it is
   condition for loop exit, so it is clearly just a loop.

Optimization is valid cause compare_exchange always store old value
in `old` variable in a same atomic manner as atomic read.



I have tested this patch on a 2-socket machine, but don't see any 
performance change in the various runs. However, there is no regression 
either in all cases.


As such, I have marked the entry "Ready for Committer".

Remember to add a version postfix to your patches such that is easy to 
identify which is the latest version.


Best regards,
 Jesper



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix performance of generic atomics

2017-06-04 Thread Sokolov Yura

Good day, every one.

I'm just posting benchmark numbers for atomics patch.

Hardware: 4 socket 72 core (144HT) x86_64 Centos 7.1
postgresql.conf tuning:
shared_buffers = 32GB
fsync = on
synchronous_commit = on
full_page_writes = off
wal_buffers = 16MB
wal_writer_flush_after = 16MB
commit_delay = 2
max_wal_size = 16GB

Results:
pgbench -i -s 300 + pgbench --skip-some-updates

Clients |  master |  atomics
+=+===
   50   |  53.1k  |  53.2k
  100   | 101.2k  | 103.5k
  150   | 119.1k  | 121.9k
  200   | 128.7k  | 132.5k
  252   | 120.2k  | 130.0k
  304   | 100.8k  | 115.9k
  356   |  78.1k  |  90.1k
  395   |  70.2k  |  79.0k
  434   |  61.6k  |  70.7k

Also graph with more points attached.

On 2017-05-25 18:12, Sokolov Yura wrote:

Hello, Tom.

I agree that lonely semicolon looks bad.
Applied your suggestion for empty loop body (/* skip */).

Patch in first letter had while(true), but I removed it cause
I think it is uglier:
- `while(true)` was necessary for grouping read with `if`,
- but now there is single statement in a loop body and it is
  condition for loop exit, so it is clearly just a loop.

Optimization is valid cause compare_exchange always store old value
in `old` variable in a same atomic manner as atomic read.

Tom Lane wrote 2017-05-25 17:39:

Sokolov Yura  writes:
@@ -382,12 +358,8 @@ static inline uint64
 pg_atomic_fetch_and_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 
and_)

 {
uint64 old;
-   while (true)
-   {
-   old = pg_atomic_read_u64_impl(ptr);
-   if (pg_atomic_compare_exchange_u64_impl(ptr, , old & and_))
-   break;
-   }
+   old = pg_atomic_read_u64_impl(ptr);
+   while (!pg_atomic_compare_exchange_u64_impl(ptr, , old & and_));
return old;
 }
 #endif

FWIW, I do not think that writing the loops like that is good style.
It looks like a typo and will confuse readers.  You could perhaps
write the same code with better formatting, eg

while (!pg_atomic_compare_exchange_u64_impl(ptr, , old & and_))
/* skip */ ;

but why not leave the formulation with while(true) and a break alone?

(I take no position on whether moving the read of "old" outside the
loop is a valid optimization.)

regards, tom lane


With regards,
--
Sokolov Yura aka funny_falcon
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix performance of generic atomics

2017-05-25 Thread Sokolov Yura

Hello, Tom.

I agree that lonely semicolon looks bad.
Applied your suggestion for empty loop body (/* skip */).

Patch in first letter had while(true), but I removed it cause
I think it is uglier:
- `while(true)` was necessary for grouping read with `if`,
- but now there is single statement in a loop body and it is
  condition for loop exit, so it is clearly just a loop.

Optimization is valid cause compare_exchange always store old value
in `old` variable in a same atomic manner as atomic read.

Tom Lane wrote 2017-05-25 17:39:

Sokolov Yura  writes:
@@ -382,12 +358,8 @@ static inline uint64
 pg_atomic_fetch_and_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 
and_)

 {
uint64 old;
-   while (true)
-   {
-   old = pg_atomic_read_u64_impl(ptr);
-   if (pg_atomic_compare_exchange_u64_impl(ptr, , old & and_))
-   break;
-   }
+   old = pg_atomic_read_u64_impl(ptr);
+   while (!pg_atomic_compare_exchange_u64_impl(ptr, , old & and_));
return old;
 }
 #endif

FWIW, I do not think that writing the loops like that is good style.
It looks like a typo and will confuse readers.  You could perhaps
write the same code with better formatting, eg

while (!pg_atomic_compare_exchange_u64_impl(ptr, , old & and_))
/* skip */ ;

but why not leave the formulation with while(true) and a break alone?

(I take no position on whether moving the read of "old" outside the
loop is a valid optimization.)

regards, tom lane


With regards,
--
Sokolov Yura aka funny_falcon
Postgres Professional: https://postgrespro.ru
The Russian Postgres CompanyFrom d377e36da652ade715b868a6fc7164736f38d1df Mon Sep 17 00:00:00 2001
From: Sokolov Yura 
Date: Thu, 25 May 2017 14:54:45 +0300
Subject: [PATCH] Fix performance of Atomics generic implementation

pg_atomic_compare_exchange_*_impl stores old value into `old` variable,
so there is no need to reread value in loop body.

Also add gcc specific __sync_fetch_and_(or|and), cause looks like
compiler may optimize code around intrinsic better than around assembler,
although intrinsic is compiled to almost same CAS loop.
---
 src/include/port/atomics/generic-gcc.h | 36 +
 src/include/port/atomics/generic.h | 72 --
 2 files changed, 60 insertions(+), 48 deletions(-)

diff --git a/src/include/port/atomics/generic-gcc.h b/src/include/port/atomics/generic-gcc.h
index 7efc0861e7..79ada94047 100644
--- a/src/include/port/atomics/generic-gcc.h
+++ b/src/include/port/atomics/generic-gcc.h
@@ -185,6 +185,24 @@ pg_atomic_fetch_add_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
 }
 #endif
 
+#if !defined(PG_HAVE_ATOMIC_FETCH_AND_U32) && defined(HAVE_GCC__SYNC_INT32_CAS)
+#define PG_HAVE_ATOMIC_FETCH_AND_U32
+static inline uint32
+pg_atomic_fetch_and_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 and_)
+{
+	return __sync_fetch_and_and(>value, and_);
+}
+#endif
+
+#if !defined(PG_HAVE_ATOMIC_FETCH_OR_U32) && defined(HAVE_GCC__SYNC_INT32_CAS)
+#define PG_HAVE_ATOMIC_FETCH_OR_U32
+static inline uint32
+pg_atomic_fetch_or_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 or_)
+{
+	return __sync_fetch_and_or(>value, or_);
+}
+#endif
+
 
 #if !defined(PG_DISABLE_64_BIT_ATOMICS)
 
@@ -223,6 +241,24 @@ pg_atomic_fetch_add_u64_impl(volatile pg_atomic_uint64 *ptr, int64 add_)
 }
 #endif
 
+#if !defined(PG_HAVE_ATOMIC_FETCH_AND_U64) && defined(HAVE_GCC__SYNC_INT64_CAS)
+#define PG_HAVE_ATOMIC_FETCH_AND_U64
+static inline uint64
+pg_atomic_fetch_and_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 and_)
+{
+	return __sync_fetch_and_and(>value, and_);
+}
+#endif
+
+#if !defined(PG_HAVE_ATOMIC_FETCH_OR_U64) && defined(HAVE_GCC__SYNC_INT64_CAS)
+#define PG_HAVE_ATOMIC_FETCH_OR_U64
+static inline uint64
+pg_atomic_fetch_or_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 or_)
+{
+	return __sync_fetch_and_or(>value, or_);
+}
+#endif
+
 #endif /* !defined(PG_DISABLE_64_BIT_ATOMICS) */
 
 #endif /* defined(HAVE_ATOMICS) */
diff --git a/src/include/port/atomics/generic.h b/src/include/port/atomics/generic.h
index 424543604a..75ffaf6e87 100644
--- a/src/include/port/atomics/generic.h
+++ b/src/include/port/atomics/generic.h
@@ -170,12 +170,9 @@ static inline uint32
 pg_atomic_exchange_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 xchg_)
 {
 	uint32 old;
-	while (true)
-	{
-		old = pg_atomic_read_u32_impl(ptr);
-		if (pg_atomic_compare_exchange_u32_impl(ptr, , xchg_))
-			break;
-	}
+	old = pg_atomic_read_u32_impl(ptr);
+	while (!pg_atomic_compare_exchange_u32_impl(ptr, , xchg_))
+		/* skip */;
 	return old;
 }
 #endif
@@ -186,12 +183,9 @@ static inline uint32
 pg_atomic_fetch_add_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
 {
 	uint32 old;
-	while (true)
-	{
-		old = pg_atomic_read_u32_impl(ptr);
-		if (pg_atomic_compare_exchange_u32_impl(ptr, , old + add_))
-			break;
-	}
+	old = 

Re: [HACKERS] Fix performance of generic atomics

2017-05-25 Thread Aleksander Alekseev
Hi Yura,

> Attached patch contains patch for all generic atomic
> functions, and also __sync_fetch_and_(or|and) for gcc, cause
> I believe GCC optimize code around intrinsic better than
> around inline assembler.
> (final performance is around 86000tps, but difference between
> 83000tps and 86000tps is not so obvious in NUMA system).

I don't see any patch in my email client or pgsql-hackers@ archive. I
would also recommend to add your patch to the nearest commitfest [1].

[1] https://commitfest.postgresql.org/

-- 
Best regards,
Aleksander Alekseev


signature.asc
Description: PGP signature


Re: [HACKERS] Fix performance of generic atomics

2017-05-25 Thread Tom Lane
Sokolov Yura  writes:
@@ -382,12 +358,8 @@ static inline uint64
 pg_atomic_fetch_and_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 and_)
 {
uint64 old;
-   while (true)
-   {
-   old = pg_atomic_read_u64_impl(ptr);
-   if (pg_atomic_compare_exchange_u64_impl(ptr, , old & and_))
-   break;
-   }
+   old = pg_atomic_read_u64_impl(ptr);
+   while (!pg_atomic_compare_exchange_u64_impl(ptr, , old & and_));
return old;
 }
 #endif

FWIW, I do not think that writing the loops like that is good style.
It looks like a typo and will confuse readers.  You could perhaps
write the same code with better formatting, eg

while (!pg_atomic_compare_exchange_u64_impl(ptr, , old & and_))
/* skip */ ;

but why not leave the formulation with while(true) and a break alone?

(I take no position on whether moving the read of "old" outside the
loop is a valid optimization.)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Fix performance of generic atomics

2017-05-25 Thread Sokolov Yura

Good day, everyone.

I've been played with pgbench on huge machine.
(72 cores, 56 for postgresql, enough memory to fit base
both into shared_buffers and file cache)
(pgbench scale 500, unlogged tables, fsync=off,
synchronous commit=off, wal_writer_flush_after=0).

With 200 clients performance is around 76000tps and main
bottleneck in this dumb test is LWLockWaitListLock.

I added gcc specific implementation for pg_atomic_fetch_or_u32_impl
(ie using __sync_fetch_and_or) and performance became 83000tps.

It were a bit strange at a first look, cause __sync_fetch_and_or
compiles to almost same CAS loop.

Looking closely, I noticed that intrinsic performs doesn't do
read in the loop body, but at loop initialization. It is correct
behavior cause `lock cmpxchg` instruction stores old value in EAX
register.

It is expected behavior, and pg_compare_and_exchange_*_impl does
the same in all implementations. So there is no need to re-read
value in the loop body:

Example diff for pg_atomic_exchange_u32_impl:

 static inline uint32
 pg_atomic_exchange_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 
xchg_)

 {
uint32 old;
+   old = pg_atomic_read_u32_impl(ptr);
while (true)
{
-   old = pg_atomic_read_u32_impl(ptr);
if (pg_atomic_compare_exchange_u32_impl(ptr, , xchg_))
break;
}
return old;
 }

After applying this change to all generic atomic functions
(and for pg_atomic_fetch_or_u32_impl ), performance became
equal to __sync_fetch_and_or intrinsic.

Attached patch contains patch for all generic atomic
functions, and also __sync_fetch_and_(or|and) for gcc, cause
I believe GCC optimize code around intrinsic better than
around inline assembler.
(final performance is around 86000tps, but difference between
83000tps and 86000tps is not so obvious in NUMA system).

With regards,
--
Sokolov Yura aka funny_falcon
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix performance of generic atomics

2017-05-25 Thread Sokolov Yura

A bit cleaner version of a patch.

Sokolov Yura писал 2017-05-25 15:22:

Good day, everyone.

I've been played with pgbench on huge machine.
(72 cores, 56 for postgresql, enough memory to fit base
both into shared_buffers and file cache)
(pgbench scale 500, unlogged tables, fsync=off,
synchronous commit=off, wal_writer_flush_after=0).

With 200 clients performance is around 76000tps and main
bottleneck in this dumb test is LWLockWaitListLock.

I added gcc specific implementation for pg_atomic_fetch_or_u32_impl
(ie using __sync_fetch_and_or) and performance became 83000tps.

It were a bit strange at a first look, cause __sync_fetch_and_or
compiles to almost same CAS loop.

Looking closely, I noticed that intrinsic performs doesn't do
read in the loop body, but at loop initialization. It is correct
behavior cause `lock cmpxchg` instruction stores old value in EAX
register.

It is expected behavior, and pg_compare_and_exchange_*_impl does
the same in all implementations. So there is no need to re-read
value in the loop body:

Example diff for pg_atomic_exchange_u32_impl:

 static inline uint32
 pg_atomic_exchange_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 
xchg_)

 {
uint32 old;
+   old = pg_atomic_read_u32_impl(ptr);
while (true)
{
-   old = pg_atomic_read_u32_impl(ptr);
if (pg_atomic_compare_exchange_u32_impl(ptr, , xchg_))
break;
}
return old;
 }

After applying this change to all generic atomic functions
(and for pg_atomic_fetch_or_u32_impl ), performance became
equal to __sync_fetch_and_or intrinsic.

Attached patch contains patch for all generic atomic
functions, and also __sync_fetch_and_(or|and) for gcc, cause
I believe GCC optimize code around intrinsic better than
around inline assembler.
(final performance is around 86000tps, but difference between
83000tps and 86000tps is not so obvious in NUMA system).

With regards,


--
Sokolov Yura aka funny_falcon
Postgres Professional: https://postgrespro.ru
The Russian Postgres CompanyFrom 2c58a32ca89867e7d244e7037a38aa9ccc2b92c2 Mon Sep 17 00:00:00 2001
From: Sokolov Yura 
Date: Thu, 25 May 2017 14:54:45 +0300
Subject: [PATCH] Fix performance of Atomics generic implementation

pg_atomic_compare_exchange_*_impl modifies value of `old` pointer,
so there is no need to reread value in loop body.

Also add gcc specific __sync_fetch_and_(or|and), cause looks like
compiler may optimize code around intrisic better than around assembler,
although intrisic is compiled to almost same CAS loop.
---
 src/include/port/atomics/generic-gcc.h | 36 +++
 src/include/port/atomics/generic.h | 64 +-
 2 files changed, 52 insertions(+), 48 deletions(-)

diff --git a/src/include/port/atomics/generic-gcc.h b/src/include/port/atomics/generic-gcc.h
index 7efc0861e7..79ada94047 100644
--- a/src/include/port/atomics/generic-gcc.h
+++ b/src/include/port/atomics/generic-gcc.h
@@ -185,6 +185,24 @@ pg_atomic_fetch_add_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
 }
 #endif
 
+#if !defined(PG_HAVE_ATOMIC_FETCH_AND_U32) && defined(HAVE_GCC__SYNC_INT32_CAS)
+#define PG_HAVE_ATOMIC_FETCH_AND_U32
+static inline uint32
+pg_atomic_fetch_and_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 and_)
+{
+	return __sync_fetch_and_and(>value, and_);
+}
+#endif
+
+#if !defined(PG_HAVE_ATOMIC_FETCH_OR_U32) && defined(HAVE_GCC__SYNC_INT32_CAS)
+#define PG_HAVE_ATOMIC_FETCH_OR_U32
+static inline uint32
+pg_atomic_fetch_or_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 or_)
+{
+	return __sync_fetch_and_or(>value, or_);
+}
+#endif
+
 
 #if !defined(PG_DISABLE_64_BIT_ATOMICS)
 
@@ -223,6 +241,24 @@ pg_atomic_fetch_add_u64_impl(volatile pg_atomic_uint64 *ptr, int64 add_)
 }
 #endif
 
+#if !defined(PG_HAVE_ATOMIC_FETCH_AND_U64) && defined(HAVE_GCC__SYNC_INT64_CAS)
+#define PG_HAVE_ATOMIC_FETCH_AND_U64
+static inline uint64
+pg_atomic_fetch_and_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 and_)
+{
+	return __sync_fetch_and_and(>value, and_);
+}
+#endif
+
+#if !defined(PG_HAVE_ATOMIC_FETCH_OR_U64) && defined(HAVE_GCC__SYNC_INT64_CAS)
+#define PG_HAVE_ATOMIC_FETCH_OR_U64
+static inline uint64
+pg_atomic_fetch_or_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 or_)
+{
+	return __sync_fetch_and_or(>value, or_);
+}
+#endif
+
 #endif /* !defined(PG_DISABLE_64_BIT_ATOMICS) */
 
 #endif /* defined(HAVE_ATOMICS) */
diff --git a/src/include/port/atomics/generic.h b/src/include/port/atomics/generic.h
index 424543604a..6d2671beab 100644
--- a/src/include/port/atomics/generic.h
+++ b/src/include/port/atomics/generic.h
@@ -170,12 +170,8 @@ static inline uint32
 pg_atomic_exchange_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 xchg_)
 {
 	uint32 old;
-	while (true)
-	{
-		old = pg_atomic_read_u32_impl(ptr);
-		if (pg_atomic_compare_exchange_u32_impl(ptr, , xchg_))
-			break;
-	}
+	old = pg_atomic_read_u32_impl(ptr);
+	while