Re: [HACKERS] Fix performance of generic atomics
I wrote: > Ah. I was not thinking of touching pg_atomic_read_u32/u64_impl, > although now that you mention it, it's not clear to me why we > couldn't simplify > - return *(&ptr->value); > + return ptr->value; Just to check, I applied that change to pg_atomic_read_u32_impl and pg_atomic_read_u64_impl, and recompiled. I get bit-for-bit the same backend executable. Maybe it would have an effect on some other compiler, but I doubt it, except perhaps at -O0. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix performance of generic atomics
Andres Freund writes: > On 2017-09-06 15:25:20 -0400, Tom Lane wrote: >> I think we can just use "old = ptr->value" to set up for the cmpxchg >> loop in every generic.h function that uses such a loop. > I think we might have been talking past each other - I thought you were > talking about changing the pg_atomic_read_u64_impl implementation for > external users. Ah. I was not thinking of touching pg_atomic_read_u32/u64_impl, although now that you mention it, it's not clear to me why we couldn't simplify - return *(&ptr->value); + return ptr->value; AFAIK, the compiler is entitled to, and does, simplify away that take-a-pointer-and-immediately-dereference-it dance. If it did not, a whole lot of standard array locutions would be much less efficient than they should be. What matters here is the volatile qualifier, which we've already got. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix performance of generic atomics
On 2017-09-06 15:25:20 -0400, Tom Lane wrote: > Andres Freund writes: > > On 2017-09-06 15:12:13 -0400, Tom Lane wrote: > >> It looks to me like two of the three implementations promise no such > >> thing. > > > They're volatile vars, so why not? > > Yeah, but so are the caller's variables. That is, in > > pg_atomic_exchange_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 xchg_) > { > uint64 old; > old = ptr->value; > > ISTM that the compiler is required to actually fetch ptr->value, not > rely on some previous read of it. I do not think that (the first > version of) pg_atomic_read_u64_impl is adding any guarantee that wasn't > there already. > > >> Even if they somehow do, it hardly matters given that the cmpxchg loop > >> would be self-correcting. > > > Well, in this one instance maybe, hardly in others. > > All the functions involved use nigh-identical cmpxchg loops. > > > What are you suggesting as an alternative? > > I think we can just use "old = ptr->value" to set up for the cmpxchg > loop in every generic.h function that uses such a loop. I think we might have been talking past each other - I thought you were talking about changing the pg_atomic_read_u64_impl implementation for external users. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix performance of generic atomics
Andres Freund writes: > On 2017-09-06 15:12:13 -0400, Tom Lane wrote: >> It looks to me like two of the three implementations promise no such >> thing. > They're volatile vars, so why not? Yeah, but so are the caller's variables. That is, in pg_atomic_exchange_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 xchg_) { uint64 old; old = ptr->value; ISTM that the compiler is required to actually fetch ptr->value, not rely on some previous read of it. I do not think that (the first version of) pg_atomic_read_u64_impl is adding any guarantee that wasn't there already. >> Even if they somehow do, it hardly matters given that the cmpxchg loop >> would be self-correcting. > Well, in this one instance maybe, hardly in others. All the functions involved use nigh-identical cmpxchg loops. > What are you suggesting as an alternative? I think we can just use "old = ptr->value" to set up for the cmpxchg loop in every generic.h function that uses such a loop. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix performance of generic atomics
On 2017-09-06 15:12:13 -0400, Tom Lane wrote: > Andres Freund writes: > > On 2017-09-06 14:31:26 -0400, Tom Lane wrote: > >> However, if that's the reasoning, why don't we make all of these > >> use simple reads? It seems unlikely that a locked read is free. > > > We don't really use locked reads? All the _atomic_ wrapper forces is an > > actual read from memory rather than a register. > > It looks to me like two of the three implementations promise no such > thing. They're volatile vars, so why not? > Even if they somehow do, it hardly matters given that the cmpxchg loop > would be self-correcting. Well, in this one instance maybe, hardly in others. > Mostly, though, I'm looking at the fallback pg_atomic_read_u64_impl > implementation (with a CAS), which seems far more expensive than can > be justified for this. What are you suggesting as an alternative? - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix performance of generic atomics
Andres Freund writes: > On 2017-09-06 14:31:26 -0400, Tom Lane wrote: >> However, if that's the reasoning, why don't we make all of these >> use simple reads? It seems unlikely that a locked read is free. > We don't really use locked reads? All the _atomic_ wrapper forces is an > actual read from memory rather than a register. It looks to me like two of the three implementations promise no such thing. Even if they somehow do, it hardly matters given that the cmpxchg loop would be self-correcting. Mostly, though, I'm looking at the fallback pg_atomic_read_u64_impl implementation (with a CAS), which seems far more expensive than can be justified for this. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix performance of generic atomics
Hi, On 2017-09-06 14:31:26 -0400, Tom Lane wrote: > I wrote: > > Anyway, I don't have a big objection to applying this. My concern > > is more that we need to be taking a harder look at other parts of > > the atomics infrastructure, because tweaks there are likely to buy > > much more. > > I went ahead and pushed these patches, adding __sync_fetch_and_sub > since gcc seems to provide that on the same footing as these other > functions. > > Looking at these generic functions a bit closer, I notice that > most of them are coded like > > old = pg_atomic_read_u32_impl(ptr); > while (!pg_atomic_compare_exchange_u32_impl(ptr, &old, old | or_)) > /* skip */; > > but there's an exception: pg_atomic_exchange_u64_impl just does > > old = ptr->value; > while (!pg_atomic_compare_exchange_u64_impl(ptr, &old, xchg_)) > /* skip */; > > That's interesting. Why not pg_atomic_read_u64_impl there? Because our platforms don't generally guaranatee that 64bit reads: /* * 64 bit reads aren't safe on all platforms. In the generic * implementation implement them as a compare/exchange with 0. That'll * fail or succeed, but always return the old value. Possible might store * a 0, but only if the prev. value also was a 0 - i.e. harmless. */ pg_atomic_compare_exchange_u64_impl(ptr, &old, 0); so I *guess* I decided doing an unnecessary cmpxchg wasn't useful. But since then we've added an optimization for platforms where we know 64bit reads have single copy atomicity. So we could change that now. > I think that the simple read is actually okay as it stands: if we > are unlucky enough to get a torn read, the first compare/exchange > will fail to compare equal, and it will replace "old" with a valid > atomically-read value, and then the next loop iteration has a chance > to succeed. Therefore there's no need to pay the extra cost of a > locked read instead of an unlocked one. > > However, if that's the reasoning, why don't we make all of these > use simple reads? It seems unlikely that a locked read is free. We don't really use locked reads? All the _atomic_ wrapper forces is an actual read from memory rather than a register. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix performance of generic atomics
I wrote: > Anyway, I don't have a big objection to applying this. My concern > is more that we need to be taking a harder look at other parts of > the atomics infrastructure, because tweaks there are likely to buy > much more. I went ahead and pushed these patches, adding __sync_fetch_and_sub since gcc seems to provide that on the same footing as these other functions. Looking at these generic functions a bit closer, I notice that most of them are coded like old = pg_atomic_read_u32_impl(ptr); while (!pg_atomic_compare_exchange_u32_impl(ptr, &old, old | or_)) /* skip */; but there's an exception: pg_atomic_exchange_u64_impl just does old = ptr->value; while (!pg_atomic_compare_exchange_u64_impl(ptr, &old, xchg_)) /* skip */; That's interesting. Why not pg_atomic_read_u64_impl there? I think that the simple read is actually okay as it stands: if we are unlucky enough to get a torn read, the first compare/exchange will fail to compare equal, and it will replace "old" with a valid atomically-read value, and then the next loop iteration has a chance to succeed. Therefore there's no need to pay the extra cost of a locked read instead of an unlocked one. However, if that's the reasoning, why don't we make all of these use simple reads? It seems unlikely that a locked read is free. If there's actually a reason for pg_atomic_exchange_u64_impl to be different from the rest, it needs to have a comment explaining why. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix performance of generic atomics
Hi Jeff, On 09/05/2017 03:47 PM, Jeff Janes wrote: I ran pgbench (-M prepared) with synchronous_commit 'on' and 'off' using both logged and unlogged tables. Also ran an internal benchmark which didn't show anything either. What scale factor and client count? How many cores per socket? It looks like Sokolov was just starting to see gains at 200 clients on 72 cores, using -N transaction. I have done a run with scale factor 300, and another with 3000 on a 2S/28C/56T/256Gb w/ 2 x RAID10 SSD machine; up to 200 clients. I would consider the runs as "noise" as I'm seeing +-1% for all client counts, so nothing like Yura is seeing in [1] for the higher client counts. I did a run with -N too using scale factor 300, using the settings in [1], but with same result (+-1%). [1] https://www.postgresql.org/message-id/d62d7d9d473d07e172d799d5a57e7...@postgrespro.ru Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix performance of generic atomics
Robert Haas writes: > It's not a question of whether the return value is used, but of > whether the updated value of *old is used. Right, but if we re-read "old" in the loop, and if the primitive doesn't return "old" (or does, but call site ignores it) then in principle the CAS might be strength-reduced a bit. Whether that can win enough to be better than removing the unlocked read, I dunno. In the case at hand, looking at a loop like while (count-- > 0) { (void) pg_atomic_fetch_or_u32(myptr, 1); } with our HEAD code and RHEL6's gcc I get this for the inner loop: .L9: movl(%rdx), %eax movl%eax, %ecx orl $1, %ecx lock cmpxchgl%ecx,(%rdx) setz%cl testb %cl, %cl je .L9 subq$1, %rbx testq %rbx, %rbx jg .L9 Applying the proposed generic.h patch, it becomes .L10: movl(%rsi), %eax .L9: movl%eax, %ecx orl $1, %ecx lock cmpxchgl%ecx,(%rdx) setz%cl testb %cl, %cl je .L9 subq$1, %rbx testq %rbx, %rbx jg .L10 Note that in both cases the cmpxchgl is coming out of the asm construct in pg_atomic_compare_exchange_u32_impl from atomics/arch-x86.h, so that even if a simpler assembly instruction were possible given that we don't need %eax to be updated, there's no chance of that actually happening. This gets back to the point I made in the other thread that maybe the arch-x86.h asm sequences are not an improvement over the gcc intrinsics. The code is the same if the loop is modified to use the result of pg_atomic_fetch_or_u32, so I won't bother showing that. Adding the proposed generic-gcc.h patch, the loop reduces to .L11: lock orl$1, (%rax) subq$1, %rbx testq %rbx, %rbx jg .L11 or if we make the loop do junk += pg_atomic_fetch_or_u32(myptr, 1); then we get .L13: movl(%rsi), %eax .L10: movl%eax, %edi movl%eax, %ecx orl $1, %ecx lock cmpxchgl %ecx, (%rdx) jne .L10 addl%edi, %r8d subq$1, %rbx testq %rbx, %rbx jg .L13 So that last is slightly better than the generic.h + asm CAS version, perhaps not meaningfully so --- but it's definitely better when the compiler can see the result isn't used. In short then, given the facts on the ground here, in particular the asm-based CAS primitive at the heart of these loops, it's clear that taking the read out of the loop can't hurt. But that should be read as a rather narrow conclusion. With a different compiler and/or a different version of pg_atomic_compare_exchange_u32_impl, maybe the answer would be different. And of course it's moot once the generic-gcc.h patch is applied. Anyway, I don't have a big objection to applying this. My concern is more that we need to be taking a harder look at other parts of the atomics infrastructure, because tweaks there are likely to buy much more. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix performance of generic atomics
On Wed, Sep 6, 2017 at 12:01 PM, Tom Lane wrote: >> This seems like a pretty sound argument to me. I think Tom's probably >> right that the changes in generic-gcc.h are the important ones, but >> I'm not sure that's an argument against patching generics.h. Given >> that pg_atomic_compare_exchange_u32_impl is defined to update *old >> there seems to be no reason to call pg_atomic_read_u32_impl every time >> through the loop. > > Probably not. I'm not quite 100% convinced of that, because of my > observation that gcc is capable of generating different and better > code for some of these primitives if it can prove that the return > value is not needed. It's not clear that that could apply in any > of these uses of pg_atomic_compare_exchange_u32_impl, though. > In any case, by my own argument, it shouldn't matter, because if > any of these are really performance-critical then we should be > looking for better ways. It's not a question of whether the return value is used, but of whether the updated value of *old is used. Unless I'm confused. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix performance of generic atomics
Robert Haas writes: > On Wed, Sep 6, 2017 at 9:42 AM, Sokolov Yura > wrote: >> But I think, generic version still should be "fixed". >> If generic version is not reached on any platform, then why it is kept? >> If it is reached somewhere, then it should be improved. > This seems like a pretty sound argument to me. I think Tom's probably > right that the changes in generic-gcc.h are the important ones, but > I'm not sure that's an argument against patching generics.h. Given > that pg_atomic_compare_exchange_u32_impl is defined to update *old > there seems to be no reason to call pg_atomic_read_u32_impl every time > through the loop. Probably not. I'm not quite 100% convinced of that, because of my observation that gcc is capable of generating different and better code for some of these primitives if it can prove that the return value is not needed. It's not clear that that could apply in any of these uses of pg_atomic_compare_exchange_u32_impl, though. In any case, by my own argument, it shouldn't matter, because if any of these are really performance-critical then we should be looking for better ways. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix performance of generic atomics
On Wed, Sep 6, 2017 at 9:42 AM, Sokolov Yura wrote: > Yes, you're right. > > But I think, generic version still should be "fixed". > If generic version is not reached on any platform, then why it is kept? > If it is reached somewhere, then it should be improved. This seems like a pretty sound argument to me. I think Tom's probably right that the changes in generic-gcc.h are the important ones, but I'm not sure that's an argument against patching generics.h. Given that pg_atomic_compare_exchange_u32_impl is defined to update *old there seems to be no reason to call pg_atomic_read_u32_impl every time through the loop. Whether doing so hurts performance in practice is likely to depend on a bunch of stuff, but we don't normally refuse to remove unnecessary code on the grounds that it will rarely be reached. Given that CPUs are weird, it's possible that there is some system where throwing an extra read of a value we already have into the loop works out to a win, but in the absence of evidence I'm reluctant to presume it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix performance of generic atomics
On 2017-09-06 16:36, Tom Lane wrote: Sokolov Yura writes: On 2017-09-06 15:56, Tom Lane wrote: The point I'm trying to make is that if tweaking generic.h improves performance then it's an indicator of missed cases in the less-generic atomics code, and the latter is where our attention should be focused. I think basically all of the improvement Sokolov got was from upgrading the coverage of generic-gcc.h. Not exactly. I've checked, that new version of generic pg_atomic_fetch_or_u32 loop also gives improvement. But once you put in the generic-gcc version, that's not reached anymore. Yes, you're right. But I think, generic version still should be "fixed". If generic version is not reached on any platform, then why it is kept? If it is reached somewhere, then it should be improved. -- Sokolov Yura aka funny_falcon Postgres Professional: https://postgrespro.ru The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix performance of generic atomics
Sokolov Yura writes: > On 2017-09-06 15:56, Tom Lane wrote: >> The point I'm trying to make is that if tweaking generic.h improves >> performance then it's an indicator of missed cases in the less-generic >> atomics code, and the latter is where our attention should be focused. >> I think basically all of the improvement Sokolov got was from upgrading >> the coverage of generic-gcc.h. > Not exactly. I've checked, that new version of generic > pg_atomic_fetch_or_u32 > loop also gives improvement. But once you put in the generic-gcc version, that's not reached anymore. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix performance of generic atomics
On 2017-09-06 15:56, Tom Lane wrote: Simon Riggs writes: On 5 September 2017 at 21:23, Tom Lane wrote: Moreover, it matters which primitive you're testing, on which platform, with which compiler, because we have a couple of layers of atomic ops implementations. If there is no gain on 2-socket, at least there is no loss either. The point I'm trying to make is that if tweaking generic.h improves performance then it's an indicator of missed cases in the less-generic atomics code, and the latter is where our attention should be focused. I think basically all of the improvement Sokolov got was from upgrading the coverage of generic-gcc.h. regards, tom lane Not exactly. I've checked, that new version of generic pg_atomic_fetch_or_u32 loop also gives improvement. Without that check I'd not suggest to fix generic atomic functions. Of course, gcc intrinsic gives more gain. -- Sokolov Yura aka funny_falcon Postgres Professional: https://postgrespro.ru The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix performance of generic atomics
Simon Riggs writes: > On 5 September 2017 at 21:23, Tom Lane wrote: >> Moreover, it matters which primitive you're testing, on which platform, >> with which compiler, because we have a couple of layers of atomic ops >> implementations. > If there is no gain on 2-socket, at least there is no loss either. The point I'm trying to make is that if tweaking generic.h improves performance then it's an indicator of missed cases in the less-generic atomics code, and the latter is where our attention should be focused. I think basically all of the improvement Sokolov got was from upgrading the coverage of generic-gcc.h. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix performance of generic atomics
On 5 September 2017 at 21:23, Tom Lane wrote: > Jeff Janes writes: >> What scale factor and client count? How many cores per socket? It looks >> like Sokolov was just starting to see gains at 200 clients on 72 cores, >> using -N transaction. ... > Moreover, it matters which primitive you're testing, on which platform, > with which compiler, because we have a couple of layers of atomic ops > implementations. ... I think Sokolov was aiming at 4-socket servers specifically, rather than as a general performance gain. If there is no gain on 2-socket, at least there is no loss either. -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix performance of generic atomics
On 2017-09-06 14:54, Sokolov Yura wrote: On 2017-09-06 07:23, Tom Lane wrote: Jeff Janes writes: What scale factor and client count? How many cores per socket? It looks like Sokolov was just starting to see gains at 200 clients on 72 cores, using -N transaction. This means that Sokolov's proposed changes in atomics/generic.h ought to be uninteresting for performance on any platform we care about --- at least for pg_atomic_fetch_add_u32(). May I cite you this way: I apologize for tone of my previous message. My tongue is my enemy. -- Sokolov Yura aka funny_falcon Postgres Professional: https://postgrespro.ru The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix performance of generic atomics
On 2017-09-06 07:23, Tom Lane wrote: Jeff Janes writes: What scale factor and client count? How many cores per socket? It looks like Sokolov was just starting to see gains at 200 clients on 72 cores, using -N transaction. This means that Sokolov's proposed changes in atomics/generic.h ought to be uninteresting for performance on any platform we care about --- at least for pg_atomic_fetch_add_u32(). May I cite you this way: "Tom Lane says PostgreSQL core team doesn't care for 4 socket 72 core Intel Xeon servers" ??? To be honestly, I didn't measure exact impact of changing pg_atomic_fetch_add_u32. I found issue with pg_atomic_fetch_or_u32, cause it is highly contended both in LWLock, and in buffer page state management. I found changing loop for generic pg_atomic_fetch_or_u32 already gives improvement. And I changed loop for other generic atomic functions to make all functions uniform. It is bad style guide not to changing worse (but correct) code into better (and also correct) just because you can't measure difference (in my opinion. Perhaps i am mistaken). (and yes: gcc intrinsic gives more improvement). -- With regards, Sokolov Yura aka funny_falcon Postgres Professional: https://postgrespro.ru The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix performance of generic atomics
Jeff Janes writes: > What scale factor and client count? How many cores per socket? It looks > like Sokolov was just starting to see gains at 200 clients on 72 cores, > using -N transaction. I spent some time poking at this with the test scaffolding I showed at https://www.postgresql.org/message-id/17694.1504665058%40sss.pgh.pa.us AFAICT, there are not huge differences between different coding methods even for two processes in direct competition in the tightest possible atomic-ops loops. So if you're testing SQL operations, you're definitely not going to see anything without a whole lot of concurrent processes. Moreover, it matters which primitive you're testing, on which platform, with which compiler, because we have a couple of layers of atomic ops implementations. I started out testing pg_atomic_fetch_add_u32(), as shown in the above- mentioned message. On x86_x64 with gcc, that is implemented by the code for it in src/include/port/atomics/arch-x86.h. If you dike out that support, it falls back to the version in atomics/generic-gcc.h, which seems no worse and possibly better. Only if you also dike out generic-gcc.h do you get to the version in atomics/generic.h that this patch wants to change. However, at that point you're also down to a spinlock-based implementation of the underlying pg_atomic_read_u32_impl and pg_atomic_compare_exchange_u32_impl, which means that performance is going to be less than great anyway. No platform that we consider well-supported ought to be hitting the spinlock paths. This means that Sokolov's proposed changes in atomics/generic.h ought to be uninteresting for performance on any platform we care about --- at least for pg_atomic_fetch_add_u32(). However, Sokolov also proposes adding gcc-intrinsic support for pg_atomic_fetch_and_xxx and pg_atomic_fetch_or_xxx. This is a different kettle of fish. I repeated the experiments I'd done for pg_atomic_fetch_add_u32(), per the above message, using the "or" primitive, and found largely the same results as for "add": it's slightly better under contention than the generic code, and significantly better if the result value of the atomic op isn't needed. So I think we should definitely take the part of the patch that changes generic-gcc.h --- and we should check to see if we're missing out on any other gcc primitives we should be using there. I'm less excited about the proposed changes in generic.h. There's nothing wrong with them in principle, but they mostly shouldn't make any performance difference on any platform we care about, because we shouldn't get to that code in the first place on any platform we care about. If we are getting to that code for any specific primitive, then either there's a gap in the platform-specific or compiler-specific support, or it's debatable that we ought to consider that primitive to be very primitive. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix performance of generic atomics
Hi, On 05/25/2017 11:12 AM, Sokolov Yura wrote: I agree that lonely semicolon looks bad. Applied your suggestion for empty loop body (/* skip */). Patch in first letter had while(true), but I removed it cause I think it is uglier: - `while(true)` was necessary for grouping read with `if`, - but now there is single statement in a loop body and it is condition for loop exit, so it is clearly just a loop. Optimization is valid cause compare_exchange always store old value in `old` variable in a same atomic manner as atomic read. I have tested this patch on a 2-socket machine, but don't see any performance change in the various runs. However, there is no regression either in all cases. As such, I have marked the entry "Ready for Committer". Remember to add a version postfix to your patches such that is easy to identify which is the latest version. Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix performance of generic atomics
On Tue, Sep 5, 2017 at 11:39 AM, Jesper Pedersen wrote: > On 09/05/2017 02:24 PM, Tom Lane wrote: > >> Jesper Pedersen writes: >> >>> I have tested this patch on a 2-socket machine, but don't see any >>> performance change in the various runs. However, there is no regression >>> either in all cases. >>> >> >> Hm, so if we can't demonstrate a performance win, it's hard to justify >> risking touching this code. What test case(s) did you use? >> >> > I ran pgbench (-M prepared) with synchronous_commit 'on' and 'off' using > both logged and unlogged tables. Also ran an internal benchmark which > didn't show anything either. > What scale factor and client count? How many cores per socket? It looks like Sokolov was just starting to see gains at 200 clients on 72 cores, using -N transaction. Cheers, Jeff
Re: [HACKERS] Fix performance of generic atomics
Jesper Pedersen writes: > On 09/05/2017 02:24 PM, Tom Lane wrote: >> Hm, so if we can't demonstrate a performance win, it's hard to justify >> risking touching this code. What test case(s) did you use? > I ran pgbench (-M prepared) with synchronous_commit 'on' and 'off' using > both logged and unlogged tables. Also ran an internal benchmark which > didn't show anything either. That may just mean that pgbench isn't stressing any atomic ops very hard (at least in the default scenario). I'm tempted to write a little C function that just hits the relevant atomic ops in a tight loop, and see how long it takes to do a few million iterations. That would be erring in the opposite direction, of overstating the importance of atomic ops to real-world scenarios --- but if we didn't get any win that way, then it's surely in the noise. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix performance of generic atomics
On 09/05/2017 02:24 PM, Tom Lane wrote: Jesper Pedersen writes: I have tested this patch on a 2-socket machine, but don't see any performance change in the various runs. However, there is no regression either in all cases. Hm, so if we can't demonstrate a performance win, it's hard to justify risking touching this code. What test case(s) did you use? I ran pgbench (-M prepared) with synchronous_commit 'on' and 'off' using both logged and unlogged tables. Also ran an internal benchmark which didn't show anything either. Setting the entry back to "Needs Review" for additional feedback from others. Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix performance of generic atomics
Jesper Pedersen writes: > I have tested this patch on a 2-socket machine, but don't see any > performance change in the various runs. However, there is no regression > either in all cases. Hm, so if we can't demonstrate a performance win, it's hard to justify risking touching this code. What test case(s) did you use? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix performance of generic atomics
Hi, On 05/25/2017 11:12 AM, Sokolov Yura wrote: I agree that lonely semicolon looks bad. Applied your suggestion for empty loop body (/* skip */). Patch in first letter had while(true), but I removed it cause I think it is uglier: - `while(true)` was necessary for grouping read with `if`, - but now there is single statement in a loop body and it is condition for loop exit, so it is clearly just a loop. Optimization is valid cause compare_exchange always store old value in `old` variable in a same atomic manner as atomic read. I have tested this patch on a 2-socket machine, but don't see any performance change in the various runs. However, there is no regression either in all cases. As such, I have marked the entry "Ready for Committer". Remember to add a version postfix to your patches such that is easy to identify which is the latest version. Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix performance of generic atomics
Good day, every one. I'm just posting benchmark numbers for atomics patch. Hardware: 4 socket 72 core (144HT) x86_64 Centos 7.1 postgresql.conf tuning: shared_buffers = 32GB fsync = on synchronous_commit = on full_page_writes = off wal_buffers = 16MB wal_writer_flush_after = 16MB commit_delay = 2 max_wal_size = 16GB Results: pgbench -i -s 300 + pgbench --skip-some-updates Clients | master | atomics +=+=== 50 | 53.1k | 53.2k 100 | 101.2k | 103.5k 150 | 119.1k | 121.9k 200 | 128.7k | 132.5k 252 | 120.2k | 130.0k 304 | 100.8k | 115.9k 356 | 78.1k | 90.1k 395 | 70.2k | 79.0k 434 | 61.6k | 70.7k Also graph with more points attached. On 2017-05-25 18:12, Sokolov Yura wrote: Hello, Tom. I agree that lonely semicolon looks bad. Applied your suggestion for empty loop body (/* skip */). Patch in first letter had while(true), but I removed it cause I think it is uglier: - `while(true)` was necessary for grouping read with `if`, - but now there is single statement in a loop body and it is condition for loop exit, so it is clearly just a loop. Optimization is valid cause compare_exchange always store old value in `old` variable in a same atomic manner as atomic read. Tom Lane wrote 2017-05-25 17:39: Sokolov Yura writes: @@ -382,12 +358,8 @@ static inline uint64 pg_atomic_fetch_and_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 and_) { uint64 old; - while (true) - { - old = pg_atomic_read_u64_impl(ptr); - if (pg_atomic_compare_exchange_u64_impl(ptr, &old, old & and_)) - break; - } + old = pg_atomic_read_u64_impl(ptr); + while (!pg_atomic_compare_exchange_u64_impl(ptr, &old, old & and_)); return old; } #endif FWIW, I do not think that writing the loops like that is good style. It looks like a typo and will confuse readers. You could perhaps write the same code with better formatting, eg while (!pg_atomic_compare_exchange_u64_impl(ptr, &old, old & and_)) /* skip */ ; but why not leave the formulation with while(true) and a break alone? (I take no position on whether moving the read of "old" outside the loop is a valid optimization.) regards, tom lane With regards, -- Sokolov Yura aka funny_falcon Postgres Professional: https://postgrespro.ru The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix performance of generic atomics
Hello, Tom. I agree that lonely semicolon looks bad. Applied your suggestion for empty loop body (/* skip */). Patch in first letter had while(true), but I removed it cause I think it is uglier: - `while(true)` was necessary for grouping read with `if`, - but now there is single statement in a loop body and it is condition for loop exit, so it is clearly just a loop. Optimization is valid cause compare_exchange always store old value in `old` variable in a same atomic manner as atomic read. Tom Lane wrote 2017-05-25 17:39: Sokolov Yura writes: @@ -382,12 +358,8 @@ static inline uint64 pg_atomic_fetch_and_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 and_) { uint64 old; - while (true) - { - old = pg_atomic_read_u64_impl(ptr); - if (pg_atomic_compare_exchange_u64_impl(ptr, &old, old & and_)) - break; - } + old = pg_atomic_read_u64_impl(ptr); + while (!pg_atomic_compare_exchange_u64_impl(ptr, &old, old & and_)); return old; } #endif FWIW, I do not think that writing the loops like that is good style. It looks like a typo and will confuse readers. You could perhaps write the same code with better formatting, eg while (!pg_atomic_compare_exchange_u64_impl(ptr, &old, old & and_)) /* skip */ ; but why not leave the formulation with while(true) and a break alone? (I take no position on whether moving the read of "old" outside the loop is a valid optimization.) regards, tom lane With regards, -- Sokolov Yura aka funny_falcon Postgres Professional: https://postgrespro.ru The Russian Postgres CompanyFrom d377e36da652ade715b868a6fc7164736f38d1df Mon Sep 17 00:00:00 2001 From: Sokolov Yura Date: Thu, 25 May 2017 14:54:45 +0300 Subject: [PATCH] Fix performance of Atomics generic implementation pg_atomic_compare_exchange_*_impl stores old value into `old` variable, so there is no need to reread value in loop body. Also add gcc specific __sync_fetch_and_(or|and), cause looks like compiler may optimize code around intrinsic better than around assembler, although intrinsic is compiled to almost same CAS loop. --- src/include/port/atomics/generic-gcc.h | 36 + src/include/port/atomics/generic.h | 72 -- 2 files changed, 60 insertions(+), 48 deletions(-) diff --git a/src/include/port/atomics/generic-gcc.h b/src/include/port/atomics/generic-gcc.h index 7efc0861e7..79ada94047 100644 --- a/src/include/port/atomics/generic-gcc.h +++ b/src/include/port/atomics/generic-gcc.h @@ -185,6 +185,24 @@ pg_atomic_fetch_add_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_) } #endif +#if !defined(PG_HAVE_ATOMIC_FETCH_AND_U32) && defined(HAVE_GCC__SYNC_INT32_CAS) +#define PG_HAVE_ATOMIC_FETCH_AND_U32 +static inline uint32 +pg_atomic_fetch_and_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 and_) +{ + return __sync_fetch_and_and(&ptr->value, and_); +} +#endif + +#if !defined(PG_HAVE_ATOMIC_FETCH_OR_U32) && defined(HAVE_GCC__SYNC_INT32_CAS) +#define PG_HAVE_ATOMIC_FETCH_OR_U32 +static inline uint32 +pg_atomic_fetch_or_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 or_) +{ + return __sync_fetch_and_or(&ptr->value, or_); +} +#endif + #if !defined(PG_DISABLE_64_BIT_ATOMICS) @@ -223,6 +241,24 @@ pg_atomic_fetch_add_u64_impl(volatile pg_atomic_uint64 *ptr, int64 add_) } #endif +#if !defined(PG_HAVE_ATOMIC_FETCH_AND_U64) && defined(HAVE_GCC__SYNC_INT64_CAS) +#define PG_HAVE_ATOMIC_FETCH_AND_U64 +static inline uint64 +pg_atomic_fetch_and_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 and_) +{ + return __sync_fetch_and_and(&ptr->value, and_); +} +#endif + +#if !defined(PG_HAVE_ATOMIC_FETCH_OR_U64) && defined(HAVE_GCC__SYNC_INT64_CAS) +#define PG_HAVE_ATOMIC_FETCH_OR_U64 +static inline uint64 +pg_atomic_fetch_or_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 or_) +{ + return __sync_fetch_and_or(&ptr->value, or_); +} +#endif + #endif /* !defined(PG_DISABLE_64_BIT_ATOMICS) */ #endif /* defined(HAVE_ATOMICS) */ diff --git a/src/include/port/atomics/generic.h b/src/include/port/atomics/generic.h index 424543604a..75ffaf6e87 100644 --- a/src/include/port/atomics/generic.h +++ b/src/include/port/atomics/generic.h @@ -170,12 +170,9 @@ static inline uint32 pg_atomic_exchange_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 xchg_) { uint32 old; - while (true) - { - old = pg_atomic_read_u32_impl(ptr); - if (pg_atomic_compare_exchange_u32_impl(ptr, &old, xchg_)) - break; - } + old = pg_atomic_read_u32_impl(ptr); + while (!pg_atomic_compare_exchange_u32_impl(ptr, &old, xchg_)) + /* skip */; return old; } #endif @@ -186,12 +183,9 @@ static inline uint32 pg_atomic_fetch_add_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_) { uint32 old; - while (true) - { - old = pg_atomic_read_u32_impl(ptr); - if (pg_atomic_compare_exchange_u32_impl(ptr, &old, old + add_)) - break; - } + old = pg_atomic_read_u32_imp
Re: [HACKERS] Fix performance of generic atomics
Hi Yura, > Attached patch contains patch for all generic atomic > functions, and also __sync_fetch_and_(or|and) for gcc, cause > I believe GCC optimize code around intrinsic better than > around inline assembler. > (final performance is around 86000tps, but difference between > 83000tps and 86000tps is not so obvious in NUMA system). I don't see any patch in my email client or pgsql-hackers@ archive. I would also recommend to add your patch to the nearest commitfest [1]. [1] https://commitfest.postgresql.org/ -- Best regards, Aleksander Alekseev signature.asc Description: PGP signature
Re: [HACKERS] Fix performance of generic atomics
Sokolov Yura writes: @@ -382,12 +358,8 @@ static inline uint64 pg_atomic_fetch_and_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 and_) { uint64 old; - while (true) - { - old = pg_atomic_read_u64_impl(ptr); - if (pg_atomic_compare_exchange_u64_impl(ptr, &old, old & and_)) - break; - } + old = pg_atomic_read_u64_impl(ptr); + while (!pg_atomic_compare_exchange_u64_impl(ptr, &old, old & and_)); return old; } #endif FWIW, I do not think that writing the loops like that is good style. It looks like a typo and will confuse readers. You could perhaps write the same code with better formatting, eg while (!pg_atomic_compare_exchange_u64_impl(ptr, &old, old & and_)) /* skip */ ; but why not leave the formulation with while(true) and a break alone? (I take no position on whether moving the read of "old" outside the loop is a valid optimization.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix performance of generic atomics
A bit cleaner version of a patch. Sokolov Yura писал 2017-05-25 15:22: Good day, everyone. I've been played with pgbench on huge machine. (72 cores, 56 for postgresql, enough memory to fit base both into shared_buffers and file cache) (pgbench scale 500, unlogged tables, fsync=off, synchronous commit=off, wal_writer_flush_after=0). With 200 clients performance is around 76000tps and main bottleneck in this dumb test is LWLockWaitListLock. I added gcc specific implementation for pg_atomic_fetch_or_u32_impl (ie using __sync_fetch_and_or) and performance became 83000tps. It were a bit strange at a first look, cause __sync_fetch_and_or compiles to almost same CAS loop. Looking closely, I noticed that intrinsic performs doesn't do read in the loop body, but at loop initialization. It is correct behavior cause `lock cmpxchg` instruction stores old value in EAX register. It is expected behavior, and pg_compare_and_exchange_*_impl does the same in all implementations. So there is no need to re-read value in the loop body: Example diff for pg_atomic_exchange_u32_impl: static inline uint32 pg_atomic_exchange_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 xchg_) { uint32 old; + old = pg_atomic_read_u32_impl(ptr); while (true) { - old = pg_atomic_read_u32_impl(ptr); if (pg_atomic_compare_exchange_u32_impl(ptr, &old, xchg_)) break; } return old; } After applying this change to all generic atomic functions (and for pg_atomic_fetch_or_u32_impl ), performance became equal to __sync_fetch_and_or intrinsic. Attached patch contains patch for all generic atomic functions, and also __sync_fetch_and_(or|and) for gcc, cause I believe GCC optimize code around intrinsic better than around inline assembler. (final performance is around 86000tps, but difference between 83000tps and 86000tps is not so obvious in NUMA system). With regards, -- Sokolov Yura aka funny_falcon Postgres Professional: https://postgrespro.ru The Russian Postgres CompanyFrom 2c58a32ca89867e7d244e7037a38aa9ccc2b92c2 Mon Sep 17 00:00:00 2001 From: Sokolov Yura Date: Thu, 25 May 2017 14:54:45 +0300 Subject: [PATCH] Fix performance of Atomics generic implementation pg_atomic_compare_exchange_*_impl modifies value of `old` pointer, so there is no need to reread value in loop body. Also add gcc specific __sync_fetch_and_(or|and), cause looks like compiler may optimize code around intrisic better than around assembler, although intrisic is compiled to almost same CAS loop. --- src/include/port/atomics/generic-gcc.h | 36 +++ src/include/port/atomics/generic.h | 64 +- 2 files changed, 52 insertions(+), 48 deletions(-) diff --git a/src/include/port/atomics/generic-gcc.h b/src/include/port/atomics/generic-gcc.h index 7efc0861e7..79ada94047 100644 --- a/src/include/port/atomics/generic-gcc.h +++ b/src/include/port/atomics/generic-gcc.h @@ -185,6 +185,24 @@ pg_atomic_fetch_add_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_) } #endif +#if !defined(PG_HAVE_ATOMIC_FETCH_AND_U32) && defined(HAVE_GCC__SYNC_INT32_CAS) +#define PG_HAVE_ATOMIC_FETCH_AND_U32 +static inline uint32 +pg_atomic_fetch_and_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 and_) +{ + return __sync_fetch_and_and(&ptr->value, and_); +} +#endif + +#if !defined(PG_HAVE_ATOMIC_FETCH_OR_U32) && defined(HAVE_GCC__SYNC_INT32_CAS) +#define PG_HAVE_ATOMIC_FETCH_OR_U32 +static inline uint32 +pg_atomic_fetch_or_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 or_) +{ + return __sync_fetch_and_or(&ptr->value, or_); +} +#endif + #if !defined(PG_DISABLE_64_BIT_ATOMICS) @@ -223,6 +241,24 @@ pg_atomic_fetch_add_u64_impl(volatile pg_atomic_uint64 *ptr, int64 add_) } #endif +#if !defined(PG_HAVE_ATOMIC_FETCH_AND_U64) && defined(HAVE_GCC__SYNC_INT64_CAS) +#define PG_HAVE_ATOMIC_FETCH_AND_U64 +static inline uint64 +pg_atomic_fetch_and_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 and_) +{ + return __sync_fetch_and_and(&ptr->value, and_); +} +#endif + +#if !defined(PG_HAVE_ATOMIC_FETCH_OR_U64) && defined(HAVE_GCC__SYNC_INT64_CAS) +#define PG_HAVE_ATOMIC_FETCH_OR_U64 +static inline uint64 +pg_atomic_fetch_or_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 or_) +{ + return __sync_fetch_and_or(&ptr->value, or_); +} +#endif + #endif /* !defined(PG_DISABLE_64_BIT_ATOMICS) */ #endif /* defined(HAVE_ATOMICS) */ diff --git a/src/include/port/atomics/generic.h b/src/include/port/atomics/generic.h index 424543604a..6d2671beab 100644 --- a/src/include/port/atomics/generic.h +++ b/src/include/port/atomics/generic.h @@ -170,12 +170,8 @@ static inline uint32 pg_atomic_exchange_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 xchg_) { uint32 old; - while (true) - { - old = pg_atomic_read_u32_impl(ptr); - if (pg_atomic_compare_exchange_u32_impl(ptr, &old, xchg_)) - break; - } + old = pg_atomic_read_u32_impl(ptr); + while (!pg_atomic_co