Re: WAL Insertion Lock Improvements

2023-12-20 Thread Michael Paquier
On Mon, Dec 18, 2023 at 10:00:29PM -0600, Nathan Bossart wrote:
> I found this code when searching for callers that use atomic exchanges as
> atomic writes with barriers (for a separate thread [0]).  Can't we use
> pg_atomic_write_u64() here since the locking functions that follow should
> serve as barriers?

The full barrier guaranteed by pg_atomic_exchange_u64() in
LWLockUpdateVar() was also necessary for the shortcut based on
read_u32() to see if there are no waiters, but it has been discarded
in the later versions of the patch because it did not influence
performance under heavy WAL inserts.

So you mean to rely on the full barriers taken by the fetches in
LWLockRelease() and LWLockWaitListLock()?  Hmm, I got the impression
that pg_atomic_exchange_u64() with its full barrier was necessary in
these two paths so as all the loads and stores are completed *before*
updating these variables.  So I am not sure to get why it would be
safe to switch to a write with no barrier.

> I've attached a patch to demonstrate what I'm thinking.  This might be more
> performant, although maybe less so after commit 64b1fb5.  Am I missing
> something obvious here?  If not, I might rerun the benchmarks to see
> whether it makes any difference.

I was wondering as well if the numbers we got upthread would go up
after what you have committed to improve the exchanges.  :)
Any change in this area should be strictly benchmarked. 
--
Michael


signature.asc
Description: PGP signature


Re: WAL Insertion Lock Improvements

2023-12-18 Thread Nathan Bossart
On Tue, Jul 25, 2023 at 04:43:16PM +0900, Michael Paquier wrote:
> 0001 has been now applied.  I have done more tests while looking at
> this patch since yesterday and was surprised to see higher TPS numbers
> on HEAD with the same tests as previously, and the patch was still
> shining with more than 256 clients.

I found this code when searching for callers that use atomic exchanges as
atomic writes with barriers (for a separate thread [0]).  Can't we use
pg_atomic_write_u64() here since the locking functions that follow should
serve as barriers?

I've attached a patch to demonstrate what I'm thinking.  This might be more
performant, although maybe less so after commit 64b1fb5.  Am I missing
something obvious here?  If not, I might rerun the benchmarks to see
whether it makes any difference.

[0] 
https://www.postgresql.org/message-id/flat/20231110205128.GB1315705%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 315a78cda9..b43972bd2e 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -1752,10 +1752,10 @@ LWLockUpdateVar(LWLock *lock, pg_atomic_uint64 *valptr, uint64 val)
 	PRINT_LWDEBUG("LWLockUpdateVar", lock, LW_EXCLUSIVE);
 
 	/*
-	 * Note that pg_atomic_exchange_u64 is a full barrier, so we're guaranteed
-	 * that the variable is updated before waking up waiters.
+	 * We rely on the barrier provided by LWLockWaitListLock() to ensure the
+	 * variable is updated before waking up waiters.
 	 */
-	pg_atomic_exchange_u64(valptr, val);
+	pg_atomic_write_u64(valptr, val);
 
 	proclist_init();
 
@@ -1881,10 +1881,10 @@ void
 LWLockReleaseClearVar(LWLock *lock, pg_atomic_uint64 *valptr, uint64 val)
 {
 	/*
-	 * Note that pg_atomic_exchange_u64 is a full barrier, so we're guaranteed
-	 * that the variable is updated before releasing the lock.
+	 * We rely on the barrier provided by LWLockRelease() to ensure the
+	 * variable is updated before releasing the lock.
 	 */
-	pg_atomic_exchange_u64(valptr, val);
+	pg_atomic_write_u64(valptr, val);
 
 	LWLockRelease(lock);
 }


Re: WAL Insertion Lock Improvements

2023-07-28 Thread Bharath Rupireddy
On Wed, Jul 26, 2023 at 1:27 AM Andres Freund  wrote:
>
> > 0001 has been now applied.  I have done more tests while looking at
> > this patch since yesterday and was surprised to see higher TPS numbers
> > on HEAD with the same tests as previously, and the patch was still
> > shining with more than 256 clients.
>
> Just a small heads up:
>
> I just rebased my aio tree over the commit and promptly, on the first run, saw
> a hang. I did some debugging on that. Unfortunately repeated runs haven't
> repeated that hang, despite quite a bit of trying.

Hm. Please share workload details, test scripts, system info and any
special settings for running in my setup.

> The symptom I was seeing is that all running backends were stuck in
> LWLockWaitForVar(), even though the value they're waiting for had
> changed. Which obviously "shouldn't be possible".

Were the backends stuck there indefinitely? IOW, did they get into a deadlock?

> It's of course possible that this is AIO specific, but I didn't see anything
> in stacks to suggest that.
>
> I do wonder if this possibly exposed an undocumented prior dependency on the
> value update always happening under the list lock.

I'm going through the other thread mentioned by Michael Paquier. I'm
wondering if the deadlock issue illustrated here
https://www.postgresql.org/message-id/55BB50D3.9000702%40iki.fi is
showing up again, because 71e4cc6b8e reduced the contention on
waitlist lock and made things *a bit* faster.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: WAL Insertion Lock Improvements

2023-07-26 Thread Michael Paquier
On Tue, Jul 25, 2023 at 04:43:16PM +0900, Michael Paquier wrote:
> We really need to do something in terms of documentation with
> something like 0002, so I'll try to look at that next.

I have applied a slightly-tweaked version of 0002 as of 66d86d4 to
improve a bit the documentation of the area, and switched the CF entry
as committed.

(I got interested in what Andres has seen on his latest AIO branch, so
I have a few extra benchmarks running in the background on HEAD, but
nothing able to freeze all the backends yet waiting for a variable
update.  These are still running now.)
--
Michael


signature.asc
Description: PGP signature


Re: WAL Insertion Lock Improvements

2023-07-25 Thread Andres Freund
Hi,

On 2023-07-26 07:40:31 +0900, Michael Paquier wrote:
> On Tue, Jul 25, 2023 at 09:49:01AM -0700, Andres Freund wrote:
> > FWIW, I'm working on a patch that replaces WAL insert locks as a whole,
> > because they don't scale all that well.
>
> What were you looking at here?  Just wondering.

Here's what I had written offlist a few days ago:

> The basic idea is to have a ringbuffer of in-progress insertions. The
> acquisition of a position in the ringbuffer is done at the same time as
> advancing the reserved LSN, using a 64bit xadd. The trick that makes that
> possible is to use the high bits of the atomic for the position in the
> ringbuffer. The point of using the high bits is that they wrap around, without
> affecting the rest of the value.
>
> Of course, when using xadd, we can't keep the "prev" pointer in the
> atomic. That's where the ringbuffer comes into play. Whenever one inserter has
> determined the byte pos of its insertion, it updates the "prev byte pos" in
> the *next* ringbuffer entry.
>
> Of course that means that insertion N+1 needs to wait for N to set the prev
> position - but that happens very quickly. In my very hacky prototype the
> relevant path (which for now just spins) is reached very rarely, even when
> massively oversubscribed. While I've not implemented that, N+1 could actually
> do the first "iteration" in CopyXLogRecordToWAL() before it needs the prev
> position, the COMP_CRC32C() could happen "inside" the buffer.
>
>
> There's a fair bit of trickyness in turning that into something working, of
> course :).  Ensuring that the ring buffer of insertions doesn't wrap around is
> non-trivial. Nor is trivial to ensure that the "reduced" space LSN in the
> atomic can't overflow.
>
> I do wish MAX_BACKENDS were smaller...
>
>
> Until last night I thought all my schemes would continue to need something
> like the existing WAL insertion locks, to implement
> WALInsertLockAcquireExclusive().
>
> But I think I came up with an idea to do away with that (not even prototyped
> yet): Use one bit in the atomic that indicates that no new insertions are
> allowed. Whenever the xadd finds that old value actually was locked, it
> "aborts" the insertion, and instead waits for a condition variable (or
> something similar). Of course that's after modifying the atomic - to deal with
> that the "lock holder" reverts all modifications that have been made to the
> atomic when releasing the "lock", they weren't actually successful and all
> those backends will retry.
>
> Except that this doesn't quite suffice - XLogInsertRecord() needs to be able
> to "roll back", when it finds that we now need to log FPIs. I can't quite see
> how to make that work with what I describe above. The only idea I have so far
> is to just waste the space with a NOOP record - it should be pretty rare. At
> least if we updated RedoRecPtr eagerly (or just stopped this stupid business
> of having an outdated backend-local copy).
>
>
>
> My prototype shows this idea to be promising. It's a tad slower at low
> concurrency, but much better at high concurrency. I think most if not all of
> the low-end overhead isn't inherent, but comes from having both old and new
> infrastructure in place (as well as a lot of debugging cruft).


Greetings,

Andres Freund




Re: WAL Insertion Lock Improvements

2023-07-25 Thread Michael Paquier
On Tue, Jul 25, 2023 at 09:49:01AM -0700, Andres Freund wrote:
> FWIW, I'm working on a patch that replaces WAL insert locks as a whole,
> because they don't scale all that well.

What were you looking at here?  Just wondering.
--
Michael


signature.asc
Description: PGP signature


Re: WAL Insertion Lock Improvements

2023-07-25 Thread Michael Paquier
On Tue, Jul 25, 2023 at 12:57:37PM -0700, Andres Freund wrote:
> I just rebased my aio tree over the commit and promptly, on the first run, saw
> a hang. I did some debugging on that. Unfortunately repeated runs haven't
> repeated that hang, despite quite a bit of trying.
> 
> The symptom I was seeing is that all running backends were stuck in
> LWLockWaitForVar(), even though the value they're waiting for had
> changed. Which obviously "shouldn't be possible".

Hmm.  I've also spent a few days looking at this past report that made
the LWLock part what it is today, but I don't quite see immediately
how it would be possible to reach a state where all the backends are
waiting for an update that's not happening:
https://www.postgresql.org/message-id/CAMkU=1zlztrowh3b42oxsb04r9zmesk3658qen4_8+b+k3e...@mail.gmail.com

All the assumptions of this code and its dependencies with
xloginsert.c are hard to come by.

> It's of course possible that this is AIO specific, but I didn't see anything
> in stacks to suggest that.

Or AIO handles the WAL syncs so quickly that it has more chances in
showing a race condition here?

> I do wonder if this possibly exposed an undocumented prior dependency on the
> value update always happening under the list lock.

I would not be surprised by that.
--
Michael


signature.asc
Description: PGP signature


Re: WAL Insertion Lock Improvements

2023-07-25 Thread Andres Freund
Hi,

On 2023-07-25 16:43:16 +0900, Michael Paquier wrote:
> On Sat, Jul 22, 2023 at 01:08:49PM +0530, Bharath Rupireddy wrote:
> > Yes, it looks safe to me too.
> 
> 0001 has been now applied.  I have done more tests while looking at
> this patch since yesterday and was surprised to see higher TPS numbers
> on HEAD with the same tests as previously, and the patch was still
> shining with more than 256 clients.

Just a small heads up:

I just rebased my aio tree over the commit and promptly, on the first run, saw
a hang. I did some debugging on that. Unfortunately repeated runs haven't
repeated that hang, despite quite a bit of trying.

The symptom I was seeing is that all running backends were stuck in
LWLockWaitForVar(), even though the value they're waiting for had
changed. Which obviously "shouldn't be possible".

It's of course possible that this is AIO specific, but I didn't see anything
in stacks to suggest that.


I do wonder if this possibly exposed an undocumented prior dependency on the
value update always happening under the list lock.

Greetings,

Andres Freund




Re: WAL Insertion Lock Improvements

2023-07-25 Thread Andres Freund
Hi,

On 2023-07-25 16:43:16 +0900, Michael Paquier wrote:
> On Sat, Jul 22, 2023 at 01:08:49PM +0530, Bharath Rupireddy wrote:
> > Yes, it looks safe to me too.
> 
> 0001 has been now applied.  I have done more tests while looking at
> this patch since yesterday and was surprised to see higher TPS numbers
> on HEAD with the same tests as previously, and the patch was still
> shining with more than 256 clients.
> 
> > FWIW, 0001 essentially implements what
> > an existing TODO comment introduced by commit 008608b9d5106 says:
> 
> We really need to do something in terms of documentation with
> something like 0002, so I'll try to look at that next.  Regarding
> 0003, I don't know.  I think that we'd better look more into cases
> where it shows actual benefits for specific workloads (like workloads
> with a fixed rate of read and/or write operations?).

FWIW, I'm working on a patch that replaces WAL insert locks as a whole,
because they don't scale all that well. If there's no very clear improvements,
I'm not sure it's worth putting too much effort into polishing them all that
much.

Greetings,

Andres Freund




Re: WAL Insertion Lock Improvements

2023-07-25 Thread Michael Paquier
On Sat, Jul 22, 2023 at 01:08:49PM +0530, Bharath Rupireddy wrote:
> Yes, it looks safe to me too.

0001 has been now applied.  I have done more tests while looking at
this patch since yesterday and was surprised to see higher TPS numbers
on HEAD with the same tests as previously, and the patch was still
shining with more than 256 clients.

> FWIW, 0001 essentially implements what
> an existing TODO comment introduced by commit 008608b9d5106 says:

We really need to do something in terms of documentation with
something like 0002, so I'll try to look at that next.  Regarding
0003, I don't know.  I think that we'd better look more into cases
where it shows actual benefits for specific workloads (like workloads
with a fixed rate of read and/or write operations?).
--
Michael


signature.asc
Description: PGP signature


Re: WAL Insertion Lock Improvements

2023-07-22 Thread Bharath Rupireddy
On Fri, Jul 21, 2023 at 11:29 AM Michael Paquier  wrote:
>
> +   /* Reading atomically avoids getting a torn value */
> +   value = pg_atomic_read_u64(valptr);
>
> Should this specify that this is specifically important for platforms
> where reading a uint64 could lead to a torn value read, if you apply
> this term in this context?  Sounding picky, I would make that a bit
> longer, say something like that:
> "Reading this value atomically is safe even on platforms where uint64
> cannot be read without observing a torn value."
>
> Only xlogprefetcher.c uses the term "torn" for a value by the way, but
> for a write.

Done.

> 0001 looks OK-ish seen from here.  Thoughts?

Yes, it looks safe to me too. FWIW, 0001 essentially implements what
an existing TODO comment introduced by commit 008608b9d5106 says:

/*
 * Read value using the lwlock's wait list lock, as we can't generally
 * rely on atomic 64 bit reads/stores.  TODO: On platforms with a way to
 * do atomic 64 bit reads/writes the spinlock should be optimized away.
 */

I'm attaching v10 patch set here - 0001 has modified the comment as
above, no other changes in patch set.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v10-0001-Optimize-WAL-insertion-lock-acquisition-and-rele.patch
Description: Binary data


v10-0002-Improve-commets-in-and-around-LWLockWaitForVar-c.patch
Description: Binary data


v10-0003-Have-a-quick-exit-for-LWLockUpdateVar-when-there.patch
Description: Binary data


Re: WAL Insertion Lock Improvements

2023-07-20 Thread Michael Paquier
On Thu, Jul 20, 2023 at 02:38:29PM +0530, Bharath Rupireddy wrote:
> On Fri, Jul 14, 2023 at 4:17 AM Andres Freund  wrote:
>> I think this commit does too many things at once.
> 
> I've split the patch into three - 1) Make insertingAt 64-bit atomic.
> 2) Have better commenting on why there's no memory barrier or spinlock
> in and around LWLockWaitForVar call sites. 3) Have a quick exit for
> LWLockUpdateVar.

FWIW, I was kind of already OK with 0001, as it shows most of the
gains observed while 0003 had a limited impact:
https://www.postgresql.org/message-id/CALj2ACWgeOPEKVY-TEPvno%3DVatyzrb-vEEP8hN7QqrQ%3DyPRupA%40mail.gmail.com

It is kind of a no-brainer to replace the spinlocks with atomic reads
and writes there.

>> I don't think:
>>  * NB: LWLockConflictsWithVar (which is called from
>>  * LWLockWaitForVar) relies on the spinlock used 
>> above in this
>>  * function and doesn't use a memory barrier.
>>
>> helps to understand why any of this is safe to a meaningful degree.
>>
>> The existing comments aren't obviously aren't sufficient to explain this, but
>> the reason it's, I think, safe today, is that we are only waiting for
>> insertions that started before WaitXLogInsertionsToFinish() was called.  The
>> lack of memory barriers in the loop means that we might see locks as "unused"
>> that have *since* become used, which is fine, because they only can be for
>> later insertions that we wouldn't want to wait on anyway.
> 
> Right.

FWIW, I always have a hard time coming back to this code and see it
rely on undocumented assumptions with code in lwlock.c while we need
to keep an eye in xlog.c (we take a spinlock there while the variable
wait logic relies on it for ordering @-@).  So the proposal of getting
more documentation in place via 0002 goes in the right direction.

>> This comment seems off to me. Using atomics doesn't guarantee not needing
>> locking. It just guarantees that we are reading a non-torn value.
> 
> Modified the comment.

-   /*
-* Read value using the lwlock's wait list lock, as we can't generally
-* rely on atomic 64 bit reads/stores.  TODO: On platforms with a way to
-* do atomic 64 bit reads/writes the spinlock should be optimized away.
-*/
-   LWLockWaitListLock(lock);
-   value = *valptr;
-   LWLockWaitListUnlock(lock);
+   /* Reading atomically avoids getting a torn value */
+   value = pg_atomic_read_u64(valptr);

Should this specify that this is specifically important for platforms
where reading a uint64 could lead to a torn value read, if you apply
this term in this context?  Sounding picky, I would make that a bit
longer, say something like that:
"Reading this value atomically is safe even on platforms where uint64
cannot be read without observing a torn value."

Only xlogprefetcher.c uses the term "torn" for a value by the way, but
for a write.

>>> @@ -1605,9 +1602,14 @@ LWLockConflictsWithVar(LWLock *lock,
>>>   *
>>>   * Note: this function ignores shared lock holders; if the lock is held
>>>   * in shared mode, returns 'true'.
>>> + *
>>> + * Be careful that LWLockConflictsWithVar() does not include a memory 
>>> barrier,
>>> + * hence the caller of this function may want to rely on an explicit 
>>> barrier or
>>> + * a spinlock to avoid memory ordering issues.
>>>   */
>>
>> s/careful/aware/?
>>
>> s/spinlock/implied barrier via spinlock or lwlock/?
> 
> Done.

Okay to mention a LWLock here, even if the sole caller of this routine
relies on a spinlock.

0001 looks OK-ish seen from here.  Thoughts?
--
Michael


signature.asc
Description: PGP signature


Re: WAL Insertion Lock Improvements

2023-07-20 Thread Bharath Rupireddy
On Fri, Jul 14, 2023 at 4:17 AM Andres Freund  wrote:
>
> Hi,
>
> On 2023-07-13 14:04:31 -0700, Andres Freund wrote:
> > From b74b6e953cb5a7e7ea1a89719893f6ce9e231bba Mon Sep 17 00:00:00 2001
> > From: Bharath Rupireddy 
> > Date: Fri, 19 May 2023 15:00:21 +
> > Subject: [PATCH v8] Optimize WAL insertion lock acquisition and release
> >
> > This commit optimizes WAL insertion lock acquisition and release
> > in the following way:
>
> I think this commit does too many things at once.

I've split the patch into three - 1) Make insertingAt 64-bit atomic.
2) Have better commenting on why there's no memory barrier or spinlock
in and around LWLockWaitForVar call sites. 3) Have a quick exit for
LWLockUpdateVar.

> > 1. WAL insertion lock's variable insertingAt is currently read and
> > written with the help of lwlock's wait list lock to avoid
> > torn-free reads/writes. This wait list lock can become a point of
> > contention on a highly concurrent write workloads. Therefore, make
> > insertingAt a 64-bit atomic which inherently provides torn-free
> > reads/writes.
>
> "inherently" is a bit strong, given that we emulate 64bit atomics where not
> available...

Modified.

> > 2. LWLockUpdateVar currently acquires lwlock's wait list lock even when
> > there are no waiters at all. Add a fastpath exit to LWLockUpdateVar when
> > there are no waiters to avoid unnecessary locking.
>
> I don't think there's enough of an explanation for why this isn't racy.
>
> The reason it's, I think, safe, is that anyone using LWLockConflictsWithVar()
> will do so twice in a row, with a barrier inbetween. But that really relies on
> what I explain in the paragraph below:

The twice-in-a-row lock acquisition protocol used by LWLockWaitForVar
is helping us out have quick exit in LWLockUpdateVar. Because,
LWLockWaitForVar ensures that they are added to the wait queue even if
LWLockUpdateVar thinks that there aren't waiters. Is my understanding
correct here?

> > It also adds notes on why LWLockConflictsWithVar doesn't need a
> > memory barrier as far as its current usage is concerned.
>
> I don't think:
>  * NB: LWLockConflictsWithVar (which is called from
>  * LWLockWaitForVar) relies on the spinlock used 
> above in this
>  * function and doesn't use a memory barrier.
>
> helps to understand why any of this is safe to a meaningful degree.
>
> The existing comments aren't obviously aren't sufficient to explain this, but
> the reason it's, I think, safe today, is that we are only waiting for
> insertions that started before WaitXLogInsertionsToFinish() was called.  The
> lack of memory barriers in the loop means that we might see locks as "unused"
> that have *since* become used, which is fine, because they only can be for
> later insertions that we wouldn't want to wait on anyway.

Right.

> Not taking a lock to acquire the current insertingAt value means that we might
> see older insertingAt value. Which should also be fine, because if we read a
> too old value, we'll add ourselves to the queue, which contains atomic
> operations.

Right. An older value adds ourselves to the queue in LWLockWaitForVar,
and we should be woken up eventually by LWLockUpdateVar.

This matches with my understanding. I used more or less your above
wording in 0002 patch.

> >   /*
> > -  * Read value using the lwlock's wait list lock, as we can't generally
> > -  * rely on atomic 64 bit reads/stores.  TODO: On platforms with a way 
> > to
> > -  * do atomic 64 bit reads/writes the spinlock should be optimized 
> > away.
> > +  * Reading the value atomically ensures that we don't need any 
> > explicit
> > +  * locking. Note that in general, 64 bit atomic APIs in postgres 
> > inherently
> > +  * provide explicit locking for the platforms without atomics support.
> >*/
>
> This comment seems off to me. Using atomics doesn't guarantee not needing
> locking. It just guarantees that we are reading a non-torn value.

Modified the comment.

> > @@ -1605,9 +1602,14 @@ LWLockConflictsWithVar(LWLock *lock,
> >   *
> >   * Note: this function ignores shared lock holders; if the lock is held
> >   * in shared mode, returns 'true'.
> > + *
> > + * Be careful that LWLockConflictsWithVar() does not include a memory 
> > barrier,
> > + * hence the caller of this function may want to rely on an explicit 
> > barrier or
> > + * a spinlock to avoid memory ordering issues.
> >   */
>
> s/careful/aware/?
>
> s/spinlock/implied barrier via spinlock or lwlock/?

Done.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v9-0001-Optimize-WAL-insertion-lock-acquisition-and-relea.patch
Description: Binary data


v9-0002-Improve-commets-in-and-around-LWLockWaitForVar-ca.patch
Description: Binary data


v9-0003-Have-a-quick-exit-for-LWLockUpdateVar-when-there-.patch
Description: Binary data


Re: WAL Insertion Lock Improvements

2023-07-13 Thread Andres Freund
Hi,

On 2023-07-13 14:04:31 -0700, Andres Freund wrote:
> From b74b6e953cb5a7e7ea1a89719893f6ce9e231bba Mon Sep 17 00:00:00 2001
> From: Bharath Rupireddy 
> Date: Fri, 19 May 2023 15:00:21 +
> Subject: [PATCH v8] Optimize WAL insertion lock acquisition and release
>
> This commit optimizes WAL insertion lock acquisition and release
> in the following way:

I think this commit does too many things at once.


> 1. WAL insertion lock's variable insertingAt is currently read and
> written with the help of lwlock's wait list lock to avoid
> torn-free reads/writes. This wait list lock can become a point of
> contention on a highly concurrent write workloads. Therefore, make
> insertingAt a 64-bit atomic which inherently provides torn-free
> reads/writes.

"inherently" is a bit strong, given that we emulate 64bit atomics where not
available...


> 2. LWLockUpdateVar currently acquires lwlock's wait list lock even when
> there are no waiters at all. Add a fastpath exit to LWLockUpdateVar when
> there are no waiters to avoid unnecessary locking.

I don't think there's enough of an explanation for why this isn't racy.

The reason it's, I think, safe, is that anyone using LWLockConflictsWithVar()
will do so twice in a row, with a barrier inbetween. But that really relies on
what I explain in the paragraph below:



> It also adds notes on why LWLockConflictsWithVar doesn't need a
> memory barrier as far as its current usage is concerned.

I don't think:
 * NB: LWLockConflictsWithVar (which is called from
 * LWLockWaitForVar) relies on the spinlock used above 
in this
 * function and doesn't use a memory barrier.

helps to understand why any of this is safe to a meaningful degree.


The existing comments aren't obviously aren't sufficient to explain this, but
the reason it's, I think, safe today, is that we are only waiting for
insertions that started before WaitXLogInsertionsToFinish() was called.  The
lack of memory barriers in the loop means that we might see locks as "unused"
that have *since* become used, which is fine, because they only can be for
later insertions that we wouldn't want to wait on anyway.

Not taking a lock to acquire the current insertingAt value means that we might
see older insertingAt value. Which should also be fine, because if we read a
too old value, we'll add ourselves to the queue, which contains atomic
operations.



> diff --git a/src/backend/storage/lmgr/lwlock.c 
> b/src/backend/storage/lmgr/lwlock.c
> index 59347ab951..82266e6897 100644
> --- a/src/backend/storage/lmgr/lwlock.c
> +++ b/src/backend/storage/lmgr/lwlock.c
> @@ -1547,9 +1547,8 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode)
>   * *result is set to true if the lock was free, and false otherwise.
>   */
>  static bool
> -LWLockConflictsWithVar(LWLock *lock,
> -uint64 *valptr, uint64 oldval, 
> uint64 *newval,
> -bool *result)
> +LWLockConflictsWithVar(LWLock *lock, pg_atomic_uint64 *valptr, uint64 oldval,
> +uint64 *newval, bool *result)
>  {
>   boolmustwait;
>   uint64  value;
> @@ -1572,13 +1571,11 @@ LWLockConflictsWithVar(LWLock *lock,
>   *result = false;
>
>   /*
> -  * Read value using the lwlock's wait list lock, as we can't generally
> -  * rely on atomic 64 bit reads/stores.  TODO: On platforms with a way to
> -  * do atomic 64 bit reads/writes the spinlock should be optimized away.
> +  * Reading the value atomically ensures that we don't need any explicit
> +  * locking. Note that in general, 64 bit atomic APIs in postgres 
> inherently
> +  * provide explicit locking for the platforms without atomics support.
>*/

This comment seems off to me. Using atomics doesn't guarantee not needing
locking. It just guarantees that we are reading a non-torn value.



> @@ -1605,9 +1602,14 @@ LWLockConflictsWithVar(LWLock *lock,
>   *
>   * Note: this function ignores shared lock holders; if the lock is held
>   * in shared mode, returns 'true'.
> + *
> + * Be careful that LWLockConflictsWithVar() does not include a memory 
> barrier,
> + * hence the caller of this function may want to rely on an explicit barrier 
> or
> + * a spinlock to avoid memory ordering issues.
>   */

s/careful/aware/?

s/spinlock/implied barrier via spinlock or lwlock/?


Greetings,

Andres




Re: WAL Insertion Lock Improvements

2023-07-13 Thread Andres Freund
On 2023-07-11 09:20:45 +0900, Michael Paquier wrote:
> On Mon, Jun 05, 2023 at 08:00:00AM +0530, Bharath Rupireddy wrote:
> > On Wed, May 31, 2023 at 5:05 PM Michael Paquier  wrote:
> >> @Andres: Were there any extra tests you wanted to be run for more
> >> input?
> > 
> > @Andres Freund please let us know your thoughts.
> 
> Err, ping.  It seems like this thread is waiting on input from you,
> Andres?

Looking. Sorry for not getting to this earlier.

- Andres




Re: WAL Insertion Lock Improvements

2023-07-10 Thread Michael Paquier
On Mon, Jun 05, 2023 at 08:00:00AM +0530, Bharath Rupireddy wrote:
> On Wed, May 31, 2023 at 5:05 PM Michael Paquier  wrote:
>> @Andres: Were there any extra tests you wanted to be run for more
>> input?
> 
> @Andres Freund please let us know your thoughts.

Err, ping.  It seems like this thread is waiting on input from you,
Andres?
--
Michael


signature.asc
Description: PGP signature


Re: WAL Insertion Lock Improvements

2023-06-04 Thread Bharath Rupireddy
On Wed, May 31, 2023 at 5:05 PM Michael Paquier  wrote:
>
> On Mon, May 22, 2023 at 09:26:25AM +0900, Michael Paquier wrote:
> > Simpler and consistent, nice.  I don't have much more to add, so I
> > have switched the patch as RfC.
>
> While at PGcon, Andres has asked me how many sockets are in the
> environment I used for the tests,

I'm glad to know that the feature was discussed at PGCon.

> and lscpu tells me the following,
> which is more than 1:
> CPU(s):  64
> On-line CPU(s) list: 0-63
> Core(s) per socket:  16
> Socket(s):   2
> NUMA node(s):2

Mine says this:

CPU(s):  96
  On-line CPU(s) list:   0-95
Core(s) per socket:  24
Socket(s):   2
NUMA:
  NUMA node(s):  2
  NUMA node0 CPU(s): 0-23,48-71
  NUMA node1 CPU(s): 24-47,72-95

> @Andres: Were there any extra tests you wanted to be run for more
> input?

@Andres Freund please let us know your thoughts.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: WAL Insertion Lock Improvements

2023-05-31 Thread Michael Paquier
On Mon, May 22, 2023 at 09:26:25AM +0900, Michael Paquier wrote:
> Simpler and consistent, nice.  I don't have much more to add, so I
> have switched the patch as RfC.

While at PGcon, Andres has asked me how many sockets are in the
environment I used for the tests, and lscpu tells me the following,
which is more than 1:
CPU(s):  64
On-line CPU(s) list: 0-63
Core(s) per socket:  16
Socket(s):   2
NUMA node(s):2

@Andres: Were there any extra tests you wanted to be run for more
input?
--
Michael


signature.asc
Description: PGP signature


Re: WAL Insertion Lock Improvements

2023-05-21 Thread Michael Paquier
On Fri, May 19, 2023 at 08:34:16PM +0530, Bharath Rupireddy wrote:
> I get it. How about the following similar to what
> ProcessProcSignalBarrier() has?
> 
> + * Note that pg_atomic_exchange_u64 is a full barrier, so we're 
> guaranteed
> + * that the variable is updated before waking up waiters.
> + */
> 
> + * Note that pg_atomic_exchange_u64 is a full barrier, so we're 
> guaranteed
> + * that the variable is updated before releasing the lock.
>   */
> 
> Please find the attached v8 patch with the above change.

Simpler and consistent, nice.  I don't have much more to add, so I
have switched the patch as RfC.
--
Michael


signature.asc
Description: PGP signature


Re: WAL Insertion Lock Improvements

2023-05-19 Thread Bharath Rupireddy
On Fri, May 19, 2023 at 12:24 PM Michael Paquier  wrote:
>
> On Thu, May 18, 2023 at 11:18:25AM +0530, Bharath Rupireddy wrote:
> > I think what I have so far seems more verbose explaining what a
> > barrier does and all that. I honestly think we don't need to be that
> > verbose, thanks to README.barrier.
>
> Agreed.  This file is a mine of information.
>
> > I simplified those 2 comments as the following:
> >
> >  * NB: pg_atomic_exchange_u64, having full barrier semantics will ensure
> >  * the variable is updated before releasing the lock.
> >
> >  * NB: pg_atomic_exchange_u64, having full barrier semantics will ensure
> >  * the variable is updated before waking up waiters.
> >
> > Please find the attached v7 patch.
>
> Nit.  These sentences seem to be worded a bit weirdly to me.  How
> about:
> "pg_atomic_exchange_u64 has full barrier semantics, ensuring that the
> variable is updated before (releasing the lock|waking up waiters)."

I get it. How about the following similar to what
ProcessProcSignalBarrier() has?

+ * Note that pg_atomic_exchange_u64 is a full barrier, so we're guaranteed
+ * that the variable is updated before waking up waiters.
+ */

+ * Note that pg_atomic_exchange_u64 is a full barrier, so we're guaranteed
+ * that the variable is updated before releasing the lock.
  */

Please find the attached v8 patch with the above change.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v8-0001-Optimize-WAL-insertion-lock-acquisition-and-relea.patch
Description: Binary data


Re: WAL Insertion Lock Improvements

2023-05-19 Thread Michael Paquier
On Thu, May 18, 2023 at 11:18:25AM +0530, Bharath Rupireddy wrote:
> I think what I have so far seems more verbose explaining what a
> barrier does and all that. I honestly think we don't need to be that
> verbose, thanks to README.barrier.

Agreed.  This file is a mine of information.

> I simplified those 2 comments as the following:
> 
>  * NB: pg_atomic_exchange_u64, having full barrier semantics will ensure
>  * the variable is updated before releasing the lock.
> 
>  * NB: pg_atomic_exchange_u64, having full barrier semantics will ensure
>  * the variable is updated before waking up waiters.
> 
> Please find the attached v7 patch.

Nit.  These sentences seem to be worded a bit weirdly to me.  How
about:
"pg_atomic_exchange_u64 has full barrier semantics, ensuring that the
variable is updated before (releasing the lock|waking up waiters)."

+ * Be careful that LWLockConflictsWithVar() does not include a memory barrier,
+ * hence the caller of this function may want to rely on an explicit barrier or
+ * a spinlock to avoid memory ordering issues.

Thanks, this addition looks OK to me.
--
Michael


signature.asc
Description: PGP signature


Re: WAL Insertion Lock Improvements

2023-05-17 Thread Bharath Rupireddy
On Wed, May 10, 2023 at 5:34 PM Michael Paquier  wrote:
>
> +* NB: LWLockConflictsWithVar (which is called from
> +* LWLockWaitForVar) relies on the spinlock used above in this
> +* function and doesn't use a memory barrier.
>
> This patch adds the following comment in WaitXLogInsertionsToFinish()
> because lwlock.c on HEAD mentions that:
> /*
>  * Test first to see if it the slot is free right now.
>  *
>  * XXX: the caller uses a spinlock before this, so we don't need a memory
>  * barrier here as far as the current usage is concerned.  But that might
>  * not be safe in general.
>  */
>
> Should it be something where we'd better be noisy about at the top of
> LWLockWaitForVar()?  We don't want to add a memory barrier at the
> beginning of LWLockConflictsWithVar(), still it strikes me that
> somebody that aims at using LWLockWaitForVar() may miss this point
> because LWLockWaitForVar() is the routine published in lwlock.h, not
> LWLockConflictsWithVar().  This does not need to be really
> complicated, say a note at the top of LWLockWaitForVar() among the
> lines of (?):
> "Be careful that LWLockConflictsWithVar() does not include a memory
> barrier, hence the caller of this function may want to rely on an
> explicit barrier or a spinlock to avoid memory ordering issues."

+1. Now, we have comments in 3 places to warn about the
LWLockConflictsWithVar not using memory barrier - one in
WaitXLogInsertionsToFinish, one in LWLockWaitForVar and another one
(existing) in LWLockConflictsWithVar specifying where exactly a memory
barrier is needed if the caller doesn't use a spinlock. Looks fine to
me.

> +* NB: pg_atomic_exchange_u64 is used here as opposed to just
> +* pg_atomic_write_u64 to update the variable. Since 
> pg_atomic_exchange_u64
> +* is a full barrier, we're guaranteed that all loads and stores issued
> +* prior to setting the variable are completed before any loads or stores
> +* issued after setting the variable.
>
> This is the same explanation as LWLockUpdateVar(), except that we
> lose the details explaining why we are doing the update before
> releasing the lock.

I think what I have so far seems more verbose explaining what a
barrier does and all that. I honestly think we don't need to be that
verbose, thanks to README.barrier.

I simplified those 2 comments as the following:

 * NB: pg_atomic_exchange_u64, having full barrier semantics will ensure
 * the variable is updated before releasing the lock.

 * NB: pg_atomic_exchange_u64, having full barrier semantics will ensure
 * the variable is updated before waking up waiters.

Please find the attached v7 patch.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 8e4eeccadc106381bc2898c1887109f96c3db869 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Thu, 18 May 2023 05:17:05 +
Subject: [PATCH v7] Optimize WAL insertion lock acquisition and release

This commit optimizes WAL insertion lock acquisition and release
in the following way:

1. WAL insertion lock's variable insertingAt is currently read and
written with the help of lwlock's wait list lock to avoid
torn-free reads/writes. This wait list lock can become a point of
contention on a highly concurrent write workloads. Therefore, make
insertingAt a 64-bit atomic which inherently provides torn-free
reads/writes.

2. LWLockUpdateVar currently acquires lwlock's wait list lock even
when there are no waiters at all. Add a fastpath exit to
LWLockUpdateVar when there are no waiters to avoid unnecessary
locking.

Note that atomic exchange operation (which is a full barrier) is
used when necessary, instead of atomic write to ensure the memory
ordering is preserved.

It also adds notes on why LWLockConflictsWithVar doesn't need a
memory barrier as far as its current usage is concerned.

Suggested-by: Andres Freund
Author: Bharath Rupireddy
Reviewed-by: Nathan Bossart
Reviewed-by: Michael Paquier
Discussion: https://www.postgresql.org/message-id/20221124184619.xit4sfi52bcz2tva%40awork3.anarazel.de
---
 src/backend/access/transam/xlog.c |  8 +++-
 src/backend/storage/lmgr/lwlock.c | 64 +++
 src/include/storage/lwlock.h  |  6 +--
 3 files changed, 48 insertions(+), 30 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index bc5a8e0569..92b0b87d1e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -376,7 +376,7 @@ typedef struct XLogwrtResult
 typedef struct
 {
 	LWLock		lock;
-	XLogRecPtr	insertingAt;
+	pg_atomic_uint64	insertingAt;
 	XLogRecPtr	lastImportantAt;
 } WALInsertLock;
 
@@ -1495,6 +1495,10 @@ WaitXLogInsertionsToFinish(XLogRecPtr upto)
 			 * calling LWLockUpdateVar.  But if it has to sleep, it will
 			 * advertise the insertion point with LWLockUpdateVar before
 			 * sleeping.
+			 *
+			 * NB: 

Re: WAL Insertion Lock Improvements

2023-05-12 Thread Michael Paquier
On Fri, May 12, 2023 at 07:35:20AM +0530, Bharath Rupireddy wrote:
> --enable-atomics=no, -T60:
> --enable-spinlocks=no, -T60:
> --enable-atomics=no --enable-spinlocks=no, -T60:

Thanks for these extra tests, I have not done these specific cases but
the profiles look similar to what I've seen myself.  If I recall
correctly the fallback implementation of atomics just uses spinlocks
internally to force the barriers required.
--
Michael


signature.asc
Description: PGP signature


Re: WAL Insertion Lock Improvements

2023-05-11 Thread Bharath Rupireddy
On Thu, May 11, 2023 at 11:56 AM Michael Paquier  wrote:
>
> On Wed, May 10, 2023 at 09:04:47PM +0900, Michael Paquier wrote:
> > It took me some time, but I have been able to deploy a big box to see
> > the effect of this patch at a rather large scale (64 vCPU, 512G of
> > memory), with the following test characteristics for HEAD and v6:
> > - TPS comparison with pgbench and pg_logical_emit_message().
> > - Record sizes of 16, 64, 256, 1k, 4k and 16k.
> > - Clients and jobs equal at 4, 16, 64, 256, 512, 1024, 2048, 4096.
> > - Runs of 3 mins for each of the 48 combinations, meaning 96 runs in
> > total.
> >
> > And here are the results I got:
> > message_size_b| 16 | 64 |256 |   1024 |  4096 |   16k
> > --|||||---|---
> > head_4_clients|   3026 |   2965 |   2846 |   2880 |  2778 |  2412
> > head_16_clients   |  12087 |  11287 |  11670 |  11100 |  9003 |  5608
> > head_64_clients   |  42995 |  44005 |  43592 |  35437 | 21533 | 11273
> > head_256_clients  | 106775 | 109233 | 104201 |  80759 | 42118 | 16254
> > head_512_clients  | 153849 | 156950 | 142915 |  99288 | 57714 | 16198
> > head_1024_clients | 122102 | 123895 | 114248 | 117317 | 62270 | 16261
> > head_2048_clients | 126730 | 115594 | 109671 | 119564 | 62454 | 16298
> > head_4096_clients | 111564 | 111697 | 119164 | 123483 | 62430 | 16140
> > v6_4_clients  |   2893 |   2917 |   3087 |   2904 |  2786 |  2262
> > v6_16_clients |  12097 |  11387 |  11579 |  11242 |  9228 |  5661
> > v6_64_clients |  45124 |  46533 |  42275 |  36124 | 21696 | 11386
> > v6_256_clients| 121500 | 125732 | 104328 |  78989 | 41949 | 16254
> > v6_512_clients| 164120 | 174743 | 146677 |  98110 | 60228 | 16171
> > v6_1024_clients   | 168990 | 180710 | 149894 | 117431 | 62271 | 16259
> > v6_2048_clients   | 165426 | 162893 | 146322 | 132476 | 62468 | 16274
> > v6_4096_clients   | 161283 | 158732 | 162474 | 135636 | 62461 | 16030
>
> Another thing I was wondering is if it would be able to see a
> difference by reducing the I/O pressure.  After mounting pg_wal to a
> tmpfs, I am getting the following table:
>  message_size_b| 16 | 64 |256 |   1024 |   4096 | 16000
> ---++++++---
>  head_4_clients|  86476 |  86592 |  84645 |  76784 |  57887 | 30199
>  head_16_clients   | 277006 | 278431 | 263238 | 228614 | 143880 | 67237
>  head_64_clients   | 373972 | 370082 | 352217 | 297377 | 190974 | 96843
>  head_256_clients  | 144510 | 147077 | 146281 | 189059 | 156294 | 88345
>  head_512_clients  | 122863 | 119054 | 127790 | 162187 | 142771 | 84109
>  head_1024_clients | 140802 | 138728 | 147200 | 172449 | 138022 | 81054
>  head_2048_clients | 175950 | 164143 | 154070 | 161432 | 128205 | 76732
>  head_4096_clients | 161438 | 158666 | 152057 | 139520 | 113955 | 69335
>  v6_4_clients  |  87356 |  86985 |  83933 |  76397 |  57352 | 30084
>  v6_16_clients | 277466 | 280125 | 259733 | 224916 | 143832 | 66589
>  v6_64_clients | 388352 | 386188 | 362358 | 302719 | 190353 | 96687
>  v6_256_clients| 365797 | 360114 | 337135 | 266851 | 172252 | 88898
>  v6_512_clients| 339751 | 332384 | 308182 | 249624 | 158868 | 84258
>  v6_1024_clients   | 301294 | 295140 | 276769 | 226034 | 148392 | 80909
>  v6_2048_clients   | 268846 | 261001 | 247110 | 205332 | 137271 | 76299
>  v6_4096_clients   | 229322 | 227049 | 217271 | 183708 | 124888 | 69263
>
> This shows more difference from 64 clients up to 4k records, without
> degradation noticed across the board.

Impressive. I further covered the following test cases. There's a
clear gain with the patch i.e. reducing burden on LWLock's waitlist
lock is helping out.

fsync=off, -T120:
 message_size_b   |   16   |   64   |  256   |  1024  |  4096  | 16384
---++++++
 head_1_clients|  33609 |  33862 |  32975 |  29722 |  21842 |  10606
 head_2_clients|  60583 |  60524 |  57833 |  53582 |  38583 |  20120
 head_4_clients| 115209 | 114012 | 114077 | 102991 |  73452 |  39179
 head_8_clients| 181786 | 177592 | 174404 | 155350 |  98642 |  41406
 head_16_clients   | 313750 | 309024 | 295375 | 253101 | 159328 |  73617
 head_32_clients   | 406456 | 416809 | 400527 | 344573 | 213756 |  96322
 head_64_clients   | 199619 | 197948 | 198871 | 208606 | 221751 | 107762
 head_128_clients  | 108576 | 108727 | 107606 | 112137 | 173998 | 106976
 head_256_clients  |  75303 |  74983 |  73986 |  76100 | 148209 |  98080
 head_512_clients  |  62559 |  60189 |  59588 |  61102 | 131803 |  90534
 head_768_clients  |  55650 |  54486 |  54813 |  55515 | 120707 |  88009
 head_1024_clients |  54709 |  52395 |  51672 |  52910 | 113904 |  86116
 head_2048_clients |  48640 |  47098 |  46787 |  47582 |  98394 |  80766
 head_4096_clients |  43205 |  42709 |  42591 |  43649 |  88903 |  72362
 v6_1_clients|  7 |  32877 |  31880 | 

Re: WAL Insertion Lock Improvements

2023-05-11 Thread Michael Paquier
On Wed, May 10, 2023 at 09:04:47PM +0900, Michael Paquier wrote:
> It took me some time, but I have been able to deploy a big box to see
> the effect of this patch at a rather large scale (64 vCPU, 512G of
> memory), with the following test characteristics for HEAD and v6:
> - TPS comparison with pgbench and pg_logical_emit_message().
> - Record sizes of 16, 64, 256, 1k, 4k and 16k.
> - Clients and jobs equal at 4, 16, 64, 256, 512, 1024, 2048, 4096.
> - Runs of 3 mins for each of the 48 combinations, meaning 96 runs in
> total.
> 
> And here are the results I got:
> message_size_b| 16 | 64 |256 |   1024 |  4096 |   16k
> --|||||---|---
> head_4_clients|   3026 |   2965 |   2846 |   2880 |  2778 |  2412
> head_16_clients   |  12087 |  11287 |  11670 |  11100 |  9003 |  5608
> head_64_clients   |  42995 |  44005 |  43592 |  35437 | 21533 | 11273
> head_256_clients  | 106775 | 109233 | 104201 |  80759 | 42118 | 16254
> head_512_clients  | 153849 | 156950 | 142915 |  99288 | 57714 | 16198
> head_1024_clients | 122102 | 123895 | 114248 | 117317 | 62270 | 16261
> head_2048_clients | 126730 | 115594 | 109671 | 119564 | 62454 | 16298
> head_4096_clients | 111564 | 111697 | 119164 | 123483 | 62430 | 16140
> v6_4_clients  |   2893 |   2917 |   3087 |   2904 |  2786 |  2262
> v6_16_clients |  12097 |  11387 |  11579 |  11242 |  9228 |  5661
> v6_64_clients |  45124 |  46533 |  42275 |  36124 | 21696 | 11386
> v6_256_clients| 121500 | 125732 | 104328 |  78989 | 41949 | 16254
> v6_512_clients| 164120 | 174743 | 146677 |  98110 | 60228 | 16171
> v6_1024_clients   | 168990 | 180710 | 149894 | 117431 | 62271 | 16259
> v6_2048_clients   | 165426 | 162893 | 146322 | 132476 | 62468 | 16274
> v6_4096_clients   | 161283 | 158732 | 162474 | 135636 | 62461 | 16030

Another thing I was wondering is if it would be able to see a
difference by reducing the I/O pressure.  After mounting pg_wal to a
tmpfs, I am getting the following table:
 message_size_b| 16 | 64 |256 |   1024 |   4096 | 16000
---++++++---
 head_4_clients|  86476 |  86592 |  84645 |  76784 |  57887 | 30199
 head_16_clients   | 277006 | 278431 | 263238 | 228614 | 143880 | 67237
 head_64_clients   | 373972 | 370082 | 352217 | 297377 | 190974 | 96843
 head_256_clients  | 144510 | 147077 | 146281 | 189059 | 156294 | 88345
 head_512_clients  | 122863 | 119054 | 127790 | 162187 | 142771 | 84109
 head_1024_clients | 140802 | 138728 | 147200 | 172449 | 138022 | 81054
 head_2048_clients | 175950 | 164143 | 154070 | 161432 | 128205 | 76732
 head_4096_clients | 161438 | 158666 | 152057 | 139520 | 113955 | 69335
 v6_4_clients  |  87356 |  86985 |  83933 |  76397 |  57352 | 30084
 v6_16_clients | 277466 | 280125 | 259733 | 224916 | 143832 | 66589
 v6_64_clients | 388352 | 386188 | 362358 | 302719 | 190353 | 96687
 v6_256_clients| 365797 | 360114 | 337135 | 266851 | 172252 | 88898
 v6_512_clients| 339751 | 332384 | 308182 | 249624 | 158868 | 84258
 v6_1024_clients   | 301294 | 295140 | 276769 | 226034 | 148392 | 80909
 v6_2048_clients   | 268846 | 261001 | 247110 | 205332 | 137271 | 76299
 v6_4096_clients   | 229322 | 227049 | 217271 | 183708 | 124888 | 69263

This shows more difference from 64 clients up to 4k records, without
degradation noticed across the board.
--
Michael


signature.asc
Description: PGP signature


Re: WAL Insertion Lock Improvements

2023-05-10 Thread Michael Paquier
On Wed, May 10, 2023 at 10:40:20PM +0530, Bharath Rupireddy wrote:
> test-case 2: -T900, WAL ~256 bytes - ran for about 3.5 hours and the
> more than 3X improvement in TPS is seen - 3.11X @ 512 3.79 @ 768, 3.47
> @ 1024, 2.27 @ 2048, 2.77 @ 4096
>
> [...]
>
> test-case 2: -t100, WAL ~256 bytes - ran for more than 12 hours
> and the maximum improvement is 1.84X @ 1024 client.

Thanks.  So that's pretty close to what I was seeing when it comes to
this message size where you see much more effects under a number of
clients of at least 512~.  Any of these tests have been using fsync =
on, I assume.  I think that disabling fsync or just mounting pg_wal to
a tmpfs should show the same pattern for larger record sizes (after 1k
of message size the curve begins to go down with 512~ clients).
--
Michael


signature.asc
Description: PGP signature


Re: WAL Insertion Lock Improvements

2023-05-10 Thread Bharath Rupireddy
On Tue, May 9, 2023 at 9:24 AM Bharath Rupireddy
 wrote:
>
> On Tue, May 9, 2023 at 9:02 AM Michael Paquier  wrote:
> >
> > On Mon, May 08, 2023 at 04:04:10PM -0700, Nathan Bossart wrote:
> > > On Mon, May 08, 2023 at 05:57:09PM +0530, Bharath Rupireddy wrote:
> > >> test-case 1: -T5, WAL ~16 bytes
> > >> test-case 1: -t1000, WAL ~16 bytes
> > >
> > > I wonder if it's worth doing a couple of long-running tests, too.
> >
> > Yes, 5s or 1000 transactions per client is too small, though it shows
> > that things are going in the right direction.
>
> I'll pick a test case that generates a reasonable amount of WAL 256
> bytes. What do you think of the following?
>
> test-case 2: -T900, WAL ~256 bytes (for c in 1 2 4 8 16 32 64 128 256
> 512 768 1024 2048 4096 - takes 3.5hrs)
> test-case 2: -t100, WAL ~256 bytes
>
> If okay, I'll fire the tests.

test-case 2: -T900, WAL ~256 bytes - ran for about 3.5 hours and the
more than 3X improvement in TPS is seen - 3.11X @ 512 3.79 @ 768, 3.47
@ 1024, 2.27 @ 2048, 2.77 @ 4096

test-case 2: -T900, WAL ~256 bytes
clientsHEADPATCHED
113941351
215511445
431042881
859745774
161215411319
322243821606
644368940567
1288072677993
256139987141638
51260108187126
76851188194406
102448766169353
204846617105961
409644163122697

test-case 2: -t100, WAL ~256 bytes - ran for more than 12 hours
and the maximum improvement is 1.84X @ 1024 client.

test-case 2: -t100, WAL ~256 bytes
clientsHEADPATCHED
114541500
216571612
432233224
863056295
161244712260
322485524335
644522944386
1288075279518
256120663119083
512149546159396
768118298181732
1024101829187492
2048107506191378
4096125130186728

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


Re: WAL Insertion Lock Improvements

2023-05-10 Thread Michael Paquier
On Mon, May 08, 2023 at 05:57:09PM +0530, Bharath Rupireddy wrote:
> Note that I've used pg_logical_emit_message() for ease of
> understanding about the txns generating various amounts of WAL, but
> the pattern is the same if txns are generating various amounts of WAL
> say with inserts.

Sounds good to me to just rely on that for some comparison numbers.

+* NB: LWLockConflictsWithVar (which is called from
+* LWLockWaitForVar) relies on the spinlock used above in this
+* function and doesn't use a memory barrier.

This patch adds the following comment in WaitXLogInsertionsToFinish()
because lwlock.c on HEAD mentions that:
/*
 * Test first to see if it the slot is free right now.
 *
 * XXX: the caller uses a spinlock before this, so we don't need a memory
 * barrier here as far as the current usage is concerned.  But that might
 * not be safe in general.
 */

Should it be something where we'd better be noisy about at the top of
LWLockWaitForVar()?  We don't want to add a memory barrier at the
beginning of LWLockConflictsWithVar(), still it strikes me that
somebody that aims at using LWLockWaitForVar() may miss this point
because LWLockWaitForVar() is the routine published in lwlock.h, not
LWLockConflictsWithVar().  This does not need to be really
complicated, say a note at the top of LWLockWaitForVar() among the
lines of (?):
"Be careful that LWLockConflictsWithVar() does not include a memory
barrier, hence the caller of this function may want to rely on an
explicit barrier or a spinlock to avoid memory ordering issues."

>> +* NB: Note the use of pg_atomic_exchange_u64 as opposed to just
>> +* pg_atomic_write_u64 to update the value. Since pg_atomic_exchange_u64 
>> is
>> +* a full barrier, we're guaranteed that the subsequent shared memory
>> +* reads/writes, if any, happen after we reset the lock variable.
>>
>> This mentions that the subsequent read/write operations are safe, so
>> this refers to anything happening after the variable is reset.  As
>> a full barrier, should be also mention that this is also ordered with
>> respect to anything that the caller did before clearing the variable?
>> From this perspective using pg_atomic_exchange_u64() makes sense to me
>> in LWLockReleaseClearVar().
>
> Wordsmithed that comment a bit.

-* Set the variable's value before releasing the lock, that prevents race
-* a race condition wherein a new locker acquires the lock, but hasn't yet
-* set the variables value.
[...]
+* NB: pg_atomic_exchange_u64 is used here as opposed to just
+* pg_atomic_write_u64 to update the variable. Since pg_atomic_exchange_u64
+* is a full barrier, we're guaranteed that all loads and stores issued
+* prior to setting the variable are completed before any loads or stores
+* issued after setting the variable.

This is the same explanation as LWLockUpdateVar(), except that we
lose the details explaining why we are doing the update before
releasing the lock.

It took me some time, but I have been able to deploy a big box to see
the effect of this patch at a rather large scale (64 vCPU, 512G of
memory), with the following test characteristics for HEAD and v6:
- TPS comparison with pgbench and pg_logical_emit_message().
- Record sizes of 16, 64, 256, 1k, 4k and 16k.
- Clients and jobs equal at 4, 16, 64, 256, 512, 1024, 2048, 4096.
- Runs of 3 mins for each of the 48 combinations, meaning 96 runs in
total.

And here are the results I got:
message_size_b| 16 | 64 |256 |   1024 |  4096 |   16k
--|||||---|---
head_4_clients|   3026 |   2965 |   2846 |   2880 |  2778 |  2412
head_16_clients   |  12087 |  11287 |  11670 |  11100 |  9003 |  5608
head_64_clients   |  42995 |  44005 |  43592 |  35437 | 21533 | 11273
head_256_clients  | 106775 | 109233 | 104201 |  80759 | 42118 | 16254
head_512_clients  | 153849 | 156950 | 142915 |  99288 | 57714 | 16198
head_1024_clients | 122102 | 123895 | 114248 | 117317 | 62270 | 16261
head_2048_clients | 126730 | 115594 | 109671 | 119564 | 62454 | 16298
head_4096_clients | 111564 | 111697 | 119164 | 123483 | 62430 | 16140
v6_4_clients  |   2893 |   2917 |   3087 |   2904 |  2786 |  2262
v6_16_clients |  12097 |  11387 |  11579 |  11242 |  9228 |  5661
v6_64_clients |  45124 |  46533 |  42275 |  36124 | 21696 | 11386
v6_256_clients| 121500 | 125732 | 104328 |  78989 | 41949 | 16254
v6_512_clients| 164120 | 174743 | 146677 |  98110 | 60228 | 16171
v6_1024_clients   | 168990 | 180710 | 149894 | 117431 | 62271 | 16259
v6_2048_clients   | 165426 | 162893 | 146322 | 132476 | 62468 | 16274
v6_4096_clients   | 161283 | 158732 | 162474 | 135636 | 62461 | 16030

These tests are not showing me any degradation, and a correlation
between the record size and the number of clients where the TPS begins
to show a difference between HEAD and v6 of the patch.  In short the
shorter the record, the 

Re: WAL Insertion Lock Improvements

2023-05-09 Thread Michael Paquier
On Tue, May 09, 2023 at 02:10:20PM +0900, Michael Paquier wrote:
> Should we split this patch into two parts, as they aim at tackling two
> different cases then?  One for LWLockConflictsWithVar() and
> LWLockReleaseClearVar() which are the straight-forward pieces
> (using one pg_atomic_write_u64() in LWLockUpdateVar instead), then
> a second for LWLockUpdateVar()?

I have been studying that a bit more, and I'd like to take this
suggestion back.  Apologies for the noise.
--
Michael


signature.asc
Description: PGP signature


Re: WAL Insertion Lock Improvements

2023-05-08 Thread Michael Paquier
On Mon, May 08, 2023 at 08:18:04PM +0530, Bharath Rupireddy wrote:
> On Mon, May 8, 2023 at 5:57 PM Bharath Rupireddy
>  wrote:
>> On Mon, Apr 10, 2023 at 9:38 AM Michael Paquier  wrote:
>>> The sensitive change is in LWLockUpdateVar().  I am not completely
>>> sure to understand this removal, though.  Does that influence the case
>>> where there are waiters?
>>
>> I'll send about this in a follow-up email to not overload this
>> response with too much data.
> 
> It helps the case when there are no waiters. IOW, it updates value
> without waitlist lock when there are no waiters, so no extra waitlist
> lock acquisition/release just to update the value. In turn, it helps
> the other backend wanting to flush the WAL looking for the new updated
> value of insertingAt in WaitXLogInsertionsToFinish(), now the flushing
> backend can get the new value faster.

Sure, which is what the memory barrier given by exchange_u64
guarantees.  My thoughts on this one is that I am not completely sure
to understand that we won't miss any waiters that should have been
awaken.

> The fastpath exit in LWLockUpdateVar() doesn't seem to influence the
> results much, see below results. However, it avoids waitlist lock
> acquisition when there are no waiters.
> 
> test-case 1: -T5, WAL ~16 bytes
> clientsHEADPATCHED with fastpathPATCHED no fast path
> 64501354552846653
> 128949038979189103
> 25682289152915152835
> 51262498138838142084
> 76857083125074126768
> 102451308113593115930
> 2048410848876485110
> 4096199394225743917

Considering that there could be a few percents of noise mixed into
that, that's not really surprising as the workload is highly
concurrent on inserts so the fast path won't really shine :)

Should we split this patch into two parts, as they aim at tackling two
different cases then?  One for LWLockConflictsWithVar() and
LWLockReleaseClearVar() which are the straight-forward pieces
(using one pg_atomic_write_u64() in LWLockUpdateVar instead), then
a second for LWLockUpdateVar()?

Also, the fast path treatment in LWLockUpdateVar() may show some
better benefits when there are really few backends and a bunch of very
little records?  Still, even that sounds a bit limited..

> Out of 3 functions that got rid of waitlist lock
> LWLockConflictsWithVar/LWLockWaitForVar, LWLockUpdateVar,
> LWLockReleaseClearVar, perf reports tell that the biggest gain (for
> the use-cases that I've tried) is for
> LWLockConflictsWithVar/LWLockWaitForVar:
>
> test-case 6: -T5, WAL 4096 bytes
> HEAD:
> +   29.66% 0.07%  postgres  [.] LWLockWaitForVar
> +   20.94% 0.08%  postgres  [.] LWLockConflictsWithVar
>  0.19% 0.03%  postgres  [.] LWLockUpdateVar
> 
> PATCHED:
> +3.95% 0.08%  postgres  [.] LWLockWaitForVar
>  0.19% 0.03%  postgres  [.] LWLockConflictsWithVar
> +1.73% 0.00%  postgres  [.] LWLockReleaseClearVar

Indeed.
--
Michael


signature.asc
Description: PGP signature


Re: WAL Insertion Lock Improvements

2023-05-08 Thread Michael Paquier
On Tue, May 09, 2023 at 09:34:56AM +0530, Bharath Rupireddy wrote:
> Below is the configuration I've been using. I have been keeping the
> checkpoints away so far to get expected numbers. Probably, something
> that I should modify for this long run? Change checkpoint_timeout to
> 15 min or so?
> 
> max_wal_size=64GB
> checkpoint_timeout=1d
> shared_buffers=8GB
> max_connections=5000

Noted.  Something like that should be OK IMO, with all the checkpoints
generated based on the volume generated.  With records that have a
fixed size, this should, I assume, lead to results that could be
compared across runs, even if the patched code would lead to more
checkpoints generated.
--
Michael


signature.asc
Description: PGP signature


Re: WAL Insertion Lock Improvements

2023-05-08 Thread Bharath Rupireddy
On Tue, May 9, 2023 at 9:27 AM Michael Paquier  wrote:
>
> On Tue, May 09, 2023 at 09:24:14AM +0530, Bharath Rupireddy wrote:
> > I'll pick a test case that generates a reasonable amount of WAL 256
> > bytes. What do you think of the following?
> >
> > test-case 2: -T900, WAL ~256 bytes (for c in 1 2 4 8 16 32 64 128 256
> > 512 768 1024 2048 4096 - takes 3.5hrs)
> > test-case 2: -t100, WAL ~256 bytes
> >
> > If okay, I'll fire the tests.
>
> Sounds like a sensible duration, yes.  What's your setting for
> min/max_wal_size?  I assume that there are still 16GB throttled with
> target_completion at 0.9?

Below is the configuration I've been using. I have been keeping the
checkpoints away so far to get expected numbers. Probably, something
that I should modify for this long run? Change checkpoint_timeout to
15 min or so?

max_wal_size=64GB
checkpoint_timeout=1d
shared_buffers=8GB
max_connections=5000

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: WAL Insertion Lock Improvements

2023-05-08 Thread Michael Paquier
On Tue, May 09, 2023 at 09:24:14AM +0530, Bharath Rupireddy wrote:
> I'll pick a test case that generates a reasonable amount of WAL 256
> bytes. What do you think of the following?
> 
> test-case 2: -T900, WAL ~256 bytes (for c in 1 2 4 8 16 32 64 128 256
> 512 768 1024 2048 4096 - takes 3.5hrs)
> test-case 2: -t100, WAL ~256 bytes
> 
> If okay, I'll fire the tests.

Sounds like a sensible duration, yes.  What's your setting for
min/max_wal_size?  I assume that there are still 16GB throttled with
target_completion at 0.9?
--
Michael


signature.asc
Description: PGP signature


Re: WAL Insertion Lock Improvements

2023-05-08 Thread Bharath Rupireddy
On Tue, May 9, 2023 at 9:02 AM Michael Paquier  wrote:
>
> On Mon, May 08, 2023 at 04:04:10PM -0700, Nathan Bossart wrote:
> > On Mon, May 08, 2023 at 05:57:09PM +0530, Bharath Rupireddy wrote:
> >> test-case 1: -T5, WAL ~16 bytes
> >> test-case 1: -t1000, WAL ~16 bytes
> >
> > I wonder if it's worth doing a couple of long-running tests, too.
>
> Yes, 5s or 1000 transactions per client is too small, though it shows
> that things are going in the right direction.

I'll pick a test case that generates a reasonable amount of WAL 256
bytes. What do you think of the following?

test-case 2: -T900, WAL ~256 bytes (for c in 1 2 4 8 16 32 64 128 256
512 768 1024 2048 4096 - takes 3.5hrs)
test-case 2: -t100, WAL ~256 bytes

If okay, I'll fire the tests.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: WAL Insertion Lock Improvements

2023-05-08 Thread Michael Paquier
On Mon, May 08, 2023 at 04:04:10PM -0700, Nathan Bossart wrote:
> On Mon, May 08, 2023 at 05:57:09PM +0530, Bharath Rupireddy wrote:
>> test-case 1: -T5, WAL ~16 bytes
>> test-case 1: -t1000, WAL ~16 bytes
> 
> I wonder if it's worth doing a couple of long-running tests, too.

Yes, 5s or 1000 transactions per client is too small, though it shows
that things are going in the right direction. 

(Will reply to the rest in a bit..)
--
Michael


signature.asc
Description: PGP signature


Re: WAL Insertion Lock Improvements

2023-05-08 Thread Nathan Bossart
On Mon, May 08, 2023 at 05:57:09PM +0530, Bharath Rupireddy wrote:
> I ran performance tests on the patch with different use-cases. Clearly
> the patch reduces burden on LWLock's waitlist lock (evident from perf
> reports [1]). However, to see visible impact in the output, the txns
> must be generating small (between 16 bytes to 2 KB) amounts of WAL in
> a highly concurrent manner, check the results below (FWIW, I've zipped
> and attached perf images for better illustration along with test
> setup).
> 
> When the txns are generating a small amount of WAL i.e. between 16
> bytes to 2 KB in a highly concurrent manner, the benefit is clearly
> visible in the TPS more than 2.3X improvement. When the txns are
> generating more WAL i.e. more than 2 KB, the gain from reduced burden
> on waitlist lock is offset by increase in the wait/release for WAL
> insertion locks and no visible benefit is seen.
> 
> As the amount of WAL each txn generates increases, it looks like the
> benefit gained from reduced burden on waitlist lock is offset by
> increase in the wait for WAL insertion locks.

Nice.

> test-case 1: -T5, WAL ~16 bytes
> test-case 1: -t1000, WAL ~16 bytes

I wonder if it's worth doing a couple of long-running tests, too.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: WAL Insertion Lock Improvements

2023-05-08 Thread Bharath Rupireddy
On Mon, May 8, 2023 at 5:57 PM Bharath Rupireddy
 wrote:
>
> On Mon, Apr 10, 2023 at 9:38 AM Michael Paquier  wrote:
>
> > -LWLockUpdateVar(LWLock *lock, uint64 *valptr, uint64 val)
> > +LWLockUpdateVar(LWLock *lock, pg_atomic_uint64 *valptr, uint64 val)
> > [...]
> > Assert(pg_atomic_read_u32(>state) & LW_VAL_EXCLUSIVE);
> >
> > -   /* Update the lock's value */
> > -   *valptr = val;
> >
> > The sensitive change is in LWLockUpdateVar().  I am not completely
> > sure to understand this removal, though.  Does that influence the case
> > where there are waiters?
>
> I'll send about this in a follow-up email to not overload this
> response with too much data.

It helps the case when there are no waiters. IOW, it updates value
without waitlist lock when there are no waiters, so no extra waitlist
lock acquisition/release just to update the value. In turn, it helps
the other backend wanting to flush the WAL looking for the new updated
value of insertingAt in WaitXLogInsertionsToFinish(), now the flushing
backend can get the new value faster.

> > Another thing I was wondering about: how much does the fast-path used
> > in LWLockUpdateVar() influence the performance numbers? Am I right to
> > guess that it counts for most of the gain seen?
>
> I'll send about this in a follow-up email to not overload this
> response with too much data.

The fastpath exit in LWLockUpdateVar() doesn't seem to influence the
results much, see below results. However, it avoids waitlist lock
acquisition when there are no waiters.

test-case 1: -T5, WAL ~16 bytes
clientsHEADPATCHED with fastpathPATCHED no fast path
1148214861457
2161716201569
4317432333031
8613663655725
16125661226911685
32242842362123177
64501354552846653
128949038979189103
25682289152915152835
51262498138838142084
76857083125074126768
102451308113593115930
2048410848876485110
4096199394225743917

> > Or could it be that
> > the removal of the spin lock in
> > LWLockConflictsWithVar()/LWLockWaitForVar() the point that has the
> > highest effect?
>
> I'll send about this in a follow-up email to not overload this
> response with too much data.

Out of 3 functions that got rid of waitlist lock
LWLockConflictsWithVar/LWLockWaitForVar, LWLockUpdateVar,
LWLockReleaseClearVar, perf reports tell that the biggest gain (for
the use-cases that I've tried) is for
LWLockConflictsWithVar/LWLockWaitForVar:

test-case 1: -T5, WAL ~16 bytes
HEAD:
+   61.89% 0.05%  postgres  [.] LWLockWaitForVar
+   43.19% 0.12%  postgres  [.] LWLockConflictsWithVar
+1.62% 0.00%  postgres  [.] LWLockReleaseClearVar

PATCHED:
+   38.79% 0.11%  postgres  [.] LWLockWaitForVar
 0.40% 0.02%  postgres  [.] LWLockConflictsWithVar
+2.80% 0.00%  postgres  [.] LWLockReleaseClearVar

test-case 6: -T5, WAL 4096 bytes
HEAD:
+   29.66% 0.07%  postgres  [.] LWLockWaitForVar
+   20.94% 0.08%  postgres  [.] LWLockConflictsWithVar
 0.19% 0.03%  postgres  [.] LWLockUpdateVar

PATCHED:
+3.95% 0.08%  postgres  [.] LWLockWaitForVar
 0.19% 0.03%  postgres  [.] LWLockConflictsWithVar
+1.73% 0.00%  postgres  [.] LWLockReleaseClearVar

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: WAL Insertion Lock Improvements

2023-04-09 Thread Michael Paquier
On Mon, Feb 20, 2023 at 09:49:48PM -0800, Nathan Bossart wrote:
> I'm marking this as ready-for-committer.  I think a couple of the comments
> could use some small adjustments, but that probably doesn't need to hold up
> this patch.

Apologies.  I was planning to have a thorough look at this patch but
life got in the way and I have not been able to study what's happening
on this thread this close to the feature freeze.

Anyway, I am attaching two modules I have written for the sake of this
thread while beginning my lookup of the patch:
- lwlock_test.tar.gz, validation module for LWLocks with variable
waits.  This module can be loaded with shared_preload_libraries to
have two LWLocks and two variables in shmem, then have 2 backends play
ping-pong with each other's locks.  An isolation test may be possible,
though I have not thought hard about it.  Just use a SQL sequence like
that, for example, with N > 1 (see README):
Backend 1: SELECT lwlock_test_acquire();
Backend 2: SELECT lwlock_test_wait(N);
Backend 1: SELECT lwlock_test_update(N);
Backend 1: SELECT lwlock_test_release();
- custom_wal.tar.gz, thin wrapper for LogLogicalMessage() able to
generate N records of size M bytes in a single SQL call.  This can be
used to generate records of various sizes for benchmarking, limiting
the overhead of individual calls to pg_logical_emit_message_bytea().
I have begun gathering numbers with WAL records of various size and
length, using pgbench like:
$ cat script.sql
\set record_size 1
\set record_number 5000
SELECT custom_wal(:record_size, :record_number);
$ pgbench -n -c 500 -t 100 -f script.sql
So this limits most the overhead of behind parsing, planning, and most
of the INSERT logic.

I have been trying to get some reproducible numbers, but I think that
I am going to need a bigger maching than what I have been using for
the last few days, up to 400 connections.  It is worth noting that
00d1e02b may influence a bit the results, so we may want to have more
numbers with that in place particularly with INSERTs, and one of the
tests used upthread uses single row INSERTs.

Another question I had: would it be worth having some tests with
pg_wal/ mounted to a tmpfs so as I/O would not be a bottleneck?  It
should be instructive to get more measurement with a fixed number of
transactions and a rather high amount of concurrent connections (1k at
least?), where the contention would be on the variable waits.  My
first impression is that records should not be too small if you want
to see more the effects of this patch, either.

Looking at the patch..  LWLockConflictsWithVar() and
LWLockReleaseClearVar() are the trivial bits.  These are OK.

+* NB: Note the use of pg_atomic_exchange_u64 as opposed to just
+* pg_atomic_write_u64 to update the value. Since pg_atomic_exchange_u64 is
+* a full barrier, we're guaranteed that the subsequent shared memory
+* reads/writes, if any, happen after we reset the lock variable.

This mentions that the subsequent read/write operations are safe, so
this refers to anything happening after the variable is reset.  As
a full barrier, should be also mention that this is also ordered with
respect to anything that the caller did before clearing the variable?
From this perspective using pg_atomic_exchange_u64() makes sense to me
in LWLockReleaseClearVar().

+* XXX: Use of a spinlock at the beginning of this function to read
+* current insert position implies memory ordering. That means that
+* the immediate loads and stores to shared memory (for instance,
+* in LWLockUpdateVar called via LWLockWaitForVar) don't need an
+* explicit memory barrier as far as the current usage is
+* concerned. But that might not be safe in general.
 */
What's the part where this is not safe?  Based on what I see, this
code path is safe because of the previous spinlock.  This is the same
comment as at the beginning of LWLockConflictsWithVar().  Is that
something that we ought to document at the top of LWLockWaitForVar()
as well?  We have one caller of this function currently, but there may
be more in the future.

- * you're about to write out.
+ * you're about to write out. Using an atomic variable for insertingAt avoids
+ * taking any explicit lock for reads and writes.

Hmm.  Not sure that we need to comment at all.

-LWLockUpdateVar(LWLock *lock, uint64 *valptr, uint64 val)
+LWLockUpdateVar(LWLock *lock, pg_atomic_uint64 *valptr, uint64 val)
[...]
Assert(pg_atomic_read_u32(>state) & LW_VAL_EXCLUSIVE);
 
-   /* Update the lock's value */
-   *valptr = val;

The sensitive change is in LWLockUpdateVar().  I am not completely
sure to understand this removal, though.  Does that influence the case
where there are waiters?

Another thing I was wondering about: how much does the fast-path used
in LWLockUpdateVar() influence the performance numbers?  Am I right to
guess that it counts for most of the gain seen? 

Re: WAL Insertion Lock Improvements

2023-02-20 Thread Nathan Bossart
On Thu, Feb 09, 2023 at 11:51:28AM +0530, Bharath Rupireddy wrote:
> On Thu, Feb 9, 2023 at 3:36 AM Nathan Bossart  
> wrote:
>> Overall, I think this patch is in reasonable shape.
> 
> Thanks for reviewing. Please see the attached v5 patch.

I'm marking this as ready-for-committer.  I think a couple of the comments
could use some small adjustments, but that probably doesn't need to hold up
this patch.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: WAL Insertion Lock Improvements

2023-02-08 Thread Bharath Rupireddy
On Thu, Feb 9, 2023 at 3:36 AM Nathan Bossart  wrote:
>
> +   pg_atomic_exchange_u64(valptr, val);
>
> nitpick: I'd add a (void) at the beginning of these calls to
> pg_atomic_exchange_u64() so that it's clear that we are discarding the
> return value.

I did that in the attached v5 patch although it's a mix elsewhere;
some doing explicit return value cast with (void) and some not.

> +   /*
> +* Update the lock variable atomically first without having to 
> acquire wait
> +* list lock, so that if anyone looking for the lock will have chance 
> to
> +* grab it a bit quickly.
> +*
> +* NB: Note the use of pg_atomic_exchange_u64 as opposed to just
> +* pg_atomic_write_u64 to update the value. Since 
> pg_atomic_exchange_u64 is
> +* a full barrier, we're guaranteed that the subsequent atomic read 
> of lock
> +* state to check if it has any waiters happens after we set the lock
> +* variable to new value here. Without a barrier, we could end up 
> missing
> +* waiters that otherwise should have been woken up.
> +*/
> +   pg_atomic_exchange_u64(valptr, val);
> +
> +   /*
> +* Quick exit when there are no waiters. This avoids unnecessary 
> lwlock's
> +* wait list lock acquisition and release.
> +*/
> +   if ((pg_atomic_read_u32(>state) & LW_FLAG_HAS_WAITERS) == 0)
> +   return;
>
> I think this makes sense.  A waiter could queue itself after the exchange,
> but it'll recheck after queueing.  IIUC this is basically how this works
> today.  We update the value and release the lock before waking up any
> waiters, so the same principle applies.

Yes, a waiter right after self-queuing (LWLockQueueSelf) checks for
the value (LWLockConflictsWithVar) before it goes and waits until
awakened in LWLockWaitForVar. A waiter added to the queue is
guaranteed to be woken up by the
LWLockUpdateVar but before that the lock value is set and we have
pg_atomic_exchange_u64 as a memory barrier, so no memory reordering.
Essentially, the order of these operations aren't changed. The benefit
that we're seeing is from avoiding LWLock's waitlist lock for reading
and updating the lock value relying on 64-bit atomics.

> Overall, I think this patch is in reasonable shape.

Thanks for reviewing. Please see the attached v5 patch.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v5-0001-Optimize-WAL-insertion-lock-acquisition-and-relea.patch
Description: Binary data


Re: WAL Insertion Lock Improvements

2023-02-08 Thread Nathan Bossart
+   pg_atomic_exchange_u64(valptr, val);

nitpick: I'd add a (void) at the beginning of these calls to
pg_atomic_exchange_u64() so that it's clear that we are discarding the
return value.

+   /*
+* Update the lock variable atomically first without having to acquire 
wait
+* list lock, so that if anyone looking for the lock will have chance to
+* grab it a bit quickly.
+*
+* NB: Note the use of pg_atomic_exchange_u64 as opposed to just
+* pg_atomic_write_u64 to update the value. Since 
pg_atomic_exchange_u64 is
+* a full barrier, we're guaranteed that the subsequent atomic read of 
lock
+* state to check if it has any waiters happens after we set the lock
+* variable to new value here. Without a barrier, we could end up 
missing
+* waiters that otherwise should have been woken up.
+*/
+   pg_atomic_exchange_u64(valptr, val);
+
+   /*
+* Quick exit when there are no waiters. This avoids unnecessary 
lwlock's
+* wait list lock acquisition and release.
+*/
+   if ((pg_atomic_read_u32(>state) & LW_FLAG_HAS_WAITERS) == 0)
+   return;

I think this makes sense.  A waiter could queue itself after the exchange,
but it'll recheck after queueing.  IIUC this is basically how this works
today.  We update the value and release the lock before waking up any
waiters, so the same principle applies.

Overall, I think this patch is in reasonable shape.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: WAL Insertion Lock Improvements

2023-02-02 Thread Bharath Rupireddy
On Tue, Jan 24, 2023 at 7:00 PM Bharath Rupireddy
 wrote:
>
> I'm attaching the v3 patch with the above review comments addressed.
> Hopefully, no memory ordering issues now. FWIW, I've added it to CF
> https://commitfest.postgresql.org/42/4141/.
>
> Test results with the v3 patch and insert workload are the same as
> that of the earlier run - TPS starts to scale at higher clients as
> expected after 512 clients and peaks at 2X with 2048 and 4096 clients.
>
> HEAD:
> 1 1380.411086
> 2 1358.378988
> 4 2701.974332
> 8 5925.380744
> 16 10956.501237
> 32 20877.513953
> 64 40838.046774
> 128 70251.744161
> 256 108114.321299
> 512 120478.988268
> 768 99140.425209
> 1024 93645.984364
> 2048 70111.159909
> 4096 55541.804826
>
> v3 PATCHED:
> 1 1493.800209
> 2 1569.414953
> 4 3154.186605
> 8 5965.578904
> 16 11912.587645
> 32 22720.964908
> 64 42001.094528
> 128 78361.158983
> 256 110457.926232
> 512 148941.378393
> 768 167256.590308
> 1024 155510.675372
> 2048 147499.376882
> 4096 119375.457779

I slightly modified the comments and attached the v4 patch for further
review. I also took perf report - there's a clear reduction in the
functions that are affected by the patch - LWLockWaitListLock,
WaitXLogInsertionsToFinish, LWLockWaitForVar and
LWLockConflictsWithVar. Note that I compiled the source code with
-ggdb for capturing symbols for perf, still the benefit stands at > 2X
for a higher number of clients.

HEAD:
+   16.87% 0.01%  postgres  [.] CommitTransactionCommand
+   16.86% 0.00%  postgres  [.] finish_xact_command
+   16.81% 0.01%  postgres  [.] CommitTransaction
+   15.09% 0.20%  postgres  [.] LWLockWaitListLock
+   14.53% 0.01%  postgres  [.] WaitXLogInsertionsToFinish
+   14.51% 0.02%  postgres  [.] LWLockWaitForVar
+   11.70%11.63%  postgres  [.] pg_atomic_read_u32_impl
+   11.66% 0.08%  postgres  [.] pg_atomic_read_u32
+9.96% 0.03%  postgres  [.] LWLockConflictsWithVar
+4.78% 0.00%  postgres  [.] LWLockQueueSelf
+1.91% 0.01%  postgres  [.] pg_atomic_fetch_or_u32
+1.91% 1.89%  postgres  [.] pg_atomic_fetch_or_u32_impl
+1.73% 0.00%  postgres  [.] XLogInsert
+1.69% 0.01%  postgres  [.] XLogInsertRecord
+1.41% 0.01%  postgres  [.] LWLockRelease
+1.37% 0.47%  postgres  [.] perform_spin_delay
+1.11% 1.11%  postgres  [.] spin_delay
+1.10% 0.03%  postgres  [.] exec_bind_message
+0.91% 0.00%  postgres  [.] WALInsertLockRelease
+0.91% 0.00%  postgres  [.] LWLockReleaseClearVar
+0.72% 0.02%  postgres  [.] LWLockAcquire
+0.60% 0.00%  postgres  [.] LWLockDequeueSelf
+0.58% 0.00%  postgres  [.] GetTransactionSnapshot
 0.58% 0.49%  postgres  [.] GetSnapshotData
+0.58% 0.00%  postgres  [.] WALInsertLockAcquire
+0.55% 0.00%  postgres  [.] XactLogCommitRecord

TPS (compiled with -ggdb for capturing symbols for perf)
1 1392.512967
2 1435.899119
4 3104.091923
8 6159.305522
16 11477.641780
32 22701.000718
64 41662.425880
128 23743.426209
256 89837.651619
512 65164.221500
768 66015.733370
1024 56421.223080
2048 52909.018072
4096 40071.146985

PATCHED:
+2.19% 0.05%  postgres  [.] LWLockWaitListLock
+2.10% 0.01%  postgres  [.] LWLockQueueSelf
+1.73% 1.71%  postgres  [.] pg_atomic_read_u32_impl
+1.73% 0.02%  postgres  [.] pg_atomic_read_u32
+1.72% 0.02%  postgres  [.] LWLockRelease
+1.65% 0.04%  postgres  [.] exec_bind_message
+1.43% 0.00%  postgres  [.] XLogInsert
+1.42% 0.01%  postgres  [.] WaitXLogInsertionsToFinish
+1.40% 0.03%  postgres  [.] LWLockWaitForVar
+1.38% 0.02%  postgres  [.] XLogInsertRecord
+0.93% 0.03%  postgres  [.] LWLockAcquireOrWait
+0.91% 0.00%  postgres  [.] GetTransactionSnapshot
+0.91% 0.79%  postgres  [.] GetSnapshotData
+0.91% 0.00%  postgres  [.] WALInsertLockRelease
+0.91% 0.00%  postgres  [.] LWLockReleaseClearVar
+0.53% 0.02%  postgres  [.] ExecInitModifyTable

TPS (compiled with -ggdb for capturing symbols for perf)
1 1295.296611
2 1459.079162
4 2865.688987
8 5533.724983
16 10771.697842
32 20557.499312
64 39436.423783
128 42555.639048
256 73139.060227
512 124649.665196
768 131162.826976
1024 132185.160007
2048 117377.586644
4096 88240.336940


--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 74c5bd8cc4f1497aa7f2fa02c6487039dc91e847 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Thu, 2 Feb 2023 03:42:27 +
Subject: [PATCH v4] Optimize WAL insertion lock acquisition and release

This commit optimizes WAL insertion lock acquisition and release
in the following way:

1. WAL insertion lock's variable insertingAt is currently read and
written with the help of lwlock's wait list lock to avoid
torn-free reads/writes. This wait list lock can become a point of
contention on a highly concurrent write workloads. 

Re: WAL Insertion Lock Improvements (was: Re: Avoid LWLockWaitForVar() for currently held WAL insertion lock in WaitXLogInsertionsToFinish())

2023-01-24 Thread Bharath Rupireddy
On Tue, Dec 6, 2022 at 12:00 AM Andres Freund  wrote:
>
> Hi

Thanks for reviewing.

> FWIW, I don't see an advantage in 0003. If it allows us to make something else
> simpler / faster, cool, but on its own it doesn't seem worthwhile.

I've discarded this change.

> On 2022-12-02 16:31:58 -0800, Nathan Bossart wrote:
> > On Fri, Dec 02, 2022 at 04:32:38PM +0530, Bharath Rupireddy wrote:
> > > On Fri, Dec 2, 2022 at 6:10 AM Andres Freund  wrote:
> > >> I'm not sure this is quite right - don't we need a memory barrier. But I 
> > >> don't
> > >> see a reason to not just leave this code as-is. I think this should be
> > >> optimized entirely in lwlock.c
> > >
> > > Actually, we don't need that at all as LWLockWaitForVar() will return
> > > immediately if the lock is free. So, I removed it.
> >
> > I briefly looked at the latest patch set, and I'm curious how this change
> > avoids introducing memory ordering bugs.  Perhaps I am missing something
> > obvious.
>
> I'm a bit confused too - the comment above talks about LWLockWaitForVar(), but
> the patches seem to optimize LWLockUpdateVar().
>
> I think it'd be safe to optimize LWLockConflictsWithVar(), due to some
> pre-existing, quite crufty, code. LWLockConflictsWithVar() says:
>
>  * Test first to see if it the slot is free right now.
>  *
>  * XXX: the caller uses a spinlock before this, so we don't need a 
> memory
>  * barrier here as far as the current usage is concerned.  But that 
> might
>  * not be safe in general.
>
> which happens to be true in the single, indirect, caller:
>
> /* Read the current insert position */
> SpinLockAcquire(>insertpos_lck);
> bytepos = Insert->CurrBytePos;
> SpinLockRelease(>insertpos_lck);
> reservedUpto = XLogBytePosToEndRecPtr(bytepos);
>
> I think at the very least we ought to have a comment in
> WaitXLogInsertionsToFinish() highlighting this.

Done.

> It's not at all clear to me that the proposed fast-path for LWLockUpdateVar()
> is safe. I think at the very least we could end up missing waiters that we
> should have woken up.
>
> I think it ought to be safe to do something like
>
> pg_atomic_exchange_u64()..
> if (!(pg_atomic_read_u32(>state) & LW_FLAG_HAS_WAITERS))
>   return;
>
> because the pg_atomic_exchange_u64() will provide the necessary memory
> barrier.

Done.

I'm attaching the v3 patch with the above review comments addressed.
Hopefully, no memory ordering issues now. FWIW, I've added it to CF
https://commitfest.postgresql.org/42/4141/.

Test results with the v3 patch and insert workload are the same as
that of the earlier run - TPS starts to scale at higher clients as
expected after 512 clients and peaks at 2X with 2048 and 4096 clients.

HEAD:
1 1380.411086
2 1358.378988
4 2701.974332
8 5925.380744
16 10956.501237
32 20877.513953
64 40838.046774
128 70251.744161
256 108114.321299
512 120478.988268
768 99140.425209
1024 93645.984364
2048 70111.159909
4096 55541.804826

v3 PATCHED:
1 1493.800209
2 1569.414953
4 3154.186605
8 5965.578904
16 11912.587645
32 22720.964908
64 42001.094528
128 78361.158983
256 110457.926232
512 148941.378393
768 167256.590308
1024 155510.675372
2048 147499.376882
4096 119375.457779

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v3-0001-Optimize-WAL-insertion-lock-acquisition-and-relea.patch
Description: Binary data


Re: WAL Insertion Lock Improvements (was: Re: Avoid LWLockWaitForVar() for currently held WAL insertion lock in WaitXLogInsertionsToFinish())

2023-01-09 Thread Andres Freund
Hi,

On 2022-12-08 12:29:54 +0530, Bharath Rupireddy wrote:
> On Tue, Dec 6, 2022 at 12:00 AM Andres Freund  wrote:
> > I think it'd be safe to optimize LWLockConflictsWithVar(), due to some
> > pre-existing, quite crufty, code. LWLockConflictsWithVar() says:
> >
> >  * Test first to see if it the slot is free right now.
> >  *
> >  * XXX: the caller uses a spinlock before this, so we don't need a 
> > memory
> >  * barrier here as far as the current usage is concerned.  But that 
> > might
> >  * not be safe in general.
> >
> > which happens to be true in the single, indirect, caller:
> >
> > /* Read the current insert position */
> > SpinLockAcquire(>insertpos_lck);
> > bytepos = Insert->CurrBytePos;
> > SpinLockRelease(>insertpos_lck);
> > reservedUpto = XLogBytePosToEndRecPtr(bytepos);
> >
> > I think at the very least we ought to have a comment in
> > WaitXLogInsertionsToFinish() highlighting this.
> 
> So, using a spinlock ensures no memory ordering occurs while reading
> lock->state in LWLockConflictsWithVar()?

No, a spinlock *does* imply ordering. But your patch does remove several of
the spinlock acquisitions (via LWLockWaitListLock()). And moved the assignment
in LWLockUpdateVar() out from under the spinlock.

If you remove spinlock operations (or other barrier primitives), you need to
make sure that such modifications don't break the required memory ordering.


> How does spinlock that gets acquired and released in the caller
> WaitXLogInsertionsToFinish() itself and the memory barrier in the called
> function LWLockConflictsWithVar() relate here? Can you please help me
> understand this a bit?

The caller's barrier means that we'll see values that are at least as "up to
date" as at the time of the barrier (it's a bit more complicated than that, a
barrier always needs to be paired with another barrier).


> > It's not at all clear to me that the proposed fast-path for 
> > LWLockUpdateVar()
> > is safe. I think at the very least we could end up missing waiters that we
> > should have woken up.
> >
> > I think it ought to be safe to do something like
> >
> > pg_atomic_exchange_u64()..
> > if (!(pg_atomic_read_u32(>state) & LW_FLAG_HAS_WAITERS))
> >   return;
> 
> pg_atomic_exchange_u64(>state, exchange_with_what_?. Exchange will
> change the value no?

Not lock->state, but the atomic passed to LWLockUpdateVar(), which we do want
to update. An pg_atomic_exchange_u64() includes a memory barrier.


> > because the pg_atomic_exchange_u64() will provide the necessary memory
> > barrier.
> 
> I'm reading some comments [1], are these also true for 64-bit atomic
> CAS?

Yes. See
/* 
 * The 64 bit operations have the same semantics as their 32bit counterparts
 * if they are available. Check the corresponding 32bit function for
 * documentation.
 * 
 */


> Does it mean that an atomic CAS operation inherently provides a
> memory barrier?

Yes.


> Can you please point me if it's described better somewhere else?

I'm not sure what you'd like to have described more extensively, tbh.

Greetings,

Andres Freund




Re: WAL Insertion Lock Improvements (was: Re: Avoid LWLockWaitForVar() for currently held WAL insertion lock in WaitXLogInsertionsToFinish())

2022-12-07 Thread Bharath Rupireddy
On Tue, Dec 6, 2022 at 12:00 AM Andres Freund  wrote:
>
> FWIW, I don't see an advantage in 0003. If it allows us to make something else
> simpler / faster, cool, but on its own it doesn't seem worthwhile.

Thanks. I will discard it.

> I think it'd be safe to optimize LWLockConflictsWithVar(), due to some
> pre-existing, quite crufty, code. LWLockConflictsWithVar() says:
>
>  * Test first to see if it the slot is free right now.
>  *
>  * XXX: the caller uses a spinlock before this, so we don't need a 
> memory
>  * barrier here as far as the current usage is concerned.  But that 
> might
>  * not be safe in general.
>
> which happens to be true in the single, indirect, caller:
>
> /* Read the current insert position */
> SpinLockAcquire(>insertpos_lck);
> bytepos = Insert->CurrBytePos;
> SpinLockRelease(>insertpos_lck);
> reservedUpto = XLogBytePosToEndRecPtr(bytepos);
>
> I think at the very least we ought to have a comment in
> WaitXLogInsertionsToFinish() highlighting this.

So, using a spinlock ensures no memory ordering occurs while reading
lock->state in LWLockConflictsWithVar()? How does spinlock that gets
acquired and released in the caller WaitXLogInsertionsToFinish()
itself and the memory barrier in the called function
LWLockConflictsWithVar() relate here? Can you please help me
understand this a bit?

> It's not at all clear to me that the proposed fast-path for LWLockUpdateVar()
> is safe. I think at the very least we could end up missing waiters that we
> should have woken up.
>
> I think it ought to be safe to do something like
>
> pg_atomic_exchange_u64()..
> if (!(pg_atomic_read_u32(>state) & LW_FLAG_HAS_WAITERS))
>   return;

pg_atomic_exchange_u64(>state, exchange_with_what_?. Exchange
will change the value no?

> because the pg_atomic_exchange_u64() will provide the necessary memory
> barrier.

I'm reading some comments [1], are these also true for 64-bit atomic
CAS? Does it mean that an atomic CAS operation inherently provides a
memory barrier? Can you please point me if it's described better
somewhere else?

[1]
 * Full barrier semantics.
 */
static inline uint32
pg_atomic_exchange_u32(volatile pg_atomic_uint32 *ptr,

/*
 * Get and clear the flags that are set for this backend. Note that
 * pg_atomic_exchange_u32 is a full barrier, so we're guaranteed that the
 * read of the barrier generation above happens before we atomically
 * extract the flags, and that any subsequent state changes happen
 * afterward.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: WAL Insertion Lock Improvements (was: Re: Avoid LWLockWaitForVar() for currently held WAL insertion lock in WaitXLogInsertionsToFinish())

2022-12-05 Thread Andres Freund
Hi,

FWIW, I don't see an advantage in 0003. If it allows us to make something else
simpler / faster, cool, but on its own it doesn't seem worthwhile.




On 2022-12-02 16:31:58 -0800, Nathan Bossart wrote:
> On Fri, Dec 02, 2022 at 04:32:38PM +0530, Bharath Rupireddy wrote:
> > On Fri, Dec 2, 2022 at 6:10 AM Andres Freund  wrote:
> >> I'm not sure this is quite right - don't we need a memory barrier. But I 
> >> don't
> >> see a reason to not just leave this code as-is. I think this should be
> >> optimized entirely in lwlock.c
> > 
> > Actually, we don't need that at all as LWLockWaitForVar() will return
> > immediately if the lock is free. So, I removed it.
> 
> I briefly looked at the latest patch set, and I'm curious how this change
> avoids introducing memory ordering bugs.  Perhaps I am missing something
> obvious.

I'm a bit confused too - the comment above talks about LWLockWaitForVar(), but
the patches seem to optimize LWLockUpdateVar().


I think it'd be safe to optimize LWLockConflictsWithVar(), due to some
pre-existing, quite crufty, code. LWLockConflictsWithVar() says:

 * Test first to see if it the slot is free right now.
 *
 * XXX: the caller uses a spinlock before this, so we don't need a 
memory
 * barrier here as far as the current usage is concerned.  But that 
might
 * not be safe in general.

which happens to be true in the single, indirect, caller:

/* Read the current insert position */
SpinLockAcquire(>insertpos_lck);
bytepos = Insert->CurrBytePos;
SpinLockRelease(>insertpos_lck);
reservedUpto = XLogBytePosToEndRecPtr(bytepos);

I think at the very least we ought to have a comment in
WaitXLogInsertionsToFinish() highlighting this.



It's not at all clear to me that the proposed fast-path for LWLockUpdateVar()
is safe. I think at the very least we could end up missing waiters that we
should have woken up.

I think it ought to be safe to do something like

pg_atomic_exchange_u64()..
if (!(pg_atomic_read_u32(>state) & LW_FLAG_HAS_WAITERS))
  return;

because the pg_atomic_exchange_u64() will provide the necessary memory
barrier.

Greetings,

Andres Freund




Re: WAL Insertion Lock Improvements (was: Re: Avoid LWLockWaitForVar() for currently held WAL insertion lock in WaitXLogInsertionsToFinish())

2022-12-02 Thread Nathan Bossart
On Fri, Dec 02, 2022 at 04:32:38PM +0530, Bharath Rupireddy wrote:
> On Fri, Dec 2, 2022 at 6:10 AM Andres Freund  wrote:
>> I'm not sure this is quite right - don't we need a memory barrier. But I 
>> don't
>> see a reason to not just leave this code as-is. I think this should be
>> optimized entirely in lwlock.c
> 
> Actually, we don't need that at all as LWLockWaitForVar() will return
> immediately if the lock is free. So, I removed it.

I briefly looked at the latest patch set, and I'm curious how this change
avoids introducing memory ordering bugs.  Perhaps I am missing something
obvious.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com