Re: LogwrtResult contended spinlock

2024-04-08 Thread Jeff Davis
On Mon, 2024-04-08 at 10:24 +0200, Alvaro Herrera wrote:
> My trouble with the "copy" term is that we don't use that term
> anywhere
> in relation to WAL.

I got the term from CopyXLogRecordToWAL().

> This "copy" is in
> reality just the insertion, after it's finished.  The "Result" suffix
> is intended to convey that it's the point where the operation has
> successfully finished.

That's a convincing point.

> Maybe we can add some commentary to make this clearer.

For now, I'd just suggest a comment on GetXLogInsertRecPtr() explaining
what it's returning and how that's different from logInsertResult.

In the future, it would be nice to clarify where variable names like
reservedUpto and CurrBytePos fit in. Also the external documentation
for pg_current_wal_insert_lsn() is a bit ambiguous. 

> Lastly, I just noticed that I forgot credits and discussion link in
> that
> commit message.  I would have attributed authorship to Bharath
> though,
> because I had not realized that this patch had actually been written
> by
> you initially[1].  My apologies.

No worries. Thank you for reviewing and committing it.

Regards,
Jeff Davis





Re: LogwrtResult contended spinlock

2024-04-08 Thread Alvaro Herrera
On 2024-Apr-07, Jeff Davis wrote:

> On Sun, 2024-04-07 at 14:19 +0200, Alvaro Herrera wrote:
> > I pushed the "copy" pointer now, except that I renamed it to
> > "insert",
> > which is what we call the operation being tracked.  I also added some
> > comments.
> 
> The "copy" pointer, as I called it, is not the same as the "insert"
> pointer (as returned by GetXLogInsertRecPtr()).
> 
> LSNs before the "insert" pointer are reserved for WAL inserters (and
> may or may not have made it to WAL buffers yet). LSNs before the "copy"
> pointer are written to WAL buffers with CopyXLogRecordToWAL() (and may
> or may not have been evicted to the OS file yet).

It seems a matter of interpretation.  WAL insertion starts with
reserving the space (in ReserveXLogInsertLocation) and moving the
CurrBytePos point forward; the same WAL insertion completes when the
actual data has been copied to the buffers.  It is this process as a
whole that we can an insertion.  CurrBytePos (what GetXLogInsertRecPtr
returns) is the point where the next insertions are going to occur; the
new logInsertResult is the point where previous insertions are known to
have completed.

We could think about insertions as something that's happening to a range
of bytes.  CurrBytePos is the high end of that range, logInsertResult is
its low end.  (Although in reality we don't know the true low end,
because the process writing the oldest WAL doesn't advertise as soon as
it finishes its insertion, because it doesn't know that it is writing
the oldest.  We only know that X is this "oldest known" after we have
waited for all those that were inserting earlier than X to have
finished.)

My trouble with the "copy" term is that we don't use that term anywhere
in relation to WAL.  It's a term we don't need.  This "copy" is in
reality just the insertion, after it's finished.  The "Result" suffix
is intended to convey that it's the point where the operation has
successfully finished.

Maybe we can add some commentary to make this clearer.

Now, if you're set on renaming the variable back to Copy, I won't
object.

Lastly, I just noticed that I forgot credits and discussion link in that
commit message.  I would have attributed authorship to Bharath though,
because I had not realized that this patch had actually been written by
you initially[1].  My apologies.

[1] 
https://postgr.es/m/727449f3151c6b9ab76ba706fa4d30bf7b03ad4f.ca...@j-davis.com

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/




Re: LogwrtResult contended spinlock

2024-04-07 Thread Jeff Davis
On Sun, 2024-04-07 at 14:19 +0200, Alvaro Herrera wrote:
> I pushed the "copy" pointer now, except that I renamed it to
> "insert",
> which is what we call the operation being tracked.  I also added some
> comments.

The "copy" pointer, as I called it, is not the same as the "insert"
pointer (as returned by GetXLogInsertRecPtr()).

LSNs before the "insert" pointer are reserved for WAL inserters (and
may or may not have made it to WAL buffers yet). LSNs before the "copy"
pointer are written to WAL buffers with CopyXLogRecordToWAL() (and may
or may not have been evicted to the OS file yet).

Other than that, looks good, thank you.

Regards,
Jeff Davis





Re: LogwrtResult contended spinlock

2024-04-07 Thread Alvaro Herrera
I pushed the "copy" pointer now, except that I renamed it to "insert",
which is what we call the operation being tracked.  I also added some
comments.

One perhaps significant change from what Bharath submitted is that we
now use the return value from monotonic advance to return an updated
value of finishedUpto from WaitXLogInsertionsToFinish.  I think this is
a useful optimization: the only caller of WaitXLogInsertionsToFinish
which cares about its return value is XLogFlush, which AFAICS benefits
from having a more up-to-date value.

Also, I did adopt your (Jeff's) idea of using a barrier when reading
logInsertResult in WaitXLogInsertionsToFinish.  If I understand
correctly, this is actually useful: if some other CPU core was also
recently busy waiting for xlog insertions and wrote an updated version
of logInsertResult that's past what we last saw, we benefit from an
updated value because we can skip the spinlock dance and WAL-insert-var
stuff altogether.  With no barrier, we would potentially see an outdated
value of logInsertResult and waste a lot of effort to get it updated.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Those who use electric razors are infidels destined to burn in hell while
we drink from rivers of beer, download free vids and mingle with naked
well shaved babes." (http://slashdot.org/comments.pl?sid=44793=4647152)




Re: LogwrtResult contended spinlock

2024-04-06 Thread Jeff Davis
On Sat, 2024-04-06 at 18:13 +0200, Alvaro Herrera wrote:
> my understanding of pg_atomic_compare_exchange_u64 is that
> *expected is set to the value that's stored in *ptr prior to the
> exchange.

Sorry, my mistake. Your version looks good.

Regards,
Jeff Davis





Re: LogwrtResult contended spinlock

2024-04-06 Thread Alvaro Herrera
On 2024-Apr-05, Jeff Davis wrote:

> Minor comments:

> * Also, I assume that the Max() call in
> pg_atomic_monotonic_advance_u64() is just for clarity? Surely the
> currval cannot be less than _target when it returns. I'd probably just
> do Assert(currval >= _target) and then return currval.

Uhh ... my understanding of pg_atomic_compare_exchange_u64 is that
*expected is set to the value that's stored in *ptr prior to the
exchange.  Its comment

 * Atomically compare the current value of ptr with *expected and store newval
 * iff ptr and *expected have the same value. The current value of *ptr will
 * always be stored in *expected.

is actually not very clear, because what does "current" mean in this
context?  Is the original "current" value, or is it the "current" value
after the exchange?  Anyway, looking at the spinlock-emulated code for
pg_atomic_compare_exchange_u32_impl in atomics.c,

/* perform compare/exchange logic */
ret = ptr->value == *expected;
*expected = ptr->value;
if (ret)
ptr->value = newval;

it's clear that *expected receives the original value, not the new one.

Because of this behavior, not doing the Max() would return the obsolete
value rather than the one we just installed.  (It would only work
correctly without Max() when our cmpxchg operation "fails", that is,
somebody else incremented the value past the one we want to install.)


Another reason is that when I used pg_atomic_monotonic_advance_u64 with
the Write and Flush pointers and did not have the Max(), the assertion
failed pretty quickly :-)

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
Al principio era UNIX, y UNIX habló y dijo: "Hello world\n".
No dijo "Hello New Jersey\n", ni "Hello USA\n".




Re: LogwrtResult contended spinlock

2024-04-06 Thread Bharath Rupireddy
On Sat, Apr 6, 2024 at 9:21 AM Thomas Munro  wrote:
>
> On Sat, Apr 6, 2024 at 6:55 AM Alvaro Herrera  wrote:
> > Pushed 0001.
>
> Could that be related to the 3 failures on parula that look like this?
>
> TRAP: failed Assert("node->next == 0 && node->prev == 0"), File:
> "../../../../src/include/storage/proclist.h", Line: 63, PID: 29119
> 2024-04-05 16:16:26.812 UTC [29114:15] pg_regress/drop_operator LOG:
> statement: DROP OPERATOR <|(bigint, bigint);
> postgres: postgres regression [local] CREATE
> ROLE(ExceptionalCondition+0x4c)[0x9b3fdc]
> postgres: postgres regression [local] CREATE ROLE[0x8529e4]
> postgres: postgres regression [local] CREATE
> ROLE(LWLockWaitForVar+0xec)[0x8538fc]
> postgres: postgres regression [local] CREATE ROLE[0x54c7d4]
> postgres: postgres regression [local] CREATE ROLE(XLogFlush+0xf0)[0x552600]
> postgres: postgres regression [local] CREATE ROLE[0x54a9b0]
> postgres: postgres regression [local] CREATE ROLE[0x54bbdc]
>
> Hmm, the comments for LWLockWaitForVar say:
>
>  * Be aware that LWLockConflictsWithVar() does not include a memory barrier,
>  * hence the caller of this function may want to rely on an explicit barrier 
> or
>  * an implied barrier via spinlock or LWLock to avoid memory ordering issues.
>
> But that seems to be more likely to make LWLockWaitForVar suffer data
> races (ie hang), not break assertions about LWLock sanity, so I don't
> know what's going on there.  I happened to have a shell on a Graviton
> box, but I couldn't reproduce it after a while...

Thanks for reporting. I'll try to spin up a similar instance like
parula and reproduce. Meanwhile, I'm wondering if it is somehow
related to what's discussed in "Why is parula failing?"
https://www.postgresql.org/message-id/4009739.1710878318%40sss.pgh.pa.us.
It seems like parula is behaving unexpectedly because of the compiler
and other stuff.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: LogwrtResult contended spinlock

2024-04-05 Thread Thomas Munro
On Sat, Apr 6, 2024 at 6:55 AM Alvaro Herrera  wrote:
> Pushed 0001.

Could that be related to the 3 failures on parula that look like this?

TRAP: failed Assert("node->next == 0 && node->prev == 0"), File:
"../../../../src/include/storage/proclist.h", Line: 63, PID: 29119
2024-04-05 16:16:26.812 UTC [29114:15] pg_regress/drop_operator LOG:
statement: DROP OPERATOR <|(bigint, bigint);
postgres: postgres regression [local] CREATE
ROLE(ExceptionalCondition+0x4c)[0x9b3fdc]
postgres: postgres regression [local] CREATE ROLE[0x8529e4]
postgres: postgres regression [local] CREATE
ROLE(LWLockWaitForVar+0xec)[0x8538fc]
postgres: postgres regression [local] CREATE ROLE[0x54c7d4]
postgres: postgres regression [local] CREATE ROLE(XLogFlush+0xf0)[0x552600]
postgres: postgres regression [local] CREATE ROLE[0x54a9b0]
postgres: postgres regression [local] CREATE ROLE[0x54bbdc]

Hmm, the comments for LWLockWaitForVar say:

 * Be aware that LWLockConflictsWithVar() does not include a memory barrier,
 * hence the caller of this function may want to rely on an explicit barrier or
 * an implied barrier via spinlock or LWLock to avoid memory ordering issues.

But that seems to be more likely to make LWLockWaitForVar suffer data
races (ie hang), not break assertions about LWLock sanity, so I don't
know what's going on there.  I happened to have a shell on a Graviton
box, but I couldn't reproduce it after a while...




Re: LogwrtResult contended spinlock

2024-04-05 Thread Jeff Davis
On Fri, 2024-04-05 at 13:54 +0200, Alvaro Herrera wrote:
> Couldn't push: I tested with --disable-atomics --disable-spinlocks
> and
> the tests fail because the semaphore for the atomic variables is not
> always initialized.  This is weird -- it's like a client process is
> running at a time when StartupXLOG has not initialized the variable
> ...
> so the initialization in the other place was not completely wrong.

It looks like you solved the problem and pushed the first patch. Looks
good to me.

Patch 0002 also looks good to me, after a mostly-trivial rebase (just
remember to initialize logCopyResult).

Minor comments:

* You could consider using a read barrier before reading the Copy ptr,
or using the pg_atomic_read_membarrier_u64() variant. I don't see a
correctness problem, but it might be slightly more clear and I don't
see a lot of downside.

* Also, I assume that the Max() call in
pg_atomic_monotonic_advance_u64() is just for clarity? Surely the
currval cannot be less than _target when it returns. I'd probably just
do Assert(currval >= _target) and then return currval.

Regards,
Jeff Davis





Re: LogwrtResult contended spinlock

2024-04-05 Thread Alvaro Herrera
Pushed 0001.  Here's the patch that adds the Copy position one more
time, with the monotonic_advance function returning the value.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
>From 3f5c860576245b92701e7bfc517947c418c68510 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Fri, 5 Apr 2024 13:52:44 +0200
Subject: [PATCH v17] Add logCopyResult

Updated using monotonic increment
---
 src/backend/access/transam/xlog.c | 32 ++-
 src/include/port/atomics.h| 36 +++
 2 files changed, 67 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b4499cda7b..a40c1fb79e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -469,6 +469,7 @@ typedef struct XLogCtlData
 	XLogRecPtr	lastSegSwitchLSN;
 
 	/* These are accessed using atomics -- info_lck not needed */
+	pg_atomic_uint64 logCopyResult; /* last byte + 1 copied to WAL buffers */
 	pg_atomic_uint64 logWriteResult;	/* last byte + 1 written out */
 	pg_atomic_uint64 logFlushResult;	/* last byte + 1 flushed */
 
@@ -1499,6 +1500,7 @@ static XLogRecPtr
 WaitXLogInsertionsToFinish(XLogRecPtr upto)
 {
 	uint64		bytepos;
+	XLogRecPtr	copyptr;
 	XLogRecPtr	reservedUpto;
 	XLogRecPtr	finishedUpto;
 	XLogCtlInsert *Insert = >Insert;
@@ -1507,6 +1509,11 @@ WaitXLogInsertionsToFinish(XLogRecPtr upto)
 	if (MyProc == NULL)
 		elog(PANIC, "cannot wait without a PGPROC structure");
 
+	/* check if there's any work to do */
+	copyptr = pg_atomic_read_u64(>logCopyResult);
+	if (upto <= copyptr)
+		return copyptr;
+
 	/* Read the current insert position */
 	SpinLockAcquire(>insertpos_lck);
 	bytepos = Insert->CurrBytePos;
@@ -1586,6 +1593,10 @@ WaitXLogInsertionsToFinish(XLogRecPtr upto)
 		if (insertingat != InvalidXLogRecPtr && insertingat < finishedUpto)
 			finishedUpto = insertingat;
 	}
+
+	finishedUpto =
+		pg_atomic_monotonic_advance_u64(>logCopyResult, finishedUpto);
+
 	return finishedUpto;
 }
 
@@ -1727,13 +1738,24 @@ WALReadFromBuffers(char *dstbuf, XLogRecPtr startptr, Size count,
 {
 	char	   *pdst = dstbuf;
 	XLogRecPtr	recptr = startptr;
+	XLogRecPtr	copyptr;
 	Size		nbytes = count;
 
 	if (RecoveryInProgress() || tli != GetWALInsertionTimeLine())
 		return 0;
 
 	Assert(!XLogRecPtrIsInvalid(startptr));
-	Assert(startptr + count <= LogwrtResult.Write);
+
+	/*
+	 * Caller should ensure that the requested data has been copied to WAL
+	 * buffers before we try to read it.
+	 */
+	copyptr = pg_atomic_read_u64(>logCopyResult);
+	if (startptr + count > copyptr)
+		ereport(ERROR,
+errmsg("request to read past end of generated WAL; requested %X/%X, current position %X/%X",
+	   LSN_FORMAT_ARGS(startptr + count),
+	   LSN_FORMAT_ARGS(copyptr)));
 
 	/*
 	 * Loop through the buffers without a lock. For each buffer, atomically
@@ -2571,13 +2593,19 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 	{
 		XLogRecPtr	Flush;
 		XLogRecPtr	Write;
+		XLogRecPtr	Copy;
 
 		Flush = pg_atomic_read_u64(>logFlushResult);
 		pg_read_barrier();
 		Write = pg_atomic_read_u64(>logWriteResult);
+		pg_read_barrier();
+		Copy = pg_atomic_read_u64(>logCopyResult);
 
 		/* WAL written to disk is always ahead of WAL flushed */
 		Assert(Write >= Flush);
+
+		/* WAL copied to buffers is always ahead of WAL written */
+		Assert(Copy >= Write);
 	}
 #endif
 }
@@ -4951,6 +4979,7 @@ XLOGShmemInit(void)
 
 	SpinLockInit(>Insert.insertpos_lck);
 	SpinLockInit(>info_lck);
+	pg_atomic_init_u64(>logCopyResult, InvalidXLogRecPtr);
 	pg_atomic_init_u64(>logWriteResult, InvalidXLogRecPtr);
 	pg_atomic_init_u64(>logFlushResult, InvalidXLogRecPtr);
 	pg_atomic_init_u64(>unloggedLSN, InvalidXLogRecPtr);
@@ -5979,6 +6008,7 @@ StartupXLOG(void)
 	 * because no other process can be reading or writing WAL yet.
 	 */
 	LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
+	pg_atomic_write_u64(>logCopyResult, EndOfLog);
 	pg_atomic_write_u64(>logWriteResult, EndOfLog);
 	pg_atomic_write_u64(>logFlushResult, EndOfLog);
 	XLogCtl->LogwrtRqst.Write = EndOfLog;
diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h
index ff47782cdb..78987f3154 100644
--- a/src/include/port/atomics.h
+++ b/src/include/port/atomics.h
@@ -570,6 +570,42 @@ pg_atomic_sub_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 sub_)
 	return pg_atomic_sub_fetch_u64_impl(ptr, sub_);
 }
 
+/*
+ * Monotonically advance the given variable using only atomic operations until
+ * it's at least the target value.  Returns the latest value observed, which
+ * may or may not be the target value.
+ *
+ * Full barrier semantics (even when value is unchanged).
+ */
+static inline uint64
+pg_atomic_monotonic_advance_u64(volatile pg_atomic_uint64 *ptr, uint64 target_)
+{
+	uint64		currval;
+
+#ifndef PG_HAVE_ATOMIC_U64_SIMULATION
+	AssertPointerAlignment(ptr, 8);
+#endif
+
+	currval = pg_atomic_read_u64_impl(ptr);
+	if (currval >= 

Re: LogwrtResult contended spinlock

2024-04-05 Thread Alvaro Herrera
Couldn't push: I tested with --disable-atomics --disable-spinlocks and
the tests fail because the semaphore for the atomic variables is not
always initialized.  This is weird -- it's like a client process is
running at a time when StartupXLOG has not initialized the variable ...
so the initialization in the other place was not completely wrong.
I'll investigate after lunch.  Here's v16.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"XML!" Exclaimed C++.  "What are you doing here? You're not a programming
language."
"Tell that to the people who use me," said XML.
https://burningbird.net/the-parable-of-the-languages/
>From 2164778b74681910544a64ba6c2ae36e5204ed9e Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Fri, 5 Apr 2024 13:21:39 +0200
Subject: [PATCH v16 1/2] Make XLogCtl->log{Write,Flush}Result accessible with
 atomics

This removes the need to hold both the info_lck spinlock and
WALWriteLock to update them.  We use stock atomic write instead, with
WALWriteLock held.  Readers can use atomic read, without any locking.

This allows for some code to be reordered: some places were a bit
contorted to avoid repeated spinlock acquisition, but that's no longer a
concern, so we can turn them to more natural coding.  Some further
changes are likely possible with a positive performance impact, but in
this commit I did rather minimal ones only, to avoid increasing the
blast radius.

Reviewed-by: Bharath Rupireddy 
Reviewed-by: Jeff Davis 
Reviewed-by: Andres Freund  (earlier versions)
Discussion: https://postgr.es/m/20200831182156.GA3983@alvherre.pgsql
---
 src/backend/access/transam/xlog.c | 105 --
 1 file changed, 57 insertions(+), 48 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 3bdd9a2ddd..748ed1f2ef 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -292,12 +292,7 @@ static bool doPageWrites;
  * LogwrtRqst indicates a byte position that we need to write and/or fsync
  * the log up to (all records before that point must be written or fsynced).
  * The positions already written/fsynced are maintained in logWriteResult
- * and logFlushResult.
- *
- * To read XLogCtl->logWriteResult or ->logFlushResult, you must hold either
- * info_lck or WALWriteLock.  To update them, you need to hold both locks.
- * The point of this arrangement is that the value can be examined by code
- * that already holds WALWriteLock without needing to grab info_lck as well.
+ * and logFlushResult using atomic access.
  * In addition to the shared variable, each backend has a private copy of
  * both in LogwrtResult, which is updated when convenient.
  *
@@ -473,12 +468,9 @@ typedef struct XLogCtlData
 	pg_time_t	lastSegSwitchTime;
 	XLogRecPtr	lastSegSwitchLSN;
 
-	/*
-	 * Protected by info_lck and WALWriteLock (you must hold either lock to
-	 * read it, but both to update)
-	 */
-	XLogRecPtr	logWriteResult; /* last byte + 1 written out */
-	XLogRecPtr	logFlushResult; /* last byte + 1 flushed */
+	/* These are accessed using atomics -- info_lck not needed */
+	pg_atomic_uint64 logWriteResult;	/* last byte + 1 written out */
+	pg_atomic_uint64 logFlushResult;	/* last byte + 1 flushed */
 
 	/*
 	 * Latest initialized page in the cache (last byte position + 1).
@@ -616,11 +608,15 @@ static XLogwrtResult LogwrtResult = {0, 0};
 
 /*
  * Update local copy of shared XLogCtl->log{Write,Flush}Result
+ *
+ * It's critical that Flush always trails Write, so the order of the reads is
+ * important, as is the barrier.  See also XLogWrite.
  */
 #define RefreshXLogWriteResult(_target) \
 	do { \
-		_target.Write = XLogCtl->logWriteResult; \
-		_target.Flush = XLogCtl->logFlushResult; \
+		_target.Flush = pg_atomic_read_u64(>logFlushResult); \
+		pg_read_barrier(); \
+		_target.Write = pg_atomic_read_u64(>logWriteResult); \
 	} while (0)
 
 /*
@@ -968,9 +964,8 @@ XLogInsertRecord(XLogRecData *rdata,
 		/* advance global request to include new block(s) */
 		if (XLogCtl->LogwrtRqst.Write < EndPos)
 			XLogCtl->LogwrtRqst.Write = EndPos;
-		/* update local result copy while I have the chance */
-		RefreshXLogWriteResult(LogwrtResult);
 		SpinLockRelease(>info_lck);
+		RefreshXLogWriteResult(LogwrtResult);
 	}
 
 	/*
@@ -1989,17 +1984,17 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
 			if (opportunistic)
 break;
 
-			/* Before waiting, get info_lck and update LogwrtResult */
+			/* Advance shared memory write request position */
 			SpinLockAcquire(>info_lck);
 			if (XLogCtl->LogwrtRqst.Write < OldPageRqstPtr)
 XLogCtl->LogwrtRqst.Write = OldPageRqstPtr;
-			RefreshXLogWriteResult(LogwrtResult);
 			SpinLockRelease(>info_lck);
 
 			/*
-			 * Now that we have an up-to-date LogwrtResult value, see if we
-			 * still need to write it or if someone else already did.
+			 * Acquire an up-to-date LogwrtResult value and see if we still
+			 * need to 

Re: LogwrtResult contended spinlock

2024-04-05 Thread Alvaro Herrera
On 2024-Apr-05, Bharath Rupireddy wrote:

> 1.
>  /*
>   * Update local copy of shared XLogCtl->log{Write,Flush}Result
> + *
> + * It's critical that Flush always trails Write, so the order of the reads is
> + * important, as is the barrier.
>   */
>  #define RefreshXLogWriteResult(_target) \
>  do { \
> -_target.Write = XLogCtl->logWriteResult; \
> -_target.Flush = XLogCtl->logFlushResult; \
> +_target.Flush = pg_atomic_read_u64(>logFlushResult); \
> +pg_read_barrier(); \
> +_target.Write = pg_atomic_read_u64(>logWriteResult); \
>  } while (0)
> 
> Is it "Flush always trails Write" or "Flush always leades Write"? I
> guess the latter, no?

Well, Flush cannot lead Write.  We cannot Flush what hasn't been
Written!  To me, "Flush trails Write" means that flush is behind.  That
seems supported by Merriam-Webster[1], which says in defining it as a
transitive verb

3 a : to follow upon the scent or trace of : TRACK
  b : to follow in the footsteps of : PURSUE
  c : to follow along behind
  d : to lag behind (someone, such as a competitor)

Maybe it's not super clear as a term.  We could turn it around and say "Write
always leads Flush".

[1] https://www.merriam-webster.com/dictionary/trail

The reason we have the barrier, and the reason we write and read them in
this order, is that we must never read a Flush value that is newer than
the Write value we read.  That is: the values are written in pairs
(w1,f1) first, (w2,f2) next, and so on.  It's okay if we obtain (w2,f1)
(new write, old flush), but it's not okay if we obtain (w1,f2) (old
write, new flush).  If we did, we might end up with a Flush value that's
ahead of Write, violating the invariant.  

> 2.
> +
> +pg_atomic_write_u64(>logWriteResult, LogwrtResult.Write);
> +pg_write_barrier();
> +pg_atomic_write_u64(>logFlushResult, LogwrtResult.Flush);
>  }
> 
> Maybe add the reason as to why we had to write logWriteResult first
> and then logFlushResult, similar to the comment atop
> RefreshXLogWriteResult?

Hmm, I had one there and removed it.  I'll put something back.

> > For 0002, did you consider having pg_atomic_monotonic_advance_u64()
> > return the currval?
> 
> +1 for returning currval here.

Oh yeah, I had that when the monotonic stuff was used by 0001 but lost
it when I moved to 0002.  I'll push 0001 now and send an updated 0002
with the return value added.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"La rebeldía es la virtud original del hombre" (Arthur Schopenhauer)




Re: LogwrtResult contended spinlock

2024-04-04 Thread Bharath Rupireddy
On Fri, Apr 5, 2024 at 4:21 AM Jeff Davis  wrote:
>
> > 4. If we're not modifying any callers of WALReadFromBuffers(), then
> > AFAICS the added check that the request is not past the Copy pointer
> > is
> > useless.  In a quick look at that code, I think we only try to read
> > data
> > that's been flushed, not written, so the stricter check that we don't
> > read data that hasn't been Copied does nothing.
>
> Bharath has indicated that he may call WALReadFromBuffers() in an
> extension, so I believe some error checking is appropriate there.
>
> >   (Honestly I'm not sure
> > that we want XLogSendPhysical to be reading data that has not been
> > written, or do we?)
>
> Not yet, but there has been some discussion[1][2] about future work to
> allow replicating data before it's been flushed locally.

Right. Although callers of WALReadFromBuffers() in core postgres
doesn't need it now (in future, they will), but it allows one to write
something up externally - for example, see 0004 patch in
https://www.postgresql.org/message-id/CALj2ACWmW+1ZdYE0BD5-4KVLzYGn3=pudomwphuhmi4yquo...@mail.gmail.com
where I implemented a xlogreader page_read callback that just keeps
reading the WAL that's fully copied to WAL buffers.

> Regarding the patches themselves, 0001 looks good to me.

A few comments on 0001:

1.
 /*
  * Update local copy of shared XLogCtl->log{Write,Flush}Result
+ *
+ * It's critical that Flush always trails Write, so the order of the reads is
+ * important, as is the barrier.
  */
 #define RefreshXLogWriteResult(_target) \
 do { \
-_target.Write = XLogCtl->logWriteResult; \
-_target.Flush = XLogCtl->logFlushResult; \
+_target.Flush = pg_atomic_read_u64(>logFlushResult); \
+pg_read_barrier(); \
+_target.Write = pg_atomic_read_u64(>logWriteResult); \
 } while (0)

Is it "Flush always trails Write" or "Flush always leades Write"? I
guess the latter, no?

2.
+
+pg_atomic_write_u64(>logWriteResult, LogwrtResult.Write);
+pg_write_barrier();
+pg_atomic_write_u64(>logFlushResult, LogwrtResult.Flush);
 }

Maybe add the reason as to why we had to write logWriteResult first
and then logFlushResult, similar to the comment atop
RefreshXLogWriteResult?

> For 0002, did you consider having pg_atomic_monotonic_advance_u64()
> return the currval?

+1 for returning currval here.

Except for the above comments, the patches look good to me. I've also
run the test loop for any assertion failure - for i in {1..100}; do
make check PROVE_TESTS="t/027_stream_regress.pl"; if [ $? -ne 0 ];
then echo "The command failed on iteration $i"; break; fi; done.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: LogwrtResult contended spinlock

2024-04-04 Thread Jeff Davis
On Thu, 2024-04-04 at 19:45 +0200, Alvaro Herrera wrote:
> 1. Using pg_atomic_write_membarrier_u64 is useless and it imposes
> mora
> barriers than we actually need.  So I switched back to
> pg_atomic_write_u64 and add one barrier between the two writes.  Same
> for reads.

+1.

This looks correct to me. Just before the writes there's a spinlock,
which acts as a full barrier; and just afterwards, the function returns
and the WALWriteLock is released, again acting as a full barrier. The
write barrier in between enforces the Write >= Flush invariant.

> 2. Using monotonic_advance for Write and Flush is useless.

+1.

> 3. Testing the invariant that the Copy pointer cannot be 0 is
> useless,
> because we initialize that pointer to EndOfLog during StartupXLOG.
> So, removed.

+1.

> 4. If we're not modifying any callers of WALReadFromBuffers(), then
> AFAICS the added check that the request is not past the Copy pointer
> is
> useless.  In a quick look at that code, I think we only try to read
> data
> that's been flushed, not written, so the stricter check that we don't
> read data that hasn't been Copied does nothing.

Bharath has indicated that he may call WALReadFromBuffers() in an
extension, so I believe some error checking is appropriate there.

>   (Honestly I'm not sure
> that we want XLogSendPhysical to be reading data that has not been
> written, or do we?)

Not yet, but there has been some discussion[1][2] about future work to
allow replicating data before it's been flushed locally.

>   Please argue why we need this patch.

I'm not sure what you mean by "this patch"?

> 5. The existing weird useless-looking block at the end of XLogWrite
> is
> there because we used to have it to declare a volatile pointer to
> XLogCtl (cf.  commit 6ba4ecbf477e).  That's no longer needed, so we
> could remove it.  Or we could leave it alone (just because it's
> ancient
> and it doesn't hurt anything), but there's no reason to have the new
> invariant-testing block inside the same block.  So I added another
> weird
> useless-looking block, except that this one does have two variable
> declaration at its top.

That didn't bother me, but it could be cleaned up a bit in a later
patch.

> 6. In a few places, we read both Write and Flush to only use one of
> them.  This is wasteful and we could dial this back to reading only
> the
> one we need.  Andres suggested as much in [1].  I didn't touch this
> in
> the current patch, and I don't necessarily think we need to address
> it
> right now.  Addressing this should probably done similar to what I
> posted in [2]'s 0002.

I agree that it should be a separate patch. I haven't thought about the
consequences of making them fully independent -- I think that means we
give up the invariant that Copy >= Write >= Flush?


Regarding the patches themselves, 0001 looks good to me.

For 0002, did you consider having pg_atomic_monotonic_advance_u64()
return the currval?

Regards,
Jeff Davis

[1]
https://www.postgresql.org/message-id/CALj2ACV6rS%2B7iZx5%2BoAvyXJaN4AG-djAQeM1mrM%3DYSDkVrUs7g%40mail.gmail.com
[2]
https://www.postgresql.org/message-id/20230125211540.zylu74dj2uuh3k7w%40awork3.anarazel.de
[3]
https://www.postgresql.org/message-id/CALj2ACW65mqn6Ukv57SqDTMzAJgd1N_AdQtDgy%2BgMDqu6v618Q%40mail.gmail.com





Re: LogwrtResult contended spinlock

2024-04-04 Thread Alvaro Herrera
I've noticed a few things here, v16 attached with some rather largish
changes.

1. Using pg_atomic_write_membarrier_u64 is useless and it imposes mora
barriers than we actually need.  So I switched back to
pg_atomic_write_u64 and add one barrier between the two writes.  Same
for reads.

2. Using monotonic_advance for Write and Flush is useless.  We can use a
simple atomic_write with a write barrier in between.  The reason is
that, as Andres said[1], we only write those with WALWriteLock held, so
it's not possible for them to move forward while we aren't looking.  All
callers of XLogWrite do RefreshXLogWriteResult() with the WALWriteLock
held.  Therefore we can just use pg_atomic_write_u64.  Consequently I
moved the addition of the monotonic advance function to the patch that
adds Copy.

3. Testing the invariant that the Copy pointer cannot be 0 is useless,
because we initialize that pointer to EndOfLog during StartupXLOG.
So, removed.

4. If we're not modifying any callers of WALReadFromBuffers(), then
AFAICS the added check that the request is not past the Copy pointer is
useless.  In a quick look at that code, I think we only try to read data
that's been flushed, not written, so the stricter check that we don't
read data that hasn't been Copied does nothing.  (Honestly I'm not sure
that we want XLogSendPhysical to be reading data that has not been
written, or do we?)  Please argue why we need this patch.

5. The existing weird useless-looking block at the end of XLogWrite is
there because we used to have it to declare a volatile pointer to
XLogCtl (cf.  commit 6ba4ecbf477e).  That's no longer needed, so we
could remove it.  Or we could leave it alone (just because it's ancient
and it doesn't hurt anything), but there's no reason to have the new
invariant-testing block inside the same block.  So I added another weird
useless-looking block, except that this one does have two variable
declaration at its top.

6. In a few places, we read both Write and Flush to only use one of
them.  This is wasteful and we could dial this back to reading only the
one we need.  Andres suggested as much in [1].  I didn't touch this in
the current patch, and I don't necessarily think we need to address it
right now.  Addressing this should probably done similar to what I
posted in [2]'s 0002.

[1] https://postgr.es/m/20210130023011.n545o54j65t4k...@alap3.anarazel.de
[2] https://postgr.es/m/202203221611.hqbjdinzsbu2@alvherre.pgsql

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
>From f791c7ee9815871a1843116febffc0077d29a9a3 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 3 Apr 2024 11:48:27 +0200
Subject: [PATCH v16 1/2] Make XLogCtl->log{Write,Flush}Result accessible with
 atomics

This removes the need to hold both the info_lck spinlock and
WALWriteLock to update them.
---
 src/backend/access/transam/xlog.c | 91 +--
 1 file changed, 49 insertions(+), 42 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f26533ec01..9b68afd400 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -292,12 +292,7 @@ static bool doPageWrites;
  * LogwrtRqst indicates a byte position that we need to write and/or fsync
  * the log up to (all records before that point must be written or fsynced).
  * The positions already written/fsynced are maintained in logWriteResult
- * and logFlushResult.
- *
- * To read XLogCtl->logWriteResult or ->logFlushResult, you must hold either
- * info_lck or WALWriteLock.  To update them, you need to hold both locks.
- * The point of this arrangement is that the value can be examined by code
- * that already holds WALWriteLock without needing to grab info_lck as well.
+ * and logFlushResult using atomic access.
  * In addition to the shared variable, each backend has a private copy of
  * both in LogwrtResult, which is updated when convenient.
  *
@@ -473,12 +468,9 @@ typedef struct XLogCtlData
 	pg_time_t	lastSegSwitchTime;
 	XLogRecPtr	lastSegSwitchLSN;
 
-	/*
-	 * Protected by info_lck and WALWriteLock (you must hold either lock to
-	 * read it, but both to update)
-	 */
-	XLogRecPtr	logWriteResult; /* last byte + 1 written out */
-	XLogRecPtr	logFlushResult; /* last byte + 1 flushed */
+	/* These are accessed using atomics -- info_lck not needed */
+	pg_atomic_uint64 logWriteResult;	/* last byte + 1 written out */
+	pg_atomic_uint64 logFlushResult;	/* last byte + 1 flushed */
 
 	/*
 	 * Latest initialized page in the cache (last byte position + 1).
@@ -616,11 +608,15 @@ static XLogwrtResult LogwrtResult = {0, 0};
 
 /*
  * Update local copy of shared XLogCtl->log{Write,Flush}Result
+ *
+ * It's critical that Flush always trails Write, so the order of the reads is
+ * important, as is the barrier.
  */
 #define RefreshXLogWriteResult(_target) \
 	do { \
-		_target.Write = XLogCtl->logWriteResult; \
-		_target.Flush = XLogCtl->logFlushResult; \
+		

Re: LogwrtResult contended spinlock

2024-04-03 Thread Jeff Davis
On Wed, 2024-04-03 at 13:19 +0200, Alvaro Herrera wrote:
> So what I do in the attached 0001 is stop using the XLogwrtResult
> struct
> in XLogCtl and replace it with separate Write and Flush values, and
> add
> the macro XLogUpdateLocalLogwrtResult() that copies the values of
> Write
> and Flush from the shared XLogCtl to the local variable given as
> macro
> argument.

+1

> 0002 then adds pg_atomic_monotonic_advance_u64.  (I don't add the
> _u32
> variant, because I don't think it's a great idea to add dead code. 
> If
> later we see a need for it we can put it in.)  It also changes the
> two
> new members to be atomics, changes the macro to use atomic read, and
> XLogWrite now uses monotonic increment.  A couple of other places can
> move the macro calls to occur outside the spinlock.  Also, XLogWrite
> gains the invariant checking that involves Write and Flush.

To maintain the invariant that Write >= Flush, I believe you need to
always store to Write first, then Flush; and load from Flush first,
then from Write. So RefreshXLogWriteResult should load Flush then load
Write, and the same for the Assert code. And in 0003, loading from Copy
should happen last.

Also, should pg_atomic_monotonic_advance_u64() return currval? I don't
think it's important for the current patches, but I could imagine a
caller wanting the most up-to-date value possible, even if it's beyond
what the caller requested. Returning void seems slightly wasteful.

Other than that, it looks good to me.

> Finally, 0003 adds the Copy pointer to XLogCtl alongside Write and
> Flush, and updates WALReadFromBuffers to test that instead of the
> Write
> pointer, and adds in XLogWrite the invariant checks that involve the
> Copy pointer.

Looks good to me.

Regards,
Jeff Davis





Re: LogwrtResult contended spinlock

2024-04-03 Thread Alvaro Herrera
On 2024-Apr-03, Bharath Rupireddy wrote:

> On Wed, Apr 3, 2024 at 4:49 PM Alvaro Herrera  wrote:

> > So what I do in the attached 0001 is stop using the XLogwrtResult struct
> > in XLogCtl and replace it with separate Write and Flush values, and add
> > the macro XLogUpdateLocalLogwrtResult() that copies the values of Write
> > and Flush from the shared XLogCtl to the local variable given as macro
> > argument.  (I also added our idiomatic do {} while(0) to the macro
> > definition, for safety).  The new members are XLogCtl->logWriteResult
> > and XLogCtl->logFlushResult and as of 0001 are just XLogRecPtr, so
> > essentially identical semantics as the previous code.  No atomic access
> > yet!
> 
> +1.

> Also, I'm fine with renaming XLogUpdateLocalLogwrtResult() to
> RefreshXLogWriteResult().

Okay, I have pushed 0001 with the name change, will see about getting
the others in tomorrow.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/




Re: LogwrtResult contended spinlock

2024-04-03 Thread Bharath Rupireddy
On Wed, Apr 3, 2024 at 4:49 PM Alvaro Herrera  wrote:
>
> Thanks for keeping this moving forward.  I gave your proposed patches a
> look.   One thing I didn't like much is that we're adding a new member
> (Copy) to XLogwrtAtomic -- but this struct is supposed to be a mirror of
> XLogwrtResult for use with atomic access.  Since this new member is not
> added to XLogwrtResult (because it's not needed there), the whole idea
> of there being symmetry between those two structs crumbles down.
> Because we later stop using struct-assign anyway, meaning we no longer
> need the structs to match, we can instead spell out the members in
> XLogCtl and call it a day.

Hm, I have no objection to having separate variables in XLogCtl.

> So what I do in the attached 0001 is stop using the XLogwrtResult struct
> in XLogCtl and replace it with separate Write and Flush values, and add
> the macro XLogUpdateLocalLogwrtResult() that copies the values of Write
> and Flush from the shared XLogCtl to the local variable given as macro
> argument.  (I also added our idiomatic do {} while(0) to the macro
> definition, for safety).  The new members are XLogCtl->logWriteResult
> and XLogCtl->logFlushResult and as of 0001 are just XLogRecPtr, so
> essentially identical semantics as the previous code.  No atomic access
> yet!

+1.

> 0002 then adds pg_atomic_monotonic_advance_u64.  (I don't add the _u32
> variant, because I don't think it's a great idea to add dead code.  If
> later we see a need for it we can put it in.)  It also changes the two
> new members to be atomics, changes the macro to use atomic read, and
> XLogWrite now uses monotonic increment.  A couple of other places can
> move the macro calls to occur outside the spinlock.  Also, XLogWrite
> gains the invariant checking that involves Write and Flush.

I'm fine with not having the 32 bit variant of
pg_atomic_monotonic_advance. However, a recent commit bd5132db5 added
both 32 and 64 bit versions of pg_atomic_read_membarrier even though
32 bit isn't being used.

> Finally, 0003 adds the Copy pointer to XLogCtl alongside Write and
> Flush, and updates WALReadFromBuffers to test that instead of the Write
> pointer, and adds in XLogWrite the invariant checks that involve the
> Copy pointer.

+1.

The attached patches look good to me.

Also, I'm fine with renaming XLogUpdateLocalLogwrtResult() to
RefreshXLogWriteResult().

> I haven't rerun Bharath test loop yet; will do so shortly.

I ran it a 100 times [1] on top of all the 3 patches, it looks fine.

[1] for i in {1..100}; do make check
PROVE_TESTS="t/027_stream_regress.pl"; if [ $? -ne 0 ]; then echo "The
command failed on iteration $i"; break; fi; done

./configure --prefix=$PWD/pg17/ --enable-debug --enable-tap-tests
--enable-cassert CC=/usr/bin/clang-14 > install.log && make -j 8
install > install.log 2>&1 &

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: LogwrtResult contended spinlock

2024-04-03 Thread Alvaro Herrera
On 2024-Apr-03, Alvaro Herrera wrote:

> So what I do in the attached 0001 is stop using the XLogwrtResult struct
> in XLogCtl and replace it with separate Write and Flush values, and add
> the macro XLogUpdateLocalLogwrtResult() that copies the values of Write
> and Flush from the shared XLogCtl to the local variable given as macro
> argument.  (I also added our idiomatic do {} while(0) to the macro
> definition, for safety).  The new members are XLogCtl->logWriteResult
> and XLogCtl->logFlushResult and as of 0001 are just XLogRecPtr, so
> essentially identical semantics as the previous code.  No atomic access
> yet!

BTW I forgot.  I didn't like the name XLogUpdateLocalLogwrtResult() name
much.  What do you think about RefreshXLogWriteResult() instead?

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"How strange it is to find the words "Perl" and "saner" in such close
proximity, with no apparent sense of irony. I doubt that Larry himself
could have managed it." (ncm, http://lwn.net/Articles/174769/)




Re: LogwrtResult contended spinlock

2024-04-03 Thread Alvaro Herrera
Thanks for keeping this moving forward.  I gave your proposed patches a
look.   One thing I didn't like much is that we're adding a new member
(Copy) to XLogwrtAtomic -- but this struct is supposed to be a mirror of
XLogwrtResult for use with atomic access.  Since this new member is not
added to XLogwrtResult (because it's not needed there), the whole idea
of there being symmetry between those two structs crumbles down.
Because we later stop using struct-assign anyway, meaning we no longer
need the structs to match, we can instead spell out the members in
XLogCtl and call it a day.

So what I do in the attached 0001 is stop using the XLogwrtResult struct
in XLogCtl and replace it with separate Write and Flush values, and add
the macro XLogUpdateLocalLogwrtResult() that copies the values of Write
and Flush from the shared XLogCtl to the local variable given as macro
argument.  (I also added our idiomatic do {} while(0) to the macro
definition, for safety).  The new members are XLogCtl->logWriteResult
and XLogCtl->logFlushResult and as of 0001 are just XLogRecPtr, so
essentially identical semantics as the previous code.  No atomic access
yet!

0002 then adds pg_atomic_monotonic_advance_u64.  (I don't add the _u32
variant, because I don't think it's a great idea to add dead code.  If
later we see a need for it we can put it in.)  It also changes the two
new members to be atomics, changes the macro to use atomic read, and
XLogWrite now uses monotonic increment.  A couple of other places can
move the macro calls to occur outside the spinlock.  Also, XLogWrite
gains the invariant checking that involves Write and Flush.

Finally, 0003 adds the Copy pointer to XLogCtl alongside Write and
Flush, and updates WALReadFromBuffers to test that instead of the Write
pointer, and adds in XLogWrite the invariant checks that involve the
Copy pointer.

I haven't rerun Bharath test loop yet; will do so shortly.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"You're _really_ hosed if the person doing the hiring doesn't understand
relational systems: you end up with a whole raft of programmers, none of
whom has had a Date with the clue stick."  (Andrew Sullivan)
https://postgr.es/m/20050809113420.gd2...@phlogiston.dyndns.org
>From e6ddbe87d598c6a1090e39845a4a308060f0af1e Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 3 Apr 2024 11:23:12 +0200
Subject: [PATCH v15 1/3] split XLogCtl->LogwrtResult into separate struct
 members

---
 src/backend/access/transam/xlog.c | 59 ++-
 1 file changed, 35 insertions(+), 24 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 87ea03954b..6213d99561 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -291,16 +291,15 @@ static bool doPageWrites;
  *
  * LogwrtRqst indicates a byte position that we need to write and/or fsync
  * the log up to (all records before that point must be written or fsynced).
- * LogwrtResult indicates the byte positions we have already written/fsynced.
- * These structs are identical but are declared separately to indicate their
- * slightly different functions.
+ * The positions already written/fsynced are maintained in logWriteResult
+ * and logFlushResult.
  *
- * To read XLogCtl->LogwrtResult, you must hold either info_lck or
- * WALWriteLock.  To update it, you need to hold both locks.  The point of
- * this arrangement is that the value can be examined by code that already
- * holds WALWriteLock without needing to grab info_lck as well.  In addition
- * to the shared variable, each backend has a private copy of LogwrtResult,
- * which is updated when convenient.
+ * To read XLogCtl->logWriteResult or ->logFlushResult, you must hold either
+ * info_lck or WALWriteLock.  To update them, you need to hold both locks.
+ * The point of this arrangement is that the value can be examined by code
+ * that already holds WALWriteLock without needing to grab info_lck as well.
+ * In addition to the shared variable, each backend has a private copy of
+ * both in LogwrtResult, which is updated when convenient.
  *
  * The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst
  * (protected by info_lck), but we don't need to cache any copies of it.
@@ -478,7 +477,8 @@ typedef struct XLogCtlData
 	 * Protected by info_lck and WALWriteLock (you must hold either lock to
 	 * read it, but both to update)
 	 */
-	XLogwrtResult LogwrtResult;
+	XLogRecPtr	logWriteResult; /* last byte + 1 written out */
+	XLogRecPtr	logFlushResult; /* last byte + 1 flushed */
 
 	/*
 	 * Latest initialized page in the cache (last byte position + 1).
@@ -614,6 +614,15 @@ static int	UsableBytesInSegment;
  */
 static XLogwrtResult LogwrtResult = {0, 0};
 
+/*
+ * Update local copy of shared XLogCtl->log{Write,Flush}Result
+ */
+#define XLogUpdateLocalLogwrtResult(_target) \
+	do { \
+		_target.Write = 

Re: LogwrtResult contended spinlock

2024-03-14 Thread Bharath Rupireddy
On Mon, Mar 4, 2024 at 9:15 PM Bharath Rupireddy
 wrote:
>
> > 0003:
> >
> > * We need to maintain the invariant that Copy >= Write >= Flush. I
> > believe that's always satisfied, because the
> > XLogWaitInsertionsToFinish() is always called before XLogWrite(). But
> > we should add an assert or runtime check of this invariant somewhere.
>
> Yes, that invariant is already maintained by the server. Although, I'm
> not fully agree, I added an assertion to WaitXLogInsertionsToFinish
> after updating XLogCtl->LogwrtResult.Copy. CF bot is happy with it -
> https://github.com/BRupireddy2/postgres/tree/atomic_LogwrtResult_v13.

I've now separated these invariants out into the 0004 patch.

With the assertions placed in WaitXLogInsertionsToFinish after
updating Copy ptr, I observed the assertion failing in one of the CF
bot machines - https://cirrus-ci.com/build/6202112288227328. I could
reproduce it locally with [1]. I guess the reason is that the Write
and Flush ptrs are now updated independently and atomically without
lock, they might drift and become out-of-order for a while if
concurrently they are accessed in WaitXLogInsertionsToFinish. So, I
guess the right place to verify the invariant Copy >= Write >= Flush
is in XLogWrite once Write and Flush ptrs in shared memory are updated
(note that only one process at a time can do this). Accordingly, I've
moved the assertions to XLogWrite in the attached v14-0004 patch.

> Please see the attached v13 patch set for further review.

Earlier versions of the patches removed a piece of code ensuring
shared WAL 'request' values did not fall beading the 'result' values.
There's a good reason for us to have it. So, I restored it.

-   /*
-* Update shared-memory status
-*
-* We make sure that the shared 'request' values do not fall behind the
-* 'result' values.  This is not absolutely essential, but it saves some
-* code in a couple of places.
-*/

Please see the attached v14 patch set.

[1] for i in {1..100}; do make check
PROVE_TESTS="t/027_stream_regress.pl"; if [ $? -ne 0 ]; then echo "The
command failed on iteration $i"; break; fi; done

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v14-0001-Add-monotonic-advancement-functions-for-atomics.patch
Description: Binary data


v14-0002-Make-XLogCtl-LogwrtResult-accessible-with-atomic.patch
Description: Binary data


v14-0003-Add-Copy-pointer-to-track-data-copied-to-WAL-buf.patch
Description: Binary data


v14-0004-Add-invariants-for-shared-LogwrtResult-members.patch
Description: Binary data


Re: LogwrtResult contended spinlock

2024-03-04 Thread Bharath Rupireddy
Thanks for looking into this.

On Thu, Feb 22, 2024 at 1:54 AM Jeff Davis  wrote:
>
> > 3.
> > @@ -6371,7 +6373,9 @@ GetFlushRecPtr(TimeLineID *insertTLI)
> >  If at all, a read
> > barrier is warranted here, we can use atomic read with full barrier
>
> I don't think we need a full barrier but I'm fine with using
> pg_atomic_read_membarrier_u64() if it's better for whatever reason.

For the sake of clarity and correctness, I've used
pg_atomic_read_membarrier_u64 everywhere for reading
XLogCtl->LogwrtResult.Write and XLogCtl->LogwrtResult.Flush.

> > 5. I guess we'd better use pg_atomic_read_u64_impl and
> > pg_atomic_compare_exchange_u64_impl in
> > pg_atomic_monotonic_advance_u64
> > to reduce one level of function indirections.
>
> Aren't they inlined?

Yes, all of them are inlined. But, it seems like XXX_impl functions
are being used in implementing exposed functions as a convention.
Therefore, having pg_atomic_read_u64_impl and
pg_atomic_compare_exchange_u64_impl doesn't sound bad IMV.

> > 6.
> > + * Full barrier semantics (even when value is unchanged).
> >
> > +currval = pg_atomic_read_u64(ptr);
> > +if (currval >= target_)
> > +{
> > +pg_memory_barrier();
>
> I don't think they are exactly equivalent: in the current patch, the
> first pg_atomic_read_u64() could be reordered with earlier reads;
> whereas that wouldn't work if using pg_atomic_read_membarrier_u64() it
> could not be. I'm not sure whether that could create a performance
> problem or not.

I left it as-is.

> > 9.
> > +copyptr = pg_atomic_read_u64(>LogwrtResult.Copy);
> > +if (startptr + count > copyptr)
> > +ereport(WARNING,
> > +(errmsg("request to read past end of generated WAL;
> > request %X/%X, current position %X/%X",
> > +LSN_FORMAT_ARGS(startptr + count),
> > LSN_FORMAT_ARGS(copyptr;
> >
> > Any specific reason for this to be a WARNING rather than an ERROR?
>
> Good question. WaitXLogInsertionsToFinish() uses a LOG level message
> for the same situation. They should probably be the same log level, and
> I would think it would be either PANIC or WARNING. I have no idea why
> LOG was chosen.

WaitXLogInsertionsToFinish adjusts upto after LOG so that the wait is
never past the current insert position even if a caller asks for
reading WAL that doesn't yet exist. And the comment there says "Here
we just assume that to mean that all WAL that has been reserved needs
to be finished."

In contrast, WALReadFromBuffers kind of enforces callers to do
WaitXLogInsertionsToFinish (IOW asks callers to send in the WAL that
exists in the server). Therefore, an ERROR seems a reasonable choice
to me, if PANIC sounds rather strong affecting all the postgres
processes.

FWIW, a PANIC when requested to flush past the end of WAL in
WaitXLogInsertionsToFinish instead of LOG seems to be good. CF bot
animals don't complain -
https://github.com/BRupireddy2/postgres/tree/be_harsh_when_request_to_flush_past_end_of_WAL_WIP.

> 0001:
>
> * The comments on the two versions of the functions are redundant, and
> the style in that header seems to be to omit the comment from the u64
> version.

Removed comments atop 64-bit version.

> * I'm not sure the AssertPointerAlignment is needed in the u32 version?

Borrowed them from pg_atomic_read_u32 and
pg_atomic_compare_exchange_u32, just like how they assert before
calling XXX_impl versions. I don't see any problem with them.

> 0002:
>
> * All updates to the non-shared LogwrtResult should update both values.
> It's confusing to update those local values independently, because it
> violates the invariant that LogwrtResult.Flush <= LogwrtResult.Write.
>
> > 2. I guess we need to update both the Write and Flush local copies in
> > AdvanceXLInsertBuffer.
>
> I agree. Whenever we update the non-shared LogwrtResult, let's update
> the whole thing. Otherwise it's confusing.

Yes, it's done that way now with a macro XLogUpdateLocalLogWrtResult
using pg_atomic_read_membarrier_u64 to read both Write and Flush ptrs.

> * pg_memory_barrier() is not needed right before a spinlock

Got rid of it as we read both Flush and Write local copies with
pg_atomic_read_membarrier_u64.

> * As mentioned above, I think GetFlushRecPtr() needs two read barriers.
> Also, I think the same for GetXLogWriteRecPtr().

Yes, it's done that way now with a macro XLogUpdateLocalLogWrtResult
using pg_atomic_read_membarrier_u64 to read both Write and Flush ptrs.

> * In general, for any place using both Write and Flush, I think Flush
> should be loaded first, followed by a read barrier, followed by a load
> of the Write pointer.

Why read Flush first rather than Write? I think it's enough to do
{read Write,  read barrier, read Flush}. This works because Write is
monotonically advanced first before Flush using full barriers and we
don't get reordering issues between the readers and writers no? Am I
missing anything here?

> And I think in most of those places there should
> be 

Re: LogwrtResult contended spinlock

2024-02-22 Thread Robert Haas
On Fri, Feb 23, 2024 at 1:18 AM Jeff Davis  wrote:
> I don't see the global non-shared variable as a huge problem, so if it
> serves a purpose then I'm fine keeping it. Perhaps we could make it a
> bit safer by using some wrapper functions.

I actually really hate these kinds of variables. I think they likely
mask various bugs (where the value might not be up to date where we
think it is) and anti-bugs (where we actually rely on the value being
out of date but we don't know that we do because the code is so
opaque). My vintage 2021 adventures in getting rid of the global
variable ThisTimeLineID and a few other global variables found some
actual but minor bugs, also found a bunch of confused code that didn't
really do what it thought it did, and took up a huge amount of my time
trying to analyze what was happening. I'm not prepared to recommend
what we should do in this particular case. I would like to believe
that the local copies can be eliminated somehow, but I haven't studied
the code or done benchmarking so I don't really know enough to guess
what sort of code changes would or would not be good enough.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: LogwrtResult contended spinlock

2024-02-22 Thread Jeff Davis
On Thu, 2024-02-22 at 10:17 +0530, Robert Haas wrote:
> I think the problems tend to be worst when you have some bit of data
> that's being frequently modified by multiple backends. Every backend
> that wants to modify the value needs to steal the cache line, and
> eventually you spend most of your CPU time stealing cache lines from
> other sockets and not much of it doing any actual work. If you have a
> value that's just being read by a lot of backends without
> modification, I think the cache line can be shared in read only mode
> by all the CPUs and it's not too bad.

That makes sense. I guess they'd be on the same cache line as well,
which means a write to either will invalidate both.

Some places (XLogWrite, XLogInsertRecord, XLogSetAsyncXactLSN,
GetFlushRecPtr, GetXLogWriteRecPtr) already update it from shared
memory anyway, so those are non-issues.

The potential problem areas (unless I missed something) are:

  * AdvanceXLInsertBuffer reads it as an early check to see if a buffer
is already written out.

  * XLogFlush / XLogNeedsFlush use it for an early return

  * WALReadFromBuffers reads it for an error check (currently an
Assert, but we might want to make that an elog).

  * We are discussing adding a Copy pointer, which would be advanced by
WaitXLogInsertionsToFinish(), and if we do replication-before-flush, we
may want to be more eager about advancing it. That could cause an
issue, as well.

I don't see the global non-shared variable as a huge problem, so if it
serves a purpose then I'm fine keeping it. Perhaps we could make it a
bit safer by using some wrapper functions. Something like:

  bool
  IsWriteRecPtrAtLeast(XLogRecPtr recptr)
  {
 XLogRecPtr writeptr;
 if (LogwrtResult.Write >= recptr)
 return true;
 writeptr = GetXLogWriteRecPtr();
 return (writeptr >= recptr);
  }

That would reduce the number of direct references to LogwrtResult,
callers would never see a stale value, and would avoid the cache miss
problem that you're concerned about. Not sure if they'd need to be
inline or not.

Regards,
Jeff Davis





Re: LogwrtResult contended spinlock

2024-02-21 Thread Robert Haas
On Thu, Feb 22, 2024 at 1:54 AM Jeff Davis  wrote:
> On Wed, 2024-02-21 at 20:00 +0530, Bharath Rupireddy wrote:
> > The local copy of LogwrtResult is so frequently used in the backends,
> > if we were to replace it with atomic accesses, won't the atomic reads
> > be costly and start showing up in perf profiles?
>
> I don't see exactly where the extra cost would come from. What is your
> concern?
>
> In functions that access it several times, it may make sense to copy it
> to a function-local variable, of course. But having a global non-shared
> LogwrtResult doesn't seem particularly useful to me.

I would love to get rid of the global variable; all of our code relies
too much on global variables, but the xlog code is some of the worst.
The logic is complex and difficult to reason about.

But I do think that there's room for concern about performance, too. I
don't know if there is an actual problem, but there could be a
problem. My impression is that accessing shared memory isn't
intrinsically more costly than accessing backend-private memory, but
accessing heavily contended memory can definitely be more expensive
than accessing uncontended memory -- and not just slightly more
expensive, but WAY more expensive.

I think the problems tend to be worst when you have some bit of data
that's being frequently modified by multiple backends. Every backend
that wants to modify the value needs to steal the cache line, and
eventually you spend most of your CPU time stealing cache lines from
other sockets and not much of it doing any actual work. If you have a
value that's just being read by a lot of backends without
modification, I think the cache line can be shared in read only mode
by all the CPUs and it's not too bad.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: LogwrtResult contended spinlock

2024-02-21 Thread Jeff Davis
On Wed, 2024-02-21 at 20:00 +0530, Bharath Rupireddy wrote:
> The local copy of LogwrtResult is so frequently used in the backends,
> if we were to replace it with atomic accesses, won't the atomic reads
> be costly and start showing up in perf profiles?

I don't see exactly where the extra cost would come from. What is your
concern?

In functions that access it several times, it may make sense to copy it
to a function-local variable, of course. But having a global non-shared
LogwrtResult doesn't seem particularly useful to me.

>  In any case, I prefer
> discussing this separately once we get to a conclusion on converting
> shared memory Write and Flush ptrs to atomics.

I agree that it shouldn't block the earlier patches, but it seems on-
topic for this thread.

> 
> 1. I guess we need to initialize the new atomics with
> pg_atomic_init_u64 initially in XLOGShmemInit:

Thank you.

> 2. I guess we need to update both the Write and Flush local copies in
> AdvanceXLInsertBuffer.

I agree. Whenever we update the non-shared LogwrtResult, let's update
the whole thing. Otherwise it's confusing.

> 3.
> @@ -6371,7 +6373,9 @@ GetFlushRecPtr(TimeLineID *insertTLI)
>  {
>  Assert(XLogCtl->SharedRecoveryState == RECOVERY_STATE_DONE);
> 
> +    pg_read_barrier();
>  LogwrtResult.Flush = pg_atomic_read_u64(
> >LogwrtResult.Flush);
> +    LogwrtResult.Write = pg_atomic_read_u64(
> >LogwrtResult.Write);
> 
> Do we need a read barrier here to not reorder things when more than
> one process is accessing the flush and write ptrs?

The above seems wrong: we don't want to reorder a load of Write before
a load of Flush, otherwise it could look like Write < Flush.
(Similarly, we never want to reorder a store of Flush before a store of
Write.) So I think we should have another read barrier between those
two atomic reads.

I also think we need the read barrier at the beginning because the
caller doesn't expect a stale value of the Flush pointer. It might not
be strictly required for correctness, because I don't think that stale
value can be inconsistent with anything else, but getting an up-to-date
value is better.

>  If at all, a read
> barrier is warranted here, we can use atomic read with full barrier

I don't think we need a full barrier but I'm fine with using
pg_atomic_read_membarrier_u64() if it's better for whatever reason.


> +    pg_atomic_write_u64(>LogwrtResult.Write, EndOfLog);
> +    pg_atomic_write_u64(>LogwrtResult.Flush, EndOfLog);
>  XLogCtl->LogwrtRqst.Write = EndOfLog;
> 
> pg_atomic_write_u64 here seems fine to me as no other process is
> active writing WAL yet, otherwise, we need write with full barrier

I don't see any memory ordering hazard. In any case, there's an
LWLockAcquire a few lines later, which acts as a full barrier.

> something like pg_write_barrier + pg_atomic_write_u64 or
> pg_atomic_exchange_u64 like in LWLockUpdateVar or use atomic write
> with full barrier sematices as proposed here (when that's in) -
> https://www.postgresql.org/message-id/20231127210030.GA140335%40nathanxps13
> .
> 
> 5. I guess we'd better use pg_atomic_read_u64_impl and
> pg_atomic_compare_exchange_u64_impl in
> pg_atomic_monotonic_advance_u64
> to reduce one level of function indirections.

Aren't they inlined?

> 6.
> + * Full barrier semantics (even when value is unchanged).
> 
> +    currval = pg_atomic_read_u64(ptr);
> +    if (currval >= target_)
> +    {
> +    pg_memory_barrier();
> 
> Perhaps, we can use atomic read with full barrier sematices proposed
> here -
> https://www.postgresql.org/message-id/20231127210030.GA140335%40nathanxps13

I don't think they are exactly equivalent: in the current patch, the
first pg_atomic_read_u64() could be reordered with earlier reads;
whereas that wouldn't work if using pg_atomic_read_membarrier_u64() it
could not be. I'm not sure whether that could create a performance
problem or not.

> Just for the sake of completeness, do we need
> pg_atomic_monotonic_advance_u32 as well?

+1.

> 8. I'd split the addition of these new monotonic functions into 0001,
> then 0002 adding XLogwrtAtomic and 0003 adding Copy to XLogwrtAtomic.

+1

> 9.
> +    copyptr = pg_atomic_read_u64(>LogwrtResult.Copy);
> +    if (startptr + count > copyptr)
> +    ereport(WARNING,
> +    (errmsg("request to read past end of generated WAL;
> request %X/%X, current position %X/%X",
> +    LSN_FORMAT_ARGS(startptr + count),
> LSN_FORMAT_ARGS(copyptr;
> 
> Any specific reason for this to be a WARNING rather than an ERROR?

Good question. WaitXLogInsertionsToFinish() uses a LOG level message
for the same situation. They should probably be the same log level, and
I would think it would be either PANIC or WARNING. I have no idea why
LOG was chosen.

> I've addressed some of my review comments (#1, #5, #7 and #8) above,
> merged the v11j-0002-fixup.patch into 0002, ran pgindent. Please see
> the attached v12 patch set. FWIW, CF bot is happy with these 

Re: LogwrtResult contended spinlock

2024-02-21 Thread Bharath Rupireddy
On Sat, Feb 17, 2024 at 2:24 AM Jeff Davis  wrote:
>
> Though it looks like we can remove the non-shared LogwrtResult
> entirely. Andres expressed some concern here:
>
> https://www.postgresql.org/message-id/20210130020211.rtu5ir3dpjrbi...@alap3.anarazel.de
>
> But then seemed to favor removing it here:
>
> https://www.postgresql.org/message-id/20240213001150.4uqzh7tinuhvo...@awork3.anarazel.de
>
> I'm inclined to think we can get rid of the non-shared copy.

The local copy of LogwrtResult is so frequently used in the backends,
if we were to replace it with atomic accesses, won't the atomic reads
be costly and start showing up in perf profiles? In any case, I prefer
discussing this separately once we get to a conclusion on converting
shared memory Write and Flush ptrs to atomics.

> A few other comments:
>
>  * Should GetFlushRecPtr()/GetXLogWriteRecPtr() use a read memory
> barrier?
>  * Why did you add pg_memory_barrier() right before a spinlock
> acquisition?
>  * Is it an invariant that Write >= Flush at all times? Are there
> guaranteed to be write barriers in the right place to ensure that?

I'll continue to think about these points.

> I would also like it if we could add a new "Copy" pointer indicating
> how much WAL data has been copied to the WAL buffers. That would be set
> by WaitXLogInsertionsToFinish() so that subsequent calls are cheap.
> Attached a patch (0003) for illustration purposes. It adds to the size
> of XLogCtlData, but it's fairly large already, so I'm not sure if
> that's a problem. If we do add this, there would be an invariant that
> Copy >= Write at all times.

Thanks. I have a few comments on v11j patches.

1. I guess we need to initialize the new atomics with
pg_atomic_init_u64 initially in XLOGShmemInit:

2. I guess we need to update both the Write and Flush local copies in
AdvanceXLInsertBuffer. Overall, I guess we need to update both the
Write and Flush local copies whenever we previously read LogwrtResult,
no? Missing to do that caused regression tests to fail and since we
were able to catch it, we ended up with v11j-0002-fixup.patch. There
might be cases where we aren't able to catch if we didn't update both
Write and Flush local copies. I'd suggest we just wrap both of these
under a macro:

 LogwrtResult.Flush = pg_atomic_read_u64(>LogwrtResult.Flush);
 LogwrtResult.Write = pg_atomic_read_u64(>LogwrtResult.Write);

and just use it wherever we did LogwrtResult = XLogCtl->LogwrtResult;
previously but not under spinlock.

3.
@@ -6371,7 +6373,9 @@ GetFlushRecPtr(TimeLineID *insertTLI)
 {
 Assert(XLogCtl->SharedRecoveryState == RECOVERY_STATE_DONE);

+pg_read_barrier();
 LogwrtResult.Flush = pg_atomic_read_u64(>LogwrtResult.Flush);
+LogwrtResult.Write = pg_atomic_read_u64(>LogwrtResult.Write);

Do we need a read barrier here to not reorder things when more than
one process is accessing the flush and write ptrs? If at all, a read
barrier is warranted here, we can use atomic read with full barrier
sematices as proposed here (when that's in) -
https://www.postgresql.org/message-id/20231127210030.GA140335%40nathanxps13.

4.
+/*
+ * Update local and shared status.  This is OK to do without any locks
+ * because no other process can be reading or writing WAL yet.
+ */
 LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;

+pg_atomic_write_u64(>LogwrtResult.Write, EndOfLog);
+pg_atomic_write_u64(>LogwrtResult.Flush, EndOfLog);
 XLogCtl->LogwrtRqst.Write = EndOfLog;

pg_atomic_write_u64 here seems fine to me as no other process is
active writing WAL yet, otherwise, we need write with full barrier
something like pg_write_barrier + pg_atomic_write_u64 or
pg_atomic_exchange_u64 like in LWLockUpdateVar or use atomic write
with full barrier sematices as proposed here (when that's in) -
https://www.postgresql.org/message-id/20231127210030.GA140335%40nathanxps13.

5. I guess we'd better use pg_atomic_read_u64_impl and
pg_atomic_compare_exchange_u64_impl in pg_atomic_monotonic_advance_u64
to reduce one level of function indirections. Of course, we need the
pointer alignments that pg_atomic_compare_exchange_u64 in the new
monotonic function.

6.
+ * Full barrier semantics (even when value is unchanged).

+currval = pg_atomic_read_u64(ptr);
+if (currval >= target_)
+{
+pg_memory_barrier();

Perhaps, we can use atomic read with full barrier sematices proposed
here - 
https://www.postgresql.org/message-id/20231127210030.GA140335%40nathanxps13

7.
+static inline void
+pg_atomic_monotonic_advance_u64(volatile pg_atomic_uint64 *ptr, uint64 target_)

Just for the sake of completeness, do we need
pg_atomic_monotonic_advance_u32 as well?

8. I'd split the addition of these new monotonic functions into 0001,
then 0002 adding XLogwrtAtomic and 0003 adding Copy to XLogwrtAtomic.

9.
+copyptr = pg_atomic_read_u64(>LogwrtResult.Copy);
+if (startptr + count > copyptr)
+ereport(WARNING,
+

Re: LogwrtResult contended spinlock

2024-02-16 Thread Jeff Davis
On Mon, 2024-02-12 at 17:44 -0800, Jeff Davis wrote:
> It looks like there's some renewed interest in this patch:

After rebasing (attached as 0001), I'm seeing some test failures. It
looks like the local LogwrtResult is not being updated in as many
places, and that's hitting the Assert that I recently added. The fix is
easy (attached as 0002).

Though it looks like we can remove the non-shared LogwrtResult
entirely. Andres expressed some concern here:

https://www.postgresql.org/message-id/20210130020211.rtu5ir3dpjrbi...@alap3.anarazel.de

But then seemed to favor removing it here:

https://www.postgresql.org/message-id/20240213001150.4uqzh7tinuhvo...@awork3.anarazel.de

I'm inclined to think we can get rid of the non-shared copy.

A few other comments:

 * Should GetFlushRecPtr()/GetXLogWriteRecPtr() use a read memory
barrier?
 * Why did you add pg_memory_barrier() right before a spinlock
acquisition?
 * Is it an invariant that Write >= Flush at all times? Are there
guaranteed to be write barriers in the right place to ensure that?

I would also like it if we could add a new "Copy" pointer indicating
how much WAL data has been copied to the WAL buffers. That would be set
by WaitXLogInsertionsToFinish() so that subsequent calls are cheap.
Attached a patch (0003) for illustration purposes. It adds to the size
of XLogCtlData, but it's fairly large already, so I'm not sure if
that's a problem. If we do add this, there would be an invariant that
Copy >= Write at all times.

Regards,
Jeff Davis

From cd48125afc61faa46f15f444390ed01f64988320 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 2 Feb 2021 14:03:43 -0300
Subject: [PATCH v11j 1/3] Make XLogCtl->LogwrtResult accessible with atomics
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Currently, access to LogwrtResult is protected by a spinlock.  This
becomes severely contended in some scenarios, such as with a largish
replication flock: walsenders all calling GetFlushRecPtr repeatedly
cause the processor heat up to the point where eggs can be fried on top.

This can be reduced to a non-problem by replacing XLogCtl->LogwrtResult
with a struct containing a pair of atomically accessed variables. Do so.
In a few places, we can adjust the exact location where the locals are
updated to account for the fact that we no longer need the spinlock.

Author: Álvaro Herrera 
Discussion: https://postgr.es/m/20200831182156.GA3983@alvherre.pgsql
---
 src/backend/access/transam/xlog.c | 106 ++
 src/include/port/atomics.h|  29 
 src/tools/pgindent/typedefs.list  |   1 +
 3 files changed, 78 insertions(+), 58 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 50c347a679..46394c5d39 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -294,16 +294,13 @@ static bool doPageWrites;
  *
  * LogwrtRqst indicates a byte position that we need to write and/or fsync
  * the log up to (all records before that point must be written or fsynced).
- * LogwrtResult indicates the byte positions we have already written/fsynced.
- * These structs are identical but are declared separately to indicate their
- * slightly different functions.
+ * LogWrtResult indicates the byte positions we have already written/fsynced.
+ * These structs are similar but are declared separately to indicate their
+ * slightly different functions; in addition, the latter is read and written
+ * using atomic operations.
  *
- * To read XLogCtl->LogwrtResult, you must hold either info_lck or
- * WALWriteLock.  To update it, you need to hold both locks.  The point of
- * this arrangement is that the value can be examined by code that already
- * holds WALWriteLock without needing to grab info_lck as well.  In addition
- * to the shared variable, each backend has a private copy of LogwrtResult,
- * which is updated when convenient.
+ * In addition to the shared variable, each backend has a private copy of
+ * LogwrtResult, each member of which is separately updated when convenient.
  *
  * The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst
  * (protected by info_lck), but we don't need to cache any copies of it.
@@ -326,6 +323,12 @@ static bool doPageWrites;
  *--
  */
 
+typedef struct XLogwrtAtomic
+{
+	pg_atomic_uint64 Write;		/* last byte + 1 written out */
+	pg_atomic_uint64 Flush;		/* last byte + 1 flushed */
+} XLogwrtAtomic;
+
 typedef struct XLogwrtRqst
 {
 	XLogRecPtr	Write;			/* last byte + 1 to write out */
@@ -461,6 +464,7 @@ typedef struct XLogCtlData
 {
 	XLogCtlInsert Insert;
 
+	XLogwrtAtomic LogwrtResult; /* uses atomics */
 	/* Protected by info_lck: */
 	XLogwrtRqst LogwrtRqst;
 	XLogRecPtr	RedoRecPtr;		/* a recent copy of Insert->RedoRecPtr */
@@ -478,12 +482,6 @@ typedef struct XLogCtlData
 	pg_time_t	lastSegSwitchTime;
 	XLogRecPtr	lastSegSwitchLSN;
 
-	/*
-	 * Protected by 

Re: LogwrtResult contended spinlock

2024-02-12 Thread Jeff Davis
On Fri, 2022-09-23 at 10:49 +0200, Alvaro Herrera wrote:
> On 2022-Jul-28, Alvaro Herrera wrote:
> 
> > v10 is just a trivial rebase.  No changes.  Moved to next
> > commitfest.
> 
> I realized that because of commit e369f3708636 this change is no
> longer
> as critical as it used to be, so I'm withdrawing this patch from the
> commitfest.

It looks like there's some renewed interest in this patch:

https://www.postgresql.org/message-id/20240213001150.4uqzh7tinuhvo...@awork3.anarazel.de

even if there's not a pressing performance conern now, it could clean
up the complexity of having a non-shared copy of LogwrtResult.

Regards,
Jeff Davis





Re: LogwrtResult contended spinlock

2022-09-23 Thread Alvaro Herrera
On 2022-Jul-28, Alvaro Herrera wrote:

> v10 is just a trivial rebase.  No changes.  Moved to next commitfest.

I realized that because of commit e369f3708636 this change is no longer
as critical as it used to be, so I'm withdrawing this patch from the
commitfest.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"No necesitamos banderas
 No reconocemos fronteras"  (Jorge González)




Re: LogwrtResult contended spinlock

2022-07-28 Thread Alvaro Herrera
v10 is just a trivial rebase.  No changes.  Moved to next commitfest.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
>From 86655a0231e277c3f1bc907a0f0eb669943d4c71 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 2 Feb 2021 14:03:43 -0300
Subject: [PATCH v10] Make XLogCtl->LogwrtResult accessible with atomics
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Currently, access to LogwrtResult is protected by a spinlock.  This
becomes severely contended in some scenarios, such as with a largish
replication flock: walsenders all calling GetFlushRecPtr repeatedly
cause the processor heat up to the point where eggs can be fried on top.

This can be reduced to a non-problem by replacing XLogCtl->LogwrtResult
with a struct containing a pair of atomically accessed variables. Do so.
In a few places, we can adjust the exact location where the locals are
updated to account for the fact that we no longer need the spinlock.

Author: Álvaro Herrera 
Discussion: https://postgr.es/m/20200831182156.GA3983@alvherre.pgsql
---
 src/backend/access/transam/xlog.c | 106 ++
 src/include/port/atomics.h|  29 
 src/tools/pgindent/typedefs.list  |   1 +
 3 files changed, 78 insertions(+), 58 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 15ab8d90d4..29ebd38103 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -287,16 +287,13 @@ static bool doPageWrites;
  *
  * LogwrtRqst indicates a byte position that we need to write and/or fsync
  * the log up to (all records before that point must be written or fsynced).
- * LogwrtResult indicates the byte positions we have already written/fsynced.
- * These structs are identical but are declared separately to indicate their
- * slightly different functions.
+ * LogWrtResult indicates the byte positions we have already written/fsynced.
+ * These structs are similar but are declared separately to indicate their
+ * slightly different functions; in addition, the latter is read and written
+ * using atomic operations.
  *
- * To read XLogCtl->LogwrtResult, you must hold either info_lck or
- * WALWriteLock.  To update it, you need to hold both locks.  The point of
- * this arrangement is that the value can be examined by code that already
- * holds WALWriteLock without needing to grab info_lck as well.  In addition
- * to the shared variable, each backend has a private copy of LogwrtResult,
- * which is updated when convenient.
+ * In addition to the shared variable, each backend has a private copy of
+ * LogwrtResult, each member of which is separately updated when convenient.
  *
  * The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst
  * (protected by info_lck), but we don't need to cache any copies of it.
@@ -319,6 +316,12 @@ static bool doPageWrites;
  *--
  */
 
+typedef struct XLogwrtAtomic
+{
+	pg_atomic_uint64 Write;		/* last byte + 1 written out */
+	pg_atomic_uint64 Flush;		/* last byte + 1 flushed */
+} XLogwrtAtomic;
+
 typedef struct XLogwrtRqst
 {
 	XLogRecPtr	Write;			/* last byte + 1 to write out */
@@ -456,6 +459,7 @@ typedef struct XLogCtlData
 {
 	XLogCtlInsert Insert;
 
+	XLogwrtAtomic LogwrtResult; /* uses atomics */
 	/* Protected by info_lck: */
 	XLogwrtRqst LogwrtRqst;
 	XLogRecPtr	RedoRecPtr;		/* a recent copy of Insert->RedoRecPtr */
@@ -473,12 +477,6 @@ typedef struct XLogCtlData
 	pg_time_t	lastSegSwitchTime;
 	XLogRecPtr	lastSegSwitchLSN;
 
-	/*
-	 * Protected by info_lck and WALWriteLock (you must hold either lock to
-	 * read it, but both to update)
-	 */
-	XLogwrtResult LogwrtResult;
-
 	/*
 	 * Latest initialized page in the cache (last byte position + 1).
 	 *
@@ -598,7 +596,7 @@ static ControlFileData *ControlFile = NULL;
 static int	UsableBytesInSegment;
 
 /*
- * Private, possibly out-of-date copy of shared LogwrtResult.
+ * Private, possibly out-of-date copy of shared XLogCtl->LogwrtResult.
  * See discussion above.
  */
 static XLogwrtResult LogwrtResult = {0, 0};
@@ -907,8 +905,6 @@ XLogInsertRecord(XLogRecData *rdata,
 		/* advance global request to include new block(s) */
 		if (XLogCtl->LogwrtRqst.Write < EndPos)
 			XLogCtl->LogwrtRqst.Write = EndPos;
-		/* update local result copy while I have the chance */
-		LogwrtResult = XLogCtl->LogwrtResult;
 		SpinLockRelease(>info_lck);
 	}
 
@@ -1786,6 +1782,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
 	 * Now that we have the lock, check if someone initialized the page
 	 * already.
 	 */
+	LogwrtResult.Write = pg_atomic_read_u64(>LogwrtResult.Write);
 	while (upto >= XLogCtl->InitializedUpTo || opportunistic)
 	{
 		nextidx = XLogRecPtrToBufIdx(XLogCtl->InitializedUpTo);
@@ -1805,17 +1802,18 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
 			if (opportunistic)
 break;
 
-			/* Before 

Re: LogwrtResult contended spinlock

2022-04-07 Thread Alvaro Herrera
On 2022-Apr-05, Alvaro Herrera wrote:

> Apologies -- I selected the wrong commit to extract the commit message
> from.  Here it is again.  I also removed an obsolete /* XXX */ comment.

I spent a lot of time staring at this to understand the needs for memory
barriers in the interactions.  In the end I decided not to get this out
for this cycle because I don't want to create subtle bugs in WAL.  I'll
come back with this for pg16.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"El que vive para el futuro es un iluso, y el que vive para el pasado,
un imbécil" (Luis Adler, "Los tripulantes de la noche")




Re: LogwrtResult contended spinlock

2022-04-05 Thread Alvaro Herrera
Apologies -- I selected the wrong commit to extract the commit message
from.  Here it is again.  I also removed an obsolete /* XXX */ comment.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
>From 0b26f90f1b95f8e9b932eb34bbf9c2a50729cf60 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 2 Feb 2021 14:03:43 -0300
Subject: [PATCH v9] Make XLogCtl->LogwrtResult accessible with atomics
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Currently, access to LogwrtResult is protected by a spinlock.  This
becomes severely contended in some scenarios, such as with a largish
replication flock: walsenders all calling GetFlushRecPtr repeatedly
cause the processor heat up to the point where eggs can be fried on top.

This can be reduced to a non-problem by replacing XLogCtl->LogwrtResult
with a struct containing a pair of atomically accessed variables. Do so.
In a few places, we can adjust the exact location where the locals are
updated to account for the fact that we no longer need the spinlock.

Author: Álvaro Herrera 
Discussion: https://postgr.es/m/20200831182156.GA3983@alvherre.pgsql
---
 src/backend/access/transam/xlog.c | 106 ++
 src/include/port/atomics.h|  29 
 src/tools/pgindent/typedefs.list  |   1 +
 3 files changed, 78 insertions(+), 58 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 17a56152f1..50fdfcbc90 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -285,16 +285,13 @@ static bool doPageWrites;
  *
  * LogwrtRqst indicates a byte position that we need to write and/or fsync
  * the log up to (all records before that point must be written or fsynced).
- * LogwrtResult indicates the byte positions we have already written/fsynced.
- * These structs are identical but are declared separately to indicate their
- * slightly different functions.
+ * LogWrtResult indicates the byte positions we have already written/fsynced.
+ * These structs are similar but are declared separately to indicate their
+ * slightly different functions; in addition, the latter is read and written
+ * using atomic operations.
  *
- * To read XLogCtl->LogwrtResult, you must hold either info_lck or
- * WALWriteLock.  To update it, you need to hold both locks.  The point of
- * this arrangement is that the value can be examined by code that already
- * holds WALWriteLock without needing to grab info_lck as well.  In addition
- * to the shared variable, each backend has a private copy of LogwrtResult,
- * which is updated when convenient.
+ * In addition to the shared variable, each backend has a private copy of
+ * LogwrtResult, each member of which is separately updated when convenient.
  *
  * The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst
  * (protected by info_lck), but we don't need to cache any copies of it.
@@ -317,6 +314,12 @@ static bool doPageWrites;
  *--
  */
 
+typedef struct XLogwrtAtomic
+{
+	pg_atomic_uint64 Write;		/* last byte + 1 written out */
+	pg_atomic_uint64 Flush;		/* last byte + 1 flushed */
+} XLogwrtAtomic;
+
 typedef struct XLogwrtRqst
 {
 	XLogRecPtr	Write;			/* last byte + 1 to write out */
@@ -480,6 +483,7 @@ typedef struct XLogCtlData
 {
 	XLogCtlInsert Insert;
 
+	XLogwrtAtomic LogwrtResult; /* uses atomics */
 	/* Protected by info_lck: */
 	XLogwrtRqst LogwrtRqst;
 	XLogRecPtr	RedoRecPtr;		/* a recent copy of Insert->RedoRecPtr */
@@ -497,12 +501,6 @@ typedef struct XLogCtlData
 	pg_time_t	lastSegSwitchTime;
 	XLogRecPtr	lastSegSwitchLSN;
 
-	/*
-	 * Protected by info_lck and WALWriteLock (you must hold either lock to
-	 * read it, but both to update)
-	 */
-	XLogwrtResult LogwrtResult;
-
 	/*
 	 * Latest initialized page in the cache (last byte position + 1).
 	 *
@@ -622,7 +620,7 @@ static ControlFileData *ControlFile = NULL;
 static int	UsableBytesInSegment;
 
 /*
- * Private, possibly out-of-date copy of shared LogwrtResult.
+ * Private, possibly out-of-date copy of shared XLogCtl->LogwrtResult.
  * See discussion above.
  */
 static XLogwrtResult LogwrtResult = {0, 0};
@@ -932,8 +930,6 @@ XLogInsertRecord(XLogRecData *rdata,
 		/* advance global request to include new block(s) */
 		if (XLogCtl->LogwrtRqst.Write < EndPos)
 			XLogCtl->LogwrtRqst.Write = EndPos;
-		/* update local result copy while I have the chance */
-		LogwrtResult = XLogCtl->LogwrtResult;
 		SpinLockRelease(>info_lck);
 	}
 
@@ -1811,6 +1807,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
 	 * Now that we have the lock, check if someone initialized the page
 	 * already.
 	 */
+	LogwrtResult.Write = pg_atomic_read_u64(>LogwrtResult.Write);
 	while (upto >= XLogCtl->InitializedUpTo || opportunistic)
 	{
 		nextidx = XLogRecPtrToBufIdx(XLogCtl->InitializedUpTo);
@@ -1830,17 +1827,18 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, 

Re: LogwrtResult contended spinlock

2022-04-05 Thread Alvaro Herrera
Here's a v8, where per my previous comment I removed some code that I
believe is no longer necessary.

I've omitted the patch that renames LogwrtResult subvariables into
LogWriteResult/LogWriteFlush; I still think the end result is better
after that one, but it's a pretty trivial change that can be dealt with
separately afterwards.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Saca el libro que tu religión considere como el indicado para encontrar la
oración que traiga paz a tu alma. Luego rebootea el computador
y ve si funciona" (Carlos Duclós)
>From fdca5f32573342d950feefbdd4d64da6dd1cfb34 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 22 Mar 2022 16:49:40 +0100
Subject: [PATCH v8] Split LogwrtResult into separate variables

Since the schedule of updating each portion is independent, it's not
very useful to have them both as a single struct.  Before we used
atomics for the corresponding XLogCtl struct this made sense because we
could use struct assignment, but that's no longer the case.

Suggested by Andres Freund.
---
 src/backend/access/transam/xlog.c | 99 ++-
 src/include/port/atomics.h| 29 +
 src/tools/pgindent/typedefs.list  |  1 +
 3 files changed, 74 insertions(+), 55 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 17a56152f1..f1cb725bed 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -285,16 +285,13 @@ static bool doPageWrites;
  *
  * LogwrtRqst indicates a byte position that we need to write and/or fsync
  * the log up to (all records before that point must be written or fsynced).
- * LogwrtResult indicates the byte positions we have already written/fsynced.
- * These structs are identical but are declared separately to indicate their
- * slightly different functions.
+ * LogWrtResult indicates the byte positions we have already written/fsynced.
+ * These structs are similar but are declared separately to indicate their
+ * slightly different functions; in addition, the latter is read and written
+ * using atomic operations.
  *
- * To read XLogCtl->LogwrtResult, you must hold either info_lck or
- * WALWriteLock.  To update it, you need to hold both locks.  The point of
- * this arrangement is that the value can be examined by code that already
- * holds WALWriteLock without needing to grab info_lck as well.  In addition
- * to the shared variable, each backend has a private copy of LogwrtResult,
- * which is updated when convenient.
+ * In addition to the shared variable, each backend has a private copy of
+ * LogwrtResult, each member of which is separately updated when convenient.
  *
  * The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst
  * (protected by info_lck), but we don't need to cache any copies of it.
@@ -317,6 +314,12 @@ static bool doPageWrites;
  *--
  */
 
+typedef struct XLogwrtAtomic
+{
+	pg_atomic_uint64 Write;		/* last byte + 1 written out */
+	pg_atomic_uint64 Flush;		/* last byte + 1 flushed */
+} XLogwrtAtomic;
+
 typedef struct XLogwrtRqst
 {
 	XLogRecPtr	Write;			/* last byte + 1 to write out */
@@ -480,6 +483,7 @@ typedef struct XLogCtlData
 {
 	XLogCtlInsert Insert;
 
+	XLogwrtAtomic LogwrtResult; /* uses atomics */
 	/* Protected by info_lck: */
 	XLogwrtRqst LogwrtRqst;
 	XLogRecPtr	RedoRecPtr;		/* a recent copy of Insert->RedoRecPtr */
@@ -497,12 +501,6 @@ typedef struct XLogCtlData
 	pg_time_t	lastSegSwitchTime;
 	XLogRecPtr	lastSegSwitchLSN;
 
-	/*
-	 * Protected by info_lck and WALWriteLock (you must hold either lock to
-	 * read it, but both to update)
-	 */
-	XLogwrtResult LogwrtResult;
-
 	/*
 	 * Latest initialized page in the cache (last byte position + 1).
 	 *
@@ -622,7 +620,7 @@ static ControlFileData *ControlFile = NULL;
 static int	UsableBytesInSegment;
 
 /*
- * Private, possibly out-of-date copy of shared LogwrtResult.
+ * Private, possibly out-of-date copy of shared XLogCtl->LogwrtResult.
  * See discussion above.
  */
 static XLogwrtResult LogwrtResult = {0, 0};
@@ -932,8 +930,6 @@ XLogInsertRecord(XLogRecData *rdata,
 		/* advance global request to include new block(s) */
 		if (XLogCtl->LogwrtRqst.Write < EndPos)
 			XLogCtl->LogwrtRqst.Write = EndPos;
-		/* update local result copy while I have the chance */
-		LogwrtResult = XLogCtl->LogwrtResult;
 		SpinLockRelease(>info_lck);
 	}
 
@@ -1811,6 +1807,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
 	 * Now that we have the lock, check if someone initialized the page
 	 * already.
 	 */
+	LogwrtResult.Write = pg_atomic_read_u64(>LogwrtResult.Write);
 	while (upto >= XLogCtl->InitializedUpTo || opportunistic)
 	{
 		nextidx = XLogRecPtrToBufIdx(XLogCtl->InitializedUpTo);
@@ -1830,17 +1827,18 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
 			if (opportunistic)
 break;
 
-			/* Before waiting, get info_lck and 

Re: LogwrtResult contended spinlock

2022-03-22 Thread Alvaro Herrera
So I've been wondering about this block at the bottom of XLogWrite:

/*
 * Make sure that the shared 'request' values do not fall behind the
 * 'result' values.  This is not absolutely essential, but it saves some
 * code in a couple of places.
 */
{
SpinLockAcquire(>info_lck);
if (XLogCtl->LogwrtRqst.Write < LogwrtResult.Write)
XLogCtl->LogwrtRqst.Write = LogwrtResult.Write;
if (XLogCtl->LogwrtRqst.Flush < LogwrtResult.Flush)
XLogCtl->LogwrtRqst.Flush = LogwrtResult.Flush;
SpinLockRelease(>info_lck);
}

I just noticed that my 0001 makes the comment a lie: it is now quite
possible that 'result' is advanced beyond 'request'.  Before the patch
that never happened because they were both advanced in the region locked
by the spinlock.

I think we could still maintain this promise if we just moved this
entire block before the first pg_atomic_monotonic_advance_u64 setting
XLogCtl->LogwrtResult.Write.  Or we could halve the whole block, and put
one acquire/test/set/release stanza before each monotonic increase of
the corresponding variable.


However, I wonder if this is still necessary.  This code was added in
4d14fe0048c (March 2001) and while everything else was quite different
back then, this hasn't changed at all.  I can't quite figure out what
are those "couple of places" that would need additional code if this
block is just removed.  I tried running the tests (including
wal_consistency_checking), and nothing breaks.  Reading the code
surrounding the other accesses of XLogCtl->LogwrtRqst, there's nothing
that looks to me like it depends on these values not lagging behind
LogwrtResult.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"If you have nothing to say, maybe you need just the right tool to help you
not say it."   (New York Times, about Microsoft PowerPoint)




Re: LogwrtResult contended spinlock

2022-03-22 Thread Alvaro Herrera
On 2022-Mar-22, Tom Lane wrote:

> I looked briefly at 0001, and I've got to say that I disagree with
> your decision to rearrange the representation of the local LogwrtResult
> copy.  It clutters the patch tremendously and makes it hard to
> understand what the actual functional change is.  Moreover, I'm
> not entirely convinced that it's a notational improvement in the
> first place.
> 
> Perhaps it'd help if you split 0001 into two steps, one to do the
> mechanical change of the representation and then a second patch that
> converts the shared variable to atomics.  Since you've moved around
> the places that read the shared variable, that part is subtler than
> one could wish and really needs to be studied on its own.

Hmm, I did it the other way around: first change to use atomics, then
the mechanical change.  I think that makes the usefulness of the change
more visible, because before the atomics use the use of the combined
struct as a unit remains sensible.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
Officer Krupke, what are we to do?
Gee, officer Krupke, Krup you! (West Side Story, "Gee, Officer Krupke")
>From dd9b53878faeedba921ea7027e98ddbb433e8647 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 2 Feb 2021 14:03:43 -0300
Subject: [PATCH v7 1/3] Change XLogCtl->LogwrtResult to use atomic ops
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Currently, access to LogwrtResult is protected by a spinlock.  This
becomes severely contended in some scenarios, such as with a largish
replication flock: walsenders all calling GetFlushRecPtr repeatedly
cause the processor heat up to the point where eggs can be fried on top.

This can be reduced to a non-problem by replacing XLogCtl->LogwrtResult
with a struct containing a pair of atomically accessed variables.

In addition, this commit splits the process-local copy of these values
(kept in the freestanding LogwrtResult struct) into two separate
LogWriteResult and LogFlushResult.  This is not strictly necessary, but
it makes it clearer that these are updated separately, each on their own
schedule.

Author: Álvaro Herrera 
Discussion: https://postgr.es/m/20200831182156.GA3983@alvherre.pgsql
---
 src/backend/access/transam/xlog.c | 85 +++
 src/include/port/atomics.h| 29 +++
 src/tools/pgindent/typedefs.list  |  1 +
 3 files changed, 72 insertions(+), 43 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4ac3871c74..6f2eb494fe 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -285,16 +285,13 @@ static bool doPageWrites;
  *
  * LogwrtRqst indicates a byte position that we need to write and/or fsync
  * the log up to (all records before that point must be written or fsynced).
- * LogwrtResult indicates the byte positions we have already written/fsynced.
- * These structs are identical but are declared separately to indicate their
- * slightly different functions.
+ * LogWrtResult indicates the byte positions we have already written/fsynced.
+ * These structs are similar but are declared separately to indicate their
+ * slightly different functions; in addition, the latter is read and written
+ * using atomic operations.
  *
- * To read XLogCtl->LogwrtResult, you must hold either info_lck or
- * WALWriteLock.  To update it, you need to hold both locks.  The point of
- * this arrangement is that the value can be examined by code that already
- * holds WALWriteLock without needing to grab info_lck as well.  In addition
- * to the shared variable, each backend has a private copy of LogwrtResult,
- * which is updated when convenient.
+ * In addition to the shared variable, each backend has a private copy of
+ * LogwrtResult, each member of which is separately updated when convenient.
  *
  * The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst
  * (protected by info_lck), but we don't need to cache any copies of it.
@@ -317,6 +314,12 @@ static bool doPageWrites;
  *--
  */
 
+typedef struct XLogwrtAtomic
+{
+	pg_atomic_uint64 Write;		/* last byte + 1 written out */
+	pg_atomic_uint64 Flush;		/* last byte + 1 flushed */
+} XLogwrtAtomic;
+
 typedef struct XLogwrtRqst
 {
 	XLogRecPtr	Write;			/* last byte + 1 to write out */
@@ -480,6 +483,7 @@ typedef struct XLogCtlData
 {
 	XLogCtlInsert Insert;
 
+	XLogwrtAtomic LogwrtResult; /* uses atomics */
 	/* Protected by info_lck: */
 	XLogwrtRqst LogwrtRqst;
 	XLogRecPtr	RedoRecPtr;		/* a recent copy of Insert->RedoRecPtr */
@@ -497,12 +501,6 @@ typedef struct XLogCtlData
 	pg_time_t	lastSegSwitchTime;
 	XLogRecPtr	lastSegSwitchLSN;
 
-	/*
-	 * Protected by info_lck and WALWriteLock (you must hold either lock to
-	 * read it, but both to update)
-	 */
-	XLogwrtResult LogwrtResult;
-
 	/*
 	 * Latest initialized page in the cache (last byte position + 1).
 	 *
@@ -622,7 

Re: LogwrtResult contended spinlock

2022-03-22 Thread Tom Lane
Alvaro Herrera  writes:
> On 2022-Mar-21, Andres Freund wrote:
>> Are you aiming this for v15? Otherwise I'd like to move the entry to the next
>> CF. Marked as waiting-on-author.

> I'd like to get 0001 pushed to pg15, yes.  I'll let 0002 sit here for
> discussion, but I haven't seen any evidence that we need it.  If others
> vouch for it, I can push that one too, but I'd rather have it be a
> separate thing.

I looked briefly at 0001, and I've got to say that I disagree with
your decision to rearrange the representation of the local LogwrtResult
copy.  It clutters the patch tremendously and makes it hard to
understand what the actual functional change is.  Moreover, I'm
not entirely convinced that it's a notational improvement in the
first place.

Perhaps it'd help if you split 0001 into two steps, one to do the
mechanical change of the representation and then a second patch that
converts the shared variable to atomics.  Since you've moved around
the places that read the shared variable, that part is subtler than
one could wish and really needs to be studied on its own.

regards, tom lane




Re: LogwrtResult contended spinlock

2022-03-22 Thread Alvaro Herrera
On 2022-Mar-21, Andres Freund wrote:

> This patch currently doesn't apply: http://cfbot.cputube.org/patch_37_2716.log

Updated.

> Are you aiming this for v15? Otherwise I'd like to move the entry to the next
> CF. Marked as waiting-on-author.

I'd like to get 0001 pushed to pg15, yes.  I'll let 0002 sit here for
discussion, but I haven't seen any evidence that we need it.  If others
vouch for it, I can push that one too, but I'd rather have it be a
separate thing.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"El miedo atento y previsor es la madre de la seguridad" (E. Burke)
>From 4d5a8d915b497b79bbe62e18352f7b9d8c1b9bca Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 2 Feb 2021 14:03:43 -0300
Subject: [PATCH v6 1/2] Change XLogCtl->LogwrtResult to use atomic ops
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Currently, access to LogwrtResult is protected by a spinlock.  This
becomes severely contended in some scenarios, such as with a largish
replication flock: walsenders all calling GetFlushRecPtr repeatedly
cause the processor heat up to the point where eggs can be fried on top.

This can be reduced to a non-problem by replacing XLogCtl->LogwrtResult
with a struct containing a pair of atomically accessed variables.

In addition, this commit splits the process-local copy of these values
(kept in the freestanding LogwrtResult struct) into two separate
LogWriteResult and LogFlushResult.  This is not strictly necessary, but
it makes it clearer that these are updated separately, each on their own
schedule.

Author: Álvaro Herrera 
Discussion: https://postgr.es/m/20200831182156.GA3983@alvherre.pgsql
---
 src/backend/access/transam/xlog.c | 195 +++---
 src/include/port/atomics.h|  29 +
 src/tools/pgindent/typedefs.list  |   1 +
 3 files changed, 125 insertions(+), 100 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4ac3871c74..311a8a1192 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -285,16 +285,14 @@ static bool doPageWrites;
  *
  * LogwrtRqst indicates a byte position that we need to write and/or fsync
  * the log up to (all records before that point must be written or fsynced).
- * LogwrtResult indicates the byte positions we have already written/fsynced.
- * These structs are identical but are declared separately to indicate their
- * slightly different functions.
+ * LogWrtResult indicates the byte positions we have already written/fsynced.
+ * These structs are similar but are declared separately to indicate their
+ * slightly different functions; in addition, the latter is read and written
+ * using atomic operations.
  *
- * To read XLogCtl->LogwrtResult, you must hold either info_lck or
- * WALWriteLock.  To update it, you need to hold both locks.  The point of
- * this arrangement is that the value can be examined by code that already
- * holds WALWriteLock without needing to grab info_lck as well.  In addition
- * to the shared variable, each backend has a private copy of LogwrtResult,
- * which is updated when convenient.
+ * In addition to the shared variable, each backend has a private copy of
+ * each member of LogwrtResult (LogWriteResult and LogFlushResult), each of
+ * which is separately updated when convenient.
  *
  * The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst
  * (protected by info_lck), but we don't need to cache any copies of it.
@@ -317,18 +315,18 @@ static bool doPageWrites;
  *--
  */
 
+typedef struct XLogwrtAtomic
+{
+	pg_atomic_uint64 Write;		/* last byte + 1 of write position */
+	pg_atomic_uint64 Flush;		/* last byte + 1 of flush position */
+} XLogwrtAtomic;
+
 typedef struct XLogwrtRqst
 {
 	XLogRecPtr	Write;			/* last byte + 1 to write out */
 	XLogRecPtr	Flush;			/* last byte + 1 to flush */
 } XLogwrtRqst;
 
-typedef struct XLogwrtResult
-{
-	XLogRecPtr	Write;			/* last byte + 1 written out */
-	XLogRecPtr	Flush;			/* last byte + 1 flushed */
-} XLogwrtResult;
-
 /*
  * Inserting to WAL is protected by a small fixed number of WAL insertion
  * locks. To insert to the WAL, you must hold one of the locks - it doesn't
@@ -480,6 +478,7 @@ typedef struct XLogCtlData
 {
 	XLogCtlInsert Insert;
 
+	XLogwrtAtomic LogwrtResult; /* uses atomics */
 	/* Protected by info_lck: */
 	XLogwrtRqst LogwrtRqst;
 	XLogRecPtr	RedoRecPtr;		/* a recent copy of Insert->RedoRecPtr */
@@ -497,12 +496,6 @@ typedef struct XLogCtlData
 	pg_time_t	lastSegSwitchTime;
 	XLogRecPtr	lastSegSwitchLSN;
 
-	/*
-	 * Protected by info_lck and WALWriteLock (you must hold either lock to
-	 * read it, but both to update)
-	 */
-	XLogwrtResult LogwrtResult;
-
 	/*
 	 * Latest initialized page in the cache (last byte position + 1).
 	 *
@@ -622,10 +615,11 @@ static ControlFileData *ControlFile = NULL;
 static int	UsableBytesInSegment;
 
 /*
- * 

Re: LogwrtResult contended spinlock

2022-03-21 Thread Andres Freund
Hi,

On 2021-11-22 18:56:43 -0300, Alvaro Herrera wrote:
> There was an earlier comment by Andres that asyncXactLSN should also be
> atomic, to avoid an ugly spinlock interaction with the new atomic-based
> logwrtResult.  The 0002 here is an attempt at doing that; I found that
> it also needed to change WalWriterSleeping to use atomics, to avoid
> XLogSetAsyncXactLSN having to grab the spinlock for that.

This patch currently doesn't apply: http://cfbot.cputube.org/patch_37_2716.log

Are you aiming this for v15? Otherwise I'd like to move the entry to the next
CF. Marked as waiting-on-author.

Greetings,

Andres Freund




Re: LogwrtResult contended spinlock

2021-11-22 Thread Alvaro Herrera
There was an earlier comment by Andres that asyncXactLSN should also be
atomic, to avoid an ugly spinlock interaction with the new atomic-based
logwrtResult.  The 0002 here is an attempt at doing that; I found that
it also needed to change WalWriterSleeping to use atomics, to avoid
XLogSetAsyncXactLSN having to grab the spinlock for that.


-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/
"Learn about compilers. Then everything looks like either a compiler or
a database, and now you have two problems but one of them is fun."
https://twitter.com/thingskatedid/status/1456027786158776329
>From 0d48422df1a0e738721bfba5cd2d3f1ede5db423 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 2 Feb 2021 14:03:43 -0300
Subject: [PATCH v5 1/2] make LogwrtResult atomic

---
 src/backend/access/transam/xlog.c | 192 +++---
 src/include/port/atomics.h|  29 +
 src/tools/pgindent/typedefs.list  |   1 +
 3 files changed, 123 insertions(+), 99 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 221e4cb34f..225ac8bc2d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -382,16 +382,14 @@ static XLogRecPtr RedoStartLSN = InvalidXLogRecPtr;
  *
  * LogwrtRqst indicates a byte position that we need to write and/or fsync
  * the log up to (all records before that point must be written or fsynced).
- * LogwrtResult indicates the byte positions we have already written/fsynced.
+ * LogWriteResult, LogFlushResult indicate the byte positions we have already
+ * written/fsynced.
  * These structs are identical but are declared separately to indicate their
  * slightly different functions.
  *
- * To read XLogCtl->LogwrtResult, you must hold either info_lck or
- * WALWriteLock.  To update it, you need to hold both locks.  The point of
- * this arrangement is that the value can be examined by code that already
- * holds WALWriteLock without needing to grab info_lck as well.  In addition
- * to the shared variable, each backend has a private copy of LogwrtResult,
- * which is updated when convenient.
+ * XLogCtl->LogwrtResult is read and written using atomic operations.
+ * In addition to the shared variable, each backend has a private copy of
+ * LogwrtResult, each member of which is separately updated when convenient.
  *
  * The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst
  * (protected by info_lck), but we don't need to cache any copies of it.
@@ -414,18 +412,18 @@ static XLogRecPtr RedoStartLSN = InvalidXLogRecPtr;
  *--
  */
 
+typedef struct XLogwrtAtomic
+{
+	pg_atomic_uint64 Write;		/* last byte + 1 of write position */
+	pg_atomic_uint64 Flush;		/* last byte + 1 of flush position */
+} XLogwrtAtomic;
+
 typedef struct XLogwrtRqst
 {
 	XLogRecPtr	Write;			/* last byte + 1 to write out */
 	XLogRecPtr	Flush;			/* last byte + 1 to flush */
 } XLogwrtRqst;
 
-typedef struct XLogwrtResult
-{
-	XLogRecPtr	Write;			/* last byte + 1 written out */
-	XLogRecPtr	Flush;			/* last byte + 1 flushed */
-} XLogwrtResult;
-
 /*
  * Inserting to WAL is protected by a small fixed number of WAL insertion
  * locks. To insert to the WAL, you must hold one of the locks - it doesn't
@@ -577,6 +575,7 @@ typedef struct XLogCtlData
 {
 	XLogCtlInsert Insert;
 
+	XLogwrtAtomic LogwrtResult; /* uses atomics */
 	/* Protected by info_lck: */
 	XLogwrtRqst LogwrtRqst;
 	XLogRecPtr	RedoRecPtr;		/* a recent copy of Insert->RedoRecPtr */
@@ -594,12 +593,6 @@ typedef struct XLogCtlData
 	pg_time_t	lastSegSwitchTime;
 	XLogRecPtr	lastSegSwitchLSN;
 
-	/*
-	 * Protected by info_lck and WALWriteLock (you must hold either lock to
-	 * read it, but both to update)
-	 */
-	XLogwrtResult LogwrtResult;
-
 	/*
 	 * Latest initialized page in the cache (last byte position + 1).
 	 *
@@ -770,10 +763,11 @@ static ControlFileData *ControlFile = NULL;
 static int	UsableBytesInSegment;
 
 /*
- * Private, possibly out-of-date copy of shared LogwrtResult.
+ * Private, possibly out-of-date copy of shared XLogCtl->LogwrtResult.
  * See discussion above.
  */
-static XLogwrtResult LogwrtResult = {0, 0};
+static XLogRecPtr LogWriteResult = 0;
+static XLogRecPtr LogFlushResult = 0;
 
 /*
  * Codes indicating where we got a WAL file from during recovery, or where
@@ -1201,8 +1195,6 @@ XLogInsertRecord(XLogRecData *rdata,
 		/* advance global request to include new block(s) */
 		if (XLogCtl->LogwrtRqst.Write < EndPos)
 			XLogCtl->LogwrtRqst.Write = EndPos;
-		/* update local result copy while I have the chance */
-		LogwrtResult = XLogCtl->LogwrtResult;
 		SpinLockRelease(>info_lck);
 	}
 
@@ -2174,6 +2166,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
 	 * Now that we have the lock, check if someone initialized the page
 	 * already.
 	 */
+	LogWriteResult = pg_atomic_read_u64(>LogwrtResult.Write);
 	while (upto >= XLogCtl->InitializedUpTo || 

Re: LogwrtResult contended spinlock

2021-11-18 Thread Alvaro Herrera
Here's a further attempt at this.  Sorry it took so long.
In this version, I replaced the coupled-in-a-struct representation of
Write with two separate global variables.  The reason to do this
is to cater to Andres' idea to keep them up-to-date separately.  Of
course, I could kept them together, but it seems more sensible this way.

Andres also suggested to not use monotonic-advance in XLogWrite, because
these update are always done with WALWriteLock held in exclusive mode.
However, that doesn't work (and tests fail pretty quickly) because
XLogWrite seems to be called possibly out of order, so you still have to
check if the current value is newer than what you have to write anyway;
using stock atomic write doesn't work.

So this is v4.

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1616448368..ad74862cb6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -382,16 +382,14 @@ static XLogRecPtr RedoStartLSN = InvalidXLogRecPtr;
  *
  * LogwrtRqst indicates a byte position that we need to write and/or fsync
  * the log up to (all records before that point must be written or fsynced).
- * LogwrtResult indicates the byte positions we have already written/fsynced.
+ * LogWriteResult, LogFlushResult indicate the byte positions we have already
+ * written/fsynced.
  * These structs are identical but are declared separately to indicate their
  * slightly different functions.
  *
- * To read XLogCtl->LogwrtResult, you must hold either info_lck or
- * WALWriteLock.  To update it, you need to hold both locks.  The point of
- * this arrangement is that the value can be examined by code that already
- * holds WALWriteLock without needing to grab info_lck as well.  In addition
- * to the shared variable, each backend has a private copy of LogwrtResult,
- * which is updated when convenient.
+ * XLogCtl->LogwrtResult is read and written using atomic operations.
+ * In addition to the shared variable, each backend has a private copy of
+ * LogwrtResult, each member of which is separately updated when convenient.
  *
  * The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst
  * (protected by info_lck), but we don't need to cache any copies of it.
@@ -414,18 +412,18 @@ static XLogRecPtr RedoStartLSN = InvalidXLogRecPtr;
  *--
  */
 
+typedef struct XLogwrtAtomic
+{
+	pg_atomic_uint64 Write;		/* last byte + 1 of write position */
+	pg_atomic_uint64 Flush;		/* last byte + 1 of flush position */
+} XLogwrtAtomic;
+
 typedef struct XLogwrtRqst
 {
 	XLogRecPtr	Write;			/* last byte + 1 to write out */
 	XLogRecPtr	Flush;			/* last byte + 1 to flush */
 } XLogwrtRqst;
 
-typedef struct XLogwrtResult
-{
-	XLogRecPtr	Write;			/* last byte + 1 written out */
-	XLogRecPtr	Flush;			/* last byte + 1 flushed */
-} XLogwrtResult;
-
 /*
  * Inserting to WAL is protected by a small fixed number of WAL insertion
  * locks. To insert to the WAL, you must hold one of the locks - it doesn't
@@ -577,6 +575,7 @@ typedef struct XLogCtlData
 {
 	XLogCtlInsert Insert;
 
+	XLogwrtAtomic LogwrtResult; /* uses atomics */
 	/* Protected by info_lck: */
 	XLogwrtRqst LogwrtRqst;
 	XLogRecPtr	RedoRecPtr;		/* a recent copy of Insert->RedoRecPtr */
@@ -594,12 +593,6 @@ typedef struct XLogCtlData
 	pg_time_t	lastSegSwitchTime;
 	XLogRecPtr	lastSegSwitchLSN;
 
-	/*
-	 * Protected by info_lck and WALWriteLock (you must hold either lock to
-	 * read it, but both to update)
-	 */
-	XLogwrtResult LogwrtResult;
-
 	/*
 	 * Latest initialized page in the cache (last byte position + 1).
 	 *
@@ -770,10 +763,11 @@ static ControlFileData *ControlFile = NULL;
 static int	UsableBytesInSegment;
 
 /*
- * Private, possibly out-of-date copy of shared LogwrtResult.
+ * Private, possibly out-of-date copy of shared XLogCtl->LogwrtResult.
  * See discussion above.
  */
-static XLogwrtResult LogwrtResult = {0, 0};
+static XLogRecPtr LogWriteResult = 0;
+static XLogRecPtr LogFlushResult = 0;
 
 /*
  * Codes indicating where we got a WAL file from during recovery, or where
@@ -1201,8 +1195,6 @@ XLogInsertRecord(XLogRecData *rdata,
 		/* advance global request to include new block(s) */
 		if (XLogCtl->LogwrtRqst.Write < EndPos)
 			XLogCtl->LogwrtRqst.Write = EndPos;
-		/* update local result copy while I have the chance */
-		LogwrtResult = XLogCtl->LogwrtResult;
 		SpinLockRelease(>info_lck);
 	}
 
@@ -2174,6 +2166,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
 	 * Now that we have the lock, check if someone initialized the page
 	 * already.
 	 */
+	LogWriteResult = pg_atomic_read_u64(>LogwrtResult.Write);
 	while (upto >= XLogCtl->InitializedUpTo || opportunistic)
 	{
 		nextidx = XLogRecPtrToBufIdx(XLogCtl->InitializedUpTo);
@@ -2184,7 +2177,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
 		 * already written out.
 		 */
 	

Re: LogwrtResult contended spinlock

2021-11-04 Thread Daniel Gustafsson
> On 8 Sep 2021, at 17:14, Jaime Casanova  wrote:
> 
> On Tue, Feb 02, 2021 at 08:19:19PM -0300, Alvaro Herrera wrote:
>> Hello,
>> 
>> So I addressed about half of your comments in this version merely by
>> fixing silly bugs.  The problem I had which I described as
>> "synchronization fails" was one of those silly bugs.
>> 
> 
> Hi Álvaro,
> 
> Are we waiting for another version of the patch based on Andres'
> comments? Or this version is good enough for testing?

Any update on this?  Should the patch be reset back to "Needs review" or will
there be a new version?

--
Daniel Gustafsson   https://vmware.com/





Re: LogwrtResult contended spinlock

2021-09-08 Thread Jaime Casanova
On Tue, Feb 02, 2021 at 08:19:19PM -0300, Alvaro Herrera wrote:
> Hello,
> 
> So I addressed about half of your comments in this version merely by
> fixing silly bugs.  The problem I had which I described as
> "synchronization fails" was one of those silly bugs.
> 

Hi Álvaro,

Are we waiting for another version of the patch based on Andres'
comments? Or this version is good enough for testing?

BTW, current patch still applies cleanly.

regards,

-- 
Jaime Casanova
Director de Servicios Profesionales
SystemGuards - Consultores de PostgreSQL




Re: LogwrtResult contended spinlock

2021-02-02 Thread Andres Freund
Hi,

On 2021-02-02 20:19:19 -0300, Alvaro Herrera wrote:
> > > @@ -1169,6 +1169,8 @@ XLogInsertRecord(XLogRecData *rdata,
> > >   {
> > >   /* advance global request to include new block(s)
> > >   */
> > >   pg_atomic_monotonic_advance_u64(>LogwrtRqst.Write, 
> > > EndPos);
> > > + pg_memory_barrier();
> >
> > That's not really useful - the path that actually updates already
> > implies a barrier. It'd probably be better to add a barrier to a "never
> > executed cmpxchg" fastpath.
>
> Got it.  Do you mean in pg_atomic_monotonic_advance_u64() itself?

Yes.


> I'm not sure which is the nicer semantics.  (If it's got to be at the
> caller, then we'll need to return a boolean from there, which sounds
> worse.)

Nearly all other modifying atomic operations have full barrier
semantics, so I think it'd be better to have it inside the
pg_atomic_monotonic_advance_u64().


> +/*
> + * Monotonically advance the given variable using only atomic operations 
> until
> + * it's at least the target value.
> + */
> +static inline void
> +pg_atomic_monotonic_advance_u64(volatile pg_atomic_uint64 *ptr, uint64 
> target_)
> +{
> + uint64  currval;
> +
> +#ifndef PG_HAVE_ATOMIC_U64_SIMULATION
> + AssertPointerAlignment(ptr, 8);
> +#endif
> +
> + currval = pg_atomic_read_u64(ptr);
> + while (currval < target_)
> + {
> + if (pg_atomic_compare_exchange_u64(ptr, , target_))
> + break;
> + }
> +}

So I think it'd be

static inline void
pg_atomic_monotonic_advance_u64(volatile pg_atomic_uint64 *ptr, uint64 target_)
{
uint64  currval;

#ifndef PG_HAVE_ATOMIC_U64_SIMULATION
AssertPointerAlignment(ptr, 8);
#endif

currval = pg_atomic_read_u64(ptr);
if (currval >= target_)
{
pg_memory_barrier();
return;
}

while (currval < target_)
{
if (pg_atomic_compare_exchange_u64(ptr, , target_))
break;
}
}


> @@ -1172,9 +1170,10 @@ XLogInsertRecord(XLogRecData *rdata,
>   /* advance global request to include new block(s) */
>   if (XLogCtl->LogwrtRqst.Write < EndPos)
>   XLogCtl->LogwrtRqst.Write = EndPos;
> - /* update local result copy while I have the chance */
> - LogwrtResult = XLogCtl->LogwrtResult;
>   SpinLockRelease(>info_lck);
> + /* update local result copy */
> + LogwrtResult.Write = 
> pg_atomic_read_u64(>LogwrtResult.Write);
> + LogwrtResult.Flush = 
> pg_atomic_read_u64(>LogwrtResult.Flush);
>   }

As mentioned before - it's not clear to me why this is a valid thing to
do without verifying all LogwrtResult.* usages. You can get updates
completely out of order / independently.


> @@ -2675,8 +2673,10 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
>* code in a couple of places.
>*/
>   {
> + pg_atomic_monotonic_advance_u64(>LogwrtResult.Write, 
> LogwrtResult.Write);
> + pg_atomic_monotonic_advance_u64(>LogwrtResult.Flush, 
> LogwrtResult.Flush);
> + pg_memory_barrier();
>   SpinLockAcquire(>info_lck);
> - XLogCtl->LogwrtResult = LogwrtResult;

I still don't see why we need "locked" atomic operations here, rather
than just a pg_atomic_write_u64(). They can only be modified
with WALWriteLock held.  There's two reasons for using a spinlock in
this place:
1) it avoids torn reads of 64bit values -
   pg_atomic_write_u64()/pg_atomic_read_u64() avoid that already.
2) it ensures that Write/Flush are updated in unison - but that's not
   useful anymore, given that other places now read the variables
   separately.


> @@ -3064,8 +3063,10 @@ XLogBackgroundFlush(void)
>   WriteRqst.Write -= WriteRqst.Write % XLOG_BLCKSZ;
>
>   /* if we have already flushed that far, consider async commit records */
> + LogwrtResult.Flush = pg_atomic_read_u64(>LogwrtResult.Flush);
>   if (WriteRqst.Write <= LogwrtResult.Flush)
>   {
> + pg_memory_barrier();
>   SpinLockAcquire(>info_lck);
>   WriteRqst.Write = XLogCtl->asyncXactLSN;
>   SpinLockRelease(>info_lck);

A SpinLockAcquire() is a full memory barrier on its own I think. This'd
probably better solved by just making asyncXactLSN atomic...


Greetings,

Andres Freund




Re: LogwrtResult contended spinlock

2021-02-02 Thread Alvaro Herrera
Hello,

So I addressed about half of your comments in this version merely by
fixing silly bugs.  The problem I had which I described as
"synchronization fails" was one of those silly bugs.

So in further thinking, it seems simpler to make only LogwrtResult
atomic, and leave LogwrtRqst as currently, using the spinlock.  This
should solve the contention problem we saw at the customer (but I've
asked Jaime very nicely to do a test run, if possible, to confirm).

For things like logical replication, which call GetFlushRecPtr() very
frequently (causing the spinlock issue we saw) it is good, because we're
no longer hitting the spinlock at all in that case.

I have another (pretty mechanical) patch that renames LogwrtResult.Write
to LogWriteResult and LogwrtResult.Flush to LogFlushResult.  That more
clearly shows that we're no longer updating them on unison.  Didn't want
to attach here because I didn't rebase on current one.  But it seems
logical: there's no longer any point in doing struct assignment, which
is the only thing that stuff was good for.


On 2021-Jan-29, Andres Freund wrote:

> > @@ -1169,6 +1169,8 @@ XLogInsertRecord(XLogRecData *rdata,
> > {
> > /* advance global request to include new block(s)
> > */
> > pg_atomic_monotonic_advance_u64(>LogwrtRqst.Write, 
> > EndPos);
> > +   pg_memory_barrier();
> 
> That's not really useful - the path that actually updates already
> implies a barrier. It'd probably be better to add a barrier to a "never
> executed cmpxchg" fastpath.

Got it.  Do you mean in pg_atomic_monotonic_advance_u64() itself?  I'm
not sure which is the nicer semantics.  (If it's got to be at the
caller, then we'll need to return a boolean from there, which sounds
worse.)

> > @@ -2905,6 +2909,7 @@ XLogFlush(XLogRecPtr record)
> > WriteRqstPtr = pg_atomic_read_u64(>LogwrtRqst.Write);
> > LogwrtResult.Write = 
> > pg_atomic_read_u64(>LogwrtResult.Write);
> > LogwrtResult.Flush = 
> > pg_atomic_read_u64(>LogwrtResult.Flush);
> > +   pg_read_barrier();
> 
> I'm not sure you really can get away with just a read barrier in these
> places. We can't e.g. have later updates to other shared memory
> variables be "moved" to before the barrier (which a read barrier
> allows).

Ah, that makes sense.

I have not really studied the barrier locations terribly closely in this
version of the patch.  It probably misses some (eg. in GetFlushRecPtr
and GetXLogWriteRecPtr).  It is passing the tests for me, but alone
that's probably not enough.  I'm gonna try and study the generated
assembly and see if I can make sense of things ...

-- 
Álvaro Herrera39°49'30"S 73°17'W
>From ca05447f41270d6b3ac3b6fcf816f86aba86d08f Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Fri, 29 Jan 2021 22:34:23 -0300
Subject: [PATCH 1/2] add pg_atomic_monotonic_advance_u64

---
 src/include/port/atomics.h | 21 +
 1 file changed, 21 insertions(+)

diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h
index 856338f161..a1c4764d18 100644
--- a/src/include/port/atomics.h
+++ b/src/include/port/atomics.h
@@ -519,6 +519,27 @@ pg_atomic_sub_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 sub_)
 	return pg_atomic_sub_fetch_u64_impl(ptr, sub_);
 }
 
+/*
+ * Monotonically advance the given variable using only atomic operations until
+ * it's at least the target value.
+ */
+static inline void
+pg_atomic_monotonic_advance_u64(volatile pg_atomic_uint64 *ptr, uint64 target_)
+{
+	uint64		currval;
+
+#ifndef PG_HAVE_ATOMIC_U64_SIMULATION
+	AssertPointerAlignment(ptr, 8);
+#endif
+
+	currval = pg_atomic_read_u64(ptr);
+	while (currval < target_)
+	{
+		if (pg_atomic_compare_exchange_u64(ptr, , target_))
+			break;
+	}
+}
+
 #undef INSIDE_ATOMICS_H
 
 #endif			/* ATOMICS_H */
-- 
2.20.1

>From d185cf50a47a0f9e346e49ccda038bb016ce246b Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 2 Feb 2021 14:03:43 -0300
Subject: [PATCH 2/2] make LogwrtResult atomic

---
 src/backend/access/transam/xlog.c | 66 +++
 1 file changed, 32 insertions(+), 34 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f03bd473e2..10802bd56f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -405,12 +405,9 @@ static XLogRecPtr RedoStartLSN = InvalidXLogRecPtr;
  * These structs are identical but are declared separately to indicate their
  * slightly different functions.
  *
- * To read XLogCtl->LogwrtResult, you must hold either info_lck or
- * WALWriteLock.  To update it, you need to hold both locks.  The point of
- * this arrangement is that the value can be examined by code that already
- * holds WALWriteLock without needing to grab info_lck as well.  In addition
- * to the shared variable, each backend has a private copy of LogwrtResult,
- * which is updated when convenient.
+ * 

Re: LogwrtResult contended spinlock

2021-01-29 Thread Andres Freund
Hi,

On 2021-01-29 22:47:56 -0300, Alvaro Herrera wrote:
> So I tried this, but -- perhaps not suprisingly -- I can't get it to
> work properly; the synchronization fails.

What do you mean by "synchronization fails"?


> Strangely, all tests work for me, but the pg_upgrade one in particular
> fails.

It's one of the few tests using fsync=on.


> +static inline void
> +pg_atomic_monotonic_advance_u64(volatile pg_atomic_uint64 *ptr, uint64 
> target_)
> +{
> +#ifndef PG_HAVE_ATOMIC_U64_SIMULATION
> + AssertPointerAlignment(ptr, 8);
> +#endif
> + /* FIXME is this algorithm correct if we have u64 simulation? */

I don't see a problem.


> + while (true)
> + {
> + uint64  currval;
> +
> + currval = pg_atomic_read_u64(ptr);
> + if (currval > target_)
> + break;  /* already done by somebody else */
> + if (pg_atomic_compare_exchange_u64(ptr, , target_))
> + break;  /* successfully done */
> + }
> +}

I suggest writing this as

currval = pg_atomic_read_u64(ptr);
while (currval < target_)
{
if (pg_atomic_compare_exchange_u64(ptr, , target_))
break;
}

>  /*
>   * Inserting to WAL is protected by a small fixed number of WAL insertion
> @@ -596,8 +599,10 @@ typedef struct XLogCtlData
>  {
>   XLogCtlInsert Insert;
>  
> + XLogwrtAtomic LogwrtRqst;
> + XLogwrtAtomic LogwrtResult;
> +
>   /* Protected by info_lck: */
> - XLogwrtRqst LogwrtRqst;

Not sure putting these into the same cacheline is a good idea.


> @@ -2166,12 +2163,11 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool 
> opportunistic)
>   if (opportunistic)
>   break;
>  
> - /* Before waiting, get info_lck and update LogwrtResult 
> */
> - SpinLockAcquire(>info_lck);
> - if (XLogCtl->LogwrtRqst.Write < OldPageRqstPtr)
> - XLogCtl->LogwrtRqst.Write = OldPageRqstPtr;
> - LogwrtResult = XLogCtl->LogwrtResult;
> - SpinLockRelease(>info_lck);
> + /* Before waiting, update LogwrtResult */
> + 
> pg_atomic_monotonic_advance_u64(>LogwrtRqst.Write, OldPageRqstPtr);
> +
> + LogwrtResult.Write = 
> pg_atomic_read_u64(>LogwrtResult.Write);
> + LogwrtResult.Flush = 
> pg_atomic_read_u64(>LogwrtResult.Flush);

I don't think it's quite as easy as this. Write/Flush now aren't
guaranteed to be coherent with each other - previously they were. And
because it's in a global variable used everywhere, we can't just be more
careful about update protocols in one place...

We also shouldn't re-read a variable that we just did via the
pg_atomic_monotonic_advance_u64().

I think we should stop updating both the Write/Flush position at the
same time. That way we don't have an expectation of them being coherent
with each other. Most places really don't need both anyway.


>   {
> - SpinLockAcquire(>info_lck);
> - XLogCtl->LogwrtResult = LogwrtResult;
> - if (XLogCtl->LogwrtRqst.Write < LogwrtResult.Write)
> - XLogCtl->LogwrtRqst.Write = LogwrtResult.Write;
> - if (XLogCtl->LogwrtRqst.Flush < LogwrtResult.Flush)
> - XLogCtl->LogwrtRqst.Flush = LogwrtResult.Flush;
> - SpinLockRelease(>info_lck);
> + pg_atomic_monotonic_advance_u64(>LogwrtResult.Write, 
> LogwrtResult.Write);
> + pg_atomic_monotonic_advance_u64(>LogwrtResult.Flush, 
> LogwrtResult.Flush);
> +
> + pg_atomic_monotonic_advance_u64(>LogwrtRqst.Write, 
> LogwrtResult.Write);
> + pg_atomic_monotonic_advance_u64(>LogwrtRqst.Flush, 
> LogwrtResult.Flush);

Hm. We went from one cheap atomic operation (SpinLockAcquire) to four
expensive ones in the happy path. That's not free...

I don't think we need to manipulate XLogCtl->LogwrtResult.* using atomic
ops - they can only be updated with WALWriteLock held, right?

XLogCtl->LogwrtResult was updated with plain assignment before, why did
you change it to pg_atomic_monotonic_advance_u64()?


> @@ -1169,6 +1169,8 @@ XLogInsertRecord(XLogRecData *rdata,
>   {
>   /* advance global request to include new block(s) */
>   pg_atomic_monotonic_advance_u64(>LogwrtRqst.Write, 
> EndPos);
> + pg_memory_barrier();
> +

That's not really useful - the path that actually updates already
implies a barrier. It'd probably be better to add a barrier to a "never
executed cmpxchg" fastpath.



> @@ -2905,6 +2909,7 @@ XLogFlush(XLogRecPtr record)
>   WriteRqstPtr = pg_atomic_read_u64(>LogwrtRqst.Write);
>   LogwrtResult.Write = 
> pg_atomic_read_u64(>LogwrtResult.Write);
>   LogwrtResult.Flush = 
> pg_atomic_read_u64(>LogwrtResult.Flush);
> + 

Re: LogwrtResult contended spinlock

2021-01-29 Thread Andres Freund
Hi,


On 2021-01-29 12:40:18 -0300, Alvaro Herrera wrote:
> On 2020-Aug-31, Andres Freund wrote:
> 
> > Wouldn't the better fix here be to allow reading of individual members 
> > without a lock? E.g. by wrapping each in a 64bit atomic.
> 
> So I've been playing with this and I'm annoyed about having two
> datatypes to represent Write/Flush positions:
> 
> typedef struct XLogwrtRqst
> {
>   XLogRecPtr  Write;  /* last byte + 1 to write out */
>   XLogRecPtr  Flush;  /* last byte + 1 to flush */
> } XLogwrtRqst;
> 
> typedef struct XLogwrtResult
> {
>   XLogRecPtr  Write;  /* last byte + 1 written out */
>   XLogRecPtr  Flush;  /* last byte + 1 flushed */
> } XLogwrtResult;
> 
> Don't they look, um, quite similar?  I am strongly tempted to remove
> that distinction, since it seems quite pointless, and introduce a
> different one:

Their spelling drives me nuts. Like, one is *Rqst, the other *Result?
Comeon.


> Now, I do wonder if there's a point in keeping LogwrtResult as a local
> variable at all; maybe since the ones in shared memory are going to use
> unlocked access, we don't need it anymore?  I'd prefer to defer that
> decision to after this patch is done, since ISTM that it'd merit more
> careful benchmarking.

I think doing that might be a bit harmful - we update LogwrtResult
fairly granularly in XLogWrite(). Doing that in those small steps in
shared memory will increase the likelihood of cache misses in other
backends.

Greetings,

Andres Freund




Re: LogwrtResult contended spinlock

2021-01-29 Thread Alvaro Herrera
So I tried this, but -- perhaps not suprisingly -- I can't get it to
work properly; the synchronization fails.  I suspect I need some
barriers, but I tried adding a few (probably some that are not really
necessary) and that didn't have the expected effect.  Strangely, all
tests work for me, but the pg_upgrade one in particular fails.

(The attached is of course POC quality at best.)

I'll have another look next week.

-- 
Álvaro Herrera39°49'30"S 73°17'W
>From 90507185357391a661cd856fc231b140079c8d78 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Fri, 29 Jan 2021 22:34:23 -0300
Subject: [PATCH v2 1/3] add pg_atomic_monotonic_advance_u64

---
 src/include/port/atomics.h | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h
index 856338f161..752f39d66e 100644
--- a/src/include/port/atomics.h
+++ b/src/include/port/atomics.h
@@ -519,6 +519,25 @@ pg_atomic_sub_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 sub_)
 	return pg_atomic_sub_fetch_u64_impl(ptr, sub_);
 }
 
+static inline void
+pg_atomic_monotonic_advance_u64(volatile pg_atomic_uint64 *ptr, uint64 target_)
+{
+#ifndef PG_HAVE_ATOMIC_U64_SIMULATION
+	AssertPointerAlignment(ptr, 8);
+#endif
+	/* FIXME is this algorithm correct if we have u64 simulation? */
+	while (true)
+	{
+		uint64		currval;
+
+		currval = pg_atomic_read_u64(ptr);
+		if (currval > target_)
+			break;	/* already done by somebody else */
+		if (pg_atomic_compare_exchange_u64(ptr, , target_))
+			break;	/* successfully done */
+	}
+}
+
 #undef INSIDE_ATOMICS_H
 
 #endif			/* ATOMICS_H */
-- 
2.20.1

>From 84a62ac0594f8f0b13c06958cf6e94e2a2723082 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Fri, 29 Jan 2021 22:34:04 -0300
Subject: [PATCH v2 2/3] atomics in xlog.c

---
 src/backend/access/transam/xlog.c | 129 ++
 1 file changed, 61 insertions(+), 68 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f03bd473e2..8073a92ceb 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -413,7 +413,8 @@ static XLogRecPtr RedoStartLSN = InvalidXLogRecPtr;
  * which is updated when convenient.
  *
  * The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst
- * (protected by info_lck), but we don't need to cache any copies of it.
+ * (which use atomic access, so no locks are needed), but we don't need to
+ * cache any copies of it.
  *
  * info_lck is only held long enough to read/update the protected variables,
  * so it's a plain spinlock.  The other locks are held longer (potentially
@@ -433,17 +434,19 @@ static XLogRecPtr RedoStartLSN = InvalidXLogRecPtr;
  *--
  */
 
-typedef struct XLogwrtRqst
+/* Write/Flush position to be used in shared memory */
+typedef struct XLogwrtAtomic
 {
-	XLogRecPtr	Write;			/* last byte + 1 to write out */
-	XLogRecPtr	Flush;			/* last byte + 1 to flush */
-} XLogwrtRqst;
+	pg_atomic_uint64 Write;			/* last byte + 1 of write position */
+	pg_atomic_uint64 Flush;			/* last byte + 1 of flush position */
+} XLogwrtAtomic;
 
-typedef struct XLogwrtResult
+/* Write/Flush position to be used in process-local memory */
+typedef struct XLogwrt
 {
 	XLogRecPtr	Write;			/* last byte + 1 written out */
 	XLogRecPtr	Flush;			/* last byte + 1 flushed */
-} XLogwrtResult;
+} XLogwrt;
 
 /*
  * Inserting to WAL is protected by a small fixed number of WAL insertion
@@ -596,8 +599,10 @@ typedef struct XLogCtlData
 {
 	XLogCtlInsert Insert;
 
+	XLogwrtAtomic LogwrtRqst;
+	XLogwrtAtomic LogwrtResult;
+
 	/* Protected by info_lck: */
-	XLogwrtRqst LogwrtRqst;
 	XLogRecPtr	RedoRecPtr;		/* a recent copy of Insert->RedoRecPtr */
 	FullTransactionId ckptFullXid;	/* nextXid of latest checkpoint */
 	XLogRecPtr	asyncXactLSN;	/* LSN of newest async commit/abort */
@@ -613,12 +618,6 @@ typedef struct XLogCtlData
 	pg_time_t	lastSegSwitchTime;
 	XLogRecPtr	lastSegSwitchLSN;
 
-	/*
-	 * Protected by info_lck and WALWriteLock (you must hold either lock to
-	 * read it, but both to update)
-	 */
-	XLogwrtResult LogwrtResult;
-
 	/*
 	 * Latest initialized page in the cache (last byte position + 1).
 	 *
@@ -779,7 +778,7 @@ static int	UsableBytesInSegment;
  * Private, possibly out-of-date copy of shared LogwrtResult.
  * See discussion above.
  */
-static XLogwrtResult LogwrtResult = {0, 0};
+static XLogwrt LogwrtResult = {0, 0};
 
 /*
  * Codes indicating where we got a WAL file from during recovery, or where
@@ -910,7 +909,7 @@ static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
 
 static void AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic);
 static bool XLogCheckpointNeeded(XLogSegNo new_segno);
-static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible);
+static void XLogWrite(XLogwrt WriteRqst, bool flexible);
 static bool InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
    bool find_free, XLogSegNo 

Re: LogwrtResult contended spinlock

2021-01-29 Thread Alvaro Herrera
On 2020-Aug-31, Andres Freund wrote:

> Wouldn't the better fix here be to allow reading of individual members 
> without a lock? E.g. by wrapping each in a 64bit atomic.

So I've been playing with this and I'm annoyed about having two
datatypes to represent Write/Flush positions:

typedef struct XLogwrtRqst
{
XLogRecPtr  Write;  /* last byte + 1 to write out */
XLogRecPtr  Flush;  /* last byte + 1 to flush */
} XLogwrtRqst;

typedef struct XLogwrtResult
{
XLogRecPtr  Write;  /* last byte + 1 written out */
XLogRecPtr  Flush;  /* last byte + 1 flushed */
} XLogwrtResult;

Don't they look, um, quite similar?  I am strongly tempted to remove
that distinction, since it seems quite pointless, and introduce a
different one:

typedef struct XLogwrtAtomic
{
pg_atomic_uint64Write;  /* last byte + 1 of write 
position */
pg_atomic_uint64Flush;  /* last byte + 1 of flush 
position */
} XLogwrtAtomic;

this one, with atomics, would be used for the XLogCtl struct members
LogwrtRqst and LogwrtResult, and are always accessed using atomic ops.
On the other hand we would have

typedef struct XLogwrt
{
XLogRecPtr  Write;  /* last byte + 1 of write position */
XLogRecPtr  Flush;  /* last byte + 1 of flush position */
} XLogwrt;

to be used for the local-memory-only LogwrtResult, using normal
assignment.

Now, I do wonder if there's a point in keeping LogwrtResult as a local
variable at all; maybe since the ones in shared memory are going to use
unlocked access, we don't need it anymore?  I'd prefer to defer that
decision to after this patch is done, since ISTM that it'd merit more
careful benchmarking.

Thoughts?

-- 
Álvaro Herrera   Valdivia, Chile




Re: LogwrtResult contended spinlock

2021-01-22 Thread Masahiko Sawada
On Fri, Jan 22, 2021 at 10:39 PM Alvaro Herrera  wrote:
>
> On 2021-Jan-22, Masahiko Sawada wrote:
>
> > On Wed, Nov 25, 2020 at 12:02 AM Alvaro Herrera  
> > wrote:
>
> > > Yes, please move it forward.  I'll post an update sometime before the
> > > next CF.
> >
> > Anything update on this?
>
> I'll update this one early next week.

Great, thanks! I'll look at that.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: LogwrtResult contended spinlock

2021-01-22 Thread Alvaro Herrera
On 2021-Jan-22, Masahiko Sawada wrote:

> On Wed, Nov 25, 2020 at 12:02 AM Alvaro Herrera  
> wrote:

> > Yes, please move it forward.  I'll post an update sometime before the
> > next CF.
> 
> Anything update on this?

I'll update this one early next week.

Thanks!

-- 
Álvaro Herrera   Valdivia, Chile
"I would rather have GNU than GNOT."  (ccchips, lwn.net/Articles/37595/)




Re: LogwrtResult contended spinlock

2021-01-22 Thread Masahiko Sawada
Hi Alvaro,

On Wed, Nov 25, 2020 at 12:02 AM Alvaro Herrera  wrote:
>
> On 2020-Nov-24, Anastasia Lubennikova wrote:
>
> > On 04.09.2020 20:13, Andres Freund wrote:
>
> > > Re general routine: On second thought, it might actually be worth having
> > > it. Even just for LSNs - there's plenty places where it's useful to
> > > ensure a variable is at least a certain size.  I think I would be in
> > > favor of a general helper function.
> > Do you mean by general helper function something like this?
> >
> > void
> > swap_lsn(XLogRecPtr old_value, XLogRecPtr new_value, bool to_largest)
>
> Something like that, yeah, though maybe name it "pg_atomic_increase_lsn"
> or some similar name that makes it clear that
>
> 1. it is supposed to use atomics
> 2. it can only be used to *advance* a value rather than a generic swap.
>
> (I'm not 100% clear that that's the exact API we need.)
>
> > This CF entry was inactive for a while. Alvaro, are you going to continue
> > working on it?
>
> Yes, please move it forward.  I'll post an update sometime before the
> next CF.

Anything update on this?

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: LogwrtResult contended spinlock

2020-11-24 Thread Alvaro Herrera
On 2020-Nov-24, Anastasia Lubennikova wrote:

> On 04.09.2020 20:13, Andres Freund wrote:

> > Re general routine: On second thought, it might actually be worth having
> > it. Even just for LSNs - there's plenty places where it's useful to
> > ensure a variable is at least a certain size.  I think I would be in
> > favor of a general helper function.
> Do you mean by general helper function something like this?
> 
> void
> swap_lsn(XLogRecPtr old_value, XLogRecPtr new_value, bool to_largest)

Something like that, yeah, though maybe name it "pg_atomic_increase_lsn"
or some similar name that makes it clear that 

1. it is supposed to use atomics
2. it can only be used to *advance* a value rather than a generic swap.

(I'm not 100% clear that that's the exact API we need.)

> This CF entry was inactive for a while. Alvaro, are you going to continue
> working on it?

Yes, please move it forward.  I'll post an update sometime before the
next CF.




Re: LogwrtResult contended spinlock

2020-11-24 Thread Anastasia Lubennikova

On 04.09.2020 20:13, Andres Freund wrote:

Hi,

On 2020-09-04 10:05:45 -0700, Andres Freund wrote:

On 2020-09-03 14:34:52 -0400, Alvaro Herrera wrote:

Looking at patterns like this

if (XLogCtl->LogwrtRqst.Write < EndPos)
XLogCtl->LogwrtRqst.Write = EndPos;

It seems possible to implement with

 do {
XLogRecPtr  currwrite;

 currwrite = pg_atomic_read_u64(LogwrtRqst.Write);
if (currwrite > EndPos)
 break;  // already done by somebody else
 if (pg_atomic_compare_exchange_u64(LogwrtRqst.Write,
   currwrite, EndPos))
 break;  // successfully updated
 } while (true);

This assumes that LogwrtRqst.Write never goes backwards, so it doesn't
seem good material for a general routine.

This *seems* correct to me, though this is muddy territory to me.  Also,
are there better ways to go about this?

Hm, I was thinking that we'd first go for reading it without a spinlock,
but continuing to write it as we currently do.

But yea, I can't see an issue with what you propose here. I personally
find do {} while () weird and avoid it when not explicitly useful, but
that's extremely minor, obviously.

Re general routine: On second thought, it might actually be worth having
it. Even just for LSNs - there's plenty places where it's useful to
ensure a variable is at least a certain size.  I think I would be in
favor of a general helper function.

Do you mean by general helper function something like this?

void
swap_lsn(XLogRecPtr old_value, XLogRecPtr new_value, bool to_largest)
{
  while (true) {
    XLogRecPtr  currwrite;

    currwrite = pg_atomic_read_u64(old_value);

    if (to_largest)
  if (currwrite > new_value)
    break;  /* already done by somebody else */
    else
  if (currwrite < new_value)
    break;  /* already done by somebody else */

    if (pg_atomic_compare_exchange_u64(old_value,
   currwrite, new_value))
    break;  /* already done by somebody else */
  }
}


which will be called like
swap_lsn(XLogCtl->LogwrtRqst.Write, EndPos, true);



Greetings,

Andres Freund


This CF entry was inactive for a while. Alvaro, are you going to 
continue working on it?


--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: LogwrtResult contended spinlock

2020-09-04 Thread Andres Freund
Hi,

On 2020-09-04 10:05:45 -0700, Andres Freund wrote:
> On 2020-09-03 14:34:52 -0400, Alvaro Herrera wrote:
> > Looking at patterns like this
> > 
> > if (XLogCtl->LogwrtRqst.Write < EndPos)
> > XLogCtl->LogwrtRqst.Write = EndPos;
> > 
> > It seems possible to implement with
> > 
> > do {
> > XLogRecPtr  currwrite;
> > 
> > currwrite = pg_atomic_read_u64(LogwrtRqst.Write);
> > if (currwrite > EndPos)
> > break;  // already done by somebody else
> > if (pg_atomic_compare_exchange_u64(LogwrtRqst.Write,
> >currwrite, EndPos))
> > break;  // successfully updated
> > } while (true);
> > 
> > This assumes that LogwrtRqst.Write never goes backwards, so it doesn't
> > seem good material for a general routine.
> > 
> > This *seems* correct to me, though this is muddy territory to me.  Also,
> > are there better ways to go about this?
> 
> Hm, I was thinking that we'd first go for reading it without a spinlock,
> but continuing to write it as we currently do.
> 
> But yea, I can't see an issue with what you propose here. I personally
> find do {} while () weird and avoid it when not explicitly useful, but
> that's extremely minor, obviously.

Re general routine: On second thought, it might actually be worth having
it. Even just for LSNs - there's plenty places where it's useful to
ensure a variable is at least a certain size.  I think I would be in
favor of a general helper function.

Greetings,

Andres Freund




Re: LogwrtResult contended spinlock

2020-09-04 Thread Andres Freund
Hi,

On 2020-09-03 14:34:52 -0400, Alvaro Herrera wrote:
> Looking at patterns like this
> 
>   if (XLogCtl->LogwrtRqst.Write < EndPos)
>   XLogCtl->LogwrtRqst.Write = EndPos;
> 
> It seems possible to implement with
> 
> do {
>   XLogRecPtr  currwrite;
> 
> currwrite = pg_atomic_read_u64(LogwrtRqst.Write);
>   if (currwrite > EndPos)
> break;  // already done by somebody else
> if (pg_atomic_compare_exchange_u64(LogwrtRqst.Write,
>  currwrite, EndPos))
> break;  // successfully updated
> } while (true);
> 
> This assumes that LogwrtRqst.Write never goes backwards, so it doesn't
> seem good material for a general routine.
> 
> This *seems* correct to me, though this is muddy territory to me.  Also,
> are there better ways to go about this?

Hm, I was thinking that we'd first go for reading it without a spinlock,
but continuing to write it as we currently do.

But yea, I can't see an issue with what you propose here. I personally
find do {} while () weird and avoid it when not explicitly useful, but
that's extremely minor, obviously.

Greetings,

Andres Freund




Re: LogwrtResult contended spinlock

2020-09-03 Thread Alvaro Herrera
Looking at patterns like this

if (XLogCtl->LogwrtRqst.Write < EndPos)
XLogCtl->LogwrtRqst.Write = EndPos;

It seems possible to implement with

do {
XLogRecPtr  currwrite;

currwrite = pg_atomic_read_u64(LogwrtRqst.Write);
if (currwrite > EndPos)
break;  // already done by somebody else
if (pg_atomic_compare_exchange_u64(LogwrtRqst.Write,
   currwrite, EndPos))
break;  // successfully updated
} while (true);

This assumes that LogwrtRqst.Write never goes backwards, so it doesn't
seem good material for a general routine.

This *seems* correct to me, though this is muddy territory to me.  Also,
are there better ways to go about this?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: LogwrtResult contended spinlock

2020-08-31 Thread Andres Freund
Hi, 

On August 31, 2020 11:34:45 AM PDT, Alvaro Herrera  
wrote:
>On 2020-Aug-31, Andres Freund wrote:
>
>> Hi, 
>> 
>> On August 31, 2020 11:21:56 AM PDT, Alvaro Herrera
> wrote:
>
>> >At first I wanted to make the new LWLock cover only LogwrtResult
>> >proper,
>> >and leave LogwrtRqst alone.  However on doing it, it seemed that
>that
>> >might change the locking protocol in a nontrivial way.  So I decided
>to
>> >make it cover both and call it a day.  We did verify that the patch
>> >solves the reported problem, at any rate.
>> 
>> Wouldn't the better fix here be to allow reading of individual
>members
>> without a lock? E.g. by wrapping each in a 64bit atomic.
>
>Heh, Simon said the same.  It's not clear to me due to the lack of
>general availability of 64-bit atomics.  If they are spinlock-protected
>when emulated, I think that would make the problem worse.
>
>IIRC Thomas wanted to start relying on atomic 64-bit vars in some
>patch,
>but I don't remember what it was.

All relevant platforms have 64bit atomics. So I don't think there's much point 
in worrying about the emulated performance. Correctness, sure. Performance, not 
so much.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: LogwrtResult contended spinlock

2020-08-31 Thread Alvaro Herrera
On 2020-Aug-31, Andres Freund wrote:

> Hi, 
> 
> On August 31, 2020 11:21:56 AM PDT, Alvaro Herrera  
> wrote:

> >At first I wanted to make the new LWLock cover only LogwrtResult
> >proper,
> >and leave LogwrtRqst alone.  However on doing it, it seemed that that
> >might change the locking protocol in a nontrivial way.  So I decided to
> >make it cover both and call it a day.  We did verify that the patch
> >solves the reported problem, at any rate.
> 
> Wouldn't the better fix here be to allow reading of individual members
> without a lock? E.g. by wrapping each in a 64bit atomic.

Heh, Simon said the same.  It's not clear to me due to the lack of
general availability of 64-bit atomics.  If they are spinlock-protected
when emulated, I think that would make the problem worse.

IIRC Thomas wanted to start relying on atomic 64-bit vars in some patch,
but I don't remember what it was.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: LogwrtResult contended spinlock

2020-08-31 Thread Andres Freund
Hi, 

On August 31, 2020 11:21:56 AM PDT, Alvaro Herrera  
wrote:
>Jaime Casanova recently reported a situation where pglogical
>replicating
>from 64 POS sites to a single central (64-core) node, each with two
>replication sets, causes XLog's info_lck to become highly contended
>because of frequently reading LogwrtResult.  We tested the simple fix
>of
>adding a new LWLock that protects LogwrtResult and LogwrtRqst; that
>seems to solve the problem easily enough.
>
>At first I wanted to make the new LWLock cover only LogwrtResult
>proper,
>and leave LogwrtRqst alone.  However on doing it, it seemed that that
>might change the locking protocol in a nontrivial way.  So I decided to
>make it cover both and call it a day.  We did verify that the patch
>solves the reported problem, at any rate.

Wouldn't the better fix here be to allow reading of individual members without 
a lock? E.g. by wrapping each in a 64bit atomic.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




LogwrtResult contended spinlock

2020-08-31 Thread Alvaro Herrera
Jaime Casanova recently reported a situation where pglogical replicating
from 64 POS sites to a single central (64-core) node, each with two
replication sets, causes XLog's info_lck to become highly contended
because of frequently reading LogwrtResult.  We tested the simple fix of
adding a new LWLock that protects LogwrtResult and LogwrtRqst; that
seems to solve the problem easily enough.

At first I wanted to make the new LWLock cover only LogwrtResult proper,
and leave LogwrtRqst alone.  However on doing it, it seemed that that
might change the locking protocol in a nontrivial way.  So I decided to
make it cover both and call it a day.  We did verify that the patch
solves the reported problem, at any rate.

-- 
Álvaro HerreraPostgreSQL Expert, https://www.2ndQuadrant.com/
>From 3eb871f235c1b6005ff5dc88561173e4e92c1314 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Mon, 17 Aug 2020 19:05:22 -0400
Subject: [PATCH] use an LWLock rather than spinlock for Logwrt{Result,Rqst}

---
 src/backend/access/transam/xlog.c| 76 +---
 src/backend/storage/lmgr/lwlocknames.txt |  2 +-
 2 files changed, 43 insertions(+), 35 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 09c01ed4ae..d838739bc4 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -403,15 +403,13 @@ static XLogRecPtr RedoStartLSN = InvalidXLogRecPtr;
  * These structs are identical but are declared separately to indicate their
  * slightly different functions.
  *
- * To read XLogCtl->LogwrtResult, you must hold either info_lck or
- * WALWriteLock.  To update it, you need to hold both locks.  The point of
- * this arrangement is that the value can be examined by code that already
- * holds WALWriteLock without needing to grab info_lck as well.  In addition
- * to the shared variable, each backend has a private copy of LogwrtResult,
- * which is updated when convenient.
+ * To read/write XLogCtl->LogwrtResult, you must hold WALWriteResultLock.
+ * In addition to the shared variable, each backend has a private copy of
+ * LogwrtResult, which is updated when convenient.
  *
  * The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst
- * (protected by info_lck), but we don't need to cache any copies of it.
+ * (also protected by WALWriteResultLock), but we don't need to cache any
+ * copies of it.
  *
  * info_lck is only held long enough to read/update the protected variables,
  * so it's a plain spinlock.  The other locks are held longer (potentially
@@ -615,10 +613,7 @@ typedef struct XLogCtlData
 	pg_time_t	lastSegSwitchTime;
 	XLogRecPtr	lastSegSwitchLSN;
 
-	/*
-	 * Protected by info_lck and WALWriteLock (you must hold either lock to
-	 * read it, but both to update)
-	 */
+	/* Protected by WALWriteResultLock. */
 	XLogwrtResult LogwrtResult;
 
 	/*
@@ -1158,13 +1153,12 @@ XLogInsertRecord(XLogRecData *rdata,
 	 */
 	if (StartPos / XLOG_BLCKSZ != EndPos / XLOG_BLCKSZ)
 	{
-		SpinLockAcquire(>info_lck);
 		/* advance global request to include new block(s) */
+		LWLockAcquire(WALWriteResultLock, LW_EXCLUSIVE);
 		if (XLogCtl->LogwrtRqst.Write < EndPos)
 			XLogCtl->LogwrtRqst.Write = EndPos;
-		/* update local result copy while I have the chance */
 		LogwrtResult = XLogCtl->LogwrtResult;
-		SpinLockRelease(>info_lck);
+		LWLockRelease(WALWriteResultLock);
 	}
 
 	/*
@@ -2155,12 +2149,12 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
 			if (opportunistic)
 break;
 
-			/* Before waiting, get info_lck and update LogwrtResult */
-			SpinLockAcquire(>info_lck);
+			/* Before waiting, update LogwrtResult */
+			LWLockAcquire(WALWriteResultLock, LW_EXCLUSIVE);
 			if (XLogCtl->LogwrtRqst.Write < OldPageRqstPtr)
 XLogCtl->LogwrtRqst.Write = OldPageRqstPtr;
 			LogwrtResult = XLogCtl->LogwrtResult;
-			SpinLockRelease(>info_lck);
+			LWLockRelease(WALWriteResultLock);
 
 			/*
 			 * Now that we have an up-to-date LogwrtResult value, see if we
@@ -2178,9 +2172,11 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
 
 WaitXLogInsertionsToFinish(OldPageRqstPtr);
 
-LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
-
+LWLockAcquire(WALWriteResultLock, LW_SHARED);
 LogwrtResult = XLogCtl->LogwrtResult;
+LWLockRelease(WALWriteResultLock);
+
+LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
 if (LogwrtResult.Write >= OldPageRqstPtr)
 {
 	/* OK, someone wrote it already */
@@ -2426,7 +2422,9 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 	/*
 	 * Update local LogwrtResult (caller probably did this already, but...)
 	 */
+	LWLockAcquire(WALWriteResultLock, LW_SHARED);
 	LogwrtResult = XLogCtl->LogwrtResult;
+	LWLockRelease(WALWriteResultLock);
 
 	/*
 	 * Since successive pages in the xlog cache are consecutively allocated,
@@ -2663,13 +2661,14 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 	 * code in a couple of places.
 	 */
 	{
-