On 25/10/2023 02:09, Andres Freund wrote:
Because of the inherent delay between the checks of XLogCtl->WalWriterSleeping
and Latch->is_set, we also sometimes end up with multiple processes signalling
walwriter, which can be bad, because it increases the likelihood that some of
the signals may be received when we are already holding WALWriteLock, delaying
its release...

That can only happen when walwriter has just come out of "hibernation", ie. when the system has been idle for a while. So probably not a big deal in practice.

I think the most important optimization we need is to have
XLogSetAsyncXactLSN() only wake up if there is a certain amount of unflushed
WAL. Unless walsender is hibernating, walsender will wake up on its own after
wal_writer_delay.  I think we can just reuse WalWriterFlushAfter for this.

E.g. a condition like
                if (WriteRqstPtr <= LogwrtResult.Write + WalWriterFlushAfter * 
XLOG_BLCKSZ)
                        return;
drastically cuts down on the amount of wakeups, without - I think - loosing
guarantees around synchronous_commit=off.

In the patch, you actually did:

+               if (WriteRqstPtr <= LogwrtResult.Flush + WalWriterFlushAfter * 
XLOG_BLCKSZ)
+                       return;

It means that you never wake up the walwriter to merely *write* the WAL. You only wake it up if it's going to also fsync() it. I think that's correct and appropriate, but it took me a while to reach that conclusion:

It might be beneficial to wake up the walwriter just to perform a write(), to offload that work from the backend. And walwriter will actually also perform an fsync() after finishing the current segment, so it would make sense to also wake it up when 'asyncXactLSN' crosses a segment boundary. However, if those extra wakeups make sense, they would also make sense when there are no asynchronous commits involved. Therefore those extra wakeups should be done elsewhere, perhaps somewhere around AdvanceXLInsertBuffer(). The condition you have in the patch is appropriate for XLogSetAsyncXactLSN().

Another reason to write the WAL aggressively, even if you don't flush it, would be to reduce the number of lost transactions on a segfault. But we don't give any guarantees on that, and even with the current aggressive logic, we only write when a page is full so you're anyway going to lose the last partial page.

It also took me a while to convince myself that this calculation matches the calculation that XLogBackgroundFlush() uses to determine whether it needs to flush or not. XLogBackgroundFlush() first divides the request and result with XLOG_BLCKSZ and then compares the number of blocks, whereas here you perform the calculation in bytes. I think the result is the same, but to make it more clear, let's do it the same way in both places.

See attached. It's the same logic as in your patch, just formulatd more clearly IMHO.

Because of the frequent wakeups, we do something else that's not smart: We
write out 8k blocks individually, many times a second. Often thousands of
8k pwrites a second.

Even with this patch, when I bumped up wal_writer_delay to 2 so that the wal writer gets woken up by the async commits rather than the timeout, the write pattern is a bit silly:

$ strace -p 1099926 # walwriter
strace: Process 1099926 attached
epoll_wait(10, [{events=EPOLLIN, data={u32=3704011232, u64=94261056289248}}], 1, 1991) = 1 read(3, "\27\0\0\0\0\0\0\0\0\0\0\0<\312\20\0\350\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024) = 128 pwrite64(5, "\24\321\5\0\1\0\0\0\0\300\0\373\5\0\0\0+\0\0\0\0\0\0\0\0\n\0\0n\276\242\305"..., 1007616, 49152) = 1007616
fdatasync(5)                            = 0
pwrite64(5, "\24\321\5\0\1\0\0\0\0 \20\373\5\0\0\0003\0\0\0\0\0\0\0\320\37\20\373\5\0\0\0"..., 16384, 1056768) = 16384 epoll_wait(10, [{events=EPOLLIN, data={u32=3704011232, u64=94261056289248}}], 1, 2000) = 1 read(3, "\27\0\0\0\0\0\0\0\0\0\0\0<\312\20\0\350\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024) = 128 pwrite64(5, "\24\321\5\0\1\0\0\0\0`\20\373\5\0\0\0+\0\0\0\0\0\0\0\0\n\0\0\5~\23\261"..., 1040384, 1073152) = 1040384
fdatasync(5)                            = 0
pwrite64(5, "\24\321\4\0\1\0\0\0\0@ \373\5\0\0\0\0\0\0\0\0\0\0\0;\0\0\0\264'\246\3"..., 16384, 2113536) = 16384 epoll_wait(10, [{events=EPOLLIN, data={u32=3704011232, u64=94261056289248}}], 1, 2000) = 1 read(3, "\27\0\0\0\0\0\0\0\0\0\0\0<\312\20\0\350\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024) = 128 pwrite64(5, "\24\321\5\0\1\0\0\0\0\200 \373\5\0\0\0003\0\0\0\0\0\0\0\320\177 \373\5\0\0\0"..., 1040384, 2129920) = 1040384
fdatasync(5)                            = 0

In each cycle, the wal writer writes a full 1 MB chunk (wal_writer_flush_after = '1MB'), flushes it, and then perform a smaller write before going to sleep.

Those smaller writes seem a bit silly. But I think it's fine.

More throughput for less CPU, seems neat :)

Indeed, impressive speedup from such a small patch!

I'm not addressing that here, but I think we also have the opposite behaviour
- we're not waking up walwriter often enough. E.g. if you have lots of bulk
dataloads, walwriter will just wake up once per wal_writer_delay, leading to
most of the work being done by backends. We should probably wake walsender at
the end of XLogInsertRecord() if there is sufficient outstanding WAL.

Right, that's basically the same issue that I reasoned through above. I did some quick testing with a few different settings of wal_buffers, wal_writer_flush_after and wal_writer_delay, to try to see that effect. But I was not able to find a case where it makes a difference.

--
Heikki Linnakangas
Neon (https://neon.tech)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 40461923ea3..e2c0f17fd61 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2437,35 +2437,44 @@ XLogSetAsyncXactLSN(XLogRecPtr asyncXactLSN)
 {
 	XLogRecPtr	WriteRqstPtr = asyncXactLSN;
 	bool		sleeping;
+	bool		wakeup = false;
+	XLogRecPtr	prevAsyncXactLSN;
 
 	SpinLockAcquire(&XLogCtl->info_lck);
 	LogwrtResult = XLogCtl->LogwrtResult;
 	sleeping = XLogCtl->WalWriterSleeping;
+	prevAsyncXactLSN = XLogCtl->asyncXactLSN;
 	if (XLogCtl->asyncXactLSN < asyncXactLSN)
 		XLogCtl->asyncXactLSN = asyncXactLSN;
 	SpinLockRelease(&XLogCtl->info_lck);
 
 	/*
-	 * If the WALWriter is sleeping, we should kick it to make it come out of
-	 * low-power mode.  Otherwise, determine whether there's a full page of
-	 * WAL available to write.
+	 * If somebody else already called this function with a more aggressive
+	 * LSN, they will have done what we needed (and perhaps more).
 	 */
-	if (!sleeping)
+	if (asyncXactLSN <= prevAsyncXactLSN)
+		return;
+
+	/*
+	 * If the WALWriter is sleeping, kick it to make it come out of low-power
+	 * mode, so that this async commit will reach disk within the expected
+	 * amount of time.  Otherwise, determine whether it has enough WAL
+	 * available to flush, the same way that XLogBackgroundFlush() does.
+	 */
+	if (sleeping)
+		wakeup = true;
+	else
 	{
-		/* back off to last completed page boundary */
-		WriteRqstPtr -= WriteRqstPtr % XLOG_BLCKSZ;
+		int			flushblocks;
 
-		/* if we have already flushed that far, we're done */
-		if (WriteRqstPtr <= LogwrtResult.Flush)
-			return;
+		flushblocks =
+			WriteRqstPtr / XLOG_BLCKSZ - LogwrtResult.Flush / XLOG_BLCKSZ;
+
+		if (WalWriterFlushAfter == 0 || flushblocks >= WalWriterFlushAfter)
+			wakeup = true;
 	}
 
-	/*
-	 * Nudge the WALWriter: it has a full page of WAL to write, or we want it
-	 * to come out of low-power mode so that this async commit will reach disk
-	 * within the expected amount of time.
-	 */
-	if (ProcGlobal->walwriterLatch)
+	if (wakeup && ProcGlobal->walwriterLatch)
 		SetLatch(ProcGlobal->walwriterLatch);
 }
 
@@ -2784,7 +2793,7 @@ XLogBackgroundFlush(void)
 	bool		flexible = true;
 	static TimestampTz lastflush;
 	TimestampTz now;
-	int			flushbytes;
+	int			flushblocks;
 	TimeLineID	insertTLI;
 
 	/* XLOG doesn't need flushing during recovery */
@@ -2836,9 +2845,13 @@ XLogBackgroundFlush(void)
 	/*
 	 * Determine how far to flush WAL, based on the wal_writer_delay and
 	 * wal_writer_flush_after GUCs.
+	 *
+	 * Note that XLogSetAsyncXactLSN() performs similar calculation based on
+	 * wal_writer_flush_after, to decide when to wake us up.  Make sure the
+	 * logic is the same in both places if you change this.a
 	 */
 	now = GetCurrentTimestamp();
-	flushbytes =
+	flushblocks =
 		WriteRqst.Write / XLOG_BLCKSZ - LogwrtResult.Flush / XLOG_BLCKSZ;
 
 	if (WalWriterFlushAfter == 0 || lastflush == 0)
@@ -2857,7 +2870,7 @@ XLogBackgroundFlush(void)
 		WriteRqst.Flush = WriteRqst.Write;
 		lastflush = now;
 	}
-	else if (flushbytes >= WalWriterFlushAfter)
+	else if (flushblocks >= WalWriterFlushAfter)
 	{
 		/* exceeded wal_writer_flush_after blocks, flush */
 		WriteRqst.Flush = WriteRqst.Write;

Reply via email to