walwriter interacts quite badly with synchronous_commit=off

Andres Freund Tue, 24 Oct 2023 16:09:52 -0700

Hi,

I recently mentioned to Heikki that I was seeing latch related wakeups being
frequent enough to prevent walwriter from doing a whole lot of work. He asked
me to write that set of concerns up, which seems quite fair...



Here's a profile of walwriter while the following pgbench run was ongoing:

c=1;psql -Xq -c 'drop table if exists testtable_logged; CREATE TABLE 
testtable_logged(v int not null default 0);' && PGOPTIONS='-c 
synchronous_commit=off' pgbench -n -c$c -j$c -Mprepared -T150 -f <(echo 'INSERT 
INTO testtable_logged DEFAULT VALUES;') -P1

Looking at top, walwriter is around 15-20% busy with this
workload. Unfortunately a profile quickly shows that little of that work is
useful:

perf record --call-graph dwarf -m16M -p $(pgrep -f 'walwriter') sleep 3

-   94.42%     0.00%  postgres  postgres          [.] AuxiliaryProcessMain
     AuxiliaryProcessMain
   - WalWriterMain
      + 78.26% WaitLatch
      + 14.01% XLogBackgroundFlush
      + 0.51% pgstat_report_wal
        0.29% ResetLatch
        0.13% pgstat_flush_io
      + 0.02% asm_sysvec_apic_timer_interrupt
        0.01% HandleWalWriterInterrupts (inlined)


Confirmed by the distribution of what syscalls are made:

perf trace -m128M --summary -p $(pgrep -f 'walwriter') sleep 5
   syscall            calls  errors  total       min       avg       max       
stddev
                                     (msec)    (msec)    (msec)    (msec)       
 (%)
   --------------- --------  ------ -------- --------- --------- ---------     
------
   epoll_wait        216610      0  3744.984     0.000     0.017     0.113      
0.03%
   read              216602      0   333.905     0.001     0.002     0.029      
0.03%
   fdatasync             27      0    94.703     1.939     3.508    11.279      
8.83%
   pwrite64            2998      0    15.646     0.004     0.005     0.027      
0.45%
   openat                 2      0     0.019     0.006     0.010     0.013     
34.84%
   close                  2      0     0.004     0.002     0.002     0.003     
25.76%

We're doing far more latch related work than actual work.

The walwriter many many times wakes up without having to do anything.

And if you increase the number of clients to e.g. c=8, it gets worse in some
ways:

perf trace:
   epoll_wait        291512      0  2364.067     0.001     0.008     0.693      
0.10%
   read              290938      0   479.837     0.001     0.002     0.020      
0.05%
   fdatasync            146      0   410.043     2.508     2.809     7.006      
1.90%
   futex              56384  43982   183.896     0.001     0.003     2.791      
1.65%
   pwrite64           17058      0   105.625     0.004     0.006     4.015      
4.61%
   clock_nanosleep        1      0     1.063     1.063     1.063     1.063      
0.00%
   openat                 9      0     0.072     0.006     0.008     0.014     
14.35%
   close                  9      0     0.018     0.002     0.002     0.003      
5.55%

Note that we 5x more lock waits (the futex calls) than writes!


I think the problem is mainly that XLogSetAsyncXactLSN() wakes up walwriter
whenever it is sleeping, regardless of whether the modified asyncXactLSN will
lead to a write.  We even wake up walwriter when we haven't changed
asyncXactLSN, because our LSN is older than some other backends!

So often we'll just wake up walwriter, which finds no work, immediately goes
to sleep, just to be woken again.

Because of the inherent delay between the checks of XLogCtl->WalWriterSleeping
and Latch->is_set, we also sometimes end up with multiple processes signalling
walwriter, which can be bad, because it increases the likelihood that some of
the signals may be received when we are already holding WALWriteLock, delaying
its release...

Because of the frequent wakeups, we do something else that's not smart: We
write out 8k blocks individually, many times a second. Often thousands of
8k pwrites a second.

We also acquire WALWriteLock and call WaitXLogInsertionsToFinish(), even if
could already know we're not going to flush! Not cheap, when you do it this
many times a second.


There is an absolutely basic optimization, helping a it at higher client
counts: Don't wake if the new asyncXactLSN is <= the old one. But it doesn't
help that much.

I think the most important optimization we need is to have
XLogSetAsyncXactLSN() only wake up if there is a certain amount of unflushed
WAL. Unless walsender is hibernating, walsender will wake up on its own after
wal_writer_delay.  I think we can just reuse WalWriterFlushAfter for this.

E.g. a condition like
                if (WriteRqstPtr <= LogwrtResult.Write + WalWriterFlushAfter * 
XLOG_BLCKSZ)
                        return;
drastically cuts down on the amount of wakeups, without - I think - loosing
guarantees around synchronous_commit=off.

1 client:

before:
tps = 42926.288765 (without initial connection time)

   syscall            calls  errors  total       min       avg       max       
stddev
                                     (msec)    (msec)    (msec)    (msec)       
 (%)
   --------------- --------  ------ -------- --------- --------- ---------     
------
   epoll_wait        209077      0  3746.918     0.000     0.018     0.143      
0.03%
   read              209073      0   310.532     0.001     0.001     0.021      
0.02%
   fdatasync             25      0    82.673     2.623     3.307     3.457      
1.13%
   pwrite64            2892      0    14.600     0.004     0.005     0.018      
0.43%

after:

tps = 46244.394058 (without initial connection time)

   syscall            calls  errors  total       min       avg       max       
stddev
                                     (msec)    (msec)    (msec)    (msec)       
 (%)
   --------------- --------  ------ -------- --------- --------- ---------     
------
   epoll_wait            25      0  4732.625     0.000   189.305   200.281      
4.17%
   fdatasync             25      0    90.264     2.814     3.611     3.835      
1.02%
   pwrite64              48      0    15.825     0.020     0.330     0.707     
12.76%
   read                  21      0     0.117     0.003     0.006     0.007      
3.69%


8 clients:

tps = 279316.646315 (without initial connection time)

 postgres (2861734), 1215159 events, 100.0%

   syscall            calls  errors  total       min       avg       max       
stddev
                                     (msec)    (msec)    (msec)    (msec)       
 (%)
   --------------- --------  ------ -------- --------- --------- ---------     
------
   epoll_wait        267517      0  2150.206     0.000     0.008     0.973      
0.12%
   read              266683      0   512.348     0.001     0.002     0.036      
0.08%
   fdatasync            149      0   413.658     2.583     2.776     3.395      
0.29%
   futex              56597  49588   183.174     0.001     0.003     1.047      
0.69%
   pwrite64           17516      0   126.208     0.004     0.007     2.927      
3.93%


after:

tps = 290958.322594 (without initial connection time)

 postgres (2861534), 1626 events, 100.0%

   syscall            calls  errors  total       min       avg       max       
stddev
                                     (msec)    (msec)    (msec)    (msec)       
 (%)
   --------------- --------  ------ -------- --------- --------- ---------     
------
   epoll_wait           153      0  4383.285     0.000    28.649    32.699      
0.92%
   fdatasync            153      0   464.088     2.452     3.033    19.999      
4.88%
   pwrite64             306      0    80.361     0.049     0.263     0.590      
4.38%
   read                 153      0     0.459     0.002     0.003     0.004      
1.37%
   futex                 49     46     0.211     0.002     0.004     0.038     
17.05%


More throughput for less CPU, seems neat :)


I'm not addressing that here, but I think we also have the opposite behaviour
- we're not waking up walwriter often enough. E.g. if you have lots of bulk
dataloads, walwriter will just wake up once per wal_writer_delay, leading to
most of the work being done by backends. We should probably wake walsender at
the end of XLogInsertRecord() if there is sufficient outstanding WAL.

Greetings,

Andres Freund

>From 953b38dc4e6f35e25abf805c531350f43e867d15 Mon Sep 17 00:00:00 2001
From: Andres Freund <and...@anarazel.de>
Date: Tue, 24 Oct 2023 16:00:18 -0700
Subject: [PATCH v1 1/2] xlog.c: Rename misleadingly named variable

---
 src/backend/access/transam/xlog.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 40461923ea3..6065fd63b8b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2784,7 +2784,7 @@ XLogBackgroundFlush(void)
 	bool		flexible = true;
 	static TimestampTz lastflush;
 	TimestampTz now;
-	int			flushbytes;
+	int			flushblocks;
 	TimeLineID	insertTLI;
 
 	/* XLOG doesn't need flushing during recovery */
@@ -2838,7 +2838,7 @@ XLogBackgroundFlush(void)
 	 * wal_writer_flush_after GUCs.
 	 */
 	now = GetCurrentTimestamp();
-	flushbytes =
+	flushblocks =
 		WriteRqst.Write / XLOG_BLCKSZ - LogwrtResult.Flush / XLOG_BLCKSZ;
 
 	if (WalWriterFlushAfter == 0 || lastflush == 0)
@@ -2857,7 +2857,7 @@ XLogBackgroundFlush(void)
 		WriteRqst.Flush = WriteRqst.Write;
 		lastflush = now;
 	}
-	else if (flushbytes >= WalWriterFlushAfter)
+	else if (flushblocks >= WalWriterFlushAfter)
 	{
 		/* exceeded wal_writer_flush_after blocks, flush */
 		WriteRqst.Flush = WriteRqst.Write;
-- 
2.38.0

>From 5e56bd0801962bd04fa2829d6f42579d85c47958 Mon Sep 17 00:00:00 2001
From: Andres Freund <and...@anarazel.de>
Date: Tue, 24 Oct 2023 16:01:11 -0700
Subject: [PATCH v1 2/2] Reduce rate of wakeups of walsender due to async
 commits

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/access/transam/xlog.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6065fd63b8b..867ab3bd742 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2437,14 +2437,23 @@ XLogSetAsyncXactLSN(XLogRecPtr asyncXactLSN)
 {
 	XLogRecPtr	WriteRqstPtr = asyncXactLSN;
 	bool		sleeping;
+	XLogRecPtr  prevAsyncXactLSN;
 
 	SpinLockAcquire(&XLogCtl->info_lck);
 	LogwrtResult = XLogCtl->LogwrtResult;
 	sleeping = XLogCtl->WalWriterSleeping;
+	prevAsyncXactLSN = XLogCtl->asyncXactLSN;
 	if (XLogCtl->asyncXactLSN < asyncXactLSN)
 		XLogCtl->asyncXactLSN = asyncXactLSN;
 	SpinLockRelease(&XLogCtl->info_lck);
 
+	/*
+	 * If somebody else already called this function with a more aggressive
+	 * LSN, they will have done what we needed (and perhaps more).
+	 */
+	if (asyncXactLSN <= prevAsyncXactLSN)
+		return;
+
 	/*
 	 * If the WALWriter is sleeping, we should kick it to make it come out of
 	 * low-power mode.  Otherwise, determine whether there's a full page of
@@ -2458,6 +2467,18 @@ XLogSetAsyncXactLSN(XLogRecPtr asyncXactLSN)
 		/* if we have already flushed that far, we're done */
 		if (WriteRqstPtr <= LogwrtResult.Flush)
 			return;
+
+		/*
+		 * There is no benefit in waking walwriter if we only have a small
+		 * amount of writes outstanding. We want to flush WAL at least once
+		 * per wal_writer_flush_after and not defer writing for more than
+		 * wal_writer_delay. As we wake walwriter below, if it's currently
+		 * hibernating, it's own loop will ensure that wal_writer_delay is
+		 * taken care of. This condition takes care of flushing if there is
+		 * more than wal_writer_flush_after unflushed data.
+		 */
+		if (WriteRqstPtr <= LogwrtResult.Flush + WalWriterFlushAfter * XLOG_BLCKSZ)
+			return;
 	}
 
 	/*
-- 
2.38.0

walwriter interacts quite badly with synchronous_commit=off

Reply via email to