Re: log_checkpoint's "WAL file(s) added" is misleading to the point of uselessness

Andres Freund Sat, 24 Jul 2021 16:02:31 -0700

Hi,

On 2021-07-24 15:50:36 -0700, Andres Freund wrote:
> As an example, here's the log output of a workload that has a replication slot
> preventing WAL files from being recycled (and too small max_wal_size):
>
> 2021-07-24 15:47:42.524 PDT [2251649][checkpointer][:0] LOG:  checkpoint 
> complete: wrote 55767 buffers (42.5%); 0 WAL file(s) added, 0 removed, 0 
> recycled; write=3.914 s, sync=0.041 s, total=3.972 s; sync files=10, 
> longest=0.010 s, average=0.005 s; distance=540578 kB, estimate=540905 kB
> 2021-07-24 15:47:46.721 PDT [2251649][checkpointer][:0] LOG:  checkpoint 
> complete: wrote 55806 buffers (42.6%); 1 WAL file(s) added, 0 removed, 0 
> recycled; write=3.855 s, sync=0.028 s, total=3.928 s; sync files=8, 
> longest=0.008 s, average=0.004 s; distance=540708 kB, estimate=540886 kB
> 2021-07-24 15:47:51.004 PDT [2251649][checkpointer][:0] LOG:  checkpoint 
> complete: wrote 55850 buffers (42.6%); 1 WAL file(s) added, 0 removed, 0 
> recycled; write=3.895 s, sync=0.034 s, total=3.974 s; sync files=9, 
> longest=0.009 s, average=0.004 s; distance=540657 kB, estimate=540863 kB
> 2021-07-24 15:47:55.231 PDT [2251649][checkpointer][:0] LOG:  checkpoint 
> complete: wrote 55879 buffers (42.6%); 0 WAL file(s) added, 0 removed, 0 
> recycled; write=3.878 s, sync=0.026 s, total=3.944 s; sync files=9, 
> longest=0.007 s, average=0.003 s; distance=540733 kB, estimate=540850 kB
> 2021-07-24 15:47:59.462 PDT [2251649][checkpointer][:0] LOG:  checkpoint 
> complete: wrote 55847 buffers (42.6%); 1 WAL file(s) added, 0 removed, 0 
> recycled; write=3.882 s, sync=0.027 s, total=3.952 s; sync files=9, 
> longest=0.008 s, average=0.003 s; distance=540640 kB, estimate=540829 kB
>
> So it's 3 new WAL files in that timeframe, one might think?
>
> A probe instrumenting xlog file creation shows something very different:
> perf probe -x 
> /home/andres/build/postgres/dev-assert/vpath/src/backend/postgres -a 
> XLogFileInitInternal:39
> (39 is the O_CREAT BasicOpenFile(), not the recycle path).
>
> perf stat -a -e probe_postgres:XLogFileInitInternal_L39 -I 1000
>      1.001030943                  9      
> probe_postgres:XLogFileInitInternal_L39
>      2.002998896                  8      
> probe_postgres:XLogFileInitInternal_L39
>      3.005098857                  8      
> probe_postgres:XLogFileInitInternal_L39
>      4.007000426                  6      
> probe_postgres:XLogFileInitInternal_L39
>      5.008423119                  9      
> probe_postgres:XLogFileInitInternal_L39
>      6.013427568                  8      
> probe_postgres:XLogFileInitInternal_L39
>      7.015087613                  8      
> probe_postgres:XLogFileInitInternal_L39
>      8.017393339                  8      
> probe_postgres:XLogFileInitInternal_L39
>      9.018425526                  7      
> probe_postgres:XLogFileInitInternal_L39
>     10.020398520                  5      
> probe_postgres:XLogFileInitInternal_L39


And even more extreme, the logs can end up suggestiong pg_wal is shrinking,
when the opposite is the case. Here's the checkpoint output from a parallel
copy data load (without a replication slot holding things back):

2021-07-24 15:59:03.215 PDT [2253324][checkpointer][:0] LOG:  checkpoint 
complete: wrote 22291 buffers (17.0%); 0 WAL file(s) added, 27 removed, 141 
recycled; write=9.737 s, sync=4.153 s, total=14.884 s; sync files=108, 
longest=0.116 s, average=0.039 s; distance=2756904 kB, estimate=2756904 kB
2021-07-24 15:59:12.978 PDT [2253324][checkpointer][:0] LOG:  checkpoint 
complete: wrote 21840 buffers (16.7%); 0 WAL file(s) added, 53 removed, 149 
recycled; write=5.531 s, sync=3.008 s, total=9.763 s; sync files=81, 
longest=0.201 s, average=0.037 s; distance=3313885 kB, estimate=3313885 kB
2021-07-24 15:59:23.421 PDT [2253324][checkpointer][:0] LOG:  checkpoint 
complete: wrote 22326 buffers (17.0%); 0 WAL file(s) added, 56 removed, 149 
recycled; write=5.787 s, sync=3.230 s, total=10.436 s; sync files=81, 
longest=0.099 s, average=0.040 s; distance=3346125 kB, estimate=3346125 kB
2021-07-24 15:59:34.424 PDT [2253324][checkpointer][:0] LOG:  checkpoint 
complete: wrote 22155 buffers (16.9%); 0 WAL file(s) added, 60 removed, 148 
recycled; write=6.096 s, sync=3.432 s, total=10.995 s; sync files=81, 
longest=0.101 s, average=0.043 s; distance=3409281 kB, estimate=3409281 kB

It does look like WAL space usage is shrinking, but the opposite is true -
we're creating so much WAL that the checkpointer can't checkpoint fast enough
to keep WAL usage below max_wal_size. So WAL files are constantly created that
then need to be removed (hence the non-zero removed counts).

#           time             counts unit events
   277.087990275                 15      probe_postgres:XLogFileInitInternal_L39
   278.098549960                 15      probe_postgres:XLogFileInitInternal_L39
   279.104787575                  6      probe_postgres:XLogFileInitInternal_L39
   280.108980690                  5      probe_postgres:XLogFileInitInternal_L39
   281.111781617                  6      probe_postgres:XLogFileInitInternal_L39
   282.113601958                  2      probe_postgres:XLogFileInitInternal_L39
   283.115711683                  0      probe_postgres:XLogFileInitInternal_L39
   284.121508636                  0      probe_postgres:XLogFileInitInternal_L39
   285.124865325                  0      probe_postgres:XLogFileInitInternal_L39
   286.126932016                  0      probe_postgres:XLogFileInitInternal_L39
   287.129874993                 11      probe_postgres:XLogFileInitInternal_L39
   288.131838429                 15      probe_postgres:XLogFileInitInternal_L39
   289.133609021                 13      probe_postgres:XLogFileInitInternal_L39
   290.136254341                  6      probe_postgres:XLogFileInitInternal_L39
   291.139368485                  5      probe_postgres:XLogFileInitInternal_L39
   292.142728293                  6      probe_postgres:XLogFileInitInternal_L39
   293.148078766                  2      probe_postgres:XLogFileInitInternal_L39
   294.150258476                  0      probe_postgres:XLogFileInitInternal_L39
   295.172398897                  0      probe_postgres:XLogFileInitInternal_L39
   296.174658196                  0      probe_postgres:XLogFileInitInternal_L39
   297.176818943                  0      probe_postgres:XLogFileInitInternal_L39
   298.179003473                 14      probe_postgres:XLogFileInitInternal_L39
   299.181597777                 14      probe_postgres:XLogFileInitInternal_L39
   300.184711566                 14      probe_postgres:XLogFileInitInternal_L39
   301.188919194                  6      probe_postgres:XLogFileInitInternal_L39

Greetings,

Andres Freund

Re: log_checkpoint's "WAL file(s) added" is misleading to the point of uselessness

Reply via email to