Re: Handing off SLRU fsyncs to the checkpointer

Jakub Wartak Fri, 28 Aug 2020 05:44:41 -0700

Hi Thomas, hackers,

>> > To move these writes out of recovery's way, we should probably just
>> > run the bgwriter process during crash recovery.  I'm going to look
>> > into that.
>>
>> Sounds awesome.
>
>I wrote a quick and dirty experimental patch to try that.  I can't see
>any benefit from it on pgbench with default shared buffers, but maybe
>it would do better with your append test due to locality, especially
>if you can figure out how to tune bgwriter to pace itself optimally.
>https://github.com/macdice/postgres/tree/bgwriter-in-crash-recovery


OK, so I've quickly tested those two PoCs patches together, in the conditions 
like below:
- similar append-only workload by pgbench (to eliminate other already known 
different WAL bottlenecks: e.g. sorting),
- 4.3GB of WAL to be applied (mostly Btree/INSERT_LEAF)
- on same system as last time (ext4 on NVMe, 1s8c16, 4.14 kernel) 
- 14master already with SLRU fsync to checkpointer/pg_qgsort patches applied

TEST bgwriterPOC1:
- in severe dirty memory conditions (artificially simulated via small s_b here) 
--> so for workloads with very high FlushBuffer activity in StartupXLOG
- with fsync=off/fpw=off by default and on NVMe (e.g. scenario:  I want to 
perform some PITR as fast as I can to see how production data looked like in 
the past, before some user deleted some data)

baseline s_b@128MB: 140.404, 0.123 (2nd small as there is small region to 
checkpoint)

    22.49%  postgres  [kernel.kallsyms]   [k] copy_user_enhanced_fast_string
            ---copy_user_enhanced_fast_string
               |--14.72%--copyin
               |          __pwrite_nocancel
               |          FileWrite
               |          mdwrite
               |          FlushBuffer
               |          ReadBuffer_common
               |           --14.52%--btree_xlog_insert
                --7.77%--copyout
                          __pread_nocancel
                           --7.57%--FileRead
                                     mdread
                                     ReadBuffer_common
     6.13%  postgres  [kernel.kallsyms]   [k] do_syscall_64
               |--1.64%--__pwrite_nocancel
                --1.23%--__pread_nocancel
     3.68%  postgres  postgres            [.] hash_search_with_hash_value
            ---hash_search_with_hash_value
               |--1.02%--smgropen

After applying:
patch -p1 < ../0001-Run-checkpointer-and-bgworker-in-crash-recovery.patch
patch -p1 < ../0002-Optionally-don-t-wait-for-end-of-recovery-checkpoint.patch

0001+0002 s_b@128MB: similar result to above
0001+0002 s_b@128MB: 108.871, 0.114 , bgwriter_delay = 
10ms/bgwriter_lru_maxpages = 1000
0001+0002 s_b@128MB: 85.392, 0.103 , bgwriter_delay = 
10ms/bgwriter_lru_maxpages = 50000 #~390MB max?

    18.40%  postgres  [kernel.kallsyms]   [k] copy_user_enhanced_fast_string
            ---copy_user_enhanced_fast_string
               |--17.79%--copyout
               |          __pread_nocancel
               |          |--16.56%--FileRead
               |          |          mdread
               |          |          ReadBuffer_common
                --0.61%--copyin // WOW
                          __pwrite_nocancel
                          FileWrite
                          mdwrite
                          FlushBuffer
                          ReadBuffer_common
     9.20%  postgres  postgres            [.] hash_search_with_hash_value
            ---hash_search_with_hash_value
               |--4.70%--smgropen

of course there is another WOW moment during recovery ("61.9%")

USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
postgres 120935  0.9  0.0 866052  3824 ?        Ss   09:47   0:00 postgres: 
checkpointer
postgres 120936 61.9  0.0 865796  3824 ?        Rs   09:47   0:22 postgres: 
background writer
postgres 120937 97.4  0.0 865940  5228 ?        Rs   09:47   0:36 postgres: 
startup recovering 000000010000000000000089

speedup of 1.647x when dirty memory is in way. When it's not:

baseline  s_b@24000MB: 39.199, 1.448 (2x patches off)
0001+0002 s_b@24000MB: 39.383, 1.442 , bgwriter_delay = 
10ms/bgwriter_lru_maxpages = 50000 #~390MB/s max, yay

there's no regression. I have only one comment about those 2 WIP patches, 
bgwriter_lru_maxpages should be maybe called standby_bgwriter_lru_maxpages in 
this scenario or even more preferred there shouldn't be a maximum set during 
closed DB recovery scenario.

TEST bgwriterPOC2a to showcase the 2nd patch which opens the DB for read-write 
users before the final checkpoint finishes after redo recovery. The DBA may 
make the decision via this parameter end_of_recovery_checkpoint_wait=off.
- on slow storage (xvda, fsync=on) and even with high memory:

s_b@24000MB: 39.043, 15.639 -- even with WAL recovery being 100% CPU 
bound(mostly on hash_search_with_hash_value() for 
Buffers/__memmove_ssse3_back), it took additional 15s to perform checkpoint 
before DB was open for users (it had to write 269462 buffers =~ 2GB =~ 140MB/s 
which is close to the xvda device speed): the complete output looks in 14master 
looks similar to this:

1598609928.620 startup 22543 LOG:  redo done at 1/12201C88
1598609928.624 checkpointer 22541 LOG:  checkpoint starting: end-of-recovery 
immediate wait
1598609944.908 checkpointer 22541 LOG:  checkpoint complete: wrote 269462 
buffers (8.6%); 0 WAL file(s) added, 0 removed, 273 recycled; write=15.145 s, 
sync=0.138 s, total=16.285 s; sync files=11, longest=0.133 s, average=0.012 s; 
distance=4468855 kB, estimate=4468855 kB
1598609944.912 postmaster 22538 LOG:  database system is ready to accept 
connections

s_b@24000MB: 39.96, 0 , with end_of_recovery_checkpoint_wait = off, before DB 
is open 15s faster 

1598610331.556 startup 29499 LOG:  redo done at 1/12201C88
1598610331.559 checkpointer 29497 LOG:  checkpoint starting: immediate force
1598610331.562 postmaster 29473 LOG:  database system is ready to accept 
connections
1598610347.202 checkpointer 29497 LOG:  checkpoint complete: wrote 269462 
buffers (8.6%); 0 WAL file(s) added, 0 removed, 273 recycled; write=15.092 s, 
sync=0.149 s, total=15.643 s; sync files=12, longest=0.142 s, average=0.012 s; 
distance=4468855 kB, estimate=4468855 kB

I suppose a checkpoint for large shared_buffers (hundredths of GB) might take a 
lot of time and this 0002 patch bypasses that. I would find it quite useful in 
some scenarios (e.g. testing backups, PITR recoveries, opening DB from storage 
snapshots / storage replication, maybe with DWH-after-crash too).

TEST bgwriterPOC2b: FYI, I was also testing the the hot_standby code path -- to 
test if it would reduce time of starting / opening a fresh standby for 
read-only queries, but this parameter doesn't seem to influence that in my 
tests. As I've learned it's apparently much more complex to reproduce what I'm 
after and involves a lot of reading about LogStandbySnapshot() / standby 
recovery points on my side.

Now, back to smgropen() hash_search_by_values() reproducer...

-Jakub Wartak.

Re: Handing off SLRU fsyncs to the checkpointer

Reply via email to