Re: WIP: WAL prefetch (another approach)
On Wed, Apr 13, 2022 at 8:05 AM Thomas Munro wrote: > On Wed, Apr 13, 2022 at 3:57 AM Dagfinn Ilmari Mannsåker > wrote: > > Simon Riggs writes: > > > This is a nice feature if it is safe to turn off full_page_writes. > > > When is it safe to do that? On which platform? > > > > > > I am not aware of any released software that allows full_page_writes > > > to be safely disabled. Perhaps something has been released recently > > > that allows this? I think we have substantial documentation about > > > safety of other settings, so we should carefully document things here > > > also. > > > > Our WAL reliability docs claim that ZFS is safe against torn pages: > > > > https://www.postgresql.org/docs/current/wal-reliability.html: > > > > If you have file-system software that prevents partial page writes > > (e.g., ZFS), you can turn off this page imaging by turning off the > > full_page_writes parameter. > > Unfortunately, posix_fadvise(WILLNEED) doesn't do anything on ZFS > right now :-(. Update: OpenZFS now has this working in its master branch (Linux only for now), so fingers crossed for the next release.
Re: WIP: WAL prefetch (another approach)
On Tue, Apr 26, 2022 at 6:11 PM Thomas Munro wrote: > I will poke some more tomorrow to try to confirm this and try to come > up with a fix. Done, and moved over to the pg_walinspect commit thread to reach the right eyeballs: https://www.postgresql.org/message-id/CA%2BhUKGLtswFk9ZO3WMOqnDkGs6dK5kCdQK9gxJm0N8gip5cpiA%40mail.gmail.com
Re: WIP: WAL prefetch (another approach)
On Tue, Apr 26, 2022 at 6:11 AM Tom Lane wrote: > I believe that the WAL prefetch patch probably accounts for the > intermittent errors that buildfarm member topminnow has shown > since it went in, eg [1]: > > diff -U3 > /home/nm/ext4/HEAD/pgsql/contrib/pg_walinspect/expected/pg_walinspect.out > /home/nm/ext4/HEAD/pgsql.build/contrib/pg_walinspect/results/pg_walinspect.out Hmm, maybe but I suspect not. I think I might see what's happening here. > +ERROR: could not read WAL at 0/1903E40 > I've reproduced this manually on that machine, and confirmed that the > proximate cause is that XLogNextRecord() is returning NULL because > state->decode_queue_head == NULL, without bothering to provide an errormsg > (which doesn't seem very well thought out in itself). I obtained the Thanks for doing that. After several hours of trying I also managed to reproduce it on that gcc23 system (not at all sure why it doesn't show up elsewhere; MIPS 32 bit layout may be a factor), and added some trace to get some more clues. Still looking into it, but here is the current hypothesis I'm testing: 1. The reason there's a messageless ERROR in this case is because there is new read_page callback logic introduced for pg_walinspect, called via read_local_xlog_page_no_wait(), which is like the old read_local_xlog_page() except that it returns -1 if you try to read past the current "flushed" LSN, and we have no queued message. An error is then reported by XLogReadRecord(), and appears to the user. 2. The reason pg_walinspect tries to read WAL data past the flushed LSN is because its GetWALRecordsInfo() function keeps calling XLogReadRecord() until EndRecPtr >= end_lsn, where end_lsn is taken from a snapshot of the flushed LSN, but I don't see where it takes into account that the flushed LSN might momentarily fall in the middle of a record. In that case, xlogreader.c will try to read the next page, which fails because it's past the flushed LSN (see point 1). I will poke some more tomorrow to try to confirm this and try to come up with a fix.
Re: WIP: WAL prefetch (another approach)
Oh, one more bit of data: here's an excerpt from pg_waldump output after the failed test: rmgr: Btree len (rec/tot): 72/72, tx:727, lsn: 0/01903BC8, prev 0/01903B70, desc: INSERT_LEAF off 111, blkref #0: rel 1663/16384/2673 blk 9 rmgr: Btree len (rec/tot): 72/72, tx:727, lsn: 0/01903C10, prev 0/01903BC8, desc: INSERT_LEAF off 141, blkref #0: rel 1663/16384/2674 blk 7 rmgr: Standby len (rec/tot): 42/42, tx:727, lsn: 0/01903C58, prev 0/01903C10, desc: LOCK xid 727 db 16384 rel 16391 rmgr: Transaction len (rec/tot):437/ 437, tx:727, lsn: 0/01903C88, prev 0/01903C58, desc: COMMIT 2022-04-25 20:16:03.374197 CEST; inval msgs: catcache 80 catcache 79 catcache 80 catcache 79 catcache 55 catcache 54 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 snapshot 2608 relcache 16391 rmgr: Heaplen (rec/tot): 59/59, tx:728, lsn: 0/01903E40, prev 0/01903C88, desc: INSERT+INIT off 1 flags 0x00, blkref #0: rel 1663/16384/16391 blk 0 rmgr: Heaplen (rec/tot): 59/59, tx:728, lsn: 0/01903E80, prev 0/01903E40, desc: INSERT off 2 flags 0x00, blkref #0: rel 1663/16384/16391 blk 0 rmgr: Transaction len (rec/tot): 34/34, tx:728, lsn: 0/01903EC0, prev 0/01903E80, desc: COMMIT 2022-04-25 20:16:03.379323 CEST rmgr: Heaplen (rec/tot): 59/59, tx:729, lsn: 0/01903EE8, prev 0/01903EC0, desc: INSERT off 3 flags 0x00, blkref #0: rel 1663/16384/16391 blk 0 rmgr: Heaplen (rec/tot): 59/59, tx:729, lsn: 0/01903F28, prev 0/01903EE8, desc: INSERT off 4 flags 0x00, blkref #0: rel 1663/16384/16391 blk 0 rmgr: Transaction len (rec/tot): 34/34, tx:729, lsn: 0/01903F68, prev 0/01903F28, desc: COMMIT 2022-04-25 20:16:03.381720 CEST The error is complaining about not being able to read 0/01903E40, which AFAICT is from the first "INSERT INTO sample_tbl" command, which most certainly ought to be down to disk at this point. Also, I modified the test script to see what WAL LSNs it thought it was dealing with, and got +\echo 'wal_lsn1 = ' :wal_lsn1 +wal_lsn1 = 0/1903E40 +\echo 'wal_lsn2 = ' :wal_lsn2 +wal_lsn2 = 0/1903EE8 confirming that idea of where 0/01903E40 is in the WAL history. So this is sure looking like a bug somewhere in xlogreader.c, not in pg_walinspect. regards, tom lane
Re: WIP: WAL prefetch (another approach)
I believe that the WAL prefetch patch probably accounts for the intermittent errors that buildfarm member topminnow has shown since it went in, eg [1]: diff -U3 /home/nm/ext4/HEAD/pgsql/contrib/pg_walinspect/expected/pg_walinspect.out /home/nm/ext4/HEAD/pgsql.build/contrib/pg_walinspect/results/pg_walinspect.out --- /home/nm/ext4/HEAD/pgsql/contrib/pg_walinspect/expected/pg_walinspect.out 2022-04-10 03:05:15.972622440 +0200 +++ /home/nm/ext4/HEAD/pgsql.build/contrib/pg_walinspect/results/pg_walinspect.out 2022-04-25 05:09:49.861642059 +0200 @@ -34,11 +34,7 @@ (1 row) SELECT COUNT(*) >= 0 AS ok FROM pg_get_wal_records_info_till_end_of_wal(:'wal_lsn1'); - ok - - t -(1 row) - +ERROR: could not read WAL at 0/1903E40 SELECT COUNT(*) >= 0 AS ok FROM pg_get_wal_stats(:'wal_lsn1', :'wal_lsn2'); ok @@ -46,11 +42,7 @@ (1 row) SELECT COUNT(*) >= 0 AS ok FROM pg_get_wal_stats_till_end_of_wal(:'wal_lsn1'); - ok - - t -(1 row) - +ERROR: could not read WAL at 0/1903E40 -- === -- Test for filtering out WAL records of a particular table -- === I've reproduced this manually on that machine, and confirmed that the proximate cause is that XLogNextRecord() is returning NULL because state->decode_queue_head == NULL, without bothering to provide an errormsg (which doesn't seem very well thought out in itself). I obtained the contents of the xlogreader struct at failure: (gdb) p *xlogreader $1 = {routine = {page_read = 0x594270 , segment_open = 0x593b44 , segment_close = 0x593d38 }, system_identifier = 0, private_data = 0x0, ReadRecPtr = 26230672, EndRecPtr = 26230752, abortedRecPtr = 26230752, missingContrecPtr = 26230784, overwrittenRecPtr = 0, DecodeRecPtr = 26230672, NextRecPtr = 26230752, PrevRecPtr = 0, record = 0x0, decode_buffer = 0xf25428 "\240", decode_buffer_size = 65536, free_decode_buffer = true, decode_buffer_head = 0xf25428 "\240", decode_buffer_tail = 0xf25428 "\240", decode_queue_head = 0x0, decode_queue_tail = 0x0, readBuf = 0xf173f0 "\020\321\005", readLen = 0, segcxt = { ws_dir = '\000' , ws_segsize = 16777216}, seg = { ws_file = 25, ws_segno = 0, ws_tli = 1}, segoff = 0, latestPagePtr = 26222592, latestPageTLI = 1, currRecPtr = 26230752, currTLI = 1, currTLIValidUntil = 0, nextTLI = 0, readRecordBuf = 0xf1b3f8 "<", readRecordBufSize = 40960, errormsg_buf = 0xef3270 "", errormsg_deferred = false, nonblocking = false} I don't have an intuition about where to look beyond that, any suggestions? What I do know so far is that while the failure reproduces fairly reliably under "make check" (more than half the time, which squares with topminnow's history), it doesn't reproduce at all under "make installcheck" (after removing NO_INSTALLCHECK), which seems odd. Maybe it's dependent on how much WAL history the installation has accumulated? It could be that this is a bug in pg_walinspect or a fault in its test case; hard to tell since that got committed at about the same time as the prefetch changes. regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=topminnow&dt=2022-04-25%2001%3A48%3A47
Re: WIP: WAL prefetch (another approach)
On Wed, Apr 13, 2022 at 3:57 AM Dagfinn Ilmari Mannsåker wrote: > Simon Riggs writes: > > This is a nice feature if it is safe to turn off full_page_writes. As other have said/shown, it does also help if a block with FPW is evicted and then read back in during one checkpoint cycle, in other words if the working set is larger than shared buffers. This also provides infrastructure for proposals in the next cycle, as part of commitfest #3316: * in direct I/O mode, I/O stalls become more likely due to lack of kernel prefetching/double-buffering, so prefetching becomes more essential * even in buffered I/O mode when benefiting from free double-buffering, the copy from kernel buffer to user space buffer can be finished in the background instead of calling pread() when you need the page, but you need to start it sooner * adjacent blocks accessed by nearby records can be merged into a single scatter-read, for example with preadv() in the background * repeated buffer lookups, pins, locks (and maybe eventually replay) to the same page can be consolidated Pie-in-the-sky ideas: * someone might eventually want to be able to replay in parallel (hard, but certainly requires lookahead) * I sure hope we'll eventually use different techniques for torn-page protection to avoid the high online costs of FPW > > When is it safe to do that? On which platform? > > > > I am not aware of any released software that allows full_page_writes > > to be safely disabled. Perhaps something has been released recently > > that allows this? I think we have substantial documentation about > > safety of other settings, so we should carefully document things here > > also. > > Our WAL reliability docs claim that ZFS is safe against torn pages: > > https://www.postgresql.org/docs/current/wal-reliability.html: > > If you have file-system software that prevents partial page writes > (e.g., ZFS), you can turn off this page imaging by turning off the > full_page_writes parameter. Unfortunately, posix_fadvise(WILLNEED) doesn't do anything on ZFS right now :-(. I have some patches to fix that on Linux[1] and FreeBSD and it seems like there's a good chance of getting them committed based on feedback, but it needs some more work on tests and mmap integration. If anyone's interested in helping get that landed faster, please ping me off-list. [1] https://github.com/openzfs/zfs/pull/9807
Re: WIP: WAL prefetch (another approach)
> Other way around. FPWs make prefetch unnecessary. > Therefore you would only want prefetch with FPW=off, AFAIK. > A few scenarios I can imagine page prefetch can help are, 1/ A DR replica instance that is smaller instance size than primary. Page prefetch can bring the pages back into memory in advance when they are evicted. This speeds up the replay and is cost effective. 2/ Allows larger checkpoint_timeout for the same recovery SLA and perhaps improved performance? 3/ WAL prefetch (not pages by itself) can improve replay by itself (not sure if it was measured in isolation, Tomas V can comment on it). 4/ Read replica running analytical workload scenario Tomas V mentioned earlier. > > Or put this another way: when is it safe and sensible to use > recovery_prefetch != off? > When checkpoint_timeout is set large and under heavy write activity, on a read replica that has working set higher than the memory and receiving constant updates from primary. This covers 1 & 4 above. > -- > Simon Riggshttp://www.EnterpriseDB.com/ > > >
Re: WIP: WAL prefetch (another approach)
On 4/12/22 17:46, Simon Riggs wrote: > On Tue, 12 Apr 2022 at 16:41, Tomas Vondra > wrote: >> >> On 4/12/22 15:58, Simon Riggs wrote: >>> On Thu, 7 Apr 2022 at 08:46, Thomas Munro wrote: >>> With that... I've finally pushed the 0002 patch and will be watching the build farm. >>> >>> This is a nice feature if it is safe to turn off full_page_writes. >>> >>> When is it safe to do that? On which platform? >>> >>> I am not aware of any released software that allows full_page_writes >>> to be safely disabled. Perhaps something has been released recently >>> that allows this? I think we have substantial documentation about >>> safety of other settings, so we should carefully document things here >>> also. >>> >> >> I don't see why/how would an async prefetch make FPW unnecessary. Did >> anyone claim that be the case? > > Other way around. FPWs make prefetch unnecessary. > Therefore you would only want prefetch with FPW=off, AFAIK. > > Or put this another way: when is it safe and sensible to use > recovery_prefetch != off? > That assumes the FPI stays in memory until the next modification, and that can be untrue for a number of reasons. Long checkpoint interval with enough random accesses in between is a nice example. See the benchmarks I did a year ago (regular pgbench). Or imagine a r/o replica used to run analytics queries, that access so much data it evicts the buffers initialized by the FPI records. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: WIP: WAL prefetch (another approach)
Simon Riggs writes: > On Thu, 7 Apr 2022 at 08:46, Thomas Munro wrote: > >> With that... I've finally pushed the 0002 patch and will be watching >> the build farm. > > This is a nice feature if it is safe to turn off full_page_writes. > > When is it safe to do that? On which platform? > > I am not aware of any released software that allows full_page_writes > to be safely disabled. Perhaps something has been released recently > that allows this? I think we have substantial documentation about > safety of other settings, so we should carefully document things here > also. Our WAL reliability docs claim that ZFS is safe against torn pages: https://www.postgresql.org/docs/current/wal-reliability.html: If you have file-system software that prevents partial page writes (e.g., ZFS), you can turn off this page imaging by turning off the full_page_writes parameter. - ilmari
Re: WIP: WAL prefetch (another approach)
On Tue, 12 Apr 2022 at 16:41, Tomas Vondra wrote: > > On 4/12/22 15:58, Simon Riggs wrote: > > On Thu, 7 Apr 2022 at 08:46, Thomas Munro wrote: > > > >> With that... I've finally pushed the 0002 patch and will be watching > >> the build farm. > > > > This is a nice feature if it is safe to turn off full_page_writes. > > > > When is it safe to do that? On which platform? > > > > I am not aware of any released software that allows full_page_writes > > to be safely disabled. Perhaps something has been released recently > > that allows this? I think we have substantial documentation about > > safety of other settings, so we should carefully document things here > > also. > > > > I don't see why/how would an async prefetch make FPW unnecessary. Did > anyone claim that be the case? Other way around. FPWs make prefetch unnecessary. Therefore you would only want prefetch with FPW=off, AFAIK. Or put this another way: when is it safe and sensible to use recovery_prefetch != off? -- Simon Riggshttp://www.EnterpriseDB.com/
Re: WIP: WAL prefetch (another approach)
On 4/12/22 15:58, Simon Riggs wrote: > On Thu, 7 Apr 2022 at 08:46, Thomas Munro wrote: > >> With that... I've finally pushed the 0002 patch and will be watching >> the build farm. > > This is a nice feature if it is safe to turn off full_page_writes. > > When is it safe to do that? On which platform? > > I am not aware of any released software that allows full_page_writes > to be safely disabled. Perhaps something has been released recently > that allows this? I think we have substantial documentation about > safety of other settings, so we should carefully document things here > also. > I don't see why/how would an async prefetch make FPW unnecessary. Did anyone claim that be the case? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: WIP: WAL prefetch (another approach)
On Thu, 7 Apr 2022 at 08:46, Thomas Munro wrote: > With that... I've finally pushed the 0002 patch and will be watching > the build farm. This is a nice feature if it is safe to turn off full_page_writes. When is it safe to do that? On which platform? I am not aware of any released software that allows full_page_writes to be safely disabled. Perhaps something has been released recently that allows this? I think we have substantial documentation about safety of other settings, so we should carefully document things here also. -- Simon Riggshttp://www.EnterpriseDB.com/
RE: WIP: WAL prefetch (another approach)
Hi, Thank you for your reply. I missed the message, sorry. Regards, Noriyoshi Shinoda -Original Message- From: Thomas Munro Sent: Tuesday, April 12, 2022 6:28 PM To: Shinoda, Noriyoshi (PN Japan FSIP) Cc: Justin Pryzby ; Tomas Vondra ; Stephen Frost ; Andres Freund ; Jakub Wartak ; Alvaro Herrera ; Tomas Vondra ; Dmitry Dolgov <9erthali...@gmail.com>; David Steele ; pgsql-hackers Subject: Re: WIP: WAL prefetch (another approach) On Tue, Apr 12, 2022 at 9:03 PM Shinoda, Noriyoshi (PN Japan FSIP) wrote: > Thank you for developing the great feature. I tested this feature and checked > the documentation. Currently, the documentation for the > pg_stat_prefetch_recovery view is included in the description for the > pg_stat_subscription view. > > INVALID URI REMOVED > toring-stats.html*MONITORING-PG-STAT-SUBSCRIPTION__;Iw!!NpxR!xRu7zc4Hc > ZppB-32Fp3YfESPqJ7B4AOP_RF7QuYP-kCWidoiJ5txu9CW8sX61TfwddE$ Hi! Thanks. I had just committed a fix before I saw your message, because there was already another report here: https://www.postgresql.org/message-id/flat/cakrakevk-lrhmdyt6x_p33ef6dcorm2jed5h_ehdrdv0res...@mail.gmail.com
Re: WIP: WAL prefetch (another approach)
On Tue, Apr 12, 2022 at 9:03 PM Shinoda, Noriyoshi (PN Japan FSIP) wrote: > Thank you for developing the great feature. I tested this feature and checked > the documentation. Currently, the documentation for the > pg_stat_prefetch_recovery view is included in the description for the > pg_stat_subscription view. > > https://www.postgresql.org/docs/devel/monitoring-stats.html#MONITORING-PG-STAT-SUBSCRIPTION Hi! Thanks. I had just committed a fix before I saw your message, because there was already another report here: https://www.postgresql.org/message-id/flat/CAKrAKeVk-LRHMdyT6x_p33eF6dCorM2jed5h_eHdRdv0reSYTA%40mail.gmail.com
RE: WIP: WAL prefetch (another approach)
Hi, Thank you for developing the great feature. I tested this feature and checked the documentation. Currently, the documentation for the pg_stat_prefetch_recovery view is included in the description for the pg_stat_subscription view. https://www.postgresql.org/docs/devel/monitoring-stats.html#MONITORING-PG-STAT-SUBSCRIPTION It is also not displayed in the list of "28.2. The Statistics Collector". https://www.postgresql.org/docs/devel/monitoring.html The attached patch modifies the pg_stat_prefetch_recovery view to appear as a separate view. Regards, Noriyoshi Shinoda -Original Message- From: Thomas Munro Sent: Friday, April 8, 2022 10:47 AM To: Justin Pryzby Cc: Tomas Vondra ; Stephen Frost ; Andres Freund ; Jakub Wartak ; Alvaro Herrera ; Tomas Vondra ; Dmitry Dolgov <9erthali...@gmail.com>; David Steele ; pgsql-hackers Subject: Re: WIP: WAL prefetch (another approach) On Fri, Apr 8, 2022 at 12:55 AM Justin Pryzby wrote: > The docs seem to be wrong about the default. > > +are not yet in the buffer pool, during recovery. Valid values are > +off (the default), on and > +try. The setting try > + enables Fixed. > + concurrency and distance, respectively. By default, it is set to > + try, which enabled the feature on systems where > + posix_fadvise is available. > > Should say "which enables". Fixed. > Curiously, I reported a similar issue last year. Sorry. I guess both times we only agreed on what the default should be in the final review round before commit, and I let the docs get out of sync (well, the default is mentioned in two places and I apparently ended my search too soon, changing only one). I also found another recently obsoleted sentence: the one about showing nulls sometimes was no longer true. Removed. pg_stat_recovery_prefetch_doc_v1.diff Description: pg_stat_recovery_prefetch_doc_v1.diff
Re: WIP: WAL prefetch (another approach)
On Fri, Apr 8, 2022 at 12:55 AM Justin Pryzby wrote: > The docs seem to be wrong about the default. > > +are not yet in the buffer pool, during recovery. Valid values are > +off (the default), on and > +try. The setting try enables Fixed. > + concurrency and distance, respectively. By default, it is set to > + try, which enabled the feature on systems where > + posix_fadvise is available. > > Should say "which enables". Fixed. > Curiously, I reported a similar issue last year. Sorry. I guess both times we only agreed on what the default should be in the final review round before commit, and I let the docs get out of sync (well, the default is mentioned in two places and I apparently ended my search too soon, changing only one). I also found another recently obsoleted sentence: the one about showing nulls sometimes was no longer true. Removed.
Re: WIP: WAL prefetch (another approach)
The docs seem to be wrong about the default. +are not yet in the buffer pool, during recovery. Valid values are +off (the default), on and +try. The setting try enables + concurrency and distance, respectively. By default, it is set to + try, which enabled the feature on systems where + posix_fadvise is available. Should say "which enables". + { + {"recovery_prefetch", PGC_SIGHUP, WAL_RECOVERY, + gettext_noop("Prefetch referenced blocks during recovery"), + gettext_noop("Look ahead in the WAL to find references to uncached data.") + }, + &recovery_prefetch, + RECOVERY_PREFETCH_TRY, recovery_prefetch_options, + check_recovery_prefetch, assign_recovery_prefetch, NULL + }, Curiously, I reported a similar issue last year. On Thu, Apr 08, 2021 at 10:37:04PM -0500, Justin Pryzby wrote: > --- a/doc/src/sgml/wal.sgml > +++ b/doc/src/sgml/wal.sgml > @@ -816,9 +816,7 @@ > prefetching mechanism is most likely to be effective on systems > with full_page_writes set to > off (where that is safe), and where the working > - set is larger than RAM. By default, prefetching in recovery is enabled > - on operating systems that have posix_fadvise > - support. > + set is larger than RAM. By default, prefetching in recovery is disabled. > >
Re: WIP: WAL prefetch (another approach)
On Mon, Apr 4, 2022 at 3:12 PM Julien Rouhaud wrote: > [review] Thanks! I took almost all of your suggestions about renaming things, comments, docs and moving a magic number into a macro. Minor changes: 1. Rebased over the shmem stats changes and others that have just landed today (woo!). The way my simple SharedStats object works and is reset looks a little primitive next to the shiny new stats infrastructure, but I can always adjust that in a follow-up patch if required. 2. It was a bit annoying that the pg_stat_recovery_prefetch view would sometimes show stale numbers when waiting for WAL to be streamed, since that happens at arbitrary points X bytes apart in the WAL. Now it also happens before sleeping/waiting and when recovery ends. 3. Last year, commit a55a9847 synchronised config.sgml with guc.c's categories. A couple of hunks in there that modified the previous version of this work before it all got reverted. So I've re-added the WAL_RECOVERY GUC category, to match the new section in config.sgml. About test coverage, the most interesting lines of xlogprefetcher.c that stand out as unreached in a gcov report are in the special handling for the new CREATE DATABASE in file-copy mode -- but that's probably something to raise in the thread that introduced that new functionality without a test. I've tested that code locally; if you define XLOGPREFETCHER_DEBUG_LEVEL you'll see that it won't touch anything in the new database until recovery has replayed the file-copy. As for current CI-vs-buildfarm blind spots that recently bit me and others, I also tested -m32 and -fsanitize=undefined,unaligned builds. I reran one of the quick pgbench/crash/drop-caches/recover tests I had lying around and saw a 17s -> 6s speedup with FPW off (you need much longer tests to see speedup with them on, so this is a good way for quick sanity checks -- see Tomas V's results for long runs with FPWs and curved effects). With that... I've finally pushed the 0002 patch and will be watching the build farm.
Re: WIP: WAL prefetch (another approach)
On Thu, Mar 31, 2022 at 10:49:32PM +1300, Thomas Munro wrote: > On Mon, Mar 21, 2022 at 9:29 PM Julien Rouhaud wrote: > > So I finally finished looking at this patch. Here again, AFAICS the > > feature is > > working as expected and I didn't find any problem. I just have some minor > > comments, like for the previous patch. > > Thanks very much for the review. I've attached a new version > addressing most of your feedback, and also rebasing over the new > WAL-logged CREATE DATABASE. I've also fixed a couple of bugs (see > end). > > > For the docs: > > > > +Whether to try to prefetch blocks that are referenced in the WAL > > that > > +are not yet in the buffer pool, during recovery. Valid values are > > +off (the default), on and > > +try. The setting try enables > > +prefetching only if the operating system provides the > > +posix_fadvise function, which is currently > > used > > +to implement prefetching. Note that some operating systems > > provide the > > +function, but don't actually perform any prefetching. > > > > Is there any reason not to change it to try? I'm wondering if some system > > says > > that the function exists but simply raise an error if you actually try to > > use > > it. I think that at least WSL does that for some functions. > > Yeah, we could just default it to try. Whether we should ship that > way is another question, but done for now. Should there be an associated pg15 open item for that, when the patch will be committed? Note that in wal.sgml, the patch still says: + [...] By default, prefetching in + recovery is disabled. I guess this should be changed even if we eventually choose to disable it by default? > I don't think there are any supported systems that have a > posix_fadvise() that fails with -1, or we'd know about it, because > we already use it in other places. We do support one OS that provides > a dummy function in libc that does nothing at all (Solaris/illumos), > and at least a couple that enter the kernel but are known to do > nothing at all for WILLNEED (AIX, FreeBSD). Ah, I didn't know that, thanks for the info! > > bool > > XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id, > >RelFileNode *rnode, ForkNumber *forknum, BlockNumber > > *blknum) > > +{ > > + return XLogRecGetBlockInfo(record, block_id, rnode, forknum, blknum, > > NULL); > > +} > > + > > +bool > > +XLogRecGetBlockInfo(XLogReaderState *record, uint8 block_id, > > + RelFileNode *rnode, ForkNumber *forknum, > > + BlockNumber *blknum, > > + Buffer *prefetch_buffer) > > { > > > > It's missing comments on that function. XLogRecGetBlockTag comments should > > probably be reworded at the same time. > > New comment added for XLogRecGetBlockInfo(). Wish I could come up > with a better name for that... Not quite sure what you thought I should > change about XLogRecGetBlockTag(). Since XLogRecGetBlockTag is now a wrapper for XLogRecGetBlockInfo, I thought it would be better to document only the specific behavior for this one (so no prefetch_buffer), rather than duplicating the whole description in both places. It seems like a good recipe to miss one of the comments the next time something is changed there. For the name, why not the usual XLogRecGetBlockTagExtended()? > > @@ -3350,6 +3392,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool > > randAccess, > > */ > > if (lastSourceFailed) > > { > > + /* > > +* Don't allow any retry loops to occur during nonblocking > > +* readahead. Let the caller process everything that has been > > +* decoded already first. > > +*/ > > + if (nonblocking) > > + return XLREAD_WOULDBLOCK; > > > > Is that really enough? I'm wondering if the code path in ReadRecord() that > > forces lastSourceFailed to False while it actually failed when switching > > into > > archive recovery (xlogrecovery.c around line 3044) can be problematic here. > > I don't see the problem scenario, could you elaborate? Sorry, I missed that in standby mode ReadRecord would keep going until a record is found, so no problem indeed. > > + /* Do we have a clue where the buffer might be already? */ > > + if (BufferIsValid(recent_buffer) && > > + mode == RBM_NORMAL && > > + ReadRecentBuffer(rnode, forknum, blkno, recent_buffer)) > > + { > > + buffer = recent_buffer; > > + goto recent_buffer_fast_path; > > + } > > > > Should this increment (local|shared)_blks_hit, since ReadRecentBuffer > > doesn't? > > Hmm. I guess ReadRecentBuffer() should really do that. Done. Ah, I also thought it be be better there but was assuming that there was some possible usage where it's not wanted. Good then! Should ReadRecentBuffer comment be updated to mention that pgBufferUsage is incremented as
Re: WIP: WAL prefetch (another approach)
On Mon, Mar 21, 2022 at 9:29 PM Julien Rouhaud wrote: > So I finally finished looking at this patch. Here again, AFAICS the feature > is > working as expected and I didn't find any problem. I just have some minor > comments, like for the previous patch. Thanks very much for the review. I've attached a new version addressing most of your feedback, and also rebasing over the new WAL-logged CREATE DATABASE. I've also fixed a couple of bugs (see end). > For the docs: > > +Whether to try to prefetch blocks that are referenced in the WAL that > +are not yet in the buffer pool, during recovery. Valid values are > +off (the default), on and > +try. The setting try enables > +prefetching only if the operating system provides the > +posix_fadvise function, which is currently used > +to implement prefetching. Note that some operating systems provide > the > +function, but don't actually perform any prefetching. > > Is there any reason not to change it to try? I'm wondering if some system > says > that the function exists but simply raise an error if you actually try to use > it. I think that at least WSL does that for some functions. Yeah, we could just default it to try. Whether we should ship that way is another question, but done for now. I don't think there are any supported systems that have a posix_fadvise() that fails with -1, or we'd know about it, because we already use it in other places. We do support one OS that provides a dummy function in libc that does nothing at all (Solaris/illumos), and at least a couple that enter the kernel but are known to do nothing at all for WILLNEED (AIX, FreeBSD). > + > + The parameter can > + be used to improve I/O performance during recovery by instructing > + PostgreSQL to initiate reads > + of disk blocks that will soon be needed but are not currently in > + PostgreSQL's buffer pool. > + The and > +settings limit prefetching > + concurrency and distance, respectively. > + By default, prefetching in recovery is disabled. > + > > I think that "improving I/O performance" is a bit misleading, maybe reduce I/O > wait time or something like that? Also, I don't know if we need to be that > precise, but maybe we should say that it's the underlying kernel that will > (asynchronously) initiate the reads, and postgres will simply notifies it. Updated with this new text: The parameter can be used to reduce I/O wait times during recovery by instructing the kernel to initiate reads of disk blocks that will soon be needed but are not currently in PostgreSQL's buffer pool and will soon be read. > + > + The pg_stat_prefetch_recovery view will contain > only > + one row. It is filled with nulls if recovery is not running or WAL > + prefetching is not enabled. See > + for more information. > + > > That's not the implemented behavior as far as I can see. It just prints > whatever is in SharedStats > regardless of the recovery state or the prefetch_wal setting (assuming that > there's no pending reset request). Yeah. Updated text: "It is filled with nulls if recovery has not run or ...". > Similarly, there's a mention that > pg_stat_reset_shared('wal') will reset the stats, but I don't see anything > calling XLogPrefetchRequestResetStats(). It's 'prefetch_recovery', not 'wal', but yeah, oops, it looks like I got carried away between v18 and v19 while simplifying the stats and lost a hunk I should have kept. Fixed. > Finally, I think we should documented what are the cumulated counters in that > view (that should get reset) and the dynamic counters (that shouldn't get > reset). OK, done. > For the code: > > bool > XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id, >RelFileNode *rnode, ForkNumber *forknum, BlockNumber > *blknum) > +{ > + return XLogRecGetBlockInfo(record, block_id, rnode, forknum, blknum, > NULL); > +} > + > +bool > +XLogRecGetBlockInfo(XLogReaderState *record, uint8 block_id, > + RelFileNode *rnode, ForkNumber *forknum, > + BlockNumber *blknum, > + Buffer *prefetch_buffer) > { > > It's missing comments on that function. XLogRecGetBlockTag comments should > probably be reworded at the same time. New comment added for XLogRecGetBlockInfo(). Wish I could come up with a better name for that... Not quite sure what you thought I should change about XLogRecGetBlockTag(). > +ReadRecord(XLogPrefetcher *xlogprefetcher, int emode, >bool fetching_ckpt, TimeLineID replayTLI) > { > XLogRecord *record; > + XLogReaderState *xlogreader = XLogPrefetcherReader(xlogprefetcher); > > nit: maybe name it XLogPrefetcherGetReader()? OK. > * containing it (if not open already), and returns true. When end of standby > * mode is triggered by the user, and there is no more WAL available, returns > * false. > + * > + * If nonblocking is
Re: WIP: WAL prefetch (another approach)
Hi, On Sun, Mar 20, 2022 at 05:36:38PM +1300, Thomas Munro wrote: > On Fri, Mar 18, 2022 at 9:59 AM Thomas Munro wrote: > > I'll push 0001 today to let the build farm chew on it for a few days > > before moving to 0002. > > Clearly 018_wal_optimize.pl is flapping and causing recoveryCheck to > fail occasionally, but that predates the above commit. I didn't > follow the existing discussion on that, so I'll try to look into that > tomorrow. > > Here's a rebase of the 0002 patch, now called 0001 So I finally finished looking at this patch. Here again, AFAICS the feature is working as expected and I didn't find any problem. I just have some minor comments, like for the previous patch. For the docs: +Whether to try to prefetch blocks that are referenced in the WAL that +are not yet in the buffer pool, during recovery. Valid values are +off (the default), on and +try. The setting try enables +prefetching only if the operating system provides the +posix_fadvise function, which is currently used +to implement prefetching. Note that some operating systems provide the +function, but don't actually perform any prefetching. Is there any reason not to change it to try? I'm wondering if some system says that the function exists but simply raise an error if you actually try to use it. I think that at least WSL does that for some functions. + + The parameter can + be used to improve I/O performance during recovery by instructing + PostgreSQL to initiate reads + of disk blocks that will soon be needed but are not currently in + PostgreSQL's buffer pool. + The and +settings limit prefetching + concurrency and distance, respectively. + By default, prefetching in recovery is disabled. + I think that "improving I/O performance" is a bit misleading, maybe reduce I/O wait time or something like that? Also, I don't know if we need to be that precise, but maybe we should say that it's the underlying kernel that will (asynchronously) initiate the reads, and postgres will simply notifies it. + + The pg_stat_prefetch_recovery view will contain only + one row. It is filled with nulls if recovery is not running or WAL + prefetching is not enabled. See + for more information. + That's not the implemented behavior as far as I can see. It just prints whatever is in SharedStats regardless of the recovery state or the prefetch_wal setting (assuming that there's no pending reset request). Similarly, there's a mention that pg_stat_reset_shared('wal') will reset the stats, but I don't see anything calling XLogPrefetchRequestResetStats(). Finally, I think we should documented what are the cumulated counters in that view (that should get reset) and the dynamic counters (that shouldn't get reset). For the code: bool XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id, RelFileNode *rnode, ForkNumber *forknum, BlockNumber *blknum) +{ + return XLogRecGetBlockInfo(record, block_id, rnode, forknum, blknum, NULL); +} + +bool +XLogRecGetBlockInfo(XLogReaderState *record, uint8 block_id, + RelFileNode *rnode, ForkNumber *forknum, + BlockNumber *blknum, + Buffer *prefetch_buffer) { It's missing comments on that function. XLogRecGetBlockTag comments should probably be reworded at the same time. +ReadRecord(XLogPrefetcher *xlogprefetcher, int emode, bool fetching_ckpt, TimeLineID replayTLI) { XLogRecord *record; + XLogReaderState *xlogreader = XLogPrefetcherReader(xlogprefetcher); nit: maybe name it XLogPrefetcherGetReader()? * containing it (if not open already), and returns true. When end of standby * mode is triggered by the user, and there is no more WAL available, returns * false. + * + * If nonblocking is true, then give up immediately if we can't satisfy the + * request, returning XLREAD_WOULDBLOCK instead of waiting. */ -static bool +static XLogPageReadResult WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, The comment still mentions a couple of time returning true/false rather than XLREAD_*, same for at least XLogPageRead(). @@ -3350,6 +3392,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, */ if (lastSourceFailed) { + /* +* Don't allow any retry loops to occur during nonblocking +* readahead. Let the caller process everything that has been +* decoded already first. +*/ + if (nonblocking) + return XLREAD_WOULDBLOCK; Is that really enough? I'm wondering if the code path in ReadRecord() that forces lastSourceFailed to False while it actually failed when switching into archive recovery (xlogrecovery.c around line 3044) can be problematic here. {"wal_decode_buffer_size", PGC_POSTMASTER, WAL_ARCHIVE_RECOVERY, gettext_noo
Re: WIP: WAL prefetch (another approach)
On Sun, Mar 20, 2022 at 5:36 PM Thomas Munro wrote: > Clearly 018_wal_optimize.pl is flapping Correction, 019_replslot_limit.pl, discussed at https://www.postgresql.org/message-id/flat/83b46e5f-2a52-86aa-fa6c-8174908174b8%40iki.fi .
Re: WIP: WAL prefetch (another approach)
On Fri, Mar 18, 2022 at 9:59 AM Thomas Munro wrote: > I'll push 0001 today to let the build farm chew on it for a few days > before moving to 0002. Clearly 018_wal_optimize.pl is flapping and causing recoveryCheck to fail occasionally, but that predates the above commit. I didn't follow the existing discussion on that, so I'll try to look into that tomorrow. Here's a rebase of the 0002 patch, now called 0001 From 3ac04122e635b98c50d6e48677fe74535d631388 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Sun, 20 Mar 2022 16:56:12 +1300 Subject: [PATCH v24] Prefetch referenced data in recovery, take II. Introduce a new GUC recovery_prefetch, disabled by default. When enabled, look ahead in the WAL and try to initiate asynchronous reading of referenced data blocks that are not yet cached in our buffer pool. For now, this is done with posix_fadvise(), which has several caveats. Since not all OSes have that system call, "try" is provided so that it can be enabled on operating systems where it is available, and that is used in 027_stream_regress.pl so that we effectively exercise on and off behaviors in the build farm. Better mechanisms will follow in later work on the I/O subsystem. The GUC maintenance_io_concurrency is used to limit the number of concurrent I/Os we allow ourselves to initiate, based on pessimistic heuristics used to infer that I/Os have begun and completed. The GUC wal_decode_buffer_size limits the maximum distance we are prepared to read ahead in the WAL to find uncached blocks. Reviewed-by: Tomas Vondra Reviewed-by: Alvaro Herrera (earlier version) Reviewed-by: Andres Freund (earlier version) Reviewed-by: Justin Pryzby (earlier version) Tested-by: Tomas Vondra (earlier version) Tested-by: Jakub Wartak (earlier version) Tested-by: Dmitry Dolgov <9erthali...@gmail.com> (earlier version) Tested-by: Sait Talha Nisanci (earlier version) Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com --- doc/src/sgml/config.sgml | 64 ++ doc/src/sgml/monitoring.sgml | 77 +- doc/src/sgml/wal.sgml | 12 + src/backend/access/transam/Makefile | 1 + src/backend/access/transam/xlog.c | 2 + src/backend/access/transam/xlogprefetcher.c | 968 ++ src/backend/access/transam/xlogreader.c | 13 + src/backend/access/transam/xlogrecovery.c | 160 ++- src/backend/access/transam/xlogutils.c| 27 +- src/backend/catalog/system_views.sql | 13 + src/backend/storage/freespace/freespace.c | 3 +- src/backend/storage/ipc/ipci.c| 3 + src/backend/utils/misc/guc.c | 53 +- src/backend/utils/misc/postgresql.conf.sample | 5 + src/include/access/xlog.h | 1 + src/include/access/xlogprefetcher.h | 51 + src/include/access/xlogreader.h | 8 + src/include/access/xlogutils.h| 3 +- src/include/catalog/pg_proc.dat | 8 + src/include/utils/guc.h | 4 + src/test/recovery/t/027_stream_regress.pl | 3 + src/test/regress/expected/rules.out | 10 + src/tools/pgindent/typedefs.list | 7 + 23 files changed, 1434 insertions(+), 62 deletions(-) create mode 100644 src/backend/access/transam/xlogprefetcher.c create mode 100644 src/include/access/xlogprefetcher.h diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 7a48973b3c..ce84f379a8 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -3644,6 +3644,70 @@ include_dir 'conf.d' + + +Recovery + + + configuration + of recovery + general settings + + + + This section describes the settings that apply to recovery in general, + affecting crash recovery, streaming replication and archive-based + replication. + + + + + + recovery_prefetch (enum) + + recovery_prefetch configuration parameter + + + + +Whether to try to prefetch blocks that are referenced in the WAL that +are not yet in the buffer pool, during recovery. Valid values are +off (the default), on and +try. The setting try enables +prefetching only if the operating system provides the +posix_fadvise function, which is currently used +to implement prefetching. Note that some operating systems provide the +function, but don't actually perform any prefetching. + + +Prefetching blocks that will soon be needed can reduce I/O wait times +during recovery with some workloads. +See also the and + settings, which limit +prefetching activity. + + + + + + wal_decode_buffer_size (integer) + + wal_decode_buffer_size configuration parameter + + +
Re: WIP: WAL prefetch (another approach)
On Mon, Mar 14, 2022 at 8:17 PM Julien Rouhaud wrote: > Great! I'm happy with 0001 and I think it's good to go! I'll push 0001 today to let the build farm chew on it for a few days before moving to 0002.
Re: WIP: WAL prefetch (another approach)
On Mon, Mar 14, 2022 at 06:15:59PM +1300, Thomas Munro wrote: > On Fri, Mar 11, 2022 at 9:27 PM Julien Rouhaud wrote: > > > > Also, is it worth an assert (likely at the top of the function) for > > > > that? > > > > > > How could I assert that EndRecPtr has the right value? > > > > Sorry, I meant to assert that some value was assigned > > (!XLogRecPtrIsInvalid). > > It can only make sure that the first call is done after XLogBeginRead / > > XLogFindNextRecord, but that's better than nothing and consistent with the > > top > > comment. > > Done. Just a small detail: I would move that assert at the top of the function as it should always be valid. > > I also fixed the compile failure with -DWAL_DEBUG, and checked that > output looks sane with wal_debug=on. Great! I'm happy with 0001 and I think it's good to go! > > > > The other thing I need to change is that I should turn on > > > recovery_prefetch for platforms that support it (ie Linux and maybe > > > NetBSD only for now), in the tests. Right now you need to put > > > recovery_prefetch=on in a file and then run the tests with > > > "TEMP_CONFIG=path_to_that make -C src/test/recovery check" to > > > excercise much of 0002. > > > > +1 with Andres' idea to have a "try" setting. > > Done. The default is still "off" for now, but in > 027_stream_regress.pl I set it to "try". Great too! Unless you want to commit both patches right now I'd like to review 0002 too (this week), as I barely look into it for now.
Re: WIP: WAL prefetch (another approach)
On Fri, Mar 11, 2022 at 06:31:13PM +1300, Thomas Munro wrote: > On Wed, Mar 9, 2022 at 7:47 PM Julien Rouhaud wrote: > > > > This could use XLogRecGetBlock? Note that this macro is for now never used. > > xlogreader.c also has some similar forgotten code that could use > > XLogRecMaxBlockId. > > That is true, but I was thinking of it like this: most of the existing > code that interacts with xlogreader.c is working with the old model, > where the XLogReader object holds only one "current" record. For that > reason the XLogRecXXX() macros continue to work as before, implicitly > referring to the record that XLogReadRecord() most recently returned. > For xlogreader.c code, I prefer not to use the XLogRecXXX() macros, > even when referring to the "current" record, since xlogreader.c has > switched to a new multi-record model. In other words, they're sort of > 'old API' accessors provided for continuity. Does this make sense? Ah I see, it does make sense. I'm wondering if there should be some comment somewhere on the top of the file to mention it, as otherwise someone may be tempted to change it to avoid some record->record->xxx usage. > > +DecodedXLogRecord * > > +XLogNextRecord(XLogReaderState *state, char **errormsg) > > +{ > > [...] > > + /* > > +* state->EndRecPtr is expected to have been set by the last call to > > +* XLogBeginRead() or XLogNextRecord(), and is the location of the > > +* error. > > +*/ > > + > > + return NULL; > > > > The comment should refer to XLogFindNextRecord, not XLogNextRecord? > > No, it does mean to refer to the XLogNextRecord() (ie the last time > you called XLogNextRecord and successfully dequeued a record, we put > its end LSN there, so if there is a deferred error, that's the > corresponding LSN). Make sense? It does, thanks! > > > Also, is it worth an assert (likely at the top of the function) for that? > > How could I assert that EndRecPtr has the right value? Sorry, I meant to assert that some value was assigned (!XLogRecPtrIsInvalid). It can only make sure that the first call is done after XLogBeginRead / XLogFindNextRecord, but that's better than nothing and consistent with the top comment. > > + if (unlikely(state->decode_buffer == NULL)) > > + { > > + if (state->decode_buffer_size == 0) > > + state->decode_buffer_size = DEFAULT_DECODE_BUFFER_SIZE; > > + state->decode_buffer = palloc(state->decode_buffer_size); > > + state->decode_buffer_head = state->decode_buffer; > > + state->decode_buffer_tail = state->decode_buffer; > > + state->free_decode_buffer = true; > > + } > > > > Maybe change XLogReaderSetDecodeBuffer to also handle allocation and use it > > here too? Otherwise XLogReaderSetDecodeBuffer should probably go in 0002 as > > the only caller is the recovery prefetching. > > I don't think it matters much? The thing is that for now the only caller to XLogReaderSetDecodeBuffer (in 0002) only uses it to set the length, so a buffer is actually never passed to that function. Since frontend code can rely on a palloc emulation, is there really a use case to use e.g. some stack buffer there, or something in a specific memory context? It seems to be the only use cases for having XLogReaderSetDecodeBuffer() rather than simply a XLogReaderSetDecodeBufferSize(). But overall I agree it doesn't matter much, so no objection to keep it as-is. > > It's also not particulary obvious why XLogFindNextRecord() doesn't check for > > this value. AFAICS callers don't (and should never) call it with a > > nonblocking == true state, maybe add an assert for that? > > Fair point. I have now explicitly cleared that flag. (I don't much > like state->nonblocking, which might be better as an argument to > page_read(), but in fact I don't like the fact that page_read > callbacks are blocking in the first place, which is why I liked > Horiguchi-san's patch to get rid of that... but that can be a subject > for later work.) Agreed. > > static void > > ResetDecoder(XLogReaderState *state) > > { > > [...] > > + /* Reset the decoded record queue, freeing any oversized records. */ > > + while ((r = state->decode_queue_tail)) > > > > nit: I think it's better to explicitly check for the assignment being != > > NULL, > > and existing code is more frequently written this way AFAICS. > > I think it's perfectly normal idiomatic C, but if you think it's > clearer that way, OK, done like that. The thing I don't like about this form is that you can never be sure that an assignment was really meant unless you read the rest of the nearby code. Other than that agreed, if perfectly normal idiomatic C. > I realised that this version has broken -DWAL_DEBUG. I'll fix that > shortly, but I wanted to post this update ASAP, so here's a new > version. + * Returns XLREAD_WOULDBLOCK if he requested data can't be read without + * waiting. This can be returned only if the installed page_read cal
Re: WIP: WAL prefetch (another approach)
On March 10, 2022 9:31:13 PM PST, Thomas Munro wrote: > The other thing I need to change is that I should turn on >recovery_prefetch for platforms that support it (ie Linux and maybe >NetBSD only for now), in the tests. Could a setting of "try" make sense? -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: WIP: WAL prefetch (another approach)
On Fri, Mar 11, 2022 at 6:31 PM Thomas Munro wrote: > Thanks for your review of 0001! It gave me a few things to think > about and some good improvements. And just in case it's useful, here's what changed between v21 and v22.. diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index 86a7b4c5c8..0d0c556b7c 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -90,8 +90,8 @@ XLogReaderSetDecodeBuffer(XLogReaderState *state, void *buffer, size_t size) state->decode_buffer = buffer; state->decode_buffer_size = size; - state->decode_buffer_head = buffer; state->decode_buffer_tail = buffer; + state->decode_buffer_head = buffer; } /* @@ -271,7 +271,7 @@ XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr) /* * See if we can release the last record that was returned by - * XLogNextRecord(), to free up space. + * XLogNextRecord(), if any, to free up space. */ void XLogReleasePreviousRecord(XLogReaderState *state) @@ -283,16 +283,16 @@ XLogReleasePreviousRecord(XLogReaderState *state) /* * Remove it from the decoded record queue. It must be the oldest item -* decoded, decode_queue_tail. +* decoded, decode_queue_head. */ record = state->record; - Assert(record == state->decode_queue_tail); + Assert(record == state->decode_queue_head); state->record = NULL; - state->decode_queue_tail = record->next; + state->decode_queue_head = record->next; - /* It might also be the newest item decoded, decode_queue_head. */ - if (state->decode_queue_head == record) - state->decode_queue_head = NULL; + /* It might also be the newest item decoded, decode_queue_tail. */ + if (state->decode_queue_tail == record) + state->decode_queue_tail = NULL; /* Release the space. */ if (unlikely(record->oversized)) @@ -302,11 +302,11 @@ XLogReleasePreviousRecord(XLogReaderState *state) } else { - /* It must be the tail record in the decode buffer. */ - Assert(state->decode_buffer_tail == (char *) record); + /* It must be the head (oldest) record in the decode buffer. */ + Assert(state->decode_buffer_head == (char *) record); /* -* We need to update tail to point to the next record that is in the +* We need to update head to point to the next record that is in the * decode buffer, if any, being careful to skip oversized ones * (they're not in the decode buffer). */ @@ -316,8 +316,8 @@ XLogReleasePreviousRecord(XLogReaderState *state) if (record) { - /* Adjust tail to release space up to the next record. */ - state->decode_buffer_tail = (char *) record; + /* Adjust head to release space up to the next record. */ + state->decode_buffer_head = (char *) record; } else { @@ -327,8 +327,8 @@ XLogReleasePreviousRecord(XLogReaderState *state) * we'll keep overwriting the same piece of memory if we're not * doing any prefetching. */ - state->decode_buffer_tail = state->decode_buffer; state->decode_buffer_head = state->decode_buffer; + state->decode_buffer_tail = state->decode_buffer; } } } @@ -351,7 +351,7 @@ XLogNextRecord(XLogReaderState *state, char **errormsg) /* Release the last record returned by XLogNextRecord(). */ XLogReleasePreviousRecord(state); - if (state->decode_queue_tail == NULL) + if (state->decode_queue_head == NULL) { *errormsg = NULL; if (state->errormsg_deferred) @@ -376,7 +376,7 @@ XLogNextRecord(XLogReaderState *state, char **errormsg) * XLogRecXXX(xlogreader) macros, which work with the decoder rather than * the record for historical reasons. */ - state->record = state->decode_queue_tail; + state->record = state->decode_queue_head; /* * Update the pointers to the beginning and one-past-the-end of this @@ -428,12 +428,12 @@ XLogReadRecord(XLogReaderState *state, char **errormsg) if (!XLogReaderHasQueuedRecordOrError(state)) XLogReadAhead(state, false /* nonblocking */ ); - /* Consume the tail record or error. */ + /* Consume the head record or error. */ decoded = XLogNextRecord(state, errormsg); if (decoded) { /* -* XLogReadRecord() returns a pointer to the record's header, not the +
Re: WIP: WAL prefetch (another approach)
Hi, On Tue, Mar 08, 2022 at 06:15:43PM +1300, Thomas Munro wrote: > On Wed, Dec 29, 2021 at 5:29 PM Thomas Munro wrote: > > https://github.com/macdice/postgres/tree/recovery-prefetch-ii > > Here's a rebase. This mostly involved moving hunks over to the new > xlogrecovery.c file. One thing that seemed a little strange to me > with the new layout is that xlogreader is now a global variable. I > followed that pattern and made xlogprefetcher a global variable too, > for now. I for now went through 0001, TL;DR the patch looks good to me. I have a few minor comments though, mostly to make things a bit clearer (at least to me). diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c index 2340dc247b..c129df44ac 100644 --- a/src/bin/pg_waldump/pg_waldump.c +++ b/src/bin/pg_waldump/pg_waldump.c @@ -407,10 +407,10 @@ XLogDumpRecordLen(XLogReaderState *record, uint32 *rec_len, uint32 *fpi_len) * add an accessor macro for this. */ *fpi_len = 0; + for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++) { if (XLogRecHasBlockImage(record, block_id)) - *fpi_len += record->blocks[block_id].bimg_len; + *fpi_len += record->record->blocks[block_id].bimg_len; } (and similar in that file, xlogutils.c and xlogreader.c) This could use XLogRecGetBlock? Note that this macro is for now never used. xlogreader.c also has some similar forgotten code that could use XLogRecMaxBlockId. + * See if we can release the last record that was returned by + * XLogNextRecord(), to free up space. + */ +void +XLogReleasePreviousRecord(XLogReaderState *state) The comment seems a bit misleading, as I first understood it as it could be optional even if the record exists. Maybe something more like "Release the last record if any"? +* Remove it from the decoded record queue. It must be the oldest item +* decoded, decode_queue_tail. +*/ + record = state->record; + Assert(record == state->decode_queue_tail); + state->record = NULL; + state->decode_queue_tail = record->next; The naming is a bit counter intuitive to me, as before reading the rest of the code I wasn't expecting the item at the tail of the queue to have a next element. Maybe just inverting tail and head would make it clearer? +DecodedXLogRecord * +XLogNextRecord(XLogReaderState *state, char **errormsg) +{ [...] + /* +* state->EndRecPtr is expected to have been set by the last call to +* XLogBeginRead() or XLogNextRecord(), and is the location of the +* error. +*/ + + return NULL; The comment should refer to XLogFindNextRecord, not XLogNextRecord? Also, is it worth an assert (likely at the top of the function) for that? XLogRecord * XLogReadRecord(XLogReaderState *state, char **errormsg) +{ [...] + if (decoded) + { + /* +* XLogReadRecord() returns a pointer to the record's header, not the +* actual decoded record. The caller will access the decoded record +* through the XLogRecGetXXX() macros, which reach the decoded +* recorded as xlogreader->record. +*/ + Assert(state->record == decoded); + return &decoded->header; I find it a bit weird to mention XLogReadRecord() as it's the current function. +/* + * Allocate space for a decoded record. The only member of the returned + * object that is initialized is the 'oversized' flag, indicating that the + * decoded record wouldn't fit in the decode buffer and must eventually be + * freed explicitly. + * + * Return NULL if there is no space in the decode buffer and allow_oversized + * is false, or if memory allocation fails for an oversized buffer. + */ +static DecodedXLogRecord * +XLogReadRecordAlloc(XLogReaderState *state, size_t xl_tot_len, bool allow_oversized) Is it worth clearly stating that it's the reponsability of the caller to update the decode_buffer_head (with the real size) after a successful decoding of this buffer? + if (unlikely(state->decode_buffer == NULL)) + { + if (state->decode_buffer_size == 0) + state->decode_buffer_size = DEFAULT_DECODE_BUFFER_SIZE; + state->decode_buffer = palloc(state->decode_buffer_size); + state->decode_buffer_head = state->decode_buffer; + state->decode_buffer_tail = state->decode_buffer; + state->free_decode_buffer = true; + } Maybe change XLogReaderSetDecodeBuffer to also handle allocation and use it here too? Otherwise XLogReaderSetDecodeBuffer should probably go in 0002 as the only caller is the recovery prefetching. + return decoded; +} I would find it a bit clearer to explicitly return NULL here. readOff = ReadPageInternal(state, targetPagePtr, Min(targetRecOff + SizeOfXLogRecord, XLOG_BLCKSZ)); - if (readOff < 0) + if (readOff == XLREAD_WOULDBLOCK) + return XLREAD_WOULDBLOCK; + else if (readOff < 0) ReadPageInternal comment should be updated to mention
Re: WIP: WAL prefetch (another approach)
Hi, On 2022-03-08 18:15:43 +1300, Thomas Munro wrote: > I'm now starting to think about committing this soon. +1 Are you thinking of committing both patches at once, or with a bit of distance? I think something in the regression tests ought to enable recovery_prefetch. 027_stream_regress or 001_stream_rep seem like the obvious candidates? - Andres
Re: WIP: WAL prefetch (another approach)
On 3/8/22 06:15, Thomas Munro wrote: > On Wed, Dec 29, 2021 at 5:29 PM Thomas Munro wrote: >> https://github.com/macdice/postgres/tree/recovery-prefetch-ii > > Here's a rebase. This mostly involved moving hunks over to the new > xlogrecovery.c file. One thing that seemed a little strange to me > with the new layout is that xlogreader is now a global variable. I > followed that pattern and made xlogprefetcher a global variable too, > for now. > > There is one functional change: now I block readahead at records that > might change the timeline ID. This removes the need to think about > scenarios where "replay TLI" and "read TLI" might differ. I don't > know of a concrete problem in that area with the previous version, but > the recent introduction of the variable(s) "replayTLI" and associated > comments in master made me realise I hadn't analysed the hazards here > enough. Since timelines are tricky things and timeline changes are > extremely infrequent, it seemed better to simplify matters by putting > up a big road block there. > > I'm now starting to think about committing this soon. +1. I don't have the capacity/hardware to do more testing at the moment, but all of this looks reasonable. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: WIP: WAL prefetch (another approach)
Hi, On 2021-12-29 17:29:52 +1300, Thomas Munro wrote: > > FWIW I don't think we include updates to typedefs.list in patches. > > Seems pretty harmless? And useful to keep around in development > branches because I like to pgindent stuff... I think it's even helpful. As long as it's done with a bit of manual oversight, I don't see a meaningful downside of doing so. One needs to be careful to not remove platform dependant typedefs, but that's it. And especially for long-lived feature branches it's much less work to keep the typedefs.list changes in the tree, rather than coming up with them locally over and over / across multiple people working on a branch. Greetings, Andres Freund
Re: WIP: WAL prefetch (another approach)
Thomas Munro writes: >> FWIW I don't think we include updates to typedefs.list in patches. > Seems pretty harmless? And useful to keep around in development > branches because I like to pgindent stuff... As far as that goes, my habit is to pull down https://buildfarm.postgresql.org/cgi-bin/typedefs.pl on a regular basis and pgindent against that. There have been some discussions about formalizing that process a bit more, but we've not come to any conclusions. regards, tom lane
Re: WIP: WAL prefetch (another approach)
Greg Stark writes: > But the bigger question is. Are we really concerned about this flaky > problem? Is it worth investing time and money on? I can get money to > go buy a G4 or G5 and spend some time on it. It just seems a bit... > niche. But if it's a real bug that represents something broken on > other architectures that just happens to be easier to trigger here it > might be worthwhile. TBH, I don't know. There seem to be three plausible explanations: 1. Flaky hardware in my unit. 2. Ancient macOS bug, as Andres suggested upthread. 3. Actual PG bug. If it's #1 or #2 then we're just wasting our time here. I'm not sure how to estimate the relative probabilities, but I suspect #3 is the least likely of the lot. FWIW, I did just reproduce the problem on that machine with current HEAD: 2021-12-17 18:40:40.293 EST [21369] FATAL: inconsistent page found, rel 1663/167772/2673, forknum 0, blkno 26 2021-12-17 18:40:40.293 EST [21369] CONTEXT: WAL redo at C/3DE3F658 for Btree/INSERT_LEAF: off 208; blkref #0: rel 1663/167772/2673, blk 26 FPW 2021-12-17 18:40:40.522 EST [21365] LOG: startup process (PID 21369) exited with exit code 1 That was after only five loops of the regression tests, so either I got lucky or the failure probability has increased again. In any case, it seems clear that the problem exists independently of Munro's patches, so I don't really think this question should be considered a blocker for those. regards, tom lane
Re: WIP: WAL prefetch (another approach)
On Fri, 17 Dec 2021 at 18:40, Tom Lane wrote: > > Greg Stark writes: > > Hm. I seem to have picked a bad checkout. I took the last one before > > the revert (45aa88fe1d4028ea50ba7d26d390223b6ef78acc). > > FWIW, I think that's the first one *after* the revert. Doh But the bigger question is. Are we really concerned about this flaky problem? Is it worth investing time and money on? I can get money to go buy a G4 or G5 and spend some time on it. It just seems a bit... niche. But if it's a real bug that represents something broken on other architectures that just happens to be easier to trigger here it might be worthwhile. -- greg
Re: WIP: WAL prefetch (another approach)
Greg Stark writes: > Hm. I seem to have picked a bad checkout. I took the last one before > the revert (45aa88fe1d4028ea50ba7d26d390223b6ef78acc). FWIW, I think that's the first one *after* the revert. > 2021-12-17 17:51:51.688 EST [50955] LOG: background worker "parallel > worker" (PID 54073) was terminated by signal 10: Bus error I'm betting on weird emulation issue. None of my real PPC machines showed such things. regards, tom lane
Re: WIP: WAL prefetch (another approach)
On 12/17/21 23:56, Greg Stark wrote: Hm. I seem to have picked a bad checkout. I took the last one before the revert (45aa88fe1d4028ea50ba7d26d390223b6ef78acc). Or there's some incompatibility with the emulation and the IPC stuff parallel workers use. 2021-12-17 17:51:51.688 EST [50955] LOG: background worker "parallel worker" (PID 54073) was terminated by signal 10: Bus error 2021-12-17 17:51:51.688 EST [50955] DETAIL: Failed process was running: SELECT variance(unique1::int4), sum(unique1::int8), regr_count(unique1::float8, unique1::float8) FROM (SELECT * FROM tenk1 UNION ALL SELECT * FROM tenk1 UNION ALL SELECT * FROM tenk1 UNION ALL SELECT * FROM tenk1) u; 2021-12-17 17:51:51.690 EST [50955] LOG: terminating any other active server processes 2021-12-17 17:51:51.748 EST [54078] FATAL: the database system is in recovery mode 2021-12-17 17:51:51.761 EST [50955] LOG: all server processes terminated; reinitializing Interesting. In my experience SIGBUS on PPC tends to be due to incorrect alignment, but I'm not sure how that works with the emulation. Can you get a backtrace? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: WIP: WAL prefetch (another approach)
Hm. I seem to have picked a bad checkout. I took the last one before the revert (45aa88fe1d4028ea50ba7d26d390223b6ef78acc). Or there's some incompatibility with the emulation and the IPC stuff parallel workers use. 2021-12-17 17:51:51.688 EST [50955] LOG: background worker "parallel worker" (PID 54073) was terminated by signal 10: Bus error 2021-12-17 17:51:51.688 EST [50955] DETAIL: Failed process was running: SELECT variance(unique1::int4), sum(unique1::int8), regr_count(unique1::float8, unique1::float8) FROM (SELECT * FROM tenk1 UNION ALL SELECT * FROM tenk1 UNION ALL SELECT * FROM tenk1 UNION ALL SELECT * FROM tenk1) u; 2021-12-17 17:51:51.690 EST [50955] LOG: terminating any other active server processes 2021-12-17 17:51:51.748 EST [54078] FATAL: the database system is in recovery mode 2021-12-17 17:51:51.761 EST [50955] LOG: all server processes terminated; reinitializing
Re: WIP: WAL prefetch (another approach)
Greg Stark writes: > I'm guessing I should do CC=/usr/bin/powerpc-apple-darwin9-gcc-4.2.1 > or maybe 4.0.1. What version is on your G4? $ gcc -v Using built-in specs. Target: powerpc-apple-darwin9 Configured with: /var/tmp/gcc/gcc-5493~1/src/configure --disable-checking -enable-werror --prefix=/usr --mandir=/share/man --enable-languages=c,objc,c++,obj-c++ --program-transform-name=/^[cg][^.-]*$/s/$/-4.0/ --with-gxx-include-dir=/include/c++/4.0.0 --with-slibdir=/usr/lib --build=i686-apple-darwin9 --program-prefix= --host=powerpc-apple-darwin9 --target=powerpc-apple-darwin9 Thread model: posix gcc version 4.0.1 (Apple Inc. build 5493) I see that gcc 4.2.1 is also present on this machine, but I've never used it. regards, tom lane
Re: WIP: WAL prefetch (another approach)
I have IBUILD:postgresql gsstark$ ls /usr/bin/*gcc* /usr/bin/gcc /usr/bin/gcc-4.0 /usr/bin/gcc-4.2 /usr/bin/i686-apple-darwin9-gcc-4.0.1 /usr/bin/i686-apple-darwin9-gcc-4.2.1 /usr/bin/powerpc-apple-darwin9-gcc-4.0.1 /usr/bin/powerpc-apple-darwin9-gcc-4.2.1 I'm guessing I should do CC=/usr/bin/powerpc-apple-darwin9-gcc-4.2.1 or maybe 4.0.1. What version is on your G4?
Re: WIP: WAL prefetch (another approach)
Greg Stark writes: > What tools and tool versions are you using to build? Is it just GCC for PPC? > There aren't any special build processes to make a fat binary involved? Nope, just "configure; make" using that macOS version's regular gcc. regards, tom lane
Re: WIP: WAL prefetch (another approach)
What tools and tool versions are you using to build? Is it just GCC for PPC? There aren't any special build processes to make a fat binary involved? On Thu, 16 Dec 2021 at 23:11, Tom Lane wrote: > > Greg Stark writes: > > But if you're interested and can explain the tests to run I can try to > > get the tests running on this machine: > > I'm not sure that machine is close enough to prove much, but by all > means give it a go if you wish. My test setup was explained in [1]: > > >> To recap, the test lashup is: > >> * 2003 PowerMac G4 (1.25GHz PPC 7455, 7200 rpm spinning-rust drive) > >> * Standard debug build (--enable-debug --enable-cassert) > >> * Out-of-the-box configuration, except add wal_consistency_checking = all > >> and configure a wal-streaming standby on the same machine > >> * Repeatedly run "make installcheck-parallel", but skip the tablespace > >> test to avoid issues with the standby trying to use the same directory > >> * Delay long enough after each installcheck-parallel to let the > >> standby catch up (the run proper is ~24 min, plus 2 min for catchup) > > Remember also that the code in question is not in HEAD; you'd > need to apply Munro's patches, or check out some commit from > around 2021-04-22. > > regards, tom lane > > [1] https://www.postgresql.org/message-id/3502526.1619925367%40sss.pgh.pa.us -- greg
Re: WIP: WAL prefetch (another approach)
Greg Stark writes: > But if you're interested and can explain the tests to run I can try to > get the tests running on this machine: I'm not sure that machine is close enough to prove much, but by all means give it a go if you wish. My test setup was explained in [1]: >> To recap, the test lashup is: >> * 2003 PowerMac G4 (1.25GHz PPC 7455, 7200 rpm spinning-rust drive) >> * Standard debug build (--enable-debug --enable-cassert) >> * Out-of-the-box configuration, except add wal_consistency_checking = all >> and configure a wal-streaming standby on the same machine >> * Repeatedly run "make installcheck-parallel", but skip the tablespace >> test to avoid issues with the standby trying to use the same directory >> * Delay long enough after each installcheck-parallel to let the >> standby catch up (the run proper is ~24 min, plus 2 min for catchup) Remember also that the code in question is not in HEAD; you'd need to apply Munro's patches, or check out some commit from around 2021-04-22. regards, tom lane [1] https://www.postgresql.org/message-id/3502526.1619925367%40sss.pgh.pa.us
Re: WIP: WAL prefetch (another approach)
The actual hardware of this machine is a Mac Mini Core 2 Duo. I'm not really clear how the emulation is done and whether it makes a reasonable test environment or not. Hardware Overview: Model Name: Mac mini Model Identifier: Macmini2,1 Processor Name: Intel Core 2 Duo Processor Speed: 2 GHz Number Of Processors: 1 Total Number Of Cores: 2 L2 Cache: 4 MB Memory: 2 GB Bus Speed: 667 MHz Boot ROM Version: MM21.009A.B00
Re: WIP: WAL prefetch (another approach)
On Fri, 26 Nov 2021 at 21:47, Tom Lane wrote: > > Yeah ... on the one hand, that machine has shown signs of > hard-to-reproduce flakiness, so it's easy to write off the failures > I saw as hardware issues. On the other hand, the flakiness I've > seen has otherwise manifested as kernel crashes, which is nothing > like the consistent test failures I was seeing with the patch. Hm. I asked around and found a machine I can use that can run PPC binaries, but it's actually, well, confusing. I think this is an x86 machine running Leopard which uses JIT to transparently run PPC binaries. I'm not sure this is really a good test. But if you're interested and can explain the tests to run I can try to get the tests running on this machine: IBUILD:~ gsstark$ uname -a Darwin IBUILD.MIT.EDU 9.8.0 Darwin Kernel Version 9.8.0: Wed Jul 15 16:55:01 PDT 2009; root:xnu-1228.15.4~1/RELEASE_I386 i386 IBUILD:~ gsstark$ sw_vers ProductName: Mac OS X ProductVersion: 10.5.8 BuildVersion: 9L31a
Re: WIP: WAL prefetch (another approach)
On Fri, Nov 26, 2021 at 9:47 PM Tom Lane wrote: > Yeah ... on the one hand, that machine has shown signs of > hard-to-reproduce flakiness, so it's easy to write off the failures > I saw as hardware issues. On the other hand, the flakiness I've > seen has otherwise manifested as kernel crashes, which is nothing > like the consistent test failures I was seeing with the patch. > > Andres speculated that maybe we were seeing a kernel bug that > affects consistency of concurrent reads and writes. That could > be an explanation; but it's just evidence-free speculation so far, > so I don't feel real convinced by that idea either. > > Anyway, I hope to find time to see if the issue still reproduces > with Thomas' new patch set. Honestly, all the reasons that Thomas articulated for the revert seem relatively unimpressive from my point of view. Perhaps they are sufficient justification for a revert so near to the end of the development cycle, but that's just an argument for committing things a little sooner so we have time to work out the kinks. This kind of work is too valuable to get hung up for a year or three because of a couple of minor preexisting bugs and/or preexisting maybe-bugs. -- Robert Haas EDB: http://www.enterprisedb.com
Re: WIP: WAL prefetch (another approach)
Hi Thomas, I am unable to apply these new set of patches on HEAD. Can you please share the rebased patch or if you have any work branch can you please point it out, I will refer to it for the changes. -- With Regards, Ashutosh sharma. On Tue, Nov 23, 2021 at 3:44 PM Thomas Munro wrote: > On Mon, Nov 15, 2021 at 11:31 PM Daniel Gustafsson > wrote: > > Could you post an updated version of the patch which is for review? > > Sorry for taking so long to come back; I learned some new things that > made me want to restructure this code a bit (see below). Here is an > updated pair of patches that I'm currently testing. > > Old problems: > > 1. Last time around, an infinite loop was reported in pg_waldump. I > believe Horiguchi-san has fixed that[1], but I'm no longer depending > on that patch. I thought his patch set was a good idea, but it's > complicated and there's enough going on here already... let's consider > that independently. > > This version goes back to what I had earlier, though (I hope) it is > better about how "nonblocking" states are communicated. In this > version, XLogPageRead() has a way to give up part way through a record > if it doesn't have enough data and there are queued up records that > could be replayed right now. In that case, we'll go back to the > beginning of the record (and occasionally, back a WAL page) next time > we try. That's the cost of not maintaining intra-record decoding > state. > > 2. Last time around, we could try to allocate a crazy amount of > memory when reading garbage past the end of the WAL. Fixed, by > validating first, like in master. > > New work: > > Since last time, I went away and worked on a "real" AIO version of > this feature. That's ongoing experimental work for a future proposal, > but I have a working prototype and I aim to share that soon, when that > branch is rebased to catch up with recent changes. In that version, > the prefetcher starts actual reads into the buffer pool, and recovery > receives already pinned buffers attached to the stream of records it's > replaying. > > That inspired a couple of refactoring changes to this non-AIO version, > to minimise the difference and anticipate the future work better: > > 1. The logic for deciding which block to start prefetching next is > moved into a new callback function in a sort of standard form (this is > approximately how all/most prefetching code looks in the AIO project, > ie sequential scans, bitmap heap scan, etc). > > 2. The logic for controlling how many IOs are running and deciding > when to call the above is in a separate component. In this non-AIO > version, it works using a simple ring buffer of LSNs to estimate the > number of in flight I/Os, just like before. This part would be thrown > away and replaced with the AIO branch's centralised "streaming read" > mechanism which tracks I/O completions based on a stream of completion > events from the kernel (or I/O worker processes). > > 3. In this version, the prefetcher still doesn't pin buffers, for > simplicity. That work did force me to study places where WAL streams > need prefetching "barriers", though, so in this patch you can > see that it's now a little more careful than it probably needs to be. > (It doesn't really matter much if you call posix_fadvise() on a > non-existent file region, or the wrong file after OID wraparound and > reuse, but it would matter if you actually read it into a buffer, and > if an intervening record might be trying to drop something you have > pinned). > > Some other changes: > > 1. I dropped the GUC recovery_prefetch_fpw. I think it was a > possibly useful idea but it's a niche concern and not worth worrying > about for now. > > 2. I simplified the stats. Coming up with a good running average > system seemed like a problem for another day (the numbers before were > hard to interpret). The new stats are super simple counters and > instantaneous values: > > postgres=# select * from pg_stat_prefetch_recovery ; > -[ RECORD 1 ]--+-- > stats_reset| 2021-11-10 09:02:08.590217+13 > prefetch | 13605674 <- times we called posix_fadvise() > hit| 24185289 <- times we found pages already cached > skip_init | 217215 <- times we did nothing because init, not read > skip_new | 192347 <- times we skipped because relation too small > skip_fpw | 27429<- times we skipped because fpw, not read > wal_distance | 10648<- how far ahead in WAL bytes > block_distance | 134 <- how far ahead in block references > io_depth | 50 <- fadvise() calls not yet followed by pread() > > I also removed the code to save and restore the stats via the stats > collector, for now. I figured that persistent stats could be a later > feature, perhaps after the shared memory stats stuff? > > 3. I dropped the code that was caching an SMgrRelation pointer to > avoid smgropen() calls that showed up in some profiles. That probably
Re: WIP: WAL prefetch (another approach)
Thomas Munro writes: > On Sat, Nov 27, 2021 at 12:34 PM Tomas Vondra > wrote: >> One thing that's not clear to me is what happened to the reasons why >> this feature was reverted in the PG14 cycle? > 3. A wild goose chase for bugs on Tom Lane's antique 32 bit PPC > machine. Tom eventually reproduced it with the patches reverted, > which seemed to exonerate them but didn't leave a good feeling: what > was happening, and why did the patches hugely increase the likelihood > of the failure mode? I have no new information on that, but I know > that several people spent a huge amount of time and effort trying to > reproduce it on various types of systems, as did I, so despite not > reaching a conclusion of a bug, this certainly contributed to a > feeling that the patch had run out of steam for the 14 cycle. Yeah ... on the one hand, that machine has shown signs of hard-to-reproduce flakiness, so it's easy to write off the failures I saw as hardware issues. On the other hand, the flakiness I've seen has otherwise manifested as kernel crashes, which is nothing like the consistent test failures I was seeing with the patch. Andres speculated that maybe we were seeing a kernel bug that affects consistency of concurrent reads and writes. That could be an explanation; but it's just evidence-free speculation so far, so I don't feel real convinced by that idea either. Anyway, I hope to find time to see if the issue still reproduces with Thomas' new patch set. regards, tom lane
Re: WIP: WAL prefetch (another approach)
On Sat, Nov 27, 2021 at 12:34 PM Tomas Vondra wrote: > One thing that's not clear to me is what happened to the reasons why > this feature was reverted in the PG14 cycle? Reasons for reverting: 1. A bug in commit 323cbe7c, "Remove read_page callback from XLogReader.". I couldn't easily revert just that piece. This new version doesn't depend on that change anymore, to try to keep things simple. (That particular bug has been fixed in a newer version of that patch[1], which I still think was a good idea incidentally.) 2. A bug where allocation for large records happened before validation. Concretely, you can see that this patch does XLogReadRecordAlloc() after validating the header (usually, same as master), but commit f003d9f8 did it first. (Though Andres pointed out[2] that more work is needed on that to make that logic more robust, and I'm keen to look into that, but that's independent of this work). 3. A wild goose chase for bugs on Tom Lane's antique 32 bit PPC machine. Tom eventually reproduced it with the patches reverted, which seemed to exonerate them but didn't leave a good feeling: what was happening, and why did the patches hugely increase the likelihood of the failure mode? I have no new information on that, but I know that several people spent a huge amount of time and effort trying to reproduce it on various types of systems, as did I, so despite not reaching a conclusion of a bug, this certainly contributed to a feeling that the patch had run out of steam for the 14 cycle. This week I'll have another crack at getting that TAP test I proposed that runs the regression tests with a streaming replica to work on Windows. That does approximately what Tom was doing when he saw problem #3, which I'd like to have as standard across the build farm. [1] https://www.postgresql.org/message-id/20211007.172820.1874635561738958207.horikyota.ntt%40gmail.com [2] https://www.postgresql.org/message-id/20210505010835.umylslxgq4a6rbwg%40alap3.anarazel.de
Re: WIP: WAL prefetch (another approach)
On 11/26/21 22:16, Thomas Munro wrote: On Fri, Nov 26, 2021 at 11:32 AM Tomas Vondra wrote: The results are pretty good / similar to previous results. Replaying the 1h worth of work on a smaller machine takes ~5:30h without prefetching (master or with prefetching disabled). With prefetching enabled this drops to ~2h (default config) and ~1h (with tuning). Thanks for testing! Wow, that's a nice graph. This has bit-rotted already due to Robert's work on ripping out globals, so I'll post a rebase early next week, and incorporate your code feedback. One thing that's not clear to me is what happened to the reasons why this feature was reverted in the PG14 cycle? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: WIP: WAL prefetch (another approach)
On Fri, Nov 26, 2021 at 11:32 AM Tomas Vondra wrote: > The results are pretty good / similar to previous results. Replaying the > 1h worth of work on a smaller machine takes ~5:30h without prefetching > (master or with prefetching disabled). With prefetching enabled this > drops to ~2h (default config) and ~1h (with tuning). Thanks for testing! Wow, that's a nice graph. This has bit-rotted already due to Robert's work on ripping out globals, so I'll post a rebase early next week, and incorporate your code feedback.
Re: WIP: WAL prefetch (another approach)
Hi, It's great you posted a new version of this patch, so I took a look a brief look at it. The code seems in pretty good shape, I haven't found any real issues - just two minor comments: This seems a bit strange: #define DEFAULT_DECODE_BUFFER_SIZE 0x1 Why not to define this as a simple decimal value? Is there something special about this particular value, or is it arbitrary? I guess it's simply the minimum for wal_decode_buffer_size GUC, but why not to use the GUC for all places decoding WAL? FWIW I don't think we include updates to typedefs.list in patches. I also repeated the benchmarks I did at the beginning of the year [1]. Attached is a chart with four different configurations: 1) master (f79962d826) 2) patched (with prefetching disabled) 3) patched (with default configuration) 4) patched (with I/O concurrency 256 and 2MB decode buffer) For all configs the shared buffers were set to 64GB, checkpoints every 20 minutes, etc. The results are pretty good / similar to previous results. Replaying the 1h worth of work on a smaller machine takes ~5:30h without prefetching (master or with prefetching disabled). With prefetching enabled this drops to ~2h (default config) and ~1h (with tuning). regards [1] https://www.postgresql.org/message-id/c5d52837-6256-0556-ac8c-d6d3d558820a%40enterprisedb.com -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: WIP: WAL prefetch (another approach)
> On 10 May 2021, at 06:11, Thomas Munro wrote: > On Thu, Apr 22, 2021 at 11:22 AM Stephen Frost wrote: >> I tend to agree with the idea to revert it, perhaps a +0 on that, but if >> others argue it should be fixed in-place, I wouldn’t complain about it. > > Reverted. > > Note: eelpout may return a couple of failures because it's set up to > run with recovery_prefetch=on (now an unknown GUC), and it'll be a few > hours before I can access that machine to adjust that... > >> I very much encourage the idea of improving testing in this area and would >> be happy to try and help do so in the 15 cycle. > > Cool. I'm going to try out some ideas. Skimming this thread without all the context it's not entirely clear which patch the CF entry relates to (I assume it's the one from April 7 based on attached mail-id but there is a revert from May?), and the CF app and CF bot are also in disagreement which is the latest one. Could you post an updated version of the patch which is for review? -- Daniel Gustafsson https://vmware.com/
Re: WIP: WAL prefetch (another approach)
On Thu, Apr 22, 2021 at 11:22 AM Stephen Frost wrote: > On Wed, Apr 21, 2021 at 19:17 Thomas Munro wrote: >> On Thu, Apr 22, 2021 at 8:16 AM Thomas Munro wrote: >> ... Personally I think the right thing to do now is to revert it >> and re-propose for 15 early in the cycle, supported with some better >> testing infrastructure. > > I tend to agree with the idea to revert it, perhaps a +0 on that, but if > others argue it should be fixed in-place, I wouldn’t complain about it. Reverted. Note: eelpout may return a couple of failures because it's set up to run with recovery_prefetch=on (now an unknown GUC), and it'll be a few hours before I can access that machine to adjust that... > I very much encourage the idea of improving testing in this area and would be > happy to try and help do so in the 15 cycle. Cool. I'm going to try out some ideas.
Re: WIP: WAL prefetch (another approach)
Hi, On 2021-05-04 18:08:35 -0700, Andres Freund wrote: > But the issue that 70b4f82a4b is trying to address seems bigger to > me. The reason it's so easy to hit the issue is that walreceiver does < > 8KB writes into recycled WAL segments *without* zero-filling the tail > end of the page - which will commonly be filled with random older > contents, because we'll use a recycled segments. I think that > *drastically* increases the likelihood of finding something that looks > like a valid record header compared to the situation on a primary where > the zeroing pages before use makes that pretty unlikely. I've written an experimental patch to deal with this and, as expected, it does make the end-of-wal detection a lot more predictable and reliable. There's only two types of possible errors outside of crashes: A record length of 0 (the end of WAL is within a page), and the page header LSN mismatching (the end of WAL is at a page boundary). This seems like a significant improvement. However: It's nontrivial to do this nicely and in a backpatchable way in XLogWalRcvWrite(). Or at least I haven't found a good way: - We can't extend the input buffer to XLogWalRcvWrite(), it's from libpq. - We don't want to copy the the entire buffer (commonly 128KiB) to a new buffer that we then can extend by 0-BLCKSZ of zeroes to cover the trailing part of the last page. - In PG13+ we can do this utilizing pg_writev(), adding another IOV entry covering the trailing space to be padded. - It's nicer to avoid increasign the number of write() calls, but it's not as crucial as the earlier points. I'm also a bit uncomfortable with another aspect, although I can't really see a problem: When switch to receiving WAL via walreceiver, we always start at a segment boundary, even if we had received most of that segment before. Currently that won't end up with any trailing space that needs to be zeroed, because the server always will send 128KB chunks, but there's no formal guarantee for that. It seems a bit odd that we could end up zeroing trailing space that already contains valid data, just to overwrite it with valid data again. But it ought to always be fine. The least offensive way I could come up with is for XLogWalRcvWrite() to always write partial pages in a separate pg_pwrite(). When writing a partial page, and the previous write position was not already on that same page, copy the buffer into a local XLOG_BLCKSZ sized buffer (although we'll never use more than XLOG_BLCKSZ-1 I think), and (re)zero out the trailing part. One thing that does not yet handle is if we were to get a partial write - we'd not again notice that we need to pad the end of the page. Does anybody have a better idea? I really wish we had a version of pg_p{read,write}[v] that internally handled partial IOs, retrying as long as they see > 0 bytes written. Greetings, Andres Freund
Re: WIP: WAL prefetch (another approach)
Hi, On 2021-05-04 09:46:12 -0400, Tom Lane wrote: > Yeah, I have also spent a fair amount of time trying to reproduce it > elsewhere, without success so far. Notably, I've been trying on a > PPC Mac laptop that has a fairly similar CPU to what's in the G4, > though a far slower disk drive. So that seems to exclude theories > based on it being PPC-specific. > > I suppose that if we're unable to reproduce it on at least one other box, > we have to write it off as hardware flakiness. I wonder if there's a chance what we're seeing is an OS memory ordering bug, or a race between walreceiver writing data and the startup process reading it. When the startup process is able to keep up, there often will be a very small time delta between the startup process reading a page that the walreceiver just wrote. And if the currently read page was the tail page written to by a 'w' message, it'll often be written to again in short order - potentially while the startup process is reading it. It'd not terribly surprise me if an old OS version on an old processor had some issues around that. Were there any cases of walsender terminating and reconnecting around the failures? It looks suspicious that XLogPageRead() does not invalidate the xlogreader state when retrying. Normally that's xlogreader's responsibility, but there is that whole XLogReaderValidatePageHeader() business. But I don't quite see how it'd actually cause problems. Greetings, Andres Freund
Re: WIP: WAL prefetch (another approach)
Hi, On 2021-05-04 15:47:41 -0400, Tom Lane wrote: > BTW, that conclusion shouldn't distract us from the very real bug > that Andres identified. I was just scraping the buildfarm logs > concerning recent failures, and I found several recent cases > that match the symptom he reported: > [...] > They all show the standby in recovery/019_replslot_limit.pl failing > with symptoms like > > 2021-05-04 07:42:00.968 UTC [24707406:1] LOG: database system was shut down > in recovery at 2021-05-04 07:41:39 UTC > 2021-05-04 07:42:00.968 UTC [24707406:2] LOG: entering standby mode > 2021-05-04 07:42:01.050 UTC [24707406:3] LOG: redo starts at 0/1C000D8 > 2021-05-04 07:42:01.079 UTC [24707406:4] LOG: consistent recovery state > reached at 0/1D0 > 2021-05-04 07:42:01.079 UTC [24707406:5] FATAL: invalid memory alloc request > size 1476397045 > 2021-05-04 07:42:01.080 UTC [13238274:3] LOG: database system is ready to > accept read only connections > 2021-05-04 07:42:01.082 UTC [13238274:4] LOG: startup process (PID 24707406) > exited with exit code 1 Yea, that's the pre-existing end-of-log-issue that got more likely as well as more consequential (by accident) in Thomas' patch. It's easy to reach parity with the state in 13, it's just changing the order in one place. But I think we need to do something for all branches here. The bandaid that was added to allocate_recordbuf() does doesn't really seems sufficient to me. This is commit 70b4f82a4b5cab5fc12ff876235835053e407155 Author: Michael Paquier Date: 2018-06-18 10:43:27 +0900 Prevent hard failures of standbys caused by recycled WAL segments In <= 13 the current state is that we'll allocate effectively random bytes as long as the random number is below 1GB whenever we reach the end of the WAL with the record on a page boundary (because there we don't. That allocation is then not freed for the lifetime of the xlogreader. And for FRONTEND uses of xlogreader we'll just happily allocate 4GB. The specific problem here is that we don't validate the record header before allocating when the record header is split across a page boundary - without much need as far as I can tell? Until we've read the entire header, we actually don't need to allocate the record buffer? This seems like an issue that needs to be fixed to be more robust in crash recovery scenarios where obviously we could just have failed with half written records. But the issue that 70b4f82a4b is trying to address seems bigger to me. The reason it's so easy to hit the issue is that walreceiver does < 8KB writes into recycled WAL segments *without* zero-filling the tail end of the page - which will commonly be filled with random older contents, because we'll use a recycled segments. I think that *drastically* increases the likelihood of finding something that looks like a valid record header compared to the situation on a primary where the zeroing pages before use makes that pretty unlikely. > (BTW, the behavior seen here where the failure occurs *immediately* > after reporting "consistent recovery state reached" is seen in the > other reports as well, including Andres' version. I wonder if that > means anything.) That's to be expected, I think. There's not a lot of data that needs to be replayed, and we'll always reach consistency before the end of the WAL unless you're dealing with starting from an in-progress base-backup that hasn't yet finished or such. The test causes replication to fail shortly after that, so we'll always switch to doing recovery from pg_wal, which then will hit the end of the WAL, hitting this issue with, I think, ~25% likelihood (data from recycled WAL data is probably *roughly* evenly distributed, and any 4byte value above 1GB will hit this error in 14). Greetings, Andres Freund
Re: WIP: WAL prefetch (another approach)
I wrote: > I suppose that if we're unable to reproduce it on at least one other box, > we have to write it off as hardware flakiness. BTW, that conclusion shouldn't distract us from the very real bug that Andres identified. I was just scraping the buildfarm logs concerning recent failures, and I found several recent cases that match the symptom he reported: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2021-04-23%2022%3A27%3A41 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2021-04-21%2005%3A15%3A24 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2021-04-20%2002%3A03%3A08 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2021-05-04%2004%3A07%3A41 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2021-04-20%2021%3A08%3A59 They all show the standby in recovery/019_replslot_limit.pl failing with symptoms like 2021-05-04 07:42:00.968 UTC [24707406:1] LOG: database system was shut down in recovery at 2021-05-04 07:41:39 UTC 2021-05-04 07:42:00.968 UTC [24707406:2] LOG: entering standby mode 2021-05-04 07:42:01.050 UTC [24707406:3] LOG: redo starts at 0/1C000D8 2021-05-04 07:42:01.079 UTC [24707406:4] LOG: consistent recovery state reached at 0/1D0 2021-05-04 07:42:01.079 UTC [24707406:5] FATAL: invalid memory alloc request size 1476397045 2021-05-04 07:42:01.080 UTC [13238274:3] LOG: database system is ready to accept read only connections 2021-05-04 07:42:01.082 UTC [13238274:4] LOG: startup process (PID 24707406) exited with exit code 1 (BTW, the behavior seen here where the failure occurs *immediately* after reporting "consistent recovery state reached" is seen in the other reports as well, including Andres' version. I wonder if that means anything.) regards, tom lane
Re: WIP: WAL prefetch (another approach)
Tomas Vondra writes: > On 5/3/21 7:42 AM, Thomas Munro wrote: >> Hmm, yeah that does seem plausible. It would be nice to see a report >> from any other system though. I'm still trying, and reviewing... > FWIW I've ran the test (make installcheck-parallel in a loop) on four > different machines - two x86_64 ones, and two rpi4. The x86 boxes did > ~1000 rounds each (and one of them had 5 local replicas) without any > issue. The rpi4 machines did ~50 rounds each, also without failures. Yeah, I have also spent a fair amount of time trying to reproduce it elsewhere, without success so far. Notably, I've been trying on a PPC Mac laptop that has a fairly similar CPU to what's in the G4, though a far slower disk drive. So that seems to exclude theories based on it being PPC-specific. I suppose that if we're unable to reproduce it on at least one other box, we have to write it off as hardware flakiness. I'm not entirely comfortable with that answer, but I won't push for reversion of the WAL patches without more evidence that there's a real issue. regards, tom lane
Re: WIP: WAL prefetch (another approach)
On 5/3/21 7:42 AM, Thomas Munro wrote: On Sun, May 2, 2021 at 3:16 PM Tom Lane wrote: That last point means that there was some hard-to-hit problem even before any of the recent WAL-related changes. However, 323cbe7c7 (Remove read_page callback from XLogReader) increased the failure rate by at least a factor of 5, and 1d257577e (Optionally prefetch referenced data) seems to have increased it by another factor of 4. But it looks like f003d9f87 (Add circular WAL decoding buffer) didn't materially change the failure rate. Oh, wow. There are several surprising results there. Thanks for running those tests for so long so that we could see the rarest failures. Even if there are somehow *two* causes of corruption, one preexisting and one added by the refactoring or decoding patches, I'm struggling to understand how the chance increases with 1d2575, since that only adds code that isn't reached when not enabled (though I'm going to re-review that). Considering that 323cbe7c7 was supposed to be just refactoring, and 1d257577e is allegedly disabled-by-default, these are surely not the results I was expecting to get. +1 It seems like it's still an open question whether all this is a real bug, or flaky hardware. I have seen occasional kernel freezeups (or so I think -- machine stops responding to keyboard or network input) over the past year or two, so I cannot in good conscience rule out the flaky-hardware theory. But it doesn't smell like that kind of problem to me. I think what we're looking at is a timing-sensitive bug that was there before (maybe long before?) and these commits happened to make it occur more often on this particular hardware. This hardware is enough unlike anything made in the past decade that it's not hard to credit that it'd show a timing problem that nobody else can reproduce. Hmm, yeah that does seem plausible. It would be nice to see a report from any other system though. I'm still trying, and reviewing... FWIW I've ran the test (make installcheck-parallel in a loop) on four different machines - two x86_64 ones, and two rpi4. The x86 boxes did ~1000 rounds each (and one of them had 5 local replicas) without any issue. The rpi4 machines did ~50 rounds each, also without failures. Obviously, it's possible there's something that neither of those (very different systems) triggers, but I'd say it might also be a hint that this really is a hw issue on the old ppc macs. Or maybe something very specific to that arch. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: WIP: WAL prefetch (another approach)
On Sun, May 2, 2021 at 3:16 PM Tom Lane wrote: > That last point means that there was some hard-to-hit problem even > before any of the recent WAL-related changes. However, 323cbe7c7 > (Remove read_page callback from XLogReader) increased the failure > rate by at least a factor of 5, and 1d257577e (Optionally prefetch > referenced data) seems to have increased it by another factor of 4. > But it looks like f003d9f87 (Add circular WAL decoding buffer) > didn't materially change the failure rate. Oh, wow. There are several surprising results there. Thanks for running those tests for so long so that we could see the rarest failures. Even if there are somehow *two* causes of corruption, one preexisting and one added by the refactoring or decoding patches, I'm struggling to understand how the chance increases with 1d2575, since that only adds code that isn't reached when not enabled (though I'm going to re-review that). > Considering that 323cbe7c7 was supposed to be just refactoring, > and 1d257577e is allegedly disabled-by-default, these are surely > not the results I was expecting to get. +1 > It seems like it's still an open question whether all this is > a real bug, or flaky hardware. I have seen occasional kernel > freezeups (or so I think -- machine stops responding to keyboard > or network input) over the past year or two, so I cannot in good > conscience rule out the flaky-hardware theory. But it doesn't > smell like that kind of problem to me. I think what we're looking > at is a timing-sensitive bug that was there before (maybe long > before?) and these commits happened to make it occur more often > on this particular hardware. This hardware is enough unlike > anything made in the past decade that it's not hard to credit > that it'd show a timing problem that nobody else can reproduce. Hmm, yeah that does seem plausible. It would be nice to see a report from any other system though. I'm still trying, and reviewing...
Re: WIP: WAL prefetch (another approach)
On Thu, Apr 29, 2021 at 12:24 PM Tom Lane wrote: > Andres Freund writes: > > On 2021-04-28 19:24:53 -0400, Tom Lane wrote: > >> IOW, we've spent over twice as many CPU cycles shipping data to the > >> standby as we did in applying the WAL on the standby. > > > I don't really know how the time calculation works on mac. Is there a > > chance it includes time spent doing IO? For comparison, on a modern Linux system I see numbers like this, while running that 025_stream_rep_regress.pl test I posted in a nearby thread: USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND tmunro 2150863 22.5 0.0 55348 6752 ?Ss 12:59 0:07 postgres: standby_1: startup recovering 00010002003C tmunro 2150867 17.5 0.0 55024 6364 ?Ss 12:59 0:05 postgres: standby_1: walreceiver streaming 2/3C675D80 tmunro 2150868 11.7 0.0 55296 7192 ?Ss 12:59 0:04 postgres: primary: walsender tmunro [local] streaming 2/3C675D80 Those ratios are better but it's still hard work, and perf shows the CPU time is all in page cache schlep: 22.44% postgres [kernel.kallsyms] [k] copy_user_enhanced_fast_string 20.12% postgres [kernel.kallsyms] [k] __add_to_page_cache_locked 7.30% postgres [kernel.kallsyms] [k] iomap_set_page_dirty That was with all three patches reverted, so it's nothing new. Definitely room for improvement... there have been a few discussions about not using a buffered file for high-frequency data exchange and relaxing various timing rules, which we should definitely look into, but I wouldn't be at all surprised if HFS+ was just much worse at this. Thinking more about good old HFS+... I guess it's remotely possible that there might have been coherency bugs in that could be exposed by our usage pattern, but then that doesn't fit too well with the clues I have from light reading: this is a non-SMP system, and it's said that HFS+ used to serialise pretty much everything on big filesystem locks anyway.
Re: WIP: WAL prefetch (another approach)
Thomas Munro writes: > On Thu, Apr 29, 2021 at 4:45 AM Tom Lane wrote: >> Andres Freund writes: >>> Tom, any chance you could check if your machine repros the issue before >>> these commits? >> Wilco, but it'll likely take a little while to get results ... > FWIW I also chewed through many megawatts trying to reproduce this on > a PowerPC system in 64 bit big endian mode, with an emulator. No > cigar. However, it's so slow that I didn't make it to 10 runs... So I've expended a lot of kilowatt-hours over the past several days, and I've got results that are interesting but don't really get us any closer to a resolution. To recap, the test lashup is: * 2003 PowerMac G4 (1.25GHz PPC 7455, 7200 rpm spinning-rust drive) * Standard debug build (--enable-debug --enable-cassert) * Out-of-the-box configuration, except add wal_consistency_checking = all and configure a wal-streaming standby on the same machine * Repeatedly run "make installcheck-parallel", but skip the tablespace test to avoid issues with the standby trying to use the same directory * Delay long enough after each installcheck-parallel to let the standby catch up (the run proper is ~24 min, plus 2 min for catchup) The failures I'm seeing generally look like 2021-05-01 15:33:10.968 EDT [8281] FATAL: inconsistent page found, rel 1663/58186/66338, forknum 0, blkno 19 2021-05-01 15:33:10.968 EDT [8281] CONTEXT: WAL redo at 3/4CE905B8 for Gist/PAGE_UPDATE: ; blkref #0: rel 1663/58186/66338, blk 19 FPW with a variety of WAL record types being named, so it doesn't seem to be specific to any particular record type. I've twice gotten the bogus-checksum-and-then-assertion-failure I reported before: 2021-05-01 17:07:52.992 EDT [17464] LOG: incorrect resource manager data checksum in record at 3/E0073EA4 TRAP: FailedAssertion("state->recordRemainLen > 0", File: "xlogreader.c", Line: 567, PID: 17464) In both of those cases, the WAL on disk was perfectly fine, and the same is true of most of the "inconsistent page" complaints. So the issue definitely seems to be about the startup process mis-reading data that was correctly shipped over. Anyway, the new and interesting data concerns the relative failure rates of different builds: * Recent HEAD (from 4-28 and 5-1): 4 failures in 8 test cycles * Reverting 1d257577e: 1 failure in 8 test cycles * Reverting 1d257577e and f003d9f87: 3 failures in 28 cycles * Reverting 1d257577e, f003d9f87, and 323cbe7c7: 2 failures in 93 cycles That last point means that there was some hard-to-hit problem even before any of the recent WAL-related changes. However, 323cbe7c7 (Remove read_page callback from XLogReader) increased the failure rate by at least a factor of 5, and 1d257577e (Optionally prefetch referenced data) seems to have increased it by another factor of 4. But it looks like f003d9f87 (Add circular WAL decoding buffer) didn't materially change the failure rate. Considering that 323cbe7c7 was supposed to be just refactoring, and 1d257577e is allegedly disabled-by-default, these are surely not the results I was expecting to get. It seems like it's still an open question whether all this is a real bug, or flaky hardware. I have seen occasional kernel freezeups (or so I think -- machine stops responding to keyboard or network input) over the past year or two, so I cannot in good conscience rule out the flaky-hardware theory. But it doesn't smell like that kind of problem to me. I think what we're looking at is a timing-sensitive bug that was there before (maybe long before?) and these commits happened to make it occur more often on this particular hardware. This hardware is enough unlike anything made in the past decade that it's not hard to credit that it'd show a timing problem that nobody else can reproduce. (I did try the time-honored ritual of reseating all the machine's RAM, partway through this. Doesn't seem to have changed anything.) Anyway, I'm not sure where to go from here. I'm for sure nowhere near being able to identify the bug --- and if there really is a bug that formerly had a one-in-fifty reproduction rate, I have zero interest in trying to identify where it started by bisecting. It'd take at least a day per bisection step, and even that might not be accurate enough. (But, if anyone has ideas of specific commits to test, I'd be willing to try a few.) regards, tom lane
Re: WIP: WAL prefetch (another approach)
On Thu, Apr 29, 2021 at 3:14 PM Andres Freund wrote: > To me it looks like a smaller version of the problem is present in < 14, > albeit only when the page header is at a record boundary. In that case > we don't validate the page header immediately, only once it's completely > read. But we do believe the total size, and try to allocate > that. > > There's a really crufty escape hatch (from 70b4f82a4b) to that: Right, I made that problem worse, and that could probably be changed to be no worse than 13 by reordering those operations. PS Sorry for my intermittent/slow responses on this thread this week, as I'm mostly away from the keyboard due to personal commitments. I'll be back in the saddle next week to tidy this up, most likely by reverting. The main thought I've been having about this whole area is that, aside from the lack of general testing of recovery, which we should definitely address[1], what it really needs is a decent test harness to drive it through all interesting scenarios and states at a lower level, independently. [1] https://www.postgresql.org/message-id/flat/CA%2BhUKGKpRWQ9SxdxxDmTBCJoR0YnFpMBe7kyzY8SUQk%2BHeskxg%40mail.gmail.com
Re: WIP: WAL prefetch (another approach)
Andres Freund writes: > I was now able to reproduce the problem again, and I'm afraid that the > bug I hit is likely separate from Tom's. Yeah, I think so --- the symptoms seem quite distinct. My score so far today on the G4 is: 12 error-free regression test cycles on b3ee4c503 (plus one more with shared_buffers set to 16MB, on the strength of your previous hunch --- didn't fail for me though) HEAD failed on the second run with the same symptom as before: 2021-04-28 22:57:17.048 EDT [50479] FATAL: inconsistent page found, rel 1663/58183/69545, forknum 0, blkno 696 2021-04-28 22:57:17.048 EDT [50479] CONTEXT: WAL redo at 4/B72D408 for Heap/INSERT: off 77 flags 0x00; blkref #0: rel 1663/58183/69545, blk 696 FPW This seems to me to be pretty strong evidence that I'm seeing *something* real. I'm currently trying to isolate a specific commit to pin it on. A straight "git bisect" isn't going to work because so many people had broken so many different things right around that date :-(, so it may take awhile to get a good answer. regards, tom lane
Re: WIP: WAL prefetch (another approach)
Hi, On 2021-04-28 17:59:22 -0700, Andres Freund wrote: > I can however say that pg_waldump on the standby's pg_wal does also > fail. The failure as part of the backend is "invalid memory alloc > request size", whereas in pg_waldump I get the much more helpful: > pg_waldump: fatal: error in WAL record at 4/7F5C31C8: record with incorrect > prev-link 416200FF/FF00 at 4/7F5C3200 > > In frontend code that allocation actually succeeds, because there is no > size check. But in backend code we run into the size check, and thus > don't even display a useful error. > > In 13 the header is validated before allocating space for the > record(except if header is spread across pages) - it seems inadvisable > to turn that around? I was now able to reproduce the problem again, and I'm afraid that the bug I hit is likely separate from Tom's. The allocation thing above is the issue in my case: The walsender connection ended (I restarted the primary), thus the startup switches to replaying locally. For some reason the end of the WAL contains non-zero data (I think it's because walreceiver doesn't zero out pages - that's bad!). Because the allocation happen before the header is validated, we reproducably end up in the mcxt.c ERROR path, failing recovery. To me it looks like a smaller version of the problem is present in < 14, albeit only when the page header is at a record boundary. In that case we don't validate the page header immediately, only once it's completely read. But we do believe the total size, and try to allocate that. There's a really crufty escape hatch (from 70b4f82a4b) to that: /* * Note that in much unlucky circumstances, the random data read from a * recycled segment can cause this routine to be called with a size * causing a hard failure at allocation. For a standby, this would cause * the instance to stop suddenly with a hard failure, preventing it to * retry fetching WAL from one of its sources which could allow it to move * on with replay without a manual restart. If the data comes from a past * recycled segment and is still valid, then the allocation may succeed * but record checks are going to fail so this would be short-lived. If * the allocation fails because of a memory shortage, then this is not a * hard failure either per the guarantee given by MCXT_ALLOC_NO_OOM. */ if (!AllocSizeIsValid(newSize)) return false; but it looks to me like that's pretty much the wrong fix, at least in the case where we've not yet validated the rest of the header. We don't need to allocate all that data before we've read the rest of the *fixed-size* header. It also seems to me that 70b4f82a4b should also have changed walsender to pad out the received data to an 8KB boundary? Greetings, Andres Freund
Re: WIP: WAL prefetch (another approach)
Hi, On 2021-04-28 17:59:22 -0700, Andres Freund wrote: > I can however say that pg_waldump on the standby's pg_wal does also > fail. The failure as part of the backend is "invalid memory alloc > request size", whereas in pg_waldump I get the much more helpful: > pg_waldump: fatal: error in WAL record at 4/7F5C31C8: record with incorrect > prev-link 416200FF/FF00 at 4/7F5C3200 There's definitely something broken around continuation records, in XLogFindNextRecord(). Which means that it's not the cause for the server side issue, but obviously still not good. The conversion of XLogFindNextRecord() to be state machine based basically only works in a narrow set of circumstances. Whenever the end of the first record read is on a different page than the start of the record, we'll endlessly loop. We'll go into XLogFindNextRecord(), and return until we've successfully read the page header. Then we'll enter the second loop. Which will try to read until the end of the first record. But after returning the first loop will again ask for page header. Even if that's fixed, the second loop alone has the same problem: As XLogBeginRead() is called unconditionally we'll start read the start of the record, discover that it needs data on a second page, return, and do the same thing again. I think it needs something roughly like the attached. Greetings, Andres Freund diff --git i/src/include/access/xlogreader.h w/src/include/access/xlogreader.h index 3b8af31a8fe..82a80cf2bf5 100644 --- i/src/include/access/xlogreader.h +++ w/src/include/access/xlogreader.h @@ -297,6 +297,7 @@ struct XLogFindNextRecordState XLogReaderState *reader_state; XLogRecPtr targetRecPtr; XLogRecPtr currRecPtr; + bool found_start; }; /* Report that data is available for decoding. */ diff --git i/src/backend/access/transam/xlogreader.c w/src/backend/access/transam/xlogreader.c index 4277e92d7c9..935c841347f 100644 --- i/src/backend/access/transam/xlogreader.c +++ w/src/backend/access/transam/xlogreader.c @@ -868,7 +868,7 @@ XLogDecodeOneRecord(XLogReaderState *state, bool allow_oversized) /* validate record header if not yet */ if (!state->record_verified && record_len >= SizeOfXLogRecord) { -if (!ValidXLogRecordHeader(state, state->DecodeRecPtr, + if (!ValidXLogRecordHeader(state, state->DecodeRecPtr, state->PrevRecPtr, prec)) goto err; @@ -1516,6 +1516,7 @@ InitXLogFindNextRecord(XLogReaderState *reader_state, XLogRecPtr start_ptr) state->reader_state = reader_state; state->targetRecPtr = start_ptr; state->currRecPtr = start_ptr; + state->found_start = false; return state; } @@ -1545,7 +1546,7 @@ XLogFindNextRecord(XLogFindNextRecordState *state) * skip over potential continuation data, keeping in mind that it may span * multiple pages */ - while (true) + while (!state->found_start) { XLogRecPtr targetPagePtr; int targetRecOff; @@ -1616,7 +1617,12 @@ XLogFindNextRecord(XLogFindNextRecordState *state) * because either we're at the first record after the beginning of a page * or we just jumped over the remaining data of a continuation. */ - XLogBeginRead(state->reader_state, state->currRecPtr); + if (!state->found_start) + { + XLogBeginRead(state->reader_state, state->currRecPtr); + state->found_start = true; + } + while ((result = XLogReadRecord(state->reader_state, &record, &errormsg)) != XLREAD_FAIL) {
Re: WIP: WAL prefetch (another approach)
Hi, On 2021-04-28 20:24:43 -0400, Tom Lane wrote: > Andres Freund writes: > > Oh! I was about to ask how much shared buffers your primary / standby > > have. > Default configurations, so 128MB each. I thought that possibly initdb would detect less or something... I assume this is 32bit? I did notice that a 32bit test took a lot longer than a 64bit test. But didn't investigate so far. > And I think I may actually have reproduce a variant of the issue! Unfortunately I had not set up things in a way that the primary retains the WAL, making it harder to compare whether it's the WAL that got corrupted or whether it's a decoding bug. I can however say that pg_waldump on the standby's pg_wal does also fail. The failure as part of the backend is "invalid memory alloc request size", whereas in pg_waldump I get the much more helpful: pg_waldump: fatal: error in WAL record at 4/7F5C31C8: record with incorrect prev-link 416200FF/FF00 at 4/7F5C3200 In frontend code that allocation actually succeeds, because there is no size check. But in backend code we run into the size check, and thus don't even display a useful error. In 13 the header is validated before allocating space for the record(except if header is spread across pages) - it seems inadvisable to turn that around? Greetings, Andres Freund
Re: WIP: WAL prefetch (another approach)
Andres Freund writes: > On 2021-04-28 19:24:53 -0400, Tom Lane wrote: >> IOW, we've spent over twice as many CPU cycles shipping data to the >> standby as we did in applying the WAL on the standby. > I don't really know how the time calculation works on mac. Is there a > chance it includes time spent doing IO? I'd be pretty astonished if it did. This is basically a NetBSD system remember (in fact, this ancient macOS release is a good deal closer to those roots than modern versions). BSDen have never accounted for time that way AFAIK. Also, the "ps" man page says specifically that that column is CPU time. > Oh! I was about to ask how much shared buffers your primary / standby > have. And I think I may actually have reproduce a variant of the issue! Default configurations, so 128MB each. regards, tom lane
Re: WIP: WAL prefetch (another approach)
Hi, On 2021-04-28 19:24:53 -0400, Tom Lane wrote: > But I happened to notice the accumulated CPU time for the background > processes: > > USER PID %CPU %MEM VSZRSS TT STAT STARTED TIME COMMAND > tgl 19048 0.0 4.4 229952 92196 ?? Ss3:19PM 19:59.19 > postgres: startup recovering 000100140022 > tgl 19051 0.0 0.1 229656 1696 ?? Ss3:19PM 27:09.14 > postgres: walreceiver streaming 14/227D8F14 > tgl 19052 0.0 0.1 229904 2516 ?? Ss3:19PM 17:38.17 > postgres: walsender tgl [local] streaming 14/227D8F14 > > IOW, we've spent over twice as many CPU cycles shipping data to the > standby as we did in applying the WAL on the standby. Is this > expected? I've got wal_consistency_checking = all, which is bloating > the WAL volume quite a bit, but still it seems like the walsender and > walreceiver have little excuse for spending more cycles per byte > than the startup process. I don't really know how the time calculation works on mac. Is there a chance it includes time spent doing IO? On the primary the WAL IO is done by a lot of backends, but on the standby it's all going to be the walreceiver. And the walreceiver does fsyncs in a not particularly efficient manner. FWIW, on my linux workstation no such difference is visible: USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND andres 2910540 9.4 0.0 2237252 126680 ? Ss 16:55 0:20 postgres: dev assert standby: startup recovering 00010002003F andres 2910544 5.2 0.0 2236724 9260 ?Ss 16:55 0:11 postgres: dev assert standby: walreceiver streaming 2/3FDCF118 andres 2910545 2.1 0.0 2237036 10672 ? Ss 16:55 0:04 postgres: dev assert: walsender andres [local] streaming 2/3FDCF118 > (This is testing b3ee4c503, so if Thomas' WAL changes improved > efficiency of the replay process at all, the discrepancy could be > even worse in HEAD.) The prefetching isn't enabled by default, so I'd not expect meaningful differences... And even with the prefetching enabled, our normal regression tests largely are resident in s_b, so there shouldn't be much prefetching. Oh! I was about to ask how much shared buffers your primary / standby have. And I think I may actually have reproduce a variant of the issue! I previously had played around with different settings that I thought might increase the likelihood of reproducing the problem. But this time I set shared_buffers lower than before, and got: 2021-04-28 17:03:22.174 PDT [2913840][] LOG: database system was shut down in recovery at 2021-04-28 17:03:11 PDT 2021-04-28 17:03:22.174 PDT [2913840][] LOG: entering standby mode 2021-04-28 17:03:22.178 PDT [2913840][1/0] LOG: redo starts at 2/416C6278 2021-04-28 17:03:37.628 PDT [2913840][1/0] LOG: consistent recovery state reached at 4/7F5C3200 2021-04-28 17:03:37.628 PDT [2913840][1/0] FATAL: invalid memory alloc request size 3053455757 2021-04-28 17:03:37.628 PDT [2913839][] LOG: database system is ready to accept read only connections 2021-04-28 17:03:37.636 PDT [2913839][] LOG: startup process (PID 2913840) exited with exit code 1 This reproduces across restarts. Yay, I guess. Isn't it off that we get a "database system is ready to accept read only connections"? Greetings, Andres Freund
Re: WIP: WAL prefetch (another approach)
Thomas Munro writes: > FWIW I also chewed through many megawatts trying to reproduce this on > a PowerPC system in 64 bit big endian mode, with an emulator. No > cigar. However, it's so slow that I didn't make it to 10 runs... Speaking of megawatts ... my G4 has now finished about ten cycles of installcheck-parallel without a failure, which isn't really enough to draw any conclusions yet. But I happened to notice the accumulated CPU time for the background processes: USER PID %CPU %MEM VSZRSS TT STAT STARTED TIME COMMAND tgl 19048 0.0 4.4 229952 92196 ?? Ss3:19PM 19:59.19 postgres: startup recovering 000100140022 tgl 19051 0.0 0.1 229656 1696 ?? Ss3:19PM 27:09.14 postgres: walreceiver streaming 14/227D8F14 tgl 19052 0.0 0.1 229904 2516 ?? Ss3:19PM 17:38.17 postgres: walsender tgl [local] streaming 14/227D8F14 IOW, we've spent over twice as many CPU cycles shipping data to the standby as we did in applying the WAL on the standby. Is this expected? I've got wal_consistency_checking = all, which is bloating the WAL volume quite a bit, but still it seems like the walsender and walreceiver have little excuse for spending more cycles per byte than the startup process. (This is testing b3ee4c503, so if Thomas' WAL changes improved efficiency of the replay process at all, the discrepancy could be even worse in HEAD.) regards, tom lane
Re: WIP: WAL prefetch (another approach)
On Thu, Apr 29, 2021 at 4:45 AM Tom Lane wrote: > Andres Freund writes: > > Tom, any chance you could check if your machine repros the issue before > > these commits? > > Wilco, but it'll likely take a little while to get results ... FWIW I also chewed through many megawatts trying to reproduce this on a PowerPC system in 64 bit big endian mode, with an emulator. No cigar. However, it's so slow that I didn't make it to 10 runs...
Re: WIP: WAL prefetch (another approach)
Andres Freund writes: > Tom, any chance you could check if your machine repros the issue before > these commits? Wilco, but it'll likely take a little while to get results ... regards, tom lane
Re: WIP: WAL prefetch (another approach)
Hi, On 2021-04-22 13:59:58 +1200, Thomas Munro wrote: > On Thu, Apr 22, 2021 at 1:21 PM Tom Lane wrote: > > I've also tried to reproduce on 32-bit and 64-bit Intel, without > > success. So if this is real, maybe it's related to being big-endian > > hardware? But it's also quite sensitive to $dunno-what, maybe the > > history of WAL records that have already been replayed. > > Ah, that's interesting. There are a couple of sparc64 failures and a > ppc64 failure in the build farm, but I couldn't immediately spot what > was wrong with them or whether it might be related to this stuff. > > Thanks for the clues. I'll see what unusual systems I can find to try > this on FWIW, I've run 32 and 64 bit x86 through several hundred regression cycles, without hitting an issue. For a lot of them I set checkpoint_timeout to a lower value as I thought that might make it more likely to reproduce an issue. Tom, any chance you could check if your machine repros the issue before these commits? Greetings, Andres Freund
Re: WIP: WAL prefetch (another approach)
Andres Freund writes: > On 2021-04-21 21:21:05 -0400, Tom Lane wrote: >> What I'm doing is running the core regression tests with a single >> standby (on the same machine) and wal_consistency_checking = all. > Do you run them over replication, or sequentially by storing data into > an archive? Just curious, because its so painful to run that scenario in > the replication case due to the tablespace conflicting between > primary/standby, unless one disables the tablespace tests. No, live over replication. I've been skipping the tablespace test. > Have you tried reproducing it on commits before the recent xlogreader > changes? Nope. regards, tom lane
Re: WIP: WAL prefetch (another approach)
On Thu, Apr 22, 2021 at 1:21 PM Tom Lane wrote: > I've also tried to reproduce on 32-bit and 64-bit Intel, without > success. So if this is real, maybe it's related to being big-endian > hardware? But it's also quite sensitive to $dunno-what, maybe the > history of WAL records that have already been replayed. Ah, that's interesting. There are a couple of sparc64 failures and a ppc64 failure in the build farm, but I couldn't immediately spot what was wrong with them or whether it might be related to this stuff. Thanks for the clues. I'll see what unusual systems I can find to try this on
Re: WIP: WAL prefetch (another approach)
Hi, On 2021-04-21 21:21:05 -0400, Tom Lane wrote: > What I'm doing is running the core regression tests with a single > standby (on the same machine) and wal_consistency_checking = all. Do you run them over replication, or sequentially by storing data into an archive? Just curious, because its so painful to run that scenario in the replication case due to the tablespace conflicting between primary/standby, unless one disables the tablespace tests. > The other PPC machine (with no known history of trouble) is the one > that had the CRC failure I showed earlier. That one does seem to be > actual bad data in the stored WAL, because the problem was also seen > by pg_waldump, and trying to restart the standby got the same failure > again. It seems like that could also indicate an xlogreader bug that is reliably hit? Once it gets confused about record lengths or such I'd expect CRC failures... If it were actually wrong WAL contents I don't think any of the xlogreader / prefetching changes could be responsible... Have you tried reproducing it on commits before the recent xlogreader changes? commit 1d257577e08d3e598011d6850fd1025858de8c8c Author: Thomas Munro Date: 2021-04-08 23:03:43 +1200 Optionally prefetch referenced data in recovery. commit f003d9f8721b3249e4aec8a1946034579d40d42c Author: Thomas Munro Date: 2021-04-08 23:03:34 +1200 Add circular WAL decoding buffer. Discussion: https://postgr.es/m/CA+hUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq=aovoddf...@mail.gmail.com commit 323cbe7c7ddcf18aaf24b7f6d682a45a61d4e31b Author: Thomas Munro Date: 2021-04-08 23:03:23 +1200 Remove read_page callback from XLogReader. Trying 323cbe7c7ddcf18aaf24b7f6d682a45a61d4e31b^ is probably the most interesting bit. > I've not been able to duplicate the consistency-check failures > there. But because that machine is a laptop with a much inferior disk > drive, the speeds are enough different that it's not real surprising > if it doesn't hit the same problem. > > I've also tried to reproduce on 32-bit and 64-bit Intel, without > success. So if this is real, maybe it's related to being big-endian > hardware? But it's also quite sensitive to $dunno-what, maybe the > history of WAL records that have already been replayed. It might just be disk speed influencing how long the tests take, which in turn increase the number of times checkpoints during the test, increasing the number of FPIs? Greetings, Andres Freund
Re: WIP: WAL prefetch (another approach)
Stephen Frost writes: > On Wed, Apr 21, 2021 at 19:17 Thomas Munro wrote: >> ... Personally I think the right thing to do now is to revert it >> and re-propose for 15 early in the cycle, supported with some better >> testing infrastructure. > I tend to agree with the idea to revert it, perhaps a +0 on that, but if > others argue it should be fixed in-place, I wouldn’t complain about it. FWIW, I've so far only been able to see problems on two old PPC Macs, one of which has been known to be a bit flaky in the past. So it's possible that what I'm looking at is a hardware glitch. But it's consistent enough that I rather doubt that. What I'm doing is running the core regression tests with a single standby (on the same machine) and wal_consistency_checking = all. Fairly reproducibly (more than one run in ten), what I get on the slightly-flaky machine is consistency check failures like 2021-04-21 17:42:56.324 EDT [42286] PANIC: inconsistent page found, rel 1663/354383/357033, forknum 0, blkno 9, byte offset 2069: replay 0x00 primary 0x03 2021-04-21 17:42:56.324 EDT [42286] CONTEXT: WAL redo at 24/121C97B0 for Heap/INSERT: off 107 flags 0x00; blkref #0: rel 1663/354383/357033, blk 9 FPW 2021-04-21 17:45:11.662 EDT [42284] LOG: startup process (PID 42286) was terminated by signal 6: Abort trap 2021-04-21 11:25:30.091 EDT [38891] PANIC: inconsistent page found, rel 1663/229880/237980, forknum 0, blkno 108, byte offset 3845: replay 0x00 primary 0x99 2021-04-21 11:25:30.091 EDT [38891] CONTEXT: WAL redo at 17/A99897FC for SPGist/ADD_LEAF: add leaf to page; off 241; headoff 171; parentoff 0; blkref #0: rel 1663/229880/237980, blk 108 FPW 2021-04-21 11:26:59.371 EDT [38889] LOG: startup process (PID 38891) was terminated by signal 6: Abort trap 2021-04-20 19:20:16.114 EDT [34405] PANIC: inconsistent page found, rel 1663/189216/197311, forknum 0, blkno 115, byte offset 6149: replay 0x37 primary 0x03 2021-04-20 19:20:16.114 EDT [34405] CONTEXT: WAL redo at 13/3CBFED00 for SPGist/ADD_LEAF: add leaf to page; off 241; headoff 171; parentoff 0; blkref #0: rel 1663/189216/197311, blk 115 FPW 2021-04-20 19:21:54.421 EDT [34403] LOG: startup process (PID 34405) was terminated by signal 6: Abort trap 2021-04-20 17:44:09.356 EDT [24106] FATAL: inconsistent page found, rel 1663/135419/143843, forknum 0, blkno 101, byte offset 6152: replay 0x40 primary 0x00 2021-04-20 17:44:09.356 EDT [24106] CONTEXT: WAL redo at D/5107D8A8 for Gist/PAGE_UPDATE: ; blkref #0: rel 1663/135419/143843, blk 101 FPW (Note I modified checkXLogConsistency to PANIC on failure, so I could get a core dump to analyze; and it's also printing the first-mismatch location.) I have not analyzed each one of these failures exhaustively, but on the ones I have looked at closely, the replay_image_masked version of the page appears correct while the primary_image_masked version is *not*. Moreover, the primary_image_masked version does not match the full-page image that I see in the on-disk WAL file. It did however seem to match the in-memory WAL record contents that the decoder is operating on. So unless you want to believe the buggy-hardware theory, something's occasionally messing up while loading WAL records from disk. All of the trouble cases involve records that span across WAL pages (unsurprising since they contain FPIs), so maybe there's something not quite right in there. In the cases that I looked at closely, it appeared that there was a block of 32 wrong bytes somewhere within the page image, with the data before and after that being correct. I'm not sure if that pattern holds in all cases though. BTW, if I restart the failed standby, it plows through the same data just fine, confirming that the on-disk WAL is not corrupt. The other PPC machine (with no known history of trouble) is the one that had the CRC failure I showed earlier. That one does seem to be actual bad data in the stored WAL, because the problem was also seen by pg_waldump, and trying to restart the standby got the same failure again. I've not been able to duplicate the consistency-check failures there. But because that machine is a laptop with a much inferior disk drive, the speeds are enough different that it's not real surprising if it doesn't hit the same problem. I've also tried to reproduce on 32-bit and 64-bit Intel, without success. So if this is real, maybe it's related to being big-endian hardware? But it's also quite sensitive to $dunno-what, maybe the history of WAL records that have already been replayed. regards, tom lane
Re: WIP: WAL prefetch (another approach)
Greetings, On Wed, Apr 21, 2021 at 19:17 Thomas Munro wrote: > On Thu, Apr 22, 2021 at 8:16 AM Thomas Munro > wrote: > > That wasn't my plan, but I admit that the timing was non-ideal. In > > any case, I'll dig into these failures and then consider options. > > More soon. > > Yeah, this clearly needs more work. xlogreader.c is difficult to work > with and I think we need to keep trying to improve it, but I made a > bad call here trying to combine this with other refactoring work up > against a deadline and I made some dumb mistakes. I could of course > debug it in-tree, and I know that this has been an anticipated > feature. Personally I think the right thing to do now is to revert it > and re-propose for 15 early in the cycle, supported with some better > testing infrastructure. I tend to agree with the idea to revert it, perhaps a +0 on that, but if others argue it should be fixed in-place, I wouldn’t complain about it. I very much encourage the idea of improving testing in this area and would be happy to try and help do so in the 15 cycle. Thanks, Stephen >
Re: WIP: WAL prefetch (another approach)
On Thu, Apr 22, 2021 at 8:16 AM Thomas Munro wrote: > That wasn't my plan, but I admit that the timing was non-ideal. In > any case, I'll dig into these failures and then consider options. > More soon. Yeah, this clearly needs more work. xlogreader.c is difficult to work with and I think we need to keep trying to improve it, but I made a bad call here trying to combine this with other refactoring work up against a deadline and I made some dumb mistakes. I could of course debug it in-tree, and I know that this has been an anticipated feature. Personally I think the right thing to do now is to revert it and re-propose for 15 early in the cycle, supported with some better testing infrastructure.
Re: WIP: WAL prefetch (another approach)
On Thu, Apr 22, 2021 at 8:07 AM Tomas Vondra wrote: > On 4/21/21 6:30 PM, Tom Lane wrote: > > Thomas Munro writes: > >> Yeah, it would have been nice to include that but it'll have to be for > >> v15 due to lack of time to convince myself that it was correct. I do > >> intend to look into more concurrency of that kind for v15. I have > >> pushed these patches, updated to be disabled by default. > > > > I have a fairly bad feeling about these patches. I've already fixed > > one critical bug (see 9e4114822), but I am still seeing random, hard > > to reproduce failures in WAL replay testing. It looks like sometimes > > the "decoded" version of a WAL record doesn't match what I see in > > the on-disk data, which I'm having no luck tracing down. Ugh. Looking into this now. Also, this week I have been researching a possible problem with eg ALTER TABLE SET TABLESPACE in the higher level patch, which I'll write about soon. > > I am not sure whether the checksum failure itself is real or a variant > > of the seeming bad-reconstruction problem, but what I'm on about right > > at this moment is that the error handling logic for this case seems > > quite broken. Why is a checksum failure only worthy of a LOG message? > > Why is ValidXLogRecord() issuing a log message for itself, rather than > > being tied into the report_invalid_record() mechanism? Why are we > > evidently still trying to decode records afterwards? > > Yeah, that seems suspicious. I may have invited trouble by deciding to rebase on the other proposal late in the cycle. That interfaces around there. > > In general, I'm not too pleased with the apparent attitude in this > > thread that it's okay to push a patch that only mostly works on the > > last day of the dev cycle and plan to stabilize it later. > > Was there such attitude? I don't think people were arguing for pushing a > patch's not working correctly. The discussion was mostly about getting > it committed even and leaving some optimizations for v15. That wasn't my plan, but I admit that the timing was non-ideal. In any case, I'll dig into these failures and then consider options. More soon.
Re: WIP: WAL prefetch (another approach)
On 4/21/21 6:30 PM, Tom Lane wrote: > Thomas Munro writes: >> Yeah, it would have been nice to include that but it'll have to be for >> v15 due to lack of time to convince myself that it was correct. I do >> intend to look into more concurrency of that kind for v15. I have >> pushed these patches, updated to be disabled by default. > > I have a fairly bad feeling about these patches. I've already fixed > one critical bug (see 9e4114822), but I am still seeing random, hard > to reproduce failures in WAL replay testing. It looks like sometimes > the "decoded" version of a WAL record doesn't match what I see in > the on-disk data, which I'm having no luck tracing down. > > Another interesting failure I just came across is > > 2021-04-21 11:32:14.280 EDT [14606] LOG: incorrect resource manager data > checksum in record at F/438000A4 > TRAP: FailedAssertion("state->decoding", File: "xlogreader.c", Line: 845, > PID: 14606) > 2021-04-21 11:38:23.066 EDT [14603] LOG: startup process (PID 14606) was > terminated by signal 6: Abort trap > > with stack trace > > #0 0x90b669f0 in kill () > #1 0x90c01bfc in abort () > #2 0x0057a6a0 in ExceptionalCondition (conditionName= unavailable, due to optimizations>, errorType= due to optimizations>, fileName= optimizations>, lineNumber= optimizations>) at assert.c:69 > #3 0x000f5cf4 in XLogDecodeOneRecord (state=0x1000640, allow_oversized=1 > '\001') at xlogreader.c:845 > #4 0x000f682c in XLogNextRecord (state=0x1000640, record=0xbfffba38, > errormsg=0xbfffba9c) at xlogreader.c:466 > #5 0x000f695c in XLogReadRecord (state= to optimizations>, record=0xbfffba98, errormsg= unavailable, due to optimizations>) at xlogreader.c:352 > #6 0x000e61a0 in ReadRecord (xlogreader=0x1000640, emode=15, fetching_ckpt=0 > '\0') at xlog.c:4398 > #7 0x000ea320 in StartupXLOG () at xlog.c:7567 > #8 0x00362218 in StartupProcessMain () at startup.c:244 > #9 0x000fc170 in AuxiliaryProcessMain (argc= due to optimizations>, argv= optimizations>) at bootstrap.c:447 > #10 0x0035c740 in StartChildProcess (type=StartupProcess) at postmaster.c:5439 > #11 0x00360f4c in PostmasterMain (argc=5, argv=0xa006a0) at postmaster.c:1406 > #12 0x0029737c in main (argc= optimizations>, argv=) > at main.c:209 > > > I am not sure whether the checksum failure itself is real or a variant > of the seeming bad-reconstruction problem, but what I'm on about right > at this moment is that the error handling logic for this case seems > quite broken. Why is a checksum failure only worthy of a LOG message? > Why is ValidXLogRecord() issuing a log message for itself, rather than > being tied into the report_invalid_record() mechanism? Why are we > evidently still trying to decode records afterwards? > Yeah, that seems suspicious. > In general, I'm not too pleased with the apparent attitude in this > thread that it's okay to push a patch that only mostly works on the > last day of the dev cycle and plan to stabilize it later. > Was there such attitude? I don't think people were arguing for pushing a patch's not working correctly. The discussion was mostly about getting it committed even and leaving some optimizations for v15. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: WIP: WAL prefetch (another approach)
Thomas Munro writes: > Yeah, it would have been nice to include that but it'll have to be for > v15 due to lack of time to convince myself that it was correct. I do > intend to look into more concurrency of that kind for v15. I have > pushed these patches, updated to be disabled by default. I have a fairly bad feeling about these patches. I've already fixed one critical bug (see 9e4114822), but I am still seeing random, hard to reproduce failures in WAL replay testing. It looks like sometimes the "decoded" version of a WAL record doesn't match what I see in the on-disk data, which I'm having no luck tracing down. Another interesting failure I just came across is 2021-04-21 11:32:14.280 EDT [14606] LOG: incorrect resource manager data checksum in record at F/438000A4 TRAP: FailedAssertion("state->decoding", File: "xlogreader.c", Line: 845, PID: 14606) 2021-04-21 11:38:23.066 EDT [14603] LOG: startup process (PID 14606) was terminated by signal 6: Abort trap with stack trace #0 0x90b669f0 in kill () #1 0x90c01bfc in abort () #2 0x0057a6a0 in ExceptionalCondition (conditionName=, errorType=, fileName=, lineNumber=) at assert.c:69 #3 0x000f5cf4 in XLogDecodeOneRecord (state=0x1000640, allow_oversized=1 '\001') at xlogreader.c:845 #4 0x000f682c in XLogNextRecord (state=0x1000640, record=0xbfffba38, errormsg=0xbfffba9c) at xlogreader.c:466 #5 0x000f695c in XLogReadRecord (state=, record=0xbfffba98, errormsg=) at xlogreader.c:352 #6 0x000e61a0 in ReadRecord (xlogreader=0x1000640, emode=15, fetching_ckpt=0 '\0') at xlog.c:4398 #7 0x000ea320 in StartupXLOG () at xlog.c:7567 #8 0x00362218 in StartupProcessMain () at startup.c:244 #9 0x000fc170 in AuxiliaryProcessMain (argc=, argv=) at bootstrap.c:447 #10 0x0035c740 in StartChildProcess (type=StartupProcess) at postmaster.c:5439 #11 0x00360f4c in PostmasterMain (argc=5, argv=0xa006a0) at postmaster.c:1406 #12 0x0029737c in main (argc=, argv=) at main.c:209 I am not sure whether the checksum failure itself is real or a variant of the seeming bad-reconstruction problem, but what I'm on about right at this moment is that the error handling logic for this case seems quite broken. Why is a checksum failure only worthy of a LOG message? Why is ValidXLogRecord() issuing a log message for itself, rather than being tied into the report_invalid_record() mechanism? Why are we evidently still trying to decode records afterwards? In general, I'm not too pleased with the apparent attitude in this thread that it's okay to push a patch that only mostly works on the last day of the dev cycle and plan to stabilize it later. regards, tom lane
Re: WIP: WAL prefetch (another approach)
On Sat, Apr 10, 2021 at 2:16 AM Thomas Munro wrote: > In commit 1d257577e08d3e598011d6850fd1025858de8c8c, there is a change in file format for stats, won't it require bumping PGSTAT_FILE_FORMAT_ID? Actually, I came across this while working on my today's commit f5fc2f5b23 where I forgot to bump PGSTAT_FILE_FORMAT_ID. So, I thought maybe we can bump it just once if required? -- With Regards, Amit Kapila.
RE: WIP: WAL prefetch (another approach)
Hi, Thank you for developing a great feature. I tested this feature and checked the documentation. Currently, the documentation for the pg_stat_prefetch_recovery view is included in the description for the pg_stat_subscription view. https://www.postgresql.org/docs/devel/monitoring-stats.html#MONITORING-PG-STAT-SUBSCRIPTION It is also not displayed in the list of "28.2. The Statistics Collector". https://www.postgresql.org/docs/devel/monitoring.html The attached patch modifies the pg_stat_prefetch_recovery view to appear as a separate view. Regards, Noriyoshi Shinoda -Original Message- From: Thomas Munro [mailto:thomas.mu...@gmail.com] Sent: Saturday, April 10, 2021 5:46 AM To: Justin Pryzby Cc: Tomas Vondra ; Stephen Frost ; Andres Freund ; Jakub Wartak ; Alvaro Herrera ; Tomas Vondra ; Dmitry Dolgov <9erthali...@gmail.com>; David Steele ; pgsql-hackers Subject: Re: WIP: WAL prefetch (another approach) On Sat, Apr 10, 2021 at 8:37 AM Justin Pryzby wrote: > Did you see this? > INVALID URI REMOVED > 278MB0483490FEAC879DCA5ED583DD2739*40GV0P278MB0483.CHEP278.PROD.OUTLOO > K.COM__;JQ!!NpxR!wcPrhiB2CaHRtywGoh9Ap0M-kH1m07hGI37-ycYRGCPgCqGs30lRS > KicsXacduEXHxI$ > > I meant to mail you so you could include it in the same commit, but > forgot until now. Done, thanks. pg_stat_prefetch_recovery_doc_v1.diff Description: pg_stat_prefetch_recovery_doc_v1.diff
Re: WIP: WAL prefetch (another approach)
On Sat, Apr 10, 2021 at 8:37 AM Justin Pryzby wrote: > Did you see this? > https://www.postgresql.org/message-id/GV0P278MB0483490FEAC879DCA5ED583DD2739%40GV0P278MB0483.CHEP278.PROD.OUTLOOK.COM > > I meant to mail you so you could include it in the same commit, but forgot > until now. Done, thanks.
Re: WIP: WAL prefetch (another approach)
On Sat, Apr 10, 2021 at 08:27:42AM +1200, Thomas Munro wrote: > On Fri, Apr 9, 2021 at 3:37 PM Justin Pryzby wrote: > > Here's some little language fixes. > > Thanks! Done. I rewrote the gibberish comment that made you say > "XXX: what?". Pushed. > > > BTW, before beginning "recovery", PG syncs all the data dirs. > > This can be slow, and it seems like the slowness is frequently due to file > > metadata. For example, that's an obvious consequence of an OS crash, after > > which the page cache is empty. I've made a habit of running find /zfs -ls > > |wc > > to pre-warm it, which can take a little bit, but then the recovery process > > starts moments later. I don't have any timing measurements, but I expect > > that > > starting to stat() all data files as soon as possible would be a win. > > Did you see commit 61752afb, "Provide > recovery_init_sync_method=syncfs"? Actually I believe it's safe to > skip that phase completely and do a tiny bit more work during > recovery, which I'd like to work on for v15[1]. Yes, I have it in my list for v14 deployment. Thanks for that. Did you see this? https://www.postgresql.org/message-id/GV0P278MB0483490FEAC879DCA5ED583DD2739%40GV0P278MB0483.CHEP278.PROD.OUTLOOK.COM I meant to mail you so you could include it in the same commit, but forgot until now. -- Justin
Re: WIP: WAL prefetch (another approach)
On Fri, Apr 9, 2021 at 3:37 PM Justin Pryzby wrote: > Here's some little language fixes. Thanks! Done. I rewrote the gibberish comment that made you say "XXX: what?". Pushed. > BTW, before beginning "recovery", PG syncs all the data dirs. > This can be slow, and it seems like the slowness is frequently due to file > metadata. For example, that's an obvious consequence of an OS crash, after > which the page cache is empty. I've made a habit of running find /zfs -ls |wc > to pre-warm it, which can take a little bit, but then the recovery process > starts moments later. I don't have any timing measurements, but I expect that > starting to stat() all data files as soon as possible would be a win. Did you see commit 61752afb, "Provide recovery_init_sync_method=syncfs"? Actually I believe it's safe to skip that phase completely and do a tiny bit more work during recovery, which I'd like to work on for v15[1]. [1] https://www.postgresql.org/message-id/flat/CA%2BhUKG%2B8Wm8TSfMWPteMEHfh194RytVTBNoOkggTQT1p5NTY7Q%40mail.gmail.com
Re: WIP: WAL prefetch (another approach)
Here's some little language fixes. BTW, before beginning "recovery", PG syncs all the data dirs. This can be slow, and it seems like the slowness is frequently due to file metadata. For example, that's an obvious consequence of an OS crash, after which the page cache is empty. I've made a habit of running find /zfs -ls |wc to pre-warm it, which can take a little bit, but then the recovery process starts moments later. I don't have any timing measurements, but I expect that starting to stat() all data files as soon as possible would be a win. commit cc9707de333fe8242607cde9f777beadc68dbf04 Author: Justin Pryzby Date: Thu Apr 8 10:43:14 2021 -0500 WIP: doc review: Optionally prefetch referenced data in recovery. 1d257577e08d3e598011d6850fd1025858de8c8c diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index bc4a8b2279..139dee7aa2 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -3621,7 +3621,7 @@ include_dir 'conf.d' pool after that. However, on file systems with a block size larger than PostgreSQL's, prefetching can avoid a -costly read-before-write when a blocks are later written. +costly read-before-write when blocks are later written. The default is off. diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml index 24cf567ee2..36e00c92c2 100644 --- a/doc/src/sgml/wal.sgml +++ b/doc/src/sgml/wal.sgml @@ -816,9 +816,7 @@ prefetching mechanism is most likely to be effective on systems with full_page_writes set to off (where that is safe), and where the working - set is larger than RAM. By default, prefetching in recovery is enabled - on operating systems that have posix_fadvise - support. + set is larger than RAM. By default, prefetching in recovery is disabled. diff --git a/src/backend/access/transam/xlogprefetch.c b/src/backend/access/transam/xlogprefetch.c index 28764326bc..363c079964 100644 --- a/src/backend/access/transam/xlogprefetch.c +++ b/src/backend/access/transam/xlogprefetch.c @@ -31,7 +31,7 @@ * stall; this is counted with "skip_fpw". * * The only way we currently have to know that an I/O initiated with - * PrefetchSharedBuffer() has that recovery will eventually call ReadBuffer(), + * PrefetchSharedBuffer() has that recovery will eventually call ReadBuffer(), XXX: what ?? * and perform a synchronous read. Therefore, we track the number of * potentially in-flight I/Os by using a circular buffer of LSNs. When it's * full, we have to wait for recovery to replay records so that the queue @@ -660,7 +660,7 @@ XLogPrefetcherScanBlocks(XLogPrefetcher *prefetcher) /* * I/O has possibly been initiated (though we don't know if it was * already cached by the kernel, so we just have to assume that it -* has due to lack of better information). Record this as an I/O +* was due to lack of better information). Record this as an I/O * in progress until eventually we replay this LSN. */ XLogPrefetchIncrement(&SharedStats->prefetch); diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 090abdad8b..8c72ba1f1a 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -2774,7 +2774,7 @@ static struct config_int ConfigureNamesInt[] = { {"wal_decode_buffer_size", PGC_POSTMASTER, WAL_ARCHIVE_RECOVERY, gettext_noop("Maximum buffer size for reading ahead in the WAL during recovery."), - gettext_noop("This controls the maximum distance we can read ahead n the WAL to prefetch referenced blocks."), + gettext_noop("This controls the maximum distance we can read ahead in the WAL to prefetch referenced blocks."), GUC_UNIT_BYTE }, &wal_decode_buffer_size,
Re: WIP: WAL prefetch (another approach)
On 4/8/21 1:46 PM, Thomas Munro wrote: > On Thu, Apr 8, 2021 at 3:27 AM Tomas Vondra > wrote: >> On 4/7/21 1:24 PM, Thomas Munro wrote: >>> I made one other simplifying change: previously, the prefetch module >>> would read the WAL up to the "written" LSN (so, allowing itself to >>> read data that had been written but not yet flushed to disk by the >>> walreceiver), though it still waited until a record's LSN was >>> "flushed" before replaying. That allowed prefetching to happen >>> concurrently with the WAL flush, which was nice, but it felt a little >>> too "special". I decided to remove that part for now, and I plan to >>> look into making standbys work more like primary servers, using WAL >>> buffers, the WAL writer and optionally the standard log-before-data >>> rule. >> >> Not sure, but the removal seems unnecessary. I'm worried that this will >> significantly reduce the amount of data that we'll be able to prefetch. >> How likely it is that we have data that is written but not flushed? >> Let's assume the replica is lagging and network bandwidth is not the >> bottleneck - how likely is this "has to be flushed" a limit for the >> prefetching? > > Yeah, it would have been nice to include that but it'll have to be for > v15 due to lack of time to convince myself that it was correct. I do > intend to look into more concurrency of that kind for v15. I have > pushed these patches, updated to be disabled by default. I will look > into how I can run a BF animal that has it enabled during the recovery > tests for coverage. Thanks very much to everyone on this thread for > all the discussion and testing so far. > OK, understood. I'll rerun the benchmarks on this version, and if there's a significant negative impact we can look into that during the stabilization phase. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: WIP: WAL prefetch (another approach)
On Thu, Apr 8, 2021 at 3:27 AM Tomas Vondra wrote: > On 4/7/21 1:24 PM, Thomas Munro wrote: > > I made one other simplifying change: previously, the prefetch module > > would read the WAL up to the "written" LSN (so, allowing itself to > > read data that had been written but not yet flushed to disk by the > > walreceiver), though it still waited until a record's LSN was > > "flushed" before replaying. That allowed prefetching to happen > > concurrently with the WAL flush, which was nice, but it felt a little > > too "special". I decided to remove that part for now, and I plan to > > look into making standbys work more like primary servers, using WAL > > buffers, the WAL writer and optionally the standard log-before-data > > rule. > > Not sure, but the removal seems unnecessary. I'm worried that this will > significantly reduce the amount of data that we'll be able to prefetch. > How likely it is that we have data that is written but not flushed? > Let's assume the replica is lagging and network bandwidth is not the > bottleneck - how likely is this "has to be flushed" a limit for the > prefetching? Yeah, it would have been nice to include that but it'll have to be for v15 due to lack of time to convince myself that it was correct. I do intend to look into more concurrency of that kind for v15. I have pushed these patches, updated to be disabled by default. I will look into how I can run a BF animal that has it enabled during the recovery tests for coverage. Thanks very much to everyone on this thread for all the discussion and testing so far.
Re: WIP: WAL prefetch (another approach)
On 4/7/21 1:24 PM, Thomas Munro wrote: > Here's rebase, on top of Horiguchi-san's v19 patch set. My patches > start at 0007. Previously, there was a "nowait" flag that was passed > into all the callbacks so that XLogReader could wait for new WAL in > some cases but not others. This new version uses the proposed > XLREAD_NEED_DATA protocol, and the caller deals with waiting for data > to arrive when appropriate. This seems tidier to me. > OK, seems reasonable. > I made one other simplifying change: previously, the prefetch module > would read the WAL up to the "written" LSN (so, allowing itself to > read data that had been written but not yet flushed to disk by the > walreceiver), though it still waited until a record's LSN was > "flushed" before replaying. That allowed prefetching to happen > concurrently with the WAL flush, which was nice, but it felt a little > too "special". I decided to remove that part for now, and I plan to > look into making standbys work more like primary servers, using WAL > buffers, the WAL writer and optionally the standard log-before-data > rule. > Not sure, but the removal seems unnecessary. I'm worried that this will significantly reduce the amount of data that we'll be able to prefetch. How likely it is that we have data that is written but not flushed? Let's assume the replica is lagging and network bandwidth is not the bottleneck - how likely is this "has to be flushed" a limit for the prefetching? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: WIP: WAL prefetch (another approach)
On Fri, Mar 19, 2021 at 2:29 PM Tomas Vondra wrote: > On 3/18/21 1:54 AM, Thomas Munro wrote: > > I'm now looking at Horiguchi-san and Heikki's patch[2] to remove > > XLogReader's callbacks, to try to understand how these two patch sets > > are related. I don't really like the way those callbacks work, and > > I'm afraid had to make them more complicated. But I don't yet know > > very much about that other patch set. More soon. > > OK. Do you think we should get both of those patches in, or do we need > to commit them in a particular order? Or what is your concern? I would like to commit the callback-removal patch first, and then the WAL decoder and prefetcher patches become simpler and cleaner on top of that. I will post the rebase and explanation shortly.
Re: WIP: WAL prefetch (another approach)
On 3/18/21 1:54 AM, Thomas Munro wrote: > On Thu, Mar 18, 2021 at 12:00 PM Tomas Vondra > wrote: >> On 3/17/21 10:43 PM, Stephen Frost wrote: >>> Guess I'm just not a fan of pushing out a change that will impact >>> everyone by default, in a possibly negative way (or positive, though >>> that doesn't seem terribly likely, but who knows), without actually >>> measuring what that impact will look like in those more common cases. >>> Showing that it's a great win when you're on ZFS or running with FPWs >>> disabled is good and the expected best case, but we should be >>> considering the worst case too when it comes to performance >>> improvements. >>> >> >> Well, maybe it'll behave differently on systems with ZFS. I don't know, >> and I have no such machine to test that at the moment. My argument >> however remains the same - if if happens to be a problem, just don't >> enable (or disable) the prefetching, and you get the current behavior. > > I see the road map for this feature being to get it working on every > OS via the AIO patchset, in later work, hopefully not very far in the > future (in the most portable mode, you get I/O worker processes doing > pread() or preadv() calls on behalf of recovery). So I'll be glad to > get this infrastructure in, even though it's maybe only useful for > some people in the first release. > +1 to that >> FWIW I'm not sure there was a discussion or argument about what should >> be the default setting (enabled or disabled). I'm fine with not enabling >> this by default, so that people have to enable it explicitly. >> >> In a way that'd be consistent with effective_io_concurrency being 1 by >> default, which almost disables regular prefetching. > > Yeah, I'm not sure but I'd be fine with disabling it by default in the > initial release. The current patch set has it enabled, but that's > mostly for testing, it's not an opinion on how it should ship. > +1 to that too. Better to have it disabled by default than not at all. > I've attached a rebased patch set with a couple of small changes: > > 1. I abandoned the patch that proposed > pg_atomic_unlocked_add_fetch_u{32,64}() and went for a simple function > local to xlogprefetch.c that just does pg_atomic_write_u64(counter, > pg_atomic_read_u64(counter) + 1), in response to complaints from > Andres[1]. > > 2. I fixed a bug in ReadRecentBuffer(), and moved it into its own > patch for separate review. > > I'm now looking at Horiguchi-san and Heikki's patch[2] to remove > XLogReader's callbacks, to try to understand how these two patch sets > are related. I don't really like the way those callbacks work, and > I'm afraid had to make them more complicated. But I don't yet know > very much about that other patch set. More soon. > OK. Do you think we should get both of those patches in, or do we need to commit them in a particular order? Or what is your concern? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: WIP: WAL prefetch (another approach)
Hi, On 3/17/21 10:43 PM, Stephen Frost wrote: > Greetings, > > * Tomas Vondra (tomas.von...@enterprisedb.com) wrote: >> Right, I was just going to point out the FPIs are not necessary - what >> matters is the presence of long streaks of WAL records touching the same >> set of blocks. But people with workloads where this is common likely >> don't need the WAL prefetching at all - the replica can keep up just >> fine, because it doesn't need to do much I/O anyway (and if it can't >> then prefetching won't help much anyway). So just don't enable the >> prefetching, and there'll be no overhead. > > Isn't this exactly the common case though..? Checkpoints happening > every 5 minutes, the replay of the FPI happens first and then the record > is updated and everything's in SB for the later changes? Well, as I said before, the FPIs are not very significant - you'll have mostly the same issue with any repeated changes to the same block. It does not matter much if you do FPI for block 1 WAL record for block 2 WAL record for block 1 WAL record for block 2 WAL record for block 1 or just WAL record for block 1 WAL record for block 2 WAL record for block 1 WAL record for block 2 WAL record for block 1 In both cases some of the prefetches are probably unnecessary. But the frequency of checkpoints does not really matter, the important bit is repeated changes to the same block(s). If you have active set much larger than RAM, this is quite unlikely. And we know from the pgbench tests that prefetching has a huge positive effect in this case. On smaller active sets, with frequent updates to the same block, we may issue unnecessary prefetches - that's true. But (a) you have not shown any numbers suggesting this is actually an issue, and (b) those cases don't really need prefetching because all the data is already either in shared buffers or in page cache. So if it happens to be an issue, the user can simply disable it. So how exactly would a problematic workload look like? > You mentioned elsewhere that this would improve 80% of cases but that > doesn't seem to be backed up by anything and certainly doesn't seem > likely to be the case if we're talking about across all PG > deployments. Obviously, the 80% was just a figure of speech, illustrating my belief that the proposed patch is beneficial for most users who currently have issues with replication lag. That is based on my experience with support customers who have such issues - it's almost invariably an OLTP workload with large active set, and we know (from the benchmarks) that in these cases it helps. Users who don't have issues with replication lag can disable (or not enable) the prefetching, and won't get any negative effects. Perhaps there are users with weird workloads that have replication lag issues but this patch won't help them - bummer, we can't solve everything in one go. Also, no one actually demonstrated such workload in this thread so far. But as you're suggesting we don't have data to support the claim that this actually helps many users (with no risk to others), I'd point out you have not actually provided any numbers showing that it actually is an issue in practice. > I also disagree that asking the kernel to go do random I/O for us, > even as a prefetch, is entirely free simply because we won't > actually need those pages. At the least, it potentially pushes out > pages that we might need shortly from the filesystem cache, no? Where exactly did I say it's free? I said that workloads where this happens a lot most likely don't need the prefetching at all, so it can be simply disabled, eliminating all negative effects. Moreover, looking at a limited number of recently prefetched blocks won't eliminate this problem anyway - imagine a random OLTP on large data set that however fits into RAM. After a while no read I/O needs to be done, but you'd need pretty much infinite list of prefetched blocks to eliminate that, and with smaller lists you'll still do 99% of the prefetches. Just disabling prefetching on such instances seems quite reasonable. >> If it was up to me, I'd just get the patch committed as is. Delaying the >> feature because of concerns that it might have some negative effect in >> some cases, when that can be simply mitigated by disabling the feature, >> is not really beneficial for our users. > > I don't know that we actually know how many cases it might have a > negative effect on or what the actual amount of such negative case there > might be- that's really why we should probably try to actually benchmark > it and get real numbers behind it, particularly when the chances of > running into such a negative effect with the default configuration (that > is, FPWs enabled) on the more typical platforms (as in, not ZFS) is more > likely to occur in the field than the cases where FPWs are disabled and > someone's running on ZFS. > > Perhaps more to the point, it'd be nice to see how this
Re: WIP: WAL prefetch (another approach)
Greetings, * Tomas Vondra (tomas.von...@enterprisedb.com) wrote: > Right, I was just going to point out the FPIs are not necessary - what > matters is the presence of long streaks of WAL records touching the same > set of blocks. But people with workloads where this is common likely > don't need the WAL prefetching at all - the replica can keep up just > fine, because it doesn't need to do much I/O anyway (and if it can't > then prefetching won't help much anyway). So just don't enable the > prefetching, and there'll be no overhead. Isn't this exactly the common case though..? Checkpoints happening every 5 minutes, the replay of the FPI happens first and then the record is updated and everything's in SB for the later changes? You mentioned elsewhere that this would improve 80% of cases but that doesn't seem to be backed up by anything and certainly doesn't seem likely to be the case if we're talking about across all PG deployments. I also disagree that asking the kernel to go do random I/O for us, even as a prefetch, is entirely free simply because we won't actually need those pages. At the least, it potentially pushes out pages that we might need shortly from the filesystem cache, no? > If it was up to me, I'd just get the patch committed as is. Delaying the > feature because of concerns that it might have some negative effect in > some cases, when that can be simply mitigated by disabling the feature, > is not really beneficial for our users. I don't know that we actually know how many cases it might have a negative effect on or what the actual amount of such negative case there might be- that's really why we should probably try to actually benchmark it and get real numbers behind it, particularly when the chances of running into such a negative effect with the default configuration (that is, FPWs enabled) on the more typical platforms (as in, not ZFS) is more likely to occur in the field than the cases where FPWs are disabled and someone's running on ZFS. Perhaps more to the point, it'd be nice to see how this change actually improves the caes where PG is running with more-or-less the defaults on the more commonly deployed filesystems. If it doesn't then maybe it shouldn't be the default..? Surely the folks running on ZFS and running with FPWs disabled would be able to manage to enable it if they wished to and we could avoid entirely the question of if this has a negative impact on the more common cases. Guess I'm just not a fan of pushing out a change that will impact everyone by default, in a possibly negative way (or positive, though that doesn't seem terribly likely, but who knows), without actually measuring what that impact will look like in those more common cases. Showing that it's a great win when you're on ZFS or running with FPWs disabled is good and the expected best case, but we should be considering the worst case too when it comes to performance improvements. Anyhow, ultimately I don't know that there's much more to discuss on this thread with regard to this particular topic, at least. As I said before, if everyone else is on board and not worried about it then so be it; I feel that at least the concern that I raised has been heard. Thanks, Stephen signature.asc Description: PGP signature
Re: WIP: WAL prefetch (another approach)
On 2/15/21 12:18 AM, Stephen Frost wrote: > Greetings, > > ... > I think there's potential for some significant optimization going forward, but I think it's basically optimization over what we're doing today. As this is already a nontrivial patch, I'd argue for doing so separately. >>> >>> This seems like a great optimization, albeit a fair bit of code, for a >>> relatively uncommon use-case, specifically where full page writes are >>> disabled or very large checkpoints. As that's the case though, I would >>> think it's reasonable to ask that it go out of its way to avoid slowing >>> down the more common configurations, particularly since it's proposed to >>> have it on by default (which I agree with, provided it ends up improving >>> the common cases, which I think the suggestions above would certainly >>> make it more likely to do). >> >> I'm OK to do some benchmarking, but it's not quite clear to me why does it >> matter if the checkpoints are smaller than shared buffers? IMO what matters >> is how "localized" the updates are, i.e. how likely it is to hit the same >> page repeatedly (in a short amount of time). Regular pgbench is not very >> suitable for that, but some non-uniform distribution should do the trick, I >> think. > > I suppose strictly speaking it'd be > Min(wal_decode_buffer_size,checkpoint_size), but yes, you're right that > it's more about the wal_decode_buffer_size than the checkpoint's size. > Apologies for the confusion. As suggested above, one way to benchmark > this to really see if there's any issue would be to increase > wal_decode_buffer_size to some pretty big size and then compare the > performance vs. unpatched. I'd think that could even be done with > pgbench, so you're not having to arrange for the same pages to get > updated over and over. > What exactly would be the point of such benchmark? I don't think the patch does prefetching based on wal_decode_buffer_size, that just says how far ahead we decode - the prefetch distance I is defined by maintenance_io_concurrency. But it's not clear to me what exactly would the result say about the necessity of the optimization at hand (skipping prefetches for blocks with recent FPI). If the the maintenance_io_concurrency is very high, the probability that a block is evicted prematurely grows, making the prefetch useless in general. How does this say anything about the problem at hand? Sure, we'll do unnecessary I/O, causing issues, but that's a bit like complaining the engine gets very hot when driving on a highway in reverse. AFAICS to measure the worst case, you'd need a workload with a lot of FPIs, and very little actual I/O. That means, data set that fits into memory (either shared buffers or RAM), and short checkpoints. But that's exactly the case where you don't need prefetching ... >>> Perhaps this already improves the common cases and is worth the extra >>> code on that basis, but I don't recall seeing much in the way of >>> benchmarking in this thread for that case- that is, where FPIs are >>> enabled and checkpoints are smaller than shared buffers. Jakub's >>> testing was done with FPWs disabled and Tomas's testing used checkpoints >>> which were much larger than the size of shared buffers on the system >>> doing the replay. While it's certainly good that this patch improves >>> those cases, we should also be looking out for the worst case and make >>> sure that the patch doesn't degrade performance in that case. >> >> I'm with Andres on this. It's fine to leave some possible optimizations on >> the table for the future. And even if some workloads are affected >> negatively, it's still possible to disable the prefetching. > > While I'm generally in favor of this argument, that a feature is > particularly important and that it's worth slowing down the common cases > to enable it, I dislike that it's applied inconsistently. I'd certainly If you have a workload where this happens to cause issues, you can just disable that. IMHO that's a perfectly reasonable engineering approach, where we get something that significantly improves 80% of the cases, allow disabling it for cases where it might cause issues, and then improve it in the next version. > feel better about it if we had actual performance numbers to consider. > I don't doubt the possibility that the extra prefetch's just don't > amount to enough to matter but I have a hard time seeing them as not > having some cost and without actually measuring it, it's hard to say > what that cost is. > > Without looking farther back than the last record, we could end up > repeatedly asking for the same blocks to be prefetched too- > > FPI for block 1 > FPI for block 2 > WAL record for block 1 > WAL record for block 2 > WAL record for block 1 > WAL record for block 2 > WAL record for block 1 > WAL record for block 2 > > ... etc. > > Entirely possible my math is off, but seems like the worst case > situation right now might end up with some 4500 u
Re: WIP: WAL prefetch (another approach)
Greetings, * Tomas Vondra (tomas.von...@enterprisedb.com) wrote: > On 2/13/21 10:39 PM, Stephen Frost wrote: > >* Andres Freund (and...@anarazel.de) wrote: > >>On 2021-02-12 00:42:04 +0100, Tomas Vondra wrote: > >>>Yeah, that's a good point. I think it'd make sense to keep track of recent > >>>FPIs and skip prefetching such blocks. But how exactly should we implement > >>>that, how many blocks do we need to track? If you get an FPI, how long > >>>should we skip prefetching of that block? > >>> > >>>I don't think the history needs to be very long, for two reasons. Firstly, > >>>the usual pattern is that we have FPI + several changes for that block > >>>shortly after it. Secondly, maintenance_io_concurrency limits this > >>>naturally > >>>- after crossing that, redo should place the FPI into shared buffers, > >>>allowing us to skip the prefetch. > >>> > >>>So I think using maintenance_io_concurrency is sufficient. We might track > >>>more buffers to allow skipping prefetches of blocks that were evicted from > >>>shared buffers, but that seems like an overkill. > >>> > >>>However, maintenance_io_concurrency can be quite high, so just a simple > >>>queue is not very suitable - searching it linearly for each block would be > >>>too expensive. But I think we can use a simple hash table, tracking > >>>(relfilenode, block, LSN), over-sized to minimize collisions. > >>> > >>>Imagine it's a simple array with (2 * maintenance_io_concurrency) elements, > >>>and whenever we prefetch a block or find an FPI, we simply add the block to > >>>the array as determined by hash(relfilenode, block) > >>> > >>> hashtable[hash(...)] = {relfilenode, block, LSN} > >>> > >>>and then when deciding whether to prefetch a block, we look at that one > >>>position. If the (relfilenode, block) match, we check the LSN and skip the > >>>prefetch if it's sufficiently recent. Otherwise we prefetch. > >> > >>I'm a bit doubtful this is really needed at this point. Yes, the > >>prefetching will do a buffer table lookup - but it's a lookup that > >>already happens today. And the patch already avoids doing a second > >>lookup after prefetching (by optimistically caching the last Buffer id, > >>and re-checking). > > > >I agree that when a page is looked up, and found, in the buffer table > >that the subsequent cacheing of the buffer id in the WAL records does a > >good job of avoiding having to re-do that lookup. However, that isn't > >the case which was being discussed here or what Tomas's suggestion was > >intended to address. > > > >What I pointed out up-thread and what's being discussed here is what > >happens when the WAL contains a few FPIs and a few regular WAL records > >which are mixed up and not in ideal order. When that happens, with this > >patch, the FPIs will be ignored, the regular WAL records will reference > >blocks which aren't found in shared buffers (yet) and then we'll both > >issue pre-fetches for those and end up having spent effort doing a > >buffer lookup that we'll later re-do. > > The question is how common this pattern actually is - I don't know. As > noted, the non-FPI would have to be fairly close to the FPI, i.e. within the > wal_decode_buffer_size, to actually cause measurable harm. Yeah, so it'll depend on how big wal_decode_buffer_size is. Increasing that would certainly help to show if there ends up being a degredation with this patch due to the extra prefetching being done. > >To address the unnecessary syscalls we really just need to keep track of > >any FPIs that we've seen between where the point where the prefetching > >is happening and the point where the replay is being done- once replay > >has replayed an FPI, our buffer lookup will succeed and we'll cache the > >buffer that the FPI is at- in other words, only wal_decode_buffer_size > >amount of WAL needs to be considered. > > Yeah, that's essentially what I proposed. Glad I captured it correctly. > >We could further leverage this tracking of FPIs, to skip the prefetch > >syscalls, by cacheing what later records address the blocks that have > >FPIs earlier in the queue with the FPI record and then when replay hits > >the FPI and loads it into shared_buffers, it could update the other WAL > >records in the queue with the buffer id of the page, allowing us to very > >likely avoid having to do another lookup later on. > > This seems like an over-engineering, at least for v1. Perhaps, though it didn't seem like it'd be very hard to do with the already proposed changes to stash the buffer id in the WAL records. > >>I think there's potential for some significant optimization going > >>forward, but I think it's basically optimization over what we're doing > >>today. As this is already a nontrivial patch, I'd argue for doing so > >>separately. > > > >This seems like a great optimization, albeit a fair bit of code, for a > >relatively uncommon use-case, specifically where full page writes are > >disabled or very large checkpoints. As that's the cas
Re: WIP: WAL prefetch (another approach)
On 2/13/21 10:39 PM, Stephen Frost wrote: Greetings, * Andres Freund (and...@anarazel.de) wrote: On 2021-02-12 00:42:04 +0100, Tomas Vondra wrote: Yeah, that's a good point. I think it'd make sense to keep track of recent FPIs and skip prefetching such blocks. But how exactly should we implement that, how many blocks do we need to track? If you get an FPI, how long should we skip prefetching of that block? I don't think the history needs to be very long, for two reasons. Firstly, the usual pattern is that we have FPI + several changes for that block shortly after it. Secondly, maintenance_io_concurrency limits this naturally - after crossing that, redo should place the FPI into shared buffers, allowing us to skip the prefetch. So I think using maintenance_io_concurrency is sufficient. We might track more buffers to allow skipping prefetches of blocks that were evicted from shared buffers, but that seems like an overkill. However, maintenance_io_concurrency can be quite high, so just a simple queue is not very suitable - searching it linearly for each block would be too expensive. But I think we can use a simple hash table, tracking (relfilenode, block, LSN), over-sized to minimize collisions. Imagine it's a simple array with (2 * maintenance_io_concurrency) elements, and whenever we prefetch a block or find an FPI, we simply add the block to the array as determined by hash(relfilenode, block) hashtable[hash(...)] = {relfilenode, block, LSN} and then when deciding whether to prefetch a block, we look at that one position. If the (relfilenode, block) match, we check the LSN and skip the prefetch if it's sufficiently recent. Otherwise we prefetch. I'm a bit doubtful this is really needed at this point. Yes, the prefetching will do a buffer table lookup - but it's a lookup that already happens today. And the patch already avoids doing a second lookup after prefetching (by optimistically caching the last Buffer id, and re-checking). I agree that when a page is looked up, and found, in the buffer table that the subsequent cacheing of the buffer id in the WAL records does a good job of avoiding having to re-do that lookup. However, that isn't the case which was being discussed here or what Tomas's suggestion was intended to address. What I pointed out up-thread and what's being discussed here is what happens when the WAL contains a few FPIs and a few regular WAL records which are mixed up and not in ideal order. When that happens, with this patch, the FPIs will be ignored, the regular WAL records will reference blocks which aren't found in shared buffers (yet) and then we'll both issue pre-fetches for those and end up having spent effort doing a buffer lookup that we'll later re-do. The question is how common this pattern actually is - I don't know. As noted, the non-FPI would have to be fairly close to the FPI, i.e. within the wal_decode_buffer_size, to actually cause measurable harm. To address the unnecessary syscalls we really just need to keep track of any FPIs that we've seen between where the point where the prefetching is happening and the point where the replay is being done- once replay has replayed an FPI, our buffer lookup will succeed and we'll cache the buffer that the FPI is at- in other words, only wal_decode_buffer_size amount of WAL needs to be considered. Yeah, that's essentially what I proposed. We could further leverage this tracking of FPIs, to skip the prefetch syscalls, by cacheing what later records address the blocks that have FPIs earlier in the queue with the FPI record and then when replay hits the FPI and loads it into shared_buffers, it could update the other WAL records in the queue with the buffer id of the page, allowing us to very likely avoid having to do another lookup later on. This seems like an over-engineering, at least for v1. I think there's potential for some significant optimization going forward, but I think it's basically optimization over what we're doing today. As this is already a nontrivial patch, I'd argue for doing so separately. This seems like a great optimization, albeit a fair bit of code, for a relatively uncommon use-case, specifically where full page writes are disabled or very large checkpoints. As that's the case though, I would think it's reasonable to ask that it go out of its way to avoid slowing down the more common configurations, particularly since it's proposed to have it on by default (which I agree with, provided it ends up improving the common cases, which I think the suggestions above would certainly make it more likely to do). I'm OK to do some benchmarking, but it's not quite clear to me why does it matter if the checkpoints are smaller than shared buffers? IMO what matters is how "localized" the updates are, i.e. how likely it is to hit the same page repeatedly (in a short amount of time). Regular pgbench is not very suitable for that, but some non-uniform distribution should do the t
Re: WIP: WAL prefetch (another approach)
Greetings, * Andres Freund (and...@anarazel.de) wrote: > On 2021-02-12 00:42:04 +0100, Tomas Vondra wrote: > > Yeah, that's a good point. I think it'd make sense to keep track of recent > > FPIs and skip prefetching such blocks. But how exactly should we implement > > that, how many blocks do we need to track? If you get an FPI, how long > > should we skip prefetching of that block? > > > > I don't think the history needs to be very long, for two reasons. Firstly, > > the usual pattern is that we have FPI + several changes for that block > > shortly after it. Secondly, maintenance_io_concurrency limits this naturally > > - after crossing that, redo should place the FPI into shared buffers, > > allowing us to skip the prefetch. > > > > So I think using maintenance_io_concurrency is sufficient. We might track > > more buffers to allow skipping prefetches of blocks that were evicted from > > shared buffers, but that seems like an overkill. > > > > However, maintenance_io_concurrency can be quite high, so just a simple > > queue is not very suitable - searching it linearly for each block would be > > too expensive. But I think we can use a simple hash table, tracking > > (relfilenode, block, LSN), over-sized to minimize collisions. > > > > Imagine it's a simple array with (2 * maintenance_io_concurrency) elements, > > and whenever we prefetch a block or find an FPI, we simply add the block to > > the array as determined by hash(relfilenode, block) > > > > hashtable[hash(...)] = {relfilenode, block, LSN} > > > > and then when deciding whether to prefetch a block, we look at that one > > position. If the (relfilenode, block) match, we check the LSN and skip the > > prefetch if it's sufficiently recent. Otherwise we prefetch. > > I'm a bit doubtful this is really needed at this point. Yes, the > prefetching will do a buffer table lookup - but it's a lookup that > already happens today. And the patch already avoids doing a second > lookup after prefetching (by optimistically caching the last Buffer id, > and re-checking). I agree that when a page is looked up, and found, in the buffer table that the subsequent cacheing of the buffer id in the WAL records does a good job of avoiding having to re-do that lookup. However, that isn't the case which was being discussed here or what Tomas's suggestion was intended to address. What I pointed out up-thread and what's being discussed here is what happens when the WAL contains a few FPIs and a few regular WAL records which are mixed up and not in ideal order. When that happens, with this patch, the FPIs will be ignored, the regular WAL records will reference blocks which aren't found in shared buffers (yet) and then we'll both issue pre-fetches for those and end up having spent effort doing a buffer lookup that we'll later re-do. To address the unnecessary syscalls we really just need to keep track of any FPIs that we've seen between where the point where the prefetching is happening and the point where the replay is being done- once replay has replayed an FPI, our buffer lookup will succeed and we'll cache the buffer that the FPI is at- in other words, only wal_decode_buffer_size amount of WAL needs to be considered. We could further leverage this tracking of FPIs, to skip the prefetch syscalls, by cacheing what later records address the blocks that have FPIs earlier in the queue with the FPI record and then when replay hits the FPI and loads it into shared_buffers, it could update the other WAL records in the queue with the buffer id of the page, allowing us to very likely avoid having to do another lookup later on. > I think there's potential for some significant optimization going > forward, but I think it's basically optimization over what we're doing > today. As this is already a nontrivial patch, I'd argue for doing so > separately. This seems like a great optimization, albeit a fair bit of code, for a relatively uncommon use-case, specifically where full page writes are disabled or very large checkpoints. As that's the case though, I would think it's reasonable to ask that it go out of its way to avoid slowing down the more common configurations, particularly since it's proposed to have it on by default (which I agree with, provided it ends up improving the common cases, which I think the suggestions above would certainly make it more likely to do). Perhaps this already improves the common cases and is worth the extra code on that basis, but I don't recall seeing much in the way of benchmarking in this thread for that case- that is, where FPIs are enabled and checkpoints are smaller than shared buffers. Jakub's testing was done with FPWs disabled and Tomas's testing used checkpoints which were much larger than the size of shared buffers on the system doing the replay. While it's certainly good that this patch improves those cases, we should also be looking out for the worst case and make sure that the patch doesn't degrade performance in that