Re: WIP: WAL prefetch (another approach)

2022-09-24 Thread Thomas Munro
On Wed, Apr 13, 2022 at 8:05 AM Thomas Munro  wrote:
> On Wed, Apr 13, 2022 at 3:57 AM Dagfinn Ilmari Mannsåker
>  wrote:
> > Simon Riggs  writes:
> > > This is a nice feature if it is safe to turn off full_page_writes.

> > > When is it safe to do that? On which platform?
> > >
> > > I am not aware of any released software that allows full_page_writes
> > > to be safely disabled. Perhaps something has been released recently
> > > that allows this? I think we have substantial documentation about
> > > safety of other settings, so we should carefully document things here
> > > also.
> >
> > Our WAL reliability docs claim that ZFS is safe against torn pages:
> >
> > https://www.postgresql.org/docs/current/wal-reliability.html:
> >
> > If you have file-system software that prevents partial page writes
> > (e.g., ZFS), you can turn off this page imaging by turning off the
> > full_page_writes parameter.
>
> Unfortunately, posix_fadvise(WILLNEED) doesn't do anything on ZFS
> right now :-(.

Update: OpenZFS now has this working in its master branch (Linux only
for now), so fingers crossed for the next release.




Re: WIP: WAL prefetch (another approach)

2022-04-26 Thread Thomas Munro
On Tue, Apr 26, 2022 at 6:11 PM Thomas Munro  wrote:
> I will poke some more tomorrow to try to confirm this and try to come
> up with a fix.

Done, and moved over to the pg_walinspect commit thread to reach the
right eyeballs:

https://www.postgresql.org/message-id/CA%2BhUKGLtswFk9ZO3WMOqnDkGs6dK5kCdQK9gxJm0N8gip5cpiA%40mail.gmail.com




Re: WIP: WAL prefetch (another approach)

2022-04-25 Thread Thomas Munro
On Tue, Apr 26, 2022 at 6:11 AM Tom Lane  wrote:
> I believe that the WAL prefetch patch probably accounts for the
> intermittent errors that buildfarm member topminnow has shown
> since it went in, eg [1]:
>
> diff -U3 
> /home/nm/ext4/HEAD/pgsql/contrib/pg_walinspect/expected/pg_walinspect.out 
> /home/nm/ext4/HEAD/pgsql.build/contrib/pg_walinspect/results/pg_walinspect.out

Hmm, maybe but I suspect not.  I think I might see what's happening here.

> +ERROR:  could not read WAL at 0/1903E40

> I've reproduced this manually on that machine, and confirmed that the
> proximate cause is that XLogNextRecord() is returning NULL because
> state->decode_queue_head == NULL, without bothering to provide an errormsg
> (which doesn't seem very well thought out in itself).  I obtained the

Thanks for doing that.  After several hours of trying I also managed
to reproduce it on that gcc23 system (not at all sure why it doesn't
show up elsewhere; MIPS 32 bit layout may be a factor), and added some
trace to get some more clues.  Still looking into it, but here is the
current hypothesis I'm testing:

1.  The reason there's a messageless ERROR in this case is because
there is new read_page callback logic introduced for pg_walinspect,
called via read_local_xlog_page_no_wait(), which is like the old
read_local_xlog_page() except that it returns -1 if you try to read
past the current "flushed" LSN, and we have no queued message.  An
error is then reported by XLogReadRecord(), and appears to the user.

2.  The reason pg_walinspect tries to read WAL data past the flushed
LSN is because its GetWALRecordsInfo() function keeps calling
XLogReadRecord() until EndRecPtr >= end_lsn, where end_lsn is taken
from a snapshot of the flushed LSN, but I don't see where it takes
into account that the flushed LSN might momentarily fall in the middle
of a record.  In that case, xlogreader.c will try to read the next
page, which fails because it's past the flushed LSN (see point 1).

I will poke some more tomorrow to try to confirm this and try to come
up with a fix.




Re: WIP: WAL prefetch (another approach)

2022-04-25 Thread Tom Lane
Oh, one more bit of data: here's an excerpt from pg_waldump output after
the failed test:

rmgr: Btree   len (rec/tot): 72/72, tx:727, lsn: 
0/01903BC8, prev 0/01903B70, desc: INSERT_LEAF off 111, blkref #0: rel 
1663/16384/2673 blk 9
rmgr: Btree   len (rec/tot): 72/72, tx:727, lsn: 
0/01903C10, prev 0/01903BC8, desc: INSERT_LEAF off 141, blkref #0: rel 
1663/16384/2674 blk 7
rmgr: Standby len (rec/tot): 42/42, tx:727, lsn: 
0/01903C58, prev 0/01903C10, desc: LOCK xid 727 db 16384 rel 16391 
rmgr: Transaction len (rec/tot):437/   437, tx:727, lsn: 
0/01903C88, prev 0/01903C58, desc: COMMIT 2022-04-25 20:16:03.374197 CEST; 
inval msgs: catcache 80 catcache 79 catcache 80 catcache 79 catcache 55 
catcache 54 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 
catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 
catcache 6 catcache 7 catcache 6 snapshot 2608 relcache 16391
rmgr: Heaplen (rec/tot): 59/59, tx:728, lsn: 
0/01903E40, prev 0/01903C88, desc: INSERT+INIT off 1 flags 0x00, blkref #0: rel 
1663/16384/16391 blk 0
rmgr: Heaplen (rec/tot): 59/59, tx:728, lsn: 
0/01903E80, prev 0/01903E40, desc: INSERT off 2 flags 0x00, blkref #0: rel 
1663/16384/16391 blk 0
rmgr: Transaction len (rec/tot): 34/34, tx:728, lsn: 
0/01903EC0, prev 0/01903E80, desc: COMMIT 2022-04-25 20:16:03.379323 CEST
rmgr: Heaplen (rec/tot): 59/59, tx:729, lsn: 
0/01903EE8, prev 0/01903EC0, desc: INSERT off 3 flags 0x00, blkref #0: rel 
1663/16384/16391 blk 0
rmgr: Heaplen (rec/tot): 59/59, tx:729, lsn: 
0/01903F28, prev 0/01903EE8, desc: INSERT off 4 flags 0x00, blkref #0: rel 
1663/16384/16391 blk 0
rmgr: Transaction len (rec/tot): 34/34, tx:729, lsn: 
0/01903F68, prev 0/01903F28, desc: COMMIT 2022-04-25 20:16:03.381720 CEST

The error is complaining about not being able to read 0/01903E40,
which AFAICT is from the first "INSERT INTO sample_tbl" command,
which most certainly ought to be down to disk at this point.

Also, I modified the test script to see what WAL LSNs it thought
it was dealing with, and got

+\echo 'wal_lsn1 = ' :wal_lsn1
+wal_lsn1 =  0/1903E40
+\echo 'wal_lsn2 = ' :wal_lsn2
+wal_lsn2 =  0/1903EE8

confirming that idea of where 0/01903E40 is in the WAL history.
So this is sure looking like a bug somewhere in xlogreader.c,
not in pg_walinspect.

regards, tom lane




Re: WIP: WAL prefetch (another approach)

2022-04-25 Thread Tom Lane
I believe that the WAL prefetch patch probably accounts for the
intermittent errors that buildfarm member topminnow has shown
since it went in, eg [1]:

diff -U3 
/home/nm/ext4/HEAD/pgsql/contrib/pg_walinspect/expected/pg_walinspect.out 
/home/nm/ext4/HEAD/pgsql.build/contrib/pg_walinspect/results/pg_walinspect.out
--- /home/nm/ext4/HEAD/pgsql/contrib/pg_walinspect/expected/pg_walinspect.out   
2022-04-10 03:05:15.972622440 +0200
+++ 
/home/nm/ext4/HEAD/pgsql.build/contrib/pg_walinspect/results/pg_walinspect.out  
2022-04-25 05:09:49.861642059 +0200
@@ -34,11 +34,7 @@
 (1 row)
 
 SELECT COUNT(*) >= 0 AS ok FROM 
pg_get_wal_records_info_till_end_of_wal(:'wal_lsn1');
- ok 
-
- t
-(1 row)
-
+ERROR:  could not read WAL at 0/1903E40
 SELECT COUNT(*) >= 0 AS ok FROM pg_get_wal_stats(:'wal_lsn1', :'wal_lsn2');
  ok 
 
@@ -46,11 +42,7 @@
 (1 row)
 
 SELECT COUNT(*) >= 0 AS ok FROM pg_get_wal_stats_till_end_of_wal(:'wal_lsn1');
- ok 
-
- t
-(1 row)
-
+ERROR:  could not read WAL at 0/1903E40
 -- ===
 -- Test for filtering out WAL records of a particular table
 -- ===


I've reproduced this manually on that machine, and confirmed that the
proximate cause is that XLogNextRecord() is returning NULL because
state->decode_queue_head == NULL, without bothering to provide an errormsg
(which doesn't seem very well thought out in itself).  I obtained the
contents of the xlogreader struct at failure:

(gdb) p *xlogreader
$1 = {routine = {page_read = 0x594270 , 
segment_open = 0x593b44 , 
segment_close = 0x593d38 }, system_identifier = 0, 
  private_data = 0x0, ReadRecPtr = 26230672, EndRecPtr = 26230752, 
  abortedRecPtr = 26230752, missingContrecPtr = 26230784, 
  overwrittenRecPtr = 0, DecodeRecPtr = 26230672, NextRecPtr = 26230752, 
  PrevRecPtr = 0, record = 0x0, decode_buffer = 0xf25428 "\240", 
  decode_buffer_size = 65536, free_decode_buffer = true, 
  decode_buffer_head = 0xf25428 "\240", decode_buffer_tail = 0xf25428 "\240", 
  decode_queue_head = 0x0, decode_queue_tail = 0x0, 
  readBuf = 0xf173f0 "\020\321\005", readLen = 0, segcxt = {
ws_dir = '\000' , ws_segsize = 16777216}, seg = {
ws_file = 25, ws_segno = 0, ws_tli = 1}, segoff = 0, 
  latestPagePtr = 26222592, latestPageTLI = 1, currRecPtr = 26230752, 
  currTLI = 1, currTLIValidUntil = 0, nextTLI = 0, 
  readRecordBuf = 0xf1b3f8 "<", readRecordBufSize = 40960, 
  errormsg_buf = 0xef3270 "", errormsg_deferred = false, nonblocking = false}

I don't have an intuition about where to look beyond that, any
suggestions?

What I do know so far is that while the failure reproduces fairly
reliably under "make check" (more than half the time, which squares
with topminnow's history), it doesn't reproduce at all under "make
installcheck" (after removing NO_INSTALLCHECK), which seems odd.
Maybe it's dependent on how much WAL history the installation has
accumulated?

It could be that this is a bug in pg_walinspect or a fault in its
test case; hard to tell since that got committed at about the same
time as the prefetch changes.

regards, tom lane

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=topminnow&dt=2022-04-25%2001%3A48%3A47




Re: WIP: WAL prefetch (another approach)

2022-04-12 Thread Thomas Munro
On Wed, Apr 13, 2022 at 3:57 AM Dagfinn Ilmari Mannsåker
 wrote:
> Simon Riggs  writes:
> > This is a nice feature if it is safe to turn off full_page_writes.

As other have said/shown, it does also help if a block with FPW is
evicted and then read back in during one checkpoint cycle, in other
words if the working set is larger than shared buffers.

This also provides infrastructure for proposals in the next cycle, as
part of commitfest #3316:
* in direct I/O mode, I/O stalls become more likely due to lack of
kernel prefetching/double-buffering, so prefetching becomes more
essential
* even in buffered I/O mode when benefiting from free
double-buffering, the copy from kernel buffer to user space buffer can
be finished in the background instead of calling pread() when you need
the page, but you need to start it sooner
* adjacent blocks accessed by nearby records can be merged into a
single scatter-read, for example with preadv() in the background
* repeated buffer lookups, pins, locks (and maybe eventually replay)
to the same page can be consolidated

Pie-in-the-sky ideas:
* someone might eventually want to be able to replay in parallel
(hard, but certainly requires lookahead)
* I sure hope we'll eventually use different techniques for torn-page
protection to avoid the high online costs of FPW

> > When is it safe to do that? On which platform?
> >
> > I am not aware of any released software that allows full_page_writes
> > to be safely disabled. Perhaps something has been released recently
> > that allows this? I think we have substantial documentation about
> > safety of other settings, so we should carefully document things here
> > also.
>
> Our WAL reliability docs claim that ZFS is safe against torn pages:
>
> https://www.postgresql.org/docs/current/wal-reliability.html:
>
> If you have file-system software that prevents partial page writes
> (e.g., ZFS), you can turn off this page imaging by turning off the
> full_page_writes parameter.

Unfortunately, posix_fadvise(WILLNEED) doesn't do anything on ZFS
right now :-(.  I have some patches to fix that on Linux[1] and
FreeBSD and it seems like there's a good chance of getting them
committed based on feedback, but it needs some more work on tests and
mmap integration.  If anyone's interested in helping get that landed
faster, please ping me off-list.

[1] https://github.com/openzfs/zfs/pull/9807




Re: WIP: WAL prefetch (another approach)

2022-04-12 Thread SATYANARAYANA NARLAPURAM
> Other way around. FPWs make prefetch unnecessary.
> Therefore you would only want prefetch with FPW=off, AFAIK.
>
A few scenarios I can imagine page prefetch can help are, 1/ A DR replica
instance that is smaller instance size than primary. Page prefetch can
bring the pages back into memory in advance when they are evicted. This
speeds up the replay and is cost effective. 2/ Allows larger
checkpoint_timeout for the same recovery SLA and perhaps improved
performance? 3/ WAL prefetch (not pages by itself) can improve replay by
itself (not sure if it was measured in isolation, Tomas V can comment on
it). 4/ Read replica running analytical workload scenario Tomas V mentioned
earlier.


>
> Or put this another way: when is it safe and sensible to use
> recovery_prefetch != off?
>
When checkpoint_timeout is set large and under heavy write activity, on a
read replica that has working set higher than the memory and receiving
constant updates from primary. This covers 1 & 4 above.


> --
> Simon Riggshttp://www.EnterpriseDB.com/
>
>
>


Re: WIP: WAL prefetch (another approach)

2022-04-12 Thread Tomas Vondra
On 4/12/22 17:46, Simon Riggs wrote:
> On Tue, 12 Apr 2022 at 16:41, Tomas Vondra
>  wrote:
>>
>> On 4/12/22 15:58, Simon Riggs wrote:
>>> On Thu, 7 Apr 2022 at 08:46, Thomas Munro  wrote:
>>>
 With that... I've finally pushed the 0002 patch and will be watching
 the build farm.
>>>
>>> This is a nice feature if it is safe to turn off full_page_writes.
>>>
>>> When is it safe to do that? On which platform?
>>>
>>> I am not aware of any released software that allows full_page_writes
>>> to be safely disabled. Perhaps something has been released recently
>>> that allows this? I think we have substantial documentation about
>>> safety of other settings, so we should carefully document things here
>>> also.
>>>
>>
>> I don't see why/how would an async prefetch make FPW unnecessary. Did
>> anyone claim that be the case?
> 
> Other way around. FPWs make prefetch unnecessary.
> Therefore you would only want prefetch with FPW=off, AFAIK.
> 
> Or put this another way: when is it safe and sensible to use
> recovery_prefetch != off?
> 

That assumes the FPI stays in memory until the next modification, and
that can be untrue for a number of reasons. Long checkpoint interval
with enough random accesses in between is a nice example. See the
benchmarks I did a year ago (regular pgbench).

Or imagine a r/o replica used to run analytics queries, that access so
much data it evicts the buffers initialized by the FPI records.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: WIP: WAL prefetch (another approach)

2022-04-12 Thread Dagfinn Ilmari Mannsåker
Simon Riggs  writes:

> On Thu, 7 Apr 2022 at 08:46, Thomas Munro  wrote:
>
>> With that... I've finally pushed the 0002 patch and will be watching
>> the build farm.
>
> This is a nice feature if it is safe to turn off full_page_writes.
>
> When is it safe to do that? On which platform?
>
> I am not aware of any released software that allows full_page_writes
> to be safely disabled. Perhaps something has been released recently
> that allows this? I think we have substantial documentation about
> safety of other settings, so we should carefully document things here
> also.

Our WAL reliability docs claim that ZFS is safe against torn pages:

https://www.postgresql.org/docs/current/wal-reliability.html:

If you have file-system software that prevents partial page writes
(e.g., ZFS), you can turn off this page imaging by turning off the
full_page_writes parameter.

- ilmari




Re: WIP: WAL prefetch (another approach)

2022-04-12 Thread Simon Riggs
On Tue, 12 Apr 2022 at 16:41, Tomas Vondra
 wrote:
>
> On 4/12/22 15:58, Simon Riggs wrote:
> > On Thu, 7 Apr 2022 at 08:46, Thomas Munro  wrote:
> >
> >> With that... I've finally pushed the 0002 patch and will be watching
> >> the build farm.
> >
> > This is a nice feature if it is safe to turn off full_page_writes.
> >
> > When is it safe to do that? On which platform?
> >
> > I am not aware of any released software that allows full_page_writes
> > to be safely disabled. Perhaps something has been released recently
> > that allows this? I think we have substantial documentation about
> > safety of other settings, so we should carefully document things here
> > also.
> >
>
> I don't see why/how would an async prefetch make FPW unnecessary. Did
> anyone claim that be the case?

Other way around. FPWs make prefetch unnecessary.
Therefore you would only want prefetch with FPW=off, AFAIK.

Or put this another way: when is it safe and sensible to use
recovery_prefetch != off?

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: WIP: WAL prefetch (another approach)

2022-04-12 Thread Tomas Vondra
On 4/12/22 15:58, Simon Riggs wrote:
> On Thu, 7 Apr 2022 at 08:46, Thomas Munro  wrote:
> 
>> With that... I've finally pushed the 0002 patch and will be watching
>> the build farm.
> 
> This is a nice feature if it is safe to turn off full_page_writes.
> 
> When is it safe to do that? On which platform?
> 
> I am not aware of any released software that allows full_page_writes
> to be safely disabled. Perhaps something has been released recently
> that allows this? I think we have substantial documentation about
> safety of other settings, so we should carefully document things here
> also.
> 

I don't see why/how would an async prefetch make FPW unnecessary. Did
anyone claim that be the case?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: WIP: WAL prefetch (another approach)

2022-04-12 Thread Simon Riggs
On Thu, 7 Apr 2022 at 08:46, Thomas Munro  wrote:

> With that... I've finally pushed the 0002 patch and will be watching
> the build farm.

This is a nice feature if it is safe to turn off full_page_writes.

When is it safe to do that? On which platform?

I am not aware of any released software that allows full_page_writes
to be safely disabled. Perhaps something has been released recently
that allows this? I think we have substantial documentation about
safety of other settings, so we should carefully document things here
also.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




RE: WIP: WAL prefetch (another approach)

2022-04-12 Thread Shinoda, Noriyoshi (PN Japan FSIP)
Hi,
Thank you for your reply. 
I missed the message, sorry.

Regards,
Noriyoshi Shinoda

-Original Message-
From: Thomas Munro  
Sent: Tuesday, April 12, 2022 6:28 PM
To: Shinoda, Noriyoshi (PN Japan FSIP) 
Cc: Justin Pryzby ; Tomas Vondra 
; Stephen Frost ; Andres 
Freund ; Jakub Wartak ; Alvaro 
Herrera ; Tomas Vondra 
; Dmitry Dolgov <9erthali...@gmail.com>; David 
Steele ; pgsql-hackers 
Subject: Re: WIP: WAL prefetch (another approach)

On Tue, Apr 12, 2022 at 9:03 PM Shinoda, Noriyoshi (PN Japan FSIP) 
 wrote:
> Thank you for developing the great feature. I tested this feature and checked 
> the documentation. Currently, the documentation for the 
> pg_stat_prefetch_recovery view is included in the description for the 
> pg_stat_subscription view.
>
> INVALID URI REMOVED
> toring-stats.html*MONITORING-PG-STAT-SUBSCRIPTION__;Iw!!NpxR!xRu7zc4Hc
> ZppB-32Fp3YfESPqJ7B4AOP_RF7QuYP-kCWidoiJ5txu9CW8sX61TfwddE$

Hi!  Thanks.  I had just committed a fix before I saw your message, because 
there was already another report here:

https://www.postgresql.org/message-id/flat/cakrakevk-lrhmdyt6x_p33ef6dcorm2jed5h_ehdrdv0res...@mail.gmail.com
 


Re: WIP: WAL prefetch (another approach)

2022-04-12 Thread Thomas Munro
On Tue, Apr 12, 2022 at 9:03 PM Shinoda, Noriyoshi (PN Japan FSIP)
 wrote:
> Thank you for developing the great feature. I tested this feature and checked 
> the documentation. Currently, the documentation for the 
> pg_stat_prefetch_recovery view is included in the description for the 
> pg_stat_subscription view.
>
> https://www.postgresql.org/docs/devel/monitoring-stats.html#MONITORING-PG-STAT-SUBSCRIPTION

Hi!  Thanks.  I had just committed a fix before I saw your message,
because there was already another report here:

https://www.postgresql.org/message-id/flat/CAKrAKeVk-LRHMdyT6x_p33eF6dCorM2jed5h_eHdRdv0reSYTA%40mail.gmail.com




RE: WIP: WAL prefetch (another approach)

2022-04-12 Thread Shinoda, Noriyoshi (PN Japan FSIP)
Hi, 
Thank you for developing the great feature. I tested this feature and checked 
the documentation. Currently, the documentation for the 
pg_stat_prefetch_recovery view is included in the description for the 
pg_stat_subscription view.

https://www.postgresql.org/docs/devel/monitoring-stats.html#MONITORING-PG-STAT-SUBSCRIPTION

It is also not displayed in the list of "28.2. The Statistics Collector".
https://www.postgresql.org/docs/devel/monitoring.html

The attached patch modifies the pg_stat_prefetch_recovery view to appear as a 
separate view.

Regards,
Noriyoshi Shinoda
-Original Message-
From: Thomas Munro  
Sent: Friday, April 8, 2022 10:47 AM
To: Justin Pryzby 
Cc: Tomas Vondra ; Stephen Frost 
; Andres Freund ; Jakub Wartak 
; Alvaro Herrera ; Tomas 
Vondra ; Dmitry Dolgov <9erthali...@gmail.com>; 
David Steele ; pgsql-hackers 
Subject: Re: WIP: WAL prefetch (another approach)

On Fri, Apr 8, 2022 at 12:55 AM Justin Pryzby  wrote:
> The docs seem to be wrong about the default.
>
> +are not yet in the buffer pool, during recovery.  Valid values are
> +off (the default), on and
> +try.  The setting try 
> + enables

Fixed.

> +   concurrency and distance, respectively.  By default, it is set to
> +   try, which enabled the feature on systems where
> +   posix_fadvise is available.
>
> Should say "which enables".

Fixed.

> Curiously, I reported a similar issue last year.

Sorry.  I guess both times we only agreed on what the default should be in the 
final review round before commit, and I let the docs get out of sync (well, the 
default is mentioned in two places and I apparently ended my search too soon, 
changing only one).  I also found another recently obsoleted sentence: the one 
about showing nulls sometimes was no longer true.  Removed.




pg_stat_recovery_prefetch_doc_v1.diff
Description: pg_stat_recovery_prefetch_doc_v1.diff


Re: WIP: WAL prefetch (another approach)

2022-04-07 Thread Thomas Munro
On Fri, Apr 8, 2022 at 12:55 AM Justin Pryzby  wrote:
> The docs seem to be wrong about the default.
>
> +are not yet in the buffer pool, during recovery.  Valid values are
> +off (the default), on and
> +try.  The setting try enables

Fixed.

> +   concurrency and distance, respectively.  By default, it is set to
> +   try, which enabled the feature on systems where
> +   posix_fadvise is available.
>
> Should say "which enables".

Fixed.

> Curiously, I reported a similar issue last year.

Sorry.  I guess both times we only agreed on what the default should
be in the final review round before commit, and I let the docs get out
of sync (well, the default is mentioned in two places and I apparently
ended my search too soon, changing only one).  I also found another
recently obsoleted sentence: the one about showing nulls sometimes was
no longer true.  Removed.




Re: WIP: WAL prefetch (another approach)

2022-04-07 Thread Justin Pryzby
The docs seem to be wrong about the default.

+are not yet in the buffer pool, during recovery.  Valid values are
+off (the default), on and
+try.  The setting try enables

+   concurrency and distance, respectively.  By default, it is set to
+   try, which enabled the feature on systems where
+   posix_fadvise is available.

Should say "which enables".

+   {
+   {"recovery_prefetch", PGC_SIGHUP, WAL_RECOVERY,
+   gettext_noop("Prefetch referenced blocks during 
recovery"),
+   gettext_noop("Look ahead in the WAL to find references 
to uncached data.")
+   },
+   &recovery_prefetch,
+   RECOVERY_PREFETCH_TRY, recovery_prefetch_options,
+   check_recovery_prefetch, assign_recovery_prefetch, NULL
+   },

Curiously, I reported a similar issue last year.

On Thu, Apr 08, 2021 at 10:37:04PM -0500, Justin Pryzby wrote:
> --- a/doc/src/sgml/wal.sgml
> +++ b/doc/src/sgml/wal.sgml
> @@ -816,9 +816,7 @@
> prefetching mechanism is most likely to be effective on systems
> with full_page_writes set to
> off (where that is safe), and where the working
> -   set is larger than RAM.  By default, prefetching in recovery is enabled
> -   on operating systems that have posix_fadvise
> -   support.
> +   set is larger than RAM.  By default, prefetching in recovery is disabled.
>
>   




Re: WIP: WAL prefetch (another approach)

2022-04-07 Thread Thomas Munro
On Mon, Apr 4, 2022 at 3:12 PM Julien Rouhaud  wrote:
> [review]

Thanks!  I took almost all of your suggestions about renaming things,
comments, docs and moving a magic number into a macro.

Minor changes:

1.  Rebased over the shmem stats changes and others that have just
landed today (woo!).  The way my simple SharedStats object works and
is reset looks a little primitive next to the shiny new stats
infrastructure, but I can always adjust that in a follow-up patch if
required.

2.  It was a bit annoying that the pg_stat_recovery_prefetch view
would sometimes show stale numbers when waiting for WAL to be
streamed, since that happens at arbitrary points X bytes apart in the
WAL.  Now it also happens before sleeping/waiting and when recovery
ends.

3.  Last year, commit a55a9847 synchronised config.sgml with guc.c's
categories.  A couple of hunks in there that modified the previous
version of this work before it all got reverted.  So I've re-added the
WAL_RECOVERY GUC category, to match the new section in config.sgml.

About test coverage, the most interesting lines of xlogprefetcher.c
that stand out as unreached in a gcov report are in the special
handling for the new CREATE DATABASE in file-copy mode -- but that's
probably something to raise in the thread that introduced that new
functionality without a test.  I've tested that code locally; if you
define XLOGPREFETCHER_DEBUG_LEVEL you'll see that it won't touch
anything in the new database until recovery has replayed the
file-copy.

As for current CI-vs-buildfarm blind spots that recently bit me and
others, I also tested -m32 and -fsanitize=undefined,unaligned builds.

I reran one of the quick pgbench/crash/drop-caches/recover tests I had
lying around and saw a 17s -> 6s speedup with FPW off (you need much
longer tests to see speedup with them on, so this is a good way for
quick sanity checks -- see Tomas V's results for long runs with FPWs
and curved effects).

With that... I've finally pushed the 0002 patch and will be watching
the build farm.




Re: WIP: WAL prefetch (another approach)

2022-04-03 Thread Julien Rouhaud
On Thu, Mar 31, 2022 at 10:49:32PM +1300, Thomas Munro wrote:
> On Mon, Mar 21, 2022 at 9:29 PM Julien Rouhaud  wrote:
> > So I finally finished looking at this patch.  Here again, AFAICS the 
> > feature is
> > working as expected and I didn't find any problem.  I just have some minor
> > comments, like for the previous patch.
>
> Thanks very much for the review.  I've attached a new version
> addressing most of your feedback, and also rebasing over the new
> WAL-logged CREATE DATABASE.  I've also fixed a couple of bugs (see
> end).
>
> > For the docs:
> >
> > +Whether to try to prefetch blocks that are referenced in the WAL 
> > that
> > +are not yet in the buffer pool, during recovery.  Valid values are
> > +off (the default), on and
> > +try.  The setting try enables
> > +prefetching only if the operating system provides the
> > +posix_fadvise function, which is currently 
> > used
> > +to implement prefetching.  Note that some operating systems 
> > provide the
> > +function, but don't actually perform any prefetching.
> >
> > Is there any reason not to change it to try?  I'm wondering if some system 
> > says
> > that the function exists but simply raise an error if you actually try to 
> > use
> > it.  I think that at least WSL does that for some functions.
>
> Yeah, we could just default it to try.  Whether we should ship that
> way is another question, but done for now.

Should there be an associated pg15 open item for that, when the patch will be
committed?  Note that in wal.sgml, the patch still says:

+   [...] By default, prefetching in
+   recovery is disabled.

I guess this should be changed even if we eventually choose to disable it by
default?

> I don't think there are any supported systems that have a
> posix_fadvise() that fails with -1, or we'd know about it, because
> we already use it in other places.  We do support one OS that provides
> a dummy function in libc that does nothing at all (Solaris/illumos),
> and at least a couple that enter the kernel but are known to do
> nothing at all for WILLNEED (AIX, FreeBSD).

Ah, I didn't know that, thanks for the info!

> >  bool
> >  XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
> >RelFileNode *rnode, ForkNumber *forknum, BlockNumber 
> > *blknum)
> > +{
> > +   return XLogRecGetBlockInfo(record, block_id, rnode, forknum, blknum, 
> > NULL);
> > +}
> > +
> > +bool
> > +XLogRecGetBlockInfo(XLogReaderState *record, uint8 block_id,
> > +   RelFileNode *rnode, ForkNumber *forknum,
> > +   BlockNumber *blknum,
> > +   Buffer *prefetch_buffer)
> >  {
> >
> > It's missing comments on that function.  XLogRecGetBlockTag comments should
> > probably be reworded at the same time.
>
> New comment added for XLogRecGetBlockInfo().  Wish I could come up
> with a better name for that...  Not quite sure what you thought I should
> change about XLogRecGetBlockTag().

Since XLogRecGetBlockTag is now a wrapper for XLogRecGetBlockInfo, I thought it
would be better to document only the specific behavior for this one (so no
prefetch_buffer), rather than duplicating the whole description in both places.
It seems like a good recipe to miss one of the comments the next time something
is changed there.

For the name, why not the usual XLogRecGetBlockTagExtended()?

> > @@ -3350,6 +3392,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool 
> > randAccess,
> >  */
> > if (lastSourceFailed)
> > {
> > +   /*
> > +* Don't allow any retry loops to occur during nonblocking
> > +* readahead.  Let the caller process everything that has been
> > +* decoded already first.
> > +*/
> > +   if (nonblocking)
> > +   return XLREAD_WOULDBLOCK;
> >
> > Is that really enough?  I'm wondering if the code path in ReadRecord() that
> > forces lastSourceFailed to False while it actually failed when switching 
> > into
> > archive recovery (xlogrecovery.c around line 3044) can be problematic here.
>
> I don't see the problem scenario, could you elaborate?

Sorry, I missed that in standby mode ReadRecord would keep going until a record
is found, so no problem indeed.

> > +   /* Do we have a clue where the buffer might be already? */
> > +   if (BufferIsValid(recent_buffer) &&
> > +   mode == RBM_NORMAL &&
> > +   ReadRecentBuffer(rnode, forknum, blkno, recent_buffer))
> > +   {
> > +   buffer = recent_buffer;
> > +   goto recent_buffer_fast_path;
> > +   }
> >
> > Should this increment (local|shared)_blks_hit, since ReadRecentBuffer 
> > doesn't?
>
> Hmm.  I guess ReadRecentBuffer() should really do that.  Done.

Ah, I also thought it be be better there but was assuming that there was some
possible usage where it's not wanted.  Good then!

Should ReadRecentBuffer comment be updated to mention that pgBufferUsage is
incremented as 

Re: WIP: WAL prefetch (another approach)

2022-03-31 Thread Thomas Munro
On Mon, Mar 21, 2022 at 9:29 PM Julien Rouhaud  wrote:
> So I finally finished looking at this patch.  Here again, AFAICS the feature 
> is
> working as expected and I didn't find any problem.  I just have some minor
> comments, like for the previous patch.

Thanks very much for the review.  I've attached a new version
addressing most of your feedback, and also rebasing over the new
WAL-logged CREATE DATABASE.  I've also fixed a couple of bugs (see
end).

> For the docs:
>
> +Whether to try to prefetch blocks that are referenced in the WAL that
> +are not yet in the buffer pool, during recovery.  Valid values are
> +off (the default), on and
> +try.  The setting try enables
> +prefetching only if the operating system provides the
> +posix_fadvise function, which is currently used
> +to implement prefetching.  Note that some operating systems provide 
> the
> +function, but don't actually perform any prefetching.
>
> Is there any reason not to change it to try?  I'm wondering if some system 
> says
> that the function exists but simply raise an error if you actually try to use
> it.  I think that at least WSL does that for some functions.

Yeah, we could just default it to try.  Whether we should ship that
way is another question, but done for now.

I don't think there are any supported systems that have a
posix_fadvise() that fails with -1, or we'd know about it, because
we already use it in other places.  We do support one OS that provides
a dummy function in libc that does nothing at all (Solaris/illumos),
and at least a couple that enter the kernel but are known to do
nothing at all for WILLNEED (AIX, FreeBSD).

> +  
> +   The  parameter can
> +   be used to improve I/O performance during recovery by instructing
> +   PostgreSQL to initiate reads
> +   of disk blocks that will soon be needed but are not currently in
> +   PostgreSQL's buffer pool.
> +   The  and
> +settings limit prefetching
> +   concurrency and distance, respectively.
> +   By default, prefetching in recovery is disabled.
> +  
>
> I think that "improving I/O performance" is a bit misleading, maybe reduce I/O
> wait time or something like that?  Also, I don't know if we need to be that
> precise, but maybe we should say that it's the underlying kernel that will
> (asynchronously) initiate the reads, and postgres will simply notifies it.

Updated with this new text:

   The  parameter can be used to reduce
   I/O wait times during recovery by instructing the kernel to initiate reads
   of disk blocks that will soon be needed but are not currently in
   PostgreSQL's buffer pool and will soon be read.

> +  
> +   The pg_stat_prefetch_recovery view will contain 
> only
> +   one row.  It is filled with nulls if recovery is not running or WAL
> +   prefetching is not enabled.  See 
> +   for more information.
> +  
>
> That's not the implemented behavior as far as I can see.  It just prints 
> whatever is in SharedStats
> regardless of the recovery state or the prefetch_wal setting (assuming that
> there's no pending reset request).

Yeah.  Updated text: "It is filled with nulls if recovery has not run
or ...".

> Similarly, there's a mention that
> pg_stat_reset_shared('wal') will reset the stats, but I don't see anything
> calling XLogPrefetchRequestResetStats().

It's 'prefetch_recovery', not 'wal', but yeah, oops, it looks like I
got carried away between v18 and v19 while simplifying the stats and
lost a hunk I should have kept.  Fixed.

> Finally, I think we should documented what are the cumulated counters in that
> view (that should get reset) and the dynamic counters (that shouldn't get
> reset).

OK, done.

> For the code:
>
>  bool
>  XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
>RelFileNode *rnode, ForkNumber *forknum, BlockNumber 
> *blknum)
> +{
> +   return XLogRecGetBlockInfo(record, block_id, rnode, forknum, blknum, 
> NULL);
> +}
> +
> +bool
> +XLogRecGetBlockInfo(XLogReaderState *record, uint8 block_id,
> +   RelFileNode *rnode, ForkNumber *forknum,
> +   BlockNumber *blknum,
> +   Buffer *prefetch_buffer)
>  {
>
> It's missing comments on that function.  XLogRecGetBlockTag comments should
> probably be reworded at the same time.

New comment added for XLogRecGetBlockInfo().  Wish I could come up
with a better name for that...  Not quite sure what you thought I should
change about XLogRecGetBlockTag().

> +ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
>bool fetching_ckpt, TimeLineID replayTLI)
>  {
> XLogRecord *record;
> +   XLogReaderState *xlogreader = XLogPrefetcherReader(xlogprefetcher);
>
> nit: maybe name it XLogPrefetcherGetReader()?

OK.

>   * containing it (if not open already), and returns true. When end of standby
>   * mode is triggered by the user, and there is no more WAL available, returns
>   * false.
> + *
> + * If nonblocking is

Re: WIP: WAL prefetch (another approach)

2022-03-21 Thread Julien Rouhaud
Hi,

On Sun, Mar 20, 2022 at 05:36:38PM +1300, Thomas Munro wrote:
> On Fri, Mar 18, 2022 at 9:59 AM Thomas Munro  wrote:
> > I'll push 0001 today to let the build farm chew on it for a few days
> > before moving to 0002.
> 
> Clearly 018_wal_optimize.pl is flapping and causing recoveryCheck to
> fail occasionally, but that predates the above commit.  I didn't
> follow the existing discussion on that, so I'll try to look into that
> tomorrow.
> 
> Here's a rebase of the 0002 patch, now called 0001

So I finally finished looking at this patch.  Here again, AFAICS the feature is
working as expected and I didn't find any problem.  I just have some minor
comments, like for the previous patch.

For the docs:

+Whether to try to prefetch blocks that are referenced in the WAL that
+are not yet in the buffer pool, during recovery.  Valid values are
+off (the default), on and
+try.  The setting try enables
+prefetching only if the operating system provides the
+posix_fadvise function, which is currently used
+to implement prefetching.  Note that some operating systems provide the
+function, but don't actually perform any prefetching.

Is there any reason not to change it to try?  I'm wondering if some system says
that the function exists but simply raise an error if you actually try to use
it.  I think that at least WSL does that for some functions.

+  
+   The  parameter can
+   be used to improve I/O performance during recovery by instructing
+   PostgreSQL to initiate reads
+   of disk blocks that will soon be needed but are not currently in
+   PostgreSQL's buffer pool.
+   The  and
+settings limit prefetching
+   concurrency and distance, respectively.
+   By default, prefetching in recovery is disabled.
+  

I think that "improving I/O performance" is a bit misleading, maybe reduce I/O
wait time or something like that?  Also, I don't know if we need to be that
precise, but maybe we should say that it's the underlying kernel that will
(asynchronously) initiate the reads, and postgres will simply notifies it.


+  
+   The pg_stat_prefetch_recovery view will contain 
only
+   one row.  It is filled with nulls if recovery is not running or WAL
+   prefetching is not enabled.  See 
+   for more information.
+  

That's not the implemented behavior as far as I can see.  It just prints 
whatever is in SharedStats
regardless of the recovery state or the prefetch_wal setting (assuming that
there's no pending reset request).  Similarly, there's a mention that
pg_stat_reset_shared('wal') will reset the stats, but I don't see anything
calling XLogPrefetchRequestResetStats().

Finally, I think we should documented what are the cumulated counters in that
view (that should get reset) and the dynamic counters (that shouldn't get
reset).

For the code:

 bool
 XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
   RelFileNode *rnode, ForkNumber *forknum, BlockNumber *blknum)
+{
+   return XLogRecGetBlockInfo(record, block_id, rnode, forknum, blknum, NULL);
+}
+
+bool
+XLogRecGetBlockInfo(XLogReaderState *record, uint8 block_id,
+   RelFileNode *rnode, ForkNumber *forknum,
+   BlockNumber *blknum,
+   Buffer *prefetch_buffer)
 {

It's missing comments on that function.  XLogRecGetBlockTag comments should
probably be reworded at the same time.

+ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
   bool fetching_ckpt, TimeLineID replayTLI)
 {
XLogRecord *record;
+   XLogReaderState *xlogreader = XLogPrefetcherReader(xlogprefetcher);

nit: maybe name it XLogPrefetcherGetReader()?

  * containing it (if not open already), and returns true. When end of standby
  * mode is triggered by the user, and there is no more WAL available, returns
  * false.
+ *
+ * If nonblocking is true, then give up immediately if we can't satisfy the
+ * request, returning XLREAD_WOULDBLOCK instead of waiting.
  */
-static bool
+static XLogPageReadResult
 WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,

The comment still mentions a couple of time returning true/false rather than
XLREAD_*, same for at least XLogPageRead().

@@ -3350,6 +3392,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool 
randAccess,
 */
if (lastSourceFailed)
{
+   /*
+* Don't allow any retry loops to occur during nonblocking
+* readahead.  Let the caller process everything that has been
+* decoded already first.
+*/
+   if (nonblocking)
+   return XLREAD_WOULDBLOCK;

Is that really enough?  I'm wondering if the code path in ReadRecord() that
forces lastSourceFailed to False while it actually failed when switching into
archive recovery (xlogrecovery.c around line 3044) can be problematic here.


{"wal_decode_buffer_size", PGC_POSTMASTER, WAL_ARCHIVE_RECOVERY,
gettext_noo

Re: WIP: WAL prefetch (another approach)

2022-03-20 Thread Thomas Munro
On Sun, Mar 20, 2022 at 5:36 PM Thomas Munro  wrote:
> Clearly 018_wal_optimize.pl is flapping

Correction, 019_replslot_limit.pl, discussed at
https://www.postgresql.org/message-id/flat/83b46e5f-2a52-86aa-fa6c-8174908174b8%40iki.fi
.




Re: WIP: WAL prefetch (another approach)

2022-03-19 Thread Thomas Munro
On Fri, Mar 18, 2022 at 9:59 AM Thomas Munro  wrote:
> I'll push 0001 today to let the build farm chew on it for a few days
> before moving to 0002.

Clearly 018_wal_optimize.pl is flapping and causing recoveryCheck to
fail occasionally, but that predates the above commit.  I didn't
follow the existing discussion on that, so I'll try to look into that
tomorrow.

Here's a rebase of the 0002 patch, now called 0001
From 3ac04122e635b98c50d6e48677fe74535d631388 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sun, 20 Mar 2022 16:56:12 +1300
Subject: [PATCH v24] Prefetch referenced data in recovery, take II.

Introduce a new GUC recovery_prefetch, disabled by default.  When
enabled, look ahead in the WAL and try to initiate asynchronous reading
of referenced data blocks that are not yet cached in our buffer pool.
For now, this is done with posix_fadvise(), which has several caveats.
Since not all OSes have that system call, "try" is provided so that
it can be enabled on operating systems where it is available, and that
is used in 027_stream_regress.pl so that we effectively exercise on and
off behaviors in the build farm.  Better mechanisms will follow in later
work on the I/O subsystem.

The GUC maintenance_io_concurrency is used to limit the number of
concurrent I/Os we allow ourselves to initiate, based on pessimistic
heuristics used to infer that I/Os have begun and completed.

The GUC wal_decode_buffer_size limits the maximum distance we are
prepared to read ahead in the WAL to find uncached blocks.

Reviewed-by: Tomas Vondra 
Reviewed-by: Alvaro Herrera  (earlier version)
Reviewed-by: Andres Freund  (earlier version)
Reviewed-by: Justin Pryzby  (earlier version)
Tested-by: Tomas Vondra  (earlier version)
Tested-by: Jakub Wartak  (earlier version)
Tested-by: Dmitry Dolgov <9erthali...@gmail.com> (earlier version)
Tested-by: Sait Talha Nisanci  (earlier version)
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 doc/src/sgml/config.sgml  |  64 ++
 doc/src/sgml/monitoring.sgml  |  77 +-
 doc/src/sgml/wal.sgml |  12 +
 src/backend/access/transam/Makefile   |   1 +
 src/backend/access/transam/xlog.c |   2 +
 src/backend/access/transam/xlogprefetcher.c   | 968 ++
 src/backend/access/transam/xlogreader.c   |  13 +
 src/backend/access/transam/xlogrecovery.c | 160 ++-
 src/backend/access/transam/xlogutils.c|  27 +-
 src/backend/catalog/system_views.sql  |  13 +
 src/backend/storage/freespace/freespace.c |   3 +-
 src/backend/storage/ipc/ipci.c|   3 +
 src/backend/utils/misc/guc.c  |  53 +-
 src/backend/utils/misc/postgresql.conf.sample |   5 +
 src/include/access/xlog.h |   1 +
 src/include/access/xlogprefetcher.h   |  51 +
 src/include/access/xlogreader.h   |   8 +
 src/include/access/xlogutils.h|   3 +-
 src/include/catalog/pg_proc.dat   |   8 +
 src/include/utils/guc.h   |   4 +
 src/test/recovery/t/027_stream_regress.pl |   3 +
 src/test/regress/expected/rules.out   |  10 +
 src/tools/pgindent/typedefs.list  |   7 +
 23 files changed, 1434 insertions(+), 62 deletions(-)
 create mode 100644 src/backend/access/transam/xlogprefetcher.c
 create mode 100644 src/include/access/xlogprefetcher.h

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 7a48973b3c..ce84f379a8 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3644,6 +3644,70 @@ include_dir 'conf.d'
  
 
 
+   
+
+Recovery
+
+ 
+  configuration
+  of recovery
+  general settings
+ 
+
+
+ This section describes the settings that apply to recovery in general,
+ affecting crash recovery, streaming replication and archive-based
+ replication.
+
+
+
+
+ 
+  recovery_prefetch (enum)
+  
+   recovery_prefetch configuration parameter
+  
+  
+  
+   
+Whether to try to prefetch blocks that are referenced in the WAL that
+are not yet in the buffer pool, during recovery.  Valid values are
+off (the default), on and
+try.  The setting try enables
+prefetching only if the operating system provides the
+posix_fadvise function, which is currently used
+to implement prefetching.  Note that some operating systems provide the
+function, but don't actually perform any prefetching.
+   
+   
+Prefetching blocks that will soon be needed can reduce I/O wait times
+during recovery with some workloads.
+See also the  and
+ settings, which limit
+prefetching activity.
+   
+  
+ 
+
+ 
+  wal_decode_buffer_size (integer)
+  
+   wal_decode_buffer_size configuration parameter
+  
+  
+

Re: WIP: WAL prefetch (another approach)

2022-03-17 Thread Thomas Munro
On Mon, Mar 14, 2022 at 8:17 PM Julien Rouhaud  wrote:
> Great!  I'm happy with 0001 and I think it's good to go!

I'll push 0001 today to let the build farm chew on it for a few days
before moving to 0002.




Re: WIP: WAL prefetch (another approach)

2022-03-14 Thread Julien Rouhaud
On Mon, Mar 14, 2022 at 06:15:59PM +1300, Thomas Munro wrote:
> On Fri, Mar 11, 2022 at 9:27 PM Julien Rouhaud  wrote:
> > > > Also, is it worth an assert (likely at the top of the function) for 
> > > > that?
> > >
> > > How could I assert that EndRecPtr has the right value?
> >
> > Sorry, I meant to assert that some value was assigned 
> > (!XLogRecPtrIsInvalid).
> > It can only make sure that the first call is done after XLogBeginRead /
> > XLogFindNextRecord, but that's better than nothing and consistent with the 
> > top
> > comment.
>
> Done.

Just a small detail: I would move that assert at the top of the function as it
should always be valid.
>
> I also fixed the compile failure with -DWAL_DEBUG, and checked that
> output looks sane with wal_debug=on.

Great!  I'm happy with 0001 and I think it's good to go!
>
> > > The other thing I need to change is that I should turn on
> > > recovery_prefetch for platforms that support it (ie Linux and maybe
> > > NetBSD only for now), in the tests.  Right now you need to put
> > > recovery_prefetch=on in a file and then run the tests with
> > > "TEMP_CONFIG=path_to_that make -C src/test/recovery check" to
> > > excercise much of 0002.
> >
> > +1 with Andres' idea to have a "try" setting.
>
> Done.  The default is still "off" for now, but in
> 027_stream_regress.pl I set it to "try".

Great too!  Unless you want to commit both patches right now I'd like to review
0002 too (this week), as I barely look into it for now.




Re: WIP: WAL prefetch (another approach)

2022-03-11 Thread Julien Rouhaud
On Fri, Mar 11, 2022 at 06:31:13PM +1300, Thomas Munro wrote:
> On Wed, Mar 9, 2022 at 7:47 PM Julien Rouhaud  wrote:
> >
> > This could use XLogRecGetBlock?  Note that this macro is for now never used.
> > xlogreader.c also has some similar forgotten code that could use
> > XLogRecMaxBlockId.
>
> That is true, but I was thinking of it like this: most of the existing
> code that interacts with xlogreader.c is working with the old model,
> where the XLogReader object holds only one "current" record.  For that
> reason the XLogRecXXX() macros continue to work as before, implicitly
> referring to the record that XLogReadRecord() most recently returned.
> For xlogreader.c code, I prefer not to use the XLogRecXXX() macros,
> even when referring to the "current" record, since xlogreader.c has
> switched to a new multi-record model.  In other words, they're sort of
> 'old API' accessors provided for continuity.  Does this make sense?

Ah I see, it does make sense.  I'm wondering if there should be some comment
somewhere on the top of the file to mention it, as otherwise someone may be
tempted to change it to avoid some record->record->xxx usage.

> > +DecodedXLogRecord *
> > +XLogNextRecord(XLogReaderState *state, char **errormsg)
> > +{
> > [...]
> > +   /*
> > +* state->EndRecPtr is expected to have been set by the last call to
> > +* XLogBeginRead() or XLogNextRecord(), and is the location of the
> > +* error.
> > +*/
> > +
> > +   return NULL;
> >
> > The comment should refer to XLogFindNextRecord, not XLogNextRecord?
> 
> No, it does mean to refer to the XLogNextRecord() (ie the last time
> you called XLogNextRecord and successfully dequeued a record, we put
> its end LSN there, so if there is a deferred error, that's the
> corresponding LSN).  Make sense?

It does, thanks!

> 
> > Also, is it worth an assert (likely at the top of the function) for that?
> 
> How could I assert that EndRecPtr has the right value?

Sorry, I meant to assert that some value was assigned (!XLogRecPtrIsInvalid).
It can only make sure that the first call is done after XLogBeginRead /
XLogFindNextRecord, but that's better than nothing and consistent with the top
comment.

> > +   if (unlikely(state->decode_buffer == NULL))
> > +   {
> > +   if (state->decode_buffer_size == 0)
> > +   state->decode_buffer_size = DEFAULT_DECODE_BUFFER_SIZE;
> > +   state->decode_buffer = palloc(state->decode_buffer_size);
> > +   state->decode_buffer_head = state->decode_buffer;
> > +   state->decode_buffer_tail = state->decode_buffer;
> > +   state->free_decode_buffer = true;
> > +   }
> >
> > Maybe change XLogReaderSetDecodeBuffer to also handle allocation and use it
> > here too?  Otherwise XLogReaderSetDecodeBuffer should probably go in 0002 as
> > the only caller is the recovery prefetching.
> 
> I don't think it matters much?

The thing is that for now the only caller to XLogReaderSetDecodeBuffer (in
0002) only uses it to set the length, so a buffer is actually never passed to
that function.  Since frontend code can rely on a palloc emulation, is there
really a use case to use e.g. some stack buffer there, or something in a
specific memory context?  It seems to be the only use cases for having
XLogReaderSetDecodeBuffer() rather than simply a
XLogReaderSetDecodeBufferSize().  But overall I agree it doesn't matter much,
so no objection to keep it as-is.

> > It's also not particulary obvious why XLogFindNextRecord() doesn't check for
> > this value.  AFAICS callers don't (and should never) call it with a
> > nonblocking == true state, maybe add an assert for that?
> 
> Fair point.  I have now explicitly cleared that flag.  (I don't much
> like state->nonblocking, which might be better as an argument to
> page_read(), but in fact I don't like the fact that page_read
> callbacks are blocking in the first place, which is why I liked
> Horiguchi-san's patch to get rid of that... but that can be a subject
> for later work.)

Agreed.

> >  static void
> >  ResetDecoder(XLogReaderState *state)
> >  {
> > [...]
> > +   /* Reset the decoded record queue, freeing any oversized records. */
> > +   while ((r = state->decode_queue_tail))
> >
> > nit: I think it's better to explicitly check for the assignment being != 
> > NULL,
> > and existing code is more frequently written this way AFAICS.
> 
> I think it's perfectly normal idiomatic C, but if you think it's
> clearer that way, OK, done like that.

The thing I don't like about this form is that you can never be sure that an
assignment was really meant unless you read the rest of the nearby code.  Other
than that agreed, if perfectly normal idiomatic C.

> I realised that this version has broken -DWAL_DEBUG.  I'll fix that
> shortly, but I wanted to post this update ASAP, so here's a new
> version.

+ * Returns XLREAD_WOULDBLOCK if he requested data can't be read without
+ * waiting.  This can be returned only if the installed page_read cal

Re: WIP: WAL prefetch (another approach)

2022-03-10 Thread Andres Freund



On March 10, 2022 9:31:13 PM PST, Thomas Munro  wrote:
>  The other thing I need to change is that I should turn on
>recovery_prefetch for platforms that support it (ie Linux and maybe
>NetBSD only for now), in the tests.  

Could a setting of "try" make sense?
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: WIP: WAL prefetch (another approach)

2022-03-10 Thread Thomas Munro
On Fri, Mar 11, 2022 at 6:31 PM Thomas Munro  wrote:
> Thanks for your review of 0001!  It gave me a few things to think
> about and some good improvements.

And just in case it's useful, here's what changed between v21 and v22..
diff --git a/src/backend/access/transam/xlogreader.c 
b/src/backend/access/transam/xlogreader.c
index 86a7b4c5c8..0d0c556b7c 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -90,8 +90,8 @@ XLogReaderSetDecodeBuffer(XLogReaderState *state, void 
*buffer, size_t size)
 
state->decode_buffer = buffer;
state->decode_buffer_size = size;
-   state->decode_buffer_head = buffer;
state->decode_buffer_tail = buffer;
+   state->decode_buffer_head = buffer;
 }
 
 /*
@@ -271,7 +271,7 @@ XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr)
 
 /*
  * See if we can release the last record that was returned by
- * XLogNextRecord(), to free up space.
+ * XLogNextRecord(), if any, to free up space.
  */
 void
 XLogReleasePreviousRecord(XLogReaderState *state)
@@ -283,16 +283,16 @@ XLogReleasePreviousRecord(XLogReaderState *state)
 
/*
 * Remove it from the decoded record queue.  It must be the oldest item
-* decoded, decode_queue_tail.
+* decoded, decode_queue_head.
 */
record = state->record;
-   Assert(record == state->decode_queue_tail);
+   Assert(record == state->decode_queue_head);
state->record = NULL;
-   state->decode_queue_tail = record->next;
+   state->decode_queue_head = record->next;
 
-   /* It might also be the newest item decoded, decode_queue_head. */
-   if (state->decode_queue_head == record)
-   state->decode_queue_head = NULL;
+   /* It might also be the newest item decoded, decode_queue_tail. */
+   if (state->decode_queue_tail == record)
+   state->decode_queue_tail = NULL;
 
/* Release the space. */
if (unlikely(record->oversized))
@@ -302,11 +302,11 @@ XLogReleasePreviousRecord(XLogReaderState *state)
}
else
{
-   /* It must be the tail record in the decode buffer. */
-   Assert(state->decode_buffer_tail == (char *) record);
+   /* It must be the head (oldest) record in the decode buffer. */
+   Assert(state->decode_buffer_head == (char *) record);
 
/*
-* We need to update tail to point to the next record that is 
in the
+* We need to update head to point to the next record that is 
in the
 * decode buffer, if any, being careful to skip oversized ones
 * (they're not in the decode buffer).
 */
@@ -316,8 +316,8 @@ XLogReleasePreviousRecord(XLogReaderState *state)
 
if (record)
{
-   /* Adjust tail to release space up to the next record. 
*/
-   state->decode_buffer_tail = (char *) record;
+   /* Adjust head to release space up to the next record. 
*/
+   state->decode_buffer_head = (char *) record;
}
else
{
@@ -327,8 +327,8 @@ XLogReleasePreviousRecord(XLogReaderState *state)
 * we'll keep overwriting the same piece of memory if 
we're not
 * doing any prefetching.
 */
-   state->decode_buffer_tail = state->decode_buffer;
state->decode_buffer_head = state->decode_buffer;
+   state->decode_buffer_tail = state->decode_buffer;
}
}
 }
@@ -351,7 +351,7 @@ XLogNextRecord(XLogReaderState *state, char **errormsg)
/* Release the last record returned by XLogNextRecord(). */
XLogReleasePreviousRecord(state);
 
-   if (state->decode_queue_tail == NULL)
+   if (state->decode_queue_head == NULL)
{
*errormsg = NULL;
if (state->errormsg_deferred)
@@ -376,7 +376,7 @@ XLogNextRecord(XLogReaderState *state, char **errormsg)
 * XLogRecXXX(xlogreader) macros, which work with the decoder rather 
than
 * the record for historical reasons.
 */
-   state->record = state->decode_queue_tail;
+   state->record = state->decode_queue_head;
 
/*
 * Update the pointers to the beginning and one-past-the-end of this
@@ -428,12 +428,12 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
if (!XLogReaderHasQueuedRecordOrError(state))
XLogReadAhead(state, false /* nonblocking */ );
 
-   /* Consume the tail record or error. */
+   /* Consume the head record or error. */
decoded = XLogNextRecord(state, errormsg);
if (decoded)
{
/*
-* XLogReadRecord() returns a pointer to the record's header, 
not the
+

Re: WIP: WAL prefetch (another approach)

2022-03-08 Thread Julien Rouhaud
Hi,

On Tue, Mar 08, 2022 at 06:15:43PM +1300, Thomas Munro wrote:
> On Wed, Dec 29, 2021 at 5:29 PM Thomas Munro  wrote:
> > https://github.com/macdice/postgres/tree/recovery-prefetch-ii
>
> Here's a rebase.  This mostly involved moving hunks over to the new
> xlogrecovery.c file.  One thing that seemed a little strange to me
> with the new layout is that xlogreader is now a global variable.  I
> followed that pattern and made xlogprefetcher a global variable too,
> for now.

I for now went through 0001, TL;DR the patch looks good to me.  I have a few
minor comments though, mostly to make things a bit clearer (at least to me).

diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 2340dc247b..c129df44ac 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -407,10 +407,10 @@ XLogDumpRecordLen(XLogReaderState *record, uint32 
*rec_len, uint32 *fpi_len)
 * add an accessor macro for this.
 */
*fpi_len = 0;
+   for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
{
if (XLogRecHasBlockImage(record, block_id))
-   *fpi_len += record->blocks[block_id].bimg_len;
+   *fpi_len += record->record->blocks[block_id].bimg_len;
}
(and similar in that file, xlogutils.c and xlogreader.c)

This could use XLogRecGetBlock?  Note that this macro is for now never used.

xlogreader.c also has some similar forgotten code that could use
XLogRecMaxBlockId.


+ * See if we can release the last record that was returned by
+ * XLogNextRecord(), to free up space.
+ */
+void
+XLogReleasePreviousRecord(XLogReaderState *state)

The comment seems a bit misleading, as I first understood it as it could be
optional even if the record exists.  Maybe something more like "Release the
last record if any"?


+* Remove it from the decoded record queue.  It must be the oldest item
+* decoded, decode_queue_tail.
+*/
+   record = state->record;
+   Assert(record == state->decode_queue_tail);
+   state->record = NULL;
+   state->decode_queue_tail = record->next;

The naming is a bit counter intuitive to me, as before reading the rest of the
code I wasn't expecting the item at the tail of the queue to have a next
element.  Maybe just inverting tail and head would make it clearer?


+DecodedXLogRecord *
+XLogNextRecord(XLogReaderState *state, char **errormsg)
+{
[...]
+   /*
+* state->EndRecPtr is expected to have been set by the last call to
+* XLogBeginRead() or XLogNextRecord(), and is the location of the
+* error.
+*/
+
+   return NULL;

The comment should refer to XLogFindNextRecord, not XLogNextRecord?
Also, is it worth an assert (likely at the top of the function) for that?


 XLogRecord *
 XLogReadRecord(XLogReaderState *state, char **errormsg)
+{
[...]
+   if (decoded)
+   {
+   /*
+* XLogReadRecord() returns a pointer to the record's header, not the
+* actual decoded record.  The caller will access the decoded record
+* through the XLogRecGetXXX() macros, which reach the decoded
+* recorded as xlogreader->record.
+*/
+   Assert(state->record == decoded);
+   return &decoded->header;

I find it a bit weird to mention XLogReadRecord() as it's the current function.


+/*
+ * Allocate space for a decoded record.  The only member of the returned
+ * object that is initialized is the 'oversized' flag, indicating that the
+ * decoded record wouldn't fit in the decode buffer and must eventually be
+ * freed explicitly.
+ *
+ * Return NULL if there is no space in the decode buffer and allow_oversized
+ * is false, or if memory allocation fails for an oversized buffer.
+ */
+static DecodedXLogRecord *
+XLogReadRecordAlloc(XLogReaderState *state, size_t xl_tot_len, bool 
allow_oversized)

Is it worth clearly stating that it's the reponsability of the caller to update
the decode_buffer_head (with the real size) after a successful decoding of this
buffer?


+   if (unlikely(state->decode_buffer == NULL))
+   {
+   if (state->decode_buffer_size == 0)
+   state->decode_buffer_size = DEFAULT_DECODE_BUFFER_SIZE;
+   state->decode_buffer = palloc(state->decode_buffer_size);
+   state->decode_buffer_head = state->decode_buffer;
+   state->decode_buffer_tail = state->decode_buffer;
+   state->free_decode_buffer = true;
+   }

Maybe change XLogReaderSetDecodeBuffer to also handle allocation and use it
here too?  Otherwise XLogReaderSetDecodeBuffer should probably go in 0002 as
the only caller is the recovery prefetching.

+   return decoded;
+}

I would find it a bit clearer to explicitly return NULL here.


readOff = ReadPageInternal(state, targetPagePtr,
   Min(targetRecOff + SizeOfXLogRecord, 
XLOG_BLCKSZ));
-   if (readOff < 0)
+   if (readOff == XLREAD_WOULDBLOCK)
+   return XLREAD_WOULDBLOCK;
+   else if (readOff < 0)

ReadPageInternal comment should be updated to mention 

Re: WIP: WAL prefetch (another approach)

2022-03-08 Thread Andres Freund
Hi,

On 2022-03-08 18:15:43 +1300, Thomas Munro wrote:
> I'm now starting to think about committing this soon.

+1

Are you thinking of committing both patches at once, or with a bit of
distance?

I think something in the regression tests ought to enable
recovery_prefetch. 027_stream_regress or 001_stream_rep seem like the obvious
candidates?


- Andres




Re: WIP: WAL prefetch (another approach)

2022-03-08 Thread Tomas Vondra



On 3/8/22 06:15, Thomas Munro wrote:
> On Wed, Dec 29, 2021 at 5:29 PM Thomas Munro  wrote:
>> https://github.com/macdice/postgres/tree/recovery-prefetch-ii
> 
> Here's a rebase.  This mostly involved moving hunks over to the new
> xlogrecovery.c file.  One thing that seemed a little strange to me
> with the new layout is that xlogreader is now a global variable.  I
> followed that pattern and made xlogprefetcher a global variable too,
> for now.
> 
> There is one functional change: now I block readahead at records that
> might change the timeline ID.  This removes the need to think about
> scenarios where "replay TLI" and "read TLI" might differ.  I don't
> know of a concrete problem in that area with the previous version, but
> the recent introduction of the variable(s) "replayTLI" and associated
> comments in master made me realise I hadn't analysed the hazards here
> enough.  Since timelines are tricky things and timeline changes are
> extremely infrequent, it seemed better to simplify matters by putting
> up a big road block there.
> 
> I'm now starting to think about committing this soon.

+1. I don't have the capacity/hardware to do more testing at the moment,
but all of this looks reasonable.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: WIP: WAL prefetch (another approach)

2021-12-29 Thread Andres Freund
Hi,

On 2021-12-29 17:29:52 +1300, Thomas Munro wrote:
> > FWIW I don't think we include updates to typedefs.list in patches.
> 
> Seems pretty harmless? And useful to keep around in development
> branches because I like to pgindent stuff...

I think it's even helpful. As long as it's done with a bit of manual
oversight, I don't see a meaningful downside of doing so. One needs to be
careful to not remove platform dependant typedefs, but that's it. And
especially for long-lived feature branches it's much less work to keep the
typedefs.list changes in the tree, rather than coming up with them locally
over and over / across multiple people working on a branch.

Greetings,

Andres Freund




Re: WIP: WAL prefetch (another approach)

2021-12-28 Thread Tom Lane
Thomas Munro  writes:
>> FWIW I don't think we include updates to typedefs.list in patches.

> Seems pretty harmless? And useful to keep around in development
> branches because I like to pgindent stuff...

As far as that goes, my habit is to pull down
https://buildfarm.postgresql.org/cgi-bin/typedefs.pl
on a regular basis and pgindent against that.  There have been
some discussions about formalizing that process a bit more,
but we've not come to any conclusions.

regards, tom lane




Re: WIP: WAL prefetch (another approach)

2021-12-17 Thread Tom Lane
Greg Stark  writes:
> But the bigger question is. Are we really concerned about this flaky
> problem? Is it worth investing time and money on? I can get money to
> go buy a G4 or G5 and spend some time on it. It just seems a bit...
> niche. But if it's a real bug that represents something broken on
> other architectures that just happens to be easier to trigger here it
> might be worthwhile.

TBH, I don't know.  There seem to be three plausible explanations:

1. Flaky hardware in my unit.
2. Ancient macOS bug, as Andres suggested upthread.
3. Actual PG bug.

If it's #1 or #2 then we're just wasting our time here.  I'm not
sure how to estimate the relative probabilities, but I suspect
#3 is the least likely of the lot.

FWIW, I did just reproduce the problem on that machine with current HEAD:

2021-12-17 18:40:40.293 EST [21369] FATAL:  inconsistent page found, rel 
1663/167772/2673, forknum 0, blkno 26
2021-12-17 18:40:40.293 EST [21369] CONTEXT:  WAL redo at C/3DE3F658 for 
Btree/INSERT_LEAF: off 208; blkref #0: rel 1663/167772/2673, blk 26 FPW
2021-12-17 18:40:40.522 EST [21365] LOG:  startup process (PID 21369) exited 
with exit code 1

That was after only five loops of the regression tests, so either
I got lucky or the failure probability has increased again.

In any case, it seems clear that the problem exists independently of
Munro's patches, so I don't really think this question should be
considered a blocker for those.

regards, tom lane




Re: WIP: WAL prefetch (another approach)

2021-12-17 Thread Greg Stark
On Fri, 17 Dec 2021 at 18:40, Tom Lane  wrote:
>
> Greg Stark  writes:
> > Hm. I seem to have picked a bad checkout. I took the last one before
> > the revert (45aa88fe1d4028ea50ba7d26d390223b6ef78acc).
>
> FWIW, I think that's the first one *after* the revert.

Doh

But the bigger question is. Are we really concerned about this flaky
problem? Is it worth investing time and money on? I can get money to
go buy a G4 or G5 and spend some time on it. It just seems a bit...
niche. But if it's a real bug that represents something broken on
other architectures that just happens to be easier to trigger here it
might be worthwhile.

-- 
greg




Re: WIP: WAL prefetch (another approach)

2021-12-17 Thread Tom Lane
Greg Stark  writes:
> Hm. I seem to have picked a bad checkout. I took the last one before
> the revert (45aa88fe1d4028ea50ba7d26d390223b6ef78acc).

FWIW, I think that's the first one *after* the revert.

> 2021-12-17 17:51:51.688 EST [50955] LOG:  background worker "parallel
> worker" (PID 54073) was terminated by signal 10: Bus error

I'm betting on weird emulation issue.  None of my real PPC machines
showed such things.

regards, tom lane




Re: WIP: WAL prefetch (another approach)

2021-12-17 Thread Tomas Vondra

On 12/17/21 23:56, Greg Stark wrote:

Hm. I seem to have picked a bad checkout. I took the last one before
the revert (45aa88fe1d4028ea50ba7d26d390223b6ef78acc). Or there's some
incompatibility with the emulation and the IPC stuff parallel workers
use.


2021-12-17 17:51:51.688 EST [50955] LOG:  background worker "parallel
worker" (PID 54073) was terminated by signal 10: Bus error
2021-12-17 17:51:51.688 EST [50955] DETAIL:  Failed process was
running: SELECT variance(unique1::int4), sum(unique1::int8),
regr_count(unique1::float8, unique1::float8)
FROM (SELECT * FROM tenk1
   UNION ALL SELECT * FROM tenk1
   UNION ALL SELECT * FROM tenk1
   UNION ALL SELECT * FROM tenk1) u;
2021-12-17 17:51:51.690 EST [50955] LOG:  terminating any other active
server processes
2021-12-17 17:51:51.748 EST [54078] FATAL:  the database system is in
recovery mode
2021-12-17 17:51:51.761 EST [50955] LOG:  all server processes
terminated; reinitializing



Interesting. In my experience SIGBUS on PPC tends to be due to incorrect 
alignment, but I'm not sure how that works with the emulation. Can you 
get a backtrace?


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: WIP: WAL prefetch (another approach)

2021-12-17 Thread Greg Stark
Hm. I seem to have picked a bad checkout. I took the last one before
the revert (45aa88fe1d4028ea50ba7d26d390223b6ef78acc). Or there's some
incompatibility with the emulation and the IPC stuff parallel workers
use.


2021-12-17 17:51:51.688 EST [50955] LOG:  background worker "parallel
worker" (PID 54073) was terminated by signal 10: Bus error
2021-12-17 17:51:51.688 EST [50955] DETAIL:  Failed process was
running: SELECT variance(unique1::int4), sum(unique1::int8),
regr_count(unique1::float8, unique1::float8)
FROM (SELECT * FROM tenk1
  UNION ALL SELECT * FROM tenk1
  UNION ALL SELECT * FROM tenk1
  UNION ALL SELECT * FROM tenk1) u;
2021-12-17 17:51:51.690 EST [50955] LOG:  terminating any other active
server processes
2021-12-17 17:51:51.748 EST [54078] FATAL:  the database system is in
recovery mode
2021-12-17 17:51:51.761 EST [50955] LOG:  all server processes
terminated; reinitializing




Re: WIP: WAL prefetch (another approach)

2021-12-17 Thread Tom Lane
Greg Stark  writes:
> I'm guessing I should do CC=/usr/bin/powerpc-apple-darwin9-gcc-4.2.1
> or maybe 4.0.1. What version is on your G4?

$ gcc -v
Using built-in specs.
Target: powerpc-apple-darwin9
Configured with: /var/tmp/gcc/gcc-5493~1/src/configure --disable-checking 
-enable-werror --prefix=/usr --mandir=/share/man 
--enable-languages=c,objc,c++,obj-c++ 
--program-transform-name=/^[cg][^.-]*$/s/$/-4.0/ 
--with-gxx-include-dir=/include/c++/4.0.0 --with-slibdir=/usr/lib 
--build=i686-apple-darwin9 --program-prefix= --host=powerpc-apple-darwin9 
--target=powerpc-apple-darwin9
Thread model: posix
gcc version 4.0.1 (Apple Inc. build 5493)

I see that gcc 4.2.1 is also present on this machine, but I've
never used it.

regards, tom lane




Re: WIP: WAL prefetch (another approach)

2021-12-17 Thread Greg Stark
I have

IBUILD:postgresql gsstark$ ls /usr/bin/*gcc*
/usr/bin/gcc
/usr/bin/gcc-4.0
/usr/bin/gcc-4.2
/usr/bin/i686-apple-darwin9-gcc-4.0.1
/usr/bin/i686-apple-darwin9-gcc-4.2.1
/usr/bin/powerpc-apple-darwin9-gcc-4.0.1
/usr/bin/powerpc-apple-darwin9-gcc-4.2.1

I'm guessing I should do CC=/usr/bin/powerpc-apple-darwin9-gcc-4.2.1
or maybe 4.0.1. What version is on your G4?




Re: WIP: WAL prefetch (another approach)

2021-12-17 Thread Tom Lane
Greg Stark  writes:
> What tools and tool versions are you using to build? Is it just GCC for PPC?
> There aren't any special build processes to make a fat binary involved?

Nope, just "configure; make" using that macOS version's regular gcc.

regards, tom lane




Re: WIP: WAL prefetch (another approach)

2021-12-17 Thread Greg Stark
What tools and tool versions are you using to build? Is it just GCC for PPC?

There aren't any special build processes to make a fat binary involved?

On Thu, 16 Dec 2021 at 23:11, Tom Lane  wrote:
>
> Greg Stark  writes:
> > But if you're interested and can explain the tests to run I can try to
> > get the tests running on this machine:
>
> I'm not sure that machine is close enough to prove much, but by all
> means give it a go if you wish.  My test setup was explained in [1]:
>
> >> To recap, the test lashup is:
> >> * 2003 PowerMac G4 (1.25GHz PPC 7455, 7200 rpm spinning-rust drive)
> >> * Standard debug build (--enable-debug --enable-cassert)
> >> * Out-of-the-box configuration, except add wal_consistency_checking = all
> >> and configure a wal-streaming standby on the same machine
> >> * Repeatedly run "make installcheck-parallel", but skip the tablespace
> >> test to avoid issues with the standby trying to use the same directory
> >> * Delay long enough after each installcheck-parallel to let the
> >> standby catch up (the run proper is ~24 min, plus 2 min for catchup)
>
> Remember also that the code in question is not in HEAD; you'd
> need to apply Munro's patches, or check out some commit from
> around 2021-04-22.
>
> regards, tom lane
>
> [1] https://www.postgresql.org/message-id/3502526.1619925367%40sss.pgh.pa.us



-- 
greg




Re: WIP: WAL prefetch (another approach)

2021-12-16 Thread Tom Lane
Greg Stark  writes:
> But if you're interested and can explain the tests to run I can try to
> get the tests running on this machine:

I'm not sure that machine is close enough to prove much, but by all
means give it a go if you wish.  My test setup was explained in [1]:

>> To recap, the test lashup is:
>> * 2003 PowerMac G4 (1.25GHz PPC 7455, 7200 rpm spinning-rust drive)
>> * Standard debug build (--enable-debug --enable-cassert)
>> * Out-of-the-box configuration, except add wal_consistency_checking = all
>> and configure a wal-streaming standby on the same machine
>> * Repeatedly run "make installcheck-parallel", but skip the tablespace
>> test to avoid issues with the standby trying to use the same directory
>> * Delay long enough after each installcheck-parallel to let the 
>> standby catch up (the run proper is ~24 min, plus 2 min for catchup)

Remember also that the code in question is not in HEAD; you'd
need to apply Munro's patches, or check out some commit from
around 2021-04-22.

regards, tom lane

[1] https://www.postgresql.org/message-id/3502526.1619925367%40sss.pgh.pa.us




Re: WIP: WAL prefetch (another approach)

2021-12-16 Thread Greg Stark
The actual hardware of this machine is a Mac Mini Core 2 Duo. I'm not
really clear how the emulation is done and whether it makes a
reasonable test environment or not.

Hardware Overview:

  Model Name: Mac mini
  Model Identifier: Macmini2,1
  Processor Name: Intel Core 2 Duo
  Processor Speed: 2 GHz
  Number Of Processors: 1
  Total Number Of Cores: 2
  L2 Cache: 4 MB
  Memory: 2 GB
  Bus Speed: 667 MHz
  Boot ROM Version: MM21.009A.B00




Re: WIP: WAL prefetch (another approach)

2021-12-16 Thread Greg Stark
On Fri, 26 Nov 2021 at 21:47, Tom Lane  wrote:
>
> Yeah ... on the one hand, that machine has shown signs of
> hard-to-reproduce flakiness, so it's easy to write off the failures
> I saw as hardware issues.  On the other hand, the flakiness I've
> seen has otherwise manifested as kernel crashes, which is nothing
> like the consistent test failures I was seeing with the patch.

Hm. I asked around and found a machine I can use that can run PPC
binaries, but it's actually, well, confusing. I think this is an x86
machine running Leopard which uses JIT to transparently run PPC
binaries. I'm not sure this is really a good test.

But if you're interested and can explain the tests to run I can try to
get the tests running on this machine:

IBUILD:~ gsstark$ uname -a
Darwin IBUILD.MIT.EDU 9.8.0 Darwin Kernel Version 9.8.0: Wed Jul 15
16:55:01 PDT 2009; root:xnu-1228.15.4~1/RELEASE_I386 i386

IBUILD:~ gsstark$ sw_vers
ProductName: Mac OS X
ProductVersion: 10.5.8
BuildVersion: 9L31a




Re: WIP: WAL prefetch (another approach)

2021-12-13 Thread Robert Haas
On Fri, Nov 26, 2021 at 9:47 PM Tom Lane  wrote:
> Yeah ... on the one hand, that machine has shown signs of
> hard-to-reproduce flakiness, so it's easy to write off the failures
> I saw as hardware issues.  On the other hand, the flakiness I've
> seen has otherwise manifested as kernel crashes, which is nothing
> like the consistent test failures I was seeing with the patch.
>
> Andres speculated that maybe we were seeing a kernel bug that
> affects consistency of concurrent reads and writes.  That could
> be an explanation; but it's just evidence-free speculation so far,
> so I don't feel real convinced by that idea either.
>
> Anyway, I hope to find time to see if the issue still reproduces
> with Thomas' new patch set.

Honestly, all the reasons that Thomas articulated for the revert seem
relatively unimpressive from my point of view. Perhaps they are
sufficient justification for a revert so near to the end of the
development cycle, but that's just an argument for committing things a
little sooner so we have time to work out the kinks. This kind of work
is too valuable to get hung up for a year or three because of a couple
of minor preexisting bugs and/or preexisting maybe-bugs.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: WIP: WAL prefetch (another approach)

2021-12-10 Thread Ashutosh Sharma
Hi Thomas,

I am unable to apply these new set of patches on HEAD. Can you please share
the rebased patch or if you have any work branch can you please point it
out, I will refer to it for the changes.

--
With Regards,
Ashutosh sharma.

On Tue, Nov 23, 2021 at 3:44 PM Thomas Munro  wrote:

> On Mon, Nov 15, 2021 at 11:31 PM Daniel Gustafsson 
> wrote:
> > Could you post an updated version of the patch which is for review?
>
> Sorry for taking so long to come back; I learned some new things that
> made me want to restructure this code a bit (see below).  Here is an
> updated pair of patches that I'm currently testing.
>
> Old problems:
>
> 1.  Last time around, an infinite loop was reported in pg_waldump.  I
> believe Horiguchi-san has fixed that[1], but I'm no longer depending
> on that patch.  I thought his patch set was a good idea, but it's
> complicated and there's enough going on here already... let's consider
> that independently.
>
> This version goes back to what I had earlier, though (I hope) it is
> better about how "nonblocking" states are communicated.  In this
> version, XLogPageRead() has a way to give up part way through a record
> if it doesn't have enough data and there are queued up records that
> could be replayed right now.  In that case, we'll go back to the
> beginning of the record (and occasionally, back a WAL page) next time
> we try.  That's the cost of not maintaining intra-record decoding
> state.
>
> 2.  Last time around, we could try to allocate a crazy amount of
> memory when reading garbage past the end of the WAL.  Fixed, by
> validating first, like in master.
>
> New work:
>
> Since last time, I went away and worked on a "real" AIO version of
> this feature.  That's ongoing experimental work for a future proposal,
> but I have a working prototype and I aim to share that soon, when that
> branch is rebased to catch up with recent changes.  In that version,
> the prefetcher starts actual reads into the buffer pool, and recovery
> receives already pinned buffers attached to the stream of records it's
> replaying.
>
> That inspired a couple of refactoring changes to this non-AIO version,
> to minimise the difference and anticipate the future work better:
>
> 1.  The logic for deciding which block to start prefetching next is
> moved into a new callback function in a sort of standard form (this is
> approximately how all/most prefetching code looks in the AIO project,
> ie sequential scans, bitmap heap scan, etc).
>
> 2.  The logic for controlling how many IOs are running and deciding
> when to call the above is in a separate component.  In this non-AIO
> version, it works using a simple ring buffer of LSNs to estimate the
> number of in flight I/Os, just like before.  This part would be thrown
> away and replaced with the AIO branch's centralised "streaming read"
> mechanism which tracks I/O completions based on a stream of completion
> events from the kernel (or I/O worker processes).
>
> 3.  In this version, the prefetcher still doesn't pin buffers, for
> simplicity.  That work did force me to study places where WAL streams
> need prefetching "barriers", though, so in this patch you can
> see that it's now a little more careful than it probably needs to be.
> (It doesn't really matter much if you call posix_fadvise() on a
> non-existent file region, or the wrong file after OID wraparound and
> reuse, but it would matter if you actually read it into a buffer, and
> if an intervening record might be trying to drop something you have
> pinned).
>
> Some other changes:
>
> 1.  I dropped the GUC recovery_prefetch_fpw.  I think it was a
> possibly useful idea but it's a niche concern and not worth worrying
> about for now.
>
> 2.  I simplified the stats.  Coming up with a good running average
> system seemed like a problem for another day (the numbers before were
> hard to interpret).  The new stats are super simple counters and
> instantaneous values:
>
> postgres=# select * from pg_stat_prefetch_recovery ;
> -[ RECORD 1 ]--+--
> stats_reset| 2021-11-10 09:02:08.590217+13
> prefetch   | 13605674 <- times we called posix_fadvise()
> hit| 24185289 <- times we found pages already cached
> skip_init  | 217215   <- times we did nothing because init, not read
> skip_new   | 192347   <- times we skipped because relation too small
> skip_fpw   | 27429<- times we skipped because fpw, not read
> wal_distance   | 10648<- how far ahead in WAL bytes
> block_distance | 134  <- how far ahead in block references
> io_depth   | 50   <- fadvise() calls not yet followed by pread()
>
> I also removed the code to save and restore the stats via the stats
> collector, for now.  I figured that persistent stats could be a later
> feature, perhaps after the shared memory stats stuff?
>
> 3.  I dropped the code that was caching an SMgrRelation pointer to
> avoid smgropen() calls that showed up in some profiles.  That probably

Re: WIP: WAL prefetch (another approach)

2021-11-26 Thread Tom Lane
Thomas Munro  writes:
> On Sat, Nov 27, 2021 at 12:34 PM Tomas Vondra
>  wrote:
>> One thing that's not clear to me is what happened to the reasons why
>> this feature was reverted in the PG14 cycle?

> 3.  A wild goose chase for bugs on Tom Lane's antique 32 bit PPC
> machine.  Tom eventually reproduced it with the patches reverted,
> which seemed to exonerate them but didn't leave a good feeling: what
> was happening, and why did the patches hugely increase the likelihood
> of the failure mode?  I have no new information on that, but I know
> that several people spent a huge amount of time and effort trying to
> reproduce it on various types of systems, as did I, so despite not
> reaching a conclusion of a bug, this certainly contributed to a
> feeling that the patch had run out of steam for the 14 cycle.

Yeah ... on the one hand, that machine has shown signs of
hard-to-reproduce flakiness, so it's easy to write off the failures
I saw as hardware issues.  On the other hand, the flakiness I've
seen has otherwise manifested as kernel crashes, which is nothing
like the consistent test failures I was seeing with the patch.

Andres speculated that maybe we were seeing a kernel bug that
affects consistency of concurrent reads and writes.  That could
be an explanation; but it's just evidence-free speculation so far,
so I don't feel real convinced by that idea either.

Anyway, I hope to find time to see if the issue still reproduces
with Thomas' new patch set.

regards, tom lane




Re: WIP: WAL prefetch (another approach)

2021-11-26 Thread Thomas Munro
On Sat, Nov 27, 2021 at 12:34 PM Tomas Vondra
 wrote:
> One thing that's not clear to me is what happened to the reasons why
> this feature was reverted in the PG14 cycle?

Reasons for reverting:

1.  A bug in commit 323cbe7c, "Remove read_page callback from
XLogReader.".  I couldn't easily revert just that piece.  This new
version doesn't depend on that change anymore, to try to keep things
simple.  (That particular bug has been fixed in a newer version of
that patch[1], which I still think was a good idea incidentally.)
2.  A bug where allocation for large records happened before
validation.  Concretely, you can see that this patch does
XLogReadRecordAlloc() after validating the header (usually, same as
master), but commit f003d9f8 did it first.  (Though Andres pointed
out[2] that more work is needed on that to make that logic more
robust, and I'm keen to look into that, but that's independent of this
work).
3.  A wild goose chase for bugs on Tom Lane's antique 32 bit PPC
machine.  Tom eventually reproduced it with the patches reverted,
which seemed to exonerate them but didn't leave a good feeling: what
was happening, and why did the patches hugely increase the likelihood
of the failure mode?  I have no new information on that, but I know
that several people spent a huge amount of time and effort trying to
reproduce it on various types of systems, as did I, so despite not
reaching a conclusion of a bug, this certainly contributed to a
feeling that the patch had run out of steam for the 14 cycle.

This week I'll have another crack at getting that TAP test I proposed
that runs the regression tests with a streaming replica to work on
Windows.  That does approximately what Tom was doing when he saw
problem #3, which I'd like to have as standard across the build farm.

[1] 
https://www.postgresql.org/message-id/20211007.172820.1874635561738958207.horikyota.ntt%40gmail.com
[2] 
https://www.postgresql.org/message-id/20210505010835.umylslxgq4a6rbwg%40alap3.anarazel.de




Re: WIP: WAL prefetch (another approach)

2021-11-26 Thread Tomas Vondra

On 11/26/21 22:16, Thomas Munro wrote:

On Fri, Nov 26, 2021 at 11:32 AM Tomas Vondra
 wrote:

The results are pretty good / similar to previous results. Replaying the
1h worth of work on a smaller machine takes ~5:30h without prefetching
(master or with prefetching disabled). With prefetching enabled this
drops to ~2h (default config) and ~1h (with tuning).


Thanks for testing!  Wow, that's a nice graph.

This has bit-rotted already due to Robert's work on ripping out
globals, so I'll post a rebase early next week, and incorporate your
code feedback.



One thing that's not clear to me is what happened to the reasons why 
this feature was reverted in the PG14 cycle?


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: WIP: WAL prefetch (another approach)

2021-11-26 Thread Thomas Munro
On Fri, Nov 26, 2021 at 11:32 AM Tomas Vondra
 wrote:
> The results are pretty good / similar to previous results. Replaying the
> 1h worth of work on a smaller machine takes ~5:30h without prefetching
> (master or with prefetching disabled). With prefetching enabled this
> drops to ~2h (default config) and ~1h (with tuning).

Thanks for testing!  Wow, that's a nice graph.

This has bit-rotted already due to Robert's work on ripping out
globals, so I'll post a rebase early next week, and incorporate your
code feedback.




Re: WIP: WAL prefetch (another approach)

2021-11-25 Thread Tomas Vondra
Hi,

It's great you posted a new version of this patch, so I took a look a
brief look at it. The code seems in pretty good shape, I haven't found
any real issues - just two minor comments:

This seems a bit strange:

#define DEFAULT_DECODE_BUFFER_SIZE 0x1

Why not to define this as a simple decimal value? Is there something
special about this particular value, or is it arbitrary? I guess it's
simply the minimum for wal_decode_buffer_size GUC, but why not to use
the GUC for all places decoding WAL?

FWIW I don't think we include updates to typedefs.list in patches.


I also repeated the benchmarks I did at the beginning of the year [1].
Attached is a chart with four different configurations:

1) master (f79962d826)

2) patched (with prefetching disabled)

3) patched (with default configuration)

4) patched (with I/O concurrency 256 and 2MB decode buffer)

For all configs the shared buffers were set to 64GB, checkpoints every
20 minutes, etc.

The results are pretty good / similar to previous results. Replaying the
1h worth of work on a smaller machine takes ~5:30h without prefetching
(master or with prefetching disabled). With prefetching enabled this
drops to ~2h (default config) and ~1h (with tuning).

regards


[1]
https://www.postgresql.org/message-id/c5d52837-6256-0556-ac8c-d6d3d558820a%40enterprisedb.com

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: WIP: WAL prefetch (another approach)

2021-11-15 Thread Daniel Gustafsson
> On 10 May 2021, at 06:11, Thomas Munro  wrote:
> On Thu, Apr 22, 2021 at 11:22 AM Stephen Frost  wrote:

>> I tend to agree with the idea to revert it, perhaps a +0 on that, but if 
>> others argue it should be fixed in-place, I wouldn’t complain about it.
> 
> Reverted.
> 
> Note: eelpout may return a couple of failures because it's set up to
> run with recovery_prefetch=on (now an unknown GUC), and it'll be a few
> hours before I can access that machine to adjust that...
> 
>> I very much encourage the idea of improving testing in this area and would 
>> be happy to try and help do so in the 15 cycle.
> 
> Cool.  I'm going to try out some ideas.

Skimming this thread without all the context it's not entirely clear which
patch the CF entry relates to (I assume it's the one from April 7 based on
attached mail-id but there is a revert from May?), and the CF app and CF bot
are also in disagreement which is the latest one.

Could you post an updated version of the patch which is for review?

--
Daniel Gustafsson   https://vmware.com/





Re: WIP: WAL prefetch (another approach)

2021-05-09 Thread Thomas Munro
On Thu, Apr 22, 2021 at 11:22 AM Stephen Frost  wrote:
> On Wed, Apr 21, 2021 at 19:17 Thomas Munro  wrote:
>> On Thu, Apr 22, 2021 at 8:16 AM Thomas Munro  wrote:
>> ... Personally I think the right thing to do now is to revert it
>> and re-propose for 15 early in the cycle, supported with some better
>> testing infrastructure.
>
> I tend to agree with the idea to revert it, perhaps a +0 on that, but if 
> others argue it should be fixed in-place, I wouldn’t complain about it.

Reverted.

Note: eelpout may return a couple of failures because it's set up to
run with recovery_prefetch=on (now an unknown GUC), and it'll be a few
hours before I can access that machine to adjust that...

> I very much encourage the idea of improving testing in this area and would be 
> happy to try and help do so in the 15 cycle.

Cool.  I'm going to try out some ideas.




Re: WIP: WAL prefetch (another approach)

2021-05-06 Thread Andres Freund
Hi,

On 2021-05-04 18:08:35 -0700, Andres Freund wrote:
> But the issue that 70b4f82a4b is trying to address seems bigger to
> me. The reason it's so easy to hit the issue is that walreceiver does <
> 8KB writes into recycled WAL segments *without* zero-filling the tail
> end of the page - which will commonly be filled with random older
> contents, because we'll use a recycled segments. I think that
> *drastically* increases the likelihood of finding something that looks
> like a valid record header compared to the situation on a primary where
> the zeroing pages before use makes that pretty unlikely.

I've written an experimental patch to deal with this and, as expected,
it does make the end-of-wal detection a lot more predictable and
reliable. There's only two types of possible errors outside of crashes:
A record length of 0 (the end of WAL is within a page), and the page
header LSN mismatching (the end of WAL is at a page boundary).

This seems like a significant improvement.

However: It's nontrivial to do this nicely and in a backpatchable way in
XLogWalRcvWrite(). Or at least I haven't found a good way:
- We can't extend the input buffer to XLogWalRcvWrite(), it's from
  libpq.
- We don't want to copy the the entire buffer (commonly 128KiB) to a new
  buffer that we then can extend by 0-BLCKSZ of zeroes to cover the
  trailing part of the last page.
- In PG13+ we can do this utilizing pg_writev(), adding another IOV
  entry covering the trailing space to be padded.
- It's nicer to avoid increasign the number of write() calls, but it's
  not as crucial as the earlier points.

I'm also a bit uncomfortable with another aspect, although I can't
really see a problem: When switch to receiving WAL via walreceiver, we
always start at a segment boundary, even if we had received most of that
segment before. Currently that won't end up with any trailing space that
needs to be zeroed, because the server always will send 128KB chunks,
but there's no formal guarantee for that.  It seems a bit odd that we
could end up zeroing trailing space that already contains valid data,
just to overwrite it with valid data again.  But it ought to always be
fine.

The least offensive way I could come up with is for XLogWalRcvWrite() to
always write partial pages in a separate pg_pwrite(). When writing a
partial page, and the previous write position was not already on that
same page, copy the buffer into a local XLOG_BLCKSZ sized buffer
(although we'll never use more than XLOG_BLCKSZ-1 I think), and (re)zero
out the trailing part.  One thing that does not yet handle is if we were
to get a partial write - we'd not again notice that we need to pad the
end of the page.

Does anybody have a better idea?

I really wish we had a version of pg_p{read,write}[v] that internally
handled partial IOs, retrying as long as they see > 0 bytes written.

Greetings,

Andres Freund




Re: WIP: WAL prefetch (another approach)

2021-05-06 Thread Andres Freund
Hi,

On 2021-05-04 09:46:12 -0400, Tom Lane wrote:
> Yeah, I have also spent a fair amount of time trying to reproduce it
> elsewhere, without success so far.  Notably, I've been trying on a
> PPC Mac laptop that has a fairly similar CPU to what's in the G4,
> though a far slower disk drive.  So that seems to exclude theories
> based on it being PPC-specific.
>
> I suppose that if we're unable to reproduce it on at least one other box,
> we have to write it off as hardware flakiness.

I wonder if there's a chance what we're seeing is an OS memory ordering
bug, or a race between walreceiver writing data and the startup process
reading it.

When the startup process is able to keep up, there often will be a very
small time delta between the startup process reading a page that the
walreceiver just wrote. And if the currently read page was the tail page
written to by a 'w' message, it'll often be written to again in short
order - potentially while the startup process is reading it.

It'd not terribly surprise me if an old OS version on an old processor
had some issues around that.


Were there any cases of walsender terminating and reconnecting around
the failures?


It looks suspicious that XLogPageRead() does not invalidate the
xlogreader state when retrying.  Normally that's xlogreader's
responsibility, but there is that whole XLogReaderValidatePageHeader()
business. But I don't quite see how it'd actually cause problems.

Greetings,

Andres Freund




Re: WIP: WAL prefetch (another approach)

2021-05-04 Thread Andres Freund
Hi,

On 2021-05-04 15:47:41 -0400, Tom Lane wrote:
> BTW, that conclusion shouldn't distract us from the very real bug
> that Andres identified.  I was just scraping the buildfarm logs
> concerning recent failures, and I found several recent cases
> that match the symptom he reported:
> [...]
> They all show the standby in recovery/019_replslot_limit.pl failing
> with symptoms like
>
> 2021-05-04 07:42:00.968 UTC [24707406:1] LOG:  database system was shut down 
> in recovery at 2021-05-04 07:41:39 UTC
> 2021-05-04 07:42:00.968 UTC [24707406:2] LOG:  entering standby mode
> 2021-05-04 07:42:01.050 UTC [24707406:3] LOG:  redo starts at 0/1C000D8
> 2021-05-04 07:42:01.079 UTC [24707406:4] LOG:  consistent recovery state 
> reached at 0/1D0
> 2021-05-04 07:42:01.079 UTC [24707406:5] FATAL:  invalid memory alloc request 
> size 1476397045
> 2021-05-04 07:42:01.080 UTC [13238274:3] LOG:  database system is ready to 
> accept read only connections
> 2021-05-04 07:42:01.082 UTC [13238274:4] LOG:  startup process (PID 24707406) 
> exited with exit code 1

Yea, that's the pre-existing end-of-log-issue that got more likely as
well as more consequential (by accident) in Thomas' patch. It's easy to
reach parity with the state in 13, it's just changing the order in one
place.

But I think we need to do something for all branches here. The bandaid
that was added to allocate_recordbuf() does doesn't really seems
sufficient to me. This is

commit 70b4f82a4b5cab5fc12ff876235835053e407155
Author: Michael Paquier 
Date:   2018-06-18 10:43:27 +0900

Prevent hard failures of standbys caused by recycled WAL segments

In <= 13 the current state is that we'll allocate effectively random
bytes as long as the random number is below 1GB whenever we reach the
end of the WAL with the record on a page boundary (because there we
don't. That allocation is then not freed for the lifetime of the
xlogreader.  And for FRONTEND uses of xlogreader we'll just happily
allocate 4GB.  The specific problem here is that we don't validate the
record header before allocating when the record header is split across a
page boundary - without much need as far as I can tell? Until we've read
the entire header, we actually don't need to allocate the record buffer?

This seems like an issue that needs to be fixed to be more robust in
crash recovery scenarios where obviously we could just have failed with
half written records.

But the issue that 70b4f82a4b is trying to address seems bigger to
me. The reason it's so easy to hit the issue is that walreceiver does <
8KB writes into recycled WAL segments *without* zero-filling the tail
end of the page - which will commonly be filled with random older
contents, because we'll use a recycled segments. I think that
*drastically* increases the likelihood of finding something that looks
like a valid record header compared to the situation on a primary where
the zeroing pages before use makes that pretty unlikely.


> (BTW, the behavior seen here where the failure occurs *immediately*
> after reporting "consistent recovery state reached" is seen in the
> other reports as well, including Andres' version.  I wonder if that
> means anything.)

That's to be expected, I think. There's not a lot of data that needs to
be replayed, and we'll always reach consistency before the end of the
WAL unless you're dealing with starting from an in-progress base-backup
that hasn't yet finished or such. The test causes replication to fail
shortly after that, so we'll always switch to doing recovery from
pg_wal, which then will hit the end of the WAL, hitting this issue with,
I think, ~25% likelihood (data from recycled WAL data is probably
*roughly* evenly distributed, and any 4byte value above 1GB will hit
this error in 14).

Greetings,

Andres Freund




Re: WIP: WAL prefetch (another approach)

2021-05-04 Thread Tom Lane
I wrote:
> I suppose that if we're unable to reproduce it on at least one other box,
> we have to write it off as hardware flakiness.

BTW, that conclusion shouldn't distract us from the very real bug
that Andres identified.  I was just scraping the buildfarm logs
concerning recent failures, and I found several recent cases
that match the symptom he reported:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2021-04-23%2022%3A27%3A41
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2021-04-21%2005%3A15%3A24
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2021-04-20%2002%3A03%3A08
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2021-05-04%2004%3A07%3A41
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2021-04-20%2021%3A08%3A59

They all show the standby in recovery/019_replslot_limit.pl failing
with symptoms like

2021-05-04 07:42:00.968 UTC [24707406:1] LOG:  database system was shut down in 
recovery at 2021-05-04 07:41:39 UTC
2021-05-04 07:42:00.968 UTC [24707406:2] LOG:  entering standby mode
2021-05-04 07:42:01.050 UTC [24707406:3] LOG:  redo starts at 0/1C000D8
2021-05-04 07:42:01.079 UTC [24707406:4] LOG:  consistent recovery state 
reached at 0/1D0
2021-05-04 07:42:01.079 UTC [24707406:5] FATAL:  invalid memory alloc request 
size 1476397045
2021-05-04 07:42:01.080 UTC [13238274:3] LOG:  database system is ready to 
accept read only connections
2021-05-04 07:42:01.082 UTC [13238274:4] LOG:  startup process (PID 24707406) 
exited with exit code 1

(BTW, the behavior seen here where the failure occurs *immediately*
after reporting "consistent recovery state reached" is seen in the
other reports as well, including Andres' version.  I wonder if that
means anything.)

regards, tom lane




Re: WIP: WAL prefetch (another approach)

2021-05-04 Thread Tom Lane
Tomas Vondra  writes:
> On 5/3/21 7:42 AM, Thomas Munro wrote:
>> Hmm, yeah that does seem plausible.  It would be nice to see a report
>> from any other system though.  I'm still trying, and reviewing...

> FWIW I've ran the test (make installcheck-parallel in a loop) on four 
> different machines - two x86_64 ones, and two rpi4. The x86 boxes did 
> ~1000 rounds each (and one of them had 5 local replicas) without any 
> issue. The rpi4 machines did ~50 rounds each, also without failures.

Yeah, I have also spent a fair amount of time trying to reproduce it
elsewhere, without success so far.  Notably, I've been trying on a
PPC Mac laptop that has a fairly similar CPU to what's in the G4,
though a far slower disk drive.  So that seems to exclude theories
based on it being PPC-specific.

I suppose that if we're unable to reproduce it on at least one other box,
we have to write it off as hardware flakiness.  I'm not entirely
comfortable with that answer, but I won't push for reversion of the WAL
patches without more evidence that there's a real issue.

regards, tom lane




Re: WIP: WAL prefetch (another approach)

2021-05-04 Thread Tomas Vondra




On 5/3/21 7:42 AM, Thomas Munro wrote:

On Sun, May 2, 2021 at 3:16 PM Tom Lane  wrote:

That last point means that there was some hard-to-hit problem even
before any of the recent WAL-related changes.  However, 323cbe7c7
(Remove read_page callback from XLogReader) increased the failure
rate by at least a factor of 5, and 1d257577e (Optionally prefetch
referenced data) seems to have increased it by another factor of 4.
But it looks like f003d9f87 (Add circular WAL decoding buffer)
didn't materially change the failure rate.


Oh, wow.  There are several surprising results there.  Thanks for
running those tests for so long so that we could see the rarest
failures.

Even if there are somehow *two* causes of corruption, one preexisting
and one added by the refactoring or decoding patches, I'm struggling
to understand how the chance increases with 1d2575, since that only
adds code that isn't reached when not enabled (though I'm going to
re-review that).


Considering that 323cbe7c7 was supposed to be just refactoring,
and 1d257577e is allegedly disabled-by-default, these are surely
not the results I was expecting to get.


+1


It seems like it's still an open question whether all this is
a real bug, or flaky hardware.  I have seen occasional kernel
freezeups (or so I think -- machine stops responding to keyboard
or network input) over the past year or two, so I cannot in good
conscience rule out the flaky-hardware theory.  But it doesn't
smell like that kind of problem to me.  I think what we're looking
at is a timing-sensitive bug that was there before (maybe long
before?) and these commits happened to make it occur more often
on this particular hardware.  This hardware is enough unlike
anything made in the past decade that it's not hard to credit
that it'd show a timing problem that nobody else can reproduce.


Hmm, yeah that does seem plausible.  It would be nice to see a report
from any other system though.  I'm still trying, and reviewing...



FWIW I've ran the test (make installcheck-parallel in a loop) on four 
different machines - two x86_64 ones, and two rpi4. The x86 boxes did 
~1000 rounds each (and one of them had 5 local replicas) without any 
issue. The rpi4 machines did ~50 rounds each, also without failures.


Obviously, it's possible there's something that neither of those (very 
different systems) triggers, but I'd say it might also be a hint that 
this really is a hw issue on the old ppc macs. Or maybe something very 
specific to that arch.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: WIP: WAL prefetch (another approach)

2021-05-02 Thread Thomas Munro
On Sun, May 2, 2021 at 3:16 PM Tom Lane  wrote:
> That last point means that there was some hard-to-hit problem even
> before any of the recent WAL-related changes.  However, 323cbe7c7
> (Remove read_page callback from XLogReader) increased the failure
> rate by at least a factor of 5, and 1d257577e (Optionally prefetch
> referenced data) seems to have increased it by another factor of 4.
> But it looks like f003d9f87 (Add circular WAL decoding buffer)
> didn't materially change the failure rate.

Oh, wow.  There are several surprising results there.  Thanks for
running those tests for so long so that we could see the rarest
failures.

Even if there are somehow *two* causes of corruption, one preexisting
and one added by the refactoring or decoding patches, I'm struggling
to understand how the chance increases with 1d2575, since that only
adds code that isn't reached when not enabled (though I'm going to
re-review that).

> Considering that 323cbe7c7 was supposed to be just refactoring,
> and 1d257577e is allegedly disabled-by-default, these are surely
> not the results I was expecting to get.

+1

> It seems like it's still an open question whether all this is
> a real bug, or flaky hardware.  I have seen occasional kernel
> freezeups (or so I think -- machine stops responding to keyboard
> or network input) over the past year or two, so I cannot in good
> conscience rule out the flaky-hardware theory.  But it doesn't
> smell like that kind of problem to me.  I think what we're looking
> at is a timing-sensitive bug that was there before (maybe long
> before?) and these commits happened to make it occur more often
> on this particular hardware.  This hardware is enough unlike
> anything made in the past decade that it's not hard to credit
> that it'd show a timing problem that nobody else can reproduce.

Hmm, yeah that does seem plausible.  It would be nice to see a report
from any other system though.  I'm still trying, and reviewing...




Re: WIP: WAL prefetch (another approach)

2021-05-02 Thread Thomas Munro
On Thu, Apr 29, 2021 at 12:24 PM Tom Lane  wrote:
> Andres Freund  writes:
> > On 2021-04-28 19:24:53 -0400, Tom Lane wrote:
> >> IOW, we've spent over twice as many CPU cycles shipping data to the
> >> standby as we did in applying the WAL on the standby.
>
> > I don't really know how the time calculation works on mac. Is there a
> > chance it includes time spent doing IO?

For comparison, on a modern Linux system I see numbers like this,
while running that 025_stream_rep_regress.pl test I posted in a nearby
thread:

USER PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
tmunro   2150863 22.5  0.0  55348  6752 ?Ss   12:59   0:07
postgres: standby_1: startup recovering 00010002003C
tmunro   2150867 17.5  0.0  55024  6364 ?Ss   12:59   0:05
postgres: standby_1: walreceiver streaming 2/3C675D80
tmunro   2150868 11.7  0.0  55296  7192 ?Ss   12:59   0:04
postgres: primary: walsender tmunro [local] streaming 2/3C675D80

Those ratios are better but it's still hard work, and perf shows the
CPU time is all in page cache schlep:

  22.44%  postgres  [kernel.kallsyms]   [k] copy_user_enhanced_fast_string
  20.12%  postgres  [kernel.kallsyms]   [k] __add_to_page_cache_locked
   7.30%  postgres  [kernel.kallsyms]   [k] iomap_set_page_dirty

That was with all three patches reverted, so it's nothing new.
Definitely room for improvement... there have been a few discussions
about not using a buffered file for high-frequency data exchange and
relaxing various timing rules, which we should definitely look into,
but I wouldn't be at all surprised if HFS+ was just much worse at
this.

Thinking more about good old HFS+... I guess it's remotely possible
that there might have been coherency bugs in that could be exposed by
our usage pattern, but then that doesn't fit too well with the clues I
have from light reading: this is a non-SMP system, and it's said that
HFS+ used to serialise pretty much everything on big filesystem locks
anyway.




Re: WIP: WAL prefetch (another approach)

2021-05-01 Thread Tom Lane
Thomas Munro  writes:
> On Thu, Apr 29, 2021 at 4:45 AM Tom Lane  wrote:
>> Andres Freund  writes:
>>> Tom, any chance you could check if your machine repros the issue before
>>> these commits?

>> Wilco, but it'll likely take a little while to get results ...

> FWIW I also chewed through many megawatts trying to reproduce this on
> a PowerPC system in 64 bit big endian mode, with an emulator.  No
> cigar.  However, it's so slow that I didn't make it to 10 runs...

So I've expended a lot of kilowatt-hours over the past several days,
and I've got results that are interesting but don't really get us
any closer to a resolution.

To recap, the test lashup is:
* 2003 PowerMac G4 (1.25GHz PPC 7455, 7200 rpm spinning-rust drive)
* Standard debug build (--enable-debug --enable-cassert)
* Out-of-the-box configuration, except add wal_consistency_checking = all
and configure a wal-streaming standby on the same machine
* Repeatedly run "make installcheck-parallel", but skip the tablespace
test to avoid issues with the standby trying to use the same directory
* Delay long enough after each installcheck-parallel to let the 
standby catch up (the run proper is ~24 min, plus 2 min for catchup)

The failures I'm seeing generally look like

2021-05-01 15:33:10.968 EDT [8281] FATAL:  inconsistent page found, rel 
1663/58186/66338, forknum 0, blkno 19
2021-05-01 15:33:10.968 EDT [8281] CONTEXT:  WAL redo at 3/4CE905B8 for 
Gist/PAGE_UPDATE: ; blkref #0: rel 1663/58186/66338, blk 19 FPW

with a variety of WAL record types being named, so it doesn't seem
to be specific to any particular record type.  I've twice gotten the
bogus-checksum-and-then-assertion-failure I reported before:

2021-05-01 17:07:52.992 EDT [17464] LOG:  incorrect resource manager data 
checksum in record at 3/E0073EA4
TRAP: FailedAssertion("state->recordRemainLen > 0", File: "xlogreader.c", Line: 
567, PID: 17464)

In both of those cases, the WAL on disk was perfectly fine, and the same
is true of most of the "inconsistent page" complaints.  So the issue
definitely seems to be about the startup process mis-reading data that
was correctly shipped over.

Anyway, the new and interesting data concerns the relative failure rates
of different builds:

* Recent HEAD (from 4-28 and 5-1): 4 failures in 8 test cycles

* Reverting 1d257577e: 1 failure in 8 test cycles

* Reverting 1d257577e and f003d9f87: 3 failures in 28 cycles

* Reverting 1d257577e, f003d9f87, and 323cbe7c7: 2 failures in 93 cycles

That last point means that there was some hard-to-hit problem even
before any of the recent WAL-related changes.  However, 323cbe7c7
(Remove read_page callback from XLogReader) increased the failure
rate by at least a factor of 5, and 1d257577e (Optionally prefetch
referenced data) seems to have increased it by another factor of 4.
But it looks like f003d9f87 (Add circular WAL decoding buffer)
didn't materially change the failure rate.

Considering that 323cbe7c7 was supposed to be just refactoring,
and 1d257577e is allegedly disabled-by-default, these are surely
not the results I was expecting to get.

It seems like it's still an open question whether all this is
a real bug, or flaky hardware.  I have seen occasional kernel
freezeups (or so I think -- machine stops responding to keyboard
or network input) over the past year or two, so I cannot in good
conscience rule out the flaky-hardware theory.  But it doesn't
smell like that kind of problem to me.  I think what we're looking
at is a timing-sensitive bug that was there before (maybe long
before?) and these commits happened to make it occur more often
on this particular hardware.  This hardware is enough unlike
anything made in the past decade that it's not hard to credit
that it'd show a timing problem that nobody else can reproduce.

(I did try the time-honored ritual of reseating all the machine's
RAM, partway through this.  Doesn't seem to have changed anything.)

Anyway, I'm not sure where to go from here.  I'm for sure nowhere
near being able to identify the bug --- and if there really is
a bug that formerly had a one-in-fifty reproduction rate, I have
zero interest in trying to identify where it started by bisecting.
It'd take at least a day per bisection step, and even that might
not be accurate enough.  (But, if anyone has ideas of specific
commits to test, I'd be willing to try a few.)

regards, tom lane




Re: WIP: WAL prefetch (another approach)

2021-04-28 Thread Thomas Munro
On Thu, Apr 29, 2021 at 3:14 PM Andres Freund  wrote:
> To me it looks like a smaller version of the problem is present in < 14,
> albeit only when the page header is at a record boundary. In that case
> we don't validate the page header immediately, only once it's completely
> read. But we do believe the total size, and try to allocate
> that.
>
> There's a really crufty escape hatch (from 70b4f82a4b) to that:

Right, I made that problem worse, and that could probably be changed
to be no worse than 13 by reordering those operations.

PS Sorry for my intermittent/slow responses on this thread this week,
as I'm mostly away from the keyboard due to personal commitments.
I'll be back in the saddle next week to tidy this up, most likely by
reverting.  The main thought I've been having about this whole area is
that, aside from the lack of general testing of recovery, which we
should definitely address[1], what it really needs is a decent test
harness to drive it through all interesting scenarios and states at a
lower level, independently.

[1] 
https://www.postgresql.org/message-id/flat/CA%2BhUKGKpRWQ9SxdxxDmTBCJoR0YnFpMBe7kyzY8SUQk%2BHeskxg%40mail.gmail.com




Re: WIP: WAL prefetch (another approach)

2021-04-28 Thread Tom Lane
Andres Freund  writes:
> I was now able to reproduce the problem again, and I'm afraid that the
> bug I hit is likely separate from Tom's.

Yeah, I think so --- the symptoms seem quite distinct.

My score so far today on the G4 is:

12 error-free regression test cycles on b3ee4c503

(plus one more with shared_buffers set to 16MB, on the strength
of your previous hunch --- didn't fail for me though)

HEAD failed on the second run with the same symptom as before:

2021-04-28 22:57:17.048 EDT [50479] FATAL:  inconsistent page found, rel 
1663/58183/69545, forknum 0, blkno 696
2021-04-28 22:57:17.048 EDT [50479] CONTEXT:  WAL redo at 4/B72D408 for 
Heap/INSERT: off 77 flags 0x00; blkref #0: rel 1663/58183/69545, blk 696 FPW

This seems to me to be pretty strong evidence that I'm seeing *something*
real.  I'm currently trying to isolate a specific commit to pin it on.
A straight "git bisect" isn't going to work because so many people had
broken so many different things right around that date :-(, so it may
take awhile to get a good answer.

regards, tom lane




Re: WIP: WAL prefetch (another approach)

2021-04-28 Thread Andres Freund
Hi,

On 2021-04-28 17:59:22 -0700, Andres Freund wrote:
> I can however say that pg_waldump on the standby's pg_wal does also
> fail. The failure as part of the backend is "invalid memory alloc
> request size", whereas in pg_waldump I get the much more helpful:
> pg_waldump: fatal: error in WAL record at 4/7F5C31C8: record with incorrect 
> prev-link 416200FF/FF00 at 4/7F5C3200
> 
> In frontend code that allocation actually succeeds, because there is no
> size check. But in backend code we run into the size check, and thus
> don't even display a useful error.
> 
> In 13 the header is validated before allocating space for the
> record(except if header is spread across pages) - it seems inadvisable
> to turn that around?

I was now able to reproduce the problem again, and I'm afraid that the
bug I hit is likely separate from Tom's. The allocation thing above is
the issue in my case:

The walsender connection ended (I restarted the primary), thus the
startup switches to replaying locally. For some reason the end of the
WAL contains non-zero data (I think it's because walreceiver doesn't
zero out pages - that's bad!). Because the allocation happen before the
header is validated, we reproducably end up in the mcxt.c ERROR path,
failing recovery.

To me it looks like a smaller version of the problem is present in < 14,
albeit only when the page header is at a record boundary. In that case
we don't validate the page header immediately, only once it's completely
read. But we do believe the total size, and try to allocate
that.

There's a really crufty escape hatch (from 70b4f82a4b) to that:

/*
 * Note that in much unlucky circumstances, the random data read from a
 * recycled segment can cause this routine to be called with a size
 * causing a hard failure at allocation.  For a standby, this would 
cause
 * the instance to stop suddenly with a hard failure, preventing it to
 * retry fetching WAL from one of its sources which could allow it to 
move
 * on with replay without a manual restart. If the data comes from a 
past
 * recycled segment and is still valid, then the allocation may succeed
 * but record checks are going to fail so this would be short-lived.  If
 * the allocation fails because of a memory shortage, then this is not a
 * hard failure either per the guarantee given by MCXT_ALLOC_NO_OOM.
 */
if (!AllocSizeIsValid(newSize))
return false;

but it looks to me like that's pretty much the wrong fix, at least in
the case where we've not yet validated the rest of the header. We don't
need to allocate all that data before we've read the rest of the
*fixed-size* header.

It also seems to me that 70b4f82a4b should also have changed walsender
to pad out the received data to an 8KB boundary?

Greetings,

Andres Freund




Re: WIP: WAL prefetch (another approach)

2021-04-28 Thread Andres Freund
Hi,

On 2021-04-28 17:59:22 -0700, Andres Freund wrote:
> I can however say that pg_waldump on the standby's pg_wal does also
> fail. The failure as part of the backend is "invalid memory alloc
> request size", whereas in pg_waldump I get the much more helpful:
> pg_waldump: fatal: error in WAL record at 4/7F5C31C8: record with incorrect 
> prev-link 416200FF/FF00 at 4/7F5C3200

There's definitely something broken around continuation records, in
XLogFindNextRecord(). Which means that it's not the cause for the server
side issue, but obviously still not good.

The conversion of XLogFindNextRecord() to be state machine based
basically only works in a narrow set of circumstances. Whenever the end
of the first record read is on a different page than the start of the
record, we'll endlessly loop.

We'll go into XLogFindNextRecord(), and return until we've successfully
read the page header. Then we'll enter the second loop. Which will try
to read until the end of the first record. But after returning the first
loop will again ask for page header.

Even if that's fixed, the second loop alone has the same problem: As
XLogBeginRead() is called unconditionally we'll start read the start of
the record, discover that it needs data on a second page, return, and
do the same thing again.

I think it needs something roughly like the attached.

Greetings,

Andres Freund
diff --git i/src/include/access/xlogreader.h w/src/include/access/xlogreader.h
index 3b8af31a8fe..82a80cf2bf5 100644
--- i/src/include/access/xlogreader.h
+++ w/src/include/access/xlogreader.h
@@ -297,6 +297,7 @@ struct XLogFindNextRecordState
 	XLogReaderState *reader_state;
 	XLogRecPtr		targetRecPtr;
 	XLogRecPtr		currRecPtr;
+	bool			found_start;
 };
 
 /* Report that data is available for decoding. */
diff --git i/src/backend/access/transam/xlogreader.c w/src/backend/access/transam/xlogreader.c
index 4277e92d7c9..935c841347f 100644
--- i/src/backend/access/transam/xlogreader.c
+++ w/src/backend/access/transam/xlogreader.c
@@ -868,7 +868,7 @@ XLogDecodeOneRecord(XLogReaderState *state, bool allow_oversized)
 /* validate record header if not yet */
 if (!state->record_verified && record_len >= SizeOfXLogRecord)
 {
-if (!ValidXLogRecordHeader(state, state->DecodeRecPtr,
+	if (!ValidXLogRecordHeader(state, state->DecodeRecPtr,
 			   state->PrevRecPtr, prec))
 		goto err;
 
@@ -1516,6 +1516,7 @@ InitXLogFindNextRecord(XLogReaderState *reader_state, XLogRecPtr start_ptr)
 	state->reader_state = reader_state;
 	state->targetRecPtr = start_ptr;
 	state->currRecPtr = start_ptr;
+	state->found_start = false;
 
 	return state;
 }
@@ -1545,7 +1546,7 @@ XLogFindNextRecord(XLogFindNextRecordState *state)
 	 * skip over potential continuation data, keeping in mind that it may span
 	 * multiple pages
 	 */
-	while (true)
+	while (!state->found_start)
 	{
 		XLogRecPtr	targetPagePtr;
 		int			targetRecOff;
@@ -1616,7 +1617,12 @@ XLogFindNextRecord(XLogFindNextRecordState *state)
 	 * because either we're at the first record after the beginning of a page
 	 * or we just jumped over the remaining data of a continuation.
 	 */
-	XLogBeginRead(state->reader_state, state->currRecPtr);
+	if (!state->found_start)
+	{
+		XLogBeginRead(state->reader_state, state->currRecPtr);
+		state->found_start = true;
+	}
+
 	while ((result = XLogReadRecord(state->reader_state, &record, &errormsg)) !=
 		   XLREAD_FAIL)
 	{


Re: WIP: WAL prefetch (another approach)

2021-04-28 Thread Andres Freund
Hi,

On 2021-04-28 20:24:43 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > Oh! I was about to ask how much shared buffers your primary / standby
> > have.
> Default configurations, so 128MB each.

I thought that possibly initdb would detect less or something...


I assume this is 32bit? I did notice that a 32bit test took a lot longer
than a 64bit test. But didn't investigate so far.


> And I think I may actually have reproduce a variant of the issue!

Unfortunately I had not set up things in a way that the primary retains
the WAL, making it harder to compare whether it's the WAL that got
corrupted or whether it's a decoding bug.

I can however say that pg_waldump on the standby's pg_wal does also
fail. The failure as part of the backend is "invalid memory alloc
request size", whereas in pg_waldump I get the much more helpful:
pg_waldump: fatal: error in WAL record at 4/7F5C31C8: record with incorrect 
prev-link 416200FF/FF00 at 4/7F5C3200

In frontend code that allocation actually succeeds, because there is no
size check. But in backend code we run into the size check, and thus
don't even display a useful error.

In 13 the header is validated before allocating space for the
record(except if header is spread across pages) - it seems inadvisable
to turn that around?


Greetings,

Andres Freund




Re: WIP: WAL prefetch (another approach)

2021-04-28 Thread Tom Lane
Andres Freund  writes:
> On 2021-04-28 19:24:53 -0400, Tom Lane wrote:
>> IOW, we've spent over twice as many CPU cycles shipping data to the
>> standby as we did in applying the WAL on the standby.

> I don't really know how the time calculation works on mac. Is there a
> chance it includes time spent doing IO?

I'd be pretty astonished if it did.  This is basically a NetBSD system
remember (in fact, this ancient macOS release is a good deal closer
to those roots than modern versions).  BSDen have never accounted for
time that way AFAIK.  Also, the "ps" man page says specifically that
that column is CPU time.

> Oh! I was about to ask how much shared buffers your primary / standby
> have. And I think I may actually have reproduce a variant of the issue!

Default configurations, so 128MB each.

regards, tom lane




Re: WIP: WAL prefetch (another approach)

2021-04-28 Thread Andres Freund
Hi,

On 2021-04-28 19:24:53 -0400, Tom Lane wrote:
> But I happened to notice the accumulated CPU time for the background
> processes:
> 
> USER   PID  %CPU %MEM  VSZRSS   TT  STAT STARTED  TIME COMMAND
> tgl  19048   0.0  4.4   229952  92196   ??  Ss3:19PM  19:59.19 
> postgres: startup recovering 000100140022
> tgl  19051   0.0  0.1   229656   1696   ??  Ss3:19PM  27:09.14 
> postgres: walreceiver streaming 14/227D8F14
> tgl  19052   0.0  0.1   229904   2516   ??  Ss3:19PM  17:38.17 
> postgres: walsender tgl [local] streaming 14/227D8F14 
> 
> IOW, we've spent over twice as many CPU cycles shipping data to the
> standby as we did in applying the WAL on the standby.  Is this
> expected?  I've got wal_consistency_checking = all, which is bloating
> the WAL volume quite a bit, but still it seems like the walsender and
> walreceiver have little excuse for spending more cycles per byte
> than the startup process.

I don't really know how the time calculation works on mac. Is there a
chance it includes time spent doing IO? On the primary the WAL IO is
done by a lot of backends, but on the standby it's all going to be the
walreceiver. And the walreceiver does fsyncs in a not particularly
efficient manner.

FWIW, on my linux workstation no such difference is visible:
USER PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
andres   2910540  9.4  0.0 2237252 126680 ?  Ss   16:55   0:20 postgres: 
dev assert standby: startup recovering 00010002003F
andres   2910544  5.2  0.0 2236724 9260 ?Ss   16:55   0:11 postgres: 
dev assert standby: walreceiver streaming 2/3FDCF118
andres   2910545  2.1  0.0 2237036 10672 ?   Ss   16:55   0:04 postgres: 
dev assert: walsender andres [local] streaming 2/3FDCF118



> (This is testing b3ee4c503, so if Thomas' WAL changes improved
> efficiency of the replay process at all, the discrepancy could be
> even worse in HEAD.)

The prefetching isn't enabled by default, so I'd not expect meaningful
differences... And even with the prefetching enabled, our normal
regression tests largely are resident in s_b, so there shouldn't be much
prefetching.


Oh! I was about to ask how much shared buffers your primary / standby
have. And I think I may actually have reproduce a variant of the issue!

I previously had played around with different settings that I thought
might increase the likelihood of reproducing the problem. But this time
I set shared_buffers lower than before, and got:

2021-04-28 17:03:22.174 PDT [2913840][] LOG:  database system was shut down in 
recovery at 2021-04-28 17:03:11 PDT
2021-04-28 17:03:22.174 PDT [2913840][] LOG:  entering standby mode
2021-04-28 17:03:22.178 PDT [2913840][1/0] LOG:  redo starts at 2/416C6278
2021-04-28 17:03:37.628 PDT [2913840][1/0] LOG:  consistent recovery state 
reached at 4/7F5C3200
2021-04-28 17:03:37.628 PDT [2913840][1/0] FATAL:  invalid memory alloc request 
size 3053455757
2021-04-28 17:03:37.628 PDT [2913839][] LOG:  database system is ready to 
accept read only connections
2021-04-28 17:03:37.636 PDT [2913839][] LOG:  startup process (PID 2913840) 
exited with exit code 1

This reproduces across restarts. Yay, I guess.

Isn't it off that we get a "database system is ready to accept read only
connections"?

Greetings,

Andres Freund




Re: WIP: WAL prefetch (another approach)

2021-04-28 Thread Tom Lane
Thomas Munro  writes:
> FWIW I also chewed through many megawatts trying to reproduce this on
> a PowerPC system in 64 bit big endian mode, with an emulator.  No
> cigar.  However, it's so slow that I didn't make it to 10 runs...

Speaking of megawatts ... my G4 has now finished about ten cycles of
installcheck-parallel without a failure, which isn't really enough
to draw any conclusions yet.  But I happened to notice the
accumulated CPU time for the background processes:

USER   PID  %CPU %MEM  VSZRSS   TT  STAT STARTED  TIME COMMAND
tgl  19048   0.0  4.4   229952  92196   ??  Ss3:19PM  19:59.19 
postgres: startup recovering 000100140022
tgl  19051   0.0  0.1   229656   1696   ??  Ss3:19PM  27:09.14 
postgres: walreceiver streaming 14/227D8F14
tgl  19052   0.0  0.1   229904   2516   ??  Ss3:19PM  17:38.17 
postgres: walsender tgl [local] streaming 14/227D8F14 

IOW, we've spent over twice as many CPU cycles shipping data to the
standby as we did in applying the WAL on the standby.  Is this
expected?  I've got wal_consistency_checking = all, which is bloating
the WAL volume quite a bit, but still it seems like the walsender and
walreceiver have little excuse for spending more cycles per byte
than the startup process.

(This is testing b3ee4c503, so if Thomas' WAL changes improved
efficiency of the replay process at all, the discrepancy could be
even worse in HEAD.)

regards, tom lane




Re: WIP: WAL prefetch (another approach)

2021-04-28 Thread Thomas Munro
On Thu, Apr 29, 2021 at 4:45 AM Tom Lane  wrote:
> Andres Freund  writes:
> > Tom, any chance you could check if your machine repros the issue before
> > these commits?
>
> Wilco, but it'll likely take a little while to get results ...

FWIW I also chewed through many megawatts trying to reproduce this on
a PowerPC system in 64 bit big endian mode, with an emulator.  No
cigar.  However, it's so slow that I didn't make it to 10 runs...




Re: WIP: WAL prefetch (another approach)

2021-04-28 Thread Tom Lane
Andres Freund  writes:
> Tom, any chance you could check if your machine repros the issue before
> these commits?

Wilco, but it'll likely take a little while to get results ...

regards, tom lane




Re: WIP: WAL prefetch (another approach)

2021-04-28 Thread Andres Freund
Hi,

On 2021-04-22 13:59:58 +1200, Thomas Munro wrote:
> On Thu, Apr 22, 2021 at 1:21 PM Tom Lane  wrote:
> > I've also tried to reproduce on 32-bit and 64-bit Intel, without
> > success.  So if this is real, maybe it's related to being big-endian
> > hardware?  But it's also quite sensitive to $dunno-what, maybe the
> > history of WAL records that have already been replayed.
> 
> Ah, that's interesting.  There are a couple of sparc64 failures and a
> ppc64 failure in the build farm, but I couldn't immediately spot what
> was wrong with them or whether it might be related to this stuff.
> 
> Thanks for the clues.  I'll see what unusual systems I can find to try
> this on

FWIW, I've run 32 and 64 bit x86 through several hundred regression
cycles, without hitting an issue. For a lot of them I set
checkpoint_timeout to a lower value as I thought that might make it more
likely to reproduce an issue.

Tom, any chance you could check if your machine repros the issue before
these commits?

Greetings,

Andres Freund




Re: WIP: WAL prefetch (another approach)

2021-04-21 Thread Tom Lane
Andres Freund  writes:
> On 2021-04-21 21:21:05 -0400, Tom Lane wrote:
>> What I'm doing is running the core regression tests with a single
>> standby (on the same machine) and wal_consistency_checking = all.

> Do you run them over replication, or sequentially by storing data into
> an archive? Just curious, because its so painful to run that scenario in
> the replication case due to the tablespace conflicting between
> primary/standby, unless one disables the tablespace tests.

No, live over replication.  I've been skipping the tablespace test.

> Have you tried reproducing it on commits before the recent xlogreader
> changes?

Nope.

regards, tom lane




Re: WIP: WAL prefetch (another approach)

2021-04-21 Thread Thomas Munro
On Thu, Apr 22, 2021 at 1:21 PM Tom Lane  wrote:
> I've also tried to reproduce on 32-bit and 64-bit Intel, without
> success.  So if this is real, maybe it's related to being big-endian
> hardware?  But it's also quite sensitive to $dunno-what, maybe the
> history of WAL records that have already been replayed.

Ah, that's interesting.  There are a couple of sparc64 failures and a
ppc64 failure in the build farm, but I couldn't immediately spot what
was wrong with them or whether it might be related to this stuff.

Thanks for the clues.  I'll see what unusual systems I can find to try
this on




Re: WIP: WAL prefetch (another approach)

2021-04-21 Thread Andres Freund
Hi,

On 2021-04-21 21:21:05 -0400, Tom Lane wrote:
> What I'm doing is running the core regression tests with a single
> standby (on the same machine) and wal_consistency_checking = all.

Do you run them over replication, or sequentially by storing data into
an archive? Just curious, because its so painful to run that scenario in
the replication case due to the tablespace conflicting between
primary/standby, unless one disables the tablespace tests.


> The other PPC machine (with no known history of trouble) is the one
> that had the CRC failure I showed earlier.  That one does seem to be
> actual bad data in the stored WAL, because the problem was also seen
> by pg_waldump, and trying to restart the standby got the same failure
> again.

It seems like that could also indicate an xlogreader bug that is
reliably hit? Once it gets confused about record lengths or such I'd
expect CRC failures...

If it were actually wrong WAL contents I don't think any of the
xlogreader / prefetching changes could be responsible...


Have you tried reproducing it on commits before the recent xlogreader
changes?

commit 1d257577e08d3e598011d6850fd1025858de8c8c
Author: Thomas Munro 
Date:   2021-04-08 23:03:43 +1200

Optionally prefetch referenced data in recovery.

commit f003d9f8721b3249e4aec8a1946034579d40d42c
Author: Thomas Munro 
Date:   2021-04-08 23:03:34 +1200

Add circular WAL decoding buffer.

Discussion: 
https://postgr.es/m/CA+hUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq=aovoddf...@mail.gmail.com

commit 323cbe7c7ddcf18aaf24b7f6d682a45a61d4e31b
Author: Thomas Munro 
Date:   2021-04-08 23:03:23 +1200

Remove read_page callback from XLogReader.


Trying 323cbe7c7ddcf18aaf24b7f6d682a45a61d4e31b^ is probably the most
interesting bit.


> I've not been able to duplicate the consistency-check failures
> there.  But because that machine is a laptop with a much inferior disk
> drive, the speeds are enough different that it's not real surprising
> if it doesn't hit the same problem.
>
> I've also tried to reproduce on 32-bit and 64-bit Intel, without
> success.  So if this is real, maybe it's related to being big-endian
> hardware?  But it's also quite sensitive to $dunno-what, maybe the
> history of WAL records that have already been replayed.

It might just be disk speed influencing how long the tests take, which
in turn increase the number of times checkpoints during the test,
increasing the number of FPIs?

Greetings,

Andres Freund




Re: WIP: WAL prefetch (another approach)

2021-04-21 Thread Tom Lane
Stephen Frost  writes:
> On Wed, Apr 21, 2021 at 19:17 Thomas Munro  wrote:
>> ... Personally I think the right thing to do now is to revert it
>> and re-propose for 15 early in the cycle, supported with some better
>> testing infrastructure.

> I tend to agree with the idea to revert it, perhaps a +0 on that, but if
> others argue it should be fixed in-place, I wouldn’t complain about it.

FWIW, I've so far only been able to see problems on two old PPC Macs,
one of which has been known to be a bit flaky in the past.  So it's
possible that what I'm looking at is a hardware glitch.  But it's
consistent enough that I rather doubt that.

What I'm doing is running the core regression tests with a single
standby (on the same machine) and wal_consistency_checking = all.
Fairly reproducibly (more than one run in ten), what I get on the
slightly-flaky machine is consistency check failures like

2021-04-21 17:42:56.324 EDT [42286] PANIC:  inconsistent page found, rel 
1663/354383/357033, forknum 0, blkno 9, byte offset 2069: replay 0x00 primary 
0x03
2021-04-21 17:42:56.324 EDT [42286] CONTEXT:  WAL redo at 24/121C97B0 for 
Heap/INSERT: off 107 flags 0x00; blkref #0: rel 1663/354383/357033, blk 9 FPW
2021-04-21 17:45:11.662 EDT [42284] LOG:  startup process (PID 42286) was 
terminated by signal 6: Abort trap

2021-04-21 11:25:30.091 EDT [38891] PANIC:  inconsistent page found, rel 
1663/229880/237980, forknum 0, blkno 108, byte offset 3845: replay 0x00 primary 
0x99
2021-04-21 11:25:30.091 EDT [38891] CONTEXT:  WAL redo at 17/A99897FC for 
SPGist/ADD_LEAF: add leaf to page; off 241; headoff 171; parentoff 0; blkref 
#0: rel 1663/229880/237980, blk 108 FPW
2021-04-21 11:26:59.371 EDT [38889] LOG:  startup process (PID 38891) was 
terminated by signal 6: Abort trap

2021-04-20 19:20:16.114 EDT [34405] PANIC:  inconsistent page found, rel 
1663/189216/197311, forknum 0, blkno 115, byte offset 6149: replay 0x37 primary 
0x03
2021-04-20 19:20:16.114 EDT [34405] CONTEXT:  WAL redo at 13/3CBFED00 for 
SPGist/ADD_LEAF: add leaf to page; off 241; headoff 171; parentoff 0; blkref 
#0: rel 1663/189216/197311, blk 115 FPW
2021-04-20 19:21:54.421 EDT [34403] LOG:  startup process (PID 34405) was 
terminated by signal 6: Abort trap

2021-04-20 17:44:09.356 EDT [24106] FATAL:  inconsistent page found, rel 
1663/135419/143843, forknum 0, blkno 101, byte offset 6152: replay 0x40 primary 
0x00
2021-04-20 17:44:09.356 EDT [24106] CONTEXT:  WAL redo at D/5107D8A8 for 
Gist/PAGE_UPDATE: ; blkref #0: rel 1663/135419/143843, blk 101 FPW

(Note I modified checkXLogConsistency to PANIC on failure, so I could get
a core dump to analyze; and it's also printing the first-mismatch location.)

I have not analyzed each one of these failures exhaustively, but on the
ones I have looked at closely, the replay_image_masked version of the page
appears correct while the primary_image_masked version is *not*.
Moreover, the primary_image_masked version does not match the full-page
image that I see in the on-disk WAL file.  It did however seem to match
the in-memory WAL record contents that the decoder is operating on.
So unless you want to believe the buggy-hardware theory, something's
occasionally messing up while loading WAL records from disk.  All of the
trouble cases involve records that span across WAL pages (unsurprising
since they contain FPIs), so maybe there's something not quite right
in there.

In the cases that I looked at closely, it appeared that there was a
block of 32 wrong bytes somewhere within the page image, with the data
before and after that being correct.  I'm not sure if that pattern
holds in all cases though.

BTW, if I restart the failed standby, it plows through the same data
just fine, confirming that the on-disk WAL is not corrupt.

The other PPC machine (with no known history of trouble) is the one
that had the CRC failure I showed earlier.  That one does seem to be
actual bad data in the stored WAL, because the problem was also seen
by pg_waldump, and trying to restart the standby got the same failure
again.  I've not been able to duplicate the consistency-check failures
there.  But because that machine is a laptop with a much inferior disk
drive, the speeds are enough different that it's not real surprising
if it doesn't hit the same problem.

I've also tried to reproduce on 32-bit and 64-bit Intel, without
success.  So if this is real, maybe it's related to being big-endian
hardware?  But it's also quite sensitive to $dunno-what, maybe the
history of WAL records that have already been replayed.

regards, tom lane




Re: WIP: WAL prefetch (another approach)

2021-04-21 Thread Stephen Frost
Greetings,

On Wed, Apr 21, 2021 at 19:17 Thomas Munro  wrote:

> On Thu, Apr 22, 2021 at 8:16 AM Thomas Munro 
> wrote:
> > That wasn't my plan, but I admit that the timing was non-ideal.  In
> > any case, I'll dig into these failures and then consider options.
> > More soon.
>
> Yeah, this clearly needs more work.  xlogreader.c is difficult to work
> with and I think we need to keep trying to improve it, but I made a
> bad call here trying to combine this with other refactoring work up
> against a deadline and I made some dumb mistakes.  I could of course
> debug it in-tree, and I know that this has been an anticipated
> feature.  Personally I think the right thing to do now is to revert it
> and re-propose for 15 early in the cycle, supported with some better
> testing infrastructure.


I tend to agree with the idea to revert it, perhaps a +0 on that, but if
others argue it should be fixed in-place, I wouldn’t complain about it.

I very much encourage the idea of improving testing in this area and would
be happy to try and help do so in the 15 cycle.

Thanks,

Stephen

>


Re: WIP: WAL prefetch (another approach)

2021-04-21 Thread Thomas Munro
On Thu, Apr 22, 2021 at 8:16 AM Thomas Munro  wrote:
> That wasn't my plan, but I admit that the timing was non-ideal.  In
> any case, I'll dig into these failures and then consider options.
> More soon.

Yeah, this clearly needs more work.  xlogreader.c is difficult to work
with and I think we need to keep trying to improve it, but I made a
bad call here trying to combine this with other refactoring work up
against a deadline and I made some dumb mistakes.  I could of course
debug it in-tree, and I know that this has been an anticipated
feature.  Personally I think the right thing to do now is to revert it
and re-propose for 15 early in the cycle, supported with some better
testing infrastructure.




Re: WIP: WAL prefetch (another approach)

2021-04-21 Thread Thomas Munro
On Thu, Apr 22, 2021 at 8:07 AM Tomas Vondra
 wrote:
> On 4/21/21 6:30 PM, Tom Lane wrote:
> > Thomas Munro  writes:
> >> Yeah, it would have been nice to include that but it'll have to be for
> >> v15 due to lack of time to convince myself that it was correct.  I do
> >> intend to look into more concurrency of that kind for v15.  I have
> >> pushed these patches, updated to be disabled by default.
> >
> > I have a fairly bad feeling about these patches.  I've already fixed
> > one critical bug (see 9e4114822), but I am still seeing random, hard
> > to reproduce failures in WAL replay testing.  It looks like sometimes
> > the "decoded" version of a WAL record doesn't match what I see in
> > the on-disk data, which I'm having no luck tracing down.

Ugh.  Looking into this now.  Also, this week I have been researching
a possible problem with eg ALTER TABLE SET TABLESPACE in the higher
level patch, which I'll write about soon.

> > I am not sure whether the checksum failure itself is real or a variant
> > of the seeming bad-reconstruction problem, but what I'm on about right
> > at this moment is that the error handling logic for this case seems
> > quite broken.  Why is a checksum failure only worthy of a LOG message?
> > Why is ValidXLogRecord() issuing a log message for itself, rather than
> > being tied into the report_invalid_record() mechanism?  Why are we
> > evidently still trying to decode records afterwards?
>
> Yeah, that seems suspicious.

I may have invited trouble by deciding to rebase on the other proposal
late in the cycle.  That interfaces around there.

> > In general, I'm not too pleased with the apparent attitude in this
> > thread that it's okay to push a patch that only mostly works on the
> > last day of the dev cycle and plan to stabilize it later.
>
> Was there such attitude? I don't think people were arguing for pushing a
> patch's not working correctly. The discussion was mostly about getting
> it committed even and leaving some optimizations for v15.

That wasn't my plan, but I admit that the timing was non-ideal.  In
any case, I'll dig into these failures and then consider options.
More soon.




Re: WIP: WAL prefetch (another approach)

2021-04-21 Thread Tomas Vondra
On 4/21/21 6:30 PM, Tom Lane wrote:
> Thomas Munro  writes:
>> Yeah, it would have been nice to include that but it'll have to be for
>> v15 due to lack of time to convince myself that it was correct.  I do
>> intend to look into more concurrency of that kind for v15.  I have
>> pushed these patches, updated to be disabled by default.
> 
> I have a fairly bad feeling about these patches.  I've already fixed
> one critical bug (see 9e4114822), but I am still seeing random, hard
> to reproduce failures in WAL replay testing.  It looks like sometimes
> the "decoded" version of a WAL record doesn't match what I see in
> the on-disk data, which I'm having no luck tracing down.
> 
> Another interesting failure I just came across is
> 
> 2021-04-21 11:32:14.280 EDT [14606] LOG:  incorrect resource manager data 
> checksum in record at F/438000A4
> TRAP: FailedAssertion("state->decoding", File: "xlogreader.c", Line: 845, 
> PID: 14606)
> 2021-04-21 11:38:23.066 EDT [14603] LOG:  startup process (PID 14606) was 
> terminated by signal 6: Abort trap
> 
> with stack trace
> 
> #0  0x90b669f0 in kill ()
> #1  0x90c01bfc in abort ()
> #2  0x0057a6a0 in ExceptionalCondition (conditionName= unavailable, due to optimizations>, errorType= due to optimizations>, fileName= optimizations>, lineNumber= optimizations>) at assert.c:69
> #3  0x000f5cf4 in XLogDecodeOneRecord (state=0x1000640, allow_oversized=1 
> '\001') at xlogreader.c:845
> #4  0x000f682c in XLogNextRecord (state=0x1000640, record=0xbfffba38, 
> errormsg=0xbfffba9c) at xlogreader.c:466
> #5  0x000f695c in XLogReadRecord (state= to optimizations>, record=0xbfffba98, errormsg= unavailable, due to optimizations>) at xlogreader.c:352
> #6  0x000e61a0 in ReadRecord (xlogreader=0x1000640, emode=15, fetching_ckpt=0 
> '\0') at xlog.c:4398
> #7  0x000ea320 in StartupXLOG () at xlog.c:7567
> #8  0x00362218 in StartupProcessMain () at startup.c:244
> #9  0x000fc170 in AuxiliaryProcessMain (argc= due to optimizations>, argv= optimizations>) at bootstrap.c:447
> #10 0x0035c740 in StartChildProcess (type=StartupProcess) at postmaster.c:5439
> #11 0x00360f4c in PostmasterMain (argc=5, argv=0xa006a0) at postmaster.c:1406
> #12 0x0029737c in main (argc= optimizations>, argv=) 
> at main.c:209
> 
> 
> I am not sure whether the checksum failure itself is real or a variant
> of the seeming bad-reconstruction problem, but what I'm on about right
> at this moment is that the error handling logic for this case seems
> quite broken.  Why is a checksum failure only worthy of a LOG message?
> Why is ValidXLogRecord() issuing a log message for itself, rather than
> being tied into the report_invalid_record() mechanism?  Why are we
> evidently still trying to decode records afterwards?
> 

Yeah, that seems suspicious.

> In general, I'm not too pleased with the apparent attitude in this
> thread that it's okay to push a patch that only mostly works on the
> last day of the dev cycle and plan to stabilize it later.
> 

Was there such attitude? I don't think people were arguing for pushing a
patch's not working correctly. The discussion was mostly about getting
it committed even and leaving some optimizations for v15.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: WIP: WAL prefetch (another approach)

2021-04-21 Thread Tom Lane
Thomas Munro  writes:
> Yeah, it would have been nice to include that but it'll have to be for
> v15 due to lack of time to convince myself that it was correct.  I do
> intend to look into more concurrency of that kind for v15.  I have
> pushed these patches, updated to be disabled by default.

I have a fairly bad feeling about these patches.  I've already fixed
one critical bug (see 9e4114822), but I am still seeing random, hard
to reproduce failures in WAL replay testing.  It looks like sometimes
the "decoded" version of a WAL record doesn't match what I see in
the on-disk data, which I'm having no luck tracing down.

Another interesting failure I just came across is

2021-04-21 11:32:14.280 EDT [14606] LOG:  incorrect resource manager data 
checksum in record at F/438000A4
TRAP: FailedAssertion("state->decoding", File: "xlogreader.c", Line: 845, PID: 
14606)
2021-04-21 11:38:23.066 EDT [14603] LOG:  startup process (PID 14606) was 
terminated by signal 6: Abort trap

with stack trace

#0  0x90b669f0 in kill ()
#1  0x90c01bfc in abort ()
#2  0x0057a6a0 in ExceptionalCondition (conditionName=, errorType=, fileName=, lineNumber=) at assert.c:69
#3  0x000f5cf4 in XLogDecodeOneRecord (state=0x1000640, allow_oversized=1 
'\001') at xlogreader.c:845
#4  0x000f682c in XLogNextRecord (state=0x1000640, record=0xbfffba38, 
errormsg=0xbfffba9c) at xlogreader.c:466
#5  0x000f695c in XLogReadRecord (state=, record=0xbfffba98, errormsg=) at xlogreader.c:352
#6  0x000e61a0 in ReadRecord (xlogreader=0x1000640, emode=15, fetching_ckpt=0 
'\0') at xlog.c:4398
#7  0x000ea320 in StartupXLOG () at xlog.c:7567
#8  0x00362218 in StartupProcessMain () at startup.c:244
#9  0x000fc170 in AuxiliaryProcessMain (argc=, argv=) at bootstrap.c:447
#10 0x0035c740 in StartChildProcess (type=StartupProcess) at postmaster.c:5439
#11 0x00360f4c in PostmasterMain (argc=5, argv=0xa006a0) at postmaster.c:1406
#12 0x0029737c in main (argc=, argv=) at 
main.c:209


I am not sure whether the checksum failure itself is real or a variant
of the seeming bad-reconstruction problem, but what I'm on about right
at this moment is that the error handling logic for this case seems
quite broken.  Why is a checksum failure only worthy of a LOG message?
Why is ValidXLogRecord() issuing a log message for itself, rather than
being tied into the report_invalid_record() mechanism?  Why are we
evidently still trying to decode records afterwards?

In general, I'm not too pleased with the apparent attitude in this
thread that it's okay to push a patch that only mostly works on the
last day of the dev cycle and plan to stabilize it later.

regards, tom lane




Re: WIP: WAL prefetch (another approach)

2021-04-16 Thread Amit Kapila
On Sat, Apr 10, 2021 at 2:16 AM Thomas Munro  wrote:
>

In commit 1d257577e08d3e598011d6850fd1025858de8c8c, there is a change
in file format for stats, won't it require bumping
PGSTAT_FILE_FORMAT_ID?

Actually, I came across this while working on my today's commit
f5fc2f5b23 where I forgot to bump PGSTAT_FILE_FORMAT_ID. So, I thought
maybe we can bump it just once if required?

-- 
With Regards,
Amit Kapila.




RE: WIP: WAL prefetch (another approach)

2021-04-12 Thread Shinoda, Noriyoshi (PN Japan FSIP)
Hi, 

Thank you for developing a great feature. I tested this feature and checked the 
documentation.
Currently, the documentation for the pg_stat_prefetch_recovery view is included 
in the description for the pg_stat_subscription view.

https://www.postgresql.org/docs/devel/monitoring-stats.html#MONITORING-PG-STAT-SUBSCRIPTION

It is also not displayed in the list of "28.2. The Statistics Collector".
https://www.postgresql.org/docs/devel/monitoring.html

The attached patch modifies the pg_stat_prefetch_recovery view to appear as a 
separate view.

Regards,
Noriyoshi Shinoda

-Original Message-
From: Thomas Munro [mailto:thomas.mu...@gmail.com] 
Sent: Saturday, April 10, 2021 5:46 AM
To: Justin Pryzby 
Cc: Tomas Vondra ; Stephen Frost 
; Andres Freund ; Jakub Wartak 
; Alvaro Herrera ; Tomas 
Vondra ; Dmitry Dolgov <9erthali...@gmail.com>; 
David Steele ; pgsql-hackers 
Subject: Re: WIP: WAL prefetch (another approach)

On Sat, Apr 10, 2021 at 8:37 AM Justin Pryzby  wrote:
> Did you see this?
> INVALID URI REMOVED
> 278MB0483490FEAC879DCA5ED583DD2739*40GV0P278MB0483.CHEP278.PROD.OUTLOO
> K.COM__;JQ!!NpxR!wcPrhiB2CaHRtywGoh9Ap0M-kH1m07hGI37-ycYRGCPgCqGs30lRS
> KicsXacduEXHxI$
>
> I meant to mail you so you could include it in the same commit, but 
> forgot until now.

Done, thanks.




pg_stat_prefetch_recovery_doc_v1.diff
Description: pg_stat_prefetch_recovery_doc_v1.diff


Re: WIP: WAL prefetch (another approach)

2021-04-09 Thread Thomas Munro
On Sat, Apr 10, 2021 at 8:37 AM Justin Pryzby  wrote:
> Did you see this?
> https://www.postgresql.org/message-id/GV0P278MB0483490FEAC879DCA5ED583DD2739%40GV0P278MB0483.CHEP278.PROD.OUTLOOK.COM
>
> I meant to mail you so you could include it in the same commit, but forgot
> until now.

Done, thanks.




Re: WIP: WAL prefetch (another approach)

2021-04-09 Thread Justin Pryzby
On Sat, Apr 10, 2021 at 08:27:42AM +1200, Thomas Munro wrote:
> On Fri, Apr 9, 2021 at 3:37 PM Justin Pryzby  wrote:
> > Here's some little language fixes.
> 
> Thanks!  Done.  I rewrote the gibberish comment that made you say
> "XXX: what?".  Pushed.
> 
> > BTW, before beginning "recovery", PG syncs all the data dirs.
> > This can be slow, and it seems like the slowness is frequently due to file
> > metadata.  For example, that's an obvious consequence of an OS crash, after
> > which the page cache is empty.  I've made a habit of running find /zfs -ls 
> > |wc
> > to pre-warm it, which can take a little bit, but then the recovery process
> > starts moments later.  I don't have any timing measurements, but I expect 
> > that
> > starting to stat() all data files as soon as possible would be a win.
> 
> Did you see commit 61752afb, "Provide
> recovery_init_sync_method=syncfs"?  Actually I believe it's safe to
> skip that phase completely and do a tiny bit more work during
> recovery, which I'd like to work on for v15[1].

Yes, I have it in my list for v14 deployment.  Thanks for that.

Did you see this?
https://www.postgresql.org/message-id/GV0P278MB0483490FEAC879DCA5ED583DD2739%40GV0P278MB0483.CHEP278.PROD.OUTLOOK.COM

I meant to mail you so you could include it in the same commit, but forgot
until now.

-- 
Justin




Re: WIP: WAL prefetch (another approach)

2021-04-09 Thread Thomas Munro
On Fri, Apr 9, 2021 at 3:37 PM Justin Pryzby  wrote:
> Here's some little language fixes.

Thanks!  Done.  I rewrote the gibberish comment that made you say
"XXX: what?".  Pushed.

> BTW, before beginning "recovery", PG syncs all the data dirs.
> This can be slow, and it seems like the slowness is frequently due to file
> metadata.  For example, that's an obvious consequence of an OS crash, after
> which the page cache is empty.  I've made a habit of running find /zfs -ls |wc
> to pre-warm it, which can take a little bit, but then the recovery process
> starts moments later.  I don't have any timing measurements, but I expect that
> starting to stat() all data files as soon as possible would be a win.

Did you see commit 61752afb, "Provide
recovery_init_sync_method=syncfs"?  Actually I believe it's safe to
skip that phase completely and do a tiny bit more work during
recovery, which I'd like to work on for v15[1].

[1] 
https://www.postgresql.org/message-id/flat/CA%2BhUKG%2B8Wm8TSfMWPteMEHfh194RytVTBNoOkggTQT1p5NTY7Q%40mail.gmail.com




Re: WIP: WAL prefetch (another approach)

2021-04-08 Thread Justin Pryzby
Here's some little language fixes.

BTW, before beginning "recovery", PG syncs all the data dirs.
This can be slow, and it seems like the slowness is frequently due to file
metadata.  For example, that's an obvious consequence of an OS crash, after
which the page cache is empty.  I've made a habit of running find /zfs -ls |wc
to pre-warm it, which can take a little bit, but then the recovery process
starts moments later.  I don't have any timing measurements, but I expect that
starting to stat() all data files as soon as possible would be a win.

commit cc9707de333fe8242607cde9f777beadc68dbf04
Author: Justin Pryzby 
Date:   Thu Apr 8 10:43:14 2021 -0500

WIP: doc review: Optionally prefetch referenced data in recovery.

1d257577e08d3e598011d6850fd1025858de8c8c

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index bc4a8b2279..139dee7aa2 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3621,7 +3621,7 @@ include_dir 'conf.d'
 pool after that.  However, on file systems with a block size larger
 than
 PostgreSQL's, prefetching can avoid a
-costly read-before-write when a blocks are later written.
+costly read-before-write when blocks are later written.
 The default is off.

   
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index 24cf567ee2..36e00c92c2 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -816,9 +816,7 @@
prefetching mechanism is most likely to be effective on systems
with full_page_writes set to
off (where that is safe), and where the working
-   set is larger than RAM.  By default, prefetching in recovery is enabled
-   on operating systems that have posix_fadvise
-   support.
+   set is larger than RAM.  By default, prefetching in recovery is disabled.
   
  
 
diff --git a/src/backend/access/transam/xlogprefetch.c 
b/src/backend/access/transam/xlogprefetch.c
index 28764326bc..363c079964 100644
--- a/src/backend/access/transam/xlogprefetch.c
+++ b/src/backend/access/transam/xlogprefetch.c
@@ -31,7 +31,7 @@
  * stall; this is counted with "skip_fpw".
  *
  * The only way we currently have to know that an I/O initiated with
- * PrefetchSharedBuffer() has that recovery will eventually call ReadBuffer(),
+ * PrefetchSharedBuffer() has that recovery will eventually call ReadBuffer(), 
XXX: what ??
  * and perform a synchronous read.  Therefore, we track the number of
  * potentially in-flight I/Os by using a circular buffer of LSNs.  When it's
  * full, we have to wait for recovery to replay records so that the queue
@@ -660,7 +660,7 @@ XLogPrefetcherScanBlocks(XLogPrefetcher *prefetcher)
/*
 * I/O has possibly been initiated (though we don't 
know if it was
 * already cached by the kernel, so we just have to 
assume that it
-* has due to lack of better information).  Record this 
as an I/O
+* was due to lack of better information).  Record this 
as an I/O
 * in progress until eventually we replay this LSN.
 */
XLogPrefetchIncrement(&SharedStats->prefetch);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 090abdad8b..8c72ba1f1a 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2774,7 +2774,7 @@ static struct config_int ConfigureNamesInt[] =
{
{"wal_decode_buffer_size", PGC_POSTMASTER, WAL_ARCHIVE_RECOVERY,
gettext_noop("Maximum buffer size for reading ahead in 
the WAL during recovery."),
-   gettext_noop("This controls the maximum distance we can 
read ahead n the WAL to prefetch referenced blocks."),
+   gettext_noop("This controls the maximum distance we can 
read ahead in the WAL to prefetch referenced blocks."),
GUC_UNIT_BYTE
},
&wal_decode_buffer_size,




Re: WIP: WAL prefetch (another approach)

2021-04-08 Thread Tomas Vondra



On 4/8/21 1:46 PM, Thomas Munro wrote:
> On Thu, Apr 8, 2021 at 3:27 AM Tomas Vondra
>  wrote:
>> On 4/7/21 1:24 PM, Thomas Munro wrote:
>>> I made one other simplifying change: previously, the prefetch module
>>> would read the WAL up to the "written" LSN (so, allowing itself to
>>> read data that had been written but not yet flushed to disk by the
>>> walreceiver), though it still waited until a record's LSN was
>>> "flushed" before replaying.  That allowed prefetching to happen
>>> concurrently with the WAL flush, which was nice, but it felt a little
>>> too "special".  I decided to remove that part for now, and I plan to
>>> look into making standbys work more like primary servers, using WAL
>>> buffers, the WAL writer and optionally the standard log-before-data
>>> rule.
>>
>> Not sure, but the removal seems unnecessary. I'm worried that this will
>> significantly reduce the amount of data that we'll be able to prefetch.
>> How likely it is that we have data that is written but not flushed?
>> Let's assume the replica is lagging and network bandwidth is not the
>> bottleneck - how likely is this "has to be flushed" a limit for the
>> prefetching?
> 
> Yeah, it would have been nice to include that but it'll have to be for
> v15 due to lack of time to convince myself that it was correct.  I do
> intend to look into more concurrency of that kind for v15.  I have
> pushed these patches, updated to be disabled by default.  I will look
> into how I can run a BF animal that has it enabled during the recovery
> tests for coverage.  Thanks very much to everyone on this thread for
> all the discussion and testing so far.
> 

OK, understood. I'll rerun the benchmarks on this version, and if
there's a significant negative impact we can look into that during the
stabilization phase.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: WIP: WAL prefetch (another approach)

2021-04-08 Thread Thomas Munro
On Thu, Apr 8, 2021 at 3:27 AM Tomas Vondra
 wrote:
> On 4/7/21 1:24 PM, Thomas Munro wrote:
> > I made one other simplifying change: previously, the prefetch module
> > would read the WAL up to the "written" LSN (so, allowing itself to
> > read data that had been written but not yet flushed to disk by the
> > walreceiver), though it still waited until a record's LSN was
> > "flushed" before replaying.  That allowed prefetching to happen
> > concurrently with the WAL flush, which was nice, but it felt a little
> > too "special".  I decided to remove that part for now, and I plan to
> > look into making standbys work more like primary servers, using WAL
> > buffers, the WAL writer and optionally the standard log-before-data
> > rule.
>
> Not sure, but the removal seems unnecessary. I'm worried that this will
> significantly reduce the amount of data that we'll be able to prefetch.
> How likely it is that we have data that is written but not flushed?
> Let's assume the replica is lagging and network bandwidth is not the
> bottleneck - how likely is this "has to be flushed" a limit for the
> prefetching?

Yeah, it would have been nice to include that but it'll have to be for
v15 due to lack of time to convince myself that it was correct.  I do
intend to look into more concurrency of that kind for v15.  I have
pushed these patches, updated to be disabled by default.  I will look
into how I can run a BF animal that has it enabled during the recovery
tests for coverage.  Thanks very much to everyone on this thread for
all the discussion and testing so far.




Re: WIP: WAL prefetch (another approach)

2021-04-07 Thread Tomas Vondra
On 4/7/21 1:24 PM, Thomas Munro wrote:
> Here's rebase, on top of Horiguchi-san's v19 patch set.  My patches
> start at 0007.  Previously, there was a "nowait" flag that was passed
> into all the callbacks so that XLogReader could wait for new WAL in
> some cases but not others.  This new version uses the proposed
> XLREAD_NEED_DATA protocol, and the caller deals with waiting for data
> to arrive when appropriate.  This seems tidier to me.
> 

OK, seems reasonable.

> I made one other simplifying change: previously, the prefetch module
> would read the WAL up to the "written" LSN (so, allowing itself to
> read data that had been written but not yet flushed to disk by the
> walreceiver), though it still waited until a record's LSN was
> "flushed" before replaying.  That allowed prefetching to happen
> concurrently with the WAL flush, which was nice, but it felt a little
> too "special".  I decided to remove that part for now, and I plan to
> look into making standbys work more like primary servers, using WAL
> buffers, the WAL writer and optionally the standard log-before-data
> rule.
> 

Not sure, but the removal seems unnecessary. I'm worried that this will
significantly reduce the amount of data that we'll be able to prefetch.
How likely it is that we have data that is written but not flushed?
Let's assume the replica is lagging and network bandwidth is not the
bottleneck - how likely is this "has to be flushed" a limit for the
prefetching?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: WIP: WAL prefetch (another approach)

2021-04-01 Thread Thomas Munro
On Fri, Mar 19, 2021 at 2:29 PM Tomas Vondra
 wrote:
> On 3/18/21 1:54 AM, Thomas Munro wrote:
> > I'm now looking at Horiguchi-san and Heikki's patch[2] to remove
> > XLogReader's callbacks, to try to understand how these two patch sets
> > are related.  I don't really like the way those callbacks work, and
> > I'm afraid had to make them more complicated.  But I don't yet know
> > very much about that other patch set.  More soon.
>
> OK. Do you think we should get both of those patches in, or do we need
> to commit them in a particular order? Or what is your concern?

I would like to commit the callback-removal patch first, and then the
WAL decoder and prefetcher patches become simpler and cleaner on top
of that.  I will post the rebase and explanation shortly.




Re: WIP: WAL prefetch (another approach)

2021-03-18 Thread Tomas Vondra
On 3/18/21 1:54 AM, Thomas Munro wrote:
> On Thu, Mar 18, 2021 at 12:00 PM Tomas Vondra
>  wrote:
>> On 3/17/21 10:43 PM, Stephen Frost wrote:
>>> Guess I'm just not a fan of pushing out a change that will impact
>>> everyone by default, in a possibly negative way (or positive, though
>>> that doesn't seem terribly likely, but who knows), without actually
>>> measuring what that impact will look like in those more common cases.
>>> Showing that it's a great win when you're on ZFS or running with FPWs
>>> disabled is good and the expected best case, but we should be
>>> considering the worst case too when it comes to performance
>>> improvements.
>>>
>>
>> Well, maybe it'll behave differently on systems with ZFS. I don't know,
>> and I have no such machine to test that at the moment. My argument
>> however remains the same - if if happens to be a problem, just don't
>> enable (or disable) the prefetching, and you get the current behavior.
> 
> I see the road map for this feature being to get it working on every
> OS via the AIO patchset, in later work, hopefully not very far in the
> future (in the most portable mode, you get I/O worker processes doing
> pread() or preadv() calls on behalf of recovery).  So I'll be glad to
> get this infrastructure in, even though it's maybe only useful for
> some people in the first release.
> 

+1 to that


>> FWIW I'm not sure there was a discussion or argument about what should
>> be the default setting (enabled or disabled). I'm fine with not enabling
>> this by default, so that people have to enable it explicitly.
>>
>> In a way that'd be consistent with effective_io_concurrency being 1 by
>> default, which almost disables regular prefetching.
> 
> Yeah, I'm not sure but I'd be fine with disabling it by default in the
> initial release.  The current patch set has it enabled, but that's
> mostly for testing, it's not an opinion on how it should ship.
> 

+1 to that too. Better to have it disabled by default than not at all.


> I've attached a rebased patch set with a couple of small changes:
> 
> 1.  I abandoned the patch that proposed
> pg_atomic_unlocked_add_fetch_u{32,64}() and went for a simple function
> local to xlogprefetch.c that just does pg_atomic_write_u64(counter,
> pg_atomic_read_u64(counter) + 1), in response to complaints from
> Andres[1].
> 
> 2.  I fixed a bug in ReadRecentBuffer(), and moved it into its own
> patch for separate review.
> 
> I'm now looking at Horiguchi-san and Heikki's patch[2] to remove
> XLogReader's callbacks, to try to understand how these two patch sets
> are related.  I don't really like the way those callbacks work, and
> I'm afraid had to make them more complicated.  But I don't yet know
> very much about that other patch set.  More soon.
> 

OK. Do you think we should get both of those patches in, or do we need
to commit them in a particular order? Or what is your concern?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: WIP: WAL prefetch (another approach)

2021-03-17 Thread Tomas Vondra
Hi,

On 3/17/21 10:43 PM, Stephen Frost wrote:
> Greetings,
> 
> * Tomas Vondra (tomas.von...@enterprisedb.com) wrote:
>> Right, I was just going to point out the FPIs are not necessary - what
>> matters is the presence of long streaks of WAL records touching the same
>> set of blocks. But people with workloads where this is common likely
>> don't need the WAL prefetching at all - the replica can keep up just
>> fine, because it doesn't need to do much I/O anyway (and if it can't
>> then prefetching won't help much anyway). So just don't enable the
>> prefetching, and there'll be no overhead.
> 
> Isn't this exactly the common case though..?  Checkpoints happening
> every 5 minutes, the replay of the FPI happens first and then the record
> is updated and everything's in SB for the later changes?

Well, as I said before, the FPIs are not very significant - you'll have
mostly the same issue with any repeated changes to the same block. It
does not matter much if you do

FPI for block 1
WAL record for block 2
WAL record for block 1
WAL record for block 2
WAL record for block 1

or just

WAL record for block 1
WAL record for block 2
WAL record for block 1
WAL record for block 2
WAL record for block 1

In both cases some of the prefetches are probably unnecessary. But the
frequency of checkpoints does not really matter, the important bit is
repeated changes to the same block(s).

If you have active set much larger than RAM, this is quite unlikely. And
we know from the pgbench tests that prefetching has a huge positive
effect in this case.

On smaller active sets, with frequent updates to the same block, we may
issue unnecessary prefetches - that's true. But (a) you have not shown
any numbers suggesting this is actually an issue, and (b) those cases
don't really need prefetching because all the data is already either in
shared buffers or in page cache. So if it happens to be an issue, the
user can simply disable it.

So how exactly would a problematic workload look like?

> You mentioned elsewhere that this would improve 80% of cases but that
> doesn't seem to be backed up by anything and certainly doesn't seem
> likely to be the case if we're talking about across all PG
> deployments.

Obviously, the 80% was just a figure of speech, illustrating my belief
that the proposed patch is beneficial for most users who currently have
issues with replication lag. That is based on my experience with support
customers who have such issues - it's almost invariably an OLTP workload
with large active set, and we know (from the benchmarks) that in these
cases it helps.

Users who don't have issues with replication lag can disable (or not
enable) the prefetching, and won't get any negative effects.

Perhaps there are users with weird workloads that have replication lag
issues but this patch won't help them - bummer, we can't solve
everything in one go. Also, no one actually demonstrated such workload
in this thread so far.

But as you're suggesting we don't have data to support the claim that
this actually helps many users (with no risk to others), I'd point out
you have not actually provided any numbers showing that it actually is
an issue in practice.


> I also disagree that asking the kernel to go do random I/O for us, 
> even as a prefetch, is entirely free simply because we won't
> actually need those pages.  At the least, it potentially pushes out
> pages that we might need shortly from the filesystem cache, no?
Where exactly did I say it's free? I said that workloads where this
happens a lot most likely don't need the prefetching at all, so it can
be simply disabled, eliminating all negative effects.

Moreover, looking at a limited number of recently prefetched blocks
won't eliminate this problem anyway - imagine a random OLTP on large
data set that however fits into RAM. After a while no read I/O needs to
be done, but you'd need pretty much infinite list of prefetched blocks
to eliminate that, and with smaller lists you'll still do 99% of the
prefetches.

Just disabling prefetching on such instances seems quite reasonable.


>> If it was up to me, I'd just get the patch committed as is. Delaying the
>> feature because of concerns that it might have some negative effect in
>> some cases, when that can be simply mitigated by disabling the feature,
>> is not really beneficial for our users.
> 
> I don't know that we actually know how many cases it might have a
> negative effect on or what the actual amount of such negative case there
> might be- that's really why we should probably try to actually benchmark
> it and get real numbers behind it, particularly when the chances of
> running into such a negative effect with the default configuration (that
> is, FPWs enabled) on the more typical platforms (as in, not ZFS) is more
> likely to occur in the field than the cases where FPWs are disabled and
> someone's running on ZFS.
> 
> Perhaps more to the point, it'd be nice to see how this

Re: WIP: WAL prefetch (another approach)

2021-03-17 Thread Stephen Frost
Greetings,

* Tomas Vondra (tomas.von...@enterprisedb.com) wrote:
> Right, I was just going to point out the FPIs are not necessary - what
> matters is the presence of long streaks of WAL records touching the same
> set of blocks. But people with workloads where this is common likely
> don't need the WAL prefetching at all - the replica can keep up just
> fine, because it doesn't need to do much I/O anyway (and if it can't
> then prefetching won't help much anyway). So just don't enable the
> prefetching, and there'll be no overhead.

Isn't this exactly the common case though..?  Checkpoints happening
every 5 minutes, the replay of the FPI happens first and then the record
is updated and everything's in SB for the later changes?  You mentioned
elsewhere that this would improve 80% of cases but that doesn't seem to
be backed up by anything and certainly doesn't seem likely to be the
case if we're talking about across all PG deployments.  I also disagree
that asking the kernel to go do random I/O for us, even as a prefetch,
is entirely free simply because we won't actually need those pages.  At
the least, it potentially pushes out pages that we might need shortly
from the filesystem cache, no?

> If it was up to me, I'd just get the patch committed as is. Delaying the
> feature because of concerns that it might have some negative effect in
> some cases, when that can be simply mitigated by disabling the feature,
> is not really beneficial for our users.

I don't know that we actually know how many cases it might have a
negative effect on or what the actual amount of such negative case there
might be- that's really why we should probably try to actually benchmark
it and get real numbers behind it, particularly when the chances of
running into such a negative effect with the default configuration (that
is, FPWs enabled) on the more typical platforms (as in, not ZFS) is more
likely to occur in the field than the cases where FPWs are disabled and
someone's running on ZFS.

Perhaps more to the point, it'd be nice to see how this change actually
improves the caes where PG is running with more-or-less the defaults on
the more commonly deployed filesystems.  If it doesn't then maybe it
shouldn't be the default..?  Surely the folks running on ZFS and running
with FPWs disabled would be able to manage to enable it if they
wished to and we could avoid entirely the question of if this has a
negative impact on the more common cases.

Guess I'm just not a fan of pushing out a change that will impact
everyone by default, in a possibly negative way (or positive, though
that doesn't seem terribly likely, but who knows), without actually
measuring what that impact will look like in those more common cases.
Showing that it's a great win when you're on ZFS or running with FPWs
disabled is good and the expected best case, but we should be
considering the worst case too when it comes to performance
improvements.

Anyhow, ultimately I don't know that there's much more to discuss on
this thread with regard to this particular topic, at least.  As I said
before, if everyone else is on board and not worried about it then so be
it; I feel that at least the concern that I raised has been heard.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: WIP: WAL prefetch (another approach)

2021-03-17 Thread Tomas Vondra
On 2/15/21 12:18 AM, Stephen Frost wrote:
> Greetings,
> 
> ...
>
 I think there's potential for some significant optimization going
 forward, but I think it's basically optimization over what we're doing
 today. As this is already a nontrivial patch, I'd argue for doing so
 separately.
>>>
>>> This seems like a great optimization, albeit a fair bit of code, for a
>>> relatively uncommon use-case, specifically where full page writes are
>>> disabled or very large checkpoints.  As that's the case though, I would
>>> think it's reasonable to ask that it go out of its way to avoid slowing
>>> down the more common configurations, particularly since it's proposed to
>>> have it on by default (which I agree with, provided it ends up improving
>>> the common cases, which I think the suggestions above would certainly
>>> make it more likely to do).
>>
>> I'm OK to do some benchmarking, but it's not quite clear to me why does it
>> matter if the checkpoints are smaller than shared buffers? IMO what matters
>> is how "localized" the updates are, i.e. how likely it is to hit the same
>> page repeatedly (in a short amount of time). Regular pgbench is not very
>> suitable for that, but some non-uniform distribution should do the trick, I
>> think.
> 
> I suppose strictly speaking it'd be
> Min(wal_decode_buffer_size,checkpoint_size), but yes, you're right that
> it's more about the wal_decode_buffer_size than the checkpoint's size.
> Apologies for the confusion.  As suggested above, one way to benchmark
> this to really see if there's any issue would be to increase
> wal_decode_buffer_size to some pretty big size and then compare the
> performance vs. unpatched.  I'd think that could even be done with
> pgbench, so you're not having to arrange for the same pages to get
> updated over and over.
> 

What exactly would be the point of such benchmark? I don't think the
patch does prefetching based on wal_decode_buffer_size, that just says
how far ahead we decode - the prefetch distance I is defined by
maintenance_io_concurrency.

But it's not clear to me what exactly would the result say about the
necessity of the optimization at hand (skipping prefetches for blocks
with recent FPI). If the the maintenance_io_concurrency is very high,
the probability that a block is evicted prematurely grows, making the
prefetch useless in general. How does this say anything about the
problem at hand? Sure, we'll do unnecessary I/O, causing issues, but
that's a bit like complaining the engine gets very hot when driving on a
highway in reverse.

AFAICS to measure the worst case, you'd need a workload with a lot of
FPIs, and very little actual I/O. That means, data set that fits into
memory (either shared buffers or RAM), and short checkpoints. But that's
exactly the case where you don't need prefetching ...

>>> Perhaps this already improves the common cases and is worth the extra
>>> code on that basis, but I don't recall seeing much in the way of
>>> benchmarking in this thread for that case- that is, where FPIs are
>>> enabled and checkpoints are smaller than shared buffers.  Jakub's
>>> testing was done with FPWs disabled and Tomas's testing used checkpoints
>>> which were much larger than the size of shared buffers on the system
>>> doing the replay.  While it's certainly good that this patch improves
>>> those cases, we should also be looking out for the worst case and make
>>> sure that the patch doesn't degrade performance in that case.
>>
>> I'm with Andres on this. It's fine to leave some possible optimizations on
>> the table for the future. And even if some workloads are affected
>> negatively, it's still possible to disable the prefetching.
> 
> While I'm generally in favor of this argument, that a feature is
> particularly important and that it's worth slowing down the common cases
> to enable it, I dislike that it's applied inconsistently.  I'd certainly

If you have a workload where this happens to cause issues, you can just
disable that. IMHO that's a perfectly reasonable engineering approach,
where we get something that significantly improves 80% of the cases,
allow disabling it for cases where it might cause issues, and then
improve it in the next version.


> feel better about it if we had actual performance numbers to consider.
> I don't doubt the possibility that the extra prefetch's just don't
> amount to enough to matter but I have a hard time seeing them as not
> having some cost and without actually measuring it, it's hard to say
> what that cost is.
> 
> Without looking farther back than the last record, we could end up
> repeatedly asking for the same blocks to be prefetched too-
> 
> FPI for block 1
> FPI for block 2
> WAL record for block 1
> WAL record for block 2
> WAL record for block 1
> WAL record for block 2
> WAL record for block 1
> WAL record for block 2
> 
> ... etc.
> 
> Entirely possible my math is off, but seems like the worst case
> situation right now might end up with some 4500 u

Re: WIP: WAL prefetch (another approach)

2021-02-14 Thread Stephen Frost
Greetings,

* Tomas Vondra (tomas.von...@enterprisedb.com) wrote:
> On 2/13/21 10:39 PM, Stephen Frost wrote:
> >* Andres Freund (and...@anarazel.de) wrote:
> >>On 2021-02-12 00:42:04 +0100, Tomas Vondra wrote:
> >>>Yeah, that's a good point. I think it'd make sense to keep track of recent
> >>>FPIs and skip prefetching such blocks. But how exactly should we implement
> >>>that, how many blocks do we need to track? If you get an FPI, how long
> >>>should we skip prefetching of that block?
> >>>
> >>>I don't think the history needs to be very long, for two reasons. Firstly,
> >>>the usual pattern is that we have FPI + several changes for that block
> >>>shortly after it. Secondly, maintenance_io_concurrency limits this 
> >>>naturally
> >>>- after crossing that, redo should place the FPI into shared buffers,
> >>>allowing us to skip the prefetch.
> >>>
> >>>So I think using maintenance_io_concurrency is sufficient. We might track
> >>>more buffers to allow skipping prefetches of blocks that were evicted from
> >>>shared buffers, but that seems like an overkill.
> >>>
> >>>However, maintenance_io_concurrency can be quite high, so just a simple
> >>>queue is not very suitable - searching it linearly for each block would be
> >>>too expensive. But I think we can use a simple hash table, tracking
> >>>(relfilenode, block, LSN), over-sized to minimize collisions.
> >>>
> >>>Imagine it's a simple array with (2 * maintenance_io_concurrency) elements,
> >>>and whenever we prefetch a block or find an FPI, we simply add the block to
> >>>the array as determined by hash(relfilenode, block)
> >>>
> >>> hashtable[hash(...)] = {relfilenode, block, LSN}
> >>>
> >>>and then when deciding whether to prefetch a block, we look at that one
> >>>position. If the (relfilenode, block) match, we check the LSN and skip the
> >>>prefetch if it's sufficiently recent. Otherwise we prefetch.
> >>
> >>I'm a bit doubtful this is really needed at this point. Yes, the
> >>prefetching will do a buffer table lookup - but it's a lookup that
> >>already happens today. And the patch already avoids doing a second
> >>lookup after prefetching (by optimistically caching the last Buffer id,
> >>and re-checking).
> >
> >I agree that when a page is looked up, and found, in the buffer table
> >that the subsequent cacheing of the buffer id in the WAL records does a
> >good job of avoiding having to re-do that lookup.  However, that isn't
> >the case which was being discussed here or what Tomas's suggestion was
> >intended to address.
> >
> >What I pointed out up-thread and what's being discussed here is what
> >happens when the WAL contains a few FPIs and a few regular WAL records
> >which are mixed up and not in ideal order.  When that happens, with this
> >patch, the FPIs will be ignored, the regular WAL records will reference
> >blocks which aren't found in shared buffers (yet) and then we'll both
> >issue pre-fetches for those and end up having spent effort doing a
> >buffer lookup that we'll later re-do.
> 
> The question is how common this pattern actually is - I don't know. As
> noted, the non-FPI would have to be fairly close to the FPI, i.e. within the
> wal_decode_buffer_size, to actually cause measurable harm.

Yeah, so it'll depend on how big wal_decode_buffer_size is.  Increasing
that would certainly help to show if there ends up being a degredation
with this patch due to the extra prefetching being done.

> >To address the unnecessary syscalls we really just need to keep track of
> >any FPIs that we've seen between where the point where the prefetching
> >is happening and the point where the replay is being done- once replay
> >has replayed an FPI, our buffer lookup will succeed and we'll cache the
> >buffer that the FPI is at- in other words, only wal_decode_buffer_size
> >amount of WAL needs to be considered.
> 
> Yeah, that's essentially what I proposed.

Glad I captured it correctly.

> >We could further leverage this tracking of FPIs, to skip the prefetch
> >syscalls, by cacheing what later records address the blocks that have
> >FPIs earlier in the queue with the FPI record and then when replay hits
> >the FPI and loads it into shared_buffers, it could update the other WAL
> >records in the queue with the buffer id of the page, allowing us to very
> >likely avoid having to do another lookup later on.
> 
> This seems like an over-engineering, at least for v1.

Perhaps, though it didn't seem like it'd be very hard to do with the
already proposed changes to stash the buffer id in the WAL records.

> >>I think there's potential for some significant optimization going
> >>forward, but I think it's basically optimization over what we're doing
> >>today. As this is already a nontrivial patch, I'd argue for doing so
> >>separately.
> >
> >This seems like a great optimization, albeit a fair bit of code, for a
> >relatively uncommon use-case, specifically where full page writes are
> >disabled or very large checkpoints.  As that's the cas

Re: WIP: WAL prefetch (another approach)

2021-02-14 Thread Tomas Vondra




On 2/13/21 10:39 PM, Stephen Frost wrote:

Greetings,

* Andres Freund (and...@anarazel.de) wrote:

On 2021-02-12 00:42:04 +0100, Tomas Vondra wrote:

Yeah, that's a good point. I think it'd make sense to keep track of recent
FPIs and skip prefetching such blocks. But how exactly should we implement
that, how many blocks do we need to track? If you get an FPI, how long
should we skip prefetching of that block?

I don't think the history needs to be very long, for two reasons. Firstly,
the usual pattern is that we have FPI + several changes for that block
shortly after it. Secondly, maintenance_io_concurrency limits this naturally
- after crossing that, redo should place the FPI into shared buffers,
allowing us to skip the prefetch.

So I think using maintenance_io_concurrency is sufficient. We might track
more buffers to allow skipping prefetches of blocks that were evicted from
shared buffers, but that seems like an overkill.

However, maintenance_io_concurrency can be quite high, so just a simple
queue is not very suitable - searching it linearly for each block would be
too expensive. But I think we can use a simple hash table, tracking
(relfilenode, block, LSN), over-sized to minimize collisions.

Imagine it's a simple array with (2 * maintenance_io_concurrency) elements,
and whenever we prefetch a block or find an FPI, we simply add the block to
the array as determined by hash(relfilenode, block)

 hashtable[hash(...)] = {relfilenode, block, LSN}

and then when deciding whether to prefetch a block, we look at that one
position. If the (relfilenode, block) match, we check the LSN and skip the
prefetch if it's sufficiently recent. Otherwise we prefetch.


I'm a bit doubtful this is really needed at this point. Yes, the
prefetching will do a buffer table lookup - but it's a lookup that
already happens today. And the patch already avoids doing a second
lookup after prefetching (by optimistically caching the last Buffer id,
and re-checking).


I agree that when a page is looked up, and found, in the buffer table
that the subsequent cacheing of the buffer id in the WAL records does a
good job of avoiding having to re-do that lookup.  However, that isn't
the case which was being discussed here or what Tomas's suggestion was
intended to address.

What I pointed out up-thread and what's being discussed here is what
happens when the WAL contains a few FPIs and a few regular WAL records
which are mixed up and not in ideal order.  When that happens, with this
patch, the FPIs will be ignored, the regular WAL records will reference
blocks which aren't found in shared buffers (yet) and then we'll both
issue pre-fetches for those and end up having spent effort doing a
buffer lookup that we'll later re-do.



The question is how common this pattern actually is - I don't know. As 
noted, the non-FPI would have to be fairly close to the FPI, i.e. within 
the wal_decode_buffer_size, to actually cause measurable harm.



To address the unnecessary syscalls we really just need to keep track of
any FPIs that we've seen between where the point where the prefetching
is happening and the point where the replay is being done- once replay
has replayed an FPI, our buffer lookup will succeed and we'll cache the
buffer that the FPI is at- in other words, only wal_decode_buffer_size
amount of WAL needs to be considered.



Yeah, that's essentially what I proposed.


We could further leverage this tracking of FPIs, to skip the prefetch
syscalls, by cacheing what later records address the blocks that have
FPIs earlier in the queue with the FPI record and then when replay hits
the FPI and loads it into shared_buffers, it could update the other WAL
records in the queue with the buffer id of the page, allowing us to very
likely avoid having to do another lookup later on.



This seems like an over-engineering, at least for v1.


I think there's potential for some significant optimization going
forward, but I think it's basically optimization over what we're doing
today. As this is already a nontrivial patch, I'd argue for doing so
separately.


This seems like a great optimization, albeit a fair bit of code, for a
relatively uncommon use-case, specifically where full page writes are
disabled or very large checkpoints.  As that's the case though, I would
think it's reasonable to ask that it go out of its way to avoid slowing
down the more common configurations, particularly since it's proposed to
have it on by default (which I agree with, provided it ends up improving
the common cases, which I think the suggestions above would certainly
make it more likely to do).



I'm OK to do some benchmarking, but it's not quite clear to me why does 
it matter if the checkpoints are smaller than shared buffers? IMO what 
matters is how "localized" the updates are, i.e. how likely it is to hit 
the same page repeatedly (in a short amount of time). Regular pgbench is 
not very suitable for that, but some non-uniform distribution should do 
the t

Re: WIP: WAL prefetch (another approach)

2021-02-13 Thread Stephen Frost
Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> On 2021-02-12 00:42:04 +0100, Tomas Vondra wrote:
> > Yeah, that's a good point. I think it'd make sense to keep track of recent
> > FPIs and skip prefetching such blocks. But how exactly should we implement
> > that, how many blocks do we need to track? If you get an FPI, how long
> > should we skip prefetching of that block?
> > 
> > I don't think the history needs to be very long, for two reasons. Firstly,
> > the usual pattern is that we have FPI + several changes for that block
> > shortly after it. Secondly, maintenance_io_concurrency limits this naturally
> > - after crossing that, redo should place the FPI into shared buffers,
> > allowing us to skip the prefetch.
> > 
> > So I think using maintenance_io_concurrency is sufficient. We might track
> > more buffers to allow skipping prefetches of blocks that were evicted from
> > shared buffers, but that seems like an overkill.
> > 
> > However, maintenance_io_concurrency can be quite high, so just a simple
> > queue is not very suitable - searching it linearly for each block would be
> > too expensive. But I think we can use a simple hash table, tracking
> > (relfilenode, block, LSN), over-sized to minimize collisions.
> > 
> > Imagine it's a simple array with (2 * maintenance_io_concurrency) elements,
> > and whenever we prefetch a block or find an FPI, we simply add the block to
> > the array as determined by hash(relfilenode, block)
> > 
> > hashtable[hash(...)] = {relfilenode, block, LSN}
> > 
> > and then when deciding whether to prefetch a block, we look at that one
> > position. If the (relfilenode, block) match, we check the LSN and skip the
> > prefetch if it's sufficiently recent. Otherwise we prefetch.
> 
> I'm a bit doubtful this is really needed at this point. Yes, the
> prefetching will do a buffer table lookup - but it's a lookup that
> already happens today. And the patch already avoids doing a second
> lookup after prefetching (by optimistically caching the last Buffer id,
> and re-checking).

I agree that when a page is looked up, and found, in the buffer table
that the subsequent cacheing of the buffer id in the WAL records does a
good job of avoiding having to re-do that lookup.  However, that isn't
the case which was being discussed here or what Tomas's suggestion was
intended to address.

What I pointed out up-thread and what's being discussed here is what
happens when the WAL contains a few FPIs and a few regular WAL records
which are mixed up and not in ideal order.  When that happens, with this
patch, the FPIs will be ignored, the regular WAL records will reference
blocks which aren't found in shared buffers (yet) and then we'll both
issue pre-fetches for those and end up having spent effort doing a
buffer lookup that we'll later re-do.

To address the unnecessary syscalls we really just need to keep track of
any FPIs that we've seen between where the point where the prefetching
is happening and the point where the replay is being done- once replay
has replayed an FPI, our buffer lookup will succeed and we'll cache the
buffer that the FPI is at- in other words, only wal_decode_buffer_size
amount of WAL needs to be considered.

We could further leverage this tracking of FPIs, to skip the prefetch
syscalls, by cacheing what later records address the blocks that have
FPIs earlier in the queue with the FPI record and then when replay hits
the FPI and loads it into shared_buffers, it could update the other WAL
records in the queue with the buffer id of the page, allowing us to very
likely avoid having to do another lookup later on.

> I think there's potential for some significant optimization going
> forward, but I think it's basically optimization over what we're doing
> today. As this is already a nontrivial patch, I'd argue for doing so
> separately.

This seems like a great optimization, albeit a fair bit of code, for a
relatively uncommon use-case, specifically where full page writes are
disabled or very large checkpoints.  As that's the case though, I would
think it's reasonable to ask that it go out of its way to avoid slowing
down the more common configurations, particularly since it's proposed to
have it on by default (which I agree with, provided it ends up improving
the common cases, which I think the suggestions above would certainly
make it more likely to do).

Perhaps this already improves the common cases and is worth the extra
code on that basis, but I don't recall seeing much in the way of
benchmarking in this thread for that case- that is, where FPIs are
enabled and checkpoints are smaller than shared buffers.  Jakub's
testing was done with FPWs disabled and Tomas's testing used checkpoints
which were much larger than the size of shared buffers on the system
doing the replay.  While it's certainly good that this patch improves
those cases, we should also be looking out for the worst case and make
sure that the patch doesn't degrade performance in that

  1   2   >