subject:"WIP\: WAL prefetch \(another approach\)"

Greg Stark  writes:
> But the bigger question is. Are we really concerned about this flaky
> problem? Is it worth investing time and money on? I can get money to
> go buy a G4 or G5 and spend some time on it. It just seems a bit...
> niche. But if it's a real bug that represents something broken on
> other architectures that just happens to be easier to trigger here it
> might be worthwhile.

TBH, I don't know.  There seem to be three plausible explanations:

1. Flaky hardware in my unit.
2. Ancient macOS bug, as Andres suggested upthread.
3. Actual PG bug.

If it's #1 or #2 then we're just wasting our time here.  I'm not
sure how to estimate the relative probabilities, but I suspect
#3 is the least likely of the lot.

FWIW, I did just reproduce the problem on that machine with current HEAD:

2021-12-17 18:40:40.293 EST [21369] FATAL:  inconsistent page found, rel 
1663/167772/2673, forknum 0, blkno 26
2021-12-17 18:40:40.293 EST [21369] CONTEXT:  WAL redo at C/3DE3F658 for 
Btree/INSERT_LEAF: off 208; blkref #0: rel 1663/167772/2673, blk 26 FPW
2021-12-17 18:40:40.522 EST [21365] LOG:  startup process (PID 21369) exited 
with exit code 1

That was after only five loops of the regression tests, so either
I got lucky or the failure probability has increased again.

In any case, it seems clear that the problem exists independently of
Munro's patches, so I don't really think this question should be
considered a blocker for those.

regards, tom lane

Re: WIP: WAL prefetch (another approach)

On Fri, 17 Dec 2021 at 18:40, Tom Lane  wrote:
>
> Greg Stark  writes:
> > Hm. I seem to have picked a bad checkout. I took the last one before
> > the revert (45aa88fe1d4028ea50ba7d26d390223b6ef78acc).
>
> FWIW, I think that's the first one *after* the revert.

Doh

But the bigger question is. Are we really concerned about this flaky
problem? Is it worth investing time and money on? I can get money to
go buy a G4 or G5 and spend some time on it. It just seems a bit...
niche. But if it's a real bug that represents something broken on
other architectures that just happens to be easier to trigger here it
might be worthwhile.

-- 
greg

Re: WIP: WAL prefetch (another approach)

Greg Stark  writes:
> Hm. I seem to have picked a bad checkout. I took the last one before
> the revert (45aa88fe1d4028ea50ba7d26d390223b6ef78acc).

FWIW, I think that's the first one *after* the revert.

> 2021-12-17 17:51:51.688 EST [50955] LOG:  background worker "parallel
> worker" (PID 54073) was terminated by signal 10: Bus error

I'm betting on weird emulation issue.  None of my real PPC machines
showed such things.

regards, tom lane

Re: WIP: WAL prefetch (another approach)

2021-12-17 Thread Tomas Vondra


On 12/17/21 23:56, Greg Stark wrote:

Hm. I seem to have picked a bad checkout. I took the last one before
the revert (45aa88fe1d4028ea50ba7d26d390223b6ef78acc). Or there's some
incompatibility with the emulation and the IPC stuff parallel workers
use.


2021-12-17 17:51:51.688 EST [50955] LOG:  background worker "parallel
worker" (PID 54073) was terminated by signal 10: Bus error
2021-12-17 17:51:51.688 EST [50955] DETAIL:  Failed process was
running: SELECT variance(unique1::int4), sum(unique1::int8),
regr_count(unique1::float8, unique1::float8)
FROM (SELECT * FROM tenk1
   UNION ALL SELECT * FROM tenk1
   UNION ALL SELECT * FROM tenk1
   UNION ALL SELECT * FROM tenk1) u;
2021-12-17 17:51:51.690 EST [50955] LOG:  terminating any other active
server processes
2021-12-17 17:51:51.748 EST [54078] FATAL:  the database system is in
recovery mode
2021-12-17 17:51:51.761 EST [50955] LOG:  all server processes
terminated; reinitializing



Interesting. In my experience SIGBUS on PPC tends to be due to incorrect 
alignment, but I'm not sure how that works with the emulation. Can you 
get a backtrace?


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: WIP: WAL prefetch (another approach)

Hm. I seem to have picked a bad checkout. I took the last one before
the revert (45aa88fe1d4028ea50ba7d26d390223b6ef78acc). Or there's some
incompatibility with the emulation and the IPC stuff parallel workers
use.


2021-12-17 17:51:51.688 EST [50955] LOG:  background worker "parallel
worker" (PID 54073) was terminated by signal 10: Bus error
2021-12-17 17:51:51.688 EST [50955] DETAIL:  Failed process was
running: SELECT variance(unique1::int4), sum(unique1::int8),
regr_count(unique1::float8, unique1::float8)
FROM (SELECT * FROM tenk1
  UNION ALL SELECT * FROM tenk1
  UNION ALL SELECT * FROM tenk1
  UNION ALL SELECT * FROM tenk1) u;
2021-12-17 17:51:51.690 EST [50955] LOG:  terminating any other active
server processes
2021-12-17 17:51:51.748 EST [54078] FATAL:  the database system is in
recovery mode
2021-12-17 17:51:51.761 EST [50955] LOG:  all server processes
terminated; reinitializing

Re: WIP: WAL prefetch (another approach)

Greg Stark  writes:
> I'm guessing I should do CC=/usr/bin/powerpc-apple-darwin9-gcc-4.2.1
> or maybe 4.0.1. What version is on your G4?

$ gcc -v
Using built-in specs.
Target: powerpc-apple-darwin9
Configured with: /var/tmp/gcc/gcc-5493~1/src/configure --disable-checking 
-enable-werror --prefix=/usr --mandir=/share/man 
--enable-languages=c,objc,c++,obj-c++ 
--program-transform-name=/^[cg][^.-]*$/s/$/-4.0/ 
--with-gxx-include-dir=/include/c++/4.0.0 --with-slibdir=/usr/lib 
--build=i686-apple-darwin9 --program-prefix= --host=powerpc-apple-darwin9 
--target=powerpc-apple-darwin9
Thread model: posix
gcc version 4.0.1 (Apple Inc. build 5493)

I see that gcc 4.2.1 is also present on this machine, but I've
never used it.

regards, tom lane

Re: WIP: WAL prefetch (another approach)

I have

IBUILD:postgresql gsstark$ ls /usr/bin/*gcc*
/usr/bin/gcc
/usr/bin/gcc-4.0
/usr/bin/gcc-4.2
/usr/bin/i686-apple-darwin9-gcc-4.0.1
/usr/bin/i686-apple-darwin9-gcc-4.2.1
/usr/bin/powerpc-apple-darwin9-gcc-4.0.1
/usr/bin/powerpc-apple-darwin9-gcc-4.2.1

I'm guessing I should do CC=/usr/bin/powerpc-apple-darwin9-gcc-4.2.1
or maybe 4.0.1. What version is on your G4?

Re: WIP: WAL prefetch (another approach)

Greg Stark  writes:
> What tools and tool versions are you using to build? Is it just GCC for PPC?
> There aren't any special build processes to make a fat binary involved?

Nope, just "configure; make" using that macOS version's regular gcc.

regards, tom lane

Re: WIP: WAL prefetch (another approach)

What tools and tool versions are you using to build? Is it just GCC for PPC?

There aren't any special build processes to make a fat binary involved?

On Thu, 16 Dec 2021 at 23:11, Tom Lane  wrote:
>
> Greg Stark  writes:
> > But if you're interested and can explain the tests to run I can try to
> > get the tests running on this machine:
>
> I'm not sure that machine is close enough to prove much, but by all
> means give it a go if you wish.  My test setup was explained in [1]:
>
> >> To recap, the test lashup is:
> >> * 2003 PowerMac G4 (1.25GHz PPC 7455, 7200 rpm spinning-rust drive)
> >> * Standard debug build (--enable-debug --enable-cassert)
> >> * Out-of-the-box configuration, except add wal_consistency_checking = all
> >> and configure a wal-streaming standby on the same machine
> >> * Repeatedly run "make installcheck-parallel", but skip the tablespace
> >> test to avoid issues with the standby trying to use the same directory
> >> * Delay long enough after each installcheck-parallel to let the
> >> standby catch up (the run proper is ~24 min, plus 2 min for catchup)
>
> Remember also that the code in question is not in HEAD; you'd
> need to apply Munro's patches, or check out some commit from
> around 2021-04-22.
>
> regards, tom lane
>
> [1] https://www.postgresql.org/message-id/3502526.1619925367%40sss.pgh.pa.us



-- 
greg

Re: WIP: WAL prefetch (another approach)

2021-12-16 Thread Tom Lane

Greg Stark  writes:
> But if you're interested and can explain the tests to run I can try to
> get the tests running on this machine:

I'm not sure that machine is close enough to prove much, but by all
means give it a go if you wish.  My test setup was explained in [1]:

>> To recap, the test lashup is:
>> * 2003 PowerMac G4 (1.25GHz PPC 7455, 7200 rpm spinning-rust drive)
>> * Standard debug build (--enable-debug --enable-cassert)
>> * Out-of-the-box configuration, except add wal_consistency_checking = all
>> and configure a wal-streaming standby on the same machine
>> * Repeatedly run "make installcheck-parallel", but skip the tablespace
>> test to avoid issues with the standby trying to use the same directory
>> * Delay long enough after each installcheck-parallel to let the 
>> standby catch up (the run proper is ~24 min, plus 2 min for catchup)

Remember also that the code in question is not in HEAD; you'd
need to apply Munro's patches, or check out some commit from
around 2021-04-22.

regards, tom lane

[1] https://www.postgresql.org/message-id/3502526.1619925367%40sss.pgh.pa.us

Re: WIP: WAL prefetch (another approach)

2021-12-16 Thread Greg Stark

The actual hardware of this machine is a Mac Mini Core 2 Duo. I'm not
really clear how the emulation is done and whether it makes a
reasonable test environment or not.

Hardware Overview:

  Model Name: Mac mini
  Model Identifier: Macmini2,1
  Processor Name: Intel Core 2 Duo
  Processor Speed: 2 GHz
  Number Of Processors: 1
  Total Number Of Cores: 2
  L2 Cache: 4 MB
  Memory: 2 GB
  Bus Speed: 667 MHz
  Boot ROM Version: MM21.009A.B00

Re: WIP: WAL prefetch (another approach)

2021-12-16 Thread Greg Stark

On Fri, 26 Nov 2021 at 21:47, Tom Lane  wrote:
>
> Yeah ... on the one hand, that machine has shown signs of
> hard-to-reproduce flakiness, so it's easy to write off the failures
> I saw as hardware issues.  On the other hand, the flakiness I've
> seen has otherwise manifested as kernel crashes, which is nothing
> like the consistent test failures I was seeing with the patch.

Hm. I asked around and found a machine I can use that can run PPC
binaries, but it's actually, well, confusing. I think this is an x86
machine running Leopard which uses JIT to transparently run PPC
binaries. I'm not sure this is really a good test.

But if you're interested and can explain the tests to run I can try to
get the tests running on this machine:

IBUILD:~ gsstark$ uname -a
Darwin IBUILD.MIT.EDU 9.8.0 Darwin Kernel Version 9.8.0: Wed Jul 15
16:55:01 PDT 2009; root:xnu-1228.15.4~1/RELEASE_I386 i386

IBUILD:~ gsstark$ sw_vers
ProductName: Mac OS X
ProductVersion: 10.5.8
BuildVersion: 9L31a

Re: WIP: WAL prefetch (another approach)

2021-12-13 Thread Robert Haas

On Fri, Nov 26, 2021 at 9:47 PM Tom Lane  wrote:
> Yeah ... on the one hand, that machine has shown signs of
> hard-to-reproduce flakiness, so it's easy to write off the failures
> I saw as hardware issues.  On the other hand, the flakiness I've
> seen has otherwise manifested as kernel crashes, which is nothing
> like the consistent test failures I was seeing with the patch.
>
> Andres speculated that maybe we were seeing a kernel bug that
> affects consistency of concurrent reads and writes.  That could
> be an explanation; but it's just evidence-free speculation so far,
> so I don't feel real convinced by that idea either.
>
> Anyway, I hope to find time to see if the issue still reproduces
> with Thomas' new patch set.

Honestly, all the reasons that Thomas articulated for the revert seem
relatively unimpressive from my point of view. Perhaps they are
sufficient justification for a revert so near to the end of the
development cycle, but that's just an argument for committing things a
little sooner so we have time to work out the kinks. This kind of work
is too valuable to get hung up for a year or three because of a couple
of minor preexisting bugs and/or preexisting maybe-bugs.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: WIP: WAL prefetch (another approach)

2021-12-10 Thread Ashutosh Sharma

Hi Thomas,

I am unable to apply these new set of patches on HEAD. Can you please share
the rebased patch or if you have any work branch can you please point it
out, I will refer to it for the changes.

--
With Regards,
Ashutosh sharma.

On Tue, Nov 23, 2021 at 3:44 PM Thomas Munro  wrote:

> On Mon, Nov 15, 2021 at 11:31 PM Daniel Gustafsson 
> wrote:
> > Could you post an updated version of the patch which is for review?
>
> Sorry for taking so long to come back; I learned some new things that
> made me want to restructure this code a bit (see below).  Here is an
> updated pair of patches that I'm currently testing.
>
> Old problems:
>
> 1.  Last time around, an infinite loop was reported in pg_waldump.  I
> believe Horiguchi-san has fixed that[1], but I'm no longer depending
> on that patch.  I thought his patch set was a good idea, but it's
> complicated and there's enough going on here already... let's consider
> that independently.
>
> This version goes back to what I had earlier, though (I hope) it is
> better about how "nonblocking" states are communicated.  In this
> version, XLogPageRead() has a way to give up part way through a record
> if it doesn't have enough data and there are queued up records that
> could be replayed right now.  In that case, we'll go back to the
> beginning of the record (and occasionally, back a WAL page) next time
> we try.  That's the cost of not maintaining intra-record decoding
> state.
>
> 2.  Last time around, we could try to allocate a crazy amount of
> memory when reading garbage past the end of the WAL.  Fixed, by
> validating first, like in master.
>
> New work:
>
> Since last time, I went away and worked on a "real" AIO version of
> this feature.  That's ongoing experimental work for a future proposal,
> but I have a working prototype and I aim to share that soon, when that
> branch is rebased to catch up with recent changes.  In that version,
> the prefetcher starts actual reads into the buffer pool, and recovery
> receives already pinned buffers attached to the stream of records it's
> replaying.
>
> That inspired a couple of refactoring changes to this non-AIO version,
> to minimise the difference and anticipate the future work better:
>
> 1.  The logic for deciding which block to start prefetching next is
> moved into a new callback function in a sort of standard form (this is
> approximately how all/most prefetching code looks in the AIO project,
> ie sequential scans, bitmap heap scan, etc).
>
> 2.  The logic for controlling how many IOs are running and deciding
> when to call the above is in a separate component.  In this non-AIO
> version, it works using a simple ring buffer of LSNs to estimate the
> number of in flight I/Os, just like before.  This part would be thrown
> away and replaced with the AIO branch's centralised "streaming read"
> mechanism which tracks I/O completions based on a stream of completion
> events from the kernel (or I/O worker processes).
>
> 3.  In this version, the prefetcher still doesn't pin buffers, for
> simplicity.  That work did force me to study places where WAL streams
> need prefetching "barriers", though, so in this patch you can
> see that it's now a little more careful than it probably needs to be.
> (It doesn't really matter much if you call posix_fadvise() on a
> non-existent file region, or the wrong file after OID wraparound and
> reuse, but it would matter if you actually read it into a buffer, and
> if an intervening record might be trying to drop something you have
> pinned).
>
> Some other changes:
>
> 1.  I dropped the GUC recovery_prefetch_fpw.  I think it was a
> possibly useful idea but it's a niche concern and not worth worrying
> about for now.
>
> 2.  I simplified the stats.  Coming up with a good running average
> system seemed like a problem for another day (the numbers before were
> hard to interpret).  The new stats are super simple counters and
> instantaneous values:
>
> postgres=# select * from pg_stat_prefetch_recovery ;
> -[ RECORD 1 ]--+--
> stats_reset| 2021-11-10 09:02:08.590217+13
> prefetch   | 13605674 <- times we called posix_fadvise()
> hit| 24185289 <- times we found pages already cached
> skip_init  | 217215   <- times we did nothing because init, not read
> skip_new   | 192347   <- times we skipped because relation too small
> skip_fpw   | 27429<- times we skipped because fpw, not read
> wal_distance   | 10648<- how far ahead in WAL bytes
> block_distance | 134  <- how far ahead in block references
> io_depth   | 50   <- fadvise() calls not yet followed by pread()
>
> I also removed the code to save and restore the stats via the stats
> collector, for now.  I figured that persistent stats could be a later
> feature, perhaps after the shared memory stats stuff?
>
> 3.  I dropped the code that was caching an SMgrRelation pointer to
> avoid smgropen() calls that showed up in some profiles.  That probably

Re: WIP: WAL prefetch (another approach)

2021-11-26 Thread Tom Lane

Thomas Munro  writes:
> On Sat, Nov 27, 2021 at 12:34 PM Tomas Vondra
>  wrote:
>> One thing that's not clear to me is what happened to the reasons why
>> this feature was reverted in the PG14 cycle?

> 3.  A wild goose chase for bugs on Tom Lane's antique 32 bit PPC
> machine.  Tom eventually reproduced it with the patches reverted,
> which seemed to exonerate them but didn't leave a good feeling: what
> was happening, and why did the patches hugely increase the likelihood
> of the failure mode?  I have no new information on that, but I know
> that several people spent a huge amount of time and effort trying to
> reproduce it on various types of systems, as did I, so despite not
> reaching a conclusion of a bug, this certainly contributed to a
> feeling that the patch had run out of steam for the 14 cycle.

Yeah ... on the one hand, that machine has shown signs of
hard-to-reproduce flakiness, so it's easy to write off the failures
I saw as hardware issues.  On the other hand, the flakiness I've
seen has otherwise manifested as kernel crashes, which is nothing
like the consistent test failures I was seeing with the patch.

Andres speculated that maybe we were seeing a kernel bug that
affects consistency of concurrent reads and writes.  That could
be an explanation; but it's just evidence-free speculation so far,
so I don't feel real convinced by that idea either.

Anyway, I hope to find time to see if the issue still reproduces
with Thomas' new patch set.

regards, tom lane

Re: WIP: WAL prefetch (another approach)

2021-11-26 Thread Thomas Munro

On Sat, Nov 27, 2021 at 12:34 PM Tomas Vondra
wrote:
> One thing that's not clear to me is what happened to the reasons why
> this feature was reverted in the PG14 cycle?

Reasons for reverting:

1. A bug in commit 323cbe7c, "Remove read_page callback from
XLogReader.". I couldn't easily revert just that piece. This new
version doesn't depend on that change anymore, to try to keep things
simple. (That particular bug has been fixed in a newer version of
that patch[1], which I still think was a good idea incidentally.)
2. A bug where allocation for large records happened before
validation. Concretely, you can see that this patch does
XLogReadRecordAlloc() after validating the header (usually, same as
master), but commit f003d9f8 did it first. (Though Andres pointed
out[2] that more work is needed on that to make that logic more
robust, and I'm keen to look into that, but that's independent of this
work).
3. A wild goose chase for bugs on Tom Lane's antique 32 bit PPC
machine. Tom eventually reproduced it with the patches reverted,
which seemed to exonerate them but didn't leave a good feeling: what
was happening, and why did the patches hugely increase the likelihood
of the failure mode? I have no new information on that, but I know
that several people spent a huge amount of time and effort trying to
reproduce it on various types of systems, as did I, so despite not
reaching a conclusion of a bug, this certainly contributed to a
feeling that the patch had run out of steam for the 14 cycle.

This week I'll have another crack at getting that TAP test I proposed
that runs the regression tests with a streaming replica to work on
Windows. That does approximately what Tom was doing when he saw
problem #3, which I'd like to have as standard across the build farm.

[1]
https://www.postgresql.org/message-id/20211007.172820.1874635561738958207.horikyota.ntt%40gmail.com
[2]
https://www.postgresql.org/message-id/20210505010835.umylslxgq4a6rbwg%40alap3.anarazel.de

Re: WIP: WAL prefetch (another approach)

2021-11-26 Thread Tomas Vondra


On 11/26/21 22:16, Thomas Munro wrote:

On Fri, Nov 26, 2021 at 11:32 AM Tomas Vondra
 wrote:

The results are pretty good / similar to previous results. Replaying the
1h worth of work on a smaller machine takes ~5:30h without prefetching
(master or with prefetching disabled). With prefetching enabled this
drops to ~2h (default config) and ~1h (with tuning).


Thanks for testing!  Wow, that's a nice graph.

This has bit-rotted already due to Robert's work on ripping out
globals, so I'll post a rebase early next week, and incorporate your
code feedback.



One thing that's not clear to me is what happened to the reasons why 
this feature was reverted in the PG14 cycle?


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: WIP: WAL prefetch (another approach)

2021-11-26 Thread Thomas Munro

On Fri, Nov 26, 2021 at 11:32 AM Tomas Vondra
 wrote:
> The results are pretty good / similar to previous results. Replaying the
> 1h worth of work on a smaller machine takes ~5:30h without prefetching
> (master or with prefetching disabled). With prefetching enabled this
> drops to ~2h (default config) and ~1h (with tuning).

Thanks for testing!  Wow, that's a nice graph.

This has bit-rotted already due to Robert's work on ripping out
globals, so I'll post a rebase early next week, and incorporate your
code feedback.

Re: WIP: WAL prefetch (another approach)

2021-11-25 Thread Tomas Vondra

Hi,

It's great you posted a new version of this patch, so I took a look a
brief look at it. The code seems in pretty good shape, I haven't found
any real issues - just two minor comments:

This seems a bit strange:

#define DEFAULT_DECODE_BUFFER_SIZE 0x1

Why not to define this as a simple decimal value? Is there something
special about this particular value, or is it arbitrary? I guess it's
simply the minimum for wal_decode_buffer_size GUC, but why not to use
the GUC for all places decoding WAL?

FWIW I don't think we include updates to typedefs.list in patches.


I also repeated the benchmarks I did at the beginning of the year [1].
Attached is a chart with four different configurations:

1) master (f79962d826)

2) patched (with prefetching disabled)

3) patched (with default configuration)

4) patched (with I/O concurrency 256 and 2MB decode buffer)

For all configs the shared buffers were set to 64GB, checkpoints every
20 minutes, etc.

The results are pretty good / similar to previous results. Replaying the
1h worth of work on a smaller machine takes ~5:30h without prefetching
(master or with prefetching disabled). With prefetching enabled this
drops to ~2h (default config) and ~1h (with tuning).

regards


[1]
https://www.postgresql.org/message-id/c5d52837-6256-0556-ac8c-d6d3d558820a%40enterprisedb.com

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: WIP: WAL prefetch (another approach)

2021-11-15 Thread Daniel Gustafsson

> On 10 May 2021, at 06:11, Thomas Munro  wrote:
> On Thu, Apr 22, 2021 at 11:22 AM Stephen Frost  wrote:

>> I tend to agree with the idea to revert it, perhaps a +0 on that, but if 
>> others argue it should be fixed in-place, I wouldn’t complain about it.
> 
> Reverted.
> 
> Note: eelpout may return a couple of failures because it's set up to
> run with recovery_prefetch=on (now an unknown GUC), and it'll be a few
> hours before I can access that machine to adjust that...
> 
>> I very much encourage the idea of improving testing in this area and would 
>> be happy to try and help do so in the 15 cycle.
> 
> Cool.  I'm going to try out some ideas.

Skimming this thread without all the context it's not entirely clear which
patch the CF entry relates to (I assume it's the one from April 7 based on
attached mail-id but there is a revert from May?), and the CF app and CF bot
are also in disagreement which is the latest one.

Could you post an updated version of the patch which is for review?

--
Daniel Gustafsson   https://vmware.com/

Re: WIP: WAL prefetch (another approach)

2021-05-09 Thread Thomas Munro

On Thu, Apr 22, 2021 at 11:22 AM Stephen Frost  wrote:
> On Wed, Apr 21, 2021 at 19:17 Thomas Munro  wrote:
>> On Thu, Apr 22, 2021 at 8:16 AM Thomas Munro  wrote:
>> ... Personally I think the right thing to do now is to revert it
>> and re-propose for 15 early in the cycle, supported with some better
>> testing infrastructure.
>
> I tend to agree with the idea to revert it, perhaps a +0 on that, but if 
> others argue it should be fixed in-place, I wouldn’t complain about it.

Reverted.

Note: eelpout may return a couple of failures because it's set up to
run with recovery_prefetch=on (now an unknown GUC), and it'll be a few
hours before I can access that machine to adjust that...

> I very much encourage the idea of improving testing in this area and would be 
> happy to try and help do so in the 15 cycle.

Cool.  I'm going to try out some ideas.

Re: WIP: WAL prefetch (another approach)

2021-05-06 Thread Andres Freund

Hi,

On 2021-05-04 18:08:35 -0700, Andres Freund wrote:
> But the issue that 70b4f82a4b is trying to address seems bigger to
> me. The reason it's so easy to hit the issue is that walreceiver does <
> 8KB writes into recycled WAL segments *without* zero-filling the tail
> end of the page - which will commonly be filled with random older
> contents, because we'll use a recycled segments. I think that
> *drastically* increases the likelihood of finding something that looks
> like a valid record header compared to the situation on a primary where
> the zeroing pages before use makes that pretty unlikely.

I've written an experimental patch to deal with this and, as expected,
it does make the end-of-wal detection a lot more predictable and
reliable. There's only two types of possible errors outside of crashes:
A record length of 0 (the end of WAL is within a page), and the page
header LSN mismatching (the end of WAL is at a page boundary).

This seems like a significant improvement.

However: It's nontrivial to do this nicely and in a backpatchable way in
XLogWalRcvWrite(). Or at least I haven't found a good way:
- We can't extend the input buffer to XLogWalRcvWrite(), it's from
  libpq.
- We don't want to copy the the entire buffer (commonly 128KiB) to a new
  buffer that we then can extend by 0-BLCKSZ of zeroes to cover the
  trailing part of the last page.
- In PG13+ we can do this utilizing pg_writev(), adding another IOV
  entry covering the trailing space to be padded.
- It's nicer to avoid increasign the number of write() calls, but it's
  not as crucial as the earlier points.

I'm also a bit uncomfortable with another aspect, although I can't
really see a problem: When switch to receiving WAL via walreceiver, we
always start at a segment boundary, even if we had received most of that
segment before. Currently that won't end up with any trailing space that
needs to be zeroed, because the server always will send 128KB chunks,
but there's no formal guarantee for that.  It seems a bit odd that we
could end up zeroing trailing space that already contains valid data,
just to overwrite it with valid data again.  But it ought to always be
fine.

The least offensive way I could come up with is for XLogWalRcvWrite() to
always write partial pages in a separate pg_pwrite(). When writing a
partial page, and the previous write position was not already on that
same page, copy the buffer into a local XLOG_BLCKSZ sized buffer
(although we'll never use more than XLOG_BLCKSZ-1 I think), and (re)zero
out the trailing part.  One thing that does not yet handle is if we were
to get a partial write - we'd not again notice that we need to pad the
end of the page.

Does anybody have a better idea?

I really wish we had a version of pg_p{read,write}[v] that internally
handled partial IOs, retrying as long as they see > 0 bytes written.

Greetings,

Andres Freund

Re: WIP: WAL prefetch (another approach)

2021-05-06 Thread Andres Freund

Hi,

On 2021-05-04 09:46:12 -0400, Tom Lane wrote:
> Yeah, I have also spent a fair amount of time trying to reproduce it
> elsewhere, without success so far.  Notably, I've been trying on a
> PPC Mac laptop that has a fairly similar CPU to what's in the G4,
> though a far slower disk drive.  So that seems to exclude theories
> based on it being PPC-specific.
>
> I suppose that if we're unable to reproduce it on at least one other box,
> we have to write it off as hardware flakiness.

I wonder if there's a chance what we're seeing is an OS memory ordering
bug, or a race between walreceiver writing data and the startup process
reading it.

When the startup process is able to keep up, there often will be a very
small time delta between the startup process reading a page that the
walreceiver just wrote. And if the currently read page was the tail page
written to by a 'w' message, it'll often be written to again in short
order - potentially while the startup process is reading it.

It'd not terribly surprise me if an old OS version on an old processor
had some issues around that.

Were there any cases of walsender terminating and reconnecting around
the failures?

It looks suspicious that XLogPageRead() does not invalidate the
xlogreader state when retrying.  Normally that's xlogreader's
responsibility, but there is that whole XLogReaderValidatePageHeader()
business. But I don't quite see how it'd actually cause problems.

Greetings,

Andres Freund

Re: WIP: WAL prefetch (another approach)

2021-05-04 Thread Andres Freund

Hi,

On 2021-05-04 15:47:41 -0400, Tom Lane wrote:
> BTW, that conclusion shouldn't distract us from the very real bug
> that Andres identified.  I was just scraping the buildfarm logs
> concerning recent failures, and I found several recent cases
> that match the symptom he reported:
> [...]
> They all show the standby in recovery/019_replslot_limit.pl failing
> with symptoms like
>
> 2021-05-04 07:42:00.968 UTC [24707406:1] LOG:  database system was shut down 
> in recovery at 2021-05-04 07:41:39 UTC
> 2021-05-04 07:42:00.968 UTC [24707406:2] LOG:  entering standby mode
> 2021-05-04 07:42:01.050 UTC [24707406:3] LOG:  redo starts at 0/1C000D8
> 2021-05-04 07:42:01.079 UTC [24707406:4] LOG:  consistent recovery state 
> reached at 0/1D0
> 2021-05-04 07:42:01.079 UTC [24707406:5] FATAL:  invalid memory alloc request 
> size 1476397045
> 2021-05-04 07:42:01.080 UTC [13238274:3] LOG:  database system is ready to 
> accept read only connections
> 2021-05-04 07:42:01.082 UTC [13238274:4] LOG:  startup process (PID 24707406) 
> exited with exit code 1

Yea, that's the pre-existing end-of-log-issue that got more likely as
well as more consequential (by accident) in Thomas' patch. It's easy to
reach parity with the state in 13, it's just changing the order in one
place.

But I think we need to do something for all branches here. The bandaid
that was added to allocate_recordbuf() does doesn't really seems
sufficient to me. This is

commit 70b4f82a4b5cab5fc12ff876235835053e407155
Author: Michael Paquier 
Date:   2018-06-18 10:43:27 +0900

Prevent hard failures of standbys caused by recycled WAL segments

In <= 13 the current state is that we'll allocate effectively random
bytes as long as the random number is below 1GB whenever we reach the
end of the WAL with the record on a page boundary (because there we
don't. That allocation is then not freed for the lifetime of the
xlogreader.  And for FRONTEND uses of xlogreader we'll just happily
allocate 4GB.  The specific problem here is that we don't validate the
record header before allocating when the record header is split across a
page boundary - without much need as far as I can tell? Until we've read
the entire header, we actually don't need to allocate the record buffer?

This seems like an issue that needs to be fixed to be more robust in
crash recovery scenarios where obviously we could just have failed with
half written records.

But the issue that 70b4f82a4b is trying to address seems bigger to
me. The reason it's so easy to hit the issue is that walreceiver does <
8KB writes into recycled WAL segments *without* zero-filling the tail
end of the page - which will commonly be filled with random older
contents, because we'll use a recycled segments. I think that
*drastically* increases the likelihood of finding something that looks
like a valid record header compared to the situation on a primary where
the zeroing pages before use makes that pretty unlikely.

> (BTW, the behavior seen here where the failure occurs *immediately*
> after reporting "consistent recovery state reached" is seen in the
> other reports as well, including Andres' version.  I wonder if that
> means anything.)

That's to be expected, I think. There's not a lot of data that needs to
be replayed, and we'll always reach consistency before the end of the
WAL unless you're dealing with starting from an in-progress base-backup
that hasn't yet finished or such. The test causes replication to fail
shortly after that, so we'll always switch to doing recovery from
pg_wal, which then will hit the end of the WAL, hitting this issue with,
I think, ~25% likelihood (data from recycled WAL data is probably
*roughly* evenly distributed, and any 4byte value above 1GB will hit
this error in 14).

Greetings,

Andres Freund

Re: WIP: WAL prefetch (another approach)

2021-05-04 Thread Tom Lane

I wrote:
> I suppose that if we're unable to reproduce it on at least one other box,
> we have to write it off as hardware flakiness.

BTW, that conclusion shouldn't distract us from the very real bug
that Andres identified.  I was just scraping the buildfarm logs
concerning recent failures, and I found several recent cases
that match the symptom he reported:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2021-04-23%2022%3A27%3A41
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2021-04-21%2005%3A15%3A24
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2021-04-20%2002%3A03%3A08
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2021-05-04%2004%3A07%3A41
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2021-04-20%2021%3A08%3A59

They all show the standby in recovery/019_replslot_limit.pl failing
with symptoms like

2021-05-04 07:42:00.968 UTC [24707406:1] LOG:  database system was shut down in 
recovery at 2021-05-04 07:41:39 UTC
2021-05-04 07:42:00.968 UTC [24707406:2] LOG:  entering standby mode
2021-05-04 07:42:01.050 UTC [24707406:3] LOG:  redo starts at 0/1C000D8
2021-05-04 07:42:01.079 UTC [24707406:4] LOG:  consistent recovery state 
reached at 0/1D0
2021-05-04 07:42:01.079 UTC [24707406:5] FATAL:  invalid memory alloc request 
size 1476397045
2021-05-04 07:42:01.080 UTC [13238274:3] LOG:  database system is ready to 
accept read only connections
2021-05-04 07:42:01.082 UTC [13238274:4] LOG:  startup process (PID 24707406) 
exited with exit code 1

(BTW, the behavior seen here where the failure occurs *immediately*
after reporting "consistent recovery state reached" is seen in the
other reports as well, including Andres' version.  I wonder if that
means anything.)

regards, tom lane

Re: WIP: WAL prefetch (another approach)

2021-05-04 Thread Tom Lane

Tomas Vondra  writes:
> On 5/3/21 7:42 AM, Thomas Munro wrote:
>> Hmm, yeah that does seem plausible.  It would be nice to see a report
>> from any other system though.  I'm still trying, and reviewing...

> FWIW I've ran the test (make installcheck-parallel in a loop) on four 
> different machines - two x86_64 ones, and two rpi4. The x86 boxes did 
> ~1000 rounds each (and one of them had 5 local replicas) without any 
> issue. The rpi4 machines did ~50 rounds each, also without failures.

Yeah, I have also spent a fair amount of time trying to reproduce it
elsewhere, without success so far.  Notably, I've been trying on a
PPC Mac laptop that has a fairly similar CPU to what's in the G4,
though a far slower disk drive.  So that seems to exclude theories
based on it being PPC-specific.

I suppose that if we're unable to reproduce it on at least one other box,
we have to write it off as hardware flakiness.  I'm not entirely
comfortable with that answer, but I won't push for reversion of the WAL
patches without more evidence that there's a real issue.

regards, tom lane

Re: WIP: WAL prefetch (another approach)

2021-05-04 Thread Tomas Vondra





On 5/3/21 7:42 AM, Thomas Munro wrote:

On Sun, May 2, 2021 at 3:16 PM Tom Lane  wrote:

That last point means that there was some hard-to-hit problem even
before any of the recent WAL-related changes.  However, 323cbe7c7
(Remove read_page callback from XLogReader) increased the failure
rate by at least a factor of 5, and 1d257577e (Optionally prefetch
referenced data) seems to have increased it by another factor of 4.
But it looks like f003d9f87 (Add circular WAL decoding buffer)
didn't materially change the failure rate.


Oh, wow.  There are several surprising results there.  Thanks for
running those tests for so long so that we could see the rarest
failures.

Even if there are somehow *two* causes of corruption, one preexisting
and one added by the refactoring or decoding patches, I'm struggling
to understand how the chance increases with 1d2575, since that only
adds code that isn't reached when not enabled (though I'm going to
re-review that).


Considering that 323cbe7c7 was supposed to be just refactoring,
and 1d257577e is allegedly disabled-by-default, these are surely
not the results I was expecting to get.


+1


It seems like it's still an open question whether all this is
a real bug, or flaky hardware.  I have seen occasional kernel
freezeups (or so I think -- machine stops responding to keyboard
or network input) over the past year or two, so I cannot in good
conscience rule out the flaky-hardware theory.  But it doesn't
smell like that kind of problem to me.  I think what we're looking
at is a timing-sensitive bug that was there before (maybe long
before?) and these commits happened to make it occur more often
on this particular hardware.  This hardware is enough unlike
anything made in the past decade that it's not hard to credit
that it'd show a timing problem that nobody else can reproduce.


Hmm, yeah that does seem plausible.  It would be nice to see a report
from any other system though.  I'm still trying, and reviewing...



FWIW I've ran the test (make installcheck-parallel in a loop) on four 
different machines - two x86_64 ones, and two rpi4. The x86 boxes did 
~1000 rounds each (and one of them had 5 local replicas) without any 
issue. The rpi4 machines did ~50 rounds each, also without failures.


Obviously, it's possible there's something that neither of those (very 
different systems) triggers, but I'd say it might also be a hint that 
this really is a hw issue on the old ppc macs. Or maybe something very 
specific to that arch.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: WIP: WAL prefetch (another approach)

2021-05-02 Thread Thomas Munro

On Sun, May 2, 2021 at 3:16 PM Tom Lane  wrote:
> That last point means that there was some hard-to-hit problem even
> before any of the recent WAL-related changes.  However, 323cbe7c7
> (Remove read_page callback from XLogReader) increased the failure
> rate by at least a factor of 5, and 1d257577e (Optionally prefetch
> referenced data) seems to have increased it by another factor of 4.
> But it looks like f003d9f87 (Add circular WAL decoding buffer)
> didn't materially change the failure rate.

Oh, wow.  There are several surprising results there.  Thanks for
running those tests for so long so that we could see the rarest
failures.

Even if there are somehow *two* causes of corruption, one preexisting
and one added by the refactoring or decoding patches, I'm struggling
to understand how the chance increases with 1d2575, since that only
adds code that isn't reached when not enabled (though I'm going to
re-review that).

> Considering that 323cbe7c7 was supposed to be just refactoring,
> and 1d257577e is allegedly disabled-by-default, these are surely
> not the results I was expecting to get.

+1

> It seems like it's still an open question whether all this is
> a real bug, or flaky hardware.  I have seen occasional kernel
> freezeups (or so I think -- machine stops responding to keyboard
> or network input) over the past year or two, so I cannot in good
> conscience rule out the flaky-hardware theory.  But it doesn't
> smell like that kind of problem to me.  I think what we're looking
> at is a timing-sensitive bug that was there before (maybe long
> before?) and these commits happened to make it occur more often
> on this particular hardware.  This hardware is enough unlike
> anything made in the past decade that it's not hard to credit
> that it'd show a timing problem that nobody else can reproduce.

Hmm, yeah that does seem plausible.  It would be nice to see a report
from any other system though.  I'm still trying, and reviewing...

Re: WIP: WAL prefetch (another approach)

2021-05-02 Thread Thomas Munro

On Thu, Apr 29, 2021 at 12:24 PM Tom Lane  wrote:
> Andres Freund  writes:
> > On 2021-04-28 19:24:53 -0400, Tom Lane wrote:
> >> IOW, we've spent over twice as many CPU cycles shipping data to the
> >> standby as we did in applying the WAL on the standby.
>
> > I don't really know how the time calculation works on mac. Is there a
> > chance it includes time spent doing IO?

For comparison, on a modern Linux system I see numbers like this,
while running that 025_stream_rep_regress.pl test I posted in a nearby
thread:

USER PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
tmunro   2150863 22.5  0.0  55348  6752 ?Ss   12:59   0:07
postgres: standby_1: startup recovering 00010002003C
tmunro   2150867 17.5  0.0  55024  6364 ?Ss   12:59   0:05
postgres: standby_1: walreceiver streaming 2/3C675D80
tmunro   2150868 11.7  0.0  55296  7192 ?Ss   12:59   0:04
postgres: primary: walsender tmunro [local] streaming 2/3C675D80

Those ratios are better but it's still hard work, and perf shows the
CPU time is all in page cache schlep:

  22.44%  postgres  [kernel.kallsyms]   [k] copy_user_enhanced_fast_string
  20.12%  postgres  [kernel.kallsyms]   [k] __add_to_page_cache_locked
   7.30%  postgres  [kernel.kallsyms]   [k] iomap_set_page_dirty

That was with all three patches reverted, so it's nothing new.
Definitely room for improvement... there have been a few discussions
about not using a buffered file for high-frequency data exchange and
relaxing various timing rules, which we should definitely look into,
but I wouldn't be at all surprised if HFS+ was just much worse at
this.

Thinking more about good old HFS+... I guess it's remotely possible
that there might have been coherency bugs in that could be exposed by
our usage pattern, but then that doesn't fit too well with the clues I
have from light reading: this is a non-SMP system, and it's said that
HFS+ used to serialise pretty much everything on big filesystem locks
anyway.

Re: WIP: WAL prefetch (another approach)

2021-05-01 Thread Tom Lane

Thomas Munro  writes:
> On Thu, Apr 29, 2021 at 4:45 AM Tom Lane  wrote:
>> Andres Freund  writes:
>>> Tom, any chance you could check if your machine repros the issue before
>>> these commits?

>> Wilco, but it'll likely take a little while to get results ...

> FWIW I also chewed through many megawatts trying to reproduce this on
> a PowerPC system in 64 bit big endian mode, with an emulator.  No
> cigar.  However, it's so slow that I didn't make it to 10 runs...

So I've expended a lot of kilowatt-hours over the past several days,
and I've got results that are interesting but don't really get us
any closer to a resolution.

To recap, the test lashup is:
* 2003 PowerMac G4 (1.25GHz PPC 7455, 7200 rpm spinning-rust drive)
* Standard debug build (--enable-debug --enable-cassert)
* Out-of-the-box configuration, except add wal_consistency_checking = all
and configure a wal-streaming standby on the same machine
* Repeatedly run "make installcheck-parallel", but skip the tablespace
test to avoid issues with the standby trying to use the same directory
* Delay long enough after each installcheck-parallel to let the 
standby catch up (the run proper is ~24 min, plus 2 min for catchup)

The failures I'm seeing generally look like

2021-05-01 15:33:10.968 EDT [8281] FATAL:  inconsistent page found, rel 
1663/58186/66338, forknum 0, blkno 19
2021-05-01 15:33:10.968 EDT [8281] CONTEXT:  WAL redo at 3/4CE905B8 for 
Gist/PAGE_UPDATE: ; blkref #0: rel 1663/58186/66338, blk 19 FPW

with a variety of WAL record types being named, so it doesn't seem
to be specific to any particular record type.  I've twice gotten the
bogus-checksum-and-then-assertion-failure I reported before:

2021-05-01 17:07:52.992 EDT [17464] LOG:  incorrect resource manager data 
checksum in record at 3/E0073EA4
TRAP: FailedAssertion("state->recordRemainLen > 0", File: "xlogreader.c", Line: 
567, PID: 17464)

In both of those cases, the WAL on disk was perfectly fine, and the same
is true of most of the "inconsistent page" complaints.  So the issue
definitely seems to be about the startup process mis-reading data that
was correctly shipped over.

Anyway, the new and interesting data concerns the relative failure rates
of different builds:

* Recent HEAD (from 4-28 and 5-1): 4 failures in 8 test cycles

* Reverting 1d257577e: 1 failure in 8 test cycles

* Reverting 1d257577e and f003d9f87: 3 failures in 28 cycles

* Reverting 1d257577e, f003d9f87, and 323cbe7c7: 2 failures in 93 cycles

That last point means that there was some hard-to-hit problem even
before any of the recent WAL-related changes.  However, 323cbe7c7
(Remove read_page callback from XLogReader) increased the failure
rate by at least a factor of 5, and 1d257577e (Optionally prefetch
referenced data) seems to have increased it by another factor of 4.
But it looks like f003d9f87 (Add circular WAL decoding buffer)
didn't materially change the failure rate.

Considering that 323cbe7c7 was supposed to be just refactoring,
and 1d257577e is allegedly disabled-by-default, these are surely
not the results I was expecting to get.

It seems like it's still an open question whether all this is
a real bug, or flaky hardware.  I have seen occasional kernel
freezeups (or so I think -- machine stops responding to keyboard
or network input) over the past year or two, so I cannot in good
conscience rule out the flaky-hardware theory.  But it doesn't
smell like that kind of problem to me.  I think what we're looking
at is a timing-sensitive bug that was there before (maybe long
before?) and these commits happened to make it occur more often
on this particular hardware.  This hardware is enough unlike
anything made in the past decade that it's not hard to credit
that it'd show a timing problem that nobody else can reproduce.

(I did try the time-honored ritual of reseating all the machine's
RAM, partway through this.  Doesn't seem to have changed anything.)

Anyway, I'm not sure where to go from here.  I'm for sure nowhere
near being able to identify the bug --- and if there really is
a bug that formerly had a one-in-fifty reproduction rate, I have
zero interest in trying to identify where it started by bisecting.
It'd take at least a day per bisection step, and even that might
not be accurate enough.  (But, if anyone has ideas of specific
commits to test, I'd be willing to try a few.)

regards, tom lane

Re: WIP: WAL prefetch (another approach)

2021-04-28 Thread Thomas Munro

On Thu, Apr 29, 2021 at 3:14 PM Andres Freund  wrote:
> To me it looks like a smaller version of the problem is present in < 14,
> albeit only when the page header is at a record boundary. In that case
> we don't validate the page header immediately, only once it's completely
> read. But we do believe the total size, and try to allocate
> that.
>
> There's a really crufty escape hatch (from 70b4f82a4b) to that:

Right, I made that problem worse, and that could probably be changed
to be no worse than 13 by reordering those operations.

PS Sorry for my intermittent/slow responses on this thread this week,
as I'm mostly away from the keyboard due to personal commitments.
I'll be back in the saddle next week to tidy this up, most likely by
reverting.  The main thought I've been having about this whole area is
that, aside from the lack of general testing of recovery, which we
should definitely address[1], what it really needs is a decent test
harness to drive it through all interesting scenarios and states at a
lower level, independently.

[1] 
https://www.postgresql.org/message-id/flat/CA%2BhUKGKpRWQ9SxdxxDmTBCJoR0YnFpMBe7kyzY8SUQk%2BHeskxg%40mail.gmail.com

Re: WIP: WAL prefetch (another approach)

Andres Freund  writes:
> I was now able to reproduce the problem again, and I'm afraid that the
> bug I hit is likely separate from Tom's.

Yeah, I think so --- the symptoms seem quite distinct.

My score so far today on the G4 is:

12 error-free regression test cycles on b3ee4c503

(plus one more with shared_buffers set to 16MB, on the strength
of your previous hunch --- didn't fail for me though)

HEAD failed on the second run with the same symptom as before:

2021-04-28 22:57:17.048 EDT [50479] FATAL:  inconsistent page found, rel 
1663/58183/69545, forknum 0, blkno 696
2021-04-28 22:57:17.048 EDT [50479] CONTEXT:  WAL redo at 4/B72D408 for 
Heap/INSERT: off 77 flags 0x00; blkref #0: rel 1663/58183/69545, blk 696 FPW

This seems to me to be pretty strong evidence that I'm seeing *something*
real.  I'm currently trying to isolate a specific commit to pin it on.
A straight "git bisect" isn't going to work because so many people had
broken so many different things right around that date :-(, so it may
take awhile to get a good answer.

regards, tom lane

Re: WIP: WAL prefetch (another approach)

Hi,

On 2021-04-28 17:59:22 -0700, Andres Freund wrote:
> I can however say that pg_waldump on the standby's pg_wal does also
> fail. The failure as part of the backend is "invalid memory alloc
> request size", whereas in pg_waldump I get the much more helpful:
> pg_waldump: fatal: error in WAL record at 4/7F5C31C8: record with incorrect 
> prev-link 416200FF/FF00 at 4/7F5C3200
> 
> In frontend code that allocation actually succeeds, because there is no
> size check. But in backend code we run into the size check, and thus
> don't even display a useful error.
> 
> In 13 the header is validated before allocating space for the
> record(except if header is spread across pages) - it seems inadvisable
> to turn that around?

I was now able to reproduce the problem again, and I'm afraid that the
bug I hit is likely separate from Tom's. The allocation thing above is
the issue in my case:

The walsender connection ended (I restarted the primary), thus the
startup switches to replaying locally. For some reason the end of the
WAL contains non-zero data (I think it's because walreceiver doesn't
zero out pages - that's bad!). Because the allocation happen before the
header is validated, we reproducably end up in the mcxt.c ERROR path,
failing recovery.

To me it looks like a smaller version of the problem is present in < 14,
albeit only when the page header is at a record boundary. In that case
we don't validate the page header immediately, only once it's completely
read. But we do believe the total size, and try to allocate
that.

There's a really crufty escape hatch (from 70b4f82a4b) to that:

/*
 * Note that in much unlucky circumstances, the random data read from a
 * recycled segment can cause this routine to be called with a size
 * causing a hard failure at allocation.  For a standby, this would 
cause
 * the instance to stop suddenly with a hard failure, preventing it to
 * retry fetching WAL from one of its sources which could allow it to 
move
 * on with replay without a manual restart. If the data comes from a 
past
 * recycled segment and is still valid, then the allocation may succeed
 * but record checks are going to fail so this would be short-lived.  If
 * the allocation fails because of a memory shortage, then this is not a
 * hard failure either per the guarantee given by MCXT_ALLOC_NO_OOM.
 */
if (!AllocSizeIsValid(newSize))
return false;

but it looks to me like that's pretty much the wrong fix, at least in
the case where we've not yet validated the rest of the header. We don't
need to allocate all that data before we've read the rest of the
*fixed-size* header.

It also seems to me that 70b4f82a4b should also have changed walsender
to pad out the received data to an 8KB boundary?

Greetings,

Andres Freund

Re: WIP: WAL prefetch (another approach)

Hi,

On 2021-04-28 17:59:22 -0700, Andres Freund wrote:
> I can however say that pg_waldump on the standby's pg_wal does also
> fail. The failure as part of the backend is "invalid memory alloc
> request size", whereas in pg_waldump I get the much more helpful:
> pg_waldump: fatal: error in WAL record at 4/7F5C31C8: record with incorrect 
> prev-link 416200FF/FF00 at 4/7F5C3200

There's definitely something broken around continuation records, in
XLogFindNextRecord(). Which means that it's not the cause for the server
side issue, but obviously still not good.

The conversion of XLogFindNextRecord() to be state machine based
basically only works in a narrow set of circumstances. Whenever the end
of the first record read is on a different page than the start of the
record, we'll endlessly loop.

We'll go into XLogFindNextRecord(), and return until we've successfully
read the page header. Then we'll enter the second loop. Which will try
to read until the end of the first record. But after returning the first
loop will again ask for page header.

Even if that's fixed, the second loop alone has the same problem: As
XLogBeginRead() is called unconditionally we'll start read the start of
the record, discover that it needs data on a second page, return, and
do the same thing again.

I think it needs something roughly like the attached.

Greetings,

Andres Freund
diff --git i/src/include/access/xlogreader.h w/src/include/access/xlogreader.h
index 3b8af31a8fe..82a80cf2bf5 100644
--- i/src/include/access/xlogreader.h
+++ w/src/include/access/xlogreader.h
@@ -297,6 +297,7 @@ struct XLogFindNextRecordState
 	XLogReaderState *reader_state;
 	XLogRecPtr		targetRecPtr;
 	XLogRecPtr		currRecPtr;
+	bool			found_start;
 };
 
 /* Report that data is available for decoding. */
diff --git i/src/backend/access/transam/xlogreader.c w/src/backend/access/transam/xlogreader.c
index 4277e92d7c9..935c841347f 100644
--- i/src/backend/access/transam/xlogreader.c
+++ w/src/backend/access/transam/xlogreader.c
@@ -868,7 +868,7 @@ XLogDecodeOneRecord(XLogReaderState *state, bool allow_oversized)
 /* validate record header if not yet */
 if (!state->record_verified && record_len >= SizeOfXLogRecord)
 {
-if (!ValidXLogRecordHeader(state, state->DecodeRecPtr,
+	if (!ValidXLogRecordHeader(state, state->DecodeRecPtr,
 			   state->PrevRecPtr, prec))
 		goto err;
 
@@ -1516,6 +1516,7 @@ InitXLogFindNextRecord(XLogReaderState *reader_state, XLogRecPtr start_ptr)
 	state->reader_state = reader_state;
 	state->targetRecPtr = start_ptr;
 	state->currRecPtr = start_ptr;
+	state->found_start = false;
 
 	return state;
 }
@@ -1545,7 +1546,7 @@ XLogFindNextRecord(XLogFindNextRecordState *state)
 	 * skip over potential continuation data, keeping in mind that it may span
 	 * multiple pages
 	 */
-	while (true)
+	while (!state->found_start)
 	{
 		XLogRecPtr	targetPagePtr;
 		int			targetRecOff;
@@ -1616,7 +1617,12 @@ XLogFindNextRecord(XLogFindNextRecordState *state)
 	 * because either we're at the first record after the beginning of a page
 	 * or we just jumped over the remaining data of a continuation.
 	 */
-	XLogBeginRead(state->reader_state, state->currRecPtr);
+	if (!state->found_start)
+	{
+		XLogBeginRead(state->reader_state, state->currRecPtr);
+		state->found_start = true;
+	}
+
 	while ((result = XLogReadRecord(state->reader_state, &record, &errormsg)) !=
 		   XLREAD_FAIL)
 	{

Re: WIP: WAL prefetch (another approach)

Hi,

On 2021-04-28 20:24:43 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > Oh! I was about to ask how much shared buffers your primary / standby
> > have.
> Default configurations, so 128MB each.

I thought that possibly initdb would detect less or something...

I assume this is 32bit? I did notice that a 32bit test took a lot longer
than a 64bit test. But didn't investigate so far.

> And I think I may actually have reproduce a variant of the issue!

Unfortunately I had not set up things in a way that the primary retains
the WAL, making it harder to compare whether it's the WAL that got
corrupted or whether it's a decoding bug.

I can however say that pg_waldump on the standby's pg_wal does also
fail. The failure as part of the backend is "invalid memory alloc
request size", whereas in pg_waldump I get the much more helpful:
pg_waldump: fatal: error in WAL record at 4/7F5C31C8: record with incorrect 
prev-link 416200FF/FF00 at 4/7F5C3200

In frontend code that allocation actually succeeds, because there is no
size check. But in backend code we run into the size check, and thus
don't even display a useful error.

In 13 the header is validated before allocating space for the
record(except if header is spread across pages) - it seems inadvisable
to turn that around?

Greetings,

Andres Freund

Re: WIP: WAL prefetch (another approach)

Andres Freund  writes:
> On 2021-04-28 19:24:53 -0400, Tom Lane wrote:
>> IOW, we've spent over twice as many CPU cycles shipping data to the
>> standby as we did in applying the WAL on the standby.

> I don't really know how the time calculation works on mac. Is there a
> chance it includes time spent doing IO?

I'd be pretty astonished if it did.  This is basically a NetBSD system
remember (in fact, this ancient macOS release is a good deal closer
to those roots than modern versions).  BSDen have never accounted for
time that way AFAIK.  Also, the "ps" man page says specifically that
that column is CPU time.

> Oh! I was about to ask how much shared buffers your primary / standby
> have. And I think I may actually have reproduce a variant of the issue!

Default configurations, so 128MB each.

regards, tom lane

Re: WIP: WAL prefetch (another approach)

Hi,

On 2021-04-28 19:24:53 -0400, Tom Lane wrote:
> But I happened to notice the accumulated CPU time for the background
> processes:
> 
> USER   PID  %CPU %MEM  VSZRSS   TT  STAT STARTED  TIME COMMAND
> tgl  19048   0.0  4.4   229952  92196   ??  Ss3:19PM  19:59.19 
> postgres: startup recovering 000100140022
> tgl  19051   0.0  0.1   229656   1696   ??  Ss3:19PM  27:09.14 
> postgres: walreceiver streaming 14/227D8F14
> tgl  19052   0.0  0.1   229904   2516   ??  Ss3:19PM  17:38.17 
> postgres: walsender tgl [local] streaming 14/227D8F14 
> 
> IOW, we've spent over twice as many CPU cycles shipping data to the
> standby as we did in applying the WAL on the standby.  Is this
> expected?  I've got wal_consistency_checking = all, which is bloating
> the WAL volume quite a bit, but still it seems like the walsender and
> walreceiver have little excuse for spending more cycles per byte
> than the startup process.

I don't really know how the time calculation works on mac. Is there a
chance it includes time spent doing IO? On the primary the WAL IO is
done by a lot of backends, but on the standby it's all going to be the
walreceiver. And the walreceiver does fsyncs in a not particularly
efficient manner.

FWIW, on my linux workstation no such difference is visible:
USER PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
andres   2910540  9.4  0.0 2237252 126680 ?  Ss   16:55   0:20 postgres: 
dev assert standby: startup recovering 00010002003F
andres   2910544  5.2  0.0 2236724 9260 ?Ss   16:55   0:11 postgres: 
dev assert standby: walreceiver streaming 2/3FDCF118
andres   2910545  2.1  0.0 2237036 10672 ?   Ss   16:55   0:04 postgres: 
dev assert: walsender andres [local] streaming 2/3FDCF118



> (This is testing b3ee4c503, so if Thomas' WAL changes improved
> efficiency of the replay process at all, the discrepancy could be
> even worse in HEAD.)

The prefetching isn't enabled by default, so I'd not expect meaningful
differences... And even with the prefetching enabled, our normal
regression tests largely are resident in s_b, so there shouldn't be much
prefetching.


Oh! I was about to ask how much shared buffers your primary / standby
have. And I think I may actually have reproduce a variant of the issue!

I previously had played around with different settings that I thought
might increase the likelihood of reproducing the problem. But this time
I set shared_buffers lower than before, and got:

2021-04-28 17:03:22.174 PDT [2913840][] LOG:  database system was shut down in 
recovery at 2021-04-28 17:03:11 PDT
2021-04-28 17:03:22.174 PDT [2913840][] LOG:  entering standby mode
2021-04-28 17:03:22.178 PDT [2913840][1/0] LOG:  redo starts at 2/416C6278
2021-04-28 17:03:37.628 PDT [2913840][1/0] LOG:  consistent recovery state 
reached at 4/7F5C3200
2021-04-28 17:03:37.628 PDT [2913840][1/0] FATAL:  invalid memory alloc request 
size 3053455757
2021-04-28 17:03:37.628 PDT [2913839][] LOG:  database system is ready to 
accept read only connections
2021-04-28 17:03:37.636 PDT [2913839][] LOG:  startup process (PID 2913840) 
exited with exit code 1

This reproduces across restarts. Yay, I guess.

Isn't it off that we get a "database system is ready to accept read only
connections"?

Greetings,

Andres Freund

Re: WIP: WAL prefetch (another approach)

Thomas Munro  writes:
> FWIW I also chewed through many megawatts trying to reproduce this on
> a PowerPC system in 64 bit big endian mode, with an emulator.  No
> cigar.  However, it's so slow that I didn't make it to 10 runs...

Speaking of megawatts ... my G4 has now finished about ten cycles of
installcheck-parallel without a failure, which isn't really enough
to draw any conclusions yet.  But I happened to notice the
accumulated CPU time for the background processes:

USER   PID  %CPU %MEM  VSZRSS   TT  STAT STARTED  TIME COMMAND
tgl  19048   0.0  4.4   229952  92196   ??  Ss3:19PM  19:59.19 
postgres: startup recovering 000100140022
tgl  19051   0.0  0.1   229656   1696   ??  Ss3:19PM  27:09.14 
postgres: walreceiver streaming 14/227D8F14
tgl  19052   0.0  0.1   229904   2516   ??  Ss3:19PM  17:38.17 
postgres: walsender tgl [local] streaming 14/227D8F14 

IOW, we've spent over twice as many CPU cycles shipping data to the
standby as we did in applying the WAL on the standby.  Is this
expected?  I've got wal_consistency_checking = all, which is bloating
the WAL volume quite a bit, but still it seems like the walsender and
walreceiver have little excuse for spending more cycles per byte
than the startup process.

(This is testing b3ee4c503, so if Thomas' WAL changes improved
efficiency of the replay process at all, the discrepancy could be
even worse in HEAD.)

regards, tom lane

Re: WIP: WAL prefetch (another approach)

2021-04-28 Thread Thomas Munro

On Thu, Apr 29, 2021 at 4:45 AM Tom Lane  wrote:
> Andres Freund  writes:
> > Tom, any chance you could check if your machine repros the issue before
> > these commits?
>
> Wilco, but it'll likely take a little while to get results ...

FWIW I also chewed through many megawatts trying to reproduce this on
a PowerPC system in 64 bit big endian mode, with an emulator.  No
cigar.  However, it's so slow that I didn't make it to 10 runs...

Re: WIP: WAL prefetch (another approach)

Andres Freund  writes:
> Tom, any chance you could check if your machine repros the issue before
> these commits?

Wilco, but it'll likely take a little while to get results ...

regards, tom lane

Re: WIP: WAL prefetch (another approach)