On Thu, Aug 13, 2020 at 06:57:20PM +1200, Thomas Munro wrote:
On Thu, Aug 6, 2020 at 10:47 PM Tomas Vondra <tomas.von...@2ndquadrant.com> wrote:On Thu, Aug 06, 2020 at 02:58:44PM +1200, Thomas Munro wrote: >On Tue, Aug 4, 2020 at 3:47 AM Tomas Vondra >> Any luck trying to reproduce thigs? Should I try again and collect some >> additional debug info? > >No luck. I'm working on it now, and also trying to reduce the >overheads so that we're not doing extra work when it doesn't help.OK, I'll see if I can still reproduce it.Since someone else ask me off-list, here's a rebase, with no functional changes. Soon I'll post a new improved version, but this version just fixes the bitrot and hopefully turns cfbot green.
I've decided to do some tests with this patch version, but I immediately ran into issues. What I did was initializing a 32GB pgbench database, backed it up (shutdown + tar) and then ran 2h pgbench with archiving. And then I restored the backed-up data directory and instructed it to replay WAL from the archive. There's about 16k WAL segments, so about 256GB of WAL. Unfortunately, the very first thing that happens after starting the recovery is this: LOG: starting archive recovery LOG: restored log file "000000010000001600000080" from archive LOG: consistent recovery state reached at 16/800000A0 LOG: redo starts at 16/800000A0 LOG: database system is ready to accept read only connections LOG: recovery started prefetching on timeline 1 at 0/800000A0 LOG: recovery no longer prefetching: unexpected pageaddr 8/84000000 in log segment 000000010000001600000081, offset 0 LOG: restored log file "000000010000001600000081" from archive LOG: restored log file "000000010000001600000082" from archive So we start applying 000000010000001600000081 and it fails almost immediately on the first segment. This is confirmed by prefetch stats, which look like this: -[ RECORD 1 ]---+----------------------------- stats_reset | 2020-09-01 15:02:31.18766+02 prefetch | 1044 skip_hit | 1995 skip_new | 87 skip_fpw | 2108 skip_seq | 27 distance | 0 queue_depth | 0 avg_distance | 135838.95 avg_queue_depth | 8.852459 So we do a little bit of prefetching and then it gets disabled :-( The segment looks perfectly fine when inspected using pg_waldump, see the attached file. I've tested this applied on 6ca547cf75ef6e922476c51a3fb5e253eef5f1b6, and the failure seems fairly similar to what I reported before, except that now it happened right at the very beginning. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
000000010000001600000081.log.gz
Description: application/gzip