On Thu, Aug 13, 2020 at 06:57:20PM +1200, Thomas Munro wrote:
On Thu, Aug 6, 2020 at 10:47 PM Tomas Vondra
<tomas.von...@2ndquadrant.com> wrote:
On Thu, Aug 06, 2020 at 02:58:44PM +1200, Thomas Munro wrote:
>On Tue, Aug 4, 2020 at 3:47 AM Tomas Vondra
>> Any luck trying to reproduce thigs? Should I try again and collect some
>> additional debug info?
>
>No luck.  I'm working on it now, and also trying to reduce the
>overheads so that we're not doing extra work when it doesn't help.

OK, I'll see if I can still reproduce it.

Since someone else ask me off-list, here's a rebase, with no
functional changes.  Soon I'll post a new improved version, but this
version just fixes the bitrot and hopefully turns cfbot green.

I've decided to do some tests with this patch version, but I immediately
ran into issues. What I did was initializing a 32GB pgbench database,
backed it up (shutdown + tar) and then ran 2h pgbench with archiving.
And then I restored the backed-up data directory and instructed it to
replay WAL from the archive. There's about 16k WAL segments, so about
256GB of WAL.

Unfortunately, the very first thing that happens after starting the
recovery is this:

    LOG:  starting archive recovery
    LOG:  restored log file "000000010000001600000080" from archive
    LOG:  consistent recovery state reached at 16/800000A0
    LOG:  redo starts at 16/800000A0
    LOG:  database system is ready to accept read only connections
    LOG:  recovery started prefetching on timeline 1 at 0/800000A0
    LOG:  recovery no longer prefetching: unexpected pageaddr 8/84000000 in log 
segment 000000010000001600000081, offset 0
    LOG:  restored log file "000000010000001600000081" from archive
    LOG:  restored log file "000000010000001600000082" from archive

So we start applying 000000010000001600000081 and it fails almost
immediately on the first segment. This is confirmed by prefetch stats,
which look like this:

    -[ RECORD 1 ]---+-----------------------------
    stats_reset     | 2020-09-01 15:02:31.18766+02
    prefetch        | 1044
    skip_hit        | 1995
    skip_new        | 87
    skip_fpw        | 2108
    skip_seq        | 27
    distance        | 0
    queue_depth     | 0
    avg_distance    | 135838.95
    avg_queue_depth | 8.852459

So we do a little bit of prefetching and then it gets disabled :-(

The segment looks perfectly fine when inspected using pg_waldump, see
the attached file.

I've tested this applied on 6ca547cf75ef6e922476c51a3fb5e253eef5f1b6,
and the failure seems fairly similar to what I reported before, except
that now it happened right at the very beginning.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment: 000000010000001600000081.log.gz
Description: application/gzip

Reply via email to