Hi,

On 2024-03-20 17:41:45 -0700, Andres Freund wrote:
> 2024-03-20 22:14:01.904 UTC [56343][client backend][6/1925:0] LOG:  
> connection authorized: user=bf database=postgres 
> application_name=027_stream_regress.pl
> 2024-03-20 22:14:01.930 UTC [56343][client backend][6/1926:0] LOG:  
> statement: SELECT '0/15BA21B0' <= replay_lsn AND state = 'streaming'
>                FROM pg_catalog.pg_stat_replication
>                WHERE application_name IN ('standby_1', 'walreceiver')
> 2024-03-20 22:14:01.958 UTC [56343][client backend][:0] LOG:  disconnection: 
> session time: 0:00:00.063 user=bf database=postgres host=[local]
> 2024-03-20 22:14:02.083 UTC [3729516][postmaster][:0] LOG:  received 
> immediate shutdown request
> 2024-03-20 22:14:04.970 UTC [3729516][postmaster][:0] LOG:  database system 
> is shut down
>
> There was no activity for 7 minutes.
>
> System statistics show relatively low load CPU and IO load for the period from
> 22:00 - 22:10.
>
>
> I suspect we have some more fundamental instability at our hands, there have
> been failures like this going back a while, and on various machines.

I've reproduced something like this scenario locally, although I am not sure
it is precisely what is happening on the buildfarm.  At least here it looks
like the problem is that apply is lagging substantially:

2024-03-20 22:43:11.024 PDT [1023505][walreceiver][:0][] DEBUG:  sendtime 
2024-03-20 22:43:11.024348-07 receipttime 2024-03-20 22:43:11.02437-07 
replication apply delay 285322 ms transfer latency 1 ms

Which then means that we'll wait for a long time for apply to finish:

Waiting for replication conn standby_1's replay_lsn to pass 0/14385E20 on 
primary
[22:41:34.521](0.221s) # state before polling:
# pid                | 1023508
# application_name   | standby_1
# sent_lsn           | 0/14385E20
# primary_wal_lsn    | 0/14385E20
# standby_write_lsn  | 0/14385E20
# primary_flush_lsn  | 0/14385E20
# standby_flush_lsn  | 0/14385E20
# standby_replay_lsn | 0/126D5C58
...
[22:43:16.376](0.161s) # running query, attempt 679/1800
[22:43:16.627](0.251s) # state now:
# pid                | 1023508
# application_name   | standby_1
# sent_lsn           | 0/14778468
# primary_wal_lsn    | 0/14778468
# standby_write_lsn  | 0/14778468
# primary_flush_lsn  | 0/14778468
# standby_flush_lsn  | 0/14778468
# standby_replay_lsn | 0/14778468



I am not sure I have debugged why exactly, but it sure looks like one part is
the startup process being busy unlinking files synchronously. This appears to
be exacerbated by mdunlinkfork() first truncating and then separately
unlinking the file - that looks to trigger a lot of filesystem journal
flushes (on xfs).

We also spend a fair bit of time updating the control file, because we flush
the WAL when replaying a transaction commit with a relation unlink. That also
badly interacts with doing metadata operations...

Thirdly, we flush received WAL extremely granularly at times, which requires
another fsync:
2024-03-20 23:30:21.469 PDT [1525084][walreceiver][:0][] LOG:  flushed received 
WAL up to 0/13BB0000
2024-03-20 23:30:21.473 PDT [1525084][walreceiver][:0][] LOG:  flushed received 
WAL up to 0/13BB0170
2024-03-20 23:30:21.479 PDT [1525084][walreceiver][:0][] LOG:  flushed received 
WAL up to 0/13BB2528
2024-03-20 23:30:21.480 PDT [1525084][walreceiver][:0][] LOG:  flushed received 
WAL up to 0/13BB58C8
2024-03-20 23:30:21.487 PDT [1525084][walreceiver][:0][] LOG:  flushed received 
WAL up to 0/13BB7DA0
2024-03-20 23:30:21.490 PDT [1525084][walreceiver][:0][] LOG:  flushed received 
WAL up to 0/13BB92B0
2024-03-20 23:30:21.494 PDT [1525084][walreceiver][:0][] LOG:  flushed received 
WAL up to 0/13BBBAC0
2024-03-20 23:30:21.496 PDT [1525084][walreceiver][:0][] LOG:  flushed received 
WAL up to 0/13BBCCC0
2024-03-20 23:30:21.499 PDT [1525084][walreceiver][:0][] LOG:  flushed received 
WAL up to 0/13BBCE18

This all when we're quite far behind with apply...

Greetings,

Andres Freund


Reply via email to