Thanks Bharath for your response,
> You mentioned that
> the connection to primary was lost, so you need to dive deep as to why
> it got lost. If the connection was lost half-way through fetching the
> WAL record, the standby may emit such a LOG message.
The connection was lost due to bad network. Currently we are okay with bad
network, because applications in general has to always expect bad network in
their design.
To simulate the bad network, we manually killed the wal-sender process in
primary multiple times, this should be same as primary unable to send messages
to standby due to bad network. And in these experiments, the standby is able to
join the primary after checking WAL files in archive, within few seconds.
But, when the standby gets disconnected due to bad network, the standby is
unable to join back, so we wanted to understand why this is happening.
> I understand this problem and there's a proposed patch to help with
> this -
> https://www.postgresql.org/message-id/CALj2ACVryN_PdFmQkbhga1VeW10VgQ4Lv9JXO=3njkvzt8q...@mail.gmail.com.
>
> It basically allows one to set a timeout as to how much duration the
> standby can restore from archive before switching to stream.
> Therefore, in your case, the standby doesn't have to wait for 1hr to
> connect to primary, but it can connect before that.
Good to know about this patch, because we tried to do something similar to this.
But this patch will not solve our problem, because here the hot-standby is NOT
downloading WAL files from the archive for very long time and wasting time,
but, the standby simply switching the sources from STREAM to ARCHIVE in a
tight loop doing nothing.
In fact, the standby is trying to connect to the primary for streaming back
after it fails to pull the latest WAL file from archive. But trying to stream
from the primary fails again immediately.
To explain this situation better, I am adding more logs.
Note: the logs mentioned here includes the custom logging we enable for
debugging purposes.
2023-02-09 19:52:30.909 GMT [ ] () [0 | 0 | 63e4244f.54]: LOG: Starting
the state machine again from starting, changing current source to
XLOG_FROM_ARCHIVE
2023-02-09 19:52:30.909 GMT [ ] () [0 | 0 | 63e4244f.54]: LOG: switched
WAL source from stream to archive after failure
ERROR: 2023/02/09 19:52:31.616383 Archive '00060002001C' does not
exist.
2023-02-09 19:52:31.618 GMT [ ] () [0 | 0 | 63e4244f.54]: LOG:
Successfully read the WAL file using XLogFileReadAnyTLI from archive or
existing pg_wal, returning true and exit.
2023-02-09 19:52:31.619 GMT [ ] () [0 | 0 | 63e4244f.54]: LOG: last
source archive failed, and now switching to new source.
2023-02-09 19:52:31.619 GMT [ ] () [0 | 0 | 63e4244f.54]: LOG: Moving to
XLOG_FROM_STREAM state and start walreceiver if necessary
2023-02-09 19:52:31.619 GMT [ ] () [0 | 0 | 63e4244f.54]: LOG: switched
WAL source from archive to stream after failure with WalStreamingPreferred false
2023-02-09 19:52:31.619 GMT [ ] () [0 | 0 | 63e4244f.54]: LOG: changing
the current file timeline to new tli.
2023-02-09 19:52:31.619 GMT [ ] () [0 | 0 | 63e4244f.54]: LOG: we have
data from wal receiver.
2023-02-09 19:52:31.619 GMT [ ] () [0 | 0 | 63e4244f.54]: LOG: after
streaming file, we dont have WAL file opened, so read now.
2023-02-09 19:52:31.619 GMT [ ] () [0 | 0 | 63e4244f.54]: LOG: Due to WAL
receiver, we have WAL file present and opened and hence returning true
2023-02-09 19:52:31.619 GMT [ ] () [0 | 0 | 63e4244f.54]: LOG: invalid
record length at 2/1C0324C8: wanted 24, got 0
2023-02-09 19:52:31.619 GMT [ ] () [0 | 0 | 63e4244f.54]: LOG: last
source stream failed, and now switching to new source.
2023-02-09 19:52:31.619 GMT [ ] () [0 | 0 | 63e4244f.54]: LOG: Failure
while streaming, we might have found invalid record in WAL streamed from master
2023-02-09 19:52:31.619 GMT [ ] () [0 | 0 | 63e4244f.54]: LOG: Stopping
WAL receiver as we saw serious failure while streaming
INFO: 2023/02/09 19:52:31.796489 WAL: 0007.history will be downloaded from
archive
ERROR: 2023/02/09 19:52:32.330491 Archive '0007.history' does not exist.
2023-02-09 19:52:32.333 GMT [ ] () [0 | 0 | 63e4244f.54]: LOG: We were
requested to recover to latest timeline, but rescan is NOT needed.
2023-02-09 19:52:32.333 GMT [ ] () [0 | 0 | 63e4244f.54]: LOG: We decided
to sleep before retry the state-machine
2023-02-09 19:52:35.910 GMT [ ] () [0 | 0 | 63e4244f.54]: LOG: Starting
the state machine again from starting, changing current source to
XLOG_FROM_ARCHIVE
2023-02-09 19:52:35.910 GMT [ ] () [0 | 0 | 63e4244f.54]: LOG: switched
WAL source from stream to archive after failure
ERROR: 2023/02/09 19:52:36.607613 Archive '00060002001C' does not
exist.
2023-02-09 19:52:36.610 GMT [ ] () [0 | 0 | 63e4244f.54]: LOG:
Successfully read the WAL file using XLogFile