On Mon, 1 Jun 2020 12:44:26 +0500 godjan • <g0d...@gmail.com> wrote:
> Hi, sorry for 2 weeks latency in answer :) > > >> It fixed out trouble, but there is one another. Now we should wait when all > >> ha alive hosts finish replaying WAL to failover. It might take a while(for > >> example WAL contains wal_record about splitting b-tree). > > > > Indeed, this is the concern I wrote about yesterday in a second mail on this > > thread. > > Actually, I found out that we use the wrong heuristic to understand that > standby still replaying WAL. We compare values of pg_last_wal_replay_lsn() > after and before sleeping. If standby replaying huge wal_record(e.g. > splitting b-tree) it gave us the wrong result. It could, yes. > > Note that when you promote a node, it first replays available WALs before > > acting as a primary. > > Do you know how Postgres understand that standby still replays available WAL? > I didn’t get it from the code of promotion. See chapter "26.2.2. Standby Server Operation" in official doc: « Standby mode is exited and the server switches to normal operation when pg_ctl promote is run or a trigger file is found (promote_trigger_file). Before failover, any WAL immediately available in the archive or in pg_wal will be restored, but no attempt is made to connect to the master. » In the source code, dig around the following chain if interested: StartupXLOG -> ReadRecord -> XLogReadRecord -> XLogPageRead -> WaitForWALToBecomeAvailable. [...] > > Nope, no clean and elegant idea. One your instances are killed, maybe you > > can force flush the system cache (secure in-memory-only data)? > > Do "force flush the system cache” means invoke this command > https://linux.die.net/man/8/sync <https://linux.die.net/man/8/sync> on the > standby? Yes, just for safety. > > and read the latest received WAL using pg_waldump? > > I did an experiment with pg_waldump without sync: > - write data on primary > - kill primary > - read the latest received WAL using pg_waldump: > 0/1D019F38 > - pg_last_wal_replay_lsn(): > 0/1D019F68 Normal. pg_waldump gives you the starting LSN of the record. pg_last_wal_replay_lsn() returns lastReplayedEndRecPtr, which is the end of the record: /* * lastReplayedEndRecPtr points to end+1 of the last record successfully * replayed. So I suppose your last xlogrecord was 30 bytes long. If I remember correctly, minimal xlogrecord length is 24 bytes, so I bet there's only one xlogrecord there, starting at 0/1D019F38 with last byte at 0/1D019F67. > So it’s wrong to use pg_waldump to understand what was latest received LSN. > At least without “forcing flush system cache”. Nope, just sum the xlogrecord length.