Hi, sorry for 2 weeks latency in answer :)

>> It fixed out trouble, but there is one another. Now we should wait when all
>> ha alive hosts finish replaying WAL to failover. It might take a while(for
>> example WAL contains wal_record about splitting b-tree).
> 
> Indeed, this is the concern I wrote about yesterday in a second mail on this
> thread.

Actually, I found out that we use the wrong heuristic to understand that 
standby still replaying WAL.
We compare values of pg_last_wal_replay_lsn() after and before sleeping.
If standby replaying huge wal_record(e.g. splitting b-tree) it gave us the 
wrong result.


> Note that when you promote a node, it first replays available WALs before
> acting as a primary.


Do you know how Postgres understand that standby still replays available WAL?
I didn’t get it from the code of promotion.


> However, how useful is this test in a context of auto failover for
> **service** high availability?

Such a situation has a really low probability in our service. And we thought 
that it could be okay to resolve such a situation with on-call participation.

> Nope, no clean and elegant idea. One your instances are killed, maybe you can
> force flush the system cache (secure in-memory-only data)? 

Do "force flush the system cache” means invoke this command 
https://linux.die.net/man/8/sync <https://linux.die.net/man/8/sync> on the 
standby?

> and read the latest received WAL using pg_waldump?


I did an experiment with pg_waldump without sync:
- write data on primary 
- kill primary
- read the latest received WAL using pg_waldump:
0/1D019F38
- pg_last_wal_replay_lsn():
0/1D019F68

So it’s wrong to use pg_waldump to understand what was latest received LSN. At 
least without “forcing flush system cache”.

> If all your nodes are killed in the same
> disaster, how/why an automatic cluster manager should take care of starting 
> all
> nodes again and pick the right node to promote?

1. How?
The automatic cluster manager will restart standbys in such a situation.
If the primary lock in ZK is released automatic cluster manager start process 
of election new primary.
To understand which node should be promoted automatic cluster manager should 
get LSN of the last wal_record wrote on disk by each potential new primary.
We used pg_last_wal_receive_lsn() for it but it was a mistake. Because after 
"kill -9” on standby pg_last_wal_receive_lsn() reports first lsn of last 
segment.

2. Why?
- sleepy on-call in a night can make something bad in such situation)
- pg_waldump didn’t give the last LSN wrote on disk(at least without forcing 
flush the system cache) so I don’t know how on-call can understand which 
standby should be promoted
- automatic cluster manager successfully resolve such situations in clusters 
with one determined synchronous standby for years, and we hope it’s possible to 
do it in clusters with quorum replication



Reply via email to