Hello Postgres gurus, I'm writing a thin clustering layer on top of Postgres using the synchronous replication feature. The goal is to enable HA and survive permanent loss of a single node. Using an external coordinator (Zookeeper), one of the nodes is elected as the primary. The primary node then picks up another healthy node as its standby, and starts serving. Thereafter, the cluster monitors the primary and the standby, and triggers a re-election if itself or its standby go down.
Detecting primary health is easy. But what is the best way to know if the standby is live? Since this is not a hot-standby, I cannot send queries to it. Currently, I'm sending the following query to the primary: SELECT * from pg_stat_replication(); I've noticed that when I terminate the standby (cleanly or through kill -9), the result of above function goes from 1 row to zero rows. The result comes back to 1 row when the standby restarts and reconnects. I was wondering if there is any kind of guarantee about the results of pg_stat_replication as the standby suffers a network partition, and/or restarts and reconnects with the primary. Are there any parameters that control this behavior? I tried looking at src/backend/replication/walsender.c/WalSndLoop() but am still not clear on the expected behavior. Thanks for your time, Abhishek