Hello Postgres gurus,

I'm writing a thin clustering layer on top of Postgres using the
synchronous replication feature.  The goal is to enable HA and survive
permanent loss of a single node.  Using an external coordinator
(Zookeeper), one of the nodes is elected as the primary.  The primary node
then picks up another healthy node as its standby, and starts serving.
Thereafter, the cluster monitors the primary and the standby,  and triggers
a re-election if itself or its standby go down.

Detecting primary health is easy.  But what is the best way to know if the
standby is live?  Since this is not a hot-standby, I cannot send queries to
it.  Currently, I'm sending the following query to the primary:

  SELECT * from pg_stat_replication();

I've noticed that when I terminate the standby (cleanly or through kill
-9), the result of above function goes from 1 row to zero rows.  The result
comes back to 1 row when the standby restarts and reconnects.  I was
wondering if there is any kind of guarantee about the results of
pg_stat_replication as the standby suffers a network partition, and/or
restarts and reconnects with the primary.  Are there any parameters that
control this behavior?

I tried looking at src/backend/replication/walsender.c/WalSndLoop() but am
still not clear on the expected behavior.

Thanks for your time,
Abhishek

Reply via email to