Hello everyone, We are doing a POC on postgres HA setup with streaming replication (async) using pgpool-II as a load balancing & connection pooling and repmgr for setting up HA & automatic failover. We are applying a test case, like isolating the VM1 node from the Network completely for more than 2 mins and again plug-in back the network, since we want to verify how the system works during network glitches, any chances of split-brain or so. Our current setup looks like below, 2 VM's on Azure cloud, each VM has Postgres running along with Pgpool service. [image: image.png]
We enabled watchdog and assigned a delegate IP *NOTE: as per some limitations we are using a floating IP and used for delegate IP.* During the test, here are our observations: 1. Client connections got hung from the time the VM1 got lost from the network and till VM1 gets back to normal. 2. Once the VM1 is lost then Pgpool promotes the VM2 as LEADER node and Postgres Standby got promoted to Primary on VM2 as well, but still client connections are not connecting to the new primary. Why is this not happening? 3. Once the VM1 is back to network, there is a split brain situation, where pgpool on VM1 takes the lead to become LEADER node (pgpool.log shows). and from then the client connects to the VM1 node via VIP. *pgpool.conf * sr_check_period 10sec health_check_period 30sec health_check_timeout 20 sec health_check_max_retries 3 health_check_retry_delay 1 wd_lifecheck_method = 'heartbeat' wd_interval = 10 wd_heartbeat_keepalive = 2 wd_heartbeat_deadtime = 30 *Logs information: * >From VM2: Pgpool.log 14:30:17 N/w disconnected After 10 sec the streaming replication check failed and got timed out. 2024-07-03 14:30:26.176: sr_check_worker pid 58187: LOG: failed to connect to PostgreSQL server on "staging-ha0001:5432", timed out Then pgpool failed to do health check since it got timed out as per health_check_timeout set to 20 sec 2024-07-03 14:30:35.869: health_check0 pid 58188: LOG: failed to connect to PostgreSQL server on "staging-ha0001:5432", timed out Re-trying health_check & sr_check but again timed out. 2024-07-03 14:30:46.187: sr_check_worker pid 58187: LOG: failed to connect to PostgreSQL server on "staging-ha0001:5432", timed out 2024-07-03 14:30:46.880: health_check0 pid 58188: LOG: failed to connect to PostgreSQL server on "staging-ha0001:5432", timed out Watchdog received a message saying the Leader node is lost. 2024-07-03 14:30:47.192: watchdog pid 58151: WARNING: we have not received a beacon message from leader node "staging-ha0001:9999 Linux staging-ha0001" 2024-07-03 14:30:47.192: watchdog pid 58151: DETAIL: requesting info message from leader node 2024-07-03 14:30:54.312: watchdog pid 58151: LOG: read from socket failed, remote end closed the connection 2024-07-03 14:30:54.312: watchdog pid 58151: LOG: client socket of staging-ha0001:9999 Linux staging-ha0001 is closed 2024-07-03 14:30:54.313: watchdog pid 58151: LOG: remote node "staging-ha0001:9999 Linux staging-ha0001" is reporting that it has lost us 2024-07-03 14:30:54.313: watchdog pid 58151: LOG: we are lost on the leader node "staging-ha0001:9999 Linux staging-ha0001" Re-trying health_check & sr_check but again timed out. 2024-07-03 14:30:57.888: health_check0 pid 58188: LOG: failed to connect to PostgreSQL server on "staging-ha0001:5432", timed out 2024-07-03 14:30:57.888: health_check0 pid 58188: LOG: health check retrying on DB node: 0 (round:3) 2024-07-03 14:31:06.201: sr_check_worker pid 58187: LOG: failed to connect to PostgreSQL server on "staging-ha0001:5432", timed out After 10 sec from the time we lost the leader node, watchdog changed current node to LEADER node 2024-07-03 14:31:04.199: watchdog pid 58151: LOG: watchdog node state changed from [STANDING FOR LEADER] to [LEADER] health_check is failed on node 0 and it received a degenerated request for node 0 and the pgpool main process started quarantining staging-ha0001(5432) (shutting down) 2024-07-03 14:31:08.202: watchdog pid 58151: LOG: setting the local node "staging-ha0002:9999 Linux staging-ha0002" as watchdog cluster leader 2024-07-03 14:31:08.202: watchdog pid 58151: LOG: signal_user1_to_parent_with_reason(1) 2024-07-03 14:31:08.202: watchdog pid 58151: LOG: I am the cluster leader node but we do not have enough nodes in cluster 2024-07-03 14:31:08.202: watchdog pid 58151: DETAIL: waiting for the quorum to start escalation process 2024-07-03 14:31:08.202: main pid 58147: LOG: Pgpool-II parent process received SIGUSR1 2024-07-03 14:31:08.202: main pid 58147: LOG: Pgpool-II parent process received watchdog state change signal from watchdog 2024-07-03 14:31:08.899: health_check0 pid 58188: LOG: failed to connect to PostgreSQL server on "staging-ha0001:5432", timed out 2024-07-03 14:31:08.899: health_check0 pid 58188: LOG: health check failed on node 0 (timeout:0) 2024-07-03 14:31:08.899: health_check0 pid 58188: LOG: received degenerate backend request for node_id: 0 from pid [58188] 2024-07-03 14:31:08.899: watchdog pid 58151: LOG: watchdog received the failover command from local pgpool-II on IPC interface 2024-07-03 14:31:08.899: watchdog pid 58151: LOG: watchdog is processing the failover command [DEGENERATE_BACKEND_REQUEST] received from local pgpool-II on IPC interface 2024-07-03 14:31:08.899: watchdog pid 58151: LOG: failover requires the quorum to hold, which is not present at the moment 2024-07-03 14:31:08.899: watchdog pid 58151: DETAIL: Rejecting the failover request 2024-07-03 14:31:08.899: watchdog pid 58151: LOG: failover command [DEGENERATE_BACKEND_REQUEST] request from pgpool-II node "staging-ha0002:9999 Linux staging-ha0002" is rejected because the watchdog cluster does not hold the quorum 2024-07-03 14:31:08.900: health_check0 pid 58188: LOG: degenerate backend request for 1 node(s) from pid [58188], is changed to quarantine node request by watchdog 2024-07-03 14:31:08.900: health_check0 pid 58188: DETAIL: watchdog does not holds the quorum 2024-07-03 14:31:08.900: health_check0 pid 58188: LOG: signal_user1_to_parent_with_reason(0) 2024-07-03 14:31:08.900: main pid 58147: LOG: Pgpool-II parent process received SIGUSR1 2024-07-03 14:31:08.900: main pid 58147: LOG: Pgpool-II parent process has received failover request 2024-07-03 14:31:08.900: watchdog pid 58151: LOG: received the failover indication from Pgpool-II on IPC interface 2024-07-03 14:31:08.900: watchdog pid 58151: LOG: watchdog is informed of failover start by the main process 2024-07-03 14:31:08.900: main pid 58147: LOG: === Starting quarantine. shutdown host staging-ha0001(5432) === 2024-07-03 14:31:08.900: main pid 58147: LOG: Restart all children 2024-07-03 14:31:08.900: main pid 58147: LOG: failover: set new primary node: -1 2024-07-03 14:31:08.900: main pid 58147: LOG: failover: set new main node: 1 2024-07-03 14:31:08.906: sr_check_worker pid 58187: ERROR: Failed to check replication time lag 2024-07-03 14:31:08.906: sr_check_worker pid 58187: DETAIL: No persistent db connection for the node 0 2024-07-03 14:31:08.906: sr_check_worker pid 58187: HINT: check sr_check_user and sr_check_password 2024-07-03 14:31:08.906: sr_check_worker pid 58187: CONTEXT: while checking replication time lag 2024-07-03 14:31:08.906: sr_check_worker pid 58187: LOG: worker process received restart request 2024-07-03 14:31:08.906: watchdog pid 58151: LOG: received the failover indication from Pgpool-II on IPC interface 2024-07-03 14:31:08.906: watchdog pid 58151: LOG: watchdog is informed of failover end by the main process 2024-07-03 14:31:08.906: main pid 58147: LOG: === Quarantine done. shutdown host staging-ha0001(5432) === 2024-07-03 14:31:09.906: pcp_main pid 58186: LOG: restart request received in pcp child process 2024-07-03 14:31:09.907: main pid 58147: LOG: PCP child 58186 exits with status 0 in failover() 2024-07-03 14:31:09.908: main pid 58147: LOG: fork a new PCP child pid 58578 in failover() 2024-07-03 14:31:09.908: main pid 58147: LOG: reaper handler 2024-07-03 14:31:09.908: pcp_main pid 58578: LOG: PCP process: 58578 started 2024-07-03 14:31:09.909: main pid 58147: LOG: reaper handler: exiting normally 2024-07-03 14:31:09.909: sr_check_worker pid 58579: LOG: process started 2024-07-03 14:31:19.915: watchdog pid 58151: LOG: not able to send messages to remote node "staging-ha0001:9999 Linux staging-ha0001" 2024-07-03 14:31:19.915: watchdog pid 58151: DETAIL: marking the node as lost 2024-07-03 14:31:19.915: watchdog pid 58151: LOG: remote node "staging-ha0001:9999 Linux staging-ha0001" is lost >From VM1: *pgpool.log* 2024-07-03 14:30:36.444: watchdog pid 8620: LOG: remote node "staging-ha0002:9999 Linux staging-ha0002" is not replying to our beacons 2024-07-03 14:30:36.444: watchdog pid 8620: DETAIL: missed beacon reply count:2 2024-07-03 14:30:37.448: sr_check_worker pid 65605: LOG: failed to connect to PostgreSQL server on "staging-ha0002:5432", timed out 2024-07-03 14:30:46.067: health_check1 pid 8676: LOG: failed to connect to PostgreSQL server on "staging-ha0002:5432", timed out 2024-07-03 14:30:46.068: health_check1 pid 8676: LOG: health check retrying on DB node: 1 (round:1) 2024-07-03 14:30:46.455: watchdog pid 8620: LOG: remote node "staging-ha0002:9999 Linux staging-ha0002" is not replying to our beacons 2024-07-03 14:30:46.455: watchdog pid 8620: DETAIL: missed beacon reply count:3 2024-07-03 14:30:47.449: sr_check_worker pid 65605: ERROR: Failed to check replication time lag 2024-07-03 14:30:47.449: sr_check_worker pid 65605: DETAIL: No persistent db connection for the node 1 2024-07-03 14:30:47.449: sr_check_worker pid 65605: HINT: check sr_check_user and sr_check_password 2024-07-03 14:30:47.449: sr_check_worker pid 65605: CONTEXT: while checking replication time lag 2024-07-03 14:30:55.104: child pid 65509: LOG: failover or failback event detected 2024-07-03 14:30:55.104: child pid 65509: DETAIL: restarting myself 2024-07-03 14:30:55.104: main pid 8617: LOG: reaper handler 2024-07-03 14:30:55.105: main pid 8617: LOG: reaper handler: exiting normally 2024-07-03 14:30:56.459: watchdog pid 8620: LOG: remote node "staging-ha0002:9999 Linux staging-ha0002" is not replying to our beacons 2024-07-03 14:30:56.459: watchdog pid 8620: DETAIL: missed beacon reply count:4 2024-07-03 14:30:56.459: watchdog pid 8620: LOG: remote node "staging-ha0002:9999 Linux staging-ha0002" is not responding to our beacon messages 2024-07-03 14:30:56.459: watchdog pid 8620: DETAIL: marking the node as lost 2024-07-03 14:30:56.459: watchdog pid 8620: LOG: remote node "staging-ha0002:9999 Linux staging-ha0002" is lost 2024-07-03 14:30:56.460: watchdog pid 8620: LOG: removing watchdog node "staging-ha0002:9999 Linux staging-ha0002" from the standby list 2024-07-03 14:30:56.460: watchdog pid 8620: LOG: We have lost the quorum 2024-07-03 14:30:56.460: watchdog pid 8620: LOG: signal_user1_to_parent_with_reason(3) 2024-07-03 14:30:56.460: main pid 8617: LOG: Pgpool-II parent process received SIGUSR1 2024-07-03 14:30:56.460: main pid 8617: LOG: Pgpool-II parent process received watchdog quorum change signal from watchdog 2024-07-03 14:30:56.461: watchdog_utility pid 66197: LOG: watchdog: de-escalation started sudo: a terminal is required to read the password; either use the -S option to read from standard input or configure an askpass helper 2024-07-03 14:30:57.078: health_check1 pid 8676: LOG: failed to connect to PostgreSQL server on "staging-ha0002:5432", timed out 2024-07-03 14:30:57.078: health_check1 pid 8676: LOG: health check retrying on DB node: 1 (round:2) 2024-07-03 14:30:57.418: life_check pid 8639: LOG: informing the node status change to watchdog 2024-07-03 14:30:57.418: life_check pid 8639: DETAIL: node id :1 status = "NODE DEAD" message:"No heartbeat signal from node" 2024-07-03 14:30:57.418: watchdog pid 8620: LOG: received node status change ipc message 2024-07-03 14:30:57.418: watchdog pid 8620: DETAIL: No heartbeat signal from node 2024-07-03 14:30:57.418: watchdog pid 8620: LOG: remote node "staging-ha0002:9999 Linux staging-ha0002" is lost 2024-07-03 14:30:57.464: sr_check_worker pid 65605: LOG: failed to connect to PostgreSQL server on "staging-ha0002:5432", timed out sudo: a password is required 2024-07-03 14:30:59.301: watchdog_utility pid 66197: LOG: failed to release the delegate IP:"10.127.1.20" 2024-07-03 14:30:59.301: watchdog_utility pid 66197: DETAIL: 'if_down_cmd' failed 2024-07-03 14:30:59.301: watchdog_utility pid 66197: WARNING: watchdog de-escalation failed to bring down delegate IP 2024-07-03 14:30:59.301: watchdog pid 8620: LOG: watchdog de-escalation process with pid: 66197 exit with SUCCESS. 2024-07-03 14:31:07.465: sr_check_worker pid 65605: ERROR: Failed to check replication time lag 2024-07-03 14:31:07.465: sr_check_worker pid 65605: DETAIL: No persistent db connection for the node 1 2024-07-03 14:31:07.465: sr_check_worker pid 65605: HINT: check sr_check_user and sr_check_password 2024-07-03 14:31:07.465: sr_check_worker pid 65605: CONTEXT: while checking replication time lag 2024-07-03 14:31:08.089: health_check1 pid 8676: LOG: failed to connect to PostgreSQL server on "staging-ha0002:5432", timed out 2024-07-03 14:31:08.089: health_check1 pid 8676: LOG: health check retrying on DB node: 1 (round:3) 2024-07-03 14:31:17.480: sr_check_worker pid 65605: LOG: failed to connect to PostgreSQL server on "staging-ha0002:5432", timed out 2024-07-03 14:31:19.097: health_check1 pid 8676: LOG: failed to connect to PostgreSQL server on "staging-ha0002:5432", timed out 2024-07-03 14:31:19.097: health_check1 pid 8676: LOG: health check failed on node 1 (timeout:0) 2024-07-03 14:31:19.097: health_check1 pid 8676: LOG: received degenerate backend request for node_id: 1 from pid [8676] 2024-07-03 14:31:19.097: watchdog pid 8620: LOG: watchdog received the failover command from local pgpool-II on IPC interface 2024-07-03 14:31:19.097: watchdog pid 8620: LOG: watchdog is processing the failover command [DEGENERATE_BACKEND_REQUEST] received from local pgpool-II on IPC interface 2024-07-03 14:31:19.097: watchdog pid 8620: LOG: failover requires the quorum to hold, which is not present at the moment 2024-07-03 14:31:19.097: watchdog pid 8620: DETAIL: Rejecting the failover request 2024-07-03 14:31:19.097: watchdog pid 8620: LOG: failover command [DEGENERATE_BACKEND_REQUEST] request from pgpool-II node "staging-ha0001:9999 Linux staging-ha0001" is rejected because the watchdog cluster does not hold the quorum 2024-07-03 14:31:19.097: health_check1 pid 8676: LOG: degenerate backend request for 1 node(s) from pid [8676], is changed to quarantine node request by watchdog 2024-07-03 14:31:19.097: health_check1 pid 8676: DETAIL: watchdog does not holds the quorum 2024-07-03 14:31:19.097: health_check1 pid 8676: LOG: signal_user1_to_parent_with_reason(0) 2024-07-03 14:31:19.097: main pid 8617: LOG: Pgpool-II parent process received SIGUSR1 2024-07-03 14:31:19.097: main pid 8617: LOG: Pgpool-II parent process has received failover request 2024-07-03 14:31:19.098: watchdog pid 8620: LOG: received the failover indication from Pgpool-II on IPC interface 2024-07-03 14:31:19.098: watchdog pid 8620: LOG: watchdog is informed of failover start by the main process 2024-07-03 14:31:19.098: main pid 8617: LOG: === Starting quarantine. shutdown host staging-ha0002(5432) === 2024-07-03 14:31:19.098: main pid 8617: LOG: Do not restart children because we are switching over node id 1 host: staging-ha0002 port: 5432 and we are in streaming replication mode 2024-07-03 14:31:19.098: main pid 8617: LOG: failover: set new primary node: 0 2024-07-03 14:31:19.098: main pid 8617: LOG: failover: set new main node: 0 2024-07-03 14:31:19.098: sr_check_worker pid 65605: ERROR: Failed to check replication time lag 2024-07-03 14:31:19.098: sr_check_worker pid 65605: DETAIL: No persistent db connection for the node 1 2024-07-03 14:31:19.098: sr_check_worker pid 65605: HINT: check sr_check_user and sr_check_password 2024-07-03 14:31:19.098: sr_check_worker pid 65605: CONTEXT: while checking replication time lag 2024-07-03 14:31:19.098: sr_check_worker pid 65605: LOG: worker process received restart request 2024-07-03 14:31:19.098: watchdog pid 8620: LOG: received the failover indication from Pgpool-II on IPC interface 2024-07-03 14:31:19.098: watchdog pid 8620: LOG: watchdog is informed of failover end by the main process 2024-07-03 14:31:19.098: main pid 8617: LOG: === Quarantine done. shutdown host staging-ha0002(5432) == 2024-07-03 14:35:59.420: watchdog pid 8620: LOG: new outbound connection to staging-ha0002:9000 2024-07-03 14:35:59.423: watchdog pid 8620: LOG: "staging-ha0001:9999 Linux staging-ha0001" is the coordinator as per our record but "staging-ha0002:9999 Linux staging-ha0002" is also announcing as a coordinator 2024-07-03 14:35:59.423: watchdog pid 8620: DETAIL: cluster is in the split-brain 2024-07-03 14:35:59.423: watchdog pid 8620: LOG: I am the coordinator but "staging-ha0002:9999 Linux staging-ha0002" is also announcing as a coordinator 2024-07-03 14:35:59.423: watchdog pid 8620: DETAIL: trying to figure out the best contender for the leader/coordinator node 2024-07-03 14:35:59.423: watchdog pid 8620: LOG: remote node:"staging-ha0002:9999 Linux staging-ha0002" should step down from leader because we are the older leader 2024-07-03 14:35:59.423: watchdog pid 8620: LOG: We are in split brain, and I am the best candidate for leader/coordinator 2024-07-03 14:35:59.423: watchdog pid 8620: DETAIL: asking the remote node "staging-ha0002:9999 Linux staging-ha0002" to step down 2024-07-03 14:35:59.423: watchdog pid 8620: LOG: we have received the NODE INFO message from the node:"staging-ha0002:9999 Linux staging-ha0002" that was lost 2024-07-03 14:35:59.423: watchdog pid 8620: DETAIL: we had lost this node because of "REPORTED BY LIFECHECK" 2024-07-03 14:35:59.423: watchdog pid 8620: LOG: node:"staging-ha0002:9999 Linux staging-ha0002" was reported lost by the life-check process 2024-07-03 14:35:59.423: watchdog pid 8620: DETAIL: node will be added to cluster once life-check mark it as reachable again 2024-07-03 14:35:59.423: watchdog pid 8620: LOG: "staging-ha0001:9999 Linux staging-ha0001" is the coordinator as per our record but "staging-ha0002:9999 Linux staging-ha0002" is also announcing as a coordinator 2024-07-03 14:35:59.423: watchdog pid 8620: DETAIL: cluster is in the split-brain 2024-07-03 14:35:59.424: watchdog pid 8620: LOG: I am the coordinator but "staging-ha0002:9999 Linux staging-ha0002" is also announcing as a coordinator 2024-07-03 14:35:59.424: watchdog pid 8620: DETAIL: trying to figure out the best contender for the leader/coordinator node 2024-07-03 14:35:59.424: watchdog pid 8620: LOG: remote node:"staging-ha0002:9999 Linux staging-ha0002" should step down from leader because we are the older leader 2024-07-03 14:35:59.424: watchdog pid 8620: LOG: We are in split brain, and I am the best candidate for leader/coordinator 2024-07-03 14:35:59.424: watchdog pid 8620: DETAIL: asking the remote node "staging-ha0002:9999 Linux staging-ha0002" to step down 2024-07-03 14:35:59.424: watchdog pid 8620: LOG: we have received the NODE INFO message from the node:"staging-ha0002:9999 Linux staging-ha0002" that was lost 2024-07-03 14:35:59.424: watchdog pid 8620: DETAIL: we had lost this node because of "REPORTED BY LIFECHECK" 2024-07-03 14:35:59.424: watchdog pid 8620: LOG: node:"staging-ha0002:9999 Linux staging-ha0002" was reported lost by the life-check process 2024-07-03 14:35:59.424: watchdog pid 8620: DETAIL: node will be added to cluster once life-check mark it as reachable again 2024-07-03 14:35:59.424: watchdog pid 8620: LOG: remote node "staging-ha0002:9999 Linux staging-ha0002" is reporting that it has found us again 2024-07-03 14:35:59.425: watchdog pid 8620: LOG: leader/coordinator node "staging-ha0002:9999 Linux staging-ha0002" decided to resign from leader, probably because of split-brain 2024-07-03 14:35:59.425: watchdog pid 8620: DETAIL: It was not our coordinator/leader anyway. ignoring the message 2024-07-03 14:35:59.425: watchdog pid 8620: LOG: we have received the NODE INFO message from the node:"staging-ha0002:9999 Linux staging-ha0002" that was lost 2024-07-03 14:35:59.425: watchdog pid 8620: DETAIL: we had lost this node because of "REPORTED BY LIFECHECK" 2024-07-03 14:35:59.425: watchdog pid 8620: LOG: node:"staging-ha0002:9999 Linux staging-ha0002" was reported lost by the life-check process 2024-07-03 14:35:59.425: watchdog pid 8620: DETAIL: node will be added to cluster once life-check mark it as reachable again 2024-07-03 14:35:59.425: watchdog pid 8620: LOG: we have received the NODE INFO message from the node:"staging-ha0002:9999 Linux staging-ha0002" that was lost 2024-07-03 14:35:59.425: watchdog pid 8620: DETAIL: we had lost this node because of "REPORTED BY LIFECHECK" 2024-07-03 14:35:59.425: watchdog pid 8620: LOG: node:"staging-ha0002:9999 Linux staging-ha0002" was reported lost by the life-check process 2024-07-03 14:35:59.425: watchdog pid 8620: DETAIL: node will be added to cluster once life-check mark it as reachable again 2024-07-03 14:35:59.427: watchdog pid 8620: LOG: we have received the NODE INFO message from the node:"staging-ha0002:9999 Linux staging-ha0002" that was lost 2024-07-03 14:35:59.427: watchdog pid 8620: DETAIL: we had lost this node because of "REPORTED BY LIFECHECK" 2024-07-03 14:35:59.427: watchdog pid 8620: LOG: node:"staging-ha0002:9999 Linux staging-ha0002" was reported lost by the life-check process 2024-07-03 14:35:59.427: watchdog pid 8620: DETAIL: node will be added to cluster once life-check mark it as reachable again 2024-07-03 14:35:59.427: watchdog pid 8620: LOG: we have received the NODE INFO message from the node:"staging-ha0002:9999 Linux staging-ha0002" that was lost 2024-07-03 14:35:59.427: watchdog pid 8620: DETAIL: we had lost this node because of "REPORTED BY LIFECHECK" 2024-07-03 14:35:59.427: watchdog pid 8620: LOG: node:"staging-ha0002:9999 Linux staging-ha0002" was reported lost by the life-check process 2024-07-03 14:35:59.427: watchdog pid 8620: DETAIL: node will be added to cluster once life-check mark it as reachable again 2024-07-03 14:36:00.213: health_check1 pid 8676: LOG: failed to connect to PostgreSQL server on "staging-ha0002:5432", timed out 2024-07-03 14:36:00.213: health_check1 pid 8676: LOG: health check retrying on DB node: 1 (round:3) 2024-07-03 14:36:01.221: health_check1 pid 8676: LOG: health check retrying on DB node: 1 succeeded 2024-07-03 14:36:01.221: health_check1 pid 8676: LOG: received failback request for node_id: 1 from pid [8676] 2024-07-03 14:36:01.221: health_check1 pid 8676: LOG: failback request from pid [8676] is changed to update status request because node_id: 1 was quarantined 2024-07-03 14:36:01.221: health_check1 pid 8676: LOG: signal_user1_to_parent_with_reason(0) 2024-07-03 14:36:01.221: main pid 8617: LOG: Pgpool-II parent process received SIGUSR1 2024-07-03 14:36:01.221: main pid 8617: LOG: Pgpool-II parent process has received failover request 2024-07-03 14:36:01.221: watchdog pid 8620: LOG: received the failover indication from Pgpool-II on IPC interface 2024-07-03 14:36:01.221: watchdog pid 8620: LOG: watchdog is informed of failover start by the main process 2024-07-03 14:36:01.221: watchdog pid 8620: LOG: watchdog is informed of failover start by the main process 2024-07-03 14:36:01.222: main pid 8617: LOG: === Starting fail back. reconnect host staging-ha0002(5432) === 2024-07-03 14:36:01.222: main pid 8617: LOG: Node 0 is not down (status: 2) 2024-07-03 14:36:01.222: main pid 8617: LOG: Do not restart children because we are failing back node id 1 host: staging-ha0002 port: 5432 and we are in streaming replication mode and not all backends were down 2024-07-03 14:36:01.222: main pid 8617: LOG: failover: set new primary node: 0 2024-07-03 14:36:01.222: main pid 8617: LOG: failover: set new main node: 0 2024-07-03 14:36:01.222: sr_check_worker pid 66222: LOG: worker process received restart request 2024-07-03 14:36:01.222: watchdog pid 8620: LOG: received the failover indication from Pgpool-II on IPC interface 2024-07-03 14:36:01.222: watchdog pid 8620: LOG: watchdog is informed of failover end by the main process 2024-07-03 14:36:01.222: main pid 8617: LOG: === Failback done. reconnect host staging-ha0002(5432) === *Questions: * 1. From the point 2 in observations, why are the connections not going to new primary? 2. In this kind of setup will the transaction split happen when there is a network glitch? If anyone has worked on similar kind of setup, please provide your insights about it. Thank you Regards Mukesh