Hello everyone,
We are doing a POC on postgres HA setup with streaming replication (async)
using pgpool-II as a load balancing  & connection pooling and repmgr for
setting up HA & automatic failover.
We are applying a test case, like isolating the VM1 node from the Network
completely for more than 2 mins and again plug-in back the network, since
we want to verify how the system works during network glitches, any chances
of split-brain or so.
Our current setup looks like below,
2 VM's on Azure cloud, each VM has Postgres running along with Pgpool
service.
[image: image.png]

We enabled watchdog and assigned a delegate IP
*NOTE: as per some limitations we are using a floating IP and used for
delegate IP.*

During the test, here are our observations:
1. Client connections got hung from the time the VM1 got lost from the
network and till VM1 gets back to normal.
2. Once the VM1 is lost then Pgpool promotes the VM2 as LEADER node and
Postgres Standby got promoted to Primary on VM2 as well, but still client
connections are not connecting to the new primary. Why is this not
happening?
3. Once the VM1 is back to network, there is a split brain situation, where
pgpool on VM1 takes the lead to become LEADER node (pgpool.log shows). and
from then the client connects to the VM1 node via VIP.

*pgpool.conf *

sr_check_period  10sec

health_check_period  30sec

health_check_timeout 20 sec

health_check_max_retries  3

health_check_retry_delay 1

wd_lifecheck_method = 'heartbeat'

wd_interval = 10

wd_heartbeat_keepalive = 2

wd_heartbeat_deadtime = 30


*Logs information: *

>From VM2:

Pgpool.log

14:30:17  N/w disconnected

After 10 sec the streaming replication check failed and got timed out.

2024-07-03 14:30:26.176: sr_check_worker pid 58187: LOG:  failed to connect
to PostgreSQL server on "staging-ha0001:5432", timed out



Then pgpool failed to do health check since it got timed out as per
health_check_timeout set to 20 sec

2024-07-03 14:30:35.869: health_check0 pid 58188: LOG:  failed to connect
to PostgreSQL server on "staging-ha0001:5432", timed out



Re-trying health_check  & sr_check but again timed out.



2024-07-03 14:30:46.187: sr_check_worker pid 58187: LOG:  failed to connect
to PostgreSQL server on "staging-ha0001:5432", timed out

2024-07-03 14:30:46.880: health_check0 pid 58188: LOG:  failed to connect
to PostgreSQL server on "staging-ha0001:5432", timed out



Watchdog received a message saying the Leader node is lost.



2024-07-03 14:30:47.192: watchdog pid 58151: WARNING:  we have not received
a beacon message from leader node "staging-ha0001:9999 Linux staging-ha0001"

2024-07-03 14:30:47.192: watchdog pid 58151: DETAIL:  requesting info
message from leader node

2024-07-03 14:30:54.312: watchdog pid 58151: LOG:  read from socket failed,
remote end closed the connection

2024-07-03 14:30:54.312: watchdog pid 58151: LOG:  client socket of
staging-ha0001:9999 Linux staging-ha0001 is closed

2024-07-03 14:30:54.313: watchdog pid 58151: LOG:  remote node
"staging-ha0001:9999 Linux staging-ha0001" is reporting that it has lost us

2024-07-03 14:30:54.313: watchdog pid 58151: LOG:  we are lost on the
leader node "staging-ha0001:9999 Linux staging-ha0001"



Re-trying health_check  & sr_check but again timed out.



2024-07-03 14:30:57.888: health_check0 pid 58188: LOG:  failed to connect
to PostgreSQL server on "staging-ha0001:5432", timed out

2024-07-03 14:30:57.888: health_check0 pid 58188: LOG:  health check
retrying on DB node: 0 (round:3)

2024-07-03 14:31:06.201: sr_check_worker pid 58187: LOG:  failed to connect
to PostgreSQL server on "staging-ha0001:5432", timed out





After 10 sec from the time we lost the leader node,  watchdog changed
current node to LEADER node

2024-07-03 14:31:04.199: watchdog pid 58151: LOG:  watchdog node state
changed from [STANDING FOR LEADER] to [LEADER]





health_check is failed on node 0 and it received a degenerated request for
node 0  and the pgpool main process started quarantining
staging-ha0001(5432) (shutting down)



2024-07-03 14:31:08.202: watchdog pid 58151: LOG:  setting the local node
"staging-ha0002:9999 Linux staging-ha0002" as watchdog cluster leader

2024-07-03 14:31:08.202: watchdog pid 58151: LOG:
signal_user1_to_parent_with_reason(1)

2024-07-03 14:31:08.202: watchdog pid 58151: LOG:  I am the cluster leader
node but we do not have enough nodes in cluster

2024-07-03 14:31:08.202: watchdog pid 58151: DETAIL:  waiting for the
quorum to start escalation process

2024-07-03 14:31:08.202: main pid 58147: LOG:  Pgpool-II parent process
received SIGUSR1

2024-07-03 14:31:08.202: main pid 58147: LOG:  Pgpool-II parent process
received watchdog state change signal from watchdog

2024-07-03 14:31:08.899: health_check0 pid 58188: LOG:  failed to connect
to PostgreSQL server on "staging-ha0001:5432", timed out

2024-07-03 14:31:08.899: health_check0 pid 58188: LOG:  health check failed
on node 0 (timeout:0)

2024-07-03 14:31:08.899: health_check0 pid 58188: LOG:  received degenerate
backend request for node_id: 0 from pid [58188]

2024-07-03 14:31:08.899: watchdog pid 58151: LOG:  watchdog received the
failover command from local pgpool-II on IPC interface

2024-07-03 14:31:08.899: watchdog pid 58151: LOG:  watchdog is processing
the failover command [DEGENERATE_BACKEND_REQUEST] received from local
pgpool-II on IPC interface

2024-07-03 14:31:08.899: watchdog pid 58151: LOG:  failover requires the
quorum to hold, which is not present at the moment

2024-07-03 14:31:08.899: watchdog pid 58151: DETAIL:  Rejecting the
failover request

2024-07-03 14:31:08.899: watchdog pid 58151: LOG:  failover command
[DEGENERATE_BACKEND_REQUEST] request from pgpool-II node
"staging-ha0002:9999 Linux staging-ha0002" is rejected because the watchdog
cluster does not hold the quorum

2024-07-03 14:31:08.900: health_check0 pid 58188: LOG:  degenerate backend
request for 1 node(s) from pid [58188], is changed to quarantine node
request by watchdog

2024-07-03 14:31:08.900: health_check0 pid 58188: DETAIL:  watchdog does
not holds the quorum

2024-07-03 14:31:08.900: health_check0 pid 58188: LOG:
signal_user1_to_parent_with_reason(0)

2024-07-03 14:31:08.900: main pid 58147: LOG:  Pgpool-II parent process
received SIGUSR1

2024-07-03 14:31:08.900: main pid 58147: LOG:  Pgpool-II parent process has
received failover request

2024-07-03 14:31:08.900: watchdog pid 58151: LOG:  received the failover
indication from Pgpool-II on IPC interface

2024-07-03 14:31:08.900: watchdog pid 58151: LOG:  watchdog is informed of
failover start by the main process

2024-07-03 14:31:08.900: main pid 58147: LOG:  === Starting quarantine.
shutdown host staging-ha0001(5432) ===

2024-07-03 14:31:08.900: main pid 58147: LOG:  Restart all children

2024-07-03 14:31:08.900: main pid 58147: LOG:  failover: set new primary
node: -1

2024-07-03 14:31:08.900: main pid 58147: LOG:  failover: set new main node:
1

2024-07-03 14:31:08.906: sr_check_worker pid 58187: ERROR:  Failed to check
replication time lag

2024-07-03 14:31:08.906: sr_check_worker pid 58187: DETAIL:  No persistent
db connection for the node 0

2024-07-03 14:31:08.906: sr_check_worker pid 58187: HINT:  check
sr_check_user and sr_check_password

2024-07-03 14:31:08.906: sr_check_worker pid 58187: CONTEXT:  while
checking replication time lag

2024-07-03 14:31:08.906: sr_check_worker pid 58187: LOG:  worker process
received restart request

2024-07-03 14:31:08.906: watchdog pid 58151: LOG:  received the failover
indication from Pgpool-II on IPC interface

2024-07-03 14:31:08.906: watchdog pid 58151: LOG:  watchdog is informed of
failover end by the main process

2024-07-03 14:31:08.906: main pid 58147: LOG:  === Quarantine done.
shutdown host staging-ha0001(5432) ===

2024-07-03 14:31:09.906: pcp_main pid 58186: LOG:  restart request received
in pcp child process

2024-07-03 14:31:09.907: main pid 58147: LOG:  PCP child 58186 exits with
status 0 in failover()

2024-07-03 14:31:09.908: main pid 58147: LOG:  fork a new PCP child pid
58578 in failover()

2024-07-03 14:31:09.908: main pid 58147: LOG:  reaper handler

2024-07-03 14:31:09.908: pcp_main pid 58578: LOG:  PCP process: 58578
started

2024-07-03 14:31:09.909: main pid 58147: LOG:  reaper handler: exiting
normally

2024-07-03 14:31:09.909: sr_check_worker pid 58579: LOG:  process started

2024-07-03 14:31:19.915: watchdog pid 58151: LOG:  not able to send
messages to remote node "staging-ha0001:9999 Linux staging-ha0001"

2024-07-03 14:31:19.915: watchdog pid 58151: DETAIL:  marking the node as
lost

2024-07-03 14:31:19.915: watchdog pid 58151: LOG:  remote node
"staging-ha0001:9999 Linux staging-ha0001" is lost







>From VM1:

*pgpool.log*

2024-07-03 14:30:36.444: watchdog pid 8620: LOG:  remote node
"staging-ha0002:9999 Linux staging-ha0002" is not replying to our beacons

2024-07-03 14:30:36.444: watchdog pid 8620: DETAIL:  missed beacon reply
count:2

2024-07-03 14:30:37.448: sr_check_worker pid 65605: LOG:  failed to connect
to PostgreSQL server on "staging-ha0002:5432", timed out

2024-07-03 14:30:46.067: health_check1 pid 8676: LOG:  failed to connect to
PostgreSQL server on "staging-ha0002:5432", timed out

2024-07-03 14:30:46.068: health_check1 pid 8676: LOG:  health check
retrying on DB node: 1 (round:1)

2024-07-03 14:30:46.455: watchdog pid 8620: LOG:  remote node
"staging-ha0002:9999 Linux staging-ha0002" is not replying to our beacons

2024-07-03 14:30:46.455: watchdog pid 8620: DETAIL:  missed beacon reply
count:3

2024-07-03 14:30:47.449: sr_check_worker pid 65605: ERROR:  Failed to check
replication time lag

2024-07-03 14:30:47.449: sr_check_worker pid 65605: DETAIL:  No persistent
db connection for the node 1

2024-07-03 14:30:47.449: sr_check_worker pid 65605: HINT:  check
sr_check_user and sr_check_password

2024-07-03 14:30:47.449: sr_check_worker pid 65605: CONTEXT:  while
checking replication time lag

2024-07-03 14:30:55.104: child pid 65509: LOG:  failover or failback event
detected

2024-07-03 14:30:55.104: child pid 65509: DETAIL:  restarting myself

2024-07-03 14:30:55.104: main pid 8617: LOG:  reaper handler

2024-07-03 14:30:55.105: main pid 8617: LOG:  reaper handler: exiting
normally

2024-07-03 14:30:56.459: watchdog pid 8620: LOG:  remote node
"staging-ha0002:9999 Linux staging-ha0002" is not replying to our beacons

2024-07-03 14:30:56.459: watchdog pid 8620: DETAIL:  missed beacon reply
count:4

2024-07-03 14:30:56.459: watchdog pid 8620: LOG:  remote node
"staging-ha0002:9999 Linux staging-ha0002" is not responding to our beacon
messages

2024-07-03 14:30:56.459: watchdog pid 8620: DETAIL:  marking the node as
lost

2024-07-03 14:30:56.459: watchdog pid 8620: LOG:  remote node
"staging-ha0002:9999 Linux staging-ha0002" is lost

2024-07-03 14:30:56.460: watchdog pid 8620: LOG:  removing watchdog node
"staging-ha0002:9999 Linux staging-ha0002" from the standby list

2024-07-03 14:30:56.460: watchdog pid 8620: LOG:  We have lost the quorum

2024-07-03 14:30:56.460: watchdog pid 8620: LOG:
signal_user1_to_parent_with_reason(3)

2024-07-03 14:30:56.460: main pid 8617: LOG:  Pgpool-II parent process
received SIGUSR1

2024-07-03 14:30:56.460: main pid 8617: LOG:  Pgpool-II parent process
received watchdog quorum change signal from watchdog

2024-07-03 14:30:56.461: watchdog_utility pid 66197: LOG:  watchdog:
de-escalation started

sudo: a terminal is required to read the password; either use the -S option
to read from standard input or configure an askpass helper

2024-07-03 14:30:57.078: health_check1 pid 8676: LOG:  failed to connect to
PostgreSQL server on "staging-ha0002:5432", timed out

2024-07-03 14:30:57.078: health_check1 pid 8676: LOG:  health check
retrying on DB node: 1 (round:2)

2024-07-03 14:30:57.418: life_check pid 8639: LOG:  informing the node
status change to watchdog

2024-07-03 14:30:57.418: life_check pid 8639: DETAIL:  node id :1 status =
"NODE DEAD" message:"No heartbeat signal from node"

2024-07-03 14:30:57.418: watchdog pid 8620: LOG:  received node status
change ipc message

2024-07-03 14:30:57.418: watchdog pid 8620: DETAIL:  No heartbeat signal
from node

2024-07-03 14:30:57.418: watchdog pid 8620: LOG:  remote node
"staging-ha0002:9999 Linux staging-ha0002" is lost

2024-07-03 14:30:57.464: sr_check_worker pid 65605: LOG:  failed to connect
to PostgreSQL server on "staging-ha0002:5432", timed out

sudo: a password is required

2024-07-03 14:30:59.301: watchdog_utility pid 66197: LOG:  failed to
release the delegate IP:"10.127.1.20"

2024-07-03 14:30:59.301: watchdog_utility pid 66197: DETAIL:  'if_down_cmd'
failed

2024-07-03 14:30:59.301: watchdog_utility pid 66197: WARNING:  watchdog
de-escalation failed to bring down delegate IP

2024-07-03 14:30:59.301: watchdog pid 8620: LOG:  watchdog de-escalation
process with pid: 66197 exit with SUCCESS.



2024-07-03 14:31:07.465: sr_check_worker pid 65605: ERROR:  Failed to check
replication time lag

2024-07-03 14:31:07.465: sr_check_worker pid 65605: DETAIL:  No persistent
db connection for the node 1

2024-07-03 14:31:07.465: sr_check_worker pid 65605: HINT:  check
sr_check_user and sr_check_password

2024-07-03 14:31:07.465: sr_check_worker pid 65605: CONTEXT:  while
checking replication time lag

2024-07-03 14:31:08.089: health_check1 pid 8676: LOG:  failed to connect to
PostgreSQL server on "staging-ha0002:5432", timed out

2024-07-03 14:31:08.089: health_check1 pid 8676: LOG:  health check
retrying on DB node: 1 (round:3)

2024-07-03 14:31:17.480: sr_check_worker pid 65605: LOG:  failed to connect
to PostgreSQL server on "staging-ha0002:5432", timed out

2024-07-03 14:31:19.097: health_check1 pid 8676: LOG:  failed to connect to
PostgreSQL server on "staging-ha0002:5432", timed out

2024-07-03 14:31:19.097: health_check1 pid 8676: LOG:  health check failed
on node 1 (timeout:0)

2024-07-03 14:31:19.097: health_check1 pid 8676: LOG:  received degenerate
backend request for node_id: 1 from pid [8676]

2024-07-03 14:31:19.097: watchdog pid 8620: LOG:  watchdog received the
failover command from local pgpool-II on IPC interface

2024-07-03 14:31:19.097: watchdog pid 8620: LOG:  watchdog is processing
the failover command [DEGENERATE_BACKEND_REQUEST] received from local
pgpool-II on IPC interface

2024-07-03 14:31:19.097: watchdog pid 8620: LOG:  failover requires the
quorum to hold, which is not present at the moment

2024-07-03 14:31:19.097: watchdog pid 8620: DETAIL:  Rejecting the failover
request

2024-07-03 14:31:19.097: watchdog pid 8620: LOG:  failover command
[DEGENERATE_BACKEND_REQUEST] request from pgpool-II node
"staging-ha0001:9999 Linux staging-ha0001" is rejected because the watchdog
cluster does not hold the quorum

2024-07-03 14:31:19.097: health_check1 pid 8676: LOG:  degenerate backend
request for 1 node(s) from pid [8676], is changed to quarantine node
request by watchdog

2024-07-03 14:31:19.097: health_check1 pid 8676: DETAIL:  watchdog does not
holds the quorum

2024-07-03 14:31:19.097: health_check1 pid 8676: LOG:
signal_user1_to_parent_with_reason(0)

2024-07-03 14:31:19.097: main pid 8617: LOG:  Pgpool-II parent process
received SIGUSR1

2024-07-03 14:31:19.097: main pid 8617: LOG:  Pgpool-II parent process has
received failover request

2024-07-03 14:31:19.098: watchdog pid 8620: LOG:  received the failover
indication from Pgpool-II on IPC interface

2024-07-03 14:31:19.098: watchdog pid 8620: LOG:  watchdog is informed of
failover start by the main process

2024-07-03 14:31:19.098: main pid 8617: LOG:  === Starting quarantine.
shutdown host staging-ha0002(5432) ===

2024-07-03 14:31:19.098: main pid 8617: LOG:  Do not restart children
because we are switching over node id 1 host: staging-ha0002 port: 5432 and
we are in streaming replication mode

2024-07-03 14:31:19.098: main pid 8617: LOG:  failover: set new primary
node: 0

2024-07-03 14:31:19.098: main pid 8617: LOG:  failover: set new main node: 0

2024-07-03 14:31:19.098: sr_check_worker pid 65605: ERROR:  Failed to check
replication time lag

2024-07-03 14:31:19.098: sr_check_worker pid 65605: DETAIL:  No persistent
db connection for the node 1

2024-07-03 14:31:19.098: sr_check_worker pid 65605: HINT:  check
sr_check_user and sr_check_password

2024-07-03 14:31:19.098: sr_check_worker pid 65605: CONTEXT:  while
checking replication time lag

2024-07-03 14:31:19.098: sr_check_worker pid 65605: LOG:  worker process
received restart request

2024-07-03 14:31:19.098: watchdog pid 8620: LOG:  received the failover
indication from Pgpool-II on IPC interface

2024-07-03 14:31:19.098: watchdog pid 8620: LOG:  watchdog is informed of
failover end by the main process

2024-07-03 14:31:19.098: main pid 8617: LOG:  === Quarantine done. shutdown
host staging-ha0002(5432) ==





2024-07-03 14:35:59.420: watchdog pid 8620: LOG:  new outbound connection
to staging-ha0002:9000

2024-07-03 14:35:59.423: watchdog pid 8620: LOG:  "staging-ha0001:9999
Linux staging-ha0001" is the coordinator as per our record but
"staging-ha0002:9999 Linux staging-ha0002" is also announcing as a
coordinator

2024-07-03 14:35:59.423: watchdog pid 8620: DETAIL:  cluster is in the
split-brain

2024-07-03 14:35:59.423: watchdog pid 8620: LOG:  I am the coordinator but
"staging-ha0002:9999 Linux staging-ha0002" is also announcing as a
coordinator

2024-07-03 14:35:59.423: watchdog pid 8620: DETAIL:  trying to figure out
the best contender for the leader/coordinator node

2024-07-03 14:35:59.423: watchdog pid 8620: LOG:  remote
node:"staging-ha0002:9999 Linux staging-ha0002" should step down from
leader because we are the older leader

2024-07-03 14:35:59.423: watchdog pid 8620: LOG:  We are in split brain,
and I am the best candidate for leader/coordinator

2024-07-03 14:35:59.423: watchdog pid 8620: DETAIL:  asking the remote node
"staging-ha0002:9999 Linux staging-ha0002" to step down

2024-07-03 14:35:59.423: watchdog pid 8620: LOG:  we have received the NODE
INFO message from the node:"staging-ha0002:9999 Linux staging-ha0002" that
was lost

2024-07-03 14:35:59.423: watchdog pid 8620: DETAIL:  we had lost this node
because of "REPORTED BY LIFECHECK"

2024-07-03 14:35:59.423: watchdog pid 8620: LOG:  node:"staging-ha0002:9999
Linux staging-ha0002" was reported lost by the life-check process

2024-07-03 14:35:59.423: watchdog pid 8620: DETAIL:  node will be added to
cluster once life-check mark it as reachable again

2024-07-03 14:35:59.423: watchdog pid 8620: LOG:  "staging-ha0001:9999
Linux staging-ha0001" is the coordinator as per our record but
"staging-ha0002:9999 Linux staging-ha0002" is also announcing as a
coordinator

2024-07-03 14:35:59.423: watchdog pid 8620: DETAIL:  cluster is in the
split-brain

2024-07-03 14:35:59.424: watchdog pid 8620: LOG:  I am the coordinator but
"staging-ha0002:9999 Linux staging-ha0002" is also announcing as a
coordinator

2024-07-03 14:35:59.424: watchdog pid 8620: DETAIL:  trying to figure out
the best contender for the leader/coordinator node

2024-07-03 14:35:59.424: watchdog pid 8620: LOG:  remote
node:"staging-ha0002:9999 Linux staging-ha0002" should step down from
leader because we are the older leader

2024-07-03 14:35:59.424: watchdog pid 8620: LOG:  We are in split brain,
and I am the best candidate for leader/coordinator

2024-07-03 14:35:59.424: watchdog pid 8620: DETAIL:  asking the remote node
"staging-ha0002:9999 Linux staging-ha0002" to step down

2024-07-03 14:35:59.424: watchdog pid 8620: LOG:  we have received the NODE
INFO message from the node:"staging-ha0002:9999 Linux staging-ha0002" that
was lost

2024-07-03 14:35:59.424: watchdog pid 8620: DETAIL:  we had lost this node
because of "REPORTED BY LIFECHECK"

2024-07-03 14:35:59.424: watchdog pid 8620: LOG:  node:"staging-ha0002:9999
Linux staging-ha0002" was reported lost by the life-check process

2024-07-03 14:35:59.424: watchdog pid 8620: DETAIL:  node will be added to
cluster once life-check mark it as reachable again

2024-07-03 14:35:59.424: watchdog pid 8620: LOG:  remote node
"staging-ha0002:9999 Linux staging-ha0002" is reporting that it has found
us again

2024-07-03 14:35:59.425: watchdog pid 8620: LOG:  leader/coordinator node
"staging-ha0002:9999 Linux staging-ha0002" decided to resign from leader,
probably because of split-brain

2024-07-03 14:35:59.425: watchdog pid 8620: DETAIL:  It was not our
coordinator/leader anyway. ignoring the message

2024-07-03 14:35:59.425: watchdog pid 8620: LOG:  we have received the NODE
INFO message from the node:"staging-ha0002:9999 Linux staging-ha0002" that
was lost

2024-07-03 14:35:59.425: watchdog pid 8620: DETAIL:  we had lost this node
because of "REPORTED BY LIFECHECK"

2024-07-03 14:35:59.425: watchdog pid 8620: LOG:  node:"staging-ha0002:9999
Linux staging-ha0002" was reported lost by the life-check process

2024-07-03 14:35:59.425: watchdog pid 8620: DETAIL:  node will be added to
cluster once life-check mark it as reachable again

2024-07-03 14:35:59.425: watchdog pid 8620: LOG:  we have received the NODE
INFO message from the node:"staging-ha0002:9999 Linux staging-ha0002" that
was lost

2024-07-03 14:35:59.425: watchdog pid 8620: DETAIL:  we had lost this node
because of "REPORTED BY LIFECHECK"

2024-07-03 14:35:59.425: watchdog pid 8620: LOG:  node:"staging-ha0002:9999
Linux staging-ha0002" was reported lost by the life-check process

2024-07-03 14:35:59.425: watchdog pid 8620: DETAIL:  node will be added to
cluster once life-check mark it as reachable again

2024-07-03 14:35:59.427: watchdog pid 8620: LOG:  we have received the NODE
INFO message from the node:"staging-ha0002:9999 Linux staging-ha0002" that
was lost

2024-07-03 14:35:59.427: watchdog pid 8620: DETAIL:  we had lost this node
because of "REPORTED BY LIFECHECK"

2024-07-03 14:35:59.427: watchdog pid 8620: LOG:  node:"staging-ha0002:9999
Linux staging-ha0002" was reported lost by the life-check process

2024-07-03 14:35:59.427: watchdog pid 8620: DETAIL:  node will be added to
cluster once life-check mark it as reachable again

2024-07-03 14:35:59.427: watchdog pid 8620: LOG:  we have received the NODE
INFO message from the node:"staging-ha0002:9999 Linux staging-ha0002" that
was lost

2024-07-03 14:35:59.427: watchdog pid 8620: DETAIL:  we had lost this node
because of "REPORTED BY LIFECHECK"

2024-07-03 14:35:59.427: watchdog pid 8620: LOG:  node:"staging-ha0002:9999
Linux staging-ha0002" was reported lost by the life-check process

2024-07-03 14:35:59.427: watchdog pid 8620: DETAIL:  node will be added to
cluster once life-check mark it as reachable again

2024-07-03 14:36:00.213: health_check1 pid 8676: LOG:  failed to connect to
PostgreSQL server on "staging-ha0002:5432", timed out

2024-07-03 14:36:00.213: health_check1 pid 8676: LOG:  health check
retrying on DB node: 1 (round:3)

2024-07-03 14:36:01.221: health_check1 pid 8676: LOG:  health check
retrying on DB node: 1 succeeded

2024-07-03 14:36:01.221: health_check1 pid 8676: LOG:  received failback
request for node_id: 1 from pid [8676]

2024-07-03 14:36:01.221: health_check1 pid 8676: LOG:  failback request
from pid [8676] is changed to update status request because node_id: 1 was
quarantined

2024-07-03 14:36:01.221: health_check1 pid 8676: LOG:
signal_user1_to_parent_with_reason(0)

2024-07-03 14:36:01.221: main pid 8617: LOG:  Pgpool-II parent process
received SIGUSR1

2024-07-03 14:36:01.221: main pid 8617: LOG:  Pgpool-II parent process has
received failover request

2024-07-03 14:36:01.221: watchdog pid 8620: LOG:  received the failover
indication from Pgpool-II on IPC interface

2024-07-03 14:36:01.221: watchdog pid 8620: LOG:  watchdog is informed of
failover start by the main process

2024-07-03 14:36:01.221: watchdog pid 8620: LOG:  watchdog is informed of
failover start by the main process

2024-07-03 14:36:01.222: main pid 8617: LOG:  === Starting fail back.
reconnect host staging-ha0002(5432) ===

2024-07-03 14:36:01.222: main pid 8617: LOG:  Node 0 is not down (status: 2)

2024-07-03 14:36:01.222: main pid 8617: LOG:  Do not restart children
because we are failing back node id 1 host: staging-ha0002 port: 5432 and
we are in streaming replication mode and not all backends were down

2024-07-03 14:36:01.222: main pid 8617: LOG:  failover: set new primary
node: 0

2024-07-03 14:36:01.222: main pid 8617: LOG:  failover: set new main node: 0

2024-07-03 14:36:01.222: sr_check_worker pid 66222: LOG:  worker process
received restart request

2024-07-03 14:36:01.222: watchdog pid 8620: LOG:  received the failover
indication from Pgpool-II on IPC interface

2024-07-03 14:36:01.222: watchdog pid 8620: LOG:  watchdog is informed of
failover end by the main process

2024-07-03 14:36:01.222: main pid 8617: LOG:  === Failback done. reconnect
host staging-ha0002(5432) ===


*Questions: *
1. From the point 2 in observations, why are the connections not going to
new primary?
2. In this kind of setup will the transaction split happen when there is a
network glitch?

If anyone has worked on similar kind of setup, please provide your insights
about it.
Thank you

Regards
Mukesh

Reply via email to