Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure
Hi, here is a complete cluster configuration: node 1: master node 2: secondary primitive AWSVIP awsvip \ params secondary_private_ip=10.x.x.x api_delay=5 primitive PGSQL pgsqlms \ params pgdata="/var/lib/postgresql/9.5/main" bindir="/usr/lib/postgresql/9.5/bin" pghost="/var/run/postgresql/" recovery_template="/etc/postgresql/9.5/main/recovery.conf.pcmk" start_opts="-c config_file=/etc/postgresql/9.5/main/postgresql.conf" \ op start timeout=60s interval=0 \ op stop timeout=60s interval=0 \ op promote timeout=15s interval=0 \ op demote timeout=120s interval=0 \ op monitor interval=15s timeout=10s role=Master \ op monitor interval=16s timeout=10s role=Slave \ op notify timeout=60 interval=0 primitive fencing-postgres-ha-2 stonith:external/ec2 \ params port=master \ op start interval=0s timeout=60s \ op monitor interval=360s timeout=60s \ op stop interval=0s timeout=60s primitive fencing-test-rsyslog stonith:external/ec2 \ params port=secondary \ op start interval=0s timeout=60s \ op monitor interval=360s timeout=60s \ op stop interval=0s timeout=60s ms PGSQL-HA PGSQL \ meta notify=true colocation IPAWSIP-WITH-MASTER inf: AWSVIP PGSQL-HA:Master order demote-then-stop-ip Mandatory: _rsc_set_ PGSQL-HA:demote AWSVIP:stop symmetrical=false location loc-fence-master fencing-postgres-ha-2 -inf: master location loc-fence-secondary fencing-test-rsyslog -inf: secondary order promote-then-ip Mandatory: _rsc_set_ PGSQL-HA:promote AWSVIP:start symmetrical=false property cib-bootstrap-options: \ have-watchdog=false \ dc-version=1.1.14-70404b0 \ cluster-infrastructure=corosync \ cluster-name=psql-ha \ stonith-enabled=true \ no-quorum-policy=ignore \ last-lrm-refresh=1556315444 \ maintenance-mode=false rsc_defaults rsc-options: \ resource-stickiness=10 \ migration-threshold=2 I tried to start manually postgres to be sure it is ok. There are no error in postgres log. I also tried with different meta parameters, but always with notify=true. I also tried this: ms PGSQL-HA PGSQL \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true interleave=true I have followed this link: https://clusterlabs.github.io/PAF/Quick_Start-Debian-9-crm.html When stonith is enabled and working I imported all other resources and constraints all together in the same time. On Fri, 26 Apr 2019 at 13:46, Jehan-Guillaume de Rorthais wrote: > Hi, > > On Thu, 25 Apr 2019 18:57:55 +0200 > Danka Ivanović wrote: > > > Apr 25 16:39:50 [4213] master lrmd: notice: > > operation_finished: PGSQL_monitor_0:5849:stderr [ ocf-exit-reason:You > > must set meta parameter notify=true for your master resource ] > > Resource agent pgsqlms refuse to start PgSQL because your configuration > lacks > the "notify=true" attribute in your master definition. > > Could you please share your full Pacemaker configuration? > > Regards, > -- Pozdrav Danka Ivanovic ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure
Hi, Here are the logs when pacemaker fails to start postgres service on master. It manage to start only postgres slave. I tried different configuration with pgslqms and pgsql resource agents. Those errors are when I use pgsqlms agent, which configuration I have sent in first mail: Apr 25 16:40:23 [4213] master lrmd: info: log_execute: executing - rsc:PGSQL action:start call_id:51 launching as "postgres" command "/usr/lib/postgresql/9.5/bin/pg_ctl --pgdata /var/lib/postgresql/9.5/main -w --timeout 120 start -o -c config_file=/etc/postgresql/9.5/main/postgresql.conf" Apr 25 16:40:24 [4211] mastercib: info: cib_perform_op: + /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='PGSQL']/lrm_rsc_op[@id='PGSQL_last_0']: @operation_key=PGSQL_start_0, @operation=start, @transition-key=12:30:0:078c2b66-b095-49c4-947b-2427dd7852bf, @transition-magic=0:0;12:30:0:078c2b66-b095-49c4-947b-2427dd7852bf, @call-id=176, @rc-code=0, @exec-time=1146, @queue-time=0 Apr 25 16:40:53 [4216] master crmd:debug: crm_timer_start: Started Shutdown Escalation (I_STOP:120ms), src=53 Apr 25 16:41:23 [4213] master lrmd: warning: child_timeout_callback: PGSQL_start_0 process (PID 5986) timed out Part of the log is attached. On Tue, 23 Apr 2019 at 17:28, Danka Ivanović wrote: > Hi, > It seems that ldap timeout caused cluster failure. Cluster is checking > status every 15s on master and 16s on slave. Cluster needs postgres user > for authentication, but ldap first query user on ldap server and then > localy on host. When connection to ldap server was interrupted, cluster > couldn't find postgres user and authenticate on db to check state. Problem > is solved with reconfiguring /etc/ldap.conf and /etc/nslcd.conf. Following > variable is added: nss_initgroups_ignoreusers with specified local users > which should be ignored when querying ldap server. Thanks for your help. :) > Another problem is that I cannot start postgres master with pacemaker. > When I start postgres manually (with systemd) and then start pacemaker on > slave, pacemaker is able to recognize master and start slave and failover > works. > That is another problem which I didn't manage to solve. Should I send a > new mail for that issue or we can continue in this thread? > > On Fri, 19 Apr 2019 at 19:19, Jehan-Guillaume de Rorthais > wrote: > >> On Fri, 19 Apr 2019 17:26:14 +0200 >> Danka Ivanović wrote: >> ... >> > Should I change any of those timeout parameters in order to avoid >> timeout? >> >> You can try to raise the timeout, indeed. But as far as we don't know >> **why** >> your VMs froze for some time, it is difficult to guess how high should be >> these timeouts. >> >> Not to mention that it will raise your RTO. >> > > > -- > Pozdrav > Danka Ivanovic > -- Pozdrav Danka Ivanovic Apr 25 16:39:50 [4211] mastercib:debug: crm_client_new: Connecting 0x55d8444e8e80 for uid=0 gid=0 pid=5791 id=c93d535d-77d8-4556-9a63-d9a1c2b45de9 Apr 25 16:39:50 [4211] mastercib:debug: handle_new_connection: IPC credentials authenticated (4211-5791-13) Apr 25 16:39:50 [4211] mastercib:debug: qb_ipcs_shm_connect: connecting to client [5791] Apr 25 16:39:50 [4211] mastercib:debug: qb_rb_open_2: shm size:524301; real_size:528384; rb->word_size:132096 Apr 25 16:39:50 [4211] mastercib:debug: qb_rb_open_2: shm size:524301; real_size:528384; rb->word_size:132096 Apr 25 16:39:50 [4211] mastercib:debug: qb_rb_open_2: shm size:524301; real_size:528384; rb->word_size:132096 Apr 25 16:39:50 [4211] mastercib:debug: cib_acl_enabled:CIB ACL is disabled Apr 25 16:39:50 [4211] mastercib:debug: qb_ipcs_dispatch_connection_request:HUP conn (4211-5791-13) Apr 25 16:39:50 [4211] mastercib:debug: qb_ipcs_disconnect: qb_ipcs_disconnect(4211-5791-13) state:2 Apr 25 16:39:50 [4211] mastercib:debug: crm_client_destroy: Destroying 0 events Apr 25 16:39:50 [4211] mastercib:debug: qb_rb_close: Free'ing ringbuffer: /dev/shm/qb-cib_rw-response-4211-5791-13-header Apr 25 16:39:50 [4211] mastercib:debug: qb_rb_close: Free'ing ringbuffer: /dev/shm/qb-cib_rw-event-4211-5791-13-header Apr 25 16:39:50 [4211] mastercib:debug: qb_rb_close: Free'ing ringbuffer: /dev/shm/qb-cib_rw-request-4211-5791-13-header Apr 25 16:39:50 [15544] master corosync debug [QB] IPC credentials authenticated (15544-5837-24) Apr 25 16:39:50 [15544] master corosync debug [QB] connecting to client [5837] Apr 25 16:39:50 [15544] master corosync debug [QB] shm size:1048589;
Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure
Hi, It seems that ldap timeout caused cluster failure. Cluster is checking status every 15s on master and 16s on slave. Cluster needs postgres user for authentication, but ldap first query user on ldap server and then localy on host. When connection to ldap server was interrupted, cluster couldn't find postgres user and authenticate on db to check state. Problem is solved with reconfiguring /etc/ldap.conf and /etc/nslcd.conf. Following variable is added: nss_initgroups_ignoreusers with specified local users which should be ignored when querying ldap server. Thanks for your help. :) Another problem is that I cannot start postgres master with pacemaker. When I start postgres manually (with systemd) and then start pacemaker on slave, pacemaker is able to recognize master and start slave and failover works. That is another problem which I didn't manage to solve. Should I send a new mail for that issue or we can continue in this thread? On Fri, 19 Apr 2019 at 19:19, Jehan-Guillaume de Rorthais wrote: > On Fri, 19 Apr 2019 17:26:14 +0200 > Danka Ivanović wrote: > ... > > Should I change any of those timeout parameters in order to avoid > timeout? > > You can try to raise the timeout, indeed. But as far as we don't know > **why** > your VMs froze for some time, it is difficult to guess how high should be > these timeouts. > > Not to mention that it will raise your RTO. > -- Pozdrav Danka Ivanovic ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure
Here is the command output from crm configure show: node 1: master \ attributes master-PGSQL=1001 node 2: secondary \ attributes master-PGSQL=1000 primitive AWSVIP awsvip \ params secondary_private_ip=10.x.x.x api_delay=5 primitive PGSQL pgsqlms \ params pgdata="/var/lib/postgresql/9.5/main" bindir="/usr/lib/postgresql/9.5/bin" pghost="/var/run/postgresql/" recovery_template="/etc/postgresql/9.5/main/recovery.conf.pcmk" start_opts="-c config_file=/etc/postgresql/9.5/main/postgresql.conf" \ op start timeout=60s interval=0 \ op stop timeout=60s interval=0 \ op promote timeout=15s interval=0 \ op demote timeout=120s interval=0 \ op monitor interval=15s timeout=10s role=Master \ op monitor interval=16s timeout=10s role=Slave \ op notify timeout=60 interval=0 primitive fencing-master stonith:external/ec2 \ params port=master \ op start interval=0s timeout=60s \ op monitor interval=360s timeout=60s \ op stop interval=0s timeout=60s primitive fencing-secondary stonith:external/ec2 \ params port=secondary \ op start interval=0s timeout=60s \ op monitor interval=360s timeout=60s \ op stop interval=0s timeout=60s ms PGSQL-HA PGSQL \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true interleave=true colocation IPAWSIP-WITH-MASTER inf: AWSVIP PGSQL-HA:Master order demote-then-stop-ip Mandatory: _rsc_set_ PGSQL-HA:demote AWSVIP:stop symmetrical=false location loc-fence-master fencing-master -inf: master location loc-fence-secondary fencing-secondary -inf: secondary order promote-then-ip Mandatory: _rsc_set_ PGSQL-HA:promote AWSVIP:start symmetrical=false property cib-bootstrap-options: \ have-watchdog=false \ dc-version=1.1.14-70404b0 \ cluster-infrastructure=corosync \ cluster-name=pgc-psql-ha \ stonith-enabled=true \ no-quorum-policy=ignore \ maintenance-mode=false \ last-lrm-refresh=1551885417 rsc_defaults rsc-options: \ resource-stickiness=10 \ migration-threshold=1 Should I change any of those timeout parameters in order to avoid timeout? On Fri, 19 Apr 2019 at 12:23, Danka Ivanović wrote: > Thanks for the clarification about failure-timeout, migration threshold > and pacemaker. > Instances are hosted on AWS cloud, and they are in the same security > groups and availability zones. > I don't have information about hardware which hosts those VMs since they > are non dedicated. UTC timezone is configured on both machines and default > ntp configuration. > remote refid st t when poll reach delay offset > jitter > > == > 0.ubuntu.pool.n .POOL. 16 p- 6400.0000.000 > 0.000 > 1.ubuntu.pool.n .POOL. 16 p- 6400.0000.000 > 0.000 > 2.ubuntu.pool.n .POOL. 16 p- 6400.0000.000 > 0.000 > 3.ubuntu.pool.n .POOL. 16 p- 6400.0000.000 > 0.000 > ntp.ubuntu.com .POOL. 16 p- 6400.0000.000 > 0.000 > +198.46.223.227 204.9.54.119 2 u 65 512 377 22.3180.096 > 1.111 > -time1.plumdev.n .GPS.1 u 116 512 377 72.4871.386 > 0.544 > -199.180.133.100 140.142.2.8 3 u 839 1024 377 65.574 -1.199 > 1.167 > +helium.constant 128.59.0.245 2 u 217 512 3777.3680.952 > 0.090 > *i.will.not.be.e 213.251.128.249 2 u 207 512 377 14.7331.185 > 0.305 > > > On Fri, 19 Apr 2019 at 11:46, Jehan-Guillaume de Rorthais > wrote: > >> On Fri, 19 Apr 2019 11:08:33 +0200 >> Danka Ivanović wrote: >> >> > Hi, >> > Thank you for your response. >> > >> > Ok, It seems that fencing resources and secondary timed out at the same >> > time, together with ldap. >> > I understand that because of "migration-threshold=1", standby tried to >> > recover just once and then was stopped. Is this ok, or the threshold >> should >> > be increased? >> >> It depend on your usecase really. >> >> Note that as soon as a resource hit migration threashold, there's an >> implicit >> constraint forbidding it to come back on this node until you reset the >> failcount. That's why your pgsql master resource never came back anywhere. >> >> You can as well set failure-timeout if you are brave enough to automate >> the >> failure reset. See: >> >> https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-failure-migration.html >> >> > Master server is started with systmectl, then pacemaker is started on >> > master, which detects master and then when starting pacemaker on >> secondary >> > it brings up postgres service in slave mode. >
Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure
Thanks for the clarification about failure-timeout, migration threshold and pacemaker. Instances are hosted on AWS cloud, and they are in the same security groups and availability zones. I don't have information about hardware which hosts those VMs since they are non dedicated. UTC timezone is configured on both machines and default ntp configuration. remote refid st t when poll reach delay offset jitter == 0.ubuntu.pool.n .POOL. 16 p- 6400.0000.000 0.000 1.ubuntu.pool.n .POOL. 16 p- 6400.0000.000 0.000 2.ubuntu.pool.n .POOL. 16 p- 6400.0000.000 0.000 3.ubuntu.pool.n .POOL. 16 p- 6400.0000.000 0.000 ntp.ubuntu.com .POOL. 16 p- 6400.0000.000 0.000 +198.46.223.227 204.9.54.119 2 u 65 512 377 22.3180.096 1.111 -time1.plumdev.n .GPS.1 u 116 512 377 72.4871.386 0.544 -199.180.133.100 140.142.2.8 3 u 839 1024 377 65.574 -1.199 1.167 +helium.constant 128.59.0.245 2 u 217 512 3777.3680.952 0.090 *i.will.not.be.e 213.251.128.249 2 u 207 512 377 14.7331.185 0.305 On Fri, 19 Apr 2019 at 11:46, Jehan-Guillaume de Rorthais wrote: > On Fri, 19 Apr 2019 11:08:33 +0200 > Danka Ivanović wrote: > > > Hi, > > Thank you for your response. > > > > Ok, It seems that fencing resources and secondary timed out at the same > > time, together with ldap. > > I understand that because of "migration-threshold=1", standby tried to > > recover just once and then was stopped. Is this ok, or the threshold > should > > be increased? > > It depend on your usecase really. > > Note that as soon as a resource hit migration threashold, there's an > implicit > constraint forbidding it to come back on this node until you reset the > failcount. That's why your pgsql master resource never came back anywhere. > > You can as well set failure-timeout if you are brave enough to automate the > failure reset. See: > > https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-failure-migration.html > > > Master server is started with systmectl, then pacemaker is started on > > master, which detects master and then when starting pacemaker on > secondary > > it brings up postgres service in slave mode. > > You should not. Systemd should not mess with resources handled by > Pacemaker. > > > I didn't manage to start postgres master over pacemaker. I tested > > failover with setup like this and it works. I will try to setup postgres > to > > be run with pacemaker, > > Pacemaker is suppose to start the resource itself if it is enabled in its > setup. Look at this whole chapter (its end is important): > > https://clusterlabs.github.io/PAF/CentOS-7-admin-cookbook.html#starting-or-stopping-the-cluster > > > but I am concerned about those timeouts which > > caused cluster to crash. Can you help me investigate why this happened or > > what should I change in order to avoid it? For aws virtual ip is used AWS > > secondary IP. > > Really I can't help on this. It looks like suddenly both VMs froze most of > their processes, or maybe some kind of clock jump, exhausting the > timeouts...I > really don't know. > > It sounds more related to your virtualization stack I suppose. Maybe some > kind > of "hot" backup? Maybe the hypervisor didn't schedule enough CPU to your > VMs > for too long? > > This is surprising both VM had timeouts in almost the same time. Do you > know if > they are on the same hypervisor host? If they do, this is a SPoF: you > should > move one of them in another host. > > ++ > > > On Thu, 18 Apr 2019 at 18:24, Jehan-Guillaume de Rorthais < > j...@dalibo.com> > > wrote: > > > > > On Thu, 18 Apr 2019 14:19:44 +0200 > > > Danka Ivanović wrote: > > > > > > > > > > > > It seems you had timeout for both fencing resources and your standby in > > > the same > > > time here: > > > > > > > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op > > > > monitor for fencing-secondary on master: unknown error (1) > > > > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op > > > > monitor for fencing-master on secondary: unknown error (1) > > > > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op > > > > monitor for PGSQL:1 on secondary: unknown error (1) > > >
Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure
Hi, Thank you for your response. Ok, It seems that fencing resources and secondary timed out at the same time, together with ldap. I understand that because of "migration-threshold=1", standby tried to recover just once and then was stopped. Is this ok, or the threshold should be increased? Master server is started with systmectl, then pacemaker is started on master, which detects master and then when starting pacemaker on secondary it brings up postgres service in slave mode. I didn't manage to start postgres master over pacemaker. I tested failover with setup like this and it works. I will try to setup postgres to be run with pacemaker, but I am concerned about those timeouts which caused cluster to crash. Can you help me investigate why this happened or what should I change in order to avoid it? For aws virtual ip is used AWS secondary IP. Link to the awsvip resource: https://raw.githubusercontent.com/ClusterLabs/resource-agents/master/heartbeat/awsvip Link to the ec2 stonith reosurce agent: https://raw.githubusercontent.com/ClusterLabs/cluster-glue/master/lib/plugins/stonith/external/ec2 Command output when cluster works: crm status Output: Stack: corosync Current DC: postgres-ha-1 (version 1.1.14-70404b0) - partition with quorum 2 nodes and 5 resources configured Online: [ postgres-ha-1 postgres-ha-2 ] Full list of resources: AWSVIP (ocf::heartbeat:awsvip): Started postgres-ha-1 Master/Slave Set: PGSQL-HA [PGSQL] Masters: [ postgres-ha-1 ] Slaves: [ postgres-ha-2 ] fencing-postgres-ha-1 (stonith:external/ec2): Started postgres-ha-2 fencing-postgres-ha-2 (stonith:external/ec2): Started postgres-ha-1 On Thu, 18 Apr 2019 at 18:24, Jehan-Guillaume de Rorthais wrote: > On Thu, 18 Apr 2019 14:19:44 +0200 > Danka Ivanović wrote: > > > > It seems you had timeout for both fencing resources and your standby in > the same > time here: > > > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op > > monitor for fencing-secondary on master: unknown error (1) > > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op > > monitor for fencing-master on secondary: unknown error (1) > > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op > > monitor for PGSQL:1 on secondary: unknown error (1) > > Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-secondary > > away from master after 1 failures (max=1) > > Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-master > away > > from secondary after 1 failures (max=1) > > Apr 17 10:03:34 master pengine[12480]: warning: Forcing PGSQL-HA away > from > > secondary after 1 failures (max=1) > > Apr 17 10:03:34 master pengine[12480]: warning: Forcing PGSQL-HA away > from > > secondary after 1 failures (max=1) > > Because you have "migration-threshold=1", the standby will be shut down: > > > Apr 17 10:03:34 master pengine[12480]: notice: Stop PGSQL:1 (secondary) > > The transition is stopped because the pgsql master timed out in the > meantime > : > > > Apr 17 10:03:40 master crmd[12481]: notice: Transition 3462 (Complete=5, > > Pending=0, Fired=0, Skipped=1, Incomplete=6, > > Source=/var/lib/pacemaker/pengine/pe-input-59.bz2): Stopped > > and as you mentioned, your ldap as well: > > > Apr 17 10:03:40 master nslcd[1518]: [d7e446] ldap_result() > > timed out > > Here are the four timeout errors (2 fencings and 2 pgsql instances): > > > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op > > monitor for fencing-secondary on master: unknown error (1) > > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op > > monitor for PGSQL:0 on master: unknown error (1) > > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op > > monitor for fencing-master on secondary: unknown error (1) > > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op > > monitor for PGSQL:1 on secondary: unknown error (1) > > As a reaction, Pacemaker decide to stop everything because it can not move > resources anywhere: > > > Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away > from > > master after 1 failures (max=1) > > Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away > from > > master after 1 failures (max=1) > > Apr 17 10:03:40 master pengine[12480]: warning: Forcing fencing-secondary > > away from master after 1 failures (max=1) > > Apr 17 10:03:40 master pengine[12480]: warning: Forcing fencing-master > away > > from secondary after 1 failures (max=1) > > Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away > from >
[ClusterLabs] Fwd: Postgres pacemaker cluster failure
Hi, Can you help me with troubleshooting postgres pacemaker cluster failure? Today cluster failed without promoting secondary to master. At the same time appeared ldap time out. Here are the logs, master was stopped by pacemaker at 10:03:40 AM UTC. Thank you in advance. corosync.log Apr 17 10:03:34 master crmd[12481]: notice: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Apr 17 10:03:34 master pengine[12480]: notice: On loss of CCM Quorum: Ignore Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op monitor for fencing-secondary on master: unknown error (1) Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op monitor for fencing-master on secondary: unknown error (1) Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op monitor for PGSQL:1 on secondary: unknown error (1) Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-secondary away from master after 1 failures (max=1) Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-master away from secondary after 1 failures (max=1) Apr 17 10:03:34 master pengine[12480]: notice: Recover PGSQL:1 (Slave secondary) Apr 17 10:03:34 master pengine[12480]: notice: Calculated Transition 3461: /var/lib/pacemaker/pengine/pe-input-58.bz2 Apr 17 10:03:34 master pengine[12480]: notice: On loss of CCM Quorum: Ignore Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op monitor for fencing-secondary on master: unknown error (1) Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op monitor for fencing-master on secondary: unknown error (1) Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op monitor for PGSQL:1 on secondary: unknown error (1) Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-secondary away from master after 1 failures (max=1) Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-master away from secondary after 1 failures (max=1) Apr 17 10:03:34 master pengine[12480]: warning: Forcing PGSQL-HA away from secondary after 1 failures (max=1) Apr 17 10:03:34 master pengine[12480]: warning: Forcing PGSQL-HA away from secondary after 1 failures (max=1) Apr 17 10:03:34 master pengine[12480]: notice: Stop PGSQL:1 (secondary) Apr 17 10:03:34 master pengine[12480]: notice: Calculated Transition 3462: /var/lib/pacemaker/pengine/pe-input-59.bz2 Apr 17 10:03:40 master lrmd[12477]: warning: PGSQL_monitor_15000 process (PID 32372) timed out Apr 17 10:03:40 master lrmd[12477]: warning: PGSQL_monitor_15000:32372 - timed out after 1ms Apr 17 10:03:40 master crmd[12481]: notice: Transition aborted by PGSQL_monitor_15000 'modify' on master: Old event (magic=2:1;8:7:8:319e4083-ccc0-440a-ae43-1bbd39275fe7, cib=0.93.14, source=process_graph_event:605, 0) Apr 17 10:03:40 master corosync[23321]: [QB ] IPC credentials authenticated (23321-32400-25) Apr 17 10:03:40 master corosync[23321]: [QB ] connecting to client [32400] Apr 17 10:03:40 master corosync[23321]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Apr 17 10:03:40 master corosync[23321]: message repeated 2 times: [ [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168] Apr 17 10:03:40 master corosync[23321]: [QB ] HUP conn (23321-32400-25) Apr 17 10:03:40 master corosync[23321]: [QB ] qb_ipcs_disconnect(23321-32400-25) state:2 Apr 17 10:03:40 master corosync[23321]: [QB ] epoll_ctl(del): Bad file descriptor (9) Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer: /dev/shm/qb-cpg-response-23321-32400-25-header Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer: /dev/shm/qb-cpg-event-23321-32400-25-header Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer: /dev/shm/qb-cpg-request-23321-32400-25-header Apr 17 10:03:40 master corosync[23321]: [QB ] IPC credentials authenticated (23321-32400-25) Apr 17 10:03:40 master corosync[23321]: [QB ] connecting to client [32400] Apr 17 10:03:40 master corosync[23321]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Apr 17 10:03:40 master corosync[23321]: message repeated 2 times: [ [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168] Apr 17 10:03:40 master corosync[23321]: [QB ] HUP conn (23321-32400-25) Apr 17 10:03:40 master corosync[23321]: [QB ] qb_ipcs_disconnect(23321-32400-25) state:2 Apr 17 10:03:40 master corosync[23321]: [QB ] epoll_ctl(del): Bad file descriptor (9) Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer: /dev/shm/qb-cmap-response-23321-32400-25-header Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer: /dev/shm/qb-cmap-event-23321-32400-25-header Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer: /dev/shm/qb-cmap-request-23321-32400-25-header Apr 17 10:03:40 master pgsqlms(PGSQL)[32393]: DEBUG: _get_controldata: found: { Apr 17 10:03:40 master pgsqlms(PGSQL)[32393]: DEBUG: pgsql_notify: environment variables: { Apr 17