Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-04-26 Thread Danka Ivanović
Hi, here is a complete cluster configuration:

node 1: master
node 2: secondary
primitive AWSVIP awsvip \
params secondary_private_ip=10.x.x.x api_delay=5
primitive PGSQL pgsqlms \
params pgdata="/var/lib/postgresql/9.5/main"
bindir="/usr/lib/postgresql/9.5/bin" pghost="/var/run/postgresql/"
recovery_template="/etc/postgresql/9.5/main/recovery.conf.pcmk"
start_opts="-c config_file=/etc/postgresql/9.5/main/postgresql.conf" \
op start timeout=60s interval=0 \
op stop timeout=60s interval=0 \
op promote timeout=15s interval=0 \
op demote timeout=120s interval=0 \
op monitor interval=15s timeout=10s role=Master \
op monitor interval=16s timeout=10s role=Slave \
op notify timeout=60 interval=0
primitive fencing-postgres-ha-2 stonith:external/ec2 \
params port=master \
op start interval=0s timeout=60s \
op monitor interval=360s timeout=60s \
op stop interval=0s timeout=60s
primitive fencing-test-rsyslog stonith:external/ec2 \
params port=secondary \
op start interval=0s timeout=60s \
op monitor interval=360s timeout=60s \
op stop interval=0s timeout=60s
ms PGSQL-HA PGSQL \
meta notify=true
colocation IPAWSIP-WITH-MASTER inf: AWSVIP PGSQL-HA:Master
order demote-then-stop-ip Mandatory: _rsc_set_ PGSQL-HA:demote AWSVIP:stop
symmetrical=false
location loc-fence-master fencing-postgres-ha-2 -inf: master
location loc-fence-secondary fencing-test-rsyslog -inf: secondary
order promote-then-ip Mandatory: _rsc_set_ PGSQL-HA:promote AWSVIP:start
symmetrical=false
property cib-bootstrap-options: \
have-watchdog=false \
dc-version=1.1.14-70404b0 \
cluster-infrastructure=corosync \
cluster-name=psql-ha \
stonith-enabled=true \
no-quorum-policy=ignore \
last-lrm-refresh=1556315444 \
maintenance-mode=false
rsc_defaults rsc-options: \
resource-stickiness=10 \
migration-threshold=2

I tried to start manually postgres to be sure it is ok. There are no error
in postgres log. I also tried with different meta parameters, but always
with notify=true.
I also tried this:
ms PGSQL-HA PGSQL \
meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1
notify=true interleave=true
I have followed this link:
https://clusterlabs.github.io/PAF/Quick_Start-Debian-9-crm.html
When stonith is enabled and working I imported all other resources and
constraints all together in the same time.

On Fri, 26 Apr 2019 at 13:46, Jehan-Guillaume de Rorthais 
wrote:

> Hi,
>
> On Thu, 25 Apr 2019 18:57:55 +0200
> Danka Ivanović  wrote:
>
> > Apr 25 16:39:50 [4213] master   lrmd:   notice:
> > operation_finished:   PGSQL_monitor_0:5849:stderr [ ocf-exit-reason:You
> > must set meta parameter notify=true for your master resource ]
>
> Resource agent pgsqlms refuse to start PgSQL because your configuration
> lacks
> the "notify=true" attribute in your master definition.
>
> Could you please share your full Pacemaker configuration?
>
> Regards,
>


-- 
Pozdrav
Danka Ivanovic
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-04-25 Thread Danka Ivanović
Hi,
Here are the logs when pacemaker fails to start postgres service on master.
It manage to start only postgres slave.
I tried different configuration with pgslqms and pgsql resource agents.
Those errors are when I use pgsqlms agent, which configuration I have sent
in first mail:

Apr 25 16:40:23 [4213] master   lrmd: info: log_execute:  executing
- rsc:PGSQL action:start call_id:51
launching as "postgres" command "/usr/lib/postgresql/9.5/bin/pg_ctl
--pgdata /var/lib/postgresql/9.5/main -w --timeout 120 start -o -c
config_file=/etc/postgresql/9.5/main/postgresql.conf"
Apr 25 16:40:24 [4211] mastercib: info: cib_perform_op: +
/cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='PGSQL']/lrm_rsc_op[@id='PGSQL_last_0']:
@operation_key=PGSQL_start_0, @operation=start,
@transition-key=12:30:0:078c2b66-b095-49c4-947b-2427dd7852bf,
@transition-magic=0:0;12:30:0:078c2b66-b095-49c4-947b-2427dd7852bf,
@call-id=176, @rc-code=0, @exec-time=1146, @queue-time=0
Apr 25 16:40:53 [4216] master   crmd:debug: crm_timer_start: Started
Shutdown Escalation (I_STOP:120ms), src=53
Apr 25 16:41:23 [4213] master   lrmd:  warning:
child_timeout_callback: PGSQL_start_0
process (PID 5986) timed out

Part of the log is attached.

On Tue, 23 Apr 2019 at 17:28, Danka Ivanović 
wrote:

> Hi,
> It seems that ldap timeout caused cluster failure. Cluster is checking
> status every 15s on master and 16s on slave. Cluster needs postgres user
> for authentication, but ldap first query user on ldap server and then
> localy on host. When connection to ldap server was interrupted, cluster
> couldn't find postgres user and authenticate on db to check state. Problem
> is solved with reconfiguring /etc/ldap.conf and /etc/nslcd.conf. Following
> variable is added: nss_initgroups_ignoreusers with specified local users
> which should be ignored when querying ldap server. Thanks for your help. :)
> Another problem is that I cannot start postgres master with pacemaker.
> When I start postgres manually (with systemd) and then start pacemaker on
> slave, pacemaker is able to recognize master and start slave and failover
> works.
> That is another problem which I didn't manage to solve. Should I send a
> new mail for that issue or we can continue in this thread?
>
> On Fri, 19 Apr 2019 at 19:19, Jehan-Guillaume de Rorthais 
> wrote:
>
>> On Fri, 19 Apr 2019 17:26:14 +0200
>> Danka Ivanović  wrote:
>> ...
>> > Should I change any of those timeout parameters in order to avoid
>> timeout?
>>
>> You can try to raise the timeout, indeed. But as far as we don't know
>> **why**
>> your VMs froze for some time, it is difficult to guess how high should be
>> these timeouts.
>>
>> Not to mention that it will raise your RTO.
>>
>
>
> --
> Pozdrav
> Danka Ivanovic
>


-- 
Pozdrav
Danka Ivanovic
Apr 25 16:39:50 [4211] mastercib:debug: crm_client_new: 
Connecting 0x55d8444e8e80 for uid=0 gid=0 pid=5791 
id=c93d535d-77d8-4556-9a63-d9a1c2b45de9
Apr 25 16:39:50 [4211] mastercib:debug: handle_new_connection:  
IPC credentials authenticated (4211-5791-13)
Apr 25 16:39:50 [4211] mastercib:debug: qb_ipcs_shm_connect:
connecting to client [5791]
Apr 25 16:39:50 [4211] mastercib:debug: qb_rb_open_2:   shm 
size:524301; real_size:528384; rb->word_size:132096
Apr 25 16:39:50 [4211] mastercib:debug: qb_rb_open_2:   shm 
size:524301; real_size:528384; rb->word_size:132096
Apr 25 16:39:50 [4211] mastercib:debug: qb_rb_open_2:   shm 
size:524301; real_size:528384; rb->word_size:132096
Apr 25 16:39:50 [4211] mastercib:debug: cib_acl_enabled:CIB ACL 
is disabled
Apr 25 16:39:50 [4211] mastercib:debug: 
qb_ipcs_dispatch_connection_request:HUP conn (4211-5791-13)
Apr 25 16:39:50 [4211] mastercib:debug: qb_ipcs_disconnect: 
qb_ipcs_disconnect(4211-5791-13) state:2
Apr 25 16:39:50 [4211] mastercib:debug: crm_client_destroy: 
Destroying 0 events
Apr 25 16:39:50 [4211] mastercib:debug: qb_rb_close:
Free'ing ringbuffer: /dev/shm/qb-cib_rw-response-4211-5791-13-header
Apr 25 16:39:50 [4211] mastercib:debug: qb_rb_close:
Free'ing ringbuffer: /dev/shm/qb-cib_rw-event-4211-5791-13-header
Apr 25 16:39:50 [4211] mastercib:debug: qb_rb_close:
Free'ing ringbuffer: /dev/shm/qb-cib_rw-request-4211-5791-13-header
Apr 25 16:39:50 [15544] master corosync debug   [QB] IPC credentials 
authenticated (15544-5837-24)
Apr 25 16:39:50 [15544] master corosync debug   [QB] connecting to client 
[5837]
Apr 25 16:39:50 [15544] master corosync debug   [QB] shm size:1048589;

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-04-23 Thread Danka Ivanović
Hi,
It seems that ldap timeout caused cluster failure. Cluster is checking
status every 15s on master and 16s on slave. Cluster needs postgres user
for authentication, but ldap first query user on ldap server and then
localy on host. When connection to ldap server was interrupted, cluster
couldn't find postgres user and authenticate on db to check state. Problem
is solved with reconfiguring /etc/ldap.conf and /etc/nslcd.conf. Following
variable is added: nss_initgroups_ignoreusers with specified local users
which should be ignored when querying ldap server. Thanks for your help. :)
Another problem is that I cannot start postgres master with pacemaker. When
I start postgres manually (with systemd) and then start pacemaker on slave,
pacemaker is able to recognize master and start slave and failover works.
That is another problem which I didn't manage to solve. Should I send a new
mail for that issue or we can continue in this thread?

On Fri, 19 Apr 2019 at 19:19, Jehan-Guillaume de Rorthais 
wrote:

> On Fri, 19 Apr 2019 17:26:14 +0200
> Danka Ivanović  wrote:
> ...
> > Should I change any of those timeout parameters in order to avoid
> timeout?
>
> You can try to raise the timeout, indeed. But as far as we don't know
> **why**
> your VMs froze for some time, it is difficult to guess how high should be
> these timeouts.
>
> Not to mention that it will raise your RTO.
>


-- 
Pozdrav
Danka Ivanovic
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-04-19 Thread Danka Ivanović
Here is the command output from crm configure show:

node 1: master \
attributes master-PGSQL=1001
node 2: secondary \
attributes master-PGSQL=1000
primitive AWSVIP awsvip \
params secondary_private_ip=10.x.x.x api_delay=5
primitive PGSQL pgsqlms \
params pgdata="/var/lib/postgresql/9.5/main"
bindir="/usr/lib/postgresql/9.5/bin" pghost="/var/run/postgresql/"
recovery_template="/etc/postgresql/9.5/main/recovery.conf.pcmk"
start_opts="-c config_file=/etc/postgresql/9.5/main/postgresql.conf" \
op start timeout=60s interval=0 \
op stop timeout=60s interval=0 \
op promote timeout=15s interval=0 \
op demote timeout=120s interval=0 \
op monitor interval=15s timeout=10s role=Master \
op monitor interval=16s timeout=10s role=Slave \
op notify timeout=60 interval=0
primitive fencing-master stonith:external/ec2 \
params port=master \
op start interval=0s timeout=60s \
op monitor interval=360s timeout=60s \
op stop interval=0s timeout=60s
primitive fencing-secondary stonith:external/ec2 \
params port=secondary \
op start interval=0s timeout=60s \
op monitor interval=360s timeout=60s \
op stop interval=0s timeout=60s
ms PGSQL-HA PGSQL \
meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1
notify=true interleave=true
colocation IPAWSIP-WITH-MASTER inf: AWSVIP PGSQL-HA:Master
order demote-then-stop-ip Mandatory: _rsc_set_ PGSQL-HA:demote AWSVIP:stop
symmetrical=false
location loc-fence-master fencing-master -inf: master
location loc-fence-secondary fencing-secondary -inf: secondary
order promote-then-ip Mandatory: _rsc_set_ PGSQL-HA:promote AWSVIP:start
symmetrical=false
property cib-bootstrap-options: \
have-watchdog=false \
dc-version=1.1.14-70404b0 \
cluster-infrastructure=corosync \
cluster-name=pgc-psql-ha \
stonith-enabled=true \
no-quorum-policy=ignore \
maintenance-mode=false \
last-lrm-refresh=1551885417
rsc_defaults rsc-options: \
resource-stickiness=10 \
migration-threshold=1

Should I change any of those timeout parameters in order to avoid timeout?

On Fri, 19 Apr 2019 at 12:23, Danka Ivanović 
wrote:

> Thanks for the clarification about failure-timeout, migration threshold
> and pacemaker.
> Instances are hosted on AWS cloud, and they are in the same security
> groups and availability zones.
> I don't have information about hardware which hosts those VMs since they
> are non dedicated. UTC timezone is configured on both machines and default
> ntp configuration.
>  remote   refid  st t when poll reach   delay   offset
> jitter
>
> ==
>  0.ubuntu.pool.n .POOL.  16 p-   6400.0000.000
>  0.000
>  1.ubuntu.pool.n .POOL.  16 p-   6400.0000.000
>  0.000
>  2.ubuntu.pool.n .POOL.  16 p-   6400.0000.000
>  0.000
>  3.ubuntu.pool.n .POOL.  16 p-   6400.0000.000
>  0.000
>  ntp.ubuntu.com  .POOL.  16 p-   6400.0000.000
>  0.000
> +198.46.223.227  204.9.54.119 2 u   65  512  377   22.3180.096
>  1.111
> -time1.plumdev.n .GPS.1 u  116  512  377   72.4871.386
>  0.544
> -199.180.133.100 140.142.2.8  3 u  839 1024  377   65.574   -1.199
>  1.167
> +helium.constant 128.59.0.245 2 u  217  512  3777.3680.952
>  0.090
> *i.will.not.be.e 213.251.128.249  2 u  207  512  377   14.7331.185
>  0.305
>
>
> On Fri, 19 Apr 2019 at 11:46, Jehan-Guillaume de Rorthais 
> wrote:
>
>> On Fri, 19 Apr 2019 11:08:33 +0200
>> Danka Ivanović  wrote:
>>
>> > Hi,
>> > Thank you for your response.
>> >
>> > Ok, It seems that fencing resources and secondary timed out at the same
>> > time, together with ldap.
>> > I understand that because of "migration-threshold=1", standby tried to
>> > recover just once and then was stopped. Is this ok, or the threshold
>> should
>> > be increased?
>>
>> It depend on your usecase really.
>>
>> Note that as soon as a resource hit migration threashold, there's an
>> implicit
>> constraint forbidding it to come back on this node until you reset the
>> failcount. That's why your pgsql master resource never came back anywhere.
>>
>> You can as well set failure-timeout if you are brave enough to automate
>> the
>> failure reset. See:
>>
>> https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-failure-migration.html
>>
>> > Master server is started with systmectl, then pacemaker is started on
>> > master, which detects master and then when starting pacemaker on
>> secondary
>> > it brings up postgres service in slave mode.
>

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-04-19 Thread Danka Ivanović
Thanks for the clarification about failure-timeout, migration threshold and
pacemaker.
Instances are hosted on AWS cloud, and they are in the same security groups
and availability zones.
I don't have information about hardware which hosts those VMs since they
are non dedicated. UTC timezone is configured on both machines and default
ntp configuration.
 remote   refid  st t when poll reach   delay   offset
jitter
==
 0.ubuntu.pool.n .POOL.  16 p-   6400.0000.000
 0.000
 1.ubuntu.pool.n .POOL.  16 p-   6400.0000.000
 0.000
 2.ubuntu.pool.n .POOL.  16 p-   6400.0000.000
 0.000
 3.ubuntu.pool.n .POOL.  16 p-   6400.0000.000
 0.000
 ntp.ubuntu.com  .POOL.  16 p-   6400.0000.000
 0.000
+198.46.223.227  204.9.54.119 2 u   65  512  377   22.3180.096
 1.111
-time1.plumdev.n .GPS.1 u  116  512  377   72.4871.386
 0.544
-199.180.133.100 140.142.2.8  3 u  839 1024  377   65.574   -1.199
 1.167
+helium.constant 128.59.0.245 2 u  217  512  3777.3680.952
 0.090
*i.will.not.be.e 213.251.128.249  2 u  207  512  377   14.7331.185
 0.305


On Fri, 19 Apr 2019 at 11:46, Jehan-Guillaume de Rorthais 
wrote:

> On Fri, 19 Apr 2019 11:08:33 +0200
> Danka Ivanović  wrote:
>
> > Hi,
> > Thank you for your response.
> >
> > Ok, It seems that fencing resources and secondary timed out at the same
> > time, together with ldap.
> > I understand that because of "migration-threshold=1", standby tried to
> > recover just once and then was stopped. Is this ok, or the threshold
> should
> > be increased?
>
> It depend on your usecase really.
>
> Note that as soon as a resource hit migration threashold, there's an
> implicit
> constraint forbidding it to come back on this node until you reset the
> failcount. That's why your pgsql master resource never came back anywhere.
>
> You can as well set failure-timeout if you are brave enough to automate the
> failure reset. See:
>
> https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-failure-migration.html
>
> > Master server is started with systmectl, then pacemaker is started on
> > master, which detects master and then when starting pacemaker on
> secondary
> > it brings up postgres service in slave mode.
>
> You should not. Systemd should not mess with resources handled by
> Pacemaker.
>
> > I didn't manage to start postgres master over pacemaker. I tested
> > failover with setup like this and it works. I will try to setup postgres
> to
> > be run with pacemaker,
>
> Pacemaker is suppose to start the resource itself if it is enabled in its
> setup. Look at this whole chapter (its end is important):
>
> https://clusterlabs.github.io/PAF/CentOS-7-admin-cookbook.html#starting-or-stopping-the-cluster
>
> > but I am concerned about those timeouts which
> > caused cluster to crash. Can you help me investigate why this happened or
> > what should I change in order to avoid it? For aws virtual ip is used AWS
> > secondary IP.
>
> Really I can't help on this. It looks like suddenly both VMs froze most of
> their processes, or maybe some kind of clock jump, exhausting the
> timeouts...I
> really don't know.
>
> It sounds more related to your virtualization stack I suppose. Maybe some
> kind
> of "hot" backup? Maybe the hypervisor didn't schedule enough CPU to your
> VMs
> for too long?
>
> This is surprising both VM had timeouts in almost the same time. Do you
> know if
> they are on the same hypervisor host? If they do, this is a SPoF: you
> should
> move one of them in another host.
>
> ++
>
> > On Thu, 18 Apr 2019 at 18:24, Jehan-Guillaume de Rorthais <
> j...@dalibo.com>
> > wrote:
> >
> > > On Thu, 18 Apr 2019 14:19:44 +0200
> > > Danka Ivanović  wrote:
> > >
> > >
> > >
> > > It seems you had timeout for both fencing resources and your standby in
> > > the same
> > > time here:
> > >
> > > > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
> > > >   monitor for fencing-secondary on master: unknown error (1)
> > > > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
> > > >   monitor for fencing-master on secondary: unknown error (1)
> > > > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
> > > >   monitor for PGSQL:1 on secondary: unknown error (1)
> > >

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-04-19 Thread Danka Ivanović
Hi,
Thank you for your response.

Ok, It seems that fencing resources and secondary timed out at the same
time, together with ldap.
I understand that because of "migration-threshold=1", standby tried to
recover just once and then was stopped. Is this ok, or the threshold should
be increased?

Master server is started with systmectl, then pacemaker is started on
master, which detects master and then when starting pacemaker on secondary
it brings up postgres service in slave mode.
I didn't manage to start postgres master over pacemaker. I tested
failover with setup like this and it works. I will try to setup postgres to
be run with pacemaker, but I am concerned about those timeouts which
caused cluster to crash. Can you help me investigate why this happened or
what should I change in order to avoid it? For aws virtual ip is used AWS
secondary IP.
Link to the awsvip resource:

https://raw.githubusercontent.com/ClusterLabs/resource-agents/master/heartbeat/awsvip


Link to the ec2 stonith reosurce agent:


https://raw.githubusercontent.com/ClusterLabs/cluster-glue/master/lib/plugins/stonith/external/ec2


Command output when cluster works:

crm status

Output:

Stack: corosync

Current DC: postgres-ha-1 (version 1.1.14-70404b0) - partition with quorum

2 nodes and 5 resources configured

Online: [ postgres-ha-1 postgres-ha-2 ]

Full list of resources:

AWSVIP (ocf::heartbeat:awsvip): Started postgres-ha-1

Master/Slave Set: PGSQL-HA [PGSQL]

Masters: [ postgres-ha-1 ]

Slaves: [ postgres-ha-2 ]

fencing-postgres-ha-1 (stonith:external/ec2): Started postgres-ha-2

fencing-postgres-ha-2 (stonith:external/ec2): Started postgres-ha-1


On Thu, 18 Apr 2019 at 18:24, Jehan-Guillaume de Rorthais 
wrote:

> On Thu, 18 Apr 2019 14:19:44 +0200
> Danka Ivanović  wrote:
>
>
>
> It seems you had timeout for both fencing resources and your standby in
> the same
> time here:
>
> > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
> >   monitor for fencing-secondary on master: unknown error (1)
> > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
> >   monitor for fencing-master on secondary: unknown error (1)
> > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
> >   monitor for PGSQL:1 on secondary: unknown error (1)
> > Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-secondary
> >   away from master after 1 failures (max=1)
> > Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-master
> away
> >   from secondary after 1 failures (max=1)
> > Apr 17 10:03:34 master pengine[12480]: warning: Forcing PGSQL-HA away
> from
> >   secondary after 1 failures (max=1)
> > Apr 17 10:03:34 master pengine[12480]: warning: Forcing PGSQL-HA away
> from
> >   secondary after 1 failures (max=1)
>
> Because you have "migration-threshold=1", the standby will be shut down:
>
> > Apr 17 10:03:34 master pengine[12480]: notice: Stop PGSQL:1 (secondary)
>
> The transition is stopped because the pgsql master timed out in the
> meantime
> :
>
> > Apr 17 10:03:40 master crmd[12481]: notice: Transition 3462 (Complete=5,
> > Pending=0, Fired=0, Skipped=1, Incomplete=6,
> > Source=/var/lib/pacemaker/pengine/pe-input-59.bz2): Stopped
>
> and as you mentioned, your ldap as well:
>
> > Apr 17 10:03:40 master nslcd[1518]: [d7e446]  ldap_result()
> > timed out
>
> Here are the four timeout errors (2 fencings and 2 pgsql instances):
>
> > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op
> >   monitor for fencing-secondary on master: unknown error (1)
> > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op
> >   monitor for PGSQL:0 on master: unknown error (1)
> > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op
> >   monitor for fencing-master on secondary: unknown error (1)
> > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op
> >   monitor for PGSQL:1 on secondary: unknown error (1)
>
> As a reaction, Pacemaker decide to stop everything because it can not move
> resources anywhere:
>
> > Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away
> from
> > master after 1 failures (max=1)
> > Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away
> from
> > master after 1 failures (max=1)
> > Apr 17 10:03:40 master pengine[12480]: warning: Forcing fencing-secondary
> > away from master after 1 failures (max=1)
> > Apr 17 10:03:40 master pengine[12480]: warning: Forcing fencing-master
> away
> > from secondary after 1 failures (max=1)
> > Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away
> from
> 

[ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-04-18 Thread Danka Ivanović
Hi,

Can you help me with troubleshooting postgres pacemaker cluster failure?
Today cluster failed without promoting secondary to master. At the same
time appeared ldap time out.
Here are the logs, master was stopped by pacemaker at 10:03:40 AM UTC.
Thank you in advance.

corosync.log

Apr 17 10:03:34 master crmd[12481]: notice: State transition S_IDLE ->
S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
origin=abort_transition_graph ]
Apr 17 10:03:34 master pengine[12480]: notice: On loss of CCM Quorum: Ignore
Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
monitor for fencing-secondary on master: unknown error (1)
Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
monitor for fencing-master on secondary: unknown error (1)
Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
monitor for PGSQL:1 on secondary: unknown error (1)
Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-secondary
away from master after 1 failures (max=1)
Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-master away
from secondary after 1 failures (max=1)
Apr 17 10:03:34 master pengine[12480]: notice: Recover PGSQL:1 (Slave
secondary)
Apr 17 10:03:34 master pengine[12480]: notice: Calculated Transition 3461:
/var/lib/pacemaker/pengine/pe-input-58.bz2
Apr 17 10:03:34 master pengine[12480]: notice: On loss of CCM Quorum: Ignore
Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
monitor for fencing-secondary on master: unknown error (1)
Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
monitor for fencing-master on secondary: unknown error (1)
Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
monitor for PGSQL:1 on secondary: unknown error (1)
Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-secondary
away from master after 1 failures (max=1)
Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-master away
from secondary after 1 failures (max=1)
Apr 17 10:03:34 master pengine[12480]: warning: Forcing PGSQL-HA away from
secondary after 1 failures (max=1)
Apr 17 10:03:34 master pengine[12480]: warning: Forcing PGSQL-HA away from
secondary after 1 failures (max=1)
Apr 17 10:03:34 master pengine[12480]: notice: Stop PGSQL:1 (secondary)
Apr 17 10:03:34 master pengine[12480]: notice: Calculated Transition 3462:
/var/lib/pacemaker/pengine/pe-input-59.bz2
Apr 17 10:03:40 master lrmd[12477]: warning: PGSQL_monitor_15000 process
(PID 32372) timed out
Apr 17 10:03:40 master lrmd[12477]: warning: PGSQL_monitor_15000:32372 -
timed out after 1ms
Apr 17 10:03:40 master crmd[12481]: notice: Transition aborted by
PGSQL_monitor_15000 'modify' on master: Old event
(magic=2:1;8:7:8:319e4083-ccc0-440a-ae43-1bbd39275fe7, cib=0.93.14,
source=process_graph_event:605, 0)
Apr 17 10:03:40 master corosync[23321]: [QB ] IPC credentials authenticated
(23321-32400-25)
Apr 17 10:03:40 master corosync[23321]: [QB ] connecting to client [32400]
Apr 17 10:03:40 master corosync[23321]: [QB ] shm size:1048589;
real_size:1052672; rb->word_size:263168
Apr 17 10:03:40 master corosync[23321]: message repeated 2 times: [ [QB ]
shm size:1048589; real_size:1052672; rb->word_size:263168]
Apr 17 10:03:40 master corosync[23321]: [QB ] HUP conn (23321-32400-25)
Apr 17 10:03:40 master corosync[23321]: [QB ]
qb_ipcs_disconnect(23321-32400-25) state:2
Apr 17 10:03:40 master corosync[23321]: [QB ] epoll_ctl(del): Bad file
descriptor (9)
Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer:
/dev/shm/qb-cpg-response-23321-32400-25-header
Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer:
/dev/shm/qb-cpg-event-23321-32400-25-header
Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer:
/dev/shm/qb-cpg-request-23321-32400-25-header
Apr 17 10:03:40 master corosync[23321]: [QB ] IPC credentials authenticated
(23321-32400-25)
Apr 17 10:03:40 master corosync[23321]: [QB ] connecting to client [32400]
Apr 17 10:03:40 master corosync[23321]: [QB ] shm size:1048589;
real_size:1052672; rb->word_size:263168
Apr 17 10:03:40 master corosync[23321]: message repeated 2 times: [ [QB ]
shm size:1048589; real_size:1052672; rb->word_size:263168]
Apr 17 10:03:40 master corosync[23321]: [QB ] HUP conn (23321-32400-25)
Apr 17 10:03:40 master corosync[23321]: [QB ]
qb_ipcs_disconnect(23321-32400-25) state:2
Apr 17 10:03:40 master corosync[23321]: [QB ] epoll_ctl(del): Bad file
descriptor (9)
Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer:
/dev/shm/qb-cmap-response-23321-32400-25-header
Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer:
/dev/shm/qb-cmap-event-23321-32400-25-header
Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer:
/dev/shm/qb-cmap-request-23321-32400-25-header
Apr 17 10:03:40 master pgsqlms(PGSQL)[32393]: DEBUG: _get_controldata:
found: {
Apr 17 10:03:40 master pgsqlms(PGSQL)[32393]: DEBUG: pgsql_notify:
environment variables: {
Apr 17