Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-07-11 Thread Danka Ivanovic
We tried to fix ldap issue with nss_initgroups_ignoreusers option in nslcd.conf for postgres and hacluster users. So cluster shouldn't contact ldap server every 15 seconds when it checks psql with postgres user: /usr/lib/postgresql/9.5/bin/pg_isready -h /var/run/postgresql/ -p 5432 We have two

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-07-10 Thread Jehan-Guillaume de Rorthais
On Wed, 10 Jul 2019 17:25:57 +0200 Danka Ivanovic wrote: ... > I know it should be avoided starting master database with systemctl, but I > didn't find a way to start it with pacemaker. I will test again, but I am > out of ideas. Put the cluster in debug mode and provide the full logs +

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-07-10 Thread Jehan-Guillaume de Rorthais
On Wed, 10 Jul 2019 16:34:17 +0200 Danka Ivanovic wrote: > Hi, Thank you all for responding so quickly. Part of corosync.log file is > attached. Cluster failure occured in 09:16 AM yesterday. > Debug mode is turned on in corosync configuration, but I didn't turn it on > in pacemaker config. I

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-07-10 Thread Jehan-Guillaume de Rorthais
On Wed, 10 Jul 2019 12:53:59 +0300 Andrei Borzenkov wrote: > On Wed, Jul 10, 2019 at 12:42 PM Jehan-Guillaume de Rorthais > wrote: > > > > > > > Jul 09 09:16:32 [2679] postgres1 lrmd:debug: > > > > child_kill_helper: Kill pid 12735's group Jul 09 09:16:34 [2679] > > > > postgres1

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-07-10 Thread Andrei Borzenkov
On Wed, Jul 10, 2019 at 12:42 PM Jehan-Guillaume de Rorthais wrote: > > > > Jul 09 09:16:32 [2679] postgres1 lrmd:debug: > > > child_kill_helper: Kill pid 12735's group Jul 09 09:16:34 [2679] > > > postgres1 lrmd: warning: child_timeout_callback: > > > PGSQL_monitor_15000

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-07-10 Thread Andrei Borzenkov
On Wed, Jul 10, 2019 at 12:42 PM Jehan-Guillaume de Rorthais wrote: > > > P.S. crm_resource is called by resource agent (pgsqlms). And it shows > > result of original resource probing which makes it confusing. At least > > it explains where these logs entries come from. > > Not sure tu understand

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-07-10 Thread Jehan-Guillaume de Rorthais
On Tue, 9 Jul 2019 19:57:06 +0300 Andrei Borzenkov wrote: > 09.07.2019 13:08, Danka Ivanović пишет: > > Hi I didn't manage to start master with postgres, even if I increased start > > timeout. I checked executable paths and start options. We would require much more logs from this failure... >

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-07-09 Thread Andrei Borzenkov
09.07.2019 13:08, Danka Ivanović пишет: > Hi I didn't manage to start master with postgres, even if I increased start > timeout. I checked executable paths and start options. > When cluster is running with manually started master and slave started over > pacemaker, everything works ok. Today we

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-05-16 Thread Ken Gaillot
On Thu, 2019-05-16 at 10:20 +0200, Jehan-Guillaume de Rorthais wrote: > On Wed, 15 May 2019 16:53:48 -0500 > Ken Gaillot wrote: > > > On Wed, 2019-05-15 at 11:50 +0200, Jehan-Guillaume de Rorthais > > wrote: > > > On Mon, 29 Apr 2019 19:59:49 +0300 > > > Andrei Borzenkov wrote: > > > > > > >

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-05-16 Thread Jehan-Guillaume de Rorthais
On Wed, 15 May 2019 16:53:48 -0500 Ken Gaillot wrote: > On Wed, 2019-05-15 at 11:50 +0200, Jehan-Guillaume de Rorthais wrote: > > On Mon, 29 Apr 2019 19:59:49 +0300 > > Andrei Borzenkov wrote: > > > > > 29.04.2019 18:05, Ken Gaillot пишет: > > > > > > > > > > > Why does not it check

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-05-15 Thread Ken Gaillot
On Wed, 2019-05-15 at 11:50 +0200, Jehan-Guillaume de Rorthais wrote: > On Mon, 29 Apr 2019 19:59:49 +0300 > Andrei Borzenkov wrote: > > > 29.04.2019 18:05, Ken Gaillot пишет: > > > > > > > > > Why does not it check OCF_RESKEY_CRM_meta_notify? > > > > > > > > I was just not aware of this

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-05-15 Thread Jehan-Guillaume de Rorthais
On Mon, 29 Apr 2019 19:59:49 +0300 Andrei Borzenkov wrote: > 29.04.2019 18:05, Ken Gaillot пишет: > >> > >>> Why does not it check OCF_RESKEY_CRM_meta_notify? > >> > >> I was just not aware of this env variable. Sadly, it is not > >> documented > >> anywhere :( > > > > It's not a

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-05-15 Thread Jehan-Guillaume de Rorthais
On Tue, 30 Apr 2019 17:28:44 +0200 Danka Ivanović wrote: > Hi, I tried new clean config with upgraded postgres and corosync and > pacemaker packages. In this attempt, your PostgreSQL resource timed out while starting up: Apr 30 15:09:43 [13342] master lrmd:debug:

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-04-29 Thread Andrei Borzenkov
29.04.2019 18:05, Ken Gaillot пишет: >> >>> Why does not it check OCF_RESKEY_CRM_meta_notify? >> >> I was just not aware of this env variable. Sadly, it is not >> documented >> anywhere :( > > It's not a Pacemaker-created value like the other notify variables -- > all user-specified

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-04-29 Thread Ken Gaillot
On Sun, 2019-04-28 at 00:27 +0200, Jehan-Guillaume de Rorthais wrote: > On Sat, 27 Apr 2019 09:15:29 +0300 > Andrei Borzenkov wrote: > > > 27.04.2019 1:04, Danka Ivanović пишет: > > > Hi, here is a complete cluster configuration: > > > > > > node 1: master > > > node 2: secondary > > >

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-04-27 Thread Jehan-Guillaume de Rorthais
On Sat, 27 Apr 2019 09:15:29 +0300 Andrei Borzenkov wrote: > 27.04.2019 1:04, Danka Ivanović пишет: > > Hi, here is a complete cluster configuration: > > > > node 1: master > > node 2: secondary > > primitive AWSVIP awsvip \ > > params secondary_private_ip=10.x.x.x api_delay=5 > >

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-04-27 Thread Andrei Borzenkov
27.04.2019 1:04, Danka Ivanović пишет: > Hi, here is a complete cluster configuration: > > node 1: master > node 2: secondary > primitive AWSVIP awsvip \ > params secondary_private_ip=10.x.x.x api_delay=5 > primitive PGSQL pgsqlms \ > params pgdata="/var/lib/postgresql/9.5/main" >

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-04-26 Thread Jehan-Guillaume de Rorthais
Hi, On Thu, 25 Apr 2019 18:57:55 +0200 Danka Ivanović wrote: > Apr 25 16:39:50 [4213] master lrmd: notice: > operation_finished: PGSQL_monitor_0:5849:stderr [ ocf-exit-reason:You > must set meta parameter notify=true for your master resource ] Resource agent pgsqlms refuse to start

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-04-25 Thread Danka Ivanović
Hi, Here are the logs when pacemaker fails to start postgres service on master. It manage to start only postgres slave. I tried different configuration with pgslqms and pgsql resource agents. Those errors are when I use pgsqlms agent, which configuration I have sent in first mail: Apr 25 16:40:23

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-04-23 Thread Danka Ivanović
Hi, It seems that ldap timeout caused cluster failure. Cluster is checking status every 15s on master and 16s on slave. Cluster needs postgres user for authentication, but ldap first query user on ldap server and then localy on host. When connection to ldap server was interrupted, cluster couldn't

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-04-19 Thread Jehan-Guillaume de Rorthais
On Fri, 19 Apr 2019 17:26:14 +0200 Danka Ivanović wrote: ... > Should I change any of those timeout parameters in order to avoid timeout? You can try to raise the timeout, indeed. But as far as we don't know **why** your VMs froze for some time, it is difficult to guess how high should be these

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-04-19 Thread Danka Ivanović
Here is the command output from crm configure show: node 1: master \ attributes master-PGSQL=1001 node 2: secondary \ attributes master-PGSQL=1000 primitive AWSVIP awsvip \ params secondary_private_ip=10.x.x.x api_delay=5 primitive PGSQL pgsqlms \ params pgdata="/var/lib/postgresql/9.5/main"

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-04-19 Thread Danka Ivanović
Thanks for the clarification about failure-timeout, migration threshold and pacemaker. Instances are hosted on AWS cloud, and they are in the same security groups and availability zones. I don't have information about hardware which hosts those VMs since they are non dedicated. UTC timezone is

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-04-19 Thread Jehan-Guillaume de Rorthais
On Fri, 19 Apr 2019 11:08:33 +0200 Danka Ivanović wrote: > Hi, > Thank you for your response. > > Ok, It seems that fencing resources and secondary timed out at the same > time, together with ldap. > I understand that because of "migration-threshold=1", standby tried to > recover just once and

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-04-19 Thread Danka Ivanović
Hi, Thank you for your response. Ok, It seems that fencing resources and secondary timed out at the same time, together with ldap. I understand that because of "migration-threshold=1", standby tried to recover just once and then was stopped. Is this ok, or the threshold should be increased?

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-04-18 Thread Jehan-Guillaume de Rorthais
On Thu, 18 Apr 2019 14:19:44 +0200 Danka Ivanović wrote: It seems you had timeout for both fencing resources and your standby in the same time here: > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op > monitor for fencing-secondary on master: unknown error (1) > Apr 17

[ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-04-18 Thread Danka Ivanović
Hi, Can you help me with troubleshooting postgres pacemaker cluster failure? Today cluster failed without promoting secondary to master. At the same time appeared ldap time out. Here are the logs, master was stopped by pacemaker at 10:03:40 AM UTC. Thank you in advance. corosync.log Apr 17