Here is the command output from crm configure show: node 1: master \ attributes master-PGSQL=1001 node 2: secondary \ attributes master-PGSQL=1000 primitive AWSVIP awsvip \ params secondary_private_ip=10.x.x.x api_delay=5 primitive PGSQL pgsqlms \ params pgdata="/var/lib/postgresql/9.5/main" bindir="/usr/lib/postgresql/9.5/bin" pghost="/var/run/postgresql/" recovery_template="/etc/postgresql/9.5/main/recovery.conf.pcmk" start_opts="-c config_file=/etc/postgresql/9.5/main/postgresql.conf" \ op start timeout=60s interval=0 \ op stop timeout=60s interval=0 \ op promote timeout=15s interval=0 \ op demote timeout=120s interval=0 \ op monitor interval=15s timeout=10s role=Master \ op monitor interval=16s timeout=10s role=Slave \ op notify timeout=60 interval=0 primitive fencing-master stonith:external/ec2 \ params port=master \ op start interval=0s timeout=60s \ op monitor interval=360s timeout=60s \ op stop interval=0s timeout=60s primitive fencing-secondary stonith:external/ec2 \ params port=secondary \ op start interval=0s timeout=60s \ op monitor interval=360s timeout=60s \ op stop interval=0s timeout=60s ms PGSQL-HA PGSQL \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true interleave=true colocation IPAWSIP-WITH-MASTER inf: AWSVIP PGSQL-HA:Master order demote-then-stop-ip Mandatory: _rsc_set_ PGSQL-HA:demote AWSVIP:stop symmetrical=false location loc-fence-master fencing-master -inf: master location loc-fence-secondary fencing-secondary -inf: secondary order promote-then-ip Mandatory: _rsc_set_ PGSQL-HA:promote AWSVIP:start symmetrical=false property cib-bootstrap-options: \ have-watchdog=false \ dc-version=1.1.14-70404b0 \ cluster-infrastructure=corosync \ cluster-name=pgc-psql-ha \ stonith-enabled=true \ no-quorum-policy=ignore \ maintenance-mode=false \ last-lrm-refresh=1551885417 rsc_defaults rsc-options: \ resource-stickiness=10 \ migration-threshold=1
Should I change any of those timeout parameters in order to avoid timeout? On Fri, 19 Apr 2019 at 12:23, Danka Ivanović <danka.ivano...@gmail.com> wrote: > Thanks for the clarification about failure-timeout, migration threshold > and pacemaker. > Instances are hosted on AWS cloud, and they are in the same security > groups and availability zones. > I don't have information about hardware which hosts those VMs since they > are non dedicated. UTC timezone is configured on both machines and default > ntp configuration. > remote refid st t when poll reach delay offset > jitter > > ============================================================================== > 0.ubuntu.pool.n .POOL. 16 p - 64 0 0.000 0.000 > 0.000 > 1.ubuntu.pool.n .POOL. 16 p - 64 0 0.000 0.000 > 0.000 > 2.ubuntu.pool.n .POOL. 16 p - 64 0 0.000 0.000 > 0.000 > 3.ubuntu.pool.n .POOL. 16 p - 64 0 0.000 0.000 > 0.000 > ntp.ubuntu.com .POOL. 16 p - 64 0 0.000 0.000 > 0.000 > +198.46.223.227 204.9.54.119 2 u 65 512 377 22.318 0.096 > 1.111 > -time1.plumdev.n .GPS. 1 u 116 512 377 72.487 1.386 > 0.544 > -199.180.133.100 140.142.2.8 3 u 839 1024 377 65.574 -1.199 > 1.167 > +helium.constant 128.59.0.245 2 u 217 512 377 7.368 0.952 > 0.090 > *i.will.not.be.e 213.251.128.249 2 u 207 512 377 14.733 1.185 > 0.305 > > > On Fri, 19 Apr 2019 at 11:46, Jehan-Guillaume de Rorthais <j...@dalibo.com> > wrote: > >> On Fri, 19 Apr 2019 11:08:33 +0200 >> Danka Ivanović <danka.ivano...@gmail.com> wrote: >> >> > Hi, >> > Thank you for your response. >> > >> > Ok, It seems that fencing resources and secondary timed out at the same >> > time, together with ldap. >> > I understand that because of "migration-threshold=1", standby tried to >> > recover just once and then was stopped. Is this ok, or the threshold >> should >> > be increased? >> >> It depend on your usecase really. >> >> Note that as soon as a resource hit migration threashold, there's an >> implicit >> constraint forbidding it to come back on this node until you reset the >> failcount. That's why your pgsql master resource never came back anywhere. >> >> You can as well set failure-timeout if you are brave enough to automate >> the >> failure reset. See: >> >> https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-failure-migration.html >> >> > Master server is started with systmectl, then pacemaker is started on >> > master, which detects master and then when starting pacemaker on >> secondary >> > it brings up postgres service in slave mode. >> >> You should not. Systemd should not mess with resources handled by >> Pacemaker. >> >> > I didn't manage to start postgres master over pacemaker. I tested >> > failover with setup like this and it works. I will try to setup >> postgres to >> > be run with pacemaker, >> >> Pacemaker is suppose to start the resource itself if it is enabled in its >> setup. Look at this whole chapter (its end is important): >> >> https://clusterlabs.github.io/PAF/CentOS-7-admin-cookbook.html#starting-or-stopping-the-cluster >> >> > but I am concerned about those timeouts which >> > caused cluster to crash. Can you help me investigate why this happened >> or >> > what should I change in order to avoid it? For aws virtual ip is used >> AWS >> > secondary IP. >> >> Really I can't help on this. It looks like suddenly both VMs froze most of >> their processes, or maybe some kind of clock jump, exhausting the >> timeouts...I >> really don't know. >> >> It sounds more related to your virtualization stack I suppose. Maybe some >> kind >> of "hot" backup? Maybe the hypervisor didn't schedule enough CPU to your >> VMs >> for too long? >> >> This is surprising both VM had timeouts in almost the same time. Do you >> know if >> they are on the same hypervisor host? If they do, this is a SPoF: you >> should >> move one of them in another host. >> >> ++ >> >> > On Thu, 18 Apr 2019 at 18:24, Jehan-Guillaume de Rorthais < >> j...@dalibo.com> >> > wrote: >> > >> > > On Thu, 18 Apr 2019 14:19:44 +0200 >> > > Danka Ivanović <danka.ivano...@gmail.com> wrote: >> > > >> > > >> > > >> > > It seems you had timeout for both fencing resources and your standby >> in >> > > the same >> > > time here: >> > > >> > > > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op >> > > > monitor for fencing-secondary on master: unknown error (1) >> > > > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op >> > > > monitor for fencing-master on secondary: unknown error (1) >> > > > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op >> > > > monitor for PGSQL:1 on secondary: unknown error (1) >> > > > Apr 17 10:03:34 master pengine[12480]: warning: Forcing >> fencing-secondary >> > > > away from master after 1 failures (max=1) >> > > > Apr 17 10:03:34 master pengine[12480]: warning: Forcing >> fencing-master >> > > away >> > > > from secondary after 1 failures (max=1) >> > > > Apr 17 10:03:34 master pengine[12480]: warning: Forcing PGSQL-HA >> away >> > > from >> > > > secondary after 1 failures (max=1) >> > > > Apr 17 10:03:34 master pengine[12480]: warning: Forcing PGSQL-HA >> away >> > > from >> > > > secondary after 1 failures (max=1) >> > > >> > > Because you have "migration-threshold=1", the standby will be shut >> down: >> > > >> > > > Apr 17 10:03:34 master pengine[12480]: notice: Stop PGSQL:1 >> (secondary) >> > > >> > > The transition is stopped because the pgsql master timed out in the >> > > meantime >> > > : >> > > >> > > > Apr 17 10:03:40 master crmd[12481]: notice: Transition 3462 >> (Complete=5, >> > > > Pending=0, Fired=0, Skipped=1, Incomplete=6, >> > > > Source=/var/lib/pacemaker/pengine/pe-input-59.bz2): Stopped >> > > >> > > and as you mentioned, your ldap as well: >> > > >> > > > Apr 17 10:03:40 master nslcd[1518]: [d7e446] <group(all)> >> ldap_result() >> > > > timed out >> > > >> > > Here are the four timeout errors (2 fencings and 2 pgsql instances): >> > > >> > > > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op >> > > > monitor for fencing-secondary on master: unknown error (1) >> > > > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op >> > > > monitor for PGSQL:0 on master: unknown error (1) >> > > > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op >> > > > monitor for fencing-master on secondary: unknown error (1) >> > > > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op >> > > > monitor for PGSQL:1 on secondary: unknown error (1) >> > > >> > > As a reaction, Pacemaker decide to stop everything because it can not >> move >> > > resources anywhere: >> > > >> > > > Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA >> away >> > > from >> > > > master after 1 failures (max=1) >> > > > Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA >> away >> > > from >> > > > master after 1 failures (max=1) >> > > > Apr 17 10:03:40 master pengine[12480]: warning: Forcing >> fencing-secondary >> > > > away from master after 1 failures (max=1) >> > > > Apr 17 10:03:40 master pengine[12480]: warning: Forcing >> fencing-master >> > > away >> > > > from secondary after 1 failures (max=1) >> > > > Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA >> away >> > > from >> > > > secondary after 1 failures (max=1) >> > > > Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA >> away >> > > from >> > > > secondary after 1 failures (max=1) >> > > > Apr 17 10:03:40 master pengine[12480]: notice: Stop AWSVIP (master) >> > > > Apr 17 10:03:40 master pengine[12480]: notice: Demote PGSQL:0 >> (Master -> >> > > > Stopped master) >> > > > Apr 17 10:03:40 master pengine[12480]: notice: Stop PGSQL:1 >> (secondary) >> > > >> > > Now, following lines are really not expected. Why systemd detects >> > > PostgreSQL >> > > stopped? >> > > >> > > > Apr 17 10:03:40 master postgresql@9.5-main[32458]: Cluster is not >> > > running. >> > > > Apr 17 10:03:40 master systemd[1]: postgresql@9.5-main.service: >> Control >> > > > process exited, code=exited status=2 >> > > > Apr 17 10:03:40 master systemd[1]: postgresql@9.5-main.service: >> Unit >> > > > entered failed state. >> > > > Apr 17 10:03:40 master systemd[1]: postgresql@9.5-main.service: >> Failed >> > > with >> > > > result 'exit-code'. >> > > >> > > I suspect the service is still enabled or has been started by hand. >> > > >> > > As soon as you setup a resource in Pacemaker, admin show **always** >> ask >> > > Pacemaker to start/stop it. Never use systemctl to handle the resource >> > > yourself. >> > > >> > > You must disable this service in systemd. >> > > > -- > Pozdrav > Danka Ivanovic > -- Pozdrav Danka Ivanovic
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/