On Thu, 18 Apr 2019 14:19:44 +0200 Danka Ivanović <danka.ivano...@gmail.com> wrote:
It seems you had timeout for both fencing resources and your standby in the same time here: > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op > monitor for fencing-secondary on master: unknown error (1) > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op > monitor for fencing-master on secondary: unknown error (1) > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op > monitor for PGSQL:1 on secondary: unknown error (1) > Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-secondary > away from master after 1 failures (max=1) > Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-master away > from secondary after 1 failures (max=1) > Apr 17 10:03:34 master pengine[12480]: warning: Forcing PGSQL-HA away from > secondary after 1 failures (max=1) > Apr 17 10:03:34 master pengine[12480]: warning: Forcing PGSQL-HA away from > secondary after 1 failures (max=1) Because you have "migration-threshold=1", the standby will be shut down: > Apr 17 10:03:34 master pengine[12480]: notice: Stop PGSQL:1 (secondary) The transition is stopped because the pgsql master timed out in the meantime : > Apr 17 10:03:40 master crmd[12481]: notice: Transition 3462 (Complete=5, > Pending=0, Fired=0, Skipped=1, Incomplete=6, > Source=/var/lib/pacemaker/pengine/pe-input-59.bz2): Stopped and as you mentioned, your ldap as well: > Apr 17 10:03:40 master nslcd[1518]: [d7e446] <group(all)> ldap_result() > timed out Here are the four timeout errors (2 fencings and 2 pgsql instances): > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op > monitor for fencing-secondary on master: unknown error (1) > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op > monitor for PGSQL:0 on master: unknown error (1) > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op > monitor for fencing-master on secondary: unknown error (1) > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op > monitor for PGSQL:1 on secondary: unknown error (1) As a reaction, Pacemaker decide to stop everything because it can not move resources anywhere: > Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away from > master after 1 failures (max=1) > Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away from > master after 1 failures (max=1) > Apr 17 10:03:40 master pengine[12480]: warning: Forcing fencing-secondary > away from master after 1 failures (max=1) > Apr 17 10:03:40 master pengine[12480]: warning: Forcing fencing-master away > from secondary after 1 failures (max=1) > Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away from > secondary after 1 failures (max=1) > Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away from > secondary after 1 failures (max=1) > Apr 17 10:03:40 master pengine[12480]: notice: Stop AWSVIP (master) > Apr 17 10:03:40 master pengine[12480]: notice: Demote PGSQL:0 (Master -> > Stopped master) > Apr 17 10:03:40 master pengine[12480]: notice: Stop PGSQL:1 (secondary) Now, following lines are really not expected. Why systemd detects PostgreSQL stopped? > Apr 17 10:03:40 master postgresql@9.5-main[32458]: Cluster is not running. > Apr 17 10:03:40 master systemd[1]: postgresql@9.5-main.service: Control > process exited, code=exited status=2 > Apr 17 10:03:40 master systemd[1]: postgresql@9.5-main.service: Unit > entered failed state. > Apr 17 10:03:40 master systemd[1]: postgresql@9.5-main.service: Failed with > result 'exit-code'. I suspect the service is still enabled or has been started by hand. As soon as you setup a resource in Pacemaker, admin show **always** ask Pacemaker to start/stop it. Never use systemctl to handle the resource yourself. You must disable this service in systemd. ++ _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/