Hi all. I formatted drbd disk to get rid of the corrupted postmaster.pid file. After this everything works fine. I couldn't reproduce the issue anymore.
Best regards, Michal Mistina From: Mistina Michal [mailto:michal.mist...@virte.sk] Sent: Monday, August 19, 2013 9:39 AM To: The Pacemaker cluster resource manager Subject: [Pacemaker] PostgreSQL failed to stop after streaming replication established Dear community. The scenario of redundant environment is in the "graphic" representation... +------------------------------------+ | WAN | + v +------------+------------+ +------------+------------+ |pgsql |pgsql | |pgsql |pgsql | +------------+------------+ +------------+------------+ |drbd-pri |drbd-sec | |drbd-pri |drbd-sec | +------------+------------+ +------------+------------+ | pacemaker | | pacemaker | +-------------------------+ +--------------------------+ | corosync | | corosync | +------------+------------+ +------------+------------+ |node1 |node2 | |node1 |node2 | +------------+------------+ +------------+------------+ TC1 TC2 Within each technical center everything worked fine when migrating resources between nodes. Then I've set up streaming replication from TC1 to TC2. Now migration from one node to another failes. Pacemaker operation FAILED to stop resource postgres. However postgresql was stopped but postmaster.pid stayed corrupted. Now I ended up like this. I am unable to stop postgresql service correctly on TC1 (streaming replication master). After issuing /etc/init.d/postgresql-9.2 stop the postmaster.pid remains on the filesystem and moreover it is corrupted. I am unable to delete it with rm command. It looks like this: [root@pcmk1 ~]# ll /var/lib/pgsql/9.2/data/ ls: cannot access /var/lib/pgsql/9.2/data/postmaster.pid: No such file or directory total 56 drwx------ 7 postgres postgres 62 Jun 26 17:13 base drwx------ 2 postgres postgres 4096 Aug 18 00:25 global drwx------ 2 postgres postgres 17 Jun 26 09:54 pg_clog -rw------- 1 postgres postgres 5127 Aug 17 16:24 pg_hba.conf -rw------- 1 postgres postgres 1636 Jun 26 09:54 pg_ident.conf drwx------ 2 postgres postgres 4096 Jul 2 00:00 pg_log drwx------ 4 postgres postgres 34 Jun 26 09:53 pg_multixact drwx------ 2 postgres postgres 17 Aug 18 00:23 pg_notify drwx------ 2 postgres postgres 6 Jun 26 09:53 pg_serial drwx------ 2 postgres postgres 6 Jun 26 09:53 pg_snapshots drwx------ 2 postgres postgres 6 Aug 18 00:25 pg_stat_tmp drwx------ 2 postgres postgres 17 Jun 26 09:54 pg_subtrans drwx------ 2 postgres postgres 6 Jun 26 09:53 pg_tblspc drwx------ 2 postgres postgres 6 Jun 26 09:53 pg_twophase -rw------- 1 postgres postgres 4 Jun 26 09:53 PG_VERSION drwx------ 3 postgres postgres 4096 Aug 18 00:25 pg_xlog -rw------- 1 postgres postgres 19884 Aug 17 22:54 postgresql.conf -rw------- 1 postgres postgres 71 Aug 18 00:23 postmaster.opts ?????????? ? ? ? ? ? postmaster.pid -rw-r--r-- 1 postgres postgres 491 Aug 17 16:33 recovery.done I don't know if the resource agent did something wrong while pacemaker tried stopping postgres or actually the postgres is the source component, which failed to stop correctly. What do you think? Has somebody experienced problem like this? I am using: - pacemaker-1.1.7-6 - corosync-1.4.1-7 - resource-agents-3.9.2-12 - drbd-8.4.3-2 CONFIGURATION [root@pcmk2 9.2]# crm configure show node pcmk1 \ attributes standby="off" node pcmk2 \ attributes standby="off" primitive drbd_pg ocf:linbit:drbd \ params drbd_resource="postgres" \ op monitor interval="15" role="Master" \ op monitor interval="16" role="Slave" \ op start interval="0" timeout="240" \ op stop interval="0" timeout="120" primitive pg_fs ocf:heartbeat:Filesystem \ params device="/dev/vg_local-lv_pgsql/lv_pgsql" directory="/var/lib/pgsql/9.2/data" options="noatime,nodiratime" fstype="xfs" \ op start interval="0" timeout="60" \ op stop interval="0" timeout="120" primitive pg_lsb lsb:postgresql-9.2 \ op monitor interval="30" timeout="60" \ op start interval="0" timeout="60" \ op stop interval="0" timeout="60" primitive pg_lvm ocf:heartbeat:LVM \ params volgrpname="vg_local-lv_pgsql" \ op start interval="0" timeout="30" \ op stop interval="0" timeout="30" primitive pg_vip ocf:heartbeat:IPaddr2 \ params ip="x.x.x.x" iflabel="pcmkvip" \ op monitor interval="5" group PGServer pg_lvm pg_fs pg_lsb pg_vip \ meta target-role="Started" ms ms_drbd_pg drbd_pg \ meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" target-role="Started" location master-prefer-node1 pg_vip 50: pcmk1 colocation col_pg_drbd inf: PGServer ms_drbd_pg:Master order ord_pg inf: ms_drbd_pg:promote PGServer:start property $id="cib-bootstrap-options" \ dc-version="1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14" \ cluster-infrastructure="openais" \ expected-quorum-votes="4" \ stonith-enabled="false" \ no-quorum-policy="ignore" \ maintenance-mode="true" \ last-lrm-refresh="1376753310" rsc_defaults $id="rsc-options" \ resource-stickiness="100" Best regards, Michal Mistina
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org