Re: [Pacemaker] Fencing of movable VirtualDomains
Andrew Beekhof and...@beekhof.net writes: It may be due to two “order”: #+begin_src order ONE-Frontend-after-its-Stonith inf: Stonith-ONE-Frontend ONE-Frontend order Quorum-Node-after-its-Stonith inf: Stonith-Quorum-Node Quorum-Node #+end_src Probably. Any particular reason for them to exist? Maybe not, the collocation should be sufficient, but even without the orders, unclean VMs fencing is tried with other Stonith devices. I'll switch to newer corosync/pacemaker and use the pacemaker_remote if I can manage dlm/cLVM/OCFS2 with it. Regards. -- Daniel Dehennin Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF Fingerprint: 3E69 014E 5C23 50E8 9ED6 2AAD CC1E 9E5B 7A6F E2DF signature.asc Description: PGP signature ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Y should pacemaker be started simultaneously.
On 06/10/14 02:11 AM, Andrei Borzenkov wrote: On Mon, Oct 6, 2014 at 9:03 AM, Digimer li...@alteeve.ca wrote: If stonith was configured, after the time out, the first node would fence the second node (unable to reach != off). Alternatively, you can set corosync to 'wait_for_all' and have the first node do nothing until it sees the peer. Am I right that wait_for_all is available only in corosync 2.x and not in 1.x? You are correct, yes. To do otherwise would be to risk a split-brain. Each node needs to know the state of the peer in order to run services safely. By having both start at the same time, then they know what the other is doing. By disabling quorum, you allow one node to continue to operate when the other leaves, but it needs that initial connection to know for sure what it's doing. Does it apply to both corosync 1.x and 2.x or only to 2.x with wait_for_all? Because I actually also was confused about precise meaning of disabling quorum in pacemaker (setting no-quorum-policy: ignore). So if I have two node cluster with pacemaker 1.x and corosync 1.x with no-quorum-policy=ignore and no fencing - what happens when one single node starts? Quorum tells the cluster that if a peer leaves (gracefully or was fenced), the remaining node is allowed to continue providing services. Stonith is needed to put a node that is in an unknown state into a known state; Be it because it couldn't reach the node when starting or because the node stopped responding. So quorum and stonith play rather different roles. Without stonith, regardless of quorum, you risk split-brains and/or data corruption. Operating a cluster without stonith is to operate a cluster in an undermined state and should never be done. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] runing abitrary script when resource fails
On 10/06/2014 06:20 AM, Alex Samad - Yieldbroker wrote: Is it possible to do this ? Or even on any major fail, I would like to send a signal to my zabbix server Alex Hi Alex, This sort of thing has been discussed before, for example see http://oss.clusterlabs.org/pipermail/pacemaker/2014-August/022418.html At Gleim, we use an active monitoring approach -- instead of waiting for a notification, our monitor polls the cluster regularly. In our case, we're using the check_crm nagios plugin available at https://github.com/dnsmichi/icinga-plugins/blob/master/scripts/check_crm. It's a fairly simple Perl script utilizing crm_mon, so you could probably tweak the output to fit something zabbix expects, if there isn't an equivalent for zabbix already. And of course you can configure zabbix to monitor the services running on the cluster as well. -- Ken Gaillot kjgai...@gleim.com Network Operations Center, Gleim Publications ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Fencing of movable VirtualDomains
On 6 Oct 2014, at 8:14 pm, Daniel Dehennin daniel.dehen...@baby-gnu.org wrote: Andrew Beekhof and...@beekhof.net writes: It may be due to two “order”: #+begin_src order ONE-Frontend-after-its-Stonith inf: Stonith-ONE-Frontend ONE-Frontend order Quorum-Node-after-its-Stonith inf: Stonith-Quorum-Node Quorum-Node #+end_src Probably. Any particular reason for them to exist? Maybe not, the collocation should be sufficient, but even without the orders, unclean VMs fencing is tried with other Stonith devices. Which other devices? The config you sent through didnt have any others. I'll switch to newer corosync/pacemaker and use the pacemaker_remote if I can manage dlm/cLVM/OCFS2 with it. No can do. All three services require corosync on the node. Regards. -- Daniel Dehennin Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF Fingerprint: 3E69 014E 5C23 50E8 9ED6 2AAD CC1E 9E5B 7A6F E2DF ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Problem]When Pacemaker uses a new version of glib, g_source_remove fails.
On 6 Oct 2014, at 4:09 pm, renayama19661...@ybb.ne.jp wrote: Hi All, When I move the next sample in RHEL6.5(glib2-2.22.5-7.el6) and Ubuntu14.04(libglib2.0-0:amd64 2.40.0-2), movement is different. * Sample : test2.c {{{ #include stdio.h #include stdlib.h #include glib.h #include sys/times.h guint t1, t2, t3; gboolean timer_func2(gpointer data){ printf(TIMER EXPIRE!2\n); fflush(stdout); return FALSE; } gboolean timer_func1(gpointer data){ clock_t ret; struct tms buff; ret = times(buff); printf(TIMER EXPIRE!1 %d\n, (int)ret); fflush(stdout); return FALSE; } gboolean timer_func3(gpointer data){ printf(TIMER EXPIRE 3!\n); fflush(stdout); printf(remove timer1!\n); fflush(stdout); g_source_remove(t1); printf(remove timer2!\n); fflush(stdout); g_source_remove(t2); printf(remove timer3!\n); fflush(stdout); g_source_remove(t3); return FALSE; } int main(int argc, char** argv){ GMainLoop *m; clock_t ret; struct tms buff; gint64 t; m = g_main_new(FALSE); t1 = g_timeout_add(1000, timer_func1, NULL); t2 = g_timeout_add(6, timer_func2, NULL); t3 = g_timeout_add(5000, timer_func3, NULL); ret = times(buff); printf(START! %d\n, (int)ret); g_main_run(m); } }}} * Result RHEL6.5(glib2-2.22.5-7.el6) [root@snmp1 ~]# ./test2 START! 429576012 TIMER EXPIRE!1 429576112 TIMER EXPIRE 3! remove timer1! remove timer2! remove timer3! Ubuntu14.04(libglib2.0-0:amd64 2.40.0-2) root@a1be102:~# ./test2 START! 1718163089 TIMER EXPIRE!1 1718163189 TIMER EXPIRE 3! remove timer1! (process:1410): GLib-CRITICAL **: Source ID 1 was not found when attempting to remove it remove timer2! remove timer3! These problems seem to be due to a correction of next glib somehow or other. * https://github.com/GNOME/glib/commit/393503ba5bdc7c09cd46b716aaf3d2c63a6c7f9c The glib behaviour on unbuntu seems reasonable, removing a source multiple times IS a valid error. I need the stack trace to know where/how this situation can occur in pacemaker. In g_source_remove() until before change, the deletion of the timer which practice completed is possible, but g_source_remove() after the change causes an error. Under this influence, we get the following crit error in the environment of Pacemaker using a new version of glib. lrmd[1632]:error: crm_abort: crm_glib_handler: Forked child 1840 to record non-fatal assert at logging.c:73 : Source ID 51 was not found when attempting to remove it lrmd[1632]:crit: crm_glib_handler: GLib: Source ID 51 was not found when attempting to remove it It seems that some kind of coping is necessary in Pacemaker when I think about next. * Distribution using a new version of glib including Ubuntu. * Version up of future glib of RHEL. A similar problem is reported in the ML. * http://www.gossamer-threads.com/lists/linuxha/pacemaker/91333#91333 * http://www.gossamer-threads.com/lists/linuxha/pacemaker/92408 Best Regards, Hideo Yamauchi. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Problem]When Pacemaker uses a new version of glib, g_source_remove fails.
Hi Andrew, These problems seem to be due to a correction of next glib somehow or other. * https://github.com/GNOME/glib/commit/393503ba5bdc7c09cd46b716aaf3d2c63a6c7f9c The glib behaviour on unbuntu seems reasonable, removing a source multiple times IS a valid error. I need the stack trace to know where/how this situation can occur in pacemaker. Pacemaker does not remove resources several times as far as I confirmed it. In Ubuntu(glib2.40), an error occurs just to remove resources first. Confirmation and the deletion of resources seem to be necessary not to produce an error in Ubuntu. And this works well in glib of RHEL6.x.(and RHEL7.0) if (g_main_context_find_source_by_id (NULL, t1) != NULL) { g_source_remove(t1); } I send it to you after acquiring stack trace. Many Thanks! Hideo Yamauchi. - Original Message - From: Andrew Beekhof and...@beekhof.net To: renayama19661...@ybb.ne.jp; The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Cc: Date: 2014/10/7, Tue 09:44 Subject: Re: [Pacemaker] [Problem]When Pacemaker uses a new version of glib, g_source_remove fails. On 6 Oct 2014, at 4:09 pm, renayama19661...@ybb.ne.jp wrote: Hi All, When I move the next sample in RHEL6.5(glib2-2.22.5-7.el6) and Ubuntu14.04(libglib2.0-0:amd64 2.40.0-2), movement is different. * Sample : test2.c {{{ #include stdio.h #include stdlib.h #include glib.h #include sys/times.h guint t1, t2, t3; gboolean timer_func2(gpointer data){ printf(TIMER EXPIRE!2\n); fflush(stdout); return FALSE; } gboolean timer_func1(gpointer data){ clock_t ret; struct tms buff; ret = times(buff); printf(TIMER EXPIRE!1 %d\n, (int)ret); fflush(stdout); return FALSE; } gboolean timer_func3(gpointer data){ printf(TIMER EXPIRE 3!\n); fflush(stdout); printf(remove timer1!\n); fflush(stdout); g_source_remove(t1); printf(remove timer2!\n); fflush(stdout); g_source_remove(t2); printf(remove timer3!\n); fflush(stdout); g_source_remove(t3); return FALSE; } int main(int argc, char** argv){ GMainLoop *m; clock_t ret; struct tms buff; gint64 t; m = g_main_new(FALSE); t1 = g_timeout_add(1000, timer_func1, NULL); t2 = g_timeout_add(6, timer_func2, NULL); t3 = g_timeout_add(5000, timer_func3, NULL); ret = times(buff); printf(START! %d\n, (int)ret); g_main_run(m); } }}} * Result RHEL6.5(glib2-2.22.5-7.el6) [root@snmp1 ~]# ./test2 START! 429576012 TIMER EXPIRE!1 429576112 TIMER EXPIRE 3! remove timer1! remove timer2! remove timer3! Ubuntu14.04(libglib2.0-0:amd64 2.40.0-2) root@a1be102:~# ./test2 START! 1718163089 TIMER EXPIRE!1 1718163189 TIMER EXPIRE 3! remove timer1! (process:1410): GLib-CRITICAL **: Source ID 1 was not found when attempting to remove it remove timer2! remove timer3! These problems seem to be due to a correction of next glib somehow or other. * https://github.com/GNOME/glib/commit/393503ba5bdc7c09cd46b716aaf3d2c63a6c7f9c The glib behaviour on unbuntu seems reasonable, removing a source multiple times IS a valid error. I need the stack trace to know where/how this situation can occur in pacemaker. In g_source_remove() until before change, the deletion of the timer which practice completed is possible, but g_source_remove() after the change causes an error. Under this influence, we get the following crit error in the environment of Pacemaker using a new version of glib. lrmd[1632]: error: crm_abort: crm_glib_handler: Forked child 1840 to record non-fatal assert at logging.c:73 : Source ID 51 was not found when attempting to remove it lrmd[1632]: crit: crm_glib_handler: GLib: Source ID 51 was not found when attempting to remove it It seems that some kind of coping is necessary in Pacemaker when I think about next. * Distribution using a new version of glib including Ubuntu. * Version up of future glib of RHEL. A similar problem is reported in the ML. * http://www.gossamer-threads.com/lists/linuxha/pacemaker/91333#91333 * http://www.gossamer-threads.com/lists/linuxha/pacemaker/92408 Best Regards, Hideo Yamauchi. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
Re: [Pacemaker] [Problem]When Pacemaker uses a new version of glib, g_source_remove fails.
On 7 Oct 2014, at 1:03 pm, renayama19661...@ybb.ne.jp wrote: Hi Andrew, These problems seem to be due to a correction of next glib somehow or other. * https://github.com/GNOME/glib/commit/393503ba5bdc7c09cd46b716aaf3d2c63a6c7f9c The glib behaviour on unbuntu seems reasonable, removing a source multiple times IS a valid error. I need the stack trace to know where/how this situation can occur in pacemaker. Pacemaker does not remove resources several times as far as I confirmed it. In Ubuntu(glib2.40), an error occurs just to remove resources first. Not quite. Returning FALSE from the callback also removes the source from glib. So your test case effectively removes t1 twice: once implicitly by returning FALSE in timer_func1() and then again explicitly in timer_func3() Confirmation and the deletion of resources seem to be necessary not to produce an error in Ubuntu. And this works well in glib of RHEL6.x.(and RHEL7.0) if (g_main_context_find_source_by_id (NULL, t1) != NULL) { g_source_remove(t1); } I send it to you after acquiring stack trace. Many Thanks! Hideo Yamauchi. - Original Message - From: Andrew Beekhof and...@beekhof.net To: renayama19661...@ybb.ne.jp; The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Cc: Date: 2014/10/7, Tue 09:44 Subject: Re: [Pacemaker] [Problem]When Pacemaker uses a new version of glib, g_source_remove fails. On 6 Oct 2014, at 4:09 pm, renayama19661...@ybb.ne.jp wrote: Hi All, When I move the next sample in RHEL6.5(glib2-2.22.5-7.el6) and Ubuntu14.04(libglib2.0-0:amd64 2.40.0-2), movement is different. * Sample : test2.c {{{ #include stdio.h #include stdlib.h #include glib.h #include sys/times.h guint t1, t2, t3; gboolean timer_func2(gpointer data){ printf(TIMER EXPIRE!2\n); fflush(stdout); return FALSE; } gboolean timer_func1(gpointer data){ clock_t ret; struct tms buff; ret = times(buff); printf(TIMER EXPIRE!1 %d\n, (int)ret); fflush(stdout); return FALSE; } gboolean timer_func3(gpointer data){ printf(TIMER EXPIRE 3!\n); fflush(stdout); printf(remove timer1!\n); fflush(stdout); g_source_remove(t1); printf(remove timer2!\n); fflush(stdout); g_source_remove(t2); printf(remove timer3!\n); fflush(stdout); g_source_remove(t3); return FALSE; } int main(int argc, char** argv){ GMainLoop *m; clock_t ret; struct tms buff; gint64 t; m = g_main_new(FALSE); t1 = g_timeout_add(1000, timer_func1, NULL); t2 = g_timeout_add(6, timer_func2, NULL); t3 = g_timeout_add(5000, timer_func3, NULL); ret = times(buff); printf(START! %d\n, (int)ret); g_main_run(m); } }}} * Result RHEL6.5(glib2-2.22.5-7.el6) [root@snmp1 ~]# ./test2 START! 429576012 TIMER EXPIRE!1 429576112 TIMER EXPIRE 3! remove timer1! remove timer2! remove timer3! Ubuntu14.04(libglib2.0-0:amd64 2.40.0-2) root@a1be102:~# ./test2 START! 1718163089 TIMER EXPIRE!1 1718163189 TIMER EXPIRE 3! remove timer1! (process:1410): GLib-CRITICAL **: Source ID 1 was not found when attempting to remove it remove timer2! remove timer3! These problems seem to be due to a correction of next glib somehow or other. * https://github.com/GNOME/glib/commit/393503ba5bdc7c09cd46b716aaf3d2c63a6c7f9c The glib behaviour on unbuntu seems reasonable, removing a source multiple times IS a valid error. I need the stack trace to know where/how this situation can occur in pacemaker. In g_source_remove() until before change, the deletion of the timer which practice completed is possible, but g_source_remove() after the change causes an error. Under this influence, we get the following crit error in the environment of Pacemaker using a new version of glib. lrmd[1632]:error: crm_abort: crm_glib_handler: Forked child 1840 to record non-fatal assert at logging.c:73 : Source ID 51 was not found when attempting to remove it lrmd[1632]:crit: crm_glib_handler: GLib: Source ID 51 was not found when attempting to remove it It seems that some kind of coping is necessary in Pacemaker when I think about next. * Distribution using a new version of glib including Ubuntu. * Version up of future glib of RHEL. A similar problem is reported in the ML. * http://www.gossamer-threads.com/lists/linuxha/pacemaker/91333#91333 * http://www.gossamer-threads.com/lists/linuxha/pacemaker/92408 Best Regards, Hideo Yamauchi. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Master-slave master not promoted on Corosync restart
I think you forgot the attachments (and my eyes are going blind trying to read the word-wrapped logs :-) On 26 Sep 2014, at 6:37 pm, Sékine Coulibaly scoulib...@gmail.com wrote: Hi everyone, I'm trying my best to diagnose a strange behaviour of my cluster. My cluster is basically a Master-Slave PostgreSQL cluster, with a VIP. Two nodes (clustera and clusterb). I'm running RHEL 6.5, Corosync 1.4.1-1 and Pacemaker 1.1.10. For the simplicity sake of the diagnostic, I took of the slave node. My problem is that the cluster properly promotes the POSTGRESQL resource once (I issue a resource cleanup MS_POSTGRESQL to reset failcount counter, and then all resources are mounted on clustera). After a Corosync restart, the POSTGRESQL resource is not promoted. I narrowed down to the point where I add a location constraint (without this location constraint, after a Corosync restart, POSTGRESQL resource is promoted): location VIP_MGT_needs_gw VIP_MGT rule -inf: not_defined pingd or pingd lte 0 The logs show that the pingd attribute value is 1000 (the ping IP is pingable, and pinged [used tcpdump]). This attribute is set by : primitive ping_eth1_mgt_gw ocf:pacemaker:ping params host_list=178.3.1.47 multiplier=1000 op monitor interval=10s meta migration-threshold=3 From corosync.log I can see : Sep 26 09:49:36 [22188] clusterapengine: notice: LogActions: Start POSTGRESQL:0(clustera) Sep 26 09:49:36 [22188] clusterapengine: info: LogActions: Leave POSTGRESQL:1(Stopped) [...] Sep 26 09:49:36 [22186] clustera lrmd: info: log_execute: executing - rsc:POSTGRESQL action:start call_id:20 [...] Sep 26 09:49:37 [22187] clustera attrd: notice: attrd_trigger_update:Sending flush op to all hosts for: master-POSTGRESQL (50) [...] Sep 26 09:49:37 [22189] clustera crmd: info: match_graph_event: Action POSTGRESQL_notify_0 (46) confirmed on clustera (rc=0) [...] Sep 26 09:49:38 [22186] clustera lrmd: info: log_finished: finished - rsc:ping_eth1_mgt_gw action:start call_id:22 pid:22352 exit-code:0 exec-time:2175ms queue-time:0ms [...] Sep 26 09:49:38 [22188] clusterapengine: info: clone_print: Master/Slave Set: MS_POSTGRESQL [POSTGRESQL] Sep 26 09:49:38 [22188] clusterapengine: info: short_print: Slaves: [ clustera ] Sep 26 09:49:38 [22188] clusterapengine: info: short_print: Stopped: [ clusterb ] Sep 26 09:49:38 [22188] clusterapengine: info: native_print: VIP_MGT (ocf::heartbeat:IPaddr2): Stopped Sep 26 09:49:38 [22188] clusterapengine: info: clone_print: Clone Set: cloned_ping_eth1_mgt_gw [ping_eth1_mgt_gw] Sep 26 09:49:38 [22188] clusterapengine: info: short_print: Started: [ clustera ] Sep 26 09:49:38 [22188] clusterapengine: info: short_print: Stopped: [ clusterb ] Sep 26 09:49:38 [22188] clusterapengine: info: rsc_merge_weights: VIP_MGT: Rolling back scores from MS_POSTGRESQL Sep 26 09:49:38 [22188] clusterapengine: info: native_color: Resource VIP_MGT cannot run anywhere Sep 26 09:49:38 [22188] clusterapengine: info: native_color: POSTGRESQL:1: Rolling back scores from VIP_MGT Sep 26 09:49:38 [22188] clusterapengine: info: native_color: Resource POSTGRESQL:1 cannot run anywhere Sep 26 09:49:38 [22188] clusterapengine: info: master_color: MS_POSTGRESQL: Promoted 0 instances of a possible 1 to master Sep 26 09:49:38 [22188] clusterapengine: info: native_color: Resource ping_eth1_mgt_gw:1 cannot run anywhere Sep 26 09:49:38 [22188] clusterapengine: info: RecurringOp: Start recurring monitor (60s) for POSTGRESQL:0 on clustera Sep 26 09:49:38 [22188] clusterapengine: info: RecurringOp: Start recurring monitor (60s) for POSTGRESQL:0 on clustera Sep 26 09:49:38 [22188] clusterapengine: info: RecurringOp: Start recurring monitor (10s) for ping_eth1_mgt_gw:0 on clustera Sep 26 09:49:38 [22188] clusterapengine: info: LogActions: Leave POSTGRESQL:0(Slave clustera) Sep 26 09:49:38 [22188] clusterapengine: info: LogActions: Leave POSTGRESQL:1(Stopped) Sep 26 09:49:38 [22188] clusterapengine: info: LogActions: Leave VIP_MGT (Stopped) Sep 26 09:49:38 [22188] clusterapengine: info: LogActions: Leave ping_eth1_mgt_gw:0 (Started clustera) Sep 26 09:49:38 [22188] clusterapengine: info: LogActions: Leave ping_eth1_mgt_gw:1 (Stopped) [...] Then everything goes weird. The POSTGRESQL is monitored, seen as master (rc=8). Since it is expected (???) to be OCF_RUNNING, the monitor operation failed. That's wierd is n't it, or am I missing something ? : Sep 26 09:49:38 [22189] clustera crmd: info: do_state_transition: State transition S_POLICY_ENGINE - S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE