Re: [Pacemaker] Fencing of movable VirtualDomains

2014-10-06 Thread Daniel Dehennin
Andrew Beekhof and...@beekhof.net writes:

 It may be due to two “order”:
 
 #+begin_src
 order ONE-Frontend-after-its-Stonith inf: Stonith-ONE-Frontend ONE-Frontend
 order Quorum-Node-after-its-Stonith inf: Stonith-Quorum-Node Quorum-Node
 #+end_src

 Probably. Any particular reason for them to exist?

Maybe not, the collocation should be sufficient, but even without the
orders, unclean VMs fencing is tried with other Stonith devices.

I'll switch to newer corosync/pacemaker and use the pacemaker_remote if
I can manage dlm/cLVM/OCFS2 with it.

Regards.

-- 
Daniel Dehennin
Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF


signature.asc
Description: PGP signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Y should pacemaker be started simultaneously.

2014-10-06 Thread Digimer

On 06/10/14 02:11 AM, Andrei Borzenkov wrote:

On Mon, Oct 6, 2014 at 9:03 AM, Digimer li...@alteeve.ca wrote:

If stonith was configured, after the time out, the first node would fence
the second node (unable to reach != off).

Alternatively, you can set corosync to 'wait_for_all' and have the first
node do nothing until it sees the peer.



Am I right that wait_for_all is available only in corosync 2.x and not in 1.x?


You are correct, yes.


To do otherwise would be to risk a split-brain. Each node needs to know the
state of the peer in order to run services safely. By having both start at
the same time, then they know what the other is doing. By disabling quorum,
you allow one node to continue to operate when the other leaves, but it
needs that initial connection to know for sure what it's doing.



Does it apply to both corosync 1.x and 2.x or only to 2.x with
wait_for_all? Because I actually also was confused about precise
meaning of disabling quorum in pacemaker (setting no-quorum-policy:
ignore). So if I have two node cluster with pacemaker 1.x and corosync
1.x with no-quorum-policy=ignore and no fencing - what happens when
one single node starts?


Quorum tells the cluster that if a peer leaves (gracefully or was 
fenced), the remaining node is allowed to continue providing services.


Stonith is needed to put a node that is in an unknown state into a known 
state; Be it because it couldn't reach the node when starting or because 
the node stopped responding.


So quorum and stonith play rather different roles.

Without stonith, regardless of quorum, you risk split-brains and/or data 
corruption. Operating a cluster without stonith is to operate a cluster 
in an undermined state and should never be done.


--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] runing abitrary script when resource fails

2014-10-06 Thread Ken Gaillot

On 10/06/2014 06:20 AM, Alex Samad - Yieldbroker wrote:

Is it possible to do this ?

Or even on any major fail, I would like to send a signal to my zabbix server

Alex


Hi Alex,

This sort of thing has been discussed before, for example see 
http://oss.clusterlabs.org/pipermail/pacemaker/2014-August/022418.html


At Gleim, we use an active monitoring approach -- instead of waiting for 
a notification, our monitor polls the cluster regularly. In our case, 
we're using the check_crm nagios plugin available at 
https://github.com/dnsmichi/icinga-plugins/blob/master/scripts/check_crm. It's 
a fairly simple Perl script utilizing crm_mon, so you could probably 
tweak the output to fit something zabbix expects, if there isn't an 
equivalent for zabbix already.


And of course you can configure zabbix to monitor the services running 
on the cluster as well.


-- Ken Gaillot kjgai...@gleim.com
Network Operations Center, Gleim Publications

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Fencing of movable VirtualDomains

2014-10-06 Thread Andrew Beekhof

On 6 Oct 2014, at 8:14 pm, Daniel Dehennin daniel.dehen...@baby-gnu.org wrote:

 Andrew Beekhof and...@beekhof.net writes:
 
 It may be due to two “order”:
 
 #+begin_src
 order ONE-Frontend-after-its-Stonith inf: Stonith-ONE-Frontend ONE-Frontend
 order Quorum-Node-after-its-Stonith inf: Stonith-Quorum-Node Quorum-Node
 #+end_src
 
 Probably. Any particular reason for them to exist?
 
 Maybe not, the collocation should be sufficient, but even without the
 orders, unclean VMs fencing is tried with other Stonith devices.

Which other devices?  The config you sent through didnt have any others.

 
 I'll switch to newer corosync/pacemaker and use the pacemaker_remote if
 I can manage dlm/cLVM/OCFS2 with it.

No can do.  All three services require corosync on the node. 

 
 Regards.
 
 -- 
 Daniel Dehennin
 Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
 Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [Problem]When Pacemaker uses a new version of glib, g_source_remove fails.

2014-10-06 Thread Andrew Beekhof

On 6 Oct 2014, at 4:09 pm, renayama19661...@ybb.ne.jp wrote:

 Hi All,
 
 When I move the next sample in RHEL6.5(glib2-2.22.5-7.el6) and 
 Ubuntu14.04(libglib2.0-0:amd64 2.40.0-2), movement is different.
 
  * Sample : test2.c
 {{{
 #include stdio.h
 #include stdlib.h
 #include glib.h
 #include sys/times.h
 guint t1, t2, t3;
 gboolean timer_func2(gpointer data){
 printf(TIMER EXPIRE!2\n);
 fflush(stdout);
 return FALSE;
 }
 gboolean timer_func1(gpointer data){
 clock_t ret;
 struct tms buff;
 
 ret = times(buff);
 printf(TIMER EXPIRE!1 %d\n, (int)ret);
 fflush(stdout);
 return FALSE;
 }
 gboolean timer_func3(gpointer data){
 printf(TIMER EXPIRE 3!\n);
 fflush(stdout);
 printf(remove timer1!\n);
 
 fflush(stdout);
 g_source_remove(t1);
 printf(remove timer2!\n);
 fflush(stdout);
 g_source_remove(t2);
 printf(remove timer3!\n);
 fflush(stdout);
 g_source_remove(t3);
 return FALSE;
 }
 int main(int argc, char** argv){
 GMainLoop *m;
 clock_t ret;
 struct tms buff;
 gint64 t;
 m = g_main_new(FALSE);
 t1 = g_timeout_add(1000, timer_func1, NULL);
 t2 = g_timeout_add(6, timer_func2, NULL);
 t3 = g_timeout_add(5000, timer_func3, NULL);
 ret = times(buff);
 printf(START! %d\n, (int)ret);
 g_main_run(m);
 }
 
 }}}
  * Result
  RHEL6.5(glib2-2.22.5-7.el6)  
 [root@snmp1 ~]# ./test2
 START! 429576012
 TIMER EXPIRE!1 429576112
 TIMER EXPIRE 3!
 remove timer1!
 remove timer2!
 remove timer3!
 
  Ubuntu14.04(libglib2.0-0:amd64 2.40.0-2) 
 root@a1be102:~# ./test2
 START! 1718163089
 TIMER EXPIRE!1 1718163189
 TIMER EXPIRE 3!
 remove timer1!
 
 (process:1410): GLib-CRITICAL **: Source ID 1 was not found when attempting 
 to remove it
 remove timer2!
 remove timer3!
 
 
 These problems seem to be due to a correction of next glib somehow or other.
  * 
 https://github.com/GNOME/glib/commit/393503ba5bdc7c09cd46b716aaf3d2c63a6c7f9c

The glib behaviour on unbuntu seems reasonable, removing a source multiple 
times IS a valid error.
I need the stack trace to know where/how this situation can occur in pacemaker.

 
 In g_source_remove() until before change, the deletion of the timer which 
 practice completed is possible, but g_source_remove() after the change causes 
 an error.
 
 Under this influence, we get the following crit error in the environment of 
 Pacemaker using a new version of glib.
 
 lrmd[1632]:error: crm_abort: crm_glib_handler: Forked child 1840 to 
 record non-fatal assert at logging.c:73 : Source ID 51 was not found when 
 attempting to remove it
 lrmd[1632]:crit: crm_glib_handler: GLib: Source ID 51 was not found 
 when attempting to remove it
 
 It seems that some kind of coping is necessary in Pacemaker when I think 
 about next.
  * Distribution using a new version of glib including Ubuntu.
  * Version up of future glib of RHEL.
 
 A similar problem is reported in the ML.
  * http://www.gossamer-threads.com/lists/linuxha/pacemaker/91333#91333
  * http://www.gossamer-threads.com/lists/linuxha/pacemaker/92408
 
 Best Regards,
 Hideo Yamauchi.
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [Problem]When Pacemaker uses a new version of glib, g_source_remove fails.

2014-10-06 Thread renayama19661014
Hi Andrew,

 These problems seem to be due to a correction of next glib somehow or 
 other.
   * 
 https://github.com/GNOME/glib/commit/393503ba5bdc7c09cd46b716aaf3d2c63a6c7f9c
 
 The glib behaviour on unbuntu seems reasonable, removing a source multiple 
 times 
 IS a valid error.
 I need the stack trace to know where/how this situation can occur in 
 pacemaker.


Pacemaker does not remove resources several times as far as I confirmed it.
In Ubuntu(glib2.40), an error occurs just to remove resources first.

Confirmation and the deletion of resources seem to be necessary not to produce 
an error in Ubuntu.
And this works well in glib of RHEL6.x.(and RHEL7.0)

        if (g_main_context_find_source_by_id (NULL, t1) != NULL) {
                g_source_remove(t1);
        }

I send it to you after acquiring stack trace.

Many Thanks!
Hideo Yamauchi.

- Original Message -
 From: Andrew Beekhof and...@beekhof.net
 To: renayama19661...@ybb.ne.jp; The Pacemaker cluster resource manager 
 pacemaker@oss.clusterlabs.org
 Cc: 
 Date: 2014/10/7, Tue 09:44
 Subject: Re: [Pacemaker] [Problem]When Pacemaker uses a new version of glib, 
 g_source_remove fails.
 
 
 On 6 Oct 2014, at 4:09 pm, renayama19661...@ybb.ne.jp wrote:
 
  Hi All,
 
  When I move the next sample in RHEL6.5(glib2-2.22.5-7.el6) and 
 Ubuntu14.04(libglib2.0-0:amd64 2.40.0-2), movement is different.
 
   * Sample : test2.c
  {{{
  #include stdio.h
  #include stdlib.h
  #include glib.h
  #include sys/times.h
  guint t1, t2, t3;
  gboolean timer_func2(gpointer data){
          printf(TIMER EXPIRE!2\n);
          fflush(stdout);
          return FALSE;
  }
  gboolean timer_func1(gpointer data){
          clock_t         ret;
          struct tms buff;
 
          ret = times(buff);
          printf(TIMER EXPIRE!1 %d\n, (int)ret);
          fflush(stdout);
          return FALSE;
  }
  gboolean timer_func3(gpointer data){
          printf(TIMER EXPIRE 3!\n);
          fflush(stdout);
          printf(remove timer1!\n);
 
          fflush(stdout);
          g_source_remove(t1);
          printf(remove timer2!\n);
          fflush(stdout);
          g_source_remove(t2);
          printf(remove timer3!\n);
          fflush(stdout);
          g_source_remove(t3);
          return FALSE;
  }
  int main(int argc, char** argv){
          GMainLoop *m;
          clock_t         ret;
          struct tms buff;
          gint64 t;
          m = g_main_new(FALSE);
          t1 = g_timeout_add(1000, timer_func1, NULL);
          t2 = g_timeout_add(6, timer_func2, NULL);
          t3 = g_timeout_add(5000, timer_func3, NULL);
          ret = times(buff);
          printf(START! %d\n, (int)ret);
          g_main_run(m);
  }
 
  }}}
   * Result
   RHEL6.5(glib2-2.22.5-7.el6)  
  [root@snmp1 ~]# ./test2
  START! 429576012
  TIMER EXPIRE!1 429576112
  TIMER EXPIRE 3!
  remove timer1!
  remove timer2!
  remove timer3!
 
   Ubuntu14.04(libglib2.0-0:amd64 2.40.0-2) 
  root@a1be102:~# ./test2
  START! 1718163089
  TIMER EXPIRE!1 1718163189
  TIMER EXPIRE 3!
  remove timer1!
 
  (process:1410): GLib-CRITICAL **: Source ID 1 was not found when attempting 
 to remove it
  remove timer2!
  remove timer3!
 
 
  These problems seem to be due to a correction of next glib somehow or 
 other.
   * 
 https://github.com/GNOME/glib/commit/393503ba5bdc7c09cd46b716aaf3d2c63a6c7f9c
 
 The glib behaviour on unbuntu seems reasonable, removing a source multiple 
 times 
 IS a valid error.
 I need the stack trace to know where/how this situation can occur in 
 pacemaker.
 
 
  In g_source_remove() until before change, the deletion of the timer which 
 practice completed is possible, but g_source_remove() after the change causes 
 an 
 error.
 
  Under this influence, we get the following crit error in the environment of 
 Pacemaker using a new version of glib.
 
  lrmd[1632]:    error: crm_abort: crm_glib_handler: Forked child 1840 to 
  record non-fatal assert at logging.c:73 : Source ID 51 was not found when 
  attempting to remove it
  lrmd[1632]:    crit: crm_glib_handler: GLib: Source ID 51 was not found 
  when attempting to remove it
 
  It seems that some kind of coping is necessary in Pacemaker when I think 
 about next.
   * Distribution using a new version of glib including Ubuntu.
   * Version up of future glib of RHEL.
 
  A similar problem is reported in the ML.
   * http://www.gossamer-threads.com/lists/linuxha/pacemaker/91333#91333
   * http://www.gossamer-threads.com/lists/linuxha/pacemaker/92408
 
  Best Regards,
  Hideo Yamauchi.
 
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org

Re: [Pacemaker] [Problem]When Pacemaker uses a new version of glib, g_source_remove fails.

2014-10-06 Thread Andrew Beekhof

On 7 Oct 2014, at 1:03 pm, renayama19661...@ybb.ne.jp wrote:

 Hi Andrew,
 
 These problems seem to be due to a correction of next glib somehow or 
 other.
   * 
 https://github.com/GNOME/glib/commit/393503ba5bdc7c09cd46b716aaf3d2c63a6c7f9c
  
 The glib behaviour on unbuntu seems reasonable, removing a source multiple 
 times 
 IS a valid error.
 I need the stack trace to know where/how this situation can occur in 
 pacemaker.
 
 
 Pacemaker does not remove resources several times as far as I confirmed it.
 In Ubuntu(glib2.40), an error occurs just to remove resources first.

Not quite. Returning FALSE from the callback also removes the source from glib.
So your test case effectively removes t1 twice: once implicitly by returning 
FALSE in timer_func1() and then again explicitly in timer_func3()

 
 Confirmation and the deletion of resources seem to be necessary not to 
 produce an error in Ubuntu.
 And this works well in glib of RHEL6.x.(and RHEL7.0)
 
 if (g_main_context_find_source_by_id (NULL, t1) != NULL) {
 g_source_remove(t1);
 }
 
 I send it to you after acquiring stack trace.
 
 Many Thanks!
 Hideo Yamauchi.
 
 - Original Message -
 From: Andrew Beekhof and...@beekhof.net
 To: renayama19661...@ybb.ne.jp; The Pacemaker cluster resource manager 
 pacemaker@oss.clusterlabs.org
 Cc: 
 Date: 2014/10/7, Tue 09:44
 Subject: Re: [Pacemaker] [Problem]When Pacemaker uses a new version of glib, 
 g_source_remove fails.
 
 
 On 6 Oct 2014, at 4:09 pm, renayama19661...@ybb.ne.jp wrote:
 
 Hi All,
 
 When I move the next sample in RHEL6.5(glib2-2.22.5-7.el6) and 
 Ubuntu14.04(libglib2.0-0:amd64 2.40.0-2), movement is different.
 
   * Sample : test2.c
 {{{
 #include stdio.h
 #include stdlib.h
 #include glib.h
 #include sys/times.h
 guint t1, t2, t3;
 gboolean timer_func2(gpointer data){
  printf(TIMER EXPIRE!2\n);
  fflush(stdout);
  return FALSE;
 }
 gboolean timer_func1(gpointer data){
  clock_t ret;
  struct tms buff;
 
  ret = times(buff);
  printf(TIMER EXPIRE!1 %d\n, (int)ret);
  fflush(stdout);
  return FALSE;
 }
 gboolean timer_func3(gpointer data){
  printf(TIMER EXPIRE 3!\n);
  fflush(stdout);
  printf(remove timer1!\n);
 
  fflush(stdout);
  g_source_remove(t1);
  printf(remove timer2!\n);
  fflush(stdout);
  g_source_remove(t2);
  printf(remove timer3!\n);
  fflush(stdout);
  g_source_remove(t3);
  return FALSE;
 }
 int main(int argc, char** argv){
  GMainLoop *m;
  clock_t ret;
  struct tms buff;
  gint64 t;
  m = g_main_new(FALSE);
  t1 = g_timeout_add(1000, timer_func1, NULL);
  t2 = g_timeout_add(6, timer_func2, NULL);
  t3 = g_timeout_add(5000, timer_func3, NULL);
  ret = times(buff);
  printf(START! %d\n, (int)ret);
  g_main_run(m);
 }
 
 }}}
   * Result
  RHEL6.5(glib2-2.22.5-7.el6)  
 [root@snmp1 ~]# ./test2
 START! 429576012
 TIMER EXPIRE!1 429576112
 TIMER EXPIRE 3!
 remove timer1!
 remove timer2!
 remove timer3!
 
  Ubuntu14.04(libglib2.0-0:amd64 2.40.0-2) 
 root@a1be102:~# ./test2
 START! 1718163089
 TIMER EXPIRE!1 1718163189
 TIMER EXPIRE 3!
 remove timer1!
 
 (process:1410): GLib-CRITICAL **: Source ID 1 was not found when attempting 
 to remove it
 remove timer2!
 remove timer3!
 
 
 These problems seem to be due to a correction of next glib somehow or 
 other.
   * 
 https://github.com/GNOME/glib/commit/393503ba5bdc7c09cd46b716aaf3d2c63a6c7f9c
 
 The glib behaviour on unbuntu seems reasonable, removing a source multiple 
 times 
 IS a valid error.
 I need the stack trace to know where/how this situation can occur in 
 pacemaker.
 
 
 In g_source_remove() until before change, the deletion of the timer which 
 practice completed is possible, but g_source_remove() after the change 
 causes an 
 error.
 
 Under this influence, we get the following crit error in the environment of 
 Pacemaker using a new version of glib.
 
 lrmd[1632]:error: crm_abort: crm_glib_handler: Forked child 1840 to 
 record non-fatal assert at logging.c:73 : Source ID 51 was not found when 
 attempting to remove it
 lrmd[1632]:crit: crm_glib_handler: GLib: Source ID 51 was not found 
 when attempting to remove it
 
 It seems that some kind of coping is necessary in Pacemaker when I think 
 about next.
   * Distribution using a new version of glib including Ubuntu.
   * Version up of future glib of RHEL.
 
 A similar problem is reported in the ML.
   * http://www.gossamer-threads.com/lists/linuxha/pacemaker/91333#91333
   * http://www.gossamer-threads.com/lists/linuxha/pacemaker/92408
 
 Best Regards,
 Hideo Yamauchi.
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 

Re: [Pacemaker] Master-slave master not promoted on Corosync restart

2014-10-06 Thread Andrew Beekhof
I think you forgot the attachments (and my eyes are going blind trying to read 
the word-wrapped logs :-)

On 26 Sep 2014, at 6:37 pm, Sékine Coulibaly scoulib...@gmail.com wrote:

 Hi everyone,
 
 I'm trying my  best to diagnose a strange behaviour of my cluster.
 
 My cluster is basically a Master-Slave PostgreSQL cluster, with a VIP.
 Two nodes (clustera and clusterb). I'm running RHEL 6.5, Corosync
 1.4.1-1 and Pacemaker 1.1.10.
 
 For the simplicity sake of the diagnostic, I took of the slave node.
 
 My problem is that the cluster properly promotes the POSTGRESQL
 resource once (I issue a resource cleanup MS_POSTGRESQL to reset
 failcount counter, and then all resources are mounted on clustera).
 After a Corosync restart, the POSTGRESQL resource is not promoted.
 
 I narrowed down to the point where I add a location constraint
 (without this location constraint, after a Corosync restart,
 POSTGRESQL resource is promoted):
 
 location VIP_MGT_needs_gw VIP_MGT rule -inf: not_defined pingd or pingd lte 0
 
 The logs show that the pingd attribute value is 1000 (the ping IP is
 pingable, and pinged [used tcpdump]). This attribute is set by :
 primitive ping_eth1_mgt_gw ocf:pacemaker:ping params
 host_list=178.3.1.47 multiplier=1000 op monitor interval=10s meta
 migration-threshold=3
 
 From corosync.log I can see :
 Sep 26 09:49:36 [22188] clusterapengine:   notice: LogActions:
 Start   POSTGRESQL:0(clustera)
 Sep 26 09:49:36 [22188] clusterapengine: info: LogActions:
 Leave   POSTGRESQL:1(Stopped)
 [...]
 Sep 26 09:49:36 [22186] clustera   lrmd: info: log_execute:
 executing - rsc:POSTGRESQL action:start call_id:20
 [...]
 Sep 26 09:49:37 [22187] clustera  attrd:   notice:
 attrd_trigger_update:Sending flush op to all hosts for:
 master-POSTGRESQL (50)
 [...]
 Sep 26 09:49:37 [22189] clustera   crmd: info:
 match_graph_event:   Action POSTGRESQL_notify_0 (46) confirmed on
 clustera (rc=0)
 [...]
 Sep 26 09:49:38 [22186] clustera   lrmd: info: log_finished:
 finished - rsc:ping_eth1_mgt_gw action:start call_id:22 pid:22352
 exit-code:0 exec-time:2175ms queue-time:0ms
 [...]
 Sep 26 09:49:38 [22188] clusterapengine: info: clone_print:
  Master/Slave Set: MS_POSTGRESQL [POSTGRESQL]
 Sep 26 09:49:38 [22188] clusterapengine: info: short_print:
  Slaves: [ clustera ]
 Sep 26 09:49:38 [22188] clusterapengine: info: short_print:
  Stopped: [ clusterb ]
 Sep 26 09:49:38 [22188] clusterapengine: info: native_print:
 VIP_MGT (ocf::heartbeat:IPaddr2):   Stopped
 Sep 26 09:49:38 [22188] clusterapengine: info: clone_print:
  Clone Set: cloned_ping_eth1_mgt_gw [ping_eth1_mgt_gw]
 Sep 26 09:49:38 [22188] clusterapengine: info: short_print:
  Started: [ clustera ]
 Sep 26 09:49:38 [22188] clusterapengine: info: short_print:
  Stopped: [ clusterb ]
 Sep 26 09:49:38 [22188] clusterapengine: info:
 rsc_merge_weights:   VIP_MGT: Rolling back scores from
 MS_POSTGRESQL
 Sep 26 09:49:38 [22188] clusterapengine: info: native_color:
 Resource VIP_MGT cannot run anywhere
 Sep 26 09:49:38 [22188] clusterapengine: info: native_color:
 POSTGRESQL:1: Rolling back scores from VIP_MGT
 Sep 26 09:49:38 [22188] clusterapengine: info: native_color:
 Resource POSTGRESQL:1 cannot run anywhere
 Sep 26 09:49:38 [22188] clusterapengine: info: master_color:
 MS_POSTGRESQL: Promoted 0 instances of a possible 1 to master
 Sep 26 09:49:38 [22188] clusterapengine: info: native_color:
 Resource ping_eth1_mgt_gw:1 cannot run anywhere
 Sep 26 09:49:38 [22188] clusterapengine: info: RecurringOp:
  Start recurring monitor (60s) for POSTGRESQL:0 on clustera
 Sep 26 09:49:38 [22188] clusterapengine: info: RecurringOp:
  Start recurring monitor (60s) for POSTGRESQL:0 on clustera
 Sep 26 09:49:38 [22188] clusterapengine: info: RecurringOp:
  Start recurring monitor (10s) for ping_eth1_mgt_gw:0 on clustera
 Sep 26 09:49:38 [22188] clusterapengine: info: LogActions:
 Leave   POSTGRESQL:0(Slave clustera)
 Sep 26 09:49:38 [22188] clusterapengine: info: LogActions:
 Leave   POSTGRESQL:1(Stopped)
 Sep 26 09:49:38 [22188] clusterapengine: info: LogActions:
 Leave   VIP_MGT (Stopped)
 Sep 26 09:49:38 [22188] clusterapengine: info: LogActions:
 Leave   ping_eth1_mgt_gw:0  (Started clustera)
 Sep 26 09:49:38 [22188] clusterapengine: info: LogActions:
 Leave   ping_eth1_mgt_gw:1  (Stopped)
 [...]
 
 Then everything goes weird. The POSTGRESQL is monitored, seen as
 master (rc=8). Since it is expected (???) to be OCF_RUNNING, the
 monitor operation failed. That's wierd is n't it, or am I missing
 something ? :
 
 Sep 26 09:49:38 [22189] clustera   crmd: info:
 do_state_transition: State transition S_POLICY_ENGINE -
 S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE