Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-12 Thread Attila Megyeri

 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net]
 Sent: Tuesday, March 11, 2014 10:27 PM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
 On 12 Mar 2014, at 1:54 am, Attila Megyeri amegy...@minerva-soft.com
 wrote:
 
 
  -Original Message-
  From: Andrew Beekhof [mailto:and...@beekhof.net]
  Sent: Tuesday, March 11, 2014 12:48 AM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
  On 7 Mar 2014, at 5:54 pm, Attila Megyeri amegy...@minerva-soft.com
  wrote:
 
  Thanks for the quick response!
 
  -Original Message-
  From: Andrew Beekhof [mailto:and...@beekhof.net]
  Sent: Friday, March 07, 2014 3:48 AM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
  On 7 Mar 2014, at 5:31 am, Attila Megyeri
  amegy...@minerva-soft.com
  wrote:
 
  Hello,
 
  We have a strange issue with Corosync/Pacemaker.
  From time to time, something unexpected happens and suddenly the
  crm_mon output remains static.
  When I check the cpu usage, I see that one of the cores uses 100%
  cpu, but
  cannot actually match it to either the corosync or one of the
  pacemaker processes.
 
  In such a case, this high CPU usage is happening on all 7 nodes.
  I have to manually go to each node, stop pacemaker, restart
  corosync, then
  start pacemeker. Stoping pacemaker and corosync does not work in
  most of the cases, usually a kill -9 is needed.
 
  Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.
 
  Using udpu as transport, two rings on Gigabit ETH, rro_mode passive.
 
  Logs are usually flooded with CPG related messages, such as:
 
  Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
  Sent 0
  CPG
  messages  (1 remaining, last=8): Try again (6)
  Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
  Sent 0
  CPG
  messages  (1 remaining, last=8): Try again (6)
  Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
  Sent 0
  CPG
  messages  (1 remaining, last=8): Try again (6)
  Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
  Sent 0
  CPG
  messages  (1 remaining, last=8): Try again (6)
 
  OR
 
  Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
  Sent 0
 CPG
  messages  (1 remaining, last=10933): Try again (
  Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
  Sent 0
 CPG
  messages  (1 remaining, last=10933): Try again (
  Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
  Sent 0
 CPG
  messages  (1 remaining, last=10933): Try again (
 
  That is usually a symptom of corosync getting into a horribly
  confused
  state.
  Version? Distro? Have you checked for an update?
  Odd that the user of all that CPU isn't showing up though.
 
 
 
  As I wrote I use Ubuntu trusty, the exact package versions are:
 
  corosync 2.3.0-1ubuntu5
  pacemaker 1.1.10+git20130802-1ubuntu2
 
  Ah sorry, I seem to have missed that part.
 
 
  There are no updates available. The only option is to install from
  sources,
  but that would be very difficult to maintain and I'm not sure I would
  get rid of this issue.
 
  What do you recommend?
 
  The same thing as Lars, or switch to a distro that stays current with
  upstream (git shows 5 newer releases for that branch since it was
  released 3 years ago).
  If you do build from source, its probably best to go with v1.4.6
 
  Hm, I am a bit confused here. We are using 2.3.0,
 
 I swapped the 2 for a 1 somehow. A bit distracted, sorry.

I upgraded all nodes to 2.3.3 and first it seemed a bit better, but still the 
same issue - after some time CPU gets to 100%, and the corosync log is flooded 
with messages like:

Mar 12 07:36:55 [4793] ctdb2cib: info: crm_cs_flush:Sent 0 
CPG messages  (48 remaining, last=3671): Try again (6)
Mar 12 07:36:55 [4798] ctdb2   crmd: info: crm_cs_flush:Sent 0 
CPG messages  (51 remaining, last=3995): Try again (6)
Mar 12 07:36:56 [4793] ctdb2cib: info: crm_cs_flush:Sent 0 
CPG messages  (48 remaining, last=3671): Try again (6)
Mar 12 07:36:56 [4798] ctdb2   crmd: info: crm_cs_flush:Sent 0 
CPG messages  (51 remaining, last=3995): Try again (6)
Mar 12 07:36:57 [4793] ctdb2cib: info: crm_cs_flush:Sent 0 
CPG messages  (48 remaining, last=3671): Try again (6)
Mar 12 07:36:57 [4798] ctdb2   crmd: info: crm_cs_flush:Sent 0 
CPG messages  (51 remaining, last=3995): Try again (6)
Mar 12 07:36:57 [4793] ctdb2cib: info: crm_cs_flush:Sent 0 
CPG messages  (48 remaining, last=3671): Try again (6)


Shall I try to downgrade to 1.4.6? What is the difference in that build? Or 
where should I start troubleshooting?

Thank you in advance.






 
  which was released approx. a year ago (you mention 3 

Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-12 Thread Jan Friesse
Attila Megyeri napsal(a):
 
 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net]
 Sent: Tuesday, March 11, 2014 10:27 PM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze


 On 12 Mar 2014, at 1:54 am, Attila Megyeri amegy...@minerva-soft.com
 wrote:


 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net]
 Sent: Tuesday, March 11, 2014 12:48 AM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze


 On 7 Mar 2014, at 5:54 pm, Attila Megyeri amegy...@minerva-soft.com
 wrote:

 Thanks for the quick response!

 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net]
 Sent: Friday, March 07, 2014 3:48 AM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze


 On 7 Mar 2014, at 5:31 am, Attila Megyeri
 amegy...@minerva-soft.com
 wrote:

 Hello,

 We have a strange issue with Corosync/Pacemaker.
 From time to time, something unexpected happens and suddenly the
 crm_mon output remains static.
 When I check the cpu usage, I see that one of the cores uses 100%
 cpu, but
 cannot actually match it to either the corosync or one of the
 pacemaker processes.

 In such a case, this high CPU usage is happening on all 7 nodes.
 I have to manually go to each node, stop pacemaker, restart
 corosync, then
 start pacemeker. Stoping pacemaker and corosync does not work in
 most of the cases, usually a kill -9 is needed.

 Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.

 Using udpu as transport, two rings on Gigabit ETH, rro_mode passive.

 Logs are usually flooded with CPG related messages, such as:

 Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
 Sent 0
 CPG
 messages  (1 remaining, last=8): Try again (6)
 Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
 Sent 0
 CPG
 messages  (1 remaining, last=8): Try again (6)
 Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
 Sent 0
 CPG
 messages  (1 remaining, last=8): Try again (6)
 Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
 Sent 0
 CPG
 messages  (1 remaining, last=8): Try again (6)

 OR

 Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
 Sent 0
 CPG
 messages  (1 remaining, last=10933): Try again (
 Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
 Sent 0
 CPG
 messages  (1 remaining, last=10933): Try again (
 Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
 Sent 0
 CPG
 messages  (1 remaining, last=10933): Try again (

 That is usually a symptom of corosync getting into a horribly
 confused
 state.
 Version? Distro? Have you checked for an update?
 Odd that the user of all that CPU isn't showing up though.



 As I wrote I use Ubuntu trusty, the exact package versions are:

 corosync 2.3.0-1ubuntu5
 pacemaker 1.1.10+git20130802-1ubuntu2

 Ah sorry, I seem to have missed that part.


 There are no updates available. The only option is to install from
 sources,
 but that would be very difficult to maintain and I'm not sure I would
 get rid of this issue.

 What do you recommend?

 The same thing as Lars, or switch to a distro that stays current with
 upstream (git shows 5 newer releases for that branch since it was
 released 3 years ago).
 If you do build from source, its probably best to go with v1.4.6

 Hm, I am a bit confused here. We are using 2.3.0,

 I swapped the 2 for a 1 somehow. A bit distracted, sorry.
 
 I upgraded all nodes to 2.3.3 and first it seemed a bit better, but still the 
 same issue - after some time CPU gets to 100%, and the corosync log is 
 flooded with messages like:
 
 Mar 12 07:36:55 [4793] ctdb2cib: info: crm_cs_flush:Sent 
 0 CPG messages  (48 remaining, last=3671): Try again (6)
 Mar 12 07:36:55 [4798] ctdb2   crmd: info: crm_cs_flush:Sent 
 0 CPG messages  (51 remaining, last=3995): Try again (6)
 Mar 12 07:36:56 [4793] ctdb2cib: info: crm_cs_flush:Sent 
 0 CPG messages  (48 remaining, last=3671): Try again (6)
 Mar 12 07:36:56 [4798] ctdb2   crmd: info: crm_cs_flush:Sent 
 0 CPG messages  (51 remaining, last=3995): Try again (6)
 Mar 12 07:36:57 [4793] ctdb2cib: info: crm_cs_flush:Sent 
 0 CPG messages  (48 remaining, last=3671): Try again (6)
 Mar 12 07:36:57 [4798] ctdb2   crmd: info: crm_cs_flush:Sent 
 0 CPG messages  (51 remaining, last=3995): Try again (6)
 Mar 12 07:36:57 [4793] ctdb2cib: info: crm_cs_flush:Sent 
 0 CPG messages  (48 remaining, last=3671): Try again (6)
 
 

Attila,

 Shall I try to downgrade to 1.4.6? What is the difference in that build? Or 
 where should I start troubleshooting?

First of all, 1.x branch (flatiron) is maintained so even it looks like
a old version, it's quite a new. It contains more or less only 

Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-12 Thread Attila Megyeri
Hello Jan,

Thank you very much for your help so far.

 -Original Message-
 From: Jan Friesse [mailto:jfrie...@redhat.com]
 Sent: Wednesday, March 12, 2014 9:51 AM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 Attila Megyeri napsal(a):
 
  -Original Message-
  From: Andrew Beekhof [mailto:and...@beekhof.net]
  Sent: Tuesday, March 11, 2014 10:27 PM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
  On 12 Mar 2014, at 1:54 am, Attila Megyeri
  amegy...@minerva-soft.com
  wrote:
 
 
  -Original Message-
  From: Andrew Beekhof [mailto:and...@beekhof.net]
  Sent: Tuesday, March 11, 2014 12:48 AM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
  On 7 Mar 2014, at 5:54 pm, Attila Megyeri
  amegy...@minerva-soft.com
  wrote:
 
  Thanks for the quick response!
 
  -Original Message-
  From: Andrew Beekhof [mailto:and...@beekhof.net]
  Sent: Friday, March 07, 2014 3:48 AM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
  On 7 Mar 2014, at 5:31 am, Attila Megyeri
  amegy...@minerva-soft.com
  wrote:
 
  Hello,
 
  We have a strange issue with Corosync/Pacemaker.
  From time to time, something unexpected happens and suddenly
 the
  crm_mon output remains static.
  When I check the cpu usage, I see that one of the cores uses
  100% cpu, but
  cannot actually match it to either the corosync or one of the
  pacemaker processes.
 
  In such a case, this high CPU usage is happening on all 7 nodes.
  I have to manually go to each node, stop pacemaker, restart
  corosync, then
  start pacemeker. Stoping pacemaker and corosync does not work in
  most of the cases, usually a kill -9 is needed.
 
  Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.
 
  Using udpu as transport, two rings on Gigabit ETH, rro_mode
 passive.
 
  Logs are usually flooded with CPG related messages, such as:
 
  Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush: 
Sent
 0
  CPG
  messages  (1 remaining, last=8): Try again (6)
  Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush: 
Sent
 0
  CPG
  messages  (1 remaining, last=8): Try again (6)
  Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush: 
Sent
 0
  CPG
  messages  (1 remaining, last=8): Try again (6)
  Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush: 
Sent
 0
  CPG
  messages  (1 remaining, last=8): Try again (6)
 
  OR
 
  Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:  
Sent 0
  CPG
  messages  (1 remaining, last=10933): Try again (
  Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:  
Sent 0
  CPG
  messages  (1 remaining, last=10933): Try again (
  Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:  
Sent 0
  CPG
  messages  (1 remaining, last=10933): Try again (
 
  That is usually a symptom of corosync getting into a horribly
  confused
  state.
  Version? Distro? Have you checked for an update?
  Odd that the user of all that CPU isn't showing up though.
 
 
 
  As I wrote I use Ubuntu trusty, the exact package versions are:
 
  corosync 2.3.0-1ubuntu5
  pacemaker 1.1.10+git20130802-1ubuntu2
 
  Ah sorry, I seem to have missed that part.
 
 
  There are no updates available. The only option is to install from
  sources,
  but that would be very difficult to maintain and I'm not sure I
  would get rid of this issue.
 
  What do you recommend?
 
  The same thing as Lars, or switch to a distro that stays current
  with upstream (git shows 5 newer releases for that branch since it
  was released 3 years ago).
  If you do build from source, its probably best to go with v1.4.6
 
  Hm, I am a bit confused here. We are using 2.3.0,
 
  I swapped the 2 for a 1 somehow. A bit distracted, sorry.
 
  I upgraded all nodes to 2.3.3 and first it seemed a bit better, but still 
  the
 same issue - after some time CPU gets to 100%, and the corosync log is
 flooded with messages like:
 
  Mar 12 07:36:55 [4793] ctdb2cib: info: crm_cs_flush:
  Sent 0 CPG
 messages  (48 remaining, last=3671): Try again (6)
  Mar 12 07:36:55 [4798] ctdb2   crmd: info: crm_cs_flush:
  Sent 0 CPG
 messages  (51 remaining, last=3995): Try again (6)
  Mar 12 07:36:56 [4793] ctdb2cib: info: crm_cs_flush:
  Sent 0 CPG
 messages  (48 remaining, last=3671): Try again (6)
  Mar 12 07:36:56 [4798] ctdb2   crmd: info: crm_cs_flush:
  Sent 0 CPG
 messages  (51 remaining, last=3995): Try again (6)
  Mar 12 07:36:57 [4793] ctdb2cib: info: crm_cs_flush:
  Sent 0 CPG
 messages  (48 remaining, last=3671): Try again (6)
  Mar 12 07:36:57 [4798] ctdb2   crmd: info: crm_cs_flush:
  Sent 0 CPG
 messages  (51 remaining, last=3995): 

Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-12 Thread Jan Friesse
Attila Megyeri napsal(a):
 Hello Jan,
 
 Thank you very much for your help so far.
 
 -Original Message-
 From: Jan Friesse [mailto:jfrie...@redhat.com]
 Sent: Wednesday, March 12, 2014 9:51 AM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze

 Attila Megyeri napsal(a):

 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net]
 Sent: Tuesday, March 11, 2014 10:27 PM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze


 On 12 Mar 2014, at 1:54 am, Attila Megyeri
 amegy...@minerva-soft.com
 wrote:


 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net]
 Sent: Tuesday, March 11, 2014 12:48 AM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze


 On 7 Mar 2014, at 5:54 pm, Attila Megyeri
 amegy...@minerva-soft.com
 wrote:

 Thanks for the quick response!

 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net]
 Sent: Friday, March 07, 2014 3:48 AM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze


 On 7 Mar 2014, at 5:31 am, Attila Megyeri
 amegy...@minerva-soft.com
 wrote:

 Hello,

 We have a strange issue with Corosync/Pacemaker.
 From time to time, something unexpected happens and suddenly
 the
 crm_mon output remains static.
 When I check the cpu usage, I see that one of the cores uses
 100% cpu, but
 cannot actually match it to either the corosync or one of the
 pacemaker processes.

 In such a case, this high CPU usage is happening on all 7 nodes.
 I have to manually go to each node, stop pacemaker, restart
 corosync, then
 start pacemeker. Stoping pacemaker and corosync does not work in
 most of the cases, usually a kill -9 is needed.

 Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.

 Using udpu as transport, two rings on Gigabit ETH, rro_mode
 passive.

 Logs are usually flooded with CPG related messages, such as:

 Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush: 
   Sent
 0
 CPG
 messages  (1 remaining, last=8): Try again (6)
 Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush: 
   Sent
 0
 CPG
 messages  (1 remaining, last=8): Try again (6)
 Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush: 
   Sent
 0
 CPG
 messages  (1 remaining, last=8): Try again (6)
 Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush: 
   Sent
 0
 CPG
 messages  (1 remaining, last=8): Try again (6)

 OR

 Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:  
   Sent 0
 CPG
 messages  (1 remaining, last=10933): Try again (
 Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:  
   Sent 0
 CPG
 messages  (1 remaining, last=10933): Try again (
 Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:  
   Sent 0
 CPG
 messages  (1 remaining, last=10933): Try again (

 That is usually a symptom of corosync getting into a horribly
 confused
 state.
 Version? Distro? Have you checked for an update?
 Odd that the user of all that CPU isn't showing up though.



 As I wrote I use Ubuntu trusty, the exact package versions are:

 corosync 2.3.0-1ubuntu5
 pacemaker 1.1.10+git20130802-1ubuntu2

 Ah sorry, I seem to have missed that part.


 There are no updates available. The only option is to install from
 sources,
 but that would be very difficult to maintain and I'm not sure I
 would get rid of this issue.

 What do you recommend?

 The same thing as Lars, or switch to a distro that stays current
 with upstream (git shows 5 newer releases for that branch since it
 was released 3 years ago).
 If you do build from source, its probably best to go with v1.4.6

 Hm, I am a bit confused here. We are using 2.3.0,

 I swapped the 2 for a 1 somehow. A bit distracted, sorry.

 I upgraded all nodes to 2.3.3 and first it seemed a bit better, but still 
 the
 same issue - after some time CPU gets to 100%, and the corosync log is
 flooded with messages like:

 Mar 12 07:36:55 [4793] ctdb2cib: info: crm_cs_flush:
 Sent 0 CPG
 messages  (48 remaining, last=3671): Try again (6)
 Mar 12 07:36:55 [4798] ctdb2   crmd: info: crm_cs_flush:
 Sent 0 CPG
 messages  (51 remaining, last=3995): Try again (6)
 Mar 12 07:36:56 [4793] ctdb2cib: info: crm_cs_flush:
 Sent 0 CPG
 messages  (48 remaining, last=3671): Try again (6)
 Mar 12 07:36:56 [4798] ctdb2   crmd: info: crm_cs_flush:
 Sent 0 CPG
 messages  (51 remaining, last=3995): Try again (6)
 Mar 12 07:36:57 [4793] ctdb2cib: info: crm_cs_flush:
 Sent 0 CPG
 messages  (48 remaining, last=3671): Try again (6)
 Mar 12 07:36:57 [4798] ctdb2   crmd: info: crm_cs_flush:
 Sent 0 CPG
 messages  (51 remaining, last=3995): Try again (6)
 Mar 12 07:36:57 [4793] ctdb2cib: info: crm_cs_flush:
 Sent 0 CPG
 messages 

Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?

2014-03-12 Thread Vladislav Bogdanov
12.03.2014 00:40, Andrew Beekhof wrote:
 
 On 11 Mar 2014, at 6:23 pm, Vladislav Bogdanov bub...@hoster-ok.com wrote:
 
 07.03.2014 10:30, Vladislav Bogdanov wrote:
 07.03.2014 05:43, Andrew Beekhof wrote:

 On 6 Mar 2014, at 10:39 pm, Vladislav Bogdanov bub...@hoster-ok.com 
 wrote:

 18.02.2014 03:49, Andrew Beekhof wrote:

 On 31 Jan 2014, at 6:20 pm, yusuke iida yusk.i...@gmail.com wrote:

 Hi, all

 I measure the performance of Pacemaker in the following combinations.
 Pacemaker-1.1.11.rc1
 libqb-0.16.0
 corosync-2.3.2

 All nodes are KVM virtual machines.

 stopped the node of vm01 compulsorily from the inside, after starting 
 14 nodes.
 virsh destroy vm01 was used for the stop.
 Then, in addition to the compulsorily stopped node, other nodes are 
 separated from a cluster.

 The log of Retransmit List: is then outputted in large quantities 
 from corosync.

 Probably best to poke the corosync guys about this.

 However, = .11 is known to cause significant CPU usage with that many 
 nodes.
 I can easily imagine this staving corosync of resources and causing 
 breakage.

 I would _highly_ recommend retesting with the current git master of 
 pacemaker.
 I merged the new cib code last week which is faster by _two_ orders of 
 magnitude and uses significantly less CPU.

 Andrew, current git master (ee094a2) almost works, the only issue is
 that crm_diff calculates incorrect diff digest. If I replace digest in
 diff by hands with what cib calculates as expected. it applies
 correctly. Otherwise - -206.

 More details?

 Hmmm...
 seems to be crmsh-specific,
 Cannot reproduce with pure-XML editing.
 Kristoffer, does 
 http://hg.savannah.gnu.org/hgweb/crmsh/rev/c42d9361a310 address this?

 The problem seems to be caused by the fact that crmsh does not provide
 status section in both orig and new XMLs to crm_diff, and digest
 generation seems to rely on that, so crm_diff and cib daemon produce
 different digests.

 Attached are two sets of XML files, one (orig.xml, new.xml, patch.xml)
 are related to the full CIB operation (with status section included),
 another (orig-edited.xml, new-edited.xml, patch-edited.xml) have that
 section removed like crmsh does do.

 Resulting diffs differ only by digest, and that seems to be the exact issue.
 
 This should help.  As long as crmsh isn't passing -c to crm_diff, then the 
 digest will no longer be present.
 
   https://github.com/beekhof/pacemaker/commit/c8d443d

Yep, that helped.
Thank you!


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-12 Thread Attila Megyeri
 -Original Message-
 From: Jan Friesse [mailto:jfrie...@redhat.com]
 Sent: Wednesday, March 12, 2014 2:27 PM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze

 Attila Megyeri napsal(a):
  Hello Jan,
 
  Thank you very much for your help so far.
 
  -Original Message-
  From: Jan Friesse [mailto:jfrie...@redhat.com]
  Sent: Wednesday, March 12, 2014 9:51 AM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
  Attila Megyeri napsal(a):
 
  -Original Message-
  From: Andrew Beekhof [mailto:and...@beekhof.net]
  Sent: Tuesday, March 11, 2014 10:27 PM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
  On 12 Mar 2014, at 1:54 am, Attila Megyeri
  amegy...@minerva-soft.com
  wrote:
 
 
  -Original Message-
  From: Andrew Beekhof [mailto:and...@beekhof.net]
  Sent: Tuesday, March 11, 2014 12:48 AM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
  On 7 Mar 2014, at 5:54 pm, Attila Megyeri
  amegy...@minerva-soft.com
  wrote:
 
  Thanks for the quick response!
 
  -Original Message-
  From: Andrew Beekhof [mailto:and...@beekhof.net]
  Sent: Friday, March 07, 2014 3:48 AM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
  On 7 Mar 2014, at 5:31 am, Attila Megyeri
  amegy...@minerva-soft.com
  wrote:
 
  Hello,
 
  We have a strange issue with Corosync/Pacemaker.
  From time to time, something unexpected happens and
 suddenly
  the
  crm_mon output remains static.
  When I check the cpu usage, I see that one of the cores uses
  100% cpu, but
  cannot actually match it to either the corosync or one of the
  pacemaker processes.
 
  In such a case, this high CPU usage is happening on all 7 nodes.
  I have to manually go to each node, stop pacemaker, restart
  corosync, then
  start pacemeker. Stoping pacemaker and corosync does not work
  in most of the cases, usually a kill -9 is needed.
 
  Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.
 
  Using udpu as transport, two rings on Gigabit ETH, rro_mode
  passive.
 
  Logs are usually flooded with CPG related messages, such as:
 
  Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:
 Sent
  0
  CPG
  messages  (1 remaining, last=8): Try again (6)
  Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:
 Sent
  0
  CPG
  messages  (1 remaining, last=8): Try again (6)
  Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:
 Sent
  0
  CPG
  messages  (1 remaining, last=8): Try again (6)
  Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:
 Sent
  0
  CPG
  messages  (1 remaining, last=8): Try again (6)
 
  OR
 
  Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
 Sent 0
  CPG
  messages  (1 remaining, last=10933): Try again (
  Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
 Sent 0
  CPG
  messages  (1 remaining, last=10933): Try again (
  Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
 Sent 0
  CPG
  messages  (1 remaining, last=10933): Try again (
 
  That is usually a symptom of corosync getting into a horribly
  confused
  state.
  Version? Distro? Have you checked for an update?
  Odd that the user of all that CPU isn't showing up though.
 
 
 
  As I wrote I use Ubuntu trusty, the exact package versions are:
 
  corosync 2.3.0-1ubuntu5
  pacemaker 1.1.10+git20130802-1ubuntu2
 
  Ah sorry, I seem to have missed that part.
 
 
  There are no updates available. The only option is to install
  from sources,
  but that would be very difficult to maintain and I'm not sure I
  would get rid of this issue.
 
  What do you recommend?
 
  The same thing as Lars, or switch to a distro that stays current
  with upstream (git shows 5 newer releases for that branch since
  it was released 3 years ago).
  If you do build from source, its probably best to go with v1.4.6
 
  Hm, I am a bit confused here. We are using 2.3.0,
 
  I swapped the 2 for a 1 somehow. A bit distracted, sorry.
 
  I upgraded all nodes to 2.3.3 and first it seemed a bit better, but
  still the
  same issue - after some time CPU gets to 100%, and the corosync log
  is flooded with messages like:
 
  Mar 12 07:36:55 [4793] ctdb2cib: info: crm_cs_flush:
  Sent 0 CPG
  messages  (48 remaining, last=3671): Try again (6)
  Mar 12 07:36:55 [4798] ctdb2   crmd: info: crm_cs_flush:
  Sent 0
 CPG
  messages  (51 remaining, last=3995): Try again (6)
  Mar 12 07:36:56 [4793] ctdb2cib: info: crm_cs_flush:
  Sent 0 CPG
  messages  (48 remaining, last=3671): Try again (6)
  Mar 12 07:36:56 [4798] ctdb2   crmd: info: crm_cs_flush:
  Sent 0
 CPG
  messages  (51 remaining, last=3995): Try again (6)
  Mar 12 07:36:57 [4793] ctdb2cib: info: 

Re: [Pacemaker] fencing question

2014-03-12 Thread Lars Marowsky-Bree
On 2014-03-12T15:17:13, Karl Rößmann k.roessm...@fkf.mpg.de wrote:

 Hi,
 
 we have a two node HA cluster using SuSE SlES 11 HA Extension SP3,
 latest release value.
 A resource (xen) was manually stopped, the shutdown_timeout is 120s
 but after 60s the node was fenced and shut down by the other node.
 
 should I change some timeout value ?
 
 This is a part of our configuration:
 ...
 primitive fkflmw ocf:heartbeat:Xen \
 meta target-role=Started is-managed=true allow-migrate=true \
 op monitor interval=10 timeout=30 \
 op migrate_from interval=0 timeout=600 \
 op migrate_to interval=0 timeout=600 \
 params xmfile=/etc/xen/vm/fkflmw shutdown_timeout=120

You need to set a 120s timeout for the stop operation too:
op stop timeout=150

 default-action-timeout=60s

Or set this to, say, 150s.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] fencing question

2014-03-12 Thread Karl Rößmann

Hi.


primitive fkflmw ocf:heartbeat:Xen \
meta target-role=Started is-managed=true allow-migrate=true \
op monitor interval=10 timeout=30 \
op migrate_from interval=0 timeout=600 \
op migrate_to interval=0 timeout=600 \
params xmfile=/etc/xen/vm/fkflmw shutdown_timeout=120


You need to set a 120s timeout for the stop operation too:
op stop timeout=150


default-action-timeout=60s


Or set this to, say, 150s.



can I do this while the resource (the xen VM) is running ?



Karl



--
Karl RößmannTel. +49-711-689-1657
Max-Planck-Institut FKF Fax. +49-711-689-1632
Postfach 800 665
70506 Stuttgart email k.roessm...@fkf.mpg.de

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-12 Thread Jan Friesse
Attila Megyeri napsal(a):
 -Original Message-
 From: Jan Friesse [mailto:jfrie...@redhat.com]
 Sent: Wednesday, March 12, 2014 2:27 PM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze

 Attila Megyeri napsal(a):
 Hello Jan,

 Thank you very much for your help so far.

 -Original Message-
 From: Jan Friesse [mailto:jfrie...@redhat.com]
 Sent: Wednesday, March 12, 2014 9:51 AM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze

 Attila Megyeri napsal(a):

 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net]
 Sent: Tuesday, March 11, 2014 10:27 PM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze


 On 12 Mar 2014, at 1:54 am, Attila Megyeri
 amegy...@minerva-soft.com
 wrote:


 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net]
 Sent: Tuesday, March 11, 2014 12:48 AM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze


 On 7 Mar 2014, at 5:54 pm, Attila Megyeri
 amegy...@minerva-soft.com
 wrote:

 Thanks for the quick response!

 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net]
 Sent: Friday, March 07, 2014 3:48 AM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze


 On 7 Mar 2014, at 5:31 am, Attila Megyeri
 amegy...@minerva-soft.com
 wrote:

 Hello,

 We have a strange issue with Corosync/Pacemaker.
 From time to time, something unexpected happens and
 suddenly
 the
 crm_mon output remains static.
 When I check the cpu usage, I see that one of the cores uses
 100% cpu, but
 cannot actually match it to either the corosync or one of the
 pacemaker processes.

 In such a case, this high CPU usage is happening on all 7 nodes.
 I have to manually go to each node, stop pacemaker, restart
 corosync, then
 start pacemeker. Stoping pacemaker and corosync does not work
 in most of the cases, usually a kill -9 is needed.

 Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.

 Using udpu as transport, two rings on Gigabit ETH, rro_mode
 passive.

 Logs are usually flooded with CPG related messages, such as:

 Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:
 Sent
 0
 CPG
 messages  (1 remaining, last=8): Try again (6)
 Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:
 Sent
 0
 CPG
 messages  (1 remaining, last=8): Try again (6)
 Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:
 Sent
 0
 CPG
 messages  (1 remaining, last=8): Try again (6)
 Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:
 Sent
 0
 CPG
 messages  (1 remaining, last=8): Try again (6)

 OR

 Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
 Sent 0
 CPG
 messages  (1 remaining, last=10933): Try again (
 Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
 Sent 0
 CPG
 messages  (1 remaining, last=10933): Try again (
 Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
 Sent 0
 CPG
 messages  (1 remaining, last=10933): Try again (

 That is usually a symptom of corosync getting into a horribly
 confused
 state.
 Version? Distro? Have you checked for an update?
 Odd that the user of all that CPU isn't showing up though.



 As I wrote I use Ubuntu trusty, the exact package versions are:

 corosync 2.3.0-1ubuntu5
 pacemaker 1.1.10+git20130802-1ubuntu2

 Ah sorry, I seem to have missed that part.


 There are no updates available. The only option is to install
 from sources,
 but that would be very difficult to maintain and I'm not sure I
 would get rid of this issue.

 What do you recommend?

 The same thing as Lars, or switch to a distro that stays current
 with upstream (git shows 5 newer releases for that branch since
 it was released 3 years ago).
 If you do build from source, its probably best to go with v1.4.6

 Hm, I am a bit confused here. We are using 2.3.0,

 I swapped the 2 for a 1 somehow. A bit distracted, sorry.

 I upgraded all nodes to 2.3.3 and first it seemed a bit better, but
 still the
 same issue - after some time CPU gets to 100%, and the corosync log
 is flooded with messages like:

 Mar 12 07:36:55 [4793] ctdb2cib: info: crm_cs_flush:
 Sent 0 CPG
 messages  (48 remaining, last=3671): Try again (6)
 Mar 12 07:36:55 [4798] ctdb2   crmd: info: crm_cs_flush:
 Sent 0
 CPG
 messages  (51 remaining, last=3995): Try again (6)
 Mar 12 07:36:56 [4793] ctdb2cib: info: crm_cs_flush:
 Sent 0 CPG
 messages  (48 remaining, last=3671): Try again (6)
 Mar 12 07:36:56 [4798] ctdb2   crmd: info: crm_cs_flush:
 Sent 0
 CPG
 messages  (51 remaining, last=3995): Try again (6)
 Mar 12 07:36:57 [4793] ctdb2cib: info: crm_cs_flush:
 Sent 0 CPG
 messages  (48 remaining, last=3671): Try again (6)
 Mar 12 07:36:57 [4798] ctdb2   crmd:

Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-12 Thread Attila Megyeri


 -Original Message-
 From: Jan Friesse [mailto:jfrie...@redhat.com]
 Sent: Wednesday, March 12, 2014 4:31 PM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze

 Attila Megyeri napsal(a):
  -Original Message-
  From: Jan Friesse [mailto:jfrie...@redhat.com]
  Sent: Wednesday, March 12, 2014 2:27 PM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
  Attila Megyeri napsal(a):
  Hello Jan,
 
  Thank you very much for your help so far.
 
  -Original Message-
  From: Jan Friesse [mailto:jfrie...@redhat.com]
  Sent: Wednesday, March 12, 2014 9:51 AM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
  Attila Megyeri napsal(a):
 
  -Original Message-
  From: Andrew Beekhof [mailto:and...@beekhof.net]
  Sent: Tuesday, March 11, 2014 10:27 PM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
  On 12 Mar 2014, at 1:54 am, Attila Megyeri
  amegy...@minerva-soft.com
  wrote:
 
 
  -Original Message-
  From: Andrew Beekhof [mailto:and...@beekhof.net]
  Sent: Tuesday, March 11, 2014 12:48 AM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
  On 7 Mar 2014, at 5:54 pm, Attila Megyeri
  amegy...@minerva-soft.com
  wrote:
 
  Thanks for the quick response!
 
  -Original Message-
  From: Andrew Beekhof [mailto:and...@beekhof.net]
  Sent: Friday, March 07, 2014 3:48 AM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
  On 7 Mar 2014, at 5:31 am, Attila Megyeri
  amegy...@minerva-soft.com
  wrote:
 
  Hello,
 
  We have a strange issue with Corosync/Pacemaker.
  From time to time, something unexpected happens and
  suddenly
  the
  crm_mon output remains static.
  When I check the cpu usage, I see that one of the cores uses
  100% cpu, but
  cannot actually match it to either the corosync or one of the
  pacemaker processes.
 
  In such a case, this high CPU usage is happening on all 7 nodes.
  I have to manually go to each node, stop pacemaker, restart
  corosync, then
  start pacemeker. Stoping pacemaker and corosync does not
 work
  in most of the cases, usually a kill -9 is needed.
 
  Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.
 
  Using udpu as transport, two rings on Gigabit ETH, rro_mode
  passive.
 
  Logs are usually flooded with CPG related messages, such as:
 
  Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:
  Sent
  0
  CPG
  messages  (1 remaining, last=8): Try again (6)
  Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:
  Sent
  0
  CPG
  messages  (1 remaining, last=8): Try again (6)
  Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:
  Sent
  0
  CPG
  messages  (1 remaining, last=8): Try again (6)
  Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:
  Sent
  0
  CPG
  messages  (1 remaining, last=8): Try again (6)
 
  OR
 
  Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
  Sent 0
  CPG
  messages  (1 remaining, last=10933): Try again (
  Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
  Sent 0
  CPG
  messages  (1 remaining, last=10933): Try again (
  Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
  Sent 0
  CPG
  messages  (1 remaining, last=10933): Try again (
 
  That is usually a symptom of corosync getting into a horribly
  confused
  state.
  Version? Distro? Have you checked for an update?
  Odd that the user of all that CPU isn't showing up though.
 
 
 
  As I wrote I use Ubuntu trusty, the exact package versions are:
 
  corosync 2.3.0-1ubuntu5
  pacemaker 1.1.10+git20130802-1ubuntu2
 
  Ah sorry, I seem to have missed that part.
 
 
  There are no updates available. The only option is to install
  from sources,
  but that would be very difficult to maintain and I'm not sure I
  would get rid of this issue.
 
  What do you recommend?
 
  The same thing as Lars, or switch to a distro that stays
  current with upstream (git shows 5 newer releases for that
  branch since it was released 3 years ago).
  If you do build from source, its probably best to go with
  v1.4.6
 
  Hm, I am a bit confused here. We are using 2.3.0,
 
  I swapped the 2 for a 1 somehow. A bit distracted, sorry.
 
  I upgraded all nodes to 2.3.3 and first it seemed a bit better,
  but still the
  same issue - after some time CPU gets to 100%, and the corosync log
  is flooded with messages like:
 
  Mar 12 07:36:55 [4793] ctdb2cib: info: crm_cs_flush:
  Sent 0
 CPG
  messages  (48 remaining, last=3671): Try again (6)
  Mar 12 07:36:55 [4798] ctdb2   crmd: info: crm_cs_flush:
  Sent 0
  CPG
  messages  (51 remaining, last=3995): Try again (6)
  Mar 12 07:36:56 [4793] ctdb2cib: info: crm_cs_flush:
  

[Pacemaker] missing init scripts for corosync and pacemaker

2014-03-12 Thread Jay G. Scott

OS = RHEL 6

because my machines are behind a firewall, i can't install
via yum.  i had to bring down the rpms and install them.
here are the rpms i installed.  yeah, it bothers me that
they say fc20 but that's what i got when i used the
pacemaker.repo file i found online.

corosync-2.3.3-1.fc20.x86_64.rpm
corosynclib-2.3.3-1.fc20.x86_64.rpm
libibverbs-1.1.7-3.fc20.x86_64.rpm
libqb-0.17.0-1.fc20.x86_64.rpm
librdmacm-1.0.17-2.fc20.x86_64.rpm
pacemaker-1.1.11-1.fc20.x86_64.rpm
pacemaker-cli-1.1.11-1.fc20.x86_64.rpm
pacemaker-cluster-libs-1.1.11-1.fc20.x86_64.rpm
pacemaker-libs-1.1.11-1.fc20.x86_64.rpm
resource-agents-3.9.5-9.fc20.x86_64.rpm

i have all of these installed.  i lack an /etc/init.d
script for corosync and pacemaker.

how come?

j.


-- 
Jay Scott   512-835-3553g...@arlut.utexas.edu
Head of Sun Support, Sr. System Administrator
Applied Research Labs, Computer Science Div.   S224
University of Texas at Austin

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] fencing question

2014-03-12 Thread Lars Marowsky-Bree
On 2014-03-12T16:16:54, Karl Rößmann k.roessm...@fkf.mpg.de wrote:

 primitive fkflmw ocf:heartbeat:Xen \
 meta target-role=Started is-managed=true allow-migrate=true \
 op monitor interval=10 timeout=30 \
 op migrate_from interval=0 timeout=600 \
 op migrate_to interval=0 timeout=600 \
 params xmfile=/etc/xen/vm/fkflmw shutdown_timeout=120
 
 You need to set a 120s timeout for the stop operation too:
  op stop timeout=150
 
 default-action-timeout=60s
 
 Or set this to, say, 150s.
 can I do this while the resource (the xen VM) is running ?

Yes, changing the stop timeout should not have a negative impact on your
resource.

You can also check how the cluster would react:

# crm configure
crm(live)configure# edit
(Make all changes you want here)
crm(live)configure# simulate actions nograph

before you type commit.

Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] pacemaker depdendancy on samba

2014-03-12 Thread Alex Samad - Yieldbroker
Hi

Just going through my cluster build, seems like

yum install pacemaker

wants to bring in samba, I have recently migrated up to samba4, wondering  if I 
can find a pacemaker that is dependant on samba4 ?

Im on centos 6.5, on a quick look I am guessing this might not be a pacemaker 
issue, might be a dep of a dep ..


Thanks
Alex 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] pacemaker depdendancy on samba

2014-03-12 Thread Trevor Hemsley
On 12/03/14 23:18, Alex Samad - Yieldbroker wrote:
 Hi

 Just going through my cluster build, seems like

 yum install pacemaker

 wants to bring in samba, I have recently migrated up to samba4, wondering  if 
 I can find a pacemaker that is dependant on samba4 ?

 Im on centos 6.5, on a quick look I am guessing this might not be a pacemaker 
 issue, might be a dep of a dep ..

Pacemaker wants to install resource-agents, resource-agents has a
dependency on /sbin/mount.cifs and then it goes on from there...

T

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] help migrating over cluster config from pacemaker plugin into corosync to pcs

2014-03-12 Thread Alex Samad - Yieldbroker
Hi

So this is what I used to do to setup my cluster
crm configure property stonith-enabled=false
crm configure property no-quorum-policy=ignore
crm configure rsc_defaults resource-stickiness=100
crm configure primitive ybrpip ocf:heartbeat:IPaddr2 params ip=10.32.21.30 
cidr_netmask=24 op monitor interval=5s
crm configure primitive ybrpstat ocf:yb:ybrp op monitor interval=5s
crm configure colocation ybrp INFINITY: ybrpip ybrpstat
crm configure group ybrpgrp ybrpip ybrpstat
crm_resource --meta --resource ybrpstat --set-parameter migration-threshold 
--parameter-value 2
crm_resource --meta --resource ybrpstat --set-parameter failure-timeout 
--parameter-value 2m
 

I have written my own ybrp resource (/usr/lib/ocf/resource.d/yb/ybrp)

So basically what I want to do is have 2 nodes have a floating VIP (I was 
looking at moving forward with the IP load balancing )
I run an application on both nodes it doesn't need to be started, should start 
at server start up.
I need the VIP or the loading balancing to move from node to node.

Normal operation would be
50% on node A and 50% on node B (I realise this depends on IP  hash)
If app fails on one node then all the traffic should move to the other node. 
The cluster should not try and restart the application
Once the application comes back on the broken node the VIP should be allowed to 
move back or the load balancing should accept traffic back there.
Simple ?

I was trying to use the above commands to programme up the new pacemaker, but I 
can't find the easy transform of crm to pcs... so I thought I would ask the 
list for help to configure up with the load balance VIP.

Alex



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] help migrating over cluster config from pacemaker plugin into corosync to pcs

2014-03-12 Thread Andrew Beekhof

On 13 Mar 2014, at 11:56 am, Alex Samad - Yieldbroker 
alex.sa...@yieldbroker.com wrote:

 Hi
 
 So this is what I used to do to setup my cluster
 crm configure property stonith-enabled=false
 crm configure property no-quorum-policy=ignore
 crm configure rsc_defaults resource-stickiness=100
 crm configure primitive ybrpip ocf:heartbeat:IPaddr2 params ip=10.32.21.30 
 cidr_netmask=24 op monitor interval=5s
 crm configure primitive ybrpstat ocf:yb:ybrp op monitor interval=5s
 crm configure colocation ybrp INFINITY: ybrpip ybrpstat
 crm configure group ybrpgrp ybrpip ybrpstat
 crm_resource --meta --resource ybrpstat --set-parameter migration-threshold 
 --parameter-value 2
 crm_resource --meta --resource ybrpstat --set-parameter failure-timeout 
 --parameter-value 2m
 
 
 I have written my own ybrp resource (/usr/lib/ocf/resource.d/yb/ybrp)
 
 So basically what I want to do is have 2 nodes have a floating VIP (I was 
 looking at moving forward with the IP load balancing )
 I run an application on both nodes it doesn't need to be started, should 
 start at server start up.
 I need the VIP or the loading balancing to move from node to node.
 
 Normal operation would be
 50% on node A and 50% on node B (I realise this depends on IP  hash)
 If app fails on one node then all the traffic should move to the other node. 
 The cluster should not try and restart the application
 Once the application comes back on the broken node the VIP should be allowed 
 to move back or the load balancing should accept traffic back there.
 Simple ?
 
 I was trying to use the above commands to programme up the new pacemaker, but 
 I can't find the easy transform of crm to pcs...

Does 
https://github.com/ClusterLabs/pacemaker/blob/master/doc/pcs-crmsh-quick-ref.md 
help?

 so I thought I would ask the list for help to configure up with the load 
 balance VIP.
 
 Alex
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] help migrating over cluster config from pacemaker plugin into corosync to pcs

2014-03-12 Thread Alex Samad - Yieldbroker


 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net]
 Sent: Thursday, 13 March 2014 1:39 PM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] help migrating over cluster config from pacemaker
 plugin into corosync to pcs
[snip]

  I was trying to use the above commands to programme up the new
 pacemaker, but I can't find the easy transform of crm to pcs...
 
 Does https://github.com/ClusterLabs/pacemaker/blob/master/doc/pcs-
 crmsh-quick-ref.md help?

Looks like it does
thanks

 
[snip]


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] help building 2 node config

2014-03-12 Thread Alex Samad - Yieldbroker
Hi

I sent out an email to help convert an old config. Thought it might better to 
start from scratch.

I have 2 nodes, which run an application (sort of a reverse proxy).
Node A
Node B

I would like to use OCF:IPaddr2 so that I can load balance IP

# Create ybrp ip address  
pcs resource create ybrpip ocf:heartbeat:IPaddr2 params ip=10.172.214.50 
cidr_netmask=24 nic=eth0 clusterip_hash=sourceip-sourceport \
op start interval=0s timeout=60s \
op monitor interval=5s timeout=20s \
op stop interval=0s timeout=60s \

# Clone it
pcs resource clone ybrpip2 ybrpip meta master-max=2 master-node-max=2 
clone-max=2 clone-node-max=1 notify=true interleave=true


This seems to work okay but I tested
On node B I ran this
crm_mon -1 ; iptables -nvL INPUT | head -5 ; ip a ; echo -n [ ; cat 
/proc/net/ipt_CLUSTERIP/10.172.214.50 ; echo ]

in particular I was watching /proc/net/ipt_CLUSTERIP/10.172.214.50

and I rebooted node A, I  noticed ipt_CLUSTERIP didn't fail over ?  I would 
have expected to see 1,2 in there on nodeB when nodeA failed

in fact when I reboot nodea it comes back with 2 in there ... that's not good !


pcs resource show ybrpip-clone
 Clone: ybrpip-clone
  Meta Attrs: master-max=2 master-node-max=2 clone-max=2 clone-node-max=1 
notify=true interleave=true 
  Resource: ybrpip (class=ocf provider=heartbeat type=IPaddr2)
   Attributes: ip=10.172.214.50 cidr_netmask=24 nic=eth0 
clusterip_hash=sourceip-sourceport 
   Operations: start interval=0s timeout=60s (ybrpip-start-interval-0s)
   monitor interval=5s timeout=20s (ybrpip-monitor-interval-5s)
   stop interval=0s timeout=60s (ybrpip-stop-interval-0s)

pcs resource show ybrpip  
 Resource: ybrpip (class=ocf provider=heartbeat type=IPaddr2)
  Attributes: ip=10.172.214.50 cidr_netmask=24 nic=eth0 
clusterip_hash=sourceip-sourceport 
  Operations: start interval=0s timeout=60s (ybrpip-start-interval-0s)
  monitor interval=5s timeout=20s (ybrpip-monitor-interval-5s)
  stop interval=0s timeout=60s (ybrpip-stop-interval-0s)



so  I think this has something todo with meta data..



I have another resource
pcs resource create  ybrpstat ocf:yb:ybrp op monitor interval=5s

I want 2 of these one for nodeA and 1 for node B.

I want the IP address to be dependant on if this resource is available on the 
node.  How can I do that ?

Alex





___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] help building 2 node config

2014-03-12 Thread Alex Samad - Yieldbroker
Well I think I have worked it out 


# Create ybrp ip address  
pcs resource create ybrpip ocf:heartbeat:IPaddr2 params ip=10.172.214.50 
cidr_netmask=24 nic=eth0 clusterip_hash=sourceip-sourceport \
op start interval=0s timeout=60s \
op monitor interval=5s timeout=20s \
op stop interval=0s timeout=60s \

# Clone it
#pcs resource clone ybrpip globally-unique=true clone-max=2 clone-node-max=2

# Create status
pcs resource create ybrpstat ocf:yb:ybrp op \
op start interval=10s timeout=60s \
op monitor interval=5s timeout=20s \
op stop interval=10s timeout=60s \





# clone it it
pcs resource clone ybrpip globally-unique=true clone-max=2 clone-node-max=2
pcs resource clone ybrpstat globally-unique=false clone-max=2 clone-node-max=2

pcs constraint colocation add ybrpip ybrpstat INFINITY
pcs constraint colocation add ybrpip-clone ybrpstat-clone INFINITY
pcs constraint order ybrpstat then ybrpip
pcs constraint order ybrpstat-clone then ybrpip-clone
pcs constraint location ybrpip prefers devrp1
pcs constraint location ybrpip-clone prefers devrp2


Have I done anything silly ?

Also as I don't have the application actually running on my nodes, I notice 
fails occur very fast, more than 1  sec, where its that configured and how do I 
configure it such that after 2 or 3,4 or 5 attempts it fails over to the other 
node. I also want then resources to move back to the original nodes when they 
come back

So I tried the config above and when I rebooted node a the ip address on A went 
to node B, but when A came back it didn't move back to node A




pcs config
Cluster Name: ybrp
Corosync Nodes:
 
Pacemaker Nodes:
 devrp1 devrp2 

Resources: 
 Clone: ybrpip-clone
  Meta Attrs: globally-unique=true clone-max=2 clone-node-max=2 
  Resource: ybrpip (class=ocf provider=heartbeat type=IPaddr2)
   Attributes: ip=10.172.214.50 cidr_netmask=24 nic=eth0 
clusterip_hash=sourceip-sourceport 
   Operations: start interval=0s timeout=60s (ybrpip-start-interval-0s)
   monitor interval=5s timeout=20s (ybrpip-monitor-interval-5s)
   stop interval=0s timeout=60s (ybrpip-stop-interval-0s)
 Clone: ybrpstat-clone
  Meta Attrs: globally-unique=false clone-max=2 clone-node-max=2 
  Resource: ybrpstat (class=ocf provider=yb type=ybrp)
   Operations: start interval=10s timeout=60s (ybrpstat-start-interval-10s)
   monitor interval=5s timeout=20s (ybrpstat-monitor-interval-5s)
   stop interval=10s timeout=60s (ybrpstat-stop-interval-10s)

Stonith Devices: 
Fencing Levels: 

Location Constraints:
  Resource: ybrpip
Enabled on: devrp1 (score:INFINITY) (id:location-ybrpip-devrp1-INFINITY)
  Resource: ybrpip-clone
Enabled on: devrp2 (score:INFINITY) 
(id:location-ybrpip-clone-devrp2-INFINITY)
Ordering Constraints:
  start ybrpstat then start ybrpip (Mandatory) 
(id:order-ybrpstat-ybrpip-mandatory)
  start ybrpstat-clone then start ybrpip-clone (Mandatory) 
(id:order-ybrpstat-clone-ybrpip-clone-mandatory)
Colocation Constraints:
  ybrpip with ybrpstat (INFINITY) (id:colocation-ybrpip-ybrpstat-INFINITY)
  ybrpip-clone with ybrpstat-clone (INFINITY) 
(id:colocation-ybrpip-clone-ybrpstat-clone-INFINITY)

Cluster Properties:
 cluster-infrastructure: cman
 dc-version: 1.1.10-14.el6-368c726
 last-lrm-refresh: 1394682724
 no-quorum-policy: ignore
 stonith-enabled: false

the constraints should have moved it back to node A ???

pcs status
Cluster name: ybrp
Last updated: Thu Mar 13 16:13:40 2014
Last change: Thu Mar 13 16:06:21 2014 via cibadmin on devrp1
Stack: cman
Current DC: devrp2 - partition with quorum
Version: 1.1.10-14.el6-368c726
2 Nodes configured
4 Resources configured


Online: [ devrp1 devrp2 ]

Full list of resources:

 Clone Set: ybrpip-clone [ybrpip] (unique)
 ybrpip:0   (ocf::heartbeat:IPaddr2):   Started devrp2 
 ybrpip:1   (ocf::heartbeat:IPaddr2):   Started devrp2 
 Clone Set: ybrpstat-clone [ybrpstat]
 Started: [ devrp1 devrp2 ]




 -Original Message-
 From: Alex Samad - Yieldbroker [mailto:alex.sa...@yieldbroker.com]
 Sent: Thursday, 13 March 2014 2:07 PM
 To: pacemaker@oss.clusterlabs.org
 Subject: [Pacemaker] help building 2 node config
 
 Hi
 
 I sent out an email to help convert an old config. Thought it might better to
 start from scratch.
 
 I have 2 nodes, which run an application (sort of a reverse proxy).
 Node A
 Node B
 
 I would like to use OCF:IPaddr2 so that I can load balance IP
 
 # Create ybrp ip address
 pcs resource create ybrpip ocf:heartbeat:IPaddr2 params ip=10.172.214.50
 cidr_netmask=24 nic=eth0 clusterip_hash=sourceip-sourceport \
 op start interval=0s timeout=60s \
 op monitor interval=5s timeout=20s \
 op stop interval=0s timeout=60s \
 
 # Clone it
 pcs resource clone ybrpip2 ybrpip meta master-max=2 master-node-
 max=2 clone-max=2 clone-node-max=1 notify=true
 interleave=true
 
 
 This seems to work okay but I tested
 On node B I ran this
 crm_mon -1 ; iptables 

[Pacemaker] RESTful API support

2014-03-12 Thread John Wei
Currently, management of pacemaker is done through CLI or xml. Any plan to
provide RESTful api to support cloud software?

John



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org