Re: [Pacemaker] Pacemaker/corosync freeze
-Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Tuesday, March 11, 2014 10:27 PM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 12 Mar 2014, at 1:54 am, Attila Megyeri amegy...@minerva-soft.com wrote: -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Tuesday, March 11, 2014 12:48 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 7 Mar 2014, at 5:54 pm, Attila Megyeri amegy...@minerva-soft.com wrote: Thanks for the quick response! -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Friday, March 07, 2014 3:48 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 7 Mar 2014, at 5:31 am, Attila Megyeri amegy...@minerva-soft.com wrote: Hello, We have a strange issue with Corosync/Pacemaker. From time to time, something unexpected happens and suddenly the crm_mon output remains static. When I check the cpu usage, I see that one of the cores uses 100% cpu, but cannot actually match it to either the corosync or one of the pacemaker processes. In such a case, this high CPU usage is happening on all 7 nodes. I have to manually go to each node, stop pacemaker, restart corosync, then start pacemeker. Stoping pacemaker and corosync does not work in most of the cases, usually a kill -9 is needed. Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty. Using udpu as transport, two rings on Gigabit ETH, rro_mode passive. Logs are usually flooded with CPG related messages, such as: Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) OR Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( That is usually a symptom of corosync getting into a horribly confused state. Version? Distro? Have you checked for an update? Odd that the user of all that CPU isn't showing up though. As I wrote I use Ubuntu trusty, the exact package versions are: corosync 2.3.0-1ubuntu5 pacemaker 1.1.10+git20130802-1ubuntu2 Ah sorry, I seem to have missed that part. There are no updates available. The only option is to install from sources, but that would be very difficult to maintain and I'm not sure I would get rid of this issue. What do you recommend? The same thing as Lars, or switch to a distro that stays current with upstream (git shows 5 newer releases for that branch since it was released 3 years ago). If you do build from source, its probably best to go with v1.4.6 Hm, I am a bit confused here. We are using 2.3.0, I swapped the 2 for a 1 somehow. A bit distracted, sorry. I upgraded all nodes to 2.3.3 and first it seemed a bit better, but still the same issue - after some time CPU gets to 100%, and the corosync log is flooded with messages like: Mar 12 07:36:55 [4793] ctdb2cib: info: crm_cs_flush:Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Mar 12 07:36:55 [4798] ctdb2 crmd: info: crm_cs_flush:Sent 0 CPG messages (51 remaining, last=3995): Try again (6) Mar 12 07:36:56 [4793] ctdb2cib: info: crm_cs_flush:Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Mar 12 07:36:56 [4798] ctdb2 crmd: info: crm_cs_flush:Sent 0 CPG messages (51 remaining, last=3995): Try again (6) Mar 12 07:36:57 [4793] ctdb2cib: info: crm_cs_flush:Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Mar 12 07:36:57 [4798] ctdb2 crmd: info: crm_cs_flush:Sent 0 CPG messages (51 remaining, last=3995): Try again (6) Mar 12 07:36:57 [4793] ctdb2cib: info: crm_cs_flush:Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Shall I try to downgrade to 1.4.6? What is the difference in that build? Or where should I start troubleshooting? Thank you in advance. which was released approx. a year ago (you mention 3
Re: [Pacemaker] Pacemaker/corosync freeze
Attila Megyeri napsal(a): -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Tuesday, March 11, 2014 10:27 PM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 12 Mar 2014, at 1:54 am, Attila Megyeri amegy...@minerva-soft.com wrote: -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Tuesday, March 11, 2014 12:48 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 7 Mar 2014, at 5:54 pm, Attila Megyeri amegy...@minerva-soft.com wrote: Thanks for the quick response! -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Friday, March 07, 2014 3:48 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 7 Mar 2014, at 5:31 am, Attila Megyeri amegy...@minerva-soft.com wrote: Hello, We have a strange issue with Corosync/Pacemaker. From time to time, something unexpected happens and suddenly the crm_mon output remains static. When I check the cpu usage, I see that one of the cores uses 100% cpu, but cannot actually match it to either the corosync or one of the pacemaker processes. In such a case, this high CPU usage is happening on all 7 nodes. I have to manually go to each node, stop pacemaker, restart corosync, then start pacemeker. Stoping pacemaker and corosync does not work in most of the cases, usually a kill -9 is needed. Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty. Using udpu as transport, two rings on Gigabit ETH, rro_mode passive. Logs are usually flooded with CPG related messages, such as: Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) OR Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( That is usually a symptom of corosync getting into a horribly confused state. Version? Distro? Have you checked for an update? Odd that the user of all that CPU isn't showing up though. As I wrote I use Ubuntu trusty, the exact package versions are: corosync 2.3.0-1ubuntu5 pacemaker 1.1.10+git20130802-1ubuntu2 Ah sorry, I seem to have missed that part. There are no updates available. The only option is to install from sources, but that would be very difficult to maintain and I'm not sure I would get rid of this issue. What do you recommend? The same thing as Lars, or switch to a distro that stays current with upstream (git shows 5 newer releases for that branch since it was released 3 years ago). If you do build from source, its probably best to go with v1.4.6 Hm, I am a bit confused here. We are using 2.3.0, I swapped the 2 for a 1 somehow. A bit distracted, sorry. I upgraded all nodes to 2.3.3 and first it seemed a bit better, but still the same issue - after some time CPU gets to 100%, and the corosync log is flooded with messages like: Mar 12 07:36:55 [4793] ctdb2cib: info: crm_cs_flush:Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Mar 12 07:36:55 [4798] ctdb2 crmd: info: crm_cs_flush:Sent 0 CPG messages (51 remaining, last=3995): Try again (6) Mar 12 07:36:56 [4793] ctdb2cib: info: crm_cs_flush:Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Mar 12 07:36:56 [4798] ctdb2 crmd: info: crm_cs_flush:Sent 0 CPG messages (51 remaining, last=3995): Try again (6) Mar 12 07:36:57 [4793] ctdb2cib: info: crm_cs_flush:Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Mar 12 07:36:57 [4798] ctdb2 crmd: info: crm_cs_flush:Sent 0 CPG messages (51 remaining, last=3995): Try again (6) Mar 12 07:36:57 [4793] ctdb2cib: info: crm_cs_flush:Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Attila, Shall I try to downgrade to 1.4.6? What is the difference in that build? Or where should I start troubleshooting? First of all, 1.x branch (flatiron) is maintained so even it looks like a old version, it's quite a new. It contains more or less only
Re: [Pacemaker] Pacemaker/corosync freeze
Hello Jan, Thank you very much for your help so far. -Original Message- From: Jan Friesse [mailto:jfrie...@redhat.com] Sent: Wednesday, March 12, 2014 9:51 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze Attila Megyeri napsal(a): -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Tuesday, March 11, 2014 10:27 PM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 12 Mar 2014, at 1:54 am, Attila Megyeri amegy...@minerva-soft.com wrote: -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Tuesday, March 11, 2014 12:48 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 7 Mar 2014, at 5:54 pm, Attila Megyeri amegy...@minerva-soft.com wrote: Thanks for the quick response! -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Friday, March 07, 2014 3:48 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 7 Mar 2014, at 5:31 am, Attila Megyeri amegy...@minerva-soft.com wrote: Hello, We have a strange issue with Corosync/Pacemaker. From time to time, something unexpected happens and suddenly the crm_mon output remains static. When I check the cpu usage, I see that one of the cores uses 100% cpu, but cannot actually match it to either the corosync or one of the pacemaker processes. In such a case, this high CPU usage is happening on all 7 nodes. I have to manually go to each node, stop pacemaker, restart corosync, then start pacemeker. Stoping pacemaker and corosync does not work in most of the cases, usually a kill -9 is needed. Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty. Using udpu as transport, two rings on Gigabit ETH, rro_mode passive. Logs are usually flooded with CPG related messages, such as: Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) OR Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( That is usually a symptom of corosync getting into a horribly confused state. Version? Distro? Have you checked for an update? Odd that the user of all that CPU isn't showing up though. As I wrote I use Ubuntu trusty, the exact package versions are: corosync 2.3.0-1ubuntu5 pacemaker 1.1.10+git20130802-1ubuntu2 Ah sorry, I seem to have missed that part. There are no updates available. The only option is to install from sources, but that would be very difficult to maintain and I'm not sure I would get rid of this issue. What do you recommend? The same thing as Lars, or switch to a distro that stays current with upstream (git shows 5 newer releases for that branch since it was released 3 years ago). If you do build from source, its probably best to go with v1.4.6 Hm, I am a bit confused here. We are using 2.3.0, I swapped the 2 for a 1 somehow. A bit distracted, sorry. I upgraded all nodes to 2.3.3 and first it seemed a bit better, but still the same issue - after some time CPU gets to 100%, and the corosync log is flooded with messages like: Mar 12 07:36:55 [4793] ctdb2cib: info: crm_cs_flush: Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Mar 12 07:36:55 [4798] ctdb2 crmd: info: crm_cs_flush: Sent 0 CPG messages (51 remaining, last=3995): Try again (6) Mar 12 07:36:56 [4793] ctdb2cib: info: crm_cs_flush: Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Mar 12 07:36:56 [4798] ctdb2 crmd: info: crm_cs_flush: Sent 0 CPG messages (51 remaining, last=3995): Try again (6) Mar 12 07:36:57 [4793] ctdb2cib: info: crm_cs_flush: Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Mar 12 07:36:57 [4798] ctdb2 crmd: info: crm_cs_flush: Sent 0 CPG messages (51 remaining, last=3995):
Re: [Pacemaker] Pacemaker/corosync freeze
Attila Megyeri napsal(a): Hello Jan, Thank you very much for your help so far. -Original Message- From: Jan Friesse [mailto:jfrie...@redhat.com] Sent: Wednesday, March 12, 2014 9:51 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze Attila Megyeri napsal(a): -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Tuesday, March 11, 2014 10:27 PM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 12 Mar 2014, at 1:54 am, Attila Megyeri amegy...@minerva-soft.com wrote: -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Tuesday, March 11, 2014 12:48 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 7 Mar 2014, at 5:54 pm, Attila Megyeri amegy...@minerva-soft.com wrote: Thanks for the quick response! -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Friday, March 07, 2014 3:48 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 7 Mar 2014, at 5:31 am, Attila Megyeri amegy...@minerva-soft.com wrote: Hello, We have a strange issue with Corosync/Pacemaker. From time to time, something unexpected happens and suddenly the crm_mon output remains static. When I check the cpu usage, I see that one of the cores uses 100% cpu, but cannot actually match it to either the corosync or one of the pacemaker processes. In such a case, this high CPU usage is happening on all 7 nodes. I have to manually go to each node, stop pacemaker, restart corosync, then start pacemeker. Stoping pacemaker and corosync does not work in most of the cases, usually a kill -9 is needed. Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty. Using udpu as transport, two rings on Gigabit ETH, rro_mode passive. Logs are usually flooded with CPG related messages, such as: Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) OR Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( That is usually a symptom of corosync getting into a horribly confused state. Version? Distro? Have you checked for an update? Odd that the user of all that CPU isn't showing up though. As I wrote I use Ubuntu trusty, the exact package versions are: corosync 2.3.0-1ubuntu5 pacemaker 1.1.10+git20130802-1ubuntu2 Ah sorry, I seem to have missed that part. There are no updates available. The only option is to install from sources, but that would be very difficult to maintain and I'm not sure I would get rid of this issue. What do you recommend? The same thing as Lars, or switch to a distro that stays current with upstream (git shows 5 newer releases for that branch since it was released 3 years ago). If you do build from source, its probably best to go with v1.4.6 Hm, I am a bit confused here. We are using 2.3.0, I swapped the 2 for a 1 somehow. A bit distracted, sorry. I upgraded all nodes to 2.3.3 and first it seemed a bit better, but still the same issue - after some time CPU gets to 100%, and the corosync log is flooded with messages like: Mar 12 07:36:55 [4793] ctdb2cib: info: crm_cs_flush: Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Mar 12 07:36:55 [4798] ctdb2 crmd: info: crm_cs_flush: Sent 0 CPG messages (51 remaining, last=3995): Try again (6) Mar 12 07:36:56 [4793] ctdb2cib: info: crm_cs_flush: Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Mar 12 07:36:56 [4798] ctdb2 crmd: info: crm_cs_flush: Sent 0 CPG messages (51 remaining, last=3995): Try again (6) Mar 12 07:36:57 [4793] ctdb2cib: info: crm_cs_flush: Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Mar 12 07:36:57 [4798] ctdb2 crmd: info: crm_cs_flush: Sent 0 CPG messages (51 remaining, last=3995): Try again (6) Mar 12 07:36:57 [4793] ctdb2cib: info: crm_cs_flush: Sent 0 CPG messages
Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?
12.03.2014 00:40, Andrew Beekhof wrote: On 11 Mar 2014, at 6:23 pm, Vladislav Bogdanov bub...@hoster-ok.com wrote: 07.03.2014 10:30, Vladislav Bogdanov wrote: 07.03.2014 05:43, Andrew Beekhof wrote: On 6 Mar 2014, at 10:39 pm, Vladislav Bogdanov bub...@hoster-ok.com wrote: 18.02.2014 03:49, Andrew Beekhof wrote: On 31 Jan 2014, at 6:20 pm, yusuke iida yusk.i...@gmail.com wrote: Hi, all I measure the performance of Pacemaker in the following combinations. Pacemaker-1.1.11.rc1 libqb-0.16.0 corosync-2.3.2 All nodes are KVM virtual machines. stopped the node of vm01 compulsorily from the inside, after starting 14 nodes. virsh destroy vm01 was used for the stop. Then, in addition to the compulsorily stopped node, other nodes are separated from a cluster. The log of Retransmit List: is then outputted in large quantities from corosync. Probably best to poke the corosync guys about this. However, = .11 is known to cause significant CPU usage with that many nodes. I can easily imagine this staving corosync of resources and causing breakage. I would _highly_ recommend retesting with the current git master of pacemaker. I merged the new cib code last week which is faster by _two_ orders of magnitude and uses significantly less CPU. Andrew, current git master (ee094a2) almost works, the only issue is that crm_diff calculates incorrect diff digest. If I replace digest in diff by hands with what cib calculates as expected. it applies correctly. Otherwise - -206. More details? Hmmm... seems to be crmsh-specific, Cannot reproduce with pure-XML editing. Kristoffer, does http://hg.savannah.gnu.org/hgweb/crmsh/rev/c42d9361a310 address this? The problem seems to be caused by the fact that crmsh does not provide status section in both orig and new XMLs to crm_diff, and digest generation seems to rely on that, so crm_diff and cib daemon produce different digests. Attached are two sets of XML files, one (orig.xml, new.xml, patch.xml) are related to the full CIB operation (with status section included), another (orig-edited.xml, new-edited.xml, patch-edited.xml) have that section removed like crmsh does do. Resulting diffs differ only by digest, and that seems to be the exact issue. This should help. As long as crmsh isn't passing -c to crm_diff, then the digest will no longer be present. https://github.com/beekhof/pacemaker/commit/c8d443d Yep, that helped. Thank you! ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker/corosync freeze
-Original Message- From: Jan Friesse [mailto:jfrie...@redhat.com] Sent: Wednesday, March 12, 2014 2:27 PM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze Attila Megyeri napsal(a): Hello Jan, Thank you very much for your help so far. -Original Message- From: Jan Friesse [mailto:jfrie...@redhat.com] Sent: Wednesday, March 12, 2014 9:51 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze Attila Megyeri napsal(a): -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Tuesday, March 11, 2014 10:27 PM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 12 Mar 2014, at 1:54 am, Attila Megyeri amegy...@minerva-soft.com wrote: -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Tuesday, March 11, 2014 12:48 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 7 Mar 2014, at 5:54 pm, Attila Megyeri amegy...@minerva-soft.com wrote: Thanks for the quick response! -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Friday, March 07, 2014 3:48 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 7 Mar 2014, at 5:31 am, Attila Megyeri amegy...@minerva-soft.com wrote: Hello, We have a strange issue with Corosync/Pacemaker. From time to time, something unexpected happens and suddenly the crm_mon output remains static. When I check the cpu usage, I see that one of the cores uses 100% cpu, but cannot actually match it to either the corosync or one of the pacemaker processes. In such a case, this high CPU usage is happening on all 7 nodes. I have to manually go to each node, stop pacemaker, restart corosync, then start pacemeker. Stoping pacemaker and corosync does not work in most of the cases, usually a kill -9 is needed. Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty. Using udpu as transport, two rings on Gigabit ETH, rro_mode passive. Logs are usually flooded with CPG related messages, such as: Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) OR Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( That is usually a symptom of corosync getting into a horribly confused state. Version? Distro? Have you checked for an update? Odd that the user of all that CPU isn't showing up though. As I wrote I use Ubuntu trusty, the exact package versions are: corosync 2.3.0-1ubuntu5 pacemaker 1.1.10+git20130802-1ubuntu2 Ah sorry, I seem to have missed that part. There are no updates available. The only option is to install from sources, but that would be very difficult to maintain and I'm not sure I would get rid of this issue. What do you recommend? The same thing as Lars, or switch to a distro that stays current with upstream (git shows 5 newer releases for that branch since it was released 3 years ago). If you do build from source, its probably best to go with v1.4.6 Hm, I am a bit confused here. We are using 2.3.0, I swapped the 2 for a 1 somehow. A bit distracted, sorry. I upgraded all nodes to 2.3.3 and first it seemed a bit better, but still the same issue - after some time CPU gets to 100%, and the corosync log is flooded with messages like: Mar 12 07:36:55 [4793] ctdb2cib: info: crm_cs_flush: Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Mar 12 07:36:55 [4798] ctdb2 crmd: info: crm_cs_flush: Sent 0 CPG messages (51 remaining, last=3995): Try again (6) Mar 12 07:36:56 [4793] ctdb2cib: info: crm_cs_flush: Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Mar 12 07:36:56 [4798] ctdb2 crmd: info: crm_cs_flush: Sent 0 CPG messages (51 remaining, last=3995): Try again (6) Mar 12 07:36:57 [4793] ctdb2cib: info:
Re: [Pacemaker] fencing question
On 2014-03-12T15:17:13, Karl Rößmann k.roessm...@fkf.mpg.de wrote: Hi, we have a two node HA cluster using SuSE SlES 11 HA Extension SP3, latest release value. A resource (xen) was manually stopped, the shutdown_timeout is 120s but after 60s the node was fenced and shut down by the other node. should I change some timeout value ? This is a part of our configuration: ... primitive fkflmw ocf:heartbeat:Xen \ meta target-role=Started is-managed=true allow-migrate=true \ op monitor interval=10 timeout=30 \ op migrate_from interval=0 timeout=600 \ op migrate_to interval=0 timeout=600 \ params xmfile=/etc/xen/vm/fkflmw shutdown_timeout=120 You need to set a 120s timeout for the stop operation too: op stop timeout=150 default-action-timeout=60s Or set this to, say, 150s. Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) Experience is the name everyone gives to their mistakes. -- Oscar Wilde ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] fencing question
Hi. primitive fkflmw ocf:heartbeat:Xen \ meta target-role=Started is-managed=true allow-migrate=true \ op monitor interval=10 timeout=30 \ op migrate_from interval=0 timeout=600 \ op migrate_to interval=0 timeout=600 \ params xmfile=/etc/xen/vm/fkflmw shutdown_timeout=120 You need to set a 120s timeout for the stop operation too: op stop timeout=150 default-action-timeout=60s Or set this to, say, 150s. can I do this while the resource (the xen VM) is running ? Karl -- Karl RößmannTel. +49-711-689-1657 Max-Planck-Institut FKF Fax. +49-711-689-1632 Postfach 800 665 70506 Stuttgart email k.roessm...@fkf.mpg.de ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker/corosync freeze
Attila Megyeri napsal(a): -Original Message- From: Jan Friesse [mailto:jfrie...@redhat.com] Sent: Wednesday, March 12, 2014 2:27 PM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze Attila Megyeri napsal(a): Hello Jan, Thank you very much for your help so far. -Original Message- From: Jan Friesse [mailto:jfrie...@redhat.com] Sent: Wednesday, March 12, 2014 9:51 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze Attila Megyeri napsal(a): -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Tuesday, March 11, 2014 10:27 PM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 12 Mar 2014, at 1:54 am, Attila Megyeri amegy...@minerva-soft.com wrote: -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Tuesday, March 11, 2014 12:48 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 7 Mar 2014, at 5:54 pm, Attila Megyeri amegy...@minerva-soft.com wrote: Thanks for the quick response! -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Friday, March 07, 2014 3:48 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 7 Mar 2014, at 5:31 am, Attila Megyeri amegy...@minerva-soft.com wrote: Hello, We have a strange issue with Corosync/Pacemaker. From time to time, something unexpected happens and suddenly the crm_mon output remains static. When I check the cpu usage, I see that one of the cores uses 100% cpu, but cannot actually match it to either the corosync or one of the pacemaker processes. In such a case, this high CPU usage is happening on all 7 nodes. I have to manually go to each node, stop pacemaker, restart corosync, then start pacemeker. Stoping pacemaker and corosync does not work in most of the cases, usually a kill -9 is needed. Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty. Using udpu as transport, two rings on Gigabit ETH, rro_mode passive. Logs are usually flooded with CPG related messages, such as: Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) OR Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( That is usually a symptom of corosync getting into a horribly confused state. Version? Distro? Have you checked for an update? Odd that the user of all that CPU isn't showing up though. As I wrote I use Ubuntu trusty, the exact package versions are: corosync 2.3.0-1ubuntu5 pacemaker 1.1.10+git20130802-1ubuntu2 Ah sorry, I seem to have missed that part. There are no updates available. The only option is to install from sources, but that would be very difficult to maintain and I'm not sure I would get rid of this issue. What do you recommend? The same thing as Lars, or switch to a distro that stays current with upstream (git shows 5 newer releases for that branch since it was released 3 years ago). If you do build from source, its probably best to go with v1.4.6 Hm, I am a bit confused here. We are using 2.3.0, I swapped the 2 for a 1 somehow. A bit distracted, sorry. I upgraded all nodes to 2.3.3 and first it seemed a bit better, but still the same issue - after some time CPU gets to 100%, and the corosync log is flooded with messages like: Mar 12 07:36:55 [4793] ctdb2cib: info: crm_cs_flush: Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Mar 12 07:36:55 [4798] ctdb2 crmd: info: crm_cs_flush: Sent 0 CPG messages (51 remaining, last=3995): Try again (6) Mar 12 07:36:56 [4793] ctdb2cib: info: crm_cs_flush: Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Mar 12 07:36:56 [4798] ctdb2 crmd: info: crm_cs_flush: Sent 0 CPG messages (51 remaining, last=3995): Try again (6) Mar 12 07:36:57 [4793] ctdb2cib: info: crm_cs_flush: Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Mar 12 07:36:57 [4798] ctdb2 crmd:
Re: [Pacemaker] Pacemaker/corosync freeze
-Original Message- From: Jan Friesse [mailto:jfrie...@redhat.com] Sent: Wednesday, March 12, 2014 4:31 PM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze Attila Megyeri napsal(a): -Original Message- From: Jan Friesse [mailto:jfrie...@redhat.com] Sent: Wednesday, March 12, 2014 2:27 PM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze Attila Megyeri napsal(a): Hello Jan, Thank you very much for your help so far. -Original Message- From: Jan Friesse [mailto:jfrie...@redhat.com] Sent: Wednesday, March 12, 2014 9:51 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze Attila Megyeri napsal(a): -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Tuesday, March 11, 2014 10:27 PM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 12 Mar 2014, at 1:54 am, Attila Megyeri amegy...@minerva-soft.com wrote: -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Tuesday, March 11, 2014 12:48 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 7 Mar 2014, at 5:54 pm, Attila Megyeri amegy...@minerva-soft.com wrote: Thanks for the quick response! -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Friday, March 07, 2014 3:48 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 7 Mar 2014, at 5:31 am, Attila Megyeri amegy...@minerva-soft.com wrote: Hello, We have a strange issue with Corosync/Pacemaker. From time to time, something unexpected happens and suddenly the crm_mon output remains static. When I check the cpu usage, I see that one of the cores uses 100% cpu, but cannot actually match it to either the corosync or one of the pacemaker processes. In such a case, this high CPU usage is happening on all 7 nodes. I have to manually go to each node, stop pacemaker, restart corosync, then start pacemeker. Stoping pacemaker and corosync does not work in most of the cases, usually a kill -9 is needed. Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty. Using udpu as transport, two rings on Gigabit ETH, rro_mode passive. Logs are usually flooded with CPG related messages, such as: Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) OR Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( That is usually a symptom of corosync getting into a horribly confused state. Version? Distro? Have you checked for an update? Odd that the user of all that CPU isn't showing up though. As I wrote I use Ubuntu trusty, the exact package versions are: corosync 2.3.0-1ubuntu5 pacemaker 1.1.10+git20130802-1ubuntu2 Ah sorry, I seem to have missed that part. There are no updates available. The only option is to install from sources, but that would be very difficult to maintain and I'm not sure I would get rid of this issue. What do you recommend? The same thing as Lars, or switch to a distro that stays current with upstream (git shows 5 newer releases for that branch since it was released 3 years ago). If you do build from source, its probably best to go with v1.4.6 Hm, I am a bit confused here. We are using 2.3.0, I swapped the 2 for a 1 somehow. A bit distracted, sorry. I upgraded all nodes to 2.3.3 and first it seemed a bit better, but still the same issue - after some time CPU gets to 100%, and the corosync log is flooded with messages like: Mar 12 07:36:55 [4793] ctdb2cib: info: crm_cs_flush: Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Mar 12 07:36:55 [4798] ctdb2 crmd: info: crm_cs_flush: Sent 0 CPG messages (51 remaining, last=3995): Try again (6) Mar 12 07:36:56 [4793] ctdb2cib: info: crm_cs_flush:
[Pacemaker] missing init scripts for corosync and pacemaker
OS = RHEL 6 because my machines are behind a firewall, i can't install via yum. i had to bring down the rpms and install them. here are the rpms i installed. yeah, it bothers me that they say fc20 but that's what i got when i used the pacemaker.repo file i found online. corosync-2.3.3-1.fc20.x86_64.rpm corosynclib-2.3.3-1.fc20.x86_64.rpm libibverbs-1.1.7-3.fc20.x86_64.rpm libqb-0.17.0-1.fc20.x86_64.rpm librdmacm-1.0.17-2.fc20.x86_64.rpm pacemaker-1.1.11-1.fc20.x86_64.rpm pacemaker-cli-1.1.11-1.fc20.x86_64.rpm pacemaker-cluster-libs-1.1.11-1.fc20.x86_64.rpm pacemaker-libs-1.1.11-1.fc20.x86_64.rpm resource-agents-3.9.5-9.fc20.x86_64.rpm i have all of these installed. i lack an /etc/init.d script for corosync and pacemaker. how come? j. -- Jay Scott 512-835-3553g...@arlut.utexas.edu Head of Sun Support, Sr. System Administrator Applied Research Labs, Computer Science Div. S224 University of Texas at Austin ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] fencing question
On 2014-03-12T16:16:54, Karl Rößmann k.roessm...@fkf.mpg.de wrote: primitive fkflmw ocf:heartbeat:Xen \ meta target-role=Started is-managed=true allow-migrate=true \ op monitor interval=10 timeout=30 \ op migrate_from interval=0 timeout=600 \ op migrate_to interval=0 timeout=600 \ params xmfile=/etc/xen/vm/fkflmw shutdown_timeout=120 You need to set a 120s timeout for the stop operation too: op stop timeout=150 default-action-timeout=60s Or set this to, say, 150s. can I do this while the resource (the xen VM) is running ? Yes, changing the stop timeout should not have a negative impact on your resource. You can also check how the cluster would react: # crm configure crm(live)configure# edit (Make all changes you want here) crm(live)configure# simulate actions nograph before you type commit. Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) Experience is the name everyone gives to their mistakes. -- Oscar Wilde ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] pacemaker depdendancy on samba
Hi Just going through my cluster build, seems like yum install pacemaker wants to bring in samba, I have recently migrated up to samba4, wondering if I can find a pacemaker that is dependant on samba4 ? Im on centos 6.5, on a quick look I am guessing this might not be a pacemaker issue, might be a dep of a dep .. Thanks Alex ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] pacemaker depdendancy on samba
On 12/03/14 23:18, Alex Samad - Yieldbroker wrote: Hi Just going through my cluster build, seems like yum install pacemaker wants to bring in samba, I have recently migrated up to samba4, wondering if I can find a pacemaker that is dependant on samba4 ? Im on centos 6.5, on a quick look I am guessing this might not be a pacemaker issue, might be a dep of a dep .. Pacemaker wants to install resource-agents, resource-agents has a dependency on /sbin/mount.cifs and then it goes on from there... T ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] help migrating over cluster config from pacemaker plugin into corosync to pcs
Hi So this is what I used to do to setup my cluster crm configure property stonith-enabled=false crm configure property no-quorum-policy=ignore crm configure rsc_defaults resource-stickiness=100 crm configure primitive ybrpip ocf:heartbeat:IPaddr2 params ip=10.32.21.30 cidr_netmask=24 op monitor interval=5s crm configure primitive ybrpstat ocf:yb:ybrp op monitor interval=5s crm configure colocation ybrp INFINITY: ybrpip ybrpstat crm configure group ybrpgrp ybrpip ybrpstat crm_resource --meta --resource ybrpstat --set-parameter migration-threshold --parameter-value 2 crm_resource --meta --resource ybrpstat --set-parameter failure-timeout --parameter-value 2m I have written my own ybrp resource (/usr/lib/ocf/resource.d/yb/ybrp) So basically what I want to do is have 2 nodes have a floating VIP (I was looking at moving forward with the IP load balancing ) I run an application on both nodes it doesn't need to be started, should start at server start up. I need the VIP or the loading balancing to move from node to node. Normal operation would be 50% on node A and 50% on node B (I realise this depends on IP hash) If app fails on one node then all the traffic should move to the other node. The cluster should not try and restart the application Once the application comes back on the broken node the VIP should be allowed to move back or the load balancing should accept traffic back there. Simple ? I was trying to use the above commands to programme up the new pacemaker, but I can't find the easy transform of crm to pcs... so I thought I would ask the list for help to configure up with the load balance VIP. Alex ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] help migrating over cluster config from pacemaker plugin into corosync to pcs
On 13 Mar 2014, at 11:56 am, Alex Samad - Yieldbroker alex.sa...@yieldbroker.com wrote: Hi So this is what I used to do to setup my cluster crm configure property stonith-enabled=false crm configure property no-quorum-policy=ignore crm configure rsc_defaults resource-stickiness=100 crm configure primitive ybrpip ocf:heartbeat:IPaddr2 params ip=10.32.21.30 cidr_netmask=24 op monitor interval=5s crm configure primitive ybrpstat ocf:yb:ybrp op monitor interval=5s crm configure colocation ybrp INFINITY: ybrpip ybrpstat crm configure group ybrpgrp ybrpip ybrpstat crm_resource --meta --resource ybrpstat --set-parameter migration-threshold --parameter-value 2 crm_resource --meta --resource ybrpstat --set-parameter failure-timeout --parameter-value 2m I have written my own ybrp resource (/usr/lib/ocf/resource.d/yb/ybrp) So basically what I want to do is have 2 nodes have a floating VIP (I was looking at moving forward with the IP load balancing ) I run an application on both nodes it doesn't need to be started, should start at server start up. I need the VIP or the loading balancing to move from node to node. Normal operation would be 50% on node A and 50% on node B (I realise this depends on IP hash) If app fails on one node then all the traffic should move to the other node. The cluster should not try and restart the application Once the application comes back on the broken node the VIP should be allowed to move back or the load balancing should accept traffic back there. Simple ? I was trying to use the above commands to programme up the new pacemaker, but I can't find the easy transform of crm to pcs... Does https://github.com/ClusterLabs/pacemaker/blob/master/doc/pcs-crmsh-quick-ref.md help? so I thought I would ask the list for help to configure up with the load balance VIP. Alex ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] help migrating over cluster config from pacemaker plugin into corosync to pcs
-Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Thursday, 13 March 2014 1:39 PM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] help migrating over cluster config from pacemaker plugin into corosync to pcs [snip] I was trying to use the above commands to programme up the new pacemaker, but I can't find the easy transform of crm to pcs... Does https://github.com/ClusterLabs/pacemaker/blob/master/doc/pcs- crmsh-quick-ref.md help? Looks like it does thanks [snip] ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] help building 2 node config
Hi I sent out an email to help convert an old config. Thought it might better to start from scratch. I have 2 nodes, which run an application (sort of a reverse proxy). Node A Node B I would like to use OCF:IPaddr2 so that I can load balance IP # Create ybrp ip address pcs resource create ybrpip ocf:heartbeat:IPaddr2 params ip=10.172.214.50 cidr_netmask=24 nic=eth0 clusterip_hash=sourceip-sourceport \ op start interval=0s timeout=60s \ op monitor interval=5s timeout=20s \ op stop interval=0s timeout=60s \ # Clone it pcs resource clone ybrpip2 ybrpip meta master-max=2 master-node-max=2 clone-max=2 clone-node-max=1 notify=true interleave=true This seems to work okay but I tested On node B I ran this crm_mon -1 ; iptables -nvL INPUT | head -5 ; ip a ; echo -n [ ; cat /proc/net/ipt_CLUSTERIP/10.172.214.50 ; echo ] in particular I was watching /proc/net/ipt_CLUSTERIP/10.172.214.50 and I rebooted node A, I noticed ipt_CLUSTERIP didn't fail over ? I would have expected to see 1,2 in there on nodeB when nodeA failed in fact when I reboot nodea it comes back with 2 in there ... that's not good ! pcs resource show ybrpip-clone Clone: ybrpip-clone Meta Attrs: master-max=2 master-node-max=2 clone-max=2 clone-node-max=1 notify=true interleave=true Resource: ybrpip (class=ocf provider=heartbeat type=IPaddr2) Attributes: ip=10.172.214.50 cidr_netmask=24 nic=eth0 clusterip_hash=sourceip-sourceport Operations: start interval=0s timeout=60s (ybrpip-start-interval-0s) monitor interval=5s timeout=20s (ybrpip-monitor-interval-5s) stop interval=0s timeout=60s (ybrpip-stop-interval-0s) pcs resource show ybrpip Resource: ybrpip (class=ocf provider=heartbeat type=IPaddr2) Attributes: ip=10.172.214.50 cidr_netmask=24 nic=eth0 clusterip_hash=sourceip-sourceport Operations: start interval=0s timeout=60s (ybrpip-start-interval-0s) monitor interval=5s timeout=20s (ybrpip-monitor-interval-5s) stop interval=0s timeout=60s (ybrpip-stop-interval-0s) so I think this has something todo with meta data.. I have another resource pcs resource create ybrpstat ocf:yb:ybrp op monitor interval=5s I want 2 of these one for nodeA and 1 for node B. I want the IP address to be dependant on if this resource is available on the node. How can I do that ? Alex ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] help building 2 node config
Well I think I have worked it out # Create ybrp ip address pcs resource create ybrpip ocf:heartbeat:IPaddr2 params ip=10.172.214.50 cidr_netmask=24 nic=eth0 clusterip_hash=sourceip-sourceport \ op start interval=0s timeout=60s \ op monitor interval=5s timeout=20s \ op stop interval=0s timeout=60s \ # Clone it #pcs resource clone ybrpip globally-unique=true clone-max=2 clone-node-max=2 # Create status pcs resource create ybrpstat ocf:yb:ybrp op \ op start interval=10s timeout=60s \ op monitor interval=5s timeout=20s \ op stop interval=10s timeout=60s \ # clone it it pcs resource clone ybrpip globally-unique=true clone-max=2 clone-node-max=2 pcs resource clone ybrpstat globally-unique=false clone-max=2 clone-node-max=2 pcs constraint colocation add ybrpip ybrpstat INFINITY pcs constraint colocation add ybrpip-clone ybrpstat-clone INFINITY pcs constraint order ybrpstat then ybrpip pcs constraint order ybrpstat-clone then ybrpip-clone pcs constraint location ybrpip prefers devrp1 pcs constraint location ybrpip-clone prefers devrp2 Have I done anything silly ? Also as I don't have the application actually running on my nodes, I notice fails occur very fast, more than 1 sec, where its that configured and how do I configure it such that after 2 or 3,4 or 5 attempts it fails over to the other node. I also want then resources to move back to the original nodes when they come back So I tried the config above and when I rebooted node a the ip address on A went to node B, but when A came back it didn't move back to node A pcs config Cluster Name: ybrp Corosync Nodes: Pacemaker Nodes: devrp1 devrp2 Resources: Clone: ybrpip-clone Meta Attrs: globally-unique=true clone-max=2 clone-node-max=2 Resource: ybrpip (class=ocf provider=heartbeat type=IPaddr2) Attributes: ip=10.172.214.50 cidr_netmask=24 nic=eth0 clusterip_hash=sourceip-sourceport Operations: start interval=0s timeout=60s (ybrpip-start-interval-0s) monitor interval=5s timeout=20s (ybrpip-monitor-interval-5s) stop interval=0s timeout=60s (ybrpip-stop-interval-0s) Clone: ybrpstat-clone Meta Attrs: globally-unique=false clone-max=2 clone-node-max=2 Resource: ybrpstat (class=ocf provider=yb type=ybrp) Operations: start interval=10s timeout=60s (ybrpstat-start-interval-10s) monitor interval=5s timeout=20s (ybrpstat-monitor-interval-5s) stop interval=10s timeout=60s (ybrpstat-stop-interval-10s) Stonith Devices: Fencing Levels: Location Constraints: Resource: ybrpip Enabled on: devrp1 (score:INFINITY) (id:location-ybrpip-devrp1-INFINITY) Resource: ybrpip-clone Enabled on: devrp2 (score:INFINITY) (id:location-ybrpip-clone-devrp2-INFINITY) Ordering Constraints: start ybrpstat then start ybrpip (Mandatory) (id:order-ybrpstat-ybrpip-mandatory) start ybrpstat-clone then start ybrpip-clone (Mandatory) (id:order-ybrpstat-clone-ybrpip-clone-mandatory) Colocation Constraints: ybrpip with ybrpstat (INFINITY) (id:colocation-ybrpip-ybrpstat-INFINITY) ybrpip-clone with ybrpstat-clone (INFINITY) (id:colocation-ybrpip-clone-ybrpstat-clone-INFINITY) Cluster Properties: cluster-infrastructure: cman dc-version: 1.1.10-14.el6-368c726 last-lrm-refresh: 1394682724 no-quorum-policy: ignore stonith-enabled: false the constraints should have moved it back to node A ??? pcs status Cluster name: ybrp Last updated: Thu Mar 13 16:13:40 2014 Last change: Thu Mar 13 16:06:21 2014 via cibadmin on devrp1 Stack: cman Current DC: devrp2 - partition with quorum Version: 1.1.10-14.el6-368c726 2 Nodes configured 4 Resources configured Online: [ devrp1 devrp2 ] Full list of resources: Clone Set: ybrpip-clone [ybrpip] (unique) ybrpip:0 (ocf::heartbeat:IPaddr2): Started devrp2 ybrpip:1 (ocf::heartbeat:IPaddr2): Started devrp2 Clone Set: ybrpstat-clone [ybrpstat] Started: [ devrp1 devrp2 ] -Original Message- From: Alex Samad - Yieldbroker [mailto:alex.sa...@yieldbroker.com] Sent: Thursday, 13 March 2014 2:07 PM To: pacemaker@oss.clusterlabs.org Subject: [Pacemaker] help building 2 node config Hi I sent out an email to help convert an old config. Thought it might better to start from scratch. I have 2 nodes, which run an application (sort of a reverse proxy). Node A Node B I would like to use OCF:IPaddr2 so that I can load balance IP # Create ybrp ip address pcs resource create ybrpip ocf:heartbeat:IPaddr2 params ip=10.172.214.50 cidr_netmask=24 nic=eth0 clusterip_hash=sourceip-sourceport \ op start interval=0s timeout=60s \ op monitor interval=5s timeout=20s \ op stop interval=0s timeout=60s \ # Clone it pcs resource clone ybrpip2 ybrpip meta master-max=2 master-node- max=2 clone-max=2 clone-node-max=1 notify=true interleave=true This seems to work okay but I tested On node B I ran this crm_mon -1 ; iptables
[Pacemaker] RESTful API support
Currently, management of pacemaker is done through CLI or xml. Any plan to provide RESTful api to support cloud software? John ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org