Re: [Pacemaker] RESTful API support
On 13/03/14 01:29 AM, John Wei wrote: Currently, management of pacemaker is done through CLI or xml. Any plan to provide RESTful api to support cloud software? John pcsd is REST-like. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] RESTful API support
Currently, management of pacemaker is done through CLI or xml. Any plan to provide RESTful api to support cloud software? John ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] help building 2 node config
Well I think I have worked it out # Create ybrp ip address pcs resource create ybrpip ocf:heartbeat:IPaddr2 params ip=10.172.214.50 cidr_netmask=24 nic=eth0 clusterip_hash=sourceip-sourceport \ op start interval="0s" timeout="60s" \ op monitor interval="5s" timeout="20s" \ op stop interval="0s" timeout="60s" \ # Clone it #pcs resource clone ybrpip globally-unique=true clone-max=2 clone-node-max=2 # Create status pcs resource create ybrpstat ocf:yb:ybrp op \ op start interval="10s" timeout="60s" \ op monitor interval="5s" timeout="20s" \ op stop interval="10s" timeout="60s" \ # clone it it pcs resource clone ybrpip globally-unique=true clone-max=2 clone-node-max=2 pcs resource clone ybrpstat globally-unique=false clone-max=2 clone-node-max=2 pcs constraint colocation add ybrpip ybrpstat INFINITY pcs constraint colocation add ybrpip-clone ybrpstat-clone INFINITY pcs constraint order ybrpstat then ybrpip pcs constraint order ybrpstat-clone then ybrpip-clone pcs constraint location ybrpip prefers devrp1 pcs constraint location ybrpip-clone prefers devrp2 Have I done anything silly ? Also as I don't have the application actually running on my nodes, I notice fails occur very fast, more than 1 sec, where its that configured and how do I configure it such that after 2 or 3,4 or 5 attempts it fails over to the other node. I also want then resources to move back to the original nodes when they come back So I tried the config above and when I rebooted node a the ip address on A went to node B, but when A came back it didn't move back to node A pcs config Cluster Name: ybrp Corosync Nodes: Pacemaker Nodes: devrp1 devrp2 Resources: Clone: ybrpip-clone Meta Attrs: globally-unique=true clone-max=2 clone-node-max=2 Resource: ybrpip (class=ocf provider=heartbeat type=IPaddr2) Attributes: ip=10.172.214.50 cidr_netmask=24 nic=eth0 clusterip_hash=sourceip-sourceport Operations: start interval=0s timeout=60s (ybrpip-start-interval-0s) monitor interval=5s timeout=20s (ybrpip-monitor-interval-5s) stop interval=0s timeout=60s (ybrpip-stop-interval-0s) Clone: ybrpstat-clone Meta Attrs: globally-unique=false clone-max=2 clone-node-max=2 Resource: ybrpstat (class=ocf provider=yb type=ybrp) Operations: start interval=10s timeout=60s (ybrpstat-start-interval-10s) monitor interval=5s timeout=20s (ybrpstat-monitor-interval-5s) stop interval=10s timeout=60s (ybrpstat-stop-interval-10s) Stonith Devices: Fencing Levels: Location Constraints: Resource: ybrpip Enabled on: devrp1 (score:INFINITY) (id:location-ybrpip-devrp1-INFINITY) Resource: ybrpip-clone Enabled on: devrp2 (score:INFINITY) (id:location-ybrpip-clone-devrp2-INFINITY) Ordering Constraints: start ybrpstat then start ybrpip (Mandatory) (id:order-ybrpstat-ybrpip-mandatory) start ybrpstat-clone then start ybrpip-clone (Mandatory) (id:order-ybrpstat-clone-ybrpip-clone-mandatory) Colocation Constraints: ybrpip with ybrpstat (INFINITY) (id:colocation-ybrpip-ybrpstat-INFINITY) ybrpip-clone with ybrpstat-clone (INFINITY) (id:colocation-ybrpip-clone-ybrpstat-clone-INFINITY) Cluster Properties: cluster-infrastructure: cman dc-version: 1.1.10-14.el6-368c726 last-lrm-refresh: 1394682724 no-quorum-policy: ignore stonith-enabled: false the constraints should have moved it back to node A ??? pcs status Cluster name: ybrp Last updated: Thu Mar 13 16:13:40 2014 Last change: Thu Mar 13 16:06:21 2014 via cibadmin on devrp1 Stack: cman Current DC: devrp2 - partition with quorum Version: 1.1.10-14.el6-368c726 2 Nodes configured 4 Resources configured Online: [ devrp1 devrp2 ] Full list of resources: Clone Set: ybrpip-clone [ybrpip] (unique) ybrpip:0 (ocf::heartbeat:IPaddr2): Started devrp2 ybrpip:1 (ocf::heartbeat:IPaddr2): Started devrp2 Clone Set: ybrpstat-clone [ybrpstat] Started: [ devrp1 devrp2 ] > -Original Message- > From: Alex Samad - Yieldbroker [mailto:alex.sa...@yieldbroker.com] > Sent: Thursday, 13 March 2014 2:07 PM > To: pacemaker@oss.clusterlabs.org > Subject: [Pacemaker] help building 2 node config > > Hi > > I sent out an email to help convert an old config. Thought it might better to > start from scratch. > > I have 2 nodes, which run an application (sort of a reverse proxy). > Node A > Node B > > I would like to use OCF:IPaddr2 so that I can load balance IP > > # Create ybrp ip address > pcs resource create ybrpip ocf:heartbeat:IPaddr2 params ip=10.172.214.50 > cidr_netmask=24 nic=eth0 clusterip_hash=sourceip-sourceport \ > op start interval="0s" timeout="60s" \ > op monitor interval="5s" timeout="20s" \ > op stop interval="0s" timeout="60s" \ > > # Clone it > pcs resource clone ybrpip2 ybrpip meta master-max="2" master-node- > max="2" clone-max="2" clone-node-max="1" notify="true" > interleave="true" > > > This
[Pacemaker] help building 2 node config
Hi I sent out an email to help convert an old config. Thought it might better to start from scratch. I have 2 nodes, which run an application (sort of a reverse proxy). Node A Node B I would like to use OCF:IPaddr2 so that I can load balance IP # Create ybrp ip address pcs resource create ybrpip ocf:heartbeat:IPaddr2 params ip=10.172.214.50 cidr_netmask=24 nic=eth0 clusterip_hash=sourceip-sourceport \ op start interval="0s" timeout="60s" \ op monitor interval="5s" timeout="20s" \ op stop interval="0s" timeout="60s" \ # Clone it pcs resource clone ybrpip2 ybrpip meta master-max="2" master-node-max="2" clone-max="2" clone-node-max="1" notify="true" interleave="true" This seems to work okay but I tested On node B I ran this crm_mon -1 ; iptables -nvL INPUT | head -5 ; ip a ; echo -n [ ; cat /proc/net/ipt_CLUSTERIP/10.172.214.50 ; echo ] in particular I was watching /proc/net/ipt_CLUSTERIP/10.172.214.50 and I rebooted node A, I noticed ipt_CLUSTERIP didn't fail over ? I would have expected to see 1,2 in there on nodeB when nodeA failed in fact when I reboot nodea it comes back with 2 in there ... that's not good ! pcs resource show ybrpip-clone Clone: ybrpip-clone Meta Attrs: master-max=2 master-node-max=2 clone-max=2 clone-node-max=1 notify=true interleave=true Resource: ybrpip (class=ocf provider=heartbeat type=IPaddr2) Attributes: ip=10.172.214.50 cidr_netmask=24 nic=eth0 clusterip_hash=sourceip-sourceport Operations: start interval=0s timeout=60s (ybrpip-start-interval-0s) monitor interval=5s timeout=20s (ybrpip-monitor-interval-5s) stop interval=0s timeout=60s (ybrpip-stop-interval-0s) pcs resource show ybrpip Resource: ybrpip (class=ocf provider=heartbeat type=IPaddr2) Attributes: ip=10.172.214.50 cidr_netmask=24 nic=eth0 clusterip_hash=sourceip-sourceport Operations: start interval=0s timeout=60s (ybrpip-start-interval-0s) monitor interval=5s timeout=20s (ybrpip-monitor-interval-5s) stop interval=0s timeout=60s (ybrpip-stop-interval-0s) so I think this has something todo with meta data.. I have another resource pcs resource create ybrpstat ocf:yb:ybrp op monitor interval=5s I want 2 of these one for nodeA and 1 for node B. I want the IP address to be dependant on if this resource is available on the node. How can I do that ? Alex ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] help migrating over cluster config from pacemaker plugin into corosync to pcs
> -Original Message- > From: Andrew Beekhof [mailto:and...@beekhof.net] > Sent: Thursday, 13 March 2014 1:39 PM > To: The Pacemaker cluster resource manager > Subject: Re: [Pacemaker] help migrating over cluster config from pacemaker > plugin into corosync to pcs [snip] > > I was trying to use the above commands to programme up the new > pacemaker, but I can't find the easy transform of crm to pcs... > > Does https://github.com/ClusterLabs/pacemaker/blob/master/doc/pcs- > crmsh-quick-ref.md help? Looks like it does thanks > [snip] ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] help migrating over cluster config from pacemaker plugin into corosync to pcs
On 13 Mar 2014, at 11:56 am, Alex Samad - Yieldbroker wrote: > Hi > > So this is what I used to do to setup my cluster > crm configure property stonith-enabled=false > crm configure property no-quorum-policy=ignore > crm configure rsc_defaults resource-stickiness=100 > crm configure primitive ybrpip ocf:heartbeat:IPaddr2 params ip=10.32.21.30 > cidr_netmask=24 op monitor interval=5s > crm configure primitive ybrpstat ocf:yb:ybrp op monitor interval=5s > crm configure colocation ybrp INFINITY: ybrpip ybrpstat > crm configure group ybrpgrp ybrpip ybrpstat > crm_resource --meta --resource ybrpstat --set-parameter migration-threshold > --parameter-value 2 > crm_resource --meta --resource ybrpstat --set-parameter failure-timeout > --parameter-value 2m > > > I have written my own ybrp resource (/usr/lib/ocf/resource.d/yb/ybrp) > > So basically what I want to do is have 2 nodes have a floating VIP (I was > looking at moving forward with the IP load balancing ) > I run an application on both nodes it doesn't need to be started, should > start at server start up. > I need the VIP or the loading balancing to move from node to node. > > Normal operation would be > 50% on node A and 50% on node B (I realise this depends on IP & hash) > If app fails on one node then all the traffic should move to the other node. > The cluster should not try and restart the application > Once the application comes back on the broken node the VIP should be allowed > to move back or the load balancing should accept traffic back there. > Simple ? > > I was trying to use the above commands to programme up the new pacemaker, but > I can't find the easy transform of crm to pcs... Does https://github.com/ClusterLabs/pacemaker/blob/master/doc/pcs-crmsh-quick-ref.md help? > so I thought I would ask the list for help to configure up with the load > balance VIP. > > Alex > > > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] help migrating over cluster config from pacemaker plugin into corosync to pcs
Hi So this is what I used to do to setup my cluster crm configure property stonith-enabled=false crm configure property no-quorum-policy=ignore crm configure rsc_defaults resource-stickiness=100 crm configure primitive ybrpip ocf:heartbeat:IPaddr2 params ip=10.32.21.30 cidr_netmask=24 op monitor interval=5s crm configure primitive ybrpstat ocf:yb:ybrp op monitor interval=5s crm configure colocation ybrp INFINITY: ybrpip ybrpstat crm configure group ybrpgrp ybrpip ybrpstat crm_resource --meta --resource ybrpstat --set-parameter migration-threshold --parameter-value 2 crm_resource --meta --resource ybrpstat --set-parameter failure-timeout --parameter-value 2m I have written my own ybrp resource (/usr/lib/ocf/resource.d/yb/ybrp) So basically what I want to do is have 2 nodes have a floating VIP (I was looking at moving forward with the IP load balancing ) I run an application on both nodes it doesn't need to be started, should start at server start up. I need the VIP or the loading balancing to move from node to node. Normal operation would be 50% on node A and 50% on node B (I realise this depends on IP & hash) If app fails on one node then all the traffic should move to the other node. The cluster should not try and restart the application Once the application comes back on the broken node the VIP should be allowed to move back or the load balancing should accept traffic back there. Simple ? I was trying to use the above commands to programme up the new pacemaker, but I can't find the easy transform of crm to pcs... so I thought I would ask the list for help to configure up with the load balance VIP. Alex ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] pacemaker depdendancy on samba
On 12/03/14 23:18, Alex Samad - Yieldbroker wrote: > Hi > > Just going through my cluster build, seems like > > yum install pacemaker > > wants to bring in samba, I have recently migrated up to samba4, wondering if > I can find a pacemaker that is dependant on samba4 ? > > Im on centos 6.5, on a quick look I am guessing this might not be a pacemaker > issue, might be a dep of a dep .. Pacemaker wants to install resource-agents, resource-agents has a dependency on /sbin/mount.cifs and then it goes on from there... T ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] pacemaker depdendancy on samba
Hi Just going through my cluster build, seems like yum install pacemaker wants to bring in samba, I have recently migrated up to samba4, wondering if I can find a pacemaker that is dependant on samba4 ? Im on centos 6.5, on a quick look I am guessing this might not be a pacemaker issue, might be a dep of a dep .. Thanks Alex ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] fencing question
On 2014-03-12T16:16:54, Karl Rößmann wrote: > >>primitive fkflmw ocf:heartbeat:Xen \ > >>meta target-role="Started" is-managed="true" allow-migrate="true" \ > >>op monitor interval="10" timeout="30" \ > >>op migrate_from interval="0" timeout="600" \ > >>op migrate_to interval="0" timeout="600" \ > >>params xmfile="/etc/xen/vm/fkflmw" shutdown_timeout="120" > > > >You need to set a >120s timeout for the stop operation too: > > op stop timeout="150" > > > >>default-action-timeout="60s" > > > >Or set this to, say, 150s. > can I do this while the resource (the xen VM) is running ? Yes, changing the stop timeout should not have a negative impact on your resource. You can also check how the cluster would react: # crm configure crm(live)configure# edit (Make all changes you want here) crm(live)configure# simulate actions nograph before you type "commit". Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) "Experience is the name everyone gives to their mistakes." -- Oscar Wilde ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] missing init scripts for corosync and pacemaker
On 13 Mar 2014, at 9:29 am, Jay G. Scott wrote: > > OS = RHEL 6 > > because my machines are behind a firewall, i can't install > via yum. i had to bring down the rpms and install them. > here are the rpms i installed. yeah, it bothers me that > they say fc20 but that's what i got when i used the > pacemaker.repo file i found online. > > corosync-2.3.3-1.fc20.x86_64.rpm > corosynclib-2.3.3-1.fc20.x86_64.rpm > libibverbs-1.1.7-3.fc20.x86_64.rpm > libqb-0.17.0-1.fc20.x86_64.rpm > librdmacm-1.0.17-2.fc20.x86_64.rpm > pacemaker-1.1.11-1.fc20.x86_64.rpm > pacemaker-cli-1.1.11-1.fc20.x86_64.rpm > pacemaker-cluster-libs-1.1.11-1.fc20.x86_64.rpm > pacemaker-libs-1.1.11-1.fc20.x86_64.rpm > resource-agents-3.9.5-9.fc20.x86_64.rpm > > i have all of these installed. i lack an /etc/init.d > script for corosync and pacemaker. > > how come? you're installing fedora packages and fedora uses systemd .service files > > j. > > > -- > Jay Scott 512-835-3553g...@arlut.utexas.edu > Head of Sun Support, Sr. System Administrator > Applied Research Labs, Computer Science Div. S224 > University of Texas at Austin > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] missing init scripts for corosync and pacemaker
OS = RHEL 6 because my machines are behind a firewall, i can't install via yum. i had to bring down the rpms and install them. here are the rpms i installed. yeah, it bothers me that they say fc20 but that's what i got when i used the pacemaker.repo file i found online. corosync-2.3.3-1.fc20.x86_64.rpm corosynclib-2.3.3-1.fc20.x86_64.rpm libibverbs-1.1.7-3.fc20.x86_64.rpm libqb-0.17.0-1.fc20.x86_64.rpm librdmacm-1.0.17-2.fc20.x86_64.rpm pacemaker-1.1.11-1.fc20.x86_64.rpm pacemaker-cli-1.1.11-1.fc20.x86_64.rpm pacemaker-cluster-libs-1.1.11-1.fc20.x86_64.rpm pacemaker-libs-1.1.11-1.fc20.x86_64.rpm resource-agents-3.9.5-9.fc20.x86_64.rpm i have all of these installed. i lack an /etc/init.d script for corosync and pacemaker. how come? j. -- Jay Scott 512-835-3553g...@arlut.utexas.edu Head of Sun Support, Sr. System Administrator Applied Research Labs, Computer Science Div. S224 University of Texas at Austin ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker/corosync freeze
> -Original Message- > From: Jan Friesse [mailto:jfrie...@redhat.com] > Sent: Wednesday, March 12, 2014 4:31 PM > To: The Pacemaker cluster resource manager > Subject: Re: [Pacemaker] Pacemaker/corosync freeze > > Attila Megyeri napsal(a): > >> -Original Message- > >> From: Jan Friesse [mailto:jfrie...@redhat.com] > >> Sent: Wednesday, March 12, 2014 2:27 PM > >> To: The Pacemaker cluster resource manager > >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze > >> > >> Attila Megyeri napsal(a): > >>> Hello Jan, > >>> > >>> Thank you very much for your help so far. > >>> > -Original Message- > From: Jan Friesse [mailto:jfrie...@redhat.com] > Sent: Wednesday, March 12, 2014 9:51 AM > To: The Pacemaker cluster resource manager > Subject: Re: [Pacemaker] Pacemaker/corosync freeze > > Attila Megyeri napsal(a): > > > >> -Original Message- > >> From: Andrew Beekhof [mailto:and...@beekhof.net] > >> Sent: Tuesday, March 11, 2014 10:27 PM > >> To: The Pacemaker cluster resource manager > >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze > >> > >> > >> On 12 Mar 2014, at 1:54 am, Attila Megyeri > >> > >> wrote: > >> > > -Original Message- > From: Andrew Beekhof [mailto:and...@beekhof.net] > Sent: Tuesday, March 11, 2014 12:48 AM > To: The Pacemaker cluster resource manager > Subject: Re: [Pacemaker] Pacemaker/corosync freeze > > > On 7 Mar 2014, at 5:54 pm, Attila Megyeri > > wrote: > > > Thanks for the quick response! > > > >> -Original Message- > >> From: Andrew Beekhof [mailto:and...@beekhof.net] > >> Sent: Friday, March 07, 2014 3:48 AM > >> To: The Pacemaker cluster resource manager > >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze > >> > >> > >> On 7 Mar 2014, at 5:31 am, Attila Megyeri > >> > >> wrote: > >> > >>> Hello, > >>> > >>> We have a strange issue with Corosync/Pacemaker. > >>> From time to time, something unexpected happens and > >> suddenly > the > >> crm_mon output remains static. > >>> When I check the cpu usage, I see that one of the cores uses > >>> 100% cpu, but > >> cannot actually match it to either the corosync or one of the > >> pacemaker processes. > >>> > >>> In such a case, this high CPU usage is happening on all 7 nodes. > >>> I have to manually go to each node, stop pacemaker, restart > >>> corosync, then > >> start pacemeker. Stoping pacemaker and corosync does not > work > >> in most of the cases, usually a kill -9 is needed. > >>> > >>> Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty. > >>> > >>> Using udpu as transport, two rings on Gigabit ETH, rro_mode > passive. > >>> > >>> Logs are usually flooded with CPG related messages, such as: > >>> > >>> Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: > >> Sent > 0 > CPG > >> messages (1 remaining, last=8): Try again (6) > >>> Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: > >> Sent > 0 > CPG > >> messages (1 remaining, last=8): Try again (6) > >>> Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: > >> Sent > 0 > CPG > >> messages (1 remaining, last=8): Try again (6) > >>> Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: > >> Sent > 0 > CPG > >> messages (1 remaining, last=8): Try again (6) > >>> > >>> OR > >>> > >>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: > >> Sent 0 > >> CPG > >> messages (1 remaining, last=10933): Try again ( > >>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: > >> Sent 0 > >> CPG > >> messages (1 remaining, last=10933): Try again ( > >>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: > >> Sent 0 > >> CPG > >> messages (1 remaining, last=10933): Try again ( > >> > >> That is usually a symptom of corosync getting into a horribly > >> confused > state. > >> Version? Distro? Have you checked for an update? > >> Odd that the user of all that CPU isn't showing up though. > >> > >>> > > > > As I wrote I use Ubuntu trusty, the exact package versions are: > > > > corosync 2.3.0-1ubuntu5 > > pacemaker 1.1.10+git20130802-1ubuntu2 > > Ah sorry, I seem to have missed that part. > > > > > There are no update
Re: [Pacemaker] Pacemaker/corosync freeze
Attila Megyeri napsal(a): >> -Original Message- >> From: Jan Friesse [mailto:jfrie...@redhat.com] >> Sent: Wednesday, March 12, 2014 2:27 PM >> To: The Pacemaker cluster resource manager >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze >> >> Attila Megyeri napsal(a): >>> Hello Jan, >>> >>> Thank you very much for your help so far. >>> -Original Message- From: Jan Friesse [mailto:jfrie...@redhat.com] Sent: Wednesday, March 12, 2014 9:51 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze Attila Megyeri napsal(a): > >> -Original Message- >> From: Andrew Beekhof [mailto:and...@beekhof.net] >> Sent: Tuesday, March 11, 2014 10:27 PM >> To: The Pacemaker cluster resource manager >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze >> >> >> On 12 Mar 2014, at 1:54 am, Attila Megyeri >> >> wrote: >> -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Tuesday, March 11, 2014 12:48 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 7 Mar 2014, at 5:54 pm, Attila Megyeri wrote: > Thanks for the quick response! > >> -Original Message- >> From: Andrew Beekhof [mailto:and...@beekhof.net] >> Sent: Friday, March 07, 2014 3:48 AM >> To: The Pacemaker cluster resource manager >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze >> >> >> On 7 Mar 2014, at 5:31 am, Attila Megyeri >> >> wrote: >> >>> Hello, >>> >>> We have a strange issue with Corosync/Pacemaker. >>> From time to time, something unexpected happens and >> suddenly the >> crm_mon output remains static. >>> When I check the cpu usage, I see that one of the cores uses >>> 100% cpu, but >> cannot actually match it to either the corosync or one of the >> pacemaker processes. >>> >>> In such a case, this high CPU usage is happening on all 7 nodes. >>> I have to manually go to each node, stop pacemaker, restart >>> corosync, then >> start pacemeker. Stoping pacemaker and corosync does not work >> in most of the cases, usually a kill -9 is needed. >>> >>> Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty. >>> >>> Using udpu as transport, two rings on Gigabit ETH, rro_mode passive. >>> >>> Logs are usually flooded with CPG related messages, such as: >>> >>> Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: >> Sent 0 CPG >> messages (1 remaining, last=8): Try again (6) >>> Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: >> Sent 0 CPG >> messages (1 remaining, last=8): Try again (6) >>> Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: >> Sent 0 CPG >> messages (1 remaining, last=8): Try again (6) >>> Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: >> Sent 0 CPG >> messages (1 remaining, last=8): Try again (6) >>> >>> OR >>> >>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: >> Sent 0 >> CPG >> messages (1 remaining, last=10933): Try again ( >>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: >> Sent 0 >> CPG >> messages (1 remaining, last=10933): Try again ( >>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: >> Sent 0 >> CPG >> messages (1 remaining, last=10933): Try again ( >> >> That is usually a symptom of corosync getting into a horribly >> confused state. >> Version? Distro? Have you checked for an update? >> Odd that the user of all that CPU isn't showing up though. >> >>> > > As I wrote I use Ubuntu trusty, the exact package versions are: > > corosync 2.3.0-1ubuntu5 > pacemaker 1.1.10+git20130802-1ubuntu2 Ah sorry, I seem to have missed that part. > > There are no updates available. The only option is to install > from sources, but that would be very difficult to maintain and I'm not sure I would get rid of this issue. > > What do you recommend? The same thing as Lars, or switch to a distro that stays current with upstream (git shows 5 newer releases for that branch since it was released 3 years ago). If you do build from source, its probably best to go with v1.4.6 >>
Re: [Pacemaker] fencing question
Hi. primitive fkflmw ocf:heartbeat:Xen \ meta target-role="Started" is-managed="true" allow-migrate="true" \ op monitor interval="10" timeout="30" \ op migrate_from interval="0" timeout="600" \ op migrate_to interval="0" timeout="600" \ params xmfile="/etc/xen/vm/fkflmw" shutdown_timeout="120" You need to set a >120s timeout for the stop operation too: op stop timeout="150" default-action-timeout="60s" Or set this to, say, 150s. can I do this while the resource (the xen VM) is running ? Karl -- Karl RößmannTel. +49-711-689-1657 Max-Planck-Institut FKF Fax. +49-711-689-1632 Postfach 800 665 70506 Stuttgart email k.roessm...@fkf.mpg.de ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] fencing question
On 2014-03-12T15:17:13, Karl Rößmann wrote: > Hi, > > we have a two node HA cluster using SuSE SlES 11 HA Extension SP3, > latest release value. > A resource (xen) was manually stopped, the shutdown_timeout is 120s > but after 60s the node was fenced and shut down by the other node. > > should I change some timeout value ? > > This is a part of our configuration: > ... > primitive fkflmw ocf:heartbeat:Xen \ > meta target-role="Started" is-managed="true" allow-migrate="true" \ > op monitor interval="10" timeout="30" \ > op migrate_from interval="0" timeout="600" \ > op migrate_to interval="0" timeout="600" \ > params xmfile="/etc/xen/vm/fkflmw" shutdown_timeout="120" You need to set a >120s timeout for the stop operation too: op stop timeout="150" > default-action-timeout="60s" Or set this to, say, 150s. Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) "Experience is the name everyone gives to their mistakes." -- Oscar Wilde ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker/corosync freeze
> -Original Message- > From: Jan Friesse [mailto:jfrie...@redhat.com] > Sent: Wednesday, March 12, 2014 2:27 PM > To: The Pacemaker cluster resource manager > Subject: Re: [Pacemaker] Pacemaker/corosync freeze > > Attila Megyeri napsal(a): > > Hello Jan, > > > > Thank you very much for your help so far. > > > >> -Original Message- > >> From: Jan Friesse [mailto:jfrie...@redhat.com] > >> Sent: Wednesday, March 12, 2014 9:51 AM > >> To: The Pacemaker cluster resource manager > >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze > >> > >> Attila Megyeri napsal(a): > >>> > -Original Message- > From: Andrew Beekhof [mailto:and...@beekhof.net] > Sent: Tuesday, March 11, 2014 10:27 PM > To: The Pacemaker cluster resource manager > Subject: Re: [Pacemaker] Pacemaker/corosync freeze > > > On 12 Mar 2014, at 1:54 am, Attila Megyeri > > wrote: > > >> > >> -Original Message- > >> From: Andrew Beekhof [mailto:and...@beekhof.net] > >> Sent: Tuesday, March 11, 2014 12:48 AM > >> To: The Pacemaker cluster resource manager > >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze > >> > >> > >> On 7 Mar 2014, at 5:54 pm, Attila Megyeri > >> > >> wrote: > >> > >>> Thanks for the quick response! > >>> > -Original Message- > From: Andrew Beekhof [mailto:and...@beekhof.net] > Sent: Friday, March 07, 2014 3:48 AM > To: The Pacemaker cluster resource manager > Subject: Re: [Pacemaker] Pacemaker/corosync freeze > > > On 7 Mar 2014, at 5:31 am, Attila Megyeri > > wrote: > > > Hello, > > > > We have a strange issue with Corosync/Pacemaker. > > From time to time, something unexpected happens and > suddenly > >> the > crm_mon output remains static. > > When I check the cpu usage, I see that one of the cores uses > > 100% cpu, but > cannot actually match it to either the corosync or one of the > pacemaker processes. > > > > In such a case, this high CPU usage is happening on all 7 nodes. > > I have to manually go to each node, stop pacemaker, restart > > corosync, then > start pacemeker. Stoping pacemaker and corosync does not work > in most of the cases, usually a kill -9 is needed. > > > > Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty. > > > > Using udpu as transport, two rings on Gigabit ETH, rro_mode > >> passive. > > > > Logs are usually flooded with CPG related messages, such as: > > > > Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: > Sent > >> 0 > >> CPG > messages (1 remaining, last=8): Try again (6) > > Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: > Sent > >> 0 > >> CPG > messages (1 remaining, last=8): Try again (6) > > Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: > Sent > >> 0 > >> CPG > messages (1 remaining, last=8): Try again (6) > > Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: > Sent > >> 0 > >> CPG > messages (1 remaining, last=8): Try again (6) > > > > OR > > > > Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: > Sent 0 > CPG > messages (1 remaining, last=10933): Try again ( > > Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: > Sent 0 > CPG > messages (1 remaining, last=10933): Try again ( > > Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: > Sent 0 > CPG > messages (1 remaining, last=10933): Try again ( > > That is usually a symptom of corosync getting into a horribly > confused > >> state. > Version? Distro? Have you checked for an update? > Odd that the user of all that CPU isn't showing up though. > > > > >>> > >>> As I wrote I use Ubuntu trusty, the exact package versions are: > >>> > >>> corosync 2.3.0-1ubuntu5 > >>> pacemaker 1.1.10+git20130802-1ubuntu2 > >> > >> Ah sorry, I seem to have missed that part. > >> > >>> > >>> There are no updates available. The only option is to install > >>> from sources, > >> but that would be very difficult to maintain and I'm not sure I > >> would get rid of this issue. > >>> > >>> What do you recommend? > >> > >> The same thing as Lars, or switch to a distro that stays current > >> with upstream (git shows 5 newer releases for that branch since > >> it was released 3 years ago). > >> If you do build from source, its probably best to go with v1.4.6 > > > > Hm, I am a bit confused her
[Pacemaker] fencing question
Hi, we have a two node HA cluster using SuSE SlES 11 HA Extension SP3, latest release value. A resource (xen) was manually stopped, the shutdown_timeout is 120s but after 60s the node was fenced and shut down by the other node. should I change some timeout value ? This is a part of our configuration: ... primitive fkflmw ocf:heartbeat:Xen \ meta target-role="Started" is-managed="true" allow-migrate="true" \ op monitor interval="10" timeout="30" \ op migrate_from interval="0" timeout="600" \ op migrate_to interval="0" timeout="600" \ params xmfile="/etc/xen/vm/fkflmw" shutdown_timeout="120" ... ... property $id="cib-bootstrap-options" \ dc-version="1.1.10-f3eeaf4" \ cluster-infrastructure="classic openais (with plugin)" \ expected-quorum-votes="2" \ no-quorum-policy="ignore" \ last-lrm-refresh="1394533475" \ default-action-timeout="60s" rsc_defaults $id="rsc_defaults-options" \ resource-stickiness="10" \ migration-threshold="3" we had this scenario: on Node ha2infra: Mar 12 11:59:59 ha2infra pengine[25631]: notice: LogActions: Stop fkflmw (ha2infra) <--- Resource fkflmw was stopped manually Mar 12 11:59:59 ha2infra pengine[25631]: notice: process_pe_message: Calculated Transition 105: /var/lib/pacemaker/pengine/pe-input-519.bz2 Mar 12 11:59:59 ha2infra crmd[25632]: notice: do_te_invoke: Processing graph 105 (ref=pe_calc-dc-1394621999-178) derived from /var/lib/pacemaker/pengine/pe-input-519.bz2 Mar 12 11:59:59 ha2infra crmd[25632]: notice: te_rsc_command: Initiating action 60: stop fkflmw_stop_0 on ha2infra (local) Mar 12 11:59:59 ha2infra Xen(fkflmw)[22718]: INFO: Xen domain fkflmw will be stopped (timeout: 120s) <--- stopping fkflmw Mar 12 12:00:00 ha2infra mgmtd: [25633]: info: CIB query: cib Mar 12 12:00:00 ha2infra mgmtd: [25633]: info: CIB query: cib Mar 12 12:00:59 ha2infra sshd[24992]: Connection closed by 134.105.232.21 [preauth] Mar 12 12:00:59 ha2infra lrmd[25629]: warning: child_timeout_callback: fkflmw_stop_0 process (PID 22718) timed out Mar 12 12:00:59 ha2infra lrmd[25629]: warning: operation_finished: fkflmw_stop_0:22718 - timed out after 6ms <--- Stop timed out after 60s (not 120s) Mar 12 12:00:59 ha2infra crmd[25632]:error: process_lrm_event: LRM operation fkflmw_stop_0 (136) Timed Out (timeout=6ms) Mar 12 12:00:59 ha2infra crmd[25632]: warning: status_from_rc: Action 60 (fkflmw_stop_0) on ha2infra failed (target: 0 vs. rc: 1): Error Mar 12 12:00:59 ha2infra pengine[25631]: warning: unpack_rsc_op_failure: Processing failed op stop for fkflmw on ha2infra: unknown error (1) Mar 12 12:00:59 ha2infra pengine[25631]: warning: pe_fence_node: Node ha2infra will be fenced because of resource failure(s) <--- is this normal ? Mar 12 12:00:59 ha2infra pengine[25631]: warning: stage6: Scheduling Node ha2infra for STONITH Node ha1infra: Mar 12 12:00:59 ha1infra stonith-ng[21808]: notice: can_fence_host_with_device: stonith_1 can fence ha2infra: dynamic-list Mar 12 12:01:01 ha1infra stonith-ng[21808]: notice: log_operation: Operation 'reboot' [23984] (call 2 from crmd.25632) for host 'ha2infra' with device 'stonith_1' returned: 0 (OK) Mar 12 12:01:05 ha1infra corosync[21794]: [TOTEM ] A processor failed, forming new configuration. Karl Roessmann -- Karl RößmannTel. +49-711-689-1657 Max-Planck-Institut FKF Fax. +49-711-689-1632 Postfach 800 665 70506 Stuttgart email k.roessm...@fkf.mpg.de ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out "lost"?
12.03.2014 00:40, Andrew Beekhof wrote: > > On 11 Mar 2014, at 6:23 pm, Vladislav Bogdanov wrote: > >> 07.03.2014 10:30, Vladislav Bogdanov wrote: >>> 07.03.2014 05:43, Andrew Beekhof wrote: On 6 Mar 2014, at 10:39 pm, Vladislav Bogdanov wrote: > 18.02.2014 03:49, Andrew Beekhof wrote: >> >> On 31 Jan 2014, at 6:20 pm, yusuke iida wrote: >> >>> Hi, all >>> >>> I measure the performance of Pacemaker in the following combinations. >>> Pacemaker-1.1.11.rc1 >>> libqb-0.16.0 >>> corosync-2.3.2 >>> >>> All nodes are KVM virtual machines. >>> >>> stopped the node of vm01 compulsorily from the inside, after starting >>> 14 nodes. >>> "virsh destroy vm01" was used for the stop. >>> Then, in addition to the compulsorily stopped node, other nodes are >>> separated from a cluster. >>> >>> The log of "Retransmit List:" is then outputted in large quantities >>> from corosync. >> >> Probably best to poke the corosync guys about this. >> >> However, <= .11 is known to cause significant CPU usage with that many >> nodes. >> I can easily imagine this staving corosync of resources and causing >> breakage. >> >> I would _highly_ recommend retesting with the current git master of >> pacemaker. >> I merged the new cib code last week which is faster by _two_ orders of >> magnitude and uses significantly less CPU. > > Andrew, current git master (ee094a2) almost works, the only issue is > that crm_diff calculates incorrect diff digest. If I replace digest in > diff by hands with what cib calculates as "expected". it applies > correctly. Otherwise - -206. More details? >>> >>> Hmmm... >>> seems to be crmsh-specific, >>> Cannot reproduce with pure-XML editing. >>> Kristoffer, does >>> http://hg.savannah.gnu.org/hgweb/crmsh/rev/c42d9361a310 address this? >> >> The problem seems to be caused by the fact that crmsh does not provide >> section in both orig and new XMLs to crm_diff, and digest >> generation seems to rely on that, so crm_diff and cib daemon produce >> different digests. >> >> Attached are two sets of XML files, one (orig.xml, new.xml, patch.xml) >> are related to the full CIB operation (with status section included), >> another (orig-edited.xml, new-edited.xml, patch-edited.xml) have that >> section removed like crmsh does do. >> >> Resulting diffs differ only by digest, and that seems to be the exact issue. > > This should help. As long as crmsh isn't passing -c to crm_diff, then the > digest will no longer be present. > > https://github.com/beekhof/pacemaker/commit/c8d443d Yep, that helped. Thank you! ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker/corosync freeze
Attila Megyeri napsal(a): > Hello Jan, > > Thank you very much for your help so far. > >> -Original Message- >> From: Jan Friesse [mailto:jfrie...@redhat.com] >> Sent: Wednesday, March 12, 2014 9:51 AM >> To: The Pacemaker cluster resource manager >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze >> >> Attila Megyeri napsal(a): >>> -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Tuesday, March 11, 2014 10:27 PM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 12 Mar 2014, at 1:54 am, Attila Megyeri wrote: >> >> -Original Message- >> From: Andrew Beekhof [mailto:and...@beekhof.net] >> Sent: Tuesday, March 11, 2014 12:48 AM >> To: The Pacemaker cluster resource manager >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze >> >> >> On 7 Mar 2014, at 5:54 pm, Attila Megyeri >> >> wrote: >> >>> Thanks for the quick response! >>> -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Friday, March 07, 2014 3:48 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 7 Mar 2014, at 5:31 am, Attila Megyeri wrote: > Hello, > > We have a strange issue with Corosync/Pacemaker. > From time to time, something unexpected happens and suddenly >> the crm_mon output remains static. > When I check the cpu usage, I see that one of the cores uses > 100% cpu, but cannot actually match it to either the corosync or one of the pacemaker processes. > > In such a case, this high CPU usage is happening on all 7 nodes. > I have to manually go to each node, stop pacemaker, restart > corosync, then start pacemeker. Stoping pacemaker and corosync does not work in most of the cases, usually a kill -9 is needed. > > Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty. > > Using udpu as transport, two rings on Gigabit ETH, rro_mode >> passive. > > Logs are usually flooded with CPG related messages, such as: > > Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: > Sent >> 0 >> CPG messages (1 remaining, last=8): Try again (6) > Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: > Sent >> 0 >> CPG messages (1 remaining, last=8): Try again (6) > Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: > Sent >> 0 >> CPG messages (1 remaining, last=8): Try again (6) > Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: > Sent >> 0 >> CPG messages (1 remaining, last=8): Try again (6) > > OR > > Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: > Sent 0 CPG messages (1 remaining, last=10933): Try again ( > Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: > Sent 0 CPG messages (1 remaining, last=10933): Try again ( > Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: > Sent 0 CPG messages (1 remaining, last=10933): Try again ( That is usually a symptom of corosync getting into a horribly confused >> state. Version? Distro? Have you checked for an update? Odd that the user of all that CPU isn't showing up though. > >>> >>> As I wrote I use Ubuntu trusty, the exact package versions are: >>> >>> corosync 2.3.0-1ubuntu5 >>> pacemaker 1.1.10+git20130802-1ubuntu2 >> >> Ah sorry, I seem to have missed that part. >> >>> >>> There are no updates available. The only option is to install from >>> sources, >> but that would be very difficult to maintain and I'm not sure I >> would get rid of this issue. >>> >>> What do you recommend? >> >> The same thing as Lars, or switch to a distro that stays current >> with upstream (git shows 5 newer releases for that branch since it >> was released 3 years ago). >> If you do build from source, its probably best to go with v1.4.6 > > Hm, I am a bit confused here. We are using 2.3.0, I swapped the 2 for a 1 somehow. A bit distracted, sorry. >>> >>> I upgraded all nodes to 2.3.3 and first it seemed a bit better, but still >>> the >> same issue - after some time CPU gets to 100%, and the corosync log is >> flooded with messages like: >>> >>> Mar 12 07:36:55 [4793] ctdb2cib: info: crm_cs_flush: >>> Se
Re: [Pacemaker] Pacemaker/corosync freeze
Hello Jan, Thank you very much for your help so far. > -Original Message- > From: Jan Friesse [mailto:jfrie...@redhat.com] > Sent: Wednesday, March 12, 2014 9:51 AM > To: The Pacemaker cluster resource manager > Subject: Re: [Pacemaker] Pacemaker/corosync freeze > > Attila Megyeri napsal(a): > > > >> -Original Message- > >> From: Andrew Beekhof [mailto:and...@beekhof.net] > >> Sent: Tuesday, March 11, 2014 10:27 PM > >> To: The Pacemaker cluster resource manager > >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze > >> > >> > >> On 12 Mar 2014, at 1:54 am, Attila Megyeri > >> > >> wrote: > >> > > -Original Message- > From: Andrew Beekhof [mailto:and...@beekhof.net] > Sent: Tuesday, March 11, 2014 12:48 AM > To: The Pacemaker cluster resource manager > Subject: Re: [Pacemaker] Pacemaker/corosync freeze > > > On 7 Mar 2014, at 5:54 pm, Attila Megyeri > > wrote: > > > Thanks for the quick response! > > > >> -Original Message- > >> From: Andrew Beekhof [mailto:and...@beekhof.net] > >> Sent: Friday, March 07, 2014 3:48 AM > >> To: The Pacemaker cluster resource manager > >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze > >> > >> > >> On 7 Mar 2014, at 5:31 am, Attila Megyeri > >> > >> wrote: > >> > >>> Hello, > >>> > >>> We have a strange issue with Corosync/Pacemaker. > >>> From time to time, something unexpected happens and suddenly > the > >> crm_mon output remains static. > >>> When I check the cpu usage, I see that one of the cores uses > >>> 100% cpu, but > >> cannot actually match it to either the corosync or one of the > >> pacemaker processes. > >>> > >>> In such a case, this high CPU usage is happening on all 7 nodes. > >>> I have to manually go to each node, stop pacemaker, restart > >>> corosync, then > >> start pacemeker. Stoping pacemaker and corosync does not work in > >> most of the cases, usually a kill -9 is needed. > >>> > >>> Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty. > >>> > >>> Using udpu as transport, two rings on Gigabit ETH, rro_mode > passive. > >>> > >>> Logs are usually flooded with CPG related messages, such as: > >>> > >>> Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: > >>> Sent > 0 > CPG > >> messages (1 remaining, last=8): Try again (6) > >>> Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: > >>> Sent > 0 > CPG > >> messages (1 remaining, last=8): Try again (6) > >>> Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: > >>> Sent > 0 > CPG > >> messages (1 remaining, last=8): Try again (6) > >>> Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: > >>> Sent > 0 > CPG > >> messages (1 remaining, last=8): Try again (6) > >>> > >>> OR > >>> > >>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: > >>> Sent 0 > >> CPG > >> messages (1 remaining, last=10933): Try again ( > >>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: > >>> Sent 0 > >> CPG > >> messages (1 remaining, last=10933): Try again ( > >>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: > >>> Sent 0 > >> CPG > >> messages (1 remaining, last=10933): Try again ( > >> > >> That is usually a symptom of corosync getting into a horribly > >> confused > state. > >> Version? Distro? Have you checked for an update? > >> Odd that the user of all that CPU isn't showing up though. > >> > >>> > > > > As I wrote I use Ubuntu trusty, the exact package versions are: > > > > corosync 2.3.0-1ubuntu5 > > pacemaker 1.1.10+git20130802-1ubuntu2 > > Ah sorry, I seem to have missed that part. > > > > > There are no updates available. The only option is to install from > > sources, > but that would be very difficult to maintain and I'm not sure I > would get rid of this issue. > > > > What do you recommend? > > The same thing as Lars, or switch to a distro that stays current > with upstream (git shows 5 newer releases for that branch since it > was released 3 years ago). > If you do build from source, its probably best to go with v1.4.6 > >>> > >>> Hm, I am a bit confused here. We are using 2.3.0, > >> > >> I swapped the 2 for a 1 somehow. A bit distracted, sorry. > > > > I upgraded all nodes to 2.3.3 and first it seemed a bit better, but still > > the > same issue - after some time CPU gets to 100%, and the corosync log is > flooded with messages like: > > > > Mar 12 07:36:55 [4793] ctdb2cib: info: crm_cs_flush: > > Sent 0 CPG > messages (48 remaining, last=3671):
Re: [Pacemaker] Pacemaker/corosync freeze
Attila Megyeri napsal(a): > >> -Original Message- >> From: Andrew Beekhof [mailto:and...@beekhof.net] >> Sent: Tuesday, March 11, 2014 10:27 PM >> To: The Pacemaker cluster resource manager >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze >> >> >> On 12 Mar 2014, at 1:54 am, Attila Megyeri >> wrote: >> -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Tuesday, March 11, 2014 12:48 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 7 Mar 2014, at 5:54 pm, Attila Megyeri wrote: > Thanks for the quick response! > >> -Original Message- >> From: Andrew Beekhof [mailto:and...@beekhof.net] >> Sent: Friday, March 07, 2014 3:48 AM >> To: The Pacemaker cluster resource manager >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze >> >> >> On 7 Mar 2014, at 5:31 am, Attila Megyeri >> >> wrote: >> >>> Hello, >>> >>> We have a strange issue with Corosync/Pacemaker. >>> From time to time, something unexpected happens and suddenly the >> crm_mon output remains static. >>> When I check the cpu usage, I see that one of the cores uses 100% >>> cpu, but >> cannot actually match it to either the corosync or one of the >> pacemaker processes. >>> >>> In such a case, this high CPU usage is happening on all 7 nodes. >>> I have to manually go to each node, stop pacemaker, restart >>> corosync, then >> start pacemeker. Stoping pacemaker and corosync does not work in >> most of the cases, usually a kill -9 is needed. >>> >>> Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty. >>> >>> Using udpu as transport, two rings on Gigabit ETH, rro_mode passive. >>> >>> Logs are usually flooded with CPG related messages, such as: >>> >>> Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: >>> Sent 0 CPG >> messages (1 remaining, last=8): Try again (6) >>> Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: >>> Sent 0 CPG >> messages (1 remaining, last=8): Try again (6) >>> Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: >>> Sent 0 CPG >> messages (1 remaining, last=8): Try again (6) >>> Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: >>> Sent 0 CPG >> messages (1 remaining, last=8): Try again (6) >>> >>> OR >>> >>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: >>> Sent 0 >> CPG >> messages (1 remaining, last=10933): Try again ( >>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: >>> Sent 0 >> CPG >> messages (1 remaining, last=10933): Try again ( >>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: >>> Sent 0 >> CPG >> messages (1 remaining, last=10933): Try again ( >> >> That is usually a symptom of corosync getting into a horribly >> confused state. >> Version? Distro? Have you checked for an update? >> Odd that the user of all that CPU isn't showing up though. >> >>> > > As I wrote I use Ubuntu trusty, the exact package versions are: > > corosync 2.3.0-1ubuntu5 > pacemaker 1.1.10+git20130802-1ubuntu2 Ah sorry, I seem to have missed that part. > > There are no updates available. The only option is to install from > sources, but that would be very difficult to maintain and I'm not sure I would get rid of this issue. > > What do you recommend? The same thing as Lars, or switch to a distro that stays current with upstream (git shows 5 newer releases for that branch since it was released 3 years ago). If you do build from source, its probably best to go with v1.4.6 >>> >>> Hm, I am a bit confused here. We are using 2.3.0, >> >> I swapped the 2 for a 1 somehow. A bit distracted, sorry. > > I upgraded all nodes to 2.3.3 and first it seemed a bit better, but still the > same issue - after some time CPU gets to 100%, and the corosync log is > flooded with messages like: > > Mar 12 07:36:55 [4793] ctdb2cib: info: crm_cs_flush:Sent > 0 CPG messages (48 remaining, last=3671): Try again (6) > Mar 12 07:36:55 [4798] ctdb2 crmd: info: crm_cs_flush:Sent > 0 CPG messages (51 remaining, last=3995): Try again (6) > Mar 12 07:36:56 [4793] ctdb2cib: info: crm_cs_flush:Sent > 0 CPG messages (48 remaining, last=3671): Try again (6) > Mar 12 07:36:56 [4798] ctdb2 crmd: info: crm_cs_flush:Sent > 0 CPG messages (51 remaining, last=3995): Try again (6) > Mar 12 07:36:57 [4793] ctdb2cib: info: crm_cs_flush:Sent > 0 CPG messages (48 remaining, last=3671)
Re: [Pacemaker] Pacemaker/corosync freeze
> -Original Message- > From: Andrew Beekhof [mailto:and...@beekhof.net] > Sent: Tuesday, March 11, 2014 10:27 PM > To: The Pacemaker cluster resource manager > Subject: Re: [Pacemaker] Pacemaker/corosync freeze > > > On 12 Mar 2014, at 1:54 am, Attila Megyeri > wrote: > > >> > >> -Original Message- > >> From: Andrew Beekhof [mailto:and...@beekhof.net] > >> Sent: Tuesday, March 11, 2014 12:48 AM > >> To: The Pacemaker cluster resource manager > >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze > >> > >> > >> On 7 Mar 2014, at 5:54 pm, Attila Megyeri > >> wrote: > >> > >>> Thanks for the quick response! > >>> > -Original Message- > From: Andrew Beekhof [mailto:and...@beekhof.net] > Sent: Friday, March 07, 2014 3:48 AM > To: The Pacemaker cluster resource manager > Subject: Re: [Pacemaker] Pacemaker/corosync freeze > > > On 7 Mar 2014, at 5:31 am, Attila Megyeri > > wrote: > > > Hello, > > > > We have a strange issue with Corosync/Pacemaker. > > From time to time, something unexpected happens and suddenly the > crm_mon output remains static. > > When I check the cpu usage, I see that one of the cores uses 100% > > cpu, but > cannot actually match it to either the corosync or one of the > pacemaker processes. > > > > In such a case, this high CPU usage is happening on all 7 nodes. > > I have to manually go to each node, stop pacemaker, restart > > corosync, then > start pacemeker. Stoping pacemaker and corosync does not work in > most of the cases, usually a kill -9 is needed. > > > > Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty. > > > > Using udpu as transport, two rings on Gigabit ETH, rro_mode passive. > > > > Logs are usually flooded with CPG related messages, such as: > > > > Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: > > Sent 0 > >> CPG > messages (1 remaining, last=8): Try again (6) > > Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: > > Sent 0 > >> CPG > messages (1 remaining, last=8): Try again (6) > > Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: > > Sent 0 > >> CPG > messages (1 remaining, last=8): Try again (6) > > Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: > > Sent 0 > >> CPG > messages (1 remaining, last=8): Try again (6) > > > > OR > > > > Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: > > Sent 0 > CPG > messages (1 remaining, last=10933): Try again ( > > Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: > > Sent 0 > CPG > messages (1 remaining, last=10933): Try again ( > > Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: > > Sent 0 > CPG > messages (1 remaining, last=10933): Try again ( > > That is usually a symptom of corosync getting into a horribly > confused > >> state. > Version? Distro? Have you checked for an update? > Odd that the user of all that CPU isn't showing up though. > > > > >>> > >>> As I wrote I use Ubuntu trusty, the exact package versions are: > >>> > >>> corosync 2.3.0-1ubuntu5 > >>> pacemaker 1.1.10+git20130802-1ubuntu2 > >> > >> Ah sorry, I seem to have missed that part. > >> > >>> > >>> There are no updates available. The only option is to install from > >>> sources, > >> but that would be very difficult to maintain and I'm not sure I would > >> get rid of this issue. > >>> > >>> What do you recommend? > >> > >> The same thing as Lars, or switch to a distro that stays current with > >> upstream (git shows 5 newer releases for that branch since it was > >> released 3 years ago). > >> If you do build from source, its probably best to go with v1.4.6 > > > > Hm, I am a bit confused here. We are using 2.3.0, > > I swapped the 2 for a 1 somehow. A bit distracted, sorry. I upgraded all nodes to 2.3.3 and first it seemed a bit better, but still the same issue - after some time CPU gets to 100%, and the corosync log is flooded with messages like: Mar 12 07:36:55 [4793] ctdb2cib: info: crm_cs_flush:Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Mar 12 07:36:55 [4798] ctdb2 crmd: info: crm_cs_flush:Sent 0 CPG messages (51 remaining, last=3995): Try again (6) Mar 12 07:36:56 [4793] ctdb2cib: info: crm_cs_flush:Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Mar 12 07:36:56 [4798] ctdb2 crmd: info: crm_cs_flush:Sent 0 CPG messages (51 remaining, last=3995): Try again (6) Mar 12 07:36:57 [4793] ctdb2cib: info: crm_cs_flush:Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Mar 12 07:36:57 [4798] ctdb2 crmd: info: cr