Re: [Pacemaker] RESTful API support

2014-03-12 Thread Digimer

On 13/03/14 01:29 AM, John Wei wrote:

Currently, management of pacemaker is done through CLI or xml. Any plan to
provide RESTful api to support cloud software?

John


pcsd is REST-like.

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] RESTful API support

2014-03-12 Thread John Wei
Currently, management of pacemaker is done through CLI or xml. Any plan to
provide RESTful api to support cloud software?

John



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] help building 2 node config

2014-03-12 Thread Alex Samad - Yieldbroker
Well I think I have worked it out 


# Create ybrp ip address  
pcs resource create ybrpip ocf:heartbeat:IPaddr2 params ip=10.172.214.50 
cidr_netmask=24 nic=eth0 clusterip_hash=sourceip-sourceport \
op start interval="0s" timeout="60s" \
op monitor interval="5s" timeout="20s" \
op stop interval="0s" timeout="60s" \

# Clone it
#pcs resource clone ybrpip globally-unique=true clone-max=2 clone-node-max=2

# Create status
pcs resource create ybrpstat ocf:yb:ybrp op \
op start interval="10s" timeout="60s" \
op monitor interval="5s" timeout="20s" \
op stop interval="10s" timeout="60s" \





# clone it it
pcs resource clone ybrpip globally-unique=true clone-max=2 clone-node-max=2
pcs resource clone ybrpstat globally-unique=false clone-max=2 clone-node-max=2

pcs constraint colocation add ybrpip ybrpstat INFINITY
pcs constraint colocation add ybrpip-clone ybrpstat-clone INFINITY
pcs constraint order ybrpstat then ybrpip
pcs constraint order ybrpstat-clone then ybrpip-clone
pcs constraint location ybrpip prefers devrp1
pcs constraint location ybrpip-clone prefers devrp2


Have I done anything silly ?

Also as I don't have the application actually running on my nodes, I notice 
fails occur very fast, more than 1  sec, where its that configured and how do I 
configure it such that after 2 or 3,4 or 5 attempts it fails over to the other 
node. I also want then resources to move back to the original nodes when they 
come back

So I tried the config above and when I rebooted node a the ip address on A went 
to node B, but when A came back it didn't move back to node A




pcs config
Cluster Name: ybrp
Corosync Nodes:
 
Pacemaker Nodes:
 devrp1 devrp2 

Resources: 
 Clone: ybrpip-clone
  Meta Attrs: globally-unique=true clone-max=2 clone-node-max=2 
  Resource: ybrpip (class=ocf provider=heartbeat type=IPaddr2)
   Attributes: ip=10.172.214.50 cidr_netmask=24 nic=eth0 
clusterip_hash=sourceip-sourceport 
   Operations: start interval=0s timeout=60s (ybrpip-start-interval-0s)
   monitor interval=5s timeout=20s (ybrpip-monitor-interval-5s)
   stop interval=0s timeout=60s (ybrpip-stop-interval-0s)
 Clone: ybrpstat-clone
  Meta Attrs: globally-unique=false clone-max=2 clone-node-max=2 
  Resource: ybrpstat (class=ocf provider=yb type=ybrp)
   Operations: start interval=10s timeout=60s (ybrpstat-start-interval-10s)
   monitor interval=5s timeout=20s (ybrpstat-monitor-interval-5s)
   stop interval=10s timeout=60s (ybrpstat-stop-interval-10s)

Stonith Devices: 
Fencing Levels: 

Location Constraints:
  Resource: ybrpip
Enabled on: devrp1 (score:INFINITY) (id:location-ybrpip-devrp1-INFINITY)
  Resource: ybrpip-clone
Enabled on: devrp2 (score:INFINITY) 
(id:location-ybrpip-clone-devrp2-INFINITY)
Ordering Constraints:
  start ybrpstat then start ybrpip (Mandatory) 
(id:order-ybrpstat-ybrpip-mandatory)
  start ybrpstat-clone then start ybrpip-clone (Mandatory) 
(id:order-ybrpstat-clone-ybrpip-clone-mandatory)
Colocation Constraints:
  ybrpip with ybrpstat (INFINITY) (id:colocation-ybrpip-ybrpstat-INFINITY)
  ybrpip-clone with ybrpstat-clone (INFINITY) 
(id:colocation-ybrpip-clone-ybrpstat-clone-INFINITY)

Cluster Properties:
 cluster-infrastructure: cman
 dc-version: 1.1.10-14.el6-368c726
 last-lrm-refresh: 1394682724
 no-quorum-policy: ignore
 stonith-enabled: false

the constraints should have moved it back to node A ???

pcs status
Cluster name: ybrp
Last updated: Thu Mar 13 16:13:40 2014
Last change: Thu Mar 13 16:06:21 2014 via cibadmin on devrp1
Stack: cman
Current DC: devrp2 - partition with quorum
Version: 1.1.10-14.el6-368c726
2 Nodes configured
4 Resources configured


Online: [ devrp1 devrp2 ]

Full list of resources:

 Clone Set: ybrpip-clone [ybrpip] (unique)
 ybrpip:0   (ocf::heartbeat:IPaddr2):   Started devrp2 
 ybrpip:1   (ocf::heartbeat:IPaddr2):   Started devrp2 
 Clone Set: ybrpstat-clone [ybrpstat]
 Started: [ devrp1 devrp2 ]




> -Original Message-
> From: Alex Samad - Yieldbroker [mailto:alex.sa...@yieldbroker.com]
> Sent: Thursday, 13 March 2014 2:07 PM
> To: pacemaker@oss.clusterlabs.org
> Subject: [Pacemaker] help building 2 node config
> 
> Hi
> 
> I sent out an email to help convert an old config. Thought it might better to
> start from scratch.
> 
> I have 2 nodes, which run an application (sort of a reverse proxy).
> Node A
> Node B
> 
> I would like to use OCF:IPaddr2 so that I can load balance IP
> 
> # Create ybrp ip address
> pcs resource create ybrpip ocf:heartbeat:IPaddr2 params ip=10.172.214.50
> cidr_netmask=24 nic=eth0 clusterip_hash=sourceip-sourceport \
> op start interval="0s" timeout="60s" \
> op monitor interval="5s" timeout="20s" \
> op stop interval="0s" timeout="60s" \
> 
> # Clone it
> pcs resource clone ybrpip2 ybrpip meta master-max="2" master-node-
> max="2" clone-max="2" clone-node-max="1" notify="true"
> interleave="true"
> 
> 
> This 

[Pacemaker] help building 2 node config

2014-03-12 Thread Alex Samad - Yieldbroker
Hi

I sent out an email to help convert an old config. Thought it might better to 
start from scratch.

I have 2 nodes, which run an application (sort of a reverse proxy).
Node A
Node B

I would like to use OCF:IPaddr2 so that I can load balance IP

# Create ybrp ip address  
pcs resource create ybrpip ocf:heartbeat:IPaddr2 params ip=10.172.214.50 
cidr_netmask=24 nic=eth0 clusterip_hash=sourceip-sourceport \
op start interval="0s" timeout="60s" \
op monitor interval="5s" timeout="20s" \
op stop interval="0s" timeout="60s" \

# Clone it
pcs resource clone ybrpip2 ybrpip meta master-max="2" master-node-max="2" 
clone-max="2" clone-node-max="1" notify="true" interleave="true"


This seems to work okay but I tested
On node B I ran this
crm_mon -1 ; iptables -nvL INPUT | head -5 ; ip a ; echo -n [ ; cat 
/proc/net/ipt_CLUSTERIP/10.172.214.50 ; echo ]

in particular I was watching /proc/net/ipt_CLUSTERIP/10.172.214.50

and I rebooted node A, I  noticed ipt_CLUSTERIP didn't fail over ?  I would 
have expected to see 1,2 in there on nodeB when nodeA failed

in fact when I reboot nodea it comes back with 2 in there ... that's not good !


pcs resource show ybrpip-clone
 Clone: ybrpip-clone
  Meta Attrs: master-max=2 master-node-max=2 clone-max=2 clone-node-max=1 
notify=true interleave=true 
  Resource: ybrpip (class=ocf provider=heartbeat type=IPaddr2)
   Attributes: ip=10.172.214.50 cidr_netmask=24 nic=eth0 
clusterip_hash=sourceip-sourceport 
   Operations: start interval=0s timeout=60s (ybrpip-start-interval-0s)
   monitor interval=5s timeout=20s (ybrpip-monitor-interval-5s)
   stop interval=0s timeout=60s (ybrpip-stop-interval-0s)

pcs resource show ybrpip  
 Resource: ybrpip (class=ocf provider=heartbeat type=IPaddr2)
  Attributes: ip=10.172.214.50 cidr_netmask=24 nic=eth0 
clusterip_hash=sourceip-sourceport 
  Operations: start interval=0s timeout=60s (ybrpip-start-interval-0s)
  monitor interval=5s timeout=20s (ybrpip-monitor-interval-5s)
  stop interval=0s timeout=60s (ybrpip-stop-interval-0s)



so  I think this has something todo with meta data..



I have another resource
pcs resource create  ybrpstat ocf:yb:ybrp op monitor interval=5s

I want 2 of these one for nodeA and 1 for node B.

I want the IP address to be dependant on if this resource is available on the 
node.  How can I do that ?

Alex





___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] help migrating over cluster config from pacemaker plugin into corosync to pcs

2014-03-12 Thread Alex Samad - Yieldbroker


> -Original Message-
> From: Andrew Beekhof [mailto:and...@beekhof.net]
> Sent: Thursday, 13 March 2014 1:39 PM
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] help migrating over cluster config from pacemaker
> plugin into corosync to pcs
[snip]

> > I was trying to use the above commands to programme up the new
> pacemaker, but I can't find the easy transform of crm to pcs...
> 
> Does https://github.com/ClusterLabs/pacemaker/blob/master/doc/pcs-
> crmsh-quick-ref.md help?

Looks like it does
thanks

> 
[snip]


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] help migrating over cluster config from pacemaker plugin into corosync to pcs

2014-03-12 Thread Andrew Beekhof

On 13 Mar 2014, at 11:56 am, Alex Samad - Yieldbroker 
 wrote:

> Hi
> 
> So this is what I used to do to setup my cluster
> crm configure property stonith-enabled=false
> crm configure property no-quorum-policy=ignore
> crm configure rsc_defaults resource-stickiness=100
> crm configure primitive ybrpip ocf:heartbeat:IPaddr2 params ip=10.32.21.30 
> cidr_netmask=24 op monitor interval=5s
> crm configure primitive ybrpstat ocf:yb:ybrp op monitor interval=5s
> crm configure colocation ybrp INFINITY: ybrpip ybrpstat
> crm configure group ybrpgrp ybrpip ybrpstat
> crm_resource --meta --resource ybrpstat --set-parameter migration-threshold 
> --parameter-value 2
> crm_resource --meta --resource ybrpstat --set-parameter failure-timeout 
> --parameter-value 2m
> 
> 
> I have written my own ybrp resource (/usr/lib/ocf/resource.d/yb/ybrp)
> 
> So basically what I want to do is have 2 nodes have a floating VIP (I was 
> looking at moving forward with the IP load balancing )
> I run an application on both nodes it doesn't need to be started, should 
> start at server start up.
> I need the VIP or the loading balancing to move from node to node.
> 
> Normal operation would be
> 50% on node A and 50% on node B (I realise this depends on IP & hash)
> If app fails on one node then all the traffic should move to the other node. 
> The cluster should not try and restart the application
> Once the application comes back on the broken node the VIP should be allowed 
> to move back or the load balancing should accept traffic back there.
> Simple ?
> 
> I was trying to use the above commands to programme up the new pacemaker, but 
> I can't find the easy transform of crm to pcs...

Does 
https://github.com/ClusterLabs/pacemaker/blob/master/doc/pcs-crmsh-quick-ref.md 
help?

> so I thought I would ask the list for help to configure up with the load 
> balance VIP.
> 
> Alex
> 
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] help migrating over cluster config from pacemaker plugin into corosync to pcs

2014-03-12 Thread Alex Samad - Yieldbroker
Hi

So this is what I used to do to setup my cluster
crm configure property stonith-enabled=false
crm configure property no-quorum-policy=ignore
crm configure rsc_defaults resource-stickiness=100
crm configure primitive ybrpip ocf:heartbeat:IPaddr2 params ip=10.32.21.30 
cidr_netmask=24 op monitor interval=5s
crm configure primitive ybrpstat ocf:yb:ybrp op monitor interval=5s
crm configure colocation ybrp INFINITY: ybrpip ybrpstat
crm configure group ybrpgrp ybrpip ybrpstat
crm_resource --meta --resource ybrpstat --set-parameter migration-threshold 
--parameter-value 2
crm_resource --meta --resource ybrpstat --set-parameter failure-timeout 
--parameter-value 2m
 

I have written my own ybrp resource (/usr/lib/ocf/resource.d/yb/ybrp)

So basically what I want to do is have 2 nodes have a floating VIP (I was 
looking at moving forward with the IP load balancing )
I run an application on both nodes it doesn't need to be started, should start 
at server start up.
I need the VIP or the loading balancing to move from node to node.

Normal operation would be
50% on node A and 50% on node B (I realise this depends on IP & hash)
If app fails on one node then all the traffic should move to the other node. 
The cluster should not try and restart the application
Once the application comes back on the broken node the VIP should be allowed to 
move back or the load balancing should accept traffic back there.
Simple ?

I was trying to use the above commands to programme up the new pacemaker, but I 
can't find the easy transform of crm to pcs... so I thought I would ask the 
list for help to configure up with the load balance VIP.

Alex



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] pacemaker depdendancy on samba

2014-03-12 Thread Trevor Hemsley
On 12/03/14 23:18, Alex Samad - Yieldbroker wrote:
> Hi
>
> Just going through my cluster build, seems like
>
> yum install pacemaker
>
> wants to bring in samba, I have recently migrated up to samba4, wondering  if 
> I can find a pacemaker that is dependant on samba4 ?
>
> Im on centos 6.5, on a quick look I am guessing this might not be a pacemaker 
> issue, might be a dep of a dep ..

Pacemaker wants to install resource-agents, resource-agents has a
dependency on /sbin/mount.cifs and then it goes on from there...

T

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] pacemaker depdendancy on samba

2014-03-12 Thread Alex Samad - Yieldbroker
Hi

Just going through my cluster build, seems like

yum install pacemaker

wants to bring in samba, I have recently migrated up to samba4, wondering  if I 
can find a pacemaker that is dependant on samba4 ?

Im on centos 6.5, on a quick look I am guessing this might not be a pacemaker 
issue, might be a dep of a dep ..


Thanks
Alex 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] fencing question

2014-03-12 Thread Lars Marowsky-Bree
On 2014-03-12T16:16:54, Karl Rößmann  wrote:

> >>primitive fkflmw ocf:heartbeat:Xen \
> >>meta target-role="Started" is-managed="true" allow-migrate="true" \
> >>op monitor interval="10" timeout="30" \
> >>op migrate_from interval="0" timeout="600" \
> >>op migrate_to interval="0" timeout="600" \
> >>params xmfile="/etc/xen/vm/fkflmw" shutdown_timeout="120"
> >
> >You need to set a >120s timeout for the stop operation too:
> > op stop timeout="150"
> >
> >>default-action-timeout="60s"
> >
> >Or set this to, say, 150s.
> can I do this while the resource (the xen VM) is running ?

Yes, changing the stop timeout should not have a negative impact on your
resource.

You can also check how the cluster would react:

# crm configure
crm(live)configure# edit
(Make all changes you want here)
crm(live)configure# simulate actions nograph

before you type "commit".

Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] missing init scripts for corosync and pacemaker

2014-03-12 Thread Andrew Beekhof

On 13 Mar 2014, at 9:29 am, Jay G. Scott  wrote:

> 
> OS = RHEL 6
> 
> because my machines are behind a firewall, i can't install
> via yum.  i had to bring down the rpms and install them.
> here are the rpms i installed.  yeah, it bothers me that
> they say fc20 but that's what i got when i used the
> pacemaker.repo file i found online.
> 
> corosync-2.3.3-1.fc20.x86_64.rpm
> corosynclib-2.3.3-1.fc20.x86_64.rpm
> libibverbs-1.1.7-3.fc20.x86_64.rpm
> libqb-0.17.0-1.fc20.x86_64.rpm
> librdmacm-1.0.17-2.fc20.x86_64.rpm
> pacemaker-1.1.11-1.fc20.x86_64.rpm
> pacemaker-cli-1.1.11-1.fc20.x86_64.rpm
> pacemaker-cluster-libs-1.1.11-1.fc20.x86_64.rpm
> pacemaker-libs-1.1.11-1.fc20.x86_64.rpm
> resource-agents-3.9.5-9.fc20.x86_64.rpm
> 
> i have all of these installed.  i lack an /etc/init.d
> script for corosync and pacemaker.
> 
> how come?

you're installing fedora packages and fedora uses systemd .service files

> 
> j.
> 
> 
> -- 
> Jay Scott 512-835-3553g...@arlut.utexas.edu
> Head of Sun Support, Sr. System Administrator
> Applied Research Labs, Computer Science Div.   S224
> University of Texas at Austin
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] missing init scripts for corosync and pacemaker

2014-03-12 Thread Jay G. Scott

OS = RHEL 6

because my machines are behind a firewall, i can't install
via yum.  i had to bring down the rpms and install them.
here are the rpms i installed.  yeah, it bothers me that
they say fc20 but that's what i got when i used the
pacemaker.repo file i found online.

corosync-2.3.3-1.fc20.x86_64.rpm
corosynclib-2.3.3-1.fc20.x86_64.rpm
libibverbs-1.1.7-3.fc20.x86_64.rpm
libqb-0.17.0-1.fc20.x86_64.rpm
librdmacm-1.0.17-2.fc20.x86_64.rpm
pacemaker-1.1.11-1.fc20.x86_64.rpm
pacemaker-cli-1.1.11-1.fc20.x86_64.rpm
pacemaker-cluster-libs-1.1.11-1.fc20.x86_64.rpm
pacemaker-libs-1.1.11-1.fc20.x86_64.rpm
resource-agents-3.9.5-9.fc20.x86_64.rpm

i have all of these installed.  i lack an /etc/init.d
script for corosync and pacemaker.

how come?

j.


-- 
Jay Scott   512-835-3553g...@arlut.utexas.edu
Head of Sun Support, Sr. System Administrator
Applied Research Labs, Computer Science Div.   S224
University of Texas at Austin

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-12 Thread Attila Megyeri


> -Original Message-
> From: Jan Friesse [mailto:jfrie...@redhat.com]
> Sent: Wednesday, March 12, 2014 4:31 PM
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>
> Attila Megyeri napsal(a):
> >> -Original Message-
> >> From: Jan Friesse [mailto:jfrie...@redhat.com]
> >> Sent: Wednesday, March 12, 2014 2:27 PM
> >> To: The Pacemaker cluster resource manager
> >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> >>
> >> Attila Megyeri napsal(a):
> >>> Hello Jan,
> >>>
> >>> Thank you very much for your help so far.
> >>>
>  -Original Message-
>  From: Jan Friesse [mailto:jfrie...@redhat.com]
>  Sent: Wednesday, March 12, 2014 9:51 AM
>  To: The Pacemaker cluster resource manager
>  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> 
>  Attila Megyeri napsal(a):
> >
> >> -Original Message-
> >> From: Andrew Beekhof [mailto:and...@beekhof.net]
> >> Sent: Tuesday, March 11, 2014 10:27 PM
> >> To: The Pacemaker cluster resource manager
> >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> >>
> >>
> >> On 12 Mar 2014, at 1:54 am, Attila Megyeri
> >> 
> >> wrote:
> >>
> 
>  -Original Message-
>  From: Andrew Beekhof [mailto:and...@beekhof.net]
>  Sent: Tuesday, March 11, 2014 12:48 AM
>  To: The Pacemaker cluster resource manager
>  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> 
> 
>  On 7 Mar 2014, at 5:54 pm, Attila Megyeri
>  
>  wrote:
> 
> > Thanks for the quick response!
> >
> >> -Original Message-
> >> From: Andrew Beekhof [mailto:and...@beekhof.net]
> >> Sent: Friday, March 07, 2014 3:48 AM
> >> To: The Pacemaker cluster resource manager
> >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> >>
> >>
> >> On 7 Mar 2014, at 5:31 am, Attila Megyeri
> >> 
> >> wrote:
> >>
> >>> Hello,
> >>>
> >>> We have a strange issue with Corosync/Pacemaker.
> >>> From time to time, something unexpected happens and
> >> suddenly
>  the
> >> crm_mon output remains static.
> >>> When I check the cpu usage, I see that one of the cores uses
> >>> 100% cpu, but
> >> cannot actually match it to either the corosync or one of the
> >> pacemaker processes.
> >>>
> >>> In such a case, this high CPU usage is happening on all 7 nodes.
> >>> I have to manually go to each node, stop pacemaker, restart
> >>> corosync, then
> >> start pacemeker. Stoping pacemaker and corosync does not
> work
> >> in most of the cases, usually a kill -9 is needed.
> >>>
> >>> Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.
> >>>
> >>> Using udpu as transport, two rings on Gigabit ETH, rro_mode
>  passive.
> >>>
> >>> Logs are usually flooded with CPG related messages, such as:
> >>>
> >>> Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:
> >> Sent
>  0
>  CPG
> >> messages  (1 remaining, last=8): Try again (6)
> >>> Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:
> >> Sent
>  0
>  CPG
> >> messages  (1 remaining, last=8): Try again (6)
> >>> Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:
> >> Sent
>  0
>  CPG
> >> messages  (1 remaining, last=8): Try again (6)
> >>> Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:
> >> Sent
>  0
>  CPG
> >> messages  (1 remaining, last=8): Try again (6)
> >>>
> >>> OR
> >>>
> >>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
> >> Sent 0
> >> CPG
> >> messages  (1 remaining, last=10933): Try again (
> >>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
> >> Sent 0
> >> CPG
> >> messages  (1 remaining, last=10933): Try again (
> >>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
> >> Sent 0
> >> CPG
> >> messages  (1 remaining, last=10933): Try again (
> >>
> >> That is usually a symptom of corosync getting into a horribly
> >> confused
>  state.
> >> Version? Distro? Have you checked for an update?
> >> Odd that the user of all that CPU isn't showing up though.
> >>
> >>>
> >
> > As I wrote I use Ubuntu trusty, the exact package versions are:
> >
> > corosync 2.3.0-1ubuntu5
> > pacemaker 1.1.10+git20130802-1ubuntu2
> 
>  Ah sorry, I seem to have missed that part.
> 
> >
> > There are no update

Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-12 Thread Jan Friesse
Attila Megyeri napsal(a):
>> -Original Message-
>> From: Jan Friesse [mailto:jfrie...@redhat.com]
>> Sent: Wednesday, March 12, 2014 2:27 PM
>> To: The Pacemaker cluster resource manager
>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>>
>> Attila Megyeri napsal(a):
>>> Hello Jan,
>>>
>>> Thank you very much for your help so far.
>>>
 -Original Message-
 From: Jan Friesse [mailto:jfrie...@redhat.com]
 Sent: Wednesday, March 12, 2014 9:51 AM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze

 Attila Megyeri napsal(a):
>
>> -Original Message-
>> From: Andrew Beekhof [mailto:and...@beekhof.net]
>> Sent: Tuesday, March 11, 2014 10:27 PM
>> To: The Pacemaker cluster resource manager
>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>>
>>
>> On 12 Mar 2014, at 1:54 am, Attila Megyeri
>> 
>> wrote:
>>

 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net]
 Sent: Tuesday, March 11, 2014 12:48 AM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze


 On 7 Mar 2014, at 5:54 pm, Attila Megyeri
 
 wrote:

> Thanks for the quick response!
>
>> -Original Message-
>> From: Andrew Beekhof [mailto:and...@beekhof.net]
>> Sent: Friday, March 07, 2014 3:48 AM
>> To: The Pacemaker cluster resource manager
>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>>
>>
>> On 7 Mar 2014, at 5:31 am, Attila Megyeri
>> 
>> wrote:
>>
>>> Hello,
>>>
>>> We have a strange issue with Corosync/Pacemaker.
>>> From time to time, something unexpected happens and
>> suddenly
 the
>> crm_mon output remains static.
>>> When I check the cpu usage, I see that one of the cores uses
>>> 100% cpu, but
>> cannot actually match it to either the corosync or one of the
>> pacemaker processes.
>>>
>>> In such a case, this high CPU usage is happening on all 7 nodes.
>>> I have to manually go to each node, stop pacemaker, restart
>>> corosync, then
>> start pacemeker. Stoping pacemaker and corosync does not work
>> in most of the cases, usually a kill -9 is needed.
>>>
>>> Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.
>>>
>>> Using udpu as transport, two rings on Gigabit ETH, rro_mode
 passive.
>>>
>>> Logs are usually flooded with CPG related messages, such as:
>>>
>>> Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:
>> Sent
 0
 CPG
>> messages  (1 remaining, last=8): Try again (6)
>>> Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:
>> Sent
 0
 CPG
>> messages  (1 remaining, last=8): Try again (6)
>>> Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:
>> Sent
 0
 CPG
>> messages  (1 remaining, last=8): Try again (6)
>>> Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:
>> Sent
 0
 CPG
>> messages  (1 remaining, last=8): Try again (6)
>>>
>>> OR
>>>
>>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
>> Sent 0
>> CPG
>> messages  (1 remaining, last=10933): Try again (
>>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
>> Sent 0
>> CPG
>> messages  (1 remaining, last=10933): Try again (
>>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
>> Sent 0
>> CPG
>> messages  (1 remaining, last=10933): Try again (
>>
>> That is usually a symptom of corosync getting into a horribly
>> confused
 state.
>> Version? Distro? Have you checked for an update?
>> Odd that the user of all that CPU isn't showing up though.
>>
>>>
>
> As I wrote I use Ubuntu trusty, the exact package versions are:
>
> corosync 2.3.0-1ubuntu5
> pacemaker 1.1.10+git20130802-1ubuntu2

 Ah sorry, I seem to have missed that part.

>
> There are no updates available. The only option is to install
> from sources,
 but that would be very difficult to maintain and I'm not sure I
 would get rid of this issue.
>
> What do you recommend?

 The same thing as Lars, or switch to a distro that stays current
 with upstream (git shows 5 newer releases for that branch since
 it was released 3 years ago).
 If you do build from source, its probably best to go with v1.4.6
>>

Re: [Pacemaker] fencing question

2014-03-12 Thread Karl Rößmann

Hi.


primitive fkflmw ocf:heartbeat:Xen \
meta target-role="Started" is-managed="true" allow-migrate="true" \
op monitor interval="10" timeout="30" \
op migrate_from interval="0" timeout="600" \
op migrate_to interval="0" timeout="600" \
params xmfile="/etc/xen/vm/fkflmw" shutdown_timeout="120"


You need to set a >120s timeout for the stop operation too:
op stop timeout="150"


default-action-timeout="60s"


Or set this to, say, 150s.



can I do this while the resource (the xen VM) is running ?



Karl



--
Karl RößmannTel. +49-711-689-1657
Max-Planck-Institut FKF Fax. +49-711-689-1632
Postfach 800 665
70506 Stuttgart email k.roessm...@fkf.mpg.de

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] fencing question

2014-03-12 Thread Lars Marowsky-Bree
On 2014-03-12T15:17:13, Karl Rößmann  wrote:

> Hi,
> 
> we have a two node HA cluster using SuSE SlES 11 HA Extension SP3,
> latest release value.
> A resource (xen) was manually stopped, the shutdown_timeout is 120s
> but after 60s the node was fenced and shut down by the other node.
> 
> should I change some timeout value ?
> 
> This is a part of our configuration:
> ...
> primitive fkflmw ocf:heartbeat:Xen \
> meta target-role="Started" is-managed="true" allow-migrate="true" \
> op monitor interval="10" timeout="30" \
> op migrate_from interval="0" timeout="600" \
> op migrate_to interval="0" timeout="600" \
> params xmfile="/etc/xen/vm/fkflmw" shutdown_timeout="120"

You need to set a >120s timeout for the stop operation too:
op stop timeout="150"

> default-action-timeout="60s"

Or set this to, say, 150s.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-12 Thread Attila Megyeri
> -Original Message-
> From: Jan Friesse [mailto:jfrie...@redhat.com]
> Sent: Wednesday, March 12, 2014 2:27 PM
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>
> Attila Megyeri napsal(a):
> > Hello Jan,
> >
> > Thank you very much for your help so far.
> >
> >> -Original Message-
> >> From: Jan Friesse [mailto:jfrie...@redhat.com]
> >> Sent: Wednesday, March 12, 2014 9:51 AM
> >> To: The Pacemaker cluster resource manager
> >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> >>
> >> Attila Megyeri napsal(a):
> >>>
>  -Original Message-
>  From: Andrew Beekhof [mailto:and...@beekhof.net]
>  Sent: Tuesday, March 11, 2014 10:27 PM
>  To: The Pacemaker cluster resource manager
>  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> 
> 
>  On 12 Mar 2014, at 1:54 am, Attila Megyeri
>  
>  wrote:
> 
> >>
> >> -Original Message-
> >> From: Andrew Beekhof [mailto:and...@beekhof.net]
> >> Sent: Tuesday, March 11, 2014 12:48 AM
> >> To: The Pacemaker cluster resource manager
> >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> >>
> >>
> >> On 7 Mar 2014, at 5:54 pm, Attila Megyeri
> >> 
> >> wrote:
> >>
> >>> Thanks for the quick response!
> >>>
>  -Original Message-
>  From: Andrew Beekhof [mailto:and...@beekhof.net]
>  Sent: Friday, March 07, 2014 3:48 AM
>  To: The Pacemaker cluster resource manager
>  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> 
> 
>  On 7 Mar 2014, at 5:31 am, Attila Megyeri
>  
>  wrote:
> 
> > Hello,
> >
> > We have a strange issue with Corosync/Pacemaker.
> > From time to time, something unexpected happens and
> suddenly
> >> the
>  crm_mon output remains static.
> > When I check the cpu usage, I see that one of the cores uses
> > 100% cpu, but
>  cannot actually match it to either the corosync or one of the
>  pacemaker processes.
> >
> > In such a case, this high CPU usage is happening on all 7 nodes.
> > I have to manually go to each node, stop pacemaker, restart
> > corosync, then
>  start pacemeker. Stoping pacemaker and corosync does not work
>  in most of the cases, usually a kill -9 is needed.
> >
> > Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.
> >
> > Using udpu as transport, two rings on Gigabit ETH, rro_mode
> >> passive.
> >
> > Logs are usually flooded with CPG related messages, such as:
> >
> > Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:
> Sent
> >> 0
> >> CPG
>  messages  (1 remaining, last=8): Try again (6)
> > Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:
> Sent
> >> 0
> >> CPG
>  messages  (1 remaining, last=8): Try again (6)
> > Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:
> Sent
> >> 0
> >> CPG
>  messages  (1 remaining, last=8): Try again (6)
> > Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:
> Sent
> >> 0
> >> CPG
>  messages  (1 remaining, last=8): Try again (6)
> >
> > OR
> >
> > Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
> Sent 0
>  CPG
>  messages  (1 remaining, last=10933): Try again (
> > Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
> Sent 0
>  CPG
>  messages  (1 remaining, last=10933): Try again (
> > Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
> Sent 0
>  CPG
>  messages  (1 remaining, last=10933): Try again (
> 
>  That is usually a symptom of corosync getting into a horribly
>  confused
> >> state.
>  Version? Distro? Have you checked for an update?
>  Odd that the user of all that CPU isn't showing up though.
> 
> >
> >>>
> >>> As I wrote I use Ubuntu trusty, the exact package versions are:
> >>>
> >>> corosync 2.3.0-1ubuntu5
> >>> pacemaker 1.1.10+git20130802-1ubuntu2
> >>
> >> Ah sorry, I seem to have missed that part.
> >>
> >>>
> >>> There are no updates available. The only option is to install
> >>> from sources,
> >> but that would be very difficult to maintain and I'm not sure I
> >> would get rid of this issue.
> >>>
> >>> What do you recommend?
> >>
> >> The same thing as Lars, or switch to a distro that stays current
> >> with upstream (git shows 5 newer releases for that branch since
> >> it was released 3 years ago).
> >> If you do build from source, its probably best to go with v1.4.6
> >
> > Hm, I am a bit confused her

[Pacemaker] fencing question

2014-03-12 Thread Karl Rößmann

Hi,

we have a two node HA cluster using SuSE SlES 11 HA Extension SP3,
latest release value.
A resource (xen) was manually stopped, the shutdown_timeout is 120s
but after 60s the node was fenced and shut down by the other node.

should I change some timeout value ?

This is a part of our configuration:
...
primitive fkflmw ocf:heartbeat:Xen \
meta target-role="Started" is-managed="true" allow-migrate="true" \
op monitor interval="10" timeout="30" \
op migrate_from interval="0" timeout="600" \
op migrate_to interval="0" timeout="600" \
params xmfile="/etc/xen/vm/fkflmw" shutdown_timeout="120"
...
...
property $id="cib-bootstrap-options" \
dc-version="1.1.10-f3eeaf4" \
cluster-infrastructure="classic openais (with plugin)" \
expected-quorum-votes="2" \
no-quorum-policy="ignore" \
last-lrm-refresh="1394533475" \
default-action-timeout="60s"
rsc_defaults $id="rsc_defaults-options" \
resource-stickiness="10" \
migration-threshold="3"


we had this scenario:

on Node ha2infra:

Mar 12 11:59:59 ha2infra pengine[25631]:   notice: LogActions: Stop 
fkflmw   (ha2infra)   <--- Resource fkflmw was stopped  
manually
Mar 12 11:59:59 ha2infra pengine[25631]:   notice: process_pe_message:  
Calculated Transition 105: /var/lib/pacemaker/pengine/pe-input-519.bz2
Mar 12 11:59:59 ha2infra crmd[25632]:   notice: do_te_invoke:  
Processing graph 105 (ref=pe_calc-dc-1394621999-178) derived from  
/var/lib/pacemaker/pengine/pe-input-519.bz2
Mar 12 11:59:59 ha2infra crmd[25632]:   notice: te_rsc_command:  
Initiating action 60: stop fkflmw_stop_0 on ha2infra (local)
Mar 12 11:59:59 ha2infra Xen(fkflmw)[22718]: INFO: Xen domain fkflmw  
will be stopped (timeout: 120s)   <--- stopping fkflmw

Mar 12 12:00:00 ha2infra mgmtd: [25633]: info: CIB query: cib
Mar 12 12:00:00 ha2infra mgmtd: [25633]: info: CIB query: cib
Mar 12 12:00:59 ha2infra sshd[24992]: Connection closed by  
134.105.232.21 [preauth]
Mar 12 12:00:59 ha2infra lrmd[25629]:  warning:  
child_timeout_callback: fkflmw_stop_0 process (PID 22718) timed out
Mar 12 12:00:59 ha2infra lrmd[25629]:  warning: operation_finished:  
fkflmw_stop_0:22718 - timed out after 6ms   <--- Stop  
timed out after 60s (not 120s)
Mar 12 12:00:59 ha2infra crmd[25632]:error: process_lrm_event: LRM  
operation fkflmw_stop_0 (136) Timed Out (timeout=6ms)
Mar 12 12:00:59 ha2infra crmd[25632]:  warning: status_from_rc: Action  
60 (fkflmw_stop_0) on ha2infra failed (target: 0 vs. rc: 1): Error


Mar 12 12:00:59 ha2infra pengine[25631]:  warning:  
unpack_rsc_op_failure: Processing failed op stop for fkflmw on  
ha2infra: unknown error (1)
Mar 12 12:00:59 ha2infra pengine[25631]:  warning: pe_fence_node: Node  
ha2infra will be fenced because of resource failure(s)
<--- is this normal ?
Mar 12 12:00:59 ha2infra pengine[25631]:  warning: stage6: Scheduling  
Node ha2infra for STONITH


Node ha1infra:

Mar 12 12:00:59 ha1infra stonith-ng[21808]:   notice:  
can_fence_host_with_device: stonith_1 can fence ha2infra: dynamic-list
Mar 12 12:01:01 ha1infra stonith-ng[21808]:   notice: log_operation:  
Operation 'reboot' [23984] (call 2 from crmd.25632) for host  
'ha2infra' with device 'stonith_1' returned: 0 (OK)
Mar 12 12:01:05 ha1infra corosync[21794]:  [TOTEM ] A processor  
failed, forming new configuration.





Karl Roessmann
--
Karl RößmannTel. +49-711-689-1657
Max-Planck-Institut FKF Fax. +49-711-689-1632
Postfach 800 665
70506 Stuttgart email k.roessm...@fkf.mpg.de


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out "lost"?

2014-03-12 Thread Vladislav Bogdanov
12.03.2014 00:40, Andrew Beekhof wrote:
> 
> On 11 Mar 2014, at 6:23 pm, Vladislav Bogdanov  wrote:
> 
>> 07.03.2014 10:30, Vladislav Bogdanov wrote:
>>> 07.03.2014 05:43, Andrew Beekhof wrote:

 On 6 Mar 2014, at 10:39 pm, Vladislav Bogdanov  
 wrote:

> 18.02.2014 03:49, Andrew Beekhof wrote:
>>
>> On 31 Jan 2014, at 6:20 pm, yusuke iida  wrote:
>>
>>> Hi, all
>>>
>>> I measure the performance of Pacemaker in the following combinations.
>>> Pacemaker-1.1.11.rc1
>>> libqb-0.16.0
>>> corosync-2.3.2
>>>
>>> All nodes are KVM virtual machines.
>>>
>>> stopped the node of vm01 compulsorily from the inside, after starting 
>>> 14 nodes.
>>> "virsh destroy vm01" was used for the stop.
>>> Then, in addition to the compulsorily stopped node, other nodes are 
>>> separated from a cluster.
>>>
>>> The log of "Retransmit List:" is then outputted in large quantities 
>>> from corosync.
>>
>> Probably best to poke the corosync guys about this.
>>
>> However, <= .11 is known to cause significant CPU usage with that many 
>> nodes.
>> I can easily imagine this staving corosync of resources and causing 
>> breakage.
>>
>> I would _highly_ recommend retesting with the current git master of 
>> pacemaker.
>> I merged the new cib code last week which is faster by _two_ orders of 
>> magnitude and uses significantly less CPU.
>
> Andrew, current git master (ee094a2) almost works, the only issue is
> that crm_diff calculates incorrect diff digest. If I replace digest in
> diff by hands with what cib calculates as "expected". it applies
> correctly. Otherwise - -206.

 More details?
>>>
>>> Hmmm...
>>> seems to be crmsh-specific,
>>> Cannot reproduce with pure-XML editing.
>>> Kristoffer, does 
>>> http://hg.savannah.gnu.org/hgweb/crmsh/rev/c42d9361a310 address this?
>>
>> The problem seems to be caused by the fact that crmsh does not provide
>>  section in both orig and new XMLs to crm_diff, and digest
>> generation seems to rely on that, so crm_diff and cib daemon produce
>> different digests.
>>
>> Attached are two sets of XML files, one (orig.xml, new.xml, patch.xml)
>> are related to the full CIB operation (with status section included),
>> another (orig-edited.xml, new-edited.xml, patch-edited.xml) have that
>> section removed like crmsh does do.
>>
>> Resulting diffs differ only by digest, and that seems to be the exact issue.
> 
> This should help.  As long as crmsh isn't passing -c to crm_diff, then the 
> digest will no longer be present.
> 
>   https://github.com/beekhof/pacemaker/commit/c8d443d

Yep, that helped.
Thank you!


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-12 Thread Jan Friesse
Attila Megyeri napsal(a):
> Hello Jan,
> 
> Thank you very much for your help so far.
> 
>> -Original Message-
>> From: Jan Friesse [mailto:jfrie...@redhat.com]
>> Sent: Wednesday, March 12, 2014 9:51 AM
>> To: The Pacemaker cluster resource manager
>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>>
>> Attila Megyeri napsal(a):
>>>
 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net]
 Sent: Tuesday, March 11, 2014 10:27 PM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze


 On 12 Mar 2014, at 1:54 am, Attila Megyeri
 
 wrote:

>>
>> -Original Message-
>> From: Andrew Beekhof [mailto:and...@beekhof.net]
>> Sent: Tuesday, March 11, 2014 12:48 AM
>> To: The Pacemaker cluster resource manager
>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>>
>>
>> On 7 Mar 2014, at 5:54 pm, Attila Megyeri
>> 
>> wrote:
>>
>>> Thanks for the quick response!
>>>
 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net]
 Sent: Friday, March 07, 2014 3:48 AM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze


 On 7 Mar 2014, at 5:31 am, Attila Megyeri
 
 wrote:

> Hello,
>
> We have a strange issue with Corosync/Pacemaker.
> From time to time, something unexpected happens and suddenly
>> the
 crm_mon output remains static.
> When I check the cpu usage, I see that one of the cores uses
> 100% cpu, but
 cannot actually match it to either the corosync or one of the
 pacemaker processes.
>
> In such a case, this high CPU usage is happening on all 7 nodes.
> I have to manually go to each node, stop pacemaker, restart
> corosync, then
 start pacemeker. Stoping pacemaker and corosync does not work in
 most of the cases, usually a kill -9 is needed.
>
> Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.
>
> Using udpu as transport, two rings on Gigabit ETH, rro_mode
>> passive.
>
> Logs are usually flooded with CPG related messages, such as:
>
> Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush: 
>   Sent
>> 0
>> CPG
 messages  (1 remaining, last=8): Try again (6)
> Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush: 
>   Sent
>> 0
>> CPG
 messages  (1 remaining, last=8): Try again (6)
> Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush: 
>   Sent
>> 0
>> CPG
 messages  (1 remaining, last=8): Try again (6)
> Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush: 
>   Sent
>> 0
>> CPG
 messages  (1 remaining, last=8): Try again (6)
>
> OR
>
> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:  
>   Sent 0
 CPG
 messages  (1 remaining, last=10933): Try again (
> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:  
>   Sent 0
 CPG
 messages  (1 remaining, last=10933): Try again (
> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:  
>   Sent 0
 CPG
 messages  (1 remaining, last=10933): Try again (

 That is usually a symptom of corosync getting into a horribly
 confused
>> state.
 Version? Distro? Have you checked for an update?
 Odd that the user of all that CPU isn't showing up though.

>
>>>
>>> As I wrote I use Ubuntu trusty, the exact package versions are:
>>>
>>> corosync 2.3.0-1ubuntu5
>>> pacemaker 1.1.10+git20130802-1ubuntu2
>>
>> Ah sorry, I seem to have missed that part.
>>
>>>
>>> There are no updates available. The only option is to install from
>>> sources,
>> but that would be very difficult to maintain and I'm not sure I
>> would get rid of this issue.
>>>
>>> What do you recommend?
>>
>> The same thing as Lars, or switch to a distro that stays current
>> with upstream (git shows 5 newer releases for that branch since it
>> was released 3 years ago).
>> If you do build from source, its probably best to go with v1.4.6
>
> Hm, I am a bit confused here. We are using 2.3.0,

 I swapped the 2 for a 1 somehow. A bit distracted, sorry.
>>>
>>> I upgraded all nodes to 2.3.3 and first it seemed a bit better, but still 
>>> the
>> same issue - after some time CPU gets to 100%, and the corosync log is
>> flooded with messages like:
>>>
>>> Mar 12 07:36:55 [4793] ctdb2cib: info: crm_cs_flush:
>>> Se

Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-12 Thread Attila Megyeri
Hello Jan,

Thank you very much for your help so far.

> -Original Message-
> From: Jan Friesse [mailto:jfrie...@redhat.com]
> Sent: Wednesday, March 12, 2014 9:51 AM
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> 
> Attila Megyeri napsal(a):
> >
> >> -Original Message-
> >> From: Andrew Beekhof [mailto:and...@beekhof.net]
> >> Sent: Tuesday, March 11, 2014 10:27 PM
> >> To: The Pacemaker cluster resource manager
> >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> >>
> >>
> >> On 12 Mar 2014, at 1:54 am, Attila Megyeri
> >> 
> >> wrote:
> >>
> 
>  -Original Message-
>  From: Andrew Beekhof [mailto:and...@beekhof.net]
>  Sent: Tuesday, March 11, 2014 12:48 AM
>  To: The Pacemaker cluster resource manager
>  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> 
> 
>  On 7 Mar 2014, at 5:54 pm, Attila Megyeri
>  
>  wrote:
> 
> > Thanks for the quick response!
> >
> >> -Original Message-
> >> From: Andrew Beekhof [mailto:and...@beekhof.net]
> >> Sent: Friday, March 07, 2014 3:48 AM
> >> To: The Pacemaker cluster resource manager
> >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> >>
> >>
> >> On 7 Mar 2014, at 5:31 am, Attila Megyeri
> >> 
> >> wrote:
> >>
> >>> Hello,
> >>>
> >>> We have a strange issue with Corosync/Pacemaker.
> >>> From time to time, something unexpected happens and suddenly
> the
> >> crm_mon output remains static.
> >>> When I check the cpu usage, I see that one of the cores uses
> >>> 100% cpu, but
> >> cannot actually match it to either the corosync or one of the
> >> pacemaker processes.
> >>>
> >>> In such a case, this high CPU usage is happening on all 7 nodes.
> >>> I have to manually go to each node, stop pacemaker, restart
> >>> corosync, then
> >> start pacemeker. Stoping pacemaker and corosync does not work in
> >> most of the cases, usually a kill -9 is needed.
> >>>
> >>> Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.
> >>>
> >>> Using udpu as transport, two rings on Gigabit ETH, rro_mode
> passive.
> >>>
> >>> Logs are usually flooded with CPG related messages, such as:
> >>>
> >>> Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush: 
> >>>   Sent
> 0
>  CPG
> >> messages  (1 remaining, last=8): Try again (6)
> >>> Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush: 
> >>>   Sent
> 0
>  CPG
> >> messages  (1 remaining, last=8): Try again (6)
> >>> Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush: 
> >>>   Sent
> 0
>  CPG
> >> messages  (1 remaining, last=8): Try again (6)
> >>> Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush: 
> >>>   Sent
> 0
>  CPG
> >> messages  (1 remaining, last=8): Try again (6)
> >>>
> >>> OR
> >>>
> >>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:  
> >>>   Sent 0
> >> CPG
> >> messages  (1 remaining, last=10933): Try again (
> >>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:  
> >>>   Sent 0
> >> CPG
> >> messages  (1 remaining, last=10933): Try again (
> >>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:  
> >>>   Sent 0
> >> CPG
> >> messages  (1 remaining, last=10933): Try again (
> >>
> >> That is usually a symptom of corosync getting into a horribly
> >> confused
>  state.
> >> Version? Distro? Have you checked for an update?
> >> Odd that the user of all that CPU isn't showing up though.
> >>
> >>>
> >
> > As I wrote I use Ubuntu trusty, the exact package versions are:
> >
> > corosync 2.3.0-1ubuntu5
> > pacemaker 1.1.10+git20130802-1ubuntu2
> 
>  Ah sorry, I seem to have missed that part.
> 
> >
> > There are no updates available. The only option is to install from
> > sources,
>  but that would be very difficult to maintain and I'm not sure I
>  would get rid of this issue.
> >
> > What do you recommend?
> 
>  The same thing as Lars, or switch to a distro that stays current
>  with upstream (git shows 5 newer releases for that branch since it
>  was released 3 years ago).
>  If you do build from source, its probably best to go with v1.4.6
> >>>
> >>> Hm, I am a bit confused here. We are using 2.3.0,
> >>
> >> I swapped the 2 for a 1 somehow. A bit distracted, sorry.
> >
> > I upgraded all nodes to 2.3.3 and first it seemed a bit better, but still 
> > the
> same issue - after some time CPU gets to 100%, and the corosync log is
> flooded with messages like:
> >
> > Mar 12 07:36:55 [4793] ctdb2cib: info: crm_cs_flush:
> > Sent 0 CPG
> messages  (48 remaining, last=3671): 

Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-12 Thread Jan Friesse
Attila Megyeri napsal(a):
> 
>> -Original Message-
>> From: Andrew Beekhof [mailto:and...@beekhof.net]
>> Sent: Tuesday, March 11, 2014 10:27 PM
>> To: The Pacemaker cluster resource manager
>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>>
>>
>> On 12 Mar 2014, at 1:54 am, Attila Megyeri 
>> wrote:
>>

 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net]
 Sent: Tuesday, March 11, 2014 12:48 AM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze


 On 7 Mar 2014, at 5:54 pm, Attila Megyeri 
 wrote:

> Thanks for the quick response!
>
>> -Original Message-
>> From: Andrew Beekhof [mailto:and...@beekhof.net]
>> Sent: Friday, March 07, 2014 3:48 AM
>> To: The Pacemaker cluster resource manager
>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>>
>>
>> On 7 Mar 2014, at 5:31 am, Attila Megyeri
>> 
>> wrote:
>>
>>> Hello,
>>>
>>> We have a strange issue with Corosync/Pacemaker.
>>> From time to time, something unexpected happens and suddenly the
>> crm_mon output remains static.
>>> When I check the cpu usage, I see that one of the cores uses 100%
>>> cpu, but
>> cannot actually match it to either the corosync or one of the
>> pacemaker processes.
>>>
>>> In such a case, this high CPU usage is happening on all 7 nodes.
>>> I have to manually go to each node, stop pacemaker, restart
>>> corosync, then
>> start pacemeker. Stoping pacemaker and corosync does not work in
>> most of the cases, usually a kill -9 is needed.
>>>
>>> Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.
>>>
>>> Using udpu as transport, two rings on Gigabit ETH, rro_mode passive.
>>>
>>> Logs are usually flooded with CPG related messages, such as:
>>>
>>> Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
>>> Sent 0
 CPG
>> messages  (1 remaining, last=8): Try again (6)
>>> Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
>>> Sent 0
 CPG
>> messages  (1 remaining, last=8): Try again (6)
>>> Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
>>> Sent 0
 CPG
>> messages  (1 remaining, last=8): Try again (6)
>>> Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
>>> Sent 0
 CPG
>> messages  (1 remaining, last=8): Try again (6)
>>>
>>> OR
>>>
>>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
>>> Sent 0
>> CPG
>> messages  (1 remaining, last=10933): Try again (
>>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
>>> Sent 0
>> CPG
>> messages  (1 remaining, last=10933): Try again (
>>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
>>> Sent 0
>> CPG
>> messages  (1 remaining, last=10933): Try again (
>>
>> That is usually a symptom of corosync getting into a horribly
>> confused
 state.
>> Version? Distro? Have you checked for an update?
>> Odd that the user of all that CPU isn't showing up though.
>>
>>>
>
> As I wrote I use Ubuntu trusty, the exact package versions are:
>
> corosync 2.3.0-1ubuntu5
> pacemaker 1.1.10+git20130802-1ubuntu2

 Ah sorry, I seem to have missed that part.

>
> There are no updates available. The only option is to install from
> sources,
 but that would be very difficult to maintain and I'm not sure I would
 get rid of this issue.
>
> What do you recommend?

 The same thing as Lars, or switch to a distro that stays current with
 upstream (git shows 5 newer releases for that branch since it was
 released 3 years ago).
 If you do build from source, its probably best to go with v1.4.6
>>>
>>> Hm, I am a bit confused here. We are using 2.3.0,
>>
>> I swapped the 2 for a 1 somehow. A bit distracted, sorry.
> 
> I upgraded all nodes to 2.3.3 and first it seemed a bit better, but still the 
> same issue - after some time CPU gets to 100%, and the corosync log is 
> flooded with messages like:
> 
> Mar 12 07:36:55 [4793] ctdb2cib: info: crm_cs_flush:Sent 
> 0 CPG messages  (48 remaining, last=3671): Try again (6)
> Mar 12 07:36:55 [4798] ctdb2   crmd: info: crm_cs_flush:Sent 
> 0 CPG messages  (51 remaining, last=3995): Try again (6)
> Mar 12 07:36:56 [4793] ctdb2cib: info: crm_cs_flush:Sent 
> 0 CPG messages  (48 remaining, last=3671): Try again (6)
> Mar 12 07:36:56 [4798] ctdb2   crmd: info: crm_cs_flush:Sent 
> 0 CPG messages  (51 remaining, last=3995): Try again (6)
> Mar 12 07:36:57 [4793] ctdb2cib: info: crm_cs_flush:Sent 
> 0 CPG messages  (48 remaining, last=3671)

Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-12 Thread Attila Megyeri

> -Original Message-
> From: Andrew Beekhof [mailto:and...@beekhof.net]
> Sent: Tuesday, March 11, 2014 10:27 PM
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> 
> 
> On 12 Mar 2014, at 1:54 am, Attila Megyeri 
> wrote:
> 
> >>
> >> -Original Message-
> >> From: Andrew Beekhof [mailto:and...@beekhof.net]
> >> Sent: Tuesday, March 11, 2014 12:48 AM
> >> To: The Pacemaker cluster resource manager
> >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> >>
> >>
> >> On 7 Mar 2014, at 5:54 pm, Attila Megyeri 
> >> wrote:
> >>
> >>> Thanks for the quick response!
> >>>
>  -Original Message-
>  From: Andrew Beekhof [mailto:and...@beekhof.net]
>  Sent: Friday, March 07, 2014 3:48 AM
>  To: The Pacemaker cluster resource manager
>  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> 
> 
>  On 7 Mar 2014, at 5:31 am, Attila Megyeri
>  
>  wrote:
> 
> > Hello,
> >
> > We have a strange issue with Corosync/Pacemaker.
> > From time to time, something unexpected happens and suddenly the
>  crm_mon output remains static.
> > When I check the cpu usage, I see that one of the cores uses 100%
> > cpu, but
>  cannot actually match it to either the corosync or one of the
>  pacemaker processes.
> >
> > In such a case, this high CPU usage is happening on all 7 nodes.
> > I have to manually go to each node, stop pacemaker, restart
> > corosync, then
>  start pacemeker. Stoping pacemaker and corosync does not work in
>  most of the cases, usually a kill -9 is needed.
> >
> > Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.
> >
> > Using udpu as transport, two rings on Gigabit ETH, rro_mode passive.
> >
> > Logs are usually flooded with CPG related messages, such as:
> >
> > Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
> > Sent 0
> >> CPG
>  messages  (1 remaining, last=8): Try again (6)
> > Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
> > Sent 0
> >> CPG
>  messages  (1 remaining, last=8): Try again (6)
> > Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
> > Sent 0
> >> CPG
>  messages  (1 remaining, last=8): Try again (6)
> > Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
> > Sent 0
> >> CPG
>  messages  (1 remaining, last=8): Try again (6)
> >
> > OR
> >
> > Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
> > Sent 0
> CPG
>  messages  (1 remaining, last=10933): Try again (
> > Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
> > Sent 0
> CPG
>  messages  (1 remaining, last=10933): Try again (
> > Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
> > Sent 0
> CPG
>  messages  (1 remaining, last=10933): Try again (
> 
>  That is usually a symptom of corosync getting into a horribly
>  confused
> >> state.
>  Version? Distro? Have you checked for an update?
>  Odd that the user of all that CPU isn't showing up though.
> 
> >
> >>>
> >>> As I wrote I use Ubuntu trusty, the exact package versions are:
> >>>
> >>> corosync 2.3.0-1ubuntu5
> >>> pacemaker 1.1.10+git20130802-1ubuntu2
> >>
> >> Ah sorry, I seem to have missed that part.
> >>
> >>>
> >>> There are no updates available. The only option is to install from
> >>> sources,
> >> but that would be very difficult to maintain and I'm not sure I would
> >> get rid of this issue.
> >>>
> >>> What do you recommend?
> >>
> >> The same thing as Lars, or switch to a distro that stays current with
> >> upstream (git shows 5 newer releases for that branch since it was
> >> released 3 years ago).
> >> If you do build from source, its probably best to go with v1.4.6
> >
> > Hm, I am a bit confused here. We are using 2.3.0,
> 
> I swapped the 2 for a 1 somehow. A bit distracted, sorry.

I upgraded all nodes to 2.3.3 and first it seemed a bit better, but still the 
same issue - after some time CPU gets to 100%, and the corosync log is flooded 
with messages like:

Mar 12 07:36:55 [4793] ctdb2cib: info: crm_cs_flush:Sent 0 
CPG messages  (48 remaining, last=3671): Try again (6)
Mar 12 07:36:55 [4798] ctdb2   crmd: info: crm_cs_flush:Sent 0 
CPG messages  (51 remaining, last=3995): Try again (6)
Mar 12 07:36:56 [4793] ctdb2cib: info: crm_cs_flush:Sent 0 
CPG messages  (48 remaining, last=3671): Try again (6)
Mar 12 07:36:56 [4798] ctdb2   crmd: info: crm_cs_flush:Sent 0 
CPG messages  (51 remaining, last=3995): Try again (6)
Mar 12 07:36:57 [4793] ctdb2cib: info: crm_cs_flush:Sent 0 
CPG messages  (48 remaining, last=3671): Try again (6)
Mar 12 07:36:57 [4798] ctdb2   crmd: info: cr