Re: [Pacemaker] Suggestions for managing HA of containers from within a Pacemaker container?

2015-02-25 Thread David Vossel


- Original Message -
 Pacemaker as a scheduler in Mesos or Kubernates does sound like a very
 interesting idea. Packaging corosync into super privileged containers still
 doesn't make too much sense to me. What's the reason in isolating something
 and then giving it all permissions on a host machine?

because soon everything will run in containers. Take a look at rhel atomic and
the stuff coreos is doing. The only way pacemaker will exist on those 
distributions
is if it lives in a super privileged container. 

 On Mon, Feb 23, 2015 at 5:20 PM, Andrew Beekhof  and...@beekhof.net  wrote:
 
 
 
  On 10 Feb 2015, at 1:45 pm, Serge Dubrouski  serge...@gmail.com  wrote:
  
  Hello Steve,
  
  Are you sure that Pacemaker is the right product for your project? Have you
  checked Mesos/Marathon or Kubernates? Those are frameworks being developed
  for managing containers.
 
 And in a few years they'll work out that they need some HA features and try
 to retrofit them :-)
 In the meantime pacemaker is actually rather good at managing containers
 already and knows a thing or two about HA and how to bring up a complex
 stack of services.
 
 The one thing that would be really interesting in this area is using the our
 policy engine as the kubernates scheduler.
 So many ideas and so little time.
 
  
  On Sat Feb 07 2015 at 1:19:15 PM Steven Dake (stdake)  std...@cisco.com 
  wrote:
  Hi,
  
  I am working on Containerizing OpenStack in the Kolla project (
  http://launchpad.net/kolla ). One of the key things we want to do over the
  next few months is add H/A support to our container tech. David Vossel had
  suggested using systemctl to monitor the containers themselves by running
  healthchecking scripts within the containers. That idea is sound.
  
  There is another technology called “super-privileged containers”.
  Essentially it allows more host access for the container, allowing the
  treatment of Pacemaker as a container rather than a RPM or DEB file. I’d
  like corosync to run in a separate container. These containers will
  communicate using their normal mechanisms in a super-privileged mode. We
  will implement this in Kolla.
  
  Where I am stuck is how does Pacemaker within a container control other
  containers in the host os. One way I have considered is using the docker
  —pid=host flag, allowing pacemaker to communicate directly with the host
  systemctl process. Where I am stuck is our containers don’t run via
  systemctl, but instead via shell scripts that are executed by third party
  deployment software.
  
  An example:
  Lets say a rabbitmq container wants to run:
  
  The user would run
  kolla-mgr deploy messaging
  
  This would run a small bit of code to launch the docker container set for
  messaging.
  
  Could pacemaker run something like
  
  Kolla-mgr status messaging
  
  To control the lifecycle of the processes?
  
  Or would we be better off with some systemd integration with kolla-mgr?
  
  Thoughts welcome
  
  Regards,
  -steve
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
  
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
  
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 
 
 
 --
 Serge Dubrouski.
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Suggestions for managing HA of containers from within a Pacemaker container?

2015-02-23 Thread David Vossel


- Original Message -
 Hi,

Hey Steve, Good to see you around :)

 I am working on Containerizing OpenStack in the Kolla project (
 http://launchpad.net/kolla ). One of the key things we want to do over the
 next few months is add H/A support to our container tech. David Vossel had
 suggested using systemctl to monitor the containers themselves by running
 healthchecking scripts within the containers. That idea is sound.

Knowing what I know about OpenStack HA now, that is a bad choice.

 
 There is another technology called “super-privileged containers”. Essentially
 it allows more host access for the container, allowing the treatment of

Yep, this is the way to do it. My plan is to have pacemaker running in a 
container,
and have pacemaker capable of launching resources within containers.

We already have a Docker resource agent. You can find it here,
https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/docker

Using that agent, pacemaker can launch a docker container, and then monitor
the container by performing health checks within the container. Here's an
example of how I'm using this technique to manage a containerized apache
instance.
https://github.com/davidvossel/phd/blob/master/scenarios/docker-apache-ap.scenario#L96


 Pacemaker as a container rather than a RPM or DEB file. I’d like corosync to
 run in a separate container. These containers will communicate using their


I actually already got pacemaker+corosync running in a container for testing
purposes. If you're interested you can checkout some of that work here,
https://github.com/davidvossel/phd/tree/master/lib/docker . The 
phd_docker_utils.sh
file holds most of the interesting parts. 

 normal mechanisms in a super-privileged mode. We will implement this in
 Kolla.
 
 Where I am stuck is how does Pacemaker within a container control other
 containers in the host os. One way I have considered is using the docker
 —pid=host flag, allowing pacemaker to communicate directly with the host
 systemctl process. Where I am stuck is our containers don’t run via
 systemctl, but instead via shell scripts that are executed by third party
 deployment software.
 
 An example:
 Lets say a rabbitmq container wants to run:
 
 The user would run
 kolla-mgr deploy messaging

yes, and from there kolla-mgr hands the containers off to pacemaker to manage.

kolla is the orchestration, pacemaker is the scheduler for performing those 
tasks.

 This would run a small bit of code to launch the docker container set for
 messaging.
 
 Could pacemaker run something like
 
 Kolla-mgr status messaging
 
 To control the lifecycle of the processes?
 
 Or would we be better off with some systemd integration with kolla-mgr?
 
 Thoughts welcome
 
 Regards,
 -steve
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] please help

2015-01-30 Thread David Vossel


- Original Message -
 
 
 Pacemaker is only running on one node.
 
 Before it was running on both node.

run, service pacemaker start, on the ams2 node.

 
 
 
 
 Thank you  Best Regards
 
 Perminus,
 
 IT
 
 
 
 
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Gracefully failing reload operation

2015-01-29 Thread David Vossel


- Original Message -
 Hi,
 
 is there a way for resource agent to tell pacemaker that in some cases
 reload operation is insufficient to apply new resource definition and
 restart is required?
 
 I tried to return OCF_ERR_GENERIC, but that prevents resource from being
 started until failure-timeout lapses and cluster is rechecked.

I believe if the resource instance attribute that is being updated is
marked as 'unique' by the specific resource's metadata, that pacemaker
will force a start/stop instead of allowing the reload.

http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#idm254549695664

-- David

 
 Best,
 Vladislav
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Version of libqb is too old: v0.13 or greater requried

2015-01-28 Thread David Vossel


- Original Message -
 Hi Everybody,
 
 I have compiled libqb 0.17.1 under Debian Jessie/testing amd64 as:
 
 tar zxvf libqb-v0.17.1.tar.gz
 cd libqb-0.17.1/
 ./autogen.sh
 ./configure
 make -j8
 make -j8 install
 
 Then after succesful builds of COROSYNC 2.3.4, CLUSTER-GLUE 1.0.12 and
 RESOURCE-AGENTS 3.9.5, compiling PACEMAKER 1.1.12 fails with:
 
 unzip Pacemaker-1.1.12.zip
 cd pacemaker-Pacemaker-1.1.12/
 addgroup --system haclient
 ./autogen.sh
 ./configure
 [...]
 configure: error: in `/home/alexis/pacemaker-Pacemaker-1.1.12':
 configure: error: Version of libqb is too old: v0.13 or greater requried
 
 I have tried to pass some flags to ./configure, but I still get this error.

what do you get when you run, pkg-config --version libqb

also, make sure you don't have an old version of libqb installed that you
did a make install over top.

-- Vossel

 
 What am I doing wrong?
 
 Thanks for your help,
 
 --
 Alexis de BRUYN alexis.mailingl...@de-bruyn.fr
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] pacemaker-remote not listening

2015-01-27 Thread David Vossel


- Original Message -
 Hi,
 my os is debian-wheezy
 i compiled and installed pacemaker-remote.
 Startup log:
 Jan 27 16:04:30 [2859] vm1 pacemaker_remoted: info: crm_log_init: Changed
 active directory to /var/lib/heartbeat/cores/root
 Jan 27 16:04:30 [2859] vm1 pacemaker_remoted: info: qb_ipcs_us_publish:
 server name: lrmd
 Jan 27 16:04:30 [2859] vm1 pacemaker_remoted: info: main: Starting
 My problem is, that pacemaker remote is not listening on port 3121

By default pacemaker_remote should listen on 3121. This is odd.

One thing I can think of. Take a look at /etc/sysconfig/pacemaker on the
node running pacemaker_remote. Make sure there isn't a custom port set
using the PCMK_remote_port variable.

http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Remote/index.html#_pacemaker_and_pacemaker_remote_options

-- Vossel

 netstat -tulpen | grep 3121
 netstat -alpen
 Proto RefCnt Flags Type State I-Node PID/Program name Path
 unix 2 [ ACC ] STREAM LISTENING 6635 2859/pacemaker_remo @lrmd
 unix 2 [ ] DGRAM 6634 2859/pacemaker_remo
 ...
 ...
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker memory usage

2015-01-23 Thread David Vossel


- Original Message -
 
 
 Hi,
 
 
 
 We are trying to introduce clustering to our embedded environment using
 pacemaker and corosync.
 
 Our environment includes following packages and modules for the initial test:
 
 1. Pacemaker – version 1.1.10
 
 2. Corosync - version 2.3.4
 
 3. Corosync.conf file find it attached to this email thread
 
 4. 2 Nodes in the cluster (master and slave)
 
 5. Test app1 that publishes some sensor data (only Master node publishes,
 slave node just receives the data)
 
 6. Test app2 that receives the published data and outputs on the screen (only
 master node outputs the data on the screen)
 
 
 
 Our test has been successful and we are able to create a cluster with 2 nodes
 and everything seems to be working fine. However we observed that pacemaker
 is consuming approx. 80MB of RAM when both the test applications are alive.

80M is more than I would expect.

One thing I know you can do is lower the IPC buffer size. That can be done in
/etc/sysconfig/pacemaker. Set the PCMK_ipc_buffer to something smaller than
whatever it defaults to.

-- Vossel



 
 We would like to know if there are any configuration settings or fine tuning
 that we need to perform to reduce the memory usage. It is very critical to
 reduce the memory consumption as we are running it in embedded environment.
 
 
 
 Thanks,
 Santosh Bidaralli
 
 
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Avoid one node from being a target for resources migration

2015-01-12 Thread David Vossel


- Original Message -
 Hello.
 
 I have 3-node cluster managed by corosync+pacemaker+crm. Node1 and Node2 are
 DRBD master-slave, also they have a number of other services installed
 (postgresql, nginx, ...). Node3 is just a corosync node (for quorum), no
 DRBD/postgresql/... are installed at it, only corosync+pacemaker.
 
 But when I add resources to the cluster, a part of them are somehow moved to
 node3 and since then fail. Note than I have a colocation directive to
 place these resources to the DRBD master only and location with -inf for
 node3, but this does not help - why? How to make pacemaker not run anything
 at node3?
 
 All the resources are added in a single transaction: cat config.txt | crm -w
 -f- configure where config.txt contains directives and commit statement
 at the end.
 
 Below are crm status (error messages) and crm configure show outputs.
 
 
 root@node3:~# crm status
 Current DC: node2 (1017525950) - partition with quorum
 3 Nodes configured
 6 Resources configured
 Online: [ node1 node2 node3 ]
 Master/Slave Set: ms_drbd [drbd]
 Masters: [ node1 ]
 Slaves: [ node2 ]
 Resource Group: server
 fs (ocf::heartbeat:Filesystem): Started node1
 postgresql (lsb:postgresql): Started node3 FAILED
 bind9 (lsb:bind9): Started node3 FAILED
 nginx (lsb:nginx): Started node3 (unmanaged) FAILED
 Failed actions:
 drbd_monitor_0 (node=node3, call=744, rc=5, status=complete,
 last-rc-change=Mon Jan 12 11:16:43 2015, queued=2ms, exec=0ms): not
 installed
 postgresql_monitor_0 (node=node3, call=753, rc=1, status=complete,
 last-rc-change=Mon Jan 12 11:16:43 2015, queued=8ms, exec=0ms): unknown
 error
 bind9_monitor_0 (node=node3, call=757, rc=1, status=complete,
 last-rc-change=Mon Jan 12 11:16:43 2015, queued=11ms, exec=0ms): unknown
 error
 nginx_stop_0 (node=node3, call=767, rc=5, status=complete, last-rc-change=Mon
 Jan 12 11:16:44 2015, queued=1ms, exec=0ms): not installed

Here's what is going on. Even when you say never run this resource on node3
pacemaker is going to probe for the resource regardless on node3 just to verify
the resource isn't running.

The failures you are seeing monitor_0 failed indicate that pacemaker failed
to be able to verify resources are running on node3 because the related 
packages for the resources are not installed. Given pacemaker's default
behavior I'd expect this.

You have two options.

1. install the resource related packages on node3 even though you never want
them to run there. This will allow the resource-agents to verify the resource
is in fact inactive.

2. If you are using the current master branch of pacemaker, there's a new
location constraint option called 'resource-discovery=always|never|exclusive'.
If you add the 'resource-discovery=never' option to your location constraint
that attempts to keep resources from node3, you'll avoid having pacemaker
perform the 'monitor_0' actions on node3 as well.

-- Vossel

 
 root@node3:~# crm configure show | cat
 node $id=1017525950 node2
 node $id=13071578 node3
 node $id=1760315215 node1
 primitive drbd ocf:linbit:drbd \
 params drbd_resource=vlv \
 op start interval=0 timeout=240 \
 op stop interval=0 timeout=120
 primitive fs ocf:heartbeat:Filesystem \
 params device=/dev/drbd0 directory=/var/lib/vlv.drbd/root
 options=noatime,nodiratime fstype=xfs \
 op start interval=0 timeout=300 \
 op stop interval=0 timeout=300
 primitive postgresql lsb:postgresql \
 op monitor interval=10 timeout=60 \
 op start interval=0 timeout=60 \
 op stop interval=0 timeout=60
 primitive bind9 lsb:bind9 \
 op monitor interval=10 timeout=60 \
 op start interval=0 timeout=60 \
 op stop interval=0 timeout=60
 primitive nginx lsb:nginx \
 op monitor interval=10 timeout=60 \
 op start interval=0 timeout=60 \
 op stop interval=0 timeout=60
 group server fs postgresql bind9 nginx
 ms ms_drbd drbd meta master-max=1 master-node-max=1 clone-max=2
 clone-node-max=1 notify=true
 location loc_server server rule $id=loc_server-rule -inf: #uname eq node3
 colocation col_server inf: server ms_drbd:Master
 order ord_server inf: ms_drbd:promote server:start
 property $id=cib-bootstrap-options \
 stonith-enabled=false \
 last-lrm-refresh=1421079189 \
 maintenance-mode=false
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Some questions on the currenct state

2015-01-12 Thread David Vossel


- Original Message -
 Hi Trevor,
 
 thank you for answering so fast.
 
 2) Besides the fact that rpm packages are available do
 you know how to make rpm packages from git repository?

./autogen.sh  ./configure  make rpm

That will generate rpms from the source tree.

 4) Is RHEL 7.x using corosync 2.x and pacemaker plugin
 for cluster membership?

no. RHEL 7.x uses corosync 2.x and the new corosync vote quorum api.
The plugins are a thing of the past for rhel7.

 Best regards
 Andreas Mock
 
 
  -Ursprüngliche Nachricht-
  Von: Trevor Hemsley [mailto:thems...@voiceflex.com]
  Gesendet: Montag, 12. Januar 2015 16:42
  An: The Pacemaker cluster resource manager
  Betreff: Re: [Pacemaker] Some questions on the currenct state
  
  On 12/01/15 15:09, Andreas Mock wrote:
   Hi all,
  
   almost allways when I'm forced to do some major upgrades
   to our core machines in terms of hardware and/or software (OS)
   I'm forced to have a look at the current state of pacemaker
   based HA. Things are going on and things change. Projects
   converge and diverge, tool(s)/chains come and go and
   distributions marketing strategies change. Therefor I want
   to ask the following question in the hope list members
   deeply involved can answer easily.
  
   1) Are there pacemaker packages für RHEL 6.6 and clones?
   When yes where?
  
  In the CentOS (etc) base/updates repos. For RHEL they're in the HA
  channel.
  
  
   2) How can I create a pacemaker package 1.1.12 on my own from
   the git sources?
  It's already in base/updates.
  
  
   3) How can I get the current versions of pcs and/or crmsh?
   Is pcs competitive to crmsh meanwhile?
  pcs is in el6.6 and now includes pcsd. You can get crmsh from an
  opensuse build repo for el6.
  
   4) Is the pacemaker HA solution of RHEL 7.x still bound to use
   of cman?
  No
  
   5) Where can I find a currenct workable version of the agents
   for RHEL 6.6 (and clones) and RHEL 7.x?
  Probably you want the resource-agents package.
  
  T
  
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
  
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] pacemaker-remote debian wheezy

2015-01-12 Thread David Vossel


- Original Message -
 Hi,
 what is the best way, to install in a debian wheezy vm the package
 pacemaker-remote? This package is in the debian repository not available.

I have no clue.

I just want to point out, if your host OS is debian wheezy and the 
pacemaker-remote
package is in fact unavailable, it is possible the version of pacemaker shipped
with wheezy doesn't even have the capability of managing pacemaker_remote nodes.

-- Vossel 

 Thanks!
 Regards,
 Thomas
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] pacemaker error after a couple week or month (David Vossel)

2014-12-22 Thread David Vossel


- Original Message -
 Hello David,
 
 I think I use the latest version from ubuntu, it is version 1.1.10
 Do you think it has bug on it?

There have been a number of fixes to the lrmd since v1.1.10. It is possible
a couple of them could result in crashes. Again, without a backtrace from
the lrmd core dump, it is difficult for me to advise whether or not your
specific issue has been fixed. Building from source could yield better results
for you. The pacemaker master branch is stable at the moment.

lrmd related changes since 1.1.10

# git log --oneline Pacemaker-1.1.10^..HEAD | grep -e lrmd:
71b429c Low: lrmd: fix regression test LSBdummy install
fb94901 Test: lrmd: Ensure the lsb script is executable
30d978e Low: lrmd: systemd stress tests
568e41d Fix: lrmd: Prevent glib assert triggered by timers being removed from 
mainloop more than once
977de97 High: lrmd: cancel pending async connection during disconnect
d2d0cba Low: lrmd: ensures systemd python package is available when systemd 
tests run
f0fe737 Fix: lrmd: fix rescheduling of systemd monitor op during start
c0e8e6a Low: lrmd: prevent \n from being printed in exit reason output
2342835 High: lrmd: pass exit reason prefix to ocf scripts as env variable
412631c High: lrmd: store failed operation exit reason in cib
ad083a8 Fix: lrmd: Log with the correct personality
718bf5b Test: lrmd: Update the systemd agent to test long running actions
c78b4b8 Fix: lrmd: Handle systemd reporting 'done' before a resource is 
actually stopped
3bd6c30 Fix: lrmd: Handle systemd reporting 'done' before a resource is 
actually stopped
574fc49 Fix: lrmd: Prevent OCF agents from logging to random files due to 
value of setenv() being NULL
155c6aa Low: lrmd: wider use of defined literals
fa8bd56 Fix: lrmd: Expose logging variables expected by OCF agents
d9cc751 Fix: lrmd: Provide stderr output from agents if available, otherwise 
fall back to stdout
3adc781 Low: lrmd: clean up the agent's entire process group
348bb51 Fix: lrmd: Cancel recurring operations before stop action is executed
fa2954e Low: lrmd: Warning msg to indicate duplicate op merge has occurred
b94d0e9 Low: lrmd: recurring op merger regression tests
c29ab27 High: lrmd: Merge duplicate recurring monitor operations
c1a326d Test: lrmd: Bump the lrmd test timeouts to avoid transient travis 
failures
deead39 Low: lrmd: Install ping agent during lrmd regression test.
aad79e2 Low: lrmd: Make ocf dummy agents executable with regression test in src 
tree
5c8c7a5 Test: lrmd: Kill uninstalled daemons by the correct name
8e90200 Test: lrmd: Fix upstart metadata test and install required OCF agents
bbdd6e1 Test: lrmd: Allow regression tests to run from the source tree
87f4091 Low: lrmd: Send event alerting estabilished clients that a new client 
connection is created.
644752e Fix: lrmd: Correctly calculate metadata for the 'service' class
ea7991f Fix: lrmd: Do not interrogate NULL replies from the server
1c14b9d Fix: lrmd: Correctly cancel monitor actions for lsb/systemd/service 
resources on cleaning up
eca Doc: lrmd: Indicate which function recieves the proxied command
ad4056f Test: lrmd: Drop the default verbosity for lrmd regression tests
eb40d6a Fix: lrmd: Do not overwrite any existing operation status error


-- Vossel

 Should I compile from the source?
 
 Best Regards,
 
 
 Ariee
 
 
 On Fri, Dec 19, 2014 at 8:27 PM,  pacemaker-requ...@oss.clusterlabs.org 
 wrote:
 
 
 Message: 2
 Date: Fri, 19 Dec 2014 14:21:59 -0500 (EST)
 From: David Vossel  dvos...@redhat.com 
 To: The Pacemaker cluster resource manager
  pacemaker@oss.clusterlabs.org 
 Subject: Re: [Pacemaker] pacemaker error after a couple week or month
 Message-ID:
  102420175.739708.1419016919246.javamail.zim...@redhat.com 
 Content-Type: text/plain; charset=utf-8
 
 
 
 - Original Message -
  Hello,
  
  I have 2 active-passive fail over system with corosync and drbd.
  One system using 2 debian server and the other using 2 ubuntu server.
  The debian servers are for web server fail over and the ubuntu servers are
  for database server fail over.
  
  I applied the same configuration in the pacemaker. Everything works fine,
  fail over can be done nicely and also the file system synchronization, but
  in the ubuntu server, it was always has error after a couple week or month.
  The pacemaker in ubuntu1 had different status with ubuntu2, ubuntu1 assumed
  that ubuntu2 was down and ubuntu2 assumed that something happened with
  ubuntu1 but still alive and took over the resources. It made the drbd
  resource cannot be taken over, thus no fail over happened and we must
  manually restart the server because restarting pacemaker and corosync
  didn't
  help. I have changed the configuration of pacemaker a couple time, but the
  problem still exist.
  
  has anyone experienced it? I use Ubuntu 14.04.1 LTS.
  
  I got this error in apport.log
  
  ERROR: apport (pid 20361) Fri Dec 19 02:43:52 2014: executable:
  /usr/lib/pacemaker/lrmd

Re: [Pacemaker] [Patch]Memory leak of Pacemakerd.

2014-12-22 Thread David Vossel


- Original Message -
 Hi All,
 
 Whenever a node to constitute a cluster repeats start and a stop, Pacemakerd
 of the node not to stop leaks out memory.
 I attached a patch.

this patch looks correct. Can you create a pull request to our master branch?
https://github.com/ClusterLabs/pacemaker

-- Vossel


 Best Regards,
 Hideo Yamauchi.
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [libqb]Unlink of files bound to sockets

2014-12-19 Thread David Vossel


- Original Message -
 I used the current trunk.
 I could not find the unlink calls.
 
 If domain sockets are used this two methods are used.
 
 In ./lib/ipc_socket.c
 static void
 qb_ipcc_us_disconnect(struct qb_ipcc_connection *c)
 {
   munmap(c-request.u.us.shared_data, SHM_CONTROL_SIZE);
   unlink(c-request.u.us.shared_file_name);

right, so here we're doing the unlinking of the shared file.
There's some trick we're using to only have a single file created
for all 3 sockets.

Is this not working for solaris?


   qb_ipcc_us_sock_close(c-event.u.us.sock);
   qb_ipcc_us_sock_close(c-request.u.us.sock);
   qb_ipcc_us_sock_close(c-setup.u.us.sock);
 }
 
 In ./lib/ipc_setup.c
 void
 qb_ipcc_us_sock_close(int32_t sock)
 {
 shutdown(sock, SHUT_RDWR);
 close(sock);
 }
 
 I added in the latter the unlink calls.
 
 -Ursprüngliche Nachricht-
 Von: David Vossel [mailto:dvos...@redhat.com]
 Gesendet: Donnerstag, 18. Dezember 2014 18:13
 An: The Pacemaker cluster resource manager
 Betreff: Re: [Pacemaker] [libqb]Unlink of files bound to sockets
 
 
 
 - Original Message -
  
  
  I sent yesterday this email to the mailing list of libq
  'quarterback-de...@lists.fedorahosted.org'.
  
  But there is nearly no activity since august.
 
 i saw the email. i flagged it so it would get a response.
 
  
  I use the current trunk of libqb.
  
  In qb_ipcc_us_sock_close nd qb_ipcs_us_withdraw of lib/ipc_setup.c
  sockets are closed.
  
  Is there a reason why the files bound to the sockets are not deleted
  with unlink?
  
  Is unlinking not necessary with Linux?
 
 Unlinking is required for linux.
 
 For client/server connections.
 
 qb_ipcc_us_disconnect unlinks on the client side.
 qb_ipcs_us_disconnect unlinks on the server side.
 
 
  I found thousands of files in statedir=/var/corosync/run after a while.
 
 What version of corosync are you using? There were some reference leaks for
 ipc connections in the corosync code we fixed a year or so ago that should
 have fixed this.
 
 -- David
 
  
  
  I tried this and it seems to work without errors.
  
  
  
  e.g.
  
  void
  
  qb_ipcc_us_sock_close(int32_t sock)
  
  {
  
  #ifdef QB_SOLARIS
  
  struct sockaddr_un un_addr;
  
  socklen_t un_addr_len = sizeof(struct sockaddr_un);
  
  #endif
  
  shutdown(sock, SHUT_RDWR);
  
  #ifdef QB_SOLARIS
  
  if (getsockname(sock, (struct sockaddr *)un_addr, un_addr_len) == 0)
  {
  
  if(strstr(un_addr.sun_path,-) != NULL) {
  
  qb_util_log(LOG_DEBUG, un_addr.sun_path=%s, un_addr.sun_path);
  
  unlink(un_addr.sun_path);
  
  }
  
  } else {
  
  qb_util_log(LOG_DEBUG, getsockname returned errno=%d, errno);
  
  }
  
  #endif
  
  close(sock);
  
  }
  
  
  
  Regards
  
  
  
  Andreas
  
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
  
  Project Home: http://www.clusterlabs.org Getting started:
  http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
  
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org Getting started:
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] pacemaker error after a couple week or month

2014-12-19 Thread David Vossel


- Original Message -
 Hello,
 
 I have 2 active-passive fail over system with corosync and drbd.
 One system using 2 debian server and the other using 2 ubuntu server.
 The debian servers are for web server fail over and the ubuntu servers are
 for database server fail over.
 
 I applied the same configuration in the pacemaker. Everything works fine,
 fail over can be done nicely and also the file system synchronization, but
 in the ubuntu server, it was always has error after a couple week or month.
 The pacemaker in ubuntu1 had different status with ubuntu2, ubuntu1 assumed
 that ubuntu2 was down and ubuntu2 assumed that something happened with
 ubuntu1 but still alive and took over the resources. It made the drbd
 resource cannot be taken over, thus no fail over happened and we must
 manually restart the server because restarting pacemaker and corosync didn't
 help. I have changed the configuration of pacemaker a couple time, but the
 problem still exist.
 
 has anyone experienced it? I use Ubuntu 14.04.1 LTS.
 
 I got this error in apport.log
 
 ERROR: apport (pid 20361) Fri Dec 19 02:43:52 2014: executable:
 /usr/lib/pacemaker/lrmd (command line /usr/lib/pacemaker/lrmd)

wow, it looks like the lrmd is crashing on you. I haven't seen this occur
in the wild before. Without a backtrace it will be nearly impossible to 
determine
what is happening.

Do you have the ability to upgrade pacemaker to a newer version?

-- Vossel

 ERROR: apport (pid 20361) Fri Dec 19 02:43:52 2014: is_closing_session(): no
 DBUS_SESSION_BUS_ADDRESS in environment
 ERROR: apport (pid 20361) Fri Dec 19 02:43:52 2014: wrote report
 /var/crash/_usr_lib_pacemaker_lrmd.0.crash
 
 my pacemaker configuration:
 
 node $id=1 db \
 attributes standby=off
 node $id=2 db2 \
 attributes standby=off
 primitive ClusterIP ocf:heartbeat:IPaddr2 \
 params ip=192.168.0.100 cidr_netmask=24 \
 op monitor interval=30s
 primitive DBase ocf:heartbeat:mysql \
 meta target-role=Started \
 op start timeout=120s interval=0 \
 op stop timeout=120s interval=0 \
 op monitor interval=20s timeout=30s
 primitive DbFS ocf:heartbeat:Filesystem \
 params device=/dev/drbd0 directory=/sync fstype=ext4 \
 op start timeout=60s interval=0 \
 op stop timeout=180s interval=0 \
 op monitor interval=60s timeout=60s
 primitive Links lsb:drbdlinks
 primitive r0 ocf:linbit:drbd \
 params drbd_resource=r0 \
 op monitor interval=29s role=Master \
 op start timeout=240s interval=0 \
 op stop timeout=180s interval=0 \
 op promote timeout=180s interval=0 \
 op demote timeout=180s interval=0 \
 op monitor interval=30s role=Slave
 group DbServer ClusterIP DbFS Links DBase
 ms ms_r0 r0 \
 meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1
 notify=true target-role=Master
 location prefer-db DbServer 50: db
 colocation DbServer-with-ms_ro inf: DbServer ms_r0:Master
 order DbServer-after-ms_ro inf: ms_r0:promote DbServer:start
 property $id=cib-bootstrap-options \
 dc-version=1.1.10-42f2063 \
 cluster-infrastructure=corosync \
 expected-quorum-votes=2 \
 stonith-enabled=false \
 no-quorum-policy=ignore \
 last-lrm-refresh=1363370585
 
 my corosync config:
 
 totem {
 version: 2
 token: 3000
 token_retransmits_before_loss_const: 10
 join: 60
 consensus: 3600
 vsftype: none
 max_messages: 20
 clear_node_high_bit: yes
 secauth: off
 threads: 0
 rrp_mode: none
 transport: udpu
 cluster_name: Dbcluster
 }
 
 nodelist {
 node {
 ring0_addr: db
 nodeid: 1
 }
 node {
 ring0_addr: db2
 nodeid: 2
 }
 }
 
 quorum {
 provider: corosync_votequorum
 }
 
 amf {
 mode: disabled
 }
 
 service {
 ver: 0
 name: pacemaker
 }
 
 aisexec {
 user: root
 group: root
 }
 
 logging {
 fileline: off
 to_stderr: yes
 to_logfile: yes
 logfile: /var/log/corosync/corosync.log
 to_syslog: no
 syslog_facility: daemon
 debug: off
 timestamp: on
 logger_subsys {
 subsys: AMF
 debug: off
 tags: enter|leave|trace1|trace2|trace3|trace4|trace6
 }
 }
 
 my drbd.conf:
 
 global {
 usage-count no;
 }
 
 common {
 protocol C;
 
 handlers {
 pri-on-incon-degr /usr/lib/drbd/notify-pri-on-incon-degr.sh;
 /usr/lib/drbd/notify-emergency-reboot.sh; echo b  /proc/sysrq-trigger ;
 reboot -f;
 pri-lost-after-sb /usr/lib/drbd/notify-pri-lost-after-sb.sh;
 /usr/lib/drbd/notify-emergency-reboot.sh; echo b  /proc/sysrq-trigger ;
 reboot -f;
 local-io-error /usr/lib/drbd/notify-io-error.sh;
 /usr/lib/drbd/notify-emergency-shutdown.sh; echo o  /proc/sysrq-trigger ;
 halt -f;
 }
 
 startup {
 degr-wfc-timeout 120;
 }
 
 disk {
 on-io-error detach;
 }
 
 syncer {
 rate 100M;
 al-extents 257;
 }
 }
 
 resource r0 {
 protocol C;
 flexible-meta-disk internal;
 
 on db2 {
 address 192.168.0.10:7801 ;
 device /dev/drbd0 minor 0;
 disk /dev/sdb1;
 }
 on db {
 device /dev/drbd0 minor 0;
 disk /dev/db/sync;
 address 192.168.0.20:7801 ;
 }
 handlers {
 split-brain /usr/lib/drbd/notify-split-brain.sh root;
 }
 net {
 after-sb-0pri discard-younger-primary; #discard-zero-changes;
 after-sb-1pri discard-secondary;
 after-sb-2pri 

Re: [Pacemaker] Fencing of bare-metal remote nodes

2014-12-19 Thread David Vossel


- Original Message -
 25.11.2014 23:41, David Vossel wrote:
 
 
  - Original Message -
  Hi!
 
  is subj implemented?
 
  Trying echo c  /proc/sysrq-trigger on remote nodes and no fencing occurs.
 
  Yes, fencing remote-nodes works. Are you certain your fencing devices can
  handle
  fencing the remote-node? Fencing a remote-node requires a cluster node to
  invoke the agent that actually performs the fencing action on the
  remote-node.
 
 David, a couple of questions.
 
 I see that in your fencing tests you just stop systemd unit.
 Shouldn't pacemaker_remoted somehow notify crmd that it is being
 shutdown? And shouldn't crmd stop all resources on that remote node
 before granting that shutdown?

yes, this needs to happen at some point.

Right now the shutdown method for a remote-node is to disable the connection
resource and wait for all the resources to stop before killing pacemaker_remoted
on the remote node. That isn't exactly ideal.


 Also, from what I see now it would be natural to hide current
 implementation of remote node configuration under node/ syntax. Now
 remote nodes do have almost all features of normal nodes, including node
 attributes. What do you think about it?

ha, well. yes. at this point that might make sense. I had originally never
planned on remote-nodes entering the actual nodes section, but eventually
that changed. I'd like for usage of remote nodes to mature a bit before I
commit to changing something like this though. I'm still a bit uncertain how
people are going to use baremetal remote nodes. The use cases people come
up with keep surprising me.  Keeping the remote node definition as a resource
gives us a bit more flexibility for configuration.

-- Vossel

 
 Best,
 Vladislav
 
 
  -- Vossel
 
 
  Best,
  Vladislav
 
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
 
 
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Fencing of bare-metal remote nodes

2014-11-26 Thread David Vossel


- Original Message -
 26.11.2014 18:36, David Vossel wrote:
  
  
  - Original Message -
  25.11.2014 23:41, David Vossel wrote:
 
 
  - Original Message -
  Hi!
 
  is subj implemented?
 
  Trying echo c  /proc/sysrq-trigger on remote nodes and no fencing
  occurs.
 
  Yes, fencing remote-nodes works. Are you certain your fencing devices can
  handle
  fencing the remote-node? Fencing a remote-node requires a cluster node to
  invoke the agent that actually performs the fencing action on the
  remote-node.
 
  Yes, if I invoke fencing action manually ('crm node fence rnode' in
  crmsh syntax), node is fenced. So the issue seems to be related to the
  detection of a need fencing.
 
  Comments in related git commits are a little bit terse in this area. So
  could you please explain what exactly needs to happen on a remote node
  to initiate fencing?
 
  I tried so far:
  * kill pacemaker_remoted when no resources are running. systemd restated
  it and crmd reconnected after some time.

This should definitely cause the remote-node to be fenced. I tested this
earlier today after reading you were having problems and my setup fenced
the remote-node correctly.

  * crash kernel when no resources are running

If a remote-node connection is lost and pacemaker was able to verify the
node is clean before the connection is lost, pacemaker will attempt to
reconnect to the remote-node without issuing a fencing request.

I could see why both fencing and not fencing in this situation could be desired.
Maybe i should make an option.

  * crash kernel during massive start of resources

This should definitely cause the remote node to be fenced.

  
  this last one should definitely cause fencing. What version of pacemaker
  are
  you using? I've made changes in this area recently. Can you provide a
  crm_report.
 
 It's c191bf3.
 crm_report is ready, but I still wait an approval from a customer to
 send it.

Great. I really need to see what you all are doing. Outside of my own setup I 
have
not seen many setups where pacemaker remote deployed on baremetal nodes. It is 
possible
something in your configuration exposes some edge case I haven't encountered 
yet.

There's a US holiday Thrusday and Friday, so I won't be able to look at this 
until next
week.

-- Vossel

 
  
  -- David
  
 
  No fencing happened. In the last case that start actions 'hung' and were
  failed by timeout (it is rather long), node was not even listed as
  failed. My customer asked me to stop crashing nodes because one of them
  does not boot anymore (I like that modern UEFI hardware very much.),
  so it is hard for me to play more with that.
 
  Best,
  Vladislav
 
 
 
  -- Vossel
 
 
  Best,
  Vladislav
 
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
 
 
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
 
 
 
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
 
  
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
  
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
  
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [Cluster-devel] [Linux-HA] [ha-wg] [RFC] Organizing HA Summit 2015

2014-11-25 Thread David Vossel


- Original Message -
 
  On 25 Nov 2014, at 8:54 pm, Lars Marowsky-Bree l...@suse.com wrote:
  
  On 2014-11-24T16:16:05, Fabio M. Di Nitto fdini...@redhat.com wrote:
  
  Yeah, well, devconf.cz is not such an interesting event for those who do
  not wear the fedora ;-)
  That would be the perfect opportunity for you to convert users to Suse ;)
  
  I´d prefer, at least for this round, to keep dates/location and explore
  the option to allow people to join remotely. Afterall there are tons of
  tools between google hangouts and others that would allow that.
  That is, in my experience, the absolute worst. It creates second class
  participants and is a PITA for everyone.
  I agree, it is still a way for people to join in tho.
  
  I personally disagree. In my experience, one either does a face-to-face
  meeting, or a virtual one that puts everyone on the same footing.
  Mixing both works really badly unless the team already knows each
  other.
  
  I know that an in-person meeting is useful, but we have a large team in
  Beijing, the US, Tasmania (OK, one crazy guy), various countries in
  Europe etc.
  Yes same here. No difference.. we have one crazy guy in Australia..
  
  Yeah, but you're already bringing him for your personal conference.
  That's a bit different. ;-)
  
  OK, let's switch tracks a bit. What *topics* do we actually have? Can we
  fill two days? Where would we want to collect them?
 
 Personally I'm interested in talking about scaling - with pacemaker-remoted
 and/or a new messaging/membership layer.

If we're going to talk about scaling, we should throw in our new docker support
in the same discussion. Docker lends itself well to the pet vs cattle analogy.
I see management of docker with pacemaker making quite a bit of sense now that 
we
have the ability to scale into the cattle territory.

 Other design-y topics:
 - SBD
 - degraded mode
 - improved notifications
 - containerisation of services (cgroups, docker, virt)
 - resource-agents (upstream releases, handling of pull requests, testing)

Yep, We definitely need to talk about the resource-agents.

 
 User-facing topics could include recent features (ie. pacemaker-remoted,
 crm_resource --restart) and common deployment scenarios (eg. NFS) that
 people get wrong.

Adding to the list, it would be a good idea to talk about Deployment
integration testing, what's going on with the phd project and why it's
important regardless if you're interested in what the project functionally
does.

-- Vossel

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] exportfs resource agent modifications

2014-11-17 Thread David Vossel


- Original Message -
 Hi there,
 
 we are using exportfs for building datastores for VMWare.
 
 After migrating 1 NFS resource to another node (after sucessful fencing
 e.g.),
 VMWare doesn't see that datastore until I manually fire _exportfs -f_ on the
 new cluster node.
 
 I tried to modify the resource agent itself like:
 
 247 restore_rmtab
 248
 249 ocf_log info File system exported
 250
 251 sleep 5 #added
 252
 253 ocf_run exportfs -f || exit $OCF_ERR_GENERIC #added
 254
 255 ocf_log info kernel table flushed #added
 256
 257 return $OCF_SUCCESS
 
 but this didn't do the trick.
 
 Does anyone has an idea how to resolve that issue?

HA NFS is tricky and requires a very specific resource startup/shutdown
order to work correctly.

Here's some information about use cases I test. At this point, the 
active/passive
use case is well understood. If you are able to, I would recommend modeling
deployments using the A/P use case guidelines. 

HA NFS Active/Passive:
https://github.com/davidvossel/phd/blob/master/doc/presentations/nfs-ap-overview.pdf?raw=true
https://github.com/davidvossel/phd/blob/master/scenarios/nfs-active-passive.scenario

HA NFS Active/Active:
https://github.com/davidvossel/phd/blob/master/doc/presentations/nfs-aa-overview.pdf?raw=true
https://github.com/davidvossel/phd/blob/master/scenarios/nfs-active-active.scenario

-- Vossel

 
 Cheers,
 Hauke
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] HA setup of MySQL service using Pacemaker/DRBD

2014-11-17 Thread David Vossel


- Original Message -
 Hi,
 
 I have a working 2 node HA setup running on CentOS 6.5 with a very simple
 Apache webserver with replicated index.html using DRBD 8.4. The setup is
 configured based on the Clusters from Scratch Edition 5 with Fedora 13.
 
 I now with to replace Apache with a MySQL database, or just simply add it.
 How can I do so? I'm guessing the following:
 1 . Add MySQL service to the cluster with a crm configure primitive
 command. I'm not sure what the params should be though, e.g. the configfile.
 2. Set the same colocation/order rules.
 3. Create/initialize a separate DRBD partition for MySQL (or can I reuse the
 same partition as Apache assuming I'll never exceed its capacity?)
 4. Copy the database/table into the mounted DRBD partition.
 5. Configure the cluster for DRBD as per Chapter 7.4 of the guide.
 
 Is this correct? Step by step instructions would be appreciated, I have some
 experience in RHEL/CentOS but not in HA nor MySQL.

You've got the right idea.  I don't have a step by step guide crmsh, but
here's a basic MySQL deploy on shared storage that I test with pcs. I just
mount the shared storage partition to the /var/lib/mysql directory, then start
mysql. From there the database mysql uses exists on shared storage and can 
follow
the mysql instance wherever it goes in the cluster.

https://github.com/davidvossel/phd/blob/master/scenarios/mariadb-basic.scenario

-- Vossel

 Thanks!
 
 --
 - Goi Sihan
 gois...@gmail.com
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] resource-stickiness not working?

2014-11-14 Thread David Vossel


- Original Message -
 Here is a simple Active/Passive configuration with a single Dummy resource
 (see end of message). The resource-stickiness default is set to 100. I was
 assuming that this would be enough to keep the Dummy resource on the active
 node as long as the active node stays healthy. However, stickiness is not
 working as I expected in the following scenario:
 
 1) The node testnode1, which is running the Dummy resource, reboots or
 crashes
 2) Dummy resource fails to node testnode2
 3) testnode1 comes back up after reboot or crash
 4) Dummy resource fails back to testnode1
 
 I don't want the resource to failback to the original node in step 4. That is
 why resource-stickiness is set to 100. The only way I can get the resource
 to not to fail back is to set resource-stickiness to INFINITY. Is this the
 correct behavior of resource-stickiness? What am I missing? This is not what
 I understand from the documentation from clusterlabs.org. BTW, after reading
 various postings on fail back issues, I played with setting on-fail to
 standby, but that doesn't seem to help either. Any help is appreciated!

I agree, this is curious.

Can you attach a crm_report? Then we can walk through the transitions to
figure out why this is happening.

-- Vossel

 Scott
 
 node testnode1
 node testnode2
 primitive dummy ocf:heartbeat:Dummy \
 op start timeout=180s interval=0 \
 op stop timeout=180s interval=0 \
 op monitor interval=60s timeout=60s migration-threshold=5
 xml rsc_location id=cli-prefer-dummy rsc=dummy role=Started
 node=testnode2 score=INFINITY/
 property $id=cib-bootstrap-options \
 dc-version=1.1.10-14.el6-368c726 \
 cluster-infrastructure=classic openais (with plugin) \
 expected-quorum-votes=2 \
 stonith-enabled=false \
 stonith-action=reboot \
 no-quorum-policy=ignore \
 last-lrm-refresh=1413378119
 rsc_defaults $id=rsc-options \
 resource-stickiness=100 \
 migration-threshold=5
 
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Operation attribute change leads to resource restart

2014-11-14 Thread David Vossel


- Original Message -
 Hi!
 
 Just noticed that deletion of a trace_ra op attribute forces resource
 to be restarted (that RA does not support reload).
 
 Logs show:
 Nov 13 09:06:05 [6633] node01cib: info: cib_process_request:
 Forwarding cib_apply_diff operation for section 'all' to master
 (origin=local/cibadmin/2)
 Nov 13 09:06:05 [6633] node01cib: info: cib_perform_op: Diff:
 --- 0.641.96 2
 Nov 13 09:06:05 [6633] node01cib: info: cib_perform_op: Diff:
 +++ 0.643.0 98ecbda94c7e87250cf2262bf89f43e8
 Nov 13 09:06:05 [6633] node01cib: info: cib_perform_op: --
 /cib/configuration/resources/clone[@id='cl-test-instance']/primitive[@id='test-instance']/operations/op[@id='test-instance-start-0']/instance_attributes[@id='test-instance-start-0-instance_attributes']
 Nov 13 09:06:05 [6633] node01cib: info: cib_perform_op: +
 /cib:  @epoch=643, @num_updates=0
 Nov 13 09:06:05 [6633] node01cib: info: cib_process_request:
 Completed cib_apply_diff operation for section 'all': OK (rc=0,
 origin=node01/cibadmin/2, version=0.643.0)
 Nov 13 09:06:05 [6638] node01   crmd: info: abort_transition_graph:
 Transition aborted by deletion of
 instance_attributes[@id='test-instance-start-0-instance_attributes']:
 Non-status change (cib=0.643.0, source=te_update_diff:383,
 path=/cib/configuration/resources/clone[@id='cl-test-instance']/primitive[@id='test-instance']/operations/op[@id='test-instance-start-0']/instance_attributes[@id='test-instance-start-0-instance_attributes'],
 1)
 Nov 13 09:06:05 [6638] node01   crmd:   notice: do_state_transition:
 State transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC
 cause=C_FSA_INTERNAL origin=abort_transition_graph]
 Nov 13 09:06:05 [6634] node01 stonith-ng: info: xml_apply_patchset:
 v2 digest mis-match: expected 98ecbda94c7e87250cf2262bf89f43e8,
 calculated 0b344571f3e1bb852e3d10ca23183688
 Nov 13 09:06:05 [6634] node01 stonith-ng:   notice: update_cib_cache_cb:
 [cib_diff_notify] Patch aborted: Application of an update diff failed
 (-206)
 ...
 Nov 13 09:06:05 [6637] node01pengine: info: check_action_definition:
 params:reload   parameters boot_directory=/var/lib/libvirt/boot
 config_uri=http://192.168.168.10:8080/cgi-bin/manage_config.cgi?action=%aamp;resource=%namp;instance=%i;
 start_vm=1 vlan_id_start=2 per_vlan_ip_prefix_len=24
 base_img=http://192.168.168.10:8080/pre45-mguard-virt.x86_64.default.qcow2;
 pool_name=default outer_phy=eth0 ip_range_prefix=10.101.0.0/16/
 Nov 13 09:06:05 [6637] node01pengine: info: check_action_definition:
 Parameters to test-instance:0_start_0 on rnode001 changed: was
 6f9eb6bd1f87a2b9b542c31cf1b9c57e vs. now 02256597297dbb42aadc55d8d94e8c7f
 (reload:3.0.9) 0:0;41:3:0:95e66b6a-a190-4e61-83a7-47165fb0105d
 ...
 Nov 13 09:06:05 [6637] node01pengine:   notice: LogActions: Restart
 test-instance:0 (Started rnode001)
 
 That is not what I'd expect to see.

Any time an instance attribute is changed for a resource, the resource is 
restarted/reloaded.
This is expected.

-- Vossel

 Is it intended or just a minor bug(s)?
 
 Best,
 Vladislav
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] resource-discovery question

2014-11-13 Thread David Vossel


- Original Message -
 12.11.2014 22:57, David Vossel wrote:
  
  
  - Original Message -
  12.11.2014 22:04, Vladislav Bogdanov wrote:
  Hi David, all,
 
  I'm trying to get resource-discovery=never working with cd7c9ab, but
  still
  get Not installed probe failures from nodes which does not have
  corresponding resource agents installed.
 
  The only difference in my location constraints comparing to what is
  committed in #589
  is that they are rule-based (to match #kind). Is that supposed to work
  with
  the
  current master or still TBD?
 
  Yep, after I modified constraint to a rule-less syntax, it works:
  
  ahh, good catch. I'll take a look!
  
 
  rsc_location id=vlan003-on-cluster-nodes rsc=vlan003
  score=-INFINITY
  node=rnode001 resource-discovery=never/
 
  But I'd prefer to that killer feature to work with rules too :)
  Although resource-discovery=exclusive with score 0 for multiple nodes
  should probably
  also work for me, correct?
  
  yep it should.
  
  I cannot test that on a cluster with one cluster
  node and one
  remote node.
  
  this feature should work the same with remote nodes and cluster nodes.
  
  I'll get a patch out for the rule issue. I'm also pushing out some
  documentation
  for the resource-discovery option. It seems like you've got a good handle
  on it
  already though :)
 
 Oh, I see new pull-request, thank you very much!
 
 One side question: Is default value for clone-max influenced by
 resource-discovery value(s)?

kind of.

with 'exclusive' if the number of nodes in the exclusive set is smaller
than clone-max, clone-max is effectively reduced to the node count in
the exclusive set.

'never' and 'always' do not directly influence resource placement, only
'exclusive'


 
 
  
 
  My location constraints look like:
 
rsc_location id=vlan003-on-cluster-nodes rsc=vlan003
resource-discovery=never
  rule score=-INFINITY id=vlan003-on-cluster-nodes-rule
expression attribute=#kind operation=ne value=cluster
id=vlan003-on-cluster-nodes-rule-expression/
  /rule
/rsc_location
 
  Do I miss something?
 
  Best,
  Vladislav
 
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
 
 
 
 
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] resource-discovery question

2014-11-12 Thread David Vossel


- Original Message -
 12.11.2014 22:04, Vladislav Bogdanov wrote:
  Hi David, all,
  
  I'm trying to get resource-discovery=never working with cd7c9ab, but
  still
  get Not installed probe failures from nodes which does not have
  corresponding resource agents installed.
  
  The only difference in my location constraints comparing to what is
  committed in #589
  is that they are rule-based (to match #kind). Is that supposed to work with
  the
  current master or still TBD?
 
 Yep, after I modified constraint to a rule-less syntax, it works:

ahh, good catch. I'll take a look!

 
 rsc_location id=vlan003-on-cluster-nodes rsc=vlan003 score=-INFINITY
 node=rnode001 resource-discovery=never/
 
 But I'd prefer to that killer feature to work with rules too :)
 Although resource-discovery=exclusive with score 0 for multiple nodes
 should probably
 also work for me, correct?

yep it should.

 I cannot test that on a cluster with one cluster
 node and one
 remote node.

this feature should work the same with remote nodes and cluster nodes.

I'll get a patch out for the rule issue. I'm also pushing out some documentation
for the resource-discovery option. It seems like you've got a good handle on it
already though :)

  
  My location constraints look like:
  
rsc_location id=vlan003-on-cluster-nodes rsc=vlan003
resource-discovery=never
  rule score=-INFINITY id=vlan003-on-cluster-nodes-rule
expression attribute=#kind operation=ne value=cluster
id=vlan003-on-cluster-nodes-rule-expression/
  /rule
/rsc_location
  
  Do I miss something?
  
  Best,
  Vladislav
  
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
  
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
  
 
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] #kind eq container matches bare-metal nodes

2014-10-24 Thread David Vossel


- Original Message -
 23.10.2014 22:39, David Vossel wrote:
  
  
  - Original Message -
  21.10.2014 06:25, Vladislav Bogdanov wrote:
  21.10.2014 05:15, Andrew Beekhof wrote:
 
  On 20 Oct 2014, at 8:52 pm, Vladislav Bogdanov bub...@hoster-ok.com
  wrote:
 
  Hi Andrew, David, all,
 
  It seems like #kind was introduced before bare-metal remote node
  support, and now it is matched against cluster and container.
  Bare-metal remote nodes match container (they are remote), but
  strictly speaking they are not containers.
  Could/should that attribute be extended to the bare-metal use case?
 
  Unclear, the intent was 'nodes that aren't really cluster nodes'.
  Whats the usecase for wanting to tell them apart? (I can think of some,
  just want to hear yours)
 
  I want VM resources to be placed only on bare-metal remote nodes.
  -inf: #kind ne container looks a little bit strange.
  #kind ne remote would be more descriptive (having now them listed in CIB
  with 'remote' type).
 
  One more case (which is what I'd like to use in the mid-future) is a
  mixed remote-node environment, where VMs run on bare-metal remote nodes
  using storage from cluster nodes (f.e. sheepdog), and some of that VMs
  are whitebox containers themselves (they run services controlled by
  pacemaker via pacemaker_remoted). Having constraint '-inf: #kind ne
  container' is not enough to not try to run VMs inside of VMs - both
  bare-metal remote nodes and whitebox containers match 'container'.
  
  remember, you can't run remote-nodes nested within remote-nodes... so
  container nodes on baremetal remote-nodes won't work.
 
 Good to know, thanks.
 That imho should go into the documentation in bold red :)

yep, I'm seeing that now.

 Is that a conceptual limitation or it is just not yet supported?

I'm not sure yet. Nested pacemaker_remote is complex. Perhaps I'll find a clever
way of doing it at some point in the future. Right now all my solutions are
too complex to be useful, which is why the limitation exists.

-- David

 
  
  You don't have to be careful about not messing this up or anything.
  You can mix container nodes and baremetal remote-nodes and everything
  should
  work fine. The policy engine will never allow you to place a container node
  on a baremetal remote-node though.
  
  -- David
  
 
 
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
 
 
 
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
 
 
 
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
 
  
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
  
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
  
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker-remote with KVM - start timeout not working

2014-09-15 Thread David Vossel


- Original Message -
 Hi!
 
 I guess it would be better to start a separate thread on this.
 
 I have a VM with pacemaker-remote installed.
 
 Stack: cman
 Current DC: wings1 - partition with quorum
 Version: 1.1.10-14.el6-368c726
 3 Nodes configured
 2 Resources configured
 
 
 Online: [ oracle-test:vm-oracle-test wings1 wings2 ]

The remote-node in this case is named 'oracle-test'. The remote-node's 
container resource
is 'vm-oracle-test'.  Internally pacemaker makes a connection resource named 
after the
remote-node. That resource represents the pacemaker_remote connection.

Kind of confusing I know. Here's the point.  The connection resource 
'oracle-test' is what is
timing out here, not the vm itself. By default the connection resource has a 60 
second
timeout. If you want to increase that timeout use the remote-connect-timeout 
resource
metadata option.  You don't have to fully understand how all this works, just 
know that the
remote-connection-timeout option needs to be greater than the time it takes for 
the virtual
machine to fully initialize.

http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#s-resource-options

Hope that helps!

-- Vossel

 
 vm-oracle-test (ocf::heartbeat:VirtualDomain): Started wings2
 
 2 resources configured...
 
 However,
 
 # pcs resource show
 vm-oracle-test (ocf::heartbeat:VirtualDomain): Started
 
 As I understand, pacemaker considered pacemaker-remote on the VM as some sort
 of 'virtual resource' (called 'oracle-test' in my case), since I have only
 one 'primitive' section (VirtualDomain) in my CIB.
 
 Well, the problem is here:
 
 Sep 15 12:28:13 wings1 crmd[13553]: error: process_lrm_event: LRM operation
 oracle-test_start_0 (8397) Timed Out (timeout=6ms)
 Sep 15 12:28:13 wings1 crmd[13553]: warning: status_from_rc: Action 7
 (oracle-test_start_0) on wings1 failed (target: 0 vs. rc: 1): Error
 Sep 15 12:28:13 wings1 crmd[13553]: warning: update_failcount: Updating
 failcount for oracle-test on wings1 after failed start: rc=1
 (update=INFINITY, time=1
 410769693)
 
 Timeout is 60 seconds! Even though I have:
 
 primitive class=ocf id=vm-oracle-test provider=heartbeat
 type=VirtualDomain
 instance_attributes id=vm-oracle-test-instance_attributes
 nvpair id=vm-oracle-test-instance_attributes-hypervisor name=hypervisor
 value=qemu:///system/
 nvpair id=vm-oracle-test-instance_attributes-config name=config
 value=/etc/libvirt/qemu/oracle-test.xml/
 /instance_attributes
 operations
 op id=vm-oracle-test-monitor-interval-60s interval=60s name=monitor/
 op id=vm-oracle-test-start-timeout-300s-interval-0s-on-fail-restart
 interval=0s name=start on-fail=restart timeout=300s/
 op id=vm-oracle-test-stop-timeout-60s-interval-0s-on-fail-block
 interval=0s name=stop on-fail=block timeout=60s/
 /operations
 
 Moreover, VirtualDomain RA has this:
 
 actions
 action name=start timeout=90 /
 action name=stop timeout=90 /
 action name=status depth=0 timeout=30 interval=10 /
 action name=monitor depth=0 timeout=30 interval=10 /
 action name=migrate_from timeout=60 /
 action name=migrate_to timeout=120 /
 
 
 My VM is unable to start in 60 seconds. What could be done here?
 
 --
 Best regards,
 Alexandr A. Alexandrov
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Notification when a node is down

2014-09-15 Thread David Vossel


- Original Message -
 Hi,
 
 Is there any way for a Pacemaker/Corosync/PCS setup to send a notification
 when it detects that a node in a cluster is down? I read that Pacemaker and
 Corosync logs events to syslog, but where is the syslog file in CentOS? Do
 they log events such as a failover occurrence?

This might be a useful reference.

http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#idm207039249856

-- Vossel

 Thanks.
 
 --
 - Goi Sihan
 gois...@gmail.com
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] pacemaker-remote location constraint

2014-09-09 Thread David Vossel


- Original Message -
 Is it possible to put a location constraint on a resource such that it will
 only run on a pacemaker-remote node? Or vice-versa, so that a resource will
 not run on a pacemaker-remote node?
 
 At a glance, this doesn't seem possible as the pacemaker-remote node does not
 exist as a node entry in the CIB, so there's nothing to match on. Is it
 possible to match on the absence of that node entry?
 
 The only other way I can think of doing this is to set utilization
 attributes, such that the remote nodes provide a remote utilization
 attribute, and configure the resource such that it needs 1 unit of remote.

There is definitely a way to do this already. I can't remember how though.

Andrew, I know we discussed this a few months ago and you immediately rattled
off how we allow this.  Do you remember?

-- Vossel




 -Patrick
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] pacemaker-remote container as a clone resource

2014-09-09 Thread David Vossel


- Original Message -
 I'm interested in creating a resource that will control host containers
 running pacemaker-remote. The catch is that I want this resource to be a
 clone, so that if I want more containers, I simply increase the `clone-max`
 property. The problem is that the `remote-node` meta parameter is set on the
 clone resource, and not the children. So I can't see a way to tell pacemaker
 the address of all the clone children containers that get created.
 The only way I can see something like this being possible is if the resource
 agent set an attribute on the clone child when the child was started.
 
 Is there any way to accomplish this? Or will I have to create separate
 resources for every single container? If this isn't possible, would this be
 considered as a future feature?

I've been keeping up with this thread and just wanted to give my thoughts.

First, lets forget the clone part initially.  Clones and pacemaker_remote don't 
mix.
I'm not convinced it is advantageous for us to take that conversation much 
further
right now. Perhaps after start testing this next part we can come back to the 
cloned
remote-node discussion.

The interesting part here is that you want to define a remote-node that doesn't
have an address. I see this as being highly beneficial. For instance, you want 
to
make a docker container a remote-node, but docker assigns a random IP to the 
container
every time it starts... Right now there'd be no (feasible) way to make the 
docker
container a remote-node because there's no static IP associated with the 
container.

I think you are dead on about how this should be done. We need a way for the 
container's resource-agent to update pacemaker via an attribute the address
for the container after it has started. From there pacemaker uses that address
when attempting to connect to the container's pacemaker_remote instance.

I've made a issue in the Red Hat bug tracker related to this so I'll remember 
to do it.
https://bugzilla.redhat.com/show_bug.cgi?id=1139843


-- Vossel






 
 Thanks
 
 -Patrick
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] Libqb v0.17.1 release candidate

2014-08-22 Thread David Vossel
Hey,

I'm gearing up for a new libqb release.

https://github.com/ClusterLabs/libqb/releases/tag/v0.17.1-rc1

If you weren't aware, libqb is the library used for ipc, logging, and event 
loops
in Pacemaker/Corosync.

This release is to address the bug fixes that have been submitted since the 
v0.17.0
release. If you have any patches you want to get into the next release let me 
know as
soon as possible. Assuming we don't discover any issues during release candidate
testing, rc1 will become the 0.17.1 release mid next week. 

Thanks,
-- Vossel

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] On recovery of failed node, pengine fails to correctly monitor 'dirty' resources

2014-08-12 Thread David Vossel
, Skipped=0, Incomplete=0,
 Source=/var/lib/pacemaker/pengine/pe-input-15.bz2): Complete
 Aug 12 11:28:14 ti14-demo1 crmd[3147]:   notice: do_state_transition: State
 transition S_TRANSITION_ENGINE - S_IDLE [ input=I_TE_SUCCESS
 cause=C_FSA_INTERNAL origin=notify_crmd ]
 
 
 
 On Monday, August 11, 2014 4:06 PM, David Vossel dvos...@redhat.com wrote:
 
 
 
 
 
 - Original Message -
  Greetings,
  
  We are using pacemaker and cman in a two-node cluster with
  no-quorum-policy:
  ignore and stonith-enabled: false on a Centos 6 system (pacemaker related
  RPM versions are listed below). We are seeing some bizarre (to us) behavior
  when a node is fully lost (e.g. reboot -nf ). Here's the scenario we have:
  
  1) Fail a resource named some-resource started with the
  ocf:heartbeat:anything script (or others) on node01 (in our case, it's a
  master/slave resource we're pulling observations from, but it can happen on
  normal ones).
  2) Wait for Resource to recover.
  3) Fail node02 (reboot -nf, or power loss)
  4) When node02 recovers, we see in /var/log/messages:
  - Quorum is recovered
  - Sending flush op to all hosts for master-some-resource,
  last-failure-some-resource, probe_complete(true),
  fail-count-some-resource(1)
  - pengine Processing failed op monitor for some-resource on node01: unknown
  error (1)
  * After adding a simple `date` called with $@  /tmp/log.rsc, we do not
  see the resource agent being called at this time, on either node.
  * Sometimes, we see other operations happen that are also not being sent to
  the RA, including stop/start
  * The resource is actually happilly running on node01 throughtout this
  whole
  process, so there's no reason we should be seeing this failure here.
  * This issue is only seen on resources that had not yet been cleaned up.
  Resources that were 'clean' when both nodes were last online do not have
  this issue.
  
  We noticed this originally because we are using the ClusterMon RA to report
  on different types of errors, and this is giving us false positives. Any
  thoughts on configuration issues we could be having, or if this sounds like
  a bug in pacemaker somewhere?
 
 This is likely a bug in whatever resource-agent you are using.  There's no
 way
 for us to know for sure without logs.
 
 -- Vossel
 
 
  
  Thanks!
  
  
  Versions:
  ccs-0.16.2-69.el6_5.1.x86_64
  clusterlib-3.0.12.1-59.el6_5.2.x86_64
  cman-3.0.12.1-59.el6_5.2.x86_64
  corosync-1.4.1-17.el6_5.1.x86_64
  corosynclib-1.4.1-17.el6_5.1.x86_64
  fence-virt-0.2.3-15.el6.x86_64
  libqb-0.16.0-2.el6.x86_64
  modcluster-0.16.2-28.el6.x86_64
  openais-1.1.1-7.el6.x86_64
  openaislib-1.1.1-7.el6.x86_64
  pacemaker-1.1.10-14.el6_5.3.x86_64
  pacemaker-cli-1.1.10-14.el6_5.3.x86_64
  pacemaker-cluster-libs-1.1.10-14.el6_5.3.x86_64
  pacemaker-libs-1.1.10-14.el6_5.3.x86_64
  pcs-0.9.90-2.el6.centos.3.noarch
  resource-agents-3.9.2-40.el6_5.7.x86_64
  ricci-0.16.2-69.el6_5.1.x86_64
  
  
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
  
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
 
 
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Configuration recommandations for (very?) large cluster

2014-08-12 Thread David Vossel


- Original Message -
 On 12/08/14 07:52, Andrew Beekhof wrote:
  On 11 Aug 2014, at 10:10 pm, Cédric Dufour - Idiap Research Institute
  cedric.duf...@idiap.ch wrote:
 
  Hello,
 
  Thanks to Pacemaker 1.1.12, I have been able to setup a (very?) large
  cluster:
  Thats certainly up there as one of the biggest :)
 
 Well, actually, I sized it down from 444 to 277 resources by merging
 'VirtualDomain' and 'MailTo' RA/primitives into a custom/single
 'LibvirtQemu' one.
 CIB is now ~3MiB uncompressed / ~100kiB compressed. (also avoids the
 informational-only 'MailTo' RA to come burden the cluster)
 'PCMK_ipc_buffer' at 2MiB might be overkill now... but I'd rather stay on the
 safe side.
 
 Q: Are there adverse effects in keeping 'PCMK_ipc_buffer' high?

More system memory will be required for ipc connections. Unless you're running
low on ram, you should be fine with the buffer you set.

 277 resources are:
  - 22 (cloned) network-health (ping) resources
  - 88 (cloned) stonith resources (I have 4 stonith devices)
  - 167 LibvirtQemu resources (83 general-purpose servers and 84 SGE-driven
  computation nodes)
 (and more LibvirtQemu resources are expected to come)
 
  Have you checked pacemaker's CPU usage during startup/failover?  I'd be
  interested in your results.
 
 I finally set  'batch-limit' set to 22 - the quantity of nodes - as it makes
 sense when enabling a new primitive, as all monitor operations get
 dispatched immediately to all nodes at once.
 
 When bringing a standby node to life:
 
  - On the waking node (E5-2690v2): 167+5 resources monitoring operations
  get dispatched; the CPU load of the 'cib' process remains below 100% as the
  operations are executed, batched by 22 (though one can not see that
  batching, the monitoring operations succeeding very quickly), and
  complete in ~2 seconds. With Pacemaker 1.1.7, the 'cib' load would have
  peaked to 100% even before the first monitoring operation started (because
  of the CIB refresh, I guess) and would remain so for several tens of
  seconds (often resulting in timeouts and monitoring operations failure)
 
  - On the DC node (E5-2690v2): the CPU would also remain below 100%,
  alternating between the 'cib', 'pengine' and 'crmd' process. The DC is back
  to IDLE within ~4 seconds.
 
 I tried raising the 'batch-limit' to 50 and witnessed CPU load peaking at
 100% while carrying out the same procedure, but all went well nonetheless.
 
 While I still had the ~450 resources, I also accidentally brought all 22
 nodes back to life together (well, actually started the DC alone and then
 started the remaining 21 nodes together). As could be expected, the DC got
 quite busy (dispatching/executing the ~450*22 monitoring operations on all
 nodes). It took 40 minutes for the cluster to stabilize. But it did
 stabilize, with no timeout and not monitor operations failure! A few high
 CIB load detected / throttle down mode messages popped up but all went
 well.
 
 Q: Is there a way to favorize more powerful nodes for the DC (iow. push the
 DC election process in a preferred direction) ?
 
 
  Last updated: Mon Aug 11 13:40:14 2014
  Last change: Mon Aug 11 13:37:55 2014
  Stack: classic openais (with plugin)
  I would at least try running it with corosync 2.x (no plugin)
  That will use CPG for messaging which should perform even better.
 
 I'm running into a deadline now and will have to stick to 1.4.x for the
 moment. But as soon as I can free an old test Intel modular chassis I have
 around, I'll try backporting Coro 2.x from Debian/Experimental to
 Debian/Wheezy and see what gives.
 
 
  Current DC: bc1hx5a05 - partition with quorum
  Version: 1.1.12-561c4cf
  22 Nodes configured, 22 expected votes
  444 Resources configured
 
  PS: 'corosync' (1.4.7) traffic goes through a 10GbE network, with strict
  QoS priority over all other traffic.
 
  Are there recommended configuration tweaks I should not miss in such
  situation?
 
  So far, I have:
  - Raised the 'PCMK_ipc_buffer' size to 2MiB
  - Lowered the 'batch-limit' to 10 (though I believe my setup could sustain
  the default 30)
  Yep, definitely worth trying the higher value.
  We _should_ automatically start throttling ourselves if things get too
  intense.
 
 Yep. As mentioned above, I did see high CIB load detected / throttle down
 mode messages popup. Is this what you think about?
 
 
  Other than that, I would be making sure all the corosync.conf timeouts and
  other settings are appropriate.
 
 Never paid much attention to it so far. But it seems to me the Debian
 defaults are quite conservative, especially more so given my 10GbE (~0.2ms
 latency) interconnect and the care I took in prioritizing Corosync traffic
 (thanks to switches QoS/GMB and Linux 'tc'):
 
 token: 3000
 token_retransmits_before_loss_const: 10
 join: 60
 consensus: 3600
 vsftype: none
 max_messages: 20
 secauth: off
 amf: disabled
 
 Am I right?
 
 PS: this work is being done within 

Re: [Pacemaker] On recovery of failed node, pengine fails to correctly monitor 'dirty' resources

2014-08-11 Thread David Vossel


- Original Message -
 Greetings,
 
 We are using pacemaker and cman in a two-node cluster with no-quorum-policy:
 ignore and stonith-enabled: false on a Centos 6 system (pacemaker related
 RPM versions are listed below). We are seeing some bizarre (to us) behavior
 when a node is fully lost (e.g. reboot -nf ). Here's the scenario we have:
 
 1) Fail a resource named some-resource started with the
 ocf:heartbeat:anything script (or others) on node01 (in our case, it's a
 master/slave resource we're pulling observations from, but it can happen on
 normal ones).
 2) Wait for Resource to recover.
 3) Fail node02 (reboot -nf, or power loss)
 4) When node02 recovers, we see in /var/log/messages:
 - Quorum is recovered
 - Sending flush op to all hosts for master-some-resource,
 last-failure-some-resource, probe_complete(true),
 fail-count-some-resource(1)
 - pengine Processing failed op monitor for some-resource on node01: unknown
 error (1)
 * After adding a simple `date` called with $@  /tmp/log.rsc, we do not
 see the resource agent being called at this time, on either node.
 * Sometimes, we see other operations happen that are also not being sent to
 the RA, including stop/start
 * The resource is actually happilly running on node01 throughtout this whole
 process, so there's no reason we should be seeing this failure here.
 * This issue is only seen on resources that had not yet been cleaned up.
 Resources that were 'clean' when both nodes were last online do not have
 this issue.
 
 We noticed this originally because we are using the ClusterMon RA to report
 on different types of errors, and this is giving us false positives. Any
 thoughts on configuration issues we could be having, or if this sounds like
 a bug in pacemaker somewhere?

This is likely a bug in whatever resource-agent you are using.  There's no way
for us to know for sure without logs.

-- Vossel

 
 Thanks!
 
 
 Versions:
 ccs-0.16.2-69.el6_5.1.x86_64
 clusterlib-3.0.12.1-59.el6_5.2.x86_64
 cman-3.0.12.1-59.el6_5.2.x86_64
 corosync-1.4.1-17.el6_5.1.x86_64
 corosynclib-1.4.1-17.el6_5.1.x86_64
 fence-virt-0.2.3-15.el6.x86_64
 libqb-0.16.0-2.el6.x86_64
 modcluster-0.16.2-28.el6.x86_64
 openais-1.1.1-7.el6.x86_64
 openaislib-1.1.1-7.el6.x86_64
 pacemaker-1.1.10-14.el6_5.3.x86_64
 pacemaker-cli-1.1.10-14.el6_5.3.x86_64
 pacemaker-cluster-libs-1.1.10-14.el6_5.3.x86_64
 pacemaker-libs-1.1.10-14.el6_5.3.x86_64
 pcs-0.9.90-2.el6.centos.3.noarch
 resource-agents-3.9.2-40.el6_5.7.x86_64
 ricci-0.16.2-69.el6_5.1.x86_64
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Why location can't work as expected

2014-08-06 Thread David Vossel


- Original Message -
 Hello everyone:
 
 my tools version
 pacemaker: 1.1.10
 corosync: 1.4.5
 crmsh-2.0
 
 I have 2 nodes node1 and node2, resource agent Test must running on node1,
 and Test should not run on node2 if node1 is offline. So I do the following
 config:
 location TestOnNode1 Test INFINITY: node1
 
 If node1 and node2 are both online, Test running on node1.
 But if node1 is offline, the resource agent Test will be switched to node2.
 I think that doesn't obey my config. My question:
 Is that pacemaker's feature? or I have missing some config?
 
 I find the following config can work as expected:
 location TestOnNode1 Test -INFINITY: node2
 But I think that is not direct .

Yes, this is expected behavior.

Take a look at the symmetric-cluster option and learn about opt-in clusters 
here.

http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#_asymmetrical_opt_in_clusters

-- Vossel


 thanks
 --
 宝存科技 david
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] NFS concurrency

2014-08-05 Thread David Vossel


- Original Message -
 
 
 Hi,
 
 I’m working with SLES 11 SP3 and pacemaker 1.1.10-0.15.25
 
 
 
 I’m looking to define constraints that can allow multiple NFSV4
 filesystem/exports to be started concurrently (they belong to the same LVM).
 I also have multiple access points. My model looks like this:
 
 
 
 | FS1 -- Exportfs1
 
 | 
 
 |FS2 ---Exportfs2
 
 | 
 
 |FSn ---Exportfsn
 
 | 
 
 rootfs---LVM1-- |---IP1
 
 | 
 
 |---IPn
 
 
 
 Essentially, rootfs starts first and then the LVM. After the LVM, I’m looking
 to start the filesystems and IPs next. Once each filesystem is started then
 the export that belongs to it.
 
 
 
 I’d like to do this since I’m using NFSV4 and the Gracetime and leasetime (I
 have set to 10 seconds) cause increasingly long stop times on failover.

you don't have to observe the lease time during stop if you stop the nfs server
first before doing the fs umount.

I used groups to do something similar to this in my NFS active-active scenario
and I don't wait for lease time during stop. After failover the scenario will
wait the grace period on the node the export moved to. The grace period should
be = than the lease time on the node the export moved from.

presentation:
https://github.com/davidvossel/phd/raw/master/doc/presentations/nfs-aa-overview.pdf

Sample scenario:
https://github.com/davidvossel/phd/blob/master/scenarios/nfs-active-active.scenario


-- Vossel

 
 
 
 I managed to define the above using individual colocations and order
 constraints, but was wondering if there was a more concise definition that
 would work. My system may support many LVMs and many shares/exports per LVM,
 manageability may get out of control.
 
 
 
 My constraints look like this:
 
 
 colocation c2 inf: ( fs1 fs2 ip1 ) lvm1
 
 colocation c3 inf: fs1 exportfs1
 
 colocation c4 inf: fs2 exportfs2
 
 order NFS-order1 inf: lvm1 fs1 exportfs1
 
 order NFS-order2 inf: lvm1 fs2 exportfs2
 
 order NFS-order3 inf: exportfs lvm1:start
 
 order NFS-order4 inf: lvm1 ip1
 
 
 
 The XML version is
 
 rsc_colocation id=c2 score=INFINITY
 
 resource_set id=c2-0 sequential=false
 
 resource_ref id=fs1/
 
 resource_ref id=fs2/
 
 resource_ref id=ip1/
 
 /resource_set
 
 resource_set id=c2-1
 
 resource_ref id=lvm1/
 
 /resource_set
 
 /rsc_colocation
 
 rsc_colocation id=c3 score=INFINITY rsc=fs1 with-rsc=exportfs1/
 
 rsc_colocation id=c4 score=INFINITY rsc=fs2 with-rsc=exportfs2/
 
 rsc_order id=nfssrv_order score=INFINITY first=nfsserver
 then=root_exportfs/
 
 rsc_order id=NFS-order3 score=INFINITY first=root_exportfs then=lvm1
 then-action=start/
 
 rsc_order id=NFS-order1 score=INFINITY
 
 resource_set id=NFS-order1-0
 
 resource_ref id=lvm1/
 
 resource_ref id=fs1/
 
 resource_ref id=exportfs1/
 
 /resource_set
 
 /rsc_order
 
 rsc_order id=NFS-order2 score=INFINITY
 
 resource_set id=NFS-order2-0
 
 resource_ref id=lvm1/
 
 resource_ref id=fs2/
 
 resource_ref id=exportfs2/
 
 /resource_set
 
 /rsc_order
 
 rsc_order id=NFS-order4 score=INFINITY first=lvm1 then=ip1/
 
 
 
 Thanks,
 
 Diane Schaefer
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Announcing 1.1.12 - Final

2014-07-22 Thread David Vossel


- Original Message -
 I am pleased to report that 1.1.12 is finally done.

 This is a really great release and includes three key improvements:
 
 - ACLs are now on by default
 - pacemaker-remote now works for bare-metal nodes
 - Thanks to a new algorithm, the CIB is now two orders of magnitude faster.

Great work Andrew, those CIB improvements are insane!  Pacemaker is about to 
find its way into some deployments we never thought were possible.

   This means less CPU usage by the cluster itself and faster failover times.
 
 I will be building for Fedora shortly, those on other distros (and the
 impatient) can build their own rpm packages with the instructions below.
 
 1. Clone the current sources:
 
 # git clone --depth 0 git://github.com/ClusterLabs/pacemaker.git
 # cd pacemaker
 
 2. Install dependancies (if you haven't already)
 
 [Fedora] # sudo yum install -y yum-utils
 [ALL] # make rpm-dep
 
 3. Build Pacemaker
 
 # make release
 
 4. Copy and deploy as needed
 
 
 Some stats for this release:
 
 - Changesets: 795
 - Diff:   195 files changed, 13772 insertions(+), 6176 deletions(-)
 
 Features added since Pacemaker-1.1.11
 
   • Changes to the ACL schema to support nodes and unix groups
   • cib: Check ACLs prior to making the update instead of parsing the diff
   afterwards
   • cib: Default ACL support to on
   • cib: Enable the more efficient xml patchset format
   • cib: Implement zero-copy status update
   • cib: Send all r/w operations via the cluster connection and have all 
 nodes
   process them
   • crmd: Set cluster-name property to corosync's cluster_name by 
 default
   for corosync-2
   • crm_mon: Display brief output if -b/--brief is supplied or 'b' is
   toggled
   • crm_report: Allow ssh alternatives to be used
   • crm_ticket: Support multiple modifications for a ticket in an atomic
   operation
   • extra: Add logrotate configuration file for /var/log/pacemaker.log
   • Fencing: Add the ability to call stonith_api_time() from stonith_admin
   • logging: daemons always get a log file, unless explicitly set to
   configured 'none'
   • logging: allows the user to specify a log level that is output to 
 syslog
   • PE: Automatically re-unfence a node if the fencing device definition
   changes
   • pengine: cl#5174 - Allow resource sets and templates for location
   constraints
   • pengine: Support cib object tags
   • pengine: Support cluster-specific instance attributes based on rules
   • pengine: Support id-ref in nvpair with optional name
   • pengine: Support per-resource maintenance mode
   • pengine: Support site-specific instance attributes based on rules
   • tools: Allow crm_shadow to create older configuration versions
   • tools: Display pending state in crm_mon/crm_resource/crm_simulate if
   --pending/-j is supplied (cl#5178)
   • xml: Add the ability to have lightweight schema revisions
   • xml: Enable resource sets in location constraints for 1.2 schema
   • xml: Support resources that require unfencing
 
 You can get the full details at
https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-1.1.12
 
 
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker unnecessarily (?) restarts a vm on active node when other node brought out of standby

2014-05-13 Thread David Vossel




- Original Message -
 From: Ian cl-3...@jusme.com
 To: Clusterlabs (pacemaker) mailing list pacemaker@oss.clusterlabs.org
 Sent: Monday, May 12, 2014 3:02:50 PM
 Subject: [Pacemaker] Pacemaker unnecessarily (?) restarts a vm on active node 
 when other node brought out of standby
 
 Hi,
 
 First message here and pretty new to this, so apologies if this is the
 wrong place/approach for this question. I'm struggling to describe this
 problem so searching for previous answers is tricky. Hopefully someone
 can give me a pointer...

does setting resource-stickiness help?

http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#s-resource-options

-- Vossel



 Brief summary:
 --
 
 Situation is: dual node cluster (CentOS 6.5), running drbd+gfs2 to
 provide active/active filestore for a libvirt domain (vm).
 
 With both nodes up, all is fine, active/active filesystem available on
 both nodes, vm running on node 1
 
 Place node 2 into standby, vm is unaffected. Good.
 
 Bring node 2 back online, pacemaker chooses to stop the vm and gfs on
 node 1 while it promotes drbd to master on node 2. Bad (not very HA!)
 
 Hopefully I've just got a constraint missing/wrong (or the whole
 structure!). I know there is a constraint linking the promotion of the
 drbd resource to the starting of the gfs2 filesystem, but I wouldn't
 expect this to trigger on node 1 in the above scenario as it's already
 promoted?
 
 
 Versions:
 -
 
 Linux sv07 2.6.32-431.11.2.el6.centos.plus.x86_64 #1 SMP Tue Mar 25
 21:36:54 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
 
 pacemaker-libs-1.1.10-14.el6_5.2.x86_64
 pacemaker-cli-1.1.10-14.el6_5.2.x86_64
 pacemaker-cluster-libs-1.1.10-14.el6_5.2.x86_64
 pacemaker-1.1.10-14.el6_5.2.x86_64
 
 
 Configuration (abridged):
 -
 
 (I can provide full configs/logs if it isn't obvious and anyone cares to
 dig deeper)
 
 res_drbd_vm1 is the drbd resource
 vm_storage_core_dev is a group containing the drbd resources (just
 res_drbd_vm1 in this config)
 vm_storage_core_dev-master is the master/slave resource for drbd
 
 res_fs_vm1 is the gfs2 filesystem resource
 vm_storage_core is a group containing the gfs2 resources (just
 res_fs_vm1 in this config)
 vm_storage_core-clone is the clone resource to get the gfs2 filesystem
 active on both nodes
 
 res_vm_nfs_server is the libvirt domain (vm)
 
 (NB, The nfs filestore this server is sharing isn't from the gfs2
 filesystem, but another drbd volume that is always active/passive)
 
 
 # pcs resource show --full
   Master: vm_storage_core_dev-master
Meta Attrs: master-max=2 master-node-max=1 clone-max=2
 clone-node-max=1 notify=true
Group: vm_storage_core_dev
 Resource: res_drbd_vm1 (class=ocf provider=linbit type=drbd)
  Attributes: drbd_resource=vm1
  Operations: monitor interval=60s (res_drbd_vm1-monitor-interval-60s)
 
   Clone: vm_storage_core-clone
Group: vm_storage_core
 Resource: res_fs_vm1 (class=ocf provider=heartbeat type=Filesystem)
  Attributes: device=/dev/drbd/by-res/vm1 directory=/data/vm1
 fstype=gfs2 options=noatime,nodiratime
  Operations: monitor interval=60s (res_fs_vm1-monitor-interval-60s)
 
   Resource: res_vm_nfs_server (class=ocf provider=heartbeat
 type=VirtualDomain)
Attributes: config=/etc/libvirt/qemu/vm09.xml
Operations: monitor interval=60s
 (res_vm_nfs_server-monitor-interval-60s)
 
 
 # pcs constraint show
 
 Ordering Constraints:
promote vm_storage_core_dev-master then start vm_storage_core-clone
start vm_storage_core-clone then start res_vm_nfs_server
 
 Colocation Constraints:
vm_storage_core-clone with vm_storage_core_dev-master
 (rsc-role:Started) (with-rsc-role:Master)
res_vm_nfs_server with vm_storage_core-clone
 
 
 /var/log/messages on node 2 around the event:
 -
 
 May 12 19:23:02 sv07 attrd[3156]:   notice: attrd_trigger_update:
 Sending flush op to all hosts for: master-res_drbd_vm1 (1)
 May 12 19:23:02 sv07 attrd[3156]:   notice: attrd_perform_update: Sent
 update 1067: master-res_drbd_vm1=1
 May 12 19:23:02 sv07 crmd[3158]:   notice: do_state_transition: State
 transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC
 cause=C_FSA_INTERNAL origin=abort_transition_graph ]
 May 12 19:23:02 sv07 attrd[3156]:   notice: attrd_trigger_update:
 Sending flush op to all hosts for: master-res_drbd_live (1)
 May 12 19:23:02 sv07 attrd[3156]:   notice: attrd_perform_update: Sent
 update 1070: master-res_drbd_live=1
 May 12 19:23:02 sv07 pengine[3157]:   notice: LogActions: Promote
 res_drbd_vm1:0#011(Slave - Master sv07)
 May 12 19:23:02 sv07 pengine[3157]:   notice: LogActions: Restart
 res_fs_vm1:0#011(Started sv06)
 May 12 19:23:02 sv07 pengine[3157]:   notice: LogActions: Start
 res_fs_vm1:1#011(sv07)
 May 12 19:23:02 sv07 pengine[3157]:   notice: LogActions: Restart
 res_vm_nfs_server#011(Started sv06)
 May 12 19:23:02 sv07 

Re: [Pacemaker] pacemaker unmanage resource

2014-05-09 Thread David Vossel




- Original Message -
 From: emmanuel segura emi2f...@gmail.com
 To: The Pacemaker cluster resource manager Pacemaker@oss.clusterlabs.org
 Sent: Friday, May 9, 2014 12:44:10 PM
 Subject: Re: [Pacemaker] pacemaker unmanage resource
 
 I found the monitor operation remain active if the resource is in unmanaged
 state, my stupid question, what is the different between monitor and
 umanaged? sorry i looking around and i don't find any document about this.

monitor = checking to see if a resource is active or not.
unmanaged resource = a resource pacemaker will only perform status checks 
(monitors) on. These resources will not be stopped/started by pacemaker.

-- Vossel

 
 2014-05-09 17:13 GMT+02:00 emmanuel segura  emi2f...@gmail.com  :
 
 
 
 Hello List,
 
 I would like to know if it's normal, that pacemaker does the monitor action
 while the resource is in unmanage state.

yes

 Thanks
 
 
 
 --
 esta es mi vida e me la vivo hasta que dios quiera
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker 1.1.12 - Release Candidate 1

2014-05-07 Thread David Vossel
- Original Message -
 From: Andrew Beekhof and...@beekhof.net
 To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org
 Sent: Wednesday, May 7, 2014 1:31:27 AM
 Subject: [Pacemaker] Pacemaker 1.1.12 - Release Candidate 1
 
 As promised, this announcement brings the first release candidate for
 Pacemaker 1.1.12
 
https://github.com/ClusterLabs/pacemaker/releases/Pacemaker-1.1.12-rc1
 
 This release primarily focuses on important but mostly invisible changes
 under-the-hood:
 
 - The CIB is now O(2) faster.  Thats 100x for those not familiar with Big-O
 notation :-)
 
   This has massively reduced the cluster's use of system resources,
   allowing us to scale further on the same hardware, and dramatically
   reduced failover times for large clusters.
 
 - Support for ACLs are is enabled by default.
 
   The new implementation can restrict cluster access for containers
   where pacemaker-remoted is used and is also more efficient.
 
 - All CIB updates are now serialized and pre-synchronized via the
   corosync CPG interface.  This makes it impossible for updates to be
   lost, even when the cluster is electing a new DC.
 
 - Schema versioning changes
 
   New features are no longer silently added to the schema.  Instead
   the ${Y} in pacemaker-${X}-${Y} will be incremented for simple
   additions, and ${X} will be bumped for removals or other changes
   requiring an XSL transformation.
 
   To take advantage of new features, you will need to updates all the
   nodes and then run the equivalent of `cibadmin --upgrade`.
 
 
 Thankyou to everyone that has tested out the new CIB and ACL code
 already.  Please keep those bug reports coming in!

Also,
This release introduces permanent remote-node attributes.  That feature was the 
last thing that functionally kept remote-nodes (nodes running pacemaker_remote) 
from behaving just like cluster-nodes.

With these new CIB improvements pacemaker scales incredibly well. Couple those 
CIB changes with pacemaker's ability to manage remote-nodes and we now have the 
ability to scale clusters spanning hundreds possibly thousands of nodes.

Exciting stuff. Thanks for everyone's hard work. This community is great!  It's 
hard to believe how far pacemaker has come over the past few years.

-- Vossel



 List of known bugs to be investigating during the RC phase:
 
 - 5206Fileencoding broken
 - 5194A resource starts with a standby node. (Latest attrd does not 
 serve as
 the crmd-transition-delay parameter)
 - 5197Fail-over is delayed. (State transition is not calculated.)
 - 5139Each node fenced in its own transition during start-up fencing
 - 5200target node is over-utilized with allow-migrate=true
 - 5184Pending probe left in the cib
 - 5187-INFINITY colocation constraint not fully respected
 - 5165Add support for transient node utilization attributes
 
 
 To build `rpm` packages for testing:
 
 1. Clone the current sources:
 
   # git clone --depth 0 git://github.com/ClusterLabs/pacemaker.git
   # cd pacemaker
 
 1. Install dependancies (if you haven't already)
 
   [Fedora] # sudo yum install -y yum-utils
   [ALL]   # make rpm-dep
 
 1. Build Pacemaker
 
   # make rc
 
 1. Copy and deploy as needed
 
 
 ## Details
 
 Changesets: 633
 Diff: 184 files changed, 12690 insertions(+), 5843 deletions(-)
 
 ## Highlights
 
 ### Features added since Pacemaker-1.1.11
   + Changes to the ACL schema to support nodes and unix groups
   + cib: Check ACLs prior to making the update instead of parsing the diff
   afterwards
   + cib: Default ACL support to on
   + cib: Enable the more efficient xml patchset format
   + cib: Implement zero-copy status update (performance)
   + cib: Send all r/w operations via the cluster connection and have all
   nodes process them
   + crm_mon: Display brief output if -b/--brief is supplied or 'b' is
   toggled
   + crm_ticket: Support multiple modifications for a ticket in an atomic
   operation
   + Fencing: Add the ability to call stonith_api_time() from stonith_admin
   + logging: daemons always get a log file, unless explicitly set to
   configured 'none'
   + PE: Automatically re-unfence a node if the fencing device definition
   changes
   + pengine: cl#5174 - Allow resource sets and templates for location
   constraints
   + pengine: Support cib object tags
   + pengine: Support cluster-specific instance attributes based on rules
   + pengine: Support id-ref in nvpair with optional name
   + pengine: Support per-resource maintenance mode
   + pengine: Support site-specific instance attributes based on rules
   + tools: Display pending state in crm_mon/crm_resource/crm_simulate if
   --pending/-j is supplied (cl#5178)
   + xml: Add the ability to have lightweight schema revisions
   + xml: Enable resource sets in location constraints for 1.2 schema
   + xml: Support resources that require unfencing
 
 See 

Re: [Pacemaker] Feedback when crm appears to do nothing

2014-04-25 Thread David Vossel




- Original Message -
 From: Iain Buchanan iain...@gmail.com
 To: Pacemaker pacemaker@oss.clusterlabs.org
 Sent: Wednesday, April 23, 2014 2:59:48 AM
 Subject: [Pacemaker] Feedback when crm appears to do nothing
 
 Hi,
 
 I hope this is the right list for corosync/pacemaker questions - apologies if
 it is not.
 
 I'm running pacemaker 1.1.10 and corosync 2.3.0 under Ubuntu 12.04.
 
 Occasionally I send a command using crm such as crm resource move RESOURCE
 SERVER and absolutely nothing appears to happen. When this occurs is there
 a way of seeing why - was there a constraint that prevented the move etc.?
 There doesn't seem to be anything at the info level in the log.

are you by any chance managing Upstart resources? If so you need to update to 
1.1.11. There was a bug in 1.1.10 that caused the crmd to block indefinitely 
when upstart resources existed in the cib.  It had to do with how the crmd 
communicated with the upstart daemon over dbus. 

-- Vossel

 Iain
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Best practice for quorum nodes

2014-04-21 Thread David Vossel




- Original Message -
 From: Andrew Martin amar...@xes-inc.com
 To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org
 Sent: Friday, April 18, 2014 9:38:45 AM
 Subject: [Pacemaker] Best practice for quorum nodes
 
 Hello,
 
 I've read several guides about how to configure a 3-node cluster with one
 node that can't actually run the resources, but just serves as a quorum
 node. One practice for configuring this node is to put it in standby,
 which prevents it from running resources. In my experience, this seems to
 work pretty well, however from time-to-time I see these errors appear in my
 pacemaker logs:
 Preventing rsc from re-starting on host: operation monitor failed 'not
 installed' (rc=5)
 
 Is there a better way to designate a node as a quorum node, so that resources
 do not attempt to start or re-start on it? Perhaps a combination of setting
 it in standby mode and a resource constraint to prevent the resources from
 running on it? Or, is there a better way to set it up?

Pacemaker is going to probe the node for active resources regardless. If you 
don't want to see these errors, make sure the resource-agents and 
resource-agent dependent packages are installed on all nodes in the cluster.

-- Vossel

 Thanks,
 
 Andrew
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] RHEL/centos6 - pacemaker - checking value of PCMK_ipc_buffer

2014-04-21 Thread David Vossel




- Original Message -
 From: Nikola Ciprich nikola.cipr...@linuxbox.cz
 To: pacema...@clusterlabs.org
 Sent: Friday, April 18, 2014 12:47:35 AM
 Subject: [Pacemaker] RHEL/centos6 - pacemaker - checking value of 
 PCMK_ipc_buffer
 
 Hello,
 
 I've hit internal limit of PCMK_ipc_buffer on one of my cluster.
 Unfortunately I'm using corosync + plugin configuration (which I
 know is discouraged, so I'll switch this production cluster to CMAN
 ASAP), however, I tried setting PCMK_ipc_buffer on my test cluster
 already running on CMAN + pacemaker by setting /etc/sysconfig/pacemaker
 and checking environ values of cib and other processes /proc/???/environ
 files
 and don't see variables set there..
 
 Therefore my question is, can I somehow check the limits I set are really
 applied? So I can be sure I've set it correctly?
 
 Thanks a lot in advance for reply

what version of libqb and pacemaker are you using?

 best regards
 
 nik
 
 
 
 --
 -
 Ing. Nikola CIPRICH
 LinuxBox.cz, s.r.o.
 28.rijna 168, 709 00 Ostrava
 
 tel.:   +420 591 166 214
 fax:+420 596 621 273
 mobil:  +420 777 093 799
 www.linuxbox.cz
 
 mobil servis: +420 737 238 656
 email servis: ser...@linuxbox.cz
 -
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] crmd does abort if a stopped node is specified

2014-04-21 Thread David Vossel




- Original Message -
 From: Kazunori INOUE kazunori.ino...@gmail.com
 To: pm pacemaker@oss.clusterlabs.org
 Sent: Friday, April 18, 2014 4:49:42 AM
 Subject: [Pacemaker] crmd does abort if a stopped node is specified
 
 Hi,
 
 crmd does abort if I load CIB which specified a stopped node.
 
 # crm_mon -1
 Last updated: Fri Apr 18 11:51:36 2014
 Last change: Fri Apr 18 11:51:30 2014
 Stack: corosync
 Current DC: pm103 (3232261519) - partition WITHOUT quorum
 Version: 1.1.11-cf82673
 1 Nodes configured
 0 Resources configured
 
 Online: [ pm103 ]
 
 # cat test.cli
 node pm103
 node pm104
 
 # crm configure load update test.cli
 
 Apr 18 11:52:42 pm103 crmd[11672]:error: crm_int_helper:
 Characters left over after parsing 'pm104': 'pm104'
 Apr 18 11:52:42 pm103 crmd[11672]:error: crm_abort: crm_get_peer:
 Triggered fatal assert at membership.c:420 : id  0 || uname != NULL
 Apr 18 11:52:42 pm103 pacemakerd[11663]:error: child_waitpid:
 Managed process 11672 (crmd) dumped core
 
 (gdb) bt
 #0  0x0033da432925 in raise () from /lib64/libc.so.6
 #1  0x0033da434105 in abort () from /lib64/libc.so.6
 #2  0x7f30241b7027 in crm_abort (file=0x7f302440b0b3
 membership.c, function=0x7f302440b5d0 crm_get_peer, line=420,
 assert_condition=0x7f302440b27e id  0 || uname != NULL, do_core=1,
 do_fork=0) at utils.c:1177
 #3  0x7f30244048ee in crm_get_peer (id=0, uname=0x0) at membership.c:420
 #4  0x7f3024402238 in crm_peer_uname (uuid=0x113e7c0 pm104) at

is the uuid for your cluster nodes supposed to be the same as the uname?  We're 
treating the uuid in this situation as if it should be a number, which it 
clearly is not.

-- Vossel


 cluster.c:386
 #5  0x0043afbd in abort_transition_graph
 (abort_priority=100, abort_action=tg_restart, abort_text=0x44d2f4
 Non-status change, reason=0x113e4b0, fn=0x44df07 te_update_diff,
 line=382) at te_utils.c:518
 #6  0x0043caa4 in te_update_diff (event=0x10f2240
 cib_diff_notify, msg=0x1137660) at te_callbacks.c:382
 #7  0x7f302461d1bc in cib_native_notify (data=0x10ef750,
 user_data=0x1137660) at cib_utils.c:733
 #8  0x0033db83d6bc in g_list_foreach () from /lib64/libglib-2.0.so.0
 #9  0x7f3024620191 in cib_native_dispatch_internal
 (buffer=0xe61ea8 notify t=\cib_notify\ subt=\cib_diff_notify\
 cib_op=\cib_apply_diff\ cib_rc=\0\
 cib_object_type=\diff\cib_generationgeneration_tuple epoch=\4\
 num_updates=\0\ admin_epoch=\0\ validate-with=\pacem...,
 length=1708, userdata=0xe5eb90) at cib_native.c:123
 #10 0x7f30241dee72 in mainloop_gio_callback (gio=0xf61ea0,
 condition=G_IO_IN, data=0xe601b0) at mainloop.c:639
 #11 0x0033db83feb2 in g_main_context_dispatch () from
 /lib64/libglib-2.0.so.0
 #12 0x0033db843d68 in ?? () from /lib64/libglib-2.0.so.0
 #13 0x0033db844275 in g_main_loop_run () from /lib64/libglib-2.0.so.0
 #14 0x00406469 in crmd_init () at main.c:154
 #15 0x004062b0 in main (argc=1, argv=0x7fff908829f8) at main.c:121
 
 Is this all right?
 
 Best Regards,
 Kazunori INOUE
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] pcs resource create with one script

2014-04-16 Thread David Vossel




- Original Message -
 From: Dvorak Andreas andreas.dvo...@baaderbank.de
 To: pacemaker@oss.clusterlabs.org
 Sent: Wednesday, April 16, 2014 12:36:14 PM
 Subject: [Pacemaker] pcs resource create with one script
 
 
 
 Dear all,
 
 
 
 I want to create a resource with an own script, but it does not work.
 
 
 
 pcs resource create MYSQLFS ocf:baader:MYSQLFS op monitor interval=30s
 
 Error: unable to locate command: /usr/lib/ocf/resource.d/baader/MYSQLFS

Does the ocf script implement the meta-data function? I think pcs will throw 
errors if it can't retrieve the metadata.  You could try the --force option.

-- Vossel

 
 But the script does exit.
 
 
 
 ls -l /usr/lib/ocf/resource.d/baader/MYSQLFS
 
 -rwxr-xr-x 1 root root 2548 Apr 16 19:16
 /usr/lib/ocf/resource.d/baader/MYSQLFS
 
 
 
 Can somebody please explain who to solve this problem?
 
 
 
 pcs status
 
 Cluster name: mysql-int-prod
 
 Last updated: Wed Apr 16 19:27:07 2014
 
 Last change: Fri Dec 13 15:54:04 2013 via crmd on sv2828-p1
 
 Stack: cman
 
 Current DC: sv2827-p1 - partition with quorum
 
 Version: 1.1.10-1.el6_4.4-368c726
 
 4 Nodes configured
 
 2 Resources configured
 
 
 
 Online: [ sv2827-p1 sv2828-p1 ]
 
 OFFLINE: [ sv2827 sv2828 ]
 
 
 
 Full list of resources:
 
 
 
 ipmi-fencing-sv2827 (stonith:fence_ipmilan): Started sv2827-p1
 
 ipmi-fencing-sv2828 (stonith:fence_ipmilan): Started sv2828-p1
 
 
 
 Best regards
 
 
 
 Andreas Dvorak
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] redhat 7 pacemaker compiled without acl

2014-04-04 Thread David Vossel




- Original Message -
 From: emmanuel segura emi2f...@gmail.com
 To: The Pacemaker cluster resource manager Pacemaker@oss.clusterlabs.org
 Sent: Friday, April 4, 2014 1:12:26 PM
 Subject: [Pacemaker] redhat 7 pacemaker compiled without acl
 
 
 Hello List,
 
 I trying to install a virtual cluster using redhat 7 beta, the first thing i
 noticed is this.
 
 :::
 [root@localhost ~]# cibadmin -!
 Pacemaker 1.1.10-19.el7 (Build: 368c726): generated-manpages agent-manpages
 ascii-docs publican-docs ncurses libqb-logging libqb-ipc upstart systemd
 nagios corosync-native
 :::
 
 Pacemaker was compiled without acl options?
 if the answer is yes, why? redhat doens't support pacemaker acl?

There is a new pacemaker acl implementation underway right now upstream. The 
current pacemaker acl support was held out of the rhel7  build with the 
expectation of picking up the new implementation in a later rhel release.

-- Vossel

 Thanks
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] pcs and lsb resource

2014-03-21 Thread David Vossel
- Original Message -
 From: Dori Seliskar d...@delo.si
 To: pacemaker@oss.clusterlabs.org
 Sent: Friday, March 21, 2014 7:46:37 AM
 Subject: [Pacemaker] pcs and lsb resource
 
 Hi all,
 
 I'm trying to create a lsb resource with pcs on fedora 20 and I'm failing
 miserably every time: (have sucessfully created ocf resources though)
 
 # pcs resource create impulz_dosemu lsb:impulz op stop interval=0 timeout=60s
 monitor interval=60s timeout=5s start interval=0 timeout=60s
 Error: Unable to create resource 'lsb:impulz', it is not installed on this
 system (use --force to override)

This could be a pcs bug. If using --force works, this is definitely a pcs bug.  
pcs is supposed to be looking through /etc/init.d/* on the local machine for 
valid lsb scripts before applying a lsb resource into the cluster config. Maybe 
it isn't matching right in the f20 pcs package. 

-- Vossel

 
 Manually starting/stoping resource from /etc/init.d works just fine and I
 tested the script for compatibility against
 http://www.linux-ha.org/wiki/LSB_Resource_Agents
 
 What am I missing here ?
 
 Thanks!
 
 Best regards,
 dori
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] crmd was aborted at pacemaker 1.1.11

2014-03-18 Thread David Vossel
- Original Message -
 From: Kazunori INOUE kazunori.ino...@gmail.com
 To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org
 Sent: Tuesday, March 18, 2014 12:30:01 AM
 Subject: Re: [Pacemaker] crmd was aborted at pacemaker 1.1.11
 
 2014-03-18 8:03 GMT+09:00 David Vossel dvos...@redhat.com:
 
  - Original Message -
  From: Kazunori INOUE kazunori.ino...@gmail.com
  To: The Pacemaker cluster resource manager
  pacemaker@oss.clusterlabs.org
  Sent: Monday, March 17, 2014 4:51:11 AM
  Subject: Re: [Pacemaker] crmd was aborted at pacemaker 1.1.11
 
  2014-03-17 16:37 GMT+09:00 Kazunori INOUE kazunori.ino...@gmail.com:
   2014-03-15 4:08 GMT+09:00 David Vossel dvos...@redhat.com:
  
  
   - Original Message -
   From: Kazunori INOUE kazunori.ino...@gmail.com
   To: pm pacemaker@oss.clusterlabs.org
   Sent: Friday, March 14, 2014 5:52:38 AM
   Subject: [Pacemaker] crmd was aborted at pacemaker 1.1.11
  
   Hi,
  
   When specifying the node name in UPPER case and performing
   crm_resource, crmd was aborted.
   (The real node name is a LOWER case.)
  
   https://github.com/ClusterLabs/pacemaker/pull/462
  
   does that fix it?
  
  
   Since behavior of glib is strange somehow, the result is NO.
   I tested this brunch.
   https://github.com/davidvossel/pacemaker/tree/lrm-segfault
   * Red Hat Enterprise Linux Server release 6.4 (Santiago)
   * glib2-2.22.5-7.el6.x86_64
  
   strcase_equal() is not called from g_hash_table_lookup().
  
   [x3650h ~]$ gdb /usr/libexec/pacemaker/crmd 17409
   ...snip...
   (gdb) b lrm.c:1232
   Breakpoint 1 at 0x4251d0: file lrm.c, line 1232.
   (gdb) b strcase_equal
   Breakpoint 2 at 0x429828: file lrm_state.c, line 95.
   (gdb) c
   Continuing.
  
   Breakpoint 1, do_lrm_invoke (action=288230376151711744,
   cause=C_IPC_MESSAGE, cur_state=S_NOT_DC, current_input=I_ROUTER,
   msg_data=0x7fff8d679540) at lrm.c:1232
   1232lrm_state = lrm_state_find(target_node);
   (gdb) s
   lrm_state_find (node_name=0x1d4c650 X3650H) at lrm_state.c:267
   267 {
   (gdb) n
   268 if (!node_name) {
   (gdb) n
   271 return g_hash_table_lookup(lrm_state_table, node_name);
   (gdb) p g_hash_table_size(lrm_state_table)
   $1 = 1
   (gdb) p (char*)((GList*)g_hash_table_get_keys(lrm_state_table))-data
   $2 = 0x1c791a0 x3650h
   (gdb) p node_name
   $3 = 0x1d4c650 X3650H
   (gdb) n
   272 }
   (gdb) n
   do_lrm_invoke (action=288230376151711744, cause=C_IPC_MESSAGE,
   cur_state=S_NOT_DC, current_input=I_ROUTER, msg_data=0x7fff8d679540)
   at lrm.c:1234
   1234if (lrm_state == NULL  is_remote_node) {
   (gdb) n
   1240CRM_ASSERT(lrm_state != NULL);
   (gdb) n
  
   Program received signal SIGABRT, Aborted.
   0x003787e328a5 in raise () from /lib64/libc.so.6
   (gdb)
  
  
   I wonder why... so I will continue investigation.
  
  
 
  I read the code of g_hash_table_lookup().
  Key is compared by the hash value generated by crm_str_hash before
  strcase_equal() is performed.
 
  good catch. I've updated the patch in this pull request. Can you give it a
  go?
 
  https://github.com/ClusterLabs/pacemaker/pull/462
 
 fail-count is not cleared only in this.
 
 $ crm_resource -C -r p1 -N X3650H
 Cleaning up p1 on X3650H
 Waiting for 1 replies from the CRMd. OK
 
 $ grep fail-count /var/log/ha-log
 Mar 18 13:53:36 x3650g attrd[3610]:debug: attrd_client_message:
 Broadcasting fail-count-p1[X3650H] = (null)
 $
 
 $ crm_mon -rf1
 Last updated: Tue Mar 18 13:54:51 2014
 Last change: Tue Mar 18 13:53:36 2014 by hacluster via crmd on x3650h
 Stack: corosync
 Current DC: x3650h (3232261384) - partition with quorum
 Version: 1.1.10-83553fa
 2 Nodes configured
 1 Resources configured
 
 
 Online: [ x3650g x3650h ]
 
 Full list of resources:
 
  p1 (ocf::pacemaker:Dummy): Stopped
 
 Migration summary:
 * Node x3650h:
p1: migration-threshold=1 fail-count=1 last-failure='Tue Mar 18
 13:53:19 2014'
 * Node x3650g:
 $
 
 
 So this change also seems to be necessary.

yep, added your patch to the pull request
https://github.com/davidvossel/pacemaker/commit/c118ac5b5244890c19e4c7b2f5a39208d362b61d

I found another one in stonith that I fixed.

https://github.com/ClusterLabs/pacemaker/pull/462

Are we good for merging this now?

-- Vossel

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] crmd was aborted at pacemaker 1.1.11

2014-03-17 Thread David Vossel




- Original Message -
 From: Kazunori INOUE kazunori.ino...@gmail.com
 To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org
 Sent: Monday, March 17, 2014 4:51:11 AM
 Subject: Re: [Pacemaker] crmd was aborted at pacemaker 1.1.11
 
 2014-03-17 16:37 GMT+09:00 Kazunori INOUE kazunori.ino...@gmail.com:
  2014-03-15 4:08 GMT+09:00 David Vossel dvos...@redhat.com:
 
 
  - Original Message -
  From: Kazunori INOUE kazunori.ino...@gmail.com
  To: pm pacemaker@oss.clusterlabs.org
  Sent: Friday, March 14, 2014 5:52:38 AM
  Subject: [Pacemaker] crmd was aborted at pacemaker 1.1.11
 
  Hi,
 
  When specifying the node name in UPPER case and performing
  crm_resource, crmd was aborted.
  (The real node name is a LOWER case.)
 
  https://github.com/ClusterLabs/pacemaker/pull/462
 
  does that fix it?
 
 
  Since behavior of glib is strange somehow, the result is NO.
  I tested this brunch.
  https://github.com/davidvossel/pacemaker/tree/lrm-segfault
  * Red Hat Enterprise Linux Server release 6.4 (Santiago)
  * glib2-2.22.5-7.el6.x86_64
 
  strcase_equal() is not called from g_hash_table_lookup().
 
  [x3650h ~]$ gdb /usr/libexec/pacemaker/crmd 17409
  ...snip...
  (gdb) b lrm.c:1232
  Breakpoint 1 at 0x4251d0: file lrm.c, line 1232.
  (gdb) b strcase_equal
  Breakpoint 2 at 0x429828: file lrm_state.c, line 95.
  (gdb) c
  Continuing.
 
  Breakpoint 1, do_lrm_invoke (action=288230376151711744,
  cause=C_IPC_MESSAGE, cur_state=S_NOT_DC, current_input=I_ROUTER,
  msg_data=0x7fff8d679540) at lrm.c:1232
  1232lrm_state = lrm_state_find(target_node);
  (gdb) s
  lrm_state_find (node_name=0x1d4c650 X3650H) at lrm_state.c:267
  267 {
  (gdb) n
  268 if (!node_name) {
  (gdb) n
  271 return g_hash_table_lookup(lrm_state_table, node_name);
  (gdb) p g_hash_table_size(lrm_state_table)
  $1 = 1
  (gdb) p (char*)((GList*)g_hash_table_get_keys(lrm_state_table))-data
  $2 = 0x1c791a0 x3650h
  (gdb) p node_name
  $3 = 0x1d4c650 X3650H
  (gdb) n
  272 }
  (gdb) n
  do_lrm_invoke (action=288230376151711744, cause=C_IPC_MESSAGE,
  cur_state=S_NOT_DC, current_input=I_ROUTER, msg_data=0x7fff8d679540)
  at lrm.c:1234
  1234if (lrm_state == NULL  is_remote_node) {
  (gdb) n
  1240CRM_ASSERT(lrm_state != NULL);
  (gdb) n
 
  Program received signal SIGABRT, Aborted.
  0x003787e328a5 in raise () from /lib64/libc.so.6
  (gdb)
 
 
  I wonder why... so I will continue investigation.
 
 
 
 I read the code of g_hash_table_lookup().
 Key is compared by the hash value generated by crm_str_hash before
 strcase_equal() is performed.

good catch. I've updated the patch in this pull request. Can you give it a go?

https://github.com/ClusterLabs/pacemaker/pull/462


 
 *** This is quick-fix solution. ***
 
  crmd/lrm_state.c   |4 ++--
  include/crm/crm.h  |2 ++
  lib/common/utils.c |   11 +++
  3 files changed, 15 insertions(+), 2 deletions(-)
 
 diff --git a/crmd/lrm_state.c b/crmd/lrm_state.c
 index d20d74a..ae036fd 100644
 --- a/crmd/lrm_state.c
 +++ b/crmd/lrm_state.c
 @@ -234,13 +234,13 @@ lrm_state_init_local(void)
  }
 
  lrm_state_table =
 -g_hash_table_new_full(crm_str_hash, strcase_equal, NULL,
 internal_lrm_state_destroy);
 +g_hash_table_new_full(crm_str_hash2, strcase_equal, NULL,
 internal_lrm_state_destroy);
  if (!lrm_state_table) {
  return FALSE;
  }
 
  proxy_table =
 -g_hash_table_new_full(crm_str_hash, strcase_equal, NULL,
 remote_proxy_free);
 +g_hash_table_new_full(crm_str_hash2, strcase_equal, NULL,
 remote_proxy_free);
  if (!proxy_table) {
   g_hash_table_destroy(lrm_state_table);
  return FALSE;
 diff --git a/include/crm/crm.h b/include/crm/crm.h
 index b763cc0..46fe5df 100644
 --- a/include/crm/crm.h
 +++ b/include/crm/crm.h
 @@ -195,7 +195,9 @@ typedef GList *GListPtr;
  #  include crm/error.h
 
  #  define crm_str_hash g_str_hash_traditional
 +#  define crm_str_hash2 g_str_hash_traditional2
 
  guint g_str_hash_traditional(gconstpointer v);
 +guint g_str_hash_traditional2(gconstpointer v);
 
  #endif
 diff --git a/lib/common/utils.c b/lib/common/utils.c
 index 29d7965..50fa6c0 100644
 --- a/lib/common/utils.c
 +++ b/lib/common/utils.c
 @@ -2368,6 +2368,17 @@ g_str_hash_traditional(gconstpointer v)
 
  return h;
  }
 +guint
 +g_str_hash_traditional2(gconstpointer v)
 +{
 +const signed char *p;
 +guint32 h = 0;
 +
 +for (p = v; *p != '\0'; p++)
 +h = (h  5) - h + g_ascii_tolower(*p);
 +
 +return h;
 +}
 
  void *
  find_library_function(void **handle, const char *lib, const char *fn,
 gboolean fatal)
 
 
  # crm_resource -C -r p1 -N X3650H
  Cleaning up p1 on X3650H
  Waiting for 1 replies from the CRMdNo messages received in 60 seconds..
  aborting
 
  Mar 14 18:33:10 x3650h crmd[10718]:error: crm_abort:
  do_lrm_invoke: Triggered fatal assert at lrm.c:1240 : lrm_state !=
  NULL
  ...snip...
  Mar 14 18:33:10

Re: [Pacemaker] crmd was aborted at pacemaker 1.1.11

2014-03-14 Thread David Vossel




- Original Message -
 From: Kazunori INOUE kazunori.ino...@gmail.com
 To: pm pacemaker@oss.clusterlabs.org
 Sent: Friday, March 14, 2014 5:52:38 AM
 Subject: [Pacemaker] crmd was aborted at pacemaker 1.1.11
 
 Hi,
 
 When specifying the node name in UPPER case and performing
 crm_resource, crmd was aborted.
 (The real node name is a LOWER case.)

https://github.com/ClusterLabs/pacemaker/pull/462

does that fix it?

 # crm_resource -C -r p1 -N X3650H
 Cleaning up p1 on X3650H
 Waiting for 1 replies from the CRMdNo messages received in 60 seconds..
 aborting
 
 Mar 14 18:33:10 x3650h crmd[10718]:error: crm_abort:
 do_lrm_invoke: Triggered fatal assert at lrm.c:1240 : lrm_state !=
 NULL
 ...snip...
 Mar 14 18:33:10 x3650h pacemakerd[10708]:error: child_waitpid:
 Managed process 10718 (crmd) dumped core
 
 
 * The state before performing crm_resource.
 
 Stack: corosync
 Current DC: x3650g (3232261383) - partition with quorum
 Version: 1.1.10-38c5972
 2 Nodes configured
 3 Resources configured
 
 
 Online: [ x3650g x3650h ]
 
 Full list of resources:
 
 f-g (stonith:external/ibmrsa-telnet):   Started x3650h
 f-h (stonith:external/ibmrsa-telnet):   Started x3650g
 p1  (ocf::pacemaker:Dummy): Stopped
 
 Migration summary:
 * Node x3650g:
 * Node x3650h:
p1: migration-threshold=1 fail-count=1 last-failure='Fri Mar 14
 18:32:48 2014'
 
 Failed actions:
 p1_monitor_1 on x3650h 'not running' (7): call=16,
 status=complete, last-rc-change='Fri Mar 14 18:32:48 2014',
 queued=0ms, exec=0ms
 
 
 Just for reference, similar phenomenon did not occur by crm_standby.
 $ crm_standby -U X3650H -v on
 
 
 Best Regards,
 Kazunori INOUE
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] drbd + lvm

2014-03-14 Thread David Vossel




- Original Message -
 From: Infoomatic infooma...@gmx.at
 To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org
 Sent: Thursday, March 13, 2014 5:28:19 PM
 Subject: Re: [Pacemaker] drbd + lvm
 
   Has anyone had this issue and resolved it? Any ideas? Thanks in advance!
  
  Yep, i've hit this as well. Use the latest LVM agent. I already fixed all
  of this.
  
  https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/LVM
  
  Keep your volume_list the way it is and use the 'exclusive=true' LVM
  option.   This will allow the LVM agent to activate volumes that don't
  exist in the volume_list.
  
  Hope that helps
 
 Thanks for the fast response. I upgraded LVM to the backports
 (2.02.95-4ubuntu1.1~precise1) and used this script, but I am getting errors
 when one of the nodes tries to activate the VG.
 
 The log:
 Mar 13 23:21:03 lxc02 LVM[7235]: INFO: 0 logical volume(s) in volume group
 replicated now active
 Mar 13 23:21:03 lxc02 LVM[7235]: INFO: LVM Volume replicated is not available
 (stopped)
 
 exclusive is true and the tag is pacemaker. Someone got hints? tia!

Yeah, those aren't errors. It's just telling you that the LVM agent stopped 
successfully. I would expect to see these after you did a failover or resource 
recovery.

Is the resource not starting and stopping correctly for you? If not, I'll need 
more logs.

-- Vossel

 infoomatic
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] drbd + lvm

2014-03-13 Thread David Vossel




- Original Message -
 From: Infoomatic infooma...@gmx.at
 To: pacemaker@oss.clusterlabs.org
 Sent: Thursday, March 13, 2014 2:26:00 PM
 Subject: [Pacemaker] drbd + lvm
 
 Hi list,
 
 I am having troubles with pacemaker and lvm and stacked drbd resources.
 The system consists of 2 Ubuntu 12 LTS servers, each having two partitions of
 an underlying raid 1+0 as volume group with one LV each as a drbd backing
 device. The purpose is for usage with VMs and adjusting needed disk space
 flexible, so on top of the drbd resources there are LVs for each VM.
 I created a stack with LCMC, which is like:
 
 DRBD-LV-libvirt and
 DRBD-LV-Filesystem-lxc
 
 The problem now: the system has hickups - when VM01 runs on HOST01 (being
 primary DRBD) and HOST02 is restarting, lvm is reloaded (at boot time) and
 the LVs are being activated. This of course results in an error, the log
 entry:
 
 Mar 13 17:58:42 host01 pengine: [27563]: ERROR: native_create_actions:
 Resource res_LVM_1 (ocf::LVM) is active on 2 nodes attempting recovery
 
 Therefore, as configured, the resource is stopped and started again (on only
 one node). Thus, all VMs and containers relying on this are also restared.
 
 When I disable the LVs that use the DRBD resource at boot (lvm.conf:
 volume_list only containing the VG from the partitions of the raidsystem) a
 reboot of the secondary does not restart the VMs running on the primary.
 However, if the primary goes down (e.g. power interruption), the secondary
 cannot activate the LVs of the VMs because they are not in the list of
 lvm.conf to be activated.
 
 Has anyone had this issue and resolved it? Any ideas? Thanks in advance!

Yep, i've hit this as well. Use the latest LVM agent. I already fixed all of 
this.

https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/LVM

Keep your volume_list the way it is and use the 'exclusive=true' LVM option.   
This will allow the LVM agent to activate volumes that don't exist in the 
volume_list.

Hope that helps

-- Vossel





 
 infoomatic
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] process/service watcher

2014-03-13 Thread David Vossel
- Original Message -
 From: Yair Ogen (yaogen) yao...@cisco.com
 To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org
 Sent: Thursday, March 13, 2014 9:22:44 AM
 Subject: Re: [Pacemaker] process/service watcher
 
 
 
 Thanks Frank, so you confirm that pacemaker doesn’t offer this?

Yes, you can run a single node cluster. It sounds like it doesn't make any 
sense, but I've actually seen this used in ways I wouldn't have expected.  It 
has valid use-cases.

-- Vossel

 
 
 Yair
 
 
 
 
 From: Frank Brendel [mailto:frank.bren...@eurolog.com]
 Sent: Thursday, March 13, 2014 16:05
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] process/service watcher
 
 
 
 
 Hi Yair,
 
 try monit http://mmonit.com/monit/
 
 
 Regards
 Frank
 
 
 Am 13.03.2014 14:24, schrieb Yair Ogen (yaogen):
 
 
 
 
 Does pacemaker have an option to act as a process / service watcher
 regardless to being part of a cluster? i.e. watch a process and identify
 when it’s down and re-start it.
 
 
 
 I am looking for a software solution that does this even a non-clustered
 environment.
 
 
 
 Thanks.
 
 
 
 Regards,
 
 
 
 Yair
 
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] How to delay first monitor op upon resource start?

2014-03-13 Thread David Vossel
- Original Message -
 From: Gianluca Cecchi gianluca.cec...@gmail.com
 To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org
 Sent: Thursday, March 13, 2014 12:00:16 PM
 Subject: [Pacemaker] How to delay first monitor op upon resource start?
 
 Hello,
 I have some init based scripts that I configure as lsb resources.
 They are java based (in this case ovirt-engine and
 ovirt-websocket-proxy from oVirt project) and they are started through
 the rhel daemon function.
 Basically it needs a few seconds before the scripts exit and the
 status option returns ok.
 So most of times when used as resources in pacemaker, their start is
 registered as FAILED because the status call happens too quickly.
 In the mean time I solved the problem putting a sleep 5 before the
 exit, but I would like to know if I can set a resource or cluster
 parameter so that the first status monitor after start is delayed.
 So I don't need to ask maintainer to make the change to the script and
 I don't need after every update to remember to re-modify the script.


This is a problem with the LSB script. No scripts that pacemaker manages should 
ever return start until status  passes. The status passing should be a 
condition for start passing.  You should make a loop at the end of the 
start function that waits for status to pass before returning.

with that said... there is a way to delay the monitor operation in pacemaker 
like you are wanting.  This is a terrible idea, i don't recommend it, and i 
don't guarantee it won't get deprecated entirely someday. the option is called 
'start-delay' and you set it within the monitor operation section (same place 
interval and timeout are set). Set that option to the amount of milliseconds 
you want to delay the operation execution.

-- Vossel

 Another option would be to try the status after start more than once
 so that eventually the first time is not ok, but it is so the second
 one
 
 Thanks in advance,
 Gianluca
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Don't want to stop lsb resource on migration

2014-03-13 Thread David Vossel




- Original Message -
 From: Bingham knee-jerk-react...@hotmail.com
 To: pacemaker@oss.clusterlabs.org
 Sent: Thursday, March 13, 2014 9:00:55 AM
 Subject: [Pacemaker] Don't want to stop lsb resource on migration
 
 Hello,
 
 My setup:
 I have a 2 node cluster using pacemaker and heartbeat. I have 2 resources,
 ocf::heartbeat:IPaddr and lsb:rabbitmq-server.
 I have these 2 resources grouped together and they will fail over to the
 other node.
 
 
 
 question:
 When rabbitmq is migrated to node1 from node2 I would like to 'not' have the
 the /etc/init.d/rabbitmq-server stop happen on the failed server (node1 in
 this example).
 
 Is it possible to do this in crm?
 I realize that I could hack the initscript's case statement for stop to just
 exit 0, but I am hoping there is a way to do this in crm.

there isn't 

 Thanks for any help,
 Steve
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-13 Thread David Vossel




- Original Message -
 From: Jan Friesse jfrie...@redhat.com
 To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org
 Sent: Thursday, March 13, 2014 4:03:28 AM
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 ...
 
 
  Also can you please try to set debug: on in corosync.conf and paste
  full corosync.log then?
 
  I set debug to on, and did a few restarts but could not reproduce the
  issue
  yet - will post the logs as soon as I manage to reproduce.
 
 
  Perfect.
 
  Another option you can try to set is netmtu (1200 is usually safe).
  
  Finally I was able to reproduce the issue.
  I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately (not
  when node was up again).
  
  The corosync log with debug on is available at:
  http://pastebin.com/kTpDqqtm
  
  
  To be honest, I had to wait much longer for this reproduction as before,
  even though there was no change in the corosync configuration - just
  potentially some system updates. But anyway, the issue is unfortunately
  still there.
  Previously, when this issue came, cpu was at 100% on all nodes - this time
  only on ctmgr, which was the DC...
  
  I hope you can find some useful details in the log.
  
 
 Attila,
 what seems to be interesting is
 
 Configuration ERRORs found during PE processing.  Please run crm_verify
 -L to identify issues.
 
 I'm unsure how much is this problem but I'm really not pacemaker expert.
 
 Anyway, I have theory what may happening and it looks like related with
 IPC (and probably not related to network). But to make sure we will not
 try fixing already fixed bug, can you please build:
 - New libqb (0.17.0). There are plenty of fixes in IPC
 - Corosync 2.3.3 (already plenty IPC fixes)

yes, there was a libqb/corosync interoperation problem that showed these same 
symptoms last year. Updating to the latest corosync and libqb will likely 
resolve this.

 - And maybe also newer pacemaker
 
 I know you were not very happy using hand-compiled sources, but please
 give them at least a try.
 
 Thanks,
   Honza
 
  Thanks,
  Attila
  
  
  
 
  Regards,
Honza
 
 
  There are also a few things that might or might not be related:
 
  1) Whenever I want to edit the configuration with crm configure edit,
 
 ...
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker remote and persistent remote node attributes

2014-03-06 Thread David Vossel




- Original Message -
 From: Покотиленко Костик cas...@meteor.dp.ua
 To: pacemaker@oss.clusterlabs.org
 Sent: Thursday, March 6, 2014 1:42:25 PM
 Subject: [Pacemaker] Pacemaker remote and persistent remote node attributes
 
 Hi, I'm new here.
 

awesome, welcome :D

 I'm looking for ways of migrating to pacemaker of the current setup
 which is ~10 KVM hypervisors with ~20 VMs each. There are few VM classes
 each running it's set of services. Each service is run on =2 VMs on
 different HV for balancing. Failover and management is to be added.
 
 As far as I know there are 3 ways to manage this setup in pacemaker:
 1. make cluster of VMs, use libvirt fencing, have problems
 2. make cluster of VMs, make different cluster of HVs, propagate fencing
 from VM cluster to HV cluster to do real fencing of VMs and HVs. Not
 sure how to do this. I've found a solution for XEN from RedHat, but not
 for KVM
 3. use pacemaker with pacemaker-remote to manage VMs and services on
 both HVs and VMs, have fun

maybe I'm biased, but I like option 3

 Tell me what I missed.
 
 I research pacemaker-remote option now. It seems to be best, but it
 doesn't support remote node persistent attributes for now (1.1.11).
 
 With node attributes I can place services by node class and location
 rules in this case are very simple.
 
 So the question is: are the remote node persistent attributes going to
 be implemented

yes, I plan on doing this for 1.1.12

 or what are the workarounds?

I'm not aware of a good one :/

 
 P.S. Some other questions:
 - what are the reliable versions of pacemaker and corosync? Now I use
 pacemaker 1.1.11 and corosync 2.3.0 backported for Ubuntu 12.04 LTS
 - what is the preferred cli now? crmsh, pcs or what?

Ha, that's a loaded question.  pcs is gaining a lot of traction and it is what 
I use exclusively now.

 - is pacemaker-mgmt  (pygui) 2.1.2 supposed to work with pacemaker
 1.1.11 and corosync 2.3.0? I have it installed and enabled (use_mgmtd:
 yes), but it isn't loading with no errors. I had it working with stock
 Ubuntu pacemaker/corosync.
 - is cman a requirement for DRBD+OCFS2 or it can be safely run with
 corosync on Ubuntu/Debian?

I don't use debian, someone else would have to give their input.


Good Luck! I'm interested to hear how your setup goes.

-- Vossel

 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Stopping resource using pcs

2014-02-28 Thread David Vossel




- Original Message -
 From: K Mehta kiranmehta1...@gmail.com
 To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org
 Sent: Friday, February 28, 2014 7:05:47 AM
 Subject: Re: [Pacemaker] Stopping resource using pcs
 
 Can anyone tell me why --wait parameter always causes pcs resource disable to
 return failure though resource actually stops within time ?

does it only show an error with multi-state resources?  It is probably a bug.  

-- Vossel

 
 
 On Wed, Feb 26, 2014 at 10:45 PM, K Mehta  kiranmehta1...@gmail.com  wrote:
 
 
 
 Deleting master resource id does not work. I see the same issue.
 However, uncloning helps. Delete works after disabling and uncloning.
 
 I see anissue in using --wait option with disable. Resources moves into
 stopped state but still error an error message is printed.
 When --wait option is not provided, error message is not seen
 
 [root@sys11 ~]# pcs resource
 Master/Slave Set: ms-de5566b1-c2a3-4dc6-9712-c82bb43f19d8
 [vha-de5566b1-c2a3-4dc6-9712-c82bb43f19d8]
 Masters: [ sys11 ]
 Slaves: [ sys12 ]
 [root@sys11 ~]# pcs resource disable ms-de5566b1-c2a3-4dc6-9712-c82bb43f19d8
 --wait
 Error: unable to stop: 'ms-de5566b1-c2a3-4dc6-9712-c82bb43f19d8', please
 check logs for failure information
 [root@sys11 ~]# pcs resource
 Master/Slave Set: ms-de5566b1-c2a3-4dc6-9712-c82bb43f19d8
 [vha-de5566b1-c2a3-4dc6-9712-c82bb43f19d8]
 Stopped: [ vha-de5566b1-c2a3-4dc6-9712-c82bb43f19d8:0
 vha-de5566b1-c2a3-4dc6-9712-c82bb43f19d8:1 ]
 [root@sys11 ~]# pcs resource disable ms-de5566b1-c2a3-4dc6-9712-c82bb43f19d8
 --wait
 Error: unable to stop: 'ms-de5566b1-c2a3-4dc6-9712-c82bb43f19d8', please
 check logs for failure information error message
 [root@sys11 ~]# pcs resource enable ms-de5566b1-c2a3-4dc6-9712-c82bb43f19d8
 [root@sys11 ~]# pcs resource
 Master/Slave Set: ms-de5566b1-c2a3-4dc6-9712-c82bb43f19d8
 [vha-de5566b1-c2a3-4dc6-9712-c82bb43f19d8]
 Masters: [ sys11 ]
 Slaves: [ sys12 ]
 [root@sys11 ~]# pcs resource disable ms-de5566b1-c2a3-4dc6-9712-c82bb43f19d8
 [root@sys11 ~]# pcs resource
 Master/Slave Set: ms-de5566b1-c2a3-4dc6-9712-c82bb43f19d8
 [vha-de5566b1-c2a3-4dc6-9712-c82bb43f19d8]
 Stopped: [ vha-de5566b1-c2a3-4dc6-9712-c82bb43f19d8:0
 vha-de5566b1-c2a3-4dc6-9712-c82bb43f19d8:1 ]
 
 
 
 
 
 On Wed, Feb 26, 2014 at 8:55 PM, David Vossel  dvos...@redhat.com  wrote:
 
 
 
 - Original Message -
  From: Frank Brendel  frank.bren...@eurolog.com 
  To: pacemaker@oss.clusterlabs.org
  Sent: Wednesday, February 26, 2014 8:53:19 AM
  Subject: Re: [Pacemaker] Stopping resource using pcs
  
  I guess we need some real experts here.
  
  I think it's because you're attempting to delete the resource and not the
  Master.
  Try deleting the Master instead of the resource.
 
 Yes, delete the Master resource id, not the primitive resource within the
 master. When using pcs, you should always refer to the resource's top most
 parent id, not the id of the children resources within the parent. If you
 make a resource a clone, start using the clone id. Same with master. If you
 add a resource to a group, reference the group id from then on and not any
 of the children resources within the group.
 
 As a general practice, it is always better to stop a resource (pcs resource
 disable) and only delete the resource after the stop has completed.
 
 This is especially important for group resources where stop order matters. If
 you delete a group, then we have no information on what order to stop the
 resources in that group. This can cause stop failures when the orphaned
 resources are cleaned up.
 
 Recently pcs gained the ability to attempt to stop resources before deleting
 them in order to avoid scenarios like i described above. Pcs will block for
 a period of time waiting for the resource to stop before deleting it. Even
 with this logic in place it is preferred to stop the resource manually then
 delete the resource once you have verified it stopped.
 
 -- Vossel
 
  
  I had a similar problem with a cloned group and solved it by un-cloning
  before deleting the group.
  Maybe un-cloning the multi-state resource could help too.
  It's easy to reproduce.
  
  # pcs resource create resPing ping host_list=10.0.0.1 10.0.0.2 op monitor
  on-fail=restart
  # pcs resource group add groupPing resPing
  # pcs resource clone groupPing clone-max=3 clone-node-max=1
  # pcs resource
  Clone Set: groupPing-clone [groupPing]
  Started: [ node1 node2 node3 ]
  # pcs resource delete groupPing-clone
  Deleting Resource (and group) - resPing
  Error: Unable to remove resource 'resPing' (do constraints exist?)
  # pcs resource unclone groupPing
  # pcs resource delete groupPing
  Removing group: groupPing (and all resources within group)
  Stopping all resources in group: groupPing...
  Deleting Resource (and group) - resPing
  
  Log:
  Feb 26 15:43:16 node1 cibadmin[2368]: notice: crm_log_args: Invoked:
  /usr/sbin/cibadmin -o resources -D --xml-text group id=groupPing#012

Re: [Pacemaker] Stopping resource using pcs

2014-02-26 Thread David Vossel
- Original Message -
 From: Frank Brendel frank.bren...@eurolog.com
 To: pacemaker@oss.clusterlabs.org
 Sent: Wednesday, February 26, 2014 8:53:19 AM
 Subject: Re: [Pacemaker] Stopping resource using pcs
 
 I guess we need some real experts here.
 
 I think it's because you're attempting to delete the resource and not the
 Master.
 Try deleting the Master instead of the resource.

Yes, delete the Master resource id, not the primitive resource within the 
master. When using pcs, you should always refer to the resource's top most 
parent id, not the id of the children resources within the parent.  If you make 
a resource a clone, start using the clone id. Same with master. If you add a 
resource to a group, reference the group id from then on and not any of the 
children resources within the group.

As a general practice, it is always better to stop a resource (pcs resource 
disable) and only delete the resource after the stop has completed.

This is especially important for group resources where stop order matters.  If 
you delete a group, then we have no information on what order to stop the 
resources in that group. This can cause stop failures when the orphaned 
resources are cleaned up.

Recently pcs gained the ability to attempt to stop resources before deleting 
them in order to avoid scenarios like i described above.  Pcs will block for a 
period of time waiting for the resource to stop before deleting it.  Even with 
this logic in place it is preferred to stop the resource manually then delete 
the resource once you have verified it stopped.

-- Vossel

 
 I had a similar problem with a cloned group and solved it by un-cloning
 before deleting the group.
 Maybe un-cloning the multi-state resource could help too.
 It's easy to reproduce.
 
 # pcs resource create resPing ping host_list=10.0.0.1 10.0.0.2 op monitor
 on-fail=restart
 # pcs resource group add groupPing resPing
 # pcs resource clone groupPing clone-max=3 clone-node-max=1
 # pcs resource
 Clone Set: groupPing-clone [groupPing]
 Started: [ node1 node2 node3 ]
 # pcs resource delete groupPing-clone
 Deleting Resource (and group) - resPing
 Error: Unable to remove resource 'resPing' (do constraints exist?)
 # pcs resource unclone groupPing
 # pcs resource delete groupPing
 Removing group: groupPing (and all resources within group)
 Stopping all resources in group: groupPing...
 Deleting Resource (and group) - resPing
 
 Log:
 Feb 26 15:43:16 node1 cibadmin[2368]: notice: crm_log_args: Invoked:
 /usr/sbin/cibadmin -o resources -D --xml-text group id=groupPing#012
 primitive class=ocf id=resPing provider=pacemaker type=ping#012
 instance_attributes id=resPing-instance_attributes#012 nvpair
 id=resPing-instance_attributes-host_list name=host_list value=10.0.0.1
 10.0.0.2/#012 /instance_attributes#012 operations#012 op
 id=resPing-monitor-on-fail-restart interval=60s name=monitor
 on-fail=restart/#012 /operations#012 /primi
 Feb 26 15:43:16 node1 cib[1820]: error: xml_log: Expecting an element
 meta_attributes, got nothing
 Feb 26 15:43:16 node1 cib[1820]: error: xml_log: Invalid sequence in
 interleave
 Feb 26 15:43:16 node1 cib[1820]: error: xml_log: Element clone failed to
 validate content
 Feb 26 15:43:16 node1 cib[1820]: error: xml_log: Element resources has extra
 content: primitive
 Feb 26 15:43:16 node1 cib[1820]: error: xml_log: Invalid sequence in
 interleave
 Feb 26 15:43:16 node1 cib[1820]: error: xml_log: Element cib failed to
 validate content
 Feb 26 15:43:16 node1 cib[1820]: warning: cib_perform_op: Updated CIB does
 not validate against pacemaker-1.2 schema/dtd
 Feb 26 15:43:16 node1 cib[1820]: warning: cib_diff_notify: Update (client:
 cibadmin, call:2): 0.516.7 - 0.517.1 (Update does not conform to the
 configured schema)
 Feb 26 15:43:16 node1 stonith-ng[1821]: warning: update_cib_cache_cb:
 [cib_diff_notify] ABORTED: Update does not conform to the configured schema
 (-203)
 Feb 26 15:43:16 node1 cib[1820]: warning: cib_process_request: Completed
 cib_delete operation for section resources: Update does not conform to the
 configured schema (rc=-203, origin=local/cibadmin/2, version=0.516.7)
 
 
 Frank
 
 Am 26.02.2014 15:00, schrieb K Mehta:
 
 
 
 Here is the config and output of few commands
 
 [root@sys11 ~]# pcs config
 Cluster Name: kpacemaker1.1
 Corosync Nodes:
 
 Pacemaker Nodes:
 sys11 sys12
 
 Resources:
 Master: ms-de5566b1-c2a3-4dc6-9712-c82bb43f19d8
 Meta Attrs: clone-max=2 globally-unique=false target-role=Started
 Resource: vha-de5566b1-c2a3-4dc6-9712-c82bb43f19d8 (class=ocf
 provider=heartbeat type=vgc-cm-agent.ocf)
 Attributes: cluster_uuid=de5566b1-c2a3-4dc6-9712-c82bb43f19d8
 Operations: monitor interval=30s role=Master timeout=100s
 (vha-de5566b1-c2a3-4dc6-9712-c82bb43f19d8-monitor-interval-30s)
 monitor interval=31s role=Slave timeout=100s
 (vha-de5566b1-c2a3-4dc6-9712-c82bb43f19d8-monitor-interval-31s)
 
 Stonith Devices:
 Fencing Levels:
 
 Location Constraints:
 Resource: 

Re: [Pacemaker] getting started with development

2014-02-26 Thread David Vossel
- Original Message -
 From: Tasim Noor tasimn...@gmail.com
 To: pacemaker@oss.clusterlabs.org
 Sent: Tuesday, February 25, 2014 5:10:33 PM
 Subject: [Pacemaker] getting started with development
 
 Hi All,
 I would be interested in contributing to the pacemaker/linux HA codebase. I
 did look through the TODO but it doesn't say which of topics are currently
 worked on and which ones are open to be taken up. i would appreciate if
 somebody can point me to a starting point i.e some feature that i can start
 looking at to get my hands dirty along with some pointers to specific source
 files as a starting point.

Awesome, one of the things we recommend to pacemaker developers is to learn 
about our unit tests. Specifically, being able to run CTS in a virtualized 
environment with 3 or more nodes is a good exercise.

I run CTS on KVM instances using libvirt. For fencing I use fence_virtd on the 
host machine and the fence_xvm agent within the guest vms.

After you get fence_virtd running and accessible from the guests ('fence_xvm -o 
list' should list all the running guest vms when executed from a guest vm) you 
can use the steps I have outlined in the scenario file below to execute cts.

https://github.com/davidvossel/phd/blob/master/scenarios/cts-virt.scenario

CTS is not strictly required for all pacemaker development. Depending on how 
deep you want go it is very helpful at verifying invasive changes.

-- Vossel

 Thanks for your help.
 Kind Regards,
 Tasim
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Possible error in RA invocation

2014-02-18 Thread David Vossel




- Original Message -
 From: Santiago Pérez santiago.pe...@entertainment-solutions.eu
 To: pacemaker@oss.clusterlabs.org
 Sent: Thursday, January 30, 2014 1:50:41 PM
 Subject: [Pacemaker] Possible error in RA invocation
 
 Hi everyone,
 
 I am running a two-node cluster which hosts two Xen VMs. We're using
 DRBD, but it's managed directly from Xen.
 
 The configuration of one of this resources is as follows:
 
 primitive xen-vm1 ocf:heartbeat:Xen
  params xmfile=/etc/xen/vm1.cfg
  op monitor interval=30s
  op start interval=0 timeout=60s
  op stop interval=0 timeout=300s
  op migrate_from interval=0 timeout=240 ingerval=0
  op migrate_to interval=0 timeout=240
  meta allow-migrate=true target-role=Started
  meta target-role=Started
 
 
 I have a problem with the monitor operation. It seems to be working
 fine... until it doesn't. The cluster can be running for weeks without
 any failure, but sometimes the monitor operation fails with a really
 strange error from the resource agent. This is an excerpt of one of the
 failures:
 
 Jan 28 14:40:20 xenhost1 lrmd: [3822]: info: rsc:xen-vm1 monitor[71]
 (pid 11756)
 Jan 28 14:40:20 xenhost1 lrmd: [3822]: info: operation monitor[71] on
 xen-vm1 for client 3825: pid 11756 exited with return code 0
 Jan 28 15:40:26 xenhost1 lrmd: [3822]: info: rsc:xen-vm1 monitor[71]
 (pid 18065)
 Jan 28 15:40:27 xenhost1 lrmd: [3822]: info: operation monitor[71] on
 xen-vm1 for client 3825: pid 18065 exited with return code 0
 Jan 28 16:40:32 xenhost1 lrmd: [3822]: info: rsc:xen-vm1 monitor[71]
 (pid 24373)
 Jan 28 16:40:32 xenhost1 lrmd: [3822]: info: operation monitor[71] on
 xen-vm1 for client 3825: pid 24373 exited with return code 0
 Jan 28 17:40:38 xenhost1 lrmd: [3822]: info: rsc:xen-vm1 monitor[71]
 (pid 30686)
 Jan 28 17:40:38 xenhost1 lrmd: [3822]: info: operation monitor[71] on
 xen-vm1 for client 3825: pid 30686 exited with return code 0
 Jan 28 18:40:44 xenhost1 lrmd: [3822]: info: rsc:xen-vm1 monitor[71]
 (pid 4593)
 Jan 28 18:40:44 xenhost1 lrmd: [3822]: info: operation monitor[71] on
 xen-vm1 for client 3825: pid 4593 exited with return code 0
 Jan 28 18:55:23 xenhost1 lrmd: [3822]: info: RA output:
 (xen-vm1:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/Xen: 71: local:
 Jan 28 18:55:23 xenhost1 lrmd: [3822]: info: RA output:
 (xen-vm1:monitor:stderr) en-list: bad variable name

This is weird. It is almost like your shell environment is borked.  I'm not 
sure what is causing this.

-- Vossel

 Jan 28 18:55:23 xenhost1 lrmd: [3822]: info: RA output:
 (xen-vm1:monitor:stderr)
 Jan 28 18:55:23 xenhost1 lrmd: [3822]: info: cancel_op: operation
 monitor[71] on xen-vm1 for client 3825, its parameters:
 crm_feature_set=[3.0.6] xmfile=[/etc/xen/vm1.cfg]
 CRM_meta_name=[monitor] CRM_meta_interval=[3]
 CRM_meta_timeout=[2]  cancelled
 Jan 28 18:55:23 xenhost1 lrmd: [3822]: info: rsc:xen-vm1 stop[72] (pid 6219)

 The machines are very low on resources, and this unnecessary migration
 is causing problems.
 
 The systems are running Debian Wheezy with pacemaker 1.1.7-1 and
 resource-agents 3.9.2-5+deb7u1. I don't know yet if there's a problem
 with the Xen RA, the lrmd service itself or my configuration. I wasn't
 able to find any information related to this issue. Do you have any idea
 of what could be causing this? Any help will be appreciated.
 
 Regards,
 Santiago
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [Gluster-users] Pacemaker and GlusterFS

2014-02-18 Thread David Vossel




- Original Message -
 From: Jefferson Carlos Machado lista.li...@results.com.br
 To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org
 Sent: Tuesday, February 11, 2014 7:03:50 AM
 Subject: Re: [Pacemaker] [Gluster-users] Pacemaker and GlusterFS
 
 Hi Vossel,
 
 I allready do this.
 
   Resource: home (class=ocf provider=heartbeat type=Filesystem)
Attributes: device=localhost:/home_gv directory=/home fstype=glusterfs
Operations: start interval=0 timeout=60 (home-start-interval-0)
stop interval=0 timeout=240 (home-stop-interval-0)
monitor interval=30s role=Started (home-monitor-interval-0)
 
 But when I try start I get error bellow and can see error in the log.
 
 Operation start for home (ocf:heartbeat:Filesystem) returned 1
 stdout: Mount failed. Please check the log file for more details.
 stderr: INFO: Running start for localhost:/home_gv on /home
 stderr: ERROR: Couldn't mount filesystem localhost:/home_gv on /home
 
 Work fine with fstab and mount -a

Ah, It looks like the FileSystem agent needs to be updated to work correctly 
with gluster then. 

-- Vossel

 [root@srvmail0 ~]# mount -a
 [root@srvmail0 ~]# df
 Filesystem   1K-blocks Used Available Use% Mounted on
 /dev/mapper/VolGroup-lv_root   1491664  1298072117816  92% /
 tmpfs   31223644080268156  15% /dev/shm
 /dev/xvda1  49584475560394684  17% /boot
 /dev/xvdb120954552 16049876   4904676  77% /gv
 /dev/xvdc1 2063504  1133824824860  58% /var
 localhost:/gv_home20954496 16049920   4904576  77% /home
 [root@srvmail0 ~]# cat /etc/fstab
 
 #
 # /etc/fstab
 # Created by anaconda on Wed Dec 19 18:01:54 2012
 #
 # Accessible filesystems, by reference, are maintained under '/dev/disk'
 # See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
 #
 /dev/mapper/VolGroup-lv_root /   ext4
 defaults1 1
 UUID=a7af8398-cbea-495f-80cd-1a642d94d9f4 /boot ext4defaults1 2
 /dev/mapper/VolGroup-lv_swap swapswap
 defaults0 0
 tmpfs   /dev/shmtmpfs defaults0 0
 devpts  /dev/ptsdevpts gid=5,mode=620  0 0
 sysfs   /syssysfs defaults0 0
 proc/proc   proc defaults0 0
 /dev/xvdb1 /gv   xfsdefaults1 1
 /dev/xvdc1 /var  ext4defaults1 1
 localhost:/gv_home   /home   glusterfs   _netdev 0 0
 [root@srvmail0 ~]#
 
 
 
 Regards,
 
 Em 07-02-2014 17:53, David Vossel escreveu:
 
 
 
  - Original Message -
  From: Jefferson Carlos Machado lista.li...@results.com.br
  To: The Pacemaker cluster resource manager
  pacemaker@oss.clusterlabs.org, gluster-us...@gluster.org
  Sent: Friday, February 7, 2014 11:55:37 AM
  Subject: [Pacemaker] [Gluster-users] Pacemaker and GlusterFS
 
  Hi,
 
  How the best way to create a resource filesystem managed type glusterfs?
  I suppose using the Filesystem resource agent.
  https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/Filesystem
 
  -- Vossel
 
  Regards,
  ___
  Gluster-users mailing list
  gluster-us...@gluster.org
  http://supercolony.gluster.org/mailman/listinfo/gluster-users
 
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
 
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [Problem] Fail-over is delayed.(State transition is not calculated.)

2014-02-18 Thread David Vossel

- Original Message -
 From: renayama19661...@ybb.ne.jp
 To: PaceMaker-ML pacemaker@oss.clusterlabs.org
 Sent: Monday, February 17, 2014 7:06:53 PM
 Subject: [Pacemaker] [Problem] Fail-over is delayed.(State transition is not  
 calculated.)
 
 Hi All,
 
 I confirmed movement at the time of the trouble in one of Master/Slave in
 Pacemaker1.1.11.
 
 -
 
 Step1) Constitute a cluster.
 
 [root@srv01 ~]# crm_mon -1 -Af
 Last updated: Tue Feb 18 18:07:24 2014
 Last change: Tue Feb 18 18:05:46 2014 via crmd on srv01
 Stack: corosync
 Current DC: srv01 (3232238180) - partition with quorum
 Version: 1.1.10-9d39a6b
 2 Nodes configured
 6 Resources configured
 
 
 Online: [ srv01 srv02 ]
 
  vip-master (ocf::heartbeat:Dummy): Started srv01
  vip-rep(ocf::heartbeat:Dummy): Started srv01
  Master/Slave Set: msPostgresql [pgsql]
  Masters: [ srv01 ]
  Slaves: [ srv02 ]
  Clone Set: clnPingd [prmPingd]
  Started: [ srv01 srv02 ]
 
 Node Attributes:
 * Node srv01:
 + default_ping_set  : 100
 + master-pgsql  : 10
 * Node srv02:
 + default_ping_set  : 100
 + master-pgsql  : 5
 
 Migration summary:
 * Node srv01:
 * Node srv02:
 
 Step2) Monitor error in vip-master.
 
 [root@srv01 ~]# rm -rf /var/run/resource-agents/Dummy-vip-master.state
 
 [root@srv01 ~]# crm_mon -1 -Af
 Last updated: Tue Feb 18 18:07:58 2014
 Last change: Tue Feb 18 18:05:46 2014 via crmd on srv01
 Stack: corosync
 Current DC: srv01 (3232238180) - partition with quorum
 Version: 1.1.10-9d39a6b
 2 Nodes configured
 6 Resources configured
 
 
 Online: [ srv01 srv02 ]
 
  Master/Slave Set: msPostgresql [pgsql]
  Masters: [ srv01 ]
  Slaves: [ srv02 ]
  Clone Set: clnPingd [prmPingd]
  Started: [ srv01 srv02 ]
 
 Node Attributes:
 * Node srv01:
 + default_ping_set  : 100
 + master-pgsql  : 10
 * Node srv02:
 + default_ping_set  : 100
 + master-pgsql  : 5
 
 Migration summary:
 * Node srv01:
vip-master: migration-threshold=1 fail-count=1 last-failure='Tue Feb 18
18:07:50 2014'
 * Node srv02:
 
 Failed actions:
 vip-master_monitor_1 on srv01 'not running' (7): call=30,
 status=complete, last-rc-change='Tue Feb 18 18:07:50 2014', queued=0ms,
 exec=0ms
 -
 
 However, the resource does not fail-over.
 
 But, fail-over is calculated when I check cib in crm_simulate at this point
 in time.
 
 -
 [root@srv01 ~]# crm_simulate -L -s
 
 Current cluster status:
 Online: [ srv01 srv02 ]
 
  vip-master (ocf::heartbeat:Dummy): Stopped
  vip-rep(ocf::heartbeat:Dummy): Stopped
  Master/Slave Set: msPostgresql [pgsql]
  Masters: [ srv01 ]
  Slaves: [ srv02 ]
  Clone Set: clnPingd [prmPingd]
  Started: [ srv01 srv02 ]
 
 Allocation scores:
 clone_color: clnPingd allocation score on srv01: 0
 clone_color: clnPingd allocation score on srv02: 0
 clone_color: prmPingd:0 allocation score on srv01: INFINITY
 clone_color: prmPingd:0 allocation score on srv02: 0
 clone_color: prmPingd:1 allocation score on srv01: 0
 clone_color: prmPingd:1 allocation score on srv02: INFINITY
 native_color: prmPingd:0 allocation score on srv01: INFINITY
 native_color: prmPingd:0 allocation score on srv02: 0
 native_color: prmPingd:1 allocation score on srv01: -INFINITY
 native_color: prmPingd:1 allocation score on srv02: INFINITY
 clone_color: msPostgresql allocation score on srv01: 0
 clone_color: msPostgresql allocation score on srv02: 0
 clone_color: pgsql:0 allocation score on srv01: INFINITY
 clone_color: pgsql:0 allocation score on srv02: 0
 clone_color: pgsql:1 allocation score on srv01: 0
 clone_color: pgsql:1 allocation score on srv02: INFINITY
 native_color: pgsql:0 allocation score on srv01: INFINITY
 native_color: pgsql:0 allocation score on srv02: 0
 native_color: pgsql:1 allocation score on srv01: -INFINITY
 native_color: pgsql:1 allocation score on srv02: INFINITY
 pgsql:1 promotion score on srv02: 5
 pgsql:0 promotion score on srv01: 1
 native_color: vip-master allocation score on srv01: -INFINITY
 native_color: vip-master allocation score on srv02: INFINITY
 native_color: vip-rep allocation score on srv01: -INFINITY
 native_color: vip-rep allocation score on srv02: INFINITY
 
 Transition Summary:
  * Start   vip-master   (srv02)
  * Start   vip-rep  (srv02)
  * Demote  pgsql:0  (Master - Slave srv01)
  * Promote pgsql:1  (Slave - Master srv02)
 
 -
 
 In addition, fail-over is calculated even if cluster_recheck_interval is
 carried out.
 
 Fail-over is carried out even if I carry out cibadmin -B.
 
 -
 [root@srv01 ~]# cibadmin -B
 
 [root@srv01 ~]# crm_mon -1 -Af
 Last updated: Tue Feb 18 18:21:15 2014
 Last change: Tue Feb 18 

Re: [Pacemaker] node1 fencing itself after node2 being fenced

2014-02-18 Thread David Vossel




- Original Message -
 From: Vladislav Bogdanov bub...@hoster-ok.com
 To: pacemaker@oss.clusterlabs.org
 Sent: Tuesday, February 18, 2014 1:02:09 PM
 Subject: Re: [Pacemaker] node1 fencing itself after node2 being fenced
 
 18.02.2014 19:49, Asgaroth wrote:
  i sometimes have the same situation. sleep ~30 seconds between startup
  cman and clvmd helps a lot.
 
  
  Thanks for the tip, I just tried this (added sleep 30 in the start
  section of case statement in cman script, but this did not resolve the
  issue for me), for some reason clvmd just refuses to start, I don’t see
  much debugging errors shooting up, so I cannot say for sure what clvmd
  is trying to do :(

I actually just made a patch related to this. If you are managing the dlm with 
pacemaker, you'll want to use this patch. It disables startup fencing in the 
dlm and has pacemaker perform the fencing instead. The agent checks the startup 
fencing condition, so you'll need that bit as well instead of just disabling 
startup fencing in the dlm.

-- Vossel

 Just a guess. Do you have startup fencing enabled in dlm-controld (I
 actually do not remember if it is applicable to cman's version, but it
 exists in dlm-4) or cman?
 If yes, then that may play its evil game, because imho it is not
 intended to use with pacemaker which has its own startup fencing policy
 (if you redirect fencing to pacemaker).
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] nfs4 cluster fail-over stops working once I introduce ipaddr2 resource

2014-02-14 Thread David Vossel




- Original Message -
 From: Dennis Jacobfeuerborn denni...@conversis.de
 To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org
 Sent: Thursday, February 13, 2014 11:18:04 PM
 Subject: Re: [Pacemaker] nfs4 cluster fail-over stops working once I 
 introduce ipaddr2 resource
 
 On 14.02.2014 02:50, Dennis Jacobfeuerborn wrote:
  Hi,
  I'm still working on my NFSv4 cluster and things are working as
  expected...as long as I don't add an IPAddr2 resource.
 
  The DRBD, filesystem and exportfs resources work fine and when I put the
  active node into standby everything fails over as expected.
 
  Once I add a VIP as a IPAddr2 resource however I seem to get monitor
  problems with the p_exportfs_root resource.
 
  I've attached the configuration, status and a log file.
 
  The transition status is the status a moment after I take nfs1
  (192.168.100.41) offline. It looks like the stopping of p_ip_nfs does
  something to the p_exportfs_root resource although I have no idea what
  that could be.
 
  The final status is the status after the cluster has settled. The
  fail-over finished but the failed action is still present and cannot be
  cleared with a crm resource cleanup p_exportfs_root.
 
  The log is the result of a tail -f on the corosync.log from the moment
  before I issued the crm node standby nfs1 to when the cluster has
  settled.
 
  Does anybody know what the issue could be here? At first I thought that
  using a VIP from the same network as the cluster nodes could be an issue
  but when I change this to use an IP in a different network
  192.168.101.43/24 the same thing happens.
 
  The moment I remove p_ip_nfs from the configuration again fail-over back
  and forth works without a hitch.
 
 So after a lot of digging I think I pinpointed the issue: A race between
 the monitoring and stop actions of the exportfs resource script.
 
 When wait_for_leasetime_on_stop is set the following happens for the
 stop action and in this specific order:
 
 1. The directory is unexported
 2. Sleep nfs lease time + 2 seconds
 
 The problem seems to be that during the sleep phase the monitoring
 action is still invoked and since the directory has already been
 unexported it reports a failure.
 
 Once I add enabled=false to the monitoring action of the exportfs
 resource the problem disappears.
 
 The question is how to ensure that the monitoring action is not called
 while the stop action is still sleeping?
 
 Would it be a solution to create a lock file for the duration of the
 sleep and check for that lock file in the monitoring action?
 
 I'm not 100% sure if this analysis is correct because if monitoring

right, I doubt that is happening.

What happens if you put the ip before the nfs server.

group p_ip_nfs g_nfs p_fs_data p_exportfs_root p_exportfs_data

Without drbd, I have a scenario I test for active/passive nfs server here that 
works for me. 
https://github.com/davidvossel/phd/blob/master/scenarios/nfs-basic.scenario I'm 
using the actual nfsserver ocf script from the latest resource-agent github 
branch.

-- Vossel


 calls are still made while the stop action is running this sounds
 inherently racy and would probably be an issue for almost all resource
 scripts not just exportfs.
 
 Regards,
Dennis
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] Announcing Pacemaker v1.1.11

2014-02-13 Thread David Vossel

I am excited to announce the release of Pacemaker v1.1.11

  https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-1.1.11

There were no changes between the final release and rc5.

This has been a very successful release process.  I'm proud of the testing
and contributions the community put into this release.  Thank you all for
your support, this community is great :D

Looking forward to Pacemaker 1.1.12, we have a lot of new functionality on
the horizon. Scaling pacemaker from a dozen or so nodes to hundreds possibly
thousands of nodes is a very real and attainable goal for us this year.  An
announcement about 1.1.12 features and beta testing should arrive in the next
few months.

If you are a user of `pacemaker_remoted`, you should take the time to read 
about changes to the online wire protocol that are present in this release.
http://blog.clusterlabs.org/blog/2013/changes-to-the-remote-wire-protocol/

To build `rpm` packages:

1. Clone the current sources:

   # git clone --depth 0 git://github.com/ClusterLabs/pacemaker.git
   # cd pacemaker

2. Install dependancies (if you haven't already)

   [Fedora] # sudo yum install -y yum-utils
   [ALL]# make rpm-dep

3. Build Pacemaker

   # make release

4. Copy and deploy as needed

## Details - 1.1.11 - final

Changesets: 462
Diff:   147 files changed, 6810 insertions(+), 4057 deletions(-)

## Highlights

### Features added since Pacemaker-1.1.10

  + attrd: A truly atomic version of attrd for use where CPG is used for 
cluster communication
  + cib: Allow values to be added/updated and removed in a single update
  + cib: Support XML comments in diffs
  + Core: Allow blackbox logging to be disabled with SIGUSR2
  + crmd: Do not block on proxied calls from pacemaker_remoted
  + crmd: Enable cluster-wide throttling when the cib heavily exceeds its 
target load
  + crmd: Make the per-node action limit directly configurable in the CIB
  + crmd: Slow down recovery on nodes with IO load
  + crmd: Track CPU usage on cluster nodes and slow down recovery on nodes with 
high CPU/IO load
  + crm_mon: add --hide-headers option to hide all headers
  + crm_node: Display partition output in sorted order
  + crm_report: Collect logs directly from journald if available
  + Fencing: On timeout, clean up the agent's entire process group
  + Fencing: Support agents that need the host to be unfenced at startup
  + ipc: Raise the default buffer size to 128k
  + PE: Add a special attribute for distinguishing between real nodes and 
containers in constraint rules
  + PE: Allow location constraints to take a regex pattern to match against 
resource IDs
  + pengine: Distinguish between the agent being missing and something the 
agent needs being missing
  + remote: Properly version the remote connection protocol

### Changes since Pacemaker-1.1.10

  +  Bug rhbz#1011618 - Consistently use 'Slave' as the role for unpromoted 
master/slave resources
  +  Bug rhbz#1057697 - Use native DBus library for systemd and upstart support 
to avoid problematic use of threads
  + attrd: Any variable called 'cluster' makes the daemon crash before reaching 
main()
  + attrd: Avoid infinite write loop for unknown peers
  + attrd: Drop all attributes for peers that left the cluster
  + attrd: Give remote-nodes ability to set attributes with attrd
  + attrd: Prevent inflation of attribute dampen intervals
  + attrd: Support SI units for attribute dampening
  + Bug cl#5171 - pengine: Don't prevent clones from running due to dependant 
resources
  + Bug cl#5179 - Corosync: Attempt to retrieve a peer's node name if it is not 
already known
  + Bug cl#5181 - corosync: Ensure node IDs are written to the CIB as unsigned 
integers
  + Bug rhbz#902407 - crm_resource: Handle --ban for master/slave resources as 
advertised
  + cib: Correctly check for archived configuration files
  + cib: Correctly log short-form xml diffs
  + cib: Fix remote cib based on TLS
  + cibadmin: Report errors during sign-off
  + cli: Do not enabled blackbox for cli tools
  + cluster: Fix segfault on removing a node
  + cman: Do not start pacemaker if cman startup fails
  + cman: Start clvmd and friends from the init script if enabled
  + Command-line tools should stop after an assertion failure
  + controld: Use the correct variant of dlm_controld for corosync-2 clusters
  + cpg: Correctly set the group name length
  + cpg: Ensure the CPG group is always null-terminated
  + cpg: Only process one message at a time to allow other priority jobs to be 
performed
  + crmd: Correctly observe the configured batch-limit
  + crmd: Correctly update expected state when the previous DC shuts down
  + crmd: Correcty update the history cache when recurring ops change their 
return code
  + crmd: Don't add node_state to cib, if we have not seen or fenced this node 
yet
  + crmd: don't segfault on shutdown when using heartbeat
  + crmd: Prevent recurring monitors being cancelled due to notify operations
  + 

Re: [Pacemaker] ocf:lvm2:clvmd resource agent

2014-02-12 Thread David Vossel




- Original Message -
 From: Andrew Daugherity adaugher...@tamu.edu
 To: pacemaker@oss.clusterlabs.org pacemaker@oss.clusterlabs.org
 Sent: Wednesday, February 12, 2014 4:56:18 PM
 Subject: [Pacemaker] ocf:lvm2:clvmd resource agent
 
 I noticed in recent discussions on this list that this RA is apparently a
 SUSE thing and not upstreamed into resource-agents.  This was news to me,
 but apparently is indeed the case.

I've just introduced (as of today) a clvmd agent for review upstream. It is not 
the SUSE agent. I would like to encourage SUSE to merge their features into 
this agent and support the upstream effort here.

https://github.com/ClusterLabs/resource-agents/pull/382

 I guess it's SUSE's decision whether to push it upstream but IMO that would
 be the best way to go, so it could become the standard by-the-book way to
 use clvmd with pacemaker.  Right now it lives in the lvm2-clvm RPM, which is
 in the SLES 11 HAE add-on and also in the standard OSS repo for openSUSE
 [1].
 
 The rest of this message is directed more at the SUSE developers  engineers
 who read this list; hopefully this is a more eyeballs = bugs are shallow
 thing than an annoyance...
 
 
 For now, is there a github repo or equivalent for this package, or do you
 just want people to file bugs with openSUSE/support requests with Novell?
 Reason I ask is, I noticed lots of log spamming by clvmd after upgrading
 from SLES 11 SP2 to SP3.  Indeed clvmd is now running with the '-d2' option,
 which is the new default:
 
 # Parameter defaults
 : ${OCF_RESKEY_CRM_meta_globally_unique:=false}
 : ${OCF_RESKEY_daemon_timeout:=80}
 : ${OCF_RESKEY_daemon_options:=-d2}
 
 In SP2 it read ': ${OCF_RESKEY_daemon_options:=-d0}'.
 
 After adjusting my clvmd cluster resource to silence this, by adding
 daemon_options like so:
 
 primitive clvm ocf:lvm2:clvmd \
 params daemon_options=-d0 \
 op start interval=0 timeout=90 \
 op stop interval=0 timeout=100
 
 syslog is back to normal.
 
 In the RPM changelog it looks like this was intentional, but the bug in
 question is marked private, so I have no idea why this was done:
 
 * Tue Jan 15 2013 dmzh...@suse.com
 - clvmd update to 2.20.98,fix colletive bugs.
 - fate#314367, cLVM should support option mirrored
   in a clustered environment
 - Fix debugging level set in clvmd_set_debug by using the correct
   variable (bnc#785467),change default -d0 to -d2
 
 Can someone who has access explain why full -d2 debug mode is now the
 default?  This doesn't seem like a sensible default.
 
 
 Thanks,
 
 Andrew Daugherity
 Systems Analyst
 Division of Research, Texas AM University
 
 
 [1]
 https://build.opensuse.org/package/show?project=openSUSE%3AFactorypackage=lvm2
 Specifically, see clvmd.ocf.
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [Gluster-users] Pacemaker and GlusterFS

2014-02-07 Thread David Vossel




- Original Message -
 From: Jefferson Carlos Machado lista.li...@results.com.br
 To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org, 
 gluster-us...@gluster.org
 Sent: Friday, February 7, 2014 11:55:37 AM
 Subject: [Pacemaker] [Gluster-users] Pacemaker and GlusterFS
 
 Hi,
 
 How the best way to create a resource filesystem managed type glusterfs?

I suppose using the Filesystem resource agent. 
https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/Filesystem

-- Vossel

 
 Regards,
 ___
 Gluster-users mailing list
 gluster-us...@gluster.org
 http://supercolony.gluster.org/mailman/listinfo/gluster-users
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Time to get ready for 1.1.11

2014-02-06 Thread David Vossel




- Original Message -
 From: Andrew Beekhof and...@beekhof.net
 To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org
 Sent: Tuesday, January 28, 2014 11:32:32 PM
 Subject: Re: [Pacemaker] Time to get ready for 1.1.11
 
 
 On 25 Jan 2014, at 2:36 am, David Vossel dvos...@redhat.com wrote:
 
  - Original Message -
  From: David Vossel dvos...@redhat.com
  To: The Pacemaker cluster resource manager
  pacemaker@oss.clusterlabs.org
  Sent: Thursday, January 23, 2014 10:08:35 PM
  Subject: Re: [Pacemaker] Time to get ready for 1.1.11
  
  I ran into a nasty bug where the crmd can infinitely block while retrieving
  metadata for systemd resources. This could affect other resource types as
  well, but I've only encountered it with systemd. There will be an RC5 so
  we can get this patch in.
  https://github.com/ClusterLabs/pacemaker/commit/b0ab1ccdb55dbead40fae097e4f84e445878afb1
 
 David worked out the cause it the fact that glib uses threads for its GDBus
 code.
 The fix (nearly complete) is to use the dbus library directly.

Andrew sorted out the whole GDBus threads issue, so it's finally time for a new 
(hopefully final) release candidate. If no issues are encountered, RC5 will 
become the final Pacemaker 1.1.11 release.

Pacemaker-1.1.11-rc5
https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-1.1.11-rc5

CHANGES Between RC4 and RC5
+ Low: services: fix building dbus support in buildbot.
+ Low: services: Fix building dbus
+ Low: services: Keep dbus build support optional
+ Refactor: dbus: Use native function for adding arguments to messages
+ Fix: Bug rhbz#1057697 - Use native DBus library for upstart support to avoid 
problematic use of threads
+ Fix: Portability: Use basic types for DBus compatability struct
+ Build: Add dbus as an rpm dependancy
+ Refactor: systemd: Simplify dbus API usage
+ Fix: Bug rhbz#1057697 - Use native DBus library for systemd async support to 
avoid problematic use of threads
+ Fix: Bug rhbz#1057697 - Use native DBus library for systemd support to avoid 
problematic use of threads
+ Low: pengine: Regression test update for record-pending=true on migrate_to
+ Fix: xml: Fix segfault in find_entity()
+ Fix: cib: Fix remote cib based on TLS
+ High: services: Do not block synced service executions
+ Fix: cluster: Fix segfault on removing a node
+ Fix: services: Reset the scheduling policy and priority for lrmd's children 
without replying on SCHED_RESET_ON_FORK
+ Fix: services: Correctly reset the nice value for lrmd's children
+ High: pengine: Force record pending for migrate_to actions
+ High: pengine: cl#5186 - Avoid running rsc on two nodes when node is fenced 
during migration


-- Vossel

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] error with pcs resource group command

2014-01-28 Thread David Vossel
- Original Message -
 From: Parveen Jain parveenj...@live.com
 To: pacemaker@oss.clusterlabs.org
 Sent: Thursday, January 23, 2014 9:24:39 AM
 Subject: [Pacemaker] error with pcs resource group command
 
 
 
 Hi Team,
 
 I was trying to add a group while converting from my CRM commands to pcs
 commands:
 
 following is the previous crm command:
 
 group vip-group vip-prim \
 
 meta target-role=Started
 
 
 
 the command which I am trying to use is:
 
 pcs resource group add vip-group vip-prim meta target-role=Started
 
 but whenever I use this command, I get following output:
 
 
 
 
  Unable to find resource: meta
 
 Unable to find resource: target-role=Started 

pcs does not have a one to one mapping to crmsh commands. The 'pcs resource 
group add' command does not accept metadata.

use pcs resource meta group id target-role=Started

or 

'pcs resource enable group id' will do the same thing.

The pcs tool tells you what arguments the different commands take. You can view 
this for yourself. Use 'pcs resource help' to see resource options.  You can 
look at the man page as well 'man pcs' and it has a detailed list.

-- Vossel


 
 
 
 I even consulted the documentation, but it also gives the syntax I am using:
 
 https://access.redhat.com/site/documentation//en-US/Red_Hat_Enterprise_Linux/7-Beta/html/High_Availability_Add-On_Reference/s1-resourceopts-HAAR.html#tb-resource-options-HAAR
 
 
 
 
 Can anyone guide where I am doing wrong ?
 
 
 
 
 
 
 
 Thanks,
 
 Parveen
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Having a really hard time with clvmd on RHEL 7 beta

2014-01-27 Thread David Vossel
- Original Message -
 From: Digimer li...@alteeve.ca
 To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org
 Sent: Monday, January 27, 2014 12:15:23 PM
 Subject: [Pacemaker] Having a really hard time with clvmd on RHEL 7 beta
 
 Hi all,
 
I'm having one heck of a time trying to get clvmd working with
 pacemaker 1.1.10 on RHEL 7 beta... I can configure DRBD dual-primary
 just fine. I can also configure DLM to start on both nodes just fine as
 well.
 
However, once I try to add clvmd using lsb::clvmd, the cluster fails
 randomly.
 
Here is the good config:

snip/

Looking at your config, unless I'm missing something I don't see ordering 
constraints between dlm and clvmd.  You need the start dlm-clone then start 
clvmd-clone order constraint as well as a colocate clvmd-clone with 
dlm-clone colocation constraint.  Otherwise, you are going to run into random 
start and stop errors.

Even after you do this, you may still have problems. The lsb:clvmd init script 
performs some unnecessary blocking operations during the 'status' operation. I 
have a patch attached to this issue, 
https://bugzilla.redhat.com/show_bug.cgi?id=1040670 , that should resolve that 
issue if you hit it.

Hope that helps :)

-- Vossel

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Having a really hard time with clvmd on RHEL 7 beta

2014-01-27 Thread David Vossel
- Original Message -
 From: Lars Marowsky-Bree l...@suse.com
 To: pacemaker@oss.clusterlabs.org
 Sent: Monday, January 27, 2014 2:17:32 PM
 Subject: Re: [Pacemaker] Having a really hard time with clvmd on RHEL 7 beta
 
 On 2014-01-27T13:15:23, Digimer li...@alteeve.ca wrote:
 
  I try to configure clvmd this way:
  
  
  pcs cluster cib clvmd_cfg
  pcs -f clvmd_cfg resource create clvmd lsb:clvmd params daemon_timeout=30s
  op monitor interval=60s
 
 Hmmm. Something is not matching up here. lsb resources can't take
 parameters, can they?

You are right.  Parameters and LSB agents don't mix.  The parameters will get 
stored in the cib i suppose, but the lrmd doesn't do anything with them for lsb 
agents.

-- Vossel

 
 SUSE actually ships a separate ocf:lvm2:clvmd RA (within the lvm2-clvm
 package), which, I'm displeased to notice, wasn't contributed back
 upstream (or at least not merged). Wouldn't that make more sense then
 the LSB script - especially if parameters need to be specified?
 
 
 Regards,
 Lars
 
 --
 Architect Storage/HA
 SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer,
 HRB 21284 (AG Nürnberg)
 Experience is the name everyone gives to their mistakes. -- Oscar Wilde
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Time to get ready for 1.1.11

2014-01-24 Thread David Vossel
- Original Message -
 From: David Vossel dvos...@redhat.com
 To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org
 Sent: Thursday, January 23, 2014 10:08:35 PM
 Subject: Re: [Pacemaker] Time to get ready for 1.1.11
 
 - Original Message -
  From: David Vossel dvos...@redhat.com
  To: The Pacemaker cluster resource manager
  pacemaker@oss.clusterlabs.org
  Sent: Wednesday, January 15, 2014 5:16:40 PM
  Subject: Re: [Pacemaker] Time to get ready for 1.1.11
  
  - Original Message -
   From: David Vossel dvos...@redhat.com
   To: The Pacemaker cluster resource manager
   pacemaker@oss.clusterlabs.org
   Sent: Tuesday, January 7, 2014 4:50:11 PM
   Subject: Re: [Pacemaker] Time to get ready for 1.1.11
   
   - Original Message -
From: Andrew Beekhof and...@beekhof.net
To: The Pacemaker cluster resource manager
pacemaker@oss.clusterlabs.org
Sent: Thursday, December 19, 2013 2:25:00 PM
Subject: Re: [Pacemaker] Time to get ready for 1.1.11


On 20 Dec 2013, at 2:11 am, Andrew Martin amar...@xes-inc.com wrote:

 David/Andrew,
 
 Once 1.1.11 final is released, is it considered the new stable series
 of
 Pacemaker,

yes

 or should 1.1.10 still be used in very stable/critical production
 environments?
 
 Thanks,
 
 Andrew
 
 - Original Message -
 From: David Vossel dvos...@redhat.com
 To: The Pacemaker cluster resource manager
 pacemaker@oss.clusterlabs.org
 Sent: Wednesday, December 11, 2013 3:33:46 PM
 Subject: Re: [Pacemaker] Time to get ready for 1.1.11
 
 - Original Message -
 From: Andrew Beekhof and...@beekhof.net
 To: The Pacemaker cluster resource manager
 pacemaker@oss.clusterlabs.org
 Sent: Wednesday, November 20, 2013 9:02:40 PM
 Subject: [Pacemaker] Time to get ready for 1.1.11
 
 With over 400 updates since the release of 1.1.10, its time to
 start
 thinking about a new release.
 
 Today I have tagged release candidate 1[1].
 The most notable fixes include:
 
  + attrd: Implementation of a truely atomic attrd for use with
  corosync
  2.x
  + cib: Allow values to be added/updated and removed in a single
  update
  + cib: Support XML comments in diffs
  + Core: Allow blackbox logging to be disabled with SIGUSR2
  + crmd: Do not block on proxied calls from pacemaker_remoted
  + crmd: Enable cluster-wide throttling when the cib heavily
  exceeds
  its
  target load
  + crmd: Use the load on our peers to know how many jobs to send
  them
  + crm_mon: add --hide-headers option to hide all headers
  + crm_report: Collect logs directly from journald if available
  + Fencing: On timeout, clean up the agent's entire process group
  + Fencing: Support agents that need the host to be unfenced at
  startup
  + ipc: Raise the default buffer size to 128k
  + PE: Add a special attribute for distinguishing between real
  nodes
  and
  containers in constraint rules
  + PE: Allow location constraints to take a regex pattern to match
  against
  resource IDs
  + pengine: Distinguish between the agent being missing and
  something
  the
  agent needs being missing
  + remote: Properly version the remote connection protocol
  + services: Detect missing agents and permission errors before
  forking
  + Bug cl#5171 - pengine: Don't prevent clones from running due to
  dependant
  resources
  + Bug cl#5179 - Corosync: Attempt to retrieve a peer's node name
  if
  it
  is
  not already known
  + Bug cl#5181 - corosync: Ensure node IDs are written to the CIB
  as
  unsigned integers
 
 If you are a user of `pacemaker_remoted`, you should take the time
 to
 read
 about changes to the online wire protocol[2] that are present in
 this
 release.
 
 [1]
 https://github.com/ClusterLabs/pacemaker/releases/Pacemaker-1.1.11-rc1
 [2]
 http://blog.clusterlabs.org/blog/2013/changes-to-the-remote-wire-protocol/
 
 To build `rpm` packages for testing:
 
 1. Clone the current sources:
 
   # git clone --depth 0
   git://github.com/ClusterLabs/pacemaker.git
   # cd pacemaker
 
 1. If you haven't already, install Pacemaker's dependancies
 
   [Fedora] # sudo yum install -y yum-utils
   [ALL] # make rpm-dep
 
 1. Build Pacemaker
 
   # make rc
 
 1. Copy the rpms and deploy as needed
 
 
 A new release candidate, Pacemaker-1.1.11-rc2, is ready for testing.
 https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-1.1.11-rc2
 
 Assuming no major regressions are encountered during testing, this
 tag
 will
 become the final Pacemaker

Re: [Pacemaker] Time to get ready for 1.1.11

2014-01-23 Thread David Vossel
- Original Message -
 From: David Vossel dvos...@redhat.com
 To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org
 Sent: Wednesday, January 15, 2014 5:16:40 PM
 Subject: Re: [Pacemaker] Time to get ready for 1.1.11
 
 - Original Message -
  From: David Vossel dvos...@redhat.com
  To: The Pacemaker cluster resource manager
  pacemaker@oss.clusterlabs.org
  Sent: Tuesday, January 7, 2014 4:50:11 PM
  Subject: Re: [Pacemaker] Time to get ready for 1.1.11
  
  - Original Message -
   From: Andrew Beekhof and...@beekhof.net
   To: The Pacemaker cluster resource manager
   pacemaker@oss.clusterlabs.org
   Sent: Thursday, December 19, 2013 2:25:00 PM
   Subject: Re: [Pacemaker] Time to get ready for 1.1.11
   
   
   On 20 Dec 2013, at 2:11 am, Andrew Martin amar...@xes-inc.com wrote:
   
David/Andrew,

Once 1.1.11 final is released, is it considered the new stable series
of
Pacemaker,
   
   yes
   
or should 1.1.10 still be used in very stable/critical production
environments?

Thanks,

Andrew

- Original Message -
From: David Vossel dvos...@redhat.com
To: The Pacemaker cluster resource manager
pacemaker@oss.clusterlabs.org
Sent: Wednesday, December 11, 2013 3:33:46 PM
Subject: Re: [Pacemaker] Time to get ready for 1.1.11

- Original Message -
From: Andrew Beekhof and...@beekhof.net
To: The Pacemaker cluster resource manager
pacemaker@oss.clusterlabs.org
Sent: Wednesday, November 20, 2013 9:02:40 PM
Subject: [Pacemaker] Time to get ready for 1.1.11

With over 400 updates since the release of 1.1.10, its time to start
thinking about a new release.

Today I have tagged release candidate 1[1].
The most notable fixes include:

 + attrd: Implementation of a truely atomic attrd for use with
 corosync
 2.x
 + cib: Allow values to be added/updated and removed in a single
 update
 + cib: Support XML comments in diffs
 + Core: Allow blackbox logging to be disabled with SIGUSR2
 + crmd: Do not block on proxied calls from pacemaker_remoted
 + crmd: Enable cluster-wide throttling when the cib heavily exceeds
 its
 target load
 + crmd: Use the load on our peers to know how many jobs to send them
 + crm_mon: add --hide-headers option to hide all headers
 + crm_report: Collect logs directly from journald if available
 + Fencing: On timeout, clean up the agent's entire process group
 + Fencing: Support agents that need the host to be unfenced at
 startup
 + ipc: Raise the default buffer size to 128k
 + PE: Add a special attribute for distinguishing between real nodes
 and
 containers in constraint rules
 + PE: Allow location constraints to take a regex pattern to match
 against
 resource IDs
 + pengine: Distinguish between the agent being missing and something
 the
 agent needs being missing
 + remote: Properly version the remote connection protocol
 + services: Detect missing agents and permission errors before
 forking
 + Bug cl#5171 - pengine: Don't prevent clones from running due to
 dependant
 resources
 + Bug cl#5179 - Corosync: Attempt to retrieve a peer's node name if
 it
 is
 not already known
 + Bug cl#5181 - corosync: Ensure node IDs are written to the CIB as
 unsigned integers

If you are a user of `pacemaker_remoted`, you should take the time to
read
about changes to the online wire protocol[2] that are present in this
release.

[1]
https://github.com/ClusterLabs/pacemaker/releases/Pacemaker-1.1.11-rc1
[2]
http://blog.clusterlabs.org/blog/2013/changes-to-the-remote-wire-protocol/

To build `rpm` packages for testing:

1. Clone the current sources:

  # git clone --depth 0
  git://github.com/ClusterLabs/pacemaker.git
  # cd pacemaker

1. If you haven't already, install Pacemaker's dependancies

  [Fedora] # sudo yum install -y yum-utils
  [ALL]   # make rpm-dep

1. Build Pacemaker

  # make rc

1. Copy the rpms and deploy as needed


A new release candidate, Pacemaker-1.1.11-rc2, is ready for testing.
https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-1.1.11-rc2

Assuming no major regressions are encountered during testing, this tag
will
become the final Pacemaker-1.1.11 release a week from today.

-- Vossel
  
  Alright, New RC time. Pacemaker-1.1.11-rc3.
  
  If no regressions are encountered, rc3 will become the 1.1.11 final release
  a
  week from today.
  
  https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-1.1.11-rc3
  
  CHANGES RC2 vs RC3
  Fix: ipc: fix memory leak for failed ipc client connections.
  Fix: pengine: Fixes memory leak in regex pattern matching code

Re: [Pacemaker] Preventing Automatic Failback

2014-01-22 Thread David Vossel
- Original Message -
 From: Michael Monette mmone...@2keys.ca
 To: Michael Monette mmone...@2keys.ca, The Pacemaker cluster resource 
 manager pacemaker@oss.clusterlabs.org
 Sent: Wednesday, January 22, 2014 12:08:09 AM
 Subject: Re: [Pacemaker] Preventing Automatic Failback
 
 This is the last time ill update this thread. I made some guesses in my last
 one but everything is clear now. I am still learning lots.
 
 I had two problems. I thought they were related but they were not.
 
 The DRBD problem was I had the wfc-timeout value to 30 in the drbd.conf and
 Pacemaker is default to 20 seconds.
 
 Second problem was I was missing 1 of the requirements of an compatible LSB
 script. There was no status option..So I made one using some if
 statements by grepping for the process to return 0 if true, 3 of
 not(whatever..im just experimenting for now).
 
 After modifying that script, and raising the DRBD start timeout to 120 in
 pacemaker, everything is working perfectly.
 
 Hope this helps someone down the road, thanks for your help Vossel.
 
 Mike.

Great! Sounds like you worked it out :)

-- Vossel

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Preventing Automatic Failback

2014-01-21 Thread David Vossel
- Original Message -
 From: Michael Monette mmone...@2keys.ca
 To: pacemaker@oss.clusterlabs.org
 Sent: Monday, January 20, 2014 8:22:25 AM
 Subject: [Pacemaker] Preventing Automatic Failback
 
 Hi,
 
 I posted this question before but my question was a bit unclear.
 
 I have 2 nodes with DRBD with Postgresql.
 
 When node-1 fails, everything fails to node-2 . But when node 1 is recovered,
 things try to failback to node-1 and all the services running on node-2 get
 disrupted(things don't ACTUALLY fail back to node-1..they try, fail, and
 then all services on node-2 are simply restarted..very annoying). This does
 not happen if I perform the same tests on node-2! I can reboot node-2,
 things fail to node-1 and node-2 comes online and waits until he is
 needed(this is what I want!) It seems to only affect my node-1's.
 
 I have tried to set resource stickiness, I have tried everything I can really
 think of, but whenever the Primary has recovered, it will always disrupt
 services running on node-2.
 
 Also I tried removing things from this config to try and isolate this. At one
 point I removed the atlassian_jira and drbd2_var primitives and only had a
 failover-ip and drbd1_opt, but still had the same problem. Hopefully someone
 can pinpoint this out for me. If I can't really avoid this, I would at least
 like to make this bug or whatever happen on node-2 instead of the actives.

I bet this is due to the drbd resource's master score value on node1 being 
higher than node2.  When you recover node1, are you actually rebooting that 
node?  If node1 doesn't lose membership from the cluster (reboot), those 
transient attributes that the drbd agent uses to specify which node will be the 
master instance will stick around.  Otherwise if you are just putting node1 in 
standby and then bringing the node back online, the I believe the resources 
will come back if the drbd master was originally on node1.

If you provide a policy engine file that shows the unwanted transition from 
node2 back to node1, we'll be able to tell you exactly why it is occurring.

-- Vossel


 
 Here is my config:
 
 node node-1.comp.com \
 attributes standby=off
 node node-1.comp.com \
 attributes standby=off
 primitive atlassian_jira lsb:jira \
 op start interval=0 timeout=240 \
 op stop interval=0 timeout=240
 primitive drbd1_opt ocf:heartbeat:Filesystem \
 params device=/dev/drbd1 directory=/opt/atlassian fstype=ext4
 primitive drbd2_var ocf:heartbeat:Filesystem \
 params device=/dev/drbd2 directory=/var/atlassian fstype=ext4
 primitive drbd_data ocf:linbit:drbd \
 params drbd_resource=r0 \
 op monitor interval=29s role=Master \
 op monitor interval=31s role=Slave
 primitive failover-ip ocf:heartbeat:IPaddr2 \
 params ip=10.199.0.13
 group jira_services drbd1_opt drbd2_var failover-ip atlassian_jira
 ms ms_drbd_data drbd_data \
 meta master-max=1 master-node-max=1 clone-max=2
 clone-node-max=1 notify=true
 colocation jira_services_on_drbd inf: atlassian_jira ms_drbd_data:Master
 order jira_services_after_drbd inf: ms_drbd_data:promote jira_services:start
 property $id=cib-bootstrap-options \
 dc-version=1.1.10-14.el6_5.1-368c726 \
 cluster-infrastructure=classic openais (with plugin) \
 expected-quorum-votes=2 \
 stonith-enabled=false \
 no-quorum-policy=ignore \
 last-lrm-refresh=1390183165 \
 default-resource-stickiness=INFINITY
 rsc_defaults $id=rsc-options \
 resource-stickiness=INFINITY
 
 Thanks
 
 Mike
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Resource Status from the CIB

2014-01-20 Thread David Vossel
- Original Message -
 From: Michael Schwartzkopff m...@sys4.de
 To: pacemaker@oss.clusterlabs.org
 Sent: Monday, January 20, 2014 8:23:25 AM
 Subject: [Pacemaker] Resource Status from the CIB
 
 Hi,
 
 is it possible to read the status of a resource from the status part of the
 CIB?
 
 I can see a attribute rc-code=0 in lrm_rsc_op_resourceID...
 on an active node and direct

yes, you can read the status if you want. It would be better to use crm_mon to 
interpret the status for you though.

Here's some info on the status section.
http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#_operation_history

-- Vossel

 lrm_rsc_op - rc-code=7
 on a node, where the where the resource is stopped.
 
 Is this the correct way to check it? Is there any documentation?
 Thanks.
 
 
 
 Mit freundlichen Grüßen,
 
 Michael Schwartzkopff
 
 --
 [*] sys4 AG
 
 http://sys4.de, +49 (89) 30 90 46 64, +49 (162) 165 0044
 Franziskanerstraße 15, 81669 München
 
 Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263
 Vorstand: Patrick Ben Koetter, Marc Schiffbauer
 Aufsichtsratsvorsitzender: Florian Kirstein
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Question about new migration

2014-01-15 Thread David Vossel
- Original Message -
 From: Andrew Beekhof and...@beekhof.net
 To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org
 Sent: Wednesday, January 15, 2014 3:42:23 AM
 Subject: Re: [Pacemaker] Question about new migration
 
 
 On 15 Jan 2014, at 7:12 pm, Kazunori INOUE kazunori.ino...@gmail.com wrote:
 
  Hi David,
  
  With new migration logic, when VM was migrated by 'node standby',
  start was performed in migrate_target. (migrate_from was not performed.)
  
  Is this the designed behavior?

no, this is a bug. In this instance the partial migration should have continued 
regardless if the transition was aborted or not.

I don't think this is a new bug though, I think this existed in the previous 
migration logic as well.  I think I understand what is going on though.  I'll 
make a patch

-- Vossel


  
  # crm_mon -rf1
  Stack: corosync
  Current DC: bl460g1n6 (3232261592) - partition with quorum
  Version: 1.1.11-0.27.b48276b.git.el6-b48276b
  2 Nodes configured
  3 Resources configured
  
  Online: [ bl460g1n6 bl460g1n7 ]
  
  Full list of resources:
  
  prmVM2  (ocf::heartbeat:VirtualDomain): Started bl460g1n6
  Clone Set: clnPing [prmPing]
  Started: [ bl460g1n6 bl460g1n7 ]
  
  Node Attributes:
  * Node bl460g1n6:
 + default_ping_set  : 100
  * Node bl460g1n7:
 + default_ping_set  : 100
  
  # crm node standby bl460g1n6
  # egrep do_lrm_rsc_op:|process_lrm_event: ha-log | grep prmVM2
  Jan 15 15:39:22 bl460g1n6 crmd[30795]: info: do_lrm_rsc_op:
  Performing key=11:5:0:be72ea63-75a9-4de4-a591-e716f960743b
  op=prmVM2_migrate_to_0
  Jan 15 15:39:28 bl460g1n6 crmd[30795]:   notice: process_lrm_event:
  LRM operation prmVM2_migrate_to_0 (call=16, rc=0, cib-update=66,
  confirmed=true) ok
  Jan 15 15:39:30 bl460g1n6 crmd[30795]: info: do_lrm_rsc_op:
  Performing key=7:6:0:be72ea63-75a9-4de4-a591-e716f960743b
  op=prmVM2_stop_0
 
 
 Looks like the transition was aborted (5) and another (6) calculated.
 
 Compare action:transition:expected_rc:uuid
 
  key=11:5:0:be72ea63-75a9-4de4-a591-e716f960743b
 
 and
 
  key=7:6:0:be72ea63-75a9-4de4-a591-e716f960743b
 
 
 
  Jan 15 15:39:30 bl460g1n6 crmd[30795]:   notice: process_lrm_event:
  LRM operation prmVM2_stop_0 (call=19, rc=0, cib-update=68,
  confirmed=true) ok
  
  Jan 15 15:39:30 bl460g1n7 crmd[29923]: info: do_lrm_rsc_op:
  Performing key=8:6:0:be72ea63-75a9-4de4-a591-e716f960743b
  op=prmVM2_start_0
  Jan 15 15:39:30 bl460g1n7 crmd[29923]:   notice: process_lrm_event:
  LRM operation prmVM2_start_0 (call=13, rc=0, cib-update=17,
  confirmed=true) ok
  
  
  Best Regards,
  Kazunori INOUE
  pcmk-Wed-15-Jan-2014.tar.bz2___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
  
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Question about new migration

2014-01-15 Thread David Vossel




- Original Message -
 From: David Vossel dvos...@redhat.com
 To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org
 Sent: Wednesday, January 15, 2014 10:27:49 AM
 Subject: Re: [Pacemaker] Question about new migration
 
 - Original Message -
  From: Andrew Beekhof and...@beekhof.net
  To: The Pacemaker cluster resource manager
  pacemaker@oss.clusterlabs.org
  Sent: Wednesday, January 15, 2014 3:42:23 AM
  Subject: Re: [Pacemaker] Question about new migration
  
  
  On 15 Jan 2014, at 7:12 pm, Kazunori INOUE kazunori.ino...@gmail.com
  wrote:
  
   Hi David,
   
   With new migration logic, when VM was migrated by 'node standby',
   start was performed in migrate_target. (migrate_from was not performed.)
   
   Is this the designed behavior?
 
 no, this is a bug. In this instance the partial migration should have
 continued regardless if the transition was aborted or not.
 
 I don't think this is a new bug though, I think this existed in the previous
 migration logic as well.  I think I understand what is going on though.
 I'll make a patch

This fixes the problem
https://github.com/ClusterLabs/pacemaker/commit/b578680e4a16d915c130d5928cf9d9af296f2414

Thanks for testing the new migration logic out :D

 -- Vossel
 
 
   
   # crm_mon -rf1
   Stack: corosync
   Current DC: bl460g1n6 (3232261592) - partition with quorum
   Version: 1.1.11-0.27.b48276b.git.el6-b48276b
   2 Nodes configured
   3 Resources configured
   
   Online: [ bl460g1n6 bl460g1n7 ]
   
   Full list of resources:
   
   prmVM2  (ocf::heartbeat:VirtualDomain): Started bl460g1n6
   Clone Set: clnPing [prmPing]
   Started: [ bl460g1n6 bl460g1n7 ]
   
   Node Attributes:
   * Node bl460g1n6:
  + default_ping_set  : 100
   * Node bl460g1n7:
  + default_ping_set  : 100
   
   # crm node standby bl460g1n6
   # egrep do_lrm_rsc_op:|process_lrm_event: ha-log | grep prmVM2
   Jan 15 15:39:22 bl460g1n6 crmd[30795]: info: do_lrm_rsc_op:
   Performing key=11:5:0:be72ea63-75a9-4de4-a591-e716f960743b
   op=prmVM2_migrate_to_0
   Jan 15 15:39:28 bl460g1n6 crmd[30795]:   notice: process_lrm_event:
   LRM operation prmVM2_migrate_to_0 (call=16, rc=0, cib-update=66,
   confirmed=true) ok
   Jan 15 15:39:30 bl460g1n6 crmd[30795]: info: do_lrm_rsc_op:
   Performing key=7:6:0:be72ea63-75a9-4de4-a591-e716f960743b
   op=prmVM2_stop_0
  
  
  Looks like the transition was aborted (5) and another (6) calculated.
  
  Compare action:transition:expected_rc:uuid
  
   key=11:5:0:be72ea63-75a9-4de4-a591-e716f960743b
  
  and
  
   key=7:6:0:be72ea63-75a9-4de4-a591-e716f960743b
  
  
  
   Jan 15 15:39:30 bl460g1n6 crmd[30795]:   notice: process_lrm_event:
   LRM operation prmVM2_stop_0 (call=19, rc=0, cib-update=68,
   confirmed=true) ok
   
   Jan 15 15:39:30 bl460g1n7 crmd[29923]: info: do_lrm_rsc_op:
   Performing key=8:6:0:be72ea63-75a9-4de4-a591-e716f960743b
   op=prmVM2_start_0
   Jan 15 15:39:30 bl460g1n7 crmd[29923]:   notice: process_lrm_event:
   LRM operation prmVM2_start_0 (call=13, rc=0, cib-update=17,
   confirmed=true) ok
   
   
   Best Regards,
   Kazunori INOUE
   pcmk-Wed-15-Jan-2014.tar.bz2___
   Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
   http://oss.clusterlabs.org/mailman/listinfo/pacemaker
   
   Project Home: http://www.clusterlabs.org
   Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
   Bugs: http://bugs.clusterlabs.org
  
  
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
  
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
  
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Time to get ready for 1.1.11

2014-01-15 Thread David Vossel
- Original Message -
 From: David Vossel dvos...@redhat.com
 To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org
 Sent: Tuesday, January 7, 2014 4:50:11 PM
 Subject: Re: [Pacemaker] Time to get ready for 1.1.11
 
 - Original Message -
  From: Andrew Beekhof and...@beekhof.net
  To: The Pacemaker cluster resource manager
  pacemaker@oss.clusterlabs.org
  Sent: Thursday, December 19, 2013 2:25:00 PM
  Subject: Re: [Pacemaker] Time to get ready for 1.1.11
  
  
  On 20 Dec 2013, at 2:11 am, Andrew Martin amar...@xes-inc.com wrote:
  
   David/Andrew,
   
   Once 1.1.11 final is released, is it considered the new stable series of
   Pacemaker,
  
  yes
  
   or should 1.1.10 still be used in very stable/critical production
   environments?
   
   Thanks,
   
   Andrew
   
   - Original Message -
   From: David Vossel dvos...@redhat.com
   To: The Pacemaker cluster resource manager
   pacemaker@oss.clusterlabs.org
   Sent: Wednesday, December 11, 2013 3:33:46 PM
   Subject: Re: [Pacemaker] Time to get ready for 1.1.11
   
   - Original Message -
   From: Andrew Beekhof and...@beekhof.net
   To: The Pacemaker cluster resource manager
   pacemaker@oss.clusterlabs.org
   Sent: Wednesday, November 20, 2013 9:02:40 PM
   Subject: [Pacemaker] Time to get ready for 1.1.11
   
   With over 400 updates since the release of 1.1.10, its time to start
   thinking about a new release.
   
   Today I have tagged release candidate 1[1].
   The most notable fixes include:
   
+ attrd: Implementation of a truely atomic attrd for use with corosync
2.x
+ cib: Allow values to be added/updated and removed in a single update
+ cib: Support XML comments in diffs
+ Core: Allow blackbox logging to be disabled with SIGUSR2
+ crmd: Do not block on proxied calls from pacemaker_remoted
+ crmd: Enable cluster-wide throttling when the cib heavily exceeds
its
target load
+ crmd: Use the load on our peers to know how many jobs to send them
+ crm_mon: add --hide-headers option to hide all headers
+ crm_report: Collect logs directly from journald if available
+ Fencing: On timeout, clean up the agent's entire process group
+ Fencing: Support agents that need the host to be unfenced at startup
+ ipc: Raise the default buffer size to 128k
+ PE: Add a special attribute for distinguishing between real nodes
and
containers in constraint rules
+ PE: Allow location constraints to take a regex pattern to match
against
resource IDs
+ pengine: Distinguish between the agent being missing and something
the
agent needs being missing
+ remote: Properly version the remote connection protocol
+ services: Detect missing agents and permission errors before forking
+ Bug cl#5171 - pengine: Don't prevent clones from running due to
dependant
resources
+ Bug cl#5179 - Corosync: Attempt to retrieve a peer's node name if it
is
not already known
+ Bug cl#5181 - corosync: Ensure node IDs are written to the CIB as
unsigned integers
   
   If you are a user of `pacemaker_remoted`, you should take the time to
   read
   about changes to the online wire protocol[2] that are present in this
   release.
   
   [1]
   https://github.com/ClusterLabs/pacemaker/releases/Pacemaker-1.1.11-rc1
   [2]
   http://blog.clusterlabs.org/blog/2013/changes-to-the-remote-wire-protocol/
   
   To build `rpm` packages for testing:
   
   1. Clone the current sources:
   
 # git clone --depth 0 git://github.com/ClusterLabs/pacemaker.git
 # cd pacemaker
   
   1. If you haven't already, install Pacemaker's dependancies
   
 [Fedora] # sudo yum install -y yum-utils
 [ALL] # make rpm-dep
   
   1. Build Pacemaker
   
 # make rc
   
   1. Copy the rpms and deploy as needed
   
   
   A new release candidate, Pacemaker-1.1.11-rc2, is ready for testing.
   https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-1.1.11-rc2
   
   Assuming no major regressions are encountered during testing, this tag
   will
   become the final Pacemaker-1.1.11 release a week from today.
   
   -- Vossel
 
 Alright, New RC time. Pacemaker-1.1.11-rc3.
 
 If no regressions are encountered, rc3 will become the 1.1.11 final release a
 week from today.
 
 https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-1.1.11-rc3
 
 CHANGES RC2 vs RC3
 Fix: ipc: fix memory leak for failed ipc client connections.
 Fix: pengine: Fixes memory leak in regex pattern matching code for
 constraints.
 Low: Avoid potentially misleading and inaccurate compression time log msg
 Fix: crm_report: Suppress logging errors after the target directory has been
 compressed
 Fix: crm_attribute: Do not swallow hostname lookup failures
 Fix: crmd: Avoid deleting the 'shutdown' attribute
 Log: attrd: Quote attribute names
 Doc: Pacemaker_Explained: Fix formatting
 

A new release candidate for pacemaker 1.1.11

Re: [Pacemaker] lrmd segfault at pacemaker 1.1.11-rc1

2014-01-10 Thread David Vossel
- Original Message -
 From: Kazunori INOUE kazunori.ino...@gmail.com
 To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org
 Sent: Friday, January 10, 2014 5:23:04 AM
 Subject: Re: [Pacemaker] lrmd segfault at pacemaker 1.1.11-rc1
 
 2014/1/9 Andrew Beekhof and...@beekhof.net:
 
  On 8 Jan 2014, at 9:15 pm, Kazunori INOUE kazunori.ino...@gmail.com
  wrote:
 
  2014/1/8 Andrew Beekhof and...@beekhof.net:
 
  On 18 Dec 2013, at 9:50 pm, Kazunori INOUE kazunori.ino...@gmail.com
  wrote:
 
  Hi David,
 
  2013/12/18 David Vossel dvos...@redhat.com:
 
  That's a really weird one... I don't see how it is possible for op-id
  to be NULL there.   You might need to give valgrind a shot to detect
  whatever is really going on here.
 
  -- Vossel
 
  Thank you for advice. I try it.
 
  Any update on this?
 
 
  We are still investigating a cause. It was not reproduced when I gave
  valgrind..
  And it was reproduced in RC3.
 
  So it happened RC3 - valgrind, but not RC3 + valgrind?
  Thats concerning.
 
  Nothing in the valgrind output?
 
 
 The cause was found.
 
 230 gboolean
 231 operation_finalize(svc_action_t * op)
 232 {
 233 int recurring = 0;
 234
 235 if (op-interval) {
 236 if (op-cancel) {
 237 op-status = PCMK_LRM_OP_CANCELLED;
 238 cancel_recurring_action(op);
 239 } else {
 240 recurring = 1;
 241 op-opaque-repeat_timer = g_timeout_add(op-interval,
 242
 recurring_action_timer, (void *)op);
 243 }
 244 }
 245
 246 if (op-opaque-callback) {
 247 op-opaque-callback(op);
 248 }
 249
 250 op-pid = 0;
 251
 252 if (!recurring) {
 253 /*
 254  * If this is a recurring action, do not free explicitly.
 255  * It will get freed whenever the action gets cancelled.
 256  */
 257 services_action_free(op);
 258 return TRUE;
 259 }
 260 return FALSE;
 261 }
 
 When op-id is not 0, in cancel_recurring_action function (l.238), op
 is not removed from hash table.
 However, op is freed in services_action_free function (l.257).  That
 is, the freed data remains in hash table.
 Then, g_hash_table_lookup function may look up the freed data.
 
 Therefore, when g_hash_table_replace function was called (in
 services_action_async function), I added change so that
 g_hash_table_remove function might certainly be called.
 As of now, segfault has not happened.

Awesome, thanks for tracking this down.  I created a modified version of your 
patch and put it up for review as a pacemaker pull request.
https://github.com/ClusterLabs/pacemaker/pull/408

-- Vossel

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Time to get ready for 1.1.11

2014-01-07 Thread David Vossel
- Original Message -
 From: Andrew Beekhof and...@beekhof.net
 To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org
 Sent: Thursday, December 19, 2013 2:25:00 PM
 Subject: Re: [Pacemaker] Time to get ready for 1.1.11
 
 
 On 20 Dec 2013, at 2:11 am, Andrew Martin amar...@xes-inc.com wrote:
 
  David/Andrew,
  
  Once 1.1.11 final is released, is it considered the new stable series of
  Pacemaker,
 
 yes
 
  or should 1.1.10 still be used in very stable/critical production
  environments?
  
  Thanks,
  
  Andrew
  
  - Original Message -
  From: David Vossel dvos...@redhat.com
  To: The Pacemaker cluster resource manager
  pacemaker@oss.clusterlabs.org
  Sent: Wednesday, December 11, 2013 3:33:46 PM
  Subject: Re: [Pacemaker] Time to get ready for 1.1.11
  
  - Original Message -
  From: Andrew Beekhof and...@beekhof.net
  To: The Pacemaker cluster resource manager
  pacemaker@oss.clusterlabs.org
  Sent: Wednesday, November 20, 2013 9:02:40 PM
  Subject: [Pacemaker] Time to get ready for 1.1.11
  
  With over 400 updates since the release of 1.1.10, its time to start
  thinking about a new release.
  
  Today I have tagged release candidate 1[1].
  The most notable fixes include:
  
   + attrd: Implementation of a truely atomic attrd for use with corosync
   2.x
   + cib: Allow values to be added/updated and removed in a single update
   + cib: Support XML comments in diffs
   + Core: Allow blackbox logging to be disabled with SIGUSR2
   + crmd: Do not block on proxied calls from pacemaker_remoted
   + crmd: Enable cluster-wide throttling when the cib heavily exceeds its
   target load
   + crmd: Use the load on our peers to know how many jobs to send them
   + crm_mon: add --hide-headers option to hide all headers
   + crm_report: Collect logs directly from journald if available
   + Fencing: On timeout, clean up the agent's entire process group
   + Fencing: Support agents that need the host to be unfenced at startup
   + ipc: Raise the default buffer size to 128k
   + PE: Add a special attribute for distinguishing between real nodes and
   containers in constraint rules
   + PE: Allow location constraints to take a regex pattern to match
   against
   resource IDs
   + pengine: Distinguish between the agent being missing and something the
   agent needs being missing
   + remote: Properly version the remote connection protocol
   + services: Detect missing agents and permission errors before forking
   + Bug cl#5171 - pengine: Don't prevent clones from running due to
   dependant
   resources
   + Bug cl#5179 - Corosync: Attempt to retrieve a peer's node name if it
   is
   not already known
   + Bug cl#5181 - corosync: Ensure node IDs are written to the CIB as
   unsigned integers
  
  If you are a user of `pacemaker_remoted`, you should take the time to
  read
  about changes to the online wire protocol[2] that are present in this
  release.
  
  [1]
  https://github.com/ClusterLabs/pacemaker/releases/Pacemaker-1.1.11-rc1
  [2]
  http://blog.clusterlabs.org/blog/2013/changes-to-the-remote-wire-protocol/
  
  To build `rpm` packages for testing:
  
  1. Clone the current sources:
  
# git clone --depth 0 git://github.com/ClusterLabs/pacemaker.git
# cd pacemaker
  
  1. If you haven't already, install Pacemaker's dependancies
  
[Fedora] # sudo yum install -y yum-utils
[ALL]   # make rpm-dep
  
  1. Build Pacemaker
  
# make rc
  
  1. Copy the rpms and deploy as needed
  
  
  A new release candidate, Pacemaker-1.1.11-rc2, is ready for testing.
  https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-1.1.11-rc2
  
  Assuming no major regressions are encountered during testing, this tag
  will
  become the final Pacemaker-1.1.11 release a week from today.
  
  -- Vossel

Alright, New RC time. Pacemaker-1.1.11-rc3.

If no regressions are encountered, rc3 will become the 1.1.11 final release a 
week from today.

https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-1.1.11-rc3

CHANGES RC2 vs RC3 
Fix: ipc: fix memory leak for failed ipc client connections.
Fix: pengine: Fixes memory leak in regex pattern matching code for constraints.
Low: Avoid potentially misleading and inaccurate compression time log msg
Fix: crm_report: Suppress logging errors after the target directory has been 
compressed
Fix: crm_attribute: Do not swallow hostname lookup failures
Fix: crmd: Avoid deleting the 'shutdown' attribute
Log: attrd: Quote attribute names
Doc: Pacemaker_Explained: Fix formatting

-- Vossel

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] pacemaker 1.1.10 + RHEL 7 beta issues

2014-01-02 Thread David Vossel
- Original Message -
 From: Digimer li...@alteeve.ca
 To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org
 Sent: Wednesday, January 1, 2014 5:38:47 PM
 Subject: Re: [Pacemaker] pacemaker 1.1.10 + RHEL 7 beta issues
 
 
  Is this a bug?

There's too much going on here for me to tell. Can you provide a crm_report 
that contains the pengine files before and after that constraint gets deleted 
allowing apache to start again?  

-- Vossel

 
 
 
 --
 Digimer
 Papers and Projects: https://alteeve.ca/w/
 What if the cure for cancer is trapped in the mind of a person without
 access to education?
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Forming single node cluster?

2014-01-02 Thread David Vossel




- Original Message -
 From: Digimer li...@alteeve.ca
 To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org
 Sent: Thursday, January 2, 2014 2:03:26 PM
 Subject: Re: [Pacemaker] Forming single node cluster?
 
 On 02/01/14 02:36 PM, John Wei wrote:
  Any way to form a single node cluster? I am evaluating pacemaker. Hope I
  can do this with just single server.
 
  John
 
 I'm not sure how much you can evaluate with just one node, but
 technically, I see no reason why you couldn't. You would have to disable
 both quorum and stonith, and there would be no resource recovery or
 migration of course.

These pcs commands will accomplish this.

pcs property set stonith-enabled=false
pcs property set no-quorum-policy=ignore

-- Vossel

 
 --
 Digimer
 Papers and Projects: https://alteeve.ca/w/
 What if the cure for cancer is trapped in the mind of a person without
 access to education?
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Problem with stonith in rhel7 + pacemaker 1.1.10 + fence_virsh

2013-12-23 Thread David Vossel
- Original Message -
 From: Digimer li...@alteeve.ca
 To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org
 Sent: Saturday, December 21, 2013 2:39:46 PM
 Subject: [Pacemaker] Problem with stonith in rhel7 + pacemaker 1.1.10 +   
 fence_virsh
 
 Hi all,
 
I'm trying to learn pacemaker (still) using a pair of RHEL 7 beta
 VMs. I've got stonith configured and it technically works (crashed node
 reboots), but pacemaker hangs...
 
 Here is the config:
 
 
 Cluster Name: rhel7-pcmk
 Corosync Nodes:
   rhel7-01.alteeve.ca rhel7-02.alteeve.ca
 Pacemaker Nodes:
   rhel7-01.alteeve.ca rhel7-02.alteeve.ca
 
 Resources:
 
 Stonith Devices:
   Resource: fence_n01_virsh (class=stonith type=fence_virsh)
Attributes: pcmk_host_list=rhel7-01 ipaddr=lemass action=reboot
 login=root passwd_script=/root/lemass.pw delay=15 port=rhel7_01
Operations: monitor interval=60s (fence_n01_virsh-monitor-interval-60s)
   Resource: fence_n02_virsh (class=stonith type=fence_virsh)
Attributes: pcmk_host_list=rhel7-02 ipaddr=lemass action=reboot


When using fence_virt, the easiest way to configure everything is to name the 
actual virtual machines the same as what their corosync node names are going to 
be.

If you run this command in a virtual machine, you can see the names fence_virt 
thinks all the nodes are.
fence_xvm -o list
node1  c4dbe904-f51a-d53f-7ef0-2b03361c6401 on
node2  c4dbe904-f51a-d53f-7ef0-2b03361c6402 on
node3  c4dbe904-f51a-d53f-7ef0-2b03361c6403 on

If you name the vm the same as the node name, you don't even have to list the 
static host list. Stonith will do all that magic behind the scenes. If the node 
names do not match, try the 'pcmk_host_map' option. I believe you should be 
able to map the corosync node name to the vm's name using that option.

Hope that helps :)

-- Vossel


 login=root passwd_script=/root/lemass.pw port=rhel7_02
Operations: monitor interval=60s (fence_n02_virsh-monitor-interval-60s)
 Fencing Levels:
 
 Location Constraints:
 Ordering Constraints:
 Colocation Constraints:
 
 Cluster Properties:
   cluster-infrastructure: corosync
   dc-version: 1.1.10-19.el7-368c726
   no-quorum-policy: ignore
   stonith-enabled: true
 
 
 Here are the logs:
 
 
 Dec 21 14:36:07 rhel7-01 corosync[1709]: [TOTEM ] A processor failed,
 forming new configuration.
 Dec 21 14:36:09 rhel7-01 corosync[1709]: [TOTEM ] A new membership
 (192.168.122.101:24) was formed. Members left: 2
 Dec 21 14:36:09 rhel7-01 corosync[1709]: [QUORUM] Members[1]: 1
 Dec 21 14:36:09 rhel7-01 corosync[1709]: [MAIN  ] Completed service
 synchronization, ready to provide service.
 Dec 21 14:36:09 rhel7-01 crmd[1730]: notice: crm_update_peer_state:
 pcmk_quorum_notification: Node rhel7-02.alteeve.ca[2] - state is now
 lost (was member)
 Dec 21 14:36:09 rhel7-01 crmd[1730]: warning: reap_dead_nodes: Our DC
 node (rhel7-02.alteeve.ca) left the cluster
 Dec 21 14:36:09 rhel7-01 crmd[1730]: notice: do_state_transition: State
 transition S_NOT_DC - S_ELECTION [ input=I_ELECTION
 cause=C_FSA_INTERNAL origin=reap_dead_nodes ]
 Dec 21 14:36:09 rhel7-01 pacemakerd[1724]: notice:
 crm_update_peer_state: pcmk_quorum_notification: Node
 rhel7-02.alteeve.ca[2] - state is now lost (was member)
 Dec 21 14:36:09 rhel7-01 crmd[1730]: notice: do_state_transition: State
 transition S_ELECTION - S_INTEGRATION [ input=I_ELECTION_DC
 cause=C_FSA_INTERNAL origin=do_election_check ]
 Dec 21 14:36:10 rhel7-01 attrd[1728]: notice: attrd_local_callback:
 Sending full refresh (origin=crmd)
 Dec 21 14:36:10 rhel7-01 attrd[1728]: notice: attrd_trigger_update:
 Sending flush op to all hosts for: probe_complete (true)
 Dec 21 14:36:11 rhel7-01 pengine[1729]: notice: unpack_config: On loss
 of CCM Quorum: Ignore
 Dec 21 14:36:11 rhel7-01 pengine[1729]: warning: pe_fence_node: Node
 rhel7-02.alteeve.ca will be fenced because fence_n02_virsh is thought to
 be active there
 Dec 21 14:36:11 rhel7-01 pengine[1729]: warning: custom_action: Action
 fence_n02_virsh_stop_0 on rhel7-02.alteeve.ca is unrunnable (offline)
 Dec 21 14:36:11 rhel7-01 pengine[1729]: warning: stage6: Scheduling Node
 rhel7-02.alteeve.ca for STONITH
 Dec 21 14:36:11 rhel7-01 pengine[1729]: notice: LogActions: Move
 fence_n02_virsh   (Started rhel7-02.alteeve.ca - rhel7-01.alteeve.ca)
 Dec 21 14:36:11 rhel7-01 pengine[1729]: warning: process_pe_message:
 Calculated Transition 0: /var/lib/pacemaker/pengine/pe-warn-2.bz2
 Dec 21 14:36:11 rhel7-01 crmd[1730]: notice: te_fence_node: Executing
 reboot fencing operation (11) on rhel7-02.alteeve.ca (timeout=6)
 Dec 21 14:36:11 rhel7-01 stonith-ng[1726]: notice: handle_request:
 Client crmd.1730.4f6ea9e1 wants to fence (reboot) 'rhel7-02.alteeve.ca'
 with device '(any)'
 Dec 21 14:36:11 rhel7-01 stonith-ng[1726]: notice:
 initiate_remote_stonith_op: Initiating remote operation reboot for
 rhel7-02.alteeve.ca: ea720bbf-aeab-43bb-a196-3a4c091dea75 (0)
 Dec 21 14:36:11 rhel7-01 

Re: [Pacemaker] Problem with stonith in rhel7 + pacemaker 1.1.10 + fence_virsh

2013-12-23 Thread David Vossel




- Original Message -
 From: Digimer li...@alteeve.ca
 To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org
 Sent: Monday, December 23, 2013 12:42:23 PM
 Subject: Re: [Pacemaker] Problem with stonith in rhel7 + pacemaker 1.1.10 +   
 fence_virsh
 
 On 23/12/13 01:30 PM, David Vossel wrote:
  - Original Message -
  From: Digimer li...@alteeve.ca
  To: The Pacemaker cluster resource manager
  pacemaker@oss.clusterlabs.org
  Sent: Saturday, December 21, 2013 2:39:46 PM
  Subject: [Pacemaker] Problem with stonith in rhel7 + pacemaker 1.1.10 +
 fence_virsh
 
  Hi all,
 
  I'm trying to learn pacemaker (still) using a pair of RHEL 7 beta
  VMs. I've got stonith configured and it technically works (crashed node
  reboots), but pacemaker hangs...
 
  Here is the config:
 
  
  Cluster Name: rhel7-pcmk
  Corosync Nodes:
 rhel7-01.alteeve.ca rhel7-02.alteeve.ca
  Pacemaker Nodes:
 rhel7-01.alteeve.ca rhel7-02.alteeve.ca
 
  Resources:
 
  Stonith Devices:
 Resource: fence_n01_virsh (class=stonith type=fence_virsh)
  Attributes: pcmk_host_list=rhel7-01 ipaddr=lemass action=reboot
  login=root passwd_script=/root/lemass.pw delay=15 port=rhel7_01
  Operations: monitor interval=60s
  (fence_n01_virsh-monitor-interval-60s)
 Resource: fence_n02_virsh (class=stonith type=fence_virsh)
  Attributes: pcmk_host_list=rhel7-02 ipaddr=lemass action=reboot
 
 
  When using fence_virt, the easiest way to configure everything is to name
  the actual virtual machines the same as what their corosync node names are
  going to be.
 
  If you run this command in a virtual machine, you can see the names
  fence_virt thinks all the nodes are.
  fence_xvm -o list
  node1  c4dbe904-f51a-d53f-7ef0-2b03361c6401 on
  node2  c4dbe904-f51a-d53f-7ef0-2b03361c6402 on
  node3  c4dbe904-f51a-d53f-7ef0-2b03361c6403 on
 
  If you name the vm the same as the node name, you don't even have to list
  the static host list. Stonith will do all that magic behind the scenes. If
  the node names do not match, try the 'pcmk_host_map' option. I believe you
  should be able to map the corosync node name to the vm's name using that
  option.
 
  Hope that helps :)
 
  -- Vossel
 
 Hi David,
 
I'm using fence_virsh,

ah sorry, missed that.

 not fence_virtd/fence_xvm. For reasons I've
 not been able to resolve, fence_xvm has been unreliable on Fedora for
 some time now.

the multicast bug :(

 
 --
 Digimer
 Papers and Projects: https://alteeve.ca/w/
 What if the cure for cancer is trapped in the mind of a person without
 access to education?
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Typo? pcs cluster push question....

2013-12-19 Thread David Vossel




- Original Message -
 From: Steven Silk - NOAA Affiliate steven.s...@noaa.gov
 To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org
 Sent: Thursday, December 19, 2013 11:54:36 AM
 Subject: [Pacemaker] Typo? pcs cluster push question
 
 I have been working from
 http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_example.html
 
 at the bottom of the page is :
 
 Now push the configuration into the cluster
 
 # pcs cluster push cib stonith_cfg
 
 This does not work - I believe the proper syntax is
 
 # pcs cluster cib-push stonith_cfg
 
 Or am I working with a different version of pcs?

You are using a newer version of pcs than that document is based off of.  We 
need to go through and update some command syntax to reflect recent changes to 
pcs.

-- Vossel

 thanks!
 
 
 Steven Silk
 
 303 497 3112
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Time to get ready for 1.1.11

2013-12-18 Thread David Vossel




- Original Message -
 From: David Vossel dvos...@redhat.com
 To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org
 Sent: Wednesday, December 11, 2013 3:33:46 PM
 Subject: Re: [Pacemaker] Time to get ready for 1.1.11
 
 - Original Message -
  From: Andrew Beekhof and...@beekhof.net
  To: The Pacemaker cluster resource manager
  pacemaker@oss.clusterlabs.org
  Sent: Wednesday, November 20, 2013 9:02:40 PM
  Subject: [Pacemaker] Time to get ready for 1.1.11
  
  With over 400 updates since the release of 1.1.10, its time to start
  thinking about a new release.
  
  Today I have tagged release candidate 1[1].
  The most notable fixes include:
  
+ attrd: Implementation of a truely atomic attrd for use with corosync
2.x
+ cib: Allow values to be added/updated and removed in a single update
+ cib: Support XML comments in diffs
+ Core: Allow blackbox logging to be disabled with SIGUSR2
+ crmd: Do not block on proxied calls from pacemaker_remoted
+ crmd: Enable cluster-wide throttling when the cib heavily exceeds its
target load
+ crmd: Use the load on our peers to know how many jobs to send them
+ crm_mon: add --hide-headers option to hide all headers
+ crm_report: Collect logs directly from journald if available
+ Fencing: On timeout, clean up the agent's entire process group
+ Fencing: Support agents that need the host to be unfenced at startup
+ ipc: Raise the default buffer size to 128k
+ PE: Add a special attribute for distinguishing between real nodes and
containers in constraint rules
+ PE: Allow location constraints to take a regex pattern to match against
resource IDs
+ pengine: Distinguish between the agent being missing and something the
agent needs being missing
+ remote: Properly version the remote connection protocol
+ services: Detect missing agents and permission errors before forking
+ Bug cl#5171 - pengine: Don't prevent clones from running due to
dependant
resources
+ Bug cl#5179 - Corosync: Attempt to retrieve a peer's node name if it is
not already known
+ Bug cl#5181 - corosync: Ensure node IDs are written to the CIB as
unsigned integers
  
  If you are a user of `pacemaker_remoted`, you should take the time to read
  about changes to the online wire protocol[2] that are present in this
  release.
  
  [1] https://github.com/ClusterLabs/pacemaker/releases/Pacemaker-1.1.11-rc1
  [2]
  http://blog.clusterlabs.org/blog/2013/changes-to-the-remote-wire-protocol/
  
  To build `rpm` packages for testing:
  
  1. Clone the current sources:
  
 # git clone --depth 0 git://github.com/ClusterLabs/pacemaker.git
 # cd pacemaker
  
  1. If you haven't already, install Pacemaker's dependancies
  
 [Fedora] # sudo yum install -y yum-utils
 [ALL]# make rpm-dep
  
  1. Build Pacemaker
  
 # make rc
  
  1. Copy the rpms and deploy as needed
  
 
 A new release candidate, Pacemaker-1.1.11-rc2, is ready for testing.
 https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-1.1.11-rc2
 
 Assuming no major regressions are encountered during testing, this tag will
 become the final Pacemaker-1.1.11 release a week from today.

I have found a compile time error with 1.1.11 on rhel6 based systems and there 
has been a lrmd crash reported this week.  After these issues get resolved 
there will be an rc3.  This will result in 1.1.11's release being pushed to 
after the new year.

Thanks for all the continued help in shaping v1.1.11 into a solid release. We 
are getting very close :)

-- Vossel

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] lrmd segfault at pacemaker 1.1.11-rc1

2013-12-17 Thread David Vossel
- Original Message -
 From: Kazunori INOUE kazunori.ino...@gmail.com
 To: pm pacemaker@oss.clusterlabs.org
 Sent: Tuesday, December 17, 2013 5:43:53 AM
 Subject: [Pacemaker] lrmd segfault at pacemaker 1.1.11-rc1
 
 Hi,
 
 When repeated 'node standby' and 'node online', lrmd crashed with
 SIGSEGV because op-id in cancel_recurring_action() was NULL.

That's a really weird one... I don't see how it is possible for op-id to be 
NULL there.   You might need to give valgrind a shot to detect whatever is 
really going on here.

-- Vossel

 
 Dec 17 19:01:21 vm3 crmd[2433]: info: do_state_transition: State
 transition S_POLICY_ENGINE - S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
 cause=C_IPC_MESSAGE origin=handle_response ]
 Dec 17 19:01:21 vm3 crmd[2433]: info: do_te_invoke: Processing
 graph 437 (ref=pe_calc-dc-1387274481-5672) derived from
 /var/lib/pacemaker/pengine/pe-input-437.bz2
 Dec 17 19:01:21 vm3 crmd[2433]:   notice: te_rsc_command: Initiating
 action 17: stop prmStonith4_stop_0 on vm3 (local)
 Dec 17 19:01:21 vm3 crmd[2433]: info: do_lrm_rsc_op: Performing
 key=17:437:0:40d7b9a2-c373-4459-a811-9c225d1a9555
 op=prmStonith4_stop_0
 Dec 17 19:01:21 vm3 lrmd[2430]: info: log_execute: executing -
 rsc:prmStonith4 action:stop call_id:3487
 Dec 17 19:01:21 vm3 stonith-ng[2429]: info: stonith_command:
 Processed st_device_remove from lrmd.2430: OK (0)
 Dec 17 19:01:21 vm3 lrmd[2430]: info: log_finished: finished -
 rsc:prmStonith4 action:stop call_id:3487  exit-code:0 exec-time:0ms
 queue-time:0ms
 Dec 17 19:01:21 vm3 pengine[2432]:   notice: process_pe_message:
 Calculated Transition 437: /var/lib/pacemaker/pengine/pe-input-437.bz2
 Dec 17 19:01:21 vm3 crmd[2433]:   notice: te_rsc_command: Initiating
 action 33: stop prmPg_stop_0 on vm3 (local)
 Dec 17 19:01:21 vm3 lrmd[2430]: info: cancel_recurring_action:
 Cancelling operation prmPg_monitor_1
 Dec 17 19:01:21 vm3 crmd[2433]: info: do_lrm_rsc_op: Performing
 key=33:437:0:40d7b9a2-c373-4459-a811-9c225d1a9555 op=prmPg_stop_0
 Dec 17 19:01:21 vm3 lrmd[2430]: info: log_execute: executing -
 rsc:prmPg action:stop call_id:3489
 Dec 17 19:01:21 vm3 crmd[2433]: info: process_lrm_event: LRM
 operation prmStonith4_monitor_360 (call=3473, status=1,
 cib-update=0, confirmed=true) Cancelled
 Dec 17 19:01:21 vm3 crmd[2433]:   notice: process_lrm_event: LRM
 operation prmStonith4_stop_0 (call=3487, rc=0, cib-update=3090,
 confirmed=true) ok
 Dec 17 19:01:21 vm3 crmd[2433]: info: process_lrm_event: LRM
 operation prmPg_monitor_1 (call=3485, status=1, cib-update=0,
 confirmed=true) Cancelled
 Dec 17 19:01:21 vm3 crmd[2433]: info: match_graph_event: Action
 prmStonith4_stop_0 (17) confirmed on vm3 (rc=0)
 Dec 17 19:01:21 vm3 crmd[2433]:   notice: te_rsc_command: Initiating
 action 40: stop prmPing_stop_0 on vm3 (local)
 Dec 17 19:01:21 vm3 cib[2428]: info: cib_process_request:
 Completed cib_modify operation for section status: OK (rc=0,
 origin=local/crmd/3090, version=0.440.2)
 Dec 17 19:01:21 vm3 stonith-ng[2429]: info: crm_client_destroy:
 Destroying 0 events
 Dec 17 19:01:21 vm3 pacemakerd[2424]:error: child_death_dispatch:
 Managed process 2430 (lrmd) dumped core
 Dec 17 19:01:21 vm3 pacemakerd[2424]:   notice: pcmk_child_exit: Child
 process lrmd terminated with signal 11 (pid=2430, core=1)
 Dec 17 19:01:21 vm3 pacemakerd[2424]:   notice: pcmk_process_exit:
 Respawning failed child process: lrmd
 Dec 17 19:01:21 vm3 pacemakerd[2424]:error: pcmk_process_exit:
 Rebooting system
 Dec 17 19:10:40 vm3 root: Mark:pcmk:1387275040
 
 $ gdb /usr/libexec/pacemaker/lrmd core.2430
 (gdb) bt
 #0  0x00323f8480ac in vfprintf () from /lib64/libc.so.6
 #1  0x00323f86f9d2 in vsnprintf () from /lib64/libc.so.6
 #2  0x003fcb81726d in qb_log_real_va_ (cs=0x3fcf208658,
 ap=0x76f5fc80) at log.c:230
 #3  0x003fcb8173ea in qb_log_real_ (cs=0x3fcf208658) at log.c:255
 #4  0x003fcf003a9c in cancel_recurring_action (op=0xb9fae0) at
 services.c:356
 #5  0x003fcf003bc6 in services_action_cancel (name=0xb9f350
 prmPing, action=0xb9ee90 monitor, interval=1) at
 services.c:381
 #6  0x00406595 in cancel_op (rsc_id=0xb9f350 prmPing,
 action=0xb9ee90 monitor, interval=1) at lrmd.c:1197
 #7  0x004067aa in process_lrmd_rsc_cancel (client=0xb926c0,
 id=7030, request=0xb95ad0) at lrmd.c:1261
 #8  0x00406a51 in process_lrmd_message (client=0xb926c0,
 id=7030, request=0xb95ad0) at lrmd.c:1300
 #9  0x00402a06 in lrmd_ipc_dispatch (c=0xb91af0,
 data=0x7f9f30acbc08, size=362) at main.c:141
 #10 0x003fcb8126f8 in _process_request_ (c=0xb91af0,
 ms_timeout=10) at ipcs.c:698
 #11 0x003fcb812ad5 in qb_ipcs_dispatch_connection_request (fd=5,
 revents=1, data=0xb91af0) at ipcs.c:801
 #12 0x003fcc0327b1 in gio_read_socket (gio=0xb92880,
 condition=G_IO_IN, data=0xb91138) at mainloop.c:437
 #13 0x003fc9c3feb2 in g_main_context_dispatch () from
 

Re: [Pacemaker] Time to get ready for 1.1.11

2013-12-13 Thread David Vossel




- Original Message -
 From: Andrey Groshev gre...@yandex.ru
 To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org
 Sent: Thursday, December 12, 2013 4:14:20 AM
 Subject: Re: [Pacemaker] Time to get ready for 1.1.11
 
 And why not include it?
 https://github.com/beekhof/pacemaker/commit/a4bdc9a

That commit is in the release candidate and will be included in the final 
1.1.11 release :)

https://github.com/ClusterLabs/pacemaker/commit/a4bdc9a

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] 1.1.10 problems on CentOS 6.5

2013-12-12 Thread David Vossel
- Original Message -
 From: Diego Remolina diego.remol...@physics.gatech.edu
 To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org
 Sent: Thursday, December 12, 2013 7:39:37 AM
 Subject: [Pacemaker] 1.1.10 problems on CentOS 6.5
 
 I was successfully running 1.1.8 on a pair of CentOS 6.4 servers and
 after updating to CentOS 6.5 and 1.1.10, pacemaker miss-behaves.
 
 The first symptoms appeared with the 1.1.10-14.el6 packages. About 20
 hours after the upgrade, the first drbd_monitor issues came out.
 
 Dec 09 18:50:12 Updated: pacemaker-libs-1.1.10-14.el6.x86_64
 Dec 09 18:50:13 Updated: pacemaker-cli-1.1.10-14.el6.x86_64
 Dec 09 18:50:13 Updated: pacemaker-cluster-libs-1.1.10-14.el6.x86_64
 Dec 09 18:50:13 Updated: pacemaker-1.1.10-14.el6.x86_64
 
 Dec 10 15:27:55 ysmha01 lrmd[3076]:  warning: child_timeout_callback:
 drbd_export_monitor_29000 process (PID 19608) timed out
 Dec 10 15:27:55 ysmha01 lrmd[3076]:  warning: operation_finished:
 drbd_export_monitor_29000:19608 - timed out after 2ms
 Dec 10 15:27:55 ysmha01 crmd[3079]:error: process_lrm_event: LRM
 operation drbd_export_monitor_29000 (77) Timed Out (timeout=2ms)
 Dec 10 15:27:56 ysmha01 crmd[3079]:   notice: process_lrm_event: LRM
 operation drbd_export_notify_0 (call=99, rc=0, cib-update=0,
 confirmed=true) ok

These errors look like a real resource failure.  Pacemaker appears to be doing 
its job here. In this case the drbd script is being called, but never exiting 
(which results in the timeout).  Your update of pacemaker likely has nothing to 
do with this. An update of anything DRBD related would make more sense.

 At this point, I tried taking the node to standby and back to online and
 cleaning the resources to no avail. I tried stopping pacemaker without
 luck. I rebooted both servers and on Dec 11, the failure started with
 failure to monitor pingd, then drbd_monitor.
 
 Dec 11 16:12:10 ysmha01 lrmd[3060]:  warning: child_timeout_callback:
 pingd_monitor_2 process (PID 26237) timed out
 Dec 11 16:12:10 ysmha01 lrmd[3060]:  warning: operation_finished:
 pingd_monitor_2:26237 - timed out after 15000ms
 Dec 11 16:12:10 ysmha01 crmd[3063]:error: process_lrm_event: LRM
 operation pingd_monitor_2 (35) Timed Out (timeout=15000ms)
 
 Dec 11 16:12:19 ysmha01 lrmd[3060]:  warning: child_timeout_callback:
 drbd_export_monitor_29000 process (PID 26268) timed out
 Dec 11 16:12:19 ysmha01 lrmd[3060]:  warning: operation_finished:
 drbd_export_monitor_29000:26268 - timed out after 2ms
 Dec 11 16:12:19 ysmha01 crmd[3063]:error: process_lrm_event: LRM
 operation drbd_export_monitor_29000 (62) Timed Out (timeout=2ms)

 I upgraded to the latest rpms yesterday afternoon (1.1.10-14.el6_5.1).
 Right before 1 am, there were issues again.
 
 Dec 12 00:49:39 ysmha01 pengine[3149]:   notice: process_pe_message:
 Calculated Transition 41: /var/lib/pacemaker/pengine/pe-input-173.bz2
 Dec 12 00:50:03 ysmha01 lrmd[3147]:  warning: child_timeout_callback:
 drbd_export_monitor_29000 process (PID 18496) timed out
 Dec 12 00:50:03 ysmha01 lrmd[3147]:  warning: operation_finished:
 drbd_export_monitor_29000:18496 - timed out after 2ms
 Dec 12 00:50:03 ysmha01 crmd[3150]:error: process_lrm_event: LRM
 operation drbd_export_monitor_29000 (60) Timed Out (timeout=2ms)
 
 I am for now manually running the machines without pacemaker. What
 suggestions do you have for me? What should I try first?

Manually running the commands works? Something weird is going on.
 
 - Revert to 1.1.8?
 - Could be something related to drbd in the new kernel? Downgrade kernel
 rpm?
 
 I can post logs on request, where would be a good place to do that?

make a crm_report, attach the crm_report file here.

 
 Thanks,
 
 Diego
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Time to get ready for 1.1.11

2013-12-11 Thread David Vossel
- Original Message -
 From: Andrew Beekhof and...@beekhof.net
 To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org
 Sent: Wednesday, November 20, 2013 9:02:40 PM
 Subject: [Pacemaker] Time to get ready for 1.1.11
 
 With over 400 updates since the release of 1.1.10, its time to start
 thinking about a new release.
 
 Today I have tagged release candidate 1[1].
 The most notable fixes include:
 
   + attrd: Implementation of a truely atomic attrd for use with corosync 2.x
   + cib: Allow values to be added/updated and removed in a single update
   + cib: Support XML comments in diffs
   + Core: Allow blackbox logging to be disabled with SIGUSR2
   + crmd: Do not block on proxied calls from pacemaker_remoted
   + crmd: Enable cluster-wide throttling when the cib heavily exceeds its
   target load
   + crmd: Use the load on our peers to know how many jobs to send them
   + crm_mon: add --hide-headers option to hide all headers
   + crm_report: Collect logs directly from journald if available
   + Fencing: On timeout, clean up the agent's entire process group
   + Fencing: Support agents that need the host to be unfenced at startup
   + ipc: Raise the default buffer size to 128k
   + PE: Add a special attribute for distinguishing between real nodes and
   containers in constraint rules
   + PE: Allow location constraints to take a regex pattern to match against
   resource IDs
   + pengine: Distinguish between the agent being missing and something the
   agent needs being missing
   + remote: Properly version the remote connection protocol
   + services: Detect missing agents and permission errors before forking
   + Bug cl#5171 - pengine: Don't prevent clones from running due to dependant
   resources
   + Bug cl#5179 - Corosync: Attempt to retrieve a peer's node name if it is
   not already known
   + Bug cl#5181 - corosync: Ensure node IDs are written to the CIB as
   unsigned integers
 
 If you are a user of `pacemaker_remoted`, you should take the time to read
 about changes to the online wire protocol[2] that are present in this
 release.
 
 [1] https://github.com/ClusterLabs/pacemaker/releases/Pacemaker-1.1.11-rc1
 [2]
 http://blog.clusterlabs.org/blog/2013/changes-to-the-remote-wire-protocol/
 
 To build `rpm` packages for testing:
 
 1. Clone the current sources:
 
# git clone --depth 0 git://github.com/ClusterLabs/pacemaker.git
# cd pacemaker
 
 1. If you haven't already, install Pacemaker's dependancies
 
[Fedora] # sudo yum install -y yum-utils
[ALL]  # make rpm-dep
 
 1. Build Pacemaker
 
# make rc
 
 1. Copy the rpms and deploy as needed
 

A new release candidate, Pacemaker-1.1.11-rc2, is ready for testing. 
https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-1.1.11-rc2

Assuming no major regressions are encountered during testing, this tag will 
become the final Pacemaker-1.1.11 release a week from today.

-- Vossel

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] catch-22: can't fence node A because node A has the fencing resource

2013-12-03 Thread David Vossel
- Original Message -
 From: Brian J. Murrell br...@interlinx.bc.ca
 To: pacema...@clusterlabs.org
 Sent: Monday, December 2, 2013 2:50:41 PM
 Subject: [Pacemaker] catch-22: can't fence node A because node A has the  
 fencing resource
 
 So, I'm migrating my working pacemaker configuration from 1.1.7 to
 1.1.10 and am finding what appears to be a new behavior in 1.1.10.
 
 If a given node is running a fencing resource and that node goes AWOL,
 it needs to be fenced (of course).  But any other node trying to take
 over the fencing resource to fence it appears to first want the current
 owner of the fencing resource to fence the node.  Of course that can't
 happen since the node that needs to do the fencing is AWOL.
 
 So while I can buy into the general policy that a node needs to be
 fenced in order to take over it's resources, fencing resources have to
 be excepted from this or there can be this catch-22.

We did away with all of the policy engine logic involved with trying to move 
fencing devices off of the target node before executing the fencing action. 
Behind the scenes all fencing devices are now essentially clones.  If the 
target node to be fenced has a fencing device running on it, that device can 
execute anywhere in the cluster to avoid the suicide situation.

When you are looking at crm_mon output and see a fencing device is running on a 
specific node, all that really means is that we are going to attempt to execute 
fencing actions for that device from that node first. If that node is 
unavailable, we'll try that same device anywhere in the cluster we can get it 
to work (unless you've specifically built some location constraint that 
prevents the fencing device from ever running on a specific node)

Hope that helps.

-- Vossel

 
 I believe that is how things were working in 1.1.7 but now that I'm on
 1.1.10[-1.el6_4.4] this no longer seems to be the case.
 
 Or perhaps there is some additional configuration that 1.1.10 needs to
 effect this behavior.  Here is my configuration:
 
 Cluster Name:
 Corosync Nodes:
  
 Pacemaker Nodes:
  host1 host2
 
 Resources:
  Resource: rsc1 (class=ocf provider=foo type=Target)
   Attributes: target=111bad0a-a86a-40e3-b056-c5c93168aa0d
   Meta Attrs: target-role=Started
   Operations: monitor interval=5 timeout=60 (rsc1-monitor-5)
   start interval=0 timeout=300 (rsc1-start-0)
   stop interval=0 timeout=300 (rsc1-stop-0)
  Resource: rsc2 (class=ocf provider=chroma type=Target)
   Attributes: target=a8efa349-4c73-4efc-90d3-d6be7d73c515
   Meta Attrs: target-role=Started
   Operations: monitor interval=5 timeout=60 (rsc2-monitor-5)
   start interval=0 timeout=300 (rsc2-start-0)
   stop interval=0 timeout=300 (rsc2-stop-0)
 
 Stonith Devices:
  Resource: st-fencing (class=stonith type=fence_foo)
 Fencing Levels:
 
 Location Constraints:
   Resource: rsc1
 Enabled on: host1 (score:20) (id:rsc1-primary)
 Enabled on: host2 (score:10) (id:rsc1-secondary)
   Resource: rsc2
 Enabled on: host2 (score:20) (id:rsc2-primary)
 Enabled on: host1 (score:10) (id:rsc2-secondary)
 Ordering Constraints:
 Colocation Constraints:
 
 Cluster Properties:
  cluster-infrastructure: classic openais (with plugin)
  dc-version: 1.1.10-1.el6_4.4-368c726
  expected-quorum-votes: 2
  no-quorum-policy: ignore
  stonith-enabled: true
  symmetric-cluster: true
 
 One thing that PCS is not showing that might be relevant here is that I
 have a a resource stickiness value set to 1000 to prevent resources from
 failing back to nodes after a failover.
 
 With the above configuration if host1 is shut down, host2 just spins in
 a loop doing:
 
 Dec  2 20:00:02 host2 pengine[8923]:  warning: pe_fence_node: Node host1 will
 be fenced because the node is no longer part of the cluster
 Dec  2 20:00:02 host2 pengine[8923]:  warning: determine_online_status: Node
 host1 is unclean
 Dec  2 20:00:02 host2 pengine[8923]:  warning: custom_action: Action
 st-fencing_stop_0 on host1 is unrunnable (offline)
 Dec  2 20:00:02 host2 pengine[8923]:  warning: custom_action: Action
 rsc1_stop_0 on host1 is unrunnable (offline)
 Dec  2 20:00:02 host2 pengine[8923]:  warning: stage6: Scheduling Node host1
 for STONITH
 Dec  2 20:00:02 host2 pengine[8923]:   notice: LogActions: Move
 st-fencing#011(Started host1 - host2)
 Dec  2 20:00:02 host2 pengine[8923]:   notice: LogActions: Move
 rsc1#011(Started host1 - host2)
 Dec  2 20:00:02 host2 crmd[8924]:   notice: te_fence_node: Executing reboot
 fencing operation (13) on host1 (timeout=6)
 Dec  2 20:00:02 host2 stonith-ng[8920]:   notice: handle_request: Client
 crmd.8924.39504cd3 wants to fence (reboot) 'host1' with device '(any)'
 Dec  2 20:00:02 host2 stonith-ng[8920]:   notice: initiate_remote_stonith_op:
 Initiating remote operation reboot for host1:
 ad69ead5-0bbb-45d8-bb07-30bcd405ace2 (0)
 Dec  2 20:00:02 host2 pengine[8923]:  warning: process_pe_message: Calculated
 Transition 22: 

Re: [Pacemaker] Fencing: Where?

2013-12-03 Thread David Vossel
- Original Message -
 From: Michael Schwartzkopff m...@sys4.de
 To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org
 Sent: Monday, December 2, 2013 4:38:12 AM
 Subject: Re: [Pacemaker] Fencing: Where?
 
 Am Montag, 2. Dezember 2013, 13:15:23 schrieb Nikita Staroverov:
   Hi,
   
   as far as I unterstood RH is going to do all infrastructure in the cman
   layer and user pacemaker only for resource management. Whith this setup
   fencing also will be a job of the fenced of the cman package.
   
   This design has its advantages: If descisions are taked on a low level,
   all
   parts of the cluster have the chance to know about theses descisions.
   
   BUT: If pacemaker needs to fence a node, it has no possibility to so so
   any
   more. Imagine a resource will not stop and pacemaker would like to fence
   that node to go on. How would that situation be handled with fencing in
   cman?
   
   Is there any way pacemaker can tell cman about it's wish to fence an
   other
   nore?
   
   The only solution I can think of is to delegate the fencing in cman to
   pacemaker. So both layers are able to fence. But this cannot be a good
   solution, since we have to setup fencing on two places.
   
   Thanks for fruitful discussion.
   
   
   Mit freundlichen Grüßen,
   
   Michael Schwartzkopff
  
  fence_pcmk does the job. Have you got any troubles with it?
 
 No. I use it now. But setting up fencing in two places
 - cman - links to pacemaker
 - pamcemaker
 
 it not very nice to configure fencing in two places.

You aren't really configuring fencing devices in two places.  The fence_pcmk 
device in cluster.conf is just telling 'fenced' to forward all its fencing 
requests to stonith-ng.  This allows us to configure all the real fencing 
devices in one place now using pacemaker.

-- Vossel

 
 Is there any way that pacemaker can tell cman about it's fencing descisions

 
 
 --
 Mit freundlichen Grüßen,
 
 Michael Schwartzkopff
 
 --
 [*] sys4 AG
 
 http://sys4.de, +49 (89) 30 90 46 64, +49 (162) 165 0044
 Franziskanerstraße 15, 81669 München
 
 Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263
 Vorstand: Patrick Ben Koetter, Axel von der Ohe, Marc Schiffbauer
 Aufsichtsratsvorsitzender: Florian Kirstein
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Can in pass commands to Pacemaker from a Remote Machine

2013-11-26 Thread David Vossel
- Original Message -
 From: Puneet Jindal puneetjin...@drishti-soft.com
 To: pacemaker@oss.clusterlabs.org
 Sent: Sunday, November 24, 2013 2:58:35 AM
 Subject: [Pacemaker] Can in pass commands to Pacemaker from a Remote Machine
 
 Hello,
 
 I want to build a GUI on top of pacemaker, i configures remote-tls-port in
 cib and now cib is listening on that port. What all commands can i send to
 CIB and how to do that. Can anyone provide some examples

look at the 'cibadmin' tool

 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Use of Pacemaker to configure a new resource

2013-11-20 Thread David Vossel




- Original Message -
 From: Aarti Sawant aartipsawan...@gmail.com
 To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org
 Sent: Wednesday, November 20, 2013 3:55:40 AM
 Subject: [Pacemaker] Use of Pacemaker to configure a new resource
 
 Hello,
 
 I have a PostgreSQL replication set up. It consists of master A and
 standby B  C which are directly connecting to master. I am using a
 tool called WANPRoxy for compression of data to be transferred between
 master and standby. In case of failure of master, I want WANProxy to
 run on any of the standbys.
 
 As far as I understand, this can be done by scripting OCF Resource
 Agent for WANProxy?

Yes, you could make your own custom OCF script to manage WANProxy, or you could 
potentially have pacemaker start WANProxy using whatever init script is shipped 
with the daemon as well.

 So my primary question is that can Pacemaker be used to start WANProxy
 compression on different machine when one of my machine fails?

yes, This sounds like basic resource failover, which is the primary reason 
people use Pacemaker.

 
 The command to start WANProxy is
 wanproxy -c /home/user_name/server_v2.conf
 
 Also , I want to know if by using some parameters
 
 in Resource Agents, can pacemaker also modify configuration files of
 compression tools like WANProxy, ssl ,ssh?

Pacemaker simply executes resource-agent scripts and is not concerned with what 
those scripts do.  If you want to modify configuration files for some reason, 
build that logic into your custom OCF scripts and pacemaker will execute.

-- Vossel

 
 Thanks,
 Aarti Sawant
 NTTDATA OSS Center Pune
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Cannot use ocf::heartbeat:IPsrcaddr (RTNETLINK answers: No such process)

2013-11-06 Thread David Vossel
- Original Message -
 From: Mathieu Peltier mathieu.pelt...@gmail.com
 To: pacemaker@oss.clusterlabs.org
 Sent: Wednesday, November 6, 2013 11:27:50 AM
 Subject: [Pacemaker] Cannot use ocf::heartbeat:IPsrcaddr (RTNETLINK answers:  
 No such process)
 
 Hi,
 I am trying to set up a simple cluster of 2 machines on CentOS 6.4:
  pacemaker-cli-1.1.10-1.el6_4.4.x86_64
  pacemaker-1.1.10-1.el6_4.4.x86_64
  pacemaker-libs-1.1.10-1.el6_4.4.x86_64
  pacemaker-cluster-libs-1.1.10-1.el6_4.4.x86_64
  corosync-1.4.1-15.el6_4.1.x86_64
  corosynclib-1.4.1-15.el6_4.1.x86_64
  pcs-0.9.90-1.el6_4.noarch
  cman-3.0.12.1-49.el6_4.2.x86_64
  resource-agents-3.9.2-21.el6_4.8.x86_64
 
 I am using the following script to configure the cluster:
 --
 #!/bin/bash
 
 CLUSTER_NAME=test
 CONFIG_FILE=/etc/cluster/cluster.conf
 NODE1_EM1=node1
 NODE2_EM1=node2
 NODE1_EM2=node1-priv
 NODE2_EM2=node2-priv
 VIP=192.168.0.6
 MONITOR_INTERVAL=60s
 
 # Make sure that pacemaker is stopped on both nodes
 # NOT INCLUDED HERE
 
 # Delete existing configuration
 rm -rf /var/log/cluster/*
 ssh root@$NODE2_EM2 'rm -rf /var/log/cluster/*'
 rm -rf /var/lib/pacemaker/cib/* /var/lib/pacemaker/cores/*
 /var/lib/pacemaker/pengine/* /var/lib/corosync/* /var/lib/cluster/*
 ssh root@$NODE2_EM2 'rm -rf /var/lib/pacemaker/cib/*
 /var/lib/pacemaker/cores/* /var/lib/pacemaker/pengine/*
 /var/lib/corosync/* /var/lib/cluster/*'
 
 # Create the cluster
 ccs -f $CONFIG_FILE --createcluster $CLUSTER_NAME
 
 # Add nodes to the cluster
 ccs -f $CONFIG_FILE --addnode $NODE1_EM1
 ccs -f $CONFIG_FILE --addnode $NODE2_EM1
 ccs -f $CONFIG_FILE --setcman two_node=1 expected_votes=1
 
 # Add alternative nodes name so that both network interfaces are used
 ccs -f $CONFIG_FILE --addalt $NODE1_EM1 $NODE1_EM2
 ccs -f $CONFIG_FILE --addalt $NODE2_EM1 $NODE2_EM2
 ccs -f $CONFIG_FILE --setdlm protocol=sctp
 
 # Teach CMAN how to send it's fencing requests to Pacemaker
 ccs -f $CONFIG_FILE --addfencedev pcmk agent=fence_pcmk
 ccs -f $CONFIG_FILE --addmethod pcmk-redirect $NODE1_EM1
 ccs -f $CONFIG_FILE --addmethod pcmk-redirect $NODE2_EM1
 ccs -f $CONFIG_FILE --addfenceinst pcmk $NODE1_EM1 pcmk-redirect
 port=$NODE1_EM1
 ccs -f $CONFIG_FILE --addfenceinst pcmk $NODE2_EM1 pcmk-redirect
 port=$NODE2_EM1
 
 # Deploy configuration to node2
 scp /etc/cluster/cluster.conf root@$NODE2_EM2:/etc/cluster/cluster.conf
 
 # Start pacemaker on main node
 /etc/init.d/pacemaker start
 sleep 30
 
 # Disable stonith
 pcs property set stonith-enabled=false
 
 # Disable quorum
 pcs property set no-quorum-policy=ignore
 
 # Define ressources
 pcs resource create VIP_EM1 ocf:heartbeat:IPaddr params nic=em1
 ip=$VIP_EM1 cidr_netmask=24 op monitor interval=$MONITOR_INTERVAL
 pcs resource create PREFERRED_SRC_IP ocf:heartbeat:IPsrcaddr params
 ipaddress=$VIP_EM1 op monitor interval=$MONITOR_INTERVAL
 
 # Define initial location and prevent ressources to go back to initial
 server after a failure
 pcs resource defaults resource-stickiness=100
 pcs constraint location VIP_EM1 prefers $NODE1_EM1=50
 --
 
 After running this script from node1:
 
 root@node1# pcs status
 Cluster name:
 Last updated: Wed Nov  6 17:17:30 2013
 Last change: Wed Nov  6 17:06:20 2013 via crm_attribute on node1
 Stack: cman
 Current DC: node1 - partition with quorum
 Version: 1.1.10-1.el6_4.4-368c726
 2 Nodes configured
 2 Resources configured
 
 Online: [ node1 ]
 OFFLINE: [ node2 ]
 
 Full list of resources:
 
  VIP_EM1(ocf::heartbeat:IPaddr):Stopped
  PREFERRED_SRC_IP(ocf::heartbeat:IPsrcaddr):Stopped
 
 Failed actions:
 PREFERRED_SRC_IP_start_0 on node1 'unknown error' (1): call=19,
 status=complete, last-rc-change='Wed Nov  6 17:06:20 2013',
 queued=67ms, exec=0ms
 
 PCSD Status:
 Error: no nodes found in corosync.conf
 
 root@node1# ip route show
 192.168.8.0/24 dev em2  proto kernel  scope link  src 192.168.8.1
 default via 192.168.0.1 dev em1
 
 Error in /var/log/cluster/corosync.log:
 ...
 IPsrcaddr(PREFERRED_SRC_IP)[638]:   2013/11/06_16:50:32 ERROR:
 command 'ip route change to  default via 192.168.0.1 dev em1 src
 192.168.0.6' failed
 Nov 06 16:50:32 [32461] node1.domain.org   lrmd:   notice:
 operation_finished:   PREFERRED_SRC_IP_start_0:638:stderr [
 RTNETLINK answers: No such process ]
 ...
 
 If I run the command manually when pacemaker is not started (after
 rebooting the machine for example), the default route is modified as
 expected (I use 192.168.0.106 because the alias 192.168.0.6 is not
 started)
 
 # ip route show
 192.168.0.0/24 dev em1  proto kernel  scope link  src 192.168.0.106
 192.168.8.0/24 dev em3  proto kernel  scope link  src 192.168.8.1
 default via 192.168.0.1 dev em1
 
 # ip route change to  default via 192.168.0.1 dev em1 src 192.168.0.106
 
 # ip route show
 192.168.0.0/24 dev em1  proto kernel  scope link  src 192.168.0.106
 192.168.8.0/24 dev em3  proto kernel 

Re: [Pacemaker] Question about new migration logic

2013-11-05 Thread David Vossel
- Original Message -
 From: Kazunori INOUE kazunori.ino...@gmail.com
 To: pm pacemaker@oss.clusterlabs.org
 Sent: Friday, November 1, 2013 4:43:53 AM
 Subject: [Pacemaker] Question about new migration logic
 
 Hi David,
 
 Because I have a plan to test the function of migration in
 pacemaker-1.1, I am interested in this commit.
 https://github.com/davidvossel/pacemaker/commit/673e8599e4
 
 If this new migration logic is merged into ClusterLabs,
 when do you think it will be merged?

That patch is still incomplete.  I'm pretty sure the idea I'm working with 
there will work, but it's not quite there yet.

As far as when it will make it into Clusterlabs, that's hard to say.  It has 
been a challenge to free up the day or two I need to finish it up.  Hopefully 
I'll be able to work on it soon.

-- Vossel

 Best Regards,
 Kazunori INOUE
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Asymmetric cluster, clones, and location constraints

2013-10-31 Thread David Vossel
- Original Message -
 From: Andrew Beekhof and...@beekhof.net
 To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org
 Sent: Wednesday, October 30, 2013 1:08:12 AM
 Subject: Re: [Pacemaker] Asymmetric cluster, clones, and location constraints
 
 
 On 25 Oct 2013, at 9:40 am, David Vossel dvos...@redhat.com wrote:
 
  
  
  
  
  - Original Message -
  From: Lindsay Todd rltodd@gmail.com
  To: The Pacemaker cluster resource manager
  pacemaker@oss.clusterlabs.org
  Sent: Wednesday, October 23, 2013 2:38:17 PM
  Subject: Re: [Pacemaker] Asymmetric cluster, clones, and location
  constraints
  
  David,
  
  The Infiniband network takes a nondeterministic amount of time to actually
  finish initializing, so we use ethmonitor to watch it; the OS is supposed
  to
  bring it up at boot time, but it moves on through the boot sequence
  without
  actually waiting for it. So in self defense we watch it with pacemaker. I
  guess I could restructure this to use a resource that brings up IB (with a
  really long time out) and use ordering to wait for that complete, but it
  seems that ethmonitor would be more adaptive to short-term IB network
  issues. Since ethmonitor works by setting an attribute (the RA running
  means
  it is watching the network, not that the network is up), I've used
  location
  constraints instead of ordering constraints.
  
  So I have completely restarted my cluster. Right now the physical nodes
  see
  each other, and the fencing agents are running. The first thing that
  should
  start are the ethmonitor resource agents on the VM hosts (the c-watch-ib0
  clones of the p-watch-ib0 primitive). They are not starting (like they
  used
  to).
  
  I see.  Your cib generates an invalid transition.  I'll try and look into
  it in more detail soon to understand the cause.
 
 According to git bisect, the winner is:

I always knew I was a winner

 
 15a86e501a57b50fdb3b8ce0ed432b183c343c74 is the first bad commit
 commit 15a86e501a57b50fdb3b8ce0ed432b183c343c74
 Author: David Vossel dvos...@redhat.com
 Date:   Mon Sep 23 18:55:21 2013 -0500
 
 High: pengine: Probe container nodes
 
 
 I'll take a look in the morning unless David beats me to it :-)

This is a tough one.  I enabled probing container nodes, but didn't anticipate 
the scenario where there's an ordering constraint involving a container node's 
container resource (the VM).

I have an idea of how to fix this, but the end result might make probing 
containers is useless.  I'll give this some thought.

Until then, there is a real easy workaround for this. Set the 
'enable-container-probes' global config option to false

-- Vossel
  


  
  One completely unrelated thought I had while looking at your config
  involves your fencing agents. You shouldn't have to use location
  constraints at on the fencing agents. I believe stonith is smart enough
  now to execute the agent on a node that isn't the target regardless of
  where the policy engine puts it.
  
  -- Vossel
  
  The cib snapshot can be seen in http://pastebin.com/TccTHQPS (some
  slight editing to hide passwords in fencing agents).
  
  /Lindsay
  
  
  On Wed, Oct 23, 2013 at 11:20 AM, David Vossel  dvos...@redhat.com 
  wrote:
  
  
  
  - Original Message -
  From: Lindsay Todd  rltodd@gmail.com 
  To: The Pacemaker cluster resource manager 
  Pacemaker@oss.clusterlabs.org 
  Sent: Tuesday, October 22, 2013 4:19:11 PM
  Subject: [Pacemaker] Asymmetric cluster, clones, and location constraints
  
  I am getting rather unexpected behavior when I combine clones, location
  constraints, and remote nodes in an asymmetric cluster. My cluster is
  configured to be asymmetric, distinguishing between vmhosts and various
  sorts of remote nodes. Currently I am running upstream version b6d42ed. I
  am
  simplifying my description to avoid confusion, hoping in so doing I don't
  miss any salient points...
  
  My physical cluster nodes, also the VM hosts, have the attribute
  nodetype=vmhost. They also have Infiniband interfaces, which take some
  time to come up. I don't want my shared file system (which needs IB), or
  libvirtd (which needs the file system), to come up before IB... So I have
  this in my configuration:
  
  
  
  
  primitive p-watch-ib0 ocf:heartbeat:ethmonitor \
  params \
  interface=ib0 \
  op monitor timeout=100s interval=10s
  clone c-watch-ib0 p-watch-ib0 \
  meta interleave=true
  #
  location loc-watch-ib-only-vmhosts c-watch-ib0 \
  rule 0: nodetype eq vmhost
  
  Something broke between upstream versions 0a2570a and c68919f -- the
  c-watch-ib0 clone never starts. I've found that if I run crm_resource
  --force-start -r p-watch-ib0 when IB is running, the ethmonitor-ib0
  attribute is not set like it used to be. Oh well, I can set it manually.
  So
  let's.
  
  A re-write of the attrd component was introduced around that time period.
  This should have been resolved at this point in the b6d42ed build

Re: [Pacemaker] Asymmetric cluster, clones, and location constraints

2013-10-24 Thread David Vossel




- Original Message -
 From: Lindsay Todd rltodd@gmail.com
 To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org
 Sent: Wednesday, October 23, 2013 2:38:17 PM
 Subject: Re: [Pacemaker] Asymmetric cluster, clones, and location constraints
 
 David,
 
 The Infiniband network takes a nondeterministic amount of time to actually
 finish initializing, so we use ethmonitor to watch it; the OS is supposed to
 bring it up at boot time, but it moves on through the boot sequence without
 actually waiting for it. So in self defense we watch it with pacemaker. I
 guess I could restructure this to use a resource that brings up IB (with a
 really long time out) and use ordering to wait for that complete, but it
 seems that ethmonitor would be more adaptive to short-term IB network
 issues. Since ethmonitor works by setting an attribute (the RA running means
 it is watching the network, not that the network is up), I've used location
 constraints instead of ordering constraints.
 
 So I have completely restarted my cluster. Right now the physical nodes see
 each other, and the fencing agents are running. The first thing that should
 start are the ethmonitor resource agents on the VM hosts (the c-watch-ib0
 clones of the p-watch-ib0 primitive). They are not starting (like they used
 to).

I see.  Your cib generates an invalid transition.  I'll try and look into it in 
more detail soon to understand the cause.

One completely unrelated thought I had while looking at your config involves 
your fencing agents. You shouldn't have to use location constraints at on the 
fencing agents. I believe stonith is smart enough now to execute the agent on a 
node that isn't the target regardless of where the policy engine puts it.

-- Vossel

 The cib snapshot can be seen in http://pastebin.com/TccTHQPS (some
 slight editing to hide passwords in fencing agents).
 
 /Lindsay
 
 
 On Wed, Oct 23, 2013 at 11:20 AM, David Vossel  dvos...@redhat.com  wrote:
 
 
 
 - Original Message -
  From: Lindsay Todd  rltodd@gmail.com 
  To: The Pacemaker cluster resource manager 
  Pacemaker@oss.clusterlabs.org 
  Sent: Tuesday, October 22, 2013 4:19:11 PM
  Subject: [Pacemaker] Asymmetric cluster, clones, and location constraints
  
  I am getting rather unexpected behavior when I combine clones, location
  constraints, and remote nodes in an asymmetric cluster. My cluster is
  configured to be asymmetric, distinguishing between vmhosts and various
  sorts of remote nodes. Currently I am running upstream version b6d42ed. I
  am
  simplifying my description to avoid confusion, hoping in so doing I don't
  miss any salient points...
  
  My physical cluster nodes, also the VM hosts, have the attribute
  nodetype=vmhost. They also have Infiniband interfaces, which take some
  time to come up. I don't want my shared file system (which needs IB), or
  libvirtd (which needs the file system), to come up before IB... So I have
  this in my configuration:
  
  
  
  
  primitive p-watch-ib0 ocf:heartbeat:ethmonitor \
  params \
  interface=ib0 \
  op monitor timeout=100s interval=10s
  clone c-watch-ib0 p-watch-ib0 \
  meta interleave=true
  #
  location loc-watch-ib-only-vmhosts c-watch-ib0 \
  rule 0: nodetype eq vmhost
  
  Something broke between upstream versions 0a2570a and c68919f -- the
  c-watch-ib0 clone never starts. I've found that if I run crm_resource
  --force-start -r p-watch-ib0 when IB is running, the ethmonitor-ib0
  attribute is not set like it used to be. Oh well, I can set it manually. So
  let's.
 
 A re-write of the attrd component was introduced around that time period.
 This should have been resolved at this point in the b6d42ed build.
 
  We use GPFS for a shared file system, so I have an agent to start it and
  wait
  for a file system to mount. It should only run on VM hosts, and only when
  IB
  is running. So I have this:
 
 So the IB resource is setting some attribute that enables the fs to run? Why
 can't a ordering constraint be used here between IB and FS?
 
  
  
  
  
  primitive p-fs-gpfs ocf:ccni:gpfs \
  params \
  fspath=/gpfs/lb/utility \
  op monitor timeout=20s interval=30s \
  op start timeout=180s \
  op stop timeout=120s
  clone c-fs-gpfs p-fs-gpfs \
  meta interleave=true
  location loc-fs-gpfs-needs-ib0 c-fs-gpfs \
  rule -inf: not_defined ethmonitor-ib0 or ethmonitor-ib0 eq 0
  location loc-fs-gpfs-on-vmhosts c-fs-gpfs \
  rule 0: nodetype eq vmhost
  
  That all used to start nicely. Now even if I set the ethmonitor-ib0
  attribute, it doesn't. However, I can use crm_resource --force-start -r
  p-fs-gpfs on each of my VM hosts, then issue crm resource cleanup
  c-fs-gpfs, and all is well. I can use crm status to see something like:
  
  
  
  Last updated: Tue Oct 22 16:35:43 2013
  Last change: Tue Oct 22 15:50:52 2013 via crmd on cvmh01
  Stack: cman
  Current DC: cvmh04 - partition with quorum
  Version: 1.1.10-19.el6.ccni-b6d42ed
  8 Nodes configured
  92

Re: [Pacemaker] libqb-0.16 instability with standby/unstandby ?

2013-10-23 Thread David Vossel




- Original Message -
 From: Andrew Beekhof and...@beekhof.net
 To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org
 Sent: Wednesday, October 23, 2013 12:12:32 AM
 Subject: Re: [Pacemaker] libqb-0.16 instability with standby/unstandby ?
 
 
 On 23 Oct 2013, at 8:39 am, David Vossel dvos...@redhat.com wrote:
 
  - Original Message -
  From: Mike Pomraning m...@pilcrow.madison.wi.us
  To: pacemaker@oss.clusterlabs.org
  Sent: Tuesday, October 22, 2013 10:49:28 AM
  Subject: [Pacemaker] libqb-0.16 instability with standby/unstandby ?
  
  Regarding Justin Burnham's recent Pacemaker crash on node
  unstandby/standby[0] message, is anyone else seeing this behavior with
  libqb-0.16?
  
  I'm getting anecdotal reports of the same behavior from a team at work
  using
  RHEL-derived pcmk-1.1.8 and corosync-1.4.1 with libqb-0.16. Reverting to
  libqb-0.14 appears to have solved the issue. Sorry, I don't have enough to
  reproduce yet, but the similarities in symptoms are suggestive.
  
  FWIW, Justin also noted off list that his problems appear to have begun
  after
  updating to 0.16 a short time ago.
  
  I've tracked this down. Don't use libqb v0.16.0 with any pacemaker version
  less than 1.1.10.
 
 So this isn't just a question of older pacemaker versions needing a rebuild?

A rebuild won't help

  
  There are multiple elements involved with this problem.   Libqb had
  reference count leaks in 0.14.4, once those got resolved we discovered a
  race condition in pacemaker 1.1.8 that caused a double free... Ultimately
  the reference count leaks looked like they covered up the problem in
  pacemaker... Updating to libqb 0.16.0 when using pacemaker 1.1.8 exposes
  the race condition problem, which is what you all are seeing.
  
  -- Vossel
  
  -Mike
  
  [0] http://comments.gmane.org/gmane.linux.highavailability.pacemaker/19289
  
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
  
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
  
  
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
  
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Asymmetric cluster, clones, and location constraints

2013-10-23 Thread David Vossel
- Original Message -
 From: Lindsay Todd rltodd@gmail.com
 To: The Pacemaker cluster resource manager Pacemaker@oss.clusterlabs.org
 Sent: Tuesday, October 22, 2013 4:19:11 PM
 Subject: [Pacemaker] Asymmetric cluster, clones, and location constraints
 
 I am getting rather unexpected behavior when I combine clones, location
 constraints, and remote nodes in an asymmetric cluster. My cluster is
 configured to be asymmetric, distinguishing between vmhosts and various
 sorts of remote nodes. Currently I am running upstream version b6d42ed. I am
 simplifying my description to avoid confusion, hoping in so doing I don't
 miss any salient points...
 
 My physical cluster nodes, also the VM hosts, have the attribute
 nodetype=vmhost. They also have Infiniband interfaces, which take some
 time to come up. I don't want my shared file system (which needs IB), or
 libvirtd (which needs the file system), to come up before IB... So I have
 this in my configuration:
 
 
 
 
 primitive p-watch-ib0 ocf:heartbeat:ethmonitor \
 params \
 interface=ib0 \
 op monitor timeout=100s interval=10s
 clone c-watch-ib0 p-watch-ib0 \
 meta interleave=true
 #
 location loc-watch-ib-only-vmhosts c-watch-ib0 \
 rule 0: nodetype eq vmhost
 
 Something broke between upstream versions 0a2570a and c68919f -- the
 c-watch-ib0 clone never starts. I've found that if I run crm_resource
 --force-start -r p-watch-ib0 when IB is running, the ethmonitor-ib0
 attribute is not set like it used to be. Oh well, I can set it manually. So
 let's.

A re-write of the attrd component was introduced around that time period.  This 
should have been resolved at this point in the b6d42ed build.

 We use GPFS for a shared file system, so I have an agent to start it and wait
 for a file system to mount. It should only run on VM hosts, and only when IB
 is running. So I have this:

So the IB resource is setting some attribute that enables the fs to run?  Why 
can't a ordering constraint be used here between IB and FS?

 
 
 
 
 primitive p-fs-gpfs ocf:ccni:gpfs \
 params \
 fspath=/gpfs/lb/utility \
 op monitor timeout=20s interval=30s \
 op start timeout=180s \
 op stop timeout=120s
 clone c-fs-gpfs p-fs-gpfs \
 meta interleave=true
 location loc-fs-gpfs-needs-ib0 c-fs-gpfs \
 rule -inf: not_defined ethmonitor-ib0 or ethmonitor-ib0 eq 0
 location loc-fs-gpfs-on-vmhosts c-fs-gpfs \
 rule 0: nodetype eq vmhost
 
 That all used to start nicely. Now even if I set the ethmonitor-ib0
 attribute, it doesn't. However, I can use crm_resource --force-start -r
 p-fs-gpfs on each of my VM hosts, then issue crm resource cleanup
 c-fs-gpfs, and all is well. I can use crm status to see something like:
 
 
 
 Last updated: Tue Oct 22 16:35:43 2013
 Last change: Tue Oct 22 15:50:52 2013 via crmd on cvmh01
 Stack: cman
 Current DC: cvmh04 - partition with quorum
 Version: 1.1.10-19.el6.ccni-b6d42ed
 8 Nodes configured
 92 Resources configured
 
 
 Online: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
 
 fence-cvmh01 (stonith:fence_ipmilan): Started cvmh04
 fence-cvmh02 (stonith:fence_ipmilan): Started cvmh01
 fence-cvmh03 (stonith:fence_ipmilan): Started cvmh01
 fence-cvmh04 (stonith:fence_ipmilan): Started cvmh01
 Clone Set: c-fs-gpfs [p-fs-gpfs]
 Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
 which is what I would expect (other than I expect pacemaker to have started
 these for me, like it used to).
 
 Now I also have clone resources to NFS-mount another file system, and
 actually do a bind mount out of the GPFS file system, which behave like the
 GPFS resource -- they used to just work, now I need to use crm_resource
 --force-start and clean up. That finally lets me start libvirtd, using this
 configuration:
 
 
 
 
 primitive p-libvirtd lsb:libvirtd \
 op monitor interval=30s
 clone c-p-libvirtd p-libvirtd \
 meta interleave=true
 order o-libvirtd-after-storage inf: \
 ( c-fs-libvirt-VM-xcm c-fs-bind-libvirt-VM-cvmh ) \
 c-p-libvirtd
 location loc-libvirtd-on-vmhosts c-p-libvirtd \
 rule 0: nodetype eq vmhost
 
 Of course that used to just work, but now, like the other clones, I need to
 force-start libvirtd on the VM hosts, and clean up. Once I do that, all my
 VM resources, which are not clones, just start up like they are supposed to!
 Several of these are configured as remote nodes, and they have services
 configured to run in them. But now other strange things happen:
 
 
 
 
 Last updated: Tue Oct 22 16:46:29 2013
 Last change: Tue Oct 22 15:50:52 2013 via crmd on cvmh01
 Stack: cman
 Current DC: cvmh04 - partition with quorum
 Version: 1.1.10-19.el6.ccni-b6d42ed
 8 Nodes configured
 92 Resources configured
 
 
 ContainerNode slurmdb02:vm-slurmdb02: UNCLEAN (offline)
 Online: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
 Containers: [ db02:vm-db02 ldap01:vm-ldap01 ldap02:vm-ldap02 ]
 
 fence-cvmh01 (stonith:fence_ipmilan): Started cvmh04
 fence-cvmh02 (stonith:fence_ipmilan): Started cvmh01
 fence-cvmh03 (stonith:fence_ipmilan): Started cvmh01
 fence-cvmh04 (stonith:fence_ipmilan): Started 

  1   2   3   >